5 Globalization Support

This chapter describes TimesTen globalization support features. It includes the following topics:

Overview of globalization support features
Choosing a database character set
Length semantics and data storage
Connection character set
Linguistic sorts
SQL string and character functions
Setting globalization support attributes
Globalization support during migration
Supported character sets
Supported linguistic sorts

Overview of globalization support features

TimesTen globalization support includes the following features:

Character set support

You must choose a database character set when you create a data store. See "Supported character sets" for a list of supported character sets. You can also choose a connection character set for a session. See "Connection character set".

Length semantics

You can specify byte semantics or character semantics for defining the storage measurement of character data types. See "Length semantics and data storage".

Linguistic sorts and indexes. You can sort data based on linguistic rules. See "Linguistic sorts". You can use linguistic indexes to improve performance of linguistic sorts. See "Using linguistic indexes".
SQL string and character functions

TimesTen provides SQL functions that return information about character strings. TimesTen also provides SQL functions that return a character from an encoded value. See "SQL string and character functions".

Note:

This release of TimesTen does not support session language and territory.

Choosing a database character set

TimesTen uses the database character set to define the encoding of data stored in character data types, such as CHAR and VARCHAR2.

Use the DatabaseCharacterSet data store attribute to specify the database character set during data store creation. You cannot alter the database character set after data store creation, and there is no default value for DatabaseCharacterSet. See "Supported character sets" for a list of supported character sets.

Consider the following questions when you choose a character set for a data store:

What languages does the data store need to support now and in the future?
Is the character set available on the operating system?
What character sets are used on clients?
How well does the application handle the character set?
What are the performance implications of the character set?

If you are using Oracle In-Memory Database Cache (IMDB Cache) to cache Oracle tables, you must create the data store with the same database character set as the Oracle Database.

This section includes the following topics:

Character sets and languages
Client operating system and application compatibility
Performance and storage implications
Character sets and replication

Character sets and languages

Choosing a database character set determines what languages can be represented in the database.

A group of characters, such as alphabetic characters, ideographs, symbols, punctuation marks, and control characters, can be encoded as a character set. An encoded character set assigns unique numeric codes to each character in the character repertoire. The numeric codes are called code points or encoded values.

Character sets can be single-byte or multibyte. Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages. Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These encoding schemes use either a fixed number or a variable number of bytes to represent each character. Unicode is a universal encoded character set that enables information from any language to be stored using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.

Client operating system and application compatibility

The database character set is independent of the operating system. On an English operating system, you can create and run a database with a Japanese character set. However, when an application in the client operating system accesses the database, the client operating system must be able to support the database character set with appropriate fonts and input methods. For example, you cannot insert or retrieve Japanese data on the English Windows operating system without first installing a Japanese font and input method. Another way to insert and retrieve Japanese data is to use a Japanese operating system remotely to access the database server.

If all client applications use the same character set, then that character set is usually the best choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the application character sets. This ensures that every character is represented when converting from an application character set to the database character set.

Performance and storage implications

For best performance, choose a character set that avoids character set conversion and uses the most efficient encoding for the languages desired. Single-byte character sets result in better performance than multibyte character sets, and they also are the most efficient in terms of space requirements. However, single-byte character sets limit how many languages you can support.

Character sets and replication

All data stores in a replication scheme must have the same database character set. No character set conversion occurs during replication.

Length semantics and data storage

In single-byte c haracter sets, the number of bytes and the number of characters in a string are the same. In multibyte character sets, a character or code point consists of one or more bytes. Calculating the number of characters based on byte lengths can be difficult in a variable-width character set. Calculating column lengths in bytes is called byte semantics, while measuring column lengths in characters is called character semantics.

Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8), suppose that you need to define a VARCHAR2 column that can store up to five Chinese characters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, which are three bytes long, and 5 bytes for the English characters, which are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters.

The expressions in the following list use byte semantics. Note the BYTE qualifier in the CHAR and VARCHAR2 expressions.

CHAR (5 BYTE)
VARCHAR2(20 BYTE)

The expressions in the following list use character semantics. Note the CHAR qualifier in the VARCHAR2 expression.

VARCHAR2(20 CHAR)
SUBSTR(string, 1, 20)

By default, the CHAR and VARCHAR2 character data types are specified in bytes, not characters. Therefore, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

The NLS_LENGTH_SEMANTICS general connection attribute determines whether a new column of character data type uses byte or character semantics. It enables you to create CHAR and VARCHAR2 columns using either byte-length or character-length semantics without having to add the explicit qualifier. NCHAR and NVARCHAR2 columns are always character-based. Existing columns are not affected.

The default value for NLS_LENGTH_SEMANTICS is BYTE. Specifying the BYTE or CHAR qualifier in a data type expression overrides the NLS_LENGTH_SEMANTICS value.

Connection character set

The database character set determines the encoding of CHAR and VARCHAR2 character data types. The connection character set is used to describe the encoding of the incoming and outgoing application data, so that TimesTen can perform the necessary character set conversion between the application and the database. For example, this allows a non-Unicode application to communicate with a Unicode (AL32UTF8) database.

The ConnectionCharacterSet general connection attribute sets the character encoding for the connection, which can be different than the database character set. The connection uses the connection character set for information that passes through the connection, such as parameters, SQL query text, results and error messages. Choose a connection character set that matches the application environment or the character set of your data source.

Best performance results when the connection character set and the database character set are the same because no conversion occurs. When the connection character set and the database character set are different, data conversion is performed in the ODBC layer. Characters that cannot be converted to the target character set are changed to replacement characters.

The default connection character set is US7ASCII. This setting applies to both direct and client connections.

Linguistic sorts

Different languages have different sorting rules. Text is conventionally sorted inside a database according to the binary codes used to encode the characters. Typically, this does not produce a sort order that is linguistically meaningful. A linguistic sort handles the complex sorting requirements of different languages and cultures. It enables text in character data types, such as CHAR, VARCHAR2, NCHAR, and NVARCHAR2, to be sorted according to specific linguistic conventions.

A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order. TimesTen offers two kinds of linguistic sorts: monolingual and multilingual.

This section includes the following topics:

Monolingual linguistic sorts
Multilingual linguistic sorts
Case-insensitive and accent-insensitive linguistic sorts
Performing a linguistic sort
Using linguistic indexes

Monolingual linguistic sorts

TimesTen compares character strings in two steps for monolingual sorts. The first step compares the major value of the entire string from a table of major values. Usually, letters with the same appearance have the same major value. The second step compares the minor value from a table of minor values. The major and minor values are defined by TimesTen. TimesTen defines letters with accent and case differences as having the same major value but different minor values.

Monolingual linguistic sorting is available only for single-byte and Unicode database character sets. If a monolingual linguistic sort is specified when the database character set is non-Unicode multibyte, then the default sort order is the binary sort order of the database character set.

For a list of supported sorts, see "Supported linguistic sorts".

Multilingual linguistic sorts

TimesTen provides multilingual linguistic sorts so that you can sort data for multiple languages in one sort. Multilingual linguistic sort is based on the ISO/OEC 14651 - International String Ordering and the Unicode Collation algorithm standards. This framework enables the database to handle languages that have complex sorting rules, such as those in Asian languages, as well as providing linguistic support for databases with multilingual data.

In addition, multilingual sorts can handle canonical equivalence and supplementary characters. Canonical equivalence is a basic equivalence between characters or sequences of characters. For example, ç is equivalent to the combination of c and ,.

For example, TimesTen supports a monolingual French sort (FRENCH), but you can specify a multilingual French sort (FRENCH_M). _M represents the ISO 14651 standard for multilingual sorting. The sorting order is based on the GENERIC_M sorting order and can sort accents from right to left. TimesTen recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only French, then a monolingual French sort may have better performance because it uses less memory. It uses less memory because fewer characters are defined in a monolingual French sort than in a multilingual French sort. There is a trade-off between the scope and the performance of a sort.

For a list of supported multilingual sorts, see "Supported linguistic sorts".

Case-insensitive and accent-insensitive linguistic sorts

Operations inside a database are sensitive to the case and the accents of the characters. Sometimes you might need to perform case-insensitive or accent-insensitive comparisons.

To specify a case-insensitive or accent-insensitive sort:

Append _CI to a TimesTen sort name for a case-insensitive sort. For example:

BINARY_CI: accent-sensitive and case-insensitive binary sort

GENERIC_M_CI: accent-sensitive and case-insensitive GENERIC_M sort
Append _AI to a TimesTen sort name for an accent-insensitive and case-insensitive sort. For example:

BINARY_AI: accent-insensitive and case-insensitive binary sort

FRENCH_M_AI: accent-insensitive and case-insensitive FRENCH_M sort

Performing a linguistic sort

The NLS_SORT data store connection attribute indicates which collating sequence to use for linguistic comparisons. The NLS_SORT value affects the SQL string comparison operators and the ORDER BY clause.

You can use the ALTER SESSION statement to change the value of NLS_SORT:

ALTER SESSION SET NLS_SORT=SWEDISH;
SELECT product_name
  FROM product
  ORDER BY product_name;

PRODUCT NAME
------------
aerial
Antenne
Lcd
ächzen
Ähre

You can also override the NLS_SORT setting by using the NLSSORT SQL function to perform a linguistic sort:

SELECT * FROM test ORDER BY NLSSORT(name,'NLS_SORT=SPANISH');

For more extensive examples of using NLSSORT, see "NLSSORT" in the Oracle TimesTen In-Memory Database SQL Reference.

Using linguistic indexes

You can create a linguistic index to achieve better performance during linguistic comparisons. A linguistic index requires storage for the sort key values.

To create a linguistic index, use a statement similar to the following:

CREATE INDEX german_index ON employees
(NLSSORT(employee_id, 'NLS_SORT=GERMAN'));

The optimizer chooses the appropriate index based on the values for NLSSORT and NLS_SORT.

You must create multiple linguistic indexes if you want more than one linguistic sort on a column. For example, if you want both GERMAN and GERMAN_CI sorts against the same column, create two linguistic indexes.

For more information, see "CREATE INDEX" in the Oracle TimesTen In-Memory Database SQL Reference.

SQL string and character functions

The following table summarizes SQL functions that operate on character strings:

SQL function	Description
ASCIISTR	Takes as its argument either a string or an expression that resolves to a string in any character set. It returns the ASCII version of the string in the database character set. Non-ASCII characters are converted to Unicode escapes.
INSTR INSTRB INSTR4	Determines the first position, if any, at which one string occurs within another string. INSTRB uses bytes instead of characters. INSTR4 uses UCS4 code points.
LENGTH LENGTHB LENGTH4	Returns the length of a character string in an expression as number of characters. LENGTHB uses bytes instead of characters. LENGTH4 uses UCS4 code points.
LOWER and UPPER	The LOWER function converts expressions of type CHAR, NCHAR, VARCHAR2 or NVARCHAR2 to lowercase. The UPPER function converts expressions of type CHAR, NCHAR, VARCHAR2 or NVARCHAR2 to uppercase. Character semantics is supported for CHAR and VARCHAR2 types. The data type of the result is the same as the data type of the expression.
RTRIM	Removes trailing spaces from CHAR, VARCHAR2, NCHAR or NVARCHAR2 strings.
SUBSTR SUBSTRB SUBSTR4	Returns a VARCHAR2 or NVARCHAR2 string that represents a substring of a CHAR or NCHAR string. The returned substring is a specified number of characters, beginning from a designated starting point. SUBSTRB uses bytes instead of characters. SUBSTR4 uses UCS4 code points.
UNISTR	Takes as its argument a string that resolves to data of type NVARCHAR2. It returns the value in UTF-16 format. Unicode escapes are supported.

The following functions return characters:

CHR: Returns the character with the specified binary value in the database character set.
NCHR: Returns the character with the specified Unicode value.

See "Expressions" in the Oracle TimesTen In-Memory Database SQL Reference for more information including examples.

Setting globalization support attributes

The globalization support attributes are summarized in the following table:

Parameter	Description
DatabaseCharacterSet	Indicates the character encoding used by a data store.
ConnectionCharacterSet	Determines the character encoding for the connection, which may be different from the database character set.
NLS_SORT	Indicates the collating sequence to use for linguistic comparisons.
NLS_LENGTH_SEMANTICS	Sets the default length semantics.
NLS_NCHAR_CONV_EXCP	Determines whether an error is reported when there is data loss during an implicit or explicit data type conversion between NCHAR/NVARCHAR2 data and CHAR/VARCHAR2 data.

DatabaseCharacterSet must be set during data store creation. There is no default. See "Choosing a database character set".

The rest of the attributes are set during connection to a data store. For more information about ConnectionCharacterSet, see "Connection character set".

You can use the ALTER SESSION statement to change the following attributes during a session:

NLS_SORT
NLS_LENGTH_SEMANTICS
NLS_NCHAR_CONV_EXCP

For more information, see "ALTER SESSION" in the Oracle TimesTen In-Memory Database SQL Reference and "Data Store Attributes" in Oracle TimesTen In-Memory Database Reference.

Backward compatibility using TIMESTEN8

TIMESTEN8 is a restricted database character set that specifies behavior from TimesTen releases before 7.0. It is supported for backward compatibility only.

TIMESTEN8 has the following restrictions:

There is no support for character set conversion of any kind. This includes:
- Conversions between the application and the database. If DatabaseCharacterSet is TIMESTEN8, then ConnectionCharacterSet must also be TIMESTEN8.
- Conversions between CHAR/VARCHAR2 data and NCHAR/NVARCHAR2 data.
Sorting for CHAR and VARCHAR data types is limited to binary ordering. NLS_SORT=BINARY is the only sort allowed.
TIMESTEN8 is not supported in IMDB Cache.

During database creation, customers should select the database character set matching the actual encoding of the data being stored in CHAR and VARCHAR2 columns whenever possible. Select TIMESTEN8 only when backwards compatibility to existing TimesTen data is required.

Globalization support during migration

The ttMigrate utility saves one or more migrate objects from a TimesTen data store into a binary data file or restores the objects from the binary data files into a TimesTen data store. Migrate objects include tables, cache group definitions, views and sequences.

This section includes the following topics:

Object migration and character sets
Migration and length semantics
Migrating linguistic indexes
Migrating cache group tables

See also "Copying, migrating, backing up and restoring a data store" of this guide and the description of ttMigrate in Oracle TimesTen In-Memory Database Reference.

Object migration and character sets

ttMigrate tags each object it saves with the object's character set. By default, ttMigrate stores object data in the database character set, but you can choose a different character set by using the -saveAsCharset option. You can specify this option in create mode (-c) or append mode (-a).

When you use ttMigrate to restore an object, its data is implicitly converted to the database character set of the target data store if needed. Character set conversion can result in data loss if the database character set of the target data store cannot represent all of the data that it receives.

If you know that the data is in encoded in the database character set of the target data store, you can use the -noCharsetConversion option. This option can be specified only in restore mode (-r). If you use the -noCharsetConversion option, ttMigrate treats the data as if it is in the database character set of the target data store.

When you restore untagged character data from a data store that was created before release 7.0 into a data store from release 7.0 and later, ttMigrate treats the data as if it is in the database character set of the target data store.

ttMigrate issues a warning whenever there is an implicit or explicit character set conversion while saving or restoring data.

Migration and length semantics

ttMigrate saves length semantic information about CHAR and VARCHAR2 columns. It restores the length semantic information when restoring objects into data stores created in TimesTen release 7.0 or later.

When objects are migrated back into a TimesTen release before 7.0, columns with character semantics are converted to byte semantics and the column length is adjusted to match the byte length of the original columns.

When objects are migrated from a release before 7.0 to release 7.0 and later, byte semantics is used.

Migrating linguistic indexes

ttMigrate supports migration of linguistic indexes into TimesTen releases that support them. When migrating back to a TimesTen release before 7.0, ttMigrate issues a warning indicating that the linguistic indexes cannot be restored. Migration of the table proceeds without the linguistic indexes.

Migrating cache group tables

You cannot restore cache group tables containing NCHAR/NVARCHAR2 columns to a release before 7.0. Releases before 7.0 do not allow these data types in cache group tables.

Supported character sets

The tables in this section describe the character sets supported in TimesTen.

Asian character sets

Name	Description
JA16EUC	EUC 24-bit Japanese
JA16EUCTILDE	The same as JA16EUC except for the way that the wave dash and the tilde are mapped to and from Unicode
JA16SJIS	Shift-JIS 16-bit Japanese
JA16SJISTILDE	The same as JA16SJIS except for the way that the wave dash and the tilde are mapped to and from Unicode
KO16KSC5601	KSC5601 16-bit Korean
KO16MSWIN949	Microsoft Windows Code Page 949 Korean
TH8TISASCII	Thai Industrial Standard 620-2533 - ASCII 8-bit
VN8MSWIN1258	Microsoft Windows Code Page 1258 8-bit Vietnamese
ZHS16CGB231280	CGB2312-80 16-bit Simplified Chinese
ZHS16GBK	GBK 16-bit Simplified Chinese
ZHS32GB18030	GB18030-2000
ZHT16BIG5	BIG5 16-bit Traditional Chinese
ZHT16HKSCS	Microsoft Windows Code Page 950 with Hong Kong Supplementary Character Set HKSCS-2001. Character set conversion to and from Unicode is based on Unicode 3.0.
ZHT16MSWIN950	Microsoft Windows Code Page 950 Traditional Chinese
ZHT32EUC	EUC 32-bit Traditional Chinese

European character sets

Name	Description
BLT8CP921	Latvian Standard LVS8-92(1) Windows/UNIX 8-bit Baltic
BLT8ISO8859P13	ISO 8859-13 Baltic
BLT8MSWIN1257	Microsoft Windows Code Page 1257 8-bit Baltic
BLT8PC775	IBM-PC Code Page 775 8-bit Baltic
CEL8ISO8859P14	ISO 8859-13 Celtic
CL8ISO8859P5	ISO 8859-5 Latin/Cyrillic
CL8KOI8R	RELCOM Internet Standard 8-bit Latin/Cyrillic
CL8KOI8U	KOI8 Ukrainian Cyrillic
CL8MSWIN1251	Microsoft Windows Code Page 1251 8-bit Latin/Cyrillic
EE8ISO8859P2	ISO 8859-2 East European
EL8ISO8859P7	ISO 8859-7 Latin/Greek
ET8MSWIN923	Microsoft Windows Code Page 923 8-bit Estonian
EE8MSWIN1250	Microsoft Windows Code Page 1250 8-bit East European
EL8MSWIN1253	Microsoft Windows Code Page 1253 8-bit Latin/Greek
EL8PC737	IBM-PC Code Page 737 8-bit Greek/Latin
EE8PC852	IBM-PC Code Page 852 8-bit East European
LT8MSWIN921	Microsoft Windows Code Page 921 8-bit Lithuanian
NE8ISO8859P10	ISO 8859-10 North European
NEE8ISO8859P4	ISO 8859-4 North and North-East European
RU8PC866	IBM-PC Code Page 866 8-bit Latin/Cyrillic
SE8ISO8859P3	ISO 8859-3 South European
US7ASCII	ASCII 7-bit American
US8PC437	IBM-PC Code Page 437 8-bit American
WE8ISO8859P1	ISO 8859-1 West European
WE8ISO8859P15	ISO 8859-15 West European
WE8MSWIN1252	Microsoft Windows Code Page 1252 8-bit West European
WE8PC850	IBM-PC Code Page 850 8-bit West European
WE8PC858	IBM-PC Code Page 858 8-bit West European

Middle Eastern character sets

Name	Description
AR8ADOS720	Arabic MS-DOS 720 Server 8-bit Latin/Arabic
AR8ASMO8X	ASMO Extended 708 8-bit Latin/Arabic
AR8ISO8859P6	ISO 8859-6 Latin/Arabic
AR8MSWIN1256	Microsoft Windows Code Page 1256 8-Bit Latin/Arabic
AZ8ISO8859P9E	ISO 8859-9 Latin Azerbaijani
IW8ISO8859P8	ISO 8859-8 Latin/Hebrew
IW8MSWIN1255	Microsoft Windows Code Page 1255 8-bit Latin/Hebrew
TR8MSWIN1254	Microsoft Windows Code Page 1254 8-bit Turkish
TR8PC857	IBM-PC Code Page 857 8-bit Turkish
WE8ISO8859P9	ISO 8859-9 West European & Turkish

TimesTen character set

Name	Description
TIMESTEN8	TimesTen legacy character semantics

Universal character sets

Name	Description
AL16UTF16	Unicode 4.0 UTF-16 Universal character set. This is the implicit TimesTen national character set.
AL32UTF8	Unicode 4.0 UTF-8 Universal character set
UTF8	Unicode 3.0 UTF-8 Universal character set, CESU-8 compliant

Supported linguistic sorts

The tables in this section list the supported values for the NLS_SORT general connection attribute and the NLS_SORT SQL function.

Monolingual linguistic sorts

Basic name	Extended name
ARABIC	-
ARABIC_MATCH	-
ARABIC_ABJ_SORT	-
ARABIC_ABJ_MATCH	-
ASCII7	-
AZERBAIJANI	XAZERBAIJANI
BENGALI	-
BIG5	-
BINARY	-
BULGARIAN	-
CANADIAN FRENCH	-
CATALAN	XCATALAN
CROATIAN	XCROATIAN
CZECH	XCZECH
CZECH_PUNCTUATION	XCZECH_PUNCTUATION
DANISH	XDANISH
DUTCH	XDUTCH
EBCDIC	-
EEC_EURO	-
EEC_EUROPA3	-
ESTONIAN	-
FINNISH	-
FRENCH	XFRENCH
GERMAN	XGERMAN
GERMAN_DIN	XGERMAN_DIN
GBK	-
GREEK	-
HEBREW	-
HKSCS	-
HUNGARIAN	XHUNGARIAN
ICELANDIC	-
INDONESIAN	-
ITALIAN	-
LATIN	-
LATVIAN	-
LITHUANIAN	-
MALAY	-
NORWEGIAN	-
POLISH	-
PUNCTUATION	XPUNCTUATION
ROMANIAN	-
RUSSIAN	-
SLOVAK	XSLOVAK
SLOVENIAN	XSLOVENIAN
SPANISH	XSPANISH
SWEDISH	-
SWISS	XSWISS
THAI_DICTIONARY	-
TURKISH	XTURKISH
UKRAINIAN	-
UNICODE_BINARY	-
VIETNAMESE	-
WEST_EUROPEAN	XWEST_EUROPEAN

Multilingual linguistic sorts

Sort name	Description
CANADIAN_M	Canadian French sort supports reverse secondary, special expanding characters.
DANISH_M	Danish sort supports sorting uppercase characters before lowercase characters.
FRENCH_M	French sort supports reverse sort for secondary.
GENERIC_M	Generic sorting order which is based on ISO14651 and Unicode canonical equivalence rules but excluding compatible equivalence rules.
JAPANESE_M	Japanese sort supports SJIS character set order and EUC characters which are not included in SJIS.
KOREAN_M	Korean sort Hangul characters are based on Unicode binary order. Hanja characters based on pronunciation order. All Hangul characters are before Hanja characters.
SPANISH_M	Traditional Spanish sort supports special contracting characters.
THAI_M	Thai sort supports swap characters for some vowels and consonants.
SCHINESE_RADICAL_M	Simplified Chinese sort is based on radical as primary order and number of strokes order as secondary order.
SCHINESE_STROKE_M	Simplified Chinese sort uses number of strokes as primary order and radical as secondary order.
SCHINESE_PINYIN_M	Simplified Chinese Pinyin sorting order.
TCHINESE_RADICAL_M	Traditional Chinese sort based on radical as primary order and number of strokes order as secondary order.
TCHINESE_STROKE_M	Traditional Chinese sort uses number of strokes as primary order and radical as secondary order. It supports supplementary characters.