TorqueUnicode

From TDN

This page is a Work In Progress.

Contents

Introduction

Image:Chinesemenu2.jpg

Internals, UTF-8 vs UTF-16

Torque uses a UTF-16 wherever possible, and UTF-8 for a few particular tasks. The dT macro is provided in platform.h for hard-coded wide strings. In win32/visual studio parlance, this means that you need to use dT("foo") instead of L"foo" if you want to be portable/consistent.

Note that the wide characters are 16 bits wide on Windows, and 32 bits wide on Mac and Linux.

In order for text in script or gui files to display correctly, the file must be saved as UTF-8 with no BOM. If you include a BOM, Torque's script parser will get confused. Notepad automatically adds a BOM and is not suitable for editing UTF-8 scripts.

[If using JEdit: Under the Utilities menu, in General settings select UTF-8 as theDefault character encoding and if you de-select "Auto-detect file encoding when possible", it should expose the BOM character from any improperly saved files so you can get rid of it permanently.] --- Jake T


You can use this resource to patch your engine, so it can compile/exec scripts even if it's saved with BOM.

Building a Unicode Torque

Note that the TORQUE_UNICODE define is set by default in 1.4.

Building a Unicode version of Torque on Win32 is simply a matter of adding a UNICODE define in the build settings of your IDE, as well as setting TORQUE_UNICODE. As this also ensures that any Windows API calls use their Unicode equivalents, if you have any custom Windows code in the platform layer you will have to ensure they pass strings as UTF-16. Conversion functions between UTF-8 and UTF-16 are provided and are detailed later.

If you want to build a Unicode version of Torque that is compatible with Windows 98 and Windows ME then you will need to refer to these additional directions.

Building Unicode Torque on Mac is easy - just set the TORQUE_UNICODE define in core/torqueConfig.h. Since all versions of MacOSX use Unicode internally, there's no reason to do a non-Unicode build on the Mac. The Mac system APIs use Unicode strings. The BSD Unix layer expects you to pass strings in UTF-8 format.

Building Unicode Torque on Linux - ???

Fonts, Character Sets

In general, little has changed. Character sets are going to be removed before 1.4final; they were in place to select what character set of font you wanted but it turned out this was unnecessary.

Caching, .gft / .uft

For maximum compatibility on shipping games, Torque's font rendering code implements a caching feature. This allows fonts to be quickly rendered from pre-generated textures for each character, as well as allowing games to use fonts which may not be installed on the user's computer (or even customized fonts!).

There are a few functions which expose this functionality to users. They allow the font cache to be pre-populated with font information, as well as allowing font image data to be exported and imported for more complex image processing. The functions are:

dumpFontCacheStatus()
Displays a full description of all cached fonts, along with info on the codepoints each contains. This basically gives you an overview of what's in the font cache for your game. It's very useful for figuring out what parameters to give the other font cache related functions.

writeFontCache()
Force all fonts to write themselves out to the cache. This is a good workaround if you have a game that crashes at shutdown (ie, it's in development) or if you just need to guarantee the font cache is in synch with memory.

duplicateCachedFont(oldFont, oldSize, newFontName)
Take an existing font, and copy all of its data to a new font (of the same size). This allows font variants to be stored under seperate names - typically you'll duplicate a font to a new name, then modify that new name with the export/import calls that follow this description to have some extra effect on it, e.g. "Lucida Console On Fire", "Arial 2px Blur", "Arial Bold Outline".

exportCachedFont(fontName, size, fileName, padding, kerning)
Export specified font to the specified filename as a PNG. The image can then be processed in Photoshop or another tool and reimported using importCachedFont. Characters in the font are exported as one long strip, so it's easy apply vertical gradients or other effects. Padding makes sure that all characters have that many pixels of padding around them, and kerning makes characters have that many pixels of additional spacing horizontally.

importCachedFont(fontName, size, fileName, padding, kerning)
Call with the same parameters you used in exportCachedFont. This loads the image strip back in, along with any changes you've done to it. You can convert the greyscale image to an RGBA image if you want to supply color information. Important Note: This function will actually change parameters on the specified font, so calling it more than once on the same font without wiping the cache will cause padding/kerning settings to get modified in ways that probably won't do what you want.

populateFontCacheString(faceName, size, string)
Populate the font cache for the specified font with characters from the specified string. For example, you might take all the strings from your localization tables and run them through this in order to make sure they'll all be displayed properly.

populateFontCacheRange(faceName, size, rangeStart, rangeEnd)
Similar to populateFontCacheString, but takes a numerical range of characters to put into the cache. So for instance if you only want to suport english you might give 0 and 128 as your range values. The values are unicode code points; Torque supports Base Multilingual Plane 0, so the first 65535 glyphs are available from this function.

populateAllFontCacheString(string)
Same as populateFontCacheString, but it does it for all active fonts.

populateAllFontCacheRange(rangeStart, rangeEnd)
Same as populateFontCacheRange, but it does it for all active fonts.

New C++ Types and Functions

Two new types have been added, UTF8 and UTF16 for UTF-8 and UTF-16 characters respectively. Although you can technically treat a char * as UTF-8, UTF-8 strings should use the UTF8 type. Caution: UTF8 is defined as the same type as char, and so the compiler will not warn you if you try to use it as a char or vice versa, when you shouldn't. All UTF-8 safe code should use the UTF8 type.

Because one UTF-8 character can span 1, 2, 3, or 4 bytes, performing string processing on UTF-8 strings is inconvenient, expensive, and not recommended. You should instead convert your UTF-8 string to a UTF-16 string to process it. Since conversion to and from UTF-8 is slightly expensive, you should avoid unnecessary conversion by sticking to UTF-16 for any heavy duty string parsing and processing, rather than jumping in and out of UTF-8 format.

Converting between UTF-8 and UTF-16

Although Torque uses UTF-16 wherever possible, it is still necessary to use UTF-8 for many tasks. The Windows API uses UTF-16 (technically, before Windows XP it used UCS2), and on Macs the BSD file handling interfaces expect path names in UTF-8 format. Thus conversion is often necessary in the platform layer. Conversion functions have been provided to convert between the two when needed.

U32 convertUTF16toUTF8( const UTF16 *unistring, UTF8  *outbuffer, U32 len);
U32 convertUTF16toUTF32(const UTF16 *unistring, UTF32 *outbuffer, U32 len);

The functions both convert 'unistring' into an 'outbuffer' that has 'len' length in bytes. They return the number of Unicode code points in the string. In Unicode vocabulary, 'code point' is used instead of 'character' to avoid confusion with the C char type.

The Unicode conversion functions guarantee that the outputted unicode strings will be on the Basic Multilingual Plane, or BMP. This means that they will not output a unicode code point that is above U+FFFF ( 0xFFFF ). This greatly simplifies UTF16 processing, and reduces the buffer sizes that you have to allocate to convert between UTF-16 and UTF-8.

When they encounter a code point that is outside the BMP, or when they encounter any invalid code point, or any error in 'unistring', the offending byte is replaced with the standard Unicode replacement character U+FFFD ( 0xFFFD ). This way, the converter guarantees that 'outbuffer' will contain a valid Unicode string on return.

Because UTF-8 uses a variable number of bytes for each character, it is not possible to precisely guess the size of the buffer required in advance. However, because our conversion functions guarantee results will be on the BMP, you don't have to worry about UTF-16 surrogate pairs, and UTF-8 code points are limited to 3 bytes in length. This means that when converting to UTF16, you need a buffer of dStrlen((UTF16*)unistring) * 2 bytes, and when converting to UTF8, you need a buffer of dStrlen((char*)unistring) * 3 bytes. Note that there are 2 different (overloaded) dStrlen() functions, one for char & UTF8, and one for UTF16. The following two code snippets demonstrate converting Unicode strings in Torque:

To convert from UTF8 to UTF16:

U32 numCodePoints, bufferLen;
bufferLen = dStrlen(utf8string)+1;                   // Need the +1 to hold the NULL
UTF16  *buffer16 = new UTF16[bufferLen];     // need len * 2 bytes == len * sizeof(UTF16)
numCodePoints = convertUTF8toUTF16(utf8string, buffer16, bufferLen);

To convert from UTF16 to UTF8:

U32 numCodePoints, bufferLen;
bufferLen = dStrlen(utf16string) * 3;
UTF8  *buffer8= new UTF8[bufferLen];     // need len * 3 bytes == len * sizeof(UTF8) * 3
numCodePoints = convertUTF16toUTF8(utf16string, buffer8, bufferLen);

Scanning Unicode strings

As UTF-8 uses a variable number of bytes for each character, you cannot simply scan a UTF-8 string as you would a normal ASCII string. Usually, the best and preferred solution is to convert your entire string to UTF16 and scan it normally. The conversion process does involve extra overhead, but since the process of finding the boundaries between UTF-8 code points implicitly converts the string to UTF-16 or UTF-32, the overhead is minimal. Additionally, a good optimizing compiler is better able to optimize your code if you first convert to UTF16.

Even so, sometimes, you'll find the best solution is to convert single UTF-8 code points as needed. The following functions are provided for converting single Unicode code points. These functions are used internally by the buffer conversion functions above.

UTF32  oneUTF8toUTF32( const UTF8 *codepoint,  U32 *unitsWalked = NULL);
UTF32  oneUTF16toUTF32(const UTF16 *codepoint, U32 *unitsWalked = NULL);
UTF16  oneUTF32toUTF16(const UTF32 codepoint);
U32    oneUTF32toUTF8( const UTF32 codepoint, UTF8 *threeByteCodeunitBuf);

Canonical UTF-8 and UTF-16 are variable length encodings. UTF-32 is the only constant length Unicode encoding, and is the native encoding. Since UTF-32 is the basic, native Unicode encoding, any conversion of the variable length encodings UTF-16 and UTF-8 implicitly goes through UTF-32. To simplify UTF16 processing, Torque's Unicode converter forces its output onto the BMP. This means that any UTF32 character output by Torque's converter can be safely downcast to UTF16 without data loss.

In Unicode parlance, 'code unit' is used to denote a single unit of a variable length encoded code point. For UTF-8, a code unit is one byte (8 bits); for UTF-16 a code unit is 2 bytes (16 bits). See the pattern?

UTF32  oneUTF8toUTF32( const UTF8 *codepoint,  U32 *unitsWalked = NULL);

Given a pointer to a UTF8 character, oneUTF8toUTF32() returns the character in UTF32 format, and returns the length of the UTF-8 character in code units in 'unitsWalked'. If it encounters an error, the Unicode replacement character U+FFFD is returned and 'unitsWalked' is set to 1. You can pass NULL as 'unitsWalked' if you don't care about that information, or you can skip that argument altogether, and it will default to NULL. The UTF32 character returned is guaranteed to be on the BMP, and can be safely downcast to a UTF16.

UTF32  oneUTF16toUTF32(const UTF16 *codepoint, U32 *unitsWalked = NULL);

Given a pointer to a UTF16 character, oneUTF16toUTF32() returns the character in UTF32 format, and returns the length of the UTF-16 character in code units in 'unitsWalked'. If it encounters an error, the Unicode replacement character U+FFFD is returned and 'unitsWalked' is set to 1. You can pass NULL as 'unitsWalked' if you don't care about that information, or you can skip that argument altogether, and it will default to NULL. The UTF32 character returned is guaranteed to be on the BMP.

UTF16  oneUTF32toUTF16(const UTF32 codepoint);

Given a 32 bit UTF32 character, oneUTF32toUTF16() returns the character in UTF16 format. This is far safer than simply downcasting the UTF32 character, as it checks for invalid Unicode characters, and forces the results onto the BMP. If it encounters an invalid Unicode code point, or one that is above the BMP, the Unicode replacement character U+FFFD is returned.

U32    oneUTF32toUTF8( const UTF32 codepoint, UTF8 *threeByteCodeunitBuf);

Given a 32 bit UTF32 character, oneUTF32toUTF8() places the character in 'threeByteCodeunitBuf' in UTF8 format, and returns the number of code units (bytes in this case) it required to store the UTF8 character. The argument 'threeByteCodeunitBuf' must be a pointer to a buffer no less than three bytes in length, or the function results are undefined. We do not place a null termination byte in the return buffer.

The following code snippets show how to convert a single UTF8 code point to a UTF16 code point, and vice versa:

To convert a single UTF8 code point:

// nest the 2 conversion functions, to get a utf16
U32 walked;
UTF16 singleChar16;
singleChar16 = oneUTF32toUTF16( oneUTF8toUTF32( utf8string, &walked));

// and to move ahead to where the next character should start:
UTF8 *nextChar8 = utf8string + walked;

To convert a single UTF16 code point:

// nest the 2 conversion functions, to get a utf8
UTF8 singleChar8[3];
U32 utf8charLen;
U32 walked;
utf8charLen = oneUTF32toUTF8( oneUTF16toUTF32( *utf16string, &walked), &singleChar8 ));

// remember, utf-16 is a variable length format... it is possible that we walked 2 UTF16 code units...
UTF16 *nextChar16 = utf16string + walked;

Standard Library Functions

One of the really nice features about UTF-8 is that you can use most of the standard library string functions with it. The string comparison functions are guaranteed to function the same for UTF-8 and ASCII, though in Torque you may have to cast UTF8 strings to const char *.

As use of UTF-16 internally is fairly new, only those library functions needed by the platform layer have UTF-16 versions. At present, this is dStrcmp() and dStrlen().

OS Considerations

Win98 needs the MS Layer for Unicode. XP needs the appropriate language packs installed.

OS X works out of box for Unicode.

Linux is an unknown atm.

Integration with Localization

Localization already UTF-8 safe, not much to say.