June 93 - WRITING LOCALIZABLE APPLICATIONS
WRITING LOCALIZABLE APPLICATIONS
JOSEPH TERNASKY AND BRYAN K. ("BEAKER") RESSLER
JOSEPH TERNASKY AND BRYAN K. ("BEAKER") RESSLERMore and more software companies are finding rich new markets overseas.
Unfortunately, many of these developers have also discovered that localizing an
application involves a lot more than translating a bunch of STR# resources. In fact,
localization often becomes an unexpectedly long, complex, and expensive development
cycle. This article describes some common problems and gives proactive engineering
advice you can use during initial U.S. development to speed your localization efforts
later on.
Most software localization headaches are associated with text drawing and character handling, so
that's what this article stresses. Four common areas of difficulty are:
- keyboard input (specifically for two-byte scripts)
- choice of fonts and sizes for screen display
- date, time, number, and currency formats and sorting order
- character encodings
We discuss each of these potential pitfalls in detail and provide data structures and example code.
PRELIMINARIES
Throughout the discussion, we assume you're developing primarily for the U.S. market, but you're
planning to publish internationally eventually (or at least you're trying to keep your options open). As
you're developing your strategy, here are a few points to keep in mind:
- Don't dismiss any markets out of hand -- investigate the potential rewards for
entry into a particular market and the features required for that market.
- The amount of effort required to support western Europe is relatively small.
Depending on the type of application you're developing, the additional effort
required for other countries isn't that much more. There's also a growing market
for non-Roman script systems inside the U.S.
- The labor required to build atruly globalprogram is much less if you do the work
up front, rather than writing quick-and-dirty code for the U.S. and having to
rewrite it later.
- Consider market growth trends. A market that's small now may be big later.
This article concentrates on features for western Europe and Japan because those are the markets
we're most familiar with. We encourage you to investigate other markets on your own.
LINGO LESSON 101
This international software thing is rife with specialized lingo. For a complete explanation of all the
terms, see the hefty "Worldwide Software Overview," Chapter 14 ofInside MacintoshVolume VI. But
we're not here to intimidate, so let's go over a few basic terms.
Script. A writing system that can be used to represent one or more human languages. For example,
the Roman script is used to represent English, Spanish, Hungarian, and so on. Scripts fall into several
categories, as described in the next section, "Script Categories."
Script code. An integer that identifies a script on the Macintosh.
Encoding. A mapping between characters and integers. Each character in the character set is assigned
a unique integer, called itscharacter code. If a character appears in more than one character set it may
have more than one encoding, a situation discussed later in the section "Dealing With Character
Encodings." Since each script has a unique encoding, sometimes the termsscript and encodingare used
interchangeably.
Character code. An integer that's associated with a given character in a script.
Glyph. The displayed form of a character. The glyph for a given character code may not always be
the same -- in some scripts the codes of the surrounding characters provide a context for choosing a
particular glyph.
Line orientation. The overall direction of text flow within a line. For instance, English has left-to-right
line orientation, while Japanese can use either top-to-bottom (vertical) or left-to-right (horizontal)
line orientation.
Character orientation. The relationship between a character's baseline and the line orientation. When
the line orientation and the character baselines go in the same direction, it's calledwith-streamcharacter orientation. When the line orientation differs from the character baseline direction, it's
called cross-stream character orientation. For instance, in Japanese, when the line orientation is left-
to-right, characters are also oriented left-to-right (with-stream). Japanese can also be formatted with
a top-to-bottom (vertical) line orientation, in which case character baselines can be left-to-right
(cross-stream) or top-to-bottom (with-stream). See Figure 1.
Figure 1 Line and Character Orientation in Mixed Japanese/English Text
SCRIPT CATEGORIES
Scripts fall into different categories that require different software solutions. Here are the basic
categories:
- Simple scriptshave small character sets (fewer than 256 characters), and no context
information is required to choose a glyph for a given character code. They have
left-to-right lines and top-to-bottom pages. Simple scripts encompass the
languages of the U.S. and Europe, as well as many other countries worldwide. For
example, some simple scripts are Roman, Cyrillic, and Greek.
- Two-byte scriptshave large character sets (up to 28,000 characters) and require no
context information for glyph choice. They use various combinations of left-to-
right or top-to-bottom lines and top-to-bottom or right-to-left pages. Two-byte
scripts include the languages of Japan, China, Hong Kong, Taiwan, and Korea.
- Context-sensitive scriptshave a small character set (fewer than 256 characters) but
may have a larger glyph set, since there are potentially several graphic
representations for any given character code. The mapping from a given character
code to a glyph depends on surrounding characters. Most languages that use a
context-sensitive script have left-to-right lines and top-to-bottom pages, such as
Devanagari and Bengali.
- Bidirectional scriptscan have runs of left-to-right and right-to-left characters
appearing simultaneously in a single line of text. These scripts have small character
sets (fewer than 256 characters) and require no context information for glyph
choice. Bidirectional scripts are used for languages such as Hebrew that have both
left-to-right and right-to-left characters, with top-to-bottom pages.
There are a few exceptional scripts that fall into more than one of these categories, such as Arabic
and Urdu. Arabic, for instance, is both context sensitive and bidirectional.
Now with the preliminaries out of the way, we're ready to discuss some localization pitfalls.
KEYBOARD INPUT
Sooner or later, your users are going to start typing. You can't stop them. So
now what do you do?
One approach is to simply ignore keyboard input. While perfectly acceptable to open-minded
engineers like yourself, your Marketing colleagues may find this approach unacceptable. So, let's
examine what happens when two-byte script users type on their keyboards.
Obviously, a Macintosh keyboard doesn't have enough keys to allow users of two-byte script systems
to simply press the key corresponding to the one character they want out of 28,000. Instead, two-
byte systems are equipped with a softwareinput method, also called a front-end processoror FEP, which
allows users to type phonetically on a keyboard similar to the standard U.S. keyboard. (Some input
methods use strokes or codes instead of phonetics, but the mechanism is the same.)
As soon as the user begins typing, a smallinput windowappears at the bottom of the screen. When
the user signals the input method, it displays variousreadingsthat correspond to the typed input.
These readings may include one or more two-byte characters. There may be more than one valid
reading of a given "clause" of input, in which case the user must choose the appropriate reading.
When satisfied, the user accepts the readings, which are then flushed from the input window and sent
to the application as key-down events. Since the Macintosh was never really designed for two-byte
characters, a two-byte character is sent to the application as two separate one-byte key-down events.
Interspersed in the stream of key-down events there may also be one-byte characters, encoded as
ASCII.
Before getting overwhelmed by all this, consider two important points. First,the input method is taking
the keystrokes for you. The keystrokes the user types are not being sent directly into your application
-- they're being processed first. Also, since the user can type a lot into the input method before
accepting the processed input, you can get a big chunk of key-down events at once.
So let's see what your main event loop should look like in its simplest form if you want to properly
accept mixed one- and two-byte characters:
// Globals
unsigned short gCharBuf; // Buffer that holds our (possibly
// two-byte) character
Boolean gNeed2ndByte; // Flag that tells us we're waiting
// for the second byte of a two-byte
// character
void EventLoop(void)
{
EventRecord event; // The current event
short cbResult; // The result of our CharByte call
unsigned char oneByte; // Single byte extracted from event
Boolean processChar; // Whether we should send our
// application a key message
if (WaitNextEvent(everyEvent, &event, SleepTime(), nil)) {
switch (event.what) {
. . .
case keyDown:
case autoKey:
. . .
// Your code checks for Command-key equivalents here.
. . .
processChar = false;
oneByte = (event.message & charCodeMask);
if (gNeed2ndByte) {
// We're expecting the second byte of a two-byte
// character. So OR the byte into the low byte of
// our accumulated two-byte character.
gCharBuf = (gCharBuf << 8) | oneByte;
cbResult = CharByte((Ptr)&gCharBuf, 1);
if (cbResult == smLastByte)
processChar = true;
gNeed2ndByte = false;
} else {
// We're not expecting anything in particular. We
// might get a one-byte character, or we might
// get the first byte of a two-byte character.
gCharBuf = oneByte;
cbResult = CharByte((Ptr)&gCharBuf, 1);
if (cbResult == smFirstByte)
gNeed2ndByte = true;
else if (cbResult == smSingleByte)
processChar = true;
}
// Now possibly send the typed character to the rest
// of the application.
if (processChar)
AppKey(gCharBuf);
break;
case . . .
}
}
}
CharByte returns smSingleByte, smFirstByte, or smLastByte. You use this information to determine
what to do with a given key event. Notice that the AppKey routine takes an unsigned short as a
parameter. That's very important. For an application to be two-byte script compatible, you need toalwayspass unsigned shorts around for a single character. This example is also completelyone-bytecompatible -- if you put this event loop in your application, it works in the U.S.
The example assumes that the grafPort is set to the document window and the port's font is set
correctly, which is important because the Script Manager's behavior is governed by the font of the
current grafPort (see "Script Manager Caveats"). Although this event loop works fine on both one-
byte and two-byte systems, it could be made more efficient. For example, since input methods
sometimes send you a whole mess of characters at a time, you could buffer up the characters into a
string and send them wholesale to AppKey, making it possible for your application to do less
redrawing on the screen.
AVOIDING FONT TYRANNY
Have you ever written the following lines of code?
void DrawMessage(short messageNum)
{
Str255theString;
GetIndString(theString, kMessageStrList, messageNum);
TextFont(geneva);
TextSize(9);
MoveTo(kMessageXPos, kMessageYPos);
DrawString(theString);
}
If so, you're overdue for a good spanking. While we're very proud of you for putting that string into
a resource like a good international programmer, the font, size, and pen position are a little too, well,
specific. Granted, it's hard to talk yourself out of using all those nice constants defined in Fonts.h,
but if you're trying to write a localizable application, this is definitely thewrong approach.
A better approach is to do this:
TextFont(applFont);
TextSize(0);
GetFontInfo(&fontInfo);
MoveTo(kMessageXPos, kMessageYMargin + fontInfo.ascent +
fontInfo.leading);
Since applFont is always a font in the system script, and TextSize(0) gives a size appropriate to the
system script, you get the right output. Plus, you're now positioning the pen based on the font,
instead of using absolute coordinates. This is important. For instance, on a Japanese systemTextSize(0) results in a point size of 12, so the code in the preceding example might not work if the
pen-positioning constants were set up to assume a 9-point font height.
If you want to make life even easier for your localizers, you could eliminate the pen-positioning
constants altogether. Instead, use an existing resource type (the 'DITL' type is appropriate for this
example) to store the layout of the text items in the window. Even though you're drawing the items
yourself, you can still use the information in the resource to determine the layout, and the localizers
can then change the layout using a resource editor -- which is a lot better than hacking your code.
There are some other interesting ways to approach this problem. Depending on what you're
drawing, the Script Manager may be able to tell you both which font and which size to use. Suppose
you need to draw some help text. You can use the following code:
void DrawHelpText(Str255 helpText, Rect *helpZone)
{
long fondSize;
fondSize = GetScript(smSystemScript, smScriptHelpFondSize);
TextFont(HiWord(fondSize));
TextSize(LoWord(fondSize));
NeoTextBox(&helpText[1], helpText[0], helpZone, GetSysJust(),
0, nil, nil);
}
Here the Script Manager tells you the appropriate font and size for help text. On a U.S. system, that
would be Geneva 9; on a Japanese system, it's Osaka 9. NeoTextBox is a fast, flexible replacement for
the Toolbox routine TextBox and is Script Manager compatible. You can learn more about
NeoTextBox by reading "The TextBox You've Always Wanted" indevelopIssue 9.
The Script Manager has some other nice combinations:
smScriptMonoFondSize // Default monospace font and size (use when
// you feel the urge to use Courier 12)
smScriptSmallFondSize // Default small font and size (use when you
// feel the urge to use Geneva 9)
smScriptSysFondSize // Default system font and size (use when you
// feel the urge to use Chicago 12)
smScriptAppFondSize // Default application font and size (use as
// default document font)
The various FondSize constants are available only in System 7. If you're writing
for earlier systems, you shouldat leastuse GetSysFont, GetAppFont, and GetDefFontSize, as
described in Chapter 17 ofInside MacintoshVolume V. And if you're too lazy to do even that,pleaseuse TextFont(0) and TextSize(0) to get the system font, which will be appropriate for the system
script. This is, by the way, how grafPorts are initialized by QuickDraw. In other words, if you don't
touch the port, it will already be set up correctly for drawing text in the system script.
INTERNATIONAL DATING
Before you get too excited, you should know that we're not talking about the true-love variety of date
here. No, we're talking about something much more tedious -- input and output of international
dates, times, numbers, and currency values. First we'll look at output formatting, and then input
parsing.
OUTPUT OF DATES, TIMES, NUMBERS, AND CURRENCY VALUES
To output dates, times, numbers, and currency values (which we'll callformatted values), you need to
know the script you're formatting for. This can be a user preference, or you can determine the script
from the current font of the field associated with the value you're formatting (use Font2Script).
You can use these International Utilities routines to format dates, times, and numbers:
- Use IUDateString for formatting a date.
- Use IUTimeString for formatting a time.
- Use NumToString for simple numbers without separators.
- Use Str2Format and FormatX2Str for complete number formatting with
separators.
Formatting a currency value is a bit trickier. You have to format the number and then add the
currency symbol in the right place. We'll show you how to get the currency symbol and the
positioning information from the 'itl0' resource.
First, let's look at an example of date and time formatting:
#define kWantSeconds true // For IUTimeString
#define kNoSeconds false
unsigned long secs;
Str255 theDate, theTime;
// Get the current date and time into Pascal strings.
GetDateTime(&secs);
IUDateString(secs, shortDate, theDate);
IUTimeString(secs, kNoSeconds, theTime);
Formatting a number with FormatX2Str is a little more complicated, because FormatX2Str requires
a canonical number format string (type NumFormatString) that describes the output format. You
make a NumFormatString by converting a literal string, like
##,###.00;-##,###.00;0.00
The strings are in the format
positiveFormat;negativeFormat;zeroFormat
where the last two parts are optional. The example string would format the number 32767 as
32,767.00, -32767 as -32,767.00, and zero as 0.00. The exact format of these strings can be quite
complicated and is described inMacintosh Worldwide Development: Guide to System Software.
The following handy routine formats a number using a format read from a string list. You provide
the string list resource and specify which item in the list to use when formatting a given number.
OSErr FormatANum(short theFormat, extended theNum, Str255 theString)
{
NItl4Handle itl4;
OSErr err;
NumberParts numberParts;
Str255 textFormatStr; // "Textual" number format spec
NumFormatString formatStr; // Opaque number format
// Load the 'itl4' and copy the NumberParts record out of it.
itl4 = (NItl4Handle)IUGetIntl(4);
if (itl4 == nil)
return resNotFound;
numberParts = *(NumberParts *)((char *)*itl4 +
(*itl4)->defPartsOffset);
// Get the format string, convert it to a NumFormatString, and
// then use it to format the input number.
GetIndString(textFormatStr, kFormatStrs, theFormat);
err = Str2Format(textFormatStr, &numberParts, &formatStr);
if (err != noErr)
return err;
err = FormatX2Str(theNum, &formatStr, &numberParts, theString);
return err;
}
Given a currency value, the following routine formats the number and then adds the currency symbol
in the appropriate place. This routine assumes that you use a particular number format for currency
values, but you can easily modify it to include an argument that specifies the format item in the string
list.
OSErr FormatCurrency(extended theNum, Str255 theString)
{
Intl0Hndl itl0;
OSErr err;
Str255 currencySymbol, formattedValue;
// First, format the number like this: ##,###.00. FormatX2Str
// will replace the "," and "." separators
// appropriately for the font script.
err = FormatANum(kCurrencyFormat, theNum, formattedValue);
if (err != noErr)
return err;
// Get the currency symbol from the 'itl0' resource. The currency
// symbol is stored as up to three bytes. If any of the bytes
// aren't used they're set to zero. So, we use strncpy to copy
// out the currency symbol as a C string and forcibly terminate
// it in case it's three bytes long.
itl0 = (Intl0Hndl)IUGetIntl(0);
if (itl0 == nil)
return resNotFound;
strncpy(currencySymbol, &(*itl0)->currSym1, 3);
currencySymbol[3] = 0x00;
c2pstr(currencySymbol);
// Now put the currency symbol and the formatted value together
// according to the currency symbol position.
if ((*itl0)->currFmt & currSymLead) {
StringCopy(theString, currencySymbol);
StringAppend(theString, formattedValue);
} else {
StringCopy(theString, formattedValue);
StringAppend(theString, currencySymbol);
}
return noErr;
}
The 'itl0' resource also includes the decimal and thousands separators. These should be the same
values used by FormatX2Str, which gets these symbols from the NumberParts structure in the 'itl4'
resource.
If using the extended type in your application makes you queasy, you can easily modify these routines
to work with the Fixed type. Just use Fix2X in the FormatX2Str call to convert the Fixed type to
extended.
INPUT OF DATES, TIMES, AND NUMBERS
The Script Manager includes routines for parsing formatted values to retrieve a date, time, or
number. The process is logically the reverse of formatting a value for output. Most applications don't
even deal with formatted numbers. They just read raw numbers (no thousands separators or currency
symbols), locate the decimal separator, convert the integer and fraction parts using NumToString,
and then put the integer and fraction parts back together.
DEALING WITH CHARACTER ENCODINGS
When writing Macintosh applications, most developers make certain assumptions that cause
problems when the application is used in other countries. One of these assumptions is that all
characters are represented by a single byte; another is that a given character code always represents
the same character. The first assumption causes immediate problems because the two-byte script
systems use both one-byte and two-byte character codes. An application that relies on one-byte
character codes often breaks up a two-byte character into two one-byte characters, rendering the
application useless for two-byte text. The second assumption causes more subtle problems, which
prevent the user from mixing text in several different scripts together in one document.
Different versions of the Macintosh system software use a different script by default. Systems sold in
the U.S. and Europe use the Roman script. Those sold in Japan, Hong Kong, or Korea use the
Japanese, traditional Chinese, or Korean script, respectively. In addition, some sophisticated users
have several script systems installed at one time, and System 7.1 makes this even easier. Actually, even
unsophisticated users can have two script systems installed at one time. All systems have the Roman
script installed, so Japanese users, for example, have both the Japanese and the Roman script
available.
For an application to work correctly with any international system software, it must be able to handle
different character encodings simultaneously. That is, the user should be able to enter characters in
different scripts and edit the text without damaging the associated script information. This section
discusses three ways to handle character encodings. These methods require different amounts of
effort to implement and provide different capabilities. Of course, those that require the most effort
also provide the most flexibility and power for your users. Before we discuss these methods, let's
define some more terms.
Language. A human language that's written using a particular script. Several languages can share the
same script. For example, the Roman script is used by English, French, German, and so on. It's also
possible for the same language to be written in more than one script, although that's a rare exception.
Alphabet, syllabary, ideograph set. A collection of characters used by a language. Some scripts include
more than one of these collections. As a simple example, the Roman script includes both an
uppercase and a lowercase alphabet. As a more complicated example, the Japanese script includes the
Roman alphabet, the Hiragana and Katakana syllabaries, and the Kanji ideograph set. An alphabet,
syllabary, or ideograph set isn't necessarily encoded in the same way in two different scripts. For
example, the Roman alphabet in the Roman script uses one-byte codes, but the Roman alphabet in
the Japanese script uses either one-byte or two-byte codes.
Segment. A subset of an encoding that may be shared by one or more scripts. For example, the simple
(7-bit) ASCII characters make up a segment that's shared by all the scripts on the Macintosh.
Characters in this segment have the same code in any Macintosh encoding.
Unicode. An international character encoding that encompasses all the written languages of the world.
Each character is assigned a unique 16-bit integer. Unicode is aunifiedencoding -- all characters that
have the same abstract shape share a common character code, even if they're used in more than one
language.
METHOD 1: NATIVE ENCODING
The easiest method is to simply pick one character encoding for your localization and stick with it
throughout the application. This is usually the native character encoding for the country (and
language) that you're targeting with the localized application. For example, if you're
localizing anapplication for the Japanese market, you choose the shift-JIS (Shifted Japanese Industrial Standard)
character encoding and modify all your text-handling routines to use this encoding.
The shift-JIS encoding uses both one-byte and two-byte character codes, so you need to use the
Script Manager's CharByte routine whenever you're stepping through a string. For a random byte in
a shift-JIS encoded string, CharByte tells you if the byte represents a one-byte character, the low byte
of a two-byte character, or the high byte of a two-byte character. You also have to handle two-byte
characters on input (as described earlier in the section "Keyboard Input") and use the native system
and application fonts for text (as described in the section "Avoiding Font Tyranny").
To summarize, the native encoding method has a few advantages:
- It's very easy to implement, so most of your code will work with simple
modifications.
- Since you're using the native encoding, the users in the country for which you're
localizing will be able to manipulate text using the conventions of the native
language.
- Every encoding includes the simple (7-bit) ASCII characters, so they'll also be able
to use English.
Unfortunately, this method has many disadvantages:
- You have to create one version of the application for every localization that you
do. Each version will use a different native encoding.
- Documents created with one version of the application can't necessarily be used
with another version of the application. For example, a document created with the
Japanese version that includes two-byte Japanese text will be displayed incorrectly
when opened with the French version.
- The user doesn't have access to all the characters in the Roman script (extended
ASCII encoding) because these are also used by the native encoding. For example,
the ASCII accented characters and extended punctuation use character codes that
are also used by the one-byte Katakana syllabary in the shift-JIS encoding.
Remember, even the simple international systems really use two scripts -- the
native script and the Roman script.
METHOD 2: MULTIPLE ENCODINGS
The most complete method for handling character encodings is to keep track of the encoding for
every bit of text that your application stores. In this method the encoding (or the script code) is
stored with a run of text just like a font family, style, or point size. The first step is to determine
which languages you may want to support. Once this is determined, you can decide which encodings
are necessary to implement support for those languages. For example, suppose your Marketing
department wants to do localized versions for French, German, Italian, Russian, Japanese, and
Korean. French, German, and Italian all use the Roman script. Russian uses the Cyrillic script;
Japanese uses the Japanese script; and Korean uses the Korean script. To support these languages,
you have to handle the Roman, Cyrillic, Japanese, and Korean encodings.
In general, each script that you include requires support for its encoding and, possibly, additional
features that are specific to that script. For example, Japanese script can be drawn left-to-right or top-
to-bottom, so a complete implementation would handle vertical text. There are other features
specific to the Japanese script (amikake, furigana, and so on) that you may also want to implement.
If any of the encodings include two-byte characters, the data structures that you use to represent text
runs must be able to handle two-byte codes. When you're processing a text run, the encoding of that
run determines how you can treat the characters in the run. For example, you can munge a Roman
text run in the usual way, safe and secure in the familiar world of one-byte character codes. In
contrast, your dealings with Japanese text runs may be wrought with angst since these runs can
include both one-byte and two-byte characters in any combination. The Script Manager is designed
to support applications that tag text runs with a script code. As long
as the font of the current grafPort is set correctly, all the Script Manager routines work with the
correct encoding for that script. For example, if you specify a Japanese font in the current grafPort,
the Script Manager routines assume that any text passed to them is stored in the shift-JIS encoding.
Keyboard script. During this discussion of the multiple encodings method, we've been assuming that
you already know the script (and therefore the encoding) of text that the user has entered. How
exactly do you know this? The Script Manager keeps track of the script of the text being entered
from the keyboard in a global variable. Your application should read this variable programmatically
after receiving a keyboard event, as follows:
short keyboardScript;
keyboardScript = GetEnvirons(smKeyScript);
Once you know the keyboard script, make sure that this information stays with the character as it
becomes part of a text run. If the keyboard script is the same as the script of the text run, you can just
add this character to the text run. Otherwise, you must create a new text run, tag it with the keyboard
script, and place the character
in it.
You can also set the keyboard script directly when the user selects text with the mouse or changes the
current font. The question is, which script do you set the keyboard to use? That depends on the font
of the selected text or the new font the user has chosen. The first step is to convert the font into a
script and then use the resulting script code to set the keyboard script. This process is known askeyboard forcing.
short fontScript;
fontScript = Font2Script(myFontID);
KeyScript(fontScript);
The user can always change the keyboard script by clicking the keyboard icon (in System 6) or by
choosing a keyboard layout from the Keyboard menu (in System 7). As a result, you're no longer sure
that the keyboard script and the font script agree when the user actually types something. You should
always check the keyboard script against the font script before entering a typed character into a text
run. If the keyboard script and the font script don't agree, a new current font is derived from the
keyboard script. This process is known asfont forcing.
short fontScript, keyboardScript;
fontScript = Font2Script(myFontID);
keyboardScript = GetEnvirons(smKeyScript);
if (fontScript != keyboardScript)
myFontID = GetScript(keyboardScript, smScriptAppFond);
The combination of keyboard forcing and font forcing is calledfont/keyboard synchronization. Both
keyboard and font forcing should be optional; the user should be able to turn these features off with a
preferences setting.
Changing fonts. An application that works with multiple encodings must pay special attention to font
changes. For each text run in the selection, the application should check the script of the text run
against the script of the new font. If the scripts agree, the text run can use the new font. If the scripts
don't agree, the application can either ignore the new font for that text run or apply some special
processing.
short fontScript;
short textRunIndex, textRunCount;
fontScript = Font2Script(myNewFontID);
for (textRunIndex = 0;
textRunIndex < textRunCount;
textRunIndex++) {
if (textRunStore[textRunIndex].script == fontScript)
textRunStore[textRunIndex].fontID = myNewFontID;
else
SpecialProcessing(&textRunIndex, &textRunCount, myNewFontID);
}
All the encodings used by the Macintosh script systems include the simple (7-bit) ASCII characters,
so it's often possible to convert these characters from one script to another. The special processing
consists of these two steps:
1. Breaking a text run into pieces, some of which contain only simple ASCII
characters and others that contain all the characters not included in simple ASCII
2. Applying the new font to the runs that contain only simple ASCII characters and
leaving the other runs with the old font
Boolean FindASCIIRun(unsigned char *textPtr, long textLength,
long *runLength)
{
*runLength = 0;
if (*textPtr < 0x80) {
// We know that this character is simple ASCII, since values
// less than 128 can't be the first byte of a two-byte
// character, and they're shared among all scripts. So, let's
// block up a run of simple ASCII.
while (*textPtr++ < 0x80 && textLength-- > 0)
*runLength++;
return true; // Run is simple ASCII.
} else {
// We know this character is not simple ASCII. It may be
// two-byte or it may be some character in a non-Roman
// script. So, let's block up a run of non-simple-ASCII
// characters.
while (textLength > 0) {
if (CharByte(textPtr, 0) == smFirstByte) {
// Skip over two-byte character.
textPtr += 2;
textLength -= 2;
*runLength += 2;
} else if (*textPtr >= 0x80) {
// Skip over one-byte character.
textPtr++;
textLength--;
*runLength++;
} else
break;
}
return false; // Run is NOT simple ASCII.
}
}
void SpecialProcessing(short *runIndex, short *runCount,
short myNewFontID)
{
TextRunRecord originalRun, createdRun;
unsigned char *textPtr;
long textLength, runLength, runFollow;
Boolean simpleASCII;
// Retrieve this run and remove it from the run list.
GetTextRun(*runIndex, &originalRun);
RemoveTextRun(*runIndex);
// Get the pointer and length of the original text.
textPtr = originalRun.text;
textLength = originalRun.count;
// Loop through all of the sub-runs in this run.
runFollow = *runIndex;
while (textLength > 0) {
// Find the length of the sub-run and its type.
TextFont(originalRun.fontID);
simpleASCII= FindASCIIRun(textPtr, textLength, &runLength);
// Create the sub-run and duplicate the characters.
createdRun = originalRun; // Same formats.
createdRun.text = NewPtr(runLength);
// Real programs check for nil pointer here.
createdRun.length = runLength;
BlockMove(textPtr, createdRun.text, runLength);
// Roman runs can use the new font.
if (simpleASCII)
createdRun.fontID = myNewFontID;
// Add the new sub-run and advance the run index.
AddTextRun(runFollow++, createdRun);
// Advance over this sub-run and continue looping.
textPtr += runLength;
textLength -= runLength;
}
// Dispose of the original run information.
DisposeTextRun(originalRun);
}
Searching and sorting. Applications that work with multiple encodings must also take care during
searching or sorting operations. An application that uses only the native encoding can assume that
character codes are unique and that any two text runs can be compared directly using the sorting
routines in the International Utilities Package. On the other hand, an application that uses multiple
encodings must always consider a character code within the context of a text run and its associated
script. In this case, character codes are unique only within a script, not across script boundaries, so
text runs can't be compared directly using International Utilities routines unless they have the same
script. If the script codes are different, the International Utilities routines provide a mechanism for
first ordering the scripts themselves (IUScriptOrder).
You have the same problems with searching as with sorting. In addition, search commands usually
include options for case sensitivity that could be extended in a multiple encodings application. For
example, the Japanese script includes both one-byte and two-byte versions of the Roman characters.
For purposes of searching, the user might want to consider these equivalent. The simplified Chinese
script also includes both one-byte and two-byte versions of the Roman characters, and these should
also be equivalent to Roman characters in the Roman script and in the Japanese script. Just like case
sensitivity, considering one- and two-byte versions of a character as equivalent should be an option in
your search dialog box. You can use the Script Manager's Transliterate routine to implement a byte-size insensitive
comparison. Use Transliterate to convert both the source and the target text into one-byte
characters, then compare the resulting strings. Because all the scripts share the same simple (7-bit)
ASCII character codes, this mechanism treats all the Roman characters, both one-byte and two-byte,
in every script as equivalent.
Summary. The multiple encodings method has several advantages:
- The user can mix text in any number of scripts within one document.
- You can produce several localized versions of the application from a single code
base.
- Users in a particular region can use features intended for users in a different
region, even if the product isn't advertised to provide those features.
The disadvantages of this method are apparent from the examples:
- It's much more difficult to implement than the native encoding method.
- The two-byte scripts use mixed one-byte and two-byte encodings, so even though
you're keeping track of the script of each text run, you still need to worry about
mixed character sizes within a run.
- Because some characters are duplicated between scripts, you need to treat their
corresponding character codes as equivalent. This further complicates the basic
algorithms you use for text editing, searching, sorting, and so on.
METHOD 3: POOR MAN'S UNIFICATION
Our favorite method combines the power of the multiple encodings method with the simplicity of
the native encoding method. The idea is to create a single "native" encoding that encompasses all the
scripts included in the multiple encodings method. In the multiple encodings method, some
characters are encoded several times: a character can have the same code value in different scripts, or
it can have different code values in the same script. For example, the letterA has the code value 0x41
in the Roman script and the same one-byte code value in Japanese and traditional Chinese. However,
Japanese also encodes the letterA as the two-byte value 0x8260, and traditional Chinese also encodes
it as the two-byte value 0xA2CF. A unified encoding would map all of the identical characters in the
multiple encodings to one unique code value.
You might have noticed that this method has some of the same goals as the Unicode scheme -- a
single character encoding for all languages with one unique code for every character. Unicode
extends this goal to the unification of the two-byte scripts. Characters that have the same abstract
shape in the simplified Chinese, traditional Chinese, Japanese, and Korean scripts have been grouped
together as a single character under Unicode. Our method doesn't go that far. We unify the simple
ASCII characters from all scripts but leave the various two-byte scripts to their unique encodings.
Thus the name for this method -- poor man's unification.
Segments. The poor man's unification method relies on the concept of a segment.
A segment is a subset of an encoding with characters that are all the same byte size. For example, the
Roman script is divided into two segments -- the simple ASCII segment and the extended
ASCII/European segment. The Japanese script has three segments -- the simple ASCII segment, the
one-byte Katakana segment, and the two-byte segment (including symbols, Hiragana, Katakana,
Roman, Cyrillic, and Kanji).
The key to poor man's unification is the simple ASCII segment. This segment is shared among all
the scripts on the Macintosh (see Figure 2). Furthermore, poor man's unification treats the various
encodings as collections of segments that can be shared among encodings. There's logically only one
simple ASCII segment, and all the scripts share it. In the multiple encodings method, characters in
this range could be found in each script. That is, the word "Beaker" could be stored in both a Roman
text run and in a Japanese text run (as one-byte ASCII). In contrast, an application that uses poor
man's unification would tag text runs with the segment, not the script, so these two occurrences of
"Beaker" would be indistinguishable.
Figure 2 Scripts Sharing the ASCII Segment
The best way to see the advantages of this method is to solve a problem we already considered --
changing the font of the selection. With the multiple encodings method, this entailed breaking text
runs into smaller runs using our FindASCIIRun routine. With poor man's unification, the same
problem is much easier to solve because the runs are already divided into segments and the simple
ASCII segment is allowed to take on any font. Other segments are only allowed to use fonts that
belong to the same script they do.
#define asciiSegment 0
#define europeanSegment 1
#define katakanaSegment 2
#define japaneseSegment 3
short fontScript, runSegment;
short textRunIndex, textRunCount;
fontScript = Font2Script(myNewFontID);
for (textRunIndex = 0;
textRunIndex < textRunCount;
textRunIndex++) {
runSegment = textRunStore[textRunIndex].segment;
if (SegmentAllowedInScript(runSegment, fontScript))
textRunStore[textRunIndex].fontID = myNewFontID;
}
The special processing is gone (surprise). Once you know that a segment isn't included in the script
of the font, you can't go any further. Such a segment consists entirely of characters that aren't in the
script of the font.
Boolean SegmentAllowedInScript(short segment, short script)
{
switch (script) {
case smRoman:
switch (segment) {
case asciiSegment:
case europeanSegment:
return true;
default:
return false;
}
case smJapanese:
switch (segment) {
case asciiSegment:
case katakanaSegment:
case japaneseSegment:
return true;
default:
return false;
}
default:
switch (segment) {
case asciiSegment:
return true;
default:
return false;
}
}
}
Determining a segment from keyboard input. How do you determine the segment of a character when
it's entered from the keyboard?
1. First determine which script the character belongs to by checking the keyboard
script.
2. Then use the character-code value and the encoding definitions to assign the
character a particular segment.
#define ksJISSpace 0x8140
unsigned short keyboardScript;
unsigned short charSegment;
EventRecord lowByteEvent;
keyboardScript = GetEnvirons(smKeyScript);
charSegment = ScriptAndByteToSegment(keyboardScript, charCode);
if (charSegment == japaneseSegment) {
// Get low byte of two-byte character from keyboard.
do {
// You can get null events between two bytes of a two-byte
// character.
GetNextEvent(keyDownMask | keyUpMask | autoKeyMask,
&lowByteEvent);
if (lowByteEvent.what == nullEvent)
GetNextEvent(keyDownMask | keyUpMask | autoKeyMask,
&lowByteEvent);
} while (lowByteEvent.what == keyUp);
if ((lowByteEvent.what == keyDown) ||
(lowByteEvent.what == autoKey))
charCode = (charCode << 8) |
(lowByteEvent.message & charCodeMask);
else
// We've gotten a valid high byte under the Japanese keyboard
// with no subsequent low byte forthcoming. Something serious
// is wrong with the current input method. Return a Japanese
// space for now. Hmmmm.
charCode = ksJISSpace;
}
#define kASCIILow 0x00
#define kASCIIHigh 0x7f
#define kRange1Low 0x81
#define kRange1High 0x9f
#define kRange2Low 0xe0
#define kRange2High 0xfc
short ScriptAndByteToSegment(unsigned short script,
unsigned char byte)
{
switch (script) {
case smRoman:
if ((byte >= kASCIILow) && (byte <= kASCIIHigh))
return asciiSegment;
else
return europeanSegment;
case smJapanese:
if ((byte >= kASCIILow) && (byte <= kASCIIHigh))
return asciiSegment;
else if ((byte >= kRange1Low) && (byte <= kRange1High))
return japaneseSegment;
else if ((byte >= kRange2Low) && (byte <= kRange2High))
return japaneseSegment;
else
return katakanaSegment;
default:
// New scripts and segments added before this.
return asciiSegment;
}
}
You might think this is quite a bit of effort just to get the low byte of a two-byte character. You're
right. And as for Joe's use of the antiquated GetNextEvent instead of the more modern
WaitNextEvent, Beaker notes that "It's a cooperative multitasking world and Joe's not cooperating."
Joe replies, "Yeah, but I don't want a context switch while I'm trying to get the low byte of a two-
byte character."
Changing fonts. Applications that employ poor man's unification still have to worry about font forcing.
Here's an algorithm for "smart" font forcing that tries to anticipate which fonts the user will select
for text in each segment. When you find a case where the current keyboard script and font script
don't agree, instead of using the application font for the keyboard script, search the surrounding text
runs for a font that does agree with the keyboard script. Only if you can't find a font that agrees do
you default to the application font of the keyboard script. From the user's perspective, this is much
nicer. Once the user has selected a font for each script, the application goes back and forth between
the fonts automatically as the keyboard script is changed.
short fontScript, keyboardScript;
fontScript = Font2Script(myFontID);
keyboardScript = GetEnvirons(smKeyScript);
// Search backward.
if (fontScript != keyboardScript) {
for (textRunIndex = currentRunIndex - 1;
textRunIndex >= 0;
textRunIndex--) {
myFontID = textRunStore[textRunIndex].fontID;
fontScript = Font2Script(myFontID);
if (fontScript == keyboardScript)
break;
}
}
// Search forward.
if (fontScript != keyboardScript) {
for (textRunIndex = currentRunIndex + 1;
textRunIndex < textRunCount;
textRunIndex++) {
myFontID = textRunStore[textRunIndex].fontID;
fontScript = Font2Script(myFontID);
if (fontScript == keyboardScript)
break;
}
}
// Punt if we couldn't find an appropriate run.
if (fontScript != keyboardScript)
myFontID = GetScript(keyboardScript, smScriptAppFond);
Applications that use font forcing also have to worry about keyboard forcing. However, if the
application includes the feature just described, keyboard forcing is not as important. Many users will
prefer to leave the keyboard completely under manual control and allow the "smart" font forcing to
choose the correct font when they start typing. The keyboard script is always visible in the menu bar,
but the current font is not.
Summary. The poor man's unification method has more advantages than the other two:
- All characters in a run belong to the same segment and therefore take the same
number of bytes for their code values. That is, any given run will be all one-byte
characters or all two-byte characters, which makes it easier to step through text for
deleting or cursor movement. This is in contrast to the multiple encodings
method, which can mix one-byte and two-byte characters in a single text run.
- Runs of simple ASCII and European characters still take one byte per character to
store. If you're working on a word processor and plan to keep large amounts of
text in memory, this can be an advantage.
- Like the multiple encodings method, this method is easy to extend as you add
more scripts to the set your application supports. Each time you add a new script,
you need to define the new segments that make up that script and then modify the
classification routines to correctly handle the new script and segment codes. Once
you locate a specification for the new encoding, these modifications should be
straightforward.
Unfortunately, this method has two disadvantages when compared to pure Unicode:
- You still have to deal with one-byte and two-byte characters, even though they
won't be mixed together (see "The Demise of One-Byte Characters").
- The application needs to tag each text run with a segment code because the
character codes aren't unique across segments.
Unicode does away with both of these disadvantages by making all characters two bytes wide and
insisting on one huge set of unique character codes.
PAY ME NOW OR PAY ME LATER
Perhaps the moral of the story is "globalization, not localization," as Joe says. The more
generalizations you can build into your application during initial development, the more
straightforward your localization process is destined to be, and the less your localized code base will
diverge from your original product.
Weigh the size and growth potential of a given language market against the amount of effort
required to implement that language. Stick to the markets where your product is most likely to
flourish. This article has shown that in some cases you can dramatically reduce code complexity by
taking shortcuts -- the poor man's unification scheme in this article is a good example. A healthy
balance between Script Manager techniques and custom code will help you bring your localized
product to market fast and make it a winner.
SCRIPT MANAGER CAVEATS
When you use a char to store a character or part of a character, use an unsigned char. In two-byte scripts,
the high byte of a two-byte character often has the high bit set, which would make a signed char negative,
possibly ruining your day. The same goes for the use of a short to store a full one- or two-byte character --
use an unsigned short.
Another important point is that most Script Manager routines rely on the font of the current grafPort for their
operation. That means you should always be sure that
the port is set appropriately and that the font of the current port is correct before making any Script Manager
calls.
A new set of interfaces has been provided for System 7.1. While the old Script Manager's text routines still
work, the new routines add flexibility. For example, you can use CharacterByteType instead of CharByte.
THE DEMISE OF ONE-BYTE CHARACTERS
The point of poor man's unification is to simplify your life. On that theme, there's another technique that will
help. You can simply decide that characters are two bytes. Period. Expand one-byte characters into an
unsigned short, with the character code in the low byte and the segment code in the high byte. Then just use
unsigned shorts everywhere instead of unsigned chars. You'll find that your code gets easier to write and
easier to understand, and that lots of special cases where you would have broken everything out into one-
and two-byte cases collapse into one case.
Putting the segment code into the high byte of one-byte characters ensures that the one-byte character codes
are unique. If your program handles only one two-byte script, the two-byte codes are also unique. When
both these conditions are true, there's no need to store the segment codes in runs, since they're implied by
the high byte of each character code.
Here are a few examples using the codes from the sample segments in the section on poor man's unification:
- 0x0041 is the letter A in the asciiSegment.
- 0x0191 is the letter ë in the europeanSegment.
- 0x8140 is a two-byte Japanese space character.
In other words, one-byte characters carry their segment code in their high byte, and the two-byte characters
all belong to the same segment. You can imagine how much easier searching and sorting algorithms are if you know you can always
advance your pointers by two bytes instead of constantly calling CharByte to find out how big each
character is. Plus, you might as well get used to storing 16 bits per character, since that's how Unicode
works. Yes, it's an extra byte per character -- deal with it.
RECOMMENDED READING
- Inside Macintosh Volume VI (Addison-Wesley, 1991), Chapter 14, "Worldwide Software Overview."
- Inside Macintosh Volume V (Addison-Wesley, 1986), Chapter 16, "The International Utilities Package,"
and Chapter 17, "The Script Manager."
- Inside Macintosh Volume I (Addison-Wesley, 1985), Chapter 18, "The Binary-Decimal Conversion
Package," and Chapter 19, "The International Utilities Package."
- Inside Macintosh: Text (Addison-Wesley, 1993).
- The Unicode Standard, Version 1.0, Volume 2 (Addison-Wesley, 1992).
- Macintosh Worldwide Development: Guide to System Software, APDA #M7047/B.
- Localization for Japan, APDA #R0250LL/A.
- Guide to Macintosh Software Localization, APDA #M1528LL/B.
- "The TextBox You've Always Wanted" by Bryan K. ("Beaker") Ressler, develop Issue 9.
JOSEPH TERNASKY wrote accounting software for a "Big Eight" firm until a senior partner recruited him into the Order of
the Free Masons. He showed great promise as an Adept, and the Order sent him to the Continent to continue his studies
under the notorious Aleister Crowley, founder of the Temple of the Golden Dawn. After years of study in the Great Art,
Joseph was sent back to America to accelerate the breakdown
of civil order and the Immanentizing of the Eschaton. He resumed his former identity and now spends his remaining years
adding hopelessly complicated international features to the Macintosh system software and various third-party
applications.*
BRYAN K. ("BEAKER") RESSLER (AppleLink ADOBE.BEAKER) had his arm twisted by develop editor Caroline Rose, forcing
him to write develop articles on demand. He resides in a snow cave in Tibet, where he fields questions ranging from
"Master, what is the meaning of life?" to "Master, why would anyone want to live in a Tibetan snow cave and answer
questions for free?" When he's not busy answering the queries of his itinerant clientele, he can usually be found writing
some esoteric sound or MIDI application. Back in his days of worldly endeavor, Beaker wrote some of the tools that were
used for testing Kanji TrueType fonts, and then worked on System 7 in the TrueType group. Hence the retreat to his current
colder but more enlightened and sane environment.*
System 7.1 provides a standard inline input interface and the system input method supports inline input. With inline input,
the input translation process can occur within the document window, and no input window is used. *
Amikake is a variable shading behind text. Furigana is annotation of a Kanji character that appears in small type above
the character (or to the right of the character, in the case of vertical line orientation). *
THANKS TO OUR TECHNICAL REVIEWERS Jeanette Cheng, Peter Edberg, Neville Nason, Gideon Shalom-Bendor *