Sax Appeal: Unicode and the encoding problem

Note:

This post is complemented by a later one, concerning especially encoding problems in databases used for web sites.

Introduction

Over the course of (too many) years, I have struggled with computer representation of text. Each tiem I had to solve what I call the fundamental problem o text representation I resorted to instainct and ad hoc methods. Only recently did I force myself to systematically gain a clear understanding of the theorical underpinnings of the incantations I have been using. Having done this, I thought I had to write what I'd learned (lest I forgot); once written (in Italian), I thought that, had I published it, I might have helped some poor schmuck struggling with the same problems; once published, I thought: why italians only? Hence this article. The root motivation, though, is that the subject matter is obscure (or obscured by decades of committee-generated specs) and much of the information that circulates about it is either incomprehensible, partial, or plain wrong.

Intended Audience

The core subject is (should be) comprehensible by anybody possessing basic computer knowledge. Its ramifications, though lie in programming techniques that are rather arkane even for software engineers. I have tried to treat the various subjects in order of increasing obscurity: when what you'llbe reading stops making any sense, most of what follows will be of the same ilk.

Conversely, if you feel that the stuff you're reading is trivial, have a look at the section headings: something juicier may be in the works. I do not advise skipping around, though. Most of this article usefulness (if any) lies in a clear definition of the encoding/decoding process, an apparently trivial pursuit: after all, everybody thinks hat he knows what encoding and decoding are about. Read on and you may find out that you were mistaken...

The fundamental problem

The fundamental problem is posed by a perplexed, yet angry, user that wlaks up to your desk saying:

"On X the text Y displays all right. On W, the same text Y, is displayed in the wrong Z-way"

Here X may be a computer or a program or anything else somebody might use to display text. Y may be coming from an actual text file, or a message blurted out by some program, or a page coming from a web server. Z ranges from "Some charachters are funny" to "It's a jumble of totally garbled, random characters"

With some luck, the original text and its displayed representation are somewhat readable, and are on accessible machines. On a bad day, it's a Dewanagari text, translated in Tokyo and displayed in Brisbane.

At any rate you, the IT professional, mutter knowingly "Oh yeah, an encoding problem. I'll be calling you" and pretend to be absorbed in deep meditation, hoping the user leaves. All the while you are cursing under your breath.

Let's see why.

A character issue

Characters are the elementary components of the written word (here 'written' i importante: by contrast, the elementary components of the spoken word are called phonemes, a different kettle of fish)

It is essential to keep a character and its visual representation distinct: the character "LATIN SMALL LETTER A" ('a') must not be confused with any of the shapes it assumes when printed or displayed on a screen, a newspaper and so on. While the character is the idea of an elementary component of a word, its realization is called grapheme. To be even more precise, a grapheme is a the specification of a character shape abstracted from its typographic font (a foggy concept that I won't explore): a printed character, font and all, should be called a glyph. A single character may have more graphemes (like the Greek "SIGMA", who has a different shapes at the end and at the boeginning/middle of a word), while more characters, can be combined in a same grafeme/glyph (like the syllable "fi", whose typographic representation ligates the two letters and omits the dot on the 'i')

At this point a "character" is already an idea much wider than "alphabet letter": it includes punctuations, diacritics and several varieties of white space. I will call "alphabet" the collection of characters used by a written language.

The stupid computer

Up to the first half of the 20th century, most of the preceding notions were basically subsumed in tha activity of learning the alphabet of a language. The coming of age of computers did change a few things about text representation and posed radically new problems.

The Turing/Von Neumann computers we're all familiar with are - at their core - quite stupid when it comes to symbolic representations. The only 'alphabet' they can cope directly with is, in fact, composed by a total of two symbols, namely, 0 and 1 (this is known as a bit). In other words, the only information a computer can directly manipulate consists of binary integer numbers. Since bits are too small to be used as a practical data unit, most computers deal with data through the exchange of 8 bit packages almost universally known as use a larger bytes (sometimes also octets).We may think to a byte as the smallest data unit that a computer can exchange through an input/output device (or from/to memory) with its central processing unit. This means that, to get at the 20th bit, a computer will have to access the entire third byte - which contains it.

Computers can process non-numerical information by converting it to numbers. If we need to process musical notation - for instance - we must beforehand devise a method that converts musical notation symbols to numbers and viceversa. This is normally done by numbering the repertoir of symbols (the alphabet) that will be used.

In what follows, I will call this correspondence between symbols and numbers a code. This is a somewhat non-standard terminology: most of the times, the same concept is called a codepage or - in a somewhat ambiguous fashion - an encoding. The reason for this change in verbiage is that the encoding process - discussed later - is actually a rather different from the code, and grasping this difference is 90% of the rules of thi particular game, while the term "codepage" is used rather inconsistently, often with reference to a particular brand of computers, or operating systems.

I will restrict my scope to alphabetic notation, neglecting more specialistic fields such as mathematical and musical notation for which several codes have been developed over the years. Suffice it to say the most important code (Unicode) can be used to process a repertoire of symbols that spans the non-textual alphabets that have been mentioned, as well as several others, such as the alphabets and writing systems of ancient languages.

I will also neglect all the problems involving text layout,
writing direction, word collation, date and currency format. Most of
these are - however - taken into consideration by "http://www.unicode.org">Unicode.

Codes

What we have said to this point can be summarized as follows:

"Computer-mediated text processing requires a procedure of character numbering called a code."

The number that - in a code - represents a given character is called codepoint.

A code can represent any given collection of symbols: an entire alphabet, just a part of it, or even several alphabets. Thus the adoption of a code limits the repertoir of characters that can be used in a given application, which can severely handicap - for instance - the processing of multilingual text. As an example, an interlinear italian translation of K'ung-tzu's work cannot be represented by using the venerable ASCII code or a large number of the national codes (like Latin-1). This was ne of the reasons of the invention of Unicode, which tries to include the entire repertoire of ll the human languages and which, to this writing, actually includes all the characters used by the principal modenr languages.

Representations

We must now distinguish between the internal and external representations of a character. I call internal representation the representation which a computer program uses to identify a given character. For our purposes ths can be regarded as completely abstract. I call external representation of a charachter the one that is used to write it to disk, or to send it to some other program.

While the internal representation is used, the code is used as a pure reference: it's there, it is (somehow) used, but it remains mostly invisible.

By contrast, the code plays a paramount role when the internal representation is converted to external, or viceversa. When transforming the internal represtation to external the following happens:

i) the character is converted to a number (codepoint) according to the code;

ii) the codepoint must be converted to a byte sequence: this is called encoding;

iii) the byte sequence is transmitted.

When going from the external represtation to internal, the prcedure is reversed:

i) the byte sequence (external representation) is received.

ii) the byte sequence is converted to a sequnce of numbers/codepoint: this is called decoding;

iii) the codepoints are converted to internal representation by using the code.

As will be showed, the procedures of converting a codepoint to a byte sequence and viceversa are not uniquely determined. If they were, converting to/from the external and internal representation would be possible by just knowing the code that is been used, and the "fundamental problem" would be much easier.

The decoding and encoding concepts are so central to the definition of our problem that they deserve to be clearly defined as follows:

The encoding process associates a codepoint to a character in tis internla representation and then converts the codepoint to a concrete numerical representation expressed as a byte sequence (external representation.

The decoding process converts a byte sequence (a character in external representation) to a codepoint and then (by the application of a suitable code) to its internal representation.

How text is displayed

So, how does a byte sequence becomes readable text on a screen (or on a printed page)? And how can it happen that the same byte sequence can instead be displayed as an unreadable jumble of cruft?

The process of actually displaying sequnce of characters can be summarized as follows.

1) The text is received by the software environment as a byte sequence (numbers in the range 0-255).

2) The decoding process converts the byte stream to a sequence of codepoints

3) The codepoint sequence is identified as a character sequence by applying a code.

4) After text processing is completed, character are associated to glyphs that can be sent to the display device (say a screen, or a printer).

The last step of this process - important though it may be for the actual representation of text - is seldom relevant to the problem at hand. At its core it consists in determining what shape the character is given within a given font. This is a well defined task, once the code is known. The only thing that can go wrong (rendering the text unreadable) happens when a wrong font (a font which does not contain the desired shapes). This may be a non-alphabetic font, or a font that does not contain the alphabet we need. This type of error is easily solved by using one of the standard (Unicode) alphabetic fonts which are found on modern systems (such as Arial, Helvetica, Courier...).

Errors that happens in phases (1) through (3) are more common, and harder to solve. Any mismatch between code and encoding will render the entire sequence unintelligible, and may even result in software errors.

So why do we have to cope with more than one (several, actually) code and encoding? To understand it, we need to look back at history?

Once upon a time

...computers were uniquely dedicate to numerical computing. Numerical results require a fairly limited character set (ten digits, a few latin letters, the '+', '-', ',' and '.' signs) which are well understood in most human languages. Besides most results were printed (or punched) for human consumption. Our problem did not exist.

This happy situation changed over the course of few years, as the computers' symbolic capabilities were understood and exploited. Codes were invented to satisfy the need to represent symbols. Anarchy set in as each manufacturer devised its own code: exchanging data among machines quickly became a problem, and the industry eventually settled on two codes: EBCDIC (1963) and ASCII (1963).

The ASCII code was - and is - an actual standard, devised by an industry comittee. EBCDIC had been independently developed by IBM and had been, over the years, adopted by many other mainframe manufacturers. Today its use is basically limited to IBM mainframes, and is waning.

The root of all evil

The goals that were set for the first codes were, by today's standards, extremely limited:

1) they were designed to be used in a single-language environment, the single language being very often English;

2) they needed to be space efficient (by using the least possible amount of bits per character) so that their transmission could be fast.

Because of (1), the set of coded symbols was cut down to the minimum: ASCII, for instance, dealt only with the characters and punctuation used in the English language.

Because of (2) a number of (unprintable) symbols that were only used for transmission control (things like Start-Of-Message, End-Of-Message) were crammed in the same code, reducing the number of codepoint available to actual characters.( This happened also because the most popular transmission methods used a serial line, so the control signalling was forcibly interleaved with the character stream.)

Mostly for these reasons ASCII is a 7 bit code. The availble number of codepoints is therefore 128 of which the first 32 are control characters: the reamining (printable) codepoint can represent mosto of the characters found on the keyboard of an (english language) typewriter keyboard. The remaining (eighth) bit was reserved for checksumming (a process that allows the detection of transmission errors)

What about the rest of the world?

Ninetysix character are not enough to represent all the alphabets of the world. They are not even nearly enough to represent all the alphabets in use for european western languages.

This was obviously quite unsatisfactory and it became ever more so as computing was becoming commonplace in countries whose alphabet was completely different from the one coded by ASCII.

The codepage

The situation was initially solved by patching the existing standards (which, alas, were not designed to tolerate much patching. A first step was seizing the eighth bit that ASCII had reseved to checksumming. This expanded the codepoint space of 128 units. Every manufacturer then proceded to create codes mapping these newly available codepoints to additional alphabetic characters (sometimes, to graphic symbols used for drawing). Borrowing an IBM word, these codes were (and are) called codepages./p>
Information Technology Babel quickly ensued: correct interpretation of a received text now required the knowledge of the (vendor dependent) codepage: if the text had to be used on a different system, a compatible codepage had to exist on the system. The situation was, however somewhat mitigated by the existence of e de-facto-standard setting Behemoth (IBM) that dictated most of the practices for the industry, and by the fact that computer communcations were still mediated by highly specialized professionals which were capable of making sense of sentences like: "I am sending over a .5 inch ANSI tape in the 443 codepage".

LATIN-1 & Co.

Standardization was obviously called for and this time the task was assumed by ISO towards the end of the 1980s.

A uniform naming system was established for the codes that had become a de-facto standard. As an (important) instance, a 256 codepoint set describing the alphabets of several western european languages was named iso-8859-1 (or, Latin-1). The names iso-8859-x (from iso-8859-2 to iso-8859-16) were chosen to designate codes for european languages with special alphabets (such as Greek and Cyrillic).

Codes forAsian languages (essentially japanese, chinese, korean) were similarly grouped under the ISO/IEC 2022 set of specifications.

Most, if not all, the ISO codepages map the first 127 codepoints in the same way ASCII does, making them (almost) ASCII compatible.

Some regional and special purpose codes were left behind in the process, and some monsters came into existence: Latin-1, for instance is to this day almost, but not quite, identical to the widely used windows-1252, codepage.

One of the outcomes of this process was underlining the need for a unified code, capable of cataloguing all the alphabets used by humanity: this resulted in the creation of Unicode.

Encodings

We defined a byte as the minimal transmission unit fora computer. Because a byte is an can represent 256 integers, any code that uses less than this amount of codepoints can have an external representation tht fits within a single byte. For these codes, internal and external representation may be made coincident, by mapping each codepoint to its single-byte value. For this codes, code and encoding are one and the same.

As we all know, several languages use more than 256 symbols (Chinese and Japanese, as well as several other asian languages, use up to tens of thousands of symbols). Codes for these languages do not have a single byte per character: some characters must be represented by more than one byte, a feat that may be accomplished in at least two ways.

Wide-char encodings

The most natural choice is using the same number of bytes to encode all the codepoints.For instance an alphabet having more than 256, but less than 65536, symbols is amenable to a two byte (00000000-00000000 to 11111111-11111111) encoding. Such encodings are called "wide-char" encodings. In spite of their being quite intuitive, wide-char encodings suffer from a number of shortcomings, that I will discuss later.

An example: UCS-2 (UTF-16)

Let us conider a U encoding, having the following properties (I am essentially describing - save a few, minor details - the UNICODE encoding known as UCS-2).

1) U is a wide-char, two byte per codepoint encoding

2) U uses the first 256 codepoints in the same order and meaning as the Latin-1 codepage. This means that all the alphabets of the principal western european language fit in the first byte of this encoding.

The first problem with U us that it is spatially inefficient. U containst 511 symbols encoded by sequences with at least a null byte (all the bits of the byte are zero). When U is used for texts using Western Europeans alphabets (fitting int he first byte of the encoding), every other byte is null - so basically half of the space (and of transmission time) is wasted.

A second problem of U relates to endianness. (The word comes from the inhabitants of the legendary islands oof the mythical islands of Lilliput and Blefuscu, who - as related by Swift in the novel "Gulliver's Travels" - could not agree on which end of an egg should be broken first. Lilliput's inhabitants - by royal decree - used the largest (big endians),Blefuscu's, who opposed the King, used the smallest (little endians). Because of this disagreement, the two peoples fought a bloody war.per protesta contro il re: little endians).

Even though the basic transmission uniti, for computers is the byte, the need of larger data units was soon felt. Among these a certain regard is attached to the so called word, adjacent pair of bytes. Internally, computers often manipulates words as a whole: integer numbers, for instance, are represented by one, two or four words.

A word, however, is never seen as basic (unsplittable). So when a word leaves the computer memory it can be sent (externally represented) in one of two ways:

1) the Most Significant Byte is sent first, followed by the Least Significant Byte (big endian)

2) the Least Significant Byte is sent first, followed by the Most Significant Byte (big endian)

(Similar differences may arise in the representation of pair of words, but fortunately they are far rarer)

If we picture bytes as decimal digits, and given the number "ninety-one", we can see that big endian machine would write/memorize it as "9" "1", whereas a little endian machine would write/memorize it as "1" "9".

Unbelievable (or stupid) as it may seem, for years nobody mandated the word order in external representation, so either order has been used with comparable frequency. This obviously made endianness (AKA byte-ordering) another stumbling block on the way towards computer communication. So pesky a problem, in fact, that at some point it was actually solved with a blitz operated by da Sun by deciding that, over a TCPI/IP network, a network byte order existed, to which all computers must submit (the network byte order is big endian, the same that Sun machine used at the time). While that fixed for network communication, no such fix exists for files, which are still being written with different endianness on different machines.

For our U encoding this means that its correct decoding will be possible only after its endianness is known.

A last problem with U is apparent to programmers only. We have seen that a U encoded character stream can contain null bytes (indeed up to half of the bytes may be null). Traditionally though (traditionally meaning from circa 1960 until sometime around the year 2000) a null byte had a almost universal meaning of "end of string" for a large body of software, including software devoted to text manipulation in Western European countries. This also means that U is not compatible with the above mentioned software, which will behave unpredictably when handed a U-encoded string.

Multibyte encodings

We have multibyte encodings if we translate different codepoints using a different number of bytes.

An example: UTF-8

Let us consider a F encoding, having the following properties (I am essentially describing - save a few, minor details - the UNICODE encoding known as UTF-8).

1) F is a multibyte encoding, an uses the first 127 codepoints in the same order and meaning as the ASCII codepage. F is therefore ASCII compatible

2) When the most significant bit of a given byte (in F) is set, the byte is part of the encoding of a codepoint which translates to a multiple byte sequence. If one (or more) bits following the most significant bit are set, followed by a zero ((110xyyzz, 1110yyzz, ...) the byte starts a sequence composed by one, two... bytes. Bytes of the form (10xxyyzz) follow the first in a multibyte sequence for a codepoint.

While F solves some of the problems of the wide-char encodings, it introduces some of its own. Let us compare it to the encoding F, above

1) F is much more space efficient than U when coding the first 127 codepoints (ASCII). F becomes less efficient than U for codepoints requiring three- or more bytes to represent (the majority of the asian languages), which have a space penalty of about 30% on the average.

2) F is not affected by endianness: codepoints are written as sequences of bytes (not words - a concept that is intrinsically ordered)

3) F does not contain null-bytes and is ASCII compatible, making it possible its processing with "traditional" software tools.

4) F is not latin-1 compatible.

5) Decoding F is harder then decoding U. F makes also harder performing tasks such as "finding the fifth character of a string". If U (or a similar wide-char encoding) is in use, the task can be effected by fetching the fifth "word" (pair of bytes) of the string. With F, one has to read all the bytes starting from the first and convert them to characters, until the fifth codepoint is reached.

6) F does not allow arbitrary byte sequences (for instanceo 110xyyzz-0qxxyyzz is an illegal F sequence). Therefore, upon encountering a byte stream containing a an illegal sequence it is possible to estblish, with certainty, that it has not been encoded using F. This may seem a trivial property, but it is not: many single-byte and wide char encodings ofer no such validation method. In particular, any random byte sequence can be interpreted as a valid ISO-8859-X encoded char sequence. This is an important part of the fundamental problem.

Over there years, several multibyte encodings were devised and describing them is byond the scope of this paper. It is worth mentioning, though the "shift" family of encodings in which the appearance of a particular byte sequence (the "upshifit" sequence) changes the meaning of all the upcoming sequences until a corresponding "downshift" sequence is met. Many such encodings are grouped under the ISO/IEC-2022, specification which deals with the encoding of several east Asia languages.

It must be said that, for most of the codes/codepages defined by the ISO specifications, the coupling between code and encoding is actually unique, meaning that, if the code is known so is the encoding (and viceversa). Unicode differs on this, as we will see shorly.

Unicode

The often mentioned Unicode standard, as specified from the Unicode consortium, is an initiative whose purpose is creating a unified repertoire (a code) of all the characters used by humanity including contemporary languages, antique languages, invented languages (there is a codepoint space dedicated to the Klingon alphabet) and with enough space to include yet-to-be-coded languages. The most recent Unicode formulation also includes math and music symbols as well as other more specialistic symbol catalogs. More than half of the codepoint space (10 planes) is currently unassigned (i.e. it contains no characters) and it is unlikely it will be in the near future.

In other words Unicode aims to be the code to end all codes: it, for instance, the creation of multilingual text without switching codepage - actually Unicode makes the existence of other codepages irrelevant (assuming all world could be persuaded to use Unicode).

As of today the standard includes1 114 112 (one milion one hudred fourteen thousand one hundred and twelve) codepoints, divided in 17 planes; every plane comprises 56 536 codepoints ordered on 256 rows of 256 columns)

Plan 0, which includes the first 65536 codepoints, is more often called Basic Multilingual Plane (BMP) and it contains most of the repertoire for modern languagese. In order to stay compatible with legacy practices, the BMP maps its first 127 codepoints in the exact same way as ASCII.

An important part of Unicode is the definition of the "Unicode transformation format" (UTF) and the "Universal character set" (UCS): these are the encodings used for Unicode's external representation.

Of these the most important (and the ones that are met with most often) are just two:

UTF-8: a multibyte encoding maximizing the ASCII retro-compatibility (I have already described this as the "F" encoding). UTF-8 encodes every character with a byte sequence whose lenght varies between one an four octects (bytes).

UTF-16 (formerly UCS-2: I have already described this as the "U" encoding): a multibyte encoding of the entire Unicode repertoire. UTF-16 represnts the entire BMP as a wide-char two-byte encoding: this was the original definition of UCS-2 -a no longer used encoding - which could only represent the BMP. UTF-16 encodes every codepoint with a byte sequence whose length varies between two and four octets. The four-byte encoding is reserved to very rare codepoints lying outside the BMP.

UTF 16 also defines a particular value(Byte-Order-Mark or BOM) which is used to start a text and that is used to infer the text's endianness.The BOM is represented by the hexadeciimal codepoint U+FEFF which on a big-endian machine is presented as the 0xFE,0xFF sequence and as the sequence 0xFF,0xFE on a little endian machine. Because the U+FEFF codepoint (Zero-Width No-Break Space) may never start an encodend sequence, while the U+FFFE codepoint is not (and never will be) assigned to a valid character, finding one of these codepoints at the start of a codified sequnec allows to determine the endianness of the entire sequence.

UTF-32/UCS-4: a "wide" fixed length encoding: every codepoint is encoded in a 4 byte long sequence. BOM is applied as seen in UTF-16. This encoding is - in practice - seldom used.

Because of the stated advantages of the F encoding over U, UTF-8 is the most used encoding for the external representation of Unicode texts, while UTF-16 is often used for the internal representation of text. In particular, UTF-16 is used in the Microsoft operating systems.

The fundamental problem, revisited.

At this point we have the background needed to understand the causes of the fundamental problem, as stated a few sections ago.

The root cause of the problem is that a text (file) that was prepared to be displayd using a given triad (code, enconding, endianness) is actually being displayed with the wrong triad (i.e. at least one of the three components is wrong).

There is another possibility, namely, a font mismatch (for instance no japanese characters in the selected font). This error source is easily elimintaed by installing (and directing the application to use) a complete set of fonts (often termed Unicode fonts).

The fundamental problem is solved if we can somehow infer the source triad, the target triad, and determin the correct translation between the two.

Theorem of the uncomputability of the transcoding.

What we know about the subject is also sufficient to proof what I (pompously) call "the theorem of the uncomputability of the transcoding":

"There is no algorithmic procedure capable to uniquely determine the encoding/codepage of an arbitrary byte sequence."

The proof is easy: it is sufficient to observe that any arbitrary byte sequence is a "correct" sequence in the Latin-1 encoding (actually in any Latin-X encoding). This means that any byte sequence "could" be a Latin-1 encoded character sequence regardless of the encoding that was actually used to produce it. Therefore the original sequence cannot be determined with certainty.

This unfortunate result means that any technique to solve the fundamental problem will have to rely on euristics and/or statistics.

Let's see some of this techniques.

Wrong endianness (for a wide char encoding).

These days, this error seldom happens, except when exchanging tapes between systems. UTF-16 encoded files must start with the BOM, and are therefore easily detected (this should actually be done for you by the application). Othe encodings may need that both possible ordering be tested, with the correct one being inferred by inspection.

Wrong code or encoding

The file was generated by using a code/encoding different from the one expected on the target machine. This is the type of errors that happen more often.

The first thing to ascertain is the (code,encoing,endianness) expected on the target system. This may not be easy as it seems. Web browsers, for instance, try to guess code and encoding of web pages, and frequently guess wrong. Operating systems assume a codepage: on windows systems the ANSI CP_ACP codepage must be examined, as well as the regional setings; on Linux systems the LOCALE family of environment variables must be taken in account. While a complete list of indicators is hardo to come by, some inferences can be made: an italian speaking Windows machine, for instance, is often working under the windows-1252 locale.

We must then turn our attention to the source triad. because of the uncomputability theorem we know that we have no sure-fire procedure, but we can still try to take a good guess.

First off, if the representation on the source machine shows a high percentage of graphical and/or control characters, we may assume that the encoding in use is wrong. Unless of course, the file we are looking at is not a binary file to start with. This may seem trivial, but, if in doubt, using the output of the Linux command "file <nomefile>" can save you many a frustrating hours, especially if the oputput turns out to be something like "PDF file" or "compressed file".

The easy case

For systems in the italian language (but the same shouldgo for most european languages), and in my experience, the most common occurence is having Unicode - encoded text on a iso-8859-1 system, or viceversa. On the Unicode side, you can exepct the encoding to be UTF-8 9 times out of 10, and UTF-16 for the rest. UTF-16 is easily detected (for european languages) by looking at the file with a binary editor (odump on linux) and noting extensive areas of the file where null and non-null bytes alternate (this method is worthless for CJK languages). Also the presence of the BOM will give UTF-16 away.

A strong evidence for UTF-8 encoding is gined by observing the way in which accented letters are modified (this is mostly useful for languages where accented letters are common, such as Italian and French). If a UTF-8 encoded file is transmitted to a ISO-8859-x (with Latin-1 or windows-1252 instances being the most common occurences) most of the non-accented characters will be displayed correctly, while accented letters (which are encoded as a pair of bytes which translate to two charaters in Latin-x encodings) will display as a pair of characters, the first being a capital "A" capped by a tilde (Ã). Diacritic marks of other western european languages (e.g. the German tilde) will behave similarly, as will most currency symbols (the S dollar being the exception) and other semi-graphic characters, such as the copyright (©) mark.

The opposite case (ISO-8859-X on a UTF-8 system) is a little thornier. All accented letters and diacritics will be mistaken as the beginning of a multibyte sequence, which will be either a valid or invalid UTF-8 mulitbyte sequence. In either case, one (or more) of the subsequent characters will disappear; what will be displayed will either be a random "exotic" character - often a white question mark on a black diamond - or an error message. Sequences of ASCII characters will display normally.

The general case

In the most general case, a more systematic approach must be taken. The following checklist may help here.

1) Collect as much information as possible on the file origin. If possible, found out which application produced it and look for its manual. Use Google. If possible, speak with the human originator of the file.

2) Examine the file through other applications. A good text editor is of great help (the one I use - emacs, has very good multilingual support from release 23 - sometimes opening the culprit in emacs is enough to solve the problem)
3) Remember the obvious. The file intended purpose is often enough to solve the problem. XML files, for instance are UTF-8 enacoded unless they explicitely declare an alternative encoding.

4) Have a good (software) toolbox and use it to try all the possible transcodings in order of decreasing probability (for example, if the file should contain japanese tex, the JIS encoding is a likely starting point; for chinese, Big5 would be more likely). If you can, make your experiments on a sample of the file that also contains recognizable sequences of western characters (things like addresses or personal names), because english/ASCII characters tend to be encoding-invariant. Some tools exist that help in automating this process (e.g. the Universal Encoding Detector uses the same euristics most browsers use).

See also appendix A for a couple of (python) functions which may be handy in this activity.

Having a clear mental picture of the purpose of your actions and of the intended behavior of your tools is an important part of this puzzle solving activity. This (python oriented) link may shed further light on the subject http://code.activestate.com/recipes/466341/

Some programming tips

The brute force approach to cracking the "fundamental problem" almost invariably requires some programming. Most programming languages, these days, claim "Unicode support", but what is meant with this is far from clear. To the best of my understanding, a programming language with "Unicode support" is a language that can represent text objects (i.e. strings) as Unicode codepoints, and that can convert a sequence of Unicode codepoints to a text object. You will notice that this tells us nothing about the process by which one can convert the language internal representation of textual objects to a suitable external representation (which is what mostly concerns us). What happens is that basically every language has its own way of doing it, which is, to say the least, confusing.

The purist approach can be found in the C language. Here external and internal representation are one and the same, because C represents strings as byte sequences. Encodings are handled through external libraries (if memory serves, IBM distributes a free ICU mulitilingual library). I believ that C++ behaves similarly. This is not for the weak, and, unless you already spend your days with Developer Studio or automake, I'd suggest a less purist aproach.

Unicode and Dynamic Languages

In ths section I will focus mostly on one dynamic language, namely, python. Even though I am normally a satisfied perl user, python appears to have better - at least where consistency and stability are concerned - UNICODE support. Once the basic concepts have been grasped, however, I'd say that the two languages' capability is almost on par.

Python, has internal supprto for two types of text objects: Unicode and ordinary (i.e. encoded) strings. Unicode objects can be thought of as sequnces of codepoints; ordinary strings can be thought of as byte sequences.

Creating a unicode string is easy:

us=u'\u00e8\u00e1'

us, represents the "èá" sequnce: 00e8 (232 hex) e 00e1 (225 hex) are the relevant codepoints.

If the encoding of an ordinary string is known, the Unicode string can be created, by decoding the string itself:

    us=cs.decode('string_encoding')

For instance from "èá":

    cs='\xe8\xe1'
    us=cs.decode('Latin-1')

There is a formally different - yet equivalent - way of obtaining the same result:

    us=unicode(cs,'Latin-1') # or the string encoding

Because I find this confusing, I mentally read this as "build a Unicode string by decoding cs from the 'Latin-1' encoding".

What above will work correctly if and only if the right encoding is specified. A wrong encoding will either result in a runtime exception or in wrong string, depending on the content of the ordinary string and on the encoding specified.

A python Unicode object is rather abstract. It cannot be directly saved, printed or represented without applying an encoding: attempts to do so will trigger a run-time exception. It may come as a surprise that a Unicode object can be encoded using any supported encoding, rather than the canonical, UTF and UCS encodings defined by Unicode.

It is in fact perfectly possible - and proper - to encode a sequence of Unicode codepoints in the (say) Latin-1 encoding provided that the codepoints are representable in the target encoding. It is for instance possible to encode as 'Latin-1' the 'U+00e8' codepoint, whereas the same cannot be done for the Kanji codepoint 'U+4e01'. Both codepoints in the preceding example, however, can be represented in the shift-jis-2004 encoding, as well as in UTF8 or UTF16. A partial list of the encodings supported by a standard python installation can be found in appendix B. UTF8 and UTF16 are special, because they are the only encodings that can always be safely specified as targets (as they are capable of represent the entire Unicode repertoire)

Going from a Unicode string (us) to a coded string is also easy:

    cs=us.encode(desired_encoding)

for instance:

    us=u'u'\u00e8'
    cs=us.encode('Latin-1') #contains '\xe8'

    us=u'\u00e8\u4e01'      #contains an ideographic char: è丁
    cs=us.encode('Latin-1') #runtime error

  UnicodeEncodeError: 'latin-1' codec can't encode character
  u'\u4e01' in position 1: ordinal not in range(256)

    cs=us.encode('shift-jis-2004') # contains '\x85}\x92\x9a'
    cs=us.encode('utf8') #contains '\xc3\xa8\xe4\xb8\x81'
    cs=us.encode('utf16') #contains '\xff\xfe\xe8\x00\x01N'

By putting the two steps together, we can translate from an encoding to another (transcoding):

    us=cs_source.encode(source_encoding)
    cs_target=us.encode(target_encoding)

Again, this is only possible for compatible encodings (target can represent all the codepoints present in source).

In particular, transcoding to UTF8 is always possible, if the codec for the source encoding is installed (Python's standard codecs are listed in appendix B):

    us=cs_source.encode(source_encoding)
    cs_target=us.encode('utf8')

As mentioned, unencoded Unicode strings cannot be sent to output streams:

    cs=u'u'\u00e8' f=file('/tmp/ciccio','a') f.write(cs)

    UnicodeEncodeError: 'ascii' codec can't encode characters in
    position 0-1: ordinal not in range(128)

Here we can see that the python interpreter tries to apply a default encoding to us (ASCII, in this case) and fails because us contains an accented character that is not part of the ASCII specs.

So the pythonic way of working with Unicode requires that we 1) decode strings coming from input and 2) encode strings going to output.

The 'codecs' module can, alternatively, decorate our I/O handles for us with suitable encodings: :

    import codecs

    f=codecs.open('/tmp/ciccio','UTF-8','r')
    g=codecs.open('/tmp/ciccia','latin-1','w') 
    us=f.read()
    g.write(us)

Anything we read from 'f' is decoded as UTF-8, while any Unicode object we write to 'g' is encoded in Latin-1. (So we may receive a runtime error if 'f' contained korean text, for instance). One should also refrain from writing ordinary - encoded - strings to g because, at this point, the interpreter would implicitely decode the original string applying a default codec (normally ASCII) which is probably not what one would expect, or desire.

It should be obvious that, for regular python programming - outside of multilingual text processing - Unicode objects are not normally used, as ordinary strings are perfectly suited to most tasks.

A different kind of "Unicode support" is the interpreter capability of processing source files containing non-ASCII characters. This is doable, by inserting a directive like:

#-*- coding: iso-8859-1 -*-

- (or other encoding) towards the beginning of the file. I advise against this, as a practice that will end up annoying you and your coworkers, as well as any other perspective user of the file. Stick to ASCII for source code.

The Curse of Implicit Encodings

Most I/O peripherals, these days, try to "help" their user by taking a guess on the encodings of the strings that are sent to them. This is good for normal use, atrocious if your aim is solving problems akin to those we have been tackling so far. Relationships between string types and encodings are confusing enough even without layering on top of them other encodings implicitely brought on by I/O devices.

This is a sample interaction, on my current system (emacs 23.1, Fedora Core 11, IPython):

    In [270]: import sys
    In [270]: sys.stdin.encoding
    Out[271]:
  'UTF-8'
    In [272]: cs='è'
    In [273]: repr(cs)
    Out[273]:
  "'\\xc3\\xa8'"

this can be translated as "writing the sequence 'è' on this interpreters console, which is using the implicit input encoding UTF-8, results in a coded string whose content is '\xc3\xa8'"

The same interaction on a different system is:

    In [270]: import sys
    In [270]: sys.stdin.encoding
    Out[271]:
  'latin_1'
    In [272]: cs='è'
    In [273]: repr(cs)
    Out[273]:
  "'\\xe8'"

this can be translated as "writing the sequence 'è' on this interpreters console, which is using the implicit input encoding Latin-1, results in a coded string whose content is '\xe8'"

Seems harmless? It is not. If we want a Unicode string, on system (1) we have to do:

    us=cs.decode('utf-8')

on system 2:

    us=cs.decode('latin-1')

I don't know you, but I find it bewildering.

My point: in source code -and outside the ASCII domain - stick to codepoint, even if writing literal characters may seem more convenient.

Unicode, encodings and HTML

Like XML, HTML had early awareness of multilingual environments. Too bad that the permissive attitude of prevalent browsers spoiled the fun for everybody.

Waht follows is my laundry list of multilingual HTML facts - check with the W£ consortium if you need complete assessments.

Named entities

In HTML, a (limited) number of national characters can be specified by using the so called 'named entitites': for instance the sequence "à" is displayed as "à".

Numeric entitities

In HTML, the entire Unicode codepoint repertoire cna be represented through numeric entities, which are written by preceding the decimal codpeoint identifier withj the sequence &# and following it with ";", like this:

— displays as: '—'

or,if you favore HEX codes:

— displays as: '—'

Obviously, no sane Japanese will ever want to write a novel this way (unless her word processor takes care of this for her). Also, all this makes for quite unreadable HTML source.

HTML content declaration

A simpler way for multilingual HTML is declaring the document charset ("charset" is HTML-speak for encoding) like this:

<meta http-equiv="content-type" content="text-html; charset=utf-8">

this way of specifying the charset is rather safe, provided the reader's browser supports the charset and that the web server does not spoil everything by slapping a different charset on your document, overwriting the one speciied by you (This happened to me when I first published this document).

Explicit charset specification, however, seldom happens so what we have instead is:

1) Assumed charset (by the editing tools, or by the server). This is seldom correct, besides, it will easily be a national codepage (i.e. windows-1252, for italians) rather than the preferable UTF-8.

2) No charset at all. Pages like this should contain only ASCII characters and named or numeric entities. But this does not happen because:

3) Browsers attempt to infer charset for a document's content, besides

4) ...several webservers "help" by providing a default charset.

All of this makes solving HTML display problems no fun. My advice follows closely the advice already given for simple text.

Database

Multilingual text in databases is a thorny issue, and I am not really up to give copentent advice on it. So I'm just going to give my very limited take of the issue, without taking any responsibility for it. Caveat emptor.

Databases, as a family, are among the oldest pieces of software. This means that they started to tackle multilingual text in the wild frontier years, when men were men and databases encoded text every which way they damn well pleased. Fun, but hardly conducive to consistency.

This said, if the amount of your interaction is going to concern the encoding of a report output, then not to worry: that's a simple text file conversion.

If on the other hand, you are going to build a DBMS system from scratch, I advise you to make everything UTF-8 and to test the entire toolchain with multilanguge strings. Here the tooolchain includes the DBMS, the relevant OSes and communication protocols, programming laguages and libraries, and command line tools.

Lastly if you are going to convert a legacy system to multilingual: stick to the manufacturers' manual(s), and may God help you.

While most databases do document the encoding used for text data (which is often a propriety of the entire database or even of the instance), this is basically ruined by the fact that (for old databases) several architects devised their own encoding schemes that they then crammed inside one of the available data types. Another difficulty is that many databases use(d) non-standard naming for encodings.

Some moderns DBMSs (among them - to my knowledge - is SQL Server since tits 2005 release), atoned for a taking a too rigid encoding stance in the past by allowing to have different encodings for the database, each table contained therein, and even for each column of every table. My opinion is that this is a recipe for disaster, insanity, or both

Appendice A

import unicodedata

def unilist(u):
    """ prints the unicde description of a string """
    for i, c in enumerate(u):
        print i, '%04x' % ord(c), unicodedata.category(c),
        print unicodedata.name(c)

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

Sax Appeal

Wednesday, December 30, 2009

Unicode and the encoding problem