The Unicode HOWTO: Introduction

Next Previous Contents

People in different countries use different characters to represent the words of their native languages. Nowadays most applications, including email systems and web browsers, are 8-bit clean, i.e. they can operate on and display text correctly provided that it is represented in an 8-bit character set, like ISO-8859-1.

There are far more than 256 characters in the world - think of cyrillic, hebrew, arabic, chinese, japanese, korean and thai -, and new characters are being invented now and then. The problems that come up for users are:

It is impossible to store text with characters from different character sets in the same document. For example, I can cite russian papers in a German or French publication if I use TeX, xdvi and PostScript, but I cannot do it in plain text.
As long as every document has its own character set, and recognition of the character set is not automatic, manual user intervention is inevitable. For example, in order to view the homepage of the XTeamLinux distribution http://www.xteamlinux.com.cn/ I had to tell Netscape that the web page is coded in GB2312.
New symbols like the Euro are being invented. ISO has issued a new standard ISO-8859-15, which is mostly like ISO-8859-1 except that it removes some rarely used characters (the old currency sign) and replaced it with the Euro sign. If users adopt this standard, they have documents in different character sets on their disk, and they start having to think about it daily. But computers should make things simpler, not more complicated.

The solution of this problem is the adoption of a world-wide usable character set. This character set is Unicode http://www.unicode.org/. For more info about Unicode, do `man 7 unicode' (manpage contained in the man-pages-1.20 package).

1.2 Unicode encodings

This reduces the user's problem of dealing with character sets to a technical problem: How to transport Unicode characters using the 8-bit bytes? 8-bit units are the smallest addressing units of most computers and also the unit used by TCP/IP network connections. The use of 1 byte to represent 1 character is, however, an accident of history, caused by the fact that computer development started in Europe and the U.S. where 96 characters were found to be sufficient for a long time.

There are basically four ways to encode Unicode characters in bytes:

UTF-8: 128 characters are encoded using 1 byte (the ASCII characters). 1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic characters). 63488 characters are encoded using 3 bytes (Chinese and Japanese among others). The other 2147418112 characters (not assigned yet) can be encoded using 4, 5 or 6 characters. For more info about UTF-8, do `man 7 utf-8' (manpage contained in the man-pages-1.20 package).
UCS-2: Every character is represented as two bytes. This encoding can only represent the first 65536 Unicode characters.
UTF-16: This is an extension of UCS-2 which can represent 1112064 Unicode characters. The first 65536 Unicode characters are represented as two bytes, the other ones as four bytes.
UCS-4: Every character is represented as four bytes.

The space requirements for encoding a text, compared to encodings currently in use (8 bit per character for European languages, more for Chinese/Japanese/Korean), is as follows. This has an influence on disk storage space and network download speed (when no form of compression is used).

UTF-8: No change for US ASCII, just a few percent more for ISO-8859-1, 50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
UCS-2 and UTF-16: No change for Chinese/Japanese/Korean. 100% more for US ASCII and ISO-8859-1, Greek and Cyrillic.
UCS-4: 100% more for Chinese/Japanese/Korean. 300% more for US ASCII and ISO-8859-1, Greek and Cyrillic.

Given the penalty for US and European documents caused by UCS-2, UTF-16, and UCS-4, it seems unlikely that these encodings have a potential for wide-scale use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at least), yet this encoding has not been widely adopted for documents - SJIS remains prevalent in Japan.

UTF-8 on the other hand has the potential for wide-scale use, since it doesn't penalize US and European users, and since many text processing programs don't need to be changed for UTF-8 support.

In the following, we will describe how to change your Linux system so it uses UTF-8 as text encoding.

Footnotes for C/C++ developers

The Microsoft Win32 approach makes it easy for developers to produce Unicode versions of their programs: You "#define UNICODE" at the top of your program and then change many occurrences of `char' to `TCHAR', until your program compiles without warnings. The problem with it is that you end up with two versions of your program: one which understands UCS-2 text but no 8-bit encodings, and one which understands only old 8-bit encodings.

Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA character set registry http://www.isi.edu/in-notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify". Network byte order is big endian. And RFC 2152 is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." Whereas Microsoft, in its C/C++ development tools, recommends to use machine-dependent endianness (i.e. little endian on ix86 processors) and either a byte-order mark at the beginning of the document, or some statistical heuristics(!).

The UTF-8 approach on the other hand keeps `char*' as the standard C string type. As a result, your program will handle US ASCII text, independently of any environment variables, and will handle both ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable is set accordingly.