www.ibiblio.org/pub/Linux/docs/HOWTO/Unicode-HOWTO
People in different countries use different characters to represent the words of their native languages. Nowadays most applications, including email systems and web browsers, are 8-bit clean, ie they can operate on and display text correctly provided that it is represented in an 8-bit character set, like ISO-8859-1. There are far more than 256 characters in the world - think of cyrillic, hebrew, arabic, chinese, japanese, korean and thai -, and new characters are being invented now and then. It is impossible to store text with characters from different character sets in the same document. For example, I can cite russian papers in a German or French publication if I use TeX, xdvi and PostScript, but I cannot do it in plain text. As long as every document has its own character set, and recognition of the character set is not automatic, manual user intervention is inevitable. cn/ I had to tell Netscape that the web page is coded in GB2312. ISO has issued a new standard ISO-8859-15, which is mostly like ISO-8859-1 except that it removes some rarely used characters (the old currency sign) and replaced it with the Euro sign. If users adopt this standard, they have documents in different character sets on their disk, and they start having to think about it daily. But computers should make things simpler, not more complicated. The solution of this problem is the adoption of a world-wide usable character set. The use of 1 byte to represent 1 character is, however, an accident of history, caused by the fact that computer development started in Europe and the US where 96 characters were found to be sufficient for a long time. There are basically four ways to encode Unicode characters in bytes: UTF-8 128 characters are encoded using 1 byte (the ASCII characters). The other 2147418112 characters (not assigned yet) can be encoded using 4, 5 or 6 characters. This encoding can only represent the first 65536 Unicode characters. UTF-16 This is an extension of UCS-2 which can represent 1112064 Unicode characters. The first 65536 Unicode characters are represented as two bytes, the other ones as four bytes. The space requirements for encoding a text, compared to encodings currently in use (8 bit per character for European languages, more for Chinese/Japanese/Korean), is as follows. This has an influence on disk storage space and network download speed (when no form of compression is used). UTF-8 No change for US ASCII, just a few percent more for ISO-8859-1, 50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic. Given the penalty for US and European documents caused by UCS-2, UTF-16, and UCS-4, it seems unlikely that these encodings have a potential for wide-scale use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at least), yet this encoding has not been widely adopted for documents - SJIS remains prevalent in Japan. UTF-8 on the other hand has the potential for wide-scale use, since it doesn't penalize US and European users, and since many text processing programs don't need to be changed for UTF-8 support. In the following, we will describe how to change your Linux system so it uses UTF-8 as text encoding. The problem with it is that you end up with two versions of your program: one which understands UCS-2 text but no 8-bit encodings, and one which understands only old 8-bit encodings. Moreover, there is an endianness issue with UCS-2 and UCS-4. edu/in- notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify". And RFC 2152 is even clearer: "ISO/IEC 10646-1:1993 specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." Whereas Microsoft, in its C/C++ development tools, recommends to use machine-dependent endianness (ie little endian on ix86 processors) and either a byte-order mark at the beginning of the document, or some statistical heuristics. The UTF-8 approach on the other hand keeps char*' as the standard C string type. As a result, your program will handle US ASCII text, independently of any environment variables, and will handle both ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable is set accordingly. html 2 Display setup We assume you have already adapted your Linux console and X11 configuration to your keyboard and locale. This is explained in the Danish/International HOWTO, and in the other national HOWTOs: Finnish, French, German, Italian, Polish, Slovenian, Spanish, Cyrillic, Hebrew, Chinese, Thai, Esperanto. Doing so will only cause problems when you switch to Unicode. When you call unicode_start', the console's screen output is interpreted as UTF-8. Also, the keyboard is put into Unicode mode (see "man kbd_mode"). You will want to use display characters from different scripts on the same screen. psf) which covers Latin, Cyrillic, Hebrew, Arabic scripts. It covers ISO 8859 parts 1,2,3,4,5,6,8,9,10 all at once. To work around the constraint that a VGA font can only cover 512 characters simultaneously, he provides a rich Unicode font (2279 characters, covering Latin, Greek, Cyrillic, Hebrew, Armenian, IPA, math symbols, arrows, and more) in the typical 8x16 size and a script which permits to extract any 512 characters as a console font. diff from Edmund Thomas Grimley Evans and Stanislav Voronyi. org> has implemented an UTF-8 console terminal emulator. It uses Unicode fonts and relies on the Linux frame buffer device. Even if they are not Unicode fonts, they will help in displaying Unicode documents: at least Netscape Communicator 4 and Java will make use of foreign fonts when available. The following programs are useful when installing fonts: . "mkfontdir directory" prepares a font directory for use by the X server, needs to be executed after installing fonts in a directory. "xset -q | sed -e '1,/^Font Path:/d' | sed -e '2,$d' -e 's/^ //'" displays the X server's current font path. "xset fp+ directory" adds a directory to the X server's current font path. To add a directory permanently, add a "FontPath" line to your /etc/XF86Config file, in section "Files". "xset fp rehash" needs to be executed after calling mkfontdir on a directory that is already contained in the X server's current font path. "xfontsel" allows you to browse the installed fonts by selecting various font properties. "xlsfonts -fn fontpattern" lists all fonts matching a font pattern. In particular, "xlsfonts -ll -fn font" lists the font properties CHARSET_REGISTRY and CHARSET_ENCODING, which together determine the font's encoding. The following fonts are freely available (not a complete list): . The ones contained in XFree86, sometimes packaged in separate packages. For example, SuSE has only normal 75dpi fonts in the base xf86' package. The other fonts are in the packages xfnt100', xfntbig', xfntcyr', xfntscl'. gz As already mentioned, they are useful even if you prefer XEmacs to GNU Emacs or don't use any Emacs at all. However, this approach is more complicated, because instead of working with Font' and XFontStruct', the programmer has to deal with XFontSet', and also because not all fonts in the font set need to have the same dimensions. Markus Kuhn has assembled fixed-width 75dpi fonts with Unicode encoding covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew scripts and many symbols. They cover ISO 8859 parts 1,2,3,4,5,7,8,9,10,13,14,15,16 all at once. These fonts are required for running xterm in utf-8 mode. Markus Kuhn has also assembled double-width fixed 75dpi fonts with Unicode encoding covering Chinese, Japanese and Korean. Roman Czyborra has assembled an 8x16 / 16x16 75dpi font with Unicode encoding covering a huge part of Unicode. It is not fixed-width: 8 pixels wide for European characters, 16 pixels wide for Chinese characters. gz /usr/X11R6/lib/X11/fonts/misc # cd /usr/X11R6/lib/X11/fonts/misc # mkfontdir # xset fp rehash . Primoz Peterlin has assembled an ETL family fonts covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew scripts. Mark Leisher has assembled a proportional, 17 pixel high (12 point), font, called ClearlyU, covering Latin, Greek...
|