Some notes on making multilingual webpages
G. Benoit, October 25, 2005

Making a webpage is easy; making one that supports multiple languages and scripts, however, has turned into quite a set of contradictions.
One way of making a page is to use the decimal or hexadecimal values. The codes must be escaped, that is, a special code signals that the next set of charactes should be interpreted by the browser differently than usual. For example, there are three ways of using representing characters:

  1. Character Entity References Test - uses a complete name for the character, e.g., the em dash — consists of the ampersand & then the name of the character (here, mdash), and terminate the escape sequence with the semicolon ; The result is —
  2. Numeric Character References Test - uses /& #, e.g., —
    here is decimal 2309 [Devanagari letter, short a] अ
  3. Hexadecimal Character Ref - uses /& # X, e.g., —
    here is hex 0905 - अ

First of all, make sure the character set is UTF-8. This is because only one character set can be used in a web page: right after the <head> tag, this line should appear: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> You'll need a font that can display the character. If the page is multiscript, and you're using UTF-8, then you need a font set that handles all the characters. Arial Unicode MS has about >56,000 glyphs; Lucida Grande and Palatino Linotype (not the regular Palatino that came with your OS) has what you'll need for European languages, but not Chinese-Japanese-Korean (CJK). The next version will use a language-specific font as the default and give users an option to choose the font.
(Some browsers require reloading the page to return from the codepage back to this web page.).
NB: Not all of the tables are listed here. See also http://www.unicode.org/charts

Code Tables: u = hexadecimal range; numbers in () are decimal.


©Benoit, 2005