TEI 工作坊 7. 介紹Unicode Dec.2006
encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位(通常 是8位元)組成的串) e.g , , , ● 2^8 = 256種可能的不同值或狀態
二進位,十六進位,十進位 ● 不同的方法表示同樣的值: = 00 = = 0A = = 80 = = FF = 255
● 電腦用byte儲存資訊. 但是它怎麼知道 表示 |a|字?? ● code page代碼頁 ( 對全部代碼點所賦予的圖形 字元和控制功能含義) #x61 97 |a|
● 用代碼頁電腦能識別byte 表示 |a|字, 但是哪一個|a|?
● 電腦也必須決定將以哪一個字形(glyph)呈現某 一個文字(character)在螢幕或紙上 glyph 字形 vs. character 文字 a comic sans ms 60 pts italics #x61 97 |a|
● 代碼頁決定文字(character) ● 字型相關的設定決定呈現某一個文字的字形 (glyph) glyph 字形 vs. character 文字 a comic sans ms 60 pts italics #x61 97 |a|
problem ● 8-bit 編碼的256代碼點 (7bit - ASCII 只有128)非 常有限,不適合CJK和很多其他的文字 ● 這是為什麼有那麼多代碼頁(see the list in your browser when you want to change the encoding)
Enter the Unicode industry standard ● Managed by the Unicode Consortium, a nonprofit group with corporate, institutional, and individual members ● Originally planned as a 16-bit specification (2^16 =65,536) ● -> too small for CJK variation and scholarship ● -> the code space of Unicode is currently 32-bits, i. e. all code points can be encoded with 32 bits or 4 bytes
History ● 1991 Unicode 1.0 ● 2001 Unicode 3.1 (+Ext B) ● 2005 Unicode 4.1 ● 2006 Unicode 5.0 ● 目前Unicode 當中包含約90,000種字, 但它總共 可容納1,114,112 代碼點.
Organisation in planes字面 Plane 0: BMP (Basic Multilingual Plane) 基本多語文字面 65,536 代碼點 (2^16) Plane 1: 65,536 代碼點 Plane 2:... Plane 3: Plane 15: Private Use Plane 16: Private Use ● 幾乎所有字都在BMP 當中(只需要16 bits就 可表示) ● 1-16字面稱為星界字 面(astral planes),需 要大於16 bits的單位 元--這些字面大部分 是空的
Exploring Unicode ● Use the Unibook: – Can you change the font? – Where are the 漢字? – What is the hexadecimal codepoint of the final character of CJK-Extension A? (use Index Style View) – On which plane is the CJK Unified Ideographs Extension B? – Find the function that allows you to see only the characters of one single codepage
CJK blocks ● 2E80-2EF3: radical supplement ● 2F00-2FD5: kangxi radicals 康熙辭典部首 ● 31C0-31CF: strokes 筆劃 ● DB5: Unified Han Ideog. Extension A ● 4E00-9FFF: Unified Han Ideographs ● F900-FA2D: Compatibility Ideographs ● A6D6: Unified Han Ideog. Ext. B
漢字字元集
Exploring Unicode ● Go to the Unicode website ( have a look around ● For a look at what scripts are defined: ● Look up CJK characters (Unihan) ml
● Using the Unihan radical look-up tool ( find the codepoints for the character: Exploring Unicode
● Find all 65,536 BMP characters at:
Unicode Transformation Format (UTF) ● The numerical codepoint of any character stays the same, but there are several ways for the computer to encode it. These different ways are called Transformation Formats (expressed in Hex) ● UTF-8, UTF-16, UTF-32 are the most important UTFs. See a comparison at: ncodings
UTF-32 ● UTF-32: every character is encoded in four bytes e.g. “a” = ● Considering that most characters live in the BMP a few to many zeros, no?
UTF-16 ● UTF-16: generally two byte encoding ● different computer architectures store bytes in different order (endianess) ● UTF-16 declares the byte order by a Byte Order Mark (BOM) at the start of the file. ● The BOM is the Zero-Width No-Break Space character (FEFF) ● FE FF = big-endian; FF FE = little-endian.
UTF-16 with four bytes ● UTF-16 represents a character from the “astral planes” with four bytes as a surrogate pair ● Two bytes each from the two surrogate blocks (“high” surrogates starting at U+D800, “low” surrogates at U+DC00))
UTF-8 ● UTF-8: encodes a character in one, two, three or four bytes ● 1 byte for the 128 US-ASCII characters ● 2 bytes for Latin letters with diacritics, and for Greek, etc. (range U+0080 to U+07FF) ● 3 bytes for the rest of the BMP (Indian and CFK) ● 4 bytes for characters in the astral planes
● You can quickly find the different UTFs for a character by using the code converter by Richard Ishida & François Yergeau: version
結論 ● 當你要處理的文本是以羅馬字為主最好用UTF-8 ● 當你要處理的文本包含來自印度的字目或漢字, 用UTF-16比較好
Exploring Unicode: the letter "a" ● a – decimal codepoint: 97 – 61 (hexadecimal, UTF-8) – (hexadecimal, UTF-16, little-endian) – (hexadecimal, UTF-16, big-endian) – (UTF-32) ● How about ē?
And ?
● Decimal: ● UTF-8: F0 A4 9D 99 ● UTF-16: D851 DF59 ● UTF-32:
© marcus bingenheimer