Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEI 工作坊 7. 介紹Unicode Dec.2006. encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位(通常 是8位元)組成的串) e.g. 01001101, 00001111, 01010101,

Similar presentations


Presentation on theme: "TEI 工作坊 7. 介紹Unicode Dec.2006. encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位(通常 是8位元)組成的串) e.g. 01001101, 00001111, 01010101,"— Presentation transcript:

1 TEI 工作坊 7. 介紹Unicode Dec.2006

2 encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位(通常 是8位元)組成的串) e.g. 01001101, 00001111, 01010101, ● 2^8 = 256種可能的不同值或狀態

3 二進位,十六進位,十進位 ● 不同的方法表示同樣的值: 00000000 = 00 = 000 00001010 = 0A = 010 10000000 = 80 = 128 11111111 = FF = 255

4 ● 電腦用byte儲存資訊. 但是它怎麼知道01100001 表示 |a|字?? ● code page代碼頁 ( 對全部代碼點所賦予的圖形 字元和控制功能含義) 01100001 #x61 97 |a|

5 ● 用代碼頁電腦能識別byte 01100001表示 |a|字, 但是哪一個|a|?

6 ● 電腦也必須決定將以哪一個字形(glyph)呈現某 一個文字(character)在螢幕或紙上 glyph 字形 vs. character 文字 1 01100001 a comic sans ms 60 pts italics 0110000 1 #x61 97 |a|

7 ● 代碼頁決定文字(character) ● 字型相關的設定決定呈現某一個文字的字形 (glyph) glyph 字形 vs. character 文字 2 01100001 a comic sans ms 60 pts italics 0110000 1 #x61 97 |a|

8 problem ● 8-bit 編碼的256代碼點 (7bit - ASCII 只有128)非 常有限,不適合CJK和很多其他的文字 ● 這是為什麼有那麼多代碼頁(see the list in your browser when you want to change the encoding)

9 Enter the Unicode industry standard ● Managed by the Unicode Consortium, a nonprofit group with corporate, institutional, and individual members ● Originally planned as a 16-bit specification (2^16 =65,536) ● -> too small for CJK variation and scholarship ● -> the code space of Unicode is currently 32-bits, i. e. all code points can be encoded with 32 bits or 4 bytes

10 History ● 1991 Unicode 1.0 ● 2001 Unicode 3.1 (+Ext B) ● 2005 Unicode 4.1 ● 2006 Unicode 5.0 ● 目前Unicode 當中包含約90,000種字, 但它總共 可容納1,114,112 代碼點.

11 Organisation in planes字面 Plane 0: BMP (Basic Multilingual Plane) 基本多語文字面 65,536 代碼點 (2^16) Plane 1: 65,536 代碼點 Plane 2:... Plane 3:...... Plane 15: Private Use Plane 16: Private Use ● 幾乎所有字都在BMP 當中(只需要16 bits就 可表示) ● 1-16字面稱為星界字 面(astral planes),需 要大於16 bits的單位 元--這些字面大部分 是空的

12 Exploring Unicode ● Use the Unibook: – Can you change the font? – Where are the 漢字? – What is the hexadecimal codepoint of the final character of CJK-Extension A? (use Index Style View) – On which plane is the CJK Unified Ideographs Extension B? – Find the function that allows you to see only the characters of one single codepage

13 CJK blocks ● 2E80-2EF3: radical supplement ● 2F00-2FD5: kangxi radicals 康熙辭典部首 ● 31C0-31CF: strokes 筆劃 ● 3400-4DB5: Unified Han Ideog. Extension A ● 4E00-9FFF: Unified Han Ideographs ● F900-FA2D: Compatibility Ideographs ● 20000-2A6D6: Unified Han Ideog. Ext. B

14 漢字字元集

15 Exploring Unicode ● Go to the Unicode website (www.unicode.org) have a look around ● For a look at what scripts are defined: http://www.unicode.org/charts/ ● Look up CJK characters (Unihan) http://www.unicode.org/charts/unihanrsindex.ht ml

16 ● Using the Unihan radical look-up tool (www.unicode.org/charts/unihanrsindex.html) find the codepoints for the character: Exploring Unicode

17 ● Find all 65,536 BMP characters at: http://unicode.coeurlumiere.com/

18 Unicode Transformation Format (UTF) ● The numerical codepoint of any character stays the same, but there are several ways for the computer to encode it. These different ways are called Transformation Formats (expressed in Hex) ● UTF-8, UTF-16, UTF-32 are the most important UTFs. See a comparison at: http://en.wikipedia.org/wiki/Comparison_of_Unicode_e ncodings

19 UTF-32 ● UTF-32: every character is encoded in four bytes e.g. “a” = 00 00 00 61 ● Considering that most characters live in the BMP a few to many zeros, no?

20 UTF-16 ● UTF-16: generally two byte encoding ● different computer architectures store bytes in different order (endianess) ● UTF-16 declares the byte order by a Byte Order Mark (BOM) at the start of the file. ● The BOM is the Zero-Width No-Break Space character (FEFF) ● FE FF = big-endian; FF FE = little-endian.

21 UTF-16 with four bytes ● UTF-16 represents a character from the “astral planes” with four bytes as a surrogate pair ● Two bytes each from the two surrogate blocks (“high” surrogates starting at U+D800, “low” surrogates at U+DC00))

22 UTF-8 ● UTF-8: encodes a character in one, two, three or four bytes ● 1 byte for the 128 US-ASCII characters ● 2 bytes for Latin letters with diacritics, and for Greek, etc. (range U+0080 to U+07FF) ● 3 bytes for the rest of the BMP (Indian and CFK) ● 4 bytes for characters in the astral planes

23

24 ● You can quickly find the different UTFs for a character by using the code converter by Richard Ishida & François Yergeau: http://people.w3.org/rishida/scripts/uniview/con version

25 結論 ● 當你要處理的文本是以羅馬字為主最好用UTF-8 ● 當你要處理的文本包含來自印度的字目或漢字, 用UTF-16比較好

26 Exploring Unicode: the letter "a" ● a – decimal codepoint: 97 – 61 (hexadecimal, UTF-8) – 61 00 (hexadecimal, UTF-16, little-endian) – 00 61 (hexadecimal, UTF-16, big-endian) – 00 00 00 61 (UTF-32) ● How about ē?

27 And ?

28 ● Decimal: 149337 ● UTF-8: F0 A4 9D 99 ● UTF-16: D851 DF59 ● UTF-32: 00 02 47 59

29 © marcus bingenheimer


Download ppt "TEI 工作坊 7. 介紹Unicode Dec.2006. encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位(通常 是8位元)組成的串) e.g. 01001101, 00001111, 01010101,"

Similar presentations


Ads by Google