TEI 工作坊 7. 介紹Unicode Dec.2006. encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位（通常是8位元）組成的串) e.g. 01001101, 00001111, 01010101,

Slides:

Advertisements

Similar presentations

仪容. 一、化妆的技巧眼部的化妆唇部化妆眉部化妆鼻部化妆根据脸型化妆根据脸型选发型.

Advertisements

What do you see? What do you recognize? What do you think we are going to learn?

升中導航— 面試技巧工作坊學校社工：江曉彤姑娘.

2014 年上学期湖南长郡卫星远程学校制作 13 Getting news from the Internet.

Presented By: 王信傑 Ricky Wang Date:2010/10/6

Unit 9 Have you ever been to an amusement park? Section A.

语文组：藏在泉州古巷中的美食结题报告.

沟通云平台三三得玖通信技术有限公司深圳市云屋科技有限公司陈志伟

十五條佛規後學：張慈幸

贵州分公司工作总结报告发起人：山大鲁能.

十二年國民基本教育高雄區入學方式說明報告人：中山工商楊薇主任.

數學解題王 ~從閱讀策略談起分享者：吳祥銘老師.

汉字编码汉字编码.

消防安全教育巫山县金银小学马泮军.

課程名稱：計算機概論授課老師：李春雄博士

道路交通管理授课教师：于远亮.

ABC PATTERN “The PLAYBOOK” ABC跟进法 “教战手册”

Unicode和多语言信息处理 adoal.

第 5 章數字系統與資料表示法.

2012高考英语书面表达精品课件：话题作文6 计划与愿望.

第二章數字系統：電腦內部的資料表示法在第一章中，我們對於電腦有了初步的認識，在深入介紹電腦的各項組成元件之前，首先我們必須先了解另一種不同於人類使用習慣的二進位表示法，由於電腦的半導體、磁性、光學元件適合用來表示二進位，因此二進位表示法非常適合用來設計電腦。

5B 教材分析.

Module 5 Shopping 第2课时.

Ⅱ、从方框里选择合适的单词填空，使句子完整通顺。 [ size beef special large yet ]

數字系統與資料表示法電腦的基本單位數字系統數值資料表示法數值資料與算數運算數碼系統浮點數表示法文字表示法資料來源：周裕達教授.

以斯拉記緒論 2012/2/19.

A3-1 數字系統 A3-2 資料表示法 A3-3 資料的儲存

计算机文化基础第一章计算机的基础知识.

第六章汉语要素和汉字教学主讲人：辽宁师范大学原新梅教授.

Area of interaction focus

第4章(2) 空间数据库 —关系数据库北京建筑工程学院王文宇.

(田徑運動 Athletics) Random Slide Show Menu

Unit title: 假期 – Holiday

重點資料結構之選定會影響演算法選擇對的資料結構讓您上天堂程式.

6-1 資料表示法簡介 6-2 數值表示法 6-3 數字系統介紹 6-4 數字系統轉換方式

Lesson 28 How Do I Learn English?

客户服务询盘惯例.

纪堉儿老师读后活动课课例点评.

Section B 2b–3b & Self Check

How to get there? Day 1.

Could you please clean your room?

第十五课：在医院看病.

Unit title: 假期 – Holiday

Review Final Chinese 2-Chapter 6~10-1

一起来做英伦风

A SMALL TRUTH TO MAKE LIFE 100%

Chinese World hmwang.

BORROWING SUBTRACTION WITHIN 20

What time do you go to school？ Section A (Grammar Focus-3c)

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

引導教學實務工作的知識根基從三個面向來思考： 1.教學中的基礎知識是指什麼？哪些領域的知識最為關鍵？ 2.教師如何實踐及運用這些知識？

2-1 數位化概念 2-2 資料的數位化 ※ 2-3 基本數位邏輯處理

Unit 3 How many? (Sound time，Checkout time ) Unit 3 How many?

爬蟲類動物2 Random Slide Show Menu

冀教版　九年级 Lesson 20: Say It in Five.

Prepare for Cozy & Lazy HOME Life

國立成功大學化工系鄭智元副教授研究室 Tel: 62664

Advanced Basic Key Terms Dependency Generalization Actor Stereotype

2 Number Systems, Operations, and Codes

怎樣把同一評估給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.

活動主題：能「合」才能「作」指導教授：張景媛教授設計者：協和國小團隊李張鑫 × 陳志豪.

Adjectives- are words that describe or modify another person or thing in the sentence. Examples are : one, beautiful, small, circle, old, red, American,

Introduction to Computer Security and Cryptography

陸綺紅 ( 陸明君飾 ) 綽號紅豆在一次危險中，阿奇奮不顧身的為她擋了一棍，擔心的將她緊緊抱著……

Section 1 Basic concepts of web page

When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.

Presentation transcript:

TEI 工作坊 7. 介紹Unicode Dec.2006

encoding 編碼 ● Bit 位元: 0 or 1 代表硬碟上最小單位的電流單位 ● 8 bits = 1 Byte 位元組 (由一定數量的位（通常是8位元）組成的串) e.g , , , ● 2^8 = 256種可能的不同值或狀態

二進位,十六進位,十進位 ● 不同的方法表示同樣的值: = 00 = = 0A = = 80 = = FF = 255

● 電腦用byte儲存資訊. 但是它怎麼知道表示 |a|字?? ● code page代碼頁 ( 對全部代碼點所賦予的圖形字元和控制功能含義) #x61 97 |a|

● 用代碼頁電腦能識別byte 表示 |a|字, 但是哪一個|a|?

● 電腦也必須決定將以哪一個字形(glyph)呈現某一個文字(character)在螢幕或紙上 glyph 字形 vs. character 文字 a comic sans ms 60 pts italics #x61 97 |a|

● 代碼頁決定文字(character) ● 字型相關的設定決定呈現某一個文字的字形 (glyph) glyph 字形 vs. character 文字 a comic sans ms 60 pts italics #x61 97 |a|

problem ● 8-bit 編碼的256代碼點 (7bit - ASCII 只有128)非常有限,不適合CJK和很多其他的文字 ● 這是為什麼有那麼多代碼頁(see the list in your browser when you want to change the encoding)

Enter the Unicode industry standard ● Managed by the Unicode Consortium, a nonprofit group with corporate, institutional, and individual members ● Originally planned as a 16-bit specification (2^16 =65,536) ● -> too small for CJK variation and scholarship ● -> the code space of Unicode is currently 32-bits, i. e. all code points can be encoded with 32 bits or 4 bytes

History ● 1991 Unicode 1.0 ● 2001 Unicode 3.1 (+Ext B) ● 2005 Unicode 4.1 ● 2006 Unicode 5.0 ● 目前Unicode 當中包含約90,000種字, 但它總共可容納1,114,112 代碼點.

Organisation in planes字面 Plane 0: BMP (Basic Multilingual Plane) 基本多語文字面 65,536 代碼點 (2^16) Plane 1: 65,536 代碼點 Plane 2:... Plane 3: Plane 15: Private Use Plane 16: Private Use ● 幾乎所有字都在BMP 當中(只需要16 bits就可表示) ● 1-16字面稱為星界字面(astral planes)，需要大於16 bits的單位元--這些字面大部分是空的

Exploring Unicode ● Use the Unibook: – Can you change the font? – Where are the 漢字? – What is the hexadecimal codepoint of the final character of CJK-Extension A? (use Index Style View) – On which plane is the CJK Unified Ideographs Extension B? – Find the function that allows you to see only the characters of one single codepage

CJK blocks ● 2E80-2EF3: radical supplement ● 2F00-2FD5: kangxi radicals 康熙辭典部首 ● 31C0-31CF: strokes 筆劃 ● DB5: Unified Han Ideog. Extension A ● 4E00-9FFF: Unified Han Ideographs ● F900-FA2D: Compatibility Ideographs ● A6D6: Unified Han Ideog. Ext. B

漢字字元集

Exploring Unicode ● Go to the Unicode website ( have a look around ● For a look at what scripts are defined: ● Look up CJK characters (Unihan) ml

● Using the Unihan radical look-up tool ( find the codepoints for the character: Exploring Unicode

● Find all 65,536 BMP characters at:

Unicode Transformation Format (UTF) ● The numerical codepoint of any character stays the same, but there are several ways for the computer to encode it. These different ways are called Transformation Formats (expressed in Hex) ● UTF-8, UTF-16, UTF-32 are the most important UTFs. See a comparison at: ncodings

UTF-32 ● UTF-32: every character is encoded in four bytes e.g. “a” = ● Considering that most characters live in the BMP a few to many zeros, no?

UTF-16 ● UTF-16: generally two byte encoding ● different computer architectures store bytes in different order (endianess) ● UTF-16 declares the byte order by a Byte Order Mark (BOM) at the start of the file. ● The BOM is the Zero-Width No-Break Space character (FEFF) ● FE FF = big-endian; FF FE = little-endian.

UTF-16 with four bytes ● UTF-16 represents a character from the “astral planes” with four bytes as a surrogate pair ● Two bytes each from the two surrogate blocks (“high” surrogates starting at U+D800, “low” surrogates at U+DC00))

UTF-8 ● UTF-8: encodes a character in one, two, three or four bytes ● 1 byte for the 128 US-ASCII characters ● 2 bytes for Latin letters with diacritics, and for Greek, etc. (range U+0080 to U+07FF) ● 3 bytes for the rest of the BMP (Indian and CFK) ● 4 bytes for characters in the astral planes

● You can quickly find the different UTFs for a character by using the code converter by Richard Ishida & François Yergeau: version

結論 ● 當你要處理的文本是以羅馬字為主最好用UTF-8 ● 當你要處理的文本包含來自印度的字目或漢字, 用UTF-16比較好

Exploring Unicode: the letter "a" ● a – decimal codepoint: 97 – 61 (hexadecimal, UTF-8) – (hexadecimal, UTF-16, little-endian) – (hexadecimal, UTF-16, big-endian) – (UTF-32) ● How about ē?

And ?

● Decimal: ● UTF-8: F0 A4 9D 99 ● UTF-16: D851 DF59 ● UTF-32:

© marcus bingenheimer