Presentation is loading. Please wait.

Presentation is loading. Please wait.

11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer

Similar presentations


Presentation on theme: "11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer"— Presentation transcript:

1 11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer
TEI 工作坊 11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer

2 Humanites Computing 人文資訊學- Digital Humanities 1
Main applications (so far): 數位化與數位版本: encyclopedias 百科全書 dictionaries 辭典 bibliographies, indices 參考書目, 索引 New types of knowledge bases (GIS etc.)

3 Humanites Computing 人文資訊學- Digital Humanities 2
文字與圖畫的數位出版與發行digital publication of text and images: New forms of information production & dissemination: wiki, blog... New research questions: authorship attribution & stylistic analysis literary analysis linguistic analysis, corpus linguistics

4 Example: authorship attribution 1
Mosteller and Wallace (1964): Inference and Disputed Authorship – The Federalist 1787-8: 85 papers, Hamilton, Madison, Jay 12 of disputed authorship: either Hamilton or Madison

5 Example: authorship attribution 2
Count Sentence Length 句子長度 ☹ Vocabulary usage 詞彙使用量化性分析: ☺ compare frequency for 30 marker words e.g. “upon”: Hamilton (2.93 per 1000), Madison (0.16 per 1000)

6 Example: analysing literary texts 分析文學
Estrella Irizarry (1992) compares two Mexican writers (O. Paz ♂ and Rosario Castellanos ♀) on language use & gender ♀ uses more and longer questions ♂ uses more words like ‘always’ and ‘absolutely’, expressions of certitude Words of compassion (taken from a thesaurus) appear only in ♀ work

7 Example: corpus linguistics 語言資料庫語言學 1
British National Corpus (BNC) ( 100 mil. words (一億詞), in samples of 45,000 words Markup with TEI (P3) Automated Part of Speech (PoS) tagging

8 Example: corpus linguistics 語言資料庫語言學 2
The BNC is: balanced 平衡的: written, spoken material from divers sources monolingual 單語的: only English synchronic 同時的/同步的: 20th century

9 Core Technologies 核心技術
xml 技術 (xslt, xquery, svg...)(從1998) 標記規格 (TEI (Text Encoding Initiative), Dublin Core, EAD...) 網路規格 (HTML, RSS...) 資料庫

10 5 stages in the production of high-quality digital texts
1. Input 輸入 2. Basic Markup 基本標記 3. Deep Markup 詳細標記 4. Content Delivery 內容發行 5. Archiving 典藏

11 1. Input 輸入 Basic data input Texts: Keyboarding (Double Keying)
Scanning 掃描 (OCR: Optical Character Recognition 光學字元辨識機) ⇨ a file (perhaps a .txt file)

12 2. Basic Markup 基本標記 檔案處理系統 (格式, 檔名 etc.)
關於數位化過程的Metadata (e.g. teiHeader) 基本結構性的內容標記 basic structural content markup (e.g. with TEI) ⇨ probably an .xml file

13 3. Scholarly in-depth markup 學術標準標記
Value adding through encoding 以標記加值 Encode (with TEI) what you wish to say about the text ⇨ 一件符合TEI的 .xml file (hopefully)

14 4. Content Delivery 內容發行 Making the content available. E.g. online
as CD in a database This needs skills beyond markup

15 5. Archiving 典藏 把自己的數位文件包含於大典藏,數位圖書館 或資料庫 Make sure your edition finds its way into larger collections, repositories or archives E.g.: OTA (Oxford Text Archive) Gutenberg Project 讓別的計畫使用與變換你的資訊Let other projects transform and reuse your content!

16 this class 工具: XML Copy editor, Firefox, Open Office 2

17 © marcus bingenheimer 2006


Download ppt "11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer"

Similar presentations


Ads by Google