11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer TEI 工作坊 11. Digitization of Text 文字數位化 September 2006 Marcus Bingenheimer
Humanites Computing 人文資訊學- Digital Humanities 1 Main applications (so far): 數位化與數位版本: encyclopedias 百科全書 dictionaries 辭典 bibliographies, indices 參考書目, 索引 New types of knowledge bases (GIS etc.)
Humanites Computing 人文資訊學- Digital Humanities 2 文字與圖畫的數位出版與發行digital publication of text and images: New forms of information production & dissemination: wiki, blog... New research questions: authorship attribution & stylistic analysis literary analysis linguistic analysis, corpus linguistics
Example: authorship attribution 1 Mosteller and Wallace (1964): Inference and Disputed Authorship – The Federalist 1787-8: 85 papers, Hamilton, Madison, Jay 12 of disputed authorship: either Hamilton or Madison
Example: authorship attribution 2 Count Sentence Length 句子長度 ☹ Vocabulary usage 詞彙使用量化性分析: ☺ compare frequency for 30 marker words e.g. “upon”: Hamilton (2.93 per 1000), Madison (0.16 per 1000)
Example: analysing literary texts 分析文學 Estrella Irizarry (1992) compares two Mexican writers (O. Paz ♂ and Rosario Castellanos ♀) on language use & gender ♀ uses more and longer questions ♂ uses more words like ‘always’ and ‘absolutely’, expressions of certitude Words of compassion (taken from a thesaurus) appear only in ♀ work
Example: corpus linguistics 語言資料庫語言學 1 British National Corpus (BNC) (http://www.natcorp.ox.ac.uk/) 100 mil. words (一億詞), in samples of 45,000 words Markup with TEI (P3) Automated Part of Speech (PoS) tagging
Example: corpus linguistics 語言資料庫語言學 2 The BNC is: balanced 平衡的: written, spoken material from divers sources monolingual 單語的: only English synchronic 同時的/同步的: 20th century
Core Technologies 核心技術 xml 技術 (xslt, xquery, svg...)(從1998) 標記規格 (TEI (Text Encoding Initiative), Dublin Core, EAD...) 網路規格 (HTML, RSS...) 資料庫
5 stages in the production of high-quality digital texts 1. Input 輸入 2. Basic Markup 基本標記 3. Deep Markup 詳細標記 4. Content Delivery 內容發行 5. Archiving 典藏
1. Input 輸入 Basic data input Texts: Keyboarding (Double Keying) Scanning 掃描 (OCR: Optical Character Recognition 光學字元辨識機) ⇨ a file (perhaps a .txt file)
2. Basic Markup 基本標記 檔案處理系統 (格式, 檔名 etc.) 關於數位化過程的Metadata (e.g. teiHeader) 基本結構性的內容標記 basic structural content markup (e.g. with TEI) ⇨ probably an .xml file
3. Scholarly in-depth markup 學術標準標記 Value adding through encoding 以標記加值 Encode (with TEI) what you wish to say about the text ⇨ 一件符合TEI的 .xml file (hopefully)
4. Content Delivery 內容發行 Making the content available. E.g. online as CD in a database This needs skills beyond markup
5. Archiving 典藏 把自己的數位文件包含於大典藏,數位圖書館 或資料庫 Make sure your edition finds its way into larger collections, repositories or archives E.g.: OTA (Oxford Text Archive) Gutenberg Project 讓別的計畫使用與變換你的資訊Let other projects transform and reuse your content!
this class 工具: XML Copy editor, Firefox, Open Office 2
© marcus bingenheimer 2006