语料库及其基本操作 杨林伟 烟台大学外语教育技术研究中心
1 2 语料库的概念及其发展简述 语料库工具、软件 3 4 自建小型语料库 教学实践与应用
1 语料库的概念及其发展简述 语料库 的定义 A corpus is a collection of pieces of language text in electronic form selected according to external criteria to represent as far as possible a language or language variety as a source of data for linguistic research. (Sinclair, 1991) a collection of sampled texts, written or spoken, in machine readable form which may be annotated with various forms of linguistic information. (McEnery et al. 2006)
1 语料库的概念及其发展简述 语料库 的定义 a large collection of well-sampled and processed electronic texts, on which language studies, theoretical or applied, can be conducted with the aid of computer tools. By BFSU CRG members
1 语料库的概念及其发展简述 语料库 百万词级 1959: SEU (Survey of English Usage) the first attempt to provide an ongoing collection of present-day English … was a precursor of later corpora such as the British National Corpus and the American National Corpus. 1961: The Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English at Brown University.
1 语料库的概念及其发展简述 语料库 百万词级 1970s: The Lancaster-Oslo/Bergen Corpus (LOB Corpus) was compiled to provide a British counterpart to the Brown Corpus. 1975: The London Lund Corpus (LLC) was the computerised spoken part of SEU, used as the basis for the famous Comprehensive Grammar (Quirk et al. 1985).
1 语料库的概念及其发展简述 语料库 千万词级 1980s: COBUILD (Collins-Birmingham University International Lexical Database). In 1991, the success of the COBUILD led to the development of a large monitor corpus, the Bank of English. 1980s: LONGMAN/LANCSTER Corpus. As part of the Longman Corpus Network, the Longman/Lancaster Corpus is not available for public access.
1980s—early 1990s: BNC (British National Corpus) 1亿 语料库的概念及其发展简述 语料库 亿词级 1980s—early 1990s: BNC (British National Corpus) 1亿 1990s: COCA (The Contemporary American English)4.5亿
Late 1990s—2002: ICLE (The International Corpus of Learner English) 语料库的概念及其发展简述 热点: 学习者语料库 Late 1990s—2002: ICLE (The International Corpus of Learner English) Late 1990s: CLEC (Chinese Learner English Corpus) HKUST Learner Corpus See More Corpora: http://www.lancaster.ac.uk/fass/projects/corpus/cbls/corpora.asp
1 语料库的概念及其发展简述 热点: 双语语料库 The BFSU (Beijing Foreign Studies University) Chinese-English Parallel Corpus contains 30 million words. Presently it is the largest parallel corpus of English and Chinese. The corpus is composed of four subcorpora, i.e. Balanced Corpus, Translation Corpus, Bilingual Sentences Corpus and Corpus for Specific Purpose.
1 语料库的概念及其发展简述 热点: 网络语料库 WaC Wa/fC WfC
AntConc: freeware, copyleft Xaira: BNC 2 语料库工具、软件 检索工具、软件 WordSmith Tools MonoConc / ParaConc AntConc: freeware, copyleft Xaira: BNC CQPWeb: Sketch Engine, BFSU CQPWeb WebCorp
2 语料库工具、软件 检索工具、软件
Wordlist and Collocation N-gram 2 语料库工具、软件 检索工具、软件 Practice 1 KWIC Wordlist and Collocation N-gram
2 语料库工具、软件 检索工具、软件 Practice 1
2 语料库工具、软件 语料库标注工具 Stanford POStagger TreeTagger CLAWS 5
2 语料库工具、软件 9/13=69.2 Stanford POStagger 语料库标注工具 Practice 2 9/13=69.2 Stanford POStagger Can/MD you/PRP can/MD a/DT can/MD as/IN a/DT canner/NN can/MD can/MD a/DT can/MD ?/. 11/13=84.6 TreeTagger Can_MD you_PP can_MD a_DT can_NN as_IN a_DT canner_NN can_MD can_MD a_DT can_NN ?_SENT
2 语料库工具、软件 Regex Editpad Pro Regular Expression PowerGrep 正则表达式 语料库文本处理工具 Regex Regular Expression 正则表达式 wordless Editpad Pro PowerGrep Regex Buddy \ba\w*\b \d+ \b\w{6}\b
Collect all the sentences of the structure: It be … that… 2 语料库工具、软件 语料库文本处理工具 Practice 3 Remove the tags Remove the words Collect all the sentences of the structure: It be … that…
Python、 NLTK: Natural Language Toolkit 2 语料库工具、软件 语料库高级工具 Active Perl Python、 NLTK: Natural Language Toolkit Text1.concordance(“monstrous”)
Representative and balanced sampling 3 自建小型语料库 语料库建库原则 Machine readable Authentic Authoritative Representative and balanced sampling
1 Text OCR, downloading, collecting 3 自建小型语料库 语料库建库步骤 Practice 4 1 Text OCR, downloading, collecting 2 Text cleaning and formatting 3 Text markup, tagging, meta information Taking Webpages as an example
4 教学实践与应用 我的 应用 1 Web多媒体新闻语料库 2 微型文本语料库