语料库与外语教学和研究 Luo Ling 109883613@qq.com 1376774196
Outline 1. Corpus-based language teaching and learning 2. Corpus-based research
What is a corpus? Bodies of natural language material (whole texts, samples from texts, or sometimes just unconnected sentences), which are stored in machine-readable form.
Voice data is also corpus!
Why use a corpus? Why use electronic text? To study knowledge of language through specimens of language use: naturally-occurring data ... Accessibility Speed: can be analyzed more quickly Accuracy: for some tasks, processing e-text is more accurate than eye scan
Types printed, electronic text, digitized speech, video, mixed monolingual vs. multilingual original vs. translations (parallel) native speaker vs. L2 learner Plain vs. annotated/tagged
Scale 5-10万词(小型) >100万词(中型) 5000万词(大型) >1亿词(特大型)
3. Corpus-based language teaching and learning
Direct/ Indirect Use of Corpora Direct (how) “Classroom concordancing” “Corpus-based CALL” “DDL” Learner as researcher Consciousness/ awareness raising Sounds good, but ... Indirect (what) Dictionary making Developing learning lists LT material design Syllabus design Language Testing Teacher Education Two dominant ways of applying corpora to language teaching: (1) Indirect: “Birmingham” approach - with a view to improve the quality of published ELT material, primarily dictionaries and course books. Dictionary making: COBUILD, LDOCE, OALD, CIDE Learning lists: Dieter Mindt LT material design: evaluation of EFL textbooks by Kennedy, Ljung, etc. Syllabus design: lexical syllabus by David Willis Language testing: automatic cloze test construction/ CALT by Alderson Teacher education: not yet published Direct: data-driven learning (DDL) approach - direct in-course application of corpus analysis methods, primarily concordancing, in order to stimulate learners’ own self-discovery of the language taught (Kaszubski)
Direct use: does it really work? Yes, for certain groups of learners: a student of linguistics a student learning translation skills advanced learners For less advanced learners, a lot of adjustments will be needed. In most of the studies conducted so far, the target learners are university students or adult learners. One of the reasons is that most teachers who are involved in corpus research are university teachers. Second reason may be that universities are the only places which can offer access to such resources. General megacorpora were used for some case studies, primarily for teaching linguistics or syntax courses. Usually the learners are at quite advanced level. For less advanced learners, a lot of adjustments seem to be needed.
Are NS corpora too difficult? P. Nation: guessing does not work with the text which has unknown words by more than 5%. COBUILD: top 11,491 words account for 95% of all text in English(Coniam 1996) For instance, we had a sample quiz at the beginning of the session. Paul Nation, a SLA researcher specializing L2 vocabulary acquisition, says that in order to guess the meaning of unknown words from the context, you must know at least 95 % of all the words in the text. That means the threshold level is 1 unknown word per 20 words. If one ought to know 95% of the average NS texts, how many words should he or she know already? COBUILD data shows that top 11,491 words will cover 95% of all the texts in English (200 million tokens) For Japanese learners, this is the reality. They are far from this vocabulary level. Even ordinary university students may not be able to handle authentic text without any support.
More finely-tuned data Use of authentic, but easy-to-understand texts: Children’s storybooks/ encyclopaedia Use of parallel corpora (T. Johns) comparing L1 and L2 structures and use Use of non-authentic texts (learner corpora) topic-specific corpora/ hard-to-translate words/ peer feedback & conferencing/ NS involvement So there are a couple of ways to tune the data more finely to the learners’ needs. Explain each one: Use of non-authentic texts: This has not been done so much. But I personally feel it has a great potential to use learners’ data. There is an empirical evidence that learners are very concerned about their peer’s performance or comments. Sharing writing data with each other, they can learn a lot from each other. If we collect writing or spoken data on a particular topic, then we can use the same topic with the vocabulary list made from peer’s writing. Hard-to-translate words can be highlighted if they are properly tagged. If we have a parallel corpus which has learners’ original writing along with writing corrected by native speakers. That could be a good resource as well. Learner data could facilitate peer feedback and conferencing with proper involvement by native speaker instructors.
Self-made corpus for teaching Using textbook material Suit learners’ level For data-driven learning
举例说明 英语专业7级水平测试习题集 110,689 11,995 4.62 英语专业8级水平测试题 117,755 12,435 4.75 总词符数 词汇量 平均词长 英语专业7级水平测试习题集 110,689 11,995 4.62 英语专业8级水平测试题 117,755 12,435 4.75
using CONCORD for teaching lexis phrase/collocation sentence pattern
词汇教学:consider
词汇教学:feel
What words go with “Obvious”? 短语/搭配教学 What words go with “Obvious”?
Adjective Collocates: Difference Difficulty Challenge Example (s) Fact Problem (s) Question (s) Reason (s) way Obvious
Collocates: little/small/large What words go with “Little, small, and large?
Collocates: little/small baby bag Bit (s) Boy (s) dog Girl (s) Kid (s) man While, thing little
Collocates: little/small Amount (s) letters part piece proportion quantities sum size world small
Which is used more often?? 短语教学: On the one hand On the other hand Which is used more often??
句型教学: It is time that…
句型教学: It is Important that Necessary that Possible that True that Significant that
句型教学: There be 存在结构
Compiling materials/testing 商务英语考试词汇表要包含哪些词? 商务英语教材中哪些是核心词汇? 商务英语中使用最多的名词、动词、形容词是 什么? 从语料库中你会有惊人的发现…
以商业体裁为例: 商业报告(business reports) 商业文章(business texts) 商业信涵(business letters)
统计每篇教材的各项指标 词频 词长 句长 词汇丰富度 长度 词块
4. Learner corpra-based research Major learner corpra Two research approaches scope Main features of learner corpra Implications
What’s learner written corpra? 将学生的作文用电子文本形式储存起来 无标注/有标注 无赋码/有赋码
What’s learner spoken corpra? 将学生的口语声音文件和声音文本 转写文件用电子文本储存起来
国内外主要学习者语料库 ICLE (Granger et al. 2002) LIND-SEI(正在建设中) CLEC (桂诗春、杨惠中,2003) SWECCL(文秋芳、王立非、梁茂成,2005)
International Corpus of Learner English 200万词的书面语料 欧洲11个国家大学英语专业3、4年级 课内外限时、非限时作文 议论文文体
The Louvian International Database of Spoken English Interlanguage (LIND-SEI) 200万词的口语语料 欧州5个国家、亚洲2个国家英语专业3、4年级 学生 计划与ICLE匹配,正在建设中
Chinese Learner English Corpus (CLEC) 100 万书面语料 中学 非英语专业4、6级 英语专业4、8级 进行了不完备的错误标注
Spoken and Written English Corpus of Chinese SWECCL WECCL SECCL One million One million
SECCL 1996-2002年英语专业4级考试 录音 数码语音样本1148个 转写的电子文本1148个
Tasks Reading aloud Retelling a story Talking on a given topic (Narrative) Talking on a given topic (argumentative) Conversation (Role play) Discussion on a given topic
SECCL语料命名 原则:简单明了,不重名,(字母+数字) SECCL命名采用三级编号,即,年份-组别-序号 如:01-47-01为2001年第47组第1号考生的语音样本。 相同组别的语音样本存放在同一个文件夹中,此文件夹以年份和组号命名(如:2001-47)。
三类标注: 文本头标注 错误标注 口语特征
文本头标注: 1)<SPOKEN> = Spoken (口语) 2)<TEM4> = 英语专业四级考试 3)<GRADE2> = Grade 2 (二年级) 4)<YEAR02> = YEAR 2002 (2002年样本) 5)<GROUP01> = Group 01 (第1组) 6)<TASKTYPE1> = Task Type 1 (口试题型 1) 7)<SEX1F> = Sex 1 Female (性别 1 女生),<Sex20> = Sex 2 Absent (性别 2 男生没有) 8)<RANK07> = Rank 07 (口试小组内排名第7)
文本头标注: <SPOKEN> <TEM 4> <GRADE 2> <YEAR00> <GROUP65> <TASKTYPE 1> <SEX 1 F> <Sex 2 0> <RANK 07>
口语特征标注 会话角色标注:用A、B角色记录。 自我重复/修正(Self Repetition/Repair) a) 按实际重复次数如实记录。如:听到think二次,就记录为think think。 b) 长停顿(Long Pause) 自然的中间停顿,用逗号<,>表示, 如果是完整的句子间的停顿,用句号<.>标注。 非流利停顿(0.3秒),用省略号 <…>标注,如:I … think c) 发音错误(Wrong Pronunciation) 转写时,写出其正确形式,然后将听到的错误发音用相应的字母拼出来,放在尖括号< >中。如:very 的错误发音记录为:very <weri>, Loise记录为noise<loise>,Sheep记录为ship<sheep>。
将错误放在< >,而将正确的形式放在文本中, 语法错误标注: 将错误放在< >,而将正确的形式放在文本中, 例如,如果听到runned,就记录为 ran <runned>。 He likes <like> to stay in the hotel.
SECCL的优势 口语语料来源于随机样本,具有代表性 口语语料按照7年时间跨度分年存放,为考察我国 学生口语能力的发展提供了可能。
SECCL优势 口语语料按照不同类型的任务加以分类,为考察 任务类型变量对口语产出的影响提供了可能。
SECCL优势 运用语法自动标注器CLAWS4对所有文本进行了 词性赋码,便于研究学生口语中的词法和句法 的变化规律。
SECCL优势 所有的文本语料都有相对应的语音文件,计算 机可以直接读取和播放。研究者既可以做基于 文本语料的口语研究,也可以对语音文件进行 标注,开展基于语音语料的相关研究。
SECCL优势 所有文本语料的文本头中均标记出考生在小 组内的成绩排名,便于研究口语水平变量对 口语发展的影响。
WECCL Written Year 1 Year 2 Year 3 Year 4
WECCL 议论文3059篇 记叙文529篇
WECCL的特点一 作文按照限时和非限时加以分类,为考察时间 变量对二语写作的影响提供方便。
WECCL的特点二 作文按照不同文体和年级加以分类,便于考察 学生写作能力的发展情况。
所有的书面语料进行了词性赋码,便于研究中国 学生的中介语词法和句法发展的特点。 WECCL的特点三 所有的书面语料进行了词性赋码,便于研究中国 学生的中介语词法和句法发展的特点。
学习者语料库主要研究方法 中介语对比分析 计算机辅助错误分析
中介语对比分析 不同水平的学习者语料比较 不同母语背景的学习者语料比较 学习者口语/笔语语料比较 学习者语料/本族语语料
计算机辅助错误分析 手工进行错误标注 用WordSmith进行批量提取
研究范围 语音 (停顿、节奏、语调) 词汇 (总体特征、某类词) 语法 (过去时、冠词、NP, VP) 语篇 (语篇标记、提问、话轮转换)
学习者语料库优势一 学习者语料库容量大,材料真实,且代表性 强,研究结果不再依靠零星的例证
学习者语料库优势二 学习者语料库有分有合,为大样本定量统计分 析和个案定性文本分析的有机结合提供了可能。
学习者语料库优势三 语料库可以长期保存、复制、检索,为验证性研 究提供了可能,有助于提高研究的信度和效度。
学习者语料库所能与所不能 CAN CANNOT 结果 过程 表达性 接受性 群体趋势 个体差异 语言运用 语言知识
启示一 从不同视角了解学习者的语言使用情况 正确使用 (correct use) 过度使用 (overuse) 使用不足 (underuse) 未用 (unuse) 错用 (misuse)
启示二 区分中介语中母语特征与发展性特征 如考察20个高频副词的使用 中国大学生在口笔语中是过度使用,还是使用不足? 他们在口笔语中使用20个高频副词与本族语者有何 不同?
研究结果1:3个副词使用不足 TTFAs BNCW CLW Dif ever 257 228 -29 increasingly 73 8 -65 Normally 79 12 -67
Ever Increasingly Normally BNC (Written) 257 73 79 Polish 205 22 30 Chinese 228 8 12 Spanish 230 15 46 French 243 23 19
启示三 外语教学与研究中须并重语言使用与语言知识。
Some resources 一些语料库工具和语料库资源的网址 http://www.360doc.com/content/10/0907/13/617 416_51837914.shtml 语料库语言学在线http://www.corpus4u.org/ 一些在线语料库 http://blog.renren.com/share/241405628/449748 4561 …
References: 杨惠中,料库语言学导论,上海外语教育出版社, 2002。 桂诗春,杨惠中,中国学习者英语语料库,上海外 语教育出版社,2002。 杨惠中,桂诗春,中国学习者英语口语语料库建设 与研究,上海外语教育出版社,2005。 杨惠中,基于CLEC语料库的中国学习者英语分析, 上海外语教育出版社,2005。 文秋芳,王立非,梁茂成,中国学生英语口笔语语 料库,外语教学与研究出版社,2008。
Thank You!