语料库与外语教学和研究 Luo Ling 109883613@qq.com 1376774196.

语料库与外语教学和研究 Luo Ling

Outline 1. Corpus-based language teaching and learning
2. Corpus-based research

What is a corpus? Bodies of natural language material (whole texts, samples from texts, or sometimes just unconnected sentences), which are stored in machine-readable form.

Voice data is also corpus!

Why use a corpus? Why use electronic text?
To study knowledge of language through specimens of language use: naturally-occurring data ... Accessibility Speed: can be analyzed more quickly Accuracy: for some tasks, processing e-text is more accurate than eye scan

Types printed, electronic text, digitized speech, video, mixed
monolingual vs. multilingual original vs. translations (parallel) native speaker vs. L2 learner Plain vs. annotated/tagged

Scale 5-10万词(小型) >100万词(中型) 5000万词(大型) >1亿词(特大型)

3. Corpus-based language teaching and learning

Direct/ Indirect Use of Corpora
Direct (how) “Classroom concordancing” “Corpus-based CALL” “DDL” Learner as researcher Consciousness/ awareness raising Sounds good, but ... Indirect (what) Dictionary making Developing learning lists LT material design Syllabus design Language Testing Teacher Education Two dominant ways of applying corpora to language teaching: (1) Indirect: “Birmingham” approach - with a view to improve the quality of published ELT material, primarily dictionaries and course books. Dictionary making: COBUILD, LDOCE, OALD, CIDE Learning lists: Dieter Mindt LT material design: evaluation of EFL textbooks by Kennedy, Ljung, etc. Syllabus design: lexical syllabus by David Willis Language testing: automatic cloze test construction/ CALT by Alderson Teacher education: not yet published Direct: data-driven learning (DDL) approach - direct in-course application of corpus analysis methods, primarily concordancing, in order to stimulate learners’ own self-discovery of the language taught (Kaszubski)

Direct use: does it really work?
Yes, for certain groups of learners: a student of linguistics a student learning translation skills advanced learners For less advanced learners, a lot of adjustments will be needed. In most of the studies conducted so far, the target learners are university students or adult learners. One of the reasons is that most teachers who are involved in corpus research are university teachers. Second reason may be that universities are the only places which can offer access to such resources. General megacorpora were used for some case studies, primarily for teaching linguistics or syntax courses. Usually the learners are at quite advanced level. For less advanced learners, a lot of adjustments seem to be needed.

Are NS corpora too difficult?
P. Nation: guessing does not work with the text which has unknown words by more than 5%. COBUILD: top 11,491 words account for 95% of all text in English(Coniam 1996) For instance, we had a sample quiz at the beginning of the session. Paul Nation, a SLA researcher specializing L2 vocabulary acquisition, says that in order to guess the meaning of unknown words from the context, you must know at least 95 % of all the words in the text. That means the threshold level is 1 unknown word per 20 words. If one ought to know 95% of the average NS texts, how many words should he or she know already? COBUILD data shows that top 11,491 words will cover 95% of all the texts in English (200 million tokens) For Japanese learners, this is the reality. They are far from this vocabulary level. Even ordinary university students may not be able to handle authentic text without any support.

More finely-tuned data
Use of authentic, but easy-to-understand texts: Children’s storybooks/ encyclopaedia Use of parallel corpora (T. Johns) comparing L1 and L2 structures and use Use of non-authentic texts (learner corpora) topic-specific corpora/ hard-to-translate words/ peer feedback & conferencing/ NS involvement So there are a couple of ways to tune the data more finely to the learners’ needs. Explain each one: Use of non-authentic texts: This has not been done so much. But I personally feel it has a great potential to use learners’ data. There is an empirical evidence that learners are very concerned about their peer’s performance or comments. Sharing writing data with each other, they can learn a lot from each other. If we collect writing or spoken data on a particular topic, then we can use the same topic with the vocabulary list made from peer’s writing. Hard-to-translate words can be highlighted if they are properly tagged. If we have a parallel corpus which has learners’ original writing along with writing corrected by native speakers. That could be a good resource as well. Learner data could facilitate peer feedback and conferencing with proper involvement by native speaker instructors.

Self-made corpus for teaching
Using textbook material Suit learners’ level For data-driven learning

举例说明英语专业7级水平测试习题集 110,689 11,995 4.62 英语专业8级水平测试题 117,755 12,435 4.75
总词符数词汇量平均词长英语专业7级水平测试习题集 110,689 11,995 4.62 英语专业8级水平测试题 117,755 12,435 4.75

using CONCORD for teaching
lexis phrase/collocation sentence pattern

词汇教学：consider

词汇教学：feel

What words go with “Obvious”？
短语/搭配教学 What words go with “Obvious”？

Adjective Collocates:
Difference Difficulty Challenge Example (s) Fact Problem (s) Question (s) Reason (s) way Obvious

Collocates: little/small/large
What words go with “Little, small, and large?

Collocates: little/small
baby bag Bit (s) Boy (s) dog Girl (s) Kid (s) man While, thing little

Collocates: little/small
Amount (s) letters part piece proportion quantities sum size world small

Which is used more often??
短语教学: On the one hand On the other hand Which is used more often??

句型教学: It is time that…

句型教学: It is Important that Necessary that Possible that True that
Significant that

句型教学: There be 存在结构

Compiling materials/testing
商务英语考试词汇表要包含哪些词? 商务英语教材中哪些是核心词汇？商务英语中使用最多的名词、动词、形容词是什么？从语料库中你会有惊人的发现…

以商业体裁为例：商业报告（business reports) 商业文章（business texts)
商业信涵（business letters)

统计每篇教材的各项指标词频词长句长词汇丰富度长度词块

4. Learner corpra-based research
Major learner corpra Two research approaches scope Main features of learner corpra Implications

What’s learner written corpra？
将学生的作文用电子文本形式储存起来无标注/有标注无赋码/有赋码

What’s learner spoken corpra?
将学生的口语声音文件和声音文本转写文件用电子文本储存起来

国内外主要学习者语料库 ICLE (Granger et al. 2002) LIND-SEI(正在建设中)
CLEC (桂诗春、杨惠中，2003） SWECCL（文秋芳、王立非、梁茂成，2005）

International Corpus of Learner English
200万词的书面语料欧洲11个国家大学英语专业3、4年级课内外限时、非限时作文议论文文体

The Louvian International Database of Spoken English Interlanguage (LIND-SEI)
200万词的口语语料欧州5个国家、亚洲2个国家英语专业3、4年级学生计划与ICLE匹配，正在建设中

Chinese Learner English Corpus (CLEC)
100 万书面语料中学非英语专业4、6级英语专业4、8级进行了不完备的错误标注

Spoken and Written English Corpus of Chinese
SWECCL WECCL SECCL One million One million

SECCL 1996－2002年英语专业4级考试录音数码语音样本1148个转写的电子文本1148个

Tasks Reading aloud Retelling a story
Talking on a given topic (Narrative) Talking on a given topic (argumentative) Conversation (Role play) Discussion on a given topic

SECCL语料命名原则:简单明了,不重名,(字母+数字) SECCL命名采用三级编号，即，年份-组别-序号
如：为2001年第47组第1号考生的语音样本。相同组别的语音样本存放在同一个文件夹中，此文件夹以年份和组号命名（如：）。

三类标注：文本头标注错误标注口语特征

文本头标注： 1）<SPOKEN> = Spoken (口语) 2）<TEM4> = 英语专业四级考试
3）<GRADE2> = Grade 2 （二年级） 4）<YEAR02> = YEAR 2002 （2002年样本） 5）<GROUP01> = Group 01 (第1组) 6）<TASKTYPE1> = Task Type 1 （口试题型 1） 7）<SEX1F> = Sex 1 Female （性别 1 女生），<Sex20> = Sex 2 Absent （性别 2 男生没有） 8）<RANK07> = Rank 07 （口试小组内排名第7）

文本头标注： <SPOKEN> <TEM 4> <GRADE 2> <YEAR00> <GROUP65> <TASKTYPE 1> <SEX 1 F> <Sex 2 0> <RANK 07>

口语特征标注会话角色标注:用A、B角色记录。自我重复/修正（Self Repetition/Repair）
a) 按实际重复次数如实记录。如：听到think二次，就记录为think think。 b) 长停顿（Long Pause）自然的中间停顿，用逗号<，>表示，如果是完整的句子间的停顿，用句号<.>标注。非流利停顿（0.3秒），用省略号 <…>标注，如：I … think c) 发音错误（Wrong Pronunciation）转写时，写出其正确形式，然后将听到的错误发音用相应的字母拼出来，放在尖括号< >中。如：very 的错误发音记录为：very <weri>, Loise记录为noise<loise>，Sheep记录为ship<sheep>。

将错误放在< >，而将正确的形式放在文本中，
语法错误标注：将错误放在< >，而将正确的形式放在文本中，例如，如果听到runned，就记录为 ran <runned>。 He likes <like> to stay in the hotel.

SECCL的优势口语语料来源于随机样本，具有代表性口语语料按照7年时间跨度分年存放，为考察我国学生口语能力的发展提供了可能。

SECCL优势口语语料按照不同类型的任务加以分类，为考察任务类型变量对口语产出的影响提供了可能。

SECCL优势运用语法自动标注器CLAWS4对所有文本进行了词性赋码，便于研究学生口语中的词法和句法的变化规律。

SECCL优势所有的文本语料都有相对应的语音文件，计算机可以直接读取和播放。研究者既可以做基于文本语料的口语研究，也可以对语音文件进行标注，开展基于语音语料的相关研究。

SECCL优势所有文本语料的文本头中均标记出考生在小组内的成绩排名，便于研究口语水平变量对口语发展的影响。

WECCL Written Year 1 Year 2 Year 3 Year 4

WECCL 议论文3059篇记叙文529篇

WECCL的特点一作文按照限时和非限时加以分类，为考察时间变量对二语写作的影响提供方便。

WECCL的特点二作文按照不同文体和年级加以分类，便于考察学生写作能力的发展情况。

所有的书面语料进行了词性赋码，便于研究中国学生的中介语词法和句法发展的特点。
WECCL的特点三所有的书面语料进行了词性赋码，便于研究中国学生的中介语词法和句法发展的特点。

学习者语料库主要研究方法中介语对比分析计算机辅助错误分析

中介语对比分析不同水平的学习者语料比较不同母语背景的学习者语料比较学习者口语/笔语语料比较学习者语料/本族语语料

计算机辅助错误分析手工进行错误标注用WordSmith进行批量提取

研究范围语音（停顿、节奏、语调）词汇（总体特征、某类词）语法（过去时、冠词、NP, VP) 语篇 (语篇标记、提问、话轮转换）

学习者语料库优势一学习者语料库容量大，材料真实，且代表性强，研究结果不再依靠零星的例证

学习者语料库优势二学习者语料库有分有合，为大样本定量统计分析和个案定性文本分析的有机结合提供了可能。

学习者语料库优势三语料库可以长期保存、复制、检索，为验证性研究提供了可能，有助于提高研究的信度和效度。

学习者语料库所能与所不能 CAN CANNOT 结果过程表达性接受性群体趋势个体差异语言运用语言知识

启示一从不同视角了解学习者的语言使用情况正确使用 (correct use) 过度使用 (overuse)
使用不足 (underuse) 未用 (unuse) 错用 (misuse)

启示二区分中介语中母语特征与发展性特征如考察20个高频副词的使用中国大学生在口笔语中是过度使用，还是使用不足？
他们在口笔语中使用20个高频副词与本族语者有何不同？

研究结果1：3个副词使用不足 TTFAs BNCW CLW Dif ever 257 228 -29 increasingly 73 8
-65 Normally 79 12 -67

Ever Increasingly Normally BNC (Written) 257 73 79 Polish 205 22 30 Chinese 228 8 12 Spanish 230 15 46 French 243 23 19

启示三外语教学与研究中须并重语言使用与语言知识。

Some resources 一些语料库工具和语料库资源的网址 _ shtml 语料库语言学在线一些在线语料库 …

References: 杨惠中，料库语言学导论，上海外语教育出版社， 2002。
桂诗春，杨惠中，中国学习者英语语料库，上海外语教育出版社，2002。杨惠中，桂诗春，中国学习者英语口语语料库建设与研究，上海外语教育出版社，2005。杨惠中，基于CLEC语料库的中国学习者英语分析，上海外语教育出版社，2005。文秋芳，王立非，梁茂成，中国学生英语口笔语语料库，外语教学与研究出版社，2008。

Thank You！

语料库与外语教学和研究 Luo Ling 109883613@qq.com 1376774196.

Similar presentations

Presentation on theme: "语料库与外语教学和研究 Luo Ling 109883613@qq.com 1376774196."— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

语料库与外语教学和研究 Luo Ling 109883613@qq.com 1376774196.

Similar presentations

Presentation on theme: "语料库与外语教学和研究 Luo Ling 109883613@qq.com 1376774196."— Presentation transcript:

Similar presentations

About project

反馈