Deep contextualized word representations

Slides:



Advertisements
Similar presentations
高考短文改错专题 张柱平. 高考短文改错专题 一. 对短文改错的要求 高考短文改错的目的在于测试考生判断发现, 纠正语篇中 语言使用错误的能力, 以及考察考生在语篇中综合运用英 语知识的能力. 二. 高考短文改错的命题特点 高考短文改错题的形式有说明文. 短文故事. 书信等, 具有很 强的实用性.
Advertisements

Did You Know III 你知道吗?(III)
2014 年上学期 湖南长郡卫星远程学校 制作 13 Getting news from the Internet.
Unsupervised feature learning: autoencoders
后置定语 形容词是表示人或事物的性质、特征或属性的一类词。它在句中可以充当定语,对名词起修饰、描绘作用,还可以充当表语、宾语补足语等。形容词作定语修饰名词时,一般放在被修饰的名词之前,称作前置定语。但有时也可放在被修饰的名词之后,称作后置定语。
二維品質模式與麻醉前訪視滿意度 中文摘要 麻醉前訪視,是麻醉醫護人員對病患提供麻醉相關資訊與服務,並建立良好醫病關係的第一次接觸。本研究目的是以Kano‘s 二維品質模式,設計病患滿意度問卷,探討麻醉前訪視內容與病患滿意度之關係,以期分析關鍵品質要素為何,作為提高病患對醫療滿意度之參考。 本研究於台灣北部某醫學中心,通過該院人體試驗委員會審查後進行。對象為婦科排程手術住院病患,其中實驗組共107位病患,在麻醉醫師訪視之前,安排先觀看麻醉流程衛教影片;另外對照組111位病患,則未提供衛教影片。問卷於麻醉醫師
Today – Academic Presentation 学术报告
华东师范大学软件学院 王科强 (第一作者), 王晓玲
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者:陳宜樺 報告日期:2015/9/25.
深層學習 暑期訓練 (2017).
Visualizing and Understanding Neural Machine Translation
Some Effective Techniques for Naive Bayes Text Classification
Improving classification models with taxonomy information
Thinking of Instrumentation Survivability Under Severe Accident
Population proportion and sample proportion
NLP Group, Dept. of CS&T, Tsinghua University
模式识别 Pattern Recognition
考试与考生 --不对等与对等 邹申 上海外国语大学
教師的成長 與 教師專業能力理念架構 教育局 專業發展及培訓分部 TCF, how much you know about it?
Source: IEEE Access, vol. 5, pp , October 2017
Journal Citation Reports® 期刊引文分析報告的使用和檢索
Sampling Theory and Some Important Sampling Distributions
肢體殘障人士 Physically handicapped
Guide to Freshman Life Prepared by Sam Wu.
创建型设计模式.
InterSpeech 2013 Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding University of Rouen(France)
Advanced Artificial Intelligence
The role of leverage in cross-border mergers and acquisitions
This Is English 3 双向视频文稿.
Interval Estimation區間估計
Formal Pivot to both Language and Intelligence in Science
药物和疾病啥关系 ? 李智恒.
Lesson 44:Popular Sayings
中国农村沼气政策与发展战略 李景明 中国北京 农业部科技发展中心能源生态处处长 中国沼气学会秘书长.
Towards Emotional Awareness in Software Development Teams
Traditional Chinese Medicine
Version Control System Based DSNs
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Guide to a successful PowerPoint design – simple is best
相關統計觀念復習 Review II.
BORROWING SUBTRACTION WITHIN 20
Safety science and engineering department
虚 拟 仪 器 virtual instrument
Common Qs Regarding Earnings
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
-----Reading: ZhongGuanCun
Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.
高考应试作文写作训练 5. 正反观点对比.
Distance Vector vs Link State
An organizational learning approach to information systems development
李宏毅專題 Track A, B, C 的時間、地點開學前通知
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
More About Auto-encoder
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
Speaker : YI-CHENG HUNG
Distance Vector vs Link State Routing Protocols
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Chapter 9 Validation Prof. Dehan Luo
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
之前都是分类的蒸馏很简单。然后从分类到分割也是一样,下一篇是检测的蒸馏
WiFi is a powerful sensing medium
Unit 1 Book 8 A land of diversity
Gaussian Process Ruohua Shi Meeting
適用於數位典藏多媒體內容之 複合式多媒體檢索技術
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Some discussions on Entity Identification
Presentation transcript:

Deep contextualized word representations NAACL 2018 best paper Deep contextualized word representations Lecturer: Zhaoyang Wang The paper’s url: https://arxiv.org/pdf/1802.05365.pdf

1 Introduction

Motivation Pre-trained word representations should model both Complex characteristics of word use(e.g, syntax and semantics) How these uses vary across linguistic contexts(i.e. to model polysemy) Traditional word embedding These approaches for learning word vectors only allow a single context-independent representation for each word Example 1. Jobs was the CEO of apple 2. He finally ate the apple The word ‘apple’ in the two sentences is not the same meaning. But in the traditional word embedding is only one meaning.

Introduction We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use and (2) how these uses vary across linguistic contexts. ELMo: Embeddings from Language Models These representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems.

Introduction Due to their ability to capture syntactic and semantic information of words from large scale unlabeled text, pretrained word vectors are a standard component of most state-of-the-art NLP architectures. Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.

Introduction In this paper, we take full advantage of access to plentiful monolingual data, and train our biLM on a corpus with approximately 30 million sentences. We also generalize these approaches to deep contextual representations, which we show work well across a broad range of diverse NLP tasks. 在本文中,我们使用丰富的单语数据并在语料库中训练我们的biLM,其中包含大约3000万个句子。我们还将这些方法扩展到深层上下文表示,并且我们在各种不同的NLP任务中显示出良好的工作结果。

biLM: Bidirectional language models 2 biLM: Bidirectional language models

biLM: Bidirectional language models Given a sequence of N tokens, (t1, t2, …, tN), A forward LM computes the probability of the sequence by modeling the probability of token tk given the history (t1, …, tk-1): A backward LM is similar to a forward LM, except it runs over the sequence in reverse, predicting the previous token given the future context:

biLM: Bidirectional language models A biLM combines both a forward and backward LM. Our formulation jointly maximizes the log likelihood of the forward and backward directions: biLM结合了前向和后向LM。 我们的配方共同最大化前后方向的对数可能性: 我们在前向和后向方向上绑定令牌表示(Θx)和Softmax层(Θs)的参数,同时在每个方向上保持LSTM的单独参数。 We tie the parameters for both the token representation (Θx) and Softmax layer (Θs) in the forward and backward direction while maintaining separate parameters for the LSTMs in each direction.

ELMo: Embeddings from Language Models 3 ELMo: Embeddings from Language Models

ELMo: Embeddings from Language Models Unlike most widely used word embeddings , ELMo word representations are functions of the entire input sentence. ELMo is a task specific combination of the intermediate layer representations in the biLM. For each token tk, a L-layer biLM computes a set of 2L + 1 representations. Plus 加

ELMo: Embeddings from Language Models For a specific down-stream task, ELMo would learn a weight to combine these representations (In the simplest case, ELMo just selects the top layer ) scale parameter softmax-normalized weights In (1), stask are softmax-normalized weights and the scalar parameter γtask allows the task model to scale the entire ELMo vector.

ELMo: Embeddings from Language Models Given a pre-trained biLM and a supervised architecture for a target NLP task, it is a simple process to use the biLM to improve the task model. We simply run the biLM and record all of the layer representations for each word. Then, we let the end task model learn a linear combination of these representations. 给定预先训练的biLM和目标NLP任务的监督架构,使用biLM来改进任务模型是一个简单的过程。 我们只需运行biLM并记录每个单词的所有图层表示。 然后,我们让结束任务模型学习这些表示的线性组合,

4 Analysis

Analysis The table shows the performance of ELMo across a diverse set of six benchmark NLP tasks. Question answering Textual entailment Semantic role labeling Coreference resolution Named entity extraction Sentiment analysis 首先 (each task build a model as a baseline) ,然后加入ELMo, six tasks have been improved,performance increased about 2 , 并且最后的结果都超过了之前的先进结果(SOTA)state of the art。 不同任务的性能指标不同:对于 SNLI 和 SST-5 是准确率,对于 SQuAD、SRL 和 NER 是 F1,对于 Coref 是平均 F1。由于 NER 和 SST-5 的测试集较小,研究者的报告结果是使用不同的随机种子进行的五次运行的均值和标准差。 The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean and standard deviation across five runs with different random seeds.

Analysis All of the task architectures in this paper include word embeddings only as input to the lowest layer biRNN. However, we find that including ELMo at the output of the biRNN in task-specific architectures improves overall results for some tasks. performance is highest One possible explanation for this result is that both the SNLI and SQuAD architectures use attention layers after the biRNN, so introducing ELMo at this layer allows the model to attend directly to the biLM’s internal representations. In the SRL case, the task-specific context representations are likely more important than those from the biLM. Table: Development set performance for SQuAD, SNLI and SRL when including ELMo at different locations in the supervised model.

Analysis Adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size. For example, the SRL model reaches a maximum development F1 after 486 epochs of training without ELMo. After adding ELMo, the model exceeds the baseline maximum at epoch 10, a 98% relative decrease in the number of updates needed to reach the same level of performance. 将ELMo添加到模型中可以显着提高样本效率,无论是参数更新次数还是达到最新性能,还是整体训练集大小。 例如,SRL模型在没有ELMo的486个训练时期之后达到最大发展F1。添加ELMo后,模型在第10纪元超过了基线最大值,达到相同性能水平所需的更新数量相对减少了98%。

Analysis The figure compares the performance of baselines models with and without ELMo as the percentage of the full training set is varied from 0.1% to 100%. Improvements with ELMo are largest for smaller training sets and significantly reduce the amount of training data needed to reach a given level of performance. In the SRL case, the ELMo model with 1% of the training set has about the same F1 as the baseline model with 10% of the training set. 该图比较了有和没有ELMo的基线模型的性能,因为完整训练集的百分比在0.1%到100%之间变化。 ELMo的改进对于较小的训练集来说是最大的,并且显着减少了达到给定性能水平所需的训练数据量。

5 Conclusion

Conclusion The paper introduced a general approach for learning high-quality deep context-dependent representations from biLMs, and show large improvements when applying ELMo to a broad range of NLP tasks. Building a model, in which words can have different meaning is good.

Thank you! The paper’s url: https://arxiv.org/pdf/1802.05365.pdf Source code:  http://allennlp.org/elmo