Deep contextualized word representations NAACL 2018 best paper Deep contextualized word representations Lecturer: Zhaoyang Wang The paper’s url: https://arxiv.org/pdf/1802.05365.pdf
1 Introduction
Motivation Pre-trained word representations should model both Complex characteristics of word use(e.g, syntax and semantics) How these uses vary across linguistic contexts(i.e. to model polysemy) Traditional word embedding These approaches for learning word vectors only allow a single context-independent representation for each word Example 1. Jobs was the CEO of apple 2. He finally ate the apple The word ‘apple’ in the two sentences is not the same meaning. But in the traditional word embedding is only one meaning.
Introduction We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use and (2) how these uses vary across linguistic contexts. ELMo: Embeddings from Language Models These representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems.
Introduction Due to their ability to capture syntactic and semantic information of words from large scale unlabeled text, pretrained word vectors are a standard component of most state-of-the-art NLP architectures. Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.
Introduction In this paper, we take full advantage of access to plentiful monolingual data, and train our biLM on a corpus with approximately 30 million sentences. We also generalize these approaches to deep contextual representations, which we show work well across a broad range of diverse NLP tasks. 在本文中,我们使用丰富的单语数据并在语料库中训练我们的biLM,其中包含大约3000万个句子。我们还将这些方法扩展到深层上下文表示,并且我们在各种不同的NLP任务中显示出良好的工作结果。
biLM: Bidirectional language models 2 biLM: Bidirectional language models
biLM: Bidirectional language models Given a sequence of N tokens, (t1, t2, …, tN), A forward LM computes the probability of the sequence by modeling the probability of token tk given the history (t1, …, tk-1): A backward LM is similar to a forward LM, except it runs over the sequence in reverse, predicting the previous token given the future context:
biLM: Bidirectional language models A biLM combines both a forward and backward LM. Our formulation jointly maximizes the log likelihood of the forward and backward directions: biLM结合了前向和后向LM。 我们的配方共同最大化前后方向的对数可能性: 我们在前向和后向方向上绑定令牌表示(Θx)和Softmax层(Θs)的参数,同时在每个方向上保持LSTM的单独参数。 We tie the parameters for both the token representation (Θx) and Softmax layer (Θs) in the forward and backward direction while maintaining separate parameters for the LSTMs in each direction.
ELMo: Embeddings from Language Models 3 ELMo: Embeddings from Language Models
ELMo: Embeddings from Language Models Unlike most widely used word embeddings , ELMo word representations are functions of the entire input sentence. ELMo is a task specific combination of the intermediate layer representations in the biLM. For each token tk, a L-layer biLM computes a set of 2L + 1 representations. Plus 加
ELMo: Embeddings from Language Models For a specific down-stream task, ELMo would learn a weight to combine these representations (In the simplest case, ELMo just selects the top layer ) scale parameter softmax-normalized weights In (1), stask are softmax-normalized weights and the scalar parameter γtask allows the task model to scale the entire ELMo vector.
ELMo: Embeddings from Language Models Given a pre-trained biLM and a supervised architecture for a target NLP task, it is a simple process to use the biLM to improve the task model. We simply run the biLM and record all of the layer representations for each word. Then, we let the end task model learn a linear combination of these representations. 给定预先训练的biLM和目标NLP任务的监督架构,使用biLM来改进任务模型是一个简单的过程。 我们只需运行biLM并记录每个单词的所有图层表示。 然后,我们让结束任务模型学习这些表示的线性组合,
4 Analysis
Analysis The table shows the performance of ELMo across a diverse set of six benchmark NLP tasks. Question answering Textual entailment Semantic role labeling Coreference resolution Named entity extraction Sentiment analysis 首先 (each task build a model as a baseline) ,然后加入ELMo, six tasks have been improved,performance increased about 2 , 并且最后的结果都超过了之前的先进结果(SOTA)state of the art。 不同任务的性能指标不同:对于 SNLI 和 SST-5 是准确率,对于 SQuAD、SRL 和 NER 是 F1,对于 Coref 是平均 F1。由于 NER 和 SST-5 的测试集较小,研究者的报告结果是使用不同的随机种子进行的五次运行的均值和标准差。 The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean and standard deviation across five runs with different random seeds.
Analysis All of the task architectures in this paper include word embeddings only as input to the lowest layer biRNN. However, we find that including ELMo at the output of the biRNN in task-specific architectures improves overall results for some tasks. performance is highest One possible explanation for this result is that both the SNLI and SQuAD architectures use attention layers after the biRNN, so introducing ELMo at this layer allows the model to attend directly to the biLM’s internal representations. In the SRL case, the task-specific context representations are likely more important than those from the biLM. Table: Development set performance for SQuAD, SNLI and SRL when including ELMo at different locations in the supervised model.
Analysis Adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size. For example, the SRL model reaches a maximum development F1 after 486 epochs of training without ELMo. After adding ELMo, the model exceeds the baseline maximum at epoch 10, a 98% relative decrease in the number of updates needed to reach the same level of performance. 将ELMo添加到模型中可以显着提高样本效率,无论是参数更新次数还是达到最新性能,还是整体训练集大小。 例如,SRL模型在没有ELMo的486个训练时期之后达到最大发展F1。添加ELMo后,模型在第10纪元超过了基线最大值,达到相同性能水平所需的更新数量相对减少了98%。
Analysis The figure compares the performance of baselines models with and without ELMo as the percentage of the full training set is varied from 0.1% to 100%. Improvements with ELMo are largest for smaller training sets and significantly reduce the amount of training data needed to reach a given level of performance. In the SRL case, the ELMo model with 1% of the training set has about the same F1 as the baseline model with 10% of the training set. 该图比较了有和没有ELMo的基线模型的性能,因为完整训练集的百分比在0.1%到100%之间变化。 ELMo的改进对于较小的训练集来说是最大的,并且显着减少了达到给定性能水平所需的训练数据量。
5 Conclusion
Conclusion The paper introduced a general approach for learning high-quality deep context-dependent representations from biLMs, and show large improvements when applying ELMo to a broad range of NLP tasks. Building a model, in which words can have different meaning is good.
Thank you! The paper’s url: https://arxiv.org/pdf/1802.05365.pdf Source code: http://allennlp.org/elmo