BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding¶
组织:Google AI Language
我们引入了一种名为 BERT 的新的语言表示模型,它代表 Transformers 的双向编码器表示。与最近的语言表示模型不同,BERT 旨在通过在所有 Layer 中,同时调节左、右上下文来预训练未标记文本的深度双向表示。因此,预训练的 BERT 模型只需一个额外的输出层即可进行微调,从而为各种任务(例如问答和语言推理)创建最先进的模型,而无需进行大量特定任务的架构修改。
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT 概念简单,经验强大。它在 11 项自然语言处理任务上获得了最新的结果,包括将 GLUE 分数推至 80.5%(绝对提高 7.7%)、MultiNLI 准确率达到 86.7%(绝对提高 4.6%)、SQuAD v1.1问答测试 F1 达到 93.2(绝对提高 1.5 分),SQuAD v2.0 测试 F1 达到 83.1(绝对提高 5.1 分)。
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
1 Introduction¶
将预训练的 language representations 应用于下游任务有两种现有策略:基于特征和微调(feature-based and fine-tuning)。
基于特征的方法 uses task-specific architectures that include the pre-trained representations as additional features.
微调的方法引入(introduces) minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters.
我们证明了双向预训练(bidirectional pre-training)对于语言表示的重要性。
预先训练的表示减少了对许多精心设计的特定任务架构(heavily-engineered task-specific architectures)的需求。
BERT 是第一个基于微调(fine-tuning based)并在大量句子级和标记级任务上实现了最先进的性能的表示模型,优于许多特定于任务的架构。
3 BERT¶
3.1 Pre-training BERT¶
Masked LM:为了训练深度双向表示,我们只需随机屏蔽一定比例的输入标记,然后预测这些屏蔽标记(完形填空)
Next Sentence Prediction (NSP):
Pre-training data:使用 BooksCorpus(800M 单词)。 ( 2015 )和英文维基百科(2,500M 字)。对于维基百科,我们仅提取文本段落并忽略列表、表格和标题。使用文档级语料库而不是打乱的句子级语料库至关重要
3.2 Fine-tuning BERT¶
Fine-tuning is straightforward
since the self-attention mechanism in the Transformer allows BERT to model(建模) many downstream tasks
whether they involve(涉及) single text or text pairs
by swapping out(交换) the appropriate inputs and outputs.
Appendix A Additional Details for BERT¶
A.1 Illustration of the Pre-training Tasks¶
Masked LM and the Masking Procedure:
* 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
* 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
* 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy.
* The purpose of this is to bias the representation towards the actual observed word.
说明:
1. 15%的token会被mask
2. 1.5%(10 of 15%)的token被随机替换(因为比例小,所以不会影响模型的语言理解能力)
Next Sentence Prediction:
Input = [CLS] the man went to [MASK] store [SEP]
he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP]
penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
A.2 Pre-training Procedure¶
为了生成每个训练输入序列,我们从语料库中采样两个文本范围,我们将其称为“句子”,尽管它们通常比单个句子长得多(但也可以更短)。
第一个句子接收A嵌入,第二个句子接收B嵌入。
50% 的情况下B是A之后的实际下一个句子,50% 的情况下它是随机句子,这是为了“下一句预测”任务而完成的。
在WordPiece标记化之后应用LM掩蔽,统一掩蔽率为15%,并且不对部分词片进行特殊考虑。
A.3 Fine-tuning Procedure¶
最佳超参数值是特定于任务的,但我们发现以下可能的值范围在所有任务中都适用:
• Batch size: 16, 32
• Learning rate (Adam): 5e-5, 3e-5, 2e-5
• Number of epochs: 2, 3, 4
大数据集(例如,100k+ 标记的训练样本)对超参数选择的敏感度远低于小数据集
A.4 Comparison of BERT and OpenAI GPT¶
GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.
GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.
GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.