主页

索引

模块索引

搜索页面

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • https://arxiv.org/abs/1810.04805

  • 组织:Google AI Language

  • 我们引入了一种名为 BERT 的新的语言表示模型,它代表 Transformers 的双向编码器表示。与最近的语言表示模型不同,BERT 旨在通过在所有 Layer 中,同时调节左、右上下文来预训练未标记文本的深度双向表示。因此,预训练的 BERT 模型只需一个额外的输出层即可进行微调,从而为各种任务(例如问答和语言推理)创建最先进的模型,而无需进行大量特定任务的架构修改。

  • We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

  • BERT 概念简单,经验强大。它在 11 项自然语言处理任务上获得了最新的结果,包括将 GLUE 分数推至 80.5%(绝对提高 7.7%)、MultiNLI 准确率达到 86.7%(绝对提高 4.6%)、SQuAD v1.1问答测试 F1 达到 93.2(绝对提高 1.5 分),SQuAD v2.0 测试 F1 达到 83.1(绝对提高 5.1 分)。

  • BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

  • GitHub: https://github.com/google-research/bert

1 Introduction

  • 将预训练的 language representations 应用于下游任务有两种现有策略:基于特征和微调(feature-based and fine-tuning)。

    1. 基于特征的方法 uses task-specific architectures that include the pre-trained representations as additional features.

    1. 微调的方法引入(introduces) minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters.

  • 我们证明了双向预训练(bidirectional pre-training)对于语言表示的重要性。

  • 预先训练的表示减少了对许多精心设计的特定任务架构(heavily-engineered task-specific architectures)的需求。

  • BERT 是第一个基于微调(fine-tuning based)并在大量句子级和标记级任务上实现了最先进的性能的表示模型,优于许多特定于任务的架构。

3 BERT

3.1 Pre-training BERT

  • Masked LM:为了训练深度双向表示,我们只需随机屏蔽一定比例的输入标记,然后预测这些屏蔽标记(完形填空)

  • Next Sentence Prediction (NSP):

  • Pre-training data:使用 BooksCorpus(800M 单词)。 ( 2015 )和英文维基百科(2,500M 字)。对于维基百科,我们仅提取文本段落并忽略列表、表格和标题。使用文档级语料库而不是打乱的句子级语料库至关重要

3.2 Fine-tuning BERT

  • Fine-tuning is straightforward

  • since the self-attention mechanism in the Transformer allows BERT to model(建模) many downstream tasks

  • whether they involve(涉及) single text or text pairs

  • by swapping out(交换) the appropriate inputs and outputs.

Appendix A Additional Details for BERT

https://img.zhaoweiguo.com/uPic/2024/09/6dVA56.png

预训练模型架构的差异。 BERT 使用双向 Transformer。 OpenAI GPT 使用从左到右的 Transformer。 ELMo 使用独立训练的从左到右和从右到左 LSTM 的串联来生成下游任务的特征。在这三者中,只有 BERT 表示在所有层中同时以左右上下文为条件。除了架构差异之外,BERT 和 OpenAI GPT 是微调方法,而 ELMo 是基于特征的方法。

A.1 Illustration of the Pre-training Tasks

Masked LM and the Masking Procedure:

* 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
* 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
* 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy.
* The purpose of this is to bias the representation towards the actual observed word.

说明:
    1. 15%的token会被mask
    2. 1.5%(10 of 15%)的token被随机替换(因为比例小,所以不会影响模型的语言理解能力)

Next Sentence Prediction:

Input = [CLS] the man went to [MASK] store [SEP]
        he bought a gallon [MASK] milk [SEP]
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]
        penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

A.2 Pre-training Procedure

  • 为了生成每个训练输入序列,我们从语料库中采样两个文本范围,我们将其称为“句子”,尽管它们通常比单个句子长得多(但也可以更短)。

  • 第一个句子接收A嵌入,第二个句子接收B嵌入。

  • 50% 的情况下B是A之后的实际下一个句子,50% 的情况下它是随机句子,这是为了“下一句预测”任务而完成的。

  • 在WordPiece标记化之后应用LM掩蔽,统一掩蔽率为15%,并且不对部分词片进行特殊考虑。

A.3 Fine-tuning Procedure

最佳超参数值是特定于任务的,但我们发现以下可能的值范围在所有任务中都适用:

• Batch size: 16, 32
• Learning rate (Adam): 5e-5, 3e-5, 2e-5
• Number of epochs: 2, 3, 4
  • 大数据集(例如,100k+ 标记的训练样本)对超参数选择的敏感度远低于小数据集

A.4 Comparison of BERT and OpenAI GPT

    • GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).

    • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.

    • GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.

    • GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.

主页

索引

模块索引

搜索页面