2012.00413_CPM: A Large-scale Generative Chinese Pre-trained Language Model¶

https://arxiv.org/abs/2012.00413
GitHub: https://github.com/TsinghuaAI/CPM
GitHub: https://github.com/TsinghuaAI/CPM-Generate
组织: 清华大学计算机科学与技术系 & BAAI
事实证明，预训练语言模型 (PLM) 对于各种下游 NLP 任务是有益的。最近，拥有 1750 亿个参数和 570GB 训练数据的 GPT-3 由于少样本（甚至零样本）学习的能力而引起了广泛关注。然而，应用GPT-3来解决中文NLP任务仍然具有挑战性，因为GPT-3的训练语料主要是英语，并且参数不公开。在本技术报告中，我们发布了在大规模中文训练数据上进行生成式预训练的中文预训练语言模型（CPM）。据我们所知，CPM 拥有 26 亿个参数和 100GB 中文训练数据，是最大的中文预训练语言模型，可以促进多项下游中文 NLP 任务，例如会话、文章生成、完形填空和语言理解。大量实验表明，CPM 在少样本（甚至零样本）学习的设置中在许多 NLP 任务上取得了出色的性能。
Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at this https URL.

.. warning:: 相关内容已过期，github从2012年就没更新了