GPT2: Language Models are Unsupervised Multitask Learners
#########################################################


* GitHub: https://github.com/openai/gpt-2
* https://openai.com/blog/better-language-models/


The Illustrated GPT-2
=====================


.. figure:: https://img.zhaoweiguo.com/uPic/2024/11/vjHN8R.png


.. figure:: https://img.zhaoweiguo.com/uPic/2024/11/vWghLu.png

    不同 GPT2 模型大小

.. figure:: https://img.zhaoweiguo.com/uPic/2024/11/2nJfMI.png

    It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right, Masked self-attention prevents that from happening. 自注意力（BERT）和屏蔽自注意力（GPT-2）


参考
====

* The Illustrated GPT-2 (Visualizing Transformer Language Models): https://jalammar.github.io/illustrated-gpt2/