19xx_GPT2: Language Models are Unsupervised Multitask Learners¶

The Illustrated GPT-2¶

https://img.zhaoweiguo.com/uPic/2024/11/vjHN8R.png

https://img.zhaoweiguo.com/uPic/2024/11/vWghLu.png — 不同 GPT2 模型大小¶

https://img.zhaoweiguo.com/uPic/2024/11/2nJfMI.png — It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right, Masked self-attention prevents that from happening. 自注意力（BERT）和屏蔽自注意力（GPT-2）¶

参考¶

The Illustrated GPT-2 (Visualizing Transformer Language Models): https://jalammar.github.io/illustrated-gpt2/

On This Page
19xx_GPT2: Language Models are Unsupervised Multitask Learners¶
The Illustrated GPT-2¶
参考¶

Powered by Yiting & Majiang