GPT2: Language Models are Unsupervised Multitask Learners ######################################################### * GitHub: https://github.com/openai/gpt-2 * https://openai.com/blog/better-language-models/ The Illustrated GPT-2 ===================== .. figure:: https://img.zhaoweiguo.com/uPic/2024/11/vjHN8R.png .. figure:: https://img.zhaoweiguo.com/uPic/2024/11/vWghLu.png 不同 GPT2 模型大小 .. figure:: https://img.zhaoweiguo.com/uPic/2024/11/2nJfMI.png It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right, Masked self-attention prevents that from happening. 自注意力(BERT)和屏蔽自注意力(GPT-2) 参考 ==== * The Illustrated GPT-2 (Visualizing Transformer Language Models): https://jalammar.github.io/illustrated-gpt2/