Build a Large Language Model (From Scratch)¶
GitHub(相关代码): https://github.com/rasbt/LLMs-from-scratch
【其他参考】The Illustrated Transformer(图的形式讲解transformer,涉及到本书要讲的qkv计算、多head注意力等): https://jalammar.github.io/illustrated-transformer/
1. Understanding LLM¶
章节:
1.1 What is an LLM?
1.2 Applications of LLMs
1.3 Stages of building and using LLMs
1.4 Using LLMs for different tasks
1.5 Utilizing large datasets
1.6 A closer look at the GPT architecture
1.7 Building a large language model
1.8 Summary
AI including:
machine learning
deep learning
rule-based systems
genetic algorithms
expert systems
fuzzy logic
symbolic reasoning
traditional machine learning: human experts might manually extract features from email text such as the frequency of certain trigger words (“prize,” “win,” “free”), the number of exclamation marks, use of all uppercase words, or the presence of suspicious links.
deep learning: does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for a deep learning model.
备注
不管是traditional machine learning还是deep learning,都需要人工收集一些数据,哪此邮件是spam,哪些不是spam。但 traditional machine learning 还需要专家进行人工的特征提取,对垃圾邮件事例特征有:win, free单词很多,有可疑链接,使用大写字母……
The two most popular categories of finetuning LLMs:
1. instruction-finetuning
2. finetuning for classification tasks
A key component of transformers and LLMs is the self-attention mechanism, which allows the model to weigh the importance of different words or tokens in a sequence relative to each other.
* Wikipedia corpus consists of English-language Wikipedia
* Books1 is likely a sample from Project Gutenberg: https://www.gutenberg.org/
* Books2 is likely from Libgen: https://en.wikipedia.org/wiki/Library_Genesis
* CommonCrawl is a filtered subset of the CommonCrawl database: https://commoncrawl.org/
* WebText2 is the text of web pages from all outbound Reddit links from posts with 3+ upvotes.
The ability to perform tasks that the model wasn’t explicitly trained to perform is called an “emergent behavior(涌现行为).”“涌现行为” 指的是模型在规模增大后自发展现出一些未明确训练的能力(如推理、算术等),这些能力不是直接由个体参数或特定任务驱动的,而是系统在规模和复杂度增加时的自发产物。
2. Working with Text Data¶
2.1 Understanding word embeddings¶
While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for retrieval- augmented generation.
Word embeddings can have varying dimensions, from one to thousands. As shown in Figure 2.3, we can choose two-dimensional word embeddings for visualization purposes.
2.2 Tokenizing text¶
2.3 Converting tokens into token IDs¶
2.4 Adding special context tokens¶
additional special tokens:
1. [BOS] (beginning of sequence)
2. [EOS] (end of sequence)
3. [PAD] (padding)
4. |endoftext|
5. |unk|
2.5 Byte pair encoding¶
The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
The BPE tokenizer from OpenAI’s open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance
2.6 Data sampling with a sliding window¶
For each text chunk, we want the inputs and targets
Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right
The prediction would look like as follows:(input-target pairs):
and ----> established
and established ----> himself
and established himself ----> in
and established himself in ----> a
Small batch sizes require less memory during training but lead to more noisy model updates. The batch size is a trade-off and hyperparameter to experiment with when training LLMs.
示例(batch_size=1, max_length=4, stride=1):
[tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
示例(batch_size=8, max_length=4, stride=4):
Inputs: tensor([[ 40, 367, 2885, 1464], [ 1807, 3619, 402, 271], [10899, 2138, 257, 7026], [15632, 438, 2016, 257], [ 922, 5891, 1576, 438], [ 568, 340, 373, 645], [ 1049, 5975, 284, 502], [ 284, 3285, 326, 11]]) Targets: tensor([[ 367, 2885, 1464, 1807], [ 3619, 402, 271, 10899], [ 2138, 257, 7026, 15632], [ 438, 2016, 257, 922], [ 5891, 1576, 438, 568], [ 340, 373, 645, 1049], [ 5975, 284, 502, 284], [ 3285, 326, 11, 287]])
2.7 Creating token embeddings¶
we converted the token IDs into a continuous vector representation, the so-called token embeddings.
2.8 Encoding word positions¶
In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes.
However, since the self-attention mechanism of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM.
two broad categories of position-aware embeddings:
1. relative positional embeddings
2. absolute positional embeddings
Instead of focusing on the absolute position of a token, the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of “how far apart” rather than “at which exact position.” The advantage here is that the model can generalize better to sequences of varying lengths, even if it hasn’t seen such lengths during training.
OpenAI’s GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original Transformer model. This optimization process is part of the model training itself, which we will implement later in this book. For now, let’s create the initial positional embeddings to create the LLM inputs for the upcoming chapters.
context_length
is a variable that represents the supported input size of the LLM.
2.9 Summary¶
LLMs require textual data to be converted into numerical vectors, known as embeddings since they can’t process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.
As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.
Special tokens, such as
<|unk|>
and<|endoftext|>
, can be added to enhance the model’s understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.The
byte pair encoding (BPE)
tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.We use a sliding window approach on tokenized data to generate
input-target pairs
for LLM training.Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.
While token embeddings provide consistent vector representations for each token, they lack a sense of the token’s position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI’s GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.
3. Coding Attention Mechanisms¶
Exploring the reasons for using attention mechanisms in neural networks
Introducing a basic self-attention framework and progressing to an enhanced self-attention mechanism
Implementing a causal attention module that allows LLMs to generate one token at a time
Masking randomly selected attention weights with dropout to reduce overfitting
Stacking multiple causal attention modules into a multi-head attention module
3.1 The problem with modeling long sequences¶
To address the issue that we cannot translate text word by word, it is common to use a deep neural network with two submodules, a so-called
encoder
anddecoder
. The job of the encoder is to first read in and process the entire text, and the decoder then produces the translated text. e.g.encoder-decoder RNNs
The big issue and limitation of encoder-decoder RNNs is that the RNN can’t directly access earlier hidden states from the encoder during the decoding phase. Consequently, it relies solely on the current hidden state, which encapsulates all relevant information. This can lead to a loss of context, especially in complex sentences where dependencies might span long distances.
3.2 Capturing data dependencies with attention mechanisms¶
Self-attention is a mechanism that allows each position in the input sequence to attend to(注意) all positions in the same sequence when computing the representation of a sequence. Self-attention is a key component of contemporary(当代) LLMs based on the transformer architecture, such as the GPT series.
3.3 Attending to different parts of the input with self-attention¶
The “self” in self-attention
In self-attention, the “self” refers to the mechanism’s ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.
This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models where the attention might be between an input sequence and an output sequence, such as the example depicted in Figure 3.5.
3.3.1 A simple self-attention mechanism without trainable weights¶
For example, consider an input text like
"Your journey starts with one step."
In this case, each element of the sequence, such as \(x^{(1)}\), corresponds to ad-dimensional
embedding vector representing a specific token, like “Your.” These input vectors are shown as3-dimensional
embeddings.Let’s focus on the
embedding vector
of the second input element, \(x^{(2)}\) (which corresponds to the token “journey”), and the correspondingcontext vector
, \(z^{(2)}\). This enhancedcontext vector
, \(z^{(2)}\), is an embedding that contains information about \(x^{(2)}\) and all other input elements \(x^{(1)}\) to \(x^{(T)}\).In self-attention,
context vectors
play a crucial role. Their purpose is to create enriched representations of each element in an input sequence (like a sentence) by incorporating information from all other elements in the sequence. This is essential in LLMs, which need to understand the relationship and relevance of words in a sentence to each other. Later, we will add trainable weights that help an LLM learn to construct these context vectors so that they are relevant for the LLM to generate the next token.
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
query = inputs[1] #A
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
# 输出
# tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
除了将点积运算视为组合两个向量以产生标量值的数学工具之外,点积还是相似性的度量,因为它量化了两个向量的对齐程度:更高的点积表示更大程度的对齐或相似性向量之间。在自注意力机制的背景下,点积决定了序列中元素相互关注的程度:点积越高,两个元素之间的相似度和注意力分数就越高。
标准化背后的主要目标是获得总和为 1 的注意力权重。
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())
# 输出
# Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
# Sum: tensor(1.0000)
In practice, it’s more common and advisable to use the softmax function for normalization.
In addition, the softmax function ensures that the attention weights are always positive.
def softmax_naive(x):
return torch.exp(x) / torch.exp(x).sum(dim=0)
attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())
# 输出
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)
Note that this naive softmax implementation (softmax_naive) may encounter numerical instability problems, such as overflow and underflow, when dealing with large or small input values.
PyTorch implementation of softmax
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())
# 输出
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)
query = inputs[1] # 2nd input token is the query
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)
# 输出
# tensor([0.4419, 0.6515, 0.5683])
3.3.2 Computing attention weights for all input tokens¶
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
attn_scores = inputs @ inputs.T
print(attn_scores)
# 输出
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
[0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
[0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
[0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
[0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
[0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
attn_weights = torch.softmax(attn_scores, dim=1)
print(attn_weights)
# 输出
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
[0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
[0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
[0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
[0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
[0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)
# 输出
tensor([[0.4421, 0.5931, 0.5790],
[0.4419, 0.6515, 0.5683],
[0.4431, 0.6496, 0.5671],
[0.4304, 0.6298, 0.5510],
[0.4671, 0.5910, 0.5266],
[0.4177, 0.6503, 0.5645]])
3.4 Implementing self-attention with trainable weights¶
Self-attention mechanism is also called
scaled dot-product attention
.
本节主要工作是在前面的基础上增加 训练时更新权重 的功能,以实现在训练时学习以得到更好的权重。
3.4.1 Computing the attention weights step by step¶
three trainable weight matrices \(W_q\) , \(W_k\) , and \(W_v\) .
These three matrices are used to project the embedded input tokens, \(x^{(i)}\), into query, key, and value vectors
Note that in GPT-like models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input (d_in=3) and output (d_out=2) dimensions here.
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
x_2 = inputs[1] #A
d_in = inputs.shape[1] #B = 3
d_out = 2 #C
# Initialize the three weight matrices Wq, Wk, and Wv
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
# 示例结构: shape=(3,2)
# tensor([[0.3821, 0.6605],
# [0.8536, 0.5932],
# [0.6367, 0.9826]])
# 说明
# setting requires_grad=False to reduce clutter in the outputs for illustration purposes
# 正式使用时需要设置 requires_grad=True
# compute the query, key, and value vectors
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)
# 输出
# tensor([0.4306, 1.4551])
【注意】不要将权重参数(weight parameters)与注意力权重(attention weights)混淆。
权重参数(weight parameters): 指的是神经网络在训练过程中被调优的权重,有时会缩写为weight。(“weight” is short for “weight parameters,” the values of a neural network that are optimized during training. )
注意力权重(attention weights): 决定了上下文向量依赖输入不同部分的程度,即网络关注输入不同部分的程度。(attention weights determine the extent to which a context vector depends on the different parts of the input, i.e., to what extent the network focuses on different parts of the input.)
总之,权重参数是定义网络连接的基本学习系数,而注意力权重是动态的、特定于上下文的值。
# 虽然这儿只计算context vector`Z^(2)`,但仍然需要所有输入元素的键和值向量
# obtain all keys and values via matrix multiplication
keys = inputs @ W_key
values = inputs @ W_value
print("keys.shape:", keys.shape)
# 输出
# keys.shape: torch.Size([6, 2])
compute the
attention score
\(ω_{22}\)
keys_2 = keys[1] #A
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
# tensor(1.8524)
同样的道理计算出所有的 \(ω_2\) (即 \(ω_{21}\) 到 \(ω_{2T}\) )
attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)
# 输出
# tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)
# 输出
# tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])
通过将注意力分数除以嵌入的平方根(square root, 即d_k**0.5)来缩放注意力分数键的维度
when scaling up the embedding dimension, which is typically greater than thousand(比较gpt4等), large dot products can result in very small gradients during backpropagation due to the softmax function applied to them.
As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero.
These small gradients can drastically slow down learning or cause training to stagnate.
The scaling by the
square root
of the embedding dimension is the reason why this self-attention mechanism is also calledscaled-dot product attention
.【关键点】为什么在自注意力机制中会用嵌入维度的平方根来缩放点积
缩放的目的:用嵌入维度的平方根来缩放,是为了避免训练过程中出现过小的梯度。若不做缩放,训练时可能会遇到梯度非常小的情况,导致模型学习变慢,甚至陷入停滞。
出现梯度变小的原因:1、当嵌入维度(即向量的维度)增加时,两个向量的点积值会变大。在GPT等大型语言模型(LLM)中,嵌入维度往往很高,可能达到上千,因此点积也变得很大。2、在点积结果上应用 softmax 函数时,如果数值较大,softmax 输出的概率分布会变得很尖锐,近似于阶跃函数。此时,大部分概率集中在几个值上,导致其他部分的梯度几乎为零。这样就会导致模型训练时更新不充分。
缩放的效果:通过用嵌入维度的平方根缩放点积的大小,可以让点积的数值控制在合理范围,使得 softmax 函数的输出更加平滑,从而使得梯度较大,模型可以更有效地学习。这种缩放的自注意力机制因此被称为“缩放点积注意力” (scaled-dot product attention)。
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)
# 输出
# tensor([0.3061, 0.8210])
备注
【为啥需要query, key, and value】这3个是借用信息检索和数据库领域的概念。接合起来理解这3个概念。
query(查询):”query” 代表模型当前关注的项(如一个句子中的单词或词元)。模型通过 “query” 来探测其他部分,判断需要关注的程度。
key(键):输入序列中的每个项(如句子中的每个词)都有一个对应的 “key”,这些 “key” 会和 “query” 进行匹配。通过这种匹配,模型可以找出哪些 “key”(即哪些输入部分)和 “query” 更相关。
value(值):一旦模型确定了与 “query” 最相关的 “key”,就会取出相应的 “value”。这些 “value” 包含输入项的实际内容或表示,通过提取这些 “value”,模型获得当前 “query” 应关注的具体内容。
备注
上节讲的是根据 关系相近的向量点积更大
与计算相似度的关系,算出每个单词的权重,然后用这个计算出的权重和输入向量点击,得到与位置相关的上下文向量。而本节是用3个可训练参数(Wq, Wk, Wv),由Wq和Wk来计算出每个单词的权重,而Wv提供了该位置上的特征信息,模型通过注意力机制自动调整从不同位置的 value 中提取多少信息。
注:为啥 Wv 不能直接用 input来表示,因为 Wv 直接决定了提取的内容和最终输出的特征信息。它能够捕捉更复杂的特征,尤其是在上下文依赖和长序列建模中,可以更好地识别出不同位置特征的细微差别。直接使用 input,模型会缺乏这种内容和注意力分布的区分,使得不同位置的特征难以有效加权合成,从而降低注意力机制的表达能力。
3.4.2 Implementing a compact self-attention Python class¶
Listing 3.1 A compact self-attention class
import torch.nn as nn
class SelfAttention_v1(nn.Module):
def __init__(self, d_in, d_out):
super().__init__()
self.d_out = d_out
self.W_query = nn.Parameter(torch.rand(d_in, d_out))
self.W_key = nn.Parameter(torch.rand(d_in, d_out))
self.W_value = nn.Parameter(torch.rand(d_in, d_out))
def forward(self, x):
keys = x @ self.W_key
queries = x @ self.W_query
values = x @ self.W_value
attn_scores = queries @ keys.T # omega
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))
# 输出
tensor([[0.2996, 0.8053],
[0.3061, 0.8210],
[0.3058, 0.8203],
[0.2948, 0.7939],
[0.2927, 0.7891],
[0.2990, 0.8040]], grad_fn=<MmBackward0>)
Using
nn.Linear
instead of manually implementingnn.Parameter(torch.rand(...))
is thatnn.Linear
has an optimized weight initialization scheme, contributing to more stable and effective model training.Listing 3.2 A self-attention class using PyTorch’s Linear layers
class SelfAttention_v2(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim)
context_vec = attn_weights @ values
return context_vec
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))
# 输出
tensor([[-0.0739, 0.0713],
[-0.0748, 0.0703],
[-0.0749, 0.0702],
[-0.0760, 0.0685],
[-0.0763, 0.0679],
[-0.0754, 0.0693]], grad_fn=<MmBackward0>)
Note that SelfAttention_v1 and SelfAttention_v2 give different outputs because they use different initial weights for the weight matrices since nn.Linear uses a more sophisticated weight initialization scheme.
3.5 Hiding future words with causal attention¶
Causal attention
, also known asmasked attention
, is a specialized form of self-attention. It restricts a model to only consider previous and current inputs in a sequence when processing any given token. This is in contrast to the standard self-attention mechanism, which allows access to the entire input sequence at once.
3.5.1 Applying a causal attention mask¶
【第一步】使用上面的方法得到 attention weights
print(attn_weights)
tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
[0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
[0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
[0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
[0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)
【第二步】对象线上面的设置为0
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)
# 输出
tensor([[1., 0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 1., 0.],
[1., 1., 1., 1., 1., 1.]])
# multiply this mask with the attention weights
masked_simple = attn_weights*mask_simple
print(masked_simple)
# 输出
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
[0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
[0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<MulBackward0>)
【第三步】重新规范化(renormalize) attention weights使每一行值的和为1
row_sums = masked_simple.sum(dim=1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
# 输出
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
[0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
[0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<DivBackward0>)
【注意-信息泄露】当我们应用掩码然后重新规范化注意力权重时,最初来自未来标记(现在已经掩码)的信息仍然可能影响当前标记,因为它们的值是 softmax 计算的一部分。然而,关键的洞察是,当我们在屏蔽后重新规范化注意力权重时,我们本质上所做的是在较小的子集上重新计算 softmax(因为屏蔽位置对 softmax 值没有贡献)。Softmax 的数学优雅之处在于,尽管最初在分母中包含了所有位置,但在屏蔽和重新归一化之后,屏蔽位置的影响被抵消了——它们不会以任何有意义的方式对 SoftMax 得分做出贡献。所以不会有信息泄露。
这儿可以利用 softmax 函数的数学特性,以更少的步骤更有效地实现屏蔽注意力权重的计算
【关键特征】当负无穷大值 (-∞) 连续存在时,softmax 函数将它们视为零概率(从数学上讲,这是因为 \(e^{-\infty}\) 接近 0)。
这儿把上面的0改为
-∞(即-inf)
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
# 输出
tensor([[0.2899, -inf, -inf, -inf, -inf, -inf],
[0.4656, 0.1723, -inf, -inf, -inf, -inf],
[0.4594, 0.1703, 0.1731, -inf, -inf,
[0.2642, 0.1024, 0.1036, 0.0186, -inf,
[0.2183, 0.0874, 0.0882, 0.0177, 0.0786,
[0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
grad_fn=<MaskedFillBackward0>)
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
print(attn_weights)
# 输出(已经规范化,不用再额外操作了,节省了操作)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
[0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
[0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)
3.5.2 Masking additional attention weights with dropout¶
Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively “dropping” them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units. It’s important to emphasize that dropout is only used during training and is disabled afterward.
备注
In the transformer architecture, including models like GPT, dropout in the attention mechanism is typically applied in two specific areas: after calculating the attention scores
or after applying the attention weights to the value vectors
.(我还没完全理解具体的位置)
我的理解(GPT给的说法有矛盾,下面是我的理解)
【Dropout 的应用位置】1、after calculating the attention scores:在计算注意力分数后应用Dropout意味着在得到Q(查询)、K(键)和V(值)向量之间的点积之后,但在将这些分数传递给Softmax函数之前,进行Dropout操作。这一步骤中的Dropout可以帮助模型避免对某些特定特征的过度依赖,因为它可能会随机地使一些注意力分数失效,从而促使模型学习到更加健壮的注意力分布。
伪代码:
function scaledDotProductAttention(Q, K, V, dropoutRate): # 计算注意力分数 scores = matmul(Q, transpose(K)) / sqrt(d_k) # d_k 是键向量的维度 # 应用Dropout scores = applyDropout(scores, dropoutRate) # 应用Softmax函数 attentionWeights = softmax(scores) # 将注意力权重应用到值向量上 output = matmul(attentionWeights, V) return output
【Dropout 的应用位置】2、after applying the attention weights to the value vectors:在将注意力权重应用到值向量之后应用Dropout意味着在Softmax后的注意力权重与值向量相乘得到加权值向量之后,再执行Dropout操作。这样做可以进一步帮助模型泛化,因为即使某些信息被Dropout随机丢弃了,模型仍然需要能够利用剩余的信息来做出准确的预测。
伪代码:
function scaledDotProductAttention(Q, K, V, dropoutRate): # 计算注意力分数 scores = matmul(Q, transpose(K)) / sqrt(d_k) # d_k 是键向量的维度 # 应用Softmax函数 attentionWeights = softmax(scores) # 应用Dropout attentionWeights = applyDropout(attentionWeights, dropoutRate) # 将注意力权重应用到值向量上 output = matmul(attentionWeights, V) return output
备注
下面示例演示了 “apply the dropout mask after computing the attention weights”
在下面的示例中我们dropout 50% 的比例,在后面训练gpt 模型时,我们使用 10%-20% 的 dropout 比例。
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) #A
example = torch.ones(6, 6) #B
print(dropout(example))
# 输出(有近一半是0)
tensor([[2., 2., 0., 2., 2., 0.],
[0., 0., 0., 2., 0., 2.],
[2., 2., 2., 2., 0., 2.],
[0., 2., 2., 0., 0., 2.],
[0., 2., 0., 2., 0., 2.],
[0., 2., 2., 2., 2., 0.]])
apply dropout to the attention weight matrix:
torch.manual_seed(123)
print(dropout(attn_weights))
# 输出
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
[0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
[0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
[0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
grad_fn=<MulBackward0>
3.5.3 Implementing a compact causal attention class¶
Listing 3.3 A compact causal attention class
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout) # A
self.register_buffer(
'mask',
torch.triu(
torch.ones(context_length, context_length),
diagonal=1
)
) #B
def forward(self, x):
b, num_tokens, d_in = x.shape #C
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2) #C
attn_scores.masked_fill_( #D
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
batch = torch.stack((inputs, inputs), dim=0)
torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)
# 输出
# context_vecs.shape: torch.Size([2, 6, 2])
3.6 Extending single-head attention to multi-head attention¶
3.6.1 Stacking multiple single-head attention layers¶
Listing 3.4 A wrapper class to implement multi-head attention
class MultiHeadAttentionWrapper(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
self.heads = nn.ModuleList(
[CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
for _ in range(num_heads)]
)
def forward(self, x):
return torch.cat([head(x) for head in self.heads], dim=-1)
torch.manual_seed(123)
context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
d_in, d_out, context_length, 0.0, num_heads=2
)
context_vecs = mha(batch)
print("context_vecs.shape:", context_vecs.shape)
#context_vecs.shape: torch.Size([2, 6, 4])
print(context_vecs)
# 输出
# tensor([[[-0.4519, 0.2216, 0.4772, 0.1063],
# [-0.5874, 0.0058, 0.5891, 0.3257],
# [-0.6300, -0.0632, 0.6202, 0.3860],
# [-0.5675, -0.0843, 0.5478, 0.3589],
# [-0.5526, -0.0981, 0.5321, 0.3428],
# [-0.5299, -0.1081, 0.5077, 0.3493]],
# [[-0.4519, 0.2216, 0.4772, 0.1063],
# [-0.5874, 0.0058, 0.5891, 0.3257],
# [-0.6300, -0.0632, 0.6202, 0.3860],
# [-0.5675, -0.0843, 0.5478, 0.3589],
# [-0.5526, -0.0981, 0.5321, 0.3428],
# [-0.5299, -0.1081, 0.5077, 0.3493]]], grad_fn=<CatBackward0>)
# 说明
# first dimension: context_vecs tensor is 2 since we have two input texts
# (两个值完全一样是因为输入文本batch, 是完全一样的)
# second dimension: 6 tokens in each input
# third dimension: 4-dimensional embedding of each token
3.6.2 Implementing multi-head attention with weight splits¶
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x) # Shape: (b, num_tokens, d_out) => [2, 6, 2]
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) =>[b, num_tokens, num_headers, head_dim] => [2, 6, 2, 1]
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2) => [b, num_headers, num_tokens, head_dim] => [2, 2, 6, 1]
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head => [2, 2, 6, 1] @ [2, 2, 1, 6] => [2, 2, 6, 6] => [b, num_headers, num_tokens, num_tokens]
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2) => [2, 2, 6, 6] @ [2, 2, 6, 1] => [2, 2, 6, 1] => [2, 6, 2, 1]
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) => [2, 6, 2]
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
使用示例
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print("context_vecs.shape:", context_vecs.shape)
# context_vecs.shape: torch.Size([2, 6, 2])
print(context_vecs)
# tensor([[[0.3190, 0.4858],
# [0.2943, 0.3897],
# [0.2856, 0.3593],
# [0.2693, 0.3873],
# [0.2639, 0.3928],
# [0.2575, 0.4028]],
# [[0.3190, 0.4858],
# [0.2943, 0.3897],
# [0.2856, 0.3593],
# [0.2693, 0.3873],
# [0.2639, 0.3928],
# [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
3.7 Summary¶
Attention mechanisms transform input elements into enhanced context vector representations that incorporate(包含,使合并) information about all inputs.
A self-attention mechanism computes the context vector representation as a weighted sum over the inputs.
In a simplified attention mechanism, the attention weights are computed via dot products.
A dot product is just a concise way of multiplying two vectors element-wise and then summing the products.
Matrix multiplications, while not strictly required, help us to implement computations more efficiently and compactly by replacing nested for-loops.
In self-attention mechanisms that are used in LLMs, also called scaled-dot product attention, we include trainable weight matrices to compute intermediate transformations of the inputs: queries, values, and keys. When working with LLMs that read and generate text from left to right, we add a causal attention mask to prevent the LLM from accessing future tokens.
Next to causal attention masks to zero out attention weights, we can also add a dropout mask to reduce overfitting in LLMs.
The attention modules in transformer-based LLMs involve multiple instances of causal attention, which is called multi-head attention.
We can create a multi-head attention module by stacking(堆叠) multiple instances of causal attention modules.
A more efficient way of creating multi-head attention modules involves batched matrix multiplications.
4 Implementing a GPT model from Scratch To Generate Text¶
Coding a GPT-like large language model (LLM) that can be trained to generate human-like text
Normalizing layer activations to stabilize neural network training
Adding shortcut connections in deep neural networks to train models more effectively
Implementing transformer blocks to create GPT models of various sizes
Computing the number of parameters and storage requirements of GPT models
4.1 Coding an LLM architecture¶
In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page
In this chapter, we consider embedding and model sizes akin to a small GPT-2 model
We’ll specifically code the architecture of the smallest GPT-2 model (124 million parameters)
configuration of the small GPT-2 model:
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
# Use a placeholder for TransformerBlock
self.trf_blocks = nn.Sequential(
*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
# Use a placeholder for LayerNorm
self.final_norm = DummyLayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
class DummyTransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
# A simple placeholder
def forward(self, x):
# This block does nothing and just returns its input.
return x
class DummyLayerNorm(nn.Module):
def __init__(self, normalized_shape, eps=1e-5):
super().__init__()
# The parameters here are just to mimic the LayerNorm interface.
def forward(self, x):
# This layer does nothing and just returns its input.
return x
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
# tensor([[ 6109, 3626, 6100, 345], #A
# [ 6109, 1110, 6622, 257]])
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)
# Output shape: torch.Size([2, 4, 50257])
# tensor([[[-1.2034, 0.3201, -0.7130, ..., -1.5548, -0.2390, -0.4667],
# [-0.1192, 0.4539, -0.4432, ..., 0.2392, 1.3469, 1.2430],
# [ 0.5307, 1.6720, -0.4695, ..., 1.1966, 0.0111, 0.5835],
# [ 0.0139, 1.6755, -0.3388, ..., 1.1586, -0.0435, -1.0400]],
# [[-1.0908, 0.1798, -0.9484, ..., -1.6047, 0.2439, -0.4530],
# [-0.7860, 0.5581, -0.0610, ..., 0.4835, -0.0077, 1.6621],
# [ 0.3567, 1.2698, -0.6398, ..., -0.0162, -0.1296, 0.3717],
# [-0.2407, -0.7349, -0.5102, ..., 2.0057, -0.3694, 0.1814]]],
# grad_fn=<UnsafeViewBackward0>)
4.2 Normalizing activations with layer normalization¶
Training deep neural networks with many layers can sometimes prove challenging due to issues like vanishing or exploding gradients.
These issues lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights, which means the learning process struggles to find a set of parameters (weights) for the neural network that minimizes the loss function.
In other words, the network has difficulty learning the underlying patterns in the data to a degree that would allow it to make accurate predictions or decisions.
The main idea behind layer normalization is to adjust the
activations (outputs)
of a neural network layer to have a mean of 0 and a variance of 1, also known asunit variance
.
# 归一化L2(欧几里德范数归一)
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
4.3 Implementing a feed forward network with GELU activations¶
主要讲了GELU的实现和特点
4.4 Adding shortcut connections¶
shortcut connections
, also known asskip or residual connections
.Originally,
shortcut connections
were proposed for deep networks in computer vision (specifically, in residual networks) to mitigate the challenge of vanishing gradients.The vanishing gradient problem refers to the issue where gradients (which guide weight updates during training) become progressively smaller as they propagate backward through the layers, making it difficult to effectively train earlier layers
class ExampleDeepNeuralNetwork(nn.Module):
def __init__(self, layer_sizes, use_shortcut):
super().__init__()
self.use_shortcut = use_shortcut
self.layers = nn.ModuleList([
nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
])
def forward(self, x):
for layer in self.layers:
# Compute the output of the current layer
layer_output = layer(x)
# Check if shortcut can be applied
if self.use_shortcut and x.shape == layer_output.shape:
x = x + layer_output
else:
x = layer_output
return x
def print_gradients(model, x):
# Forward pass
output = model(x)
target = torch.tensor([[0.]])
# Calculate loss based on how close the target
# and output are
loss = nn.MSELoss()
loss = loss(output, target)
# Backward pass to calculate the gradients
loss.backward()
for name, param in model.named_parameters():
if 'weight' in name:
# Print the mean absolute gradient of the weights
print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")
print the gradient values without shortcut connections:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=False
)
print_gradients(model_without_shortcut, sample_input)
# 输出 ==>
# layers.0.0.weight has gradient mean of 0.00020173587836325169
# layers.1.0.weight has gradient mean of 0.0001201116101583466
# layers.2.0.weight has gradient mean of 0.0007152041653171182
# layers.3.0.weight has gradient mean of 0.001398873864673078
# layers.4.0.weight has gradient mean of 0.005049646366387606
print the gradient values with shortcut connections:
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)
# 输出 ==>
# layers.0.0.weight has gradient mean of 0.22169792652130127
# layers.1.0.weight has gradient mean of 0.20694105327129364
# layers.2.0.weight has gradient mean of 0.32896995544433594
# layers.3.0.weight has gradient mean of 0.2665732502937317
# layers.4.0.weight has gradient mean of 1.3258541822433472
4.5 Connecting attention and linear layers in a transformer block¶
The idea is that the self-attention mechanism in the multi-head attention block identifies and analyzes relationships between elements in the input sequence.
In contrast, the feed forward network modifies the data individually at each position.
This combination not only enables a more nuanced(细微差别) understanding and processing of the input but also enhances the model’s overall capacity for handling complex data patterns.
【组合的意义 fromGPT】自注意力机制提供全局信息:通过捕捉序列中元素之间的关系,让模型理解上下文和结构的复杂性。前馈网络强化局部信息:对每个位置进行特定的非线性变换,提升其独立特征表达。协同效果:这种组合让模型既能捕捉全局模式,又能处理局部细节,从而在面对复杂数据模式时表现出更强的处理能力。【例】句子翻译任务:自注意力机制帮助模型理解句子结构和词语之间的关系;前馈网络对每个单词的特定信息进行调整和优化,从而生成更准确的翻译。
from previous_chapters import MultiHeadAttention
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
return x
输入&输出形状相同,是一个关键设计
The preservation of shape throughout the transformer block architecture is not incidental but a crucial aspect of its design.
This design enables its effective application across a wide range of sequence-to-sequence tasks, where each output vector directly corresponds to an input vector, maintaining a one-to-one relationship.
However, the output is a context vector that encapsulates information from the entire input sequence, as we learned in chapter 3.
This means that while the physical dimensions of the sequence (length and feature size) remain unchanged as it passes through the transformer block, the content of each output vector is re-encoded to integrate contextual information from across the entire input sequence.
4.6 Coding the GPT model¶
the transformer block is repeated many times throughout a GPT model architecture.
In the case of the 124 million parameter GPT-2 model, it’s repeated 12 times
In the case of the largest GPT-2 model with 1,542 million parameters, this transformer block is repeated 36 times.
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
请求:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
# 输出
# Input batch:
# tensor([[6109, 3626, 6100, 345],
# [6109, 1110, 6622, 257]])
# Output shape: torch.Size([2, 4, 50257])
# tensor([[[ 0.3613, 0.4222, -0.0711, ..., 0.3483, 0.4661, -0.2838],
# [-0.1792, -0.5660, -0.9485, ..., 0.0477, 0.5181, -0.3168],
# [ 0.7120, 0.0332, 0.1085, ..., 0.1018, -0.4327, -0.2553],
# [-1.0076, 0.3418, -0.1190, ..., 0.7195, 0.4023, 0.0532]],
# [[-0.2564, 0.0900, 0.0335, ..., 0.2659, 0.4454, -0.6806],
# [ 0.1230, 0.3653, -0.2074, ..., 0.7705, 0.2710, 0.2246],
# [ 1.0558, 1.0318, -0.2800, ..., 0.6936, 0.3205, -0.3178],
# [-0.1565, 0.3926, 0.3288, ..., 1.2630, -0.1858, 0.0388]]],
# grad_fn=<UnsafeViewBackward0>)
参看参数总数:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# 输出
Total number of parameters: 163,009,536
# 疑问
为啥不是 GPT2 论文说的 124M 呢?
备注
In the original GPT-2 paper, the researchers applied weight tying
, which means that they reused the token embedding layer (tok_emb) as the output layer, which means setting self.out_head.weight = self.tok_emb.weight
. The token embedding and output layers are very large due to the number of rows for the 50,257 in the tokenizer’s vocabulary. Weight tying reduces the overall memory footprint and computational complexity of the model. 具体参见 WeightTying
去除GPT2重用的参数:
total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters())
print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")
# 输出
Number of trainable parameters considering weight tying: 124,412,160
内存占用:
# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
total_size_bytes = total_params * 4
# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)
print(f"Total size of the model: {total_size_mb:.2f} MB")
# 输出
Total size of the model: 621.83 MB
Exercise:
- **GPT2-small** (the 124M configuration we already implemented):
- "emb_dim" = 768
- "n_layers" = 12
- "n_heads" = 12
- **GPT2-medium:**
- "emb_dim" = 1024
- "n_layers" = 24
- "n_heads" = 16
- **GPT2-large:**
- "emb_dim" = 1280
- "n_layers" = 36
- "n_heads" = 20
- **GPT2-XL:**
- "emb_dim" = 1600
- "n_layers" = 48
- "n_heads" = 25
4.7 Generating text¶
def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):
# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]
# Get the predictions
with torch.no_grad():
logits = model(idx_cond)
# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]
# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
return idx
备注
其实这儿不用执行 torch.softmax
直接执行 torch.softmax
可以得到同样的结果。we coded the conversion to illustrate the full process of transforming logits to probabilities, which can add additional intuition, such as that the model generates the most likely next token, which is known as greedy decoding
. In the next chapter, when we will implement the GPT training code, we will also introduce additional sampling techniques
where we modify the softmax outputs such that the model doesn’t always select the most likely token, which introduces variability and creativity in the generated text.
准备数据:
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)
# 输出
# encoded: [15496, 11, 314, 716]
# encoded_tensor.shape: torch.Size([1, 4])
put the model into .eval()
mode, which disables random components like dropout, which are only used during training:
model.eval() # disable dropout
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
# 输出
# Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])
# Output length: 10
Remove batch dimension and convert back into text:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
# 输出(未训练的)
# Hello, I am Featureiman Byeswickattribute argue
4.8 Summary¶
Layer normalization stabilizes training by ensuring that each layer’s outputs have a consistent mean and variance.
Shortcut connections are connections that skip one or more layers by feeding the output of one layer directly to a deeper layer, which helps mitigate the vanishing gradient problem when training deep neural networks, such as LLMs.
Transformer blocks are a core structural component of GPT models, combining
masked multi-head attention modules
withfully connected feed-forward networks
that use the GELU activation function.GPT models are LLMs with many repeated transformer blocks that have millions to billions of parameters.
GPT models come in various sizes, for example, 124 million, and 1542 million parameters, which we can implement with the same GPTModel Python class.
The text generation capability of a GPT-like LLM involves decoding output tensors into human-readable text by sequentially predicting one token at a time based on a given input context.
Without training, a GPT model generates incoherent text, which underscores the importance of model training for coherent text generation, which is the topic of subsequent chapters.
import torch
import torch.nn as nn
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
return x
def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):
# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]
# Get the predictions
with torch.no_grad():
logits = model(idx_cond)
# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]
# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
return idx
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval() # disable dropout
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
5 Pretraining on Unlabeled Data¶
Computing the training and validation set losses to assess the quality of LLM-generated text during training
Implementing a training function and pretraining the LLM
Saving and loading model weights to continue training an LLM Loading pretrained weights from OpenAI
5.1 Evaluating generative text models¶
5.1.1 Using GPT to generate text¶
Utility functions for text to token ID conversion
import tiktoken
from previous_chapters import generate_text_simple
def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
return encoded_tensor
def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0) # remove batch dimension
return tokenizer.decode(flat.tolist())
start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(start_context, tokenizer),
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
# 输出
# Output text:
# Every effort moves you rentingetic wasnم refres RexMeCHicular stren
5.1.2 Calculating the text generation loss¶
先回顾一下前面 how the data is loaded from chapter 2 and how the text is generated via the generate_text_simple function from chapter 4.
输入&期望输出:
inputs = torch.tensor([[16833, 3626, 6100], # ["every effort moves",
[40, 1107, 588]]) # "I really like"]
targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves you",
[1107, 588, 11311]]) # " really like chocolate"]
We feed the inputs into the model to calculate logit vectors for the two input examples, each comprising three tokens, and apply the softmax function to transform these logit values into probability scores:
with torch.no_grad():
logits = model(inputs)
probas = torch.softmax(logits, dim=-1) # Probability of each token in vocabulary
print(probas.shape) # Shape: (batch_size, num_tokens, vocab_size)
# 输出
# torch.Size([2, 3, 50257])
by applying the argmax function to the probability scores to obtain the corresponding token IDs:
# 这儿得到的是模型的最终输出和token ID(当然现在还没训练,结果当然乱七八糟)
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)
# 输出
# Token IDs:
# tensor([[[36397],
# [39619],
# [20610]],
# [[ 8615],
# [49289],
# [47105]]])
Part of the text evaluation process is to measure “how far” the generated tokens are from the correct predictions (targets).
The training function will use this information to adjust the model weights to generate text that is more similar to (or ideally matches) the target text.
The token probabilities corresponding to the target indices are as follows:
# 看看现在期望的输出的概率(probabilities)
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)
# 输出
# Text 1: tensor([2.3466e-05, 2.0531e-05, 1.1733e-05])
# Text 2: tensor([4.2794e-05, 1.6248e-05, 1.1586e-05])
备注
The goal of training an LLM is to maximize these values, aiming to get them as close to a probability of 1. This way, we ensure the LLM consistently picks the target token(essentially the next word in the sentence)as the next token it generates.
Backpropagation¶
# Compute logarithm of all token probabilities
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)
# tensor([ -9.5042, -10.3796, -11.3677, -11.4798, -9.7764, -12.2561])
# Calculate the average probability for each token
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
# tensor(-10.7940)
# In deep learning, the common practice is to bring the negative average log probability down to 0.
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)
# tensor(10.7940)
备注
The term for this negative value, -10.7722 turning into 10.7722, is known as the cross entropy loss in deep learning.
Cross entropy loss¶
the shape of the logits and target tensors:
# Logits have shape (batch_size, num_tokens, vocab_size)
print("Logits shape:", logits.shape)
# Targets have shape (batch_size, num_tokens)
print("Targets shape:", targets.shape)
For the cross_entropy function in PyTorch, we want to flatten these tensors by combining them over the batch dimension:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)
# 输出
# Flattened logits: torch.Size([6, 50257])
# Flattened targets: torch.Size([6])
PyTorch’s cross_entropy function:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)
备注
详见定义 _cross_entropy:
Perplexity¶
Perplexity is a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence.
备注
详见定义 Perplexity 困惑度
5.1.3 Calculating the training and validation set losses¶
The cost of pretraining LLMs¶
For visualization purposes, Figure 5.9 uses a max_length=6 due to spatial constraints. However, for the actual data loaders we are implementing, we set the max_length equal to the 256-token context length that the LLM supports so that the LLM sees longer texts during training.
Training with variable lengths¶
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
# Reduce the number of batches to match the total number of batches in the data loader
# if num_batches exceeds the number of batches in the data loader
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)
5.2 Training an LLM¶
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
# Initialize lists to track losses and tokens seen
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1
# Main training loop
for epoch in range(num_epochs):
model.train() # Set model to training mode
for input_batch, target_batch in train_loader:
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # Calculate loss gradients
optimizer.step() # Update model weights using loss gradients
tokens_seen += input_batch.numel()
global_step += 1
# Optional evaluation step
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
# Print a sample text after each epoch
generate_and_print_sample(
model, tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval()
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
model.train()
return train_loss, val_loss
def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text_simple(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " ")) # Compact print format
model.train()
Adam optimizers are a popular choice for training deep neural networks.
However, in our training loop, we opt for the AdamW optimizer.
AdamW is a variant of Adam that improves the weight decay approach, which aims to minimize model complexity and prevent overfitting by penalizing(处罚) larger weights.
This adjustment allows AdamW to achieve more effective regularization and better generalization and is thus frequently used in the training of LLMs.
运行
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context="Every effort moves you", tokenizer=tokenizer
)
输出:
Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933 Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339 Every effort moves you,,,,,,,,,,,,. Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048 Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616 Every effort moves you, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and,, and, and, ... ... Ep 9 (Step 000075): Train loss 0.717, Val loss 6.293 Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393 Every effort moves you?" "Yes--quite insensible to the irony. She wanted him vindicated--and by me!" He laughed again, and threw back the window-curtains, I had the donkey. "There were days when I Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gis
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
# Plot training and validation loss against epochs
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) # only show integer labels on x-axis
# Create a second x-axis for tokens seen
ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis
ax2.plot(tokens_seen, train_losses, alpha=0) # Invisible plot for aligning ticks
ax2.set_xlabel("Tokens seen")
fig.tight_layout() # Adjust layout to make room
plt.savefig("loss-plot.pdf")
plt.show()
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
Looking at the results above, we can see that the model starts out generating incomprehensible strings of words, whereas towards the end, it’s able to produce grammatically more or less correct sentences
However, based on the training and validation set losses, we can see that the model starts overfitting(最终训练集损失是0.541,而验证集损失是6.393)
If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim(逐字翻译) – it simply memorizes the training data
Later, we will cover decoding strategies that can mitigate this memorization by a certain degree
Note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times
5.3 Decoding strategies to control randomness¶
本节将重新实现
generate_text_simple() in chapter 4.7
但 we will cover two techniques, temperature scaling, and top-k sampling, to improve this function.之前实现的版本使用的贪心解码(greedy decoding),每一次请求的返回结果都是选择概率最高的那个,导致每一次请求的结果都是相同的,这样对生成式的任务而言,效果单一。
5.3.1 Temperature scaling¶
This section introduces
temperature scaling
, a technique that adds a probabilistic selection process to the next-token generation task.Previously, inside the
generate_text_simple
function, we always sampled the token with the highest probability as the next token usingtorch.argmax
, also known asgreedy decoding
. To generate text with more variety, we can replace the argmax with a function thatsamples from a probability distribution
(here, the probability scores the LLM generates for each vocabulary entry at each token generation step).
small vocabulary for illustration purposes:
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}
上节讲的 greedy decoding
next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
probas = torch.softmax(next_token_logits, dim=0)
next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])
# "forward"
# 讲解:
# 通过对 next_token_logits 分析知道
# 最大概率是 6.75, 对应的softmax的最大值
# 所以next token的index是3,得到最终结果: forward
对应的 probabilistic sampling process
# replace the argmax with the multinomial function in PyTorch
torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
# 说明: 不管概率高低,都有被选中的可能
temperature scaling 的作用:
# Temperatures greater than 1 result in more uniformly distributed token probabilities
# Temperatures smaller than 1 will result in more confident (sharper or more peaky) distributions.
def softmax_with_temperature(logits, temperature):
scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)
5.3.2 Top-k sampling¶
上节的方法allows for exploring less likely but potentially more interesting and creative paths in the generation process. However, One downside of this approach is that it sometimes leads to grammatically incorrect or completely nonsensicaloutputs(就是超低概率被命中,正常来说这种nexttoken是不太合适的,也就是无意义的语句生成)
备注
In top-k sampling, we can restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
new_logits = torch.where(
condition=next_token_logits < top_logits[-1],
input=torch.tensor(float("-inf")),
other=next_token_logits
)
topk_probas = torch.softmax(new_logits, dim=0)
# 更高效实现
> new_logits = torch.full_like( # create tensor containing -inf values
> next_token_logits, -torch.inf
>)
> new_logits[top_pos] = next_token_logits[top_pos] # copy top k values into the -inf tensor
> topk_probas = torch.softmax(new_logits, dim=0)
5.3.3 Modifying the text generation function¶
本节把前面两节的 temperature scale 和 top-k sampling 合并到generator函数中
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
# For-loop is the same as before: Get logits, and only focus on last time step
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]
# New: Filter logits with top_k sampling
if top_k is not None:
# Keep only top_k values
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)
# New: Apply temperature scaling
if temperature > 0.0:
logits = logits / temperature
# Apply softmax to get probabilities
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
# Sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
# Otherwise same as before: get idx of the vocab entry with the highest logits value
else:
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
break
# Same as before: append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
5.4 Loading and saving model weights in PyTorch¶
It’s common to train LLMs with adaptive optimizers like Adam or AdamW instead of regular SGD
These adaptive optimizers store additional parameters for each model weight, so it makes sense to save them as well in case we plan to continue the pretraining later:
torch.save({ "model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict(), }, "model_and_optimizer.pth" )
loading:
checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0005, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();
5.5 Loading pretrained weights from OpenAI¶
本节主要讲了,如何从gpt2中加载权重
因为gpt2和我们这儿定义的数据结构名称不相同,所以需要专门进行定制加载
5.6 Summary¶
When LLMs generate text, they output one token at a time.
By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as “greedy decoding.”
Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text.
Training and validation set losses can be used to gauge(测量) the quality of text generated by LLM during training.
Pretraining an LLM involves changing its weights to minimize the training loss.
The training loop for LLMs itself is a standard procedure in deep learning, using a conventional
cross entropy loss
andAdamW optimizer
. Pretraining an LLM on a large text corpus is time- and resource- intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves.
Appendix A. Introduction to PyTorch¶
A.1 What is PyTorch¶
Firstly, PyTorch is a tensor library that extends the concept of array-oriented programming library NumPy with the additional feature of accelerated computation on GPUs, thus providing a seamless switch between CPUs and GPUs.
Secondly, PyTorch is an automatic differentiation engine, also known as autograd, which enables the automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization.
Finally, PyTorch is a deep learning library, meaning that it offers modular, flexible, and efficient building blocks (including pre-trained models, loss functions, and optimizers) for designing and training a wide range of deep learning models, catering to both researchers and developers.
A.2 Understanding tensors¶
PyTorch’s has a NumPy-like API
A.3 Seeing models as computation graphs¶
Automatic differentiation engine, also known as autograd. PyTorch’s autograd system provides functions to compute gradients in dynamic computational graphs automatically.
Computation graph is a directed graph that allows us to express and visualize mathematical expressions.
In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network – we will need this later to compute the required gradients for backpropagation, which is the main training algorithm for neural networks.
In fact, PyTorch builds such a computation graph in the background, and we can use this to calculate gradients of a loss function with respect to the model parameters (here w1 and b) to train the model
A.4 Automatic differentiation made easy¶
Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the chain rule from calculus for neural networks
Partial derivatives and gradients¶
A partial derivatives, which measure the rate at which a function changes with respect to one of its variables.
A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.
This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model’s performance, using a method such as gradient descent.
Listing A.3 Computing gradients via autograd
import torch.nn.functional as F
from torch.autograd import grad
y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True) #A
grad_L_b = grad(loss, b, retain_graph=True)
PyTorch provides even more high-level tools to automate this process:
loss.backward()
print(w1.grad)
print(b.grad)
A.5 Implementing multilayer neural networks¶
Listing A.4 A multilayer perceptron with two hidden layers
class NeuralNetwork(torch.nn.Module):
def __init__(self, num_inputs, num_outputs): #A
super().__init__()
self.layers = torch.nn.Sequential(
# 1st hidden layer
torch.nn.Linear(num_inputs, 30), #B
torch.nn.ReLU(), #C
# 2nd hidden layer
torch.nn.Linear(30, 20), #D
torch.nn.ReLU(),
# output layer
torch.nn.Linear(20, num_outputs),
)
def forward(self, x):
logits = self.layers(x)
return logits #E
instantiate a new neural network object:
>>> model = NeuralNetwork(50, 3)
NeuralNetwork(
(layers): Sequential(
(0): Linear(in_features=50, out_features=30, bias=True)
(1): ReLU()
(2): Linear(in_features=30, out_features=20, bias=True)
(3): ReLU()
(4): Linear(in_features=20, out_features=3, bias=True)
)
)
check the total number of trainable parameters:
>>> sum(p.numel() for p in model.parameters() if p.requires_grad)
2213
手动计算:
第一个隐藏层:50 个输入乘以 30 个隐藏单元加上 30 个偏置单元。
50*30+30
第二个隐藏层:30 个输入单元乘以 20 个节点加上 20 个偏置单元。
30*20+20
输出层:20 个输入节点乘以 3 个输出节点加上 3 个偏置单元。
20*3+3
总共等于:1530+620+63=2213
A linear layer multiplies the inputs with a weight matrix and adds a bias vector( \(wx+b\) ). This is sometimes also referred to as a
feedforward
orfully connected layer
.
# weight parameter matrix
>>> print(model.layers[0].weight)
# bias parameter matrix
>>> print(model.layers[0].bias)
we can make the random number initialization reproducible by seeding PyTorch’s random number generator:
torch.manual_seed(123)
model = NeuralNetwork(50, 3)
print(model.layers[0].weight)
When using for inference rather than training, it is a best practice to use torch.no_grad()
context manager:
# This tells PyTorch that it doesn't need to keep track of the gradients,
# which can result in significant savings in memory and computation.
with torch.no_grad():
out = model(X)
print(out)
In PyTorch, it’s common practice to code models such that they return the outputs of the last layer (logits) without passing them to a nonlinear activation function.
That’s because PyTorch’s commonly used loss functions combine the softmax (or sigmoid for binary classification) operation with the negative log-likelihood loss in a single class.
The reason for this is numerical efficiency and stability.
So, if we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly:
with torch.no_grad(): out = torch.softmax(model(X), dim=1) print(out)
A.6 Setting up efficient data loaders¶
train_loader = DataLoader(
dataset=train_ds,
batch_size=2,
shuffle=True,
num_workers=0, # crucial for parallelizing data loading and preprocessing.
drop_last=True # drop the last batch in each epoch(因为最后一个批次的数量可能不够)
)
警告
or Jupyter notebooks, setting num_workers to greater than 0 can sometimes lead to issues related to the sharing of resources between different processes, resulting in errors or notebook crashes.
备注
根据经验,设置 num_workers=4 通常会在许多实际数据集上带来最佳性能,但最佳设置取决于您的硬件以及用于加载 Dataset 类中定义的训练示例的代码。
A.7 A typical training loop¶
Listing A.9 Neural network training in PyTorch
import torch.nn.functional as F
torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2) #A
optimizer = torch.optim.SGD(model.parameters(), lr=0.5) #B
num_epochs = 3
for epoch in range(num_epochs):
model.train()
for batch_idx, (features, labels) in enumerate(train_loader):
logits = model(features)
loss = F.cross_entropy(logits, labels)
optimizer.zero_grad() #C
loss.backward() #D
optimizer.step() #E
### LOGGING
print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
f" | Train Loss: {loss:.2f}")
model.eval()
# Optional model evaluation
Listing A.10 A function to compute the prediction accuracy
def compute_accuracy(model, dataloader):
model = model.eval()
correct = 0.0
total_examples = 0
for idx, (features, labels) in enumerate(dataloader):
with torch.no_grad():
logits = model(features)
predictions = torch.argmax(logits, dim=1)
compare = labels == predictions #A
correct += torch.sum(compare) #B
total_examples += len(compare)
return (correct / total_examples).item() #C
A.8 Saving and loading models¶
模型保存:
torch.save(model.state_dict(), "model.pth")
# model.state_dict 是一个 Python 字典对象,它将模型中的每一层映射到其可训练参数(权重和偏差)
模型载入:
model = NeuralNetwork(2, 2) # 架构需要与原始保存的模型完全匹配
model.load_state_dict(torch.load("model.pth"))
A.9 Optimizing training performance with GPUs¶
A.9.1 PyTorch computations on GPU devices¶
CPU:
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])
print(tensor_1 + tensor_2)
# 输出
# tensor([5., 7., 9.])
GPU:
tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")
print(tensor_1 + tensor_2)
# 输出
tensor([5., 7., 9.], device='cuda:0')
A.9.2 Single-GPU training¶
# Nvidia GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Apple Silicon 芯片
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
计算CPU&GPU的计算速度:
# CPU
a = torch.rand(100, 200)
b = torch.rand(200, 300)
%timeit a@b
# GPU
a, b = a.to("cuda"), b.to("cuda")
%timeit a @ b
A.9.3 Training with multiple GPUs¶
分布式训练是将模型训练划分到多个 GPU 和机器上的概念。
备注
DDP does not function properly within interactive Python environments like Jupyter notebooks, which don’t handle multiprocessing in the same way a standalone Python script does.
如果您的机器有四个 GPU 并且您只想使用第一个和第三个 GPU:
CUDA_VISIBLE_DEVICES=0,2 python some_script.py
A.10 Summary¶
PyTorch is an open-source library that consists of three core components: a
tensor library
,automatic differentiation functions
, anddeep learning utilities
.PyTorch’s tensor library is similar to array libraries like NumPy
In the context of PyTorch, tensors are array-like data structures to represent
scalars
,vectors
,matrices
, andhigher-dimensional arrays
. PyTorch tensors can be executed on the CPU, but one major advantage of PyTorch’s tensor format is its GPU support to accelerate computations.The
automatic differentiation (autograd)
capabilities in PyTorch allow us to conveniently train neural networks using backpropagation without manually deriving gradients.The deep learning utilities in PyTorch provide building blocks for creating custom deep neural networks.
PyTorch includes
Dataset
andDataLoader
classes to set up efficient data loading pipelines.It’s easiest to train models on a CPU or single GPU.
Using
DistributedDataParallel
is the simplest way in PyTorch to accelerate the training if multiple GPUs are available.