Build a Large Language Model (From Scratch)¶

GitHub(相关代码): https://github.com/rasbt/LLMs-from-scratch
【其他参考】The Illustrated Transformer(图的形式讲解transformer，涉及到本书要讲的qkv计算、多head注意力等): https://jalammar.github.io/illustrated-transformer/

1. Understanding LLM¶

章节:

1 What is an LLM?
2 Applications of LLMs
3 Stages of building and using LLMs
4 Using LLMs for different tasks
5 Utilizing large datasets
6 A closer look at the GPT architecture
7 Building a large language model
8 Summary

https://img.zhaoweiguo.com/uPic/2024/11/uwgOhZ.png

AI including:

machine learning
deep learning

rule-based systems
genetic algorithms
expert systems
fuzzy logic
symbolic reasoning

traditional machine learning: human experts might manually extract features from email text such as the frequency of certain trigger words (“prize,” “win,” “free”), the number of exclamation marks, use of all uppercase words, or the presence of suspicious links.
deep learning: does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for a deep learning model.

备注

不管是traditional machine learning还是deep learning，都需要人工收集一些数据，哪此邮件是spam，哪些不是spam。但 traditional machine learning 还需要专家进行人工的特征提取，对垃圾邮件事例特征有：win, free单词很多，有可疑链接，使用大写字母……

https://img.zhaoweiguo.com/uPic/2024/11/ntzxz1.png — Figure 1.3¶

The two most popular categories of finetuning LLMs:

1. instruction-finetuning
2. finetuning for classification tasks

https://img.zhaoweiguo.com/uPic/2024/11/Pg4s7f.png — Figure 1.4 A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts, an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the decoder can use to generate the translated text one word at a time. Note that this figure shows the final stage of the translation process where the decoder has to generate only the final word (“Beispiel”), given the original input text (“This is an example”) and a partially translated sentence (“Das ist ein”), to complete the translation.¶

A key component of transformers and LLMs is the self-attention mechanism, which allows the model to weigh the importance of different words or tokens in a sequence relative to each other.

https://img.zhaoweiguo.com/uPic/2024/11/DMgI9v.png — Figure 1.5 A visual representation of the transformer’s `encoder` and `decoder` submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences.¶

https://img.zhaoweiguo.com/uPic/2024/11/jiIOSX.png — Table 1.1 The pretraining dataset of the popular GPT-3 LLM¶

* Wikipedia corpus consists of English-language Wikipedia
* Books1 is likely a sample from Project Gutenberg: https://www.gutenberg.org/
* Books2 is likely from Libgen: https://en.wikipedia.org/wiki/Library_Genesis
* CommonCrawl is a filtered subset of the CommonCrawl database: https://commoncrawl.org/
* WebText2 is the text of web pages from all outbound Reddit links from posts with 3+ upvotes.

The ability to perform tasks that the model wasn’t explicitly trained to perform is called an “emergent behavior(涌现行为).”“涌现行为” 指的是模型在规模增大后自发展现出一些未明确训练的能力（如推理、算术等），这些能力不是直接由个体参数或特定任务驱动的，而是系统在规模和复杂度增加时的自发产物。

https://img.zhaoweiguo.com/uPic/2024/11/VoBxrx.png — Figure 1.9 The stages of building LLMs covered in this book include implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and finetuning the foundation model to become a personal assistant or text classifier.¶

2. Working with Text Data¶

https://img.zhaoweiguo.com/uPic/2024/11/tfp8db.png — Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining.¶

2.1 Understanding word embeddings¶

https://img.zhaoweiguo.com/uPic/2024/11/enlX0I.png — Figure 2.2 Deep learning models cannot process data formats like video, audio, and text in their raw form. Thus, we use an embedding model to transform this raw data into a dense vector representation that deep learning architectures can easily understand and process. Specifically, this figure illustrates the process of converting raw data into a three-dimensional numerical vector.¶

While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for retrieval- augmented generation.

https://img.zhaoweiguo.com/uPic/2024/11/unvhYf.png — Figure 2.3 If word embeddings are two-dimensional, we can plot them in a two-dimensional scatterplot for visualization purposes as shown here. When using word embedding techniques, such as Word2Vec, words corresponding to similar concepts often appear close to each other in the embedding space. For instance, different types of birds appear closer to each other in the embedding space compared to countries and cities.¶

Word embeddings can have varying dimensions, from one to thousands. As shown in Figure 2.3, we can choose two-dimensional word embeddings for visualization purposes.

2.2 Tokenizing text¶

https://img.zhaoweiguo.com/uPic/2024/11/Ujvr7k.png — Figure 2.4 A view of the text processing steps covered in this section in the context of an LLM. Here, we split an input text into individual tokens, which are either words or special characters, such as punctuation characters. In upcoming sections, we will convert the text into token IDs and create token embeddings.¶

2.3 Converting tokens into token IDs¶

https://img.zhaoweiguo.com/uPic/2024/11/OUFhDM.png — Figure 2.6 We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity.¶

https://img.zhaoweiguo.com/uPic/2024/11/AmM7uq.png — Figure 2.7 Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs. The vocabulary is built from the entire training set and can be applied to the training set itself and any new text samples. The depicted vocabulary contains no punctuation or special characters for simplicity.¶

https://img.zhaoweiguo.com/uPic/2024/11/RAYbq0.png — Figure 2.8 Tokenizer implementations share two common methods: an encode method and a decode method. The encode method takes in the sample text, splits it into individual tokens, and converts the tokens into token IDs via the vocabulary. The decode method takes in token IDs, converts them back into text tokens, and concatenates the text tokens into natural text.¶

2.4 Adding special context tokens¶

https://img.zhaoweiguo.com/uPic/2024/11/FS7Wg5.png — Figure 2.9 We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an `<|unk|>` token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an `<|endoftext|>` token that we can use to separate two unrelated text sources.¶

https://img.zhaoweiguo.com/uPic/2024/11/sm50GQ.png — Figure 2.10 When working with multiple independent text source, we add `<|endoftext|>` tokens between these texts. These `<|endoftext|>` tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.¶

additional special tokens:

[BOS] (beginning of sequence)
[EOS] (end of sequence)
[PAD] (padding)
|endoftext|
|unk|

2.5 Byte pair encoding¶

https://img.zhaoweiguo.com/uPic/2024/11/OcQqcV.png — Figure 2.11 BPE tokenizers break down unknown words into subwords and individual characters. This way, a BPE tokenizer can parse any word and doesn’t need to replace unknown words with special tokens, such as `<|unk|>`.¶

The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
The BPE tokenizer from OpenAI’s open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance

2.6 Data sampling with a sliding window¶

https://img.zhaoweiguo.com/uPic/2024/11/bCiR3G.png — Figure 2.12 Given a text sample, extract input blocks as subsamples that serve as input to the LLM, and the LLM’s prediction task during training is to predict the next word that follows the input block. During training, we mask out all words that are past the target. Note that the text shown in this figure would undergo tokenization before the LLM can process it; however, this figure omits the tokenization step for clarity.¶

For each text chunk, we want the inputs and targets
Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right

The prediction would look like as follows:(input-target pairs):

and ---->  established
and established ---->  himself
and established himself ---->  in
and established himself in ---->  a

https://img.zhaoweiguo.com/uPic/2024/11/abgbDL.png — Figure 2.13 To implement efficient data loaders, we collect the inputs in a tensor, x, where each row represents one input context. A second tensor, y, contains the corresponding prediction targets (next words), which are created by shifting the input by one position.¶

https://img.zhaoweiguo.com/uPic/2024/11/khbYvv.png — Figure 2.14 When creating multiple batches from the input dataset, we slide an input window across the text. If the stride is set to 1, we shift the input window by 1 position when creating the next batch. If we set the stride equal to the input window size, we can prevent overlaps between the batches.¶

Small batch sizes require less memory during training but lead to more noisy model updates. The batch size is a trade-off and hyperparameter to experiment with when training LLMs.

示例(batch_size=1, max_length=4, stride=1):

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

示例(batch_size=8, max_length=4, stride=4):

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

2.7 Creating token embeddings¶

we converted the token IDs into a continuous vector representation, the so-called token embeddings.

https://img.zhaoweiguo.com/uPic/2024/11/NGr1lp.png — Figure 2.15 Preparing the input text for an LLM involves tokenizing text, converting text tokens to token IDs, and converting token IDs into vector embedding vectors. In this section, we consider the token IDs created in previous sections to create the token embedding vectors.¶

https://img.zhaoweiguo.com/uPic/2024/11/OrtaX4.png — Figure 2.16 Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer’s weight matrix. input text => token ID => embedded token ID¶

2.8 Encoding word positions¶

https://img.zhaoweiguo.com/uPic/2024/11/35sgO7.png — Figure 2.17 The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 2, whether it’s in the first or last position in the token ID input vector, will result in the same embedding vector.¶

In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes.
However, since the self-attention mechanism of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM.

two broad categories of position-aware embeddings:

1. relative positional embeddings
2. absolute positional embeddings

https://img.zhaoweiguo.com/uPic/2024/11/q49SEA.png — Figure 2.18 Positional embeddings are added to the token embedding vector to create the input embeddings for an LLM. The positional vectors have the same dimension as the original token embeddings. The token embeddings are shown with value 1 for simplicity.¶

Instead of focusing on the absolute position of a token, the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of “how far apart” rather than “at which exact position.” The advantage here is that the model can generalize better to sequences of varying lengths, even if it hasn’t seen such lengths during training.
OpenAI’s GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original Transformer model. This optimization process is part of the model training itself, which we will implement later in this book. For now, let’s create the initial positional embeddings to create the LLM inputs for the upcoming chapters.

https://img.zhaoweiguo.com/uPic/2024/11/zMvBpf.png — Figure 2.19 As part of the input processing pipeline, input text is first broken up into individual tokens. These tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used as input for the main LLM layers.¶

context_length is a variable that represents the supported input size of the LLM.

2.9 Summary¶

LLMs require textual data to be converted into numerical vectors, known as embeddings since they can’t process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.
As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.
Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhance the model’s understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.
The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.
We use a sliding window approach on tokenized data to generate input-target pairs for LLM training.
Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.
While token embeddings provide consistent vector representations for each token, they lack a sense of the token’s position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI’s GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.

3. Coding Attention Mechanisms¶

Exploring the reasons for using attention mechanisms in neural networks
Introducing a basic self-attention framework and progressing to an enhanced self-attention mechanism
Implementing a causal attention module that allows LLMs to generate one token at a time
Masking randomly selected attention weights with dropout to reduce overfitting
Stacking multiple causal attention modules into a multi-head attention module

https://img.zhaoweiguo.com/uPic/2024/11/Z0VzmK.png — Figure 3.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on attention mechanisms, which are an integral part of an LLM architecture.¶

https://img.zhaoweiguo.com/uPic/2024/11/XwmbT3.png — Figure 3.2 The figure depicts different attention mechanisms we will code in this chapter, starting with a simplified version of self-attention before adding the trainable weights. The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.¶

3.1 The problem with modeling long sequences¶

https://img.zhaoweiguo.com/uPic/2024/11/fHcmYY.png — Figure 3.3 When translating text from one language to another, such as German to English, it’s not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammar alignment.¶

To address the issue that we cannot translate text word by word, it is common to use a deep neural network with two submodules, a so-called encoder and decoder. The job of the encoder is to first read in and process the entire text, and the decoder then produces the translated text. e.g. encoder-decoder RNNs

https://img.zhaoweiguo.com/uPic/2024/11/9nYx0o.png — Figure 3.4 Before the advent of transformer models, `encoder-decoder RNNs` were a popular choice for machine translation. The encoder takes a sequence of tokens from the source language as input, where a hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by token.¶

The big issue and limitation of encoder-decoder RNNs is that the RNN can’t directly access earlier hidden states from the encoder during the decoding phase. Consequently, it relies solely on the current hidden state, which encapsulates all relevant information. This can lead to a loss of context, especially in complex sentences where dependencies might span long distances.

3.2 Capturing data dependencies with attention mechanisms¶

https://img.zhaoweiguo.com/uPic/2024/11/yidBLJ.png — Figure 3.5 Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the so-called `attention weights`, which we will compute later. Note that this figure shows the general idea behind attention and does not depict(描述) the exact implementation of the `Bahdanau mechanism`, which is an RNN method outside this book’s scope.¶

Self-attention is a mechanism that allows each position in the input sequence to attend to(注意) all positions in the same sequence when computing the representation of a sequence. Self-attention is a key component of contemporary(当代) LLMs based on the transformer architecture, such as the GPT series.

https://img.zhaoweiguo.com/uPic/2024/11/KTZwGT.png — Figure 3.6 Self-attention is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence. In this chapter, we will code this self-attention mechanism from the ground up before we code the remaining parts of the GPT-like LLM in the following chapter.¶

3.3 Attending to different parts of the input with self-attention¶

The “self” in self-attention
In self-attention, the “self” refers to the mechanism’s ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.
This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models where the attention might be between an input sequence and an output sequence, such as the example depicted in Figure 3.5.

3.3.1 A simple self-attention mechanism without trainable weights¶

https://img.zhaoweiguo.com/uPic/2024/11/NuE983.png — Figure 3.7 The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements. In the example depicted in this figure, we compute the context vector \(z^{(2)}\) . The importance or contribution of each input element for computing \(z^{(2)}\) is determined by the attention weights \(α_{21}\) to \(α_{2T}\) . When computing \(z^{(2)}\) , the attention weights are calculated with respect to input element \(x^{(2)}\) and all other inputs. The exact computation of these attention weights is discussed later in this section.¶

For example, consider an input text like "Your journey starts with one step." In this case, each element of the sequence, such as \(x^{(1)}\), corresponds to a d-dimensional embedding vector representing a specific token, like “Your.” These input vectors are shown as 3-dimensional embeddings.
Let’s focus on the embedding vector of the second input element, \(x^{(2)}\) (which corresponds to the token “journey”), and the corresponding context vector, \(z^{(2)}\). This enhanced context vector, \(z^{(2)}\), is an embedding that contains information about \(x^{(2)}\) and all other input elements \(x^{(1)}\) to \(x^{(T)}\).
In self-attention, context vectors play a crucial role. Their purpose is to create enriched representations of each element in an input sequence (like a sentence) by incorporating information from all other elements in the sequence. This is essential in LLMs, which need to understand the relationship and relevance of words in a sentence to each other. Later, we will add trainable weights that help an LLM learn to construct these context vectors so that they are relevant for the LLM to generate the next token.

https://img.zhaoweiguo.com/uPic/2024/11/1ArQGZ.png — Figure 3.8 The overall goal of this section is to illustrate the computation of the context vector \(z^{(2)}\) using the second input sequence, \(x^{(2)}\) as a query. This figure shows the first intermediate step, computing the attention scores `ω` between the query \(x^{(2)}\) and all other input elements as a dot product.¶

import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts    (x^3)
    [0.22, 0.58, 0.33], # with      (x^4)
    [0.77, 0.25, 0.10], # one       (x^5)
    [0.05, 0.80, 0.55]] # step      (x^6)
)


query = inputs[1]  #A
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
# 输出
# tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

除了将点积运算视为组合两个向量以产生标量值的数学工具之外，点积还是相似性的度量，因为它量化了两个向量的对齐程度：更高的点积表示更大程度的对齐或相似性向量之间。在自注意力机制的背景下，点积决定了序列中元素相互关注的程度：点积越高，两个元素之间的相似度和注意力分数就越高。

https://img.zhaoweiguo.com/uPic/2024/11/Gzln85.png — Figure 3.9 After computing the attention scores \(ω_{21}\) to \(ω_{2T}\) with respect to the input query \(x^{(2)}\), the next step is to obtain the attention weights \(α_{21}\) to \(α_{2T}\) by normalizing the attention scores.¶

标准化背后的主要目标是获得总和为 1 的注意力权重。

attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())
# 输出
# Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
# Sum: tensor(1.0000)

In practice, it’s more common and advisable to use the softmax function for normalization.
In addition, the softmax function ensures that the attention weights are always positive.

def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)
attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())
# 输出
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)

Note that this naive softmax implementation (softmax_naive) may encounter numerical instability problems, such as overflow and underflow, when dealing with large or small input values.
PyTorch implementation of softmax

attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())
# 输出
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)

https://img.zhaoweiguo.com/uPic/2024/11/iVEcj6.png — Figure 3.10 The final step, after calculating and normalizing the attention scores to obtain the attention weights for query \(x^{(2)}\), is to compute the context vector \(z^{(2)}\). This context vector is a combination of all input vectors \(x^{(1)}\) to \(x^{(T)}\) weighted by the attention weights.¶

query = inputs[1] # 2nd input token is the query
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)
# 输出
# tensor([0.4419, 0.6515, 0.5683])

3.3.2 Computing attention weights for all input tokens¶

https://img.zhaoweiguo.com/uPic/2024/11/sZt4xS.png — Figure 3.11 The highlighted row shows the attention weights for the second input element as a query, as we computed in the previous section. This section generalizes the computation to obtain all other attention weights.(这是上面用softmax计算出的 attention weights \(α_{21}\) to \(α_{2T}\) )。用同样的方法，可以计算出其他的 attention weights¶

https://img.zhaoweiguo.com/uPic/2024/11/Itah8v.png — Figure 3.12 计算的3步骤¶

import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts    (x^3)
    [0.22, 0.58, 0.33], # with      (x^4)
    [0.77, 0.25, 0.10], # one       (x^5)
    [0.05, 0.80, 0.55]] # step      (x^6)
)

attn_scores = inputs @ inputs.T
print(attn_scores)
# 输出
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
    [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
    [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
    [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
    [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
    [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

attn_weights = torch.softmax(attn_scores, dim=1)
print(attn_weights)
# 输出
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
    [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
    [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
    [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
    [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
    [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

all_context_vecs = attn_weights @ inputs
print(all_context_vecs)
# 输出
tensor([[0.4421, 0.5931, 0.5790],
    [0.4419, 0.6515, 0.5683],
    [0.4431, 0.6496, 0.5671],
    [0.4304, 0.6298, 0.5510],
    [0.4671, 0.5910, 0.5266],
    [0.4177, 0.6503, 0.5645]])

3.4 Implementing self-attention with trainable weights¶

Self-attention mechanism is also called scaled dot-product attention.

https://img.zhaoweiguo.com/uPic/2024/11/PXVtSP.png — Figure 3.13 A mental model illustrating how the self-attention mechanism we code in this section fits into the broader context of this book and chapter. In the previous section, we coded a simplified attention mechanism to understand the basic mechanism behind attention mechanisms. In this section, we add trainable weights to this attention mechanism. In the upcoming sections, we will then extend this self-attention mechanism by adding a causal mask and multiple heads.¶

本节主要工作是在前面的基础上增加 训练时更新权重 的功能，以实现在训练时学习以得到更好的权重。

3.4.1 Computing the attention weights step by step¶

three trainable weight matrices \(W_q\) , \(W_k\) , and \(W_v\) .
These three matrices are used to project the embedded input tokens, \(x^{(i)}\), into query, key, and value vectors

https://img.zhaoweiguo.com/uPic/2024/11/5i96Wp.png — Figure 3.14 In the first step of the self-attention mechanism with trainable weight matrices, we compute query (q), key (k), and value (v) vectors for input elements x. Similar to previous sections, we designate the second input, \(x^{(2)}\), as the query input. The query vector \(q^{(2)}\) is obtained via matrix multiplication between the input \(x^{(2)}\) and the weight matrix Wq. Similarly, we obtain the key and value vectors(即: \(k^{(T)}\) and \(v^{(T)}\) ) via matrix multiplication involving the weight matrices Wk and Wv.¶

Note that in GPT-like models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input (d_in=3) and output (d_out=2) dimensions here.

import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts    (x^3)
    [0.22, 0.58, 0.33], # with      (x^4)
    [0.77, 0.25, 0.10], # one       (x^5)
    [0.05, 0.80, 0.55]] # step      (x^6)
)

x_2 = inputs[1] #A
d_in = inputs.shape[1] #B  = 3
d_out = 2 #C

# Initialize the three weight matrices Wq, Wk, and Wv
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
# 示例结构: shape=(3,2)
# tensor([[0.3821, 0.6605],
#         [0.8536, 0.5932],
#         [0.6367, 0.9826]])
# 说明
# setting requires_grad=False to reduce clutter in the outputs for illustration purposes
# 正式使用时需要设置 requires_grad=True

# compute the query, key, and value vectors
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)
# 输出
# tensor([0.4306, 1.4551])

【注意】不要将权重参数(weight parameters)与注意力权重(attention weights)混淆。
权重参数(weight parameters): 指的是神经网络在训练过程中被调优的权重，有时会缩写为weight。(“weight” is short for “weight parameters,” the values of a neural network that are optimized during training. )
注意力权重(attention weights): 决定了上下文向量依赖输入不同部分的程度，即网络关注输入不同部分的程度。(attention weights determine the extent to which a context vector depends on the different parts of the input, i.e., to what extent the network focuses on different parts of the input.)
总之，权重参数是定义网络连接的基本学习系数，而注意力权重是动态的、特定于上下文的值。

# 虽然这儿只计算context vector`Z^(2)`，但仍然需要所有输入元素的键和值向量
# obtain all keys and values via matrix multiplication
keys = inputs @ W_key
values = inputs @ W_value
print("keys.shape:", keys.shape)
# 输出
# keys.shape: torch.Size([6, 2])

https://img.zhaoweiguo.com/uPic/2024/11/xpgGR5.png — Figure 3.15 The attention score computation is a dot-product computation similar to what we have used in the simplified self-attention mechanism in section 3.3. The new aspect here is that we are not directly computing the dot-product between the input elements but using the query and key obtained by transforming the inputs via the respective weight matrices.¶

compute the attention score \(ω_{22}\)

keys_2 = keys[1] #A
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
# tensor(1.8524)

同样的道理计算出所有的 \(ω_2\) (即 \(ω_{21}\) 到 \(ω_{2T}\) )

attn_scores_2 = query_2 @ keys.T    # All attention scores for given query
print(attn_scores_2)
# 输出
# tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])

https://img.zhaoweiguo.com/uPic/2024/11/ZOhueo.png — Figure 3.16 After computing the attention scores ω, the next step is to normalize these scores using the softmax function to obtain the attention weights α.¶

d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)
# 输出
# tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])

通过将注意力分数除以嵌入的平方根(square root, 即d_k**0.5)来缩放注意力分数键的维度
when scaling up the embedding dimension, which is typically greater than thousand(比较gpt4等), large dot products can result in very small gradients during backpropagation due to the softmax function applied to them.
As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero.
These small gradients can drastically slow down learning or cause training to stagnate.
The scaling by the square root of the embedding dimension is the reason why this self-attention mechanism is also called scaled-dot product attention .
【关键点】为什么在自注意力机制中会用嵌入维度的平方根来缩放点积
缩放的目的：用嵌入维度的平方根来缩放，是为了避免训练过程中出现过小的梯度。若不做缩放，训练时可能会遇到梯度非常小的情况，导致模型学习变慢，甚至陷入停滞。
出现梯度变小的原因：1、当嵌入维度（即向量的维度）增加时，两个向量的点积值会变大。在GPT等大型语言模型（LLM）中，嵌入维度往往很高，可能达到上千，因此点积也变得很大。2、在点积结果上应用 softmax 函数时，如果数值较大，softmax 输出的概率分布会变得很尖锐，近似于阶跃函数。此时，大部分概率集中在几个值上，导致其他部分的梯度几乎为零。这样就会导致模型训练时更新不充分。
缩放的效果：通过用嵌入维度的平方根缩放点积的大小，可以让点积的数值控制在合理范围，使得 softmax 函数的输出更加平滑，从而使得梯度较大，模型可以更有效地学习。这种缩放的自注意力机制因此被称为“缩放点积注意力” (scaled-dot product attention)。

https://img.zhaoweiguo.com/uPic/2024/11/hTqYgH.png — Figure 3.17 In the final step of the self-attention computation, we compute the `context vector` by combining all `value vectors` via the `attention weights`.¶

context_vec_2 = attn_weights_2 @ values
print(context_vec_2)
# 输出
# tensor([0.3061, 0.8210])

备注

【为啥需要query, key, and value】这3个是借用信息检索和数据库领域的概念。接合起来理解这3个概念。

query（查询）：”query” 代表模型当前关注的项(如一个句子中的单词或词元)。模型通过 “query” 来探测其他部分，判断需要关注的程度。
key（键）：输入序列中的每个项（如句子中的每个词）都有一个对应的 “key”，这些 “key” 会和 “query” 进行匹配。通过这种匹配，模型可以找出哪些 “key”（即哪些输入部分）和 “query” 更相关。
value（值）：一旦模型确定了与 “query” 最相关的 “key”，就会取出相应的 “value”。这些 “value” 包含输入项的实际内容或表示，通过提取这些 “value”，模型获得当前 “query” 应关注的具体内容。

备注

上节讲的是根据 关系相近的向量点积更大 与计算相似度的关系，算出每个单词的权重，然后用这个计算出的权重和输入向量点击，得到与位置相关的上下文向量。而本节是用3个可训练参数(Wq, Wk, Wv)，由Wq和Wk来计算出每个单词的权重，而Wv提供了该位置上的特征信息，模型通过注意力机制自动调整从不同位置的 value 中提取多少信息。

注：为啥 Wv 不能直接用 input来表示，因为 Wv 直接决定了提取的内容和最终输出的特征信息。它能够捕捉更复杂的特征，尤其是在上下文依赖和长序列建模中，可以更好地识别出不同位置特征的细微差别。直接使用 input，模型会缺乏这种内容和注意力分布的区分，使得不同位置的特征难以有效加权合成，从而降低注意力机制的表达能力。

3.4.2 Implementing a compact self-attention Python class¶

Listing 3.1 A compact self-attention class

import torch.nn as nn
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))
    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value

        attn_scores = queries @ keys.T # omega

        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))
# 输出
tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)

https://img.zhaoweiguo.com/uPic/2024/11/0d15vS.png — Figure 3.18 In self-attention, we transform the input vectors in the input matrix X with the three weight matrices, Wq, Wk, and Wv. Then, we compute the attention weight matrix based on the resulting queries (Q) and keys (K). Using the attention weights and values (V), we then compute the context vectors (Z). (For visual clarity, we focus on a single input text with n tokens in this figure, not a batch of multiple inputs. Consequently, the 3D input tensor is simplified to a 2D matrix in this context. This approach allows for a more straightforward visualization and understanding of the processes involved.)¶

Using nn.Linear instead of manually implementing nn.Parameter(torch.rand(...)) is that nn.Linear has an optimized weight initialization scheme, contributing to more stable and effective model training.
Listing 3.2 A self-attention class using PyTorch’s Linear layers

class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim)

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))
# 输出
tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)

Note that SelfAttention_v1 and SelfAttention_v2 give different outputs because they use different initial weights for the weight matrices since nn.Linear uses a more sophisticated weight initialization scheme.

3.5 Hiding future words with causal attention¶

Causal attention, also known as masked attention, is a specialized form of self-attention. It restricts a model to only consider previous and current inputs in a sequence when processing any given token. This is in contrast to the standard self-attention mechanism, which allows access to the entire input sequence at once.

https://img.zhaoweiguo.com/uPic/2024/11/blTF9A.png — Figure 3.19 In causal attention, we mask out the attention weights above the diagonal such that for a given input, the LLM can’t access future tokens when computing the context vectors using the attention weights. For example, for the word “journey” in the second row, we only keep the attention weights for the words before (“Your”) and in the current position (“journey”).¶

3.5.1 Applying a causal attention mask¶

https://img.zhaoweiguo.com/uPic/2024/11/e795Ol.png — Figure 3.20 One way to obtain the masked attention weight matrix in causal attention is to apply the softmax function to the attention scores, zeroing out the elements above the diagonal and normalizing the resulting matrix.¶

【第一步】使用上面的方法得到 attention weights

print(attn_weights)
tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)

【第二步】对象线上面的设置为0

context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)
# 输出
tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])

# multiply this mask with the attention weights
masked_simple = attn_weights*mask_simple
print(masked_simple)
# 输出
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<MulBackward0>)

【第三步】重新规范化(renormalize) attention weights使每一行值的和为1

row_sums = masked_simple.sum(dim=1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
# 输出
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)

【注意-信息泄露】当我们应用掩码然后重新规范化注意力权重时，最初来自未来标记（现在已经掩码）的信息仍然可能影响当前标记，因为它们的值是 softmax 计算的一部分。然而，关键的洞察是，当我们在屏蔽后重新规范化注意力权重时，我们本质上所做的是在较小的子集上重新计算 softmax（因为屏蔽位置对 softmax 值没有贡献）。Softmax 的数学优雅之处在于，尽管最初在分母中包含了所有位置，但在屏蔽和重新归一化之后，屏蔽位置的影响被抵消了——它们不会以任何有意义的方式对 SoftMax 得分做出贡献。所以不会有信息泄露。

https://img.zhaoweiguo.com/uPic/2024/11/lqdqs2.png — Figure 3.21 A more efficient way to obtain the masked attention weight matrix in causal attention is to mask the attention scores with negative infinity values before applying the softmax function.¶

这儿可以利用 softmax 函数的数学特性，以更少的步骤更有效地实现屏蔽注意力权重的计算
【关键特征】当负无穷大值 (-∞) 连续存在时，softmax 函数将它们视为零概率（从数学上讲，这是因为 \(e^{-\infty}\) 接近 0）。
这儿把上面的0改为 -∞(即-inf)

mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
# 输出
tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)


attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
print(attn_weights)
# 输出(已经规范化，不用再额外操作了，节省了操作)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)

3.5.2 Masking additional attention weights with dropout¶

Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively “dropping” them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units. It’s important to emphasize that dropout is only used during training and is disabled afterward.

备注

In the transformer architecture, including models like GPT, dropout in the attention mechanism is typically applied in two specific areas: after calculating the attention scores or after applying the attention weights to the value vectors.(我还没完全理解具体的位置)

我的理解(GPT给的说法有矛盾，下面是我的理解)
【Dropout 的应用位置】1、after calculating the attention scores：在计算注意力分数后应用Dropout意味着在得到Q（查询）、K（键）和V（值）向量之间的点积之后，但在将这些分数传递给Softmax函数之前，进行Dropout操作。这一步骤中的Dropout可以帮助模型避免对某些特定特征的过度依赖，因为它可能会随机地使一些注意力分数失效，从而促使模型学习到更加健壮的注意力分布。

伪代码:

function scaledDotProductAttention(Q, K, V, dropoutRate):
    # 计算注意力分数
    scores = matmul(Q, transpose(K)) / sqrt(d_k)  # d_k 是键向量的维度

    # 应用Dropout
    scores = applyDropout(scores, dropoutRate)

    # 应用Softmax函数
    attentionWeights = softmax(scores)

    # 将注意力权重应用到值向量上
    output = matmul(attentionWeights, V)

    return output

【Dropout 的应用位置】2、after applying the attention weights to the value vectors：在将注意力权重应用到值向量之后应用Dropout意味着在Softmax后的注意力权重与值向量相乘得到加权值向量之后，再执行Dropout操作。这样做可以进一步帮助模型泛化，因为即使某些信息被Dropout随机丢弃了，模型仍然需要能够利用剩余的信息来做出准确的预测。

伪代码:

function scaledDotProductAttention(Q, K, V, dropoutRate):
    # 计算注意力分数
    scores = matmul(Q, transpose(K)) / sqrt(d_k)  # d_k 是键向量的维度

    # 应用Softmax函数
    attentionWeights = softmax(scores)

    # 应用Dropout
    attentionWeights = applyDropout(attentionWeights, dropoutRate)

    # 将注意力权重应用到值向量上
    output = matmul(attentionWeights, V)

    return output

备注

下面示例演示了 “apply the dropout mask after computing the attention weights”

https://img.zhaoweiguo.com/uPic/2024/11/TVHyRq.png — Figure 3.22 Using the causal attention mask (upper left), we apply an additional dropout mask (upper right) to zero out additional attention weights to reduce overfitting during training.¶

在下面的示例中我们dropout 50% 的比例，在后面训练gpt 模型时，我们使用 10%-20% 的 dropout 比例。

torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) #A
example = torch.ones(6, 6) #B
print(dropout(example))
# 输出(有近一半是0)
tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])

apply dropout to the attention weight matrix:

torch.manual_seed(123)
print(dropout(attn_weights))
# 输出
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
       grad_fn=<MulBackward0>

3.5.3 Implementing a compact causal attention class¶

Listing 3.3 A compact causal attention class

class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # A
        self.register_buffer(
            'mask',
            torch.triu(
                torch.ones(context_length, context_length),
                diagonal=1
            )
        ) #B


    def forward(self, x):
        b, num_tokens, d_in = x.shape  #C
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)  #C
        attn_scores.masked_fill_(  #D
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec


import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)
batch = torch.stack((inputs, inputs), dim=0)

torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)
# 输出
# context_vecs.shape: torch.Size([2, 6, 2])

https://img.zhaoweiguo.com/uPic/2024/11/NpD3Za.png — Figure 3.23 A mental model summarizing the four different attention modules we are coding in this chapter. We began with a simplified attention mechanism, added trainable weights, and then added a casual attention mask. In the remainder of this chapter, we will extend the causal attention mechanism and code multi-head attention, which is the final module we will use in the LLM implementation in the next chapter.¶

3.6 Extending single-head attention to multi-head attention¶

3.6.1 Stacking multiple single-head attention layers¶

https://img.zhaoweiguo.com/uPic/2024/11/FxjRwY.png — Figure 3.24 The multi-head attention module in this figure depicts two single-head attention modules stacked on top of each other. So, instead of using a single matrix Wv for computing the value matrices, in a multi-head attention module with two heads, we now have two value weight matrices: \(W_{v1}\) and \(W_{v2}\). The same applies to the other weight matrices, Wq and Wk. We obtain two sets of context vectors Z1 and Z2 that we can combine into a single context vector matrix Z.¶

Listing 3.4 A wrapper class to implement multi-head attention

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

https://img.zhaoweiguo.com/uPic/2024/11/cXgCA8.png — Figure 3.25 Using the MultiHeadAttentionWrapper, we specified the number of attention heads (num_heads). If we set num_heads=2, as shown in this figure, we obtain a tensor with two sets of context vector matrices. In each context vector matrix, the rows represent the context vectors corresponding to the tokens, and the columns correspond to the embedding dimension specified via d_out=4. We concatenate these context vector matrices along the column dimension. Since we have 2 attention heads and an embedding dimension of 2, the final embedding dimension is 2 × 2 = 4.¶

torch.manual_seed(123)

context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

context_vecs = mha(batch)

print("context_vecs.shape:", context_vecs.shape)
#context_vecs.shape: torch.Size([2, 6, 4])
print(context_vecs)
# 输出
# tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
#          [-0.5874,  0.0058,  0.5891,  0.3257],
#          [-0.6300, -0.0632,  0.6202,  0.3860],
#          [-0.5675, -0.0843,  0.5478,  0.3589],
#          [-0.5526, -0.0981,  0.5321,  0.3428],
#          [-0.5299, -0.1081,  0.5077,  0.3493]],

#         [[-0.4519,  0.2216,  0.4772,  0.1063],
#          [-0.5874,  0.0058,  0.5891,  0.3257],
#          [-0.6300, -0.0632,  0.6202,  0.3860],
#          [-0.5675, -0.0843,  0.5478,  0.3589],
#          [-0.5526, -0.0981,  0.5321,  0.3428],
#          [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)

# 说明
# first  dimension: context_vecs tensor is 2 since we have two input texts
#       (两个值完全一样是因为输入文本batch, 是完全一样的)
# second dimension: 6 tokens in each input
# third  dimension: 4-dimensional embedding of each token

3.6.2 Implementing multi-head attention with weight splits¶

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x)    # Shape: (b, num_tokens, d_out) => [2, 6, 2]
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)        =>[b, num_tokens, num_headers, head_dim]  => [2, 6, 2, 1]
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)             => [b, num_headers, num_tokens, head_dim] => [2, 2, 6, 1]
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head   => [2, 2, 6, 1] @ [2, 2, 1, 6] => [2, 2, 6, 6] => [b, num_headers, num_tokens, num_tokens]

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)       => [2, 2, 6, 6] @ [2, 2, 6, 1] => [2, 2, 6, 1] => [2, 6, 2, 1]

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)   => [2, 6, 2]
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

https://img.zhaoweiguo.com/uPic/2024/11/nUNeSc.png — Figure 3.26 In the `MultiheadAttentionWrapper` class with two attention heads, we initialized two weight matrices \(W_{q1}\) and \(W_{q2}\) and computed two query matrices Q1 and Q2 as illustrated at the top of this figure. In the `MultiheadAttention` class, we initialize one larger weight matrix \(W_q\) , only perform one matrix multiplication with the inputs to obtain a query matrix Q, and then split the query matrix into Q1 and Q2 as shown at the bottom of this figure. We do the same for the keys and values, which are not shown to reduce visual clutter.¶

使用示例

torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print("context_vecs.shape:", context_vecs.shape)
# context_vecs.shape: torch.Size([2, 6, 2])
print(context_vecs)
# tensor([[[0.3190, 0.4858],
#          [0.2943, 0.3897],
#          [0.2856, 0.3593],
#          [0.2693, 0.3873],
#          [0.2639, 0.3928],
#          [0.2575, 0.4028]],
#         [[0.3190, 0.4858],
#          [0.2943, 0.3897],
#          [0.2856, 0.3593],
#          [0.2693, 0.3873],
#          [0.2639, 0.3928],
#          [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)

3.7 Summary¶

Attention mechanisms transform input elements into enhanced context vector representations that incorporate(包含,使合并) information about all inputs.
A self-attention mechanism computes the context vector representation as a weighted sum over the inputs.
In a simplified attention mechanism, the attention weights are computed via dot products.
A dot product is just a concise way of multiplying two vectors element-wise and then summing the products.
Matrix multiplications, while not strictly required, help us to implement computations more efficiently and compactly by replacing nested for-loops.
In self-attention mechanisms that are used in LLMs, also called scaled-dot product attention, we include trainable weight matrices to compute intermediate transformations of the inputs: queries, values, and keys. When working with LLMs that read and generate text from left to right, we add a causal attention mask to prevent the LLM from accessing future tokens.
Next to causal attention masks to zero out attention weights, we can also add a dropout mask to reduce overfitting in LLMs.
The attention modules in transformer-based LLMs involve multiple instances of causal attention, which is called multi-head attention.
We can create a multi-head attention module by stacking(堆叠) multiple instances of causal attention modules.
A more efficient way of creating multi-head attention modules involves batched matrix multiplications.

4 Implementing a GPT model from Scratch To Generate Text¶

Coding a GPT-like large language model (LLM) that can be trained to generate human-like text
Normalizing layer activations to stabilize neural network training
Adding shortcut connections in deep neural networks to train models more effectively
Implementing transformer blocks to create GPT models of various sizes
Computing the number of parameters and storage requirements of GPT models

https://img.zhaoweiguo.com/uPic/2024/11/EHDcf5.png — Figure 4.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on implementing the LLM architecture, which we will train in the next chapter.¶

4.1 Coding an LLM architecture¶

In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page
In this chapter, we consider embedding and model sizes akin to a small GPT-2 model
We’ll specifically code the architecture of the smallest GPT-2 model (124 million parameters)

https://img.zhaoweiguo.com/uPic/2024/11/UuEOhL.png — Figure 4.2 A mental model of a GPT model. Next to the embedding layers, it consists of one or more transformer blocks containing the masked multi-head attention module we implemented in the previous chapter.¶

configuration of the small GPT-2 model:

GPT_CONFIG_124M = {
    "vocab_size": 50257,        # Vocabulary size
    "context_length": 1024,     # Context length
    "emb_dim": 768,             # Embedding dimension
    "n_heads": 12,              # Number of attention heads
    "n_layers": 12,             # Number of layers
    "drop_rate": 0.1,           # Dropout rate
    "qkv_bias": False           # Query-Key-Value bias
}

https://img.zhaoweiguo.com/uPic/2024/11/xEM2A7.png — Figure 4.3 A mental model outlining the order in which we code the GPT architecture. In this chapter, we will start with the GPT backbone, a placeholder architecture, before we get to the individual core pieces and eventually assemble them in a transformer block for the final GPT architecture.¶

import torch
import torch.nn as nn


class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        # Use a placeholder for LayerNorm
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # A simple placeholder

    def forward(self, x):
        # This block does nothing and just returns its input.
        return x


class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface.

    def forward(self, x):
        # This layer does nothing and just returns its input.
        return x

https://img.zhaoweiguo.com/uPic/2024/11/phEjtL.png — Figure 4.4 A big-picture overview showing how the input data is tokenized, embedded, and fed to the GPT model. Note that in our DummyGPTClass coded earlier, the token embedding is handled inside the GPT model. In LLMs, the embedded input token dimension typically matches the output dimension. The output embeddings here represent the context vectors we discussed in chapter 3.¶

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
# tensor([[ 6109,  3626,  6100,   345], #A
#         [ 6109,  1110,  6622,   257]])

torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)
# Output shape: torch.Size([2, 4, 50257])
# tensor([[[-1.2034,  0.3201, -0.7130,  ..., -1.5548, -0.2390, -0.4667],
#          [-0.1192,  0.4539, -0.4432,  ...,  0.2392,  1.3469,  1.2430],
#          [ 0.5307,  1.6720, -0.4695,  ...,  1.1966,  0.0111,  0.5835],
#          [ 0.0139,  1.6755, -0.3388,  ...,  1.1586, -0.0435, -1.0400]],
#         [[-1.0908,  0.1798, -0.9484,  ..., -1.6047,  0.2439, -0.4530],
#          [-0.7860,  0.5581, -0.0610,  ...,  0.4835, -0.0077,  1.6621],
#          [ 0.3567,  1.2698, -0.6398,  ..., -0.0162, -0.1296,  0.3717],
#          [-0.2407, -0.7349, -0.5102,  ...,  2.0057, -0.3694,  0.1814]]],
#        grad_fn=<UnsafeViewBackward0>)

4.2 Normalizing activations with layer normalization¶

Training deep neural networks with many layers can sometimes prove challenging due to issues like vanishing or exploding gradients.
These issues lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights, which means the learning process struggles to find a set of parameters (weights) for the neural network that minimizes the loss function.
In other words, the network has difficulty learning the underlying patterns in the data to a degree that would allow it to make accurate predictions or decisions.
The main idea behind layer normalization is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance of 1, also known as unit variance.

https://img.zhaoweiguo.com/uPic/2024/11/7J1nUv.png — Figure 4.5 An illustration of layer normalization where the 5 layer outputs, also called activations, are normalized such that they have a zero mean and variance of 1.¶

https://img.zhaoweiguo.com/uPic/2024/11/uQxKaT.png — Figure 4.6 An illustration of the dim parameter when calculating the mean of a tensor. For instance,if we have a 2D tensor(matrix) with dimensions[rows, columns], using dim=0 will perform the operation across rows (vertically, as shown at the bottom), resulting in an output that aggregates the data for each column. Using dim=1 or dim=-1 will perform the operation across columns (horizontally, as shown at the top), resulting in an output aggregating the data for each row.¶

# 归一化L2(欧几里德范数归一)
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

https://img.zhaoweiguo.com/uPic/2024/11/jeZEJm.png — Figure 4.7 A mental model listing the different building blocks we implement in this chapter to assemble the GPT architecture.¶

4.3 Implementing a feed forward network with GELU activations¶

主要讲了GELU的实现和特点

https://img.zhaoweiguo.com/uPic/2024/11/G5BwCr.png — Figure 4.11 A mental model showing the topics we cover in this chapter, with the black checkmarks indicating those that we have already covered.¶

4.4 Adding shortcut connections¶

shortcut connections, also known as skip or residual connections.
Originally, shortcut connections were proposed for deep networks in computer vision (specifically, in residual networks) to mitigate the challenge of vanishing gradients.
The vanishing gradient problem refers to the issue where gradients (which guide weight updates during training) become progressively smaller as they propagate backward through the layers, making it difficult to effectively train earlier layers

https://img.zhaoweiguo.com/uPic/2024/11/bGNkqW.png — Figure 4.12 A comparison between a deep neural network consisting of 5 layers without (on the left) and with shortcut connections (on the right). Shortcut connections involve adding the inputs of a layer to its outputs, effectively creating an alternate path that bypasses certain layers. The gradient illustrated in Figure 1.1 denotes the mean absolute gradient at each layer, which we will compute in the code example that follows.¶

class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList([
            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
        ])

    def forward(self, x):
        for layer in self.layers:
            # Compute the output of the current layer
            layer_output = layer(x)
            # Check if shortcut can be applied
            if self.use_shortcut and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

def print_gradients(model, x):
    # Forward pass
    output = model(x)
    target = torch.tensor([[0.]])

    # Calculate loss based on how close the target
    # and output are
    loss = nn.MSELoss()
    loss = loss(output, target)

    # Backward pass to calculate the gradients
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            # Print the mean absolute gradient of the weights
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

print the gradient values without shortcut connections:

layer_sizes = [3, 3, 3, 3, 3, 1]

sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=False
)
print_gradients(model_without_shortcut, sample_input)

# 输出 ==>
# layers.0.0.weight has gradient mean of 0.00020173587836325169
# layers.1.0.weight has gradient mean of 0.0001201116101583466
# layers.2.0.weight has gradient mean of 0.0007152041653171182
# layers.3.0.weight has gradient mean of 0.001398873864673078
# layers.4.0.weight has gradient mean of 0.005049646366387606

print the gradient values with shortcut connections:

torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)
# 输出 ==>
# layers.0.0.weight has gradient mean of 0.22169792652130127
# layers.1.0.weight has gradient mean of 0.20694105327129364
# layers.2.0.weight has gradient mean of 0.32896995544433594
# layers.3.0.weight has gradient mean of 0.2665732502937317
# layers.4.0.weight has gradient mean of 1.3258541822433472

4.5 Connecting attention and linear layers in a transformer block¶

https://img.zhaoweiguo.com/uPic/2024/11/d9Qu7J.png — Figure 4.13 An illustration of a transformer block. The bottom of the diagram shows input tokens that have been embedded into 768-dimensional vectors. Each row corresponds to one token’s vector representation. The outputs of the transformer block are vectors of the same dimension as the input, which can then be fed into subsequent layers in an LLM.¶

The idea is that the self-attention mechanism in the multi-head attention block identifies and analyzes relationships between elements in the input sequence.
In contrast, the feed forward network modifies the data individually at each position.
This combination not only enables a more nuanced(细微差别) understanding and processing of the input but also enhances the model’s overall capacity for handling complex data patterns.
【组合的意义 fromGPT】自注意力机制提供全局信息：通过捕捉序列中元素之间的关系，让模型理解上下文和结构的复杂性。前馈网络强化局部信息：对每个位置进行特定的非线性变换，提升其独立特征表达。协同效果：这种组合让模型既能捕捉全局模式，又能处理局部细节，从而在面对复杂数据模式时表现出更强的处理能力。【例】句子翻译任务：自注意力机制帮助模型理解句子结构和词语之间的关系；前馈网络对每个单词的特定信息进行调整和优化，从而生成更准确的翻译。

from previous_chapters import MultiHeadAttention

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x

输入&输出形状相同，是一个关键设计
The preservation of shape throughout the transformer block architecture is not incidental but a crucial aspect of its design.
This design enables its effective application across a wide range of sequence-to-sequence tasks, where each output vector directly corresponds to an input vector, maintaining a one-to-one relationship.
However, the output is a context vector that encapsulates information from the entire input sequence, as we learned in chapter 3.
This means that while the physical dimensions of the sequence (length and feature size) remain unchanged as it passes through the transformer block, the content of each output vector is re-encoded to integrate contextual information from across the entire input sequence.

https://img.zhaoweiguo.com/uPic/2024/11/REg2ra.png — Figure 4.14 A mental model of the different concepts we have implemented in this chapter so far.¶

4.6 Coding the GPT model¶

https://img.zhaoweiguo.com/uPic/2024/11/kbWfLF.png — Figure 4.15 An overview of the GPT model architecture. This figure illustrates the flow of data through the GPT model. Starting from the bottom, tokenized text is first converted into token embeddings, which are then augmented with positional embeddings. This combined information forms a tensor that is passed through a series of transformer blocks shown in the center (each containing multi-head attention and feed forward neural network layers with dropout and layer normalization), which are stacked on top of each other and repeated 12 times.¶

the transformer block is repeated many times throughout a GPT model architecture.
In the case of the 124 million parameter GPT-2 model, it’s repeated 12 times
In the case of the largest GPT-2 model with 1,542 million parameters, this transformer block is repeated 36 times.

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

请求:

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

# 输出
# Input batch:
#  tensor([[6109, 3626, 6100,  345],
#         [6109, 1110, 6622,  257]])

# Output shape: torch.Size([2, 4, 50257])
# tensor([[[ 0.3613,  0.4222, -0.0711,  ...,  0.3483,  0.4661, -0.2838],
#          [-0.1792, -0.5660, -0.9485,  ...,  0.0477,  0.5181, -0.3168],
#          [ 0.7120,  0.0332,  0.1085,  ...,  0.1018, -0.4327, -0.2553],
#          [-1.0076,  0.3418, -0.1190,  ...,  0.7195,  0.4023,  0.0532]],

#         [[-0.2564,  0.0900,  0.0335,  ...,  0.2659,  0.4454, -0.6806],
#          [ 0.1230,  0.3653, -0.2074,  ...,  0.7705,  0.2710,  0.2246],
#          [ 1.0558,  1.0318, -0.2800,  ...,  0.6936,  0.3205, -0.3178],
#          [-0.1565,  0.3926,  0.3288,  ...,  1.2630, -0.1858,  0.0388]]],
#        grad_fn=<UnsafeViewBackward0>)

参看参数总数:

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

# 输出
Total number of parameters: 163,009,536
# 疑问
为啥不是 GPT2 论文说的 124M 呢？

备注

In the original GPT-2 paper, the researchers applied weight tying , which means that they reused the token embedding layer (tok_emb) as the output layer, which means setting self.out_head.weight = self.tok_emb.weight . The token embedding and output layers are very large due to the number of rows for the 50,257 in the tokenizer’s vocabulary. Weight tying reduces the overall memory footprint and computational complexity of the model. 具体参见 WeightTying

去除GPT2重用的参数:

total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())
print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")

# 输出
Number of trainable parameters considering weight tying: 124,412,160

内存占用:

# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
total_size_bytes = total_params * 4

# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)

print(f"Total size of the model: {total_size_mb:.2f} MB")
# 输出
Total size of the model: 621.83 MB

Exercise:

- **GPT2-small** (the 124M configuration we already implemented):
    - "emb_dim" = 768
    - "n_layers" = 12
    - "n_heads" = 12

- **GPT2-medium:**
    - "emb_dim" = 1024
    - "n_layers" = 24
    - "n_heads" = 16

- **GPT2-large:**
    - "emb_dim" = 1280
    - "n_layers" = 36
    - "n_heads" = 20

- **GPT2-XL:**
    - "emb_dim" = 1600
    - "n_layers" = 48
    - "n_heads" = 25

4.7 Generating text¶

https://img.zhaoweiguo.com/uPic/2024/11/ZfRLSw.png — Figure 4.16 This diagram illustrates the step-by-step process by which an LLM generates text, one token at a time. Starting with an initial input context (“Hello, I am”), the model predicts a subsequent token during each iteration, appending it to the input context for the next round of prediction. As shown, the first iteration adds “a”, the second “model”, and the third “ready”, progressively building the sentence.¶

https://img.zhaoweiguo.com/uPic/2024/11/MaELd0.png — Figure 4.17 details the mechanics of text generation in a GPT model by showing a single iteration in the token generation process. The process begins by encoding the input text into token IDs, which are then fed into the GPT model. The outputs of the model are then converted back into text and appended to the original input text.¶

def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

备注

其实这儿不用执行 torch.softmax 直接执行 torch.softmax 可以得到同样的结果。we coded the conversion to illustrate the full process of transforming logits to probabilities, which can add additional intuition, such as that the model generates the most likely next token, which is known as greedy decoding. In the next chapter, when we will implement the GPT training code, we will also introduce additional sampling techniques where we modify the softmax outputs such that the model doesn’t always select the most likely token, which introduces variability and creativity in the generated text.

https://img.zhaoweiguo.com/uPic/2024/11/7DkeWM.png — Figure 4.18 An illustration showing six iterations of a token prediction cycle, where the model takes a sequence of initial token IDs as input, predicts the next token, and appends this token to the input sequence for the next iteration. (The token IDs are also translated into their corresponding text for better understanding.)¶

准备数据:

start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)
# 输出
# encoded: [15496, 11, 314, 716]
# encoded_tensor.shape: torch.Size([1, 4])

put the model into .eval() mode, which disables random components like dropout, which are only used during training:

model.eval() # disable dropout

out = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))
# 输出
# Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])
# Output length: 10

Remove batch dimension and convert back into text:

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
# 输出(未训练的)
# Hello, I am Featureiman Byeswickattribute argue

4.8 Summary¶

Layer normalization stabilizes training by ensuring that each layer’s outputs have a consistent mean and variance.
Shortcut connections are connections that skip one or more layers by feeding the output of one layer directly to a deeper layer, which helps mitigate the vanishing gradient problem when training deep neural networks, such as LLMs.
Transformer blocks are a core structural component of GPT models, combining masked multi-head attention modules with fully connected feed-forward networks that use the GELU activation function.
GPT models are LLMs with many repeated transformer blocks that have millions to billions of parameters.
GPT models come in various sizes, for example, 124 million, and 1542 million parameters, which we can implement with the same GPTModel Python class.
The text generation capability of a GPT-like LLM involves decoding output tensors into human-readable text by sequentially predicting one token at a time based on a given input context.
Without training, a GPT model generates incoherent text, which underscores the importance of model training for coherent text generation, which is the topic of subsequent chapters.

import torch
import torch.nn as nn


class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift


class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)



class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # optional projection

        return context_vec



class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x

def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]
        
        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)
        
        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]  

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval() # disable dropout

out = generate_text_simple(
    model=model,
    idx=encoded_tensor, 
    max_new_tokens=6, 
    context_size=GPT_CONFIG_124M["context_length"]
)

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

5 Pretraining on Unlabeled Data¶

Computing the training and validation set losses to assess the quality of LLM-generated text during training
Implementing a training function and pretraining the LLM
Saving and loading model weights to continue training an LLM Loading pretrained weights from OpenAI

https://img.zhaoweiguo.com/uPic/2024/11/fk4Mc8.png — Figure 5.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it on a labeled dataset. This chapter focuses on pretraining the LLM, which includes implementing the training code, evaluating the performance, and saving and loading model weights.¶

5.1 Evaluating generative text models¶

https://img.zhaoweiguo.com/uPic/2024/11/jKVWDh.png — Figure 5.2 An overview of the topics covered in this chapter. We begin by recapping the text generation from the previous chapter and implementing basic model evaluation techniques that we can use during the pretraining stage.¶

5.1.1 Using GPT to generate text¶

https://img.zhaoweiguo.com/uPic/2024/11/L2KVis.png — Figure 5.3 Generating text involves encoding text into token IDs that the LLM processes into logit vectors. The logit vectors are then converted back into token IDs, detokenized into a text representation.¶

Utility functions for text to token ID conversion

import tiktoken
from previous_chapters import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

# 输出
# Output text:
#  Every effort moves you rentingetic wasnم refres RexMeCHicular stren

5.1.2 Calculating the text generation loss¶

先回顾一下前面 how the data is loaded from chapter 2 and how the text is generated via the generate_text_simple function from chapter 4.

https://img.zhaoweiguo.com/uPic/2024/11/S5XB0w.png — Figure 5.4 For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model.¶

https://img.zhaoweiguo.com/uPic/2024/11/sdWuJt.png — Figure 5.5 We now implement the text evaluation function in the remainder of this section. In the next section, we apply this evaluation function to the entire dataset we use for model training¶

输入&期望输出:

inputs = torch.tensor([[16833, 3626, 6100],   # ["every effort moves",
                       [40,    1107, 588]])   #  "I really like"]

targets = torch.tensor([[3626, 6100, 345  ],  # [" effort moves you",
                        [1107,  588, 11311]]) #  " really like chocolate"]

We feed the inputs into the model to calculate logit vectors for the two input examples, each comprising three tokens, and apply the softmax function to transform these logit values into probability scores:

with torch.no_grad():
    logits = model(inputs)

probas = torch.softmax(logits, dim=-1) # Probability of each token in vocabulary
print(probas.shape) # Shape: (batch_size, num_tokens, vocab_size)
# 输出
# torch.Size([2, 3, 50257])

by applying the argmax function to the probability scores to obtain the corresponding token IDs:

# 这儿得到的是模型的最终输出和token ID(当然现在还没训练，结果当然乱七八糟)
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)
# 输出
# Token IDs:
# tensor([[[36397],
#         [39619],
#         [20610]],
#        [[ 8615],
#         [49289],
#         [47105]]])

Part of the text evaluation process is to measure “how far” the generated tokens are from the correct predictions (targets).
The training function will use this information to adjust the model weights to generate text that is more similar to (or ideally matches) the target text.

https://img.zhaoweiguo.com/uPic/2024/11/yLleo1.png — Figure 5.6 Before training, the model produces random next-token probability vectors. The goal of model training is to ensure that the probability values corresponding to the highlighted target token IDs are maximized.¶

The token probabilities corresponding to the target indices are as follows:

# 看看现在期望的输出的概率(probabilities)
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)
# 输出
# Text 1: tensor([2.3466e-05, 2.0531e-05, 1.1733e-05])
# Text 2: tensor([4.2794e-05, 1.6248e-05, 1.1586e-05])

备注

The goal of training an LLM is to maximize these values, aiming to get them as close to a probability of 1. This way, we ensure the LLM consistently picks the target token(essentially the next word in the sentence)as the next token it generates.

Backpropagation¶

https://img.zhaoweiguo.com/uPic/2024/11/pZeCF1.png — Figure 5.7 Calculating the loss involves several steps. Steps 1 to 3 calculate the token probabilities corresponding to the target tensors. These probabilities are then transformed via a logarithm and averaged in steps 4-6.¶

# Compute logarithm of all token probabilities
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)
# tensor([ -9.5042, -10.3796, -11.3677, -11.4798,  -9.7764, -12.2561])

# Calculate the average probability for each token
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
# tensor(-10.7940)

# In deep learning, the common practice is to bring the negative average log probability down to 0.
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)
# tensor(10.7940)

备注

The term for this negative value, -10.7722 turning into 10.7722, is known as the cross entropy loss in deep learning.

Cross entropy loss¶

the shape of the logits and target tensors:

# Logits have shape (batch_size, num_tokens, vocab_size)
print("Logits shape:", logits.shape)
# Targets have shape (batch_size, num_tokens)
print("Targets shape:", targets.shape)

For the cross_entropy function in PyTorch, we want to flatten these tensors by combining them over the batch dimension:

logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)
# 输出
# Flattened logits: torch.Size([6, 50257])
# Flattened targets: torch.Size([6])

PyTorch’s cross_entropy function:

loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

备注

详见定义 _cross_entropy:

Perplexity¶

Perplexity is a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence.

备注

详见定义 Perplexity 困惑度

5.1.3 Calculating the training and validation set losses¶

https://img.zhaoweiguo.com/uPic/2024/11/10h7zN.png — Figure 5.8 After computing the cross entropy loss in the previous section, we now apply this loss computation to the entire text dataset that we will use for model training.¶

The cost of pretraining LLMs¶

https://img.zhaoweiguo.com/uPic/2024/11/Uid7fP.png — Figure 5.9 When preparing the data loaders, we split the input text into training and validation set portions. Then, we tokenize the text (only shown for the training set portion for simplicity) and divide the tokenized text into chunks of a user-specified length (here 6). Finally, we shuffle the rows and organize the chunked text into batches (here, batch size 2), which we can use for model training.¶

For visualization purposes, Figure 5.9 uses a max_length=6 due to spatial constraints. However, for the actual data loaders we are implementing, we set the max_length equal to the 256-token context length that the LLM supports so that the LLM sees longer texts during training.

Training with variable lengths¶

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
    return loss


def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        # Reduce the number of batches to match the total number of batches in the data loader
        # if num_batches exceeds the number of batches in the data loader
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches




with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

https://img.zhaoweiguo.com/uPic/2024/11/u7CMTh.png — Figure 5.10 We have recapped the text generation process and implemented basic model evaluation techniques to compute the training and validation set losses. Next, we will go to the training functions and pretrain the LLM.¶

5.2 Training an LLM¶

https://img.zhaoweiguo.com/uPic/2024/11/G8oVmR.png — Figure 5.11 A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights so that the training set loss is minimized.¶

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()  # Set model to training mode

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad() # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward() # Calculate loss gradients
            optimizer.step() # Update model weights using loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        # Print a sample text after each epoch
        generate_and_print_sample(
            model, tokenizer, device, start_context
        )

    return train_losses, val_losses, track_tokens_seen


def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    model.train()
    return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
    model.eval()
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(
            model=model, idx=encoded,
            max_new_tokens=50, context_size=context_size
        )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))  # Compact print format
    model.train()

Adam optimizers are a popular choice for training deep neural networks.
However, in our training loop, we opt for the AdamW optimizer.
AdamW is a variant of Adam that improves the weight decay approach, which aims to minimize model complexity and prevent overfitting by penalizing(处罚) larger weights.
This adjustment allows AdamW to achieve more effective regularization and better generalization and is thus frequently used in the training of LLMs.
运行

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)

输出:

Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933
Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339
Every effort moves you,,,,,,,,,,,,.
Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and,, and, and,
... ...
Ep 9 (Step 000075): Train loss 0.717, Val loss 6.293
Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back the window-curtains, I had the donkey. "There were days when I
Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452
Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gis

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator


def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
    fig, ax1 = plt.subplots(figsize=(5, 3))

    # Plot training and validation loss against epochs
    ax1.plot(epochs_seen, train_losses, label="Training loss")
    ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")
    ax1.set_xlabel("Epochs")
    ax1.set_ylabel("Loss")
    ax1.legend(loc="upper right")
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))  # only show integer labels on x-axis

    # Create a second x-axis for tokens seen
    ax2 = ax1.twiny()  # Create a second x-axis that shares the same y-axis
    ax2.plot(tokens_seen, train_losses, alpha=0)  # Invisible plot for aligning ticks
    ax2.set_xlabel("Tokens seen")

    fig.tight_layout()  # Adjust layout to make room
    plt.savefig("loss-plot.pdf")
    plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

https://img.zhaoweiguo.com/uPic/2024/11/qdFTHI.png — Figure 5.12 At the beginning of the training, we observe that both the training and validation set losses sharply decrease, which is a sign that the model is learning. However, the training set loss continues to decrease past the second epoch, whereas the validation loss stagnates. This is a sign that the model is still learning, but it’s overfitting to the training set past epoch 2.¶

Looking at the results above, we can see that the model starts out generating incomprehensible strings of words, whereas towards the end, it’s able to produce grammatically more or less correct sentences
However, based on the training and validation set losses, we can see that the model starts overfitting(最终训练集损失是0.541，而验证集损失是6.393)
If we were to check a few passages it writes towards the end, we would find that they are contained in the training set verbatim(逐字翻译) – it simply memorizes the training data
Later, we will cover decoding strategies that can mitigate this memorization by a certain degree
Note that the overfitting here occurs because we have a very, very small training set, and we iterate over it so many times

https://img.zhaoweiguo.com/uPic/2024/11/u3Vqxc.png — Figure 5.13 Our model can generate coherent text after implementing the training function. However, it often memorizes passages from the training set verbatim. The following section covers strategies to generate more diverse output texts.¶

5.3 Decoding strategies to control randomness¶

本节将重新实现 generate_text_simple() in chapter 4.7 但 we will cover two techniques, temperature scaling, and top-k sampling, to improve this function.
之前实现的版本使用的贪心解码(greedy decoding)，每一次请求的返回结果都是选择概率最高的那个，导致每一次请求的结果都是相同的，这样对生成式的任务而言，效果单一。

5.3.1 Temperature scaling¶

This section introduces temperature scaling , a technique that adds a probabilistic selection process to the next-token generation task.
Previously, inside the generate_text_simple function, we always sampled the token with the highest probability as the next token using torch.argmax, also known as greedy decoding. To generate text with more variety, we can replace the argmax with a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step).

small vocabulary for illustration purposes:

vocab = {
    "closer": 0,
    "every": 1,
    "effort": 2,
    "forward": 3,
    "inches": 4,
    "moves": 5,
    "pizza": 6,
    "toward": 7,
    "you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}

上节讲的 greedy decoding

next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
probas = torch.softmax(next_token_logits, dim=0)
next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])
# "forward"
# 讲解:
#   通过对 next_token_logits 分析知道
#   最大概率是 6.75, 对应的softmax的最大值
#   所以next token的index是3，得到最终结果: forward

对应的 probabilistic sampling process

# replace the argmax with the multinomial function in PyTorch
torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
# 说明： 不管概率高低，都有被选中的可能

temperature scaling 的作用:

# Temperatures greater than 1 result in more uniformly distributed token probabilities
# Temperatures smaller than 1 will result in more confident (sharper or more peaky) distributions.
def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

https://img.zhaoweiguo.com/uPic/2024/11/peGC1C.png — Figure 5.14 A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here “forward”) will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.¶

5.3.2 Top-k sampling¶

上节的方法allows for exploring less likely but potentially more interesting and creative paths in the generation process. However, One downside of this approach is that it sometimes leads to grammatically incorrect or completely nonsensicaloutputs(就是超低概率被命中，正常来说这种nexttoken是不太合适的，也就是无意义的语句生成)

备注

In top-k sampling, we can restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores

https://img.zhaoweiguo.com/uPic/2024/11/xtYDXg.tif — Figure 5.15 Using top-k sampling with k=3, we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all non- top-k tokens.¶

top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
new_logits = torch.where(
    condition=next_token_logits < top_logits[-1],
    input=torch.tensor(float("-inf")),
    other=next_token_logits
)
topk_probas = torch.softmax(new_logits, dim=0)

# 更高效实现
> new_logits = torch.full_like( # create tensor containing -inf values
>    next_token_logits, -torch.inf
>)
> new_logits[top_pos] = next_token_logits[top_pos] # copy top k values into the -inf tensor
> topk_probas = torch.softmax(new_logits, dim=0)

5.3.3 Modifying the text generation function¶

本节把前面两节的 temperature scale 和 top-k sampling 合并到generator函数中

def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
    # For-loop is the same as before: Get logits, and only focus on last time step
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]

        # New: Filter logits with top_k sampling
        if top_k is not None:
            # Keep only top_k values
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)

        # New: Apply temperature scaling
        if temperature > 0.0:
            logits = logits / temperature

            # Apply softmax to get probabilities
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

        # Otherwise same as before: get idx of the vocab entry with the highest logits value
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

        if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
            break

        # Same as before: append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

5.4 Loading and saving model weights in PyTorch¶

It’s common to train LLMs with adaptive optimizers like Adam or AdamW instead of regular SGD

These adaptive optimizers store additional parameters for each model weight, so it makes sense to save them as well in case we plan to continue the pretraining later:

torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    },
    "model_and_optimizer.pth"
)

loading:

checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)

model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0005, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();

5.5 Loading pretrained weights from OpenAI¶

本节主要讲了，如何从gpt2中加载权重
因为gpt2和我们这儿定义的数据结构名称不相同，所以需要专门进行定制加载

5.6 Summary¶

When LLMs generate text, they output one token at a time.
By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as “greedy decoding.”
Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text.
Training and validation set losses can be used to gauge(测量) the quality of text generated by LLM during training.
Pretraining an LLM involves changing its weights to minimize the training loss.
The training loop for LLMs itself is a standard procedure in deep learning, using a conventional cross entropy loss and AdamW optimizer . Pretraining an LLM on a large text corpus is time- and resource- intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves.

Appendix A. Introduction to PyTorch¶

A.1 What is PyTorch¶

https://img.zhaoweiguo.com/uPic/2024/11/LiTYmO.png — Figure A.1 PyTorch’s three main components include a tensor library as a fundamental building block for computing, automatic differentiation for model optimization, and deep learning utility functions, making it easier to implement and train deep neural network models.¶

Firstly, PyTorch is a tensor library that extends the concept of array-oriented programming library NumPy with the additional feature of accelerated computation on GPUs, thus providing a seamless switch between CPUs and GPUs.
Secondly, PyTorch is an automatic differentiation engine, also known as autograd, which enables the automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization.
Finally, PyTorch is a deep learning library, meaning that it offers modular, flexible, and efficient building blocks (including pre-trained models, loss functions, and optimizers) for designing and training a wide range of deep learning models, catering to both researchers and developers.

A.2 Understanding tensors¶

https://img.zhaoweiguo.com/uPic/2024/11/fXXxjF.png — Figure A.6 An illustration of tensors with different ranks. Here 0D corresponds to rank 0, 1D to rank 1, and 2D to rank 2. Note that a 3D vector, which consists of 3 elements, is still a rank 1 tensor.¶

PyTorch’s has a NumPy-like API

A.3 Seeing models as computation graphs¶

Automatic differentiation engine, also known as autograd. PyTorch’s autograd system provides functions to compute gradients in dynamic computational graphs automatically.
Computation graph is a directed graph that allows us to express and visualize mathematical expressions.
In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network – we will need this later to compute the required gradients for backpropagation, which is the main training algorithm for neural networks.

https://img.zhaoweiguo.com/uPic/2024/11/3m8Ye0.png — Figure A.7 A logistic regression forward pass as a computation graph. The input feature x1 is multiplied by a model weight w1 and passed through an activation function σ after adding the bias. The loss is computed by comparing the model output a with a given label y.¶

In fact, PyTorch builds such a computation graph in the background, and we can use this to calculate gradients of a loss function with respect to the model parameters (here w1 and b) to train the model

A.4 Automatic differentiation made easy¶

https://img.zhaoweiguo.com/uPic/2024/11/6lneDt.png — Figure A.8 The most common way of computing the loss gradients in a computation graph involves applying the chain rule from right to left, which is also called `reverse-model automatic differentiation` or `backpropagation`. It means we start from the output layer (or the loss itself) and work backward through the network to the input layer. This is done to compute the gradient of the loss with respect to each parameter (weights and biases) in the network, which informs how we update these parameters during training.¶

Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the chain rule from calculus for neural networks

Partial derivatives and gradients¶

A partial derivatives, which measure the rate at which a function changes with respect to one of its variables.
A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.
This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model’s performance, using a method such as gradient descent.
Listing A.3 Computing gradients via autograd

import torch.nn.functional as F
from torch.autograd import grad
y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True)  #A
grad_L_b = grad(loss, b, retain_graph=True)

PyTorch provides even more high-level tools to automate this process:

loss.backward()
print(w1.grad)
print(b.grad)

A.5 Implementing multilayer neural networks¶

https://img.zhaoweiguo.com/uPic/2024/11/WVZ7B9.png — Figure A.9 An illustration of a multilayer perceptron with 2 hidden layers. Each node represents a unit in the respective layer. Each layer has only a very small number of nodes for illustration purposes.¶

Listing A.4 A multilayer perceptron with two hidden layers

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):  #A
        super().__init__()
        self.layers = torch.nn.Sequential(
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),  #B
            torch.nn.ReLU(),  #C
            # 2nd hidden layer
            torch.nn.Linear(30, 20),  #D
            torch.nn.ReLU(),
            # output layer
            torch.nn.Linear(20, num_outputs),
        )
    def forward(self, x):
        logits = self.layers(x)
        return logits   #E

instantiate a new neural network object:

>>> model = NeuralNetwork(50, 3)
NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)

check the total number of trainable parameters:

>>> sum(p.numel() for p in model.parameters() if p.requires_grad)
2213

手动计算:
第一个隐藏层：50 个输入乘以 30 个隐藏单元加上 30 个偏置单元。
    50*30+30
第二个隐藏层：30 个输入单元乘以 20 个节点加上 20 个偏置单元。
    30*20+20
输出层：20 个输入节点乘以 3 个输出节点加上 3 个偏置单元。
    20*3+3
总共等于：1530+620+63=2213

A linear layer multiplies the inputs with a weight matrix and adds a bias vector( \(wx+b\) ). This is sometimes also referred to as a feedforward or fully connected layer.

# weight parameter matrix
>>> print(model.layers[0].weight)

# bias parameter matrix
>>> print(model.layers[0].bias)

we can make the random number initialization reproducible by seeding PyTorch’s random number generator:

torch.manual_seed(123)
model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

When using for inference rather than training, it is a best practice to use torch.no_grad() context manager:

# This tells PyTorch that it doesn't need to keep track of the gradients,
#   which can result in significant savings in memory and computation.
with torch.no_grad():
    out = model(X)
print(out)

In PyTorch, it’s common practice to code models such that they return the outputs of the last layer (logits) without passing them to a nonlinear activation function.
That’s because PyTorch’s commonly used loss functions combine the softmax (or sigmoid for binary classification) operation with the negative log-likelihood loss in a single class.
The reason for this is numerical efficiency and stability.
So, if we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly:
```
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)
```

A.6 Setting up efficient data loaders¶

https://img.zhaoweiguo.com/uPic/2024/11/9fcMmn.png — Figure A.10 PyTorch implements a Dataset and a DataLoader class. The Dataset class is used to instantiate objects that define how each data record is loaded. The DataLoader handles how the data is shuffled and assembled into batches.¶

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,      # crucial for parallelizing data loading and preprocessing.
    drop_last=True      # drop the last batch in each epoch(因为最后一个批次的数量可能不够)
)

https://img.zhaoweiguo.com/uPic/2024/11/KvFx0K.png — Figure A.11 参数 `num_workers` 用于控制数据loading的工作线程，当num_workers=0时（像如图左边一样），计算损失值和加载数据一次只能做一个（计算完损失后，gpu会空闲等待数据载入）；而 `num_workers` 启动后，载入数据和计算损失值可以并发执行（像如图右边一样）。Loading data without multiple workers (setting num_workers=0) will create a data loading bottleneck where the model sits idle until the next batch is loaded as illustrated in the left subpanel. If multiple workers are enabled, the data loader can already queue up the next batch in the background as shown in the right subpanel.¶

警告

or Jupyter notebooks, setting num_workers to greater than 0 can sometimes lead to issues related to the sharing of resources between different processes, resulting in errors or notebook crashes.

备注

根据经验，设置 num_workers=4 通常会在许多实际数据集上带来最佳性能，但最佳设置取决于您的硬件以及用于加载 Dataset 类中定义的训练示例的代码。

A.7 A typical training loop¶

Listing A.9 Neural network training in PyTorch

import torch.nn.functional as F

torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2) #A
optimizer = torch.optim.SGD(model.parameters(), lr=0.5) #B

num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):
        logits = model(features)
        loss = F.cross_entropy(logits, labels)
        optimizer.zero_grad() #C
        loss.backward() #D
        optimizer.step() #E
        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
            f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
            f" | Train Loss: {loss:.2f}")
    model.eval()
    # Optional model evaluation

Listing A.10 A function to compute the prediction accuracy

def compute_accuracy(model, dataloader):
    model = model.eval()
    correct = 0.0
    total_examples = 0
    for idx, (features, labels) in enumerate(dataloader):
        with torch.no_grad():
            logits = model(features)
        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions #A
        correct += torch.sum(compare) #B
        total_examples += len(compare)
    return (correct / total_examples).item() #C

A.8 Saving and loading models¶

模型保存:

torch.save(model.state_dict(), "model.pth")

# model.state_dict 是一个 Python 字典对象，它将模型中的每一层映射到其可训练参数（权重和偏差）

模型载入:

model = NeuralNetwork(2, 2)  # 架构需要与原始保存的模型完全匹配
model.load_state_dict(torch.load("model.pth"))

A.9 Optimizing training performance with GPUs¶

A.9.1 PyTorch computations on GPU devices¶

CPU:

tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])
print(tensor_1 + tensor_2)

# 输出
# tensor([5., 7., 9.])

GPU:

tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")
print(tensor_1 + tensor_2)

# 输出
tensor([5., 7., 9.], device='cuda:0')

A.9.2 Single-GPU training¶

# Nvidia GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Apple Silicon 芯片
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

计算CPU&GPU的计算速度:

# CPU
a = torch.rand(100, 200)
b = torch.rand(200, 300)
%timeit a@b

# GPU
a, b = a.to("cuda"), b.to("cuda")
%timeit a @ b

A.9.3 Training with multiple GPUs¶

分布式训练是将模型训练划分到多个 GPU 和机器上的概念。

https://img.zhaoweiguo.com/uPic/2024/11/f1XfZm.png — Figure A.12 The model and data transfer in DDP involves two key steps. First, we create a copy of the model on each of the GPUs. Then we divide the input data into unique minibatches that we pass on to each model copy.¶

https://img.zhaoweiguo.com/uPic/2024/11/0NwzaD.png — Figure A.13 The forward and backward pass in DDP are executed independently on each GPU with its corresponding data subset. Once the forward and backward passes are completed, gradients from each model replica (on each GPU) are synchronized across all GPUs. This ensures that every model replica has the same updated weights.¶

备注

DDP does not function properly within interactive Python environments like Jupyter notebooks, which don’t handle multiprocessing in the same way a standalone Python script does.

如果您的机器有四个 GPU 并且您只想使用第一个和第三个 GPU:

CUDA_VISIBLE_DEVICES=0,2 python some_script.py

A.10 Summary¶

PyTorch is an open-source library that consists of three core components: a tensor library, automatic differentiation functions, and deep learning utilities.
PyTorch’s tensor library is similar to array libraries like NumPy
In the context of PyTorch, tensors are array-like data structures to represent scalars, vectors, matrices, and higher-dimensional arrays. PyTorch tensors can be executed on the CPU, but one major advantage of PyTorch’s tensor format is its GPU support to accelerate computations.
The automatic differentiation (autograd) capabilities in PyTorch allow us to conveniently train neural networks using backpropagation without manually deriving gradients.
The deep learning utilities in PyTorch provide building blocks for creating custom deep neural networks.
PyTorch includes Dataset and DataLoader classes to set up efficient data loading pipelines.
It’s easiest to train models on a CPU or single GPU.
Using DistributedDataParallel is the simplest way in PyTorch to accelerate the training if multiple GPUs are available.

Appendix B. References and Further Reading¶

Appendix C. Exercise Solutions¶

Appendix D. Adding Bells and Whistles to the Training Loop¶

其他¶

https://img.zhaoweiguo.com/uPic/2024/11/transformer_decoding_1.gif — The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of `attention vectors K and V`. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:¶

https://img.zhaoweiguo.com/uPic/2024/11/transformer_decoding_2.gif — The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.¶

https://img.zhaoweiguo.com/uPic/2024/11/1GKHaZ.png — The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. The `softmax` layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.¶