# GPU


* 聊天模型生成文本的瓶颈是 **内存带宽** 而不是 **计算能力**，因为它必须为模型生成的每个 token 从内存中读取每一个active parameter。
* 这意味着您每秒可以从聊天模型生成的 token 数量通常与它这个表达式成正比：``内存总带宽除以模型的大小``
* 一个8B的模型，以 bfloat16 精度加载时，模型大小为 ~16GB。这意味着必须为模型生成的每个令牌从内存中读取 16GB。
* 总内存带宽：
    - 消费类 CPU 的 20-100GB/秒
    - 消费类 GPU、Intel Xeon、AMD Threadripper/Epyc 或高端 Apple Silicon 等专用 CPU 的 200-900GB/秒不等
    - 数据中心 GPU，如 Nvidia A100 或 H100：高达 2-3TB/秒的
* In our quickstart example above, our model was ~16GB in size when loaded in bfloat16 precision. This means that 16GB must be read from memory for every token generated by the model. Total memory bandwidth can vary from 20-100GB/sec for consumer CPUs to 200-900GB/sec for consumer GPUs, specialized CPUs like Intel Xeon, AMD Threadripper/Epyc or high-end Apple silicon, and finally up to 2-3TB/sec for data center GPUs like the Nvidia A100 or H100. This should give you a good idea of the generation speed you can expect from these different hardware types.


![](https://img.zhaoweiguo.com/uPic/2025/03/pyig7w.png)

GPU中SRAM和HBM以及CPU DRAM的带宽


## OPS

* 相关概念:
    * FLOPS: floating point operations per second(每秒浮点运算次数, 衡量硬件性能的指标)
    * FLOPs: Floating Point Operations(浮点运算次数, 衡量算法/模型的复杂度)
      * FLOPs 计算通常是基于 FP32 理论计算图；
    * TOPS: Tera Operations Per Second(1 万亿 次操作/秒 = 10¹² ops/sec)
    * GOPS: Giga Operations Per Second
    * MOPS: Million Operation Per Second
    * MACs: Multiply-Accumulate Operations(乘加操作数)
      * 注意：通常 1 MAC = 2 FLOPs（1 乘法 + 1 加法）

### FLOPs 与FLOPS

* 关系公式: $\text{运行时间（秒）} = \frac{\text{FLOPs}}{\text{FLOPS}}$


### 计算步骤(估算方法)

* 步骤1：计算模型的总计算量（MACs）
    - 模型计算量通常以 MACs（乘加操作数）或 FLOPs（浮点运算次数）来衡量。
        + 工具: 对于 CNN 模型，可以使用 ptflops（PyTorch）或 TensorFlow Model Analyzer 等工具来估算。
        + 手动计算: $\text{MACs} = K \times K \times C_{in} \times C_{out} \times H_{out} \times W_{out}$
            * 其中：
            * $K$: 卷积核大小
            * $C_{in}, C_{out}$: 输入输出通道数
            * $H_{out}, W_{out}$: 输出特征图的高和宽
* 步骤2：考虑运行帧率（推理频率）
    * 假设你要每秒处理 `N` 帧：``每秒总运算量（MACs/s）=每帧MACs×N``
    * 例如：
        * 一个模型推理一次需要 5 GOPs（5 × 10⁹ ops）
        * 想要 30 FPS 的实时处理
        * 则：$$5 × 10⁹ \times 30 = 150 × 10⁹ = 150 GOPs/s = 0.15 TOPS$$

* 步骤3：最终得到所需 TOPS: $\text{所需 TOPS} = \frac{\text{每帧 MACs} \times \text{帧率}}{10^{12}}$

### 工具

* PyTorch 模型：

```python
from ptflops import get_model_complexity_info
from torchvision.models import resnet18

model = resnet18()
macs, params = get_model_complexity_info(model, (3, 224, 224), as_strings=True, print_per_layer_stat=False)
print(f"MACs: {macs}, Params: {params}")
```

对 PyTorch 多模态模型进行分析

```python
from fvcore.nn import FlopCountAnalysis
model = MyMultimodalModel()
inputs = (img_tensor, text_tensor)
flops = FlopCountAnalysis(model, inputs)
print(f"Total FLOPs: {flops.total() / 1e9} GFLOPs")
```


* TensorFlow 模型：
    - 使用 `tf.profiler` 或 Netron 分析计算图。


* ONNX 模型分析
    - 使用 [Netron](https://netron.app) 查看结构，或使用 tools like [onnxruntime-tools](https://github.com/microsoft/onnxruntime-tools) 的 `onnxruntime-tools.flops`.


### 多模态大模型计算 TOPS

运行多模态大模型时，比如视觉+语言（V+L）、语音+文本或图文+动作等组合，其 **TOPS 计算方式** 比单一模型更复杂，因为它通常包含多个子模块（如 ViT、Transformer、Q-Former、LLaMA 等）且存在大量 cross-attention。下面我们按步骤系统讲解如何计算多模态大模型所需的 TOPS。

* 划分模块
    - 明确模型中各个子模块的组成和功能。通常结构如下：

        | 模块              | 类型                  | 示例组件                        |
        | --------------- | ------------------- | --------------------------- |
        | 图像编码器           | CNN / ViT           | ResNet, Swin, CLIP ViT-B/32 |
        | 文本编码器           | Transformer Encoder | BERT, Q-Former              |
        | 解码器 / LLM       | Transformer Decoder | LLaMA, GPT, T5              |
        | Cross Attention | 多模态对齐模块             | 用于图文/音文融合                   |


* 分别估算各子模块 FLOPs 或 MACs
    - 图像编码器部分（如 ViT-B/16）
        * ViT 模型 FLOPs 可以通过下面经验值估算：
          * ViT-B/16 @ 224x224 图像，大概需要 **17.6 GFLOPs**
          * ViT-L/14 @ 224x224，约 **60 GFLOPs**
          * 说明
            * **ViT-B/16**：
              * “B”是 **Base** 模型
              * “/16”表示输入图像被划分成 **16×16 的 Patch**；
              * 输入分辨率是 **224×224**（常见的 ImageNet 大小）；
              * 计算总量大约是 **17.6 GFLOPs**，可以视为一张图像推理时的开销。
            * **ViT-L/14**：
              * “L”是 **Large** 模型，参数更多；
              * Patch 大小是 14×14，Patch 更小，意味着 Patch 数更多；
              * 分辨率也是 224×224；
              * FLOPs 大幅上升到 **60 GFLOPs**

    - 文本处理模块（如 Q-Former、LLaMA）
        * 假设输入文本长度为 32 token
        * Q-Former 为 12 层，LLaMA 7B有32层，每层注意力 + FFN。
        * 一层 transformer 的 FLOPs 近似为：$\text{FLOPs} \approx 2 \cdot L \cdot T^2 \cdot d + 4 \cdot L \cdot T \cdot d^2$
        * 其中：
            * $L$：层数
            * $T$：token 长度
            * $d$：hidden size(如 LLaMA7B 是 4096)
        * 示例:
            * LLaMA 7B 推理一个 32 token 输入大约需要 **350\~400 GFLOPs**
            * ``2* 32*32^2*4096 + 4*32*32*4096^2 = 16787458 = 16.787GFLOPs``
    - Cross-Attention 计算开销也要考虑
        * 假设有 32 个图像token 和 32 个text token
        * 每层 cross-attention 的 FLOPs 为：$FLOPs = 2 \cdot T_q \cdot T_k \cdot d$
        * 若 T\_q=32，T\_k=32，d=768，总 FLOPs 每层约为 1.5M。

* 累加 FLOPs 总量，乘以帧率或请求频率
    * 假设每秒运行 1 次，则：
        * 图像编码器：20 GFLOPs
        * 文本模块：300 GFLOPs
        * LLM 解码：100 GFLOPs
        * Cross-Attention：50 GFLOPs
    * 总推理一次为：$\text{总计算量} = 20 + 300 + 100 + 50 = 470 GFLOPs = 0.47 TOPs$
    * 如每秒运行 2 次，则需要 0.94 TOPs。


### 语音→语音模型计算 TOPS

* 常见语音到语音模型的模块如下：
    * end-to-end模型（如 Translatotron 2），背后仍包含这些组件。

| 模块              | 功能           | 示例组件                         |
| --------------- | ------------ | ---------------------------- |
| 1. 语音编码器        | 提取音频特征       | Wav2Vec2, Whisper Encoder    |
| 2. 文本解码器（可选）    | 解码为中间文本（ASR） | Transformer Decoder          |
| 3. LLM/中间理解模块   | 理解+推理+生成回答   | GPT, BERT, Whisper           |
| 4. 文本到语音（TTS）   | 文本生成语音       | Tacotron2, FastSpeech2, VITS |
| 5. 声码器（Vocoder） | 把mel谱转换为语音信号 | HiFi-GAN, WaveRNN            |


* 1.语音编码器（ASR Encoder，如 Whisper）
    - Whisper Base 模型对 30 秒语音处理 FLOPs 大概是：
        * **Base 模型**：约 1.5 GFLOPs
        * 这些是一次推理一次音频片段（30 秒）
    - 如果你每秒运行一次推理，则：
        * Whisper base：约 **0.05 GOPS/s**
    - OpenAI Whisper 官方 benchmark：
        | Model  | FLOPs (30s audio) |
        | ------ | ----------------- |
        | tiny   | 1.0 GFLOPs        |
        | base   | 1.5 GFLOPs        |
        | small  | 5.7 GFLOPs        |
        | medium | 24.6 GFLOPs       |
        | large  | 118 GFLOPs        |
* 2.中间理解模块（如 GPT）
    * GPT-2 small (117M)：约 ``20~30 GFLOPs / 推理``
    * LLaMA 7B：约 ``350~400 GFLOPs / 推理（32 token）``
* 3.文本到语音模块（TTS）
    - 例如 FastSpeech2：
        * FastSpeech2 推理一个句子（30 token）需要 3~5 GFLOPs
        * HiFi-GAN vocoder（生成 1s 语音）约 10~20 GFLOPs（INT8 可以降很多）

* 4.总 FLOPs 估算（语音→语音，30秒片段）
    * 若每秒运行一次：$450 GFLOPs = 0.45 TOPS$

| 模块                          | 估算 FLOPs         |
| --------------------------- | ---------------- |
| Whisper Large               | 118 GFLOPs       |
| LLM推理                       | 300 GFLOPs       |
| TTS (FastSpeech2 + HiFiGAN) | 30 GFLOPs        |
| **总计**                      | **约 450 GFLOPs** |

* ⚠️ 计算注意点
    1. **推理频率影响 TOPS 需求**
       * 如果是实时语音（流式识别 + 语音响应），则 TOPS 需求需乘以每秒片段数量。
    2. **不同数据精度（FP32 / FP16 / INT8）差异极大**
       * Whisper large 
           * FP16 模式 > 0.4 TOPS
           * INT8 版 < 0.1 TOPS。


### 量化对计算量的影响

* 理论计算加速比
    - 通常我们以 FP32 为基准，几种数据类型的计算/存储压缩比大致如下：

        | 精度类型     | 存储缩小比  | 计算加速比（理论） |
        | -------- | ------ | --------- |
        | FP16     | 2×     | 2×        |
        | INT8     | 4×     | 2\~4×     |
        | **INT4** | **8×** | **4\~8×** |

### 一层 Transformer 的 FLOPs

* 对于 decoder-only 的架构（比如 LLaMA、GPT、ChatGPT 等），每一层主要包括以下两部分：
  1. **Self-Attention 部分**（QKV + Attention 计算 + 输出 projection）
  2. **Feed-Forward Network（FFN）部分**
* 参数说明
    * `L`：序列长度（例如 2048）
    * `H`：hidden size（例如 4096）
    * `n_head`：head 数量（例如 32）
    * `d_head`：每个 head 的维度 = `H / n_head`（例如 128）
* Attention 部分 FLOPs
  * $\underbrace{3L H^2}_{QKV投影} + \underbrace{2L^2 H}_{打分+输出} + \underbrace{L H^2}_{输出投影}= 4L H^2 + 2L^2 H$
* Feed-Forward Network（FFN）部分 FLOPs
  * $2 \times L \times H \times D_{ffn} = 2 \times L \times H \times (2.7H) = 5.4L H^2$
* 总 FLOPs / Layer
    * $\boxed{\underbrace{4L H^2 + 2L^2 H}_{Attention} + \underbrace{5.4L H^2}_{FFN}= (9.4L H^2 + 2L^2 H)}$

* 示例(LLaMA 7B)
    * `L = 2048`
    * `H = 4096`
    * 总FLOPs / Layer 
        - = 9.4L H^2 + 2L^2 H 
        - = 9.4 * 2048 * 4096*4096 + 2 * 2048*2048 * 4096
        - = 357*10^9
        - ≈ 357 GFLOPs
    * 总模型 FLOPs 
        - = 层数 × 每层 FLOPs 
        - → LLaMA 7B 有 32 层，所以推理一次约 
        - = 357 GFLOPs * 32 = 11.4 TFLOPs

## 参考

* Chatting with Transformers: https://huggingface.co/docs/transformers/main/en/conversations