# TOPS计算

## 计算步骤


### 计算步骤(估算方法)

* 步骤1：计算模型的总计算量（MACs）
    - 模型计算量通常以 MACs（乘加操作数）或 FLOPs（浮点运算次数）来衡量。
        + 工具: 对于 CNN 模型，可以使用 ptflops（PyTorch）或 TensorFlow Model Analyzer 等工具来估算。
        + 手动计算: $\text{MACs} = K \times K \times C_{in} \times C_{out} \times H_{out} \times W_{out}$
            * 其中：
            * $K$: 卷积核大小
            * $C_{in}, C_{out}$: 输入输出通道数
            * $H_{out}, W_{out}$: 输出特征图的高和宽
* 步骤2：考虑运行帧率（推理频率）
    * 假设你要每秒处理 `N` 帧：``每秒总运算量（MACs/s）=每帧MACs×N``
    * 例如：
        * 一个模型推理一次需要 5 GOPs（5 × 10⁹ ops）
        * 想要 30 FPS 的实时处理
        * 则：$$5 × 10⁹ \times 30 = 150 × 10⁹ = 150 GOPs/s = 0.15 TOPS$$

* 步骤3：最终得到所需 TOPS: $\text{所需 TOPS} = \frac{\text{每帧 MACs} \times \text{帧率}}{10^{12}}$

### 多模态大模型计算 TOPS

运行多模态大模型时，比如视觉+语言（V+L）、语音+文本或图文+动作等组合，其 **TOPS 计算方式** 比单一模型更复杂，因为它通常包含多个子模块（如 ViT、Transformer、Q-Former、LLaMA 等）且存在大量 cross-attention。下面我们按步骤系统讲解如何计算多模态大模型所需的 TOPS。

* 划分模块
    - 明确模型中各个子模块的组成和功能。通常结构如下：

        | 模块              | 类型                  | 示例组件                        |
        | --------------- | ------------------- | --------------------------- |
        | 图像编码器           | CNN / ViT           | ResNet, Swin, CLIP ViT-B/32 |
        | 文本编码器           | Transformer Encoder | BERT, Q-Former              |
        | 解码器 / LLM       | Transformer Decoder | LLaMA, GPT, T5              |
        | Cross Attention | 多模态对齐模块             | 用于图文/音文融合                   |


* 分别估算各子模块 FLOPs 或 MACs
    - 图像编码器部分（如 ViT-B/16）
        * ViT 模型 FLOPs 可以通过下面经验值估算：
          * ViT-B/16 @ 224x224 图像，大概需要 **17.6 GFLOPs**
          * ViT-L/14 @ 224x224，约 **60 GFLOPs**
          * 说明
            * **ViT-B/16**：
              * “B”是 **Base** 模型
              * “/16”表示输入图像被划分成 **16×16 的 Patch**；
              * 输入分辨率是 **224×224**（常见的 ImageNet 大小）；
              * 计算总量大约是 **17.6 GFLOPs**，可以视为一张图像推理时的开销。
            * **ViT-L/14**：
              * “L”是 **Large** 模型，参数更多；
              * Patch 大小是 14×14，Patch 更小，意味着 Patch 数更多；
              * 分辨率也是 224×224；
              * FLOPs 大幅上升到 **60 GFLOPs**

    - 文本处理模块（如 Q-Former、LLaMA）
        * 假设输入文本长度为 32 token
        * Q-Former 为 12 层，LLaMA 7B有32层，每层注意力 + FFN。
        * 一层 transformer 的 FLOPs 近似为：$\text{FLOPs} \approx 2 \cdot L \cdot T^2 \cdot d + 4 \cdot L \cdot T \cdot d^2$
        * 其中：
            * $L$：层数
            * $T$：token 长度
            * $d$：hidden size(如 LLaMA7B 是 4096)
        * 示例:
            * LLaMA 7B 推理一个 32 token 输入大约需要 **350\~400 GFLOPs**
            * ``2* 32*32^2*4096 + 4*32*32*4096^2 = 16787458 = 16.787GFLOPs``
    - Cross-Attention 计算开销也要考虑
        * 假设有 32 个图像token 和 32 个text token
        * 每层 cross-attention 的 FLOPs 为：$FLOPs = 2 \cdot T_q \cdot T_k \cdot d$
        * 若 T\_q=32，T\_k=32，d=768，总 FLOPs 每层约为 1.5M。

* 累加 FLOPs 总量，乘以帧率或请求频率
    * 假设每秒运行 1 次，则：
        * 图像编码器：20 GFLOPs
        * 文本模块：300 GFLOPs
        * LLM 解码：100 GFLOPs
        * Cross-Attention：50 GFLOPs
    * 总推理一次为：$\text{总计算量} = 20 + 300 + 100 + 50 = 470 GFLOPs = 0.47 TOPs$
    * 如每秒运行 2 次，则需要 0.94 TOPs。


### 语音→语音模型计算 TOPS

* 常见语音到语音模型的模块如下：
    * end-to-end模型（如 Translatotron 2），背后仍包含这些组件。

| 模块              | 功能           | 示例组件                         |
| --------------- | ------------ | ---------------------------- |
| 1. 语音编码器        | 提取音频特征       | Wav2Vec2, Whisper Encoder    |
| 2. 文本解码器（可选）    | 解码为中间文本（ASR） | Transformer Decoder          |
| 3. LLM/中间理解模块   | 理解+推理+生成回答   | GPT, BERT, Whisper           |
| 4. 文本到语音（TTS）   | 文本生成语音       | Tacotron2, FastSpeech2, VITS |
| 5. 声码器（Vocoder） | 把mel谱转换为语音信号 | HiFi-GAN, WaveRNN            |


* 1.语音编码器（ASR Encoder，如 Whisper）
    - Whisper Base 模型对 30 秒语音处理 FLOPs 大概是：
        * **Base 模型**：约 1.5 GFLOPs
        * 这些是一次推理一次音频片段（30 秒）
    - 如果你每秒运行一次推理，则：
        * Whisper base：约 **0.05 GOPS/s**
    - OpenAI Whisper 官方 benchmark：
        | Model  | FLOPs (30s audio) |
        | ------ | ----------------- |
        | tiny   | 1.0 GFLOPs        |
        | base   | 1.5 GFLOPs        |
        | small  | 5.7 GFLOPs        |
        | medium | 24.6 GFLOPs       |
        | large  | 118 GFLOPs        |
* 2.中间理解模块（如 GPT）
    * GPT-2 small (117M)：约 ``20~30 GFLOPs / 推理``
    * LLaMA 7B：约 ``350~400 GFLOPs / 推理（32 token）``
* 3.文本到语音模块（TTS）
    - 例如 FastSpeech2：
        * FastSpeech2 推理一个句子（30 token）需要 3~5 GFLOPs
        * HiFi-GAN vocoder（生成 1s 语音）约 10~20 GFLOPs（INT8 可以降很多）

* 4.总 FLOPs 估算（语音→语音，30秒片段）
    * 若每秒运行一次：$450 GFLOPs = 0.45 TOPS$

| 模块                          | 估算 FLOPs         |
| --------------------------- | ---------------- |
| Whisper Large               | 118 GFLOPs       |
| LLM推理                       | 300 GFLOPs       |
| TTS (FastSpeech2 + HiFiGAN) | 30 GFLOPs        |
| **总计**                      | **约 450 GFLOPs** |

* ⚠️ 计算注意点
    1. **推理频率影响 TOPS 需求**
       * 如果是实时语音（流式识别 + 语音响应），则 TOPS 需求需乘以每秒片段数量。
    2. **不同数据精度（FP32 / FP16 / INT8）差异极大**
       * Whisper large 
           * FP16 模式 > 0.4 TOPS
           * INT8 版 < 0.1 TOPS。


### 量化对计算量的影响

* 理论计算加速比
    - 通常我们以 FP32 为基准，几种数据类型的计算/存储压缩比大致如下：

        | 精度类型     | 存储缩小比  | 计算加速比（理论） |
        | -------- | ------ | --------- |
        | FP16     | 2×     | 2×        |
        | INT8     | 4×     | 2\~4×     |
        | **INT4** | **8×** | **4\~8×** |

### 一层 Transformer 的 FLOPs

* 对于 decoder-only 的架构（比如 LLaMA、GPT、ChatGPT 等），每一层主要包括以下两部分：
  1. **Self-Attention 部分**（QKV + Attention 计算 + 输出 projection）
  2. **Feed-Forward Network（FFN）部分**
* 参数说明
    * `L`：序列长度（例如 2048）
    * `H`：hidden size（例如 4096）
    * `n_head`：head 数量（例如 32）
    * `d_head`：每个 head 的维度 = `H / n_head`（例如 128）
* Attention 部分 FLOPs
  * $\underbrace{3L H^2}_{QKV投影} + \underbrace{2L^2 H}_{打分+输出} + \underbrace{L H^2}_{输出投影}= 4L H^2 + 2L^2 H$
* Feed-Forward Network（FFN）部分 FLOPs
  * $2 \times L \times H \times D_{ffn} = 2 \times L \times H \times (2.7H) = 5.4L H^2$
* 总 FLOPs / Layer
    * $\boxed{\underbrace{4L H^2 + 2L^2 H}_{Attention} + \underbrace{5.4L H^2}_{FFN}= (9.4L H^2 + 2L^2 H)}$

* 示例(LLaMA 7B)
    * `L = 2048`
    * `H = 4096`
    * 总FLOPs / Layer 
        - = 9.4L H^2 + 2L^2 H 
        - = 9.4 * 2048 * 4096*4096 + 2 * 2048*2048 * 4096
        - = 357\*10^9
        - ≈ 357 GFLOPs
    * 总模型 FLOPs 
        - = 层数 × 每层 FLOPs 
        - → LLaMA 7B 有 32 层，所以推理一次约 
        - = 357 GFLOPs * 32 = 11.4 TFLOPs


## 计算步骤2

```note
**FLOP:Byte比** 是衡量「计算 vs 带宽瓶颈」的重要指标；
```

* AMD Ryzen 7950X has 
    * 67 GB/s memory bandwidth
    * 2735 GFLOPS
    * 40:1 FLOP:byte ratio
      * 计算方式: 2735/67 = 40
* NVidia GeForce RTX 4090 has 
    * 1008 GB/s memory bandwidth 
    * 83 TFLOPS,
    * 82:1 FLOP:byte ratio
        * 计算方式: 83\*1000/1008 = 82

* NVidia H100 SXM (which is a data-center card) has 
    * 3350 GB/s memory bandwidth and 
    * 67 TFLOPS,
    * 20:1 FLOP:byte; however(seemingly more modest), 
        * 计算方式: 67\*1000/3350 = 20
        * 对于像 Transformer 中常见的 矩阵乘法类问题（GEMM），启用专用 Tensor Core后算力激增到 494 TFLOPS
            * ~494 TFLOPS using **tensor cores** without sparsity
            * 147:1 FLOP:byte ratio.


### 示例-Mistral 7B

4 * 8192 * 4096^2 + 8192^2 * 4096


* 参数组成
    * 基本参数
        * 序列长度(**L**): 8192
        * hidden size(**H**): 4096
        * Attention head 数(**n_head**): 32
        * 每个 head 的维度(**d_head**): 4096/32=128
        * Transformer层数(**layers**): 32
    * 嵌入矩阵参数: 
        * 4096 * 32000 = 131M
        * 此矩阵不用于矩阵向量乘法，因为每个标记仅读取矩阵的一行，因此我们不会将其包含在带宽计算中
    * computing attention-related vectors
        * 32 * (4096 * (128 * 32 + 128 * 8 * 2) + 4096 * 128 * 32) = 1342M
    * transforming hidden state via a feed-forward network
        * 32 * (4096 * 14336 * 3) = 5637M


## 工具

* PyTorch 模型：

```python
from ptflops import get_model_complexity_info
from torchvision.models import resnet18

model = resnet18()
macs, params = get_model_complexity_info(model, (3, 224, 224), as_strings=True, print_per_layer_stat=False)
print(f"MACs: {macs}, Params: {params}")
```

对 PyTorch 多模态模型进行分析

```python
from fvcore.nn import FlopCountAnalysis
model = MyMultimodalModel()
inputs = (img_tensor, text_tensor)
flops = FlopCountAnalysis(model, inputs)
print(f"Total FLOPs: {flops.total() / 1e9} GFLOPs")
```


* TensorFlow 模型：
    - 使用 `tf.profiler` 或 Netron 分析计算图。


* ONNX 模型分析
    - 使用 [Netron](https://netron.app) 查看结构，或使用 tools like [onnxruntime-tools](https://github.com/microsoft/onnxruntime-tools) 的 `onnxruntime-tools.flops`.