TOPS计算¶

计算步骤¶

计算步骤(估算方法)¶

步骤1：计算模型的总计算量（MACs）
- 模型计算量通常以 MACs（乘加操作数）或 FLOPs（浮点运算次数）来衡量。
  - 工具: 对于 CNN 模型，可以使用 ptflops（PyTorch）或 TensorFlow Model Analyzer 等工具来估算。
  - 手动计算: $MACs = K \times K \times C_{i n} \times C_{o u t} \times H_{o u t} \times W_{o u t}$
    - 其中：
    - $K$ : 卷积核大小
    - $C_{i n}, C_{o u t}$ : 输入输出通道数
    - $H_{o u t}, W_{o u t}$ : 输出特征图的高和宽
步骤2：考虑运行帧率（推理频率）
- 假设你要每秒处理 N 帧：每秒总运算量（MACs/s）=每帧MACs×N
- 例如：
  - 一个模型推理一次需要 5 GOPs（5 × 10⁹ ops）
  - 想要 30 FPS 的实时处理
  - 则：$ $5 \times 10 ⁹ \times 30 = 150 \times 10 ⁹ = 150 G O P s / s = 0.15 T O P S$ $
步骤3：最终得到所需 TOPS: $所需 TOPS = \frac{每帧 MACs \times 帧率}{10^{12}}$

多模态大模型计算 TOPS¶

运行多模态大模型时，比如视觉+语言（V+L）、语音+文本或图文+动作等组合，其 TOPS 计算方式 比单一模型更复杂，因为它通常包含多个子模块（如 ViT、Transformer、Q-Former、LLaMA 等）且存在大量 cross-attention。下面我们按步骤系统讲解如何计算多模态大模型所需的 TOPS。

划分模块

明确模型中各个子模块的组成和功能。通常结构如下：

模块	类型	示例组件
图像编码器	CNN / ViT	ResNet, Swin, CLIP ViT-B/32
文本编码器	Transformer Encoder	BERT, Q-Former
解码器 / LLM	Transformer Decoder	LLaMA, GPT, T5
Cross Attention	多模态对齐模块	用于图文/音文融合

分别估算各子模块 FLOPs 或 MACs
- 图像编码器部分（如 ViT-B/16）
  - ViT 模型 FLOPs 可以通过下面经验值估算：
    - ViT-B/16 @ 224x224 图像，大概需要 17.6 GFLOPs
    - ViT-L/14 @ 224x224，约 60 GFLOPs
    - 说明
      - ViT-B/16：
        
        “B”是 Base 模型
        
        “/16”表示输入图像被划分成 16×16 的 Patch；
        
        输入分辨率是 224×224（常见的 ImageNet 大小）；
        
        计算总量大约是 17.6 GFLOPs，可以视为一张图像推理时的开销。
      - ViT-L/14：
        
        “L”是 Large 模型，参数更多；
        
        Patch 大小是 14×14，Patch 更小，意味着 Patch 数更多；
        
        分辨率也是 224×224；
        
        FLOPs 大幅上升到 60 GFLOPs
- 文本处理模块（如 Q-Former、LLaMA）
  - 假设输入文本长度为 32 token
  - Q-Former 为 12 层，LLaMA 7B有32层，每层注意力 + FFN。
  - 一层 transformer 的 FLOPs 近似为： $FLOPs \approx 2 \cdot L \cdot T^{2} \cdot d + 4 \cdot L \cdot T \cdot d^{2}$
  - 其中：
    - $L$ ：层数
    - $T$ ：token 长度
    - $d$ ：hidden size(如 LLaMA7B 是 4096)
  - 示例:
    - LLaMA 7B 推理一个 32 token 输入大约需要 350~400 GFLOPs
    - 2* 32*32^2*4096 + 4*32*32*4096^2 = 16787458 = 16.787GFLOPs
- Cross-Attention 计算开销也要考虑
  - 假设有 32 个图像token 和 32 个text token
  - 每层 cross-attention 的 FLOPs 为： $F L O P s = 2 \cdot T_{q} \cdot T_{k} \cdot d$
  - 若 T_q=32，T_k=32，d=768，总 FLOPs 每层约为 1.5M。
累加 FLOPs 总量，乘以帧率或请求频率
- 假设每秒运行 1 次，则：
  - 图像编码器：20 GFLOPs
  - 文本模块：300 GFLOPs
  - LLM 解码：100 GFLOPs
  - Cross-Attention：50 GFLOPs
- 总推理一次为： $总计算量 = 20 + 300 + 100 + 50 = 470 G F L O P s = 0.47 T O P s$
- 如每秒运行 2 次，则需要 0.94 TOPs。

语音→语音模型计算 TOPS¶

常见语音到语音模型的模块如下：
- end-to-end模型（如 Translatotron 2），背后仍包含这些组件。

模块	功能	示例组件
1. 语音编码器	提取音频特征	Wav2Vec2, Whisper Encoder
2. 文本解码器（可选）	解码为中间文本（ASR）	Transformer Decoder
3. LLM/中间理解模块	理解+推理+生成回答	GPT, BERT, Whisper
4. 文本到语音（TTS）	文本生成语音	Tacotron2, FastSpeech2, VITS
5. 声码器（Vocoder）	把mel谱转换为语音信号	HiFi-GAN, WaveRNN

1.语音编码器（ASR Encoder，如 Whisper）
- Whisper Base 模型对 30 秒语音处理 FLOPs 大概是：
  - Base 模型：约 1.5 GFLOPs
  - 这些是一次推理一次音频片段（30 秒）
- 如果你每秒运行一次推理，则：
  - Whisper base：约 0.05 GOPS/s
- OpenAI Whisper 官方 benchmark：
  
  Model
  
  FLOPs (30s audio)
  
  tiny
  
  1.0 GFLOPs
  
  base
  
  1.5 GFLOPs
  
  small
  
  5.7 GFLOPs
  
  medium
  
  24.6 GFLOPs
  
  large
  
  118 GFLOPs
2.中间理解模块（如 GPT）
- GPT-2 small (117M)：约 20~30 GFLOPs / 推理
- LLaMA 7B：约 350~400 GFLOPs / 推理（32 token）
3.文本到语音模块（TTS）
- 例如 FastSpeech2：
  - FastSpeech2 推理一个句子（30 token）需要 3~5 GFLOPs
  - HiFi-GAN vocoder（生成 1s 语音）约 10~20 GFLOPs（INT8 可以降很多）
4.总 FLOPs 估算（语音→语音，30秒片段）
- 若每秒运行一次： $450 G F L O P s = 0.45 T O P S$

模块	估算 FLOPs
Whisper Large	118 GFLOPs
LLM推理	300 GFLOPs
TTS (FastSpeech2 + HiFiGAN)	30 GFLOPs
总计	约 450 GFLOPs

⚠️ 计算注意点
1. 推理频率影响 TOPS 需求
  - 如果是实时语音（流式识别 + 语音响应），则 TOPS 需求需乘以每秒片段数量。
2. 不同数据精度（FP32 / FP16 / INT8）差异极大
  - Whisper large
    - FP16 模式 > 0.4 TOPS
    - INT8 版 < 0.1 TOPS。

量化对计算量的影响¶

理论计算加速比
- 通常我们以 FP32 为基准，几种数据类型的计算/存储压缩比大致如下：
  
  精度类型
  
  存储缩小比
  
  计算加速比（理论）
  
  FP16
  
  2×
  
  2×
  
  INT8
  
  4×
  
  2~4×
  
  INT4
  
  8×
  
  4~8×

一层 Transformer 的 FLOPs¶

对于 decoder-only 的架构（比如 LLaMA、GPT、ChatGPT 等），每一层主要包括以下两部分：
1. Self-Attention 部分（QKV + Attention 计算 + 输出 projection）
2. Feed-Forward Network（FFN）部分
参数说明
- L：序列长度（例如 2048）
- H：hidden size（例如 4096）
- n_head：head 数量（例如 32）
- d_head：每个 head 的维度 = H / n_head（例如 128）
Attention 部分 FLOPs
- $\underset{Q K V 投影}{\underset{⏟}{3 L H^{2}}} + \underset{打分 + 输出}{\underset{⏟}{2 L^{2} H}} + \underset{输出投影}{\underset{⏟}{L H^{2}}} = 4 L H^{2} + 2 L^{2} H$
Feed-Forward Network（FFN）部分 FLOPs
- $2 \times L \times H \times D_{f f n} = 2 \times L \times H \times (2.7 H) = 5.4 L H^{2}$
总 FLOPs / Layer
- $\underset{A t t e n t i o n}{\underset{⏟}{4 L H^{2} + 2 L^{2} H}} + \underset{F F N}{\underset{⏟}{5.4 L H^{2}}} = (9.4 L H^{2} + 2 L^{2} H)$
示例(LLaMA 7B)
- L = 2048
- H = 4096
- 总FLOPs / Layer
  - = 9.4L H^2 + 2L^2 H
  - = 9.4 * 2048 * 40964096 + 2 * 20482048 * 4096
  - = 357*10^9
  - ≈ 357 GFLOPs
- 总模型 FLOPs
  - = 层数 × 每层 FLOPs
  - → LLaMA 7B 有 32 层，所以推理一次约
  - = 357 GFLOPs * 32 = 11.4 TFLOPs

计算步骤2¶

备注

FLOP:Byte比 是衡量「计算 vs 带宽瓶颈」的重要指标；

AMD Ryzen 7950X has
- 67 GB/s memory bandwidth
- 2735 GFLOPS
- 40:1 FLOP:byte ratio
  - 计算方式: 2735/67 = 40
NVidia GeForce RTX 4090 has
- 1008 GB/s memory bandwidth
- 83 TFLOPS,
- 82:1 FLOP:byte ratio
  - 计算方式: 83*1000/1008 = 82
NVidia H100 SXM (which is a data-center card) has
- 3350 GB/s memory bandwidth and
- 67 TFLOPS,
- 20:1 FLOP:byte; however(seemingly more modest),
  - 计算方式: 67*1000/3350 = 20
  - 对于像 Transformer 中常见的矩阵乘法类问题（GEMM），启用专用 Tensor Core后算力激增到 494 TFLOPS
    - ~494 TFLOPS using tensor cores without sparsity
    - 147:1 FLOP:byte ratio.

示例-Mistral 7B¶

4 * 8192 * 4096^2 + 8192^2 * 4096

参数组成
- 基本参数
  - 序列长度(L): 8192
  - hidden size(H): 4096
  - Attention head 数(n_head): 32
  - 每个 head 的维度(d_head): 4096/32=128
  - Transformer层数(layers): 32
- 嵌入矩阵参数:
  - 4096 * 32000 = 131M
  - 此矩阵不用于矩阵向量乘法，因为每个标记仅读取矩阵的一行，因此我们不会将其包含在带宽计算中
- computing attention-related vectors
  - 32 * (4096 * (128 * 32 + 128 * 8 * 2) + 4096 * 128 * 32) = 1342M
- transforming hidden state via a feed-forward network
  - 32 * (4096 * 14336 * 3) = 5637M

工具¶

PyTorch 模型：

from ptflops import get_model_complexity_info
from torchvision.models import resnet18

model = resnet18()
macs, params = get_model_complexity_info(model, (3, 224, 224), as_strings=True, print_per_layer_stat=False)
print(f"MACs: {macs}, Params: {params}")

对 PyTorch 多模态模型进行分析

from fvcore.nn import FlopCountAnalysis
model = MyMultimodalModel()
inputs = (img_tensor, text_tensor)
flops = FlopCountAnalysis(model, inputs)
print(f"Total FLOPs: {flops.total() / 1e9} GFLOPs")

TensorFlow 模型：
- 使用 tf.profiler 或 Netron 分析计算图。
ONNX 模型分析
- 使用 Netron 查看结构，或使用 tools like onnxruntime-tools 的 onnxruntime-tools.flops.

Model	FLOPs (30s audio)
tiny	1.0 GFLOPs
base	1.5 GFLOPs
small	5.7 GFLOPs
medium	24.6 GFLOPs
large	118 GFLOPs

精度类型	存储缩小比	计算加速比（理论）
FP16	2×	2×
INT8	4×	2~4×
INT4	8×	4~8×