# 通用 * 聊天模型生成文本的瓶颈是 **内存带宽** 而不是 **计算能力**,因为它必须为模型生成的每个 token 从内存中读取每一个active parameter。 * 这意味着您每秒可以从聊天模型生成的 token 数量通常与它这个表达式成正比:``内存总带宽除以模型的大小`` * 一个8B的模型,以 bfloat16 精度加载时,模型大小为 ~16GB。这意味着必须为模型生成的每个令牌从内存中读取 16GB。 * 总内存带宽: - 消费类 CPU 的 20-100GB/秒 - 消费类 GPU、Intel Xeon、AMD Threadripper/Epyc 或高端 Apple Silicon 等专用 CPU 的 200-900GB/秒不等 - 数据中心 GPU,如 Nvidia A100 或 H100:高达 2-3TB/秒的 * In our quickstart example above, our model was ~16GB in size when loaded in bfloat16 precision. This means that 16GB must be read from memory for every token generated by the model. Total memory bandwidth can vary from 20-100GB/sec for consumer CPUs to 200-900GB/sec for consumer GPUs, specialized CPUs like Intel Xeon, AMD Threadripper/Epyc or high-end Apple silicon, and finally up to 2-3TB/sec for data center GPUs like the Nvidia A100 or H100. This should give you a good idea of the generation speed you can expect from these different hardware types. ![](https://img.zhaoweiguo.com/uPic/2025/03/pyig7w.png) GPU中SRAM和HBM以及CPU DRAM的带宽 ## OPS * 相关概念: * FLOPS: floating point operations per second(每秒浮点运算次数, 衡量硬件性能的指标) * FLOPs: Floating Point Operations(浮点运算次数, 衡量算法/模型的复杂度) * FLOPs 计算通常是基于 FP32 理论计算图; * TOPS: Tera Operations Per Second(1 万亿 次操作/秒 = 10¹² ops/sec) * GOPS: Giga Operations Per Second * MOPS: Million Operation Per Second * MACs: Multiply-Accumulate Operations(乘加操作数) * 注意:通常 1 MAC = 2 FLOPs(1 乘法 + 1 加法) ### FLOPs 与FLOPS * 关系公式: $\text{运行时间(秒)} = \frac{\text{FLOPs}}{\text{FLOPS}}$ ## 参考 * Chatting with Transformers: [https://huggingface.co/docs/transformers/main/en/conversations](https://huggingface.co/docs/transformers/main/en/conversations) * LLM inference speed of light: [https://zeux.io/2024/03/15/llm-inference-sol/](https://zeux.io/2024/03/15/llm-inference-sol/)