7.6.1. huggingface¶

Transformers¶

如果您的模型适合单个 GPU 并且您有足够的空间来容纳小批量大小，则不需要使用 DeepSpeed，因为它只会减慢速度。
但是，如果模型不适合单个 GPU，或者您无法适应小批量，则可以利用 DeepSpeed ZeRO + CPU Offload 或 NVMe Offload 来处理更大的模型。
其核心是零冗余优化器（ZeRO），它可以大规模训练大型模型。

DeepSpeed是一个专为具有数十亿参数的大型模型的分布式训练的速度和规模而设计的库。其核心是零冗余优化器 (ZeRO)，它将优化器状态 (ZeRO-1)、梯度 (ZeRO-2) 和参数 (ZeRO-3) 跨数据并行进程进行分片。这大大减少了内存使用量，使您能够将训练扩展到十亿个参数模型。为了释放更高的内存效率，ZeRO-Offload 在优化过程中利用 CPU 资源来减少 GPU 计算和内存。

DeepSpeed implements everything described in the ZeRO paper: https://arxiv.org/abs/1910.02054
ZeRO-Offload has its own dedicated paper: ZeRO-Offload: Democratizing Billion-Scale Model Training: https://arxiv.org/abs/2101.06840
ZeRO-Infinity(NVMe-support is described): Breaking the GPU Memory Wall for Extreme Scale Deep Learning: https://arxiv.org/abs/2104.07857
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training: https://arxiv.org/abs/2306.10209
DeepSpeed Zero-2主要仅用于训练，因为其特征对推理无用。