2104.04473_Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Abstract

  • 大型语言模型已经在多种任务中达到了最先进的准确度。然而,由于两个原因,有效地训练这些模型是具有挑战性的:a) GPU内存容量有限,即使在多GPU服务器上也无法容纳大型模型;b) 训练这些模型所需的计算操作数量可能导致训练时间过长而不切实际。因此,提出了诸如张量并行和管道并行等新的模型并行方法。不幸的是,这些方法的简单使用在数千个GPU的情况下会导致基本的扩展问题,例如由于昂贵的跨节点通信或设备等待其他设备进展而花费大量时间。

  • 在这篇论文中,我们展示了如何将不同类型并行方法(张量、管道和数据并行)组合起来,以扩展到数千个GPU以及拥有数万亿参数的模型。我们探讨了管道并行的技术,并提出了一种新颖的交错管道并行调度方案,该方案可以在保持与现有方法相当的内存占用的同时,将吞吐量提高10%以上。我们定量地研究了张量、管道和数据并行之间的权衡,并提供了关于如何配置大型模型分布式训练的直觉理解。我们的方法使我们能够在3072个GPU上以502 petaFLOP/s的速度对拥有1万亿参数的模型进行训练迭代,实现了理论峰值52%的每GPU吞吐量。

  • Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress.

  • In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak.