6.2.5. torchrun (Elastic Launch)¶
Torch Distributed Elastic: https://pytorch.org/docs/stable/distributed.elastic.html
Quickstart¶
MIN_SIZE:MAX_SIZE # 至少MIN_SIZE节点和最多MAX_SIZE节点
HOST_NODE_ADDR # 默认为 29400
To launch a fault-tolerant job:
torchrun
--nnodes=NUM_NODES
--nproc-per-node=TRAINERS_PER_NODE
--max-restarts=NUM_ALLOWED_FAILURES
--rdzv-id=JOB_ID
--rdzv-backend=c10d
--rdzv-endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an elastic job:
torchrun
--nnodes=MIN_SIZE:MAX_SIZE
--nproc-per-node=TRAINERS_PER_NODE
--max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
--rdzv-id=JOB_ID
--rdzv-backend=c10d
--rdzv-endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)