torchrun (Elastic Launch) ######################### * from: https://pytorch.org/docs/stable/elastic/run.html * Torch Distributed Elastic: https://pytorch.org/docs/stable/distributed.elastic.html Quickstart ========== :: MIN_SIZE:MAX_SIZE # 至少MIN_SIZE节点和最多MAX_SIZE节点 HOST_NODE_ADDR # 默认为 29400 To launch a fault-tolerant job:: torchrun --nnodes=NUM_NODES --nproc-per-node=TRAINERS_PER_NODE --max-restarts=NUM_ALLOWED_FAILURES --rdzv-id=JOB_ID --rdzv-backend=c10d --rdzv-endpoint=HOST_NODE_ADDR YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) To launch an elastic job:: torchrun --nnodes=MIN_SIZE:MAX_SIZE --nproc-per-node=TRAINERS_PER_NODE --max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES --rdzv-id=JOB_ID --rdzv-backend=c10d --rdzv-endpoint=HOST_NODE_ADDR YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)