Multi-node TorchTitan Training Job¶
In this tutorial, we will run a 2-node TorchTitan training job on a Verda Kubernetes instant cluster, using SkyPilot as the submission layer.
You will end up with a sky launch one-liner that trains Llama 3 debug_model for 50 steps across 16 GPUs (2 nodes × 8 B300) and logs loss converging from ~8.2 down to ~5.7.
Info
The debug_model config is a tiny toy model meant to validate the cluster and plumbing end-to-end — it is not a real training run. See Scaling up for pointers to real Llama 3 8B / 70B runs.
Prerequisites¶
For this tutorial you need:
- A Verda Kubernetes instant cluster with at least two GPU nodes. This tutorial assumes B300 nodes; adjust the accelerator type for other SKUs.
- Local
kubectlconfigured to talk to the cluster. Runningkubectl get nodesshould list your GPU worker nodes inReadystate. - Python 3.10+ on your workstation.
Tip
This tutorial does not require Kubeflow, Kueue, or any operator installation. SkyPilot talks directly to the Kubernetes API and creates plain pods.
Install SkyPilot¶
Install into an isolated virtual environment so it does not interfere with any system Python:
python3 -m venv ~/sky-env
source ~/sky-env/bin/activate
pip install 'skypilot[kubernetes]'
sky --version
Expect something like skypilot, version 0.12.0 or newer.
Verify SkyPilot sees your cluster¶
Expected output includes:
Confirm SkyPilot can see your GPUs:
You should see your GPU type (e.g. B300) listed with per-node utilization info.
Warning
If this step fails, SkyPilot cannot reach the cluster. Re-check your kubeconfig and that your context is selected with kubectl config current-context.
Write the task YAML¶
Save the following as torchtitan-sky.yaml:
name: torchtitan-tutorial
num_nodes: 2
resources:
infra: k8s
accelerators: B300:8 # adjust for your GPU SKU
image_id: docker:nvcr.io/nvidia/pytorch:25.08-py3
cpus: 32+
memory: 192+
# Inject RDMA resource + IPC_LOCK capability. SkyPilot's short form covers
# CPU/memory/GPU but not custom cluster resources, so we pass them as a raw
# pod spec fragment that SkyPilot merges into its generated pod template.
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
rdma/rdma_shared_device_a: "1"
requests:
rdma/rdma_shared_device_a: "1"
securityContext:
capabilities:
add: ["IPC_LOCK"]
envs:
NCCL_DEBUG: INFO
NCCL_SOCKET_IFNAME: eth0
NCCL_IB_TIMEOUT: "22"
NCCL_IB_RETRY_CNT: "7"
NCCL_NET_GDR_LEVEL: "0"
setup: |
set -ex
cd ~
if [ ! -d torchtitan ]; then
git clone https://github.com/pytorch/torchtitan.git
fi
cd torchtitan
# Pin to a commit whose torch API usage matches NGC 25.08's torch build.
# Newer torchtitan commits import APIs (HuggingFaceStorageWriter,
# DefaultStager) that are still private in this torch build.
git checkout b0902b29
# Install torchtitan's deps, minus torch* (NGC image already provides it).
grep -E -v '^(torch|torchvision|torchaudio)([[:space:]]|$|[<>=])' \
requirements.txt > /tmp/req-notorch.txt
pip install --no-cache-dir -r /tmp/req-notorch.txt
pip install --no-cache-dir tiktoken blobfile pyyaml
pip install --no-cache-dir --no-deps -e .
python -c "import torchtitan, torch; print('torch', torch.__version__, 'cuda', torch.version.cuda)"
run: |
set -ex
cd ~/torchtitan
MASTER_ADDR=$(echo $SKYPILOT_NODE_IPS | awk '{print $1}')
MASTER_PORT=29500
echo "rank=$SKYPILOT_NODE_RANK nnodes=$SKYPILOT_NUM_NODES gpus/node=$SKYPILOT_NUM_GPUS_PER_NODE master=$MASTER_ADDR"
CONFIG=./torchtitan/models/llama3/train_configs/debug_model.toml
torchrun \
--nproc-per-node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$SKYPILOT_NUM_NODES \
--node-rank=$SKYPILOT_NODE_RANK \
--rdzv-id=100 \
--rdzv-backend=c10d \
--rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train \
--job.config-file "$CONFIG" \
--training.steps=50
What the YAML does¶
num_nodes: 2+accelerators: B300:8— request 2 pods, each with 8 B300 GPUs. SkyPilot translates this into the rightnvidia.com/gpurequest andnvidia.com/gpu.productnodeSelector automatically.image_id— NGC PyTorch container as the base. Ships with a Blackwell-tunedtorchbuild, so no CUDA install needed. The25.08-py3tag in particular has thetorch==2.8nightly that pairs with our pinned torchtitan commit.config.kubernetes.pod_config— SkyPilot's escape hatch for Kubernetes fields it does not model natively. Here we add the RDMA resource request and theIPC_LOCKLinux capability that RDMA requires for memory pinning.envs— NCCL tuning for the RDMA fabric.setup— runs once per node on first launch. Clones torchtitan, pins to a known-good commit, installs its deps (avoiding the torch that NGC already provides), and installs torchtitan itself in editable mode.run— the actual training command. SkyPilot populates$SKYPILOT_NODE_IPS,$SKYPILOT_NODE_RANK, etc. on each pod sotorchruncan wire up the distributed rendezvous.
Launch the job¶
This does four things in sequence:
- Provisions 2 pods on the cluster (takes ~30–60 s; mostly image pull).
- Syncs files and env.
- Runs
setupon each pod (clones + installs torchtitan — ~1–2 min first time). - Runs
run—torchrunfires up 16 ranks and starts training.
Watch the live output in your terminal, or stream logs from a separate shell:
What success looks like¶
After setup finishes, you should see per-rank training step logs like:
step: 1 loss: 8.1898 grad_norm: 0.2366 tps: 763 tflops: 0.05 mfu: 0.02%
step: 2 loss: 8.1361 grad_norm: 0.2334 tps: 405,648 tflops: 29.17 mfu: 9.35%
...
step: 36 loss: 5.6757 grad_norm: 0.1597 tps: 564,898 tflops: 40.62 mfu: 13.02%
And at the end:
Things to check:
- All 16 ranks show the same loss at each step — NCCL collectives are correct.
- Loss decreases monotonically from ~8.2 to ~5.7 — the model is actually training.
- TFLOPS settles in the 30–45 range per rank — GPUs are doing work, not just sitting.
- MFU around 10–15% is expected and low —
debug_modelis a toy (tens of thousands of parameters). Real Llama 3 8B/70B runs will show MFU in the 40–55% range.
Tear down¶
The cluster stays up after the job finishes so you can re-use it:
sky logs tt-tutorial # re-stream logs
sky exec tt-tutorial ... # run more commands
sky launch -c tt-tutorial torchtitan-sky.yaml # re-run (setup is cached)
When you are done:
This deletes the pods. GPUs go back to the cluster's free pool.
Scaling up¶
To go from the smoke test to a real run:
- Bigger model — swap
debug_model.tomlforllama3_8b.tomlorllama3_70b.tomlundertorchtitan/models/llama3/train_configs/. You will also need Hugging Face tokens for the tokenizer and dataset — set them as SkyPilotenvs. - More nodes — increase
num_nodesup to your cluster's GPU capacity. The rest of the YAML does not change. - More steps — bump
--training.steps=50in therunsection. - Checkpointing — add
--checkpoint.enable_checkpoint=trueand mount shared storage (e.g.cephfs-pvc) viapod_configso ranks write to one volume.
Troubleshooting¶
ImportError: cannot import name 'HuggingFaceStorageWriter'¶
TorchTitan main uses a torch API that a given NGC image does not expose yet. The pin git checkout b0902b29 in setup avoids this; if you adopt a newer torchtitan commit, you may need a newer NGC image (25.09-py3, 25.10-py3, etc.) to match.
Pods stuck Pending¶
Run kubectl describe pod <pod-name> — usually means insufficient GPUs or wrong nodeSelector. Verify sky gpus list --infra k8s shows free GPUs matching your accelerators: request.
Setup takes forever¶
First-run setup clones torchtitan and pip-installs dependencies inside the pod. ~1–2 minutes is normal. Re-launches re-use the mounted ~/torchtitan and skip the clone.
NCCL hangs or bandwidth looks low¶
Confirm the RDMA resource request actually landed:
Output should show rdma/rdma_shared_device_a. If missing, the config.kubernetes.pod_config block did not merge — check indentation.
What's next¶
- Gang scheduling with Kueue — for shared clusters where multiple jobs compete for GPUs, adding Kueue gives you a FIFO queue and atomic pod-group admission. See the companion tutorial on Kueue integration.
- Managed jobs — use
sky jobs launchinstead ofsky launchto run the job under a controller that handles restarts, preemption recovery, etc. - Dynamo inference — once you have a checkpoint, deploy it with Dynamo. See the Dynamo inference tutorial.