Skip to content

Multi-node TorchTitan Training Job

In this tutorial, we will run a 2-node TorchTitan training job on a Verda Kubernetes instant cluster, using SkyPilot as the submission layer.

You will end up with a sky launch one-liner that trains Llama 3 debug_model for 50 steps across 16 GPUs (2 nodes × 8 B300) and logs loss converging from ~8.2 down to ~5.7.

Info

The debug_model config is a tiny toy model meant to validate the cluster and plumbing end-to-end — it is not a real training run. See Scaling up for pointers to real Llama 3 8B / 70B runs.

Prerequisites

For this tutorial you need:

  1. A Verda Kubernetes instant cluster with at least two GPU nodes. This tutorial assumes B300 nodes; adjust the accelerator type for other SKUs.
  2. Local kubectl configured to talk to the cluster. Running kubectl get nodes should list your GPU worker nodes in Ready state.
  3. Python 3.10+ on your workstation.

Tip

This tutorial does not require Kubeflow, Kueue, or any operator installation. SkyPilot talks directly to the Kubernetes API and creates plain pods.

Install SkyPilot

Install into an isolated virtual environment so it does not interfere with any system Python:

python3 -m venv ~/sky-env
source ~/sky-env/bin/activate
pip install 'skypilot[kubernetes]'
sky --version

Expect something like skypilot, version 0.12.0 or newer.

Verify SkyPilot sees your cluster

sky check k8s

Expected output includes:

Kubernetes: enabled [compute]
    Allowed contexts:
    └── <your-context-name>: enabled.

Confirm SkyPilot can see your GPUs:

sky gpus list --infra k8s

You should see your GPU type (e.g. B300) listed with per-node utilization info.

Warning

If this step fails, SkyPilot cannot reach the cluster. Re-check your kubeconfig and that your context is selected with kubectl config current-context.

Write the task YAML

Save the following as torchtitan-sky.yaml:

name: torchtitan-tutorial

num_nodes: 2

resources:
  infra: k8s
  accelerators: B300:8           # adjust for your GPU SKU
  image_id: docker:nvcr.io/nvidia/pytorch:25.08-py3
  cpus: 32+
  memory: 192+

# Inject RDMA resource + IPC_LOCK capability. SkyPilot's short form covers
# CPU/memory/GPU but not custom cluster resources, so we pass them as a raw
# pod spec fragment that SkyPilot merges into its generated pod template.
config:
  kubernetes:
    pod_config:
      spec:
        containers:
          - resources:
              limits:
                rdma/rdma_shared_device_a: "1"
              requests:
                rdma/rdma_shared_device_a: "1"
            securityContext:
              capabilities:
                add: ["IPC_LOCK"]

envs:
  NCCL_DEBUG: INFO
  NCCL_SOCKET_IFNAME: eth0
  NCCL_IB_TIMEOUT: "22"
  NCCL_IB_RETRY_CNT: "7"
  NCCL_NET_GDR_LEVEL: "0"

setup: |
  set -ex
  cd ~
  if [ ! -d torchtitan ]; then
    git clone https://github.com/pytorch/torchtitan.git
  fi
  cd torchtitan
  # Pin to a commit whose torch API usage matches NGC 25.08's torch build.
  # Newer torchtitan commits import APIs (HuggingFaceStorageWriter,
  # DefaultStager) that are still private in this torch build.
  git checkout b0902b29
  # Install torchtitan's deps, minus torch* (NGC image already provides it).
  grep -E -v '^(torch|torchvision|torchaudio)([[:space:]]|$|[<>=])' \
    requirements.txt > /tmp/req-notorch.txt
  pip install --no-cache-dir -r /tmp/req-notorch.txt
  pip install --no-cache-dir tiktoken blobfile pyyaml
  pip install --no-cache-dir --no-deps -e .
  python -c "import torchtitan, torch; print('torch', torch.__version__, 'cuda', torch.version.cuda)"

run: |
  set -ex
  cd ~/torchtitan
  MASTER_ADDR=$(echo $SKYPILOT_NODE_IPS | awk '{print $1}')
  MASTER_PORT=29500
  echo "rank=$SKYPILOT_NODE_RANK nnodes=$SKYPILOT_NUM_NODES gpus/node=$SKYPILOT_NUM_GPUS_PER_NODE master=$MASTER_ADDR"

  CONFIG=./torchtitan/models/llama3/train_configs/debug_model.toml

  torchrun \
    --nproc-per-node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --nnodes=$SKYPILOT_NUM_NODES \
    --node-rank=$SKYPILOT_NODE_RANK \
    --rdzv-id=100 \
    --rdzv-backend=c10d \
    --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m torchtitan.train \
    --job.config-file "$CONFIG" \
    --training.steps=50

What the YAML does

  • num_nodes: 2 + accelerators: B300:8 — request 2 pods, each with 8 B300 GPUs. SkyPilot translates this into the right nvidia.com/gpu request and nvidia.com/gpu.product nodeSelector automatically.
  • image_id — NGC PyTorch container as the base. Ships with a Blackwell-tuned torch build, so no CUDA install needed. The 25.08-py3 tag in particular has the torch==2.8 nightly that pairs with our pinned torchtitan commit.
  • config.kubernetes.pod_config — SkyPilot's escape hatch for Kubernetes fields it does not model natively. Here we add the RDMA resource request and the IPC_LOCK Linux capability that RDMA requires for memory pinning.
  • envs — NCCL tuning for the RDMA fabric.
  • setup — runs once per node on first launch. Clones torchtitan, pins to a known-good commit, installs its deps (avoiding the torch that NGC already provides), and installs torchtitan itself in editable mode.
  • run — the actual training command. SkyPilot populates $SKYPILOT_NODE_IPS, $SKYPILOT_NODE_RANK, etc. on each pod so torchrun can wire up the distributed rendezvous.

Launch the job

sky launch -c tt-tutorial -y torchtitan-sky.yaml

This does four things in sequence:

  1. Provisions 2 pods on the cluster (takes ~30–60 s; mostly image pull).
  2. Syncs files and env.
  3. Runs setup on each pod (clones + installs torchtitan — ~1–2 min first time).
  4. Runs runtorchrun fires up 16 ranks and starts training.

Watch the live output in your terminal, or stream logs from a separate shell:

sky logs tt-tutorial

What success looks like

After setup finishes, you should see per-rank training step logs like:

step:  1  loss:  8.1898  grad_norm:  0.2366  tps: 763      tflops: 0.05   mfu: 0.02%
step:  2  loss:  8.1361  grad_norm:  0.2334  tps: 405,648  tflops: 29.17  mfu: 9.35%
...
step: 36  loss:  5.6757  grad_norm:  0.1597  tps: 564,898  tflops: 40.62  mfu: 13.02%

And at the end:

✓ Job finished (status: SUCCEEDED).

Things to check:

  • All 16 ranks show the same loss at each step — NCCL collectives are correct.
  • Loss decreases monotonically from ~8.2 to ~5.7 — the model is actually training.
  • TFLOPS settles in the 30–45 range per rank — GPUs are doing work, not just sitting.
  • MFU around 10–15% is expected and low — debug_model is a toy (tens of thousands of parameters). Real Llama 3 8B/70B runs will show MFU in the 40–55% range.

Tear down

The cluster stays up after the job finishes so you can re-use it:

sky logs tt-tutorial       # re-stream logs
sky exec tt-tutorial ...   # run more commands
sky launch -c tt-tutorial torchtitan-sky.yaml  # re-run (setup is cached)

When you are done:

sky down tt-tutorial

This deletes the pods. GPUs go back to the cluster's free pool.

Scaling up

To go from the smoke test to a real run:

  • Bigger model — swap debug_model.toml for llama3_8b.toml or llama3_70b.toml under torchtitan/models/llama3/train_configs/. You will also need Hugging Face tokens for the tokenizer and dataset — set them as SkyPilot envs.
  • More nodes — increase num_nodes up to your cluster's GPU capacity. The rest of the YAML does not change.
  • More steps — bump --training.steps=50 in the run section.
  • Checkpointing — add --checkpoint.enable_checkpoint=true and mount shared storage (e.g. cephfs-pvc) via pod_config so ranks write to one volume.

Troubleshooting

ImportError: cannot import name 'HuggingFaceStorageWriter'

TorchTitan main uses a torch API that a given NGC image does not expose yet. The pin git checkout b0902b29 in setup avoids this; if you adopt a newer torchtitan commit, you may need a newer NGC image (25.09-py3, 25.10-py3, etc.) to match.

Pods stuck Pending

Run kubectl describe pod <pod-name> — usually means insufficient GPUs or wrong nodeSelector. Verify sky gpus list --infra k8s shows free GPUs matching your accelerators: request.

Setup takes forever

First-run setup clones torchtitan and pip-installs dependencies inside the pod. ~1–2 minutes is normal. Re-launches re-use the mounted ~/torchtitan and skip the clone.

NCCL hangs or bandwidth looks low

Confirm the RDMA resource request actually landed:

kubectl get pod <pod> -o json | jq '.spec.containers[0].resources'

Output should show rdma/rdma_shared_device_a. If missing, the config.kubernetes.pod_config block did not merge — check indentation.

What's next

  • Gang scheduling with Kueue — for shared clusters where multiple jobs compete for GPUs, adding Kueue gives you a FIFO queue and atomic pod-group admission. See the companion tutorial on Kueue integration.
  • Managed jobs — use sky jobs launch instead of sky launch to run the job under a controller that handles restarts, preemption recovery, etc.
  • Dynamo inference — once you have a checkpoint, deploy it with Dynamo. See the Dynamo inference tutorial.