SkyPilot¶
SkyPilot is an open-source framework for running AI and batch workloads on any cloud or Kubernetes cluster. It provides a unified CLI and YAML spec for launching interactive clusters, managed jobs, and model-serving replicas - and it ships with first-class support for Verda.
With SkyPilot on Verda, you can:
- Provision on-demand GPU instances with a single
sky launch. - Run managed jobs that recover automatically on failure.
- Deploy model endpoints behind a built-in load balancer with SkyServe.
- Reuse the same task YAML across your laptop, CI, and any other cloud you already use.
This guide walks through installing SkyPilot, authenticating against Verda, and launching your first GPU workload.
Prerequisites¶
- A Verda account with an active project. Sign up.
- A Client ID and Client Secret for the Verda API (see below).
- Python 3.9 - 3.13 on your local machine.
Install SkyPilot¶
Verda support is built into the upstream SkyPilot project and requires no extra dependencies. We recommend installing SkyPilot into a dedicated virtual environment so it doesn't conflict with other Python projects.
To use SkyPilot against several providers, combine the extras:
Verify the install:
Create API credentials¶
SkyPilot authenticates with Verda using OAuth2 client credentials.
- Log in to the Verda Console.
- Open the Credentials page from the sidebar.
- Under Cloud API credentials, click + Create.
- Copy the Client ID and Client Secret - the secret is shown only once.
Configure credentials for SkyPilot¶
Provide the credentials to SkyPilot either through a config file or environment variables. The config file is persistent and works well for workstations; environment variables are easier to inject in CI/CD pipelines.
Optionally override the default region (falls back to FIN-03), either via the config file or environment variable:
Check that SkyPilot can reach Verda:
The output should list Verda as enabled. If you added credentials after starting the SkyPilot API server, restart it so the new credentials take effect:
Launch your first cluster¶
Create train.yaml:
#
# Example job to run on Verda (formerly DataCrunch).
#
name: minGPT-ddp
resources:
# Use H100 x 1 node from Verda
infra: verda
accelerators: H100:1
run: |
set -e
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples/distributed/minGPT-ddp
git pull
uv venv --python 3.11
uv pip install -r requirements.txt "numpy<2" torch torchvision --extra-index-url https://download.pytorch.org/whl/cu126
export LOGLEVEL=INFO
echo "Starting minGPT-ddp training"
uv run torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE mingpt/main.py
Launch it:
SkyPilot provisions a single H100 instance on Verda, syncs your working directory, runs the task, and streams the output to your terminal.
Inspect the cluster:
sky status # list all your clusters
sky queue test-verda # job queue on this cluster
sky logs test-verda 1 # stream logs for job 1
Run additional commands on the same cluster without re-provisioning:
Connect over SSH using the cluster name as the host:
Terminate the cluster when you're done:
Info
Verda instances can only be terminated, not stopped. Use sky down to release resources. For cheaper, shorter experiments, use spot (preemptive) instances and smaller GPUs, and tear them down promptly.
Discover available GPUs¶
List the GPU types SkyPilot can provision:
To see pricing and availability for a specific accelerator across regions:
Verda's default image is ubuntu-24.04-cuda-12.8-open-docker (Ubuntu 24.04 with CUDA 12.8 and Docker pre-installed). Override it per task with image_id: in your resources block, or globally with the SKYPILOT_VERDA_IMAGE_ID environment variable.
A complete training task¶
name: train-verda
resources:
infra: verda
accelerators: H100:8
disk_size: 500
ports: 6006 # TensorBoard
workdir: .
envs:
MODEL: meta-llama/Llama-3.1-8B
EPOCHS: "3"
secrets:
HF_TOKEN: null # passed at launch time
setup: |
pip install torch transformers datasets accelerate wandb
run: |
huggingface-cli login --token $HF_TOKEN
accelerate launch train.py \
--model $MODEL \
--epochs $EPOCHS \
--output /workdir/checkpoints
Launch it with your Hugging Face token passed as a secret:
Stream logs:
For persistent checkpoints across cluster restarts, mount a Verda shared filesystem or attach a block volume to the instance.
Managed jobs¶
Managed jobs run on ephemeral clusters that SkyPilot provisions, monitors, and re-launches automatically if the instance fails. They're well-suited to long training runs where you don't want to hand-hold provisioning.
Monitor:
Cancel:
Write checkpoints to a mounted shared filesystem or block volume so retries resume from the latest checkpoint rather than starting over.
Serving models with SkyServe¶
SkyServe deploys replicated model endpoints with health checks, autoscaling, and a built-in load balancer. Add a service: block to any task YAML - here's a vLLM example serving Llama 3.1 on a Verda H100:
# vllm-llama.sky.yaml
name: llama-serve
service:
readiness_probe: /v1/models
replicas: 2
resources:
infra: verda
accelerators: H100:1
ports: 8000
envs:
MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct
secrets:
HF_TOKEN: null
setup: |
pip install vllm
run: |
vllm serve $MODEL_NAME --host 0.0.0.0 --port 8000
Bring it up:
sky serve up -n llama vllm-llama.sky.yaml --secret HF_TOKEN=hf_xxx
sky serve status llama --endpoint
Call the endpoint with any OpenAI-compatible client:
ENDPOINT=$(sky serve status llama --endpoint)
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]}'
Scale replicas or roll out a new version by editing the YAML and running sky serve update llama vllm-llama.sky.yaml. Tear the service down with:
Tip
For fully managed, autoscaling inference without provisioning clusters yourself, see Verda Serverless Containers and our inference API.
Restrict SkyPilot to Verda only¶
If you want SkyPilot to consider only Verda when picking a cloud - for example, in a Verda-only CI pipeline - add this to ~/.sky/config.yaml:
SkyPilot will skip credential checks for other providers and fail fast if Verda is unreachable.
Capabilities and current limits¶
SkyPilot runs natively on Verda with a few provider-specific notes:
| Feature | Status on Verda |
|---|---|
| On-demand GPU instances | Supported |
Managed jobs (sky jobs) |
Supported |
| SkyServe replica serving | Supported |
sky stop / sky autostop |
Not supported - use sky down |
Custom Docker images (image_id: docker:...) |
Not supported - use setup: commands on the default image |
| Multi-node clusters | Not supported |
| Mounting object storage as directories | Use mode: COPY rather than mode: MOUNT |
| Opening ports post-launch | Declare all required ports in resources.ports at launch time |
For production multi-node distributed training, we recommend using Verda Instant Clusters with Slurm or Kubernetes.
Further reading¶
- SkyPilot documentation
- Task YAML reference
- CLI reference
- Managed jobs guide
- SkyServe guide
- SkyPilot on GitHub
If you hit issues, reach out via chat in the Verda console or [email protected].