Skip to content

SkyPilot

SkyPilot is an open-source framework for running AI and batch workloads on any cloud or Kubernetes cluster. It provides a unified CLI and YAML spec for launching interactive clusters, managed jobs, and model-serving replicas - and it ships with first-class support for Verda.

With SkyPilot on Verda, you can:

  • Provision on-demand GPU instances with a single sky launch.
  • Run managed jobs that recover automatically on failure.
  • Deploy model endpoints behind a built-in load balancer with SkyServe.
  • Reuse the same task YAML across your laptop, CI, and any other cloud you already use.

This guide walks through installing SkyPilot, authenticating against Verda, and launching your first GPU workload.

Prerequisites

  • A Verda account with an active project. Sign up.
  • A Client ID and Client Secret for the Verda API (see below).
  • Python 3.9 - 3.13 on your local machine.

Install SkyPilot

Verda support is built into the upstream SkyPilot project and requires no extra dependencies. We recommend installing SkyPilot into a dedicated virtual environment so it doesn't conflict with other Python projects.

uv venv --seed --python 3.10 ~/.venvs/sky
source ~/.venvs/sky/bin/activate
uv pip install "skypilot[verda]"
python -m venv ~/.venvs/sky
source ~/.venvs/sky/bin/activate
pip install "skypilot[verda]"
conda create -y -n sky python=3.10
conda activate sky
pip install "skypilot[verda]"

To use SkyPilot against several providers, combine the extras:

pip install "skypilot[verda,kubernetes,aws]"

Verify the install:

sky --version

Create API credentials

SkyPilot authenticates with Verda using OAuth2 client credentials.

  1. Log in to the Verda Console.
  2. Open the Credentials page from the sidebar.
  3. Under Cloud API credentials, click + Create.
  4. Copy the Client ID and Client Secret - the secret is shown only once.

Configure credentials for SkyPilot

Provide the credentials to SkyPilot either through a config file or environment variables. The config file is persistent and works well for workstations; environment variables are easier to inject in CI/CD pipelines.

Create ~/.verda/config.json:

mkdir -p ~/.verda
cat > ~/.verda/config.json <<'EOF'
{
  "client_id": "your-client-id",
  "client_secret": "your-client-secret"
}
EOF
chmod 600 ~/.verda/config.json
export VERDA_CLIENT_ID="your-client-id"
export VERDA_CLIENT_SECRET="your-client-secret"

Optionally override the default region (falls back to FIN-03), either via the config file or environment variable:

cat >> ~/.verda/config.json <<'EOF'
{
  "default_region": "FIN-02"
}
EOF
export VERDA_DEFAULT_REGION="FIN-02"

Check that SkyPilot can reach Verda:

sky check verda

The output should list Verda as enabled. If you added credentials after starting the SkyPilot API server, restart it so the new credentials take effect:

sky api stop && sky api start

Launch your first cluster

Create train.yaml:

#
# Example job to run on Verda (formerly DataCrunch).
#
name: minGPT-ddp

resources:
    # Use H100 x 1 node from Verda
    infra: verda
    accelerators: H100:1

run: |
    set -e
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples/distributed/minGPT-ddp
    git pull
    uv venv --python 3.11
    uv pip install -r requirements.txt "numpy<2" torch torchvision --extra-index-url https://download.pytorch.org/whl/cu126
    export LOGLEVEL=INFO
    echo "Starting minGPT-ddp training"
    uv run torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE mingpt/main.py

Launch it:

sky launch -c test-verda train.yaml

SkyPilot provisions a single H100 instance on Verda, syncs your working directory, runs the task, and streams the output to your terminal.

Inspect the cluster:

sky status          # list all your clusters
sky queue test-verda      # job queue on this cluster
sky logs test-verda 1     # stream logs for job 1

Run additional commands on the same cluster without re-provisioning:

sky exec test-verda train.yaml

Connect over SSH using the cluster name as the host:

ssh test-verda

Terminate the cluster when you're done:

sky down test-verda

Info

Verda instances can only be terminated, not stopped. Use sky down to release resources. For cheaper, shorter experiments, use spot (preemptive) instances and smaller GPUs, and tear them down promptly.

Discover available GPUs

List the GPU types SkyPilot can provision:

sky gpus list --infra verda

To see pricing and availability for a specific accelerator across regions:

sky gpus list H100 --infra verda

Verda's default image is ubuntu-24.04-cuda-12.8-open-docker (Ubuntu 24.04 with CUDA 12.8 and Docker pre-installed). Override it per task with image_id: in your resources block, or globally with the SKYPILOT_VERDA_IMAGE_ID environment variable.

A complete training task

name: train-verda

resources:
  infra: verda
  accelerators: H100:8
  disk_size: 500
  ports: 6006          # TensorBoard

workdir: .

envs:
  MODEL: meta-llama/Llama-3.1-8B
  EPOCHS: "3"

secrets:
  HF_TOKEN: null       # passed at launch time

setup: |
  pip install torch transformers datasets accelerate wandb

run: |
  huggingface-cli login --token $HF_TOKEN
  accelerate launch train.py \
    --model $MODEL \
    --epochs $EPOCHS \
    --output /workdir/checkpoints

Launch it with your Hugging Face token passed as a secret:

sky launch -c train-verda train-verda.yaml --secret HF_TOKEN=hf_xxx

Stream logs:

sky logs train-verda

For persistent checkpoints across cluster restarts, mount a Verda shared filesystem or attach a block volume to the instance.

Managed jobs

Managed jobs run on ephemeral clusters that SkyPilot provisions, monitors, and re-launches automatically if the instance fails. They're well-suited to long training runs where you don't want to hand-hold provisioning.

sky jobs launch -n train-run train-verda.yaml --secret HF_TOKEN=hf_xxx

Monitor:

sky jobs queue          # all managed jobs
sky jobs logs <job-id>  # stream logs

Cancel:

sky jobs cancel <job-id>

Write checkpoints to a mounted shared filesystem or block volume so retries resume from the latest checkpoint rather than starting over.

Serving models with SkyServe

SkyServe deploys replicated model endpoints with health checks, autoscaling, and a built-in load balancer. Add a service: block to any task YAML - here's a vLLM example serving Llama 3.1 on a Verda H100:

# vllm-llama.sky.yaml
name: llama-serve

service:
  readiness_probe: /v1/models
  replicas: 2

resources:
  infra: verda
  accelerators: H100:1
  ports: 8000

envs:
  MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct

secrets:
  HF_TOKEN: null

setup: |
  pip install vllm

run: |
  vllm serve $MODEL_NAME --host 0.0.0.0 --port 8000

Bring it up:

sky serve up -n llama vllm-llama.sky.yaml --secret HF_TOKEN=hf_xxx
sky serve status llama --endpoint

Call the endpoint with any OpenAI-compatible client:

ENDPOINT=$(sky serve status llama --endpoint)
curl $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Scale replicas or roll out a new version by editing the YAML and running sky serve update llama vllm-llama.sky.yaml. Tear the service down with:

sky serve down llama

Tip

For fully managed, autoscaling inference without provisioning clusters yourself, see Verda Serverless Containers and our inference API.

Restrict SkyPilot to Verda only

If you want SkyPilot to consider only Verda when picking a cloud - for example, in a Verda-only CI pipeline - add this to ~/.sky/config.yaml:

allowed_clouds:
  - verda

SkyPilot will skip credential checks for other providers and fail fast if Verda is unreachable.

Capabilities and current limits

SkyPilot runs natively on Verda with a few provider-specific notes:

Feature Status on Verda
On-demand GPU instances Supported
Managed jobs (sky jobs) Supported
SkyServe replica serving Supported
sky stop / sky autostop Not supported - use sky down
Custom Docker images (image_id: docker:...) Not supported - use setup: commands on the default image
Multi-node clusters Not supported
Mounting object storage as directories Use mode: COPY rather than mode: MOUNT
Opening ports post-launch Declare all required ports in resources.ports at launch time

For production multi-node distributed training, we recommend using Verda Instant Clusters with Slurm or Kubernetes.

Further reading

If you hit issues, reach out via chat in the Verda console or [email protected].