Instant Cluster Release Notes¶

Changes to the Instant GPU Cluster offering -> new GPU accelerators, orchestrator images, software upgrades, and infrastructure improvements.

April 2026¶

NVIDIA B300 GPU Accelerator support¶

Instant Clusters now support NVIDIA B300 GPUs, with InfiniBand XDR networking and NVMe passthrough. NCCL tests are compiled for the B300 compute capability.

SLURM upgraded to 25.11.4¶

The SLURM scheduler has been updated to version 25.11.4.

March 2026¶

Kubernetes cluster image¶

A Kubernetes orchestrator image is now available as an alternative to SLURM. Clusters can be deployed with Kubernetes, including GPU device plugins (NFD, GFD), local NVMe storage classes, and InfiniBand networking support.

Kubernetes MPI Operator¶

Kubernetes clusters now deploy the MPI Operator, enabling distributed multi-node training jobs using MPIJob resources.

Slurmrestd REST API¶

SLURM clusters now run slurmrestd, enabling programmatic job submission and cluster management via the SLURM REST API or the s9s tool.

DOCA networking updated to 3.2.2¶

The NVIDIA DOCA SDK has been updated to 3.2.2, keeping InfiniBand and networking drivers current.

February 2026¶

Instant GPU Clusters Generally Available¶

The Beta label has been removed from clusters, marking their general availability.

Centralized log aggregation with VictoriaMetrics¶

Cluster logs can be viewed using Grafana

Logs are retained with a size-based limit of 50 GB.

Production-ready alerting¶

The observability stack now includes production-ready alert rules with tuned thresholds for GPU health, node availability, disk usage, and cluster state.

To configure destinations of alerts see monitoring section

January 2026¶

Reduced availability¶

We have moved most of our instant cluster availability away for a few dedicated customers. More availability expected later in spring.

kanidm for identity management¶

Cluster internal authentication now uses kanidm.

kanidm provides centralized POSIX identity management across all cluster nodes, including SSH key distribution and user/group synchronization.

SLURM upgraded to 25.11.1¶

The SLURM scheduler has been upgraded from 25.05 to the 25.11 series, bringing improved scheduling performance and new features.

gpud upgraded to 0.9.2¶

The gpud daemon has been upgraded to 0.9.2, improving GPU health monitoring.

October 2025¶

SLURM upgraded to 25.05.3¶

The SLURM scheduler has been upgraded to 25.05.3. This is the first major version upgrade from the initial SLURM build.

Ubuntu 24.04 cluster image¶

Cluster images are now built on Ubuntu 24.04 with the 6.14 HWE kernel, replacing the previous Ubuntu 22.04 base.

CUDA 12.9 removed¶

CUDA 12.9 has been removed from the cluster image. Clusters now use CUDA 13.0.

gpud integrated with Node Health Check¶

The gpud daemon is now used within SLURM Node Health Check (NHC) to continuously monitor GPU state, InfiniBand port health, and driver status. Nodes with degraded GPUs or network links are automatically drained and can be rebooted.

Chrony time synchronization¶

Cluster nodes now use chrony for NTP time synchronization, ensuring consistent timestamps across all nodes.

August 2025¶

NVIDIA B200 GPU Accelerator support¶

Instant Clusters now support NVIDIA H200 GPUs

HPC-X pre-installed¶

NVIDIA HPC-X is now pre-installed in /opt on all cluster nodes, providing optimized MPI, SHMEM, and PGAS libraries for multi-node communication over InfiniBand.

Grafana monitoring dashboards¶

Clusters now include pre-provisioned Grafana dashboards for GPU metrics (DCGM), node resource usage, SLURM job status, and IPMI sensor data.

Customer-facing alerts for common failure modes are included.

NCCL all_reduce_perf validation¶

Cluster deployment now includes automated NCCL all_reduce_perf benchmarks to validate GPU-to-GPU communication performance across nodes before the cluster is marked as ready.

Enroot and Pyxis container support¶

Enroot and Pyxis are pre-installed, allowing SLURM jobs to run inside unprivileged containers pulled directly from container registries.

September 2025¶

virtiofs replaces NFS for /home¶

The shared /home filesystem now uses virtiofs instead of NFS-backed SFS, improving file I/O performance for cluster workloads.

April 2025¶

NFS backed SFS¶

The /home filesystem is now backed by NFS instead of CephFS

March 2025¶

Observability stack¶

Prometheus and Grafana are deployed on the cluster jumphost, providing metrics collection and visualization from day one. Node Exporter and DCGM Exporter run on all compute nodes.

February 2025¶

NVIDIA H200 GPU Accelerator support¶

Instant Clusters Initially Released with support for NVIDIA H200 GPUs