vastai-prod-checklist

Name: vastai-prod-checklist - Agent Skill
Rating: 5.0 (1036 reviews)
Author: jeremylongshore

1,036

MIT

Execute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist".

Download Folder View on GitHub

CLI Install

Recommended

Use the skills CLI to install this skill with one command. Auto-detects all installed AI assistants.

Method 1 - skills CLI

npx skills i jeremylongshore/claude-code-plugins-plus-skills/plugins/saas-packs/vastai-pack/skills/vastai-prod-checklist

Method 2 - openskills (supports sync & update)

npx openskills install jeremylongshore/claude-code-plugins-plus-skills

Auto-detects Claude Code, Cursor, Codex CLI, Gemini CLI, and more. One install, works everywhere.

Installation Path

Download and extract to one of the following locations:

~/.claude/skills/vastai-prod-checklist/

SKILL.md

Skill Instructions

Back

Run with Cloud Agent

No setup needed. Let our cloud agents run this skill for you.

Select Provider

Select Model

Claude Sonnet 4.5

$0.20/task

Best for coding tasks

No setup required

Vast.ai Production Checklist

Overview

Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.

Prerequisites

Vast.ai account with sufficient credits
Docker images tested and published to registry
Checkpoint-based training pipeline

Instructions

Account & Authentication

API key stored in secrets manager (not in code or env files)
Dedicated SSH key pair for Vast.ai (not shared with other services)
Account balance sufficient for planned workload duration + 50% buffer
Billing alerts configured at cloud.vast.ai

Instance Selection

GPU type validated for workload (VRAM, compute capability)
Reliability filter set to >= 0.98 for production jobs
Internet speed filter set to inet_down >= 200 for data transfer
Disk allocation includes room for checkpoints + data + 20% overhead
CUDA version on host matches Docker image requirements

Data Safety

Training data encrypted before upload to instances
Checkpoint saving every N steps (not just per epoch)
Checkpoints uploaded to persistent storage (S3/GCS) periodically
Instance cleanup script removes data before destruction
No sensitive data (API keys, PII) embedded in Docker images

Spot Instance Protection

Spot preemption handler implemented (save checkpoint on SIGTERM)
Auto-recovery: detect destroyed instance, provision replacement, resume
On-demand fallback configured for critical final training stages
Checkpoint integrity verification after recovery

Monitoring & Alerting

GPU utilization monitoring (alert if < 50% for > 10 min)
Instance health polling every 60 seconds
Cost accumulation tracking with budget threshold alerts
Training loss/metrics logged to external service (W&B, MLflow)
Dead instance detection (auto-destroy stuck instances)

Cost Controls

Maximum dph_total set in search queries
Auto-destroy timeout for all instances (e.g., 24h max)
Daily spending limit configured
Cost-per-job tracking for budget reporting

Verification Script

#!/bin/bash
set -euo pipefail
echo "Vast.ai Production Readiness Check"
 
# 1. Auth
vastai show user --raw | python3 -c "
import sys, json; u=json.load(sys.stdin)
balance = u.get('balance', 0)
print(f'  Auth: OK | Balance: \${balance:.2f}')
assert balance >= 10, f'Balance too low: \${balance:.2f}'

Output

Production readiness checklist verified
Verification script passes all checks
Cost controls and monitoring configured
Data safety measures in place

Error Handling

Error	Cause	Solution
Insufficient balance	Credits depleted mid-job	Set up auto-top-up or balance alerts
Instance preempted during final epoch	Spot instance reclaimed	Use on-demand for final training stage
Checkpoint corrupted	Interrupted mid-save	Implement atomic checkpoint writes (save to temp, rename)
GPU utilization drops to 0%	Data pipeline bottleneck	Profile data loading; increase disk I/O

Resources

Next Steps

For version upgrades, see vastai-upgrade-migration.

Examples

Pre-launch audit: Run the verification script, check all boxes, confirm Docker image pulls successfully, and verify at least 3 matching offers are available before starting a production training run.

Budget-safe launch: Set max_dph=2.00, auto-destroy timeout of 12 hours, and daily spend alert at $50 to prevent cost overruns.

# 2. Offer availability

COUNT=$(vastai search offers 'reliability>0.98 num_gpus=1 rentable=true' --raw --limit 1 | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")

echo " Offers available: $COUNT+ | PASS"

# 3. Docker image pullable

docker pull pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime > /dev/null 2>&1 && echo " Docker image: PASS" || echo " Docker image: FAIL"

echo "Pre-flight checks complete."