TPU Dispatch Guide¶
Theseus can dispatch jobs to Google Cloud TPU VMs. It manages the full lifecycle: creating the TPU VM if it doesn't exist, shipping code to all workers, and launching the job across the pod.
Prerequisites¶
Before using the TPU backend you need:
-
The
gcloudCLI installed and authenticated. -
TPU quota in your GCP project for the accelerator type you want (e.g.
v4-32,v5e-16). Check your quotas in the GCP Console. -
A GCP zone with TPU availability. Common zones:
us-central2-b(v4)us-east1-d(v5e)us-east5-b(v5p)
Check what's available:
Dispatch Config¶
Add a TPU host to your ~/.theseus.yaml. Here is a walkthrough of every
field:
clusters:
# Paths on the TPU VM filesystem.
# These are local to the VM, not a shared filesystem.
gcp:
root: /home/user/theseus-data # data, checkpoints
work: /home/user/theseus-work # scratch/working directory
log: /home/user/theseus-logs # log directory
hosts:
# The host key below becomes the TPU VM name in Google Cloud.
# This is the name used in all gcloud commands (create, ssh, delete, etc.)
# and what shows up in `gcloud compute tpus tpu-vm list`.
# Pick something descriptive — you cannot change it after creation.
my-tpu-v4:
type: tpu
# Must match a cluster entry above.
cluster: gcp
# GCP zone where the TPU will be created.
zone: us-central2-b
# GCP project. If omitted, uses your gcloud default project.
project: my-gcp-project
# TPU accelerator type. Format: "v{version}-{chips}".
# The number is the total chip count across the pod.
# Examples: "v4-8" (single host), "v4-32" (4 hosts x 8 chips),
# "v5e-16", "v5p-128"
accelerator_type: v4-32
# TPU software/runtime version.
# List available versions:
# gcloud compute tpus versions list --zone=us-central2-b
version: tpu-ubuntu2204-base
# --- Pricing options (pick one or neither) ---
# Spot VMs: cheaper, but GCP can preempt at any time.
spot: true
# Preemptible: cheaper, 24h time limit, can be preempted.
# preemptible: false
# --- Optional fields ---
# VPC network and subnetwork (if your project uses custom networking).
# network: my-vpc
# subnetwork: my-subnet
# GCP service account for the TPU VM.
# service_account: sa@proj.iam.gserviceaccount.com
# Use internal IP for SSH/SCP (required in some VPC setups).
# internal_ip: false
# Instance metadata key-value pairs.
# metadata:
# startup-script: "echo hello"
# uv dependency groups to sync in the bootstrap script.
uv_groups: [tpu]
priority:
- my-tpu-v4
TPU VM Naming¶
The host key in your config IS the TPU VM name in Google Cloud. There is no
separate name field — the YAML key is used directly:
This is the name that appears in gcloud compute tpus tpu-vm list and that you
use in all gcloud commands (ssh, describe, delete, etc.). If the VM
doesn't exist yet, theseus will prompt you to create it with this name.
Submitting a Job¶
# Basic submit — solver matches your chip request against the TPU host:
theseus submit my-run experiment.yaml --chip tpu-v4 -n 32
# If you only have one TPU host, chip/n flags are optional:
theseus submit my-run experiment.yaml
# Override TPU software version:
theseus submit my-run experiment.yaml --tpu-version tpu-vm-v4-base
# Use spot pricing (override config):
theseus submit my-run experiment.yaml --tpu-spot
# Use on-demand pricing (override config):
theseus submit my-run experiment.yaml --tpu-on-demand
# Include uncommitted changes:
theseus submit my-run experiment.yaml --dirty
How It Works Under the Hood¶
When you run theseus submit targeting a TPU host:
-
Solver matches your hardware request against the TPU host's
accelerator_type. The chip name is parsed from the accelerator type (e.g.v4-32→ chiptpu-v4, 32 chips). -
TPU VM lifecycle: If the TPU VM doesn't exist, theseus prompts you for confirmation (creating a TPU incurs GCP costs), then creates it via
gcloud compute tpus tpu-vm createand waits for it to reachREADYstate. If it already exists and isREADY, this step is skipped. -
Code shipping: Your repo is
git archive'd into a tarball, SCP'd to all workers in the TPU pod viagcloud compute tpus tpu-vm scp --worker=all, then extracted in-place. This ensures identical code on every host. -
Bootstrap scripts: The bootstrap shell script and Python dispatch script(s) are SCP'd to all workers the same way.
-
Launch: The bootstrap script is executed on all workers simultaneously via
gcloud compute tpus tpu-vm ssh --worker=all. Each worker runsuv sync --group tpu, then executes the dispatch Python script. JAX'sjax.distributed.initialize()coordinates the workers into a single pod. -
Done: The job runs in the background via
nohup. Logs are written to thelogdirectory on the TPU VM.
Monitoring Jobs¶
Log file naming¶
TPU jobs run as background processes on the VM. Logs are written to:
Where project defaults to "general" and group defaults to "default" if
not specified. For example:
theseus submit my_run experiment.yaml --project myproj --group exp1
# -> /home/user/theseus-logs/myproj_exp1_my_run_20250304_143022.log
theseus submit train_gpt experiment.yaml
# -> /home/user/theseus-logs/general_default_train_gpt_20250304_143022.log
The exact path (including timestamp) is printed when the job is submitted.
Checking on a job¶
# SSH into worker 0:
gcloud compute tpus tpu-vm ssh my-tpu-v4 --zone=us-central2-b --worker=0
# List log files to find the right one:
ls -lt /home/user/theseus-logs/
# Tail the log file:
tail -f /home/user/theseus-logs/myproj_exp1_my_run_20250304_143022.log
# Or do it in one command without an interactive shell:
gcloud compute tpus tpu-vm ssh my-tpu-v4 --zone=us-central2-b --worker=0 \
--command="tail -f /home/user/theseus-logs/myproj_exp1_my_run_*.log"
# Check if the process is still running:
gcloud compute tpus tpu-vm ssh my-tpu-v4 --zone=us-central2-b --worker=0 \
--command="ps aux | grep python"
Deleting TPU VMs¶
TPU VMs incur costs while they exist, even when idle. Delete them when you're done:
List all your TPU VMs:
Troubleshooting¶
TPU VM creation fails with quota error¶
You've hit your TPU quota. Check and request increases in the
GCP Console.
Filter by TPU to find the relevant quota for your accelerator type.
"PREEMPTED" or "TERMINATED" state¶
Spot and preemptible TPUs can be reclaimed by GCP at any time. Re-submit the job — theseus will recreate the TPU VM automatically.
SSH connection fails¶
- Check that
gcloud auth loginis current. - If using
internal_ip: true, make sure you're on the same VPC or have a VPN/IAP tunnel configured. - Verify the TPU VM is in
READYstate:
JAX doesn't see all chips¶
- Verify code was shipped to all workers (check that the work directory exists on each worker).
- Check that
jax.distributed.initialize()is being called. The bootstrap setsTHESEUS_TPU_MODE=1so the training code knows to initialize distributed.
Wrong TPU software version¶
List available versions and pick one that matches your JAX version: