Skip to content

TPU Dispatch Guide

Theseus can dispatch jobs to Google Cloud TPU VMs. It manages the full lifecycle: creating the TPU VM if it doesn't exist, shipping code to all workers, and launching the job across the pod.

Prerequisites

Before using the TPU backend you need:

  1. The gcloud CLI installed and authenticated.

    gcloud auth login
    gcloud config set project my-project
    

  2. TPU quota in your GCP project for the accelerator type you want (e.g. v4-32, v5e-16). Check your quotas in the GCP Console.

  3. A GCP zone with TPU availability. Common zones:

  4. us-central2-b (v4)
  5. us-east1-d (v5e)
  6. us-east5-b (v5p)

Check what's available:

gcloud compute tpus accelerator-types list --zone=us-central2-b

Dispatch Config

Add a TPU host to your ~/.theseus.yaml. Here is a walkthrough of every field:

clusters:
  # Paths on the TPU VM filesystem.
  # These are local to the VM, not a shared filesystem.
  gcp:
    root: /home/user/theseus-data     # data, checkpoints
    work: /home/user/theseus-work     # scratch/working directory
    log: /home/user/theseus-logs      # log directory

hosts:
  # The host key below becomes the TPU VM name in Google Cloud.
  # This is the name used in all gcloud commands (create, ssh, delete, etc.)
  # and what shows up in `gcloud compute tpus tpu-vm list`.
  # Pick something descriptive — you cannot change it after creation.
  my-tpu-v4:
    type: tpu

    # Must match a cluster entry above.
    cluster: gcp

    # GCP zone where the TPU will be created.
    zone: us-central2-b

    # GCP project. If omitted, uses your gcloud default project.
    project: my-gcp-project

    # TPU accelerator type. Format: "v{version}-{chips}".
    # The number is the total chip count across the pod.
    # Examples: "v4-8" (single host), "v4-32" (4 hosts x 8 chips),
    #           "v5e-16", "v5p-128"
    accelerator_type: v4-32

    # TPU software/runtime version.
    # List available versions:
    #   gcloud compute tpus versions list --zone=us-central2-b
    version: tpu-ubuntu2204-base

    # --- Pricing options (pick one or neither) ---

    # Spot VMs: cheaper, but GCP can preempt at any time.
    spot: true

    # Preemptible: cheaper, 24h time limit, can be preempted.
    # preemptible: false

    # --- Optional fields ---

    # VPC network and subnetwork (if your project uses custom networking).
    # network: my-vpc
    # subnetwork: my-subnet

    # GCP service account for the TPU VM.
    # service_account: sa@proj.iam.gserviceaccount.com

    # Use internal IP for SSH/SCP (required in some VPC setups).
    # internal_ip: false

    # Instance metadata key-value pairs.
    # metadata:
    #   startup-script: "echo hello"

    # uv dependency groups to sync in the bootstrap script.
    uv_groups: [tpu]

priority:
  - my-tpu-v4

TPU VM Naming

The host key in your config IS the TPU VM name in Google Cloud. There is no separate name field — the YAML key is used directly:

hosts:
  my-tpu-v4:     # <- this creates/uses a GCP TPU VM named "my-tpu-v4"
    type: tpu
    ...

This is the name that appears in gcloud compute tpus tpu-vm list and that you use in all gcloud commands (ssh, describe, delete, etc.). If the VM doesn't exist yet, theseus will prompt you to create it with this name.

Submitting a Job

# Basic submit — solver matches your chip request against the TPU host:
theseus submit my-run experiment.yaml --chip tpu-v4 -n 32

# If you only have one TPU host, chip/n flags are optional:
theseus submit my-run experiment.yaml

# Override TPU software version:
theseus submit my-run experiment.yaml --tpu-version tpu-vm-v4-base

# Use spot pricing (override config):
theseus submit my-run experiment.yaml --tpu-spot

# Use on-demand pricing (override config):
theseus submit my-run experiment.yaml --tpu-on-demand

# Include uncommitted changes:
theseus submit my-run experiment.yaml --dirty

How It Works Under the Hood

When you run theseus submit targeting a TPU host:

  1. Solver matches your hardware request against the TPU host's accelerator_type. The chip name is parsed from the accelerator type (e.g. v4-32 → chip tpu-v4, 32 chips).

  2. TPU VM lifecycle: If the TPU VM doesn't exist, theseus prompts you for confirmation (creating a TPU incurs GCP costs), then creates it via gcloud compute tpus tpu-vm create and waits for it to reach READY state. If it already exists and is READY, this step is skipped.

  3. Code shipping: Your repo is git archive'd into a tarball, SCP'd to all workers in the TPU pod via gcloud compute tpus tpu-vm scp --worker=all, then extracted in-place. This ensures identical code on every host.

  4. Bootstrap scripts: The bootstrap shell script and Python dispatch script(s) are SCP'd to all workers the same way.

  5. Launch: The bootstrap script is executed on all workers simultaneously via gcloud compute tpus tpu-vm ssh --worker=all. Each worker runs uv sync --group tpu, then executes the dispatch Python script. JAX's jax.distributed.initialize() coordinates the workers into a single pod.

  6. Done: The job runs in the background via nohup. Logs are written to the log directory on the TPU VM.

Monitoring Jobs

Log file naming

TPU jobs run as background processes on the VM. Logs are written to:

{log_dir}/{project}_{group}_{name}_{timestamp}.log

Where project defaults to "general" and group defaults to "default" if not specified. For example:

theseus submit my_run experiment.yaml --project myproj --group exp1
# -> /home/user/theseus-logs/myproj_exp1_my_run_20250304_143022.log

theseus submit train_gpt experiment.yaml
# -> /home/user/theseus-logs/general_default_train_gpt_20250304_143022.log

The exact path (including timestamp) is printed when the job is submitted.

Checking on a job

# SSH into worker 0:
gcloud compute tpus tpu-vm ssh my-tpu-v4 --zone=us-central2-b --worker=0

# List log files to find the right one:
ls -lt /home/user/theseus-logs/

# Tail the log file:
tail -f /home/user/theseus-logs/myproj_exp1_my_run_20250304_143022.log

# Or do it in one command without an interactive shell:
gcloud compute tpus tpu-vm ssh my-tpu-v4 --zone=us-central2-b --worker=0 \
  --command="tail -f /home/user/theseus-logs/myproj_exp1_my_run_*.log"

# Check if the process is still running:
gcloud compute tpus tpu-vm ssh my-tpu-v4 --zone=us-central2-b --worker=0 \
  --command="ps aux | grep python"

Deleting TPU VMs

TPU VMs incur costs while they exist, even when idle. Delete them when you're done:

gcloud compute tpus tpu-vm delete my-tpu-v4 --zone=us-central2-b --quiet

List all your TPU VMs:

gcloud compute tpus tpu-vm list --zone=us-central2-b

Troubleshooting

TPU VM creation fails with quota error

You've hit your TPU quota. Check and request increases in the GCP Console. Filter by TPU to find the relevant quota for your accelerator type.

"PREEMPTED" or "TERMINATED" state

Spot and preemptible TPUs can be reclaimed by GCP at any time. Re-submit the job — theseus will recreate the TPU VM automatically.

SSH connection fails

  • Check that gcloud auth login is current.
  • If using internal_ip: true, make sure you're on the same VPC or have a VPN/IAP tunnel configured.
  • Verify the TPU VM is in READY state:
    gcloud compute tpus tpu-vm describe my-tpu-v4 --zone=us-central2-b
    

JAX doesn't see all chips

  • Verify code was shipped to all workers (check that the work directory exists on each worker).
  • Check that jax.distributed.initialize() is being called. The bootstrap sets THESEUS_TPU_MODE=1 so the training code knows to initialize distributed.

Wrong TPU software version

List available versions and pick one that matches your JAX version:

gcloud compute tpus versions list --zone=us-central2-b