Skip to content

tpu

tpu

Google Cloud TPU VM utilities for remote dispatch.

Wraps gcloud compute tpus tpu-vm commands to provide an SSH-like interface for creating, managing, and executing commands on TPU VMs.

Assumes the user has gcloud installed and authenticated locally.

parse_accelerator_type(accel_type: str) -> tuple[str, int]

Parse TPU accelerator type into (chip_name, total_chips).

Examples:

"v4-32" -> ("tpu-v4", 32) "v5e-16" -> ("tpu-v5e", 16) "v3-8" -> ("tpu-v3", 8)

run(cmd: str, tpu_name: str, zone: str, project: str | None = None, worker: str = 'all', internal_ip: bool = False, timeout: float | None = None) -> RunResult

Execute a command on a TPU VM via gcloud compute tpus tpu-vm ssh.

Uses --worker=all by default so the same command runs on every worker in a TPU pod simultaneously.

copy_to(local_path: str | Path, tpu_name: str, remote_path: str, zone: str, project: str | None = None, worker: str = 'all', internal_ip: bool = False, timeout: float | None = None) -> RunResult

Copy a local file or directory to a TPU VM via gcloud compute tpus tpu-vm scp.

Copies to all workers by default so every host gets identical files.

create(name: str, zone: str, accelerator_type: str, version: str, project: str | None = None, spot: bool = False, preemptible: bool = False, network: str | None = None, subnetwork: str | None = None, service_account: str | None = None, metadata: dict[str, str] | None = None, timeout: float | None = None) -> RunResult

Create a TPU VM.

.. warning:: This incurs GCP costs. The dispatch layer prompts the user for confirmation before calling this function.

delete(name: str, zone: str, project: str | None = None, timeout: float | None = None) -> RunResult

Delete a TPU VM.

Uses --quiet to skip interactive confirmation from gcloud itself.

describe(name: str, zone: str, project: str | None = None, timeout: float | None = None) -> dict[str, Any] | None

Get TPU VM description as parsed JSON. Returns None if not found.

get_status(name: str, zone: str, project: str | None = None, timeout: float | None = None) -> str | None

Get TPU VM state (e.g. READY, CREATING). Returns None if not found.

wait_ready(name: str, zone: str, project: str | None = None, timeout: float = 600.0, poll_interval: float = 15.0) -> bool

Block until the TPU VM reaches READY state.

Returns True on success, False on timeout or terminal state.

forward_port(tpu_name: str, zone: str, local_port: int, remote_port: int, project: str | None = None, worker: str = '0', internal_ip: bool = False) -> TunnelResult

Start a background SSH tunnel via gcloud compute tpus tpu-vm ssh.

Forwards local_port on the dispatching machine to remote_port on the TPU VM worker. Only runs on a single worker (default 0) since REPL sessions are single-host.

ship(tpu_name: str, remote_path: str, zone: str, project: str | None = None, internal_ip: bool = False, repo_path: str | Path | None = None, ref: str = 'HEAD', timeout: float | None = None) -> RunResult

Ship a code snapshot to all TPU VM workers.

Creates a tarball via git archive, SCPs it to every worker, then extracts in-place. This guarantees identical code across all hosts.

ship_dirty(tpu_name: str, remote_path: str, zone: str, project: str | None = None, internal_ip: bool = False, repo_path: str | Path | None = None, timeout: float | None = None) -> RunResult

Ship code including uncommitted changes to all TPU VM workers.