tpu
tpu
¶
Google Cloud TPU VM utilities for remote dispatch.
Wraps gcloud compute tpus tpu-vm commands to provide an SSH-like interface for creating, managing, and executing commands on TPU VMs.
Assumes the user has gcloud installed and authenticated locally.
parse_accelerator_type(accel_type: str) -> tuple[str, int]
¶
Parse TPU accelerator type into (chip_name, total_chips).
Examples:
"v4-32" -> ("tpu-v4", 32) "v5e-16" -> ("tpu-v5e", 16) "v3-8" -> ("tpu-v3", 8)
run(cmd: str, tpu_name: str, zone: str, project: str | None = None, worker: str = 'all', internal_ip: bool = False, timeout: float | None = None) -> RunResult
¶
Execute a command on a TPU VM via gcloud compute tpus tpu-vm ssh.
Uses --worker=all by default so the same command runs on every worker
in a TPU pod simultaneously.
copy_to(local_path: str | Path, tpu_name: str, remote_path: str, zone: str, project: str | None = None, worker: str = 'all', internal_ip: bool = False, timeout: float | None = None) -> RunResult
¶
Copy a local file or directory to a TPU VM via gcloud compute tpus tpu-vm scp.
Copies to all workers by default so every host gets identical files.
create(name: str, zone: str, accelerator_type: str, version: str, project: str | None = None, spot: bool = False, preemptible: bool = False, network: str | None = None, subnetwork: str | None = None, service_account: str | None = None, metadata: dict[str, str] | None = None, timeout: float | None = None) -> RunResult
¶
Create a TPU VM.
.. warning:: This incurs GCP costs. The dispatch layer prompts the user for confirmation before calling this function.
delete(name: str, zone: str, project: str | None = None, timeout: float | None = None) -> RunResult
¶
Delete a TPU VM.
Uses --quiet to skip interactive confirmation from gcloud itself.
describe(name: str, zone: str, project: str | None = None, timeout: float | None = None) -> dict[str, Any] | None
¶
Get TPU VM description as parsed JSON. Returns None if not found.
get_status(name: str, zone: str, project: str | None = None, timeout: float | None = None) -> str | None
¶
Get TPU VM state (e.g. READY, CREATING). Returns None if not found.
wait_ready(name: str, zone: str, project: str | None = None, timeout: float = 600.0, poll_interval: float = 15.0) -> bool
¶
Block until the TPU VM reaches READY state.
Returns True on success, False on timeout or terminal state.
forward_port(tpu_name: str, zone: str, local_port: int, remote_port: int, project: str | None = None, worker: str = '0', internal_ip: bool = False) -> TunnelResult
¶
Start a background SSH tunnel via gcloud compute tpus tpu-vm ssh.
Forwards local_port on the dispatching machine to remote_port on the
TPU VM worker. Only runs on a single worker (default 0) since REPL
sessions are single-host.
ship(tpu_name: str, remote_path: str, zone: str, project: str | None = None, internal_ip: bool = False, repo_path: str | Path | None = None, ref: str = 'HEAD', timeout: float | None = None) -> RunResult
¶
Ship a code snapshot to all TPU VM workers.
Creates a tarball via git archive, SCPs it to every worker, then
extracts in-place. This guarantees identical code across all hosts.
ship_dirty(tpu_name: str, remote_path: str, zone: str, project: str | None = None, internal_ip: bool = False, repo_path: str | Path | None = None, timeout: float | None = None) -> RunResult
¶
Ship code including uncommitted changes to all TPU VM workers.