Skip to content

checkpoints

checkpoints

Checkpoint service - browses checkpoint_dir for saved checkpoints.

Checkpoint directory structure (from CheckpointedJob): checkpoints_dir/ {project}/ {group}/ {job_name}/ latest # text file with latest suffix {nested_dirs}/ # can be nested up to 3 levels config.yaml # This marks a checkpoint directory checkpoint/ # Orbax checkpoint data job.json rng.npy

CheckpointService(checkpoints_dir: Path)

Service for browsing checkpoints.

list_all_checkpoints(project: Optional[str] = None, group: Optional[str] = None, job_name: Optional[str] = None, limit: int = 100) -> list[CheckpointInfo]

List all checkpoints, optionally filtered.

Returns checkpoints sorted by creation time (most recent first).

get_checkpoint(project: str, group: str, job_name: str, suffix: str) -> Optional[CheckpointInfo]

Get a specific checkpoint.

get_latest_checkpoint(project: str, group: str, job_name: str) -> Optional[CheckpointInfo]

Get the latest checkpoint for a job.

list_job_checkpoints(project: str, group: str, job_name: str) -> list[CheckpointInfo]

List all checkpoints for a specific job.

get_checkpoint_config(project: str, group: str, job_name: str, suffix: str) -> Optional[dict[str, Any]]

Read the config from a checkpoint (tries config.yaml then config.json).

get_checkpoint_job_spec(project: str, group: str, job_name: str, suffix: str) -> Optional[dict[str, Any]]

Read the job.json from a checkpoint.

count_checkpoints() -> int

Count total number of checkpoints.

get_total_size() -> int

Get total size of all checkpoints in bytes.

format_size(size_bytes: int) -> str

Format bytes as human-readable string.

delete_job_checkpoints(project: str, group: str, job_name: str) -> bool

Delete all checkpoints for a job. Returns True if successful.