Skip to content

job

job

BasicJob(spec: ExecutionSpec)

Bases: _BaseJob, Generic[C]

done: bool property

Check if job is already complete (idempotency check)

run() -> None abstractmethod

Run the job, assuming all hosts have setup

CheckpointedJob(spec: ExecutionSpec)

Bases: BasicJob[C], Generic[C]

Job with checkpoint save/load support.

Checkpoint path resolution

Paths are split into two parts:

checkpoints_dir / rel_path

where checkpoints_dir comes from the cluster config and rel_path is project/group/job_name/suffix.

The *_from_path methods accept an arbitrary rel_path, which lets a job load/save checkpoints belonging to a different job. The plain get_tree_and_metadata / save_tree_and_metadata methods derive rel_path from self.spec automatically — they exist for backwards compatibility and are thin wrappers around the *_from_path variants.

_get_checkpoint_path is a legacy static helper used by external callers (scripts, inference, RestoreableJob) that returns the full absolute path. It is kept for backwards compatibility.

get_tree_and_metadata_from_path(rel_path: str | Path, template_tree: PyTree[Any], partial: bool = False) -> Tuple[PyTree[Any], Dict[str, Any]]

Load tree and metadata from rel_path under checkpoints_dir.

Parameters:

Name Type Description Default
rel_path str | Path

Relative path under checkpoints_dir.

required
template_tree PyTree[Any]

Template pytree for shape/sharding info.

required
partial bool

If True, only restore leaves present in template_tree and silently skip mismatched subtrees (e.g. optimizer state when loading a trainer checkpoint into an inference template).

False

get_tree_and_metadata(suffix: str | Path, template_tree: PyTree[Any]) -> Tuple[PyTree[Any], Dict[str, Any]]

Load from this job's own checkpoint. Wrapper for backwards compat.

save_tree_and_metadata_from_path(rel_path: str | Path, tree: PyTree[Any], metadata: Dict[str, Any]) -> None

Save tree and metadata to rel_path under checkpoints_dir.

save_tree_and_metadata(suffix: str | Path, tree: PyTree[Any], metadata: Dict[str, Any]) -> None

Save to this job's own checkpoint. Wrapper for backwards compat.

get_metadata_from_path(rel_path: str | Path) -> Dict[str, Any]

Load metadata only from rel_path under checkpoints_dir.

get_metadata(suffix: str | Path) -> Dict[str, Any]

Load metadata only from this job's own checkpoint. Wrapper for backwards compat.

RestoreableJob(spec: ExecutionSpec)

Bases: CheckpointedJob[C], Generic[C]

restore_from_path(rel_path: str | Path) -> None abstractmethod

Restore job state from rel_path under checkpoints_dir.

restore(suffix: str | Path) -> None

Restore from this job's own checkpoint. Wrapper for backwards compat.

register(suffix: str | Path) -> None

Register this checkpoint as the latest, for idempotent restore.

latest(spec: ExecutionSpec) -> str | None classmethod

Get the latest checkpoint suffix, or None if no checkpoint exists.

from_checkpoint_path(rel_path: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Self, Any] classmethod

Load and instantiate a checkpointed job from rel_path under checkpoints_dir.

Parameters:

Name Type Description Default
rel_path str | Path

Relative path under checkpoints_dir

required
spec ExecutionSpec

execution spec to use for locating checkpoint

required
runtime_cfg Any | None

config values from the current launch to overlay onto the checkpoint config before job initialization

None
resume bool

If True, this is an idempotent resume of the same job (e.g. after preemption). All saved spec fields — including the wandb run id — are restored from the checkpoint's job.json so that logging sessions can be rejoined. When False (the default), the caller's spec identity is kept intact (useful for --restore which loads weights from a different job's checkpoint).

False

Returns:

Type Description
Tuple[Self, Any]

Tuple[Self, Any]: restored job instance and configuration

from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Self, Any] classmethod

Load from this job's own checkpoint. Wrapper for backwards compat.

checkpoints(spec: ExecutionSpec) -> List[str] classmethod

given the execution spec, list available checkpoints to restore from