job
job
¶
BasicJob(spec: ExecutionSpec)
¶
CheckpointedJob(spec: ExecutionSpec)
¶
Bases: BasicJob[C], Generic[C]
Job with checkpoint save/load support.
Checkpoint path resolution¶
Paths are split into two parts:
checkpoints_dir / rel_path
where checkpoints_dir comes from the cluster config and rel_path
is project/group/job_name/suffix.
The *_from_path methods accept an arbitrary rel_path, which lets
a job load/save checkpoints belonging to a different job. The plain
get_tree_and_metadata / save_tree_and_metadata methods derive
rel_path from self.spec automatically — they exist for backwards
compatibility and are thin wrappers around the *_from_path variants.
_get_checkpoint_path is a legacy static helper used by external
callers (scripts, inference, RestoreableJob) that returns the full
absolute path. It is kept for backwards compatibility.
get_tree_and_metadata_from_path(rel_path: str | Path, template_tree: PyTree[Any], partial: bool = False) -> Tuple[PyTree[Any], Dict[str, Any]]
¶
Load tree and metadata from rel_path under checkpoints_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rel_path
|
str | Path
|
Relative path under checkpoints_dir. |
required |
template_tree
|
PyTree[Any]
|
Template pytree for shape/sharding info. |
required |
partial
|
bool
|
If True, only restore leaves present in |
False
|
get_tree_and_metadata(suffix: str | Path, template_tree: PyTree[Any]) -> Tuple[PyTree[Any], Dict[str, Any]]
¶
Load from this job's own checkpoint. Wrapper for backwards compat.
save_tree_and_metadata_from_path(rel_path: str | Path, tree: PyTree[Any], metadata: Dict[str, Any]) -> None
¶
Save tree and metadata to rel_path under checkpoints_dir.
save_tree_and_metadata(suffix: str | Path, tree: PyTree[Any], metadata: Dict[str, Any]) -> None
¶
Save to this job's own checkpoint. Wrapper for backwards compat.
get_metadata_from_path(rel_path: str | Path) -> Dict[str, Any]
¶
Load metadata only from rel_path under checkpoints_dir.
get_metadata(suffix: str | Path) -> Dict[str, Any]
¶
Load metadata only from this job's own checkpoint. Wrapper for backwards compat.
RestoreableJob(spec: ExecutionSpec)
¶
Bases: CheckpointedJob[C], Generic[C]
restore_from_path(rel_path: str | Path) -> None
abstractmethod
¶
Restore job state from rel_path under checkpoints_dir.
restore(suffix: str | Path) -> None
¶
Restore from this job's own checkpoint. Wrapper for backwards compat.
register(suffix: str | Path) -> None
¶
Register this checkpoint as the latest, for idempotent restore.
latest(spec: ExecutionSpec) -> str | None
classmethod
¶
Get the latest checkpoint suffix, or None if no checkpoint exists.
from_checkpoint_path(rel_path: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Self, Any]
classmethod
¶
Load and instantiate a checkpointed job from rel_path under checkpoints_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rel_path
|
str | Path
|
Relative path under checkpoints_dir |
required |
spec
|
ExecutionSpec
|
execution spec to use for locating checkpoint |
required |
runtime_cfg
|
Any | None
|
config values from the current launch to overlay onto the checkpoint config before job initialization |
None
|
resume
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Tuple[Self, Any]
|
Tuple[Self, Any]: restored job instance and configuration |
from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Self, Any]
classmethod
¶
Load from this job's own checkpoint. Wrapper for backwards compat.
checkpoints(spec: ExecutionSpec) -> List[str]
classmethod
¶
given the execution spec, list available checkpoints to restore from