slurm
slurm
¶
SLURM dispatch utilities
SlurmJob(name: str, command: str, partition: str | None = None, nodes: int = 1, ntasks: int = 1, ntasks_per_node: int | None = None, cpus_per_task: int | None = None, gpus: int | None = None, gpus_per_node: int | None = None, gpu_type: str | None = None, mem: str | None = None, mem_per_cpu: str | None = None, time: str | None = None, output: str | None = None, error: str | None = None, workdir: str | None = None, root_dir: str | None = None, env: dict[str, str] = dict(), modules: list[str] = list(), uv_groups: list[str] = list(), dependency: str | None = None, exclusive: bool = False, constraint: str | None = None, account: str | None = None, qos: str | None = None, exclude: list[str] = list(), extra_directives: list[str] = list(), setup_commands: list[str] = list(), payload: str | None = None, payload_extract_to: str = '$SLURM_TMPDIR/code', juicefs_mount: JuiceFSMount | None = None, is_slurm: bool = True, bootstrap_py: str | None = None)
dataclass
¶
Configuration for an sbatch script.
pack(tarball: bytes) -> 'SlurmJob'
¶
Return a new SlurmJob with the given tarball as payload.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tarball
|
bytes
|
gzip-compressed tarball bytes (from sync.snapshot()) |
required |
Returns:
| Type | Description |
|---|---|
'SlurmJob'
|
New SlurmJob with payload set |
to_bootstrap_script() -> str
¶
Generate the bootstrap.sh script that runs on each node.
to_sbatch_script(bootstrap_script_path: str) -> str
¶
Generate the sbatch wrapper script that calls srun on bootstrap.sh.
to_script() -> str
¶
Generate script for backward compatibility.
For SLURM: returns bootstrap script (sbatch wrapper generated separately) For SSH: returns bootstrap script
SlurmResult(job_id: int | None, ssh_result: RunResult)
dataclass
¶
Result of a SLURM job submission.
JobStatus(job_id: int, partition: str, name: str, user: str, state: str, time_elapsed: str, nodes: int, nodelist: str)
dataclass
¶
Status of a SLURM job from squeue.
JobInfo(job_id: str, name: str, partition: str, state: str, exit_code: str, elapsed: str, max_rss: str, nodelist: str)
dataclass
¶
Detailed info about a SLURM job from sacct.
QueueResult(jobs: list[JobStatus], ssh_result: RunResult)
dataclass
¶
Result of a queue query.
StatusResult(job: JobStatus | None, ssh_result: RunResult)
dataclass
¶
Result of a job status query.
JobInfoResult(steps: list[JobInfo], ssh_result: RunResult)
dataclass
¶
Result of a job info query.
main: JobInfo | None
property
¶
Get the main job entry (without step suffix).
NodeGres(name: str, type: str | None, configured: int, allocated: int)
dataclass
¶
GRES (generic resource) info for a node.
NodeInfo(name: str, state: str, cpus_total: int, cpus_allocated: int, memory_total: int, memory_allocated: int, gres: list[NodeGres], partitions: list[str], features: list[str])
dataclass
¶
Detailed info about a SLURM node.
get_gres(name: str) -> NodeGres | None
¶
Get GRES by name (e.g., 'gpu').
PartitionInfo(name: str, state: str, nodes: list[str], total_cpus: int, total_nodes: int)
dataclass
¶
Info about a SLURM partition.
submit(job: SlurmJob, host: str, share_dir: str | None = None, script_path: str | None = None, timeout: float | None = None) -> SlurmResult
¶
Submit a SLURM job to a remote host via SSH.
Creates two scripts on remote: - bootstrap.sh: runs on each node via srun (setup + command) - sbatch wrapper: contains SBATCH directives, calls srun bootstrap.sh
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job
|
SlurmJob
|
SlurmJob configuration |
required |
host
|
str
|
SSH host with SLURM access |
required |
share_dir
|
str | None
|
Shared directory visible to all nodes (required for multi-node jobs) |
None
|
script_path
|
str | None
|
Optional remote path prefix for scripts |
None
|
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
SlurmResult
|
SlurmResult with job_id and SSH result |
submit_packed(job: SlurmJob, host: str, repo_path: str | None = None, share_dir: str | None = None, dirty: bool = False, script_path: str | None = None, timeout: float | None = None) -> SlurmResult
¶
Submit a SLURM job with code packed into the script.
The code tarball is embedded in the sbatch script and extracted at runtime on the compute node (to $SLURM_TMPDIR/code by default).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job
|
SlurmJob
|
SlurmJob configuration |
required |
host
|
str
|
SSH host with SLURM access |
required |
repo_path
|
str | None
|
Local git repo to pack (default: cwd) |
None
|
share_dir
|
str | None
|
Shared directory visible to all nodes for scripts |
None
|
dirty
|
bool
|
Include uncommitted changes (default: False) |
False
|
script_path
|
str | None
|
Optional remote path for script |
None
|
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
SlurmResult
|
SlurmResult with job_id and SSH result |
status(job_id: int, host: str, timeout: float | None = None) -> StatusResult
¶
Check the status of a SLURM job.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
int
|
SLURM job ID |
required |
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
StatusResult
|
StatusResult with parsed JobStatus |
cancel(job_id: int, host: str, timeout: float | None = None) -> RunResult
¶
Cancel a SLURM job.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
int
|
SLURM job ID to cancel |
required |
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
RunResult
|
RunResult from scancel |
job_info(job_id: int, host: str, timeout: float | None = None) -> JobInfoResult
¶
Get detailed info about a SLURM job (including completed jobs).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
int
|
SLURM job ID |
required |
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
JobInfoResult
|
JobInfoResult with parsed job steps |
queue(host: str, user: str | None = None, timeout: float | None = None) -> QueueResult
¶
List jobs in the SLURM queue.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
host
|
str
|
SSH host with SLURM access |
required |
user
|
str | None
|
Filter by user (default: all users) |
None
|
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
QueueResult
|
QueueResult with list of parsed JobStatus |
partitions(host: str, timeout: float | None = None) -> list[PartitionInfo]
¶
List all SLURM partitions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
list[PartitionInfo]
|
List of PartitionInfo |
partition_nodes(partition: str, host: str, timeout: float | None = None) -> list[str]
¶
List all nodes in a SLURM partition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
partition
|
str
|
Partition name |
required |
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of node names |
first_node_from_nodelist(nodelist: str) -> str | None
¶
Get first hostname from a SLURM nodelist expression.
node_info(nodename: str, host: str, timeout: float | None = None) -> NodeInfo | None
¶
Get detailed info about a SLURM node.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nodename
|
str
|
Name of the node |
required |
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout in seconds |
None
|
Returns:
| Type | Description |
|---|---|
NodeInfo | None
|
NodeInfo or None if node not found |
nodes_info(nodenames: list[str], host: str, timeout: float | None = None) -> dict[str, NodeInfo]
¶
Get info about multiple nodes in parallel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nodenames
|
list[str]
|
List of node names |
required |
host
|
str
|
SSH host with SLURM access |
required |
timeout
|
float | None
|
SSH timeout per node |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, NodeInfo]
|
Dict mapping nodename -> NodeInfo (excludes failed lookups) |
available_gpus(partition: str, host: str, gpu_type: str | None = None, timeout: float | None = None) -> list[tuple[str, int]]
¶
Find nodes with available GPUs in a partition.
Uses a single sinfo command to get all node GPU info efficiently, avoiding rate limits from multiple SSH connections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
partition
|
str
|
Partition name |
required |
host
|
str
|
SSH host with SLURM access |
required |
gpu_type
|
str | None
|
Optional GPU type filter (e.g., "a100") |
None
|
timeout
|
float | None
|
SSH timeout |
None
|
Returns:
| Type | Description |
|---|---|
list[tuple[str, int]]
|
List of (nodename, available_gpu_count) tuples, sorted by availability descending |
partition_gpu_types(host: str, partitions: list[str] | None = None, timeout: float | None = None) -> dict[str, set[str]]
¶
Get GPU types for all partitions on a host in a single query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
host
|
str
|
SSH host with SLURM access |
required |
partitions
|
list[str] | None
|
Optional list of partitions to filter (queries all if None) |
None
|
timeout
|
float | None
|
SSH timeout |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dict mapping partition name -> set of GPU type names |
wait(job_id: int, host: str, poll_interval: float = 10.0, timeout: float | None = None) -> JobInfoResult
¶
Wait for a SLURM job to complete.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
int
|
SLURM job ID |
required |
host
|
str
|
SSH host with SLURM access |
required |
poll_interval
|
float
|
Seconds between status checks |
10.0
|
timeout
|
float | None
|
Total timeout in seconds (None = wait forever) |
None
|
Returns:
| Type | Description |
|---|---|
JobInfoResult
|
JobInfoResult with final job state |
wait_until_running(job_id: int, host: str, poll_interval: float = 10.0, timeout: float | None = None) -> tuple[str | None, StatusResult]
¶
Wait until a job is running and return the first allocated hostname.