Skip to content

slurm

slurm

SLURM dispatch utilities

SlurmJob(name: str, command: str, partition: str | None = None, nodes: int = 1, ntasks: int = 1, ntasks_per_node: int | None = None, cpus_per_task: int | None = None, gpus: int | None = None, gpus_per_node: int | None = None, gpu_type: str | None = None, mem: str | None = None, mem_per_cpu: str | None = None, time: str | None = None, output: str | None = None, error: str | None = None, workdir: str | None = None, root_dir: str | None = None, env: dict[str, str] = dict(), modules: list[str] = list(), uv_groups: list[str] = list(), dependency: str | None = None, exclusive: bool = False, constraint: str | None = None, account: str | None = None, qos: str | None = None, exclude: list[str] = list(), extra_directives: list[str] = list(), setup_commands: list[str] = list(), payload: str | None = None, payload_extract_to: str = '$SLURM_TMPDIR/code', juicefs_mount: JuiceFSMount | None = None, is_slurm: bool = True, bootstrap_py: str | None = None) dataclass

Configuration for an sbatch script.

pack(tarball: bytes) -> 'SlurmJob'

Return a new SlurmJob with the given tarball as payload.

Parameters:

Name Type Description Default
tarball bytes

gzip-compressed tarball bytes (from sync.snapshot())

required

Returns:

Type Description
'SlurmJob'

New SlurmJob with payload set

to_bootstrap_script() -> str

Generate the bootstrap.sh script that runs on each node.

to_sbatch_script(bootstrap_script_path: str) -> str

Generate the sbatch wrapper script that calls srun on bootstrap.sh.

to_script() -> str

Generate script for backward compatibility.

For SLURM: returns bootstrap script (sbatch wrapper generated separately) For SSH: returns bootstrap script

SlurmResult(job_id: int | None, ssh_result: RunResult) dataclass

Result of a SLURM job submission.

JobStatus(job_id: int, partition: str, name: str, user: str, state: str, time_elapsed: str, nodes: int, nodelist: str) dataclass

Status of a SLURM job from squeue.

JobInfo(job_id: str, name: str, partition: str, state: str, exit_code: str, elapsed: str, max_rss: str, nodelist: str) dataclass

Detailed info about a SLURM job from sacct.

QueueResult(jobs: list[JobStatus], ssh_result: RunResult) dataclass

Result of a queue query.

StatusResult(job: JobStatus | None, ssh_result: RunResult) dataclass

Result of a job status query.

JobInfoResult(steps: list[JobInfo], ssh_result: RunResult) dataclass

Result of a job info query.

main: JobInfo | None property

Get the main job entry (without step suffix).

NodeGres(name: str, type: str | None, configured: int, allocated: int) dataclass

GRES (generic resource) info for a node.

NodeInfo(name: str, state: str, cpus_total: int, cpus_allocated: int, memory_total: int, memory_allocated: int, gres: list[NodeGres], partitions: list[str], features: list[str]) dataclass

Detailed info about a SLURM node.

get_gres(name: str) -> NodeGres | None

Get GRES by name (e.g., 'gpu').

PartitionInfo(name: str, state: str, nodes: list[str], total_cpus: int, total_nodes: int) dataclass

Info about a SLURM partition.

submit(job: SlurmJob, host: str, share_dir: str | None = None, script_path: str | None = None, timeout: float | None = None) -> SlurmResult

Submit a SLURM job to a remote host via SSH.

Creates two scripts on remote: - bootstrap.sh: runs on each node via srun (setup + command) - sbatch wrapper: contains SBATCH directives, calls srun bootstrap.sh

Parameters:

Name Type Description Default
job SlurmJob

SlurmJob configuration

required
host str

SSH host with SLURM access

required
share_dir str | None

Shared directory visible to all nodes (required for multi-node jobs)

None
script_path str | None

Optional remote path prefix for scripts

None
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
SlurmResult

SlurmResult with job_id and SSH result

submit_packed(job: SlurmJob, host: str, repo_path: str | None = None, share_dir: str | None = None, dirty: bool = False, script_path: str | None = None, timeout: float | None = None) -> SlurmResult

Submit a SLURM job with code packed into the script.

The code tarball is embedded in the sbatch script and extracted at runtime on the compute node (to $SLURM_TMPDIR/code by default).

Parameters:

Name Type Description Default
job SlurmJob

SlurmJob configuration

required
host str

SSH host with SLURM access

required
repo_path str | None

Local git repo to pack (default: cwd)

None
share_dir str | None

Shared directory visible to all nodes for scripts

None
dirty bool

Include uncommitted changes (default: False)

False
script_path str | None

Optional remote path for script

None
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
SlurmResult

SlurmResult with job_id and SSH result

status(job_id: int, host: str, timeout: float | None = None) -> StatusResult

Check the status of a SLURM job.

Parameters:

Name Type Description Default
job_id int

SLURM job ID

required
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
StatusResult

StatusResult with parsed JobStatus

cancel(job_id: int, host: str, timeout: float | None = None) -> RunResult

Cancel a SLURM job.

Parameters:

Name Type Description Default
job_id int

SLURM job ID to cancel

required
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
RunResult

RunResult from scancel

job_info(job_id: int, host: str, timeout: float | None = None) -> JobInfoResult

Get detailed info about a SLURM job (including completed jobs).

Parameters:

Name Type Description Default
job_id int

SLURM job ID

required
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
JobInfoResult

JobInfoResult with parsed job steps

queue(host: str, user: str | None = None, timeout: float | None = None) -> QueueResult

List jobs in the SLURM queue.

Parameters:

Name Type Description Default
host str

SSH host with SLURM access

required
user str | None

Filter by user (default: all users)

None
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
QueueResult

QueueResult with list of parsed JobStatus

partitions(host: str, timeout: float | None = None) -> list[PartitionInfo]

List all SLURM partitions.

Parameters:

Name Type Description Default
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
list[PartitionInfo]

List of PartitionInfo

partition_nodes(partition: str, host: str, timeout: float | None = None) -> list[str]

List all nodes in a SLURM partition.

Parameters:

Name Type Description Default
partition str

Partition name

required
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
list[str]

List of node names

first_node_from_nodelist(nodelist: str) -> str | None

Get first hostname from a SLURM nodelist expression.

node_info(nodename: str, host: str, timeout: float | None = None) -> NodeInfo | None

Get detailed info about a SLURM node.

Parameters:

Name Type Description Default
nodename str

Name of the node

required
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout in seconds

None

Returns:

Type Description
NodeInfo | None

NodeInfo or None if node not found

nodes_info(nodenames: list[str], host: str, timeout: float | None = None) -> dict[str, NodeInfo]

Get info about multiple nodes in parallel.

Parameters:

Name Type Description Default
nodenames list[str]

List of node names

required
host str

SSH host with SLURM access

required
timeout float | None

SSH timeout per node

None

Returns:

Type Description
dict[str, NodeInfo]

Dict mapping nodename -> NodeInfo (excludes failed lookups)

available_gpus(partition: str, host: str, gpu_type: str | None = None, timeout: float | None = None) -> list[tuple[str, int]]

Find nodes with available GPUs in a partition.

Uses a single sinfo command to get all node GPU info efficiently, avoiding rate limits from multiple SSH connections.

Parameters:

Name Type Description Default
partition str

Partition name

required
host str

SSH host with SLURM access

required
gpu_type str | None

Optional GPU type filter (e.g., "a100")

None
timeout float | None

SSH timeout

None

Returns:

Type Description
list[tuple[str, int]]

List of (nodename, available_gpu_count) tuples, sorted by availability descending

partition_gpu_types(host: str, partitions: list[str] | None = None, timeout: float | None = None) -> dict[str, set[str]]

Get GPU types for all partitions on a host in a single query.

Parameters:

Name Type Description Default
host str

SSH host with SLURM access

required
partitions list[str] | None

Optional list of partitions to filter (queries all if None)

None
timeout float | None

SSH timeout

None

Returns:

Type Description
dict[str, set[str]]

Dict mapping partition name -> set of GPU type names

wait(job_id: int, host: str, poll_interval: float = 10.0, timeout: float | None = None) -> JobInfoResult

Wait for a SLURM job to complete.

Parameters:

Name Type Description Default
job_id int

SLURM job ID

required
host str

SSH host with SLURM access

required
poll_interval float

Seconds between status checks

10.0
timeout float | None

Total timeout in seconds (None = wait forever)

None

Returns:

Type Description
JobInfoResult

JobInfoResult with final job state

wait_until_running(job_id: int, host: str, poll_interval: float = 10.0, timeout: float | None = None) -> tuple[str | None, StatusResult]

Wait until a job is running and return the first allocated hostname.