๐ ezpz PBS Guideโ๏ธ
Hardware-agnostic distributed PyTorch on PBS with automatic topology inference and CPU binding.
๐ Quick Startโ๏ธ
To build a PBS-aware launch command (using the current jobโs hostfile when available):
-
Discover active PBS job ID:
-
Get PBS Hostfile:
-
Build the launch command with inferred topology:
>>> launch_cmd = ezpz.pbs.get_pbs_launch_cmd(hostfile=hostfile) [2026-01-16 08:58:58,119421][I][ezpz/pbs:267:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] >>> # Build the mpiexec/mpirun command with inferred topology + CPU binding >>> print(launch_cmd) mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 -
Launch directly from python using
subprocess:>>> import subprocess >>> subprocess.run(f"{launch_cmd} --line-buffer python3 -c 'import ezpz; print(ezpz.setup_torch())'", shell=True, check=True)Expected Output
`python 0 Using [24 / 24] available "xpu" devices !! 1 2 12 13 3 14 15 16 17 4 18 19 20 21 22 23 5 6 7 8 9 10 11 CompletedProcess(args="mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 --line-buffer python3 -c 'import ezpz; print(ezpz.setup_torch())'", returncode=0)
๐งญ What the PBS helpers doโ๏ธ
- Detect your active PBS job (if any) and locate its nodefile.
- Infer topology (
ngpus,nhosts,ngpu_per_host) from the hostfile and machine limits unless you override them. - Build the correct launcher (
mpiexec/mpirun) with sensible CPU binding: - Intel GPU machines (
aurora,sunspot) get--no-vniand vendor binding lists. - If
CPU_BINDis set, its value is forwarded verbatim. - Otherwise, a generic
--cpu-bind=depth --depth=8is applied. - Optionally inject PBS environment metadata (
PBS_NODEFILE, host list, launch command) viaget_pbs_env.
flowchart TD
A["Detect scheduler"] --> B{"Active PBS job?"}
B -- yes --> C["get_pbs_jobid_of_active_job"]
C --> D["get_pbs_nodefile(jobid)"]
B -- no --> E["get_hostfile_with_fallback"]
D --> F["infer_topology"]
E --> F
F --> G["get_pbs_launch_cmd"]
G --> H["mpiexec/mpirun command"]
H --> I["Launch distributed script"]
๐ Discovering jobs and hostfilesโ๏ธ
pbs.get_pbs_running_jobs_for_user() -> dict[str, list[str]]: all running jobs for the current user with their node lists.pbs.get_pbs_jobid_of_active_job() -> str | None: job that includes the current host (orNoneif not on a PBS job).-
pbs.get_pbs_nodefile(jobid=None) -> str | None: path to the nodefile for a job (active job by default).
๐งฎ Topology inferenceโ๏ธ
get_pbs_launch_cmd will infer topology when you omit values:
- If you set
nhosts: it uses all GPUs per host for that many hosts. - If you set
ngpu_per_host:ngpus = nhosts * ngpu_per_host. - If you set
ngpusonly:ngpu_per_host = ngpus / nhosts(must divide evenly). -
If nothing is specified:
Infer topology automatically and use all GPUs on all hosts in the hostfile>>> launch_cmd = ezpz.pbs.get_pbs_launch_cmd() [2026-01-16 09:17:26,226619][I][ezpz/pbs:267:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] >>> print(launch_cmd) mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 -
Any inconsistent combination raises
ValueError
๐ช Building launch commandsโ๏ธ
pbs.get_pbs_launch_cmd(...) -> str: build the launcher string (mpiexec/mpirun) with CPU binding.pbs.build_launch_cmd(...) -> str: scheduler-aware wrapper; currently dispatches to PBS or Slurm.-
Override explicitly when needed:
>>> # from 2 nodes of Aurora >>> launch_cmd = ezpz.pbs.get_pbs_launch_cmd(ngpus=8, nhosts=1) [2026-01-16 09:14:53,127734][I][ezpz/pbs:267:get_pbs_launch_cmd] โ ๏ธ Using [8/24] GPUs [1 hosts] x [8 GPU/host] [2026-01-16 09:14:53,128623][W][ezpz/pbs:273:get_pbs_launch_cmd] [๐ง WARNING] Using only [8/24] available GPUs!! >>> print(launch_cmd) mpiexec --envall --np=8 --ppn=8 --hostfile=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
๐ณ Environment injectionโ๏ธ
pbs.get_pbs_env(hostfile=None, jobid=None, verbose=False) -> dict[str, str]:- Adds PBS-derived metadata (host list, counts,
LAUNCH_CMD) intoos.environ. -
Useful for passing context into downstream tools or logging.
>>> env = pbs.get_pbs_env(verbose=True) [2026-01-16 09:19:21,899146][I][ezpz/pbs:267:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [pbsenv]: โข PBS_O_WORKDIR=/lus/tegu/projects/datascience/foremans/projects/saforem2/ezpz โข PBS_JOBID=12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov โข PBS_NODEFILE=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov โข HOSTFILE=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov โข HOSTS=[x1921c1s2b0n0-hsn0, x1921c1s5b0n0-hsn0] โข NHOSTS=2 โข NGPU_PER_HOST=12 โข NGPUS=24 โข NGPUS_AVAILABLE=24 โข MACHINE=SunSpot โข DEVICE=xpu โข BACKEND=xccl โข LAUNCH_CMD=mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/12459205.sunspot-pbs-0001.head.cm.sunspot.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 โข WORLD_SIZE_TOTAL=24
๐งช Minimal end-to-end launchโ๏ธ
import subprocess
from ezpz import pbs
hostfile = pbs.get_pbs_nodefile() # falls back if not on PBS
cmd = pbs.get_pbs_launch_cmd(hostfile=hostfile, ngpus=4)
# Run a distributed smoke test
subprocess.run(f"{cmd} python -m ezpz.examples.test", shell=True, check=True)
๐ Notesโ๏ธ
- If no PBS job is active,
get_pbs_launch_cmdstill works using the fallback hostfile and machine introspection. - CPU binding precedence:
CPU_BINDenv > machine-specific defaults > generic depth binding. - Combine with
ezpz.launchif you want scheduler-agnostic CLI parsing and fallbacks.