๐ ezpz launchโ๏ธ
Single entry point for launching distributed applications.
This will:
- Automatically detect your PBS/Slurm job and
- Launch
<cmd>across all available accelerators.
This is done by detecting if ezpz launch is being executed from inside a
PBS/Slurm job2.
If so, it determines the specifics of the active job (number of nodes, and
number of GPUs per node), and uses this information to build and execute the
appropriate launch command (e.g. mpiexec, srun).
When not running inside a PBS/Slurm job, ezpz launch falls back to mpirun
with sensible defaults.
Arguments can be passed through to the mpiexec/srun launcher by separating
them from the <cmd> with --1, e.g.:
For example, to run with 8 processes total, 4 processes per node, on 2 hosts, we can:
Assuming your current job can satisfy this (i.e. at least 4 accelerators per
node, and at least 2 nodes), this would launch python3 -m
ezpz.examples.fsdp_tp across 8 processes, 4 per node, on the first two hosts
allocated to your job.
-
ezpz launch --helpezpz launch --help usage: ezpz launch [-h] [--print-source] [--filter FILTER [FILTER ...]] [-n NPROC] [-ppn NPROC_PER_NODE] [-nh NHOSTS] [--hostfile HOSTFILE] [--cpu-bind CPU_BIND] ... Launch a command on the current PBS/SLURM job. Additional `<launcher flags>` can be passed through directly to the launcher by including '--' as a separator before the command. Examples: ezpz launch <launcher flags> -- <command> <args> ezpz launch -n 8 -ppn 4 --verbose --tag-output -- python3 -m ezpz.examples.fsdp_tp ezpz launch --nproc 8 -x EZPZ_LOG_LEVEL=DEBUG -- python3 my_script.py --my-arg val positional arguments: command Command (and arguments) to execute. Use '--' to separate options when needed. options: -h, --help show this help message and exit --print-source Print the location of the launch CLI source and exit. --filter FILTER [FILTER ...] Deprecated: output filtering has been removed. This flag is ignored. -n NPROC, -np NPROC, --n NPROC, --np NPROC, --nproc NPROC, --world_size NPROC, --nprocs NPROC Number of processes. -ppn NPROC_PER_NODE, --ppn NPROC_PER_NODE, --nproc_per_node NPROC_PER_NODE Processes per node. -nh NHOSTS, --nh NHOSTS, --nhost NHOSTS, --nnode NHOSTS, --nnodes NHOSTS, --nhosts NHOSTS, --nhosts NHOSTS Number of nodes to use. --hostfile HOSTFILE Hostfile to use for launching. --cpu-bind CPU_BIND CPU binding value to pass to the launcher. Takes precedence over CPU_BIND when both are specified. --timeout IDLE_TIMEOUT_S Idle-output watchdog timeout in seconds. Off by default. --retries RETRIES Re-execute on non-zero exit, up to N times. Default: 0. --auto-retry Unbounded bad-node failover loop. Mutually exclusive with --retries. Requires explicit --nproc. --spare-nodes SPARE_NODES Spare-node pool for --auto-retry. "auto" (default) derives from total_pbs_nodes - ceil($nproc / $ppn); pass an int for an explicit cap. --max-failover-retries MAX_FAILOVER_RETRIES Optional upper bound on --auto-retry attempts. Default: unbounded (see termination matrix).
Idle-output watchdog (--timeout)โ๏ธ
--timeout SECONDS arms a watchdog that monitors the launched
process's output. If no output appears (on stdout or stderr โ
they are merged at the watchdog) for SECONDS consecutive seconds,
the watchdog sends SIGTERM, waits up to 10 seconds for a clean
shutdown, then sends SIGKILL. The exit code returned by
ezpz launch is 124 (matching GNU timeout(1) convention) so
shell wrappers can distinguish "killed for going silent" from
"command failed". Passing --timeout 0 disables the watchdog (same
as omitting the flag).
# Abort if the training script goes silent for 10 minutes.
ezpz launch --timeout 600 -- python3 -m my_app.train
Idle, not walltime. The process can run indefinitely as long as
it keeps emitting at least one line per SECONDS on either stream.
This is the right semantics for catching collective hangs (e.g.
xccl on XPU silently deadlocking) where the process is alive but
every rank is blocked in the same collective and nothing reaches
either stream. For a hard walltime limit, use the scheduler's
existing mechanism (#PBS -l walltime=...).
Python buffering. The watchdog sets PYTHONUNBUFFERED=1 in the
child environment so Python's default block-buffering (which kicks
in when stdout isn't a TTY) doesn't fool the watchdog into killing a
healthy job that's accumulating output in a 4-8 KB buffer. The
variable is benign for non-Python children: they ignore it.
Scope caveat. The watchdog only watches the process ezpz launch
spawns directly. If you qsub a job script that internally invokes
python train.py, the watchdog needs to live inside that script
(or call ezpz launch from inside it), not the outer qsub.
Retry on non-zero exit (--retries)โ๏ธ
--retries N re-executes the command up to N additional times
whenever the previous attempt returns a non-zero exit code, including
the watchdog's 124. Exponential backoff is applied between attempts
(5s, 10s, 20s, 40s, then capped at 60s).
# Up to 3 retries with watchdog protection. Useful for flaky fabrics
# or transient EC2 spot interruptions.
ezpz launch --timeout 600 --retries 3 -- python3 -m my_app.train
A clean exit on any attempt short-circuits the loop and returns 0.
If every attempt fails, the final attempt's exit code is returned.
Combine with --timeout to convert silent hangs into retryable
failures.
Auto-retry on bad-node failure (--auto-retry)โ๏ธ
--auto-retry engages the failover loop. On every non-zero exit,
ezpz scrapes the log for known bad-node signatures (Aurora PALS
shepherd-9, gloo TCP peer-closed), swaps each named host out for a
spare from the rest of the PBS allocation, and re-runs the command.
Unlike --retries N, the loop is unbounded by default โ it
continues until one of the conditions in the termination matrix
below fires.
# 50 nodes allocated by PBS on Aurora (12 ranks/node = 600 GPUs).
# Train on 512 ranks (= ceil(512/12) = 43 hosts), reserve the
# remaining 7 hosts as spares. Loop until success / walltime /
# spare exhaustion.
ezpz launch --auto-retry --np 512 -- python3 -m ezpz.examples.test
--auto-retry is mutually exclusive with --retries. They model
different things: --retries N is a bounded process-level retry
that re-launches the same command on the same nodes. --auto-retry
is an unbounded node-level failover that swaps bad hosts out
between attempts.
Decision flow at a glanceโ๏ธ
flowchart TD
Start(["ezpz launch --auto-retry --np N"]) --> Validate{"nproc set<br/>explicitly?"}
Validate -->|no| ErrParse["SystemExit at parse:<br/>requires --nproc"]
Validate -->|yes| Split["Split PBS nodelist<br/>into active + spare,<br/>write active.hostfile"]
Split --> Attempt["Run attempt i<br/>tee to attempt-i.log,<br/>watchdog armed<br/>(default 1800s)"]
Attempt -->|"SIGINT<br/>(Ctrl-C)"| Interrupted(["FAILOVER STOP:<br/>interrupted<br/>return 130"])
Attempt --> GotRC["rc = child exit<br/>124 if watchdog fired"]
GotRC --> Strip["Strip ANSI codes,<br/>strip innocent<br/>rank-cascade lines"]
Strip --> CheckSuccess{"rc==0 AND<br/>no crash patterns<br/>AND inner_rc==0?"}
CheckSuccess -->|yes| Success(["FAILOVER STOP:<br/>success<br/>return 0"])
CheckSuccess -->|no| CheckWalltime{"rc==143 AND<br/>no crash patterns?"}
CheckWalltime -->|yes| Walltime(["FAILOVER STOP:<br/>walltime<br/>return 143"])
CheckWalltime -->|no| CheckStuck{"prior AND current<br/>attempt both have<br/>0 step= markers?"}
CheckStuck -->|yes| Stuck(["FAILOVER STOP:<br/>stuck_pre_training<br/>return rc"])
CheckStuck -->|no| CheckSpares{"spares left?"}
CheckSpares -->|no| Exhausted(["FAILOVER STOP:<br/>exhausted<br/>return rc"])
CheckSpares -->|yes| CheckScraper{"scraper found<br/>named host(s)?"}
CheckScraper -->|yes| Swap["Swap bad host(s)<br/>for spare(s),<br/>rewrite active.hostfile"]
CheckScraper -->|no| Swap
Swap --> Backoff["Backoff sleep:<br/>5/10/20/40/60s"]
Backoff --> Attempt
The single Swap node above is swap_in when the scraper named
hosts and swap_one_blind when it didn't โ see the
empty-swap_in fallback note for the
edge case where swap_in finds no live hosts to replace and
falls through to a blind rotation.
Required: explicit --nprocโ๏ธ
--auto-retry needs to know how many ranks are training so it can
split the PBS allocation into active + spare. We do not guess
the active-host count โ pass --nproc N (or -n N / --np N)
explicitly. The CLI errors out at parse time otherwise:
$ ezpz launch --auto-retry -- python3 train.py
--auto-retry requires --nproc (-n/--np) to be set explicitly. ...
Spare-node policy (--spare-nodes)โ๏ธ
By default (--spare-nodes auto), the spare pool is
total_pbs_nodes - ceil($nproc / $ppn). The --nproc (or -n,
--np) flag counts ranks, not nodes; we ceiling-divide by the
ranks-per-node (--ppn or the cluster's get_gpus_per_node()) to
get the number of hosts actually needed for training. Any
allocated nodes beyond that go into the spare pool.
So if PBS gave you 50 nodes on Aurora (12 GPUs/node) and you ask
for 512 training ranks, the active set is ceil(512/12) = 43
hosts and the spare pool is 50 - 43 = 7. Pass --spare-nodes N
to cap the pool explicitly (useful when you'd rather not use the
full leftover slice):
# Cap the spare pool at 5, regardless of how many nodes PBS gave us.
ezpz launch --auto-retry --np 512 --spare-nodes 5 -- ...
Termination matrixโ๏ธ
Every termination logs a single FAILOVER STOP: <reason> line so
post-mortem grep is reliable.
| # | Condition | Result |
|---|---|---|
| 1 | exit 0 (clean inner trailer, no crash patterns) | SUCCESS โ return 0 |
| 2 | exit 143 (walltime SIGTERM), no crash patterns | WALLTIME โ return rc |
| 3 | exit 143 with crash patterns in log | bad-node retry (real failure raced the walltime kill) |
| 4 | exit 124 (idle-output watchdog tripped) | bad-node retry (silent hang โ blind rotation) |
| 5 | any other non-zero, scraper found named host(s) | swap_in named โ retry |
| 6 | any other non-zero, scraper found nothing | swap_one_blind โ retry |
| 7 | two consecutive attempts with zero step= markers |
STUCK_PRE_TRAINING โ return rc |
| 8 | bad-node verdict but no spares left | EXHAUSTED โ return rc |
| 9 | SIGINT (Ctrl-C) | INTERRUPTED โ return 130 |
Empty-swap_in fallback. Row 5's swap_in skips any host that
isn't currently in the active set (the named host was already
replaced on a prior attempt, the scraper picked up stale lines from
an older log, etc.). If swap_in ends up swapping zero hosts, the
loop falls through to row 6's swap_one_blind so it still makes
forward progress instead of looping on the same bad set.
The step= marker guard (#7) replaces a numeric "max consecutive
blind rotations" cap. The intent is to catch broken configs / missing
datasets / pre-training-loop bugs before they burn the entire spare
pool โ if two attempts in a row die before History.update prints
its first step=N line, no amount of node-swapping will help.
Worked example โ real Aurora UR_RESULT_ERROR_OUT_OF_RESOURCESโ๏ธ
Here's an excerpt from a real Aurora torchtitan job that the
classifier handles correctly. The relevant signals from
attempt-1.log:
[2026-05-12 08:04:23][I][components/metrics:526:log] step: 1 loss: 12.94587 ...
[2026-05-12 08:04:30][I][components/metrics:526:log] step: 2 loss: 12.90856 ...
... (16 more clean training steps) ...
[2026-05-12 08:06:24][I][components/metrics:526:log] step: 18 loss: 10.27772 ...
[rank7]: RuntimeError: level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
x4610c4s3b0n0.hsn.cm.aurora.alcf.anl.gov: rank 7 exited with code 1
x4610c4s5b0n0.hsn.cm.aurora.alcf.anl.gov: rank 14 died from signal 15
[ezpz/launch] Execution finished with 143.
What the classifier does step by step:
rc=143from the shell (mpiexec teardown after the GPU OOM).- Strip ANSI codes from the log.
- Strip innocent rank-cascade lines:
rank 14 died from signal 15is a downstream cascade from the primary kill onx4610c4s3b0n0, not a bad-node indicator onx4610c4s5b0n0. This line is excluded before the crash-pattern match runs (job 8466848 postmortem โ tagging cascade victims as bad nodes burns spares for nothing). - Run the crash-pattern grep on the stripped text:
UR_RESULT_ERROR_OUT_OF_RESOURCESmatches โ there IS a real hardware failure in the log. rc==143 AND crash_patternsโ bad-node retry path, notWALLTIME. Without the strip we'd still get to bad-node retry (the cascade lines also containdied from signal), but we'd reach it via the wrong condition โ and we'd be at risk of taggingx4610c4s5b0n0as a bad node when onlyx4610c4s3b0n0is the actual culprit.- Scraper picks up the hostname from the
x4610c4s3b0n0.hsn...: rank 7 exited with code 1line (or, if the scraper missed it because it's not in the explicit pattern set, falls through toswap_one_blind). bad_nodes.txtgetsx4610c4s3b0n0.hsn.cm.aurora.alcf.anl.gov.active.hostfileis rewritten with that host replaced by the next spare.- Backoff 5 seconds, run
attempt-2.log. The retry uses the updated active set; the bad GPU is no longer in the training pool.
This exact log shape is pinned as a regression test in both code
paths:
test_crash_patterns_real_ur_oom_with_cascade_regression
(Python) and
test_run_walltime_143_retries_on_real_aurora_ur_oom_with_cascade
(bash).
Reading the postmortem logโ๏ธ
After a run finishes (success, walltime, or exhausted), the
logs/failover-<jobid>/ directory is the postmortem entry point.
A few one-liners:
# What was the final verdict?
grep "FAILOVER STOP" logs/failover-*/attempt-*.log
# Which nodes were swapped out across all attempts?
cat logs/failover-*/bad_nodes.txt
# Which step did each attempt reach before dying?
for f in logs/failover-*/attempt-*.log; do
echo "=== $f ==="
grep -oE "step:[[:space:]]+[0-9]+" "$f" | tail -1
done
# Was the failure a real hardware death or just a walltime hit?
grep -E "OutOfMemoryError|UR_RESULT_ERROR|gloo.*Connection closed|shepherd died" \
logs/failover-*/attempt-*.log
Default idle-output watchdogโ๏ธ
When --auto-retry is set, --timeout defaults to 1800 seconds
(30 minutes) instead of being off. This matches the
FAILOVER_IDLE_TIMEOUT default in src/ezpz/bin/failover.sh and
prevents silent xccl hangs from burning the full PBS walltime
before the loop can intervene. Pass --timeout 0 to disable, or
--timeout N to override.
Optional cap (--max-failover-retries)โ๏ธ
--max-failover-retries N is an additional belt-and-suspenders
cap. Default is unbounded โ terminate only via the matrix above.
Useful for short jobs where you'd rather give up than retry 100
times.
Files writtenโ๏ธ
Per-job, in $(pwd)/logs/failover-<jobid>/:
active.hostfileโ the current active node set, mutated in place as nodes are swapped. Always reflects what the next attempt will run on.bad_nodes.txtโ every host that's been swapped out (named swap_in and blind rotations). Append-only.attempt-N.logโ combined stdout+stderr of attempt N.
Note on
active.hostfilemutation. The file is rewritten in place between attempts. If youcatit from a second shell mid-run, you'll see the current active set, not the original PBS allocation. The launcher reads it fresh at each attempt โ no re-launch ofezpz launchitself is needed for the new contents to take effect, since the launcher subprocess re-resolves the hostfile path's contents per spawn. To inspect the original PBS allocation, look at$PBS_NODEFILE(unmodified) instead.
Relationship to src/ezpz/bin/failover.shโ๏ธ
failover.sh is the bash equivalent for users who can't put
ezpz launch at the top of their job script โ for example,
qsub'ing a wrapper that already invokes python directly. The
scrape source is identical (ezpz.failover.scrape_bad_nodes); the
retry/swap mechanics are independent re-implementations because the
classifier is much easier to test in Python. Prefer --auto-retry
when you can, fall back to sourcing the bash lib when you can't.
Testing on real Aurora / Sunspot allocationsโ๏ธ
Two ready-to-submit PBS drivers ship with the package, both
following the same scenario-runner pattern (each scenario writes
PASS|FAIL to a summary at the end):
# In an ezpz checkout on Aurora:
qsub src/ezpz/bin/test_launch_auto_retry_aurora.pbs
# Or for the pre-#144 watchdog/retries (PR #136):
qsub src/ezpz/bin/test_launch_timeout_retries_aurora.pbs
The --auto-retry driver requests 4 nodes and runs 7 scenarios
(~30 min walltime budget):
| Scenario | What it covers |
|---|---|
| A | Happy path โ succeeds first attempt, 0 swaps |
| B | Blind rotation โ scraper finds nothing, swap_one_blind, succeed |
| C | Named swap โ emit a PALS shepherd died from signal 9 line; verify the named host (not a blind one) gets swapped out |
| D | --max-failover-retries 2 cap โ always-fail, exactly 3 attempts before bail |
| E | STUCK_PRE_TRAINING guard โ 2 consecutive zero-step attempts, bail at attempt 2 |
| F | Realistic โ ezpz.examples.test --model debug --train-iters 20 under --auto-retry |
| G | --hostfile honored โ pass a 2-node filtered hostfile, verify the loop splits THAT (not the full PBS allocation) |
The driver works on Sunspot too โ swap -A AuroraGPT โ -A datascience
and filesystems=flare:home โ filesystems=tegu:home in the PBS header
(everything else is identical: same scrape patterns, same workflow).
Inspect after the run:
tail -f logs/ezpz-launch-auto-retry-test.*.log # scenario summary
ls /tmp/ezpz-aurora-auto-retry-*/scen-*/logs/ # per-scenario failover dirs
For a one-off manual test against your own workload (no scenario
harness, just --auto-retry on a job you'd be running anyway):
# Inside any PBS allocation with .venv set up:
ezpz launch --auto-retry --np <N> --timeout 600 -- python3 -m your.module
If you want to force-trigger a swap to validate the loop end-to-end, make your script emit one of the recognized Aurora crash signatures on rank 0 of the first attempt โ e.g.:
# Sentinel must live on SHARED filesystem (PBS_O_WORKDIR is your
# submit dir, on Lustre/GPFS); /tmp would be per-node and the
# sentinel would vanish after the failover swaps the active host out.
SENTINEL="${PBS_O_WORKDIR}/.ezpz-attempt2-$$"
# Aurora scraper recognizes the .hsn.cm.aurora.alcf.anl.gov FQDN
# form; `hostname` on compute nodes returns just the short name. Use
# the first PBS_NODEFILE entry โ those entries are always in HSN form.
ACTIVE_HOST=$(head -n 1 "${PBS_NODEFILE}")
ezpz launch --auto-retry --np 12 -ppn 12 --timeout 60 -- bash -c "
if [[ ! -f '${SENTINEL}' ]]; then
touch '${SENTINEL}'
echo '${ACTIVE_HOST}: shepherd died from signal 9'
exit 1
fi
echo 'iter step=1 loss=0.5'
exit 0
"
# Should produce 2 attempts, ${ACTIVE_HOST} swapped into bad_nodes.txt.
# After: `cat logs/failover-*/bad_nodes.txt` shows the swapped host;
# `cat logs/failover-*/active.hostfile` shows the replacement.
# Clean up: rm "${SENTINEL}"
Python interpreter resolutionโ๏ธ
When ezpz launch needs to invoke python3 (e.g.
ezpz launch python3 -m my.module), it picks the interpreter in this
order:
$VIRTUAL_ENV/bin/python3if$VIRTUAL_ENVis set and existsshutil.which("python3")โ first python3 on$PATHsys.executableas a last resort
Why not just sys.executable? It's frozen at interpreter startup. If
you ran ezpz yeet to copy your env to /tmp/
and then source /tmp/.venv/bin/activate, sys.executable would still
point to the original Lustre path because the ezpz CLI script's
shebang is baked in at install time. Reading $VIRTUAL_ENV (set by
activate) lets the launcher follow the user's actual current venv.
Examplesโ๏ธ
Use it to launch:
-
Arbitrary command(s):
-
Arbitrary Python string:
-
One of the Distributed Training examples:
-
Your own distributed training script:
to launch
your_app.trainacross 16 processes, 8 per node.
Sequence Diagram
Two primary control paths drive ezpz launch: a scheduler-aware path used when
running inside PBS/SLURM allocations, and a local fallback that shells out to
mpirun when no scheduler metadata is available.
sequenceDiagram
autonumber
actor User
participant CLI as ezpz_launch
participant Scheduler as PBS_or_Slurm
participant MPI as mpirun_mpiexec
participant App as User_application
User->>CLI: ezpz launch <launch_flags> -- <cmd> <cmd_flags>
CLI->>Scheduler: detect_scheduler()
alt scheduler_detected
Scheduler-->>CLI: scheduler_type, job_metadata
CLI->>Scheduler: build_scheduler_command(cmd_to_launch)
Scheduler-->>CLI: launch_cmd (mpiexec_or_srun)
CLI->>MPI: run_command(launch_cmd)
MPI->>App: start_ranks_and_execute
App-->>MPI: return_codes
MPI-->>CLI: aggregate_status
else no_scheduler_detected
Scheduler-->>CLI: unknown
CLI->>MPI: mpirun -np 2 <cmd> <cmd_flags>
MPI->>App: start_local_ranks
App-->>MPI: return_codes
MPI-->>CLI: aggregate_status
end
CLI-->>User: exit_code
Distributed Training Examplesโ๏ธ
-
๐ Examples: Scalable and ready-to-go!
Any of the examples can be launched with:
๐ค HF Integration
-
ezpz.examples.{fsdp_tp,diffusion,hf,hf_trainer} all support arbitrary ๐ค Hugging Face datasets e.g.:dataset="stanfordnlp/imdb" # or any other HF dataset ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}" ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}" ezpz launch python3 -m ezpz.examples.hf \ --model_name_or_path meta-llama/Llama-3.2-1B \ --dataset_name="${dataset}" \ --streaming \ --bf16=true ezpz launch python3 -m ezpz.examples.hf_trainer \ --model_name_or_path meta-llama/Llama-3.2-1B \ --dataset_name="${dataset}" \ --streaming \ --bf16=true -
ezpz.examples.hfandezpz.examples.hf_trainerboth support arbitrary combinations of (compatible)transformers.from_pretrainedmodels, and HF Datasets (with support for streaming!).hfuses an explicit training loop with Accelerate, whilehf_trainerwraps the HFTrainerAPI.ezpz launch python3 -m ezpz.examples.hf \ --streaming \ --dataset_name=eliplutchok/fineweb-small-sample \ --tokenizer_name meta-llama/Llama-3.2-1B \ --model_name_or_path meta-llama/Llama-3.2-1B \ --bf16=true ezpz launch python3 -m ezpz.examples.hf_trainer \ --streaming \ --dataset_name=eliplutchok/fineweb-small-sample \ --tokenizer_name meta-llama/Llama-3.2-1B \ --model_name_or_path meta-llama/Llama-3.2-1B \ --bf16=true
Simple Example
Output
Macbook Pro
#[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โ!?] [4s] ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())' Using [2 / 2] available "mps" devices !! 0 1 [2025-12-23-162222] Execution time: 4s secAurora (2 Nodes)
#[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s] #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?] ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads. [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐ ezpz.launch][started][2026-01-08-145802]---- [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203 [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0'] [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together: [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command. [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c import ezpz; print(ezpz.setup_torch()) [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804... [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())'CPU bind output (24 lines)
Raw rank output (24 lines)
[2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐ ezpz.launch][stop][2026-01-08-145814]---- [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0. [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds. [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting. took: 18sdemo.pydemo.pyimport ezpz # automatic device + backend setup for distributed PyTorch _ = ezpz.setup_torch() # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ... device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...} rank = ezpz.get_rank() world_size = ezpz.get_world_size() # ...etc if rank == 0: print(f"Hello from rank {rank} / {world_size} on {device}!")We can launch this script with:
Output(s)
MacBook Pro
Aurora (2 nodes)
# from 2 nodes of Aurora: #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s] #[01/08/26,07:26:10][x4604c5s2b0n0][~] ; ezpz launch python3 demo.py [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads. [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐ ezpz.launch][started][2026-01-08-072620]---- [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832 [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0'] [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together: [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command. [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621... [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.pyCPU bind output (24 lines)
Using [24 / 24] available "xpu" devices !! Hello from rank 0 / 24 on xpu! [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐ ezpz.launch][stop][2026-01-08-072633]---- [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0. [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds. [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting. took: 22s -
-
When no
--is present, all arguments are treated as part of the command to run. ↩ -
By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm).
If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:- The number of available nodes
- How many GPUs are present on each of these nodes
- How many GPUs we have total
It will then use this information to automatically construct the appropriate {
mpiexec,srun} command to launch, and finally, execute the launch cmd. ↩ -
The
ezpz.Historyclass automatically computes distributed statistics (min, max, mean, std. dev) across ranks for all recorded metrics.
NOTE: This is automatically disabled whenezpz.get_world_size() >= 384(e.g. >= {32, 96} {Aurora, Polaris} nodes) due to the additional overhead introduced (but can be manually enabled, if desired). ↩