๐ ezpz launchโ๏ธ
Single entry point for distributed jobs.
ezpz detects PBS/Slurm automatically and falls back to mpirun, forwarding
useful environment variables so your script behaves the same on laptops and
clusters.
Add your own args to any command (--config, --batch-size, etc.) and ezpz
will propagate them through the detected launcher.
Use the provided:
to automatically launch <cmd> across all available1
accelerators.
-
ezpz launch --helpezpz launch --help usage: ezpz launch [-h] [--print-source] [--filter FILTER [FILTER ...]] [-n NPROC] [-ppn NPROC_PER_NODE] [-nh NHOSTS] [--hostfile HOSTFILE] ... Launch a command on the current PBS/SLURM job. Additional `<launcher flags>` can be passed through directly to the launcher by including '--' as a separator before the command. Examples: ezpz launch <launcher flags> -- <command> <args> ezpz launch -n 8 -ppn 4 --verbose --tag-output -- python3 -m ezpz.examples.fsdp_tp ezpz launch --nproc 8 -x EZPZ_LOG_LEVEL=DEBUG -- python3 my_script.py --my-arg val positional arguments: command Command (and arguments) to execute. Use '--' to separate options when needed. options: -h, --help show this help message and exit --print-source Print the location of the launch CLI source and exit. --filter FILTER [FILTER ...] Filter output lines by these strings. -n NPROC, -np NPROC, --n NPROC, --np NPROC, --nproc NPROC, --world_size NPROC, --nprocs NPROC Number of processes. -ppn NPROC_PER_NODE, --ppn NPROC_PER_NODE, --nproc_per_node NPROC_PER_NODE Processes per node. -nh NHOSTS, --nh NHOSTS, --nhost NHOSTS, --nnode NHOSTS, --nnodes NHOSTS, --nhosts NHOSTS, --nhosts NHOSTS Number of nodes to use. --hostfile HOSTFILE Hostfile to use for launching. -
Scheduler smarts: detects PBS/Slurm automatically;
Otherwise falls back tompirunwith sensible env forwarding. For launcher-only flags/env (e.g.,-x FOO=bar), place them before--; everything after--is the command to run: -
Automatic distributed initialization using
ezpz.setup_torch()with automatic {device, backend} selection -
Automatic single-process logging with rank-aware filtering for distributed runs:
-
Metric tracking, aggregation, and recording via
ezpz.History():- Automatic distributed statistics (min, max, mean, stddev) across ranks2
- Weights & Biases integration
- Persistent storage of metrics in
.h5format - Plotting support:
Graphical plots (
svg,png) viamatplotlibTerminal-based ASCII plots via
plotextdt dt/min โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.384โคโ โ0.384โค- โ 0.320โคโ โ0.129โค --------------------------------โ 0.256โค โ โ โโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 0.129โค โโ โ 1.0 3.2 5.5 7.8 10.0 0.066โค โ โdt/min iter 0.002โค โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ dt/std โโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1.0 3.2 5.5 7.8 10.0 0.00068โค * * * โ dt iter 0.00046โค ****** ** * ****** ***โ dt/mean 0.00011โค******* *** โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโฌโโโโโโโโฌโโโโโโโฌโโโโโโโโฌโโโโโโโฌโ 0.384โคยท โ 1.0 3.2 5.5 7.8 10.0 0.320โคยท โdt/std iter 0.256โค ยท โ dt/max 0.193โค ยท โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.129โค ยท โ0.384โค+ โ 0.066โค ยท โ0.257โค ++ โ 0.002โค ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทโ0.066โค ++++++++++++++++++++++++++++++โ โโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ โโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 1.0 3.2 5.5 7.8 10.0 1.0 3.2 5.5 7.8 10.0 dt/mean iter dt/max iter
text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/dt.txt โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.384โค ++ dt/max โ โ -- dt/min โ โ ยทยท dt/mean โ 0.320โค โโ dt โ โ โ โ โ โ โ 0.256โค โ โ โ โโ โ โ โ โ 0.193โค โ โ โ โ โ โ โ โ โ โโ โ 0.129โค โ โ โ โ โ โ โ โ 0.065โค โ โ โ โโ โ โ โ โ 0.002โค โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโ 1.0 3.2 5.5 7.8 10.0 text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/dt_summary.txt dt/mean hist dt/max hist
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 9.0โคโโโโ โ9.0โคโโโโ โ 7.5โคโโโโ โ7.5โคโโโโ โ 6.0โคโโโโ โ6.0โคโโโโ โ 4.5โคโโโโ โ4.5โคโโโโ โ 3.0โคโโโโ โ3.0โคโโโโ โ 1.5โคโโโโ โโโโโ1.5โคโโโโ โโโโโ 0.0โคโโโ โโโโโ0.0โคโโโ โโโโโ โโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโ โโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโ -0.01 0.09 0.19 0.30 0.40 -0.01 0.09 0.19 0.30 0.40 dt/min hist dt/std hist โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 9.0โคโโโโ โ2.00โค โโโ โโโโโ 7.5โคโโโโ โ1.67โค โโโ โโโโโ 6.0โคโโโโ โ1.33โค โโโ โโโโโ 4.5โคโโโโ โ1.00โคโโโโโโโโโโโโโโโโโ โโโโ โโโโโโโโ โโโโโ โ โโโโโโโโโโโโโโโโโโ โโโโ โโโโโโโโ 3.0โคโโโโ โ0.67โคโโโโโโโโโโโโโโโโโ โโโโ โโโโโโโโ 1.5โคโโโโ โโโโโ0.33โคโโโโโโโโโโโโโโโโโ โโโโ โโโโโโโโ 0.0โคโโโ โโโโโ0.00โคโโโโโโโโโโโโโโโโโ โโโโ โโโโโโโโ โโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโ -0.02 0.09 0.19 0.30 0.40 -0.00003 0.00016 0.00034 0.00053
text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/dt_hist.txt loss loss/min
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 43.4โค โโโโโโโโโ โ39.3โค - ------------ โ 38.6โค โ โโ โโ โ22.7โค---- ---------- -------โ 33.7โค โ โ โโ โโโ โ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 24.1โค โ โ โโ โโโโโ โ 1.0 3.2 5.5 7.8 10.0 19.3โคโโ โ โโ โโโโโloss/min iter 14.4โคโ โโโ โ loss/std โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1.0 3.2 5.5 7.8 10.0 17.4โค * โ loss iter 11.8โค **** ** * โ loss/mean 3.3โค******* *************** ****โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 41.4โค ยทยทยทยท โ 1.0 3.2 5.5 7.8 10.0 37.6โค ยท ยทยทยทยท ยทยทยทยท โloss/std iter
33.9โค ยท ยท ยท ยทยทยทยทยทยทยท โ loss/max
30.2โค ยท ยท ยท ยทยท โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 26.4โค ยท ยทยท ยทยทโ56.3โค + โ 22.7โคยท โ45.3โค +++++++ ++++++++++++++++++ โ 18.9โคยท โ28.9โค++++ ++++โ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 1.0 3.2 5.5 7.8 10.0 1.0 3.2 5.5 7.8 10.0 loss/mean iter loss/max iter text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/loss.txt โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 56.3โค ++ loss/max + โ โ -- loss/min + + โ โ ยทยท loss/mean + + โ 49.3โค โโ loss + ++ โ โ + + โ โ + + โ 42.3โค + +โโโโโโโโโโโโโโ โ โ + โยท +++โโโโโ + โ โ โ++++++++ ยทยทยทยทยทยทโยท-ยทยทยทยทยทยทยทยทยท โโ ++ + โ 35.4โค โโยท ยท โโ- ---------ยทยทยทยทยทยทยทยท โโโ+ +++ + โ โ โ--โยทยท ยทยท โโ- ---- ยทยทยทโโ+++++ ยท ++ โ โ โโ -โ ยท ยท โ- ---- ยทโโยทยทยทยทยทยทยทยท ยทยท + โ โ ยทโ โโ ยทยท ยทยท โ- ------โโโโ ยทยท + โ 28.4โค ยทโ โโ ยทยทยท โ- -โโโโโ ยทยท++โ โ ยทโโ โ โโ -โโโโโโ ยทยทโ โ+ยทโโ โ โโ --โโโโโโโ 21.4โคยท โ โ โโ โ โยทโ โโ โโโโ โ โโโ โโ-โโโ โ 14.4โคโ โโ โ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโ 1.0 3.2 5.5 7.8 10.0 text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/loss_summary.txt loss/mean hist loss/max hist
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2.00โค โโโโโโโโ2.00โค โโโโโโโโโโโ โ 1.67โค โโโโโโโโ1.67โค โโโโโโโโโโโ โ 1.33โค โโโโโโโโ1.33โค โโโโโโโโโโโ โ 1.00โคโโโโ โโโโโโโ โโโโโโโโโโโโโโโโโโ1.00โคโโโโโโโ โโโโโโโโโโโโโโ โโโโโ 0.67โคโโโโ โโโโโโโ โโโโโโโโโโโโโโโโโโ0.67โคโโโโโโโ โโโโโโโโโโโโโโ โโโโโ 0.33โคโโโโ โโโโโโโ โโโโโโโโโโโโโโโโโโ0.33โคโโโโโโโ โโโโโโโโโโโโโโ โโโโโ 0.00โคโโโ โโโโโโ โโโโโโโโโโโโโโโโโโ0.00โคโโโโโโโ โโโโโโโโโโโโโโ โโโโโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 17.9 24.0 30.2 36.3 42.4 22.0 30.9 39.8 48.8 57.7 loss/min hist loss/std hist
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2.00โคโโโโ โโโโโ3.00โคโโโโ โ 1.67โคโโโโ โโโโโ2.50โคโโโโ โ 1.33โคโโโโ โโโโโ2.00โคโโโโโโโโโโ โ 1.00โคโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโ1.50โคโโโโโโโโโโ โ โโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโ โ 0.67โคโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโ1.00โคโโโโโโโโโโโโโโ โโโโ โโโโโ 0.33โคโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโ0.50โคโโโโโโโโโโโโโโ โโโโ โโโโโ 0.00โคโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโ0.00โคโโโโโโโโโโโโโ โโโโ โโโโโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ โโฌโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโ 13.3 20.1 26.9 33.7 40.5 -0.2 4.4 8.9 13.5 18.1 text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/loss_hist.txt
Use it to launch:
-
Arbitrary command(s):
-
Arbitrary Python string:
-
One of the ready-to-go examples:
-
Your own distributed training script:
to launch
your_app.trainacross 16 processes, 8 per node.
Sequence Diagram
Two primary control paths drive ezpz launch: a scheduler-aware path used when
running inside PBS/SLURM allocations, and a local fallback that shells out to
mpirun when no scheduler metadata is available.
sequenceDiagram
autonumber
actor User
participant CLI as ezpz_launch
participant Scheduler as PBS_or_Slurm
participant MPI as mpirun_mpiexec
participant App as User_application
User->>CLI: ezpz launch <launch_flags> -- <cmd> <cmd_flags>
CLI->>Scheduler: detect_scheduler()
alt scheduler_detected
Scheduler-->>CLI: scheduler_type, job_metadata
CLI->>Scheduler: build_scheduler_command(cmd_to_launch)
Scheduler-->>CLI: launch_cmd (mpiexec_or_srun)
CLI->>MPI: run_command(launch_cmd)
MPI->>App: start_ranks_and_execute
App-->>MPI: return_codes
MPI-->>CLI: aggregate_status
else no_scheduler_detected
Scheduler-->>CLI: unknown
CLI->>MPI: mpirun -np 2 <cmd> <cmd_flags>
MPI->>App: start_local_ranks
App-->>MPI: return_codes
MPI-->>CLI: aggregate_status
end
CLI-->>User: exit_code
๐ Ready-to-go Examplesโ๏ธ
-
๐ Examples: Scalable and ready-to-go!
Running Examples
Any of the examples below can be launched with (sensible defaults if not specified):
๐ค HF Integration
-
ezpz.examples.{fsdp_tp,diffusion, hf_trainer,hf_trainer} all support arbitrary ๐ค Hugging Face datasets e.g.:dataset="stanfordnlp/imdb" # or any other HF dataset ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}" ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}" ezpz launch python3 -m ezpz.examples.hf_trainer \ --model_name_or_path meta-llama/Llama-3.2-1B \ --dataset_name="${dataset}" \ --streaming \ --bf16=true -
ezpz.examples.hf_trainersupports arbitrary combinations of (compatible)transformers.from_pretrainedmodels, and HF Datasets (with support for streaming!)
Simple Example
Output
Macbook Pro
#[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โ!?] [4s] ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())' Using [2 / 2] available "mps" devices !! 0 1 [2025-12-23-162222] Execution time: 4s secAurora (2 Nodes)
#[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s] #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?] ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads. [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐ ezpz.launch][started][2026-01-08-145802]---- [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203 [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0'] [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together: [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command. [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c import ezpz; print(ezpz.setup_torch()) [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804... [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())' cpubind:list x4717c0s6b0n0 pid 118589 rank 12 0: mask 0x1c cpubind:list x4717c0s6b0n0 pid 118590 rank 13 1: mask 0x1c00 cpubind:list x4717c0s6b0n0 pid 118591 rank 14 2: mask 0x1c0000 cpubind:list x4717c0s6b0n0 pid 118592 rank 15 3: mask 0x1c000000 cpubind:list x4717c0s6b0n0 pid 118593 rank 16 4: mask 0x1c00000000 cpubind:list x4717c0s6b0n0 pid 118594 rank 17 5: mask 0x1c0000000000 cpubind:list x4717c0s6b0n0 pid 118595 rank 18 6: mask 0x1c0000000000000 cpubind:list x4717c0s6b0n0 pid 118596 rank 19 7: mask 0x1c000000000000000 cpubind:list x4717c0s6b0n0 pid 118597 rank 20 8: mask 0x1c00000000000000000 cpubind:list x4717c0s6b0n0 pid 118598 rank 21 9: mask 0x1c0000000000000000000 cpubind:list x4717c0s6b0n0 pid 118599 rank 22 10: mask 0x1c000000000000000000000 cpubind:list x4717c0s6b0n0 pid 118600 rank 23 11: mask 0x1c00000000000000000000000 cpubind:list x4418c6s1b0n0 pid 66450 rank 0 0: mask 0x1c cpubind:list x4418c6s1b0n0 pid 66451 rank 1 1: mask 0x1c00 cpubind:list x4418c6s1b0n0 pid 66452 rank 2 2: mask 0x1c0000 cpubind:list x4418c6s1b0n0 pid 66453 rank 3 3: mask 0x1c000000 cpubind:list x4418c6s1b0n0 pid 66454 rank 4 4: mask 0x1c00000000 cpubind:list x4418c6s1b0n0 pid 66455 rank 5 5: mask 0x1c0000000000 cpubind:list x4418c6s1b0n0 pid 66456 rank 6 6: mask 0x1c0000000000000 cpubind:list x4418c6s1b0n0 pid 66457 rank 7 7: mask 0x1c000000000000000 cpubind:list x4418c6s1b0n0 pid 66458 rank 8 8: mask 0x1c00000000000000000 cpubind:list x4418c6s1b0n0 pid 66459 rank 9 9: mask 0x1c0000000000000000000 cpubind:list x4418c6s1b0n0 pid 66460 rank 10 10: mask 0x1c000000000000000000000 cpubind:list x4418c6s1b0n0 pid 66461 rank 11 11: mask 0x1c00000000000000000000000 Using [24 / 24] available "xpu" devices !! 8 10 0 4 3 5 7 11 6 1 9 2 14 15 12 13 16 17 19 22 20 23 18 21 [2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐ ezpz.launch][stop][2026-01-08-145814]---- [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0. [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds. [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting. took: 18sdemo.pydemo.pyimport ezpz # automatic device + backend setup for distributed PyTorch _ = ezpz.setup_torch() # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ... device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...} rank = ezpz.get_rank() world_size = ezpz.get_world_size() # ...etc if rank == 0: print(f"Hello from rank {rank} / {world_size} on {device}!")We can launch this script with:
Output(s)
MacBook Pro
Aurora (2 nodes)
# from 2 nodes of Aurora: #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s] #[01/08/26,07:26:10][x4604c5s2b0n0][~] ; ezpz launch python3 demo.py [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads. [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐ ezpz.launch][started][2026-01-08-072620]---- [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832 [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0'] [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together: [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command. [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621... [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py cpubind:list x4604c5s3b0n0 pid 131587 rank 12 0: mask 0x1c cpubind:list x4604c5s3b0n0 pid 131588 rank 13 1: mask 0x1c00 cpubind:list x4604c5s3b0n0 pid 131589 rank 14 2: mask 0x1c0000 cpubind:list x4604c5s3b0n0 pid 131590 rank 15 3: mask 0x1c000000 cpubind:list x4604c5s3b0n0 pid 131591 rank 16 4: mask 0x1c00000000 cpubind:list x4604c5s3b0n0 pid 131592 rank 17 5: mask 0x1c0000000000 cpubind:list x4604c5s3b0n0 pid 131593 rank 18 6: mask 0x1c0000000000000 cpubind:list x4604c5s3b0n0 pid 131594 rank 19 7: mask 0x1c000000000000000 cpubind:list x4604c5s3b0n0 pid 131595 rank 20 8: mask 0x1c00000000000000000 cpubind:list x4604c5s3b0n0 pid 131596 rank 21 9: mask 0x1c0000000000000000000 cpubind:list x4604c5s3b0n0 pid 131597 rank 22 10: mask 0x1c000000000000000000000 cpubind:list x4604c5s3b0n0 pid 131598 rank 23 11: mask 0x1c00000000000000000000000 cpubind:list x4604c5s2b0n0 pid 121225 rank 0 0: mask 0x1c cpubind:list x4604c5s2b0n0 pid 121226 rank 1 1: mask 0x1c00 cpubind:list x4604c5s2b0n0 pid 121227 rank 2 2: mask 0x1c0000 cpubind:list x4604c5s2b0n0 pid 121228 rank 3 3: mask 0x1c000000 cpubind:list x4604c5s2b0n0 pid 121229 rank 4 4: mask 0x1c00000000 cpubind:list x4604c5s2b0n0 pid 121230 rank 5 5: mask 0x1c0000000000 cpubind:list x4604c5s2b0n0 pid 121231 rank 6 6: mask 0x1c0000000000000 cpubind:list x4604c5s2b0n0 pid 121232 rank 7 7: mask 0x1c000000000000000 cpubind:list x4604c5s2b0n0 pid 121233 rank 8 8: mask 0x1c00000000000000000 cpubind:list x4604c5s2b0n0 pid 121234 rank 9 9: mask 0x1c0000000000000000000 cpubind:list x4604c5s2b0n0 pid 121235 rank 10 10: mask 0x1c000000000000000000000 cpubind:list x4604c5s2b0n0 pid 121236 rank 11 11: mask 0x1c00000000000000000000000 Using [24 / 24] available "xpu" devices !! Hello from rank 0 / 24 on xpu! [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐ ezpz.launch][stop][2026-01-08-072633]---- [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0. [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds. [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting. took: 22s -
-
By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm).
If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:- The number of available nodes
- How many GPUs are present on each of these nodes
- How many GPUs we have total
It will then use this information to automatically construct the appropriate {
mpiexec,srun} command to launch, and finally, execute the launch cmd. ↩ -
The
ezpz.Historyclass automatically computes distributed statistics (min, max, mean, std. dev) across ranks for all recorded metrics.
NOTE: This is automatically disabled whenezpz.get_world_size() >= 384(e.g. >= {32, 96} {Aurora, Polaris} nodes) due to the additional overhead introduced (but can be manually enabled, if desired). ↩