๐ ezpz launchโ๏ธ
Single entry point for launching distributed applications.
This will:
- Automatically detect your PBS/Slurm job and
- Launch
<cmd>across all available accelerators.
This is done by detecting if ezpz launch is being executed from inside a
PBS/Slurm job2.
If so, it determines the specifics of the active job (number of nodes, and
number of GPUs per node), and uses this information to build and execute the
appropriate launch command (e.g. mpiexec, srun).
When not running inside a PBS/Slurm job, ezpz launch falls back to mpirun
with sensible defaults.
Arguments can be passed through to the mpiexec/srun launcher by separating
them from the <cmd> with --1, e.g.:
For example, to run with 8 processes total, 4 processes per node, on 2 hosts, we can:
Assuming your current job can satisfy this (i.e. at least 4 accelerators per
node, and at least 2 nodes), this would launch python3 -m
ezpz.examples.fsdp_tp across 8 processes, 4 per node, on the first two hosts
allocated to your job.
-
ezpz launch --helpezpz launch --help usage: ezpz launch [-h] [--print-source] [--filter FILTER [FILTER ...]] [-n NPROC] [-ppn NPROC_PER_NODE] [-nh NHOSTS] [--hostfile HOSTFILE] [--cpu-bind CPU_BIND] ... Launch a command on the current PBS/SLURM job. Additional `<launcher flags>` can be passed through directly to the launcher by including '--' as a separator before the command. Examples: ezpz launch <launcher flags> -- <command> <args> ezpz launch -n 8 -ppn 4 --verbose --tag-output -- python3 -m ezpz.examples.fsdp_tp ezpz launch --nproc 8 -x EZPZ_LOG_LEVEL=DEBUG -- python3 my_script.py --my-arg val positional arguments: command Command (and arguments) to execute. Use '--' to separate options when needed. options: -h, --help show this help message and exit --print-source Print the location of the launch CLI source and exit. --filter FILTER [FILTER ...] Deprecated: output filtering has been removed. This flag is ignored. -n NPROC, -np NPROC, --n NPROC, --np NPROC, --nproc NPROC, --world_size NPROC, --nprocs NPROC Number of processes. -ppn NPROC_PER_NODE, --ppn NPROC_PER_NODE, --nproc_per_node NPROC_PER_NODE Processes per node. -nh NHOSTS, --nh NHOSTS, --nhost NHOSTS, --nnode NHOSTS, --nnodes NHOSTS, --nhosts NHOSTS, --nhosts NHOSTS Number of nodes to use. --hostfile HOSTFILE Hostfile to use for launching. --cpu-bind CPU_BIND CPU binding value to pass to the launcher. Takes precedence over CPU_BIND when both are specified. --timeout IDLE_TIMEOUT_S Idle-output watchdog timeout in seconds. Off by default. --retries RETRIES Re-execute on non-zero exit, up to N times. Default: 0.
Idle-output watchdog (--timeout)โ๏ธ
--timeout SECONDS arms a watchdog that monitors the launched
process's output. If no output appears (on stdout or stderr โ
they are merged at the watchdog) for SECONDS consecutive seconds,
the watchdog sends SIGTERM, waits up to 10 seconds for a clean
shutdown, then sends SIGKILL. The exit code returned by
ezpz launch is 124 (matching GNU timeout(1) convention) so
shell wrappers can distinguish "killed for going silent" from
"command failed". Passing --timeout 0 disables the watchdog (same
as omitting the flag).
# Abort if the training script goes silent for 10 minutes.
ezpz launch --timeout 600 -- python3 -m my_app.train
Idle, not walltime. The process can run indefinitely as long as
it keeps emitting at least one line per SECONDS on either stream.
This is the right semantics for catching collective hangs (e.g.
xccl on XPU silently deadlocking) where the process is alive but
every rank is blocked in the same collective and nothing reaches
either stream. For a hard walltime limit, use the scheduler's
existing mechanism (#PBS -l walltime=...).
Python buffering. The watchdog sets PYTHONUNBUFFERED=1 in the
child environment so Python's default block-buffering (which kicks
in when stdout isn't a TTY) doesn't fool the watchdog into killing a
healthy job that's accumulating output in a 4-8 KB buffer. The
variable is benign for non-Python children: they ignore it.
Scope caveat. The watchdog only watches the process ezpz launch
spawns directly. If you qsub a job script that internally invokes
python train.py, the watchdog needs to live inside that script
(or call ezpz launch from inside it), not the outer qsub.
Retry on non-zero exit (--retries)โ๏ธ
--retries N re-executes the command up to N additional times
whenever the previous attempt returns a non-zero exit code, including
the watchdog's 124. Exponential backoff is applied between attempts
(5s, 10s, 20s, 40s, then capped at 60s).
# Up to 3 retries with watchdog protection. Useful for flaky fabrics
# or transient EC2 spot interruptions.
ezpz launch --timeout 600 --retries 3 -- python3 -m my_app.train
A clean exit on any attempt short-circuits the loop and returns 0.
If every attempt fails, the final attempt's exit code is returned.
Combine with --timeout to convert silent hangs into retryable
failures.
Python interpreter resolutionโ๏ธ
When ezpz launch needs to invoke python3 (e.g.
ezpz launch python3 -m my.module), it picks the interpreter in this
order:
$VIRTUAL_ENV/bin/python3if$VIRTUAL_ENVis set and existsshutil.which("python3")โ first python3 on$PATHsys.executableas a last resort
Why not just sys.executable? It's frozen at interpreter startup. If
you ran ezpz yeet to copy your env to /tmp/
and then source /tmp/.venv/bin/activate, sys.executable would still
point to the original Lustre path because the ezpz CLI script's
shebang is baked in at install time. Reading $VIRTUAL_ENV (set by
activate) lets the launcher follow the user's actual current venv.
Examplesโ๏ธ
Use it to launch:
-
Arbitrary command(s):
-
Arbitrary Python string:
-
One of the Distributed Training examples:
-
Your own distributed training script:
to launch
your_app.trainacross 16 processes, 8 per node.
Sequence Diagram
Two primary control paths drive ezpz launch: a scheduler-aware path used when
running inside PBS/SLURM allocations, and a local fallback that shells out to
mpirun when no scheduler metadata is available.
sequenceDiagram
autonumber
actor User
participant CLI as ezpz_launch
participant Scheduler as PBS_or_Slurm
participant MPI as mpirun_mpiexec
participant App as User_application
User->>CLI: ezpz launch <launch_flags> -- <cmd> <cmd_flags>
CLI->>Scheduler: detect_scheduler()
alt scheduler_detected
Scheduler-->>CLI: scheduler_type, job_metadata
CLI->>Scheduler: build_scheduler_command(cmd_to_launch)
Scheduler-->>CLI: launch_cmd (mpiexec_or_srun)
CLI->>MPI: run_command(launch_cmd)
MPI->>App: start_ranks_and_execute
App-->>MPI: return_codes
MPI-->>CLI: aggregate_status
else no_scheduler_detected
Scheduler-->>CLI: unknown
CLI->>MPI: mpirun -np 2 <cmd> <cmd_flags>
MPI->>App: start_local_ranks
App-->>MPI: return_codes
MPI-->>CLI: aggregate_status
end
CLI-->>User: exit_code
Distributed Training Examplesโ๏ธ
-
๐ Examples: Scalable and ready-to-go!
Any of the examples can be launched with:
๐ค HF Integration
-
ezpz.examples.{fsdp_tp,diffusion,hf,hf_trainer} all support arbitrary ๐ค Hugging Face datasets e.g.:dataset="stanfordnlp/imdb" # or any other HF dataset ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}" ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}" ezpz launch python3 -m ezpz.examples.hf \ --model_name_or_path meta-llama/Llama-3.2-1B \ --dataset_name="${dataset}" \ --streaming \ --bf16=true ezpz launch python3 -m ezpz.examples.hf_trainer \ --model_name_or_path meta-llama/Llama-3.2-1B \ --dataset_name="${dataset}" \ --streaming \ --bf16=true -
ezpz.examples.hfandezpz.examples.hf_trainerboth support arbitrary combinations of (compatible)transformers.from_pretrainedmodels, and HF Datasets (with support for streaming!).hfuses an explicit training loop with Accelerate, whilehf_trainerwraps the HFTrainerAPI.ezpz launch python3 -m ezpz.examples.hf \ --streaming \ --dataset_name=eliplutchok/fineweb-small-sample \ --tokenizer_name meta-llama/Llama-3.2-1B \ --model_name_or_path meta-llama/Llama-3.2-1B \ --bf16=true ezpz launch python3 -m ezpz.examples.hf_trainer \ --streaming \ --dataset_name=eliplutchok/fineweb-small-sample \ --tokenizer_name meta-llama/Llama-3.2-1B \ --model_name_or_path meta-llama/Llama-3.2-1B \ --bf16=true
Simple Example
Output
Macbook Pro
#[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โ!?] [4s] ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())' Using [2 / 2] available "mps" devices !! 0 1 [2025-12-23-162222] Execution time: 4s secAurora (2 Nodes)
#[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s] #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?] ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads. [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐ ezpz.launch][started][2026-01-08-145802]---- [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203 [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0'] [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together: [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())' [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command. [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c import ezpz; print(ezpz.setup_torch()) [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804... [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())'CPU bind output (24 lines)
Raw rank output (24 lines)
[2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐ ezpz.launch][stop][2026-01-08-145814]---- [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0. [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds. [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting. took: 18sdemo.pydemo.pyimport ezpz # automatic device + backend setup for distributed PyTorch _ = ezpz.setup_torch() # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ... device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...} rank = ezpz.get_rank() world_size = ezpz.get_world_size() # ...etc if rank == 0: print(f"Hello from rank {rank} / {world_size} on {device}!")We can launch this script with:
Output(s)
MacBook Pro
Aurora (2 nodes)
# from 2 nodes of Aurora: #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s] #[01/08/26,07:26:10][x4604c5s2b0n0][~] ; ezpz launch python3 demo.py [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads. [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐ ezpz.launch][started][2026-01-08-072620]---- [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832 [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0'] [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together: [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command. [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621... [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.pyCPU bind output (24 lines)
Using [24 / 24] available "xpu" devices !! Hello from rank 0 / 24 on xpu! [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐ ezpz.launch][stop][2026-01-08-072633]---- [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0. [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds. [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting. took: 22s -
-
When no
--is present, all arguments are treated as part of the command to run. ↩ -
By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm).
If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:- The number of available nodes
- How many GPUs are present on each of these nodes
- How many GPUs we have total
It will then use this information to automatically construct the appropriate {
mpiexec,srun} command to launch, and finally, execute the launch cmd. ↩ -
The
ezpz.Historyclass automatically computes distributed statistics (min, max, mean, std. dev) across ranks for all recorded metrics.
NOTE: This is automatically disabled whenezpz.get_world_size() >= 384(e.g. >= {32, 96} {Aurora, Polaris} nodes) due to the additional overhead introduced (but can be manually enabled, if desired). ↩