Skip to content

๐Ÿš€ ezpz launchโš“๏ธŽ

Single entry point for launching distributed applications.

ezpz launch <cmd>

This will:

  1. Automatically detect your PBS/Slurm job and
  2. Launch <cmd> across all available accelerators.

This is done by detecting if ezpz launch is being executed from inside a PBS/Slurm job2.

If so, it determines the specifics of the active job (number of nodes, and number of GPUs per node), and uses this information to build and execute the appropriate launch command (e.g. mpiexec, srun).

When not running inside a PBS/Slurm job, ezpz launch falls back to mpirun with sensible defaults.

Arguments can be passed through to the mpiexec/srun launcher by separating them from the <cmd> with --1, e.g.:

ezpz launch <launch-args> -- <cmd> <cmd-args>

For example, to run with 8 processes total, 4 processes per node, on 2 hosts, we can:

ezpz launch -n 8 -ppn 4 -nh 2 -- python3 -m ezpz.examples.fsdp_tp

Assuming your current job can satisfy this (i.e. at least 4 accelerators per node, and at least 2 nodes), this would launch python3 -m ezpz.examples.fsdp_tp across 8 processes, 4 per node, on the first two hosts allocated to your job.

  • ezpz launch --help
    ezpz launch --help
    usage: ezpz launch [-h] [--print-source] [--filter FILTER [FILTER ...]]
                       [-n NPROC] [-ppn NPROC_PER_NODE] [-nh NHOSTS]
                       [--hostfile HOSTFILE] [--cpu-bind CPU_BIND] ...
    
    Launch a command on the current PBS/SLURM job.
    
    Additional `<launcher flags>` can be passed through directly
    to the launcher by including '--' as a separator before
    the command.
    
    Examples:
    
        ezpz launch <launcher flags> -- <command> <args>
    
        ezpz launch -n 8 -ppn 4 --verbose --tag-output -- python3 -m ezpz.examples.fsdp_tp
    
        ezpz launch --nproc 8 -x EZPZ_LOG_LEVEL=DEBUG -- python3 my_script.py --my-arg val
    
    positional arguments:
    command               Command (and arguments) to execute. Use '--' to separate options when needed.
    
    options:
    -h, --help            show this help message and exit
    --print-source        Print the location of the launch CLI source and exit.
    --filter FILTER [FILTER ...]
                            Deprecated: output filtering has been removed. This flag is ignored.
    -n NPROC, -np NPROC, --n NPROC, --np NPROC, --nproc NPROC, --world_size NPROC, --nprocs NPROC
                            Number of processes.
    -ppn NPROC_PER_NODE, --ppn NPROC_PER_NODE, --nproc_per_node NPROC_PER_NODE
                            Processes per node.
    -nh NHOSTS, --nh NHOSTS, --nhost NHOSTS, --nnode NHOSTS, --nnodes NHOSTS, --nhosts NHOSTS, --nhosts NHOSTS
                            Number of nodes to use.
    --hostfile HOSTFILE   Hostfile to use for launching.
    --cpu-bind CPU_BIND   CPU binding value to pass to the launcher.
                            Takes precedence over CPU_BIND when both are specified.
    --timeout IDLE_TIMEOUT_S
                            Idle-output watchdog timeout in seconds. Off by default.
    --retries RETRIES     Re-execute on non-zero exit, up to N times. Default: 0.
    --auto-retry          Unbounded bad-node failover loop. Mutually
                            exclusive with --retries. Requires explicit --nproc.
    --spare-nodes SPARE_NODES
                            Spare-node pool for --auto-retry. "auto" (default)
                            derives from total_pbs_nodes - ceil($nproc / $ppn);
                            pass an int for an explicit cap.
    --max-failover-retries MAX_FAILOVER_RETRIES
                            Optional upper bound on --auto-retry attempts.
                            Default: unbounded (see termination matrix).
    

Idle-output watchdog (--timeout)โš“๏ธŽ

--timeout SECONDS arms a watchdog that monitors the launched process's output. If no output appears (on stdout or stderr โ€” they are merged at the watchdog) for SECONDS consecutive seconds, the watchdog sends SIGTERM, waits up to 10 seconds for a clean shutdown, then sends SIGKILL. The exit code returned by ezpz launch is 124 (matching GNU timeout(1) convention) so shell wrappers can distinguish "killed for going silent" from "command failed". Passing --timeout 0 disables the watchdog (same as omitting the flag).

# Abort if the training script goes silent for 10 minutes.
ezpz launch --timeout 600 -- python3 -m my_app.train

Idle, not walltime. The process can run indefinitely as long as it keeps emitting at least one line per SECONDS on either stream. This is the right semantics for catching collective hangs (e.g. xccl on XPU silently deadlocking) where the process is alive but every rank is blocked in the same collective and nothing reaches either stream. For a hard walltime limit, use the scheduler's existing mechanism (#PBS -l walltime=...).

Python buffering. The watchdog sets PYTHONUNBUFFERED=1 in the child environment so Python's default block-buffering (which kicks in when stdout isn't a TTY) doesn't fool the watchdog into killing a healthy job that's accumulating output in a 4-8 KB buffer. The variable is benign for non-Python children: they ignore it.

Scope caveat. The watchdog only watches the process ezpz launch spawns directly. If you qsub a job script that internally invokes python train.py, the watchdog needs to live inside that script (or call ezpz launch from inside it), not the outer qsub.

Retry on non-zero exit (--retries)โš“๏ธŽ

--retries N re-executes the command up to N additional times whenever the previous attempt returns a non-zero exit code, including the watchdog's 124. Exponential backoff is applied between attempts (5s, 10s, 20s, 40s, then capped at 60s).

# Up to 3 retries with watchdog protection. Useful for flaky fabrics
# or transient EC2 spot interruptions.
ezpz launch --timeout 600 --retries 3 -- python3 -m my_app.train

A clean exit on any attempt short-circuits the loop and returns 0. If every attempt fails, the final attempt's exit code is returned. Combine with --timeout to convert silent hangs into retryable failures.

Auto-retry on bad-node failure (--auto-retry)โš“๏ธŽ

--auto-retry engages the failover loop. On every non-zero exit, ezpz scrapes the log for known bad-node signatures (Aurora PALS shepherd-9, gloo TCP peer-closed), swaps each named host out for a spare from the rest of the PBS allocation, and re-runs the command. Unlike --retries N, the loop is unbounded by default โ€” it continues until one of the conditions in the termination matrix below fires.

# 50 nodes allocated by PBS on Aurora (12 ranks/node = 600 GPUs).
# Train on 512 ranks (= ceil(512/12) = 43 hosts), reserve the
# remaining 7 hosts as spares. Loop until success / walltime /
# spare exhaustion.
ezpz launch --auto-retry --np 512 -- python3 -m ezpz.examples.test

--auto-retry is mutually exclusive with --retries. They model different things: --retries N is a bounded process-level retry that re-launches the same command on the same nodes. --auto-retry is an unbounded node-level failover that swaps bad hosts out between attempts.

Decision flow at a glanceโš“๏ธŽ

flowchart TD
    Start(["ezpz launch --auto-retry --np N"]) --> Validate{"nproc set<br/>explicitly?"}
    Validate -->|no| ErrParse["SystemExit at parse:<br/>requires --nproc"]
    Validate -->|yes| Split["Split PBS nodelist<br/>into active + spare,<br/>write active.hostfile"]
    Split --> Attempt["Run attempt i<br/>tee to attempt-i.log,<br/>watchdog armed<br/>(default 1800s)"]
    Attempt -->|"SIGINT<br/>(Ctrl-C)"| Interrupted(["FAILOVER STOP:<br/>interrupted<br/>return 130"])
    Attempt --> GotRC["rc = child exit<br/>124 if watchdog fired"]
    GotRC --> Strip["Strip ANSI codes,<br/>strip innocent<br/>rank-cascade lines"]
    Strip --> CheckSuccess{"rc==0 AND<br/>no crash patterns<br/>AND inner_rc==0?"}
    CheckSuccess -->|yes| Success(["FAILOVER STOP:<br/>success<br/>return 0"])
    CheckSuccess -->|no| CheckWalltime{"rc==143 AND<br/>no crash patterns?"}
    CheckWalltime -->|yes| Walltime(["FAILOVER STOP:<br/>walltime<br/>return 143"])
    CheckWalltime -->|no| CheckStuck{"prior AND current<br/>attempt both have<br/>0 step= markers?"}
    CheckStuck -->|yes| Stuck(["FAILOVER STOP:<br/>stuck_pre_training<br/>return rc"])
    CheckStuck -->|no| CheckSpares{"spares left?"}
    CheckSpares -->|no| Exhausted(["FAILOVER STOP:<br/>exhausted<br/>return rc"])
    CheckSpares -->|yes| CheckScraper{"scraper found<br/>named host(s)?"}
    CheckScraper -->|yes| Swap["Swap bad host(s)<br/>for spare(s),<br/>rewrite active.hostfile"]
    CheckScraper -->|no| Swap
    Swap --> Backoff["Backoff sleep:<br/>5/10/20/40/60s"]
    Backoff --> Attempt

The single Swap node above is swap_in when the scraper named hosts and swap_one_blind when it didn't โ€” see the empty-swap_in fallback note for the edge case where swap_in finds no live hosts to replace and falls through to a blind rotation.

Required: explicit --nprocโš“๏ธŽ

--auto-retry needs to know how many ranks are training so it can split the PBS allocation into active + spare. We do not guess the active-host count โ€” pass --nproc N (or -n N / --np N) explicitly. The CLI errors out at parse time otherwise:

$ ezpz launch --auto-retry -- python3 train.py
--auto-retry requires --nproc (-n/--np) to be set explicitly. ...

Spare-node policy (--spare-nodes)โš“๏ธŽ

By default (--spare-nodes auto), the spare pool is total_pbs_nodes - ceil($nproc / $ppn). The --nproc (or -n, --np) flag counts ranks, not nodes; we ceiling-divide by the ranks-per-node (--ppn or the cluster's get_gpus_per_node()) to get the number of hosts actually needed for training. Any allocated nodes beyond that go into the spare pool.

So if PBS gave you 50 nodes on Aurora (12 GPUs/node) and you ask for 512 training ranks, the active set is ceil(512/12) = 43 hosts and the spare pool is 50 - 43 = 7. Pass --spare-nodes N to cap the pool explicitly (useful when you'd rather not use the full leftover slice):

# Cap the spare pool at 5, regardless of how many nodes PBS gave us.
ezpz launch --auto-retry --np 512 --spare-nodes 5 -- ...

Termination matrixโš“๏ธŽ

Every termination logs a single FAILOVER STOP: <reason> line so post-mortem grep is reliable.

# Condition Result
1 exit 0 (clean inner trailer, no crash patterns) SUCCESS โ†’ return 0
2 exit 143 (walltime SIGTERM), no crash patterns WALLTIME โ†’ return rc
3 exit 143 with crash patterns in log bad-node retry (real failure raced the walltime kill)
4 exit 124 (idle-output watchdog tripped) bad-node retry (silent hang โ†’ blind rotation)
5 any other non-zero, scraper found named host(s) swap_in named โ†’ retry
6 any other non-zero, scraper found nothing swap_one_blind โ†’ retry
7 two consecutive attempts with zero step= markers STUCK_PRE_TRAINING โ†’ return rc
8 bad-node verdict but no spares left EXHAUSTED โ†’ return rc
9 SIGINT (Ctrl-C) INTERRUPTED โ†’ return 130

Empty-swap_in fallback. Row 5's swap_in skips any host that isn't currently in the active set (the named host was already replaced on a prior attempt, the scraper picked up stale lines from an older log, etc.). If swap_in ends up swapping zero hosts, the loop falls through to row 6's swap_one_blind so it still makes forward progress instead of looping on the same bad set.

The step= marker guard (#7) replaces a numeric "max consecutive blind rotations" cap. The intent is to catch broken configs / missing datasets / pre-training-loop bugs before they burn the entire spare pool โ€” if two attempts in a row die before History.update prints its first step=N line, no amount of node-swapping will help.

Worked example โ€” real Aurora UR_RESULT_ERROR_OUT_OF_RESOURCESโš“๏ธŽ

Here's an excerpt from a real Aurora torchtitan job that the classifier handles correctly. The relevant signals from attempt-1.log:

[2026-05-12 08:04:23][I][components/metrics:526:log] step:  1  loss: 12.94587  ...
[2026-05-12 08:04:30][I][components/metrics:526:log] step:  2  loss: 12.90856  ...
... (16 more clean training steps) ...
[2026-05-12 08:06:24][I][components/metrics:526:log] step: 18  loss: 10.27772  ...
[rank7]: RuntimeError: level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
x4610c4s3b0n0.hsn.cm.aurora.alcf.anl.gov: rank 7 exited with code 1
x4610c4s5b0n0.hsn.cm.aurora.alcf.anl.gov: rank 14 died from signal 15
[ezpz/launch] Execution finished with 143.

What the classifier does step by step:

  1. rc=143 from the shell (mpiexec teardown after the GPU OOM).
  2. Strip ANSI codes from the log.
  3. Strip innocent rank-cascade lines: rank 14 died from signal 15 is a downstream cascade from the primary kill on x4610c4s3b0n0, not a bad-node indicator on x4610c4s5b0n0. This line is excluded before the crash-pattern match runs (job 8466848 postmortem โ€” tagging cascade victims as bad nodes burns spares for nothing).
  4. Run the crash-pattern grep on the stripped text: UR_RESULT_ERROR_OUT_OF_RESOURCES matches โ†’ there IS a real hardware failure in the log.
  5. rc==143 AND crash_patterns โ†’ bad-node retry path, not WALLTIME. Without the strip we'd still get to bad-node retry (the cascade lines also contain died from signal), but we'd reach it via the wrong condition โ€” and we'd be at risk of tagging x4610c4s5b0n0 as a bad node when only x4610c4s3b0n0 is the actual culprit.
  6. Scraper picks up the hostname from the x4610c4s3b0n0.hsn...: rank 7 exited with code 1 line (or, if the scraper missed it because it's not in the explicit pattern set, falls through to swap_one_blind).
  7. bad_nodes.txt gets x4610c4s3b0n0.hsn.cm.aurora.alcf.anl.gov. active.hostfile is rewritten with that host replaced by the next spare.
  8. Backoff 5 seconds, run attempt-2.log. The retry uses the updated active set; the bad GPU is no longer in the training pool.

This exact log shape is pinned as a regression test in both code paths: test_crash_patterns_real_ur_oom_with_cascade_regression (Python) and test_run_walltime_143_retries_on_real_aurora_ur_oom_with_cascade (bash).

Reading the postmortem logโš“๏ธŽ

After a run finishes (success, walltime, or exhausted), the logs/failover-<jobid>/ directory is the postmortem entry point. A few one-liners:

# What was the final verdict?
grep "FAILOVER STOP" logs/failover-*/attempt-*.log

# Which nodes were swapped out across all attempts?
cat logs/failover-*/bad_nodes.txt

# Which step did each attempt reach before dying?
for f in logs/failover-*/attempt-*.log; do
  echo "=== $f ==="
  grep -oE "step:[[:space:]]+[0-9]+" "$f" | tail -1
done

# Was the failure a real hardware death or just a walltime hit?
grep -E "OutOfMemoryError|UR_RESULT_ERROR|gloo.*Connection closed|shepherd died" \
  logs/failover-*/attempt-*.log

Default idle-output watchdogโš“๏ธŽ

When --auto-retry is set, --timeout defaults to 1800 seconds (30 minutes) instead of being off. This matches the FAILOVER_IDLE_TIMEOUT default in src/ezpz/bin/failover.sh and prevents silent xccl hangs from burning the full PBS walltime before the loop can intervene. Pass --timeout 0 to disable, or --timeout N to override.

Optional cap (--max-failover-retries)โš“๏ธŽ

--max-failover-retries N is an additional belt-and-suspenders cap. Default is unbounded โ€” terminate only via the matrix above. Useful for short jobs where you'd rather give up than retry 100 times.

Files writtenโš“๏ธŽ

Per-job, in $(pwd)/logs/failover-<jobid>/:

  • active.hostfile โ€” the current active node set, mutated in place as nodes are swapped. Always reflects what the next attempt will run on.
  • bad_nodes.txt โ€” every host that's been swapped out (named swap_in and blind rotations). Append-only.
  • attempt-N.log โ€” combined stdout+stderr of attempt N.

Note on active.hostfile mutation. The file is rewritten in place between attempts. If you cat it from a second shell mid-run, you'll see the current active set, not the original PBS allocation. The launcher reads it fresh at each attempt โ€” no re-launch of ezpz launch itself is needed for the new contents to take effect, since the launcher subprocess re-resolves the hostfile path's contents per spawn. To inspect the original PBS allocation, look at $PBS_NODEFILE (unmodified) instead.

Relationship to src/ezpz/bin/failover.shโš“๏ธŽ

failover.sh is the bash equivalent for users who can't put ezpz launch at the top of their job script โ€” for example, qsub'ing a wrapper that already invokes python directly. The scrape source is identical (ezpz.failover.scrape_bad_nodes); the retry/swap mechanics are independent re-implementations because the classifier is much easier to test in Python. Prefer --auto-retry when you can, fall back to sourcing the bash lib when you can't.

Testing on real Aurora / Sunspot allocationsโš“๏ธŽ

Two ready-to-submit PBS drivers ship with the package, both following the same scenario-runner pattern (each scenario writes PASS|FAIL to a summary at the end):

# In an ezpz checkout on Aurora:
qsub src/ezpz/bin/test_launch_auto_retry_aurora.pbs

# Or for the pre-#144 watchdog/retries (PR #136):
qsub src/ezpz/bin/test_launch_timeout_retries_aurora.pbs

The --auto-retry driver requests 4 nodes and runs 7 scenarios (~30 min walltime budget):

Scenario What it covers
A Happy path โ€” succeeds first attempt, 0 swaps
B Blind rotation โ€” scraper finds nothing, swap_one_blind, succeed
C Named swap โ€” emit a PALS shepherd died from signal 9 line; verify the named host (not a blind one) gets swapped out
D --max-failover-retries 2 cap โ€” always-fail, exactly 3 attempts before bail
E STUCK_PRE_TRAINING guard โ€” 2 consecutive zero-step attempts, bail at attempt 2
F Realistic โ€” ezpz.examples.test --model debug --train-iters 20 under --auto-retry
G --hostfile honored โ€” pass a 2-node filtered hostfile, verify the loop splits THAT (not the full PBS allocation)

The driver works on Sunspot too โ€” swap -A AuroraGPT โ†’ -A datascience and filesystems=flare:home โ†’ filesystems=tegu:home in the PBS header (everything else is identical: same scrape patterns, same workflow).

Inspect after the run:

tail -f logs/ezpz-launch-auto-retry-test.*.log   # scenario summary
ls /tmp/ezpz-aurora-auto-retry-*/scen-*/logs/   # per-scenario failover dirs

For a one-off manual test against your own workload (no scenario harness, just --auto-retry on a job you'd be running anyway):

# Inside any PBS allocation with .venv set up:
ezpz launch --auto-retry --np <N> --timeout 600 -- python3 -m your.module

If you want to force-trigger a swap to validate the loop end-to-end, make your script emit one of the recognized Aurora crash signatures on rank 0 of the first attempt โ€” e.g.:

# Sentinel must live on SHARED filesystem (PBS_O_WORKDIR is your
# submit dir, on Lustre/GPFS); /tmp would be per-node and the
# sentinel would vanish after the failover swaps the active host out.
SENTINEL="${PBS_O_WORKDIR}/.ezpz-attempt2-$$"

# Aurora scraper recognizes the .hsn.cm.aurora.alcf.anl.gov FQDN
# form; `hostname` on compute nodes returns just the short name. Use
# the first PBS_NODEFILE entry โ€” those entries are always in HSN form.
ACTIVE_HOST=$(head -n 1 "${PBS_NODEFILE}")

ezpz launch --auto-retry --np 12 -ppn 12 --timeout 60 -- bash -c "
  if [[ ! -f '${SENTINEL}' ]]; then
    touch '${SENTINEL}'
    echo '${ACTIVE_HOST}: shepherd died from signal 9'
    exit 1
  fi
  echo 'iter step=1 loss=0.5'
  exit 0
"
# Should produce 2 attempts, ${ACTIVE_HOST} swapped into bad_nodes.txt.
# After: `cat logs/failover-*/bad_nodes.txt` shows the swapped host;
# `cat logs/failover-*/active.hostfile` shows the replacement.
# Clean up: rm "${SENTINEL}"

Python interpreter resolutionโš“๏ธŽ

When ezpz launch needs to invoke python3 (e.g. ezpz launch python3 -m my.module), it picks the interpreter in this order:

  1. $VIRTUAL_ENV/bin/python3 if $VIRTUAL_ENV is set and exists
  2. shutil.which("python3") โ€” first python3 on $PATH
  3. sys.executable as a last resort

Why not just sys.executable? It's frozen at interpreter startup. If you ran ezpz yeet to copy your env to /tmp/ and then source /tmp/.venv/bin/activate, sys.executable would still point to the original Lustre path because the ezpz CLI script's shebang is baked in at install time. Reading $VIRTUAL_ENV (set by activate) lets the launcher follow the user's actual current venv.

Examplesโš“๏ธŽ

Use it to launch:

  • Arbitrary command(s):

    ezpz launch hostname
    
  • Arbitrary Python string:

    ezpz launch python3 -c 'import ezpz; ezpz.setup_torch()'
    
  • One of the Distributed Training examples:

    ezpz launch python3 -m ezpz.examples.test --profile
    ezpz launch -n 8 -- python3 -m ezpz.examples.fsdp_tp --tp 4
    
  • Your own distributed training script:

    ezpz launch -n 16 -ppn 8 -- python3 -m your_app.train --config configs/your_config.yaml
    

    to launch your_app.train across 16 processes, 8 per node.

Sequence Diagram

Two primary control paths drive ezpz launch: a scheduler-aware path used when running inside PBS/SLURM allocations, and a local fallback that shells out to mpirun when no scheduler metadata is available.

sequenceDiagram
    autonumber
    actor User
    participant CLI as ezpz_launch
    participant Scheduler as PBS_or_Slurm
    participant MPI as mpirun_mpiexec
    participant App as User_application

    User->>CLI: ezpz launch <launch_flags> -- <cmd> <cmd_flags>
    CLI->>Scheduler: detect_scheduler()
    alt scheduler_detected
        Scheduler-->>CLI: scheduler_type, job_metadata
        CLI->>Scheduler: build_scheduler_command(cmd_to_launch)
        Scheduler-->>CLI: launch_cmd (mpiexec_or_srun)
        CLI->>MPI: run_command(launch_cmd)
        MPI->>App: start_ranks_and_execute
        App-->>MPI: return_codes
        MPI-->>CLI: aggregate_status
    else no_scheduler_detected
        Scheduler-->>CLI: unknown
        CLI->>MPI: mpirun -np 2 <cmd> <cmd_flags>
        MPI->>App: start_local_ranks
        App-->>MPI: return_codes
        MPI-->>CLI: aggregate_status
    end
    CLI-->>User: exit_code

Distributed Training Examplesโš“๏ธŽ

  1. ๐Ÿ“ Examples: Scalable and ready-to-go!

    Links Example Module What it Does
    ยท ยท ezpz.examples.test Train MLP with DDP on MNIST
    ยท ยท ezpz.examples.fsdp Train CNN with FSDP on MNIST
    ยท ยท ezpz.examples.vit Train ViT with FSDP on MNIST
    ยท ยท ezpz.examples.fsdp_tp Train Transformer with FSDP + TP on HF Datasets
    ยท ยท ezpz.examples.diffusion Train Diffusion LLM with FSDP on HF Datasets
    ยท ยท ezpz.examples.hf Fine-tune causal LM with Accelerate + FSDP
    ยท ยท ezpz.examples.hf_trainer Train LLM with FSDP + HF Trainer on HF Datasets

    Any of the examples can be launched with:

    ezpz launch python3 -m ezpz.examples.<example>
    
    ๐Ÿค— HF Integration
    1. ezpz.examples.{fsdp_tp, diffusion, hf, hf_trainer} all support arbitrary ๐Ÿค— Hugging Face datasets e.g.:

      dataset="stanfordnlp/imdb"  # or any other HF dataset
      ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}"
      ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}"
      ezpz launch python3 -m ezpz.examples.hf \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --dataset_name="${dataset}" \
          --streaming \
          --bf16=true
      ezpz launch python3 -m ezpz.examples.hf_trainer \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --dataset_name="${dataset}" \
          --streaming \
          --bf16=true
      
    2. ezpz.examples.hf and ezpz.examples.hf_trainer both support arbitrary combinations of (compatible) transformers.from_pretrained models, and HF Datasets (with support for streaming!). hf uses an explicit training loop with Accelerate, while hf_trainer wraps the HF Trainer API.

      ezpz launch python3 -m ezpz.examples.hf \
          --streaming \
          --dataset_name=eliplutchok/fineweb-small-sample \
          --tokenizer_name meta-llama/Llama-3.2-1B \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --bf16=true
      
      ezpz launch python3 -m ezpz.examples.hf_trainer \
          --streaming \
          --dataset_name=eliplutchok/fineweb-small-sample \
          --tokenizer_name meta-llama/Llama-3.2-1B \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --bf16=true
      
    Simple Example
    ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    
    Output
    Macbook Pro
    #[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โœ˜!?] [4s]
    ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())'
    Using [2 / 2] available "mps" devices !!
    0
    1
    [2025-12-23-162222] Execution time: 4s sec
    
    Aurora (2 Nodes)
    #[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s]
    #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?]
    ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    
    
    [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
    [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
    [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
    [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-145802]----
    [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203
    [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0']
    [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
    [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())'
    [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command.
    [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing:
    mpiexec
    --envall
    --np=24
    --ppn=12
    --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    -c
    import ezpz; print(ezpz.setup_torch())
    [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804...
    [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters
    [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command:
    mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())'
    

    CPU bind output (24 lines)

    cpubind:list x4717c0s6b0n0 pid 118589 rank 12 0: mask 0x1c
    cpubind:list x4717c0s6b0n0 pid 118590 rank 13 1: mask 0x1c00
    ...
    cpubind:list x4418c6s1b0n0 pid 66460 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66461 rank 11 11: mask 0x1c00000000000000000000000
    
    Using [24 / 24] available "xpu" devices !!
    

    Raw rank output (24 lines)

    8
    10
    0
    4
    ...
    18
    21
    
    [2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-145814]----
    [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0.
    [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds.
    [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting.
    took: 18s
    
    demo.py
    demo.py
    import ezpz
    
    # automatic device + backend setup for distributed PyTorch
    _ = ezpz.setup_torch()  # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ...
    
    device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...}
    rank = ezpz.get_rank()
    world_size = ezpz.get_world_size()
    # ...etc
    
    if rank == 0:
        print(f"Hello from rank {rank} / {world_size} on {device}!")
    

    We can launch this script with:

    ezpz launch python3 demo.py
    
    Output(s)
    MacBook Pro
    # from MacBook Pro
    $ ezpz launch python3 demo.py
    [2026-01-08 07:22:31,989741][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 /Users/samforeman/python/ezpz_demo.py
    Using [2 / 2] available "mps" devices !!
    Hello from rank 0 / 2 on mps!
    
    Aurora (2 nodes)
    # from 2 nodes of Aurora:
    #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s]
    #[01/08/26,07:26:10][x4604c5s2b0n0][~]
    ; ezpz launch python3 demo.py
    
    [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
    [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
    [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
    [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-072620]----
    [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832
    [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0']
    [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
    [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py
    [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command.
    [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing:
    mpiexec
    --envall
    --np=24
    --ppn=12
    --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    demo.py
    [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621...
    [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters
    [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command:
    mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py
    

    CPU bind output (24 lines)

    cpubind:list x4604c5s3b0n0 pid 131587 rank 12 0: mask 0x1c
    cpubind:list x4604c5s3b0n0 pid 131588 rank 13 1: mask 0x1c00
    ...
    cpubind:list x4604c5s2b0n0 pid 121235 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121236 rank 11 11: mask 0x1c00000000000000000000000
    
    Using [24 / 24] available "xpu" devices !!
    Hello from rank 0 / 24 on xpu!
    [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-072633]----
    [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0.
    [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds.
    [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting.
    took: 22s
    

  1. When no -- is present, all arguments are treated as part of the command to run. 

  2. By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm).
    If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:

    1. The number of available nodes
    2. How many GPUs are present on each of these nodes
    3. How many GPUs we have total

    It will then use this information to automatically construct the appropriate {mpiexec, srun} command to launch, and finally, execute the launch cmd. 

  3. The ezpz.History class automatically computes distributed statistics (min, max, mean, std. dev) across ranks for all recorded metrics.
    NOTE: This is automatically disabled when ezpz.get_world_size() >= 384 (e.g. >= {32, 96} {Aurora, Polaris} nodes) due to the additional overhead introduced (but can be manually enabled, if desired).