Skip to content

๐Ÿš€ ezpz launchโš“๏ธŽ

Single entry point for distributed jobs.

ezpz detects PBS/Slurm automatically and falls back to mpirun, forwarding useful environment variables so your script behaves the same on laptops and clusters.

Add your own args to any command (--config, --batch-size, etc.) and ezpz will propagate them through the detected launcher.

Use the provided:

ezpz launch <launch flags> -- <cmd> <cmd flags>

to automatically launch <cmd> across all available1 accelerators.

  • ezpz launch --help
    ezpz launch --help
    usage: ezpz launch [-h] [--print-source] [--filter FILTER [FILTER ...]] [-n NPROC] [-ppn NPROC_PER_NODE] [-nh NHOSTS] [--hostfile HOSTFILE] ...
    
    Launch a command on the current PBS/SLURM job.
    
    Additional `<launcher flags>` can be passed through directly
    to the launcher by including '--' as a separator before
    the command.
    
    Examples:
    
        ezpz launch <launcher flags> -- <command> <args>
    
        ezpz launch -n 8 -ppn 4 --verbose --tag-output -- python3 -m ezpz.examples.fsdp_tp
    
        ezpz launch --nproc 8 -x EZPZ_LOG_LEVEL=DEBUG -- python3 my_script.py --my-arg val
    
    positional arguments:
    command               Command (and arguments) to execute. Use '--' to separate options when needed.
    
    options:
    -h, --help            show this help message and exit
    --print-source        Print the location of the launch CLI source and exit.
    --filter FILTER [FILTER ...]
                            Filter output lines by these strings.
    -n NPROC, -np NPROC, --n NPROC, --np NPROC, --nproc NPROC, --world_size NPROC, --nprocs NPROC
                            Number of processes.
    -ppn NPROC_PER_NODE, --ppn NPROC_PER_NODE, --nproc_per_node NPROC_PER_NODE
                            Processes per node.
    -nh NHOSTS, --nh NHOSTS, --nhost NHOSTS, --nnode NHOSTS, --nnodes NHOSTS, --nhosts NHOSTS, --nhosts NHOSTS
                            Number of nodes to use.
    --hostfile HOSTFILE   Hostfile to use for launching.
    
  • Scheduler smarts: detects PBS/Slurm automatically;
    Otherwise falls back to mpirun with sensible env forwarding. For launcher-only flags/env (e.g., -x FOO=bar), place them before --; everything after -- is the command to run:

    ezpz launch <launch flags> -- <command to run> <command args>
    
    Examples

    e.g.:

    ezpz launch -- python3 -m ezpz.examples.fsdp
    

    or, specify -n 8 processes, forward a specific PYTHONPATH, and set EZPZ_LOG_LEVEL=DEBUG:

    ezpz launch -n 8 \
        -x PYTHONPATH=/tmp/.venv/bin:${PYTHONPATH} \
        -x EZPZ_LOG_LEVEL=DEBUG \
        -- \
        python3 -m ezpz.examples.fsdp
    
  • Automatic distributed initialization using ezpz.setup_torch() with automatic {device, backend} selection

    import ezpz
    _ = ezpz.setup_torch()
    
    device = ezpz.get_torch_device()
    # cuda, xpu, mps, cpu, ...
    
  • Automatic single-process logging with rank-aware filtering for distributed runs:

    logger = ezpz.get_logger(__name__)
    
  • Metric tracking, aggregation, and recording via ezpz.History():

    • Automatic distributed statistics (min, max, mean, stddev) across ranks2
    • Weights & Biases integration
    • Persistent storage of metrics in .h5 format
    • Plotting support:
    Graphical plots (svg, png) via matplotlib

    Accuracy Loss Forward time Backward time

    Terminal-based ASCII plots via plotext

                        dt                                    dt/min
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    0.384โ”คโ–Œ                                โ”‚0.384โ”ค-                                โ”‚
    0.320โ”คโ–                                โ”‚0.129โ”ค --------------------------------โ”‚
    0.256โ”ค โ–š                               โ”‚     โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜
    0.129โ”ค โ–โ––                              โ”‚     1.0     3.2     5.5     7.8   10.0 
    0.066โ”ค  โ–                              โ”‚dt/min              iter
    0.002โ”ค   โ–šโ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ”‚                    dt/std
         โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         1.0     3.2     5.5     7.8   10.0 0.00068โ”ค             *      *      *   โ”‚
    dt                  iter                0.00046โ”ค       ****** **   * ****** ***โ”‚
                    dt/mean                 0.00011โ”ค*******         ***            โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜
    0.384โ”คยท                                โ”‚       1.0     3.2    5.5     7.8  10.0 
    0.320โ”คยท                                โ”‚dt/std               iter
    0.256โ”ค ยท                               โ”‚                   dt/max
    0.193โ”ค  ยท                              โ”‚     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    0.129โ”ค  ยท                              โ”‚0.384โ”ค+                                โ”‚
    0.066โ”ค   ยท                             โ”‚0.257โ”ค ++                              โ”‚
    0.002โ”ค    ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทโ”‚0.066โ”ค   ++++++++++++++++++++++++++++++โ”‚
         โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜     โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜
        1.0     3.2     5.5     7.8   10.0      1.0     3.2     5.5     7.8   10.0 
    dt/mean             iter                dt/max              iter              
    text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/dt.txt โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 0.384โ”ค ++ dt/max โ”‚ โ”‚ -- dt/min โ”‚ โ”‚ ยทยท dt/mean โ”‚ 0.320โ”ค โ–žโ–ž dt โ”‚ โ”‚ โ– โ”‚ โ”‚ โ–Œ โ”‚ 0.256โ”ค โ–š โ”‚ โ”‚ โ–โ–– โ”‚ โ”‚ โ–Œ โ”‚ 0.193โ”ค โ– โ”‚ โ”‚ โ–Œ โ”‚ โ”‚ โ– โ”‚ โ”‚ โ–โ–– โ”‚ 0.129โ”ค โ–š โ”‚ โ”‚ โ– โ”‚ โ”‚ โ–Œ โ”‚ 0.065โ”ค โ– โ”‚ โ”‚ โ–โ–– โ”‚ โ”‚ โ–š โ”‚ 0.002โ”ค โ–โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ–„โ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 1.0 3.2 5.5 7.8 10.0 text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/dt_summary.txt dt/mean hist dt/max hist
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 9.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚9.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 7.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚7.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 6.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚6.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 4.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚4.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 3.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚3.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚1.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.0โ”คโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.0โ”คโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ -0.01 0.09 0.19 0.30 0.40 -0.01 0.09 0.19 0.30 0.40 dt/min hist dt/std hist โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 9.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚2.00โ”ค โ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 7.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚1.67โ”ค โ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 6.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚1.33โ”ค โ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 4.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚1.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 3.0โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚0.67โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 1.5โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.33โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.0โ”คโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -0.02 0.09 0.19 0.30 0.40 -0.00003 0.00016 0.00034 0.00053
    text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/dt_hist.txt loss loss/min
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 43.4โ”ค โ–—โ–€โ–€โ–€โ–€โ–„โ–„โ–„โ–„ โ”‚39.3โ”ค - ------------ โ”‚ 38.6โ”ค โ–Ÿ โ–—โ–˜ โ–šโ–– โ”‚22.7โ”ค---- ---------- -------โ”‚ 33.7โ”ค โ–ž โ–š โ–—โ–˜ โ–โ–šโ–– โ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 24.1โ”ค โ– โ–š โ–—โ–˜ โ–โ–€โ–šโ–„โ–– โ”‚ 1.0 3.2 5.5 7.8 10.0 19.3โ”คโ–—โ–˜ โ–š โ–„โ–˜ โ–โ–€โ–€โ–€โ”‚loss/min iter 14.4โ”คโ–Œ โ–šโ–„โ–€ โ”‚ loss/std โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 1.0 3.2 5.5 7.8 10.0 17.4โ”ค * โ”‚ loss iter 11.8โ”ค **** ** * โ”‚ loss/mean 3.3โ”ค******* *************** ****โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 41.4โ”ค ยทยทยทยท โ”‚ 1.0 3.2 5.5 7.8 10.0 37.6โ”ค ยท ยทยทยทยท ยทยทยทยท โ”‚loss/std iter
    33.9โ”ค ยท ยท ยท ยทยทยทยทยทยทยท โ”‚ loss/max
    30.2โ”ค ยท ยท ยท ยทยท โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 26.4โ”ค ยท ยทยท ยทยทโ”‚56.3โ”ค + โ”‚ 22.7โ”คยท โ”‚45.3โ”ค +++++++ ++++++++++++++++++ โ”‚ 18.9โ”คยท โ”‚28.9โ”ค++++ ++++โ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 1.0 3.2 5.5 7.8 10.0 1.0 3.2 5.5 7.8 10.0 loss/mean iter loss/max iter text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/loss.txt โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 56.3โ”ค ++ loss/max + โ”‚ โ”‚ -- loss/min + + โ”‚ โ”‚ ยทยท loss/mean + + โ”‚ 49.3โ”ค โ–žโ–ž loss + ++ โ”‚ โ”‚ + + โ”‚ โ”‚ + + โ”‚ 42.3โ”ค + +โ–žโ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–šโ–„โ–„โ–„โ–– โ”‚ โ”‚ + โ–žยท +++โ–โ–€โ–€โ–€โ–š + โ”‚ โ”‚ โ––++++++++ ยทยทยทยทยทยทโ–ยท-ยทยทยทยทยทยทยทยทยท โ–€โ–– ++ + โ”‚ 35.4โ”ค โ–žโ–šยท ยท โ–—โ–˜- ---------ยทยทยทยทยทยทยทยท โ–โ–šโ––+ +++ + โ”‚ โ”‚ โ–--โ–šยทยท ยทยท โ–—โ–˜- ---- ยทยทยทโ–โ–„+++++ ยท ++ โ”‚ โ”‚ โ–—โ–˜ -โ–Œ ยท ยท โ–ž- ---- ยทโ–šโ––ยทยทยทยทยทยทยทยท ยทยท + โ”‚ โ”‚ ยทโ–Œ โ–โ–– ยทยท ยทยท โ–ž- ------โ–โ–šโ–„โ–– ยทยท + โ”‚ 28.4โ”ค ยทโ–ž โ–โ–– ยทยทยท โ–- -โ–โ–€โ–šโ–„โ–– ยทยท++โ”‚ โ”‚ ยทโ–—โ–˜ โ–š โ–—โ–˜ -โ–โ–€โ–€โ–„โ–„โ–– ยทยทโ”‚ โ”‚+ยทโ–—โ–˜ โ–š โ–—โ–˜ --โ–โ–€โ–€โ–„โ–„โ–„โ”‚ 21.4โ”คยท โ–ž โ–Œ โ–—โ–ž โ”‚ โ”‚ยทโ–ž โ–โ–– โ–—โ–„โ–€โ–˜ โ”‚ โ”‚โ–—โ–˜ โ–โ––-โ–„โ–žโ–˜ โ”‚ 14.4โ”คโ–Œ โ–โ–€ โ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 1.0 3.2 5.5 7.8 10.0 text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/loss_summary.txt loss/mean hist loss/max hist
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 2.00โ”ค โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚2.00โ”ค โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.67โ”ค โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚1.67โ”ค โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.33โ”ค โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚1.33โ”ค โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚1.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.67โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.67โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.33โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.33โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.00โ”คโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 17.9 24.0 30.2 36.3 42.4 22.0 30.9 39.8 48.8 57.7 loss/min hist loss/std hist
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” 2.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚3.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.67โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚2.50โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.33โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚2.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 1.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚1.50โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ 0.67โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚1.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.33โ”คโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.50โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ 0.00โ”คโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚0.00โ”คโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ โ””โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”˜ 13.3 20.1 26.9 33.7 40.5 -0.2 4.4 8.9 13.5 18.1 text saved in /Users/samforeman/vibes/saforem2/ezpz/outputs/History-2026-01-15-162549/2026-01-15-162549/plots/tplot/loss_hist.txt

Use it to launch:

  • Arbitrary command(s):

    ezpz launch hostname
    
  • Arbitrary Python string:

    ezpz launch python3 -c 'import ezpz; ezpz.setup_torch()'
    
  • One of the ready-to-go examples:

    ezpz launch python3 -m ezpz.examples.test --profile
    ezpz launch -n 8 -- python3 -m ezpz.examples.fsdp_tp --tp 4
    
  • Your own distributed training script:

    ezpz launch -n 16 -ppn 8 -- python3 -m your_app.train --config configs/your_config.yaml
    

    to launch your_app.train across 16 processes, 8 per node.

Sequence Diagram

Two primary control paths drive ezpz launch: a scheduler-aware path used when running inside PBS/SLURM allocations, and a local fallback that shells out to mpirun when no scheduler metadata is available.

sequenceDiagram
    autonumber
    actor User
    participant CLI as ezpz_launch
    participant Scheduler as PBS_or_Slurm
    participant MPI as mpirun_mpiexec
    participant App as User_application

    User->>CLI: ezpz launch <launch_flags> -- <cmd> <cmd_flags>
    CLI->>Scheduler: detect_scheduler()
    alt scheduler_detected
        Scheduler-->>CLI: scheduler_type, job_metadata
        CLI->>Scheduler: build_scheduler_command(cmd_to_launch)
        Scheduler-->>CLI: launch_cmd (mpiexec_or_srun)
        CLI->>MPI: run_command(launch_cmd)
        MPI->>App: start_ranks_and_execute
        App-->>MPI: return_codes
        MPI-->>CLI: aggregate_status
    else no_scheduler_detected
        Scheduler-->>CLI: unknown
        CLI->>MPI: mpirun -np 2 <cmd> <cmd_flags>
        MPI->>App: start_local_ranks
        App-->>MPI: return_codes
        MPI-->>CLI: aggregate_status
    end
    CLI-->>User: exit_code

๐Ÿ“ Ready-to-go Examplesโš“๏ธŽ

  1. ๐Ÿ“ Examples: Scalable and ready-to-go!

    Links Example Module What it Does
    ยท ยท ezpz.examples.test Train MLP with DDP on MNIST
    ยท ยท ezpz.examples.fsdp Train CNN with FSDP on MNIST
    ยท ยท ezpz.examples.vit Train ViT with FSDP on MNIST
    ยท ยท ezpz.examples.fsdp_tp Train Transformer with FSDP + TP on HF Datasets
    ยท ยท ezpz.examples.diffusion Train Diffusion LLM with FSDP on HF Datasets
    ยท ยท ezpz.examples.hf_trainer Train LLM with FSDP + HF Trainer on HF Datasets
    Running Examples

    Any of the examples below can be launched with (sensible defaults if not specified):

    ezpz launch python3 -m ezpz.examples.fsdp
    ezpz launch python3 -m ezpz.examples.fsdp_tp
    # ...etc
    ezpz launch python3 -m ezpz.examples.hf_trainer
    
    ๐Ÿค— HF Integration
    1. ezpz.examples.{fsdp_tp, diffusion, hf_trainer, hf_trainer} all support arbitrary ๐Ÿค— Hugging Face datasets e.g.:

      dataset="stanfordnlp/imdb"  # or any other HF dataset
      ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}"
      ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}"
      ezpz launch python3 -m ezpz.examples.hf_trainer \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --dataset_name="${dataset}" \
          --streaming \
          --bf16=true
      
    2. ezpz.examples.hf_trainer supports arbitrary combinations of (compatible) transformers.from_pretrained models, and HF Datasets (with support for streaming!)

      ezpz launch python3 -m ezpz.examples.hf_trainer \
          --streaming \
          --dataset_name=eliplutchok/fineweb-small-sample \
          --tokenizer_name meta-llama/Llama-3.2-1B \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --bf16=true
          # ...etc.
      
    Simple Example
    ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    
    Output
    Macbook Pro
    #[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โœ˜!?] [4s]
    ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())'
    Using [2 / 2] available "mps" devices !!
    0
    1
    [2025-12-23-162222] Execution time: 4s sec
    
    Aurora (2 Nodes)
    #[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s]
    #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?]
    ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    
    
    [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
    [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
    [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
    [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-145802]----
    [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203
    [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0']
    [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
    [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())'
    [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command.
    [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing:
    mpiexec
    --envall
    --np=24
    --ppn=12
    --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    -c
    import ezpz; print(ezpz.setup_torch())
    [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804...
    [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters
    [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command:
    mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())'
    cpubind:list x4717c0s6b0n0 pid 118589 rank 12 0: mask 0x1c
    cpubind:list x4717c0s6b0n0 pid 118590 rank 13 1: mask 0x1c00
    cpubind:list x4717c0s6b0n0 pid 118591 rank 14 2: mask 0x1c0000
    cpubind:list x4717c0s6b0n0 pid 118592 rank 15 3: mask 0x1c000000
    cpubind:list x4717c0s6b0n0 pid 118593 rank 16 4: mask 0x1c00000000
    cpubind:list x4717c0s6b0n0 pid 118594 rank 17 5: mask 0x1c0000000000
    cpubind:list x4717c0s6b0n0 pid 118595 rank 18 6: mask 0x1c0000000000000
    cpubind:list x4717c0s6b0n0 pid 118596 rank 19 7: mask 0x1c000000000000000
    cpubind:list x4717c0s6b0n0 pid 118597 rank 20 8: mask 0x1c00000000000000000
    cpubind:list x4717c0s6b0n0 pid 118598 rank 21 9: mask 0x1c0000000000000000000
    cpubind:list x4717c0s6b0n0 pid 118599 rank 22 10: mask 0x1c000000000000000000000
    cpubind:list x4717c0s6b0n0 pid 118600 rank 23 11: mask 0x1c00000000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66450 rank 0 0: mask 0x1c
    cpubind:list x4418c6s1b0n0 pid 66451 rank 1 1: mask 0x1c00
    cpubind:list x4418c6s1b0n0 pid 66452 rank 2 2: mask 0x1c0000
    cpubind:list x4418c6s1b0n0 pid 66453 rank 3 3: mask 0x1c000000
    cpubind:list x4418c6s1b0n0 pid 66454 rank 4 4: mask 0x1c00000000
    cpubind:list x4418c6s1b0n0 pid 66455 rank 5 5: mask 0x1c0000000000
    cpubind:list x4418c6s1b0n0 pid 66456 rank 6 6: mask 0x1c0000000000000
    cpubind:list x4418c6s1b0n0 pid 66457 rank 7 7: mask 0x1c000000000000000
    cpubind:list x4418c6s1b0n0 pid 66458 rank 8 8: mask 0x1c00000000000000000
    cpubind:list x4418c6s1b0n0 pid 66459 rank 9 9: mask 0x1c0000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66460 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66461 rank 11 11: mask 0x1c00000000000000000000000
    Using [24 / 24] available "xpu" devices !!
    8
    10
    0
    4
    3
    5
    7
    11
    6
    1
    9
    2
    14
    15
    12
    13
    16
    17
    19
    22
    20
    23
    18
    21
    [2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-145814]----
    [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0.
    [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds.
    [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting.
    took: 18s
    
    demo.py
    demo.py
    import ezpz
    
    # automatic device + backend setup for distributed PyTorch
    _ = ezpz.setup_torch()  # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ...
    
    device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...}
    rank = ezpz.get_rank()
    world_size = ezpz.get_world_size()
    # ...etc
    
    if rank == 0:
        print(f"Hello from rank {rank} / {world_size} on {device}!")
    

    We can launch this script with:

    ezpz launch python3 demo.py
    
    Output(s)
    MacBook Pro
    # from MacBook Pro
    $ ezpz launch python3 demo.py
    [2026-01-08 07:22:31,989741][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 /Users/samforeman/python/ezpz_demo.py
    Using [2 / 2] available "mps" devices !!
    Hello from rank 0 / 2 on mps!
    
    Aurora (2 nodes)
    # from 2 nodes of Aurora:
    #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s]
    #[01/08/26,07:26:10][x4604c5s2b0n0][~]
    ; ezpz launch python3 demo.py
    
    [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
    [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
    [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
    [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-072620]----
    [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832
    [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0']
    [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
    [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py
    [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command.
    [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing:
    mpiexec
    --envall
    --np=24
    --ppn=12
    --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    demo.py
    [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621...
    [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters
    [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command:
    mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py
    cpubind:list x4604c5s3b0n0 pid 131587 rank 12 0: mask 0x1c
    cpubind:list x4604c5s3b0n0 pid 131588 rank 13 1: mask 0x1c00
    cpubind:list x4604c5s3b0n0 pid 131589 rank 14 2: mask 0x1c0000
    cpubind:list x4604c5s3b0n0 pid 131590 rank 15 3: mask 0x1c000000
    cpubind:list x4604c5s3b0n0 pid 131591 rank 16 4: mask 0x1c00000000
    cpubind:list x4604c5s3b0n0 pid 131592 rank 17 5: mask 0x1c0000000000
    cpubind:list x4604c5s3b0n0 pid 131593 rank 18 6: mask 0x1c0000000000000
    cpubind:list x4604c5s3b0n0 pid 131594 rank 19 7: mask 0x1c000000000000000
    cpubind:list x4604c5s3b0n0 pid 131595 rank 20 8: mask 0x1c00000000000000000
    cpubind:list x4604c5s3b0n0 pid 131596 rank 21 9: mask 0x1c0000000000000000000
    cpubind:list x4604c5s3b0n0 pid 131597 rank 22 10: mask 0x1c000000000000000000000
    cpubind:list x4604c5s3b0n0 pid 131598 rank 23 11: mask 0x1c00000000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121225 rank 0 0: mask 0x1c
    cpubind:list x4604c5s2b0n0 pid 121226 rank 1 1: mask 0x1c00
    cpubind:list x4604c5s2b0n0 pid 121227 rank 2 2: mask 0x1c0000
    cpubind:list x4604c5s2b0n0 pid 121228 rank 3 3: mask 0x1c000000
    cpubind:list x4604c5s2b0n0 pid 121229 rank 4 4: mask 0x1c00000000
    cpubind:list x4604c5s2b0n0 pid 121230 rank 5 5: mask 0x1c0000000000
    cpubind:list x4604c5s2b0n0 pid 121231 rank 6 6: mask 0x1c0000000000000
    cpubind:list x4604c5s2b0n0 pid 121232 rank 7 7: mask 0x1c000000000000000
    cpubind:list x4604c5s2b0n0 pid 121233 rank 8 8: mask 0x1c00000000000000000
    cpubind:list x4604c5s2b0n0 pid 121234 rank 9 9: mask 0x1c0000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121235 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121236 rank 11 11: mask 0x1c00000000000000000000000
    Using [24 / 24] available "xpu" devices !!
    Hello from rank 0 / 24 on xpu!
    [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-072633]----
    [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0.
    [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds.
    [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting.
    took: 22s
    

  1. By default, this will detect if we're running behind a job scheduler (e.g. PBS or Slurm).
    If so, we automatically determine the specifics of the currently active job; explicitly, this will determine:

    1. The number of available nodes
    2. How many GPUs are present on each of these nodes
    3. How many GPUs we have total

    It will then use this information to automatically construct the appropriate {mpiexec, srun} command to launch, and finally, execute the launch cmd. 

  2. The ezpz.History class automatically computes distributed statistics (min, max, mean, std. dev) across ranks for all recorded metrics.
    NOTE: This is automatically disabled when ezpz.get_world_size() >= 384 (e.g. >= {32, 96} {Aurora, Polaris} nodes) due to the additional overhead introduced (but can be manually enabled, if desired).