Skip to content

๐Ÿงฐ ezpz CLIโš“๏ธŽ

Once installed, ezpz provides a CLI with a few useful utilities to help launch distributed PyTorch applications.

Explicitly, these are ezpz <command>:

  • ๐Ÿฉบ ezpz doctor: Health check your environment
  • ๐Ÿš€ ezpz launch: Launch commands with automatic job scheduler detection (PBS, Slurm)

    • ๐Ÿ’ฏ ezpz test: Run simple distributed smoke test
    • ๐Ÿ“ Examples: Scalable and ready-to-go!

      Links Example Module What it Does
      ยท ยท ezpz.examples.test Train MLP with DDP on MNIST
      ยท ยท ezpz.examples.fsdp Train CNN with FSDP on MNIST
      ยท ยท ezpz.examples.vit Train ViT with FSDP on MNIST
      ยท ยท ezpz.examples.fsdp_tp Train Transformer with FSDP + TP on HF Datasets
      ยท ยท ezpz.examples.diffusion Train Diffusion LLM with FSDP on HF Datasets
      ยท ยท ezpz.examples.hf_trainer Train LLM with FSDP + HF Trainer on HF Datasets
      Running Examples

      Any of the examples below can be launched with (sensible defaults if not specified):

      ezpz launch python3 -m ezpz.examples.fsdp
      ezpz launch python3 -m ezpz.examples.fsdp_tp
      # ...etc
      ezpz launch python3 -m ezpz.examples.hf_trainer
      
      ๐Ÿค— HF Integration
      1. ezpz.examples.{fsdp_tp, diffusion, hf_trainer, hf_trainer} all support arbitrary ๐Ÿค— Hugging Face datasets e.g.:

        dataset="stanfordnlp/imdb"  # or any other HF dataset
        ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}"
        ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}"
        ezpz launch python3 -m ezpz.examples.hf_trainer \
            --model_name_or_path meta-llama/Llama-3.2-1B \
            --dataset_name="${dataset}" \
            --streaming \
            --bf16=true
        
      2. ezpz.examples.hf_trainer supports arbitrary combinations of (compatible) transformers.from_pretrained models, and HF Datasets (with support for streaming!)

        ezpz launch python3 -m ezpz.examples.hf_trainer \
            --streaming \
            --dataset_name=eliplutchok/fineweb-small-sample \
            --tokenizer_name meta-llama/Llama-3.2-1B \
            --model_name_or_path meta-llama/Llama-3.2-1B \
            --bf16=true
            # ...etc.
        
      Simple Example
      ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
      
      Output
      Macbook Pro
      #[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โœ˜!?] [4s]
      ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
      [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())'
      Using [2 / 2] available "mps" devices !!
      0
      1
      [2025-12-23-162222] Execution time: 4s sec
      
      Aurora (2 Nodes)
      #[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s]
      #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?]
      ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
      
      
      [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
      [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
      [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
      [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-145802]----
      [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203
      [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0']
      [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
      [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
      [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
      [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())'
      [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command.
      [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing:
      mpiexec
      --envall
      --np=24
      --ppn=12
      --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      --no-vni
      --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
      python3
      -c
      import ezpz; print(ezpz.setup_torch())
      [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
      [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804...
      [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters
      [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command:
      mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())'
      cpubind:list x4717c0s6b0n0 pid 118589 rank 12 0: mask 0x1c
      cpubind:list x4717c0s6b0n0 pid 118590 rank 13 1: mask 0x1c00
      cpubind:list x4717c0s6b0n0 pid 118591 rank 14 2: mask 0x1c0000
      cpubind:list x4717c0s6b0n0 pid 118592 rank 15 3: mask 0x1c000000
      cpubind:list x4717c0s6b0n0 pid 118593 rank 16 4: mask 0x1c00000000
      cpubind:list x4717c0s6b0n0 pid 118594 rank 17 5: mask 0x1c0000000000
      cpubind:list x4717c0s6b0n0 pid 118595 rank 18 6: mask 0x1c0000000000000
      cpubind:list x4717c0s6b0n0 pid 118596 rank 19 7: mask 0x1c000000000000000
      cpubind:list x4717c0s6b0n0 pid 118597 rank 20 8: mask 0x1c00000000000000000
      cpubind:list x4717c0s6b0n0 pid 118598 rank 21 9: mask 0x1c0000000000000000000
      cpubind:list x4717c0s6b0n0 pid 118599 rank 22 10: mask 0x1c000000000000000000000
      cpubind:list x4717c0s6b0n0 pid 118600 rank 23 11: mask 0x1c00000000000000000000000
      cpubind:list x4418c6s1b0n0 pid 66450 rank 0 0: mask 0x1c
      cpubind:list x4418c6s1b0n0 pid 66451 rank 1 1: mask 0x1c00
      cpubind:list x4418c6s1b0n0 pid 66452 rank 2 2: mask 0x1c0000
      cpubind:list x4418c6s1b0n0 pid 66453 rank 3 3: mask 0x1c000000
      cpubind:list x4418c6s1b0n0 pid 66454 rank 4 4: mask 0x1c00000000
      cpubind:list x4418c6s1b0n0 pid 66455 rank 5 5: mask 0x1c0000000000
      cpubind:list x4418c6s1b0n0 pid 66456 rank 6 6: mask 0x1c0000000000000
      cpubind:list x4418c6s1b0n0 pid 66457 rank 7 7: mask 0x1c000000000000000
      cpubind:list x4418c6s1b0n0 pid 66458 rank 8 8: mask 0x1c00000000000000000
      cpubind:list x4418c6s1b0n0 pid 66459 rank 9 9: mask 0x1c0000000000000000000
      cpubind:list x4418c6s1b0n0 pid 66460 rank 10 10: mask 0x1c000000000000000000000
      cpubind:list x4418c6s1b0n0 pid 66461 rank 11 11: mask 0x1c00000000000000000000000
      Using [24 / 24] available "xpu" devices !!
      8
      10
      0
      4
      3
      5
      7
      11
      6
      1
      9
      2
      14
      15
      12
      13
      16
      17
      19
      22
      20
      23
      18
      21
      [2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-145814]----
      [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0.
      [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds.
      [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting.
      took: 18s
      
      demo.py
      demo.py
      import ezpz
      
      # automatic device + backend setup for distributed PyTorch
      _ = ezpz.setup_torch()  # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ...
      
      device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...}
      rank = ezpz.get_rank()
      world_size = ezpz.get_world_size()
      # ...etc
      
      if rank == 0:
          print(f"Hello from rank {rank} / {world_size} on {device}!")
      

      We can launch this script with:

      ezpz launch python3 demo.py
      
      Output(s)
      MacBook Pro
      # from MacBook Pro
      $ ezpz launch python3 demo.py
      [2026-01-08 07:22:31,989741][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 /Users/samforeman/python/ezpz_demo.py
      Using [2 / 2] available "mps" devices !!
      Hello from rank 0 / 2 on mps!
      
      Aurora (2 nodes)
      # from 2 nodes of Aurora:
      #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s]
      #[01/08/26,07:26:10][x4604c5s2b0n0][~]
      ; ezpz launch python3 demo.py
      
      [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
      [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
      [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
      [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-072620]----
      [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832
      [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0']
      [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
      [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
      [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
      [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py
      [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command.
      [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing:
      mpiexec
      --envall
      --np=24
      --ppn=12
      --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      --no-vni
      --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
      python3
      demo.py
      [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
      [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621...
      [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters
      [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command:
      mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py
      cpubind:list x4604c5s3b0n0 pid 131587 rank 12 0: mask 0x1c
      cpubind:list x4604c5s3b0n0 pid 131588 rank 13 1: mask 0x1c00
      cpubind:list x4604c5s3b0n0 pid 131589 rank 14 2: mask 0x1c0000
      cpubind:list x4604c5s3b0n0 pid 131590 rank 15 3: mask 0x1c000000
      cpubind:list x4604c5s3b0n0 pid 131591 rank 16 4: mask 0x1c00000000
      cpubind:list x4604c5s3b0n0 pid 131592 rank 17 5: mask 0x1c0000000000
      cpubind:list x4604c5s3b0n0 pid 131593 rank 18 6: mask 0x1c0000000000000
      cpubind:list x4604c5s3b0n0 pid 131594 rank 19 7: mask 0x1c000000000000000
      cpubind:list x4604c5s3b0n0 pid 131595 rank 20 8: mask 0x1c00000000000000000
      cpubind:list x4604c5s3b0n0 pid 131596 rank 21 9: mask 0x1c0000000000000000000
      cpubind:list x4604c5s3b0n0 pid 131597 rank 22 10: mask 0x1c000000000000000000000
      cpubind:list x4604c5s3b0n0 pid 131598 rank 23 11: mask 0x1c00000000000000000000000
      cpubind:list x4604c5s2b0n0 pid 121225 rank 0 0: mask 0x1c
      cpubind:list x4604c5s2b0n0 pid 121226 rank 1 1: mask 0x1c00
      cpubind:list x4604c5s2b0n0 pid 121227 rank 2 2: mask 0x1c0000
      cpubind:list x4604c5s2b0n0 pid 121228 rank 3 3: mask 0x1c000000
      cpubind:list x4604c5s2b0n0 pid 121229 rank 4 4: mask 0x1c00000000
      cpubind:list x4604c5s2b0n0 pid 121230 rank 5 5: mask 0x1c0000000000
      cpubind:list x4604c5s2b0n0 pid 121231 rank 6 6: mask 0x1c0000000000000
      cpubind:list x4604c5s2b0n0 pid 121232 rank 7 7: mask 0x1c000000000000000
      cpubind:list x4604c5s2b0n0 pid 121233 rank 8 8: mask 0x1c00000000000000000
      cpubind:list x4604c5s2b0n0 pid 121234 rank 9 9: mask 0x1c0000000000000000000
      cpubind:list x4604c5s2b0n0 pid 121235 rank 10 10: mask 0x1c000000000000000000000
      cpubind:list x4604c5s2b0n0 pid 121236 rank 11 11: mask 0x1c00000000000000000000000
      Using [24 / 24] available "xpu" devices !!
      Hello from rank 0 / 24 on xpu!
      [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-072633]----
      [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0.
      [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds.
      [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting.
      took: 22s
      
  • ezpz --help

    To see the list of available commands, run:

    $ ezpz --help
    Usage: ezpz [OPTIONS] COMMAND [ARGS]...
    
    ezpz distributed utilities.
    
    Options:
    --version   Show the version and exit.
    -h, --help  Show this message and exit.
    
    Commands:
    doctor    Inspect the environment for ezpz launch readiness.
    launch    Launch a command across the active scheduler.
    test      Run the distributed smoke test.