Skip to content

๐Ÿ“ Ready-to-go Examplesโš“๏ธŽ

This section contains ready-to-run examples demonstrating various features of the ezpz library.

  1. ๐Ÿ“ Examples: Scalable and ready-to-go!

    Links Example Module What it Does
    ยท ยท ezpz.examples.test Train MLP with DDP on MNIST
    ยท ยท ezpz.examples.fsdp Train CNN with FSDP on MNIST
    ยท ยท ezpz.examples.vit Train ViT with FSDP on MNIST
    ยท ยท ezpz.examples.fsdp_tp Train Transformer with FSDP + TP on HF Datasets
    ยท ยท ezpz.examples.diffusion Train Diffusion LLM with FSDP on HF Datasets
    ยท ยท ezpz.examples.hf_trainer Train LLM with FSDP + HF Trainer on HF Datasets
    Running Examples

    Any of the examples below can be launched with (sensible defaults if not specified):

    ezpz launch python3 -m ezpz.examples.fsdp
    ezpz launch python3 -m ezpz.examples.fsdp_tp
    # ...etc
    ezpz launch python3 -m ezpz.examples.hf_trainer
    
    ๐Ÿค— HF Integration
    1. ezpz.examples.{fsdp_tp, diffusion, hf_trainer, hf_trainer} all support arbitrary ๐Ÿค— Hugging Face datasets e.g.:

      dataset="stanfordnlp/imdb"  # or any other HF dataset
      ezpz launch python3 -m ezpz.examples.fsdp_tp --dataset "${dataset}"
      ezpz launch python3 -m ezpz.examples.diffusion --dataset "${dataset}"
      ezpz launch python3 -m ezpz.examples.hf_trainer \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --dataset_name="${dataset}" \
          --streaming \
          --bf16=true
      
    2. ezpz.examples.hf_trainer supports arbitrary combinations of (compatible) transformers.from_pretrained models, and HF Datasets (with support for streaming!)

      ezpz launch python3 -m ezpz.examples.hf_trainer \
          --streaming \
          --dataset_name=eliplutchok/fineweb-small-sample \
          --tokenizer_name meta-llama/Llama-3.2-1B \
          --model_name_or_path meta-llama/Llama-3.2-1B \
          --bf16=true
          # ...etc.
      
    Simple Example
    ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    
    Output
    Macbook Pro
    #[01/08/26 @ 14:56:50][~/v/s/ezpz][dev][$โœ˜!?] [4s]
    ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    [2026-01-08 14:56:54,307030][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 -c 'import ezpz; print(ezpz.setup_torch())'
    Using [2 / 2] available "mps" devices !!
    0
    1
    [2025-12-23-162222] Execution time: 4s sec
    
    Aurora (2 Nodes)
    #[aurora_frameworks-2025.2.0](torchtitan-aurora_frameworks-2025.2.0)[1m9s]
    #[01/08/26,14:56:42][x4418c6s1b0n0][/f/d/f/p/p/torchtitan][main][?]
    ; ezpz launch python3 -c 'import ezpz; print(ezpz.setup_torch())'
    
    
    [2026-01-08 14:58:01,994729][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
    [2026-01-08 14:58:01,997067][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
    [2026-01-08 14:58:01,997545][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
    [2026-01-08 14:58:02,465850][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-145802]----
    [2026-01-08 14:58:04,765720][I][ezpz/launch:416:launch] Job ID: 8247203
    [2026-01-08 14:58:04,766527][I][ezpz/launch:417:launch] nodelist: ['x4418c6s1b0n0', 'x4717c0s6b0n0']
    [2026-01-08 14:58:04,766930][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-01-08 14:58:04,767616][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2026-01-08 14:58:04,768399][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
    [2026-01-08 14:58:04,768802][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-01-08 14:58:04,769517][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 -c 'import ezpz; print(ezpz.setup_torch())'
    [2026-01-08 14:58:04,770278][I][ezpz/launch:433:launch] Took: 3.01 seconds to build command.
    [2026-01-08 14:58:04,770660][I][ezpz/launch:436:launch] Executing:
    mpiexec
    --envall
    --np=24
    --ppn=12
    --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    -c
    import ezpz; print(ezpz.setup_torch())
    [2026-01-08 14:58:04,772125][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-01-08 14:58:04,772651][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-145804...
    [2026-01-08 14:58:04,773070][I][ezpz/launch:138:run_command] Caught 24 filters
    [2026-01-08 14:58:04,773429][I][ezpz/launch:139:run_command] Running command:
    mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8247203.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -c 'import ezpz; print(ezpz.setup_torch())'
    cpubind:list x4717c0s6b0n0 pid 118589 rank 12 0: mask 0x1c
    cpubind:list x4717c0s6b0n0 pid 118590 rank 13 1: mask 0x1c00
    cpubind:list x4717c0s6b0n0 pid 118591 rank 14 2: mask 0x1c0000
    cpubind:list x4717c0s6b0n0 pid 118592 rank 15 3: mask 0x1c000000
    cpubind:list x4717c0s6b0n0 pid 118593 rank 16 4: mask 0x1c00000000
    cpubind:list x4717c0s6b0n0 pid 118594 rank 17 5: mask 0x1c0000000000
    cpubind:list x4717c0s6b0n0 pid 118595 rank 18 6: mask 0x1c0000000000000
    cpubind:list x4717c0s6b0n0 pid 118596 rank 19 7: mask 0x1c000000000000000
    cpubind:list x4717c0s6b0n0 pid 118597 rank 20 8: mask 0x1c00000000000000000
    cpubind:list x4717c0s6b0n0 pid 118598 rank 21 9: mask 0x1c0000000000000000000
    cpubind:list x4717c0s6b0n0 pid 118599 rank 22 10: mask 0x1c000000000000000000000
    cpubind:list x4717c0s6b0n0 pid 118600 rank 23 11: mask 0x1c00000000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66450 rank 0 0: mask 0x1c
    cpubind:list x4418c6s1b0n0 pid 66451 rank 1 1: mask 0x1c00
    cpubind:list x4418c6s1b0n0 pid 66452 rank 2 2: mask 0x1c0000
    cpubind:list x4418c6s1b0n0 pid 66453 rank 3 3: mask 0x1c000000
    cpubind:list x4418c6s1b0n0 pid 66454 rank 4 4: mask 0x1c00000000
    cpubind:list x4418c6s1b0n0 pid 66455 rank 5 5: mask 0x1c0000000000
    cpubind:list x4418c6s1b0n0 pid 66456 rank 6 6: mask 0x1c0000000000000
    cpubind:list x4418c6s1b0n0 pid 66457 rank 7 7: mask 0x1c000000000000000
    cpubind:list x4418c6s1b0n0 pid 66458 rank 8 8: mask 0x1c00000000000000000
    cpubind:list x4418c6s1b0n0 pid 66459 rank 9 9: mask 0x1c0000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66460 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4418c6s1b0n0 pid 66461 rank 11 11: mask 0x1c00000000000000000000000
    Using [24 / 24] available "xpu" devices !!
    8
    10
    0
    4
    3
    5
    7
    11
    6
    1
    9
    2
    14
    15
    12
    13
    16
    17
    19
    22
    20
    23
    18
    21
    [2026-01-08 14:58:14,252433][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-145814]----
    [2026-01-08 14:58:14,253726][I][ezpz/launch:448:launch] Execution finished with 0.
    [2026-01-08 14:58:14,254184][I][ezpz/launch:449:launch] Executing finished in 9.48 seconds.
    [2026-01-08 14:58:14,254555][I][ezpz/launch:450:launch] Took 9.48 seconds to run. Exiting.
    took: 18s
    
    demo.py
    demo.py
    import ezpz
    
    # automatic device + backend setup for distributed PyTorch
    _ = ezpz.setup_torch()  # CUDA/NCCL, XPU/XCCL, {MPS, CPU}/GLOO, ...
    
    device = ezpz.get_torch_device() # {cuda, xpu, mps, cpu, ...}
    rank = ezpz.get_rank()
    world_size = ezpz.get_world_size()
    # ...etc
    
    if rank == 0:
        print(f"Hello from rank {rank} / {world_size} on {device}!")
    

    We can launch this script with:

    ezpz launch python3 demo.py
    
    Output(s)
    MacBook Pro
    # from MacBook Pro
    $ ezpz launch python3 demo.py
    [2026-01-08 07:22:31,989741][I][ezpz/launch:515:run] No active scheduler detected; falling back to local mpirun: mpirun -np 2 python3 /Users/samforeman/python/ezpz_demo.py
    Using [2 / 2] available "mps" devices !!
    Hello from rank 0 / 2 on mps!
    
    Aurora (2 nodes)
    # from 2 nodes of Aurora:
    #[aurora_frameworks-2025.2.0](foremans-aurora_frameworks-2025.2.0)[C v7.5.0-gcc][43s]
    #[01/08/26,07:26:10][x4604c5s2b0n0][~]
    ; ezpz launch python3 demo.py
    
    [2026-01-08 07:26:19,723138][I][numexpr/utils:148:_init_num_threads] Note: detected 208 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
    [2026-01-08 07:26:19,725453][I][numexpr/utils:151:_init_num_threads] Note: NumExpr detected 208 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
    [2026-01-08 07:26:19,725932][I][numexpr/utils:164:_init_num_threads] NumExpr defaulting to 16 threads.
    [2026-01-08 07:26:20,290222][I][ezpz/launch:396:launch] ----[๐Ÿ‹ ezpz.launch][started][2026-01-08-072620]----
    [2026-01-08 07:26:21,566797][I][ezpz/launch:416:launch] Job ID: 8246832
    [2026-01-08 07:26:21,567684][I][ezpz/launch:417:launch] nodelist: ['x4604c5s2b0n0', 'x4604c5s3b0n0']
    [2026-01-08 07:26:21,568082][I][ezpz/launch:418:launch] hostfile: /var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-01-08 07:26:21,568770][I][ezpz/pbs:264:get_pbs_launch_cmd] โœ… Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2026-01-08 07:26:21,569557][I][ezpz/launch:367:build_executable] Building command to execute by piecing together:
    [2026-01-08 07:26:21,569959][I][ezpz/launch:368:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-01-08 07:26:21,570821][I][ezpz/launch:369:build_executable] (2.) cmd_to_launch: python3 demo.py
    [2026-01-08 07:26:21,571548][I][ezpz/launch:433:launch] Took: 2.11 seconds to build command.
    [2026-01-08 07:26:21,571918][I][ezpz/launch:436:launch] Executing:
    mpiexec
    --envall
    --np=24
    --ppn=12
    --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    demo.py
    [2026-01-08 07:26:21,573262][I][ezpz/launch:220:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-01-08 07:26:21,573781][I][ezpz/launch:443:launch] Execution started @ 2026-01-08-072621...
    [2026-01-08 07:26:21,574195][I][ezpz/launch:138:run_command] Caught 24 filters
    [2026-01-08 07:26:21,574532][I][ezpz/launch:139:run_command] Running command:
    mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8246832.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 demo.py
    cpubind:list x4604c5s3b0n0 pid 131587 rank 12 0: mask 0x1c
    cpubind:list x4604c5s3b0n0 pid 131588 rank 13 1: mask 0x1c00
    cpubind:list x4604c5s3b0n0 pid 131589 rank 14 2: mask 0x1c0000
    cpubind:list x4604c5s3b0n0 pid 131590 rank 15 3: mask 0x1c000000
    cpubind:list x4604c5s3b0n0 pid 131591 rank 16 4: mask 0x1c00000000
    cpubind:list x4604c5s3b0n0 pid 131592 rank 17 5: mask 0x1c0000000000
    cpubind:list x4604c5s3b0n0 pid 131593 rank 18 6: mask 0x1c0000000000000
    cpubind:list x4604c5s3b0n0 pid 131594 rank 19 7: mask 0x1c000000000000000
    cpubind:list x4604c5s3b0n0 pid 131595 rank 20 8: mask 0x1c00000000000000000
    cpubind:list x4604c5s3b0n0 pid 131596 rank 21 9: mask 0x1c0000000000000000000
    cpubind:list x4604c5s3b0n0 pid 131597 rank 22 10: mask 0x1c000000000000000000000
    cpubind:list x4604c5s3b0n0 pid 131598 rank 23 11: mask 0x1c00000000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121225 rank 0 0: mask 0x1c
    cpubind:list x4604c5s2b0n0 pid 121226 rank 1 1: mask 0x1c00
    cpubind:list x4604c5s2b0n0 pid 121227 rank 2 2: mask 0x1c0000
    cpubind:list x4604c5s2b0n0 pid 121228 rank 3 3: mask 0x1c000000
    cpubind:list x4604c5s2b0n0 pid 121229 rank 4 4: mask 0x1c00000000
    cpubind:list x4604c5s2b0n0 pid 121230 rank 5 5: mask 0x1c0000000000
    cpubind:list x4604c5s2b0n0 pid 121231 rank 6 6: mask 0x1c0000000000000
    cpubind:list x4604c5s2b0n0 pid 121232 rank 7 7: mask 0x1c000000000000000
    cpubind:list x4604c5s2b0n0 pid 121233 rank 8 8: mask 0x1c00000000000000000
    cpubind:list x4604c5s2b0n0 pid 121234 rank 9 9: mask 0x1c0000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121235 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4604c5s2b0n0 pid 121236 rank 11 11: mask 0x1c00000000000000000000000
    Using [24 / 24] available "xpu" devices !!
    Hello from rank 0 / 24 on xpu!
    [2026-01-08 07:26:33,060432][I][ezpz/launch:447:launch] ----[๐Ÿ‹ ezpz.launch][stop][2026-01-08-072633]----
    [2026-01-08 07:26:33,061512][I][ezpz/launch:448:launch] Execution finished with 0.
    [2026-01-08 07:26:33,062045][I][ezpz/launch:449:launch] Executing finished in 11.49 seconds.
    [2026-01-08 07:26:33,062531][I][ezpz/launch:450:launch] Took 11.49 seconds to run. Exiting.
    took: 22s