🍋 ezpz⚓︎

Write once, run anywhere.

ezpz makes distributed PyTorch code portable across any supported hardware {NVIDIA, AMD, Intel, MPS, CPU} with zero code changes. Built for researchers and engineers running distributed PyTorch on HPC systems (ALCF, NERSC, OLCF) or local workstations.

This lets us write Python applications that can be run anywhere, at any scale; with native job scheduler (PBS, Slurm)¹ integration and graceful fallbacks for running locally² on Mac, Linux machines.

🤔 Why `ezpz`?⚓︎

Distributed PyTorch requires boilerplate that varies by hardware, backend, and job scheduler. ezpz replaces all of it.

With ezpzWithout ezpz

train.py

import ezpz
import torch

rank = ezpz.setup_torch()           # auto-detects device + backend
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 10).to(device)
model = ezpz.wrap_model(model)       # FSDP (default)

launch.sh

# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:
ezpz launch python3 train.py

train.py

import os, torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

backend = "gloo"
device_type = "cpu"
if torch.cuda.is_available():
    backend = "nccl"
    device_type = "cuda"
elif torch.xpu.is_available():
    backend = "xccl"
    device_type = "xpu"

rank = int(os.environ.get("RANK", 0))
local_rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
dist.init_process_group(backend, rank=rank, world_size=world_size)

if torch.cuda.is_available():
    torch.cuda.set_device(local_rank)
    device = torch.device(f"cuda:{local_rank}")
elif torch.xpu.is_available():
    torch.xpu.set_device(local_rank)
    device = torch.device(f"xpu:{local_rank}")
else:
    device = torch.device("cpu")

model = torch.nn.Linear(128, 10).to(device)
model = DDP(
    model,
    device_ids=[local_rank] if backend in ["nccl", "xccl"] else None
)

# Different launch per {scheduler, cluster}:
mpiexec -np 8 --ppn 4 python3 train.py      # Polaris    @ ALCF  [NVIDIA / PBS]
mpiexec -np 24 --ppn 12 python3 train.py    # Aurora     @ ALCF  [INTEL / PBS]
srun -N 2 -n 8 python3 train.py             # Frontier   @ ORNL  [AMD / SLURM]
srun -N 2 -n 8 python3 train.py             # Perlmutter @ NERSC [NVIDIA / SLURM]
torchrun --nproc_per_node=4 train.py        # Generic    @ ???   [NVIDIA / ???]

🏃‍♂️ Try it out!⚓︎

No cluster required — this runs on a laptop³:

uv pip install "git+https://github.com/saforem2/ezpz"

hello.py

import ezpz

rank = ezpz.setup_torch()
device = ezpz.get_torch_device()
world_size = ezpz.get_world_size()

if rank == 0:
    print(f"Using [{world_size} / {world_size}] available \"{device}\" devices !!")
    print(f"Hello from rank {rank} on {device}!")

ezpz.cleanup()

Laptop (python3)Laptop (ezpz launch)Polaris (ALCF)Aurora (ALCF)Perlmutter (NERSC)

$ python3 hello.py
Using [1 / 1] available "mps" devices !!
Hello from rank 0 on mps!

$ ezpz launch python3 hello.py
Using [2 / 2] available "mps" devices !!
Hello from rank 0 on mps!

$ ezpz launch python3 hello.py   # 2 nodes × 4 GPUs
Using [8 / 8] available "cuda" devices !!
Hello from rank 0 on cuda!

$ ezpz launch python3 hello.py   # 2 nodes × 12 GPUs
Using [24 / 24] available "xpu" devices !!
Hello from rank 0 on xpu!

$ ezpz launch python3 hello.py   # 2 nodes × 4 GPUs
Using [8 / 8] available "cuda" devices !!
Hello from rank 0 on cuda!

Ready to get started? See the Quick Start.

👀 Overview⚓︎

ezpz is, at its core, a Python library that provides a variety of utilities for both writing and launching distributed PyTorch applications.

These can be broken down (~roughly) into:

🐍 Python library: import ezpz Python API for writing hardware-agnostic, distributed PyTorch code.
🧰 CLI: ezpz <command> Utilities for launching and managing distributed PyTorch jobs:
- 🚀 ezpz launch: Launch commands with automatic job scheduler detection (PBS, Slurm)
- 📤 ezpz submit: Submit batch jobs to PBS or Slurm
- 💯 ezpz test: Run simple distributed smoke test
- 📊 ezpz benchmark: Run and compare example benchmarks
- 🩺 ezpz doctor: Health check your environment

✨ Features⚓︎

Automatic distributed initialization — setup_torch() detects device + backend
Universal launcher — ezpz launch auto-detects PBS, Slurm, or falls back to mpirun
Batch job submission — ezpz submit generates and submits PBS/Slurm job scripts
Model wrapping — wrap_model() for DDP, FSDP, or FSDP+TP with one call
Multi-backend experiment tracking — History with distributed statistics and automatic dispatch to W&B, MLflow, and CSV
MFU tracking — compute_mfu() measures hardware utilization across NVIDIA, AMD, and Intel accelerators
Environment diagnostics — ezpz doctor checks your setup
Cross-backend timing — synchronize() works on CUDA, XPU, MPS, and CPU

🔗 Next Steps⚓︎

Start here → Quick Start: install, write a script, launch it. (~5 minutes.)

Other entry points

Once you've worked through the Quick Start, pick whichever of these matches what you're trying to do:

Distributed Training Guide — progressive tutorial from hello world to production
Recipes — copy-pasteable patterns for common tasks
End-to-End Walkthrough — full runnable example with real terminal output
Experiment Tracking — History guide: distributed stats, multi-backend logging, plots
Examples — end-to-end training scripts (FSDP, ViT, Diffusion, etc.)
Comparison vs. alternatives — vs. raw torchrun, Accelerate, DeepSpeed
FAQ — common questions and troubleshooting
Architecture — how ezpz works under the hood

With first class support for all of the major HPC Supercomputing centers (e.g. ALCF, OLCF, NERSC) ↩
This is particularly useful if you'd like to run development / debugging experiments locally ↩

if you still haven't installed uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

↩