🍋 ezpz⚓︎
Write once, run anywhere.
ezpz makes distributed PyTorch code portable across any supported hardware
{NVIDIA, AMD, Intel, MPS, CPU} with zero code changes.
Built for researchers and engineers running distributed PyTorch on HPC
systems (ALCF, NERSC, OLCF) or local workstations.
This lets us write Python applications that can be run anywhere, at any scale; with native job scheduler (PBS, Slurm)1 integration and graceful fallbacks for running locally2 on Mac, Linux machines.
🤔 Why ezpz?⚓︎
Distributed PyTorch requires boilerplate that varies by hardware, backend, and
job scheduler. ezpz replaces all of it.
import os, torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
backend = "gloo"
device_type = "cpu"
if torch.cuda.is_available():
backend = "nccl"
device_type = "cuda"
elif torch.xpu.is_available():
backend = "xccl"
device_type = "xpu"
rank = int(os.environ.get("RANK", 0))
local_rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
dist.init_process_group(backend, rank=rank, world_size=world_size)
if torch.cuda.is_available():
torch.cuda.set_device(local_rank)
device = torch.device(f"cuda:{local_rank}")
elif torch.xpu.is_available():
torch.xpu.set_device(local_rank)
device = torch.device(f"xpu:{local_rank}")
else:
device = torch.device("cpu")
model = torch.nn.Linear(128, 10).to(device)
model = DDP(
model,
device_ids=[local_rank] if backend in ["nccl", "xccl"] else None
)
# Different launch per {scheduler, cluster}:
mpiexec -np 8 --ppn 4 python3 train.py # Polaris @ ALCF [NVIDIA / PBS]
mpiexec -np 24 --ppn 12 python3 train.py # Aurora @ ALCF [INTEL / PBS]
srun -N 2 -n 8 python3 train.py # Frontier @ ORNL [AMD / SLURM]
srun -N 2 -n 8 python3 train.py # Perlmutter @ NERSC [NVIDIA / SLURM]
torchrun --nproc_per_node=4 train.py # Generic @ ??? [NVIDIA / ???]
🏃♂️ Try it out!⚓︎
No cluster required — this runs on a laptop3:
import ezpz
rank = ezpz.setup_torch()
print(f"Hello from rank {rank} on {ezpz.get_torch_device()}!")
ezpz.cleanup()
Ready to get started? See the Quick Start.
👀 Overview⚓︎
ezpz is, at its core, a Python library that provides a variety of utilities
for both writing and launching distributed PyTorch applications.
These can be broken down (~roughly) into:
-
🐍 Python library:
import ezpzPython API for writing hardware-agnostic, distributed PyTorch code. -
🧰 CLI:
ezpz <command>Utilities for launching and managing distributed PyTorch jobs:- 🚀
ezpz launch: Launch commands with automatic job scheduler detection (PBS, Slurm) - 📤
ezpz submit: Submit batch jobs to PBS or Slurm - 💯
ezpz test: Run simple distributed smoke test - 📊
ezpz benchmark: Run and compare example benchmarks - 🩺
ezpz doctor: Health check your environment
- 🚀
✨ Features⚓︎
- Automatic distributed initialization —
setup_torch()detects device + backend - Universal launcher —
ezpz launchauto-detects PBS, Slurm, or falls back tompirun - Batch job submission —
ezpz submitgenerates and submits PBS/Slurm job scripts - Model wrapping —
wrap_model()for DDP, FSDP, or FSDP+TP with one call - Multi-backend experiment tracking —
Historywith distributed statistics and automatic dispatch to W&B, MLflow, and CSV - Environment diagnostics —
ezpz doctorchecks your setup - Cross-backend timing —
synchronize()works on CUDA, XPU, MPS, and CPU
🔗 Next Steps⚓︎
- Quick Start — install, write a script, launch it
- Distributed Training Guide — progressive tutorial from hello world to production
- Recipes — copy-pasteable patterns for common tasks
- End-to-End Walkthrough — full runnable example with real terminal output
- Experiment Tracking —
Historyguide: distributed stats, multi-backend logging, plots - Examples — end-to-end training scripts (FSDP, ViT, Diffusion, etc.)
- FAQ — common questions and troubleshooting
- Architecture — how
ezpzworks under the hood