Skip to content

🍋 ezpz⚓︎

Write once, run anywhere.

ezpz makes distributed PyTorch code portable across any supported hardware {NVIDIA, AMD, Intel, MPS, CPU} with zero code changes. Built for researchers and engineers running distributed PyTorch on HPC systems (ALCF, NERSC, OLCF) or local workstations.

This lets us write Python applications that can be run anywhere, at any scale; with native job scheduler (PBS, Slurm)1 integration and graceful fallbacks for running locally2 on Mac, Linux machines.

🤔 Why ezpz?⚓︎

Distributed PyTorch requires boilerplate that varies by hardware, backend, and job scheduler. ezpz replaces all of it.

train.py
import ezpz
import torch

rank = ezpz.setup_torch()           # auto-detects device + backend
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 10).to(device)
model = ezpz.wrap_model(model)       # DDP by default
# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:
ezpz launch python3 train.py
train.py
import os, torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

backend = "gloo"
device_type = "cpu"
if torch.cuda.is_available():
    backend = "nccl"
    device_type = "cuda"
elif torch.xpu.is_available():
    backend = "xccl"
    device_type = "xpu"

rank = int(os.environ.get("RANK", 0))
local_rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
dist.init_process_group(backend, rank=rank, world_size=world_size)

if torch.cuda.is_available():
    torch.cuda.set_device(local_rank)
    device = torch.device(f"cuda:{local_rank}")
elif torch.xpu.is_available():
    torch.xpu.set_device(local_rank)
    device = torch.device(f"xpu:{local_rank}")
else:
    device = torch.device("cpu")

model = torch.nn.Linear(128, 10).to(device)
model = DDP(
    model,
    device_ids=[local_rank] if backend in ["nccl", "xccl"] else None
)
# Different launch per {scheduler, cluster}:
mpiexec -np 8 --ppn 4 python3 train.py      # Polaris    @ ALCF  [NVIDIA / PBS]
mpiexec -np 24 --ppn 12 python3 train.py    # Aurora     @ ALCF  [INTEL / PBS]
srun -N 2 -n 8 python3 train.py             # Frontier   @ ORNL  [AMD / SLURM]
srun -N 2 -n 8 python3 train.py             # Perlmutter @ NERSC [NVIDIA / SLURM]
torchrun --nproc_per_node=4 train.py        # Generic    @ ???   [NVIDIA / ???]

🏃‍♂️ Try it out!⚓︎

No cluster required — this runs on a laptop:

pip install "git+https://github.com/saforem2/ezpz"
hello.py
import ezpz
rank = ezpz.setup_torch()
print(f"Hello from rank {rank} on {ezpz.get_torch_device()}!")
ezpz.cleanup()
python3 hello.py
# Hello from rank 0 on mps!

Ready to get started? See the Quick Start.

👀 Overview⚓︎

ezpz is, at its core, a Python library that provides a variety of utilities for both writing and launching distributed PyTorch applications.

These can be broken down (~roughly) into:

  1. 🐍 Python library: import ezpz Python API for writing hardware-agnostic, distributed PyTorch code.

  2. 🧰 CLI: ezpz <command> Utilities for launching distributed PyTorch applications:

    • 🚀 ezpz launch: Launch commands with automatic job scheduler detection (PBS, Slurm)
    • 💯 ezpz test: Run simple distributed smoke test
    • 🩺 ezpz doctor: Health check your environment

✨ Features⚓︎

  • Automatic distributed initializationsetup_torch() detects device + backend
  • Universal launcherezpz launch auto-detects PBS, Slurm, or falls back to mpirun
  • Model wrappingwrap_model() for DDP or FSDP with one flag
  • Metric trackingHistory with distributed statistics, W&B integration, and plot generation
  • Environment diagnosticsezpz doctor checks your setup
  • Cross-backend timingsynchronize() works on CUDA, XPU, MPS, and CPU

🔗 Next Steps⚓︎

  • Quick Start — install, write a script, launch it
  • Recipes — copy-pasteable patterns for common tasks
  • Reference — complete runnable example with terminal output
  • Metric Tracking — full History guide: distributed stats, W&B, plots
  • Examples — end-to-end training scripts (FSDP, ViT, Diffusion, etc.)
  • FAQ — common questions and troubleshooting
  • Architecture — how ezpz works under the hood

  1. With first class support for all of the major HPC Supercomputing centers (e.g. ALCF, OLCF, NERSC) 

  2. This is particularly useful if you'd like to run development / debugging experiments locally