πββοΈββ‘οΈ Quick StartβοΈ
Everything you need to get started: install, write a script, launch it, track metrics, and a complete API cheat sheet with before/after diffs.
For a full walkthrough with real terminal output, see the End-to-End Walkthrough.
π¦ InstallβοΈ
Don't have uv?
Install it first (one-liner, no Python needed):
Or use plain pip:
Editable install for development
Try without installing via uv run
If you already have a Python environment with
{torch, mpi4py} installed, you can try ezpz without installing
it:
Verify: ezpz test
After installing, run a quick smoke test to verify distributed functionality and device detection:
This trains a simple MLP on MNIST using DDP and reports timing metrics. See the W&B Report for example output.
[Optional] Shell Environment and Setup
-
-
ezpz/bin/
utils.sh: Shell script containing a collection of functions that I've accumulated over time and found to be useful. To use these, we can source the file directly from the command line: -
savejobenv: Shell function that will save relevant job-specific environment variables to a file for later use. -
Running from a PBS job:
savejobenvwill automatically save the relevant environment variables to a file~/.pbsenvwhich can be later sourced viasource ~/.pbsenvfrom, e.g., another terminal:
-
π Distributed Training ScriptβοΈ
import ezpz
import torch
rank = ezpz.setup_torch() # auto-detects device + backend, returns global rank
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 64).to(device)
model = ezpz.wrap_model(model, dtype="bfloat16") # FSDP by default; use_fsdp=False for DDP
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for step in range(100):
x = torch.randn(32, 128, device=device)
loss = model(x).sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
ezpz.cleanup()
That's it β ezpz detects the available backend, initializes the process
group, wraps your model in FSDP/DDP, and assigns each rank to the correct
device. For help choosing between DDP, FSDP, and FSDP+TP, see the
Distributed Training Guide.
ezpz.synchronize()
Use ezpz.synchronize() instead of torch.cuda.synchronize() to get
correct timing measurements on any backend (CUDA, XPU, MPS, CPU).
π Launch ItβοΈ
This single command works everywhere because the launcher detects the active job scheduler automatically:
| Environment | What ezpz launch runs |
|---|---|
| PBS job (Polaris, Aurora, Sunspot) | mpiexec with hostfile from $PBS_NODEFILE |
| SLURM job (Frontier, Perlmutter) | srun with SLURM topology |
| No scheduler (laptop / workstation) | mpirun -np <ngpus> fallback |
Flag aliases
-n, -np, and --nproc are all aliases for the same flag (number of
processes). Similarly, -ppn and --nproc_per_node are aliases.
Advanced Launcher Examples
To pass arguments through to the launcher1:
$ ezpz launch -- python3 -m ezpz.examples.fsdp
# pass --line-buffer through to mpiexec:
$ ezpz launch --line-buffer -- python3 \
-m ezpz.examples.vit --compile --fsdp
# Create and use a custom hostfile
$ head -n 2 "${PBS_NODEFILE}" > hostfile0-2
$ ezpz launch --hostfile hostfile0-2 -- python3 \
-m ezpz.examples.fsdp_tp
# use explicit np/ppn/nhosts
$ ezpz launch \
-np 4 \
-ppn 2 \
--nhosts 2 \
--hostfile hostfile0-2 \
-- \
python3 -m ezpz.examples.diffusion
# forward the PYTHONPATH environment variable
$ ezpz launch -x PYTHONPATH=/tmp/.venv/bin:${PYTHONPATH} \
-- \
python3 -m ezpz.examples.fsdp
For pass-through launcher flags, custom hostfiles, and advanced usage, see the CLI reference.
π€ Submit ItβοΈ
ezpz launch runs inside an existing allocation. To submit a batch job
to the scheduler queue, use ezpz submit:
This auto-generates a PBS or SLURM job script, wraps your command with
ezpz launch, and submits it. Preview the generated script first with
--dry-run:
You can also submit an existing job script directly:
See the CLI reference for the full option list, or the Distributed Training Guide for a complete production workflow walkthrough.
π οΈ API Cheat SheetβοΈ
Each ezpz component can be used independently β pick only what you need.
Setup & Distributed InitβοΈ
- import os, torch.distributed as dist
- dist.init_process_group(backend="nccl", ...)
- rank = int(os.environ["RANK"])
- local_rank = int(os.environ["LOCAL_RANK"])
- world_size = int(os.environ["WORLD_SIZE"])
+ import ezpz
+ rank = ezpz.setup_torch() # returns global rank
+ local_rank = ezpz.get_local_rank()
+ world_size = ezpz.get_world_size()
Device ManagementβοΈ
- device = torch.device("cuda")
- model.to("cuda")
- batch = batch.to("cuda")
+ device = ezpz.get_torch_device() # cuda, xpu, mps, or cpu
+ model.to(device)
+ batch = batch.to(device)
Model WrappingβοΈ
- from torch.nn.parallel import DistributedDataParallel as DDP
- model = DDP(model, device_ids=[local_rank], output_device=local_rank)
+ model = ezpz.wrap_model(model) # FSDP (default)
+ model = ezpz.wrap_model(model, use_fsdp=False) # DDP
Training LoopβοΈ
for step, batch in enumerate(dataloader):
- batch = batch.to("cuda")
+ batch = batch.to(ezpz.get_torch_device())
t0 = time.perf_counter()
loss = train_step(...)
- torch.cuda.synchronize()
+ ezpz.synchronize()
dt = time.perf_counter() - t0
Metric TrackingβοΈ
import ezpz
logger = ezpz.get_logger(__name__)
history = ezpz.History(
project_name="my-project", # optional
backends="wandb", # or "mlflow", "wandb,mlflow,csv", etc.
)
for step in range(100):
loss = train_step(...)
logger.info(history.update({"loss": loss.item()}, step=step))
history.finalize(outdir="./outputs") # saves dataset + plots
History automatically computes distributed statistics (min, max, mean, std)
across all ranks β no extra code needed on worker ranks.
What finalize produces
Calling history.finalize() writes a summary dataset and generates
loss curves and other plots β ready for inspection or inclusion in
reports. See the Walkthrough
for sample output with terminal plots.
For the full History API β distributed aggregation, environment variables,
StopWatch, and more β see the MetricΒ TrackingΒ guide.
π Next StepsβοΈ
Read next β Distributed Training Guide: the progressive tutorial that takes the hello-world above through to production-grade FSDP + tensor parallelism.
Other references
- Recipes β copy-pasteable patterns (data loading, checkpointing, gradient accumulation)
- End-to-End Walkthrough β full runnable example with real terminal output
- Experiment Tracking β
Historyguide: distributed stats, multi-backend logging, plots - Examples β end-to-end training scripts (FSDP, ViT, Diffusion, etc.)
- CLI Reference β
ezpz launch,ezpz submit, and more - Configuration β environment variables and config dataclasses
- Comparison vs. alternatives β vs. raw torchrun, Accelerate, DeepSpeed
- Architecture β how
ezpzworks under the hood
-
This will be
srunif a Slurm scheduler is detected,mpirun/mpiexecotherwise. ↩