Supported SystemsβοΈ
ezpz auto-detects the machine, scheduler, accelerator type, and distributed backend so that the same user code runs everywhere. This page documents the supported systems, how detection works, and the environment variables you can use to override defaults.
Systems MatrixβοΈ
| System | Scheduler | Launcher | GPU Type | Backend | Hostfile Source |
|---|---|---|---|---|---|
| Aurora | PBS Pro | mpiexec |
Intel XPU | ccl |
/var/spool/pbs/aux/$PBS_JOBID.* |
| Sunspot | PBS Pro | mpiexec |
Intel XPU | ccl |
/var/spool/pbs/aux/$PBS_JOBID.* |
| Polaris | PBS Pro | mpiexec |
NVIDIA GPU | nccl |
/var/spool/pbs/aux/$PBS_JOBID.* |
| Sophia | PBS Pro | mpiexec |
NVIDIA GPU | nccl |
/var/spool/pbs/aux/$PBS_JOBID.* |
| Sirius | PBS Pro | mpiexec |
NVIDIA GPU | nccl |
/var/spool/pbs/aux/$PBS_JOBID.* |
| Frontier | SLURM | srun |
AMD GPU | nccl |
$SLURM_NODELIST |
| Perlmutter | SLURM | srun |
NVIDIA GPU | nccl |
$SLURM_NODELIST |
| Local | None | mpirun |
CPU or GPU | gloo |
None |
Machine DetectionβοΈ
Both the Python and shell sides detect the machine from the hostname:
| Hostname Prefix | Machine | Notes |
|---|---|---|
x4* |
Aurora | Or aurora* on login nodes |
x1* |
Sunspot | Or uan* on login nodes |
x3* |
Polaris | Sirius if "sirius" appears in $PBS_O_HOST |
sophia-* |
Sophia | |
frontier* |
Frontier | |
login*/nid* |
Perlmutter |
Python: get_machine() in
ezpz.distributed.
Shell: ezpz_get_machine_name() in
utils.sh.
Scheduler DetectionβοΈ
Python: get_scheduler() in ezpz.configs:
PBS_JOBIDset β"PBS"SLURM_JOB_IDorSLURM_JOBIDset β"SLURM"- Machine-name fallback (ALCF β PBS, Frontier/Perlmutter β SLURM)
- Otherwise β
"UNKNOWN"
Shell: ezpz_get_scheduler_type() applies the same logic.
Backend SelectionβοΈ
get_torch_backend() selects the torch.distributed backend:
| Condition | Backend |
|---|---|
TORCH_BACKEND env var set |
Value of env var |
Device is xpu (Intel) |
ccl |
Device is cuda (NVIDIA/AMD) |
nccl |
Device is cpu or mps |
gloo |
Override with:
Environment Variables ReferenceβοΈ
ezpz InternalβοΈ
| Variable | Default | Description |
|---|---|---|
EZPZ_LOG_LEVEL |
"INFO" |
Logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL) |
EZPZ_LOG_FROM_ALL_RANKS |
unset | If set (any truthy value), all ranks log β not just rank 0 |
EZPZ_RUN_COMMAND |
unset | Set by ezpz.launch to track the running command |
Distributed TopologyβοΈ
These are read during setup_torch() and written after MPI
initialization so that downstream code can access them:
| Variable | Description |
|---|---|
RANK |
Global MPI rank |
LOCAL_RANK |
Rank within the current node |
WORLD_SIZE |
Total number of processes |
LOCAL_WORLD_SIZE |
Processes per node |
MASTER_ADDR |
Address for torch.distributed rendezvous |
MASTER_PORT |
Port for torch.distributed rendezvous |
Local Rank Fallback ChainβοΈ
If LOCAL_RANK is not set, these are checked in order:
PMI_LOCAL_RANKOMPI_COMM_WORLD_LOCAL_RANKMPI_LOCALRANKIDMPICH_LOCALRANKIDSLURM_LOCAL_ID
GPUs-Per-Node Fallback ChainβοΈ
If NGPU_PER_HOST is not set, these are checked in order:
LOCAL_WORLD_SIZEPMI_LOCAL_SIZESLURM_NTASKS_PER_NODE
Device and Backend OverridesβοΈ
| Variable | Default | Description |
|---|---|---|
TORCH_DEVICE |
auto-detected | Force device type: cuda, xpu, mps, cpu |
TORCH_BACKEND |
auto-detected | Force distributed backend: nccl, ccl, gloo |
TORCH_DDP_TIMEOUT |
3600 |
Timeout (seconds) for init_process_group |
Scheduler / JobβοΈ
| Variable | Set By | Description |
|---|---|---|
PBS_JOBID |
PBS Pro | Job ID (triggers PBS detection) |
PBS_NODEFILE |
PBS Pro | Path to allocated-nodes file |
PBS_O_WORKDIR |
PBS Pro | Submission directory |
SLURM_JOB_ID |
SLURM | Job ID (triggers SLURM detection) |
SLURM_NODELIST |
SLURM | Compact node list |
SLURM_NNODES |
SLURM | Number of allocated nodes |
SLURM_SUBMIT_DIR |
SLURM | Submission directory |
HOSTFILE |
User | Override hostfile path for any scheduler |
Weights & BiasesβοΈ
| Variable | Default | Description |
|---|---|---|
WANDB_DISABLED |
unset | Disable wandb entirely |
WANDB_MODE |
"offline" |
online, offline, disabled, shared |
WANDB_API_KEY |
unset | API authentication key |
WANDB_PROJECT |
unset | Project name (also checks WB_PROJECT, WB_PROJECT_NAME) |
Overrides and TipsβοΈ
- Custom hostfile: pass
--hostfile /path/to/hostfiletoezpz launchorexport HOSTFILE=/path/to/hostfile. - Override rank counts: use
-n(total ranks) and-ppn(ranks per node) withmpiexec, or--ntasks/--ntasks-per-nodewithsrun. - Debugging backend issues: set
TORCH_BACKEND=glooto fall back to a CPU-only backend while debugging connectivity. - Wandb network issues: set
WANDB_MODE=offlineand sync later withwandb sync.
Known Failure ModesβοΈ
| Symptom | Diagnosis | Fix |
|---|---|---|
| Scheduler not detected | PBS_NODEFILE / SLURM_NODELIST not in env |
Set EZPZ_LOG_LEVEL=DEBUG and check output; use --hostfile |
mpiexec / srun not found |
Module not loaded | module load the appropriate MPI or use full path |
Backend init fails (nccl/ccl) |
Driver or module mismatch | Verify GPU drivers and modules; fall back to TORCH_BACKEND=gloo |
| Wandb hangs on init | No network on compute nodes | export WANDB_MODE=offline |