ezpz submitโ๏ธ
Submit jobs to PBS (qsub) or SLURM (sbatch) schedulers directly from
the command line.
Two Modesโ๏ธ
1. Wrap a commandโ๏ธ
Provide a command after -- and ezpz submit generates a job script
automatically:
The generated script includes:
- Scheduler directives (
#PBSor#SBATCH) - Activation of your current Python environment (venv or conda)
ezpz launchwrapping for distributed execution
2. Submit an existing scriptโ๏ธ
Pass a .sh file directly:
Optionsโ๏ธ
| Flag | Description |
|---|---|
-N, --nodes |
Number of compute nodes (default: 1) |
-t, --time |
Walltime in HH:MM:SS format (default: 01:00:00) |
-q, --queue |
Queue (PBS) or partition (SLURM) (default: debug) |
-A, --account |
Project/account for billing |
--filesystems |
PBS filesystems directive (default: home) |
--job-name |
Job name (auto-derived from command if omitted) |
--scheduler |
Force PBS or SLURM (auto-detected by default) |
--dry-run |
Print the script without submitting |
--launch |
Wrap the command with ezpz launch |
Examplesโ๏ธ
Dry-run to preview the generated scriptโ๏ธ
Submit with specific filesystems (PBS/Aurora)โ๏ธ
ezpz submit -N 2 -q debug -t 01:00:00 \
--filesystems home:eagle:grand \
-A myproject \
-- python3 -m ezpz.examples.test
Submit with ezpz launch wrappingโ๏ธ
Environment Detectionโ๏ธ
The generated script automatically activates your current environment:
- venv: If
VIRTUAL_ENVis set, addssource $VIRTUAL_ENV/bin/activate - conda: If
CONDA_PREFIXis set, addsconda activate <env_name> - Custom: If
EZPZ_SETUP_ENVpoints to a file, sources it
Example: ezpz benchmark on Auroraโ๏ธ
Submit ezpz benchmark to run on 2 nodes:
#[aurora_frameworks-2025.3.1](ezpz-aurora_frameworks-2025.3.1)
#[04/02/26,10:24:48][aurora-uan-0009][/f/A/f/p/s/ezpz][dev][$?]
; ezpz submit -A AuroraGPT -N 2 -q capacity -t 00:30:00 -- ezpz benchmark
Generated job script:
------------------------------------------------------------
#!/bin/bash --login
#PBS -l select=2
#PBS -l walltime=00:30:00
#PBS -l filesystems=home
#PBS -A AuroraGPT
#PBS -k doe
#PBS -j oe
#PBS -q capacity
#PBS -N ezpz
set -eo pipefail
cd /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/ezpz
# โโ Environment setup โโ
source <(curl -fsSL https://bit.ly/ezpz-utils) && ezpz_setup_env
ezpz benchmark
------------------------------------------------------------
Submitted job 8414055.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
Script saved to /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/ezpz/.ezpz-submit-20260402_102453.sh
took: 5s
Outputโ๏ธ
Job output
[2026-04-02-104328][I][/dev/fd/63:2850] Detected PBS scheduler environment.
[2026-04-02-104328][W][/dev/fd/63:2886] Current working directory does not match PBS_O_WORKDIR! This may cause issues with the job submission.
[2026-04-02-104328][W][/dev/fd/63:2887] PBS_O_WORKDIR /flare/AuroraGPT/foremans/projects/saforem2/ezpz
[2026-04-02-104328][W][/dev/fd/63:2888] WORKING_DIR /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/ezpz
[2026-04-02-104328][W][/dev/fd/63:2889] Exporting PBS_O_WORKDIR=WORKING_DIR=/lus/flare/projects/AuroraGPT/foremans/projects/saforem2/ezpz and continuing...
[2026-04-02-104328][I][/dev/fd/63:2582] [ezpz_setup_env]...
[2026-04-02-104328][I][/dev/fd/63:1363] [PYTHON]
[2026-04-02-104328][I][/dev/fd/63:1370] - No conda_prefix OR pythonuserbase OR virtual_env found in environment. Setting up conda...
Lmod Warning: ONEAPI_DEVICE_SELECTOR has been set to
"opencl:gpu;level_zero:gpu"
to enable Triton-XPU, vLLM, Ray, and dpctl functionality.
If you encounter issues, you can revert to:
export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
If you do need to revert, please email [email protected]
so we can track and address compatibility issues.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
frameworks/2025.3.1 /opt/aurora/26.26.0/modulefiles/frameworks/2025.3.1.lua
Lmod Warning: ONEAPI_DEVICE_SELECTOR has been set to
"opencl:gpu;level_zero:gpu"
to enable Triton-XPU, vLLM, Ray, and dpctl functionality.
If you encounter issues, you can revert to:
export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
If you do need to revert, please email [email protected]
so we can track and address compatibility issues.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
frameworks/2025.3.1 /opt/aurora/26.26.0/modulefiles/frameworks/2025.3.1.lua
[2026-04-02-104330][I][/dev/fd/63:852] Setting FI_MR_CACHE_MONITOR=userfaultfd
[2026-04-02-104330][I][/dev/fd/63:934] List of active modules:
Currently Loaded Modules:
1) gcc/13.4.0
2) oneapi/release/2025.3.1
3) mpich/opt/5.0.0.aurora_test.3c70a61
4) libfabric/1.22.0
5) cray-pals/1.8.0
6) cray-libpals/1.8.0
7) gcc-runtime/13.4.0-2tg3zy7 (H)
8) intel-oneapi-runtime/2025.3.1-h4uj4w3 (H)
9) hdf5/1.14.6
10) pti-gpu/0.16.0-rc1
11) miniforge3/25.11.0-1
12) frameworks/2025.3.1
Where:
H: Hidden Module
[2026-04-02-104330][I][/dev/fd/63:1383] - Found Python at /opt/aurora/26.26.0/frameworks/aurora_frameworks-2025.3.1
[2026-04-02-104330][I][/dev/fd/63:1204] - Found python root at /opt/aurora/26.26.0/frameworks/aurora_frameworks-2025.3.1
[2026-04-02-104330][I][/dev/fd/63:1219] - No VIRTUAL_ENV found in environment!
[2026-04-02-104330][I][/dev/fd/63:1222] - Looking for venv in venvs/aurora/ezpz-aurora_frameworks-2025.3.1...
[2026-04-02-104330][I][/dev/fd/63:1246] - Activating existing venv in VENV_DIR=venvs/ezpz-aurora_frameworks-2025.3.1
[2026-04-02-104330][I][/dev/fd/63:1248] - Found /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/ezpz/venvs/aurora/ezpz-aurora_frameworks-2025.3.1/bin/activate
[2026-04-02-104330][I][/dev/fd/63:1418] - Using python from: /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/ezpz/venvs/aurora/ezpz-aurora_frameworks-2025.3.1/bin/python3
[2026-04-02-104330][I][/dev/fd/63:2424] [JOB]
[2026-04-02-104330][I][/dev/fd/63:2425] - Parsing job env for foremans
[2026-04-02-104330][I][/dev/fd/63:2426] - Detected pbs scheduler
[2026-04-02-104330][I][/dev/fd/63:2427] - Machine: aurora
[2026-04-02-104330][I][/dev/fd/63:2428] - Hostname: x4514c6s0b0n0
[2026-04-02-104330][I][/dev/fd/63:2338] - PBS_JOBID=8414055.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
to calculate:
- num_hosts: 2
- num_cores_per_host: 208
- num_cpus_per_host: 104
- num_gpus_per_host: 12
- depth: 8
- num_gpus: 24
[2026-04-02-104331][I][/dev/fd/63:1844] [HOSTS]
[2026-04-02-104331][I][/dev/fd/63:1846] - Detected PBS Scheduler
[2026-04-02-104331][I][/dev/fd/63:1864] - HOSTFILE=/var/spool/pbs/aux/8414055.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
[2026-04-02-104331][I][/dev/fd/63:1865] - NHOSTS=2
[2026-04-02-104331][I][/dev/fd/63:1866] - HOSTS:
[2026-04-02-104331][I][/dev/fd/63:1869] - [host:0] - x4514c6s0b0n0.hsn.cm.aurora.alcf.anl.gov
[2026-04-02-104331][I][/dev/fd/63:1869] - [host:1] - x4514c6s1b0n0.hsn.cm.aurora.alcf.anl.gov
[2026-04-02-104331][I][/dev/fd/63:2030] [DIST_INFO]
[2026-04-02-104331][I][/dev/fd/63:2031] - HOSTFILE=/var/spool/pbs/aux/8414055.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
[2026-04-02-104331][I][/dev/fd/63:2032] - NHOSTS=2
[2026-04-02-104331][I][/dev/fd/63:2033] - NGPU_PER_HOST=12
[2026-04-02-104331][I][/dev/fd/63:2034] - NGPUS=24
[2026-04-02-104331][I][/dev/fd/63:2606] [โ] Finished [ezpz_setup_env]
Benchmark output directory: outputs/benchmarks/20260402_104339
Environment info written to outputs/benchmarks/20260402_104339/env.json
Running 7 example(s): test, fsdp, vit, fsdp_tp, diffusion, hf, hf_trainer
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[1/7] Running: test
cmd: ezpz launch python3 -m ezpz.examples.test --model small
log: outputs/benchmarks/20260402_104339/test.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ test completed in 2m 36s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[2/7] Running: fsdp
ETA: ~15m 37s remaining
cmd: ezpz launch python3 -m ezpz.examples.fsdp --model small
log: outputs/benchmarks/20260402_104339/fsdp.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ fsdp completed in 40s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[3/7] Running: vit
ETA: ~8m 12s remaining
cmd: ezpz launch python3 -m ezpz.examples.vit --model small --warmup 0 --fsdp
log: outputs/benchmarks/20260402_104339/vit.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ vit completed in 59s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[4/7] Running: fsdp_tp
ETA: ~5m 41s remaining
cmd: ezpz launch python3 -m ezpz.examples.fsdp_tp --model small --dataset stanfordnlp/imdb
log: outputs/benchmarks/20260402_104339/fsdp_tp.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ fsdp_tp completed in 2m 05s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[5/7] Running: diffusion
ETA: ~4m 46s remaining
cmd: ezpz launch python3 -m ezpz.examples.diffusion --model small --dataset stanfordnlp/imdb
log: outputs/benchmarks/20260402_104339/diffusion.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ diffusion completed in 45s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[6/7] Running: hf
ETA: ~2m 50s remaining
cmd: ezpz launch python3 -m ezpz.examples.hf --dataset_name=eliplutchok/fineweb-small-sample --streaming --model_name_or_path meta-llama/Llama-3.2-1B --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --max-steps=100 --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=100 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=2048 --fsdp=auto_wrap --output_dir=outputs/benchmarks/20260402_104339/outputs/ezpz.hf/20260402_104339
log: outputs/benchmarks/20260402_104339/hf.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ hf completed in 3m 53s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[7/7] Running: hf_trainer
ETA: ~1m 50s remaining
cmd: ezpz launch python3 -m ezpz.examples.hf_trainer --dataset_name=eliplutchok/fineweb-small-sample --streaming --model_name_or_path meta-llama/Llama-3.2-1B --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --max-steps=100 --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=100 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=2048 --fsdp=auto_wrap --output_dir=outputs/benchmarks/20260402_104339/outputs/ezpz.hf_trainer/20260402_104339
log: outputs/benchmarks/20260402_104339/hf_trainer.log
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ hf_trainer completed in 2m 53s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
7/7 passed in 13m 54s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ test 2m 36s
โ fsdp 40s
โ vit 59s
โ fsdp_tp 2m 05s
โ diffusion 45s
โ hf 3m 53s
โ hf_trainer 2m 53s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Generating report...
Benchmark Reportโ๏ธ
| Key | Value |
|---|---|
| Date | 2026-04-02T15:43:39 |
| Git Commit | 3f8cb86 (branch: dev) |
| Job ID | 8414055.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov (PBS) |
| Nodes | 2 ร 12 GPUs = 24 total |
| Python | 3.12.12 |
| PyTorch | 2.10.0a0+git449b176 |
| ezpz | 0.11.3 |
| Example | Status | Wall Time | Steps | Final Loss | Mean dt (s) | Throughput | W&B |
|---|---|---|---|---|---|---|---|
| test | โ | 2m36s | 199 | 0.2045 | 0.0151 | โ | link |
| fsdp | โ | 40s | โ | โ | โ | โ | link |
| vit | โ | 59s | 4 | โ | โ | โ | link |
| fsdp_tp | โ | 2m05s | โ | โ | โ | โ | link |
| diffusion | โ | 45s | โ | โ | โ | โ | link |
| hf | โ | 3m53s | 105 | 1.5889 | 0.4126 | 124692 tok/s | link |
| hf_trainer | โ | 2m53s | โ | โ | โ | โ | link |
Account Fallbackโ๏ธ
If --account is not provided, ezpz submit checks these environment
variables in order:
PBS_ACCOUNT(PBS)SLURM_ACCOUNT(SLURM)PROJECT