Shell Environmentโ๏ธ
ezpz ships a collection of Bash helper functions (all prefixed ezpz_) that
handle Python environment setup, job discovery, and launch-command construction
on HPC systems. The two main entry points are:
Already have a working environment?
If you already have a Python environment with torch and mpi4py, you can
skip the shell helpers entirely and run:
Sourcing utils.shโ๏ธ
The helpers live in
src/ezpz/bin/utils.sh.
Source them into your current shell session:
Full list of provided functions
$ functions | egrep "ezpz.+\(" | awk '{print $1}' | sort
_ezpz_install_from_git
_ezpz_setup_conda_polaris
ezpz_activate_or_create_micromamba_env
ezpz_build_bdist_wheel_from_github_repo
ezpz_check_and_kill_if_running
ezpz_check_if_already_built
ezpz_check_working_dir
ezpz_check_working_dir_pbs
ezpz_check_working_dir_slurm
ezpz_ensure_micromamba_hook
ezpz_ensure_uv
ezpz_generate_cpu_ranges
ezpz_get_cpu_bind_aurora
ezpz_get_dist_launch_cmd
ezpz_get_job_env
ezpz_get_jobenv_file
ezpz_get_jobid_from_hostname
ezpz_get_machine_name
ezpz_get_num_gpus_nvidia
ezpz_get_num_gpus_per_host
ezpz_get_num_gpus_total
ezpz_get_num_hosts
ezpz_get_num_xpus
ezpz_get_pbs_env
ezpz_get_pbs_jobid
ezpz_get_pbs_jobid_from_nodefile
ezpz_get_pbs_nodefile_from_hostname
ezpz_get_python_prefix_nersc
ezpz_get_python_root
ezpz_get_scheduler_type
ezpz_get_shell_name
ezpz_get_slurm_env
ezpz_get_slurm_running_jobid
ezpz_get_slurm_running_nodelist
ezpz_get_tstamp
ezpz_get_venv_dir
ezpz_get_working_dir
ezpz_getjobenv_main
ezpz_has
ezpz_head_n_from_pbs_nodefile
ezpz_install
ezpz_install_and_setup_micromamba
ezpz_install_micromamba
ezpz_install_uv
ezpz_is_sourced
ezpz_kill_mpi
ezpz_launch
ezpz_load_new_pt_modules_aurora
ezpz_load_python_modules_nersc
ezpz_make_slurm_nodefile
ezpz_parse_hostfile
ezpz_prepare_repo_in_build_dir
ezpz_print_hosts
ezpz_print_job_env
ezpz_qsme_running
ezpz_realpath
ezpz_require_file
ezpz_reset
ezpz_reset_pbs_vars
ezpz_save_deepspeed_env
ezpz_save_dotenv
ezpz_save_ds_env
ezpz_save_pbs_env
ezpz_save_slurm_env
ezpz_savejobenv_main
ezpz_set_proxy_alcf
ezpz_setup_alcf
ezpz_setup_conda
ezpz_setup_conda_aurora
ezpz_setup_conda_frontier
ezpz_setup_conda_perlmutter
ezpz_setup_conda_polaris
ezpz_setup_conda_sirius
ezpz_setup_conda_sophia
ezpz_setup_conda_sunspot
ezpz_setup_env
ezpz_setup_env_pt28_aurora
ezpz_setup_env_pt29_aurora
ezpz_setup_host
ezpz_setup_host_pbs
ezpz_setup_host_slurm
ezpz_setup_install
ezpz_setup_job
ezpz_setup_job_alcf
ezpz_setup_job_slurm
ezpz_setup_new_uv_venv
ezpz_setup_python
ezpz_setup_python_alcf
ezpz_setup_python_nersc
ezpz_setup_python_pt28_aurora
ezpz_setup_python_pt29_aurora
ezpz_setup_python_pt_new_aurora
ezpz_setup_srun
ezpz_setup_uv_venv
ezpz_setup_venv_from_conda
ezpz_setup_venv_from_pythonuserbase
ezpz_setup_xpu
ezpz_show_env
ezpz_tail_n_from_pbs_nodefile
ezpz_timeit
ezpz_write_job_info
ezpz_write_job_info_slurm
log_message
These are especially useful on DOE leadership computing facilities (ALCF, OLCF, NERSC) where you need to load the right modules, build virtual environments on top of system conda, and construct MPI launch commands.
ezpz_setup_envโ๏ธ
The recommended one-liner. Internally it calls:
ezpz_setup_pythonโ load modules and activate a virtual environmentezpz_setup_jobโ discover job resources and build alaunchalias
Key Functionsโ๏ธ
| Function | Description |
|---|---|
ezpz_setup_env |
Wrapper: ezpz_setup_python && ezpz_setup_job |
ezpz_setup_job |
Determine NGPUS, NGPU_PER_HOST, NHOSTS; build launch alias |
ezpz_setup_python |
Wrapper: ezpz_setup_conda && ezpz_setup_venv_from_conda |
ezpz_setup_conda |
Find and activate the appropriate conda module1 |
ezpz_setup_venv_from_conda |
From ${CONDA_NAME}, build or activate the venv in venvs/${CONDA_NAME}/ |
Shell Variables Exportedโ๏ธ
After ezpz_setup_env completes, the following are available in your shell
and visible to the Python side via os.environ:
| Variable | Example | Description |
|---|---|---|
NHOSTS |
2 |
Number of hosts in the job |
NGPU_PER_HOST |
12 |
Accelerators per host |
NGPUS |
24 |
Total accelerators |
HOSTFILE |
/var/spool/... |
Path to hostfile |
DIST_LAUNCH |
mpiexec ... |
Full launch command prefix |
JOBENV_FILE |
.jobenv |
Saved job environment file |
Setup Pythonโ๏ธ
This will:
-
Load and activate conda via
ezpz_setup_conda.How this works varies by machine:
- ALCF (Aurora, Polaris, Sophia, Sunspot, Sirius): Load the most recent conda module and activate the base environment.
- Frontier: Load AMD modules (ROCm, RCCL, etc.) and activate base conda.
- Perlmutter: Load the appropriate
pytorchmodule and activate. - Unknown: Look for a
conda,mamba, ormicromambaexecutable and use that to activate base.
/// tip | Using your own conda
If you are already in a conda environment when calling
ezpz_setup_python, it will use that environment as the base. For example, if~/conda/envs/customis active, the virtual env will be created invenvs/custom/.///
-
Build or activate a virtual environment on top of the active conda base, at
venvs/${CONDA_NAME}/.The venv directory is resolved relative to the first non-empty match:
$PBS_O_WORKDIR$SLURM_SUBMIT_DIR$(pwd)
If the venv doesn't exist, it is created with:
Setup Jobโ๏ธ
Once inside a suitable Python environment, this function constructs the launch command. It needs:
- Which machine we're on (and which scheduler: PBS or SLURM)
- How many nodes are allocated
- How many GPUs per node
- What type of GPUs they are
With this it builds the mpiexec / mpirun / srun command and exports
it as the DIST_LAUNCH variable (and a convenient launch alias).
Machine Detectionโ๏ธ
The function ezpz_get_machine_name inspects $(hostname):
| Hostname prefix | Machine | Scheduler |
|---|---|---|
x4* |
Aurora | PBS |
x1* |
Sunspot | PBS |
x3* |
Polaris2 | PBS |
sophia-* |
Sophia | PBS |
frontier* |
Frontier | SLURM |
login*/nid* |
Perlmutter | SLURM |
| (other) | Unknown | fallback |
PBS Hostfile Discoveryโ๏ธ
On PBS systems, the hostfile is found by:
ezpz_qsme_runningโ list all running jobs for$USERwith their hostsezpz_get_pbs_nodefile_from_hostnameโ match$(hostname)to a job ID- Resolve the hostfile at
/var/spool/pbs/aux/${PBS_JOBID}.*
SLURM Hostfile Discoveryโ๏ธ
On SLURM systems, the hostfile is generated from $SLURM_NODELIST using
scontrol show hostnames and written to a local file.
Intel XPU Setup (ezpz_setup_xpu)โ๏ธ
For Intel GPU training (Aurora, Sunspot), ezpz_setup_xpu loads the
oneAPI software stack and sets the runtime environment:
This is equivalent to:
module load oneapi/release/2025.3.1 hdf5 pti-gpu
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
export CCL_PROCESS_LAUNCHER=pmix
export CCL_OP_SYNC=1
export ONEAPI_DEVICE_SELECTOR="opencl:gpu;level_zero:gpu"
export TORCH_CPP_LOG_LEVEL=ERROR
| Variable | Purpose |
|---|---|
ZE_FLAT_DEVICE_HIERARCHY=FLAT |
Expose each PVC tile as a separate device |
CCL_PROCESS_LAUNCHER=pmix |
Use PMIx for oneCCL bootstrap (matches mpiexec) |
CCL_OP_SYNC=1 |
Synchronous oneCCL ops (avoids deadlocks in some workloads) |
ONEAPI_DEVICE_SELECTOR |
Restrict to GPU devices (skip CPU OpenCL backend) |
TORCH_CPP_LOG_LEVEL=ERROR |
Suppress noisy PyTorch C++ logs |
ALCF System Module Loaders (ezpz_load_modules_*)โ๏ธ
When you want to bring up your own Python / venv on top of an
ALCF system stack โ instead of using the prebuilt frameworks
(Aurora/Sunspot) or conda (Polaris) module โ use the matching
ezpz_load_modules_* helper. Each loads everything the canonical
environment module would have pulled in except the framework /
conda module itself:
ezpz_load_modules_aurora # or _sunspot / _polaris
uv venv --python=$(which python3)
source .venv/bin/activate
uv pip install ...
| Function | Loads | Excludes |
|---|---|---|
ezpz_load_modules_aurora |
oneapi, hdf5, pti-gpu + XPU runtime env + FI_MR_CACHE_MONITOR=userfaultfd |
frameworks |
ezpz_load_modules_sunspot |
oneapi, hdf5, pti-gpu + XPU runtime env |
frameworks |
ezpz_load_modules_polaris |
PrgEnv-gnu, craype-x86-milan, cray-hdf5-parallel, cudnn, gcc-native + CUDA paths, NCCL/TensorRT, ALCF proxy, MPI/JAX hints |
conda |
The Polaris helper reverse-engineers the env vars the canonical
conda module sets (/soft/modulefiles/conda/<DATE>.lua) โ CUDA
paths, NCCL/TensorRT lib paths, ALCF HTTP proxy, MPI/JAX hints โ
so the resulting environment matches what the conda module would
have produced, just without the prebuilt PyTorch/Python.
Pairs naturally with ezpz tar-env +
ezpz yeet for distributing the venv to all
worker nodes.
-
System-dependent. See
ezpz_setup_condafor the full implementation. ↩ -
For
x3*hostnames,$PBS_O_HOSTis checked to distinguish Polaris from Sirius. ↩