HF Trainer Comparison⚓︎
📊 Data⚓︎
| CONFIG | Polaris | Aurora |
|---|---|---|
| Local Batch Size | 1 | 1 |
| World Size | 8 | 24 |
| Global Batch Size | 8 | 24 |
| Sequence Length | 8192 | 8192 |
| Epoch | 1.0 | 2.24 |
| Perplexity | 16.6743 | 12.7268 |
| Input tokens seen | 6,553,600 | 19,660,800 |
| Total FLOPs | 6,691,434 GF | 6,691,434 GF |
| METRICS | Polaris | Aurora |
|---|---|---|
steps/sec |
0.245 | 0.334 |
samples/sec |
1.958 | 8.021 |
tokens/sec |
2,004.87 | 2,737.98 |
tokens/sample |
~ 10243 | ~ 3414 |
tokens/step |
~ 81835 | ~ 81976 |
tokens/sec/gpu |
250.61 | 114.08 |
| TRAIN | Polaris | Aurora |
|---|---|---|
| loss | 3.0482 | 2.8478 |
| runtime | 0:06:48.60 | 0:04:59.19 |
| samples | 25,000 | 25,000 |
| EVAL | Polaris | Aurora |
|---|---|---|
| accuracy | 0.4328 | 0.4701 |
| loss | 2.8139 | 2.5437 |
| runtime | 0:00:06.26 | 0:00:09.12 |
| samples | 50 | 50 |
samples/sec |
1.118 | 0.329 |
steps/sec |
0.16 | 0.11 |
| W&B run | glad-moon-11 | cosmic-sunset-52 |
Model Config⚓︎
AuroraGPT-2B
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"dtype": "bfloat16",
"eos_token_id": 128001,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 16,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": true,
"transformers_version": "4.57.3",
"use_cache": true,
"vocab_size": 128256
}
🔍 Details⚓︎
Samples vs. Tokens per Step
-
[samples/step] = [samples/sec] / [steps/sec]- Polaris:
1.958 / 0.245 ≈ 8.0samples/step (✅ matches global batch 8) - Aurora :
8.021 / 0.334 ≈ 24.0samples/step (✅ matches global batch 24)
- Polaris:
-
[tokens/step] = [tokens/sec]/[steps/sec]- Polaris:
2004.87 / 0.245 ≈ 8183tokens/step - Aurora :
2737.98 / 0.334 ≈ 8197tokens/step
- Polaris:
So both runs are doing ~8192 tokens/step (close enough).
Since we know that \(\text{tokens/sec} = (\text{tokens/step}) \times (\text{steps/sec})\), then:
The difference in
tokens/secis almost entirely explained by the difference insteps/sec.
Fair Comparison⚓︎
Since tokens/step is ~ equal, compare steps/sec (or step time):
- Polaris:
0.245 steps/sec- → step time ≈ 4.08 s/step
- Aurora :
0.334 steps/sec- → step time ≈ 2.99 s/step
Aurora is ~1.36× faster per optimizer step (0.334 / 0.245), and therefore
~1.36× faster in tokens/sec, for this exact training configuration.
We can normalize by device count (though this penalizes Aurora with 3× more devices):
-
steps/sec/device:- Polaris:
0.245 / 8 = 0.0306 - Aurora :
0.334 / 24 = 0.0139
So per device, Polaris is ~2.2× "better" on this metric.
- Polaris:
But this is to be somewhat expected since they're operating at different scales, i.e.: 24-way data parallel will usually lose per-device efficiency to comms/overheads vs 8-way, unless you increase work per device (bigger microbatch, longer seq, etc.).
So:
- Time-to-train / throughput at chosen scale:
- Aurora: 1.36× higher
steps/secandtokens/sec; this is what we'd care about, operationally.
- Aurora: 1.36× higher
- Scaling efficiency
-
Per-device efficiency ratio:
i.e. Aurora with 24 devices is delivering ~45% of the per-device step throughput on Polaris with 8 devices.
(how much we're paying for using more devices)
-
🫸 Packing in our Sequences⚓︎
Note that the tokens/sample are different:
- Polaris: ~
8183 / 8 ≈ 1024[tokens/sample] - Aurora : ~
8197 / 24 ≈ 341[tokens/sample]
This means that we're not feeding the same "sample" definition (sequence length / packing / truncation).
This is OK, as long as we're comparing per-step or per-token, but not per-sample!
In order to truly do a fair comparison, we'd need to:
- fix
seq_lenand packing so tokens/sample matches, then - sweep DP size (8 vs 24) on each system and plot step time vs devices
This would tell us whether Aurora is "less efficient per device" because of comms, kernel maturity, input pipeline, or just the scaling point we chose.
🪵 Logs⚓︎
Aurora
#[aurora_frameworks-2025.2.0](ezpz-distributed-metrics-aurora_frameworks-2025.2.0)
#[12/18/25,13:21:37][x4310c7s4b0n0][/f/A/A/E/A/t/s/ezpz-distributed-metrics][distributed-metrics][$?] [1m11s]
; ckpt=/flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650
; ezpz launch python3 -m ezpz.examples.hf_trainer \
--streaming \
--dataset_name=stanfordnlp/imdb \
--model_name_or_path="${ckpt}" \
--bf16=true \
--do_train=true \
--do_eval=true \
--report-to=wandb \
--logging-steps=1 \
--include-tokens-per-second=true \
--max-steps=100 \
--include-num-input-tokens-seen=true \
--optim=adamw_torch \
--logging-first-step \
--include-for-metrics='inputs,loss' \
--max-eval-samples=50 \
--per_device_train_batch_size=1 \
--per_device_eval_batch_size=1 \
--block_size=8192 \
--gradient_checkpointing=true \
--fsdp=auto_wrap \
--output_dir=$(tstamp)
-
Output
# [2025-12-18 13:21:50,091003][I][ezpz/launch:378:launch] ----[🍋 ezpz.launch][started][2025-12-18-132150]---- # [2025-12-18 13:21:51,231621][I][ezpz/launch:396:launch] Job ID: 8219131 # [2025-12-18 13:21:51,232430][I][ezpz/launch:397:launch] nodelist: ['x4310c7s4b0n0', 'x4418c6s1b0n0'] # [2025-12-18 13:21:51,232855][I][ezpz/launch:398:launch] hostfile: /var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov # [2025-12-18 13:21:51,233490][I][ezpz/pbs:329:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host] # [2025-12-18 13:21:51,234263][I][ezpz/launch:354:build_executable] Building command to execute by piecing together: # [2025-12-18 13:21:51,234692][I][ezpz/launch:355:build_executable] (1.) launch_cmd: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2025-12-18 13:21:51,235394][I][ezpz/launch:356:build_executable] (2.) cmd_to_launch: python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-132143 # [2025-12-18 13:21:51,237243][I][ezpz/launch:412:launch] Took: 1.22 seconds to build command. # [2025-12-18 13:21:51,237645][I][ezpz/launch:413:launch] Executing: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-132143 [2025-12-18 13:21:51,240032][I][ezpz/launch:213:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2025-12-18 13:21:51,240580][I][ezpz/launch:420:launch] Execution started @ 2025-12-18-132151... [2025-12-18 13:21:51,241009][I][ezpz/launch:421:launch] ----[🍋 ezpz.launch][stop][2025-12-18-132151]---- [2025-12-18 13:21:51,241473][I][ezpz/launch:132:run_command] Caught 24 filters [2025-12-18 13:21:51,241853][I][ezpz/launch:133:run_command] Running command: mpiexec --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/8219131.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /flare/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/public/sophiag/hf/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-132143 [2025-12-18 13:22:09,760765][I][ezpz/dist:1926:setup_wandb] Using WB_PROJECT=ezpz-hf_trainer--flare-AuroraGPT-AuroraGPT-v1-Experiments-AuroraGPT-2B-public-sophiag-hf-global_step138650 [2025-12-18 13:22:11,079430][I][ezpz/dist:1955:setup_wandb] wandb.run=[cosmic-sunset-5](https://wandb.ai/aurora_gpt/ezpz-hf_trainer--flare-AuroraGPT-AuroraGPT-v1-Experiments-AuroraGPT-2B-public-sophiag-hf-global_step138650/runs/pqytcarn) [WARNING|trainer.py:982] 2025-12-18 13:23:46,860 >> The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly,being updated with the tokenizers values. Updated tokens: {'eos_token_id': 1, 'bos_token_id': 2, 'pad_token_id': 0}. [INFO|trainer.py:2519] 2025-12-18 13:23:51,164 >> ***** Running training ***** [INFO|trainer.py:2520] 2025-12-18 13:23:51,164 >> Num examples = 2,400 [INFO|trainer.py:2521] 2025-12-18 13:23:51,164 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:2522] 2025-12-18 13:23:51,164 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2525] 2025-12-18 13:23:51,164 >> Total train batch size (w. parallel, distributed & accumulation) = 24 [INFO|trainer.py:2526] 2025-12-18 13:23:51,165 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2527] 2025-12-18 13:23:51,165 >> Total optimization steps = 100 [INFO|trainer.py:2528] 2025-12-18 13:23:51,165 >> Number of trainable parameters = 82,752,264 [INFO|integration_utils.py:867] 2025-12-18 13:23:51,171 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" 0%| | 0/100 [00:00<?, ?it/s] # [...clipped...] {'loss': 2.6031, 'grad_norm': 1.8801486492156982, 'learning_rate': 5.000000000000001e-07, 'epoch': 2.24, 'num_input_tokens_seen': 19660800, 'train_runtime': 252.007, 'train_tokens_per_second': 78016.894} 100%|##########| 100/100 [04:11<00:00, 2.27s/it] # [...clipped...] [INFO|trainer.py:4309] 2025-12-18 13:28:05,079 >> Saving model checkpoint to 2025-12-18-132143/checkpoint-100 {'train_runtime': 299.1983, 'train_samples_per_second': 8.021, 'train_steps_per_second': 0.334, 'train_tokens_per_second': 2737.984, 'train_loss': 2.8478338932991027, 'epoch': 2.24, 'num_input_tokens_seen': 19660800} 100%|##########| 100/100 [04:59<00:00, 2.27s/it] [INFO|trainer.py:4309] 2025-12-18 13:28:52,199 >> Saving model checkpoint to 2025-12-18-132143 ***** train metrics ***** epoch = 2.24 num_input_tokens_seen = 19660800 total_flos = 6691434GF train_loss = 2.8478 train_runtime = 0:04:59.19 train_samples = 25000 train_samples_per_second = 8.021 train_steps_per_second = 0.334 train_tokens_per_second = 2737.984 [INFO|trainer.py:4643] 2025-12-18 13:29:00,076 >> ***** Running Evaluation ***** [INFO|trainer.py:4647] 2025-12-18 13:29:00,077 >> Num examples: Unknown [INFO|trainer.py:4648] 2025-12-18 13:29:00,077 >> Batch size = 1 ***** eval metrics ***** epoch = 2.24 eval_accuracy = 0.4701 eval_loss = 2.5437 eval_runtime = 0:00:09.12 eval_samples = 50 eval_samples_per_second = 0.329 eval_steps_per_second = 0.11 num_input_tokens_seen = 19660800 perplexity = 12.7268 wandb: wandb: 🚀 View run cosmic-sunset-5 at: wandb: Find logs at: ../../../../../../../../lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/ezpz-distributed-metrics/wandb/run-20251218_132210-pqytcarn/logs
Polaris
# (2025-09-25/base) (ezpz-distributed-metrics-mconda3)
#[/e/A/f/p/s/ezpz-distributed-metrics][🌱 distributed-metrics][🤷✓] [⏱️ 1m28s]
#[12/18/25 @ 13:32:34][x3006c0s1b0n0]
; ckpt=/eagle/AuroraGPT/foremans/Downloads/global_step138650
; ezpz launch python3 -m ezpz.examples.hf_trainer \
--streaming \
--dataset_name=stanfordnlp/imdb \
--model_name_or_path "${ckpt}" \
--bf16=true \
--do_train=true \
--do_eval=true \
--report-to=wandb \
--logging-steps=1 \
--include-tokens-per-second=true \
--max-steps=100 \
--include-num-input-tokens-seen=true \
--optim=adamw_torch \
--logging-first-step \
--include-for-metrics='inputs,loss' \
--max-eval-samples=50 \
--per_device_train_batch_size=1 \
--per_device_eval_batch_size=1 \
--block_size=8192 \
--gradient_checkpointing=true \
--fsdp=auto_wrap \
--output_dir=$(tstamp)
-
Output
[2025-12-18 13:32:45,480638][i][ezpz/launch:378:launch] ----[🍋 ezpz.launch][started][2025-12-18-133245]---- [2025-12-18 13:32:46,548182][i][ezpz/launch:396:launch] job id: 6815207 [2025-12-18 13:32:46,549238][i][ezpz/launch:397:launch] nodelist: ['x3006c0s1b0n0', 'x3006c0s25b0n0'] [2025-12-18 13:32:46,549674][i][ezpz/launch:398:launch] hostfile: /var/spool/pbs/aux/6815207.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov [2025-12-18 13:32:46,550353][i][ezpz/pbs:329:get_pbs_launch_cmd] ✅ using [8/8] gpus [2 hosts] x [4 gpu/host] [2025-12-18 13:32:46,551713][i][ezpz/launch:354:build_executable] building command to execute by piecing together: [2025-12-18 13:32:46,552123][i][ezpz/launch:355:build_executable] (1.) launch_cmd: mpiexec --envall --np=8 --ppn=4 --hostfile=/var/spool/pbs/aux/6815207.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind=depth --depth=8 [2025-12-18 13:32:46,552678][i][ezpz/launch:356:build_executable] (2.) cmd_to_launch: python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /eagle/auroragpt/foremans/downloads/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1--include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-133234 [2025-12-18 13:32:46,554200][i][ezpz/launch:412:launch] took: 1.43 seconds to build command. [2025-12-18 13:32:46,554592][i][ezpz/launch:413:launch] executing: mpiexec --envall --np=8 --ppn=4 --hostfile=/var/spool/pbs/aux/6815207.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind=depth --depth=8 python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /eagle/auroragpt/foremans/downloads/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-133234 [2025-12-18 13:32:46,557424][i][ezpz/launch:420:launch] execution started @ 2025-12-18-133246... [2025-12-18 13:32:46,557890][i][ezpz/launch:421:launch] ----[🍋 ezpz.launch][stop][2025-12-18-133246]---- [2025-12-18 13:32:46,558395][i][ezpz/launch:133:run_command] running command: mpiexec --envall --np=8 --ppn=4 --hostfile=/var/spool/pbs/aux/6815207.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind=depth --depth=8 python3 -m ezpz.examples.hf_trainer --streaming --dataset_name=stanfordnlp/imdb --model_name_or_path /eagle/auroragpt/foremans/downloads/global_step138650 --bf16=true --do_train=true --do_eval=true --report-to=wandb --logging-steps=1 --include-tokens-per-second=true --max-steps=100 --include-num-input-tokens-seen=true --optim=adamw_torch --logging-first-step --include-for-metrics=inputs,loss --max-eval-samples=50 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --block_size=8192 --gradient_checkpointing=true --fsdp=auto_wrap --output_dir=2025-12-18-133234 [2025-12-18 13:33:46,605589][i][ezpz/dist:1926:setup_wandb] using wb_project=ezpz-hf_trainer--eagle-auroragpt-foremans-downloads-global_step138650 [2025-12-18 13:33:54,462981][i][ezpz/dist:1955:setup_wandb] wandb.run=[glad-moon-1](https://wandb.ai/aurora_gpt/ezpz-hf_trainer--eagle-auroragpt-foremans-downloads-global_step138650/runs/k1rvbdmc) [info|trainer.py:2519] 2025-12-18 13:34:42,086 >> ***** running training ***** [info|trainer.py:2520] 2025-12-18 13:34:42,087 >> num examples = 800 [info|trainer.py:2521] 2025-12-18 13:34:42,087 >> num epochs = 9,223,372,036,854,775,807 [info|trainer.py:2522] 2025-12-18 13:34:42,087 >> instantaneous batch size per device = 1 [info|trainer.py:2525] 2025-12-18 13:34:42,087 >> total train batch size (w. parallel, distributed & accumulation) = 8 [info|trainer.py:2526] 2025-12-18 13:34:42,087 >> gradient accumulation steps = 1 [info|trainer.py:2527] 2025-12-18 13:34:42,087 >> total optimization steps = 100 [info|trainer.py:2528] 2025-12-18 13:34:42,087 >> number of trainable parameters = 248,256,768 [info|integration_utils.py:867] 2025-12-18 13:34:43,444 >> automatic weights & biases logging enabled, to disable set os.environ["wandb_disabled"] = "true" 0%| | 0/100 [00:00<?, ?it/s] # [...clipped...] {'loss': 2.9705, 'grad_norm': 1.2514474391937256, 'learning_rate': 5.000000000000001e-07, 'epoch': 1.0, 'num_input_tokens_seen': 6553600, 'train_runtime': 356.0234, 'train_tokens_per_second': 18407.78} 100%|##########| 100/100 [05:54<00:00, 3.45s/it] # [...clipped...] [info|trainer.py:4309] 2025-12-18 13:40:42,445 >> saving model checkpoint to 2025-12-18-133234/checkpoint-100 {'train_runtime': 408.6051, 'train_samples_per_second': 1.958, 'train_steps_per_second': 0.245, 'train_tokens_per_second': 2004.87, 'train_loss': 3.048191022872925, 'epoch': 1.0, 'num_input_tokens_seen': 6553600} 100%|##########| 100/100 [06:47<00:00, 3.45s/it] 100%|##########| 100/100 [06:47<00:00, 4.07s/it] [info|trainer.py:4309] 2025-12-18 13:41:34,895 >> saving model checkpoint to 2025-12-18-133234 ***** train metrics ***** epoch = 1.0 num_input_tokens_seen = 6553600 total_flos = 6691434gf train_loss = 3.0482 train_runtime = 0:06:48.60 train_samples = 25000 train_samples_per_second = 1.958 train_steps_per_second = 0.245 train_tokens_per_second = 2004.87 [info|trainer.py:4643] 2025-12-18 13:41:43,747 >> ***** running evaluation ***** [info|trainer.py:4647] 2025-12-18 13:41:43,747 >> num examples: unknown [info|trainer.py:4648] 2025-12-18 13:41:43,747 >> batch size = 1 ***** eval metrics ***** epoch = 1.0 eval_accuracy = 0.4328 eval_loss = 2.8139 eval_runtime = 0:00:06.26 eval_samples = 50 eval_samples_per_second = 1.118 eval_steps_per_second = 0.16 num_input_tokens_seen = 6553600 perplexity = 16.6743 wandb: wandb: 🚀 view run glad-moon-1 at: wandb: find logs at: ../../../../../../lus/eagle/projects/auroragpt/foremans/projects/saforem2/ezpz-distributed-metrics/wandb/run-20251218_133347-k1rvbdmc/logs
-
W&B Run: glad-moon-1 ↩
-
W&B Run: cosmic-sunset-5 ↩
-
Tokens per sample on Polaris: ↩
-
Tokens per sample on Aurora: ↩
-
Tokens per optimizer step on Polaris: - ~ 8183 = (2004.87 / 0.245) [
tokens_per_second/steps_per_second] ↩ -
Tokens per optimizer step on Aurora: - ~ 8197 = (2737.984 / 0.334) [
tokens_per_second/steps_per_second] ↩