AuroraGPT: Training Foundation Models on Supercomputers

Pranesh Nikhar Dec 16, 2025 12/16/25 3 min read

 󰈤

Pranesh Nikhar 2025-12-16

🧰 AuroraGPT: Toolbox
👥 Team Leads
🤝 Teams
🏋️ Challenges
🧬 MProt-DPO
🌎 AERIS (2025)
📓 References
❤️ Acknowledgements
Extras

🧰 AuroraGPT: Toolbox

Datasets and data pipelines (how do we deal with scientific data?)
Software infrastructure and workflows (scalable, robust, extensible)
Evaluation of state-of-the-art LLM Models (how do they perform on scientific tasks?)

🍋 ezpz

praneshnikhar/ezpz

Write once, run anywhere

🚂 Training

argonne-lcf/Megatron-DeepSpeed

For the largest of large language models

🏃‍♂️ Running

argonne-lcf/inference-endpoints

Inference endpoints for LLMs, hosted @ ALCF

👥 Team Leads

Planning

Data

Training

Evaluation

Post

Inference

Comms

Distribution

🤝 Teams

Planning
Data Prep
- Accumulate 20+ T tokens of high-quality scientific text and structured data
Models / Training
¹ - Train (entirely from scratch) a series of models on publicly available data
Evaluation
- Skills, trustworthiness, safety, robustness, privacy, machine ethics

Post-Training
- Fine-tuning, alignment
Inference
- Model serving, API development / public-facing web services
Distribution
- Licensing, generating and distributing artifacts for public consumption
Communication

🏋️ Challenges

This is incredibly difficult in practice, due in part to:

Brand new hardware, architecture, software
Lack of native support in existing frameworks (though getting better!)
General system stability
+10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ $(\times \frac{12 XPU}{1 Node}) \Rightarrow$ +100k XPUs
- network performance
- file system stability (impacted by other users !)
- many unexpected difficulties occur at increasingly large scales
Combinatorial explosion of possible configurations and experiments
{hyperparameters, architectures, tokenizers, learning rates, …}

💾 AuroraGPT: Training

To train a fixed model on trillions of tokens requires:
1. Aggregating data from multiple different corpora
  (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
2. Sampling each training batch according to a fixed distribution across corpora
3. Building indices that map batches of tokens into these files (indexing)

The original implementation was slow:

Designed to run serially on a single device
Major bottleneck when debugging data pipeline at scale

🍹 AuroraGPT: Blending Data, Efficiently

🐢 Original implementation:
- Slow (serial, single device)
- ~ 1 hr/2T tokens
🐇 New implementation:
- Fast! (distributed, asynchronous)
- ~ 2 min/2T tokens (30x faster !!)

Figure 1: Time spent preparing 2T tokens

📉 Training AuroraGPT-7B on 2T Tokens

📉 Training AuroraGPT-2B on 7T Tokens

$Forward Diffusion Process (\pi\rightarrow \mathcal{N})$

🌀 Sequence-Window-Pipeline Parallelism `SWiPe`

SWiPe is a novel parallelism strategy for Swin-based Transformers
Hybrid 3D Parallelism strategy, combining:
- Sequence parallelism (SP)
- Window parallelism (WP)
- Pipeline parallelism (PP)

Figure 10

Figure 11: SWiPe Communication Patterns

🚀 AERIS: Scaling Results

Figure 12: AERIS: Scaling Results

10 EFLOPs (sustained) @ 120,960 GPUs
See (Hatanpää et al. (2025)) for additional details
arXiv:2509.13523

🌪️ Hurricane Laura

Figure 13: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.

📓 References

Dharuman, Gautham, Kyle Hippe, Alexander Brace, et al. 2024. “MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization.” Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Atlanta, GA, USA), SC ’24. https://doi.org/10.1109/SC41406.2024.00013.

Hatanpää, Väinö, Eugene Ku, Jason Stock, et al. 2025. AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions. https://arxiv.org/abs/2509.13523.

Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. 2024. GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather. https://arxiv.org/abs/2312.15796.

Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, et al. 2023. DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies. https://arxiv.org/abs/2310.04610.

❤️ Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Extras

Co-led by: Venkat Vishwanath, Pranesh Nikhar ↩

← [b]ack

talks/ Production Pre-Training at Scale: The Good, the Bad, and the Restarts [n]ext → talks/ AuroraGPT: Training Foundation Models on Supercomputers