Model Forge

Democratizing training of large neural networks - The NSF NCAR ML Model Forge

Although the CREDIT framework lowers many of the barriers to developing AI-based weather prediction (AIWP) and related Earth system ML models, training large neural networks remains both technically challenging and time-intensive. Training or fine-tuning models often requires access to petabyte-scale datasets, high–I/O throughput storage systems, high-performance parallel file systems, and multi-GPU clusters capable of distributed training. In addition to infrastructure demands, there are substantial workforce challenges. Researchers must be proficient in machine learning training frameworks such as PyTorch or JAX, understand HPC job scheduling environments, work effectively with parallel I/O systems, and be familiar with geophysical data standards and conventions, to name only a few. Effective workflows also depend on supporting software ecosystems, including tools such as Xarray and Dask for scalable data preprocessing and post-processing. As a result, relatively few Earth system scientists are currently positioned to train new models or adapt existing ones for their research applications.

To address this gap, NSF NCAR is building a pilot model training service, the ML Model Forge, designed to remove many of these technical barriers. The Model Forge, when ready, will provide a no-code or low-code interface for reproducibly training a range of AI models relevant to Earth system science, thereby broadening access beyond ML specialists. In addition to supporting model training for non-experts, the ML Model Forge will emphasize reproducibility and version control, streamlined deployment pathways, and responsible, reliable model development and use. By abstracting much of the underlying complexity, the service aims to democratize advanced AI capabilities while maintaining scientific rigor and operational robustness.

For more information or for partnership opportunities, please contact John Clyne.