Dataset Development
Still image from an animation showing Earth system variability using CONUS404 dataset variables across time scales. View the full animation here.
Dataset Development in Support of AI
Developing high-quality, community datasets has long been central to NSF NCAR’s mission. One of the earliest examples of this is the first global reanalysis, the NCEP NCAR Reanalysis Project, spearheaded by Kalnay at NCAR in the 1990s. That foundational work helped shape the modern datasets that now power today’s AI weather models, including widely used products such as ERA5.
NSF NCAR continues to generate new, large-scale datasets that support both scientific discovery and AI model training. As such, NSF NCAR’s continually evolving dataset portfolio, as new observational products, reanalyses, and model simulations are developed to meet emerging AI and Earth system science needs. For a subset of our AI Ready Data examples:
- Output from the Community Earth System Model (CESM) is used to train the CAMulator global CREDIT model, and will enable new studies into coupled AI modeling frameworks.
- At finer scales, the Weather Research & Forecasting Model (WRF), CONUS-404 dataset, a 4-km Long-term Regional Hydro- Reanalysis over the continental United States, was developed through collaboration between NSF NCAR’s Research Applications Lab (RAL), MMM, and the USGS. This dataset is enabling the first convection-permitting regional AI weather models.
- Private companies are even picking up the CONUS404 dataset as they train their own AI models to improve their forecasts.
AI systems are often limited by the quality of the data they are trained on and the computing resources available to process that data. NSF NCAR uniquely combines large-scale data production, national computing infrastructure, and expert consulting support within a single integrated ecosystem.
Scientists, university researchers, and students can obtain computing allocations on NCAR resources to generate data, host and share datasets through the Geoscience Data Exchange (GDEX), train AI models on readily accessible GPUs, and receive expert guidance throughout the research process, all with no data transfer costs. This end-to-end pipeline makes NSF NCAR a uniquely integrated environment for AI-driven geoscience research.
For the university community and early-career scientists in particular, this integrated model removes a significant barrier to entry for AI research and helps ensure that the next generation of geoscience AI is built by a broad and diverse community, not limited to those with access to specific computational resources.
For computing allocation requests, visit the NSF NCAR Allocations page. For general inquiries and data support questions, please contact the NSF NCAR Research Data Help Desk at datahelp@ucar.edu or through the NSF NCAR Geoscience Data Exchange Help Desk.

