BEACON Overview
BEACON: Benchmark Evaluations for AI and Conventional NWP
Billions of dollars in decisions increasingly depend on AI-driven weather and Earth system models. Yet there is no independent benchmark to determine which models are truly decision-ready and perform well enough on relevant metrics for real-world use.
BEACON is a new AI/ML decision-grade evaluation framework in development at NSF NCAR to address exactly this gap.
Figure: Taken from BEACON, direct model performance comparison of PanguWeather to GraphCast- 4DWX.
BEACON provides independent, side-by-side evaluations of emerging AI weather and Earth system models against established, physics-based forecasting systems. The goal is to provide decision-makers with transparent, evidence-based evaluations that clarify where AI models deliver measurable value, where limitations remain, and when they are ready for operational use. These decision-makers include insurers and reinsurers, catastrophe model developers and assessors, operational forecast centers and infrastructure operators, defense and national security operators and planners, private industry leaders, and federal, state, and local government officials, all of whom rely on credible and defensible hazard information to guide safety-critical, financial, and mission-critical decisions.
BEACON is built on METplus, NSF NCAR’s internationally adopted verification and evaluation framework for numerical weather prediction. METplus provides standardized datasets, transparent performance metrics, and reproducible workflows for evaluating Earth system model performance. It is already widely used across research and operational forecast centers worldwide. By building on this established foundation, BEACON extends gold-standard scientific verification practices to AI-based systems, ensuring consistent, transparent, and reproducible evaluation.
BEACON is designed to serve a range of application areas where reliable forecasts carry real consequences, including transportation safety, energy infrastructure development, air quality and health, wildfire prediction, extreme weather risk, and agricultural planning, and insurance risk assessment, all of which contribute to economic resilience and national security. By validating AI tools before they are deployed, the evolving platform enables sponsors, stakeholders, and the broader public to avoid costly missteps and invest with confidence in modeling systems and innovation grounded in transparent, scientifically rigorous evaluation.

