Benchmarking and Evaluation

Developing AI tools and models is only part of what NSF NCAR does. We also maintain systematic approaches to rigorously evaluate them. Over decades of building benchmarking frameworks for physical Earth system models, NSF NCAR has established a comprehensive evaluation infrastructure and principles for assessing AI performance across multiple dimensions, from physical consistency to uncertainty quantification to generalized reliability. As a community facility serving the broader research community, NSF NCAR provides open platforms, standardized protocols, and leadership in global evaluation standards that enable researchers to test AI approaches thoroughly and build the evidence base needed to have confidence in AI in scientific applications.

Learn more about how NSF NCAR will evaluate benchmarking and evaluation of AI models, and NSF NCAR’s standards and NSF NCAR’s approach.

Learn more about transparent, rigorous benchmarking for AI and NSF NCAR’s BEACON physics-based forecasting, and the tools we have available for the community.

Quality AI requires high-quality data: carefully curated, thoroughly documented, and fit for purpose. Learn more about our commitments to high-quality data practices and standards.