Inadequate evaluation, testing and benchmarking
Evaluation/testing is incomplete or unrepresentative (e.g. benchmark contamination, missing safety evals), giving false assurance.
- Risk family
- Governance & process
- MIT domain
- 6. Socioeconomic and Environmental
- MIT subdomain
- 6.5 > Governance failure
- AI type
- GPAI, Classical_ML, Agentic
- Scope
- Organization
- Source standard
- MIT AI Risk Repository v4
Provenance
Framework crosswalk
Every framework item mapped to this risk. Items marked partial overlap only in part; definitions appear on hover where the source licence permits.
- A.9 ISO/IEC 23894 Annex A A.9
- A.6.2.4 ISO/IEC 42001 Annex A A.6.2.4
- Art. 15
- Art. 9
- CoP S&S Ch. Commitments 2-5
- ibm-incomplete-ai-agent-evaluation Incomplete AI agent evaluation
- ibm-incorrect-risk-testing Incorrect risk testing
- ibm-lack-of-testing-diversity Lack of testing diversity
- ibm-reproducibility Reproducibility partial
- ibm-unrepresentative-risk-testing Unrepresentative risk testing
More in Governance & process
Part of the Deployer AI Risk Register, an open-source resource developed by MindXO. Version 1.0, 3 July 2026. Derived from the MIT AI Risk Repository (V4, December 2025) under CC BY 4.0; an independent derivative work, not endorsed by or affiliated with MIT. Sub-risk decomposition references MITRE ATLAS™ v5.6.0 (© 2021-2026 The MITRE Corporation, reproduced and distributed with permission). ISO/IEC and EU AI Act references are by number only. License: CC BY 4.0. Full attribution and licensing.