New research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment ...
On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity's Last Exam (HLE), a new academic benchmark aiming to "test the limits of AI knowledge at the frontiers of human expertise," ...
AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human well-being or just maximize for engagement. A ...
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that ...
To fix the way we test and measure models, AI is learning tricks from social science. It’s not easy being one of Silicon Valley’s favorite benchmarks. SWE-Bench (pronounced “swee bench”) launched in ...
Our Secure Future (OSF), an organization dedicated to the advancement of the Women, Peace and Security (WPS) agenda, is leading the development of a WPS-specific Artificial Intelligence (AI) benchmark ...
As more AI models show evidence of being able to deceive their creators, researchers from the Center for AI Safety and Scale AI have developed a first-of-its-kind lie detector. On Wednesday, the ...
Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...