What Is a Benchmark - Search News

6don MSN

Are AI agents ready for the workplace? A new benchmark raises doubts.

New research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment ...

ZDNet

'Humanity's Last Exam' benchmark is stumping top AI models - can you do any better?

On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity's Last Exam (HLE), a new academic benchmark aiming to "test the limits of AI knowledge at the frontiers of human expertise," ...

TechCrunch

A new AI benchmark tests whether chatbots protect human well-being

AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human well-being or just maximize for engagement. A ...

Ars Technica

New secret math benchmark stumps AI models and PhDs alike

On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that ...

MIT Technology Review

How to build a better AI benchmark

To fix the way we test and measure models, AI is learning tricks from social science. It’s not easy being one of Silicon Valley’s favorite benchmarks. SWE-Bench (pronounced “swee bench”) launched in ...

techtimes

AI Without Women Is a Risk: A Benchmark for Peace and Security

Our Secure Future (OSF), an organization dedicated to the advancement of the Women, Peace and Security (WPS) agenda, is leading the development of a WPS-specific Artificial Intelligence (AI) benchmark ...

ZDNet

This new AI benchmark measures how much models lie

As more AI models show evidence of being able to deceive their creators, researchers from the Center for AI Safety and Scale AI have developed a first-of-its-kind lie detector. On Wednesday, the ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results