New benchmark study results show leading AI models, including ChatGPT, Claude, and Gemini, still lag humans in visual math reasoning.
On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity's Last Exam (HLE), a new academic benchmark aiming to "test the limits of AI knowledge at the frontiers of human expertise," ...
OpenAI secretly funded and had access to a benchmarking dataset, raising questions about high scores achieved by its new o3 AI model. Revelations that OpenAI secretly funded and had access to the ...
A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model is raising questions about the company’s transparency and model testing practices. When OpenAI unveiled o3 in ...
OpenAI has long been touting the capabilities of its artificial intelligence (AI) developments, especially with their o-series models that are capable of reasoning and more advanced capabilities. The ...
Every time Hasan publishes a story, you’ll get an alert straight to your inbox! Enter your email By clicking “Sign up”, you agree to receive emails from ...