News

Crowdsourced AI benchmarks like Chatbot Arena, which have become popular among AI labs, have serious flaws, some experts say.
The FrontierMath benchmark from Epoch AI tests generative models on difficult math problems. Find out how OpenAI’s o3 and ...
A discrepancy between first- and third-party benchmark results for OpenAI's o3 AI model is raising questions about the ...
AI models are numerous and confusing to navigate, but the benchmarks used to measure their performance are also challenging.
OpenAI’s o3 model shows inflated benchmark results; real-world tests reflect performance far below initial FrontierMath ...
Benchmark performance results typically accompany the launch of every new AI model to showcase how well the models can ...
Victor Lazarte, a general partner at Benchmark, said AI is "fully replacing people." ...
OpenAI’s newest LLM, o3, is facing scrutiny after independent tests found it solved a far fewer number of tough math problems ...
IBM and the European Space Agency (ESA) have launched TerraMind, billed as the best-performing AI model for Earth observation ...
Epoch AI, the research institute behind FrontierMath, released results of its independent benchmark tests of o3 on Friday. Epoch found that o3 scored around 10%, well below OpenAI's highest ...