News
Crowdsourced AI benchmarks like Chatbot Arena, which have become popular among AI labs, have serious flaws, some experts say.
The FrontierMath benchmark from Epoch AI tests generative models on difficult math problems. Find out how OpenAI’s o3 and ...
A discrepancy between first- and third-party benchmark results for OpenAI's o3 AI model is raising questions about the ...
OpenAI’s o3 model shows inflated benchmark results; real-world tests reflect performance far below initial FrontierMath ...
Benchmark performance results typically accompany the launch of every new AI model to showcase how well the models can ...
AI models are numerous and confusing to navigate, but the benchmarks used to measure their performance are also challenging.
Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the ...
Artificial intelligence group MLCommons unveiled two new benchmarks that it said can help determine how quickly top-of-the-line hardware and software can run AI applications.
A new test of AI capabilities consists of puzzles that humans are able to solve without too much trouble, but which all ...
Meta for its part wasn't hiding the fact this was an experimental build. In its launch blog post, the Instagram parent wrote ...
There are a lot of AI models, and it can be tricky to know which are best. Tech companies often use "benchmarks" to measure how an AI model performs. But industry observers are becoming increasingly ...
AI Index shows an industry in flux, with models increasing in complexity but public perception still sometimes negative.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results