News
Crowdsourced AI benchmarks like Chatbot Arena, which have become popular among AI labs, have serious flaws, some experts say.
Chinese startup Manus AI has picked up $75 million in a funding round led by Benchmark at a roughly $500 million valuation, ...
The FrontierMath benchmark from Epoch AI tests generative models on difficult math problems. Find out how OpenAI’s o3 and ...
Graph neural nets have grown in importance as a component of programs that use gen AI. For example, Google's DeepMind unit used graph nets extensively to make stunning breakthroughs in protein-folding ...
Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the ...
AI models are numerous and confusing to navigate, but the benchmarks used to measure their performance are also challenging.
OpenAI’s o3 model shows inflated benchmark results; real-world tests reflect performance far below initial FrontierMath ...
Google's Gemini AI beats Anthropic's Claude in Pokémon—but with a custom cheat map, sparking fresh controversy over AI benchmark fairness.
Every few months, a new large language model (LLM) is anointed AI champion, with record-breaking benchmark scores. But these celebrated metrics of LLM performance—such as testing graduate-level ...
the co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, said that he thinks benchmarks like Chatbot Arena are being "co-opted" by AI labs to "promote exaggerated ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results