GPT, Claude, Llama? How to tell which AI model is best

When Meta, the parent company of Facebook, announced its latest open-source large language model (LLM) on July 23rd, it claimed that the most powerful version of Llama 3.1 had “state-of-the-art capabilities that rival the best closed-source models” such as GPT-4o and Claude 3.5 Sonnet. Meta’s announcement included a table, showing the scores achieved by these and other models on a series of popular benchmarks with names such as MMLU, GSM8K and GPQA.

On MMLU, for example, the most powerful version of Llama 3.1 scored 88.6%, against 88.7% for GPT-4o and 88.3% for Claude 3.5 Sonnet, rival models made by OpenAI and Anthropic, two AI startups, respectively. Claude 3.5 Sonnet had itself been unveiled on June 20th, again with a table of impressive benchmark scores. And on July 24th, the day after Llama 3.1’s debut, Mistral, a French AI startup, announced Mistral Large 2, its latest LLM, with—you’ve guessed it—yet another table of benchmarks. Where do such numbers come from, and can they be trusted?

Having accurate, reliable benchmarks for AI models matters, and not just for the bragging rights of the firms making them. Benchmarks “define and drive progress”, telling model-makers where they stand and incentivising them to improve, says Percy Liang of the Institute for Human-Centred Artificial Intelligence at Stanford University. Benchmarks chart the field’s overall progress and show how AI systems compare with humans at specific tasks. They can also help users decide which model to use for a particular job and identify promising new entrants in the space, says Clémentine Fourrier, a specialist in evaluating LLMs at Hugging Face, a startup that provides tools for AI developers.

But, says Dr Fourrier, benchmark scores “should be taken with a pinch of salt”. Model-makers are, in effect, marking their own homework—and then using the results to hype their products and talk up their company valuations. Yet all too often, she says, their grandiose claims fail to match real-world performance, because existing benchmarks, and the ways they are applied, are flawed in various ways.

One problem with benchmarks such as MMLU (massive multi-task language understanding) is that they are simply too easy for today’s models. MMLU was created in 2020 and consists of 15,908 multiple-choice questions, each with four possible answers, across 57 topics including maths, American history, science and law. At the time, most language models scored little better than 25% on MMLU, which is what you would get by picking answers at random; OpenAI’s GPT-3 did best, with a score of 43.9%. But since then, models have improved, with the best now scoring between 88% and 90%.

This means it is difficult to draw meaningful distinctions from their scores, a problem known as “saturation” (see chart). “It’s like grading high-school students on middle-school tests,” says Dr Fourrier. More difficult benchmarks have been devised—MMLU-Pro has tougher questions and ten possible answers rather than four. GPQA is like MMLU at PhD level, on selected science topics; today’s best models tend to score between 50% and 60% on it. Another benchmark, MuSR (multi-step soft reasoning), tests reasoning ability using, for example, murder-mystery scenarios. When a person reads such a story and works out who the killer is, they are combining an understanding of motivation with language comprehension and logical deduction. AI models are not so good at this kind of “soft reasoning” over multiple steps. So far, few models score better than random on MuSR.

MMLU also highlights two other problems. One is that the answers in such tests are sometimes wrong. A study carried out by Aryo Gema of the University of Edinburgh and colleagues, published in June, found that, of the questions they sampled, 57% of MMLU’s virology questions and 26% of its logical-fallacy ones contained errors. Some had no correct answer; others had more than one. (The researchers cleaned up the MMLU questions to create a new benchmark, MMLU-Redux.)

Then there is a deeper issue, known as “contamination”. LLMs are trained using data from the internet, which may include the exact questions and answers for MMLU and other benchmarks. Intentionally or not, the models may be cheating, in short, because they have seen the tests in advance. Indeed, some model-makers may deliberately train a model with benchmark data to boost its score. But the score then fails to reflect the model’s true ability. One way to get around this problem is to create “private” benchmarks for which the questions are kept secret, or released only in a tightly controlled manner, to ensure that they are not used for training (GPQA does this). But then only those with access can independently verify a model’s scores.

To complicate matters further, it turns out that small changes in the way questions are posed to models can significantly affect their scores. In a multiple-choice test,asking an AI model to state the answer directly, or to reply with the letter or number corresponding to the correct answer, can produce different results. That affects reproducibility and comparability.

Automated testing systems are now used to test models against benchmarks in a standardised manner. Dr Liang’s team at Stanford has built one such system, called HELM (holistic evaluation of language models), which generates leaderboards showing how a range of models perform on various benchmarks. Dr Fourrier’s team at Hugging Face uses another such system, EleutherAI Harness, to generate leaderboards for open-source models. These leaderboards are more trustworthy than the tables of results provided by model-makers, because the benchmark scores have been generated in a consistent way.

The greatest trick AI ever pulled

As models gain new skills, new benchmarks are being developed to assess them. GAIA, for example, tests AI models on real-world problem-solving. (Some of the answers are kept secret to avoid contamination.) NoCha (novel challenge), announced in June, is a “long context” benchmark consisting of 1,001 questions about 67 recently published English-language novels. The answers depend on having read and understood the whole book, which is supplied to the model as part of the test. Recent novels were chosen because they are unlikely to have been used as training data. Other benchmarks assess models’ ability to solve biology problems or their tendency to hallucinate.

But new benchmarks can be expensive to develop, because they often require human experts to create a detailed set of questions and answers. One answer is to use LLMs themselves to develop new benchmarks. Dr Liang is doing this with a project called AutoBencher, which extracts questions and answers from source documents and identifies the hardest ones.

Anthropic, the startup behind the Claude LLM, has started funding the creation of benchmarks directly, with a particular emphasis on AI safety. “We are super-undersupplied on benchmarks for safety,” says Logan Graham, a researcher at Anthropic. “We are in a dark forest of not knowing what the models are capable of.” On July 1st the company began inviting proposals for new benchmarks, and tools for generating them, which it will co-fund, with a view to making them available to all. This might involve developing ways to assess a model’s ability to develop cyber-attack tools, say, or its willingness to provide advice on making chemical or biological weapons. These benchmarks can then be used to assess the safety of a model before public release.

Historically, says Dr Graham, AI benchmarks have been devised by academics. But as AI is commercialised and deployed in a range of fields, there is a growing need for reliable and specific benchmarks. Startups that specialise in providing AI benchmarks are starting to appear, he notes. “Our goal is to pump-prime the market,” he says, to give researchers, regulators and academics the tools they need to assess the capabilities of AI models, good and bad. The days of AI labs marking their own homework could soon be over.

© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.

MoreLess

Leave a Comment