AI firms will soon exhaust most of the internet’s data

The internet provided not only the images, but also the resources for labelling them. Once search engines had delivered pictures of what they took to be dogs, cats, chairs or whatever, these images were inspected and annotated by humans recruited through Mechanical Turk, a crowdsourcing service provided by Amazon which allows people to earn money by doing mundane tasks. The result was a database of millions of curated, verified images. It was through using parts of ImageNet for its training that, in 2012, a program called AlexNet demonstrated the remarkable potential of “deep learning”—that is to say, of neural networks with many more layers than had previously been used. This was the beginning of the AI boom, and of a labelling industry designed to provide it with training data.

The later development of large language models (LLMs) also depended on internet data, but in a different way. The classic training exercise for an LLM is not predicting what word best describes the contents of an image; it is predicting what a word cut from a piece of text is, on the basis of the other words around it.

In this sort of training there is no need for labelled and curated data; the system can blank out words, take guesses and grade its answers in a process known as “self-supervised training”. There is, though, a need for copious data. The more text the system is given to train on, the better it gets. Given that the internet offers hundreds of trillions of words of text, it became to LLMs what aeons of carbon randomly deposited in sediments have been to modern industry: something to be refined into miraculous fuel.

Common Crawl, an archive of much of the open internet including 50bn web pages, became widely used in AI research. Newer models supplemented it with data from more and more sources, such as Books3, a widely used compilation of thousands of books. But the machines’ appetites for text have grown at a rate the internet cannot match. Epoch AI, a research firm, estimates that, by 2028, the stock of high-quality textual data on the internet will all have been used. In the industry this is known as the “data wall”. How to deal with this wall is one of AI’s great looming questions, and perhaps the one most likely to slow its progress.

One approach is to focus on data quality rather than quantity. AI labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an AI firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like OpenAI’s GPT-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training LLMs is often copyrighted and used without consent from, or payment to, the rights holders. Some AI models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. AI models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”.

Different rights holders are taking different tactics. Getty Images has sued Stability AI , an image-generation firm, for unauthorised use of its image store. The New York Times has sued Openai and Microsoft for copyright infringement of millions of articles. Other papers have struck deals to license their content. News Corp, owner of the Wall Street Journal, signed a deal worth $250m over five years. (The Economist has not taken a position on its relationship with ai firms.) Other sources of text and video are doing the same. Stack Overflow, a coding help-site, Reddit, a social-media site, and X (formerly Twitter) are now charging for access to their content for training.

The situation differs between jurisdictions. Japan and Israel have a permissive stance to promote their ai industries. The European Union has no generic “fair use” concept, so could prove stricter. Where markets are set up, different types of data will command different prices: models will need access to timely information from the real world to stay up to date.

Model capabilities can also be improved when the version produced by self-supervised learning, known as the pre-trained version, is refined through additional data in post-training. “Supervised fine-tuning”, for example, involves feeding a model question-and-answer pairs collected or handcrafted by humans. This teaches models what good answers look like. “Reinforcement-learning from human feedback” (RLHF), on the other hand, tells them if the answer satisfied the questioner (a subtly different matter).

In RLHF users give a model feedback on the quality of its outputs, which are then used to tweak the model’s parameters, or “weights”. User interactions with chatbots, such as a thumbs-up or -down, are especially useful for RLHF. This creates what techies call a “data flywheel”, in which more users lead to more data which feeds back into tuning a better model. AI startups are keenly watching what types of questions users ask their models, and then collecting data to tune their models on those topics.

Scale it up

As pre-training data on the internet dry up, post-training is more important. Labelling companies such as Scale AI and Surge AI earn hundreds of millions of dollars a year collecting post-training data. Scale recently raised $1bn on a $14bn valuation. Things have moved on from the Mechanical Turk days: the best labellers earn up to $100 an hour. But, though post-training helps produce better models and is sufficient for many commercial applications, it is ultimately incremental.

Rather than pushing the data wall back bit by bit, another solution would be to jump over it entirely. One approach is to use synthetic data, which are machine-created and therefore limitless. AlphaGo Zero, a model produced by DeepMind, a Google subsidiary, is a good example. The company’s first successful Go-playing model had been trained using data on millions of moves from amateur games. AlphaGo Zero used no pre-existing data. Instead it learned Go by playing 4.9m matches against itself over three days, noting the winning strategies. That “reinforcement learning” taught it how to respond to its opponent’s moves by simulating a large number of possible responses and choosing the one with the best chance of winning.

A similar approach could be used for LLMs writing, say, a maths proof, step-by-step. An LLM might build up an answer by first generating many first steps. A separate “helper” AI, trained on data from human experts to judge quality, would identify which was best and worth building on. Such AI-produced feedback is a form of synthetic data, and can be used to further train the first model. Eventually you might have a higher-quality answer than if the LLM answered in one go, and an improved LLM to boot. This ability to improve the quality of output by taking more time to think is like the slower, deliberative “system 2″ thinking in humans, as described in a recent talk by Andrej Karpathy, a co-founder of OpenAI. Currently, LLMs employ “system 1″ thinking, generating a response without deliberation, similar to a human’s reflexive response.

The difficulty is extending the approach to settings like health care or education. In gaming, there is a clear definition of winning and it is easier to collect data on whether a move is advantageous. Elsewhere it is trickier. Data on what is a “good” decision are typically collected from experts. But that is costly, takes time and is only a patchy solution. And how do you know if a particular expert is correct?

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in AI. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives.

From The Economist, published under licence. The original content can be found on www.economist.com

Scale it up

Leave a Comment Cancel reply