Large Language Models require a surprising amount of human effort
Without the help of human teachers, AI would reflect the absolute worst the world has to offer, plus they’d hallucinate a lot more
Generative AI is impressive. There’s no denying that. So, what actually goes into training one of these giant, impressively intelligent models? Let’s take ChatGPT as an example. In the popular imagination, clever engineers at OpenAI wrote code to crawl nearly the entirety of the internet slurping up text, and presumably “intelligence”, and then with some nifty deepnet math, created models that can answer questions across the range of human knowledge. (Clearly we’re driving towards a world where if it’s not on the internet, it never happened.)
Of course, there’s more to it than that. For one thing, the storage and computing resources required to pull this off are enormous. The tech companies are not forthcoming about the actual costs, but estimates for training and ongoing inference services put the costs in the hundreds of millions of dollars. Contributing to the high cost is the herculean amount of human effort required to improve the raw models that have learned on, let’s say, less than ideal training data. Armies of people are employed to teach AI models how to interpret data and to correct many of its mistakes. Text-based generative models work by predicting the next bit of text in a sequence. They have to infer these predictions based on their training data, which in this case includes lots and lots of public data scraped from the web. The quality of that data is uneven to say the least, so post training corrections have to be made.
There is also the problem of hallucinations, AI’s propensity to make stuff up and confidently state it as fact. Remember that ChatGPT is simply predicting what comes next based on what has come before, but without any understanding of what it’s saying. It has to learn which answers are good and bad through feedback, which comes from the developers, users, and contractors who have been hired to label its output. If you’ve used ChatGPT yourself and given it feedback, you’ve also contributed to its post-training training.
Without human input, LLMs are basically sociopaths. As a matter of fact GPT-3 (the current freely available version of ChatGPT is based on GPT-3.5), had a tendency towards sexist, racist, and other vile output. To improve the situation for version 3.5, OpenAI hired teams of workers in Kenya to help guide the AI towards its better self. According to TIME, the employees in Kenya were paid less than $2 per hour and were exposed to some of the most toxic content the internet has to offer. Many reported suffering psychological trauma from the experience.
Any updates to ChatGPT’s knowledge base must also come from humans. If you ask it for facts about events that happened since its training period ended, it can’t help you.

If building or extending these models was just a matter of turning them loose on new content, OpenAI could simply feed the latest news into the models to update them. But the models can’t evaluate the accuracy and quality of new information without help, so the expensive and time-consuming process of guiding the models to reliability has to start over again. On a related note, now that media companies have realized how generative models were trained, they are not too happy about having their content co-opted by AI companies to be used as training data (see below).
Another one bites the dust
Gannet is the latest in a string of media companies whose attempt to bring AI into their news operations has failed. But the idea of robot reporters is so compelling that they just keep trying. Gannet has now stopped using an AI tool they had deployed to write high school sports stories after the tool made several major errors in articles. Abandoning the tool comes after Gannet had already eliminated 6% of the workforce in their news division. This latest fiasco follows a similar failed attempt at CNET. Other news organizations like The Guardian and Associated Press are treading more carefully saying that they are experimenting with AI but not using it to generate publishable content. The “yet” is left unsaid but is clearly implied.
Meanwhile, these same publishers have taken steps to block tech companies from future scraping of their websites for training data. There is a kind of catch-22 at play here. Media companies want generative models to work better to assist with their news operations but they don’t want their valuable archives and intellectual property necessary to train the models to be used to build technology that could put them out of business. We’re still a long way from fully automated news operations (see above) but there is no reason publishing companies should help to bring it about. But they are resource constrained, so they want better models… and around we go.
Nvidia strikes it rich(er) mining veins of AI gold
The fact that Nvidia’s share price has more than tripled this year is largely due to the world’s insatiable appetite for AI. Nvidia’s GPU chips power the calculations necessary for training AI of all kinds but especially the deepnet generative models like DALL-E, GPT, and LaMDA. Nvidia’s second quarter revenue was up 88% from the first quarter. The latest earnings report beat expectations by a lot (almost 25%) bumping up the share price even more. Supply cannot keep up with demand which is driving high profits for Nvidia, but it’s also making life difficult for startups and others that are unable to source the chips they need.
Naturally, people are clamoring to invest in AI companies of all kinds but curiously AI itself is a bit skeptical. Investment funds that use AI to make investment decisions are not enthusiastic about investing in the AI sector. For example, one AI-powered fund has decided that Meta is overvalued and recommended against it as an investment. Maybe it knows something we don’t.
Links:
https://time.com/6247678/openai-chatgpt-kenya-workers/
https://www.washingtonpost.com/opinions/2023/08/25/nvidia-chip-ai-bottleneck/
https://www.npr.org/transcripts/1196130707
https://edition.cnn.com/2023/08/30/tech/gannett-ai-experiment-paused/
https://www.cnn.com/2023/08/28/media/media-companies-blocking-chatgpt-reliable-sources/index.html


