DeepMind tests the limits of large AI language systems with 280 billion-parameter model

Language creation is the hottest thing in AI right now, with a class of systems known as "large language models" (or LLMs) being used for everything from improving Google's search engine to creating text-based fantasy games. is being done for. But these programs also have serious problems, including reinventing sexist and racist language and failing tests of logical reasoning. A big question is whether these vulnerabilities can only be rectified by adding more data and computing power, or are we reaching the limits of this technological paradigm?

It is one of the topics that Alphabet's AI lab DeepMind is tackling in a trio of research papers published today. The company's conclusion is that rolling out these systems should lead to substantial improvements. "A key finding of the paper is that the progress and capabilities of large language models are still growing. This is not an area that has plateaued," DeepMind research scientist Jack Rae told reporters in a briefing call.

DeepMind, which regularly feeds its work into Google products, has tested the capabilities of this LLM by creating a language model with 280 billion parameters named Gopher. The parameter is a quick measure of a language's model size and complexity, meaning that Gopher is larger than OpenAI's GPT-3 (175 billion parameters), but not as large as some more experimental systems, such as Microsoft and Nvidia's Megatron models. (530 billion parameters).

In the AI world it is generally true that bigger is better, with bigger models usually offering higher performance. Research from DeepMind confirms this trend and suggests that scaling an LLM provides better performance on most common benchmarks testing things like sentiment analysis and summaries. However, the researchers also cautioned that fixing some of the issues inherent in the language model would require more than just data and computation.

"I think right now it looks like the model may have failed in a number of ways," Rae said. "Some subsets of those methods are because the model doesn't have a good enough understanding of what it's reading, and I think for those classes of problems, we're going to see better performance with more data and scale." Huh."

But, he said, "there are other categories of problems, such as whether the model perpetuates conservative biases or the model is being forced to misrepresent, [...] at DeepMind that nobody thinks scale solutions [from ] Will happen." In these cases, the language model would require "additional training routines," such as feedback from human users, he said.

To reach these, the findings from DeepMind researchers evaluated language models of various sizes on 152 language tasks or benchmarks. They found that the larger models generally produced better results, with Gopher itself offering state-of-the-art performance on nearly 80 percent of the tests the scientists selected.

In another paper, the company also surveyed the wide range of potential pitfalls associated with the deployment of LLMs. These include their use of toxic language by the system, their ability to share misinformation, and their potential to be used for malicious purposes such as spam or promotion sharing. All of these issues will become increasingly important as AI language models become more widely deployed – for example as chatbots and sales agents.

However, it is worth remembering that performance on benchmarks is not everything in evaluating machine learning systems. In a recent paper, several AI researchers (including two from Google) explored the limitations of benchmarks, noting that these datasets will always be limited in scope and unable to match real-world complexity. As is often the case with new technology, the only reliable way to test these systems is to see how they perform in reality. With a larger language model, we'll see more of these applications soon.