Observation|The wave of large models is about to exhaust the entire universe of texts. Where does high-quality data come from?

Source: The Paper

Author: Shao Wen

AI-powered bots like ChatGPT could soon “run out of text in the universe,” experts warn. At the same time, using the data generated by AI to “feed back” AI or cause the model to collapse. The high-quality data used for future model training may become more and more expensive, and the network will become fragmented and closed.

"When the development of large-scale models goes deeper, such as large-scale industry models, the data required is not free and open data on the Internet. To train a model with high precision, what is needed is industry expertise or even commercial secrets. Knowledge. For everyone to contribute to such a corpus, there must be a mechanism for the distribution of rights and interests.”

Image source: Generated by Unbounded AI

As one of the “troika” of artificial intelligence infrastructure, the importance of data has always been self-evident. As the boom in big language models enters its peak period, the industry is paying more attention to data than ever before.

In early July, Stuart Russell, a professor of computer science at the University of California, Berkeley and author of “Artificial Intelligence—A Modern Approach,” warned that AI-powered bots such as ChatGPT could soon “run out of text in the universe.” ", and the technique of training bots by collecting large amounts of text is “beginning to run into difficulties.” Research firm Epoch estimates that machine learning datasets could exhaust all “high-quality language data” by 2026.

“Data quality and data volume will be the key to the emergence of large-scale model capabilities in the next stage.” Wu Chao, director of the expert committee of CITIC Think Tank and director of the Securities Research Institute of China Securities, shared a speech at the 2023 World Artificial Intelligence Conference (WAIC) It is estimated that “20% of the quality of a model in the future will be determined by the algorithm, and 80% will be determined by the quality of the data. Next, high-quality data will be the key to improving the performance of the model.”

However, where does high-quality data come from? At present, the data industry still faces many urgent problems, such as what is the standard of data quality, how to promote data sharing and circulation, and how to design a pricing and distribution revenue system.

High quality data urgent

Wei Zhilin, deputy general manager of Shanghai Data Exchange, said in an interview with The Paper (including media) on July 8 that in the “troika” of data, computing power, and algorithms, data is the core, the longest, and the most basic. elements.

The large-scale language model (LLM) has amazing performance today, and the mechanism behind it is summarized as “intelligent emergence”. In a simple understanding, AI skills that have not been taught before can now be learned. And a large number of data sets is an important basis for “intelligence emergence”.

A large language model is a deep neural network with billions to trillions of parameters, which is “pre-trained” on a huge natural language corpus of several terabytes (Terabytes, 1TB=1024MB), including structured data, online books and other content . Shan Haijun, vice president of China Electronics Jinxin Research Institute, told Peng Mei Technology during the 2023 World Artificial Intelligence Conference that large models are essentially probabilistic generation models, and their core highlights lie in the ability to understand (context prompt learning) and reasoning (thinking chain) and Has Values (Human Feedback Reinforcement Learning). The biggest breakthrough of ChatGPT was when GPT-3 appeared, with about 175 billion parameters and a data volume of 45 TB.

A comprehensive view of all datasets from GPT-1 to Gopher’s curated language models from 2018 to early 2022. The unweighted size, in GB. Credit: Alan D. Thompson

“OpenAI has always been working towards seeking more high-quality data and deeply analyzing the existing data, so as to make its capabilities more and more powerful.” On July 12, Professor of Fudan University, Shanghai Key Laboratory of Data Science Director Xiao Yanghua told The Paper, “Acquiring large-scale, high-quality, and diverse data, and in-depth analysis of these data may be one of the important ideas to promote the development of large models.”

However, high-quality data is in short supply.

A study last November by Epoch, a group of artificial intelligence researchers, estimated that machine learning datasets could exhaust all “high-quality language data” by 2026. And when the study was published, the global boom in big models hadn’t even happened. According to the study, language data in “high-quality” sets came from “books, news articles, scientific papers, Wikipedia, and filtered web content.”

At the same time, the data collection practices of generative AI development organizations such as OpenAI to train large language models are becoming more and more controversial. At the end of June, OpenAI was hit with a class action lawsuit, accused of stealing “a large amount of personal data” to train ChatGPT. Social media, including Reddit and Twitter, expressed dissatisfaction with the random use of data on their platforms. On July 1, Musk imposed a temporary limit on the number of tweets read for this reason.

In an interview with technology and financial media Insider on July 12, Russell said that many reports, although unconfirmed, detail that OpenAI purchased text datasets from private sources. While there are various possible explanations for this purchase, “the natural inference is that there are not enough high-quality public data.”

Some experts have suggested that perhaps new solutions will emerge before the data is exhausted. For example, the large model can continuously generate new data by itself, and then undergo some quality filtering, which in turn can be used to train the model. This is called self-learning or “feedback”. However, according to a paper published on the preprint platform arXiv by researchers from Oxford University, Cambridge University, and Imperial College London in May this year, AI training with AI-generated data will lead to irreversible defects in the AI model. Call it Model Collapse. This means that the high-quality data used for model training in the future will become more and more expensive, the network will become fragmented and closed, and content creators will do their best to prevent their content from being crawled for free.

It is not difficult to see that the acquisition of high-quality data will become more and more difficult. “Most of our data now comes from the Internet. Where will the data come from in the second half of the year? I think this is very important. In the end, everyone will share private data, or you have data that I don’t have.” Young scientist of Shanghai Artificial Intelligence Laboratory, responsible for OpenDataLab He Conghui talked about it at the 2023 World Artificial Intelligence Conference.

Wu Chao also told The Paper that whoever has higher-quality data next, or who can generate a steady stream of high-quality data, will become the key to improving performance.

“Data-centric” troubles

He Conghui believes that the paradigm of the entire model development will gradually change from “model-centric” to “data-centric”. But there is a problem with data-centricity-the lack of standards, and the criticality of data quality is often mentioned, but in fact it is currently difficult for anyone to say clearly what is good data quality and what the standard is.

In the process of practice, He Conghui also faced such a problem, "Our practice in this process is to break down the data, and make it more and more detailed. With each subdivision field and subdivision topic, the quality standard of the data gradually becomes smaller and smaller. It has been proposed. At the same time, it is not enough to look at the data alone, but also to look behind the data. We will combine the data and the model performance improvement of the corresponding intention of the data, and formulate a set of data quality iteration mechanism together.”

Last year, the Shanghai Artificial Intelligence Laboratory where He Conghui works released the open data platform OpenDataLab for artificial intelligence, providing more than 5,500 high-quality data sets, “but this is only at the level of public data sets. We hope that the data exchange will be established two days ago. The large-scale corpus data alliance can provide research institutions and enterprises with better data circulation methods.”

On July 6, at the 2023 World Artificial Intelligence Conference, Shanghai Artificial Intelligence Laboratory, China Institute of Scientific and Technological Information, Shanghai Data Group, Shanghai Digital Business Association, National Meteorological Center, China Central Radio and Television, Shanghai Press Industry Group The large model corpus data alliance jointly initiated by other units announced the formal establishment.

On July 7, the official website of the Shanghai Data Exchange officially launched the corpus, and a total of nearly 30 corpus data products have been listed, including text, audio, image and other multi-modality, covering finance, transportation and medical fields.

But such a corpus construction is not a matter of course. “Can there be high-quality corpus required by large-scale enterprises? Will the target audience be willing to open data?” Tang Qifeng, general manager of Shanghai Data Exchange, said at the 2023 World Artificial Intelligence Conference that the difficulty mainly lies in the degree of openness and data quality Two ways.

Wei Zhilin shared that the supply of data is now facing many challenges. Leading manufacturers are unwilling to open up data. At the same time, everyone is also worried about the security mechanism in the data sharing process. Another important issue is that there are still doubts about the revenue distribution mechanism for the open circulation of data.

Specifically, data sharing needs to solve three problems. Lin Le, founder and CEO of Shanghai Lingshu Technology Co., Ltd. explained to Pengpai Technology that, first, data is easy to falsify, and it is necessary to ensure that the data is authentic and credible. The second is that data is easy to copy, which means that the ownership relationship is not clear, and blockchain is required for confirmation and authorized use. The third is that it is easy to leak privacy. Blockchain can be combined with privacy computing technology to make data available and invisible.

How to solve the income distribution

Tang Qifeng pointed out that for suppliers with high data quality but low openness, the trust problem of corpus data circulation can be effectively solved through the data transaction chain. “One of the cores lies in the issue of property rights and the distribution of benefits after participating in the large-scale model.”

Lin Changle, executive vice president of Tsinghua University’s Interdisciplinary Information Core Technology Research Institute, is designing a theoretical system on how to price data and distribute benefits.

“To some extent, a lot of human knowledge like ChatGPT may be used for free in a few months. We see that the large model can learn some writers’ articles, write the same style of articles, or generate Van Gogh’s paintings, but it does not need to be This payment, the subjects of these data sources have not benefited from it.” Lin Changle said at the 2023 World Artificial Intelligence Conference, so there may be a more radical point of view: intellectual property rights in the era of large models do not exist, or It is said that traditional intellectual property protection does not exist.

However, Lin Changle believes that after the era of large-scale models, the protection of intellectual property rights will develop to the confirmation of data rights, pricing and transactions. "When the development of large-scale models goes deeper, such as large-scale industry models, the data required is not free and open data on the Internet. To train models with extremely high precision, what is needed is industry expertise or even commercial secrets. Knowledge. For everyone to contribute to such a corpus, there must be a mechanism for the distribution of rights and interests.”

The “data asset map” that Lin Changle is working on now is to use mathematics to prove a set of income distribution mechanism to distribute data rights fairly.

How to solve data circulation

Liu Quan, deputy chief engineer of the CCID Research Institute of the Ministry of Industry and Information Technology and a foreign academician of the Russian Academy of Natural Sciences, mentioned at the WAIC “Integration of Numbers and Reality, Intelligence Leading the Future” Industrial Blockchain Ecological Forum that recently the Beijing version of “Twenty Articles of Data” has emerged in the industry. Very big response, it solves the core problem in the process of data circulation. Most obviously, the question of who owns government data is clarified—public data belongs to the government. What about corporate data and personal data? “The Beijing Municipal Data Exchange can be entrusted to conduct entrusted operations.”

On July 5, the Beijing Municipal Committee of the Communist Party of China and the Beijing Municipal People’s Government issued a notice on “Implementation Opinions on Making Better Use of Data Elements and Further Accelerating the Development of the Digital Economy”. The “Implementation Opinions” is divided into nine parts. It builds a basic data system from the aspects of data property rights, circulation transactions, income distribution, and security governance. It puts forward a total of 23 specific requirements, which are called the Beijing version of the “Twenty Data Articles” in the industry.

“From a domestic point of view, according to statistics, 80% of data resources are concentrated in public and government institutions. We want to solve the data supply, to a large extent, we hope to be based on the 20 Articles of Data (“The Central Committee of the Communist Party of China and the State Council on Building a Data Basic System Opinions on Better Playing the Role of Data Elements”) The open sharing of public data can form a set of replicable mechanisms and paradigms to promote data formed in public utilities and then serve the public.” Wei Zhilin said.

Wei Zhilin said that according to current statistics, the stock of data resources in China as a whole ranks second in the world, but these data are scattered in various places. According to Zhan Yubao, deputy director of the Digital China Research Institute of the State Information Center, at the 2023 World Artificial Intelligence Conference on July 7, China’s current national data circulation system includes: There are two data exchanges, one is Shanghai Data Exchange One is the Shenzhen Data Exchange; there are 17 data exchange centers in China, including the Beijing Data Exchange Center.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)