Generative Artificial Intelligence Development Training Data Acquisition Emergency

JK1152 
Created at Aug 28, 2024 02:44:47
Updated at Sep 21, 2024 13:53:15 
92   0   0   0  

Securing data, the most important element of generative artificial intelligence (AI) development, is on alert. Until now, AI developers have trained AI models by raking various data online through an automatic program (bot) that is in charge of crawling. Recently, content companies including media companies have banned access to their websites. Previously, AI companies were suing for copyright when using content without permission, but now they are blocking data collection itself.

It is important for big techs to have as much data as possible to learn to improve the performance of AI models, but there are concerns that the amount and quality of data they can use are deteriorating. The Epoch Institute, an American AI research institute, even predicted that if the current trend continues, it will be almost impossible to obtain new AI learning data between 2026 and 2032. It means that learning new data will become increasingly difficult if you do not pay for copyright properly.

Recently, the U.S. cybersecurity company Cloudflare launched a free program that prevents unauthorized data from being taken. It prohibits OpenAI, Google, and Apple from accessing the site without the consent of the website owner. The company said, "It will provide tools to prevent malicious operators from crawling websites on a large scale."

Reddit, the largest online community in the U.S. that uses a lot of AI learning, strengthened anti-crawling tools last month. Reddit signed a contract with Google to provide content to OpenAI for a fee this year, but it is trying to strictly prevent unauthorized crawling of its content.

In particular, media companies with high-quality data have already blocked data collection. According to Reuters, 638 out of 1,165 media companies, more than half of them, stopped searching sites for OpenAI, Google, and CommonCrawl, a non-profit data collection organization, as of the end of last year.

Generative Artificial Intelligence Development Training Data Acquisition Emergency

According to the Data Providence Initiative (DPI), a research institute run by MIT, 5 percent of 14,000 websites used to collect AI data blocked "crawler access" last year. In particular, 25 percent of high-quality content such as the media prohibits crawlers. The DPI said, "The number of measures to ban data collection across online websites is increasing rapidly."

With the proliferation of "crawler blocking," AI model developers are having difficulty securing data. Data is continuously emerging online, but it is not enough to keep up with the demand for data needed for AI learning. OpenAI's GPT-3 in 2020 learned about 300 billion tokens (minimum unit of sentence the AI learns). Launched three years later, GPT-4 has trained about 12 trillion tokens, which have increased 40 times. Meta's Generative AI Lama 3 has learned more than 15 trillion tokens this year. According to the Epoch Institute, GPT-5 is expected to learn about 60 trillion tokens, but even with all the high-quality data currently available, it may lack more than 10 to 20 trillion tokens for GPT-5 learning.

The New York Times said, "As media companies, creators, and copyright holders restrict data collection, AI developers who need to constantly secure high-quality data to keep AI models up to date are feeling threatened."

Generative Artificial Intelligence Development Training Data Acquisition Emergency


Before the advent of Generative AI, creators did not know how the data obtained by crawling were used. As the Generative AI craze blew, the value of data increased, and as media companies and creators also demanded fair value for it, their reluctance to crawl increased.

In recent years, criticism of AI developers' crawling behavior has been intensifying. This is because it has been revealed that AI developers have still collected data through crawling despite the growing demand for data copyright. According to a recent Business Insider, OpenAI and Antropic have been found to bypass tools that prevent crawling on websites. It has been revealed that PurpleLexity, an AI search startup invested by Amazon and Nvidia, has also collected data by bypassing the IT magazines Wired and Forbes' anti-crawling tools.


Key Takeaways: Data Scarcity and Access Issues in Generative AI Development

  • Data is becoming increasingly scarce for AI models: Content companies are actively blocking access to their websites, preventing AI developers from using their content to train models.
  • Copyright concerns are driving the shift: Media companies and creators are demanding fair compensation for their content, leading to a reluctance to allow crawling.
  • AI developers are facing a data crunch: While data is constantly being generated online, the demand for high-quality data for training powerful AI models is outpacing the supply.
  • Data acquisition is becoming more expensive: AI companies are now forced to pay for copyright access, making data acquisition more expensive and potentially limiting the scale of future models.
  • Anti-crawling tools are on the rise: Companies like Cloudflare and Reddit are implementing measures to prevent unauthorized data collection, further restricting access.
  • Ethical concerns are coming to the forefront: AI developers are facing criticism for bypassing anti-crawling tools and collecting data without proper consent.
  • The future of AI development is at stake: The scarcity of high-quality data could significantly impact the development of future AI models, particularly those requiring massive datasets for training.
  • The value of data is being recognized: The rise of generative AI has highlighted the importance of data and its economic value, leading to a shift in how data is accessed and used.


Tags: AI Crawling Bots Amazonbot Anti-Crawling Tools Applebot ByteSpider CloudeBot Copyright DPI Data Providence Initiative GPTBot Generative AI GoogleOther ImageshiftBot Share on Facebook Share on X

◀ PREVIOUS
Demand for AI and Electric-Differentiated Renewable Energy Surges

▶ NEXT
Rapidly Growing Humanoid Robot Industry with AI

  Comments 0
Login for comment
SIMILAR POSTS

Rapidly Growing Humanoid Robot Industry with AI

(updated at Sep 22, 2024)

Generative AI cannot scale without Responsible AI (RAI)

(updated at Sep 01, 2024)

Japan's Current Status on Generative AI and Copyright: A Summary of Developments, Current Situation, and Key Issues

(updated at Oct 08, 2024)


OTHER POSTS IN THE SAME CATEGORY

Global Robot Market Outlook

(updated at Sep 28, 2024)

Global Electronic Medicine Trends and Market Outlook 

(updated at Oct 09, 2024)

Big Tech's AI Investments and RE100

(created at Sep 23, 2024)

Amazon's Return-to-Office Mandate: A Bold Move or a Step Backwards?

(updated at Oct 08, 2024)

The Federal Reserve: The Money-Printing King

(updated at Sep 22, 2024)

Green Premium, Eco-Certified Building

(updated at Oct 08, 2024)

Lack of Public Incinerators in Korea

(updated at Sep 22, 2024)

The Biden administration announces $62 million in support of the growing hydrogen industry in the United States

(updated at Sep 10, 2024)

Supply of EVs and Replacing Oil Demand

(updated at Sep 21, 2024)

Digital Innovation Tools to Improve Health and Productivity in the Workplace

(updated at Sep 03, 2024)

Virtual (VR), Augmented (AR) and Extended (XR) Real Market Status

(updated at Sep 22, 2024)

Generative AI cannot scale without Responsible AI (RAI)

(updated at Sep 01, 2024)

Harris And Trump's Position On the Future of American Science

(updated at Aug 31, 2024)

Urban Extinction and Compact City

(updated at Sep 22, 2024)

Rapidly Growing Humanoid Robot Industry with AI

(updated at Sep 22, 2024)

Demand for AI and Electric-Differentiated Renewable Energy Surges

(updated at Sep 21, 2024)

Starlink vs Cheon Beomseongjwa(千帆星座) Accelerates Competition for Low-Orbit Satellite Communications

(updated at Sep 22, 2024)

AI and Exoskeleton Robots

(updated at Sep 22, 2024)

Remember the book you read How does the world actually work

(updated at Sep 22, 2024)

Extended Range Electric Vehicle (EREV) helps to reduce charging inconvenience and revolutionize long-distance driving

(updated at Sep 22, 2024)

When is the oil peak?

(updated at Sep 21, 2024)

WiFi connection is established, but internet is not available on my Microsoft Windows Laptop - how to make it work?

(updated at Sep 22, 2024)

My windows memory usage is 51% even though I haven't ran any app - how to optimize?

(updated at Jul 19, 2024)

Technical Secification for Schwinn Men's Trailway 700c/28" Hybrid Bike

(updated at Jun 09, 2024)

Microsoft's On-Device AI: Revolutionizing Smart Technology and Redefining Innovation

(updated at Sep 22, 2024)

ChatGPT Reset command and Ignore the Previous Response feature to have a Solid Result

(updated at May 16, 2024)

ChatGPT Connectors makes the results Perfect as you expected

(updated at May 10, 2024)

The difference between 403 and 404 in HTTP

(updated at Sep 25, 2024)

Step-by-Step Guide: Developing Simple Games in Unity for Beginners

(updated at May 09, 2024)

Unleashing Creativity with Unity: A Comprehensive Game Development Powerhouse

(updated at Sep 22, 2024)

UPDATES

Exploring UC Davis (aka UCD) - Schools and Majors

(updated at Oct 16, 2024)

AI Generated One-Punch Man with old school style TV shows

(updated at Oct 15, 2024)

One-Punch Man Analysis: The Bald Cape Hero

(updated at Oct 15, 2024)

Why One-Punch Man is a Great Action Anime?

(updated at Oct 15, 2024)

The War of Dogs and Cats - AI-Generated Video by AlgoContent

(updated at Oct 15, 2024)

Dreamy indie band Room402's song "Like the Moon in the Daytime" with AI-generated video

(updated at Oct 15, 2024)

One-Punch Man's Saitama: Motivational Quotes and the Hero's Story

(updated at Oct 13, 2024)

NewJeans - Chicago Live at Lollapalooza 2023

(updated at Oct 12, 2024)

One-Punch Man: Saitama's Promotion Journey and the Final Goal as a Hero

(updated at Oct 12, 2024)

One-Punch Man Combat Power Rankings

(created at Oct 12, 2024)

Difference in HEAD and GET for HTTP Request - why HEAD Request could be used for DDoS Attack?

(updated at Oct 11, 2024)

AI Generated Sailor Moon Video - In the name of justice, I will not forgive you!

(updated at Oct 11, 2024)

Snack that makes my mouth happy when winter comes from the U.S - Hot Pockets

(updated at Oct 11, 2024)

One-Punch Man Crafted by AI - Witness the Limitless Power of Sora AI

(updated at Oct 11, 2024)

My chrome browser is annoying me by Language - How do I change the default language?

(updated at Oct 11, 2024)

AI-Generated Berserk: A Majestic Sight

(updated at Oct 11, 2024)

Global Electronic Medicine Trends and Market Outlook 

(updated at Oct 09, 2024)

What is Google Analytics?

(updated at Oct 09, 2024)

Understanding the Key Differences Between GIS and LBS: Purpose, Technology, and Applications

(updated at Oct 09, 2024)

The Evolution and Applications of Geographic Information Systems: From Thematic Mapping to Efficient Data Analysis and Management

(created at Oct 09, 2024)

Quantum computer and qubit generation method

(updated at Oct 08, 2024)

Green Premium, Eco-Certified Building

(updated at Oct 08, 2024)

Why Two Path Authentication is Essential - My Microsoft Account is Gone!

(updated at Oct 08, 2024)

Amazon's Return-to-Office Mandate: A Bold Move or a Step Backwards?

(updated at Oct 08, 2024)

Exploring UC Merced (aka UCM) - Schools and Majors

(updated at Oct 08, 2024)

Understanding Rose-Hulman Institute of Technology

(updated at Oct 08, 2024)

Understanding Texas A&M - a leading research university, receiving significant funding from both the government and private sector

(updated at Oct 08, 2024)

Understanding Rensselaer Polytechnic Institute based in New York

(updated at Oct 08, 2024)

Understanding Virgina Tech founded in 1872

(updated at Oct 08, 2024)

Understanding Purdue University - a public land-grant research university

(updated at Oct 08, 2024)

Spam's New Soul-Touching Gochujang Spam

(updated at Oct 08, 2024)

Japan's Current Status on Generative AI and Copyright: A Summary of Developments, Current Situation, and Key Issues

(updated at Oct 08, 2024)

I do enjoy Pasta rather than Cheese Cake at Cheese cake factory

(updated at Oct 05, 2024)

A Hurdle on the Court: My Basketball Injury Journey

(updated at Oct 03, 2024)

Difference between Java and Javascript

(updated at Oct 03, 2024)

Loading XML Data with JavaScript

(updated at Oct 03, 2024)

jQuery Example to make GET method call with $.ajax()

(updated at Oct 03, 2024)

Regular Expressions in JavaScript

(updated at Oct 03, 2024)

20+ Polite email greetings for the Smooth Conversation

(updated at Oct 03, 2024)

The UN Pushes for Global AI Standards

(created at Oct 01, 2024)

Took a Student ID Card Photo for My Sophomore Year

(updated at Sep 30, 2024)

Pink's Hot Dogs - Legendary Hot Dogs near Hollywood

(updated at Sep 30, 2024)

Churrasco - Brazilian Style BBQ

(updated at Sep 30, 2024)

Laguna Beach having the beautiful sunset

(updated at Sep 29, 2024)

Exciting explore at Sequoia National Park

(updated at Sep 29, 2024)

Sun, Rocks, and Adventure: A Day at Joshua Tree National Park

(updated at Sep 29, 2024)

Northwood High School vs Irvine High School Pink Out Football Game

(updated at Sep 29, 2024)

Winter Formal Fashion prepared for Dance Party

(updated at Sep 29, 2024)

My Clothes - Rocking the Black and Green Split Hoodie

(updated at Sep 29, 2024)

Hair Adventure - Two Block Cut and Volume Perm Delight

(updated at Sep 29, 2024)