Generative Artificial Intelligence Development Training Data Acquisition Emergency

JK1612 
Created at Aug 28, 2024 02:44:47
Updated at Sep 21, 2024 13:53:15 
  1,117   0   0  

Securing data, the most important element of generative artificial intelligence (AI) development, is on alert. Until now, AI developers have trained AI models by raking various data online through an automatic program (bot) that is in charge of crawling. Recently, content companies including media companies have banned access to their websites. Previously, AI companies were suing for copyright when using content without permission, but now they are blocking data collection itself.

It is important for big techs to have as much data as possible to learn to improve the performance of AI models, but there are concerns that the amount and quality of data they can use are deteriorating. The Epoch Institute, an American AI research institute, even predicted that if the current trend continues, it will be almost impossible to obtain new AI learning data between 2026 and 2032. It means that learning new data will become increasingly difficult if you do not pay for copyright properly.

Recently, the U.S. cybersecurity company Cloudflare launched a free program that prevents unauthorized data from being taken. It prohibits OpenAI, Google, and Apple from accessing the site without the consent of the website owner. The company said, "It will provide tools to prevent malicious operators from crawling websites on a large scale."

Reddit, the largest online community in the U.S. that uses a lot of AI learning, strengthened anti-crawling tools last month. Reddit signed a contract with Google to provide content to OpenAI for a fee this year, but it is trying to strictly prevent unauthorized crawling of its content.

In particular, media companies with high-quality data have already blocked data collection. According to Reuters, 638 out of 1,165 media companies, more than half of them, stopped searching sites for OpenAI, Google, and CommonCrawl, a non-profit data collection organization, as of the end of last year.

Generative Artificial Intelligence Development Training Data Acquisition Emergency

According to the Data Providence Initiative (DPI), a research institute run by MIT, 5 percent of 14,000 websites used to collect AI data blocked "crawler access" last year. In particular, 25 percent of high-quality content such as the media prohibits crawlers. The DPI said, "The number of measures to ban data collection across online websites is increasing rapidly."

With the proliferation of "crawler blocking," AI model developers are having difficulty securing data. Data is continuously emerging online, but it is not enough to keep up with the demand for data needed for AI learning. OpenAI's GPT-3 in 2020 learned about 300 billion tokens (minimum unit of sentence the AI learns). Launched three years later, GPT-4 has trained about 12 trillion tokens, which have increased 40 times. Meta's Generative AI Lama 3 has learned more than 15 trillion tokens this year. According to the Epoch Institute, GPT-5 is expected to learn about 60 trillion tokens, but even with all the high-quality data currently available, it may lack more than 10 to 20 trillion tokens for GPT-5 learning.

The New York Times said, "As media companies, creators, and copyright holders restrict data collection, AI developers who need to constantly secure high-quality data to keep AI models up to date are feeling threatened."

Generative Artificial Intelligence Development Training Data Acquisition Emergency


Before the advent of Generative AI, creators did not know how the data obtained by crawling were used. As the Generative AI craze blew, the value of data increased, and as media companies and creators also demanded fair value for it, their reluctance to crawl increased.

In recent years, criticism of AI developers' crawling behavior has been intensifying. This is because it has been revealed that AI developers have still collected data through crawling despite the growing demand for data copyright. According to a recent Business Insider, OpenAI and Antropic have been found to bypass tools that prevent crawling on websites. It has been revealed that PurpleLexity, an AI search startup invested by Amazon and Nvidia, has also collected data by bypassing the IT magazines Wired and Forbes' anti-crawling tools.


Key Takeaways: Data Scarcity and Access Issues in Generative AI Development

  • Data is becoming increasingly scarce for AI models: Content companies are actively blocking access to their websites, preventing AI developers from using their content to train models.
  • Copyright concerns are driving the shift: Media companies and creators are demanding fair compensation for their content, leading to a reluctance to allow crawling.
  • AI developers are facing a data crunch: While data is constantly being generated online, the demand for high-quality data for training powerful AI models is outpacing the supply.
  • Data acquisition is becoming more expensive: AI companies are now forced to pay for copyright access, making data acquisition more expensive and potentially limiting the scale of future models.
  • Anti-crawling tools are on the rise: Companies like Cloudflare and Reddit are implementing measures to prevent unauthorized data collection, further restricting access.
  • Ethical concerns are coming to the forefront: AI developers are facing criticism for bypassing anti-crawling tools and collecting data without proper consent.
  • The future of AI development is at stake: The scarcity of high-quality data could significantly impact the development of future AI models, particularly those requiring massive datasets for training.
  • The value of data is being recognized: The rise of generative AI has highlighted the importance of data and its economic value, leading to a shift in how data is accessed and used.


Tags: AI Crawling Bots Amazonbot Anti-Crawling Tools Applebot ByteSpider CloudeBot Copyright DPI Data Providence Initiative GPTBot Generative AI GoogleOther ImageshiftBot Share on Facebook Share on X

◀ PREVIOUS
Demand for AI and Electric-Differentiated Renewable Energy Surges

▶ NEXT
Rapidly Growing Humanoid Robot Industry with AI

  Comments 0
SIMILAR POSTS

Rapidly Growing Humanoid Robot Industry with AI

(updated at Sep 22, 2024)

Generative AI cannot scale without Responsible AI (RAI)

(updated at Sep 01, 2024)

Japan's Current Status on Generative AI and Copyright: A Summary of Developments, Current Situation, and Key Issues

(updated at Oct 08, 2024)


OTHER POSTS IN THE SAME CATEGORY

Global Robot Market Outlook

(updated at Sep 28, 2024)

Global Electronic Medicine Trends and Market Outlook 

(updated at Oct 09, 2024)

Big Tech's AI Investments and RE100

(created at Sep 23, 2024)

Amazon's Return-to-Office Mandate: A Bold Move or a Step Backwards?

(updated at Oct 08, 2024)

The Federal Reserve: The Money-Printing King

(updated at Sep 22, 2024)

Green Premium, Eco-Certified Building

(updated at Oct 08, 2024)

Lack of Public Incinerators in Korea

(updated at Sep 22, 2024)

The Biden administration announces $62 million in support of the growing hydrogen industry in the United States

(updated at Sep 10, 2024)

Supply of EVs and Replacing Oil Demand

(updated at Sep 21, 2024)

Digital Innovation Tools to Improve Health and Productivity in the Workplace

(updated at Sep 03, 2024)

Virtual (VR), Augmented (AR) and Extended (XR) Real Market Status

(updated at Sep 22, 2024)

Generative AI cannot scale without Responsible AI (RAI)

(updated at Sep 01, 2024)

Harris And Trump's Position On the Future of American Science

(updated at Aug 31, 2024)

Urban Extinction and Compact City

(updated at Sep 22, 2024)

Rapidly Growing Humanoid Robot Industry with AI

(updated at Sep 22, 2024)

Demand for AI and Electric-Differentiated Renewable Energy Surges

(updated at Sep 21, 2024)

Starlink vs Cheon Beomseongjwa(千帆星座) Accelerates Competition for Low-Orbit Satellite Communications

(updated at Sep 22, 2024)

AI and Exoskeleton Robots

(updated at Sep 22, 2024)

Remember the book you read How does the world actually work

(updated at Sep 22, 2024)

Extended Range Electric Vehicle (EREV) helps to reduce charging inconvenience and revolutionize long-distance driving

(updated at Sep 22, 2024)

When is the oil peak?

(updated at Sep 21, 2024)

WiFi connection is established, but internet is not available on my Microsoft Windows Laptop - how to make it work?

(updated at Sep 22, 2024)

My windows memory usage is 51% even though I haven't ran any app - how to optimize?

(updated at Jul 19, 2024)

Technical Secification for Schwinn Men's Trailway 700c/28" Hybrid Bike

(updated at Jun 09, 2024)

Microsoft's On-Device AI: Revolutionizing Smart Technology and Redefining Innovation

(updated at Sep 22, 2024)

ChatGPT Reset command and Ignore the Previous Response feature to have a Solid Result

(updated at May 16, 2024)

ChatGPT Connectors makes the results Perfect as you expected

(updated at May 10, 2024)

The difference between 403 and 404 in HTTP

(updated at Sep 25, 2024)

Step-by-Step Guide: Developing Simple Games in Unity for Beginners

(updated at May 09, 2024)

Unleashing Creativity with Unity: A Comprehensive Game Development Powerhouse

(updated at Sep 22, 2024)

UPDATES

Life Quotes from Google CTO Will Grannis emphasizes the importance of data and the problem definition

(updated at Dec 17, 2024)

Life Quotes from Netflix CTO Elizabeth Stone in 2023

(updated at Dec 17, 2024)

Exploring UC Irvine (aka UCI) - School and its Majors

(updated at Dec 13, 2024)

Understanding Rose-Hulman Institute of Technology

(updated at Dec 13, 2024)

Chilling Acrobatic Taekwondo! The Birth of a Poomsae Prodigy - Byeon Jae-yeong Wins 1st Place at the Hong Kong World Poomsae Championships

(created at Dec 12, 2024)

IU's breathtakingly beautiful "eight" live performance, captivating the hearts of the audience with her dazzling vocals

(created at Dec 10, 2024)

Navigation for UMass Amherst (aka University of Massachusetts Amherst) - Campus Life and Underground Majors

(updated at Dec 10, 2024)

Exploring UC San Diego (aka UCSD) - School and its Majors

(updated at Dec 10, 2024)

How to access websites blocked by ESNI and ECH settings with Firefox!

(updated at Nov 29, 2024)

[#2024MAMA] G-DRAGON - HOME SWEET HOME (feat. Taeyang, Daesung) | Mnet 241123

(updated at Nov 27, 2024)

Eveything you tell HR is confidential

(updated at Nov 27, 2024)

The hippie perm of NewJeans' Danielle 

(updated at Nov 23, 2024)

LoL Worldcup - Worlds 2024 Finals Opening Ceremony Presented by Mastercard ft. Linkin Park, Ashnikko and More!

(created at Nov 18, 2024)

Danielle was featured on the UK Fashion Pop Magazine cover

(updated at Nov 15, 2024)

IU Photos from her family trip

(updated at Nov 09, 2024)

Men vs. Women Taekwondo Sparring - Beautiful Taekwondo Star Tammy's Dazzling Roundhouse Kicks

(updated at Nov 09, 2024)

Legendary Taekwondo Match of the Korean National Sports Festival in High School Division

(updated at Nov 09, 2024)

Legendary Taekwondo 540 degree Kick - Champion Hyun-goo Noh

(created at Nov 09, 2024)

The difference between Equation and Formula

(created at Nov 08, 2024)

Lengendary Turkish Taekwondo player Tazegul at 2015 WTF World Taekwondo Championships

(updated at Nov 08, 2024)

World Rank #2 - Turkey TKD Legend Servet Tazegül

(created at Nov 07, 2024)

Irvine Restaurant American-Style Vietnamese Food Brodard (ft. Ultimate Spring Roll)

(updated at Nov 03, 2024)

Block unwanted URLs for comfortable web browsing with Chrome Addon - URL Blocker

(updated at Nov 01, 2024)

The Gigant Cowboys of Virginia City, Nevada 1889 - AI Generated Photos

(updated at Oct 28, 2024)

Sushi Koto: The "Ohtani" Sushi Spot in Irvine

(created at Oct 26, 2024)

Modern Web Indexing Technology - IndexNow

(updated at Oct 24, 2024)

Key Differences in Gen Z/Alpha/Zalpha based on Upbringing and Life Experiences

(updated at Oct 22, 2024)

Zalpha Generation: A New Term for the Children of Gen Z and Millennials

(updated at Oct 22, 2024)

Zalpha: A Global Trend, Not Just a Distant Concept

(updated at Oct 22, 2024)

The Generation Corona (+ Gen Z) is grappling with how to communicate and live alongside Gen Alpha

(updated at Oct 21, 2024)

Porto's Bakery in Buena Park: A Review from Irvine

(created at Oct 20, 2024)

Starship, Super Heavy, Successful Ground Landing

(updated at Oct 19, 2024)

AI Generated One-Punch Man with old school style TV shows

(updated at Oct 15, 2024)

One-Punch Man Analysis: The Bald Cape Hero

(updated at Oct 15, 2024)

Why One-Punch Man is a Great Action Anime?

(updated at Oct 15, 2024)

The War of Dogs and Cats - AI-Generated Video by AlgoContent

(updated at Oct 15, 2024)

Dreamy indie band Room402's song "Like the Moon in the Daytime" with AI-generated video

(updated at Oct 15, 2024)

One-Punch Man's Saitama: Motivational Quotes and the Hero's Story

(updated at Oct 13, 2024)

Cream Pan: A Must-Visit Japanese Bakery in Fountain Valley

(created at Oct 13, 2024)

NewJeans - Chicago Live at Lollapalooza 2023

(updated at Oct 12, 2024)

One-Punch Man: Saitama's Promotion Journey and the Final Goal as a Hero

(updated at Oct 12, 2024)

One-Punch Man Combat Power Rankings

(created at Oct 12, 2024)

Difference in HEAD and GET for HTTP Request - why HEAD Request could be used for DDoS Attack?

(updated at Oct 11, 2024)

AI Generated Sailor Moon Video - In the name of justice, I will not forgive you!

(updated at Oct 11, 2024)

Snack that makes my mouth happy when winter comes from the U.S - Hot Pockets

(updated at Oct 11, 2024)

One-Punch Man Crafted by AI - Witness the Limitless Power of Sora AI

(updated at Oct 11, 2024)

My chrome browser is annoying me by Language - How do I change the default language?

(updated at Oct 11, 2024)

AI-Generated Berserk: A Majestic Sight

(updated at Oct 11, 2024)

Global Electronic Medicine Trends and Market Outlook 

(updated at Oct 09, 2024)

What is Google Analytics?

(updated at Oct 09, 2024)