When the internet appears to be too small, cut corners. It seems that’s exactly what OpenAI has been doing in order to find data for its shiny artificial intelligence systems.
In a new report, The New York Times says that OpenAI transcribed over a million hours of YouTube videos to train GPT-4, its most advanced large language model.
The AI lab decided to do this because, in late 2021, it was desperate for training data after exhausting most high-quality English-language texts on the web. OpenAI then created a speech recognition tool called Whisper that could transcribe the audio from YouTube videos – and got to work.
According to The New York Times, the company knew perfectly well that it was doing things within the gray area of AI copyright law but believed the solution to be fair use. Reportedly, OpenAI’s president, Greg Brockman, was personally involved in collecting the videos.
Only last year, OpenAI announced that it was searching for partnerships with organizations to produce public and private datasets for training AI models after several news media outlets blocked AI firms from harvesting their content.
One could, of course, debate the logic of these bans, but so far, at least YouTube’s Terms of Service specifically prohibit their content from being scraped without permission.
“You are not allowed to access the Service using any automated means (such as robots, botnets, or scrapers) except: (a) in the case of public search engines, in accordance with YouTube’s robots.txt file; (b) with YouTube’s prior written permission; or (c) as permitted by applicable law,” say the terms.
YouTube’s CEO Neal Mohan also said similar things about the possibility that OpenAI used YouTube to train its Sora model. In an interview with Bloomberg, he said it would be a “clear violation” of the platform’s policies.
Sure, The New York Times says Google also collected transcripts from YouTube but the platform is actually owned by the tech giant. However, the report adds that Google was also looking at the possibility of expanding what it could do with consumer data on tools such as Google Docs.
Meta meanwhile discussed buying the publishing house Simon & Schuster last year to procure long works and train their AI models on them, according to The Times.
In short, the race to lead AI has indeed become a desperate hunt for data. The forest, though, is smaller every day – last week, The Wall Street Journal cited sources saying that the industry’s need for high-quality text data could outstrip supply within two years and slow AI’s development.
Leading chatbots have learned from pools of digital text containing as many as three trillion words, or around twice the number of words stored in Oxford University’s Bodleian Library, collecting manuscripts since 1602.
Your email address will not be published. Required fields are markedmarked