OpenAI uses million hours Of YouTube video dataset to train AI model

Google, the owner of YouTube, stated that it has encountered "unverified reports" regarding OpenAI's actions.

Author
Shantanu Poswal
Follow us:

OpenAI, under the direction of Sam Altman, reportedly undertook a vast transcription project encompassing over a million hours of YouTube videos as part of its training regimen for the GPT-4 AI model, as detailed in a recent report. The New York Times highlighted that OpenAI acknowledged the potential legal issues associated with this endeavor but defended its actions under the concept of "fair use."

According to the report, Greg Brockman, the president of OpenAI, took a hands-on approach in curating the videos utilized for training purposes. This involvement underscores the significance and direct oversight OpenAI exerted over the data collection process.

In response to inquiries, an OpenAI spokesperson informed The Verge that the organization employs a variety of data sources, including publicly accessible data and proprietary partnerships, to ensure its continued competitiveness in global AI research.

Despite these assertions, OpenAI reiterated its commitment to ethical practices, emphasizing that both its robots.txt files and Terms of Service explicitly prohibit unauthorized scraping or downloading of YouTube content. This stance reflects OpenAI's awareness of the legal and ethical considerations surrounding data usage in AI development.

Notably, previous reports, such as The Information's revelation last year, have shed light on OpenAI's utilization of YouTube data for training AI models. This approach, while drawing from a vast wellspring of multimedia content, raises questions about data rights, privacy, and the boundaries of fair use in AI research and development.