The story opens on OpenAI which, desperate for training data, reportedly developed its Whisper audio transcription model to get over the hump, transcribing over a million hours of YouTube videos to train GPT-4, its most advanced large language model. That’s according to The New York Times, which reports that the company knew this was legally questionable but believed it to be fair use. OpenAI president Greg Brockman was personally involved in collecting videos that were used, the Times writes.
OpenAI spokesperson Lindsay Held told The Verge in an email that the company curates “unique” datasets for each of its models to “help their understanding of the world” and maintain its global research competitiveness. Held added that the company uses “numerous sources including publicly available data and partnerships for non-public data,” and that it’s looking into generating its own synthetic data.