OpenAI seeks partnerships to get access to publicly unavailable data


The AI company has announced a search for partnerships with organizations to produce public and private datasets for training AI models. The aim is to increase AI’s overall understanding of all subject matters.

According to OpenAI, for AI to deeply understand all industries, cultures, and languages, it needs as broad a training dataset as possible.

“Modern AI technology learns skills and aspects of our world – of people, our motivations, interactions, and the way we communicate – by making sense of the data on which it’s trained,” writes the company.

ADVERTISEMENT

OpenAI invites organizations or other interested parties to share large-scale datasets that reflect human society and are not already easily accessible online to the public today. The datasets will be used in an open-source archive, publicly available for AI model training, and private datasets for training proprietary AI models.

Submitted data can be in text, image, audio, or video formats. The company states that it has the tools to transcribe and digitalize PDFs or other ways to process raw data.

OpenAI claimed that they do not seek datasets with sensitive or personal information or information belonging to a third party and can help remove this information from submitted data.

Expanding the data AI was trained on should increase the models' understanding of the particular domain or topic.

“We’re already working with many partners who are eager to represent data from their country or industry,” says the company.

OpenAI has collaborated with the Icelandic Government and Miðeind ehf to enhance GPT-4's proficiency in Icelandic by incorporating their curated datasets. Additionally, OpenAI has joined forces with the non-profit organization Free Law Project, dedicated to democratizing access to legal knowledge, and has included their extensive collection of legal documents in AI training.

ADVERTISEMENT