New generative AI model will plug language gap in the field


Most models powering generative AI tools today are trained on data in English and Mandarin Chinese. A new model, able to follow instructions in more than 100 languages, is an attempt to fill the hole.

The open-source multilingual large language model (LLM) called Aya was released this week by Cohere for AI, the nonprofit AI research lab at Cohere, a Canadian tech company.

The firm says the new model has the potential to provide open access to the powerful technology for billions of people. It’s the result of a long project involving 3,000 researchers in more than 100 countries, and it covers more than twice as many languages as other existing open-source models.

ADVERTISEMENT

Probably most importantly, the Aya Collection consists of 513 million prompts and completions that were curated and annotated by fluent speakers in multiple languages.

“Many languages in this collection had no representation in instruction-style datasets before. The fully permissive and open-sourced dataset includes a wide spectrum of language examples, encompassing a variety of dialects and original contributions that authentically reflect organic, natural, and informal language use,” said Cohere for AI.

The researchers said that by focusing primarily on English and one or two dozen other languages as training resources, most models tend to reflect inherent cultural bias. The Aya project was started to address this gap.

The data sources include machine translations of several existing datasets into more than 100 languages. Half of them are considered underrepresented or unrepresented in existing text datasets, including Azeri, Welsh, Bemba, Somali, Uzbek, or Gujarati.

Of course, other open-source multilingual models are being developed, including BLOOM, which can generate text in 46 languages, a model covering African languages, and a bilingual Arabic-English LLM called Jais.

However, according to the team, Aya outperforms other existing open-source multilingual models when evaluated by humans or using GPT-4 and is “a massive leap forward.”

Researchers told Axios that the team envisioned Aya being used for language research and to preserve and represent languages and cultures at risk of being left out of AI advances.

ADVERTISEMENT