The web is flooded with low-quality machine-translated content, particularly in languages of the Global South, according to new research.
A significant portion of the content online is generated using machine translation (MT), which raises “serious concerns” about the training of large language models that rely on data scraped from the web, the Amazon-led study has warned.
Researchers found that more than half of sentences online are multi-way parallel translations, where sentences are translated between three or more languages, with the low quality of these translations indicating they were likely created using MT.
The study used a set of 6.4 billion unique sentences in 90 languages – the largest multi-way corpus to date, according to the paper published on aRxiv, an open-access research platform.
It found that machine-translated content was distributed differently between “high resource” languages like English and French and “lower resource” African and other Global South languages.
“Multi-way parallel, machine-generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages,” researchers said.
The study warns that large language models trained on lower-quality machine-translated content could be less accurate, less fluent, and prone to more hallucinations.
“We also find evidence of a selection bias in the type of content which is translated into many languages, and therefore over-represented in lower resource languages,” researchers said.
Such content includes low-quality text translated “en masse” from English to other languages and aimed at generating ad revenue.
Machine-translated content is also “shorter, more predictable, and has a different topic distribution compared to content translated into a single language,” which could further affect artificial intelligence models in those languages, the researchers concluded.
More from Cybernews:
Subscribe to our newsletter