Personal pictures of kids used to train AI


The personal pictures of Brazilian children are being used to train popular AI tools, Human Rights Watch (HRW) has revealed.

The photos are being scraped off the internet into a large data set that companies then use to train their AI tools without the children’s knowledge or consent, HRW said. These tools can then be used to create malicious deepfakes, putting more children at risk of exploitation and harm, it warned.

“Children should not have to live in fear that their photos might be stolen and weaponized against them,” said Hye Jung Han, children’s rights and technology researcher and advocate at HRW.

ADVERTISEMENT

“The government should urgently adopt policies to protect children’s data from AI-fueled misuse.”

HRW said it found that the openly-accessible LAION-5B dataset contains links to identifiable pictures of Brazilian children. In many cases, their identity is easily traceable. Some photos contain the children’s names listed in the accompanying caption of the URL where the image is stored, as well as information about where and when the photo was taken.

LAION, a German non-profit that manages the dataset, pledged to remove the pictures. However, it said that children and their guardians were responsible for removing their personal photos from the internet, which it noted was the most effective protection against misuse.

Many of the pictures reviewed by HRW included “intimate moments” that span the entirety of childhood, the organization said. These included birth, young children blowing out candles on their birthday cake or dancing in their underwear at home, and teenagers posing for photos at school events.

Many of these photos were originally seen by few people and appear to have previously had a measure of privacy, HRW said. It noted that the images do not appear to be otherwise possible to find through an online search and many were taken years before the dataset was created.

Upon its release in March 2022, LAION-5B was the largest freely available dataset in the world. It was used to train a number of high-profile AI models, including Stable Diffusion and Google’s Imagen.

The dataset was built on billions of images scraped from the internet. Research published just months after its release found that “thousands” of those pictures contained child sexual abuse material.

According to the Stanford Internet Observatory, which carried out the analysis, it was the result of “unguided crawling that includes a significant amount of explicit material.”

ADVERTISEMENT

In response, LAION said it had a “zero tolerance policy for illegal content” and temporarily took LAION-5B, as well as its earlier version LAION-400m, offline.

While efforts to ban the nonconsensual use of AI to generate sexually explicit images of people, including children, were “urgent and important,” governments should also prohibit scraping children’s personal data into AI systems, HRW said. The nonconsensual digital replication or manipulation of children’s likeness should also be banned, it said.