Child sexual abuse material (CSAM) has been located in LAION, a major data set used to train AI.
The Stanford Internet Observatory revealed thousands of images of child sexual abuse in the LAION-5B data set, which supports many different AI models.
The report shows that AI models such as Stable Diffusion and Google’s Imagen “were trained on billions of scraped images in the LAION-5B dataset.” This dataset is said to have been created through “unguided crawling that includes a significant amount of explicit material.”
These images have allowed AI systems to produce realistic and explicit images of imaginary children while also altering images of clothed individuals into nude photos.
Previous Stanford Internet Observatory reports have deduced that machine-learning models can produce CSAM. However, the work assumed this was only possible by combining “two concepts” such as child and explicit actions.
Despite LAION’s attempts to classify whether the content was sexually explicit or whether data contained underage explicit content, models were trained on a wide array of benign and graphic content.
The report concludes that having possession of a LAION-5B dataset implies the possession of “thousands of illegal images – not including all of the intimate imagery published and gathered non-consensually.”
There is no evidence to suggest that CSAM influences the model's output, and the likelihood of CSAM content exerting influence is slim.
Despite having a “zero tolerance policy for illegal content,” a multitude of images containing CSAM are present in the LAION open-source data set.
LAION’s 5B data set has since been taken offline, and the non-profit is working closely with the Internet Watch Foundation, a charity dedicated to protecting children worldwide by removing and preventing abusive content online.