A huge dataset used to train AI tools that detect not-safe-for-work (NSFW) content contained child sexual abuse material (CSAM), until a child protection charity got its hands on it.

The NudeNet dataset contains more than 700,000 images scraped from various areas of the web, including social media, image hosting services, and pornography websites.

A child abuse charity has discovered that roughly 0.1% of the dataset is made up of child sexual abuse material.

Around 680 images were found in the NudeNet dataset that are suspected or confirmed to be images depicting child sexual abuse or exploitation, the Canadian Centre for Child Protection (C3P) found.

Using tools from Project Arachnid, a “victim-centric set of tools to combat the growing proliferation of CSAM on the internet,” C3P found that there were images of known survivors, up-close photos of intimate areas, and pictures depicting sexually abusive acts.

Over 120 images were found to contain known victims of child sexual abuse, including victims in Canada and America.

Almost 70 images showed the genitalia and anuses of pre pubescent children, and 130 images of post pubescent children.

In some extreme cases, there were images of children and teenagers engaging in sexually abusive acts.

Those who have downloaded the dataset would have unknowingly been downloading child pornography, which is highly illegal, 404 Media, which first reported the story, warns.

However, researchers and developers wouldn’t know this unless they thoroughly combed through the dataset.

According to C3P, the charity requested that Academic Torrents, the platform making this dataset available, remove the content, which was successful.

This situation feels like deja vu, as a similar incident happened in 2023, when the Stanford Internet Observatory identified CSAM in LAION-5B, a large data set used to train services like Stable Diffusion and Google’s Imagen.

The issue with large datasets like this is that the information is scraped from all parts of the web, which means that much of the content can’t be properly vetted.

This naturally leads to nefarious content finding its way into datasets that are then used, innocently, by researchers, AI developers, and even the general public in the case of LAION-5B.

