Data leak reveals auto giant and others harvesting user data to train AI models

Van Mossel, the biggest auto dealer in Benelux, and other companies used the services of an obscure data analytics company to train AI models, which leaked their client data to anyone on the internet.

On February 1st, our research team uncovered a concerning misconfiguration on systems belonging to Rawdamental, a data collection and analysis company, that caused a leak of personal data.

Even though Rawdamental could not be found in the Dutch company register, its services have been used by numerous Dutch companies. The discovered security incident affected users of ten companies that likely used data-gathering services, including multinational auto dealer Van Mossel, which employs nearly 7,000 people.

Companies affected by the leak:

Auto dealer – Van Mossel
Software companies – Simpul.nl and Divtag.nl
Marketplace for motorcycle parts – Motorparts-online.com
Marketing agency – InovaMedia
Fireworks retailer – Vuurwerkbestel.nl
Interior retailer – Oletti.nl
Christmas gift services – Kerstpakkettenexpress.nl and kerstcomplimenten.nl
Netherlands’ Motorsport fan club – Ttassen-fanbase.com

The harvested user data was meant to provide Rawdamental’s clients with a starting dataset to train AI models to predict user behavior. While the ethics of using corporate AI models is debatable, the current data leak shows that the security of such services remains extremely vague.

Cybernews reached out to the companies that used Rawdamental services, but a response is yet to be received.

Our investigation determined that the leak was caused by a missing authentication on the company’s Kibana dashboard – a popular online tool for searching, visualizing, and analyzing stored data.

The missing authentication went unnoticed, leaving data publicly accessible since December 2021.

The company has not responded to either Cybernews or the Computer Emergency Response Team (CERT) in the Netherlands attempts to contact them.

Just before publishing the article, researchers noticed that the company had closed the instance.

Training AI models on private data

Rawdamental’s business model is based on gathering huge amounts of data for its clients to create unique profiles of website visitors. By collecting clickstream data, the company compiles huge chunks of data on the user journey and behavior, which can later be used by companies to train their AI models, as claimed on the Rawdamental website.

Leaked IPs, user agents, and fingerprint

Using such datasets to train AI is dangerous. Our investigation of leaked traffic data revealed that among the gathered data was private user information. The model trained on private data may spit out sensitive information without user consent.

“This is a well-known risk with AI tools in workplaces, which has led multiple organizations to ban their use, fearing that sensitive company information might be leaked to the tool's operator. This leak also serves as a reminder that such risks are also present in traditional online tools,” said Aras Nazarovas, a security researcher at Cybernews.

Leaked web traffic included:

Users IP addresses
Accessed URLs
Visited pages titles
User Agents
In some cases, user names and projects they have been working on
Unique user identifiers, created based on different types of metadata

Failure to anonymize user data

Apart from the obvious cybersecurity loopholes that caused a data leak and created a treasure trove for threat actors, another major concern is the company's poor anonymization of user data.

“For services such as Rawdamental, it is crucial to anonymize user data. Despite the company's intention to anonymize user data, the investigation revealed that they failed to anticipate all potential scenarios,” said Nazarovas, a security researcher at Cybernews.

Rawadamental data leak 2 — Requests revealing personal and project information

One example involved the company's client platforms, which are most likely dedicated to accounting. Some platforms displayed personally identifiable information, such as names and projects, in the website's title tag, which appears in the browser tab names.

Rawdamental had not implemented safeguards for such scenarios, so sensitive user data was collected. Furthermore, user IP addresses were also present in the gathered data, indicating another failure to fully anonymize the dataset.

Most of the affected services, except Van Mossel, have not disclosed third-party cookies used for tracking and fingerprinting purposes.

The ambiguity of companies' privacy policies leaves the user in the dark about whether their personal information is shared with third-party services such as Rawdamental.

Rawadamental data leak — Leaked e-commerce dashboard

Company’s response

After the article was published, Rawdamental contacted Cybernews and claimed to have started an investigation into the incident and the process of implementing measures to “enhance its system's security.”

As a company spokesperson claimed, the open Kibana instance was part of a test project. An error was made in securing the IP address authentication, and the data was not used for AI training purposes but “solely for data collection.”

“Our priority is now to secure the data and inform the affected parties. We will work closely with the involved companies and assist them in addressing any potential consequences of this incident,” said a spokesperson in an email.

“We deeply regret that this incident has occurred and extend our sincere apologies to the affected clients.”

According to the company, the delay in responding happened because the disclosure was sent to an email that was not actively monitored by the company.

Data leak reveals auto giant and others harvesting user data to train AI models

More from Cybernews

Training AI models on private data

Failure to anonymize user data

Company’s response