Largest ever analysis of breached dataset content reveals monumental risks


When we hear about a new data breach, we usually care most about the volume of the exposure. But a new study claims that breach analysts should pay a lot more attention to data content, as that’s what cybercriminals are most interested in.

KNP, a 158-year-old transport company based in the United Kingdom, was doing perfectly fine back in 2023. The firm was running hundreds of lorries all over the country, and its IT department complied with industry standards.

But when a hacking gang known as Akira took aim at KNP’s systems, all it took for them to take them over was an employee’s weak password.

ADVERTISEMENT

The hackers simply guessed the password, encrypted the company data, locked the systems, and demanded a ransom.

Now, KNP is no more. According to the BBC, the company didn’t have the five million pounds the hackers wanted and went under. Seven hundred employees lost their jobs.

Ernestas Naprys vilius Paulina Okunyte Niamh Ancell BW
Don’t miss our latest stories on Google News

Sure, it wasn’t a large company, and the story didn’t hit the headlines of the largest news outlets. But that’s also because incidents such as these are pretty common these days – tens of thousands of businesses globally are attacked by ransomware gangs every year.

And the damages are – or can be in the worst scenarios – enormous. In a new report, Lab-1, a UK-based cybersecurity company specializing in exposed data intelligence, says that a vast majority of breached datasets contain important financial information.

Analysing 141 million files across 1,297 data breach incidents, Lab-1’s Anatomy of a Breach 2025 report reveals the risk of fraud to organizations, their employees and customers from data breaches.

Most reports on data breaches overemphasize the volume of stolen information rather than content.

It’s the largest ever content-level analysis of breached datasets. The report reveals the monumental risk of fraud to organizations, their employees, and customers, with nearly all breached datasets including financial, HR, and customer data.

ADVERTISEMENT

Financial and HR data prevalent

Most reports on data breaches overemphasize the volume of stolen information rather than content. According to Lab-1, organizations need to take a more content-aware approach to breach analysis.

Unsurprisingly, the company itself is doing so. Lab 1 uses AI agents to scrape breached datasets and analyze every file exposed, including unstructured files, like PDFs, emails, spreadsheets, and code files.

“While typically overlooked in data breach analysis techniques, the information can be leveraged for sophisticated cyberattacks, social engineering attacks, and fraud against companies and their customers,” says the firm.

Staggeringly, financial documents appear in 93% of incidents and account for 41% of all exposed files. Financial sensitive information types were also highly prevalent and reveal how personal data, as well as commercial information, is being leaked into the public domain.

health-data-breach
Image by Cybernews.

For instance, bank statements, which enable identity fraud, were present in 49% of incidents, and IBANs, which can be used for mandate scams and payment redirection, were included in 36% of breached data sets.

HR data – often containing personally identifiable information (PII), payroll and resumes – appeared in 82% of breaches. And two-thirds (67%) involved communications and records concerning customer service interactions and support.

Emails were leaked in 86% of all data breaches, the most prevalent exposed sensitive information type, but perhaps most concerningly, half of all incidents analyzed (51%) included US Social Security Numbers.

Exposure of personal data can obviously lead to targeted phishing, identity theft, and regulatory violations under laws like GDPR or the FTC Act, opening organizations up to the risk of substantial fines, legal action, and erosion of customer trust, said Lab-1.

ADVERTISEMENT

Cybercrooks are now like data scientists

According to the company, the dataset used in the Anatomy of a Breach Study comprises 141,168,340 individual file records sourced from 1,297 ransomware and data breach incidents, all of which are in the public domain and were reconstructed from forensic acquisitions of compromised systems.

skull crossbones
Chuyn/Getty Images

Some of the revelations illustrate the wealth of data that experienced cybercriminals can expect to find and misuse.

While exposed in a smaller proportion of incidents, cryptographic keys (SSH and RSA Keys) that enable attackers to bypass authentication and access secure systems were present in 18% of all incidents, for example.

Cloud and Infrastructure indicators, such as AWS S3 paths and virtual hosts, featured in two-fifths of breaches (20% and 23% respectively), which can facilitate data exfiltration or the discovery of unsecured cloud storage endpoints.

Code files, which were exposed in 87% of incidents and account for 17% of all exposed files, also introduce vulnerabilities to the Software Bill of Materials by undermining the integrity and trustworthiness of the software supply chain.

“With cybercriminals now behaving like data scientists to unearth these valuable insights to fuel cyberattacks and fraud, unstructured data cannot be ignored,”

Robin Brattel.

“Rather than focus on mega data dumps of structured and primarily credential-based information, we've focused on the huge risks associated with unstructured files that often hold high-value information, such as cryptographic keys, customer account data, or sensitive commercial contracts,” said Robin Brattel, co-founder and CEO of Lab-1.

“With cybercriminals now behaving like data scientists to unearth these valuable insights to fuel cyberattacks and fraud, unstructured data cannot be ignored.”

ADVERTISEMENT

“Ultimately, organizations must understand what information has been leaked, how it can be used, and who might be affected. And faster than it can be used against them,” said Brattel.