
An unprotected database has exposed billions of records, revealing terabytes of personal information. What appears to be 16TB of professional and corporate intelligence data includes LinkedIn URLs and profile handles, alongside other personal information.
-
An unprotected database exposed 4.3 billion records, some with LinkedIn-derived personal information.
-
The 16TB-strong instance contained emails, photos, employment histories, and other personal data.
-
A single collection alone contained 732 million records, including photographs.
-
Researchers suggest the data may have been collected within the last two years, spanning multiple regions worldwide.
While massive contact databases can be a significant time-saver for businesses, they also have a major drawback – security. If left unprotected, a single exposed dataset can endanger the privacy of millions of users. That’s exactly what the Cybernews research team discovered in a recent major data leak.
The team found an unprotected MongoDB instance containing a staggering 16.14 terabytes of professional and corporate intelligence data. In total, researchers discovered nearly 4.3 billion documents, making it one of the largest lead-generation datasets to have ever leaked.
Bob Diachenko, a Cybernews contributor, cybersecurity researcher, and owner of SecurityDiscovery.com, is behind this major discovery. Diachenko uncovered the 4.3 billion-strong database on November 23rd, 2025, with the instance’s owners securing it two days later.
While researchers do not know how long the instance was exposed before being found, if our team was able to find it, less high-minded individuals may have also. Attackers treasure large and well-organized datasets with abundant personal information, as they enable the conduct of large-scale automated attacks.
What’s inside the massive data set?
All 4.3 billion exposed records were stored in a MongoDB instance, a database often used by businesses to store and process large volumes of data. In this case, the leak most likely stemmed from a common mistake where databases are left exposed without proper authentication due to human error.
According to the team, the instance, which contained over 16TB of data, was fully structured and likely comprised scraped professional and corporate intelligence data. The database exposed deeply detailed LinkedIn-derived profiles, contact information, corporate relationships, and employment histories.
In total, nine collections were uncovered within the dataset, with each name most likely indicating the type of information contained within. All databases widely vary in their weight and the records they contain:
- intent – 2,054,410,607 docs (604.76 GB)
- profiles – 1,135,462,992 docs (5.85 TB)
- unique_profiles – 732,412,172 docs (5.63 TB)
- people – 169,061,357 docs (3.95 TB)
- sitemap – 163,765,524 docs (20.22 GB)
- companies – 17,302,088 docs (72.9 GB)
- company_sitemap – 17,301,617 docs (3.76 GB)
- address_cache – 8,126,667 docs (26.78 GB)
- intent_archive – 2,073,723 docs (620 MB)
According to our researchers, all records within a specific collection are unique. However, there could be duplicates between different collections within the exposed dataset.
While different collections contain different sets of information, the researchers confirmed that at least three of them, profiles, unique_profiles, and people, contained personally identifiable information (PII). Three collections, which combined held nearly 2 billion records, exposed details such as:
- Full names
- Emails and phone numbers
- LinkedIn URLs and profile handles
- Position titles, employers, employment histories
- Education, degrees, certifications
- Location data
- Languages, skills, functions
- Social media accounts
- Image URLs (unique_profiles)
- Email confidence scoring (people)
- “Apollo ID”
The humongous volume of the database has serious privacy implications for all parties involved. For one, the database structure is indicative of LinkedIn-style scraping, which often means that most of the data, such as emails, phone numbers, job roles, and social graphs, is up-to-date and accurate.
Moreover, the unique_profiles collection, which had over 732 million records, included photographs. At the same time, the people collection contained email validation, enrichment scores, and social accounts. Marketers, sales teams, and recruiters add enrichment scores to user profiles to assess how well a lead or a candidate matches a desired profile or person.
It is challenging to determine the age of the LinkedIn data included in the dataset. The database’s “updated at” timestamps indicate that the information was collected and/or updated within 2025. However, in 2021, threat actors published claims alleging that they had scraped hundreds of millions of LinkedIn records. The exposed MongoDB database could contain records scraped in the past.
Researchers have also observed the database having uniform schemas for profiles, contacts, and employment histories. The Sitemap and company_sitemap collections, which contain 180 million records, link URLs to profile IDs. The team believes that the high volume of the leaked database strongly points to automated scraping and enrichment pipelines.
While it’s unclear what the “Apollo ID” stands for, the nature of the dataset strongly points to information from the sales intelligence tool Apollo.io. The presence of “Apollo ID” integrates two major lead-gen ecosystems, LinkedIn and Apollo, creating a unified surveillance-grade dataset.
Who owns the leaked dataset?
At the time of this article’s publication, the attribution remains unconfirmed. However, there are some indications of the dataset’s owner. The team discovered that the database included sitemap collections linking “/people” and “/company” to a website of a lead-generation company.
The company helps businesses find and connect with potential customers, providing access to a large-scale B2B database of leads that strongly correlates with the type of information included in the exposed database.
The company’s website states that it connects with over 700 million professionals, which closely matches the number of records in the unique_profiles collection. Moreover, after researchers notified the company about the potential data leak, the exposed instance was closed the next day.
“Large datasets like this one are a prime target for malicious actors, as they act as a strong foundational base for profile enrichment based on other data leaks, enabling malicious actors to craft a large, searchable database of personal data that, after enrichment, could also include passwords, device identifiers, links to other social media, etc.,”
our researchers explained.
However, the team reserves the right not to attribute the leak to the company. There is a chance that the company’s presence in the leak points to its databases being scraped by the real owner of the data.
We have reached out to the company for comment and will update the article once we receive a reply.
Why is the leak dangerous?
Cybercriminals can exploit large and unprotected databases to create a gold mine for themselves. For example, attackers can utilize the data to carry out targeted phishing attacks. Malicious actors can cherrypick CEOs from the dataset for CEO fraud attacks, when a head of the company is impersonated to trick employees into transferring funds.
Another venue for exploitation is corporate reconnaissance, where security professionals use the humongous amounts of personal employee information to test organizational defenses against social engineering attacks. Malicious actors can employ the same tactics to identify vulnerabilities that allow them to penetrate company systems.
Attackers often target major corporations as their data is a valuable asset on the dark web. Since it’s almost certain that Fortune 500 company employees are included in the list, threat actors can use the data to focus their sights on specific businesses.
While attackers don’t need a specific database to carry out this type of attack, having the data collected for them increases their chances of success and reduces preparation time.
Malicious actors can also utilize a large dataset for automated attacks. Cybercriminals are as invested in AI-assisted operations as any company, and a 4.3 billion-record-strong dataset is a perfect candidate for this type of activity.
Large language models (LLMs) are capable of generating personalized messages based on user profile information. With some additional effort, tens of millions of malicious emails can be sent to victims, and it only takes one high-value target for the whole operation to be profitable for the attacker.
“Large datasets like this one are a prime target for malicious actors, as they act as a strong foundational base for profile enrichment based on other data leaks, enabling malicious actors to craft a large, searchable database of personal data that, after enrichment, could also include passwords, device identifiers, links to other social media, etc. Such datasets simplify social engineering and credential stuffing attacks,” our researchers explained.
Billions of leaked records: a new reality
Major data leaks containing billions of exposed records have become nearly ubiquitous. In June, Cybernews reported on what is likely the largest data leak to ever affect China, comprising billions of documents containing financial data, WeChat and Alipay details, as well as other sensitive personal information.
Last summer, the largest password compilation, with nearly ten billion unique passwords, known as RockYou2024, was leaked on a popular hacking forum. In 2021, a similar compilation with over eight billion records was leaked online.
In early 2024, the Cybernews research team discovered what is likely still the largest data leak ever: the Mother of All Breaches (MOAB), with a mind-boggling 26 billion records.
Large data leaks involving professional and company information are not entirely new, either. In 2018, Apollo.io left an unprotected database containing billions of records and 125M unique email addresses.
In 2019, People Data Labs, a US-based data broker, suffered a data breach impacting 622 million individuals. Last year, Cybernews researchers discovered an unprotected instance with over 170 million sensitive data records, attributed to PDL.
Meanwhile, LinkedIn has been adamantly fighting companies scraping its members' profiles. In early October, LinkedIn filed a lawsuit against software company ProAPIs and its CEO, claiming that the firm unlawfully created hundreds of thousands of fake accounts used for scraping millions of LinkedIn member profiles.
LinkedIn said its User Agreement prohibits data scraping by automated bots, as well as impersonating others or creating fake accounts. The company emphasized that scraping poses a risk to its users.
“Neither LinkedIn nor its members can then prevent Defendants or their customers from using that scraped data to send spam, from selling or exposing member data to scammers, or from combining LinkedIn member data with other data to create extensive private databases, among other activities,” LinkedIn said in a lawsuit.
Vilius Petkauskas is a deputy editor at Cybernews. Vilius brings over a decade of experience in journalism to his role at Cybernews. He oversees content quality, topic pitching and research article development. Before joining Cybernews, Vilius sharpened his pen as a journalist for both print and online media, covering a diverse range of topics from local business to international politics.
- Leak discovered: November 23rd, 2025
- Responsible disclosure: November 24th, 2025
- Leak closed: November 25th, 2025
Unlock more exclusive Cybernews content on YouTube.
Your email address will not be published. Required fields are markedmarked