DeepSeek trained on buffet of exposed credentials

Security researchers have scanned a massive dataset used to train DeepSeek and other AI models and found almost 12,000 live secret credentials, exposing their respective services.

“Live’” secrets refer to API keys, passwords, and other credentials that successfully authenticate with their respective services.

Truffle Security, an open-source security software company, said it found an astounding 11,908 live secrets in the Common Crawl archive. This 400-terabyte dataset contains website snapshots from 47.5 million hosts across 38.3 million registered domains and represents a broad cross-section of the internet.

Live secrets were found in almost three million websites, which means that many websites reuse the same secrets.

“In one extreme case, a single WalkScore API key appeared 57,029 times across 1,871 subdomains!” Truffle Security said in a report.

The firm explains that developers expose secrets by hardcoding them in the front-end HTML and Javascript on web pages they don’t control. It's not the fault of organizations or crawlers, who shouldn’t be tasked with redacting crawl data used by researchers.

However, when this data is ingested by the large language models (LLMs), the exposure “likely contributes to LLMs suggesting hardcoded secrets.”

Researchers also highlighted the risks that AI models trained on insecure code may reproduce unsafe practices. For example, they may suggest hardcoding credentials, putting organizations at risk.

“Popular LLMs, including DeepSeek, are trained on Common Crawl, a massive dataset containing website snapshots. Given our experience finding exposed secrets on the public internet, we suspected that hardcoded credentials might be present in the training data, potentially influencing model behavior,” the researchers said.

Don’t miss our latest stories on Google News

Add us as your Preferred Source on Google.

Model outputs are further shaped by other training datasets, fine-tuning, alignment techniques, and prompt context.

The firm previously tested 10 LLMs and demonstrated that most recommend hardcoding API keys and passwords, including developer tools like VS Code, ChatGPT, and other widely used AI coding assistants.

“The real risk? Inexperienced (and non) coders might follow this advice blindly, unaware they’re introducing major security flaws,” the researchers explained.

Truffle Security said they contacted the vendors whose users were most impacted by the exposed secrets and worked with them to revoke the live keys, which helped to rotate “several thousand keys.”

Among the 219 distinctive types of discovered secrets, the most frequent leak was MailChimp API keys, which attackers can exploit in phishing campaigns, data exfiltration, and brand impersonation. Some websites exposed AWS root keys, and one web page included 17 live Slack webhooks.

For developers, researchers suggest including hard rules in their AI prompts to never suggest hardcoded credentials or other insecure code patterns. Developers should also scan their code and public websites for any exposed keys.

AI learning from bad code: DeepSeek training data contains 12,000 live credentials

More from Cybernews