The following content is a press release provided by a third-party.

How to Find the Best Cybersecurity Datasets for Machine Learning Projects


Introduction

In the fast-changing world of cybersecurity, machine learning (ML) has become a key tool for spotting and stopping threats. However, the effectiveness of ML models depends significantly on the quality and applicability of the training data. As cyber threats continue to escalate, with the global cost of cybercrime projected to reach $10.5 trillion annually by 2025, according to Boise State University's Institute for Pervasive Cybersecurity. Whether you're developing a next-gen threat detection system or enhancing SOC operations, the right data can make all the difference.

Why Are Cybersecurity Datasets Important for Machine Learning?

ADVERTISEMENT

Machine learning models need to learn from examples. In cybersecurity, these examples often come in the form of labeled attack data, normal traffic patterns, malware samples, phishing emails, or user behavior logs. Without access to cybersecurity datasets for machine learning, your models won’t generalize well, and they’ll likely miss modern attack patterns.

A good cybersecurity dataset for ML must be relevant to the specific threat, with accurate labels for reliable model training. It should contain diverse, realistic data (e.g., network traffic, logs) that is up-to-date with modern threats. The dataset also needs to be sufficiently large, anonymized for privacy, and well-documented, so users can understand its contents and purpose.

Where to Find the Best Cybersecurity Datasets

A. Public Datasets

CICIDS 2017 / CICIDS 2018

These datasets include a variety of attack scenarios and are widely used for intrusion detection research.

NSL-KDD

A classic dataset for IDS benchmarking, NSL-KDD is an improved version of the older KDD Cup 1999 dataset. It addresses issues like redundancy and provides balanced data for anomaly detection. However, it’s outdated for modern threats, as it was derived from 1990s DARPA data.

ADVERTISEMENT

UNSW-NB15

This dataset features modern network traffic with normal and attack data (e.g., fuzzers, backdoors, exploits). It includes 49 features extracted from 100 GB of raw traffic, making it suitable for contemporary IDS research.

CTU-13 Botnet Dataset

ctu13 botnet dataset

It captures botnet, normal, and background traffic across 13 scenarios. It’s labeled flow-by-flow, making it one of the largest botnet datasets for analyzing botnet behavior.

MalwareBazaar

A repository of malware samples and metadata, MalwareBazaar provides daily updated samples for malware classification tasks. It’s useful for training models to detect malicious files.

VirusShare

VirusShare offers a large collection of malicious files for researchers. It’s often used for malware analysis and classification projects.

PhishTank

ADVERTISEMENT

A live feed of phishing URLs submitted by users, PhishTank allows the classification of malicious vs. non-malicious sites.

APT Notes & Threat Intelligence Feeds

These resources provide reports and data on advanced persistent threats (APTs) and adversary behavior for modeling attacker tactics and building threat intelligence systems.

B. Research and Academic Repositories

Kaggle Cybersecurity Competitions

Kaggle hosts various competitions and datasets related to cybersecurity, providing a platform for practitioners to test and improve their models.

kaggle cybersecurity competitions

UCI Machine Learning Repository

A collection of databases and data generators widely used in the ML community.

IEEE DataPort

Offers a variety of datasets across different domains, including cybersecurity.

ADVERTISEMENT
ieee data port

Zenodo

An open-access repository developed under CERN, hosting datasets from various scientific domains.

GitHub (Security-Focused Repositories)

Many researchers and organizations share cybersecurity datasets and tools on GitHub.

C. Commercial / Private Datasets

Commercial Threat Intelligence Vendors

Companies like Recorded Future, Palo Alto Networks, and CrowdStrike offer proprietary datasets and threat intelligence feeds.

Cybersecurity Data Marketplaces

Platforms that provide access to cybersecurity datasets for purchase or subscription.

ADVERTISEMENT

Partnerships with MSSPs and SOC Providers

Collaborating with Managed Security Service Providers (MSSPs) and Security Operations Centers (SOCs) can grant access to real-world data for model training.

Enterprise Honeypot Data

Organizations can deploy honeypots to collect data on attack patterns and behaviors — invaluable for training ML models on emerging threats.

Tips for Selecting the Right Cybersecurity Dataset

Choosing the right dataset requires careful consideration:

  • Align the dataset to your ML use case
  • Check for recent updates
  • Evaluate label quality and balance
  • Beware of “toy” datasets
  • Consider synthetic vs. Real-world Data
  • Ensure legal/ethical compliance

How to Prepare and Clean Cybersecurity Datasets

Data Preprocessing Essentials

  • Normalization. Normalize numerical features to prevent scaling issues.
ADVERTISEMENT
numerical feature normalization
  • Encoding Categorical Variables. Convert categorical variables (protocol types, domain names) into a numerical format.
encoding categorical variables
  • Handling Imbalanced Classes. Address class imbalance using techniques such as SMOTE, undersampling, or cost-sensitive learning.
handling imbalanced classes
  • Noise Reduction. Remove irrelevant or duplicate data to reduce model overfitting.
noice reduction data cleaning

Conclusion

Selecting and preparing appropriate cybersecurity datasets for machine learning are foundational steps in building effective security solutions. By understanding what makes a high-quality dataset, knowing where to source them, and applying best practices in preprocessing, cybersecurity professionals can improve the accuracy and resilience of ML-based defenses.

Disclaimer

ADVERTISEMENT