Cybersecurity under scrutiny

Machine learning in cybersecurity research is prone to subtle pitfalls

Pitfalls in cybersecurity are widespread - even in carefully conducted top research. — © pixabay

Cybersecurity has become a key asset for our digital society. In security research, machine learning (ML) has emerged as one of the most important tools for investigating security-related problems: However, a group of European researchers from TU Berlin, TU Braunschweig, University College London, King’s College London, Royal Holloway University of London, and Karlsruhe Institute of Technology (KIT)/KASTEL Security Research Labs, led by BIFOLD researchers from TU Berlin, have shown recently that research with ML in cybersecurity contexts is often prone to error. Their conference paper "Dos and Don'ts of Machine Learning in Computer Security" on pitfalls in the application of machine learning in security research was honored with a Distinguished Paper Award at the renowned USENIX Security Symposium 2022.

Machine learning techniques have led to major breakthroughs in a wide range of application fields, such as computer vision and natural language processing. Consequently, this success has also influenced cybersecurity, where not only vendors advertise their AI-driven products to be more efficient and effective than previous solutions. Also, many researchers prominently apply these techniques, as algorithms seemingly often outperform traditional methods by large extent. For example, machine learning techniques are used to learn attack tactics and adapt defenses to new threats.

Where modern cybersecurity approaches falter

“In the paper, we provide a critical analysis of using ML for cybersecurity research”, describes first author Dr. Daniel Arp, postdoctoral researcher at TU Berlin: “First, we identify common pitfalls in the design, implementation, and evaluation of learning-based security systems.” One example for this pitfall is the use of non-representative data. For instance, a dataset where the number of attacks is over-represented compared to their prevalence in the wild. ML models trained on such data might show different performance when applied in practice. In the worst case, it might even turn out that they do not work outside an experimental setting. Similarly, the results of experiments might be presented in a way that leads to misinterpretations of a system’s capabilities. In a second step, the researchers conducted a prevalence analysis, based on the identified pitfalls, in which they studied 30 papers from top-tier security conferences published between 2010 and 2020. “Concerningly, we could confirm that these pitfalls are widespread even in carefully conducted top research”, says BIFOLD Fellow Prof. Dr. Konrad Rieck, TU Braunschweig.

Understanding pitfalls and their prevalence in research is crucial for promoting a sound adoption of learning-based security systems. However, an equally important point is to understand whether such pitfalls, prevalent in all the surveyed research, affect models’ performance and interpretation. This impact analysis on four representative case studies, based on examples taken from the literature, highlighted pitfalls can indeed lead to unrealistic performance and misleading findings. One of the examined case studies deals with the detection of mobile malware. Due to the large number of new malicious apps for mobile devices, traditional anti-virus scanners often have problems to keep up with the evolution of malware and only provide poor detection performance. To cope with this problem, researchers suggested and developed learning-based methods that can automatically adapt to new malware variants.

“Unfortunately, the performance of the learning-based systems might have been overestimated in many cases. Due to the lack of open datasets provided by companies, researchers compose learning-datasets on their own, thus merging applications from different sources”, explains Dr. Daniel Arp. “But, merging data from different sources, leads to a sampling bias: official app stores of smartphone manufacturers tend to have less problems with malware, whereas alternative sources usually offer less protection than these well-known stores. In consequence, we could demonstrate that state-of-the-art approaches tend to rely on the source of an app instead of learning actual malicious characteristics to detect malware. This is only one of many examples in our paper that showed how a subtle pitfall can introduce a severe bias and affect the overall outcome of an experiment.”

The difficulties of operating learning-based methods in cybersecurity is exacerbated by the need to operate in adversarial contexts. Understanding the perils this work outlines should raise awareness of possible pitfalls in the experimental design and how to avoid these – so the researchers hope.

The publication in detail:

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, Konrad Rieck: Dos and Don'ts of Machine Learning in Computer Security