Why AI-based vulnerability detection isn’t as advanced as we thought…
Whether it’s an online banking app, a hospital’s patient portal, or the software in your own car, even the simplest everyday activity today depends on thousands of lines of code. A single undiscovered bug can create security vulnerabilities with potentially serious consequences, such as the theft of sensitive data or the failure of critical systems. To test such systems before deployment, large language models (LLMs) are now frequently used. A research team from BIFOLD has shown in a recent study that the enormous technical effort behind these LLMs does not always pay off. The publication “LLM-based Vulnerability Discovery through the Lens of Code Metrics” by Felix Weißberg, Lukas Pirch, Erik Imgrund, Jonas Möller, Dr. Thorsten Eisenhofer, and Prof. Dr. Konrad Rieck was recently presented at the 48th IEEE/ACM International Conference on Software Engineering (ICSE) 2026, one of the world’s leading conferences in software engineering.
Large language models are considered particularly powerful when it comes to generating, understanding, and especially analyzing code. Based on the assumption that greater complexity leads to better results in code analysis, research in recent years has focused on building ever larger and more complex models with increasing numbers of parameters. The BIFOLD team asks the opposite question in its paper: What added value do these large models provide compared to an analysis using simple so-called code metrics, which have been used since the 1970s? Code metrics are simple quantitative indicators that, for example, measure source code size in lines or characters. More advanced metrics attempt to assess how understandable the code is. “According to the principle that correlation does not equal causality, such metrics can only indicate a possible vulnerability, but they are no direct proof,” explains author Lukas Pirch. In contrast, LLMs are supposed to “understand” what a piece of code does.
The study’s key finding: A traditional detection system based on just 23 code metrics already achieves 98 percent of the detection rate of the best modern LLMs, while requiring only six percent of the parameters. Even a detection system relying on just a single metric still achieves more than 90 percent of the detection performance of a far more resource-intensive language model.
Good news for IT security
In a second step, the researchers investigated the cause of this surprising tie. Author Felix Weißberg explains: “Using statistical methods, we were able to show that all examined LLMs use code metrics or exhibit very similar patterns, and that their predictions are closely correlated with them. For some models, we were even able to demonstrate strong indicators of causality: the LLMs’ decisions were based, at least in part, on these simple patterns that have been known for decades.”
“We were surprised that the difference between the two approaches was so small under realistic conditions,” said Konrad Rieck, summarizing his team’s findings. “Our results show that recent progress in AI-based vulnerability detection is due less to the capabilities of the LLMs themselves than to the tools and environments in which they operate. This raises the question of whether the immense size of today’s models is even necessary for this task. For IT security, that is good news: We may be able to find and fix many software bugs using far fewer resources.”
Publication:
Felix Weißberg, Lukas Pirch, Erik Imgrund, Jonas Möller, Thorsten Eisenhofer, Konrad Rieck: LLM-based Vulnerability Discovery through the Lens of Code Metrics. Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE), 2026.