CodeGraphSMOTE: Data Augmentation for Vulnerability Discovery

Tom Ganz

Erik Imgrund

Martin Härterich

Konrad Rieck

July 12, 2023

The automated discovery of vulnerabilities at scale is a crucial area of research in software security. While numerous machine learning models for detecting vulnerabilities are known, recent studies show that their generalizability and transferability heavily depend on the quality of the training data. Due to the scarcity of real vulnerabilities, available datasets are highly imbalanced, making it difficult for deep learning models to learn and generalize effectively. Based on the fact that programs can inherently be represented by graphs and to leverage recent advances in graph neural networks, we propose a novel method to generate synthetic code graphs for data augmentation to enhance vulnerability discovery. Our method includes two significant contributions: a novel approach for generating synthetic code graphs and a graph-to-code transformer to convert code graphs into their code representation. Applying our augmentation strategy to vulnerability discovery models achieves the same originally reported F1-score with less than 20% of the original dataset and we outperform the F1-score of prior work on augmentation strategies by up to 25.6% in detection performance.

https://doi.org/10.1007/978-3-031-37586-6_17

BIFOLD AUTHORS

Prof. Dr. Konrad Rieck