Machine Learning on large dynamic graphs

Space-Efficient Random Walks on Streaming Graphs

The paper „Space-Efficient Random Walks on Streaming Graphs“, by Serafeim Papadias, Dr. Zoi Kaoudi, Dr. Jorge Arnulfo Quiane Ruiz, and Prof. Dr. Volker Markl has been accepted for publication at PVLDB 16 / VLDB 2023. The work of this paper was done in the context of the BIFOLD Agility Project: Machine learning on large dynamic graphs. In their paper, the researchers present Wharf, the first step towards scalable incremental machine learning on graphs.

“For example: A digital marketplace can be seen as a graph, where nodes represent either users or products and edges represent the act of a user buying a product. The graph can change rapidly – people all over the world buy different products, meaning that thousands of new edges in this graph may be created in only a second. In this setting, we need to maintain on the fly and very efficiently random walks in the graph so that we can build an up-to-date machine learning model for predicting missing links (i.e., the connection between a specific product and a certain user, who will buy it) and thus recommend this product to the user", explains BIFOLD researcher Dr. Zoi Kaoudi.

Specifically, Wharf generates, succinctly maintains, and efficiently updates random walk corpuses in main memory. The maintained random walks serve as input to many downstream machine learning applications such as node classification (i.e., deciding the type of each node in a graph for detecting communities), but also to other applications, such as (personalized) page rank and streaming recommendations.

“Nowadays, graph-structured data are ubiquitous, because every potential interaction between affiliated entities in the real world can be expressed via graphs such as in social networks (user-user interactions) and marketplaces (user-product interactions)” knows Zoi Kaoudi. Even though a plethora of information exists and as a matter of fact is constantly generated, due to the dynamicity of graphs today, training machine learning models (ML models) on it is difficult, because of the graph's unstructured nature. Specifically, the ML models accept data in a tabular form. Random walk-based graph embeddings are a class of ML algorithms that turn, through random walks, the unstructured graph data into a tabular form, which at last is readily accepted as input to well-known and frequently used downstream ML models that do vertex classification, link prediction, etc. Since the graphs recently became more and more dynamic, incremental solutions are compulsory, and thus, the maintained random walks need to be efficiently and frequently updated in order to be up-to-date for the downstream ML models.

Wharf achieves scalable updating of the maintained random walks by utilizing a purely functional tree-of-tree structure that encodes and space-efficiently stores the walk-related information. Wharf's approach couples purely-functional tree structures with pairing functions, which in the end renders it capable of efficient search in trees that keep the walk information through ordering properties that pairing functions possess. Finally, Wharf updates random walks collectively in a batch and parallel manner leveraging the properties of the purely-functional tree structures, which enables it to achieve 2.6x better throughput and 2.1x less memory footprint than the existing solutions.

The publication in detail:

Serafeim Papadias, Dr. Zoi Kaoudi, Dr. Jorge Arnulfo Quiane Ruiz, Prof. Dr. Volker Mark: "Space-Efficient Random Walks on Streaming Graphs"