Since the official announcement of the Berlin Institute for the Foundations of Learning and Data in January 2020, BIFOLD researchers achieved a wide array of advancements in the domains of Machine Learning and Big Data Management as well as in a variety of application areas by developing new Systems and creating impactfull publications. The following summary provides an overview of recent research activities and successes.

BIFOLD seeks to lay the foundations for Big Data and Machine Learning and offers internationally visible impulses from the interlocking research of Big Data and Machine Learning – a new scientific discipline. In this way, BIFOLD advances the scientific and industrial use of both technology domains. For the broad, productive application of AI, it is crucial to develop new, automatically scalable technologies that organize the constantly growing flood of data and use intelligent procedures to derive well-founded information for data-driven decisions. Only by combining Big Data and Machine Learning, new values can be derived from heterogeneous and large amounts of data, which is of fundamental importance for more and more applications, from healthcare to digital humanities. The close interaction between Big Data and Machine Learning is the unique selling point of BIFOLD within the network of German AI centers of excellence. Training, start-ups, transfer, cooperation and networking activities make BIFOLD a nucleus for technology transfer and innovation for the interdisciplinary ecosystem of Big Data and machine learning in Germany and Europe. Several BIFOLD employees are active in the consortia ELLIS (European Lab for Learning & Intelligent Systems specifically the Berlin ELLIS unit) and CLAIRE (Confederation of Laboratories for Artificial Intelligence Research in Europe) and contribute, for example, new AI methods for statistical mechanics and quantum chemistry.

STATE OF RESEARCH

During the first half of 2020, BIFOLD Research Groups have produced over 55 new scientific publications and won several prestigious awards.

MACHINE LEARNING

In the field of Machine Learning, BIFOLD has and continues to make significant progress in both basic and applied research across a wide range of applications.

Under the direction of FU Berlin’s Prof. Dr. Tim Conrad, researchers at the Zuse Institute Berlin are working on the analysis of dynamic networks. The focus is on learning the laws of varying change processes in the context of medical and social phenomena. In collaboration with Université Paris-Saclay’s Prof. Dr. Gilles Blanchard, an improved estimation method for kernel embeddings of distributions was developed.

Under the direction of TU Berlin’s Prof. Dr. Manfred Opper, dynamic language models that combine approximate Bayesian inference and deep neural networks and suitable for conducting sensitivity analysis in systems were developed. In addition, research on neural variation inference for the circuits of dynamic systems and stochastic process approximations for the training of Bayesian neural networks is currently underway.

Under the direction of TU Berlin’s Prof. Dr. Klaus-Robert Müller and Dr. Grégoire Montavon, in the field of Explainable AI (XAI), two novel explanation methods were developed: BiLRP for deep similarity models and GNN-LRP for deep graph-based neural networks. Additionally, insights into protection mechanisms to guard against manipulation attacks for interpretation methods were gained. Furthermore, a BIFOLD publication on quantum chemistry was among the 50 most read Nature Communications articles in chemistry and materials sciences published in 2019.

TU Berlin’s Prof. Dr. Gitta Kutyniok, Head of the Applied Functional Analysis Group, conducts research on the theoretical foundations of artificial neural networks and is the coordinator of a new DFG Priority Program on the Theoretical Foundations of Deep Learning.

Under the direction of TU Berlin’s Prof. Dr. Giuseppe Caire, Head of the Communication and Information Theory Group, researchers are employing ML to solve various communications problems (e.g., real-time localization, cellular optimization, D2D connection planning) and improve existing solutions, jointly with TU Berlin Mathematicians.

At the Fraunhofer Heinrich Hertz Institute, Dr. Wojciech Samek and his colleagues have gained important insights into federated learning. To avoid suboptimal results (e.g., when data distributions deviated on the customer side), a novel framework for federated multi-tasking learning was developed. Clustered federated learning facilitates managing dynamic customer populations that evolve over time under privacy constraints.

BIG DATA

Researchers in TU Berlin’s Database Systems and Information Management (DIMA) Group led by Prof. Dr. Volker Markl, have been conducting research on the scalable real-time processing of very large, heterogeneous and geographically distributed data streams. Jointly with DIMA researchers, Project Lead Dr. Steffen Zeuch is currently developing an end-to-end data processing system for the Internet of Things (IoT) called NebulaStream. NebulaStream is being designed to cope with the heterogeneity and distribution of data and systems. It supports various data and programming models that go beyond relational algebra and addresses potentially unreliable communication. The NebulaStream platform enables the development of novel IoT applications. The first test results confirm that it offers fast and efficient data delivery. In cooperation with Prof. Dr. Matthias Böhm at the University of Graz, investigations into the declarative specification and automatic optimization of data science pipelines is underway.

Among the more recent accomplishments were a BIFOLD publication from the DIMA Group on the use of modern GPUs to accelerate the processing of database queries, which earned the 2020 ACM SIGMOD Best Paper Award, in a joint collaboration with HPI’s Prof. Dr. Tilmann Rabl, Head of the Data Engineering Systems Group and BIFOLD Principal Investigator. Furthermore, Data Engineering Systems group and DIMA developed a new message broker system, which addresses weaknesses of older systems regarding new storage and access technologies. This research work won second place in the highly regarded ACM SIGMOD 2020 Student Research Competition.

Today, in order to perform AI and Data Science (DS) on a multitude of assets, an enormous amount of resources is required. However, only a few players have this capability, which results in a lock-in effect. Project Leads Dr. Jorge-Arnulfo Quiané-Ruiz and Dr. Jonas Traub, jointly with DIMA researchers, are currently designing and building an ecosystem called Agora to unify data, algorithms, models and computing resources, to enable exchange across a broad audience. Agora treats data-related assets as first-class citizens, uses fine-grained asset exchange, enables the combination of assets to create novel applications, and offers the flexibility required to run applications on the available resources.

Dr. Jorge-Arnulfo Quiané-Ruiz and Dr. Kaustubh Beedkar, jointly with Prof. Dr. Volker Markl are currently investigating how the processing of geo-distributed queries can be reconciled with data movement restrictions imposed by guidelines. To address these challenges, they have created a declarative specification language for guidelines, a conformance-based optimizer for distributed queries, and efficient mechanisms for policy evaluation. Moreover, Dr. Jorge-Arnulfo Quiané-Ruiz and Dr. Zoi Kaoudi, jointly with researchers at QCRI, have developed a novel approach for the debugging of large-scale dataflow jobs.

TU Berlin’s Prof. Dr. Begüm Demir, Head of the Remote Sensing Image Analysis Group, conducts research in remote sensing, signal processing, image processing, machine learning, and big data in earth observation. In particular, she currently holds an ERC Starting Grant called BigEarth,which aims to develop a scalable earth observation (EO) image search and retrieval system for the fast discovery of critical information contained in massive EO archives. Additionally, she is the recipient of the prestigious 2018 Early Career Award presented by the IEEE Geoscience and Remote Sensing Society.

TU Berlin’s Prof. Dr. Ziawasch Abedjan, Head of the Big Data Management (BigDaMa) Group conducts research in data cleaning, data integration, machine learning, and data science. BigDaMa created a prototype for the integration of web data with data sets for prediction tasks, which was awarded first place in the 2019 GI Data Science Challenge. In collaboration with DIMA researchers, they are investigating how to optimize machine learning processes via the integration of artifacts. A recent publication on configuration-free error detection in datasets received the ACM SIGMOD Most Reproducible Paper Award.

TU Berlin’s Prof. Dr. Odej Kao, Head of the Distributed and Operating Systems Group, conducts research on the collaborative processing of sensor data using heterogeneous distributed computing resources and achieving automatic compliance with prescribed properties. In collaboration with Dr. Fuyuki Ishikawa of the National Institute of Informatics in Tokyo, Japan, there is ongoing research in the area of reliability testing. Additionally, there is an ongoing collaboration with Prof. Dr. Gjorgji Madjarov of the University of Skopje in Northern Macedonia on self-monitored learning for medical decision support.

TU Berlin’s Prof. Dr. Georgios Smaragdakis, currently is Head of the Internet Network Architectures and Internet Measurement and Analysis Groups. His focus is on improving the performance of geo-distributed analyses via software-defined networks and an intelligent scheduler across varying analysis platforms. More recently, they analyzed data to assess the impact of the COVID-19 pandemic on network performance. Among the collaboration partners, there are researchers drawn from a number of international and German institutions, including Imperial College London, Northeastern University, IMDEA Networks, FORTH, Akamai Technologies, Max Planck Institute for Informatics, and DE-CIX. His work was recognized with best paper awards in ACM CoNEXT 2019, Best of ACM SIGCOMM Computer Communication Review 2019, and two IETF/IRTF Applied Networking Research Prizes in 2019 and 2020.

DFKI’s Prof. Dr. Sebastian Möller, Head of the Speech and Language Technology Group is conducting research on scalable cross-lingual information extraction methods for the identification and prevention of interactions from social media and forums. The project is conducted in close cooperation with LIMSI, CNRS in France.

TU Berlin’s Prof. Dr. Manfred Hauswirth, Head of the Open Distributed Systems (ODS) group and the managing director of the Fraunhofer Institute for Open Communication Systems, (FOKUS), conducts research on open distributed systems including sensor/stream middleware, cyber physical systems, P2P and Semantic Web/Linked Data. Together with Dr. Danh Le Phuoc, the ODS team has been building the CQELS Framework that supports neural-symbolic reasoning operations on multimodal stream data such as video streams, cameras/LIDARs and semantic streaming graphs. The CQELS Framework has helped the ODS team win the Best Paper Award at the 8th International Conference on the Internet of Things, 2018. A recent publication of ODS on “autonomous semantic stream processing” received the Best Paper Runner-up at the 9th Joint International Semantic Technology Conference, 2019.

This year, BIFOLD research on Big Data Science received very high international visibility at prestigious conferences such as SIGMOD, VLDB, ICDE and CIDR.

APPLICATION-ORIENTED RESEARCH

MEDICINE

Various biomedical applications are tackled in BIFOLD. Prof. Dr. Frank Noé, Head of the Computational Molecular Biology Group at FU Berlin has been working on the development of SARS-CoV-2 drugs using simulation and ML for the JEDI COVID-19 Grand Challenge, jointly with virologists and physicians in Germany and the USA. Prof. Dr. Anja Hennemuth with Charité is conducting research in the analysis of the heart muscle and blood vessel walls using ML. Further focal points include the development of novel approaches for the interactive exploration of 4D-Radionics features of the heart as well as the clinical evaluation of research software. Under the direction of Prof. Dr. Hennemuth, the CADA-Challenge for the detection and analysis of cranial aneurysms is currently underway. The results will be evaluated and published at the upcoming MICCAI conference, to be held in September 2020.

Prof. Dr. Martin Vingron, Director of the Computational Molecular Biology Department at the Max Planck Institute for Molecular Genetics is investigating the mechanisms of cell type-specific gene regulation. Machine learning and statistical methods are employed to identify regulatory DNA elements in the genome and link them to their target genes. In close cooperation with Charité, under the direction of Prof. Dr. Frederick Klauschen, a clinically relevant contribution to the computer-assisted diagnosis of lung carcinomas and metastases of head and neck tumors have been made. The developed method exceeds the prediction accuracy of SVMs and randomized tests. The group of BIFOLD Senior Researcher Dr. Roland Schwarz at the Max Delbrueck Center for Molecular Medicine conducts research in cancer genomics using ML. Earlier this year, in conjunction with members of the International Cancer Genome Consortium, a paper on the Pan-Cancer Analysis of Whole Genomes (PCAWG) was published in Nature. The widely cited study identifies common mutation patterns in over 2600 whole cancer genomes.

Prof. Dr. Thomas Wiegand, Executive Director at the Fraunhofer HHI (Heinrich Hertz Institute) conducts research in signal processing, data and video compression, communications, and applied ML. Currently, he chairs the ITU/WHO Focus Group on AI for Health.

Prof. Dr. Uwe Ohler‘s group at the Max-Delbrück-Center is active in applying ML to understand the gene regulatory code of complex organisms. The specific goal is to interpret how genetic sequence variation changes the activity of genes, and initial results show great promise for using deep learning to map and score the target sequences of regulatory proteins.

SECURITY

In the field of computer security, Prof. Dr. Jean-Pierre Seifert, Head of the Security in Telecommunications (SECT) Group at TU Berlin is conducting research on multiple fronts: on modeling physically unclonable functions (PUF), in cooperation with researchers at CWI (Netherlands) and the University of Florida (USA) -and- on the separation of the PAC (Probably Approximately Correct) learning model for quantum computers from the classical PAC learning model, jointly with Prof. Dr. Jens Eisert, Head of a Research Group in the Dahlem Center for Complex Quantum Systems at FU Berlin.

At TU Braunschweig, Prof. Dr. Konrad Rieck is working on attacks and protective measures for learn-based and data-driven IT systems within the framework of BIFOLD. This research is carried out in cooperation with the Cyber Security in the Age of Large-Scale Adversaries (CASA) Cluster of Excellence at the Ruhr University Bochum and researchers from King’s College London.

DIGITAL HUMANITIES

In the Digital Humanities, Prof. Dr. Matteo Valleriani at the Max Planck Institute for the History of Science is working on the development of a deep neural network to calculate similarities between astronomical tables as printed in early modern scientific papers. This effort would enable us, in cooperation with experts from the humanities and researchers in the Explainable AI community, to automatically determine semantic similarity across a large number of historical numerical tables.

An overview of the current state of research in BIFOLD