Banner Banner

BIFOLD Symposium


September 15, 2022  09:30 - September 16, 2022  18:00


Forum Digitale Technologien & BBAW in Berlin


Contact person
Katharina Jung
+49 (0)30 314-70851


September, 15th 2022: Forum Digitale Technologien

September, 16th 2022: BBAW


Time Event Topic Speaker Institution Chair
09:30 Coffee      
09:45 Welcome Prof. Dr. Volker Markl, Prof. Dr. Klaus-Robert Müller BIFOLD Directors  
10:00 "Adversarial Preprocessing: Image-Scaling Attacks in Machine Learning" Prof. Dr. Konrad Rieck Technische Universität Braunschweig Prof. Dr. Volker Markl
10:40 "A Vision for Data Alignment and Integration in Data Lakes" Prof. Dr. Renée Miller Khoury College of Computer Sciences, Northeastern
University, Boston
11:20 Break      
11:50 "Advancing molecular simulation with Deep
Prof. Dr. Frank Noe Freie Universität Berlin Prof. Dr. Konrad Rieck
12:30 "Do you see what I see? Largescale learning from multimodal videos" Prof. Dr. Cordelia Schmid Research Director, INRIA; Research Scientist, Google  
13:10 Lunchbreak      
14:00 "Physics-aware Machine learning in the Earth sciences" Prof. Dr. Gustau Camps-Valls Universitat de València, Image Processing
Prof. Dr. Klaus- Robert Müller
14:40 "Saga: Continuous Construction and Serving of Large Scale Knowledge
Prof. Ihab F. Ilyas Kaldas Cheriton School of Computer Science, University of Waterloo  
15:20 Break      
15:50 "Matrix and Tensor Factorizations for Data Fusion: Focus on Model Match, Interpretability, and Reproducibility" Prof. Dr. Tülay Adali University of Maryland Baltimore County Prof. Dr. Matthias Böhm
16:30 "A Next-Generation Multi- Objective Optimizer for Resource Management in
Cloud Data Analytics"
Prof. Dr. Yanlei Diao University of Massachusetts Amherst & École polytechnique  
17:00 Postersession and Fingerfood      

Time Event Topic Speaker Institution Chair
09:30 Coffee      
09:55 Welcome Prof. Dr. Volker Markl, Prof. Dr. Klaus-Robert Müller BIFOLD Directors  
10:00 "Indexing and Querying Earth from Big Satellite Data Archives" Prof. Dr. Begum Demir Technische Universtiät Berlin tbd
10:40 "Quantum Machine Learning in chemical compound space" Prof. Dr. Anatole von Lilienfeld Professor & Clark Chair of Advanced Materials am Vector Institute & University of Toronto  
11:20 Break      
11:40 "Optimizing Compiler Infrastructure for Datacentric
ML Pipelines"
Prof. Dr. Matthias Böhm Technische Universität Berlin Prof.Begum Demir
12:20 "Learning and Data as the foundation for medical breakthroughs - the case of AI in critical care" Prof. Dr. Alexander Meyer Charité - Universitätsmedizin Berlin, Leitung DHZB Medical Data Science  
13:00 Lunch      
14:00 Welcome Host Wolfgang Richter  
14:10 Greeting Franziska Giffey Governing Mayor of Berlin  
14:20 Video message Judith Pirscher State Secretary Bundesministerium für Bildung und Forschung  
14:30 Greeting Prof. Dr. Geraldine Rauch President of Technische Universität Berlin  
14:40 Greeting Prof. Dr. Heyo Kroemer CEO Charité -
Universitätsmedizin Berlin
14:50 Presentation BIFOLD Prof. Dr. Volker Markl, Prof. Dr. Klaus-Robert
BIFOLD Directors  
15:10 "Recent advances in representation learning" Dr. Samy Bengio Senior Director of AI and Machine Learning
Research, Apple
15:40 BIFOLD Artist in Residence      
16:00 Music & Reception      




Yanlei Diao is Professor of Computer Science at Ecole Polytechnique, France and the University of Massachusetts Amherst, USA. She recently joinedAmazonasanAmazonScholar.Herresearchinterestsliein big data analytics and scalable intelligent information systems, with a focus on optimization in cloud analytics, data stream analytics, explainable anomaly detection, interactive data exploration, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.

Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She has given keynote speeches at the ACM DEBS Conference, Max Planck Institut (MPI) Informatik, IBM Almaden Research Center, Naver Research Labs, Northeastern University, Technische Universität Darmstadt, and University of Texas at Austin. She has served as Chair of the ACM SIGMOD Awards Committee, Chair of the ASIGMOD Research Highlight Award Committee, Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.


Cloud data analytics at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and esource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of cloud environments while having to meet stringent time constraints for scheduling. In this talk, I present UDAO, a multi-objective resource optimizer that solves the above problems by bringing large-scale machine learning to automated performance modeling and a principled MOO approach to enabling intelligent resource optimization decisions. I present our solutions in two settings: 1) parameter tuning of Spark jobs in single query execution environments and 2) making placement and resource allocation decisions in multi-tenant serverless environments. Evaluation using production workloads shows that our optimizer could reduce 37-72% latency and, at the same time, 43-78% cloud cost, while running in 0.02-0.03s.



Anatole has been the inaugural Clark Chair in Advanced Materials at the Vector Institute and at University of Toronto since 2022. Prior to that he was a Full Professor for “Computational Materials Discovery” at the Faculty of Physics at the University of Vienna. From 2013-2020, Anatole held Associate and Assistant Professorship positions at the University of Basel, and the Free University of Brussels. Until 2013, he worked as an Assistant Computational Scientist at the Argonne National Laboratory’s Leadership Computing Facility. In spring 2011, he chaired the 3 months program, “Navigating Chemical Compound Space for Materials and Bio Design” at the Institute for Pure and Applied Mathematics at UCLA. From 2007 to 2010 Anatole was a Distinguished Harry S. Truman Fellow at Sandia National Laboratories.

Anatole carried out postdoctoral research at the Max-Planck Institute for Polymer Research (2007) and at New York University (2006). He received a PhD in computational chemistry from EPF Lausanne in 2005. He performed his diploma thesis work within an Erasmus exchange program at ETH Zürich and the University of Cambridge. He studied chemistry as an undergraduate at ETH Zürich, the École de Chimie, Polymères, et Matériaux in Strasbourg, and at the University of Leipzig.


Many of the most relevant observables of matter depend explicitly on atomistic and electronic details, rendering a first principles approach to computational materials design mandatory. Alas, even when using high-performance computers, brute force high-throughput screening of material candidates is beyond any capacity for all but the simplest systems and properties due to the combinatorial nature of compound space, i.e. all the possible combinations of compositional and structural degrees of freedom. Consequently, efficient exploration algorithms exploit implicit redundancies and correlations. I will discuss recently developed statistical learning based approaches for interpolating relevant chemical properties throughout compound space. Numerical results indicate promising performance in terms of efficiency, accuracy, scalability and transferability.



Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate in Computer Science from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996-1997. Since 1997 she has held a permanent research position at Inria, where she is a research director. Dr. Schmid is a member of the German National Academy of Sciences, Leopoldina and a fellow of IEEE and the ELLIS society. She was awarded the Longuet-Higgins prize in 2006, 2014 and 2016 and the Koenderink prize in 2018, both for fundamental contributions in computer vision that have withstood the test of time. She received an ERC advanced grant in 2013, the Humbolt research award in 2015, the Inria & French Academy of Science Grand Prix in 2016, the Royal Society Milner award in 2020 and the PAMI distinguished researcher award in 2021. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001-2005) and for IJCV (2004-2012), an editor- in-chief for IJCV (2013-2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. Starting 2018 she holds a joint appointment with Google research.


In this talk, we present recent progress on large-scale learning of multimodal video representations. We start by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results on zero shot prediction and video captioning. Next, we present an approach for video question answering which relies on training from instruction videos and cross-modal supervision with a textual question answer module. We show state-of-the-art results for video question answering without any supervision (zero-shot VQA) and demonstrate that our approach obtains competitive results for pre-training and then fine-tuning on video question answering datasets. We conclude our talk by presenting the recent VideoCC dataset, which transfers image captions to video and allows obtaining state-of-the-art performance for zero-shot video and audio retrieval and video captioning.



Frank Noé has degrees in electrical engineering and computer science and has received his PhD on molecular physics at University of Heidelberg. He came to FU Berlin in 2007 as a group leader and became a professor in 2013. Frank also holds an adjunct professorship at Rice University (Houston) and is visiting researcher for Microsoft Research Cambridge. He has held two ERC grants, is member of the Berlin-Brandenburg academy of sciences and is in the ISI highly cited list. Frank’s research interest lies on promoting molecular physics with new methods from machine learning and artificial intelligence.


Molecular simulation may ideally serve as a "computational laboratory", with its ability to observe both structure and dynamics at high resolution and to simulate molecules that are difficult to synthesize. However, it also suffers from fundamental limitations, in particular in the accurate modeling of molecules and in the efficient computation of experimental observables. By leveraging the latest developments in machine learning, we can advance molecular simulation algorithms to make significant progress at these fronts without sacrificing rigorous physics. In this talk, I will give an overview over our work on the highly accurate computation of quantum states with deep fermionic neural networks and Quantum Monte Carlo and addressing the many-body sampling problem using generative deep learning.



Gustau Camps-Valls (born 1972 in València) is a Physicist and Full Professor in Electrical Engineering in the Universitat de València, Spain, where he lectures on machine learning, remote sensing and signal processing. He is the Head of the Image and Signal Processing (ISP) group, an interdisciplinary group of 40 researchers working at the intersection of AI for Earth and Climate sciences. Prof. Camps-Valls published 250+ peer-reviewed international journalpapers, 350+ international conference papers, 25 book chapters, and 5internationalbooksonremotesensing,imageprocessingandmachine learning. He has an h-index of 78 with 29000+ citations in Google Scholar. He was listed as a Highly Cited Researcher in 2011, 2020 and 2021; currently has 13 «Highly Cited Papers» and 1 «Hot Paper». Thomson Reuters ScienceWatch identified his activities as a Fast Moving Front research (2011) and the most- cited paper in the area of Engineering in 2011. He received the Google Classic paper award (2019), and Stanford Metrics includes him in the top 2% most cited researchers of 2017-2020. He publishes in both technical and scientific journals, from IEEE and PLOS One to Nature, Nature Communications, Science Advances, and PNAS.

He has been Program Committee member of international conferences (IEEE, SPIE, EGU, AGU), and Technical Program Chair at IEEE IGARSS 2018 (2400 + attendees) and general at AISTATS 2022. He served in technical committees of the IEEE GRSS & IEEE SPS, as Associate Editor of 5 top IEEE journals, and in the prestigious IEEE Distinguished Lecturer program of the GRSS (2017-2019) to promote «AI in Earth sciences» globally. He has given 100+ talks, keynote speaker in 10+ conferences, and (co)advised 10+ PhD theses.

He coordinated/participated in 60+ research projects, involving industry and academia at national and European levels. He assisted the aerospace industry in Advisory Boards; Fellow Consultant of the ESA PhiLab (2019) and member ofthe EUMETSAT MTG-IRS Science Team. He is compromised with open source/access in Science, and is habitual panel evaluator for H2020 (ERC, FET), NSF, China and Swiss Science Foundations.

He coordinates the ‘Machine Learning for Earth and Climate Sciences' research program of ELLIS, the top network of excellence on AI in Europe. He was elevated to IEEE Fellow member (2018) in two Societies (Geosciences and Signal Processing) and to ELLIS Fellow (2019). Prof. Camps-Valls is the only researcher receiving two European Research Council (ERC) grants in two different areas: an ERC Consolidator (2015, Computer Science) and ERC Synergy (2019, Physical Sciences) grants to advance AI for Earth and Climate Sciences. In 2021 he became a Member of the ESSC panel part of the European Science Foundation (ESF), and in 2022 was elevated to Fellow of the European Academy of Sciences (EurASc), Fellow of the Academia Europeae (AE), and Fellow of Asia-Pacific Artificial Intelligence Association (AAIA).


Most problems in Earth sciences aim to do inferences about the system, where accurate predictions are just a tiny part of the whole problem. Inferences mean understanding variables relations, deriving models that are physically plausible, that are simple parsimonious, and mathematically tractable. Machine learning models alone are excellent approximators, but very often do not respect the most elementary laws of physics, like mass or energy conservation, so consistency and confidence are compromised. I will review the main challenges ahead in the field, and introduce several ways to live in the Physics and machine learning interplay. Physics-aware machine learning models are just a step towards understanding the data-generating process, for which causality promises great advances. I'll review some recent methodologies to cope with it too. This is a collective long-term AI agenda towards developing and applying algorithms capable of discovering knowledge in the Earth system.



Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. He is currently on leave leading the Knowledge Graph Platform at Apple. His main research focuses on the areas of data science and and data management, with special interest in data quality and integration, knowledge construction, machine learning for structured data, and information extraction. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration, and he is also the co-founder of inductiv (acquired by Apple), a Waterloo-based startup on using AI for structured data cleaning. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Fellow and an IEEE Fellow.


In this talk I present Saga, an end-to-end platform for incremental and continuous construction of large scale knowledge graphs. Saga demonstrates the complexity of building such platform in industrial settings with strong consistency, latency, and coverage requirements. In the talk, I will discuss challenges around the following: building source adapters for ingesting heterogeneous data sources; building entity linking and fusion pipelines for constructing coherent knowledge graphs that adhere to a common controlled vocabulary; updating the knowledge graphs with real-time streams; and finally, exposing the constructed knowledge via a variety of services. Graph services include: low-latency query answering; graph analytics; ML-biased entity disambiguation and semantic annotation; and other graph-embedding services to power multiple downstream applications. Saga is used in production at large scale to power a variety of user-facing knowledge features.



Konrad Rieck is a Full Professor at TU Braunschweig, where he leads the Institute of System Security. Previously, he worked at the University of Göttingen, TU Berlin and Fraunhofer Institute FIRST. Konrad's research interests revolve around computer security and machine learning. Together with his group, he develops learning-based methods for detecting computer attacks, analyzing malicious code, and discovering vulnerabilities. His research has received different awards, including a Google Faculty Research Award, the German IT Security Award, and recently an ERCConsolidatorGrant.


The remarkable advances of machine learning are overshadowed by attacks that thwart its proper operation. While previous work has mainly focused on attacking learning algorithms directly, another weak spot in intelligent systems has been overlooked: preprocessing. As an example of this threat, I present a recent class of attacks against image scaling. These attacks are agnostic to learning algorithms and affect the preprocessing of all vision systems that use vulnerable implementations, including versions of TensorFlow, OpenCV, and Pillow. Based on a root-cause analysis of the vulnerabilities, I introduce novel defenses that effectively block image-scaling attacks in practice and can be easily added to existing systems.



Matthias Boehm is a full professor for large-scale data engineering at Technische Universität Berlin and BIFOLD. His cross-organizational research group focuses on high-level, data science-centric abstractions as well as systems and tools to execute these tasks in an efficient and scalable manner. From 2018 through 2022, Matthias was a BMK- endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the co-located Know-Center GmbH. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his PhD from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. Hisprevious research also includes systems support for time series forecasting as well as in-memory indexing and query processing.


The trend towards data-centric AI leads to increasingly complex, composite ML pipelines with outer loops for data integration and cleaning, data programming and augmentation, model and feature selection, hyper-parameter tuning and cross validation, as well as data validation and model debugging. Interestingly, state-of-the-art techniques for data integration, cleaning, and augmentation as well as model debugging are based on machine learning themselves, which motivates their integration into ML systems. In this talk, we make a case for optimizing compiler infrastructure in Apache SystemDS and DAPHNE as two sibling open-source ML systems. We discuss recent feature highlights and how they all fit together. The covered topics range from linear-algebra-based data cleaning pipeline enumeration and slice finding; over lineage- based reuse and workload-aware redundancy exploitation; to federated learning, vectorized execution on heterogeneous HW devices, and extensibility.



Begüm Demir is a Professor and the founder head of the Remote Sensing Image Analysis (RSiM) group at the Faculty of Electrical Engineering and Computer Science, TU Berlin and the head of the Big Data Analytics for Earth Observation research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD).

She performs research in the field of processing and analysis of large- scaleEarth observation data acquired by airborne and satellite- borne systems. She was awarded the prestigious ‘2018 Early Career Award’ by the IEEE Geoscience and Remote Sensing Society for her research contributions in machine learning for information retrieval in remote sensing. In 2018, she received a Starting Grant from the European Research Council (ERC) for her project “BigEarth: Accurate and Scalable Processing of Big Data in Earth Observation”. She has been an IEEE Senior Member since 2016.


Earth observation (EO) data archives are explosively growing as a result of advances in satellite systems. As an example, remote sensing (RS) images acquired by ESA’s Sentinel satellites (which are a part of EU’s Copernicus program) reach the scale of more than 10 TB per day. The “big EO data” is a great source for information discovery and extraction for monitoring Earth from space. Thus, accurate and scalable techniques for RS image understanding, search and retrieval have recently emerged. In this talk, a general overview on scientific and practical problems related to RS image characterization, indexing and search from massive archives will be initially discussed. Then, our recent developments that can overcome theseproblemswillbepresented.Particularattentionwillbegiventoourdeephashingnetworkthat learns a semantic-based metric space, while simultaneously producing binary hash codes for scalable and accurate content-based indexing and retrieval of RS images. Finally, the BigEarthNet benchmark archive, which is one of the largest benchmark archives to support the deep learning studies in EO, will be introduced.



Renée J. Miller is a University Distinguished Professor of Computer Science at Northeastern University. She is a Fellow of the Royal Society of Canada and received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and is a fellow of the ACM. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her colleagues received the ICDT Test-of-Time Award and the 2020 Alonzo Church Award for Outstanding Contributions to Logic and Computation for their influential work establishing the foundations of data exchange. She received the CS- Can/Info-Can Lifetime Achievement Award in Computer Science. Professor Miller is an Editor-in-Chief of the VLDB Journal and former president of the Very Large Data Base (VLDB) Foundation. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s degrees in Mathematics and Cognitive Science from MIT.


The requirements for integration over massive, heterogeneous table repositories (aka data lakes) are fundamentally different than they are for federated data integration (where the data owned by an enterprise is integrated into a cohesive whole) or data exchange (where data is exchanged and shared among a small set of autonomous peers). In this talk, I will outline a vision for data alignment and integration in data lakes. Data lakes afford new opportunities for using new methods to discover emergent semantics from large heterogeneous collections of data sets. I will illustrate these ideas by discussing the problem of data lake disambiguation, work which received the best paper award in EDBT 2021.



Tülay Adali received the PhD in Electrical Engineering from North Carolina State University, Raleigh, NC, USA, in 1992 and joined the faculty at the University of Maryland Baltimore County (UMBC), Baltimore, MD, the same year. She is currently a Distinguished University Professor in the Department of Computer Science and Electrical Engineering at UMBC.

She has been active in conference organizations. She served or will serve as technical chair, 2017, special sessions chair, 2018, 2024, publicity chair, 2000, 2005, for the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), general/technical chair for the IEEE Machine Learning for Signal Processing (MLSP) and Neural Networks for Signal Processing Workshops 2001−2009, 2014, 2023. She served or is currently serving on numerous boards and technical committees of the IEEE Signal Processing Society (SPS). She was the Chair of the NNSP/MLSP Technical Committee, 2003– 2005 and 2011–2013, and the SPS Vice President for Technical Directions 2019−2022. She is currently the Chair−Elect for the IEEE Brain Initiative. Prof. Adali is a Fellow of the IEEE and the AIMBE, a Fulbright Scholar, and an IEEE SPS Distinguished Lecturer. She is the recipient of a Humboldt Research Award, an IEEE SPS Best Paper Award, SPIE Unsupervised Learning and ICA Pioneer Award, the University System of Maryland Regents' Award for Research, and an NSF CAREER Award. Her current research interests are in the areas of statistical signal processing, machine learning, and their applications with emphasis on applications in medical image analysis and fusion.


In many fields today, such as neuroscience, remote sensing, computational social science, and physical sciences, multiple sets of data are readily available. Matrix and tensor factorizations enable joint analysis, i.e., fusion, of these multiple datasets such that they can fully interact and inform each other while also minimizing the assumptions placed on their inherent relationships. A key advantage of these methods is the direct interpretability of their results. This talk presents an overview of the main models that have been successfully used for fusion of multiple datasets. Examples based on independent component and independent vector analysis as well as canonical polyadic decomposition are discussed in more detail with examples in fusion of neuroimaging data. Importance of computational reproducibility is also addressed, with a focus on its relationship to model match and interpretability.



Samy Bengio (PhD in computer science, University of Montreal, 1993) is a senior director of machine learning research at Apple since 2021. Before that, he was a distinguished scientist at Google Research since 2007 where he was heading the Google Brain team, and at IDIAP in the early 2000s where he co-wrote the well- known open-source Torch machine learning library.

His research interests span many areas of machine learning such as deep architectures, representation learning, sequence processing, speech recognition, and image understanding.
He is action editor of the Journal of Machine Learning Research and on the board of the NeurIPS foundation. He was on the editorial board of the Machine Learning Journal, has been program chair (2017) and general chair (2018) of NeurIPS, program chair of ICLR (2015, 2016), general chair of BayLearn (2012-2015), MLMI (2004-2006), as well as NNSP (2002), and on the program committee of several international conferences such as NeurIPS, ICML, ICLR, ECML and IJCAI.


Deep learning has made a lot of progress in the past decade which yielded impressive results transforming applications such as speech recognition or machine translation, but much more needs to be done in order to build and understand better representation learning approaches in terms of fairness, efficiency and robustness. In this presentation, I will go over a set of recent research work from my new team at Apple towards better understanding important aspects of representation learning.