Welcome Prof. Dr. Ziawasch Abedjan

New Chair for Information Integration and Data Cleansing

As of March 1, 2024, BIFOLD welcomes a new Research Group Lead in Berlin: Prof. Dr. Ziawasch Abedjan heads the new chair for Data Integration and Data Preparation. In particular, he develops new algorithms for the automatic preparation, extraction, and cleaning of datasets for data science workflows. With this addition, BIFOLD now has 11 research groups and around 120 scientific staff members. Another new Professor will start in June, and two BIFOLD groups at the Charité – Universitätsmedizin Berlin are in the process of being established.
“I am very happy to return to Berlin and to join BIFOLD as a research group lead,” said the 37 years old scientist, who completed his doctorate at the Hasso-Plattner-Institute in Potsdam in 2014, spent two years at MIT as a Postdoc, and then served as a junior professor at TU Berlin from 2016 to 2020. Before his move to BIFOLD Ziawasch Abedjan chaired the databases and information systems group as a full professor at Leibniz Universität Hannover.

Special lectures on Data Science for computer science students as well as non-STEM students

“My core area is information integration and data preparation. Our research is thematically positioned right at the interface between data processing, machine learning, and knowledge discovery, ideally complementing my colleagues at BIFOLD,” Ziawasch Abedjan stated. “For different data from various sources to be usefully utilized in an application, all data must first be standardized. A simple example would be merging different Excel tables with e.g. personal data. Typically, for instance, umlauts are written differently, the order of data changes, there are errors in the tables, etc. In research, often hundreds of thousands of data points need to be standardized. This is still largely done manually today. Most data scientists claim to spend 60-80% of their time on cleaning and transforming data, next to data discovery and extraction. Also, most data scientists admit that this is the least enjoyable task in their pipeline. So far, we focused on data cleaning that aims at identifying data inconsistencies and errors and correcting them. We worked on approaches to reduce the human effort in data cleaning. Data cleaning is a tedious preprocessing step in the vast majority of data science applications. Our solutions make use of historical data and novel application of machine learning techniques to reduce the user effort in this regard. In Berlin we will be working on automating and accelerating data cleaning and other integration tasks, including data discovery and data resolution, for the scale of large data lakes. ”
For structured data for defined applications, initial successes have already been achieved. It becomes much more complicated when merging DNA-sequence data, laboratory results, or computer code data. In addition, the future will increasingly involve developing models that merge data for arbitrary applications and autonomously learn the requirements for each new application. “Despite all the progress in this area, in the foreseeable future automatic data preparation and integration will not work without some manual post-processing of the data,” said the data expert.

Ziawasch Abedjan currently supervises six doctoral students who are moving with him to Berlin. In teaching, he will offer a lecture on data integration as well as a foundational lectures on Data Science for computer science students as well as other non-STEM students. “As more and more sciences use large datasets, it is important to me to also familiarize scientists from non-STEM fields with these fundamentals. This lecture will therefore be aimed as well at students from non-STEM fields as computer scientists.”