Banner Banner

Simulations of large biomolecules with quantum accuracy

Universal applicability of the AI model paves the way for accelerated drug discovery

An international team of researchers from the Berlin Institute for the Foundations of Learning and Data (BIFOLD) at TU Berlin, the University of Luxembourg, and Google DeepMind has developed a new machine learning foundation model capable of simulating a wide variety of molecular systems – for example, large and complex biological molecules – with quantum-mechanical accuracy. The results have now been published in the prestigious Journal of the American Chemical Society (JACS). The new method, called SO3LR, combines the latest developments in neural network design with physical laws and was trained on a specially curated dataset of four million different molecular structures. This enables the model to be applied not only to large biomolecules like proteins, sugars, or cell membranes, but also to a broad spectrum of other molecules without the need for retraining. This universal applicability of SO3LR paves the way for accelerated drug discovery and a deeper understanding of molecular biology.

Pushing the frontier of efficiency, accuracy, scalability, and transferability

© BIFOLD
Molecular Dynamics simulations allow for the description of molecular interactions over time and provide insights into their structure, dynamics, and functioning.

Molecular dynamics (MD) simulations enable us to understand and predict the behavior of molecules. They allow for the description of molecular interactions over time and provide insights into their structure, dynamics, and functioning. The exact simulation of the interaction of large biomolecules could, for example, enable the development of new drugs without the need to first conduct time-consuming, material-intensive, and costly experiments. As such, MD is a cornerstone in modern science and improving its accuracy and applicability has a long-standing history in computational physics and chemistry. For decades, scientists have been facing a fundamental trade-off: Methods were either fast but only approximate and not transferable between different molecules, or highly accurate but computationally extraordinarily expensive. This trade-off restricted the scope of accurate simulations to small systems with a few hundred atoms. Large and complex biomolecules – e.g. proteins or sugars – can contain tens of thousands of atoms, limiting our ability to accurately model and understand fundamental dynamic processes like protein folding or cell assembly. 

In recent years, AI-based models have started to bridge this gap between approximate (classical) methods and highly accurate (quantum mechanical) methods. Despite great advances in the field, a persistent challenge has been the scaling of AI-based approaches to large biomolecular systems of realistic size. One central reason which has hindered the widespread adaptation for large and complex biomolecules has been the lack of accurate treatment of quantum effects at long distances between atoms. Simply put, the atoms in a molecule not only interact with atoms that are nearby but also with atoms far away. The larger the molecule, the more important are the long-range effects. Without such long-range interactions, life as we know it would not be possible, as biomolecules would not be functional.

The new SO3LR model overcomes these challenges, pushing the frontier of efficiency, accuracy, scalability, and transferability for simulating organic and biological molecules. The scientists were able to achieve this, by designing the new SO3LR model using a hybrid approach. It divides the complex task of calculating the quantum mechanical interactions between the atoms into two complementary components: A fast and highly accurate machine learning model, which learns the complex, quantum many-body interactions that occur at short and medium distances is combined with universal, physically-grounded equations, accurately describing the interactions between the atoms at long distances. “Reliable simulations at the biomolecular scale hinge on long-range effects, so SO3LR encodes them by design,” explains Adil Kabylda of the University of Luxembourg, who led the project. “This allows our model to focus its powerful learning capacity on capturing the complex quantum effects that traditional models are missing to date,” adds Thorben Frank, postdoctoral researcher at TU Berlin and BIFOLD Institute. 

The crucial breakthrough with SO3LR lies in its universality

The second challenge which needed to be solved was the universal applicability of a single model to many different molecules. Therefore, the team created an extensive and diverse dataset of over 4 million carefully curated molecular structures, which has been a key factor for “teaching” SO3LR how to accurately describe the vast diversity of molecules that exist in nature, achieving a level of transferability beyond that of former methods. 
To demonstrate the capabilities of SO3LR, the research team performed a series of challenging simulations for all four major types of biomolecules that can be found in nature. For example, they performed simulations of large biomolecular systems in an explicit water environment, including the crambin protein and a complex glycoprotein. They further performed simulations for a lipid POPC bilayer, which serves as a model system for human cell membranes. “The crucial breakthrough with SO3LR lies in its universality. Instead of having to go through the lengthy and complex process of data generation and subsequent model training for every new molecule, we provide a single, ready-to-use foundation model. This saves researchers the time and compute-intensive preparation steps and allows them to directly test hypotheses with quantum-mechanical accuracy,” explains Prof. Klaus-Robert Müller, Co-Director of BIFOLD. “SO3LR represents a crucial step towards this goal. By combining machine learning with physical principles, we are opening the door to modeling realistic biological processes with quantum accuracy, which has profound implications for understanding health and disease and designing the next generation of drugs”, says Prof. Alexandre Tkatchenko from the University of Luxembourg, summarizing the impact of this work.

In the realm of AI models being increasingly in the hands of private companies, the model and its underlying datasets are being made openly accessible to the scientific community to accelerate further advancements in the field.

Publication: doi.org/10.1021/jacs.5c09558