Banner Banner

Generalized convolutional many body distribution functional representations

Danish Khan
O. Anatole von Lilienfeld

September 30, 2024

Modern machine learning (ML) models of chemical and materials systems with billions of parameters require vast training datasets and considerable computational efforts. Lightweight kernel or decision tree based methods, however, can be rapidly trained, leading to a considerably lower carbon footprint. We introduce generalized many-body distribution functionals as highly compute and data efficient atomic representations for accurate kernels that excel in low-data regimes. Generalizing the MBDF framework, cMBDF encodes local chemical environments in a compact fashion using translationally and rotationally invariant functionals of smooth atom centered Gaussian electron density proxy distributions weighted by interaction potentials. The functional values can be efficiently evaluated by expressing them in terms of convolutions which are calculated via fast Fourier transforms and stored on pre-defined grids. In the generalized form each atomic environment is described using a set of functionals uniformly defined by three integers; many-body, derivative, weighting orders. Irrespective of size/composition, cMBDF atomic vectors remain compact and constant in size for a fixed choice of these orders controlling the structural and compositional resolution. While being up to two orders of magnitude more compact than other popular representations, cMBDF is shown to be more accurate for the learning of various quantum properties such as energies, dipole moments, homo-lumo gaps, heat-capacity, polarizability, optimal exact-exchange admixtures and basis-set scaling factors. Applicability for organic and inorganic chemistry is tested as represented by the QM7b, QM9 and VQM24 data sets. The versatility, accuracy, and computational efficiency obtained suggest that cMBDF holds great promise as a crucial ingredient for foundational yet green ML models in the chemical and materials sciences.