Unlocking the power of past and future QM Data generation efforts.
Molecular dynamics (MD) simulations are indispensable in material science and drug discovery, relying on accurate potential energy functions often derived from Quantum Mechanics (QM) calculations. However, the computational demands of QM hinder scalability, leading to the emergence of Machine Learning Interatomic Potentials (MLIPs) as faster alternatives. Yet, MLIPs rely on high-fidelity QM data, limiting their applicability due to sparse coverage of chemical space. In this talk, we introduce openQDC, a comprehensive repository consolidating diverse QM datasets, democratizing access to high-quality, standardized data crucial for MLIP training in MD simulations. We then discuss enhancing future QM data generation efforts through Implicit Delta Learning (IDL) and exploring data biases for MLIP generalization. IDL reduces reliance on expensive QM data by training models to correlate lower-fidelity semi-empirical (SE) and high-fidelity QM energies, improving data efficiency and generalizability without SE data at inference. IDL results demonstrate promising avenues for enhancing MLIP generalization and scalability. Additionally, studying structural and conformational diversity tradeoffs highlights the impact of dataset biases on MLIP performance, providing valuable insights for optimizing data generation efforts. Together, these contributions pave the way for more impactful MLIPs, enabling faster and more accurate molecular simulations crucial for scientific research and discovery.