Development of a Descriptor-Based Foundation Model for Molecular Property Prediction
Jackson Burns, Massachusetts Institute of Technology
Artificial intelligence has revolutionized science by enabling researchers to draw meaningful conclusions from previously intractable datasets. Unfortunately the scope of this impact has been highly asymmetrical across different domains, favoring those where data is abundant. In chemistry, data is often in limited supply due to the cost of performing experiments. It is in this context that foundation models are promising. Foundation models are general-purpose AI models trained on enormous amounts of data intended to then be ‘fine-tuned’ for a specific application. This pre-training reduces the amount of data required for this fine-tuning step, ideal for the typically data-poor chemistry domain. This work focuses on the ongoing development of a foundation model for molecular property prediction. The model is a self-supervised autoencoder which attempts to compress a set of 1,613 molecular descriptors for the ChEMBL database into a dense latent representation of only 64 features. These 64 features are then applied to various cheminformatics benchmarks, achieving reasonable performance on much smaller amounts of training data than state-of-the-art models.