Skip navigation

Using Natural Language Processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters

Using Natural Language Processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters

Pang, Jiayun ORCID logoORCID: https://orcid.org/0000-0003-0689-8440, Pine, Alexander and Sulemana, Abdulai (2023) Using Natural Language Processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters. Digital Discovery. ISSN 2635-098X (Online) (doi:10.1039/d3dd00119a)

[thumbnail of Publisher VoR]
Preview
PDF (Publisher VoR)
45041_PANG_Using_Natural_Language_Processing_NLP_Inspired_Molecular_Embedding_Approach.pdf - Published Version

Download (2MB) | Preview

Abstract

Hansen solubility parameters (HSPs) have three components, δd, δp and δh, accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how molecular structure affects miscibility/solubility. HSP is widely used throughout the pipeline of pharmaceutical research and yet has not been as well studied computationally as the aqueous solubility. In the current study, we predicted HSPs using only the SMILES of molecules and utilise the molecular embedding approach inspired by Natural Language Processing (NLP). Two pre-trained deep learning models – Mol2Vec and ChemBERTa have been used to derive the embeddings. A dataset of ∼1200 organic molecules with experimentally determined HSPs was used as the labelled dataset. Upon finetuning, the ChemBERTa model “learned” relevant molecular features and shifted attention to functional groups that give rise to the relevant HSPs. The finetuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan fingerprint method albeit not to a significant extent. Interestingly, the embedding models can predict δd significantly better than δh and δp and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility. Our study indicates that the extent of transfer learning leveraged from the pre-trained models is related to the labelled molecular properties. It also highlights how δp and δh may have large intrinsic errors in the way they are defined and therefore introduces inherent limitations to their accurate prediction using machine learning models. Our work reveals several interesting findings that will help explore the potential of BERT-based models for molecular property prediction. It may also guide the possible refinement of the Hansen solubility framework, which will generate a wide impact across the pharmaceutical industry and research.

Item Type: Article
Uncontrolled Keywords: Hansen solubility parameter; NLP; molecular embedding; deep learning; ESOL solubility parameter; SMILES; transfer learning; finetuning,; Hugging Face
Subjects: Q Science > Q Science (General)
Q Science > QD Chemistry
Faculty / School / Research Centre / Research Group: Faculty of Engineering & Science
Faculty of Engineering & Science > School of Science (SCI)
Last Modified: 07 Feb 2024 10:18
URI: http://gala.gre.ac.uk/id/eprint/45041

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics