A unified ML framework for solubility prediction across organic solvents. (8th February 2023)
- Record Type:
- Journal Article
- Title:
- A unified ML framework for solubility prediction across organic solvents. (8th February 2023)
- Main Title:
- A unified ML framework for solubility prediction across organic solvents
- Authors:
- Vassileiou, Antony D.
Robertson, Murray N.
Wareham, Bruce G.
Soundaranathan, Mithushan
Ottoboni, Sara
Florence, Alastair J.
Hartwig, Thoralf
Johnston, Blair F. - Abstract:
- Abstract : A generic framework for enhancing an initial solubility prediction with ML, even with simple methods and a modestly sized, sparse dataset. We dissect the setup to show the model "locking on" to the target system as more data are made available. Abstract : We report a single machine learning (ML)-based model to predict the solubility of drug/drug-like compounds across 49 organic solvents, extensible to more. By adopting a cross-solvent data structure, we enable the exploitation of valuable relational information between systems. The effect is major, with even a single experimental measurement of a solute in a different solvent being enough to significantly improve predictions on it, and successive ones improving them further. Working with a sparse dataset of only 714 experimental data points spanning 75 solutes and 49 solvents (81% sparsity), a ML-based model with a prediction RMSE of 0.75 log S (g/100 g) for unseen solutes was produced. This compares favourably with conductor-like screening model for real solvents (COSMO-RS), an industry-standard model based on thermodynamic laws, which yielded a prediction RMSE of 0.97 for the same dataset. The error for our method reduced to a mean RMSE of 0.65 when one instance of the solute (in a different solvent) was included in the training data; this iteratively reduced further to 0.60, 0.57 and 0.56 when two, three and four instances were available, respectively. This standard of performance not only meets or exceedsAbstract : A generic framework for enhancing an initial solubility prediction with ML, even with simple methods and a modestly sized, sparse dataset. We dissect the setup to show the model "locking on" to the target system as more data are made available. Abstract : We report a single machine learning (ML)-based model to predict the solubility of drug/drug-like compounds across 49 organic solvents, extensible to more. By adopting a cross-solvent data structure, we enable the exploitation of valuable relational information between systems. The effect is major, with even a single experimental measurement of a solute in a different solvent being enough to significantly improve predictions on it, and successive ones improving them further. Working with a sparse dataset of only 714 experimental data points spanning 75 solutes and 49 solvents (81% sparsity), a ML-based model with a prediction RMSE of 0.75 log S (g/100 g) for unseen solutes was produced. This compares favourably with conductor-like screening model for real solvents (COSMO-RS), an industry-standard model based on thermodynamic laws, which yielded a prediction RMSE of 0.97 for the same dataset. The error for our method reduced to a mean RMSE of 0.65 when one instance of the solute (in a different solvent) was included in the training data; this iteratively reduced further to 0.60, 0.57 and 0.56 when two, three and four instances were available, respectively. This standard of performance not only meets or exceeds those of alternative ML-based solubility models insofar as they can be compared but reaches the perceived ceiling for solubility prediction models of this type. In parallel, we assess the performance of the model with and without the addition of COSMO-RS output as an additional descriptor. We find that a significant benefit is gained from its addition, indicating that mechanistic methods can bring insight that simple molecular descriptors cannot and should be incorporated into a data-driven prediction of molecular properties where possible. … (more)
- Is Part Of:
- Digital discovery. Volume 2:Number 2(2023)
- Journal:
- Digital discovery
- Issue:
- Volume 2:Number 2(2023)
- Issue Display:
- Volume 2, Issue 2 (2023)
- Year:
- 2023
- Volume:
- 2
- Issue:
- 2
- Issue Sort Value:
- 2023-0002-0002-0000
- Page Start:
- 356
- Page End:
- 367
- Publication Date:
- 2023-02-08
- Subjects:
- Chemistry -- Data processing -- Periodicals
Medical sciences -- Data processing -- Periodicals
Machine learning -- Periodicals
542.85 - Journal URLs:
- https://www.rsc.org/journals-books-databases/about-journals/digital-discovery/ ↗
http://www.rsc.org/ ↗ - DOI:
- 10.1039/d2dd00024e ↗
- Languages:
- English
- ISSNs:
- 2635-098X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 26931.xml