A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing. (14th October 2022)
- Record Type:
- Journal Article
- Title:
- A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing. (14th October 2022)
- Main Title:
- A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
- Authors:
- Winter, Benedikt
Winter, Clemens
Schilling, Johannes
Bardow, André - Abstract:
- Abstract : SPT is a natural language processing model that predicts limiting activity coefficients from SMILES. High accuracy is achieved by pre-training the model on millions of synthetic data points and fine-tuning the model on limited experimental data. Abstract : The knowledge of mixtures' phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce a SMILES-to-properties-transformer (SPT), a natural language processing network, to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables the SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS and UNIFACDortmund, and improving on recentAbstract : SPT is a natural language processing model that predicts limiting activity coefficients from SMILES. High accuracy is achieved by pre-training the model on millions of synthetic data points and fine-tuning the model on limited experimental data. Abstract : The knowledge of mixtures' phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce a SMILES-to-properties-transformer (SPT), a natural language processing network, to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables the SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS and UNIFACDortmund, and improving on recent machine learning approaches. … (more)
- Is Part Of:
- Digital discovery. Volume 1:Number 6(2022)
- Journal:
- Digital discovery
- Issue:
- Volume 1:Number 6(2022)
- Issue Display:
- Volume 1, Issue 6 (2022)
- Year:
- 2022
- Volume:
- 1
- Issue:
- 6
- Issue Sort Value:
- 2022-0001-0006-0000
- Page Start:
- 859
- Page End:
- 869
- Publication Date:
- 2022-10-14
- Subjects:
- Chemistry -- Data processing -- Periodicals
Medical sciences -- Data processing -- Periodicals
Machine learning -- Periodicals
542.85 - Journal URLs:
- https://www.rsc.org/journals-books-databases/about-journals/digital-discovery/ ↗
http://www.rsc.org/ ↗ - DOI:
- 10.1039/d2dd00058j ↗
- Languages:
- English
- ISSNs:
- 2635-098X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 24610.xml