A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. (1st August 2022)
- Record Type:
- Journal Article
- Title:
- A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. (1st August 2022)
- Main Title:
- A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties
- Authors:
- Comesana, Ana E.
Huntington, Tyler T.
Scown, Corinne D.
Niemeyer, Kyle E.
Rapp, Vi H. - Abstract:
- Abstract: Machine learning has proven to be a powerful tool for accelerating biofuel development. Although numerous models are available to predict a range of properties using chemical descriptors, there is a trade-off between interpretability and performance. Neural networks provide predictive models with high accuracy at the expense of some interpretability, while simpler models such as linear regression often lack in accuracy. In addition to model architecture, feature selection is also critical for developing interpretable and accurate predictive models. We present a method for systematically selecting molecular descriptor features and developing interpretable machine learning models without sacrificing accuracy. Our method simplifies the process of selecting features by reducing feature multicollinearity and enables discoveries of new relationships between global properties and molecular descriptors. To demonstrate our approach, we developed models for predicting melting point, boiling point, flash point, yield sooting index, and net heat of combustion with the help of the Tree-based Pipeline Optimization Tool (TPOT). For training, we used publicly available experimental data for up to 8351 molecules. Our models accurately predict various molecular properties for organic molecules (mean absolute percent error (MAPE) ranges from 3.3% to 10.5%) and provide a set of features that are well-correlated to the property. This method enables researchers to explore sets ofAbstract: Machine learning has proven to be a powerful tool for accelerating biofuel development. Although numerous models are available to predict a range of properties using chemical descriptors, there is a trade-off between interpretability and performance. Neural networks provide predictive models with high accuracy at the expense of some interpretability, while simpler models such as linear regression often lack in accuracy. In addition to model architecture, feature selection is also critical for developing interpretable and accurate predictive models. We present a method for systematically selecting molecular descriptor features and developing interpretable machine learning models without sacrificing accuracy. Our method simplifies the process of selecting features by reducing feature multicollinearity and enables discoveries of new relationships between global properties and molecular descriptors. To demonstrate our approach, we developed models for predicting melting point, boiling point, flash point, yield sooting index, and net heat of combustion with the help of the Tree-based Pipeline Optimization Tool (TPOT). For training, we used publicly available experimental data for up to 8351 molecules. Our models accurately predict various molecular properties for organic molecules (mean absolute percent error (MAPE) ranges from 3.3% to 10.5%) and provide a set of features that are well-correlated to the property. This method enables researchers to explore sets of features that significantly contribute to the prediction of the property, offering new scientific insights. To help accelerate early stage biofuel research and development, we also integrated the data and models into a open-source, interactive web tool. Highlights: Developed method for selecting chemical descriptors and minimizing collinearity Trained five property prediction models using diverse data sets Models are interpretable and yield excellent peformance Feature importances are consistent and agree with previous research Webtool available at feedstock-to-function.lbl.gov … (more)
- Is Part Of:
- Fuel. Volume 321(2022)
- Journal:
- Fuel
- Issue:
- Volume 321(2022)
- Issue Display:
- Volume 321, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 321
- Issue:
- 2022
- Issue Sort Value:
- 2022-0321-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-08-01
- Subjects:
- Chemical descriptors -- Machine learning -- Biofuel -- TPOT
Fuel -- Periodicals
Coal -- Periodicals
Coal
Fuel
Periodicals
662.6 - Journal URLs:
- http://www.sciencedirect.com/science/journal/latest/00162361 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.fuel.2022.123836 ↗
- Languages:
- English
- ISSNs:
- 0016-2361
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4048.000000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 21589.xml