Blend Tool
We developed our bio-jet fuel blend property prediction tool with supervised machine learning models that correlate each blend’s infrared spectroscopy data with three key fuel properties-- flash point, freezing point, and cetane number. Our current models predict fuel properties of select high-interest bio-based molecules combined with the jet fuel F-24, using open-source, Python-based, supervised machine learning packages. Specifically, we trained and tested multi-linear regression and tree-based machine learning models that correlate each blend's ATR-FTIR spectroscopy data with its corresponding fuel properties. We utilized F-24 jet fuel blend data, gasoline surrogate blend data, and neat molecule data to train and test the models. We collected ATR-FTIR spectroscopy of the jet fuel blends at additive-relevant concentrations of 10%, 20%, and 30% to use in the training and validation process. Generally, mid-range blend concentrations were included in the testing set and corresponding lower and upper range blend concentrations and neat molecules were included in the training set.
For feature selection, we used the regularized multi-linear regression algorithm, LASSO, that systematically reduces the available independent variables to a much smaller set of the most important features to reduce the likelihood of an overfit, over-complicated model. The threshold associated with this reduction was tuned with 5-fold cross validation. The features from this regularized model were further reduced by consolidating the remaining highly correlated variables to eliminate redundancy. We tested consolidating variables with Spearman correlation coefficients greater than 0.95, 0.9, 0.8, and 0.7. We tested these reduced feature sets as inputs to various algorithms from the Scikit-learn library including both linear models and nonlinear models such as tree-based ensemble models. We chose the best model for each property by examining which had similar train and test performance-- indicating less overfitting. This routine encourages simplified models that still retain the necessary complexity to well-predict the fuel properties.
The flash point model was trained with the tree-based ensemble AdaBoost regression algorithm with features reduced with regularization and further reduced by removing monotonic correlations above 0.9. The freezing point model was also trained with the tree-based ensemble AdaBoost regression algorithm with features reduced with regularization and further reduced by removing monotonic correlations above 0.95. The cetane number model was trained with the regularized multi-linear regression algorithm, LASSO, with features reduced with regularization and further reduced by removing monotonic correlations above 0.8.