31P NMR versus 1H NMR versus IR integration accuracy based on data set Wolfs.A
The obtained \(\mathrm DS_1H\) values based on the described integration routine and reported \(\mathrm DS_IR,int\) values are compared to the reported \(\mathrm DS_31P\) values of Wolfs.A. It is assumed that 31P NMR provides the most accurate \(\textDS\) values. Please note that only data points with \(\textDS>1.21\) can be considered due to insolubility of the samples below this \(\textDS\) value. 1H NMR data achieved a better \(\textMAE\) of 0.032 (\(\textMRE=1.53\%\)), while IR integration reported previously showed a \(\textMAE\) of 0.078 (\(\textMRE=3.64\%\)). The results are visualized in Fig. 1.
To extend the study also to \(\textDS\) values smaller than 1.21, all following investigations are performed with \(\mathrm DS_1H\) values as ground truth, i.e. as training data. Based on Fig. 1 and considering that 31P NMR data is also susceptible to experimental errors, this assumption appears justified.

\(\mathrm DS_31P\) values versus \(\mathrm DS_1H\) and \(\mathrm DS_IR,int\) for the Wolfs.A data set. Only \(\textDS>1.21\) values are considered due to insolubility of the samples below this value.

\(\mathrm DS_1H\) vs. \(\mathrm DS_IR,ML\) and \(\mathrm DS_IR,int\) values for the Wolfs.A data set. The mean values and corresponding standard deviation of all test data points from N repetitions and k folds are shown.
ML prediciton accuracy on data set Wolfs.A
Baseline accuracy
A repeated k-fold cross validation (CV, see methods section) was performed with randomized data assignment between repetitions on the \(\mathrm DS_1H\) data of Wolfs.A. A multiple linear regression (MLR) model is used with \(N=1000\) repetitions and \(k=8\) folds, meaning that roughly \(12.5\,\%\) of the data points were reserved for testing in each iteration. The results are shown in Fig. 2. A \(\mathrm MAE_IR,ML\) of 0.069 (\(\mathrm MRE_IR,ML=5.73\%\)) is obtained. Additionally, the \(\mathrm DS_IR,int\) values are plotted against the \(\mathrm DS_1H\) values. A \(\mathrm MAE_IR,int\) of 0.083 (\(\mathrm MRE_IR,int=5.34\%\)) is obtained. Hence, the ML evaluation achieved a slightly smaller \(\textMAE\) and slightly larger \(\textMRE\) than the manual IR integration method, although both evaluation methods are precise and achieve \(\textMAE<0.1\). This immediately demonstrates that a simple linear regression model is able to make good predictions on the \(\textDS\) based on provided raw IR spectra. It should be highlighted that only the prediction accuracy on the test data, i.e. unseen data that has not been used in training, is shown, while for IR integration data, no test/train split was performed. It is apparent that the integration method consistently overpredicts the \(\textDS\) values, while the ML evaluation appears less biased. The opposing behavior in \(\textMAE\) and \(\textMRE\) can be explained by the large prediction error of the ML model at the relatively small \(\mathrm DS_1H\) value of 0.70, resulting in a larger relative error.
Influence of k-fold
To investigate the influence of the amount of training data, the parameter k was varied between \(k\in \2,…,16\\). These values correspond to \(\50\,\%,…, 6.25\,\%\\) of the data being used for testing, i.e. higher k values result in less data reserved for the test set and more data being used for training. The results are given in Fig. 3 as a box plot. The box spans from the first (Q1) to the third quartile (Q3) and contains the central \(50\,\%\) of the data, while the whiskers extend to the farthest data point lying within 1.5 times the inter-quartile range (IQR). The median value is indicated by a horizontal line and the mean value by a point inside the box.
Increasing k, i.e. increasing the amount of data being used for training versus testing, lowers both the mean and median \(\textMAE\) across all \(N\times k\) folds. However, even for \(k=2\), i.e. only \(50\,\%\) of the data being used for training, an \(\textMAE<0.1\) is achieved in the majority of folds. This shows that the linear regression model is robust with respect to the required amount of training data and is not prone to overfitting. The span of the whiskers are quite large, which means that the \(\textMAE\) exceeds 0.1 in some folds and predictions are less accurate than the IR-based method. However, these cases are retraced to extreme test/train splits, like e.g. only training for small \(\textDS\), while predicting only large \(\textDS\) values. Please note that this is only due to the applied cross validation with randomized test/train splits and will not be relevant for production models, i.e. routine evaluation. Here, either all or sensibly selected data is used for training, effectively eliminating these extreme cases.

\(\mathrm MAE_IR,ML\) for varying k during k-fold. \(k\in \2,4,8,16\\) corresponds to \(50\,\%\), \(25\,\%\), \(12.5\,\%\) and \(6.25\,\%\) of the data being used for testing.
Influence of wavenumeber ranges and feature selection
To evaluate if and how different parts of the spectrum hold more or less information relevant for model training, model training was performed on specific wavenumber ranges, while any data outside the specific range was omitted. Fig. 4c visualizes the investigated ranges together with an exemplary spectrum: A non-fingerprint region is defined for \(\nu >1500\,\textcm^-1\) and a fingerprint region for \(\nu <1500\,\textcm^-1\). Additionally, peak specific ranges for \(\hbox C=\hbox O\) (\(1600\,\textcm^-1<\nu <1800\,\textcm^-1\)), \(\hbox C-\hbox H\) (\(1325\,\textcm^-1<\nu <1425\,\textcm^-1\)), \(\hbox C-\hbox O\) (\(1150\,\textcm^-1<\nu <1300\,\textcm^-1\)) and a combined range \(\hbox C=\hbox O\), \(\hbox C-\hbox H\), \(\hbox C-\hbox O\) (\(1150\,\textcm^-1<\nu <1800\,\textcm^-1\)) are defined. For reference, both the full range and a randomly selected area with no apparent information (\(2000\,\textcm^-1<\nu <2500\,\textcm^-1\)) are investigated. For each area, a repeated CV was performed with \(N=1000\) and \(k=8\). The resulting \(\textMAE\) values are given as box-plot in Fig. 4a and compared to the integration-based result from above indicted by a horizontal line.
As expected, the randomly selected area was not able to yield reliable predictions of the \(\textDS\). Similarly, predicting the \(\textDS\) solely based on the \(\hbox C=\hbox O\) or \(\hbox C-\hbox O\) peaks produced large deviations and is less accurate than the integration-based method. The best predictions were obtained for the full and \(\hbox C-\hbox H\) range, followed by the fingerprint region. These results are surprising in the sense that Wolfs et al.27 reported highest accuracy for evaluation of the \(\hbox C=\hbox O\), followed by the \(\hbox C-\hbox O\) and lastly the \(\hbox C-\hbox H\) peak. These differences can be attributed to the different model architecture used, with the \(\hbox C-\hbox H\) peak being more suited towards a linear regression. Fig. 4a highlights the counter-intuitive fact that manually selecting specific wavenumber ranges, a routinely applied procedure in academia and industry, does not necessarily improve model accuracy. In fact, it might lead to drastically worse predictions.

(a) Box-plots of \(\mathrm MAE_IR,ML\) for varying wavenumber ranges viszualized in (c). (b) Box-plots of \(\mathrm MAE_IR,ML\) for varying n of n-best feature selection. The optimum value (\(n=250\)) is colored in red and the selected wavelengths are visualized in (c).
However, the search for the region that contains most information and yields the best predictions can also be performed automatically via an n-best feature selection algorithm described in the methods section. A parameter study was performed varying n between \(n\in \left[ 50,1750\right]\) in steps of 100 performing a repeated CV for each with \(N=1000\) and \(k=8\). The resulting \(\textMAE\) values are shown as box plots in Fig. 4b. The full data set contains 1750 features and is shown on the far right. Reducing the number of features reduces both the \(\textMAE\) and its spread until \(n=950\), while further reduction initially increases the prediction error. Reducing n further yields a global minimum at \(n_\textopt=250\), where \(\mathrm MAE_opt=0.052\) is achieved. This represents a significant increase in prediction accuracy compared to the integration-based evaluation method. Increasing prediction accuracy by limiting the amount of features, i.e. information, seems counter-intuitive, however, this can be explained by a reduced amount of overfitting. The initially present 1750 features contain plenty of wavenumbers that do not store information on the \(\textDS\). Excluding these from training makes the resulting models more robust and more accurate on unseen data. Vividly, the model is forced to learn the underlying relationships rather than experiment-specific random deviations. Fig. 4c shows the selected wavenumbers as scattered points at \(T=0\). Besides the expected peaks for \(\hbox C=\hbox O\), \(\hbox C-\hbox H\) and \(\hbox C-\hbox O\), the model also evaluates in the range \(3250\,\textcm^-1<\nu <3500\,\textcm^-1\), \(2850\,\textcm^-1<\nu <2950\,\textcm^-1\) and, most surprisingly, in the range \(500\,\textcm^-1<\nu <650\,\textcm^-1\) of the fingerprint region. For conventional, i.e. manual ATR-FTIR characterization of organic and polymeric compounds (including cellulose derivatives), especially vibrational bands at higher wavenumbers (\(\ge\) 1500 \(\hbox cm^-1\)) outside the fingerprint region are essential. Particularly, the finger print region is not frequently examined in data evaluation due to its complexity. The ML approach thus seems to provide benefits for data analysis a researcher would not easily identify. Hence, not only did the n-best feature selection increase the overall prediction accuracy, it simultaneously returned relevant information about where information on the \(\textDS\) is stored in the IR spectrum. It should be noted that only raw IR data was used for model training and the feature select algorithm identified the chemically relevant areas without additional mechanistic information or guidance.
Application to other data sets
To assess the applicability of the ML evaluation method, i.e. the extrapolation to other data sets, two linear regression models were trained on the entire Wolfs.A data set. No cross validation is required as all points are included in training. One model is trained on the optimum amount of features found in the previous section (\(n=250\)) and one is trained on the entire range of wavelengths. Both models are used to predict the \(\mathrm DS_IR,ML\) for samples contained in the Wolfs.B, Sehn.A, Sehn.B and Sehn.C data sets, which are compared to the respective \(\mathrm DS_1H\) values. Fig. 5a shows the results for the model with feature selection and Fig. 5b for the one without. The resulting \(\textMAE\) and \(\textMRE\) values are given in Tab. 2.

\(\mathrm DS_1H\) versus \(\mathrm DS_IR,ML\) values for the Wolfs.B, Sehn.A, Sehn.B and Sehn.C data set with applied n-best feature select (\(n=250\)) (a) and without feature selection (b). All models were trained on the full Wolfs.A data set. Exemplary IR spectra at \(\mathrm DS_1H\approx 2\).
Both Wolfs.B and Sehn.A are predicted with a high accuracy, demonstrating that the MLR model is robust and can be applied to IR data of different experimenters (Wolfs.B) and even different synthesis routes (Sehn.A). Additionally, the selected features seem to be universal for CA, as using the optimized model with feature select results in overall lower errors. These results show that routine evaluations of the \(\textDS\) of CAs can be performed without the need for laborious and time-consuming 1H or even 31P NMR measurements and without the need for manual peak integration and evaluation of IR data as was done by Wolfs et al.27. It should be noted that all IR data was measured on the same instrument and future studies should include the influence of instrument-specific deviations on the robustness of the evaluation. However, the applied normalization in Eq. 1 alleviates this effect as long as the relative intensities between wavenumbers stay similar.
Applying the trained model to Sehn.B and Sehn.C, i.e. to other cellulose esters than CA, should be regarded as a large extrapolation task, as no IR data of different CEs was included during training. However, Fig. 5b shows that when the MLR model has access to all wavenumbers it is indeed capable of predicting the \(\textDS\) to an accuracy of \(\approx 0.1-0.2\), with only cellulose hexanoate of Sehn.B being predicted poorly (\(\Delta \textDS=0.62\)). This shows that to some degree, the \(\textDS\) influences the IR spectra of different CEs quite similarly and that a machine learning model is capable of learning these effects. However, the model with feature select shown in Fig. 5a is unable to predict other esters, with the \(\textMAE\) even exceeding 1 for Sehn.B. To discuss this drastically different behavior, exemplary IR spectra of all data sets are visualized in Fig. 5c. The samples closest to \(\mathrm DS_1H=2\) were selected, meaning that for the linear model to yield similar predictions, all curves need to be similar. Again the selected wavenumbers (\(n=250\)) are indicated with scattered points. In general, all spectra are similar, especially in the expected ranges for the \(\hbox C=\hbox O\), \(\hbox C-\hbox H\) and \(\hbox C-\hbox O\) peaks, but also at \(\nu \approx 3500\,\textcm^-1\). However, both the peak at \(\nu \approx 2900\,\textcm^-1\) and the fingerprint region differ quite strongly for different CEs. As these regions contain many of the selected wavenumbers, it is not surprising that the reduced model has difficulties in predicting the \(\textDS\). By not applying any feature selection, these areas are less important and prediction accuracy is increased for these different CEs.
It should be noted that the objective of the trained model was not to predict the \(\textDS\) of CEs other than CA and that it lacked the required data to accurately do so. Nevertheless, Fig. 5c suggests that this may indeed be possible with enough relevant training data. Two adjusted model structures can be conceptualized: First, a combined model with two output parameters—one being the \(\textDS\) and one being e.g. the chain length of the ester or smiles string of the substituent—could be trained. Secondly, it is sensible to train multiple ester-specific models for prediction of the \(\textDS\) and combine them with a superordinate classification model, which predicts the type of ester and hence, which model to apply.
A note on model architecture
This work did not apply other model architectures than the MLR, although in the contemporary machine learning trend, neural networks (NN) are often displayed as somehow superior. It is important to stress that NN are only one of many different model architectures and that all have specific benefits and downsides. The MLR achieved a prediction accuracy close to 1H-based evaluation, especially after feature selection. Considering that \(\mathrm DS_1H\) values also contain experimental errors, it is argued that a NN has not much room for improvement and even if slightly lower numbers were obtained, this might not represent an improvement with respect to the true \(\textDS\). Therefore, if multiple models do not differ significantly in accuracy, one should always choose the one with fewer hyper-parameters. For a NN to perform well, the number of layers, number of nodes in each layer and activation functions have to be chosen correctly, i.e. optimized with respect to the training data. Otherwise NN are highly prone to overfitting and are known for their poor extrapolation capabilities. A previous study showed how this can be done directly during training47, however, the increased work and complexity did not seem to be justified in this context. Additionally, a NN is a black box without any apparent meaning to the individual weights, while the magnitude of the specific weights of the MLR herein directly corresponds to the relative importance of a specific wavenumber.
link