Evaluation of classical machine learning models built from public EEG datasets
We built ML models using the eight different subsets of features individually, and the combination of all features, to train detection models for epileptic seizures using different classical machine learning approaches. Figure 1A shows a ROC curve comparing the sensitivity and false positive rate of the different algorithms trained using the combination of all features and default parameters. Using the AUROC as the metric, the best performing model was Random Forest Classifier (RFC) with a value of 0.972, followed by Support Vector Machine (SVM) with 0.934 and K Nearest Neighbours (KNN) with 0.924. The other algorithms obtained values below 0.9, being Naive-Bayes Classifier (NBC) the lowest with 0.729.

(A) ROC curve for models trained with different algorithms and the union of all features (All). SGD: stochastic gradient descent. SVM: support vector machine. RFC: random forest classifier. DTC: decision tree classifier. KNN: k-nearest neighbors. NBC: naive-bayes classifier. (B) F-Score for each algorithm trained using only one particular group of features, compared to the F-Score using the union of all features (last group of bars). C-D) ROC Curve and F-Score values for the Random Forest Classifier trained using only one particular group of features and performing gridsearch to optimize parameters.
Figure 1B shows the best F1-score metric for each detection on the test data of each model. The results of the models combining all the features (All) are consistent with those obtained in the ROC curve. Additionally, it can be observed that for all groups of features the models trained with RFC are the best performing, with values for F-score above 0.83, with the only exception of the Coherence feature. For the latter the best performing model was trained with SVM having an F-score of 0.846 when tested. In general, the other machine learning algorithms have different results depending on the feature used. However, NBC was consistently found to be the worst performer, with an F-score varying between 0.517 and 0.657. Overall the best models trained with default parameters were the combination of RFC with the ‘All’ feature obtaining an F-score value of 0.917 on the test data, followed by RFC with the Wavelets feature with a value of 0.916 and RFC with the PBand feature with an F-score of 0.913.
A grid search was executed on the RFC models to identify parameters that improve the performance of these models. Figure 1C shows the ROC curves after gridsearch was implemented and used on the test data. As expected the AUROC values improved for all models obtaining values above 0.91 for all calculated features. The best models obtained under this metric were those trained with all features (All), on the Wavelets feature and PBand, all with very similar values of 0.975, 0.972 and 0.971 respectively. The next best was TimeD with 0.957. The other models also performed well but with an AUROC of less than 0.93. Figure 1D shows the corresponding F1-score values, all of them higher than 0.82. Consistent with the ROC analysis, the models that exceeded the value of 0.9 for F-Score were those built from a combination of all features (0.918), using only Wavelets (0.917) and using only power bands (0.916). For all models, the best parameter values were 1 for min_samples_leaf, 2 for min_samples_split and 300 for n_estimators. As for the criterion parameter, for the All and PBand features it was entropy, while for Wavelets it was gini. These were the three best performing models from the machine learning section and chosen for further use on the clinical data from HOMI.
Evaluation of retrained deep learning models with public EEG datasets
We reproduced and retrained different neural network architectures from literature to build deep learning models for seizure classification using the raw signals from public databases. Figure 2A shows the ROC curve for each model, comparing the annotated windows with the gold standard annotations in the test dataset. All models obtained an AUROC higher than 0.94 except for FCN and ConvLSTM which had a lower performance with a value of 0.845 and 0.881 respectively. The best performing models were CNN1 with an AUROC value of 0.977, DCNN with 0.959, and CNN3 with 0.943. The model with the AE + biLSTM architecture obtained 0.917. Additionally, the graph includes the ROC curve for the best machine learning model obtained in the previous section. In this case, this ranks as the second best model with AUROC of 0.975. Figure 2B shows the best F-score obtained for each method. The ranking of models based on this metric was consistent with that inferred from the ROC analysis. The CNN1 and DCNN models are the only ones that exceed 0.9 F-Score, having 0.923 and 0.903 respectively. The RFC model with the combination of all features also ranked as the second best for this metric.

(A) ROC curve comparing the performance of neural network models. The performance of the random forest model with all features (RFC-All) is shown as a reference. (B) F-Score achieved by the models in the test dataset. DCNN: deep convolutional neural network. CNN1: convolutional neural network 1. CNN3: convolutional neural network 3. FCN: full convolutional neural network. AE + BiLSTM: auto encoder with LSTM layers. convLSTM: convolutional neural network with LSTM layers. RFC: random forest classifier with the combination of all calculated features (all). (C) Loss vs. Epoch for the CNN1 neural network model. D). Architecture of the CNN1 neural network model.
Figure 2C shows the loss value for the different training epochs for the CNN1 model. In this case, it is observed that the model reached its minimum loss at epoch 94 with a value of 0.041. Over this epoch the validation accuracy was 0.917. For the other models, the epoch with the lowest loss was 107 for FCN, 47 for CNN3, 78 for DCNN and 119 for AE + biLSTM. The validation accuracy values were 0.732, 0.883, 0.894 and 0.822 respectively. The architecture of the best performing model (CNN1) is shown in Fig. 2D. CNN1 starts with 5 blocks of convolutional layers and max pooling, followed by 2 dense layers for classification based on the space transformed by the convolutional layers. From these results, it was concluded that the best deep learning models were those that used the CNN1, CNN3 and DCNN architectures. These models will be subsequently evaluated on EEG exams from HOMI.
Evaluation of best trained models in HOMI clinical data
Our main goal for the evaluation experiments presented in this work was to select the most promising models for automated detection of peaks in EEG exams taken at HOMI. Hence, we further assessed the performance of the best models according to the previous evaluations on an EEG generated and manually annotated by experts at HOMI. Figure 3A shows the ROC curve of the chosen models tested on HOMI data. A large reduction of the performance of all models was revealed when tested on the new data set. Moreover, the model with the best metrics on the test data from public databases (CNN1) ranked last in this case with an AUROC of 0.495. The models achieving the best performance on these new data were the RFC with all the features and the RFC only with the Wavelets feature, which obtained AUROC of 0.656 and 0.602 respectively. The RFC model using only power band features obtained a value of 0.574. The best neural network model in this case was CNN3 (AUROC = 0.553) followed by DCNN (AUROC = 0.510). Figure 3B shows the best F1 score obtained for each model. The RFC with all features was also the best model according to this metric, but followed in this case by the CNN3 and the DCNN neural networks. The other models showed F1 scores below 0.6 (Supplementary Table 3).

(A) ROC curve for prediction of seizures by the selected models in the HOMI EEG. (B) F-Score achieved by each model. DCNN: deep convolutional neural network. CNN1: convolutional neural network 1. CNN3: convolutional neural network 3. RFC: random forest classifier. The RFC was trained with the combination of all calculated features (all), with the powerband feature calculation (Pband) and with the wavelets transform calculation. (C) Sum of distances from seizures manually annotated to seizures automatically detected by each model.
Trying to identify reasons explaining the accuracy gap between the two experiments, we investigated if lack of accuracy on the boundaries of manually annotated seizures could produce inflated numbers of detection errors (either false positives or false negatives). Hence, we developed alternative measures, calculating for each automated annotation the distance to the closest manual annotation, and for each manual annotation the distance to the closest automated annotation. These measures should give a smaller penalty to detection errors around the boundaries of the annotated events, compared to other error types. Figure 3C shows the results obtained calculating these measures. The RFC with all features and the CNN3 model are the most balanced, having the smallest distances in general. Random forest models trained on Power Bands or Wavelets, and CNN1 models have a similar behavior with automated annotations having a large distance to the closest manually annotated fragment, evidencing a large number of false positives not related to accuracy in boundaries of annotations. On the other hand, DCNN has the opposite behavior, showing a large distance between the manually annotated fragments to the closest automatically annotated fragment. This may be due to a large number of false negative fragments.
We confirmed the hypotheses drawn by the analysis of the distance measures by visualization and comparison of the seizure annotations. Figure 4 shows the result of the annotation performed by the models (colored tacks), compared to the manual annotation made by experts at the hospital (black track). The 5 s fragments of this examination are presented chronologically along the tracks. The manual annotation track shows 243 fragments marked with crises by experts. Regarding the neural networks, only 107 fragments are annotated by the DCNN model, confirming a low sensitivity. Conversely, the CNN1 model annotates seizures in 1275 fragments, which represents 33.6% of the total number of fragments (3788) and more than five times the number of fragments annotated with crises. Although the CNN3 model annotates 298 fragments, they do not have a good coincidence with the manually annotated fragments. The random forest models trained solely on power bands or on Wavelets appear to perform well in the first 70% of the test, annotating areas with seizure fragments similar to those in the manual annotation. However, they have a large number of false positives towards the end of the exam, obtaining a final number of fragments annotated as crisis of 1077 for Wavelets and 1038 for PBand, which translates into a poor specificity. Using all proposed features seems to alleviate this issue towards the end of the exam, without sacrificing accuracy in the first 70% of the exam. In total, this model classified 441 fragments with seizure, having the best performance across all metrics.

Results of the seizure detection process for the HOMI EEG across the 5.3 h of the exam. The track colored black represents the manual annotation performed by experts at the hospital, which was used as gold standard for benchmark purposes. The remaining tracks display the results of the models for automated detection. Lines on each track represent 5-second fragments labeled as part of a seizure event. DCNN: deep convolutional neural network. CNN1: convolutional neural network 1. CNN3: convolutional neural network 3. RFC: random forest classifier. The RFC was trained with the combination of all calculated features (all), with the powerband feature calculation (Pband) and with the wavelets transform calculation.
Computational efficiency of the detection models
Another aspect of interest of the trained models was the computational efficiency in both the training and the test process. Figure 5A shows training time for each of the algorithms used on clinical data using a server with 64 Gb of memory. Random forest with all features was the model taking the longest training time (over 50 h), mainly due to the time needed to calculate the features. This was followed by DCNN with 38.5 h. The remaining models took between 8 and 16 h to train.

(A) Total training duration in hours of each model used in the HOMI EEG. (B) Total duration of model classification on HOMI EEG. DCNN: deep convolutional neural network. CNN1: convolutional neural network 1. CNN3: convolutional neural network 3. RFC: random forest classifier. The RFC was trained with the combination of all calculated features (all), with the powerband feature calculation (Pband) and with the wavelets transform calculation.
Given the possible clinical utility of these models, we also calculated the time taken to analyze an exam. Figure 5B shows time for each of the models. Model of RFC + All is the model that takes the longest time and is far superior to the others. In total for an exam of approximately 6 h it took a little over 12 h to complete the classification task. This is due to the fact that it needs to calculate all features. The PBand and Wavelets models have a duration of 1.37 and 1.30 h respectively, which implies that they take a quarter of the test duration to perform the analysis. Finally, the neural network models were the fastest in the classification task, all lasting approximately one hour.
link
