The accurate and timely detection of COVID19 remains a critical challenge in effectively combating the ongoing pandemic. This study presents “Cough2COVID19,” an advanced and enhanced framework for COVID19 detection. This section outlines the methodology employed in developing the framework, which leverages multilayer ensemble deep learning techniques and a novel feature ranking approach called CoughFeatureRanker, as shown in Fig. 1.
Dataset collection
To establish a robust and comprehensive dataset, we employed publicly available cough audio datasets from diverse sources, encompassing both COVID19 positive and negative cases. The data collection process adhered rigorously to clinical ethical guidelines, ensuring participant privacy and consent. Our experimentation involved multiple datasets, including COUGHVID^{42}, Coswara^{20}, Virufy^{43}, and the ComParE^{44} datasets. The ComParE dataset was acquired through direct correspondence with the challenge organizers via email. These collections furnished a diverse range of cough sounds, encompassing individuals with confirmed COVID19 cases and those with various respiratory conditions. To maintain focus on COVID19 detection, we employed a filtering process, exclusively selecting subsets of the dataset. Specifically, we utilized data from participants with confirmed COVID19 positivity and data from healthy individuals.
Table 3 provides a detailed description of the dataset used in our experiments for COVID19 detection. These datasets were selected based on their relevance and contribution to our study. The ComParE dataset encompassed 517 samples, comprising 119 samples from individuals with COVID19 positivity and 398 samples from COVID19negative cases. This dataset proved invaluable for our analysis of respiratory audio recordings. The Virufy dataset comprised 1,190 samples, equally divided between COVID19 positive and negative cases, with 595 samples in each category. This dataset specifically focused on cough and breath sounds from confirmed COVID19 cases. The COUGHVID dataset featured 1,311 samples, including 651 samples from individuals with COVID19 positivity and 660 samples from COVID19negative cases. This dataset provided a diverse array of cough sounds for our analysis. The Coswara dataset contributed 1,319 samples, consisting of 185 COVID19 positive and 1,134 COVID19 negative samples. This comprehensive dataset encompassed various respiratory data, including cough sounds, breath sounds, and demographic information. The combined datasets totaled 4,337 samples, comprising 1,550 samples from individuals with COVID19 positivity and 2,787 samples from COVID19negative cases. This extensive data collection from the aforementioned sources facilitated a comprehensive analysis and the development of accurate models for COVID19 detection based on cough audio signals.
Due to the potential adverse impact of data imbalance on neural network performance , we have implemented SMOTE^{45,46} as a solution. This technique addresses class imbalances in the training set by generating synthetic samples to oversample the minority class. SMOTE has proven effective in mitigating such imbalances in previous studies, particularly in the realms of cough detection and cough classification.
The initial classifiers used in our proposed ensemble method (CoughFeatureRanker) were evaluated using the validation data from each respective dataset. The purpose of using validation data was to finetune the model hyperparameters and ensure optimal performance for each classifier before combining them into the ensemble. Each classifier was trained on the training data, validated on the validation set, and their performance was measured using this validation data before creating the ensemble. The final performance of the CoughFeatureRanker ensemble was then evaluated using the test datasets to ensure an unbiased assessment of its generalization capabilities
Audio cough features engineering and preprocessing
The preprocessing phase played a pivotal role in guaranteeing the caliber and compatibility of the cough audio samples. This stage involved employing diverse techniques, including noise reduction, resampling, and segmentation, to bolster the signaltonoise ratio and enhance the subsequent feature extraction’s overall efficacy^{47}. Thoroughly cleansing the audio data proved paramount, given the lack of control over recording devices and environments. To ensure data quality and uniformity, we executed essential preprocessing steps using the Python toolkit librosa^{48}.
A critical stride encompassed eliminating the leading and trailing silences within the audio recordings. This elimination of silent segments served to eradicate potential noise or extraneous details from the dataset, thereby concentrating exclusively on the pertinent cough sounds^{49}. Moreover, we standardized the amplitude of the audio signals within a range of (1, 1). This amplitude normalization procedure fostered equitable and consistent comparisons among diverse audio samples, obviating any amplitude disparities that might have influenced the subsequent analysis and modelling. Figure 2 shows the sample audio signals used in our study and Table 4 shows the evaluation of audio cough features for COVID19 detection.
Time domain features
Temporal audio features are vital for capturing cough signal patterns^{50}. They reveal energy distribution, amplitude fluctuations, and temporal variations, aiding COVID19 detection. These features are essential for constructing algorithms and models to analyze and classify cough signals.

Root Mean Square Energy (RMSE)^{51}:
$$\beginaligned \text RMSE = \sqrt\frac1N\sum _i=1^Nx_i^2 \endaligned$$
(1)
where N represents the total number of samples, and \(x_i\) denotes the amplitude of the \(ith\) sample.

ZeroCrossing Rate (ZCR):
$$\beginaligned \text ZCR = \frac12N\sum _i=1^N1sgn(x_i)sgn(x_i+1) \endaligned$$
(2)
Frequency domain
It plays a crucial role in capturing the spectral characteristics of cough signals, offering valuable insights into their frequency content and distribution^{52}. These features provide information about the central frequency, tonality, and temporal variations within the cough spectrum. By analyzing these characteristics, we can uncover key attributes that may contribute to identifying and classifying cough signals, particularly in COVID19 detection. The frequency domain audio features serve as fundamental building blocks for understanding and characterizing cough signals, enabling the development of effective algorithms and models for automated cough analysis.
Spectral Bandwidth In the spectral bandwidth is shown in Eq. (3)^{53}, the (\(SBW\)) represents the expected frequencies’ energy concentration/variance.
$$\beginaligned S\text BW = \sqrt\sum _k=1^K(f_k – E^2 \cdot P_k)^2 \endaligned$$
(3)
It is calculated using the frequency of each band (\(f_k\)), the expected frequency (E), and the energy in each band (\(P_k\)). With a total of K frequency bands, \(SBW\) helps quantify energy distribution across the signal’s spectrum.
Spectral centroid
In the spectral centroid, as shown in Eq. (4), the (\(SCENT\)) is the weighted and unweighted sum of spectral magnitudes^{54}. It utilizes the energy in each band (\(P_k\)) and the corresponding frequency (\(f_k\)) of the bands.
$$\beginaligned S\text CENT = \frac\sum _k=1^KP_k \cdot f_k\sum _k=1^KP_k \endaligned$$
(4)
With a total of K frequency bands, the spectral centroid provides information about the average frequency content of the signal.
Spectral contrast The spectral contrast in frequency band k, denoted as \(S\text CONT_k\), quantifies the comparison between spectral peaks and valleys as shown in Eq. (5)^{55}.
$$\beginaligned \beginaligned S\text CONT_k = P_k – V_k = \left( \log \frac1N \sum _n=1^N x’_k,N\right) – \left( \log \frac1N \sum _n=1^N x’_k,Nn+1\right) \endaligned \endaligned$$
(5)
It considers the spectral peaks (\(P_k\)) and valleys (\(V_k\)) within the frequency band. The kth of K FFT vector coefficients in frame n, represented as \(x’_k, N\), are considered to calculate the spectral contrast. The number of frames is denoted as N.
The spectral contrast feature provides insights into the variation and distinctiveness of different frequency components within the specified band by analyzing the relationship between spectral peaks and valleys.
Spectral flatness In Eq. (6), \(SFLAT\) represents spectral flatness, which measures the similarity of a signal to white noise^{56}.
$$\beginaligned S\text FLAT = \frac\exp \left( \frac1K \sum _k=1^K \ln (P_k)\right) \frac1K \sum _k=1^K P_k \endaligned$$
(6)
The energy determines it in each frequency band (\(P_k\)) and the total number of frequency bands (K).
Spectral flux In the spectral flux as shown in Eq. (7), (\(SFLUX\)) is a measure that quantifies the energy change between consecutive frames in an audio signal^{57}.
$$\beginaligned S\text FLUX_n = \sum _k=1^K \left( E_n,k – E_n1,k^2 \right) \endaligned$$
(7)
It is calculated by comparing the discrete Fourier Transform coefficients in the current frame (\(E_n,k\)) with the coefficients in the previous frame (\(E_n1,k\))
Spectral rolloff
The spectral rolloff (\(S\text ROLL\)) Eq. (8), finds the frequency \(f_R\) at which the accumulated energy is no less than a proportion S of the total energy^{58}.
$$\beginaligned S\text ROLL = \arg \min _f_R \in \1, \ldots , K\ \left\ \frac\sum _k=1^f_R P_k\sum _k=1^K P_k \ge S \right\ \endaligned$$
(8)
It utilizes the energy in each frequency band (\(P_k\)), the total number of frequency bands (K), and the proportion threshold (S). The equation identifies the smallest \(f_R\) that satisfies this condition, indicating the spectral cutoff point in the audio signal.
Time frequency domain
This section explores the timefrequency domain analysis of audio signals, essential for respiratory classification and COVID19 detection. By examining cepstral and tonal features, such as Melfrequency cepstral coefficients (MFCC), we can differentiate between healthy and COVID19affected respiratory sounds. These techniques capture critical changes in pitch and timbre, aiding in the development of noninvasive diagnostic tools.
Cepstral features Cepstral features are divided into cepstral (timbre or tone colour) and tonal (pitch)^{59}.
Nonlinear melfrequency cepstrum (MFC) The nonlinear Melfrequency Cepstrum (MFC) is commonly used in respiratory classification as it captures the temporal frequency content of a signal. It has also been utilized for COVID19 analysis.
Melfrequency cepstral coefficients (MFCC)
The transformation of the signal is represented by Eq. (9). The logarithmic energy of the \(kth\) of K coefficients at frame n is denoted by s(k).
$$\beginaligned \text MFCC_n = \sum _k=1^K s(k) \cos (\pi n(k0.5)) \endaligned$$
(9)
Tonal features Tonal features are based on the human perception of periodic pitch. They are relevant for analyzing respiratory signals affected by conditions like COVID19, which may alter the pitch of inhalation and exhalation^{60}.
Chroma energy normalized (CENS) CENS is a chroma abstraction that considers shorttime statistics within chroma bands. It is designed to be resistant to timbre variations^{61}.
ConstantQ chromagram (CCQT) The constantQ chromagram is obtained from a timefrequency representation using the constantQ transform (CCQT), which offers good frequency resolution for low frequencies^{62}.
Shorttime Fourier transform chromagram (STFT) The shorttime Fourier transform chromagram is similar to CCQT, but the initial transformation used is the Shorttime Fourier Transform (STFT)^{63}.
Tonnetz Tonnetz is a lattice graph that represents harmonic information. Geometric areas between points encode pitch, making the distances between points meaningful^{64}.
CoughFeatureRanker algorithm
An exceptional contribution of this study lies in the proposal of the CoughFeatureRanker Algorithm, a novel approach designed to identify the most informative features extracted from cough signals. The objective of this approach is to rank features according to their relevance in detecting COVID19, thereby reducing the dimensionality of the input space. This not only enhances the efficiency of the model but also improves its interpretability. The algorithmic intricacies of the CoughFeatureRanker are comprehensively elucidated in Algorithm 1, including its seamless integration within the overarching framework. The CoughFeatureRanker ML algorithm 1 is designed to rank cough audio features based on evaluation metrics, including ROC, Precision, and Recall. It utilizes machine learning algorithms including KNN^{65}, SVM^{66}, RF^{67}, and LR^{68} to calculate these metrics for each feature. Based on their scores, the algorithm sorts the features in descending order and categorizes them into the time domain, tonal, spectral, and cepstral categories. The topranked features this algorithm identifies will be incorporated into our Cough2COVID19 framework. Cough2COVID19 is a multilayer ensemble deep learning framework that aims to enhance the accuracy of COVID19 detection based on cough analysis.
In our investigations, we systematically evaluated a compilation of 15 distinct audio features originating from three diverse signal domains. Our intent was to establish a uniform dimensionality^{69} across machine learning (ML) models, thereby ensuring harmonious applicability regardless of the audio sample’s temporal extent, which ranged from 1 to 30 seconds. For a comprehensive analysis, we computed seven cardinal summary statistics to encapsulate the feature distribution across frames, thereby fostering effective comparison. This set of statistics encompassed essential measures such as minimum, maximum, mean, median, variance, 1st quartile, and 3rd quartile. By employing these summary statistics, we were able to succinctly encapsulate the distribution patterns of the audio features, leading to a comprehensive representation of the underlying data dynamics. A judicious approach was adopted to safeguard against overfitting^{70} and uphold the veracity of our evaluations. Rather than concurrently assessing and ranking the complete feature set, we systematically considered and evaluated small, manageable feature subsets. This methodology allowed us to concentrate on welldefined subsets, ensuring meticulous scrutiny and mitigating the potential biases linked to overfitting phenomena.
The experiments were performed using a system equipped with an Intel Core i77700K 4.20GHz processor, 16GB RAM, and a GTX1060 6GB GPU. This hardware configuration provided the necessary computational resources for running the experiments and training the machine learning models. The i77700K processor ensured fast and efficient processing, while the 16GB RAM allowed for smooth execution of the experiments. The proposed method demonstrated a training time of 1.36 seconds per epoch, ensuring rapid iteration during model optimization. Furthermore, the prediction time for classifying a single cough audio signal was measured at 0.48 milliseconds, indicating the model’s capability for swift decisionmaking. The GTX1060 6GB GPU accelerated the training process, enabling efficient utilization of parallel computing capabilities for training deep learning models.
Multilayer ensemble deep learning
Our study employed a sophisticated multilayer ensemble methodology to leverage the capabilities of deep learning models in the context of COVID19 detection^{71}. This approach encompassed the training and amalgamation of multiple deep neural networks, each specializing in distinct facets of cough audio analysis. We provide an exhaustive discourse on the architecture, training regimen, optimization strategies employed for each network, and the ensemble technique adopted to synthesize their collective outcomes.
Within our investigation, we meticulously devised and assessed four distinct deep learningbased models for the purpose of COVID19 detection. Our endeavour encompassed a meticulous exploration, analysis, and experimentation on 15 audio features, with the intention of extracting the most informative elements. These features underwent a rigorous ranking process executed by our pioneering CoughFeatureRanker algorithm.
From the pool of 15 audio features, our algorithm discerned that the MFCC, Spectrogram, and Chromagram features stood out as topranked. These specific features were deemed superior performance and robust discriminatory capabilities, rendering them pivotal and invaluable for accurate COVID19 detection.
MFCCMLP
Our methodology for extracting the MFCC coefficients follows a sequence of steps. Initially, the coughing audio waveform is resampled to a sampling rate of 22.5 KHz. Subsequently, the resampled signal undergoes feature extraction utilizing a hop length of 23 ms and a window length of 93 ms. A Hann window type is employed during this process to enhance accuracy. Upon the acquisition of the MFCC features, further processing is conducted. These features are averaged along the time axis, culminating in a compact set of 1D 39 coefficients. These coefficients encapsulate crucial details of the coughing audio signal. We engineered a streamlined MultiLayer Perception (MLP) network for our model development, illustrated in Fig. 1. The network architecture comprises four fully connected (FC) layers and a singular output layer. Within each FC layer, a distinct number of nodes are presentspecifically, 1024, 2048, 512, and 512 nodes, respectively. Rectified Linear Unit (ReLU) activation functions and dropout layers are thoughtfully integrated into each FC layer to introduce nonlinearity and mitigate overfitting. The ultimate layer of the MLP network is a dense layer comprising a solitary node, which is activated by a Sigmoid function. This node’s output signifies the probability of the cough signal indicating the presence of COVID19. Through the training process on labelled coughing audio data, the network can categorize cough signals according to their likelihood of being linked to the disease. The outlined approach integrates MFCC feature extraction with a compact MLP network to establish a dependable detection mechanism for COVID19, hinging on the analysis of coughing audio signals. Figure 1 provides a visual representation of the model we put into use.
SpectrogramCNN
Spectrograms have demonstrated their efficacy in various tasks, including speech recognition, speaker verification, and speech enhancement. In our spectrogram extraction process, we leverage the librosa library, building upon the foundation of the previously acquired MFCC coefficients. These coefficients are pivotal in generating spectrograms, providing visual representations of audio signals. Subsequently, the generated spectrograms are resized to a standardized dimension of 128 × 40. Ensuring consistent input across the network, we further normalize the spectrograms within the range of [0, 1].
Given the twodimensional nature of spectrograms, we architect a streamlined Convolutional Deep Neural Network (CDNN) inspired by the influential VGG network. Our CDNN design entails three convolutional layers, a flattened layer, three fully connected (FC) layers, and a single output layer for classification^{72}. The convolutional layers encompass composite functions: a convolutional function with filter sizes of 32, 64, and 64, sequentially followed by a ReLU activation layer, a max pooling layer with a filter size of 2x2 and a stride of 2, and a Batch Normalization layer. In contrast, the FC layers integrate 256, 64, and 64 nodes, each accompanied by ReLU activation functions and dropout layers. The ultimate dense layer consists of a solitary node with a Sigmoid activation function. To accommodate the spectrogram images, they are resized to a dimension of 128x40 before input into the network. The model we utilized is visually depicted in Fig. 1.
ChromagramMLP
The Chromagram audio feature potentially aids in COVID19 detection by analyzing cough and respiratory sounds.
$$\beginaligned x_i = \frac\sum _f \in \text Pitch Class i m(f)\sum _f m(f) \endaligned$$
(10)
In Eq. (10), \(x_ix\)i represents the chroma value for pitch class i, and m(f) represents the spectral magnitude at frequency f. For our study, we adopted a distinct approach by extracting 12element 1D features from each cough audio signal through librosa. These features were then utilized as inputs for our custom neural network architecture. Like the MFCCMLP model, our model comprises four fully connected (FC) layers. These FC layers consist of 1024, 2048, 512, and 512 nodes, respectively, incorporating ReLU activation functions and featuring dropout layers for regularization. The ultimate dense layer, culminating in our network, comprises a sole node activated by the Sigmoid function. Figure 1 illustrates the model we employed for our study.
Ensemble Cough2COVID19 approach
We have successfully formulated a robust Ensemble Convolutional Neural Network (CNN) architecture, named Cough2COVID19, meticulously crafted to detect COVID19 from cough audio signals precisely. Our pioneering design synergizes diverse neural features, harnessed through the CoughFeatureRank algorithm, originating from distinct domains: the Melfrequency cepstral coefficients (MFCC) features, spectrogram images, and chroma features (chromagram) extracted from coughing audio signals. This fusion of multiple neural characteristics significantly amplifies the accuracy and efficacy of COVID19 detection within cough audio signals.This fusion of multiple neural characteristics significantly amplifies the accuracy and efficacy of COVID19 detection within cough audio signals. Our framework combines neural features from distinct domains, enhancing the COVID19 detection process in cough audio signals. Specifically, the MFCC branch contributes 83%89% accuracy, with chromagram features further boosting it to 86%89%. Spectrogram features demonstrate a slightly lower performance, contributing 73%86%. However, the Cough2COVID19 (MLEDL) model, which integrates all these characteristics, achieves superior results, contributing 93%98% in accuracy across multiple datasets.
Figure 1 and Table 5 provide a detailed representation of the architectural framework that supports the Cough2COVID19 MultiLayer Ensemble Deep Learning Network (MLEDN). This innovative construct encompasses three distinct branches, each proficient in extracting unique and discerning neural attributes from the aforementioned sources. Consequently, these extracted neural attributes undergo amalgamation via concatenation and subsequently funnel into the classification network for advanced processing and meticulous analysis. The complete hyperparameter detail in the proposed architecture is described in Table 5.
In the primary segment of the proposed architecture, significant neural attributes, denoted as \(F_1n \in \mathbb R^C’1\), are derived from the Melfrequency cepstral coefficients (MFCC) source. This extraction is accomplished through a twolayer dense node configuration. The initial dense layer encompasses 512 nodes activated by the Rectified Linear Unit (ReLU) function and is supplemented by a Dropout layer. Subsequently, the second dense layer integrates 256 nodes with ReLU activation, accompanied by another Dropout layer, where \(C1^\prime =256\). The incorporation of Dropout layers effectively addresses potential overfitting concerns, ensuring an optimal balance between performance and generalization capabilities.
The second branch is dedicated to extracting neural features, denoted as \(F_2n \in \mathbb R^C’2\), from spectrogram images sized \(128 \times 40\). This branch encompasses a network comprising three composite function layers, a flattened layer, a dense layer with 256 nodes, and a Dropout layer. Each composite function layer consists of a convolutional layer with filter sizes of 32, 64, and 64, sequentially followed by a ReLU activation layer, a max pooling layer with a \(2 \times 2\) filter size and a stride of 2, and a Batch Normalization layer where \(C2^\prime =256\).
The third branch shares a similar architecture with the MFCC branch and is tailored for extracting neural features sized \(F_3n \in \mathbb R^C’3\) from Chromabased features. It comprises two dense layers with 512 and 256 nodes, respectively, both employing a ReLU activation function and a Dropout layer, where \(C3^\prime =256\).
The extracted neural features from the three branches are combined to generate a composite neural feature vector. This can be expressed in the Eq. (11).
$$\beginaligned \mathcal F = \left[ F_n^1; F_n^2; F_n^3\right] \endaligned$$
(11)
The composite neural feature vector F of size \(768 \times 1\) is obtained by concatenating the extracted neural features \(F_n^1 \in \mathbb R^256\), concatenating the extracted neural features \(F_n^2 \in \mathbb R^256\), and concatenating the extracted neural features \(F_n^3 \in \mathbb R^256\)6 from the MFCCs, Spectrogram images, and Chromabased features, respectively. The composite neural feature vector F of size 768 is then fed into the classification network, which consists of a shallow network comprising two dense neural blocks. Each dense neural block consists of 64 filters with ReLU activations and Dropout layers. The final node in the network is a singleunit neural block with a Sigmoid function, providing the probability of a given coughing audio signal being COVID19 positive. In Fig. 1, the complete structural layout of our model is visually encapsulated
link