The accurate and timely detection of COVID-19 remains a critical challenge in effectively combating the ongoing pandemic. This study presents “Cough2COVID-19,” an advanced and enhanced framework for COVID-19 detection. This section outlines the methodology employed in developing the framework, which leverages multi-layer ensemble deep learning techniques and a novel feature ranking approach called CoughFeatureRanker, as shown in Fig. 1.

Cough2COVID-19: proposed framework for COVID-19 detection according to our study.
Dataset collection
To establish a robust and comprehensive dataset, we employed publicly available cough audio datasets from diverse sources, encompassing both COVID-19 positive and negative cases. The data collection process adhered rigorously to clinical ethical guidelines, ensuring participant privacy and consent. Our experimentation involved multiple datasets, including COUGHVID42, Coswara20, Virufy43, and the ComParE44 datasets. The ComParE dataset was acquired through direct correspondence with the challenge organizers via email. These collections furnished a diverse range of cough sounds, encompassing individuals with confirmed COVID-19 cases and those with various respiratory conditions. To maintain focus on COVID-19 detection, we employed a filtering process, exclusively selecting subsets of the dataset. Specifically, we utilized data from participants with confirmed COVID-19 positivity and data from healthy individuals.
Table 3 provides a detailed description of the dataset used in our experiments for COVID-19 detection. These datasets were selected based on their relevance and contribution to our study. The ComParE dataset encompassed 517 samples, comprising 119 samples from individuals with COVID-19 positivity and 398 samples from COVID-19-negative cases. This dataset proved invaluable for our analysis of respiratory audio recordings. The Virufy dataset comprised 1,190 samples, equally divided between COVID-19 positive and negative cases, with 595 samples in each category. This dataset specifically focused on cough and breath sounds from confirmed COVID-19 cases. The COUGHVID dataset featured 1,311 samples, including 651 samples from individuals with COVID-19 positivity and 660 samples from COVID-19-negative cases. This dataset provided a diverse array of cough sounds for our analysis. The Coswara dataset contributed 1,319 samples, consisting of 185 COVID-19 positive and 1,134 COVID-19 negative samples. This comprehensive dataset encompassed various respiratory data, including cough sounds, breath sounds, and demographic information. The combined datasets totaled 4,337 samples, comprising 1,550 samples from individuals with COVID-19 positivity and 2,787 samples from COVID-19-negative cases. This extensive data collection from the aforementioned sources facilitated a comprehensive analysis and the development of accurate models for COVID-19 detection based on cough audio signals.
Due to the potential adverse impact of data imbalance on neural network performance , we have implemented SMOTE45,46 as a solution. This technique addresses class imbalances in the training set by generating synthetic samples to oversample the minority class. SMOTE has proven effective in mitigating such imbalances in previous studies, particularly in the realms of cough detection and cough classification.
The initial classifiers used in our proposed ensemble method (CoughFeatureRanker) were evaluated using the validation data from each respective dataset. The purpose of using validation data was to fine-tune the model hyperparameters and ensure optimal performance for each classifier before combining them into the ensemble. Each classifier was trained on the training data, validated on the validation set, and their performance was measured using this validation data before creating the ensemble. The final performance of the CoughFeatureRanker ensemble was then evaluated using the test datasets to ensure an unbiased assessment of its generalization capabilities
Audio cough features engineering and pre-processing
The pre-processing phase played a pivotal role in guaranteeing the caliber and compatibility of the cough audio samples. This stage involved employing diverse techniques, including noise reduction, resampling, and segmentation, to bolster the signal-to-noise ratio and enhance the subsequent feature extraction’s overall efficacy47. Thoroughly cleansing the audio data proved paramount, given the lack of control over recording devices and environments. To ensure data quality and uniformity, we executed essential pre-processing steps using the Python toolkit librosa48.
A critical stride encompassed eliminating the leading and trailing silences within the audio recordings. This elimination of silent segments served to eradicate potential noise or extraneous details from the dataset, thereby concentrating exclusively on the pertinent cough sounds49. Moreover, we standardized the amplitude of the audio signals within a range of (-1, 1). This amplitude normalization procedure fostered equitable and consistent comparisons among diverse audio samples, obviating any amplitude disparities that might have influenced the subsequent analysis and modelling. Figure 2 shows the sample audio signals used in our study and Table 4 shows the evaluation of audio cough features for COVID-19 detection.

Sample of audio Signals whereby, (a) Initial cough waveform and spectrogram. (b) Noise-reduced cough waveform and spectrogram.
Time domain features
Temporal audio features are vital for capturing cough signal patterns50. They reveal energy distribution, amplitude fluctuations, and temporal variations, aiding COVID-19 detection. These features are essential for constructing algorithms and models to analyze and classify cough signals.
-
Root Mean Square Energy (RMSE)51:
$$\beginaligned \text RMSE = \sqrt\frac1N\sum _i=1^Nx_i^2 \endaligned$$
(1)
where N represents the total number of samples, and \(x_i\) denotes the amplitude of the \(i-th\) sample.
-
Zero-Crossing Rate (ZCR):
$$\beginaligned \text ZCR = \frac12N\sum _i=1^N-1|sgn(x_i)-sgn(x_i+1)| \endaligned$$
(2)
Frequency domain
It plays a crucial role in capturing the spectral characteristics of cough signals, offering valuable insights into their frequency content and distribution52. These features provide information about the central frequency, tonality, and temporal variations within the cough spectrum. By analyzing these characteristics, we can uncover key attributes that may contribute to identifying and classifying cough signals, particularly in COVID-19 detection. The frequency domain audio features serve as fundamental building blocks for understanding and characterizing cough signals, enabling the development of effective algorithms and models for automated cough analysis.
Spectral Bandwidth In the spectral bandwidth is shown in Eq. (3)53, the (\(S-BW\)) represents the expected frequencies’ energy concentration/variance.
$$\beginaligned S\text -BW = \sqrt\sum _k=1^K(f_k – E^2 \cdot P_k)^2 \endaligned$$
(3)
It is calculated using the frequency of each band (\(f_k\)), the expected frequency (E), and the energy in each band (\(P_k\)). With a total of K frequency bands, \(S-BW\) helps quantify energy distribution across the signal’s spectrum.
Spectral centroid
In the spectral centroid, as shown in Eq. (4), the (\(S-CENT\)) is the weighted and unweighted sum of spectral magnitudes54. It utilizes the energy in each band (\(P_k\)) and the corresponding frequency (\(f_k\)) of the bands.
$$\beginaligned S\text -CENT = \frac\sum _k=1^KP_k \cdot f_k\sum _k=1^KP_k \endaligned$$
(4)
With a total of K frequency bands, the spectral centroid provides information about the average frequency content of the signal.
Spectral contrast The spectral contrast in frequency band k, denoted as \(S\text -CONT_k\), quantifies the comparison between spectral peaks and valleys as shown in Eq. (5)55.
$$\beginaligned \beginaligned S\text -CONT_k = P_k – V_k = \left( \log \frac1N \sum _n=1^N x’_k,N\right) – \left( \log \frac1N \sum _n=1^N x’_k,N-n+1\right) \endaligned \endaligned$$
(5)
It considers the spectral peaks (\(P_k\)) and valleys (\(V_k\)) within the frequency band. The k-th of K FFT vector coefficients in frame n, represented as \(x’_k, N\), are considered to calculate the spectral contrast. The number of frames is denoted as N.
The spectral contrast feature provides insights into the variation and distinctiveness of different frequency components within the specified band by analyzing the relationship between spectral peaks and valleys.
Spectral flatness In Eq. (6), \(S-FLAT\) represents spectral flatness, which measures the similarity of a signal to white noise56.
$$\beginaligned S\text -FLAT = \frac\exp \left( \frac1K \sum _k=1^K \ln (P_k)\right) \frac1K \sum _k=1^K P_k \endaligned$$
(6)
The energy determines it in each frequency band (\(P_k\)) and the total number of frequency bands (K).
Spectral flux In the spectral flux as shown in Eq. (7), (\(S-FLUX\)) is a measure that quantifies the energy change between consecutive frames in an audio signal57.
$$\beginaligned S\text -FLUX_n = \sum _k=1^K \left( E_n,k – E_n-1,k^2 \right) \endaligned$$
(7)
It is calculated by comparing the discrete Fourier Transform coefficients in the current frame (\(E_n,k\)) with the coefficients in the previous frame (\(E_n-1,k\))
Spectral rolloff
The spectral rolloff (\(S\text -ROLL\)) Eq. (8), finds the frequency \(f_R\) at which the accumulated energy is no less than a proportion S of the total energy58.
$$\beginaligned S\text -ROLL = \arg \min _f_R \in \1, \ldots , K\ \left\ \frac\sum _k=1^f_R P_k\sum _k=1^K P_k \ge S \right\ \endaligned$$
(8)
It utilizes the energy in each frequency band (\(P_k\)), the total number of frequency bands (K), and the proportion threshold (S). The equation identifies the smallest \(f_R\) that satisfies this condition, indicating the spectral cutoff point in the audio signal.
Time frequency domain
This section explores the time-frequency domain analysis of audio signals, essential for respiratory classification and COVID-19 detection. By examining cepstral and tonal features, such as Mel-frequency cepstral coefficients (MFCC), we can differentiate between healthy and COVID-19-affected respiratory sounds. These techniques capture critical changes in pitch and timbre, aiding in the development of non-invasive diagnostic tools.
Cepstral features Cepstral features are divided into cepstral (timbre or tone colour) and tonal (pitch)59.
Non-linear mel-frequency cepstrum (MFC) The non-linear Mel-frequency Cepstrum (MFC) is commonly used in respiratory classification as it captures the temporal frequency content of a signal. It has also been utilized for COVID-19 analysis.
Mel-frequency cepstral coefficients (MFCC)
The transformation of the signal is represented by Eq. (9). The logarithmic energy of the \(k-th\) of K coefficients at frame n is denoted by s(k).
$$\beginaligned \text MFCC_n = \sum _k=1^K s(k) \cos (\pi n(k-0.5)) \endaligned$$
(9)
Tonal features Tonal features are based on the human perception of periodic pitch. They are relevant for analyzing respiratory signals affected by conditions like COVID-19, which may alter the pitch of inhalation and exhalation60.
Chroma energy normalized (C-ENS) C-ENS is a chroma abstraction that considers short-time statistics within chroma bands. It is designed to be resistant to timbre variations61.
Constant-Q chromagram (C-CQT) The constant-Q chromagram is obtained from a time-frequency representation using the constant-Q transform (C-CQT), which offers good frequency resolution for low frequencies62.
Short-time Fourier transform chromagram (STFT) The short-time Fourier transform chromagram is similar to C-CQT, but the initial transformation used is the Short-time Fourier Transform (STFT)63.
Tonnetz Tonnetz is a lattice graph that represents harmonic information. Geometric areas between points encode pitch, making the distances between points meaningful64.
CoughFeatureRanker algorithm
An exceptional contribution of this study lies in the proposal of the CoughFeatureRanker Algorithm, a novel approach designed to identify the most informative features extracted from cough signals. The objective of this approach is to rank features according to their relevance in detecting COVID-19, thereby reducing the dimensionality of the input space. This not only enhances the efficiency of the model but also improves its interpretability. The algorithmic intricacies of the CoughFeatureRanker are comprehensively elucidated in Algorithm 1, including its seamless integration within the overarching framework. The CoughFeatureRanker ML algorithm 1 is designed to rank cough audio features based on evaluation metrics, including ROC, Precision, and Recall. It utilizes machine learning algorithms including KNN65, SVM66, RF67, and LR68 to calculate these metrics for each feature. Based on their scores, the algorithm sorts the features in descending order and categorizes them into the time domain, tonal, spectral, and cepstral categories. The top-ranked features this algorithm identifies will be incorporated into our Cough2COVID-19 framework. Cough2COVID-19 is a multi-layer ensemble deep learning framework that aims to enhance the accuracy of COVID-19 detection based on cough analysis.

CoughFeatureRanker: ranking cough audio features
In our investigations, we systematically evaluated a compilation of 15 distinct audio features originating from three diverse signal domains. Our intent was to establish a uniform dimensionality69 across machine learning (ML) models, thereby ensuring harmonious applicability regardless of the audio sample’s temporal extent, which ranged from 1 to 30 seconds. For a comprehensive analysis, we computed seven cardinal summary statistics to encapsulate the feature distribution across frames, thereby fostering effective comparison. This set of statistics encompassed essential measures such as minimum, maximum, mean, median, variance, 1st quartile, and 3rd quartile. By employing these summary statistics, we were able to succinctly encapsulate the distribution patterns of the audio features, leading to a comprehensive representation of the underlying data dynamics. A judicious approach was adopted to safeguard against overfitting70 and uphold the veracity of our evaluations. Rather than concurrently assessing and ranking the complete feature set, we systematically considered and evaluated small, manageable feature subsets. This methodology allowed us to concentrate on well-defined subsets, ensuring meticulous scrutiny and mitigating the potential biases linked to overfitting phenomena.
The experiments were performed using a system equipped with an Intel Core i7-7700K 4.20GHz processor, 16GB RAM, and a GTX1060 6GB GPU. This hardware configuration provided the necessary computational resources for running the experiments and training the machine learning models. The i7-7700K processor ensured fast and efficient processing, while the 16GB RAM allowed for smooth execution of the experiments. The proposed method demonstrated a training time of 1.36 seconds per epoch, ensuring rapid iteration during model optimization. Furthermore, the prediction time for classifying a single cough audio signal was measured at 0.48 milliseconds, indicating the model’s capability for swift decision-making. The GTX1060 6GB GPU accelerated the training process, enabling efficient utilization of parallel computing capabilities for training deep learning models.
Multi-layer ensemble deep learning
Our study employed a sophisticated multi-layer ensemble methodology to leverage the capabilities of deep learning models in the context of COVID-19 detection71. This approach encompassed the training and amalgamation of multiple deep neural networks, each specializing in distinct facets of cough audio analysis. We provide an exhaustive discourse on the architecture, training regimen, optimization strategies employed for each network, and the ensemble technique adopted to synthesize their collective outcomes.
Within our investigation, we meticulously devised and assessed four distinct deep learning-based models for the purpose of COVID-19 detection. Our endeavour encompassed a meticulous exploration, analysis, and experimentation on 15 audio features, with the intention of extracting the most informative elements. These features underwent a rigorous ranking process executed by our pioneering CoughFeatureRanker algorithm.
From the pool of 15 audio features, our algorithm discerned that the MFCC, Spectrogram, and Chromagram features stood out as top-ranked. These specific features were deemed superior performance and robust discriminatory capabilities, rendering them pivotal and invaluable for accurate COVID-19 detection.
MFCC-MLP
Our methodology for extracting the MFCC coefficients follows a sequence of steps. Initially, the coughing audio waveform is resampled to a sampling rate of 22.5 KHz. Subsequently, the resampled signal undergoes feature extraction utilizing a hop length of 23 ms and a window length of 93 ms. A Hann window type is employed during this process to enhance accuracy. Upon the acquisition of the MFCC features, further processing is conducted. These features are averaged along the time axis, culminating in a compact set of 1D 39 coefficients. These coefficients encapsulate crucial details of the coughing audio signal. We engineered a streamlined Multi-Layer Perception (MLP) network for our model development, illustrated in Fig. 1. The network architecture comprises four fully connected (FC) layers and a singular output layer. Within each FC layer, a distinct number of nodes are present-specifically, 1024, 2048, 512, and 512 nodes, respectively. Rectified Linear Unit (ReLU) activation functions and dropout layers are thoughtfully integrated into each FC layer to introduce non-linearity and mitigate overfitting. The ultimate layer of the MLP network is a dense layer comprising a solitary node, which is activated by a Sigmoid function. This node’s output signifies the probability of the cough signal indicating the presence of COVID-19. Through the training process on labelled coughing audio data, the network can categorize cough signals according to their likelihood of being linked to the disease. The outlined approach integrates MFCC feature extraction with a compact MLP network to establish a dependable detection mechanism for COVID-19, hinging on the analysis of coughing audio signals. Figure 1 provides a visual representation of the model we put into use.
Spectrogram-CNN
Spectrograms have demonstrated their efficacy in various tasks, including speech recognition, speaker verification, and speech enhancement. In our spectrogram extraction process, we leverage the librosa library, building upon the foundation of the previously acquired MFCC coefficients. These coefficients are pivotal in generating spectrograms, providing visual representations of audio signals. Subsequently, the generated spectrograms are resized to a standardized dimension of 128 × 40. Ensuring consistent input across the network, we further normalize the spectrograms within the range of [0, 1].
Given the two-dimensional nature of spectrograms, we architect a streamlined Convolutional Deep Neural Network (CDNN) inspired by the influential VGG network. Our CDNN design entails three convolutional layers, a flattened layer, three fully connected (FC) layers, and a single output layer for classification72. The convolutional layers encompass composite functions: a convolutional function with filter sizes of 32, 64, and 64, sequentially followed by a ReLU activation layer, a max pooling layer with a filter size of 2x2 and a stride of 2, and a Batch Normalization layer. In contrast, the FC layers integrate 256, 64, and 64 nodes, each accompanied by ReLU activation functions and dropout layers. The ultimate dense layer consists of a solitary node with a Sigmoid activation function. To accommodate the spectrogram images, they are resized to a dimension of 128x40 before input into the network. The model we utilized is visually depicted in Fig. 1.
Chromagram-MLP
The Chromagram audio feature potentially aids in COVID-19 detection by analyzing cough and respiratory sounds.
$$\beginaligned x_i = \frac\sum _f \in \text Pitch Class i m(f)\sum _f m(f) \endaligned$$
(10)
In Eq. (10), \(x_ix\)i represents the chroma value for pitch class i, and m(f) represents the spectral magnitude at frequency f. For our study, we adopted a distinct approach by extracting 12-element 1D features from each cough audio signal through librosa. These features were then utilized as inputs for our custom neural network architecture. Like the MFCC-MLP model, our model comprises four fully connected (FC) layers. These FC layers consist of 1024, 2048, 512, and 512 nodes, respectively, incorporating ReLU activation functions and featuring dropout layers for regularization. The ultimate dense layer, culminating in our network, comprises a sole node activated by the Sigmoid function. Figure 1 illustrates the model we employed for our study.
Ensemble Cough2COVID-19 approach
We have successfully formulated a robust Ensemble Convolutional Neural Network (CNN) architecture, named Cough2COVID-19, meticulously crafted to detect COVID-19 from cough audio signals precisely. Our pioneering design synergizes diverse neural features, harnessed through the CoughFeatureRank algorithm, originating from distinct domains: the Mel-frequency cepstral coefficients (MFCC) features, spectrogram images, and chroma features (chromagram) extracted from coughing audio signals. This fusion of multiple neural characteristics significantly amplifies the accuracy and efficacy of COVID-19 detection within cough audio signals.This fusion of multiple neural characteristics significantly amplifies the accuracy and efficacy of COVID-19 detection within cough audio signals. Our framework combines neural features from distinct domains, enhancing the COVID-19 detection process in cough audio signals. Specifically, the MFCC branch contributes 83%-89% accuracy, with chromagram features further boosting it to 86%-89%. Spectrogram features demonstrate a slightly lower performance, contributing 73%-86%. However, the Cough2COVID-19 (MLEDL) model, which integrates all these characteristics, achieves superior results, contributing 93%-98% in accuracy across multiple datasets.
Figure 1 and Table 5 provide a detailed representation of the architectural framework that supports the Cough2COVID-19 Multi-Layer Ensemble Deep Learning Network (MLEDN). This innovative construct encompasses three distinct branches, each proficient in extracting unique and discerning neural attributes from the aforementioned sources. Consequently, these extracted neural attributes undergo amalgamation via concatenation and subsequently funnel into the classification network for advanced processing and meticulous analysis. The complete hyperparameter detail in the proposed architecture is described in Table 5.
In the primary segment of the proposed architecture, significant neural attributes, denoted as \(F_1n \in \mathbb R^C’1\), are derived from the Mel-frequency cepstral coefficients (MFCC) source. This extraction is accomplished through a two-layer dense node configuration. The initial dense layer encompasses 512 nodes activated by the Rectified Linear Unit (ReLU) function and is supplemented by a Dropout layer. Subsequently, the second dense layer integrates 256 nodes with ReLU activation, accompanied by another Dropout layer, where \(C1^\prime =256\). The incorporation of Dropout layers effectively addresses potential overfitting concerns, ensuring an optimal balance between performance and generalization capabilities.
The second branch is dedicated to extracting neural features, denoted as \(F_2n \in \mathbb R^C’2\), from spectrogram images sized \(128 \times 40\). This branch encompasses a network comprising three composite function layers, a flattened layer, a dense layer with 256 nodes, and a Dropout layer. Each composite function layer consists of a convolutional layer with filter sizes of 32, 64, and 64, sequentially followed by a ReLU activation layer, a max pooling layer with a \(2 \times 2\) filter size and a stride of 2, and a Batch Normalization layer where \(C2^\prime =256\).
The third branch shares a similar architecture with the MFCC branch and is tailored for extracting neural features sized \(F_3n \in \mathbb R^C’3\) from Chroma-based features. It comprises two dense layers with 512 and 256 nodes, respectively, both employing a ReLU activation function and a Dropout layer, where \(C3^\prime =256\).
The extracted neural features from the three branches are combined to generate a composite neural feature vector. This can be expressed in the Eq. (11).
$$\beginaligned \mathcal F = \left[ F_n^1; F_n^2; F_n^3\right] \endaligned$$
(11)
The composite neural feature vector F of size \(768 \times 1\) is obtained by concatenating the extracted neural features \(F_n^1 \in \mathbb R^256\), concatenating the extracted neural features \(F_n^2 \in \mathbb R^256\), and concatenating the extracted neural features \(F_n^3 \in \mathbb R^256\)6 from the MFCCs, Spectrogram images, and Chroma-based features, respectively. The composite neural feature vector F of size 768 is then fed into the classification network, which consists of a shallow network comprising two dense neural blocks. Each dense neural block consists of 64 filters with ReLU activations and Dropout layers. The final node in the network is a single-unit neural block with a Sigmoid function, providing the probability of a given coughing audio signal being COVID-19 positive. In Fig. 1, the complete structural layout of our model is visually encapsulated
link