Table 1.
Characteristics of the voice dataset.
Fig 1.
Architecture of the 1D CNN (1-Dimensional Convolutional Neural Network) model.
The diagram illustrates the sequential transformation of raw audio features through feature extraction layers down to the final classification logic, highlighting its capability to process sequential time-series acoustic data.
Fig 2.
Architecture of the AST (Audio Spectrogram Transformer) model.
This schematic details how 2D audio spectrograms are divided into localized patches and processed through transformer encoder blocks utilizing self-attention mechanisms to learn global acoustic context.
Fig 3.
Architecture of the Wav2Vec 2.0 model.
The figure demonstrates the self-supervised learning pipeline, showcasing the initial encoding of raw speech waveforms via a CNN block followed by deep contextualized representation learning within a robust transformer network.
Table 2.
Performance metrics and statistical comparisons for MCI and AD classification models.
Fig 4.
Confusion matrices for the Wav2Vec 2.0 and 1D CNN (Spectrogram) models across binary classification tasks.
(A) Wav2Vec 2.0 for NC vs. MCI. (B) Wav2Vec 2.0 for NC vs. AD. (C) 1D CNN (Spectrogram) for NC vs. MCI. (D) 1D CNN (Spectrogram) for NC vs. AD. Values represent aggregated predictions across all five cross-validation folds. NC, Normal Cognition; MCI, Mild Cognitive Impairment; AD, Alzheimer’s Disease.