Skip to main content
  • Loading metrics

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

  • Daniel M. Low ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Writing – original draft

    Affiliations Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, Massachusetts, United States of America, McGovern Institute for Brain Research, MIT, Cambridge, Massachusetts, United States of America

  • Vishwanatha Rao,

    Roles Data curation, Formal analysis, Writing – original draft

    Affiliations Department of Biomedical Engineering, Columbia University, New York, New York, United States of America, Department of Otolaryngology–Head and Neck Surgery, Massachusetts Eye and Ear Infirmary, Boston, Massachusetts, United States of America

  • Gregory Randolph,

    Roles Writing – review & editing

    Affiliations Department of Otolaryngology–Head and Neck Surgery, Massachusetts Eye and Ear Infirmary, Boston, Massachusetts, United States of America, Department of Otolaryngology–Head and Neck Surgery, Harvard Medical School, Boston, Massachusetts, United States of America

  • Phillip C. Song ,

    Contributed equally to this work with: Phillip C. Song, Satrajit S. Ghosh

    Roles Conceptualization, Data curation, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Department of Otolaryngology–Head and Neck Surgery, Massachusetts Eye and Ear Infirmary, Boston, Massachusetts, United States of America, Department of Otolaryngology–Head and Neck Surgery, Harvard Medical School, Boston, Massachusetts, United States of America

  • Satrajit S. Ghosh

    Contributed equally to this work with: Phillip C. Song, Satrajit S. Ghosh

    Roles Conceptualization, Funding acquisition, Methodology, Software, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, Massachusetts, United States of America, McGovern Institute for Brain Research, MIT, Cambridge, Massachusetts, United States of America, Department of Otolaryngology–Head and Neck Surgery, Harvard Medical School, Boston, Massachusetts, United States of America


Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance. Patients with confirmed UVFP through endoscopic examination (N = 77) and controls with normal voices matched for age and sex (N = 77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel "a". Four machine learning models of differing complexity were used. SHapley Additive exPlanations (SHAP) was used to identify important features. The highest median bootstrapped ROC AUC score was 0.87 and beat clinician’s performance (range: 0.74–0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis. We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician’s ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.

Author summary

The diagnosis of certain voice disorders can involve costly and time-consuming methods such as video laryngoscopy. An alternative is to screen using machine learning models that predict risk given just a short audio recording from a mobile device. However, these models can be biased if they detect recording idiosyncrasies of a given dataset that would not generalize to new samples with a different recording protocol, making the model unusable. These types of biases are not always evaluated in clinical machine learning studies. We found that a model we trained to detect unilateral vocal fold paralysis from healthy voices from brief audio recordings was biased: patients with a softer voice may have been induced to over-project their voice to obtain clearer recordings or the gain on the microphone may have been increased only for these participants, creating a bias that is unlikely to generalize. We demonstrate how to detect such biases using explainable machine learning and clinician ratings as well as how to potentially mitigate the effect of the bias. We also provide recommendations for identifying and mitigating bias in machine learning models that use audio recordings for screening in laryngology in general.


Voice recordings provide a rich source of information related to vocal tract physiology and human physical and mental health. Given advances in smartphones and wearables, these recordings can be made anytime and anywhere. Thus the search for disorder-specific acoustic biomarkers has been gaining momentum. Voice biomarkers have been reported for detecting Parkinson’s disease [1] as well as psychiatric disorders including depression, schizophrenia, and bipolar disorder (for a systematic review, see Low et al, 2020 [2]). Given our scientific understanding of the complexity of speech production, multiple acoustic features have been devised for use in machine learning models. In Fig 1, we describe a schematic of speech production and the process of extracting certain acoustic features from an audio signal (see also Quatieri, 2008 [3]), which is an important part of explaining how pathophysiology would affect acoustic features that are used in machine learning classifiers. Panel (A) depicts speech as the result of the neural coordination of three subsystems: the respiratory system (lungs), the laryngeal system (vocal folds), and the resonatory system of the vocal tract (pharynx, oral cavity, nasal cavity, articulators, and subglottal effects). Speech production requires air flow from the lungs to generate sound sources that are filtered by the vocal tract. Panel (B) captures the fact that environmental, microphone, and digital sampling characteristics (e.g., background noise, microphone gain, sampling rate) can affect acoustic features. Panel (C) shows the waveform of the audio signal, representing areas of compression (positive amplitude; higher air pressure) and rarefaction (negative amplitude; lower air pressure). Higher amplitudes can lead to higher perceived loudness. Prosodic features arise from changes over longer segments of time, which is perceived in the rhythm, stress, and intonation of speech. A segment of the waveform is shown in the right panel, indicating a periodic signal from the vocal folds. Panel (D) shows that for a given time window, a spectrum (right panel) can be obtained through a fast Fourier transform (FFT) which represents the magnitude of the frequencies in the signal with peaks (formants F1–F3) due to vocal tract filtering of the source signal produced by the vocal folds. The spectrogram (left panel) is a representation of the spectrum as it varies over time and can be obtained through a short-term Fourier transform (STFT). The approximate location of the F0 and first formants are displayed. Finally, (E) It is possible to separate source and filter components by computing the inverse FFT of the log of the magnitude of the spectrum, called the cepstrum (right panel). The peak in the cepstrum reflects the periodic glottal fold vibration while lower quefrency components reflect properties of the resonatory subsystem. For speech recognition, Mel filters are applied to the spectrum to better approximate human hearing. A conversion of the Mel-spectrum to a cepstrum using a Discrete Cosine Transform (DCT) generates mel-frequency cepstral coefficients (MFCCs). Similar to the cepstrum, lower MFCCs track vocal-tract filter information.

Fig 1. Schematic of speech production and the process of extracting certain acoustic features from an audio signal.

(A) Speech production, (B) recording characteristics, (C) waveform of audio signal with fundamental frequency (f0), (D) spectrogram with formants F1-F3 and intensity, (E) mel-frequency cepstral coefficients (MFCCs). Full description in the main text.

Furthermore, while machine learning (ML) can be a powerful and successful approach for diagnostics, they are often treated as "black-boxes". It can be difficult to determine how the model is making a decision, that is, how it is combining input features from a given patient to generate a prediction. This is particularly worrisome given ML algorithms can detect and associate unintended or clinically irrelevant relationships and introduce bias that may be difficult to anticipate. Explainable ML refers to a series of methods and quantitative analyses for uncovering and "explaining" the rationale behind the decision made by complex algorithms, which is particularly critical in the high-stakes decisions of medicine to increase trust among clinicians and patients [4].

There are many challenges for applying acoustic analysis to detect specific disorders. Voice characteristics are highly varied and change over time. Laryngeal pathology, age, gender, size, weight, general state of health, smoking/vaping, and medications can impact vocal acoustic characteristics. Diseases in the larynx and phonatory system (i.e., larynx, resonating structures, lungs) and/or neurological system, will also affect voice. Compensatory production strategies and environmental conditions can also change the vocal signal. Furthermore, because hoarseness is such a frequent occurrence and specialty voice centers are rare, vocal fold disorders are often undiagnosed, under-reported, or misdiagnosed [5].

We chose vocal fold paralysis as the study cohort for several reasons. First, it is clinically important. UVFP can have detrimental effects on voice and quality of life with resultant morbidity related to respiration, swallowing and aspiration [6]. Vocal fold paralysis may occur due to iatrogenic injury, malignancy, idiopathic, and neurological disease [7]. Overall, surgical iatrogenic injury accounts for 46% of all UVFP in adults and thyroid and parathyroid surgeries are responsible for 32% of postsurgical UVFP [8]. There is a significant need for a screening tool for the diagnosis and tracking of UVFP because of the high impact of this condition on productivity and quality of life. Screening could be done remotely and frequently, especially when surgical specialists and laryngeal exams are not readily accessible due to geographical, financial, and other barriers [9]. Using an explainable ML model as a screening tool for UVFP can provide greater clarity as to who most needs laryngoscopy and provides insight in the key voice characteristics related to the pathophysiology [1014]. The costs associated with UVFP not only relate to patient morbidity and diminished quality of life but also to the economic burden placed on our healthcare system. Greater lengths of hospitalization and increased hospital costs have been associated with postsurgical VFP [15,16]. Access to specialists for diagnosis is limited and early detection and management of UVFP appear to improve length of stay and surgical outcomes [17]. Special consideration should be given to what the model can actually classify: a model that generalizes well in classifying UVFP from controls may not be able to screen for UVFP out of other voice disorders, but could be used to monitor UVFP patients remotely and affordably during treatment or detect risk for UVFP when it is the most likely cause such as dysphonia after thyroid surgery.

Furthermore, UVFP is an ideal model for demonstrating the explainability of ML. UVFP occurs when the mobility of a single vocal fold is impaired as a consequence of neurological injury and diagnosis is consistently verified through routine laryngoscopy; therefore, ground truth labels are available. Second, the clinical signs of UVFP are well-described. These characteristics include a weak, breathy voice quality, early vocal fatigue, reduced cough strength, and aspiration with thin liquids [18,19]. Therefore, the acoustic differences between UVFP patients and healthy controls can be interpreted with regards to perceptual symptoms and a well-understood pathophysiology. In contrast, explaining important variables to predict a disorder which is hard to diagnose (e.g., has low inter-rater reliability) and has an unclear pathophysiology would ironically result in a poor explanation, because it would be puzzling how or even if the disorder could modulate the important acoustic variables. Of course, machine learning models can also offer novel explanations into a disorder by characterizing novel characteristics. However, if these models use high-dimensional feature vectors, they are more likely to overfit when using small datasets [20,21], which should lead to more skepticism of these novel explanations.

There have been several studies detecting unilateral vocal fold paralysis (UVFP) using machine learning [2230]; however, most have included the disorder among a set of voice disorders to be predicted. Limitations of these prior studies could be seen to fall into one of following types: not reporting the performance when classifying the subset of participants with UVFP out of the participants with dysphonia they were trying to detect; small sample sizes given most studies contained 10 participants with UVFP or fewer with one study containing 50 participants [31]; a lack of algorithmic explanations: they either do not report on the relative importance of each acoustic variable; use input data such as a spectrogram in a black-box deep learning model which could make attempts at algorithmic explanations on images such as saliency maps more opaque than results from feature importance of handcrafted features; use a black-box model such as neural network without attempting to explain its predictions with deep learning explainability methods [32]; use a single type of model which may pick up on certain types of patterns but miss others leading to incomplete conclusions on feature importance; use only a few features which may impede better predictive performance by not capturing certain relevant information; and/or not publicly share models or data to help test their generalizability to new data.

The objectives of our study were: to detect UVFP using ML; to evaluate the effectiveness of different models in differentiating the acoustic signals between patients with UVFP and patients with normal functioning vocal folds (i.e., controls); to explain which features are most important to the diagnostic models and examine the pathophysiological relevance; and to compare performance to human clinicians evaluating audio recordings. To achieve these objectives, we evaluated four different classes of machine learning algorithms to assess classification performance, obtained the minimal set of features necessary for detection, and identified the most important acoustic features for model construction after removing redundant features. Ultimately, we wanted to see if the most important features identified by the machine learning models matched clinically-known relevant acoustic changes.

Materials and methods

This study was approved by the Institutional Review Board at Massachusetts Eye and Ear Infirmary and Partners Healthcare (IRB 2019002711).

Participants and voice samples

Through retrospective chart analysis from 2009 to 2019, a total of 1043 patient charts were reviewed from a tertiary care laryngology practice who underwent endoscopic evaluation and voice testing. Of those, 53 patients with confirmed UVFP were identified. They had documented vocal fold paralysis by endoscopic examination and had undergone acoustic analysis as part of routine clinical care. Each patient had four acoustic recordings. These included three sustained vocalizations of the "a" vowel sound (ɑ in the International Phonetic Alphabet) and a reading of the introductory paragraph of the rainbow passage [33]. The acoustic recordings were all taken in an acoustically shielded room. For each of these 53 patients, a board-certified otolaryngologist reviewed their clinical history, video laryngoscopy as well as their audio samples to confirm that they were correctly classified to have UVFP. Voice samples from an additional 24 patients were collected prospectively using a mobile software, OperaVOXTM on an iPad, who were being treated for UVFP. These patients also had the same four acoustic recordings as the patients from retrospective chart review. This combination of data collection yielded a total of 77 UVFP patients for analysis, of which 48 had left UVFP and 29 right UVFP.

All of the patients were then matched with control samples from a database of patients without UVFP who had also undergone acoustic analysis. Each control was the same sex and had the same smoking status as the UVFP patient and within three years of age, and had documented laryngeal examinations that verified the absence of vocal fold mucosal pathology. The controls were excluded if they had established laryngeal surgery, vocal fold lesions, radiation, head and neck cancer, or neurological disease. The controls had recorded the same four acoustic recordings as the retrospectively gathered UVFP group. A board-certified otolaryngologist confirmed that the voice recordings and video laryngoscopies of these controls matched normal expectancies.

The reading samples were divided in thirds to match the amount of vowel production samples, resulting in 6 samples for most participants. Reading recordings were not available for three patients and three patient vowel samples were removed due to containing multiple vowel productions or a cough. The final dataset that was analyzed is described in Table 1.

Reading+vowel refers to including all samples (i.e., ~6 samples) from the same participant with the goal of either obtaining higher performance or discovering features that show variation in relation to diagnosis consistently across tasks. Mean (SD) audio lengths were 6.81s (5.47) for reading samples and 3.95s (1.00) for vowel samples. The audio samples were processed using OpenSmile with the eGeMAPS configuration file (article [34], source code [35]) which applies different summarization statistics to the time series depending on the feature resulting in 88 features per sample covering information related to the vocal folds (F0, jitter, shimmer), intensity (loudness, HNR), vocal tract (F1–3 frequency, bandwidth, amplitude), spectral balance (alpha ratio, Hammamberg index, spectral slope, MFCC 1–4, spectral flux), and prosody (voice and unvoiced segments, loudness peaks per second). See section Text A in S1 Appendix ("eGeMAPS features") for full list.

Machine learning models of increasing complexity

With the goal of classifying voices recording into either UVFP or controls, we used four machine learning algorithms of increasing complexity from the scikit-learn (v0.21.3) using the pydra-ml (v0.3.1) toolbox [36] (default parameters were used unless otherwise specified). By complexity we mean models are more complex if they are harder to simulate, that is, harder to take the input data and model parameters and step through every calculation required to produce a prediction in a reasonable time which increases with the amount of parameters and interactions [37].

  1. Logistic Regression: a simple linear model that is constrained to use few features due to an L1 penalty making it the simplest model (“liblinear” solver was used which is ideal for smaller datasets).
  2. Stochastic Gradient Descent (SGD) Classifier: we used a log loss which implements a logistic regression; therefore, it is also a linear model but tends to use more features due to an elastic net penalty, making it slightly more complex (the max_iter parameter was set to 5000 and early_stopping was set to True).
  3. Random Forest: it is an algorithm that uses simpler decision trees (i.e., weak learners) on feature subsets "but then takes the majority of the votes of the decision trees’ predictions to create a stronger learner, making it harder to interpret which features are important across trees.
  4. Multi-Layer Perceptron: it is a neural network classifier which incorporates, in our case, 100 instances of perceptrons (artificial neurons), which are connected to each input feature through weights with a ReLU activation function to capture nonlinear relationships in the data. It is not possible to know exactly how the hundreds of internal weights interact to determine feature importance, making the model difficult to interpret directly from its parameters (the max_iter parameter was set to 1000; alpha or the L2 penalty parameter was set to 1).

To generate independent test and train data splits, a bootstrapped group shuffle split sampling scheme was used. Bootstrapping is more optimal than cross-validation on smaller datasets and provides a measure of uncertainty through a confidence interval [38]. For each iteration of bootstrapping, a random selection of 20% of the participants, balanced between the two groups, was used to create a held-out test set. The remaining 80% of participants were used for training. This process was repeated 50 times, and the four classifiers were fitted and tested for each test/train split. We used the default of 50 bootstrapping splits from pydra-ml to reduce computational time. Median ROC AUC stabilized to larger spit values at around 40 splits for logistic regression models across tasks (see Figure A in S1 Appendix) while reducing runtime compared to larger split values. The Area Under the Receiver Operating Characteristic Curve (ROC AUC; perfect classification = 1; chance = 0.5) was computed to evaluate the performance of the models on each bootstrapping iteration, resulting in a distribution of 50 ROC AUC scores for each classifier. To ensure results were not due to choosing scikit-learn’s hyperparameter default settings, hyperparameter tuning was performed on the main models using all features and achieved similar performance to non-fine-tuned models (see Table A in S1 Appendix). The focus of our study is identifying bias and not achieving–in our case–a small increment in performance; therefore, given the large number of models, analyses, and bootstrapping samples in our study which focuses on identifying bias, we chose default parameters given the small changes in performance we observed with hyperparameter tuning. Additionally, for each iteration, each classifier was trained with randomized patient/control labelings to generate a null distribution of ROC AUC scores (i.e., a permutation test). Each model’s performance was statistically compared to their null model’s distribution using an empirical p-value, a common and effective measure for evaluating classifier performance (see Definition 1 in [39])). The significance level was set to alpha = 0.05.

Assessing feature importance

Kernel SHAP (SHapley Additive exPlanations) was used to determine which acoustic features were most important for each model to detect UVFP. This method is model agnostic in that it can take any trained target model (even “black box” neural networks) and compute feature importance [40]. It does so by performing regression with L1 penalty between different sets of input features and a single prediction made by the target model. It then uses the coefficients of the additional regression model as a measure of feature importance for a single prediction. We took the average of the absolute SHAP values across all test predictions (positive and negative values are both important for classification). We then weighted the average values by the model’s median performance since an important feature for a bad model could be a less important feature for a good model and vice versa. Since we trained each model 50 times (i.e., one for each bootstrapping split), we computed the mean SHAP values across splits for each model. This pipeline (i.e., machine learning models, bootstrapping scheme, SHAP analysis) was done using pydra-ml.

Reducing collinearity to do explainability analysis using Independence Factor

Highly correlated features (i.e., collinearity) can influence model generation and interpretation. Two models may obtain similar performance while using different features or placing different weights on the same features (i.e., underspecification [20,41]). This makes it difficult to compare algorithmic explanations across models. For instance, mean F1 frequency may be less important to a given model because the model uses mean F2 frequency which happens to capture very similar information in a particular dataset (i.e., has a high correlation), whereas a different model may use F1 instead of F2 or use both but assign less importance to each and still obtain the same performance. To enforce models to use the same features that capture very similar information and be able to compare feature importance across models, we kept a single feature out of the sets of features that share similar information above a given threshold.

We used a custom algorithm we call Independence Factor whereby for each feature in alphabetical (i.e., arbitrary) order, we removed features that show strong dependence above a given threshold. The step was repeated for remaining features. We use distance correlation from the Python dcor package (v0.4) because, unlike Pearson r or Spearman rho, it can capture non-monotonic relationships [42,43]. We have included several examples of non-monotonic associations between variables in our dataset that would be captured better by dcor (see Figure B in S1 Appendix). We used the following threshold values for the distance correlation [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2] to compute the Independence Factor, which removed increasingly more features (i.e., 1.0 keeps all features and 0.2 removes features that have a distance correlation above 0.2). We chose the feature size which contains at least one model that scores within three percentage points of the performance using all features, with the goal of obtaining a more parsimonious model for subsequent explanation while maintaining high accuracy. Thus, removing redundant features makes the models easier to interpret for clinical relevance. To visualize the original redundancy across features, we computed clustermaps using seaborn package (v0.10.1) performing hierarchical clustering with the average-linkage method and Euclidean distance. This was performed on the pairwise distance correlation, computed separately on data from UVFP, controls, UVFP+controls and on reading, vowel, and reading+vowel.

Performance using most important and least important features

Studies tend to report and describe the top N features out of M features, but it is not clear what performance the model would obtain when using only those top N features; perhaps it would perform substantially worse than the full model. We will report performance using only top 5 features as well as performance without top 5 features to provide a more realistic evaluation of their importance.

Performance using audio duration

Fig 2 indicates clear differences in the distributions of audio recording duration between UVFP patients and controls. This is due to how recordings were processed and saved and not necessarily due to an intrinsic property of UVFP (e.g., slower speech), which reveals a bias that models can leverage but is not expected to generalize well under different audio processing procedures. Therefore, we examine whether audio duration alone could perform well in classification of UVFP. The mean (and standard deviation) for the audio duration for reading task is 3.5 s (0.00 s) for the controls and 10.25 s (6.17 s) for the UVFP patients and the audio duration for sustained vowel task is 4.11 s (0.07 s) for the controls and 3.74 s (1.3 s) for the UVFP patients.

Fig 2. Distribution of audio duration for reading and vowel tasks split by group reveals a dataset bias.

The mode of the audio durations for the controls is 3.5 s for reading samples and 4.11 s for vowel samples.

Performance using cepstral peak prominence

To evaluate whether results are sensitive to choice of features, we use a different set of features derived from cepstral peak prominence (CPP) given it has been shown to be a good measure of breathiness and dysphonia [44,45]. We match the summary statistics across the audio recording that eGeMAPS uses: CPP mean, CPP coefficient of variation (standard deviation normalized by the mean), CPP 20th percentile and CPP 80th percentile. We use our custom Python implementation which matches MatLab’s COVAREP output [46].

Clinician ratings

In order to corroborate whether there are unintended recording differences between UVFP patients and controls that may lead to bias, one otorhinolaryngologist and two speech-language pathologists rated each audio recording of the reading task (one per participant, not split in three) for the following variables (and possible responses), in order: background noise (None, Some, High); UVFP (yes, no), CAPE-V severity (0 to 100), CAPE-V roughness (0 to 100), CAPE-V breathiness (0 to 100), CAPE-V strain (0 to 100), CAPE-V pitch (0 to 100), CAPE-V loudness (0 to 100; estimated loudness as if the rater were in the recording room), recording loudness (low, medium, high; loudness of the recording). Inter-rater agreement was assessed using intra-class correlation for all numerical variables and Light’s k for the binary presence of UVFP [47] using the R package irr (v0.84.1) [48]. The entire reading task was provided instead of the task split in three to make assignment easier for clinicians. The reading task was chosen over the sustained vowel because we expected it to be easier for clinicians to detect UVFP.


Performance of models using acoustic features

In Table 2, we report performance for models using all features, models after removing redundant features, models using only top 5 features (to understand their unique role in performance), models using all 88 features without 5 features (to understand whether the top 5 features are necessary for high performance), models using audio duration length, and models using a different feature set based on CPP. Performance was found to be high across most models except CPP-based models. Some of the models just using audio duration length were able to achieve close to the highest performance, which reflects the expected effect of the difference in the dataset. Given dependent features provide similar information (see Figures C to K in S1 Appendix) and distort feature importance analyses, we then tested performance after removing redundant features using the Independence Factor method previously described. Figure L in S1 Appendix shows performance for different feature set sizes with increasing amounts of redundant features. From this analysis, we selected the feature-set size that resulted in best performance using the least amount of features for subsequent analyses: 39 features (reading), 13 (vowel), 19 (reading+vowel). After removing related features (i.e., reducing collinearity) from the original 88 features, similar performance was obtained (median ROC AUC = 0.84–0.87) using fewer features (see Table A in S1 Appendix for analysis of how this method compares to removing features across each train set).

The bootstrapped ROC AUC distributions and permutation tests for the reduced (parsimonious) models using the non-redundant feature set are shown in Fig 3. Models distribution were all significantly different than their null distribution after correcting for multiple comparisons using a Benjamini-Hochberg procedure.

Fig 3. Model performance comparison using a permutation test using non-redundant features.

Scores from models trained on true labels (blue) and trained on permuted labels (orange) over bootstrapping splits.

Given 24 UVFP patients were recorded with a different device, an iPad, we trained models without their samples to make sure these differences in recordings were not driving performance. There was a small drop in performance, which could be due to a bias (the full, original model using information of the recording device), but could also be due to removing training samples. The drop in performance is not large enough to suspect that differences in recording are driving the full original model’s performance (see Table B in S1 Appendix and Table C in S1 Appendix).

Assessing feature importance

Fig 4 reports feature importance using SHAP for all models. For the reading-based models, all models tend to use the same top 5 features except SGD, which also has the lowest performance. For further description of features and the chosen classification of features, see Eyben et al. (2015) [34] and Low et al. (2020) [2]. When reviewing important features, it is key to note that any of the features with which it is codependent or associated could be a reasonable important feature (see clusters of redundant features in Figures C to K in S1 Appendix). The variance on feature importance rank is evidence that models can use different feature information and still obtain similar high–although not perfect–performance. We further display the distribution of each top feature and its individual performance in Fig 5, which shows that no single feature is enough to dissociate groups with high performance. This figure also revealed the bias: the intensity-related feature equivalent sound level was counterintuitively higher for UVFP patients than controls. Fig 6 reports similarity between top 5 features and all original 88 eGeMAPS features. Features that have a high dcor or distance correlation (i.e., cluster) with top 5 features were not used in models to avoid redundancy, but still share similar information and can therefore be considered important features as well. Hierarchically-clustered heatmaps for other data types (vowel, reading, both) and groups (UVFP patients, controls, both) are displayed in Figures C to K in S1 Appendix. Clustering tends to reflect pre-defined features types such as those reflecting patterns from vocal folds, intensity, vocal tract, spectral analyses, and prosody.

Fig 4. Feature importance parallel coordinate plot.

Rank reads from bottom (most important) to top (least important). Mean rank is weighted by performance of each model to avoid a lower performing model biasing the mean rank.

Fig 5. Distributions for top 5 features and corresponding performance for single features.

Logistic Regression with L1 penalty was used. No single feature is enough to dissociate groups with high performance. Null models’ median performance was 0.5.

Fig 6. Feature redundancy with top 5 features highlighted.

Top 5 features are highlighted in bold and their rank is displayed. Squares are clusters of redundant features. Computed with all participants on the reading task.

Clinician ratings

The median ROC AUC for humans was 0.78 (min. = 0.74 to max. = 0.81) meaning the machine learning models performed better than the highest performing clinician on the limited available data, that is, the audio samples of the reading task. Interestingly, using the average clinician’s CAPE-V ratings within machine learning models was able to obtain a maximum median ROC AUC of 0.84 (0.72–0.92) with the Random Forest model (Table 3). Using clinicians’ perceptual ratings of background noise and recording loudness achieved a maximum median ROC AUC of 0.77 (.63– .87).

Table 3. Performance using clinician ratings as variables for machine learning models.

In Figs 6 and 7 we report the inter-rater reliability (Flight’s kappa and ICC) along with the distribution of the ratings. Common cutoffs for inter-rater agreement are poor for values less than .40, fair for values between .40 and .59, good for values between .60 and .74, and excellent for values between .75 and 1.0 [49]. Background noise had poor reliability across rater, UVFP and recording loudness had fair reliability (see Fig 7) and CAPE-V-inspired ratings scored good to excellent except for pitch which was fair (see Fig 8).

Fig 7. Descriptive statistics and inter-rater reliability of clinician ratings for unilateral vocal fold paralysis (UVFP), background noise, and recording loudness indicating likely bias.

Controls and UVFP are ground truth diagnosis from the full clinical interview. Ratings are on brief reading samples. Bars indicate maximum and minimum count across the three raters. The disproportionate amount of UVFP samples rated as having high background noise and high loudness indicates likely bias, where the gain might have been raised for some UVFP patients and they may have phonated more intensely. kappa: Light’s kappa; ICC: intra-class correlation coefficient.

Fig 8. How clinicians rate the audio recordings of read speech: descriptive statistics and inter-rater reliability of average clinician ratings.

The average across raters was taken for each recording. ICC: intra-class correlation coefficient.

Bias mitigation: Matching audio duration and removing features associated to intensity

We trimmed the longer UVFP samples so they were matched to control samples (all samples were the same duration), removing the audio duration difference. Vowel samples could not be matched by trimming as some UVFP samples were shorter and some were longer than control samples; therefore we demonstrate an attempt at bias mitigation only with reading samples. In Table 4, we show results on these samples after additionally removing all intensity features as well as variables that have a distance correlation (dcor) with any of them > = 0.3 and 0.4 based on the reading samples. Models have comparable performance to models with the original duration and intensity-related biases. See Table E in S1 Appendix for a list of the 44 features associated with audio duration and the 14 intensity related features. For distance correlations between audio duration and features, see Table F in S1 Appendix.

Table 4. Performance keeping features least associated with intensity features on samples of equal audio length after trimming.


This study achieves high performance in detecting UVFP from healthy voices using a few seconds of audio recordings and surpassing clinician evaluations even after mitigating the biases we found in the dataset. As a result of performing the explainability analysis, we discovered a likely bias: intensity features were higher for UVFP patients than controls on average (Fig 5) when UVFP patients should have weaker voices. There are two likely causes. A first cause is that the software that had been used prompted users to speak louder if they had a weak voice in order to achieve an audible recording. A second cause was supported by clinicians’ ratings: clinicians rated UVFP patients as having louder recordings and more background noise than controls on average–when they should have similar levels–, which are proxies for microphone gain having been increased. This would have helped models improve performance using characteristics stemming from the recording idiosyncrasies instead of from pathophysiology. However, we removed features correlating with the clearly biased features and still achieved high performance.

Our study expands on prior studies which have used pre-existing commercial databases, smaller sample sizes, fewer features, and/or methods for model evaluation that can be biased in small datasets given the test sets may not be representative (for a discussion on bootstrapping for clinical datasets, see [2]). Critically, we provide a roadmap for evaluating models more thoroughly including quantitatively explaining models and checking the robustness of the models to different choices of speech-eliciting tasks, algorithms, and feature sets. All of this should increase trust when no bias is found and when explanations are robust across models and make sense to experts. Such a model could fulfill several clinical needs: (1) postoperative screening for thyroid surgery-related UVFP since after thyroid surgery, UVFP is common, occurring in up to 5 to 10% of cases [27]. Furthermore, laryngoscopy is not readily available to all postoperative populations and symptomatic changes are notoriously variable. An ML-based screening could help identify patients needing further workup and treatment, and earlier diagnosis is essential to optimize long-term outcomes [28,29]. (2) Monitoring voice during speech therapy and after surgical treatment for confirmed UVFP to measure when and if the patient’s voice is approximating a healthy voice. (3) Preoperative screening prior to surgeries that are at high risk for developing UVFP such as thyroid, head and neck, cardiac, thoracic, esophageal, and cervical spine operations.

In Table 5 we summarize several key recommendations to avoid bias when building and explaining machine learning tools for laryngology, although more could be added, and we expand upon how we dealt with some of these steps in the following sections.

Table 5. Recommendations to avoid bias for explainable machine learning models that use audio recordings for screening in laryngology.

Explaining acoustic features relevant to detecting vocal fold paralysis

Objective acoustic measurement changes associated with vocal fold paralysis have been described and these changes include reduced loudness and maximum phonation time, higher perturbation measurements such as jitter and shimmer, and increased signal to noise ratio [19,58,59]; however these were univariate models, and we have demonstrated that using single variables does not seem to provide high predictive performance. While other multivariate machine learning models have been used, these used few features and small or undefined samples and only report feature importance results for one model; therefore it is not clear whether the important features reported would hold using larger feature sets or how other models would perform. Using a much larger initial set of acoustic features for analysis, we demonstrate that several machine learning algorithms of increasing complexity (using more parameters) identify vocal fold paralysis from healthy voices. We also report that these models can use different features to achieve similar performance. Different models emphasize different features not simply because of its relevance to a disorder, but because of the mathematics associated with the model (e.g., containing different degrees of interaction effects, regularization, or propensity to underfitting or overfitting) [60]. The variability of the ranking of features used by our individual models also illustrates the potential danger of using the single highest performing model, which is commonly seen in published literature.

Instead of simply reporting the important features from the highest performing model, we analyzed the models to find common features. The most important features across models were somewhat associated with intensity features (Table F in S1 Appendix); therefore, even if not strongly associated with intensity features, they could be important due to a combination of intrinsic differences between UVFP and controls for which we provide hypotheses or because of how intensity influences them; a new unbiased dataset would be needed to confirm this. These top features were: intensity, especially equivalent sound pressure level which was redundant with multiple loudness features and seems to be due to some patients trying to use more breath for projection or being recorded with a higher microphone gain; Mel Frequency Cepstral Coefficients (especially the first coefficient, which captures spectral envelope or slope, which has be shown to be important for predicting UVFP [29]; mean F0 semitones given F0 originates from vocal-fold oscillation, a vocal-fold paralysis is expected to alter F0, and has been shown to help predict pathological speech including UVFP [28];, mean F1 amplitude and frequency, influenced by how the vocal tract filters F0 and the shape of the glottal pulse which would be affected by UVFP voiced and unvoiced segments (prosodic and speech articulation features which may be altered due to changes in the periodicity of F0), and CPP features (which indicate voice quality degradations that could include more breathiness, a typical feature of UVFP [61]). Shimmer variability was important just for reading, and it captures variability in glottal pulses and pressure patterns which ultimately affect F0 and has been found to be significantly different between UVFP and a control group [62]. When we removed the top 5 features from the full feature set, performance is practically equivalent to using 88 features, as expected, since there are features that are redundant with the top 5 features. Therefore, it is not that only these 5 specific features drive performance, but rather the information they contain, which in this dataset is also captured by other features as shown in Fig 6.

These acoustic features would corroborate our clinical understanding of glottal incompetence from UVFP and with common patient complaints of reduced loudness, vocal instability, hoarseness, and rough voice; however, they could also be important due to their associations with intensity features. Uncovering and understanding the basic mechanisms and features that models use to generate predictions and outcomes are important as these tools become part of the clinical decision making process.

Identifying and addressing bias

Equivalent Sound Level was higher in UVFP patients than controls. This is counter-intuitive because UVFP patients are known to have softer voices as already described; however, clinicians rated most UVFP samples as being louder than controls. The bias discovered was likely due to increasing the gain on the microphone for some UVFP patients, which would explain the increased background noise in UVFP patients’ recordings. A second source of bias may have occurred from requesting UVFP patients to speak louder in order to meet the minimum intensity threshold on the recording softwares Computerized Speech Lab™ and OperaVOX, or patients could have tried this on their own knowing they were being recorded. This behavioral compensation is likely to occur in biomarker research when the participant has a soft voice, especially in retrospective studies like ours where the study goal is not known at the time of recording or when certain software properties lead individuals with weak voices to speak louder. Even though the current models perform better than the clinicians, a systematic comparison would require more clinician and model assessments across datasets. It is likely a model trained on a single dataset might learn intrinsic characteristics of that dataset that do not generalize as well as clinical expertise might.

Having said this, this line of research would help us understand the extent to which UVFP detection is generalizable from acoustic data alone. Finding an objective measure of hoarseness is important given a "normal voice" is a fundamentally subjective classification that is not well defined [63,64] and varies with training [65,66], which may result in low reliability of evaluation of disordered voices among clinical rating scales [67].

As a post hoc analysis, we address bias by trying to mitigate its effect: we removed variables associated with intensity variables on samples matched on audio duration. After removing these features, the models were able to obtain similar performance using a very different set of features. It is possible that these remaining features better reflect pathophysiology or that the features extracted are still influenced by intensity, but further studies should address their generalizability or their relation to intensity variation.

Evaluating the sensitivity to tasks, model complexity, and features used

In addition to getting a better understanding of features, we explored performance in the context of different vocal tasks. Participants carried out two different tasks to elicit voice, reading, which captures more complex speech dynamics, and sustaining vowels, which is a simpler measure of vocalization and the respiratory subsystem. Overall, these dynamics from the speech task may have improved model performance as was observed. Comparing simpler and more complex models is important because simpler models such as Logistic Regression could be preferred because they tend to generalize better given they are less at risk for overfitting the training set and they are more interpretable and thus biases can be assessed more directly [68].

By removing redundant features, we can concentrate on finding the most useful features for further analysis. Performance decreased only slightly while we made models more parsimonious and explainable. This approach is key given the curse of dimensionality in machine learning that may make models unnecessarily complex and harder to generalize [20].

Often studies will report the top N features but not how predictive they are in isolation. In our study we ran models on the top 5 features together (Table 2). The lower performance of these top 5 features relative to a richer feature set helps demonstrate that model performance is dependent on interactions across multiple additional features (with the exception of samples from the reading task which obtained an AUC of 0.86 using just the 5 features). We also ran models without top 5 features to demonstrate that leaving features that are redundant with these top features results in almost equivalent high performance to using all 88 features since the redundant features share information. Furthermore, when training models on the individual features from within these top 5 one at a time, the performance was reduced considerably with scores from 0.55 to 0.71. This indicates the need for these models to combine multiple features to achieve high performance and any model evaluation should not focus on only the common or top features without testing their predictive performance.

Limitations and future directions

We cannot determine how the bias will affect the model’s performance on future samples, but it will likely underperform in samples where length was not different between groups, where gain cannot be changed, and where participants are instructed to not overproject their voice; however, it is possible the model could underperform for other reasons including dataset shift (e.g., the distribution of voice characteristics or demographics is different in a new sample).

The classification using just duration itself varied across models and clinicians who listened to the reading passage in its entirety did not achieve as good a classification as the best performing models. Duration itself was not included as a feature in the eGeMaps-based models and has a complex effect on both machines and humans. For example, duration could have affected eGeMAPS features (e.g., introduce more variability to the functionals that are computed over sliding time windows) and duration of vowels varied extensively across the UVFP group thus cannot itself be tied to underlying pathophysiology. Therefore, important future work should analyze how duration may affect these features, should address the intrinsic variability in durations of UVFP patients in responding to speech items, and should incorporate models of production that include a consideration of respiratory capabilities, articulation changes, and vocal fold pathophysiology.

It is not clear whether these models could detect UVFP from other voice disorders or just healthier voices; however, a model that generalizes well in classifying UVFP from controls could be used to monitor UVFP patients remotely and affordably during treatment or detect risk for UVFP when it is the most likely cause (e.g., dysphonia after thyroid surgery). Larger sample sizes with curated examinations can help increase diverse representation across voice quality and thereby potentially reduce bias in classifier performance. We did not analyze potential racial bias given this data was not extracted from the chart review.

Our choice of a standardized feature set worked well in this setting, but may fail to work for differential voice disorder diagnosis or when generalizing to larger datasets, which may bring in additional sources of variance unaccounted for in this dataset. With the availability of more data, additional features could be extracted that better capture changes in coordination (e.g., XCORR [69]).

Furthermore, while our feature importance evaluation method, SHAP, shows a certain amount of robustness across models, alternative model-agnostic feature-importance methods (e.g., LOFO, permutation importance) as well as model-specific methods (coefficient values for linear models, mean decrease in impurity for Random Forest) could be compared. Model understandability–how easily are the explanations understood by a speech scientist or a clinician–could be assessed rigorously [55].

Finally, debiasing the models by removing features correlated with the biased ones was attempted although it is not clear how exactly intensity may influence certain features; we assume if intensity is influencing a variable, it generally should create some considerable association which we discarded using dcor. Therefore, the effect of the bias can be assessed by testing the model’s generalizability to new unbiased datasets. Therefore, we are not promoting our final debiased models as completely unbiased or ready to use, it is possible our debiasing strategies are only partially effective, additional biases remain, and/or additional ways of debiasing have not been considered.

We tested how well a model using only the top 5 features performed independently of the model with all features; it is possible to also test how well the incremental set of top features performs (1st, 1st and 2nd, 1st–3rd, etc.), which would be useful in order to compare different models’ performance as a function of which features are being used.


Using one of the largest UVFP datasets to date, our study demonstrates the importance of checking for biases using explainable machine learning and clinician perceptual ratings. In order to first explain models, we tackle collinearity (i.e., redundant or highly correlated independent variables), which biases feature importance, using a custom method called Independence Factor that selects one out of a set of associated features without losing predictive performance. We then compare how results change across different speech-eliciting tasks, training algorithms, features, features set sizes, and highest and lowest performing features to better understand the process that models use to predict vocal changes associated with laryngeal disease, since analyzing a single model will result in a biased view of how predictions are achieved. During this process, we discovered there was a difference in audio duration between groups clearly not related to intrinsic differences in UVFP speech rate, but in cropping all control recordings to a certain length during audio storage. We also discovered that sound equivalent level was counterintuitively higher in UVFP patients, a likely bias resulting from the weak or underprojected voice that characterizes many UVFP patients: patients were prompted by the recording software to speak louder and the microphone gain was likely raised selectively for these patients with weaker voices, possibly generating higher background noise which was detected through clinician’s ratings; therefore the models picked up on the acoustic correlates of this increased intensity, which would impede generalization under different recording procedures and natural audio durations. This is more likely to occur in laryngology datasets when patients have a softer voice.

We found that matching audio duration between groups and removing all variables that were clearly related to intensity (e.g., bias mitigation) resulted in similar high performance. In this case, the model may be using information more related to pathophysiology, which would need to be further confirmed by future unbiased samples. Machine learning models tended to surpass clinician’s evaluation of the same audio recordings. Interestingly, using clinician’s voice quality ratings on the recordings in machine learning models performed better than their binary evaluation on whether recordings contained a sample of UVFP voice or not.

We hope to promote moving beyond using a single model and only reporting top features to a better explanation of how these models work as well as being able to understand variance across modeling and evaluation choices. We believe these are all aspects of machine learning that clinicians need to understand prior to using such applications.

With these considerations along with the recommendations we make, machine learning applications could aid in laryngology screening, allowing for the potential development of in-home screening assessments and continuous pre- and post-treatment monitoring.

Supporting information

S1 Appendix.

Text A. List of eGeMAPs features. Figure A. Controls, reading+vowel tasks: Visualization of features with shared information using pairwise distance correlation across the 88 features. Squares are clusters of redundant features. Figure B. Non-monotonic associations between features. Figure C. All participants, reading task: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features. Squares are clusters of redundant features. Figure D. All participants, vowel task: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features. Squares are clusters of redundant features. Figure E. All participants, reading+vowel tasks: Visualization of features with shared information using pairwise distance correlation across the 88 features. Squares are clusters of redundant features. Figure F. Patients, reading task: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features. Squares are clusters of redundant features. Figure G. Patients, vowel task: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features. Squares are clusters of redundant features. Figure H. Patients, reading+vowel tasks: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features. Squares are clusters of redundant features. Figure I. Controls, reading task: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features. Squares are clusters of redundant features. Figure J. Controls, vowel task: Visualization of features with shared information using pairwise distance correlation across the 88 eGeMAPs features extracted. Squares are clusters of redundant features. Figure K. Controls, reading+vowel tasks: Visualization of features with shared information using pairwise distance correlation across the 88 features. Squares are clusters of redundant features. Figure L. Performance as a function of feature set size using Independence Factor method for reducing feature redundancy. The feature sets remove features with distance correlation ≥ 0.2 up to 1.0 (i.e., keeping all features) in increments of 0.1. Table A. Model performance after hyperparameter tuning increased performance by 0.01 on average across models and tasks. Table B. Comparison of selecting features on the entire dataset (useful for explainability) versus selecting on 50 bootstrap (80–20) train splits. Original total features are 88. CI = Confidence Interval. Table C. Performance of models without 24 patients recorded on iPad. Median ROC AUC score from 50 bootstrapping splits (90% confidence interval; median score of null model). The control group represents 60% of the training samples. MLP: Multi-Layer Perceptron; SGD: Stochastic Gradient Descent Classifier. Table D. False negative rate (FNR) of training on one recording device and testing on 24 UVFP patients that used iPad. FNR is generally quite low. Performance can also be influenced by having a smaller training set in order to balance the classes. Table E. Features with distance correlation (dcor) > 0.3 with biased intensity-related features. Table F. Features with distance correlation (dcor) > 0.3 with biased audio duration.



We would like to thank Cody Sullivan and Carolyn Hsu for their help in rating the audio samples and thank Daryush Mehta, Robert Hillman, and John Guttag for their feedback on an earlier version of this study.


  1. 1. Wroge TJ, Özkanca Y, Demiroglu C, Si D. Parkinson’s disease diagnosis using machine learning and voice. 2018 IEEE signal.
  2. 2. Low DM, Bentley KH, Ghosh SS. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig Otolaryngol. 2020 Feb;5(1):96–116. pmid:32128436
  3. 3. Quatieri TF. Discrete-Time Speech Signal Processing: Principles and Practice. Pearson Education; 2008.
  4. 4. Molnar C. Interpretable Machine Learning.; 2019.
  5. 5. Stachler RJ, Francis DO, Schwartz SR, Damask CC, Digoy GP, Krouse HJ, et al. Clinical practice guideline: Hoarseness (dysphonia). Otolaryngol Head Neck Surg. 2018 Mar;158:S1–42.
  6. 6. Brunner E, Friedrich G, Kiesler K, Chibidziura-Priesching J, Gugatschka M. Subjective breathing impairment in unilateral vocal fold paralysis. Folia Phoniatr Logop. 2011;63(3):142–6. pmid:20938194
  7. 7. Spataro EA, Grindler DJ, Paniello RC. Etiology and Time to Presentation of Unilateral Vocal Fold Paralysis. Otolaryngol Head Neck Surg. 2014 Aug;151(2):286–93. pmid:24796331
  8. 8. Sritharan N, Chase M, Kamani D. The vagus nerve. 2015
  9. 9. Randolph GW, Kamani D. The importance of preoperative laryngoscopy in patients undergoing thyroidectomy: voice, vocal cord function, and the preoperative detection of invasive thyroid malignancy. Surgery. 2006 Mar;139(3):357–62. pmid:16546500
  10. 10. Colton RH, Paseman A, Kelley RT, Stepp D, Casper JK. Spectral moment analysis of unilateral vocal fold paralysis. J Voice. 2011 May;25(3):330–6. pmid:20813498
  11. 11. Balasubramanium RK, Bhat JS, Fahim S 3rd, Raju R 3rd. Cepstral analysis of voice in unilateral adductor vocal fold palsy. J Voice. 2011 May;25(3):326–9. pmid:20346619
  12. 12. Little M, Costello D, Harries M. Objective dysphonia quantification in vocal fold paralysis: comparing nonlinear with classical measures. Nature Precedings. 2009 Apr 21;1–1. pmid:19900790
  13. 13. Bielamowicz S, Stager SV. Diagnosis of unilateral recurrent laryngeal nerve paralysis: laryngeal electromyography, subjective rating scales, acoustic and aerodynamic measures. Laryngoscope. 2006 Mar;116(3):359–64. pmid:16540889
  14. 14. Hartl DAM, Hans S, Vaissière J, Brasnu DAMF. Objective acoustic and aerodynamic measures of breathiness in paralytic dysphonia. Eur Arch Otorhinolaryngol. 2003 Apr;260(4):175–82. pmid:12709799
  15. 15. Francis DO, Pearce EC, Ni S, Garrett CG, Penson DF. Epidemiology of vocal fold paralyses after total thyroidectomy for well-differentiated thyroid cancer in a Medicare population. Otolaryngol Head Neck Surg. 2014 Apr;150(4):548–57. pmid:24482349
  16. 16. Jeannon JP, Orabi AA, Bruch GA, Abdalsalam HA, Simo R. Diagnosis of recurrent laryngeal nerve palsy after thyroidectomy: a systematic review. Int J Clin Pract. 2009 Apr;63(4):624–9. pmid:19335706
  17. 17. Bhattacharyya N, Kotz T, Shapiro J. Dysphagia and aspiration with unilateral vocal cord immobility: incidence, characterization, and response to surgical treatment. Ann Otol Rhinol Laryngol. 2002 Aug;111(8):672–9. pmid:12184586
  18. 18. Pinho CMR, Jesus LMT, Barney A. Aerodynamic measures of speech in unilateral vocal fold paralysis (UVFP) patients. Logoped Phoniatr Vocol. 2013 Apr;38(1):19–34. pmid:22741532
  19. 19. Hartl DM, Crevier-Buchman L, Vaissière J, Brasnu DF. Phonetic effects of paralytic dysphonia. Ann Otol Rhinol Laryngol. 2005 Oct;114(10):792–8. pmid:16285270
  20. 20. Berisha V., Krantsevich C., Hahn P. R., Hahn S., Dasarathy G., Turaga P., & Liss J. Digital medicine and the curse of dimensionality. NPJ Digital Medicine. 2021 Dec;4(1):s41746–021. pmid:34711924
  21. 21. Rusz J, Švihlík J, Krýže P, Novotný M, Tykalová T. Reproducibility of Voice Analysis with Machine Learning. Mov Disord. 2021 May;36(5):1282–3. pmid:33991447
  22. 22. Schönweiler R, Hess M, Wübbelt P, Ptok M. Novel approach to acoustical voice analysis using artificial neural networks. J Assoc Res Otolaryngol. 2000 Dec;1(4):270–82. pmid:11547807
  23. 23. Godino-Llorente JI, Gómez-Vilda P. Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors. IEEE Trans Biomed Eng. 2004 Feb;51(2):380–4. pmid:14765711
  24. 24. Fraile R, Saenz-Lechon N, Godino-Llorente JI, Osma-Ruiz V, Fredouille C. Automatic detection of laryngeal pathologies in records of sustained vowels by means of mel-frequency cepstral coefficient parameters and differentiation of patients by sex. Folia Phoniatr Logop. 2009;61(3):146–52. pmid:19571549
  25. 25. Voigt D, Döllinger M, Yang A, Eysholdt U, Lohscheller J. Automatic diagnosis of vocal fold paresis by employing phonovibrogram features and machine learning methods. Comput Methods Programs Biomed. 2010 Sep;99(3):275–88. pmid:20138386
  26. 26. Lopes LW, Batista Simões L, Delfino da Silva J, da Silva Evangelista D, da Nóbrega E Ugulino AC, Oliveira Costa Silva P, et al. Accuracy of Acoustic Analysis Measurements in the Evaluation of Patients With Different Laryngeal Diagnoses. J Voice. 2017 May;31(3):382.e15–382.e26. pmid:27742492
  27. 27. Powell ME, Rodriguez Cancio M, Young D, Nock W, Abdelmessih B, Zeller A, et al. Decoding phonation with artificial intelligence (DeP AI): Proof of concept. Laryngoscope Investig Otolaryngol. 2019 Jun;4(3):328–34. pmid:31236467
  28. 28. Dibazar AA, Narayanan S, Berger TW. Feature analysis for automatic detection of pathological speech. In: Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society. Engineering in Medicine and Biology. 2002. p. 182–3 vol.1.
  29. 29. Seedat N, Aharonson V, Hamzany Y. Automated and interpretable m-health discrimination of vocal cord pathology enabled by machine learning. In: 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). 2020. p. 1–6.
  30. 30. Mittal V, Sharma RK. Deep Learning Approach for Voice Pathology Detection and Classification. IJHISI. 2021 Oct 1;16(4):1–30.
  31. 31. Hu HC, Chang SY, Wang CH, Li KJ, Cho HY, Chen YT, et al. Deep Learning Application for Vocal Fold Disease Prediction Through Voice Recognition: Preliminary Development Study. J Med Internet Res. 2021 Jun 8;23(6):e25247. pmid:34100770
  32. 32. Ras G, Xie N, van Gerven M, Doran D. Explainable Deep Learning: A Field Guide for the Uninitiated. jair. 2022 Jan 25;73:329–96.
  33. 33. Fairbanks G. Voice and Articulation Drillbook. Harper; 1960. 196 p.
  34. 34. Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing. 2016 Apr;7(2):190–202.
  35. 35. audEERING GmbH. openSMILE (Version 2.3) Internet. 2017. Available from: 7f/config/gemaps/eGeMAPSv01a.conf
  36. 36. Satrajit S Ghosh, Daniel M Low, Hoda Rajaei et al. Pydra-ML Internet. Available from:
  37. 37. Lipton ZC. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queueing Syst. 2018 Jun 1;16(3):31–57.
  38. 38. Raschka S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning Internet. arXiv cs.LG. 2018. Available from:
  39. 39. Ojala M, Garriga GC. Permutation Tests for Studying Classifier Performance. In: 2009 Ninth IEEE International Conference on Data Mining. IEEE; 2009. p. 1833–63.
  40. 40. Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions Internet. arXiv cs.AI. 2017. Available from:
  41. 41. D’Amour A, Heller K, Moldovan D, Adlam B, Alipanahi B, Beutel A, et al. Underspecification presents challenges for credibility in modern machine learning. J Mach Learn Res. 2022 Jan 1;23(1):10237–97.
  42. 42. de Siqueira Santos S, Takahashi DY, Nakata A, Fujita A. A comparative study of statistical methods used to identify dependencies between gene expression signals. Brief Bioinform. 2014 Nov;15(6):906–18. pmid:23962479
  43. 43. Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. 2007.
  44. 44. Hillenbrand J, Houde RA. Acoustic correlates of breathy vocal quality: dysphonic voices and continuous speech. J Speech Hear Res. 1996 Apr;39(2):311–21. pmid:8729919
  45. 45. Murton O, Hillman R, Mehta D. Cepstral Peak Prominence Values for Clinical Voice Evaluation. Am J Speech Lang Pathol. 2020 Aug 4;29(3):1596–607. pmid:32658592
  46. 46. G. Degottex, J. Kane, T. Drugman, T. Raitio and S. Scherer. COVAREP—A collaborative voice analysis repository for speech technologies. Proc IEEE Int Conf Acoust Speech Signal Process Internet. 2014.
  47. 47. Hallgren KA. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol. 2012;8(1):23–34. pmid:22833776
  48. 48. Gamer M, Lemon J, Gamer MM, Robinson A, Kendall’s W. Package “irr.” Various coefficients of interrater reliability and agreement. 2012;22:1–32.
  49. 49. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess. 1994 Dec;6(4):284–90.
  50. 50. Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution Internet. Vol. 115, Proceedings of the National Academy of Sciences. 2018. p. 2600–6. Available from:
  51. 51. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A Survey on Bias and Fairness in Machine Learning. ACM Comput Surv. 2021 Jul 13;54(6):1–35.
  52. 52. Osborne JW, Overbay A. The power of outliers (and why researchers should ALWAYS check for them). Practical Assessment, Research, and Evaluation. 2019;9(1):6.
  53. 53. Kapoor S, Cantrell E, Peng K, Pham TH, Bail CA, Gundersen OE, et al. REFORMS: Reporting Standards for Machine Learning Based Science Internet. arXiv cs.LG. 2023. Available from:
  54. 54. Thompson CG, Kim RS, Aloe AM, Becker BJ. Extracting the Variance Inflation Factor and Other Multicollinearity Diagnostics from Typical Regression Results. Basic Appl Soc Psych. 2017 Mar 4;39(2):81–90.
  55. 55. Zhou Y, Ribeiro MT, Shah J. ExSum: From Local Explanations to Model Understanding Internet. arXiv cs.CL. 2022. Available from:
  56. 56. Hort M, Chen Z, Zhang JM, Harman M, Sarro F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey. ACM J Responsib Comput Internet. 2023 Nov 1; Available from:
  57. 57. Dockès J, Varoquaux G, Poline JB. Preventing dataset shift from breaking machine-learning biomarkers. Gigascience Internet. 2021 Sep 28;10(9). Available from: pmid:34585237
  58. 58. Ramig LA, Scherer RC, Titze IR, Ringel SP. Acoustic analysis of voices of patients with neurologic disease: rationale and preliminary data. Ann Otol Rhinol Laryngol. 1988 Mar-Apr;97(2 Pt 1):164–72. pmid:2965542
  59. 59. Morsomme D, Jamart J, Wéry C, Giovanni A, Remacle M. Comparison between the GIRBAS Scale and the Acoustic and Aerodynamic Measures Provided by EVA for the Assessment of Dysphonia following Unilateral Vocal Fold Paralysis. Folia Phoniatr Logop. 2001 Nov-Dec;53(6):317–25. pmid:11721138
  60. 60. Kriegeskorte N, Douglas PK. Interpreting encoding and decoding models. Curr Opin Neurobiol. 2019 Apr;55:167–79. pmid:31039527
  61. 61. Hartl DM, Hans S, Vaissière J, Riquet M, Brasnu DF. Objective voice quality analysis before and after onset of unilateral vocal fold paralysis. J Voice. 2001 Sep;15(3):351–61. pmid:11575632
  62. 62. Ma Y, Xu X, Hou G, Zhou L, Zhuang P. Acoustic analysis in patients with unilateral arytenoid dislocation and unilateral vocal fold paralysis. Lin Chung Er Bi Yan Hou Tou Jing Wai Ke Za Zhi. 2016 Feb;30(4):268–71. pmid:27373031
  63. 63. Misono S. The Voice and the Larynx in Older Adults: What’s Normal, and Who Decides? JAMA Otolaryngol Head Neck Surg. 2018 Jul 1;144(7):572–3. pmid:29799923
  64. 64. Eadie T, Sroka A, Wright DR, Merati A. Does knowledge of medical diagnosis bias auditory-perceptual judgments of dysphonia? J Voice. 2011 Jul;25(4):420–9. pmid:20347262
  65. 65. Helou LB, Solomon NP, Henry LR, Coppit GL, Howard RS, Stojadinovic A. The role of listener experience on Consensus Auditory-perceptual Evaluation of Voice (CAPE-V) ratings of postthyroidectomy voice. Am J Speech Lang Pathol. 2010 Aug;19(3):248–58. pmid:20484704
  66. 66. Eadie TL, Baylor CR. The effect of perceptual training on inexperienced listeners’ judgments of dysphonic voice. J Voice. 2006 Dec;20(4):527–44. pmid:16324823
  67. 67. Karnell MP, Melton SD, Childes JM, Coleman TC, Dailey SA, Hoffman HT. Reliability of clinician-based (GRBAS and CAPE-V) and patient-based (V-RQOL and IPVI) documentation of voice disorders. J Voice. 2007 Sep;21(5):576–90. pmid:16822648
  68. 68. Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell. 2019 May;1(5):206–15. pmid:35603010
  69. 69. Williamson JR, Quatieri TF, Helfer BS, Ciccarelli G, Mehta DD. Vocal and Facial Biomarkers of Depression based on Motor Incoordination and Timing. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. New York, NY, USA: Association for Computing Machinery; 2014. p. 65–72. (AVEC ‘14).