Skip to main content
  • Loading metrics

Sleep deprivation detected by voice analysis

  • Etienne Thoret ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Laboratoire des systèmes perceptifs, Département d’études cognitives, École normale supérieure, PSL University, CNRS, Paris, France, Aix-Marseille University, CNRS, Institut de Neurosciences de la Timone (INT) UMR7289, Perception Representation Image Sound Music (PRISM) UMR7061, Laboratoire d’Informatique et Systèmes (LIS) UMR7020, Marseille, France, Institute of Language Communication and the Brain, Aix-Marseille University, Marseille, France

  • Thomas Andrillon,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Sorbonne Université, Institut du Cerveau - Paris Brain Institute - ICM, Mov’it team, Inserm, CNRS, Paris, France, Université Paris Cité, VIFASOM, ERC 7330, Vigilance Fatigue Sommeil et santé publique, Paris, France, APHP, Hôtel-Dieu, Centre du Sommeil et de la Vigilance, Paris, France

  • Caroline Gauriau,

    Roles Investigation, Methodology

    Affiliations Université Paris Cité, VIFASOM, ERC 7330, Vigilance Fatigue Sommeil et santé publique, Paris, France, APHP, Hôtel-Dieu, Centre du Sommeil et de la Vigilance, Paris, France

  • Damien Léger ,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    ‡ These authors joint last first authorship on this work.

    Affiliations Université Paris Cité, VIFASOM, ERC 7330, Vigilance Fatigue Sommeil et santé publique, Paris, France, APHP, Hôtel-Dieu, Centre du Sommeil et de la Vigilance, Paris, France

  • Daniel Pressnitzer

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    ‡ These authors joint last first authorship on this work.

    Affiliation Laboratoire des systèmes perceptifs, Département d’études cognitives, École normale supérieure, PSL University, CNRS, Paris, France


This is an uncorrected proof.


Sleep deprivation has an ever-increasing impact on individuals and societies. Yet, to date, there is no quick and objective test for sleep deprivation. Here, we used automated acoustic analyses of the voice to detect sleep deprivation. Building on current machine-learning approaches, we focused on interpretability by introducing two novel ideas: the use of a fully generic auditory representation as input feature space, combined with an interpretation technique based on reverse correlation. The auditory representation consisted of a spectro-temporal modulation analysis derived from neurophysiology. The interpretation method aimed to reveal the regions of the auditory representation that supported the classifiers’ decisions. Results showed that generic auditory features could be used to detect sleep deprivation successfully, with an accuracy comparable to state-of-the-art speech features. Furthermore, the interpretation revealed two distinct effects of sleep deprivation on the voice: changes in slow temporal modulations related to prosody and changes in spectral features related to voice quality. Importantly, the relative balance of the two effects varied widely across individuals, even though the amount of sleep deprivation was controlled, thus confirming the need to characterize sleep deprivation at the individual level. Moreover, while the prosody factor correlated with subjective sleepiness reports, the voice quality factor did not, consistent with the presence of both explicit and implicit consequences of sleep deprivation. Overall, the findings show that individual effects of sleep deprivation may be observed in vocal biomarkers. Future investigations correlating such markers with objective physiological measures of sleep deprivation could enable “sleep stethoscopes” for the cost-effective diagnosis of the individual effects of sleep deprivation.

Author summary

Sleep deprivation has an ever-increasing impact on individuals and societies, from accidents to chronic conditions costing billions to health systems. Yet, to date, there is no quick and objective test for sleep deprivation. We show that sleep deprivation can be detected at the individual level with voice recordings. Importantly, we focused on interpretability, which allowed us to identify two independent effects of sleep deprivation on the voice: changes in prosody and changes in voice quality or timbre. The results also revealed a striking variability in individual reactions to the same deprivation, further confirming the need to consider the effects of sleep deprivation at the individual level. Vocal markers could be correlated to specific underlying physiological factors in future studies, outlining possible cost-effective and non-invasive “sleep stethoscopes”.


In the last decade or so, insufficient sleep has become a prominent public health issue, with one third of the adult population sleeping less than six hours per night [13]. This chronic sleep debt is associated with an increased risk of chronic disease, such as obesity, type 2 diabetes, cardiovascular diseases, inflammation, addictions, accidents and cancer [48]. Sleep debt also increase the risk of developing multiple comorbidities [9]. Moreover, more than one worker out of five operates at night and suffers from a high level of sleep deprivation [10], which causes accidents in the workplace or when driving [11]. In the present study, we use automated acoustic analyses of the voice to detect sleep deprivation. The aim is not to improve the accuracy of current machine-learning approaches [1214], but, rather, to build on them to introduce a new focus on interpretability. Ideally, our method should not only detect whether an individual is sleep deprived or not, but also help to formulate specific hypotheses as to the physiological consequences of sleep deprivation for a given individual at a given moment in time.

Currently, there are several techniques aiming to measure sleep deprivation and its associated physiological consequences. First, sleep deprivation may be simply assessed in terms of the loss of sleep time, as measured in hours. Remarkably, however, the impact of a given amount of sleep deprivation varies massively across individuals. In laboratory settings where the amount of deprivation could be precisely controlled, up to 90% of the variance in cognitive performance was related to individual traits and not to the actual time spent asleep [15,16]. Second, sleep deprivation may also be measured through subjective sleepiness, which participants can explicitly report using rating scales [1719]. However, subjective sleepiness could be influenced by other factors than sleep deprivation, such as the time of the day, motivation, or stress. Besides, it is not clear whether reported subjective sleepiness captures the full physiological impact of sleep deprivation, given the variety of the potentially implicit processes involved [20]. Third, objective methods have been developed to measure tangible consequences of sleep deprivation. The multiple sleep latency test [21], the gold standard in clinical settings, uses electro-encephalography (EEG) to estimate sleep latency (e.g. the amount of time to go from wake to sleep) along five successive naps sampled every two hours during daytime. The psychomotor vigilance test [22], often used in research settings, tests for the ability to respond quickly to infrequent stimuli, with slower reaction times assumed to be markers of attentional lapses. More recently, new approaches have attempted to measure the concentration of key molecules in the urine, saliva or breath [23]. Although these objective methods are complementary to subjective reports, they are often costly, time consuming, or difficult to deploy outside of laboratories. So, whereas there are cheap and fast objective diagnosis tools for other causes of temporary cognitive impairment, such as alcohol or drug abuse, there is currently no established means to estimate sleep deprivation effects, at the individual level, in real-life settings.

If sleep deprivation could be detected through voice recordings, this would fill this gap by providing a quick, non-invasive, and cost-effective objective measure of sleep deprivation. Indeed, because the voice is easy to record with off-the-shelf equipment, there is a growing interest in finding vocal biomarkers to diagnose a variety of medical conditions [24,25]. For sleep, the idea was first explored by Morris et al. [26]. Free speech was produced by sleep deprived participants and rated by the authors. A slowing down of speech and a “flatness of the voice” were noted after deprivation. These observations were extended by Harrison and Horne [27], who found that raters blind to the amount of deprivation of the speakers could detect effects on the intonation of speech after deprivation. More recently, an experiment using a larger database found that, indeed, raters could detect sleepy versus non sleepy voices with an accuracy above 90% [28]. So, it does seem that there are acoustic cues in the voice that reflect sleep deprivation and/or sleepiness.

Machine learning has been applied to automate the detection of sleep deprivation and/or sleepiness from the voice. In an early study [29], sleep deprivation was inferred with high accuracy from vocal recordings (86%) but it should be noted that the deprivation was extreme, consisting of 60 hours without sleep, with unknown applicability to the much more common situation of mild sleep deprivation. Two “computational paralinguistic challenges” have since been launched, with sub-challenges aimed at assessing sleepiness from vocal recordings [30,31]. We will not review all of the entries to these challenges here, as they are quite technical in nature. To summarize, all of them used a similar framework: i) selection of a set of acoustic features, such as pitch, spectral and cepstral coefficients, duration estimates, and functionals of those features; ii) dimensionality reduction of the feature set; iii) supervised learning of target classification using various machine learning techniques, such as support vector machines or neural networks. The best results varied depending on the challenge. Subjective sleepiness proved difficult to predict [28], but the binary categorization of sleepy versus non-sleepy voices could be achieved with high accuracy (over 80%) in the best performing classifiers [32].

The framework described above will be familiar -and effective- for many machine learning problems, but it has two major limitations from a neuroscientific perspective. First, the initial selection of features is based on a somewhat arbitrary choice. Often, the choice of features was guided by the “voice flatness” hypothesis [26,27]. However, other, perhaps more subtle acoustic markers of sleep deprivation or sleepiness may have been overlooked by human raters. Second, the acoustic features discovered by the classifiers are not necessarily interpretable and can be difficult to relate to plausible mediating mechanisms [14]. Interestingly, the best-performing system so far used a carefully hand-crafted small feature set inspired from auditory processing models, suggesting that “perceptual” features may be a promising route for sleepiness detection in the voice [32]. A more recent study has again attempted to focus on “simple” acoustic descriptors for one of the databases of the paralinguistic challenge, with the explicit aim to facilitate interpretation [33]. Accurate classification was possible with the simpler feature set of about 20 features, with a resulting accuracy of 76%.

Here, we aim to extend these findings In four different ways. First, we use our own vocal database, which has been collected in a controlled laboratory setting where the amount of sleep deprivation could be precisely controlled. Vocal recordings were obtained from reading out loud the same texts for all participants, in random order across participants. This is important to avoid biases confounding sleep deprivation with e.g. participant identity, which is easily picked up by classifiers [28,34]. Second, we use a fully generic acoustic feature set, spectro-temporal modulations (STMs), computed with a neurophysiologically-inspired model of auditory processing [35]. The STM representation has been successfully applied to various machine-learning problems such as musical instruments classification [36], timbre perception [36,37], or speech detection and enhancement [38]. Third, we apply our own technique to interpret the cues discovered by the classifier [39]. This technique, similar in spirit to the reverse correlation method used in neuroscience and psychophysics, identifies the parts of the input representation that have the most weight in the classifiers’ decisions. The main outcome of the analysis consists in determining the parts of auditory feature space impacted by sleep deprivation. Fourth, by fitting classifiers to individual participants, we aim to uncover plausible physiological factors underlying the large and as of yet unexplained variability observed in the responses to mild sleep deprivation in normal healthy adults.


Twenty-two healthy women between 30–50 years of age (42.7 ± 6.8) were sleep deprived during a controlled laboratory protocol. An all-female experimental group was chosen because the current experiment took place in parallel with a dermatology study [40], but also because such a choice was expected to homogenize the vocal pitch range across participants. After a first “Control night” spent in the laboratory, participants were restricted to no more than 3 hours of sleep per night during two subsequent “Restriction nights”, also monitored in the laboratory. Such a sleep restriction is both more ecological than total sleep deprivation and better controlled than observational paradigms. Vocal recordings were obtained throughout the protocol, during six reading sessions. The first three reading sessions occurred at different times of the day right after the control night (no sleep deprivation). The last three reading sessions occurred at the same times of the day but right after the second restriction night (see Methods for further details). All participants read, for about 10 minutes, chapters of the same French classic book: “Le Comte de Monte Christo” by Alexandre Dumas. The order of the excerpts was randomized across sessions for each participant to avoid a confound with deprivation. In total, our database thus consisted of 22 healthy participants producing about half an hour of vocal recordings each (M = 31min, SD = 5min) evenly split between before and after two nights of mild sleep deprivation. Each recording session was further split into 15-s frames. The main classification task was to decide whether a frame was recorded before deprivation or after deprivation.

Subjective sleepiness reports are highly variable

Sleepiness was self-reported by participants at different times during the day (see Methods) using the Stanford Sleepiness Scale (SSS) questionnaire [19]. Fig 1A shows the distributions of SSS ratings. On average, sleep deprivation had an effect on self-reported sleepiness: sleepiness was low right after the control night, but increased after the deprivation nights. This was confirmed by an ANOVA on the SSS, with factors Day (2 levels, before and after deprivation) and Time of Report (4 levels). Both factors had a significant effect, but with a much larger effect size for Day (F(1,46) = 52.14, p<0.001, ηp2 = 0.221) compared to Time of report (F(3,92) = 3.07, p = 0.029, ηp2 = 0.048). Moreover, there was no interaction between Day and Time of report (F(3,92) = 0.59, p = 0.621). Because of this lack of interaction, we now consider average SSS values for all Times of Reports in a Day, to focus on the effect of sleep deprivation.

Fig 1.

A. Subjective sleepiness. Sleepiness was evaluated by self-reports on the Stanford Scale before sleep deprivation (Control) and after two nights of mild sleep deprivation (Sleep deprived). The abscissa indicates the time of day when sleepiness reports were collected. B. Average reported sleepiness before and after sleep restriction. Lines connect data points for each participant.

Fig 1B illustrates the data aggregated in that way, with individual changes in sleepiness now identified across the control and sleep deprived day. A remarkable individual variability was obvious in reported sleepiness. Note that this was in spite of our precise control of the amount of sleep deprivation, which was equated across all participants. Even so, some participants showed little effect of sleep deprivation, with even cases of decreases in subjective sleepiness after deprivation. Such unexpected effects were observed for all baseline sleepiness, low or high, as measured before deprivation. This striking variability is in fact consistent with previous observations involving objective measures of sleep deprivation [16]. It also further justifies that vocal biomarkers of sleep deprivation should be investigated at the individual level.

Acoustic features of speech before and after sleep deprivation are broadly similar

To get a first qualitative overview of the effects of sleep deprivation on the acoustic features of speech, and in particular to test whether deprivation and any obvious average effect on the voice, we computed STM representations before and after deprivation.

Let us describe the STM representation in more details. At each moment in time, STMs contain the dimensions of frequency, rate, and scale. The frequency dimension, in Hz but on a logarithmic scale, reflects the spectral content of the sound. It is obtained by bandpass filtering the temporal waveform into several sub-bands, in a way intended to simulate peripheral auditory filtering. Rates and scales are then obtained using a bank of spectro-temporal modulation filters at the output of each peripheral channel. The rate dimension, in Hz, reflects the modulations in sound amplitude in the time domain. Slow modulations have low rates, whereas fast modulations have high rates. Positive rates indicate temporal modulations coupled with downward changes in frequency, whereas negative rates indicate temporal modulations coupled with upward changes in frequency. The scale dimension, in cycle per octave, reflects modulations in the spectral domain. Sounds with fine details in their spectral envelopes have high scale values, while sounds with relatively flat spectral envelopes have low scale values. For speech, the dominant rates are between 2 Hz and 8 Hz [41], while dominant scales, related to the harmonic structure of vowels, are around 2 cyc/oct [42]. The STM representation is similar to other widely used auditory front-ends such as the Modulation Power Spectrum (MPS) [43,44], the main difference being the logarithmic vs linear frequency scales used by the STM and MPS models, respectively.

The full STMs thus have four dimensions of time, frequency, rate, and scale. To have a look at the overall effect of sleep deprivation on acoustic features, we averaged the STMs along the time dimension, separately before and after deprivation. Average STMs before (Fig 2A) and after (Fig 2B) deprivation were qualitatively similar. The rate-scale projections showed that, unsurprisingly, high energy in the STMs was focused in regions associated to speech [38]. The frequency-rate projection simply showed the average spectrum of our vocal recordings.

Fig 2. Acoustic analyses.

A. Spectro-Temporal Modulations before sleep deprivation. Projections on the rate-scale and rate-frequency plane are shown. Arbitrary model units. B. As in A., but after sleep deprivation. C. Acoustic difference before and after sleep deprivation, shown as 2 * abs(B-A) / (A+B). Units of percent. D. Speech features before (green) and after (orange) sleep deprivation. Displayed are four openSMILE features related to average pitch (mean of the fundamental frequency f0), pitch variation (standard deviation of f0), voice creakiness (jitter) and voice breathiness (logarithm of the Harmonic to Noise Ratio). Lines connect data points for each participant.

To further investigate the candidate acoustic differences caused by deprivation, we subtracted STMs before and after deprivation (Fig 2C for the population-level results, S1 Fig for individual-level results). At the population level, maximal differences in the rate-scale projection were less than 3%, while differences up to 11% were observed in the frequency-rate projection. At the subject level, differences in the rate-scale projection were around 24.68% on average (SD = 6), while differences up to 40.83% on average (SD = 12) were observed in the frequency-rate projection. Larger differences seem therefore observable at individual level but there is no obvious structure to the differences: they appear noisy and do not necessarily match the STM regions of high energy in speech (see S1 Fig).

For comparison with the state-of-the art of sleepiness detection from the voice [33], we also computed speech features using the openSMILE library [45]. The full 4368 speech features suggested in [33] were extracted (see Methods). Four of them are illustrated in Fig 2D, averaged before and after deprivation. These features were selected according to the “voice flatness” hypothesis. According to this hypothesis, it could be that sleep deprivation lowered the average pitch of the voice and reduced with its variation. It could also be that the quality of the voice, described with such adjectives as “creakiness” or “breathiness”, could systematically change after deprivation. The closest openSMILE correlates of such perceptual descriptors are shown in Fig 2D. Visually, no obvious change was induced by sleep deprivation, with increases or decreases for all four features.

At this point, it is unclear whether the raw acoustic differences illustrated in Fig 2 are meaningful compared to the within- and across-participant variability. Also, the choice to illustrate 4 features out of 4368 is somewhat arbitrary. So, it remains to be tested whether the STM or openSMILE features have any predictive power to detect sleep deprivation. To address this point in a principled manner, we now turn to machine-learning, for the new STM representation and also for the openSMILE feature set.

Detection of sleep deprivation from the voice is possible based on generic auditory features

A first question raised by the present study is whether fully generic auditory features can be used to detect sleep deprivation from the voice. To address this question, the STM representation was used as the input feature space for a standard machine-learning pipeline [13,14,36]. Recordings were converted to 15-s long frames, to be classified as “before sleep deprivation” or “after sleep deprivation” on the basis of their STM representation. The dataset was first transformed into train and test splits. We then reduced the high dimensionality of the feature space by means of a principal component analysis (PCA) on the training set (see Methods). The PCA spaces were then fed to a support vector machine classifier (SVM) with a Gaussian kernel (radial basis function). We opted for an SVM and not a deep-learning architecture mainly because of the relatively modest size of our dataset, but also because SVMs have been shown to outperform more complex classifiers in similar tasks [14]. The performance of the SVM was evaluated with Balanced Accuracy (BAcc, see Methods).

At the population level, two cross-validation strategies were used. First, a Leave-One-Subject-Out (LOSO) strategy, in which one subject was left out of the training set and constituted the test set. The procedure was repeated for each participant. This procedure is the most stringent test of generalization of prediction for unknown participants. However, in small datasets with a large amount of individual variability, it has been argued that LOSO may be inappropriate [46]. This is likely the case for our dataset, with 22 participants and a large expected variability for the effects of sleep deprivation. Thus, we also report cross-validation using a 50-times repeated splitting of 25% of the data (50-splits, 25% test) randomly selected among the whole pool of participants, as suggested in [46]. At the participant level, the LOSO strategy does not make sense, so only the 50-splits, 25% test validation was applied.

Classification performance is shown in Fig 3. At the population level, the classifier was able to detect sleep deprivation significantly above chance (50-splits, 20% test: BAcc, M = .77, SD = .01, t-test against .5: t(49) = 145.27 p < 0.001; LOSO: BAcc, M = .56, SD = .09, t-test against .5: t(21) = 3.01, p = .006). This seems on par with the state of the art obtained with different speech databases [32,33]. Interestingly, and as expected from the sizeable individual variability observed in the SSS reports, the same machine-learning pipeline was more accurate when applied at the individual level (BAcc, M = .86, SD = .09). Noticeably, for half of the participants, the classifiers’ accuracies displayed BAccs above .9, outperforming the state of the art and matching human performance on a similar task [28]. For two participants, the classifiers’ accuracies were relatively poor. Participant #1 displayed a decrease in sleepiness after deprivation (-0.75 for the sleepiness ratings averaged after and before deprivation), and was the only participant to exhibit such a trend in the group for which vocal recordings were available (another participant exhibited such a decrease in Fig 1B, but was could not be included in the vocal analysis). This may have contributed to the poor accuracy of the classifier. Participant #2 did exhibit an increase in sleepiness after deprivation (+0.75), so there are no obvious reasons for the classifier’s poor performance in this case.

Fig 3. Machine learning classification results with STM input features.

A. Balanced Accuracies for the population-level classifier using the generic STM representation as input feature space. Two cross-validation procedures are reported (see text). Error bars show standard deviations. Stars indicate the significance level of t-tests against chance level (** < .01; *** < .001). B. Balanced Accuracies for the classifiers tuned to individual participants, obtained with the 50-splits, 25% test cross-validation procedure. Participants are ranked according to classification accuracy.

Overall, this shows that there is enough information in the fully generic STM representation of vocal recordings to detect mild sleep deprivation in otherwise normal and healthy participants. The classification performance at the population level is poor using a LOSO cross-validation procedure, and it is also possible that the population classifiers benefited from a high input dimensionality to learn several individual recordings from different features sets, so the generalizability of our approach across speakers is not warranted. However, performance is generally excellent at the individual level, strengthening the idea that individual variability is key when considering vocal correlates of sleep deprivation.

Individual classifiers detect voice changes related to sleep deprivation

Individual classifiers were excellent at discriminating recordings before and after deprivation, but there is one critical question that needs be addressed before any further investigation: were the classifiers sensitive to changes in the voice related to the experimental manipulation, sleep deprivation, or were they rather only picking up random acoustic variations that are bound to exist across different reading sessions?

Some sources of random acoustic variation can be ruled out by our experimental protocol. All participants were recorded with the same microphone before and after deprivation, in the same room, at the same location in the room. This minimizes confounds such as microphone frequency response or room reverberation.

However, there remains the possibility that participants simply talked in different ways when reading the same text more than once, for reasons unrelated to sleep deprivation. To test for this possibility, we took advantage of the availability of multiple reading sessions per participant: three reading sessions before deprivation and three reading sessions after deprivation, recorded at different times of the day. The before and after sessions were pooled together for the main analysis of Fig 3. Here, we trained and cross-validated individual classifiers using the same pipeline as for the main analysis, but this time to discriminate between all possible pairs of different reading sessions for a given participant. We then averaged the average accuracy for all sessions “Within” the same state (discriminate two different sessions recorded both before or both after deprivation) or “Across” states (discriminate one session recorded before deprivation and one session recorded after deprivation). If our experimental manipulation had no effect, the accuracies “Within” and “Across” should be identical.

This was not the case (Fig 4, two-tailed comparison t(21) = 5.1, p = 5e-5). Note that the analysis likely underestimates the effect of the experimental manipulation, because a lot of “Across” accuracies were near ceiling (hence the use of Rationalized Arcsine Units for display and statistical analysis purposes, see [47]). Also, there are interesting, non-random causes of variability within a state: the circadian rhythm and the homeostatic pressure for sleep that are both associated with fatigue, which could explain the relatively high performance already in the “Within” case.

Fig 4. Classification accuracy for voice variability unrelated versus related to sleep deprivation.

Balanced Accuracies for discriminating reading sessions recorded both before or after sleep deprivation (within) or for discriminating one reading sessions recorded before and one reading session recorded after sleep deprivation. As 15 out of 22 classifiers produced accuracies above 0.9 for the Across comparison, BAccs were converted to rationalized arcsine units (RAU). Points represent individual participants. The median and interquartile intervals are also shown.

Using standard speech features does not improve sleep deprivation detection accuracy

Even if the STM representation successfully supported sleep detection at the individual level, it could be that it missed important speech features such as “pitch” or “pitch variation”, which are at the core of the “voice flatness” hypothesis and are part of most automatic sleepiness detection pipelines.

To investigate this possibility, we used the full openSMILE feature set (4368 features) as input feature space. We then applied the same classification pipeline as for the STM representation, consisting of dimensionality reduction followed by a Gaussian kernel SVM. This resulted in a pipeline matching the state-of-the art for sleepiness [33] while allowing comparison between the two input spaces.

Results are displayed in Fig 5. At the population level, the openSMILE classifier accuracy was similar to the STM classifier (50-splits, 20% test: BAcc, M = .70 SD = .01, t-test against .5: t(49) = 91.6, p < 0.01; LOSO: BAcc, M = .56, SD = .09, t-test against .5: t(21) = 2.98, p = .007). At the individual level, the accuracies of the openSMILE classifiers were on average poorer than those observed with the STM classifiers (BAcc, M = 0.67, SD = 0.9). The correlation between classification performance using STM or openSMILE feature was low (r(20) = .34, p = .11). Interestingly, however, participants #1 and #2 for whom poor classification performance was observed using the STM input feature space also displayed poor classification using the openSMILE input feature space.

Fig 5. Machine learning classification results with openSMILE input features.

Format as in Fig 3. In particular, for B., participants’ labels (#) are identical to Fig 3.

These results show that, for our voice database at least, using standard speech features decreased the accuracy of sleep deprivation detection. The relevant information to detect sleep deprivation from the voice was thus better expressed in the STM representation, with the added benefit, from our perspective, that generic auditory features should be easier to interpret. We thus now focus on the STM representation to interpret the features used for classification.

Interpreting classifiers to identify vocal biomarkers of sleep deprivation

To gain insight about the nature of the acoustic features distinguishing speech before and after sleep deprivation, we probed the trained classifiers with STM using a recent interpretation technique based on reverse correlation [39]. Briefly, the technique consists in randomly perturbing the input to the trained classifier, over thousands of trials, and then averaging all of the noisy representations leading to correct classification. This aims to identify the portion of the input that participates the most to the classifier’s performance. The input representation was perturbed using additive noise in the PCA-reduced feature space [48]. Averaging all masks leading to a correct classification decision revealed, in our case, the discriminative features of a voice after deprivation compared to before deprivation (for details, see Methods and 39).

As a preliminary step, we evaluated the consistency of the interpretation masks. Because of our cross-validation technique, 50 classifiers were fitted either for the whole dataset for the population-level classifier or for each participant’s classifier. To check internal consistency, we computed the pairwise Pearson’s correlation coefficients between all 50 interpretation maps. At the population-level, this “consistency correlation” was low albeit significantly above chance (r(22527): M = .20, SD = .34; all but 28 over 1225 pairwise correlations were significant, p < .05) which is consistent with the large variability suspected across listeners. At the participant-level, however, consistency correlations were very high (r(22527): M = .91, SD = .06, min = .73; all but 3 over 26950 pairwise correlations were significant, p < .05). Furthermore, because individual classifiers varied in accuracy, we could check whether the consistency of the interpretation improved with accuracy. As expected, the correlation between BAccs and consistency correlation was strong (r(20) = .71, p = .0003). These consistency results confirm that caution should be applied when considering population-level interpretations, but that individual results are robust and can be interpreted.

Fig 6 shows the interpretation maps for the population-level classifier. Maps should be read as follows: red areas correspond to STM features where the presence of energy is associated with sleep deprivation for the classifier, whereas blue areas represent STM features where the absence of energy is associated to sleep deprivation for the classifier. For the population-level map, the rate-scale projection resembles the raw difference before and after deprivation, although less noisy, whereas the frequency-rate projection does not match such raw acoustic differences (compare with Fig 2C). As these population-level interpretations are not robust, we simply show them for illustrative purposes and refrain from further description of their features.

Fig 6. Interpretation of the population-level classifier.

Discriminative features (see main text) are shown in the input STM space, for the rate-scale and frequency-scale projections. Red areas indicate features positively associated to sleep deprivation by the classifier. Blue areas correspond to features negatively associated to sleep deprivation by the classifier. Color bar indicate the averaged value of the reverse correlation mask. Values are low because of the relative low consistency of the interpretation masks for this population-level classifier.

Fig 7A shows all individual classifiers on the rate-scale projection, ordered along increasing accuracy (BAcc) of the corresponding classifier. We chose to interpret in priority the rate-scale projections, as is done in most speech applications [38]. The frequency-rate projections are provided as S3 Fig. The main feature of the results is the striking diversity of the individual maps, which is not related to classifier accuracy in any obvious manner. For some participants, sleep deprivation was detected through a reduction in energy over a range of STM features (blue color), consistent with a “flattening” of speech modulations. But the opposite was also observed for other participants (red color). Moreover, the details of the discriminative features also varied across participants. As shown before, these details are robust and warrant interpretation.

Fig 7. Interpretation of the participant-level classifiers.

A. As for Fig 6, but for individual participants identified by their participant #. B. Projection of participants # in the interpretation-PCA space of all participant’s masks (see text for details). C. and D. Variance of the idealized masks along the first two dimensions of the interpretation-PCA. Idealized masks are obtained by first sampling the PCA latent space between -2 and 2 for the two first dimensions with 30 values and then inverting the latent space into the input feature space by using the inverse transform of the PCA. Red areas show the discriminative features that vary the most along each interpretation-PCA dimension. Units: variance in the feature space.

To get a better understanding of this variability across individual maps, we performed a PCA on the maps themselves, which we will term interpretation-PCA for clarity. A first interpretation-PCA dimension explained 35.9% of the variance, while a second dimension explained 24.2% of the variance. There is a drop for all other dimensions (N = 3) which explain less that 13% of the variance, see S4 Fig. Participants ordered on the first two interpretation-PCA dimensions are shown in Fig 6B. We computed the variance of all STM features along each interpretation-PCA dimension, to visualize the features that distinguished the interpretation maps along these main axes of variation. Results are shown in Fig 7C and 7D. The features defining the first interpretation-PCA dimension were clustered between rates of about 2 Hz to 8 Hz, which is exactly the amplitude modulation range corresponding to speech prosody and syllabic rate [41]. This shows that the amplitude modulation characteristics of speech was affected by sleep deprivation. Importantly, depending on the individual, the classifiers used the presence or absence of energy around these rates to detect sleep deprivation. This shows that while some participants spoke in a “flattened” voice after deprivation, consistent with classic hypotheses [26,33], others instead spoke in a more “animated” voice after deprivation. The features defining the second interpretation-PCA dimension clustered at very low rates and covered a broad scale range, peaking at about 2 cyc/oct. This corresponds to long-term spectral characteristics of speech and vowel sounds. In speech, such voice-quality or timbre features are determined by the precise shape of the various resonators inside the vocal tract, such as the throat and nasal cavities: by filtering the sound produced by the vocal folds, resonators impose formants that impact the timbre of vowels and other speech sounds.

Correlation with subjective sleepiness reports

All participants were subjected to the exact same amount of sleep deprivation. Nevertheless, their subjective sleepiness reports varied widely (Fig 1). We investigated whether the variability in subjective sleepiness reports could be accounted for by characteristics of the individual machine-learning classifiers.

First, we simply correlated the individual classifier’s accuracies to the individual SSS reports after sleep deprivation. If subjective sleepiness was a full measure of the impact of sleep deprivation, we would expect a high correlation between the classifier’s accuracy and SSS reports. Results are shown in Fig 8A. There was no significant correlation between BAccs and SSS reports (r(20) = .32, p = .14, BF10 = .47), suggesting that subjective sleepiness did not express all of the objective effects of sleep deprivation, at least as captured by our voice classifiers.

Fig 8. Relation between subjective sleepiness and voice classifiers.

A. Subjective sleepiness is plotted as a function of balanced accuracy of each participant-level classifier. B. Subjective sleepiness is plotted as a function of the coordinate of each participant-level classifier on the first dimension of the interpretation-PCA space. C. As in B., but for the second dimension of the interpretation-PCA space.

Next, we investigated whether the classifier’s interpretation maps could account for the SSS reports variability. In particular, we reasoned that the prosodic and rhythmic changes captured by the first interpretation-PCA dimension could be due to cognitive factors, inducing flattened or animated speech. Such factors could be explicit to participants–if only by self-monitoring their own speech rate and intonation. In contrast, the voice quality cues captured by the second interpretation-PCA dimension could be more subtle and remain implicit. Results are shown in Fig 8B and 8C. Consistent with our hypothesis, we observed a moderate but significant correlation between the location of participants on the first interpretation dimension and sleepiness reports (r(20) = -.44, p = .03, BF10 = 1.34). In contrast, the location of participants on the second interpretation dimension did not show any significant correlation with sleepiness reports (r(20) = .19, p = .38, BF10 = .23).

Finally, to assess the full information contained in the interpretation maps, we fitted a linear model that used coordinates on both interpretation-PCA dimensions to predict SSS scores after deprivation (see Methods). Results showed that it was possible to predict sleepiness from interpretation maps (R2: M = .29, SD = .18) significantly above chance (two-sample t-test to 0: p < .00001). This results further suggests that the classifiers detected voice changes related to sleep deprivation and not random variation across reading sessions (see also Fig 4).


Summary of findings

We ran a sleep deprivation protocol with normal and healthy participants, collecting subjective reports of sleepiness plus vocal recordings before and after deprivation. After two nights of mild sleep deprivation, subjective sleepiness increased on average, although with striking individual differences—including some participants even reporting decreases in subjective sleepiness after deprivation. Nevertheless, sleep deprivation could be detected accurately by means of machine-learning analysis of vocal recordings. Classification was most accurate at the individual level, with 85% balanced accuracy on average. Importantly, such a classification was based on a fully generic auditory representation. This allowed us to interpret the discriminative features discovered by classifiers to detect sleep deprivation. Two broad classes of features were revealed: changes in temporal modulations within the rhythmic range characteristic of speech sentences, and changes in spectral modulations within the timbre range of speech sounds. Furthermore, the interpretation maps could account for some of the variability in subjective sleepiness reports, which were correlated to the changes in temporal modulations (“flattened” or “animated” voice).

Candidate mechanisms underlying the vocal biomarkers of sleep deprivation

The individual classifiers using the STM input features learnt to detect sleep deprivation with a high accuracy, matching human performance previously measured on a similar task [28]. A data-driven interpretation of the classifier revealed that classification was based on two classes of auditory features: temporal modulations in the 2 Hz to 8 Hz range, and spectral modulations around 2 cyc/oct. We now speculatively relate these vocal features to two classes of well-established neurophysiological effects of sleep deprivation.

The temporal modulation features associated to sleep deprivation were in a range which has been robustly found as characteristic of speech across a variety of languages, to the extent that they have been described as “universal rhythmic properties of human speech” [41]. Such a universal rhythm is imposed by the biomechanical constraints of the vocal apparatus and by the neurodynamics of its control and perception systems. The changes in speech rhythms observed after sleep deprivation could thus result from a temporary impairment of the cognitive control of the speech production process. Sleep deprivation impacts cognitive function [20], presumably through changes in glucose consumption in frontal and motor brain regions [49,50]. Accordingly, previous studies showed lower activity in the dorsolateral prefrontal cortex and in the intraparietal sulcus in cognitive tasks requiring attention, with large inter-individual variability [51]. A reduced connectivity was also observed within the default mode network, the dorsal attention network, and the auditory, visual and motor network following sleep deprivation [50,52,53]. Finally, extended wakefulness has been associated with an increase in the intrusion of sleep-like patterns of brain activity in wakefulness [54,55]. All these results suggest that sleep deprivation is akin to a minor cognitive frontal dysfunction, and may thus plausibly affect the fluency of vocal production. Interestingly, compensatory responses were also observed in cognitive tasks, which may explain why some of our participants responded to deprivation with less speech modulation, consistent with the classic “flattened voice” hypothesis [26,27], whereas others unexpectedly responded with speech over-modulation and instead produced an “animated voice” after deprivation.

The spectral modulation changes detected by our classifiers were consistent with changes in the timbre of speech sounds, and in particular vowel sounds [35,38,42]. Such sounds acquire their distinctive spectral envelopes by means of the resonances of the vocal tract, including the throat and nasal cavities. Inflammation of the throat and nose could be responsible for these changes in timbre. Sleep deprivation is known to trigger an immune response leading to inflammation. A cortisol increment can be observed after a single night of sleep deprivation [8,56,57], so is plausible in our protocol that included two nights of mild sleep deprivation. In terms of mechanisms, sleep restriction and deprivation disturb the normal secretion of hormones like cortisol or testosterone, and is associated with increased rates of interleukin-6 and CRP as assessed on salivary samples in normal subjects. This inflammatory response could be linked to an elevated blood pressure following sleep deprivation [58] and could affect the vocal tract and plausibly impact the spectral envelope of speech. It should be noted that other variables, such as changes in hydration or food intake due to deprivation, might also impact characteristics of the vocal apparatus and induce timbre changes instead or in addition to putative inflammation. Such additional variables were not controlled in our protocol.

Limitations of the study

There are both technical and conceptual limitations to the present study. We chose to use a controlled protocol to precisely equate sleep deprivation in our participants, but this came at the expense of a relatively small dataset compared to the online databases used by machine-learning challenges [30,31]. Our protocol prevented biases in the database, such as associating the identity of speakers with the amount of sleep deprivation [28], but also limited our choice of possible machine-learning techniques to perform the classification. We thus used an SVM classifier, and not potentially more powerful deep-learning architectures. We note however that in the studies that compared SVMs with other classifier types, SVM performed best, including in state-of-the-art studies [14,32,33]. In any case, the interpretation method we used could be applied to any kind of classifier [39], including more complex ones.

All participants were female, mainly for practical reasons. Sleep deprivation might affect females and males differently, in particular with respect to inflammation, although the evidence is still mixed [59]. The generalizability of our findings to males thus remains to be tested experimentally. In addition, the modest performance observed for population-level classifiers limits the generalization of our approach to unknown speakers, which would be desirable for practical use cases involving pre-trained classifiers. However, this also confirms the interest to apply interpretation techniques at the individual level, to capture the variability that seems inherent to the effects of speech deprivation.

The feature set we used was a generic auditory representation, which is a major difference with previous machine-learning oriented studies. On the one hand, some studies were fully data-driven and selected the best-performing features from thousands of speech descriptors. The resulting features were often difficult to interpret. On the other hand, there were also studies using a small set of features, but these features were carefully hand-crafted and potentially lacked genericity. Our approach represents a trade-off between these two ideas: we applied a data-driven approach to select a small subset of features, but because these features were from a generic representation, they remained interpretable. A clear limitation is that we did not include features related to pitch or functionals of pitch such as contour features, which have been repeatedly shown to be useful for sleepiness detection [14,32,60]. However, average estimates of pitch and pitch variation (Fig 2D) suggested that there were no obvious effects on these features in our database. Furthermore, our classification pipeline applied to standard speech features performed worse than using the STM representation. We believe that these omissions were compensated by the richness of the STM representation. Pitch and pitch functionals will in fact be indirectly reflected in the STMs, which analyses sounds over a broad range of temporal scales simultaneously.

The possible physiological mechanisms that we put forward as a mediation between sleep deprivation and vocal features have to be considered as fully speculative for now. We did not collect the objective measures required to confirm or infirm these interpretations. The cognitive factor could be assessed with objective behavioral measures, such as the psychomotor vigilance test [22], or with brain imaging data [49,50]. The inflammatory factor could be assessed by biological analyses of e.g. cortisol in the saliva [8,57]. Because we have not gathered such measurements, we can only argue that both minor cognitive dysfunction and inflammation effects are likely for our participants as a group. In any case, the present study is the first one to suggest that such factors may be measured at the individual level from voice biomarkers, and it raises the possibility for future investigations to confirm or reject this hypothesis by actually correlating vocal features with more invasive objective markers.

The classification task we investigated consisted only in detecting whether a vocal recording was performed before or after sleep deprivation. We did not attempt to decode the effect of more subtle factors on the voice, such as the time of the day, which would reflect the interactions between circadian rhythms and sleep deprivation. These interactions have been shown in a recent study [61], albeit using a more severe deprivation protocol (60 hours without sleep). Unfortunately, our experimental design does not provide the statistical power to examine within-day variations before sleep deprivation (half the dataset) or the interaction between within-day variations and deprivation (second-order effect). In the same study, a regression approach was implemented to provide predictions beyond binary classification. Interestingly, this approach was successful only for predicting objective measures, such as sleep latency, but failed for subjective reports. This is consistent with the claim that subjective scales incompletely characterize the full effects of sleep deprivation and can usefully be complemented by objective measures such as voice analysis. In any case, as we did not collect objective measures of sleepiness beyond the voice, we did not attempt a regression analysis.

Finally, on a conceptual level, we wish to raise a basic but inescapable limitation of any study of sleep deprivation. Sleep deprivation may be defined, as we did, by the amount of sleep available to each individual. However, as has been repeatedly pointed out and again observed here, there is a remarkable diversity of responses to the same amount of sleep deprivation. Thus, it should not be expected that any one measure will capture all of the effects of sleep deprivation. Subjective reports may capture explicit feelings of fatigue, but be blind to implicit effects [61]. With objective measures, which are by necessity indirect, there is an issue with interpreting negative outcomes. In our case for instance, how to interpret a relatively poor accuracy for a sleep deprivation classifier, such as was observed for two participants? It cannot be decided whether this poor accuracy showed that sleep deprivation had no effect on these participants, or that sleep deprivation had effects that were not expressed in the voice, or that the classifiers simply failed for technical reasons. Measuring multiple markers of sleep deprivation, including the novel ones we suggest, and incorporate them into a holistic model of the neurophysiological effects of sleep deprivation seems to be a promising way forward.


Keeping these limitations in mind, the demonstration of vocal biomarkers for sleep deprivation could have major clinical implications. Subjective sleepiness reports do not capture the whole effect of a lack of sleep [61]. Moreover, such reports rely on the honest cooperation of participants, which is not a given if self-reports of excessive sleepiness can have negative work-related or financial consequences for the individual. Objective correlates of sleepiness exist [21,22], but vocal biomarkers would represent a considerably cheaper and faster alternative, requiring no specialized equipment and increasing their practicality for real-life clinical assessment. Crucially, our technique also goes beyond the simple binary detection of sleep deprivation: thanks to the application of interpretability techniques [39], we suggest that different neurophysiological processes related to sleep deprivation may be untangled through the voice alone. Such measures could in turn be used to design interventions tailored to each individual and situation, if the effects of sleep deprivation needed to be temporarily alleviated for instance. More generally, there is a growing realization that interpretability is key to future clinical applications of artificial intelligence, as both patients and clinicians would understandably want to understand the reason for a diagnostic [62]. For application to real-life settings, it is particularly interesting to identify features that do not correlate with subjective sleepiness, as one of the biggest dangers of sleep loss is the partial agnosia for one’s own sleepiness.

To finish, it is useful to point out that the methodological pipeline we introduced here is fully generic, as the audio features representation used is itself generic and the interpretation method can be applied to any classifier. Therefore, the present study could pave the way for future investigations of vocal biomarkers over the broad range of fundamental or clinical applications that are currently only starting to be considered [24,25].

Materials and Methods

Ethics statement

The study was conducted according to French regulations on human research including agreements from the Hotel-Dieu Hospital Ethics Committee (CPP Ile de France 1—N° 2017-sept.-13690), with signed consent from participants who received financial compensation. Our protocol was conducted in accordance with the 2016 version of the Declaration of Helsinki and the ICH guidelines for Good Clinical Practice.

Experimental design

A group of twenty-four healthy women between 30–50 years old (42.7 ± 6.8) took part in the experiment. This study was part of a dermatological study and only Caucasian phototypes I-IV (Fitzpatrick classification) were recruited. Participants were non-smokers and did not report a history of substance abuse. They had a Body Mass Index (BMI) between 19 and 25, no sleep disorders or chronic disease, no daytime vigilance issues (Epworth Sleepiness Scale ≤ 10), and were not under any medical treatment (exclusion criteria).

Before the experiment, participants wore an actigraph for 7 days and were instructed to maintain a regular sleep-wake behavior with their usual 7–8 h of sleep (i.e., in bed from 23:00–01:00 until 07:00–09:00). The compliance with these recommendations was verified through the actigraphic recordings (MW8, CamTech; UK) that were inspected by the research team at the participant’s arrival the morning before the first night of sleep restriction (day 1). No sleep episodes were detected outside of the scheduled experimental time in bed (see 40 for details). The protocol lasted for 3 days (day 1: before sleep restriction; day 2: during sleep restriction; day 3: after sleep restriction), which included 2 night of sleep deprivation (at the end of day 1 and 2). During the “sleep restriction” session, the participants were instructed to restrict their sleep time to 3h for 2 consecutive nights (i.e., in bed from 03:00 to 06:00) and to follow their usual routine outside the laboratory. After the second sleep-restricted night (day 3), the participants went to the laboratory on the morning and their actigraphy recordings were immediately analysed to ensure their compliance with the imposed sleep-wake hours. During day 1 (after habitual sleep and before sleep restriction: baseline condition) and day 3 of each session, the participants remained in the sleep laboratory from 09:00 to 19:00 under continuous supervision. In order to help the participants stay awake, from the moment they left the laboratory at the end of day 1 until their return to the laboratory at the beginning of day 3 at 09:00, two investigators exchanged text messages with the participants at random times during the entire period outside of the laboratory. Text messages were sent throughout the night (except during the period where participants were instructed to sleep, that is between 3 and 6 a.m.). Participants had to respond right after receiving these messages. In case of an absence of response, participants were immediately called on their personal phone. For lunch in the laboratory (day 1 and 3), participants received controlled meals consisting of a maximum of 2,500 calories/day with a balanced proportion of nutrients (protein, fat, and carbohydrates).

Voice recordings

During day 1 (before sleep deprivation) and day 3 (after), at three different times during the day (9am, 3pm, 5 pm), participants were seated and instructed to read 10 minutes of different chapters of the same French classic book: “Le Comte de Monte Christo” (Alexandre Dumas, 1844). Their voice was recorded with a portable recorder (Zoom H1/MB, stereo-recording). Then, during one minute, participants produced free speech, but these recordings were not used in the present analyses. Two participants had to be discarded at this stage, as technical issues prevented the completion of all recording sessions.

Baseline speech feature set

We computed basic speech features using the openSMILE library [45], in the configuration recommended for the Interspeech 2011 challenge. This feature set has been used for in state-of-the-art studies detecting sleepiness from voice recordings [33]. It consists of 59 low-level descriptors, including 4 energy descriptors, 50 spectral descriptors, and 5 voice descriptors. These descriptors were then combined with 33 base functionals and 5 f0 functionals, resulting in a total of 4,368 features.

Spectro-Temporal Modulations (STM)

The sound files, initially sampled at 44.1 kHz, were down-sampled to 16 kHz. Spectro-Temporal Modulations (STMs) were computed with our own toolkit, available on the repository associated with the paper and which is directly adapted from the standard NSL Toolbox [35]. Sounds were processed through a bank of 128 constant-Q asymmetric bandpass filters equally spaced on a logarithmic frequency scale spanning 5.3 octaves, which resulted in an auditory spectrogram, a two-dimensional time-frequency array. The STM were then computed by applying a spectro-temporal modulation filterbank to the auditory spectrogram. We generally followed the procedure detailed in [36], with minor adaptations. A 2D Fourier transform was first applied to the spectrogram resulting in a two-dimensional array [44] whose dimensions were spectral modulation (scale) and temporal modulation (rate). Then, the STM representation was derived by filtering the MPS according to different rates and scales and then transforming back to the time-frequency domain. We chose the following scale (s) and rate (r) center values as 2D Gaussian filters to generate the STMs: s = [0.71, 1.0, 1.41, 2.00, 2.83, 4.00, 5.66, 8.00] cyc/oct, r = ±[.25, .5, 1, 2, 4, 5.70, 8, 11.3, 16, 22.6, 32] Hz. Such a range covers the relevant spectral and temporal modulations of speech sounds as already used in different studies [63]. The resulting representation thus corresponds to a 4D matrix with dimensions of time, frequency, scale, and rate.

Classification pipeline

For all recordings, STMs were computed and used as the basis for the input feature space. STMs were computed with 22 rates x 8 scales x 128 frequencies per 40ms temporal windows. One data point for the classifier consisted of the average of 3841 successive temporal windows, yielding frames of 15 seconds. Thus, there were 22528 features for every 15-s long frame.

Standard machine-learning pipeline were then used [13,14,36] to predict whether a frame belonged to the sleep deprived class or not. First, the whole dataset was randomly separated into a training set and a testing set, either by randomly holding 25% of the data into the testing set or, only at population level, by holding-out the data from one subject to define the training and the testing in a Leave-One-Subject-Out (LOSO) cross-validation procedure. We then reduced this high dimensionality of the feature space by means of a principal component analysis (PCA). At the population level, we trained a PCA on the whole dataset and retained the 250 main dimensions, explaining 99% of the variance. We further checked that the exact choice of PCA dimensions did not affect our conclusions, about the performance but also about the interpretation of the classifiers (see S2 Fig). At the participant level, for each participant we trained a PCA on the data from all other participants, to reduce a possible contamination of the reduced space by peculiarities of the considered participant. We next retained the 30 main dimensions of the PCA. The number of PCA dimensions in this case was chosen empirically, so that the reduced feature space still explained more than 90% of the variance and provided a dimensionality lower than the number of frames available for each participant (between 98 and 194 frames of 15 s each), to avoid overfitting. We checked that the exact choice of PCA dimensions did not affect our conclusions, in particular on the interpreted features that are consistent for PCA dimensions above 30.

The PCA spaces were then fed to a support vector machine classifier (SVM) with a gaussian kernel (radial basis function). The training set was used to fit the SVM through an hyperparameter grid-search, using a stratified 5-folds cross-validation. The fitted SVM was then evaluated on the testing set by computing Balanced Accuracy (BAcc, defined as the average of true positive rate, or sensitivity, with true negative rate, or specificity). For the randomly selected train/test split, we repeated the fitting procedure 50 times, generating 50 distinct train/test sets for both the population and individual levels (denoted 50-splits, 25% test). In the Leave-One-Subject-Out (LOSO) approach, we replicated the fitting procedure with 22 subjects, each excluded once and designated as the testing set. In each instance, we computed the final balanced accuracies and then averaged them across either the 50 different train/test splits (in the case of 50-splits, 25% test) or across the 22 subjects (in the case of LOSO). Lastly, for each cross-validation procedure, we conducted a t-test against the threshold of 0.5 using the distributions of balanced accuracies. This allowed us to evaluate the classifier’s capability to predict sleepiness from voice, assessing its performance compared to random chance. All the classification pipelines from PCA to RBF + SVM are implemented with the sci-kit learn library [64].

Interpretation of the classifiers

Each classifier fitted in the study is probed with the reverse correlation technique which provides an interpretation of which features are relevant in the sense of the classifier. Theoretically, for each feature of the input space, a correlation is made between the array of random values from each noise sample with the array of decision values, 0 or 1, 0 corresponding to excerpts classified as before sleep restriction and 1 to excerpts recorded after sleep restriction. Here, as the noise were of null average in the feature space, we simply subtracted the average values of the noises that led to a classification in the class ‘after restriction’ with the average values that led to classification in the class ‘before restriction’ [65]. We refer to our method paper for a full description of the method [39]. Here, we used the version of the method which consists of pseudo-random noise as perturbation at the input of the system. Pseudo-random noises allow to accurately fool the classifier while using a white noise may implicate complication as the classifier can tend to classify all the stimuli + noise excerpt in only one class. One specificity of this method is that it requires a large number of trials to provide an accurate description of the importance of each feature in the input space. Here we chose to use a number of trials equal to 100 times the number of frames which represents between 9800 and 20000 trials. Each interpretation provides a “interpretation mask” which are composed of positive and negative values, positive values correspond to features which are diagnostic of sleep loss and negative ones conversely. Here we refer to this mask as the "discriminative information", which has also been termed "represented features" previously [39].

For each classification task, 50 classifiers were fitted. In order to test the independence of the prediction accuracy from the 50 different random training set. Each of these 50 classifiers were interpreted with the previously described method and a second test was then performed in order to test the similarities between the 50 interpretations. Pairwise Pearson’s correlation coefficients between all 50 interpretation maps were computed and then averaged.

Supporting information

S1 Fig. Acoustic difference before and after sleep deprivation before and after sleep deprivation computed as 2*abs(After-Before) / (After+Before)–for each subject ranked according to the classifier accuracies.

Units: Percent.


S2 Fig. Performance of the classifiers as a function of the number of principal components, averaged across the 22 subjects.

(Right) Average pairwise correlations between the interpretation masks for different number of Principal Components averaged across the 22 subjects. In order to evaluate for how many PCs the interpretations are stabilized, pairwise correlations between each masks, 1 one for each PC, has been done for each and averaged across the whole 22 subjects. We observe that the interpretation is stabilized around 20 PCs which. (Left) Error bars represent standard deviations.


S3 Fig. Average masks in the frequency-rate projection for all subjects (ranked by balanced accuracy).

Red parts indicate the areas of the frequency rate projections that are used by classifier to predict a sleepy voice excerpt and conversely, blue parts indicate frequency-rate parts that characterize a non-sleepy subject.


S4 Fig. Explained variance in percent with respect to the number of principal components for the interpretation-PCA analysis.



We would like to thank Andrey Anikin for suggesting the analysis presented in Fig 4.


  1. 1. Bin Y. S., Marshall N. S., Glozier N., Sleeping at the Limits: The Changing Prevalence of Short and Long Sleep Durations in 10 Countries. Am J Epidemiol 177, 826–833 (2013). pmid:23524039
  2. 2. Wang S., Rossheim M. E., Nandy R. R., Trends in prevalence of short sleep duration and trouble sleeping among US adults, 2005–2018. Sleep (2022) pmid:36152031
  3. 3. Leger D., Stepnowsky C., The Economic and Societal Burden of Excessive Daytime Sleepiness in Patients with Obstructive Sleep Apnea. Sleep Med Rev 51, 101275 (2020). pmid:32169792
  4. 4. Bayon V., Leger D., Gomez-Merino D., Vecchierini M.-F., Chennaoui M., Sleep debt and obesity. Ann Med 46, 264–272 (2014). pmid:25012962
  5. 5. Smiley A., King D., Bidulescu A., The Association between Sleep Duration and Metabolic Syndrome: The NHANES 2013/2014. Nutrients 11, 2582 (2019). pmid:31717770
  6. 6. Cappuccio F. P., Cooper D., D’Elia L., Strazzullo P., Miller M. A., Sleep duration predicts cardiovascular outcomes: a systematic review and meta-analysis of prospective studies. Eur Heart J 32, 1484–1492 (2011). pmid:21300732
  7. 7. Leger D., Andler R., Richard J. B., Nguyen‐Thanh V., Collin O., Chennaoui M., et al., Sleep, substance misuse and addictions: a nationwide observational survey on smoking, alcohol, cannabis and sleep in 12,637 adults. J Sleep Res 31, e13553 (2022). pmid:35088480
  8. 8. Faraut B., Cordina-Duverger E., Aristizabal G., Drogou C., Gauriau C., Sauvet F., et al., Immune disruptions and night shift work in hospital healthcare professionals: The intricate effects of social jet-lag and sleep debt. Front Immunol 13, 939829 (2022). pmid:36164341
  9. 9. Sabia S., et al., Association of sleep duration at age 50, 60, and 70 years with risk of multimorbidity in the UK: 25-year follow-up of the Whitehall II cohort study. Plos Med 19, e1004109 (2022). pmid:36256607
  10. 10. Kecklund G., Axelsson J., Health consequences of shift work and insufficient sleep. Bmj 355, i5210 (2016). pmid:27803010
  11. 11. Larsen A. D., Hannerz H., Møller S. V., Dyreborg J., Bonde J. P., Hansen J., et al., Night work, long work weeks, and risk of accidental injuries. A register-based study. Scand J Work Environ Heal 43, 578–586 (2017). pmid:28914325
  12. 12. Krajewski, J., Schnieder, S., Monschau, C., Titt, R., Sommer, D., & Golz, M., Large Sleepy Reading Corpus (LSRC): Applying Read Speech for Detecting Sleepiness in (2016), pp. 1–4.
  13. 13. Krajewski J., Schnieder S., Sommer D., Batliner A., Schuller B., Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech. Neurocomputing 84, 65–75 (2012).
  14. 14. Krajewski J., Batliner A., Golz M., Acoustic sleepiness detection: Framework and validation of a speech-adapted pattern recognition approach. Behav Res Methods 41, 795–804 (2009). pmid:19587194
  15. 15. VanDongen H. P. A., Vitellaro K. M., Dinges D. F., Individual Differences in Adult Human Sleep and Wakefulness: Leitmotif for a Research Agenda. Sleep 28, 479–496 (2005). pmid:16171293
  16. 16. VanDongen P. A., Baynard M. D., Maislin G., Dinges D. F., Systematic Interindividual Differences in Neurobehavioral Impairment from Sleep Loss: Evidence of Trait-Like Differential Vulnerability. Sleep 27, 423–433 (2004). pmid:15164894
  17. 17. MacLean A. W., Cynthia G., Saskin P., Knowles J. B., Psychometric evaluation of the Stanford Sleepiness Scale. J Sleep Res 1, 35–39 (1992). pmid:10607023
  18. 18. Shahid A., Wilkinson K., Marcu S., Shapiro C. M., STOP, THAT and One Hundred Other Sleep Scales. 209–210 (2011).
  19. 19. Hoddes E., Zarcone V., Smythe H., Phillips R., Dement W. C., Quantification of Sleepiness: A New Approach. Psychophysiology 10, 431–436 (1973). pmid:4719486
  20. 20. Durmer J., Dinges D., Neurocognitive Consequences of Sleep Deprivation. Semin Neurol 25, 117–129 (2005). pmid:15798944
  21. 21. Arand D. L., Bonnet M. H., Chapter 26 The multiple sleep latency test. Handb Clin Neurology 160, 393–403 (2019).
  22. 22. Lim J., Dinges D. F., Sleep Deprivation and Vigilant Attention. Ann Ny Acad Sci 1129, 305–322 (2008). pmid:18591490
  23. 23. Bougard C., VanBeers P., Sauvet F., Drogou C., Guillard M., Dorey R., et al., Motorcycling performance and sleepiness during an extended ride on a dynamic simulator: relationship with stress biomarkers. Physiol Meas 41, 104004 (2020). pmid:33164915
  24. 24. Editorial T. L. D. H., Do I sound sick? The Lancet Digital Health, e534 (2021). pmid:34446262
  25. 25. Fagherazzi G., Fischer A., Ismael M., Despotovic V., Voice for Health: The Use of Vocal Biomarkers from Research to Clinical Practice. Digital Biomarkers 5, 78–88 (2021). pmid:34056518
  26. 26. Morris G. O., Williams H. L., Lubin A., Misperception and Disorientation During Sleep Deprivation. M Archives Gen Psychiatry 2, 247–254 (1960).
  27. 27. Harrison Y., Horne J. A., Sleep Deprivation Affects Speech. Sleep 20, 871–877 (1997). pmid:9415947
  28. 28. Huckvale M., Beke A., Ikushima M., Prediction of Sleepiness Ratings from Voice by Man and Machine. Interspeech 2020, 4571–4575 (2020).
  29. 29. T. L. Nwe, H. Li, M. Dong, Analysis and detection of speech under sleep deprivation. Interspeech 2006, paper 1934-Wed2BuP.15-0 (2006).
  30. 30. Schuller B., Batliner A., Bergler C., Pokorny F. B., Krajewski J., Cychosz M., et al., The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity. Interspeech 2019, 2378–2382 (2019).
  31. 31. Schuller B., Steidl S., Batliner A., Schiel F., Krajewski J., The INTERSPEECH 2011 speaker state challenge. Interspeech 2011, 3201–3204 (2011).
  32. 32. B. Günsel, C. Sezgin, J. Krajewski, SLEEPINESS DETECTION FROM SPEECH BY PERCEPTUAL FEATURES. 2013 Ieee Int Conf Acoust Speech Signal Process, 788–792 (2013).
  33. 33. V. P. Martin, J.-L. Rouas, P. Thivel, J. Krajewski, Sleepiness detection on read speech using simple features. 2019 Int Conf Speech Technology Human-computer Dialogue Sped 00, 1–7 (2019).
  34. 34. Martin V. P., Rouas J.-L., Micoulaud-Franchi J.-A., Philip P., Krajewski J., How to Design a Relevant Corpus for Sleepiness Detection Through Voice? Frontiers Digital Heal 3, 686068 (2021). pmid:34713156
  35. 35. Chi T., Ru P., Shamma S. A., Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118, 887–20 (2005). pmid:16158645
  36. 36. Patil K., Pressnitzer D., Shamma S., Elhilali M., Music in our ears: the biological bases of musical timbre perception. PLoS Comput Biol 8, e1002759 (2012). pmid:23133363
  37. 37. Thoret E., Caramiaux B., Depalle P., McAdams S., Learning metrics on spectrotemporal modulations reveals the perception of musical instrument timbre. Nat Hum Behav 5, 369–377 (2021). pmid:33257878
  38. 38. N. Mesgarani, S. Shamma, SPEECH PROCESSING WITH A CORTICAL REPRESENTATION OF AUDIO. 2011 Ieee Int Conf Acoust Speech Signal Process Icassp 1, 5872–5875 (2011).
  39. 39. Thoret E., Andrillon T., Léger D., Pressnitzer D., Probing machine-learning classifiers using noise, bubbles, and reverse correlation. J Neurosci Meth 362, 109297 (2021). pmid:34320410
  40. 40. Léger D., Gauriau C., Etzi C., Ralambondrainy S., Heusèle C., Schnebert S., et al., “You look sleepy…” The impact of sleep restriction on skin parameters and facial appearance of 24 women. Sleep Med 89, 97–103 (2022).
  41. 41. Ding N., Patel A. D., Chen L., Butler H., Luo C., & Poeppel D., Temporal modulations in speech and music. Neuroscience and Biobehavioral Reviews, 1–7 (2017). pmid:28212857
  42. 42. Joly O., Ramus F., Pressnitzer D., Vanduffel W., Orban G. A., Interhemispheric differences in auditory processing revealed by fMRI in awake rhesus monkeys. Cereb. Cortex 22, 838–853 (2012). pmid:21709178
  43. 43. Elliott T. M., Theunissen F. E., The Modulation Transfer Function for Speech Intelligibility. PLoS Comput Biol 5, e1000302 (2009). pmid:19266016
  44. 44. Thoret E., Depalle P., Mcadams S., Perceptually Salient Regions of the Modulation Power Spectrum for Musical Instrument Identification. Front. Psychol. 8, 1–10 (2017).
  45. 45. Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459–1462).
  46. 46. Varoquaux G., Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage 180, 68–77 (2018). pmid:28655633
  47. 47. Studebaker G. A., A Rationalized Arcsine Transform. J. Speech, Lang., Hear. Res. 28, 455–462 (1985). pmid:4046587
  48. 48. Gosselin F., Schyns P. G., RAP: a new framework for visual categorization. Trends Cogn Sci 6, 70–77 (2002). pmid:15866190
  49. 49. Knutson K. L., Spiegel K., Penev P., V. Cauter E., The metabolic consequences of sleep deprivation. Sleep Med Rev 11, 163–178 (2007). pmid:17442599
  50. 50. Krause A. J., et al., The sleep-deprived human brain. Nat Rev Neurosci 18, 404–418 (2017). pmid:28515433
  51. 51. Chee M. W. L., Tan J. C., Lapsing when sleep deprived: Neural activation characteristics of resistant and vulnerable individuals. Neuroimage 51, 835–843 (2010). pmid:20171288
  52. 52. Yeo B. T. T., Tandi J., Chee M. W. L., Functional connectivity during rested wakefulness predicts vulnerability to sleep deprivation. Neuroimage 111, 147–158 (2015). pmid:25700949
  53. 53. Kaufmann T., et al., The brain functional connectome is robustly altered by lack of sleep. Neuroimage 127, 324–332 (2016). pmid:26712339
  54. 54. Bernardi G., et al., Neural and Behavioral Correlates of Extended Training during Sleep Deprivation in Humans: Evidence for Local, Task-Specific Effects. J Neurosci 35, 4487–4500 (2015). pmid:25788668
  55. 55. Hung C. S., Sarasso S., Ferrarelli F., Riedner B., Ghilardi M. F., Cirelli C., et al., Local Experience-Dependent Changes in the Wake EEG after Prolonged Wakefulness. Sleep 36, 59–72 (2013). pmid:23288972
  56. 56. Faraut B., Andrillon T., Vecchierini M.-F., Leger D., Napping: A public health issue. From epidemiological to laboratory studies. Sleep Med Rev 35, 85–100 (2017). pmid:27751677
  57. 57. Faraut B., Bayon V., Léger D., Neuroendocrine, immune and oxidative stress in shift workers. Sleep Med Rev 17, 433–444 (2013). pmid:23618533
  58. 58. Sauvet F., Leftheriotis G., Gomez-Merino D., Langrume C., Drogou C., Van Beers P., et al., Effect of acute sleep deprivation on vascular function in healthy subjects. J Appl Physiol 108, 68–75 (2010). pmid:19910332
  59. 59. Dolsen E. A., Crosswell A. D., Prather A. A., Links Between Stress, Sleep, and Inflammation: Are there Sex Differences? Curr. Psychiatry Rep. 21, 8 (2019). pmid:30729328
  60. 60. Martin V. P., Arnaud B., Rouas J.-L., Philip P., Does sleepiness influence reading pauses in hypersomniac patients? Speech Prosody 2022, 62–66 (2022).
  61. 61. Baykaner K. R., Huckvale M., Whiteley I., Andreeva S., Ryumin O., Predicting Fatigue and Psychophysiological Test Performance from Speech for Safety-Critical Environments. Front. Bioeng. Biotechnol. 3, 124 (2015). pmid:26380259
  62. 62. Ghassemi M., Oakden-Rayner L., Beam A. L., The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digital Heal 3, e745–e750 (2021). pmid:34711379
  63. 63. Pasley B. N., et al., Reconstructing Speech from Human Auditory Cortex. Plos Biol 10, e1001251 (2012). pmid:22303281
  64. 64. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., et al. (2011). Scikit-learn: Machine learning in Python. The Journal of machine Learning research, 12, 2825–2830.
  65. 65. Murray R. F., Classification images: A review. J Vision 11, 2–2 (2011). pmid:21536726