Wastewater-Based Epidemiology of Stimulant Drugs: Functional Data Analysis Compared to Traditional Statistical Methods

Background Wastewater-based epidemiology (WBE) is a new methodology for estimating the drug load in a population. Simple summary statistics and specification tests have typically been used to analyze WBE data, comparing differences between weekday and weekend loads. Such standard statistical methods may, however, overlook important nuanced information in the data. In this study, we apply functional data analysis (FDA) to WBE data and compare the results to those obtained from more traditional summary measures. Methods We analysed temporal WBE data from 42 European cities, using sewage samples collected daily for one week in March 2013. For each city, the main temporal features of two selected drugs were extracted using functional principal component (FPC) analysis, along with simpler measures such as the area under the curve (AUC). The individual cities’ scores on each of the temporal FPCs were then used as outcome variables in multiple linear regression analysis with various city and country characteristics as predictors. The results were compared to those of functional analysis of variance (FANOVA). Results The three first FPCs explained more than 99% of the temporal variation. The first component (FPC1) represented the level of the drug load, while the second and third temporal components represented the level and the timing of a weekend peak. AUC was highly correlated with FPC1, but other temporal characteristic were not captured by the simple summary measures. FANOVA was less flexible than the FPCA-based regression, and even showed concordance results. Geographical location was the main predictor for the general level of the drug load. Conclusion FDA of WBE data extracts more detailed information about drug load patterns during the week which are not identified by more traditional statistical methods. Results also suggest that regression based on FPC results is a valuable addition to FANOVA for estimating associations between temporal patterns and covariate information.


Introduction
Illicit drug use is a growing global health concern, and it is estimated that around a quarter of the European adult population has used illicit drugs at some point in their lives [1]. In Europe, central nervous system stimulants such as amphetamine and ecstasy (MDMA) are among the most commonly used illicit drugs [1]. The drugs may cause appetite suppression and euphoria with feelings of increased confidence, sociability and energy, making them popular drugs of abuse, particularly in the young [2]. Stimulant use has, however, numerous negative effects, such as insomnia, anxiety, mood disturbance, violent behaviour, dependence and psychosis making them a public health concern [3].
Because of this considerable health risk, reliable estimates of the extent of drug use in a population are important for health professionals and policy makers. Traditionally, estimates of the consumption of stimulants are calculated from data collected from sources such as treatment programmes [4], hospital emergency departments [5,6], drivers apprehended by the police [7,8], prisoners [9] and from population surveys (e.g., internet, population, school) [10]. These types of data, however, have their limitations, mostly related to difficulties in capturing representative survey populations. General population surveys may have poor response rates and there is often unwillingness to supply information about an activity that may have a social stigma or legal implications [10]. Further, while data from drug treatment programmes may underestimate prevalence because of limited places in treatment, data gathered from the police may overestimate prevalence as investigations are targeted towards selected populations [5][6][7][8][9].
Wastewater-based epidemiology (WBE) is an alternative and complementary approach for estimating the collective illicit drug use in a community [11]. The concentration of various illicit drugs in the wastewater can be measured directly, overcoming the problems related to surveys and sampling bias. WBE has shown promising results, at both local national and international level [11][12][13], and analyses of wastewater data have indicated differences in drug loads detected in wastewater on weekdays and at weekends [14][15][16]. However, as WBE is a relatively new research field, data are often analysed using simple statistical methods which do not take the temporal nature of the data fully into account, potentially overlooking important information. The aim of this study was to move beyond the simple statistical analyses often applied to wastewater-based data, in order to explore whether more advanced statistical methods can extract more information about the patterns of stimulant use.
We reanalysed a WBE dataset on 42 European cities [17] using the framework of functional data analysis (FDA), a statistical method specifically developed for analyzing temporal data [18], and we compared these results with more traditional statistical analyses. For the purpose of the study, we selected two drugs with different patterns throughout the week; ecstasy (MDMA) which is mostly a "party drug" with high expected weekend loads, and amphetamine which is expected to be used more regularly throughout the week [13]. The main temporal features for the illicit drugs throughout the course of a week were estimated using functional principal component analysis (FPCA). FPCA has recently been applied for improved statistical analysis of glucose regulation [19] and monitoring of fetal movement [20] among other things. In order to explore whether differences in temporal drug loads in the wastewater between cities could be related to various geographic or other urban characteristics, we performed both functional analysis of variance (FANOVA) as well as multiple regression analyses on the FPCA results.

Data Material
No specific permissions were required for the present study. The use of wastewater data to study trends in illicit drug use in large catchment areas does not raise any major ethical issues as individuals cannot be identified, and thus cannot be harmed by such a study. The ethics of this approach has been thoroughly discussed in a previous paper [21].
Raw sewage was collected from the inlet of 47 sewage treatment plants in 42 cities from 21 European countries, servicing a combined population of approximately 24.7 million inhabitants. Samples were collected from each location over seven consecutive days, starting for 36 of the 42 cities on Wednesday 6th March 2013 and ending on Tuesday 12th March 2013. For the remaining six cities sampling during this week was not possible, and a different week in the same month was chosen. At all locations, automated sampling devices were used to collect subsamples over 24 hours. These subsamples were then pooled to a 24 hour composite sample. For cities with more than one sewage treatment plant, results were combined to a city average using a weighted mean. More background information, details regarding wastewater plant (WWTP) characteristics and so on can be found in an earlier publication based on the same material [17]. The data sets supporting the results of this article are also freely available [17]. For this study, we selected two drugs with very different consumption patterns; ecstasy (MDMA) which is mostly a party drug and amphetamine which is mostly used in more regular amounts throughout the week [13].
A specific tailored questionnaire was developed in cooperation with local sewage and treatment plant operators in order to evaluate information about the structure state of the sewer and the variability of the population size [17]. For the purpose of this analysis, daily mass loads were normalized by the population size of the catchment (mg/10000 people/day). Moreover concentrations for each drug below the limit of quantification (LOQ) were replaced by LOQ/2 [22] if at least one day in the week had a concentration value above the LOQ. Cities with no measurements above LOQ were excluded. Four cities (9.5%) were excluded for ecstasy (MDMA) and nine cities (21.4%) were excluded for amphetamine. Information on the characteristics of each city and WWTP is provided as supplementary material (S1 Table).

Data description
The unit of observation in the analyses is a seven day week starting Wednesday and ending Tuesday. For six (14.3%) cities, the data sampling started later in the week. Missing data for the two drugs across all the 42 cities ranged from 1.7% to 2.2%. This is low [23], but since functional data analysis (FDA) requires complete datasets we performed single imputation [24] before proceeding with the analysis on the imputed dataset.
The drug loads for weekdays (Mon-Fri) and weekends (Sat-Sun) were described using median and quartiles (Q1, Q3). Wastewater drug load data was heavily skewed, and the data was log-transformed prior to further analysis.

Traditional data analysis
For the log-transformed data for each city we calculated the overall mean throughout the week, the area under the curve (AUC) and the difference d between weekdays and weekends. The significance of the latter was assessed using the Wilcox test.

Functional data analysis
The temporal pattern of wastewater drug loads was analyzed using FDA [18]. In FDA, mathematical functions are first fitted to the observations. Statistical analysis is then performed on the fitted functions rather than the original data. The seven consecutive observations for each of the 42 European cities were discrete samples of an underlying continuous process, and were converted into 42 continuous smooth curves using B-splines with seven basis functions [18,25]. The optimal smoothing of the functions was estimated using the generalized cross validation (GCV) criterion [26] with a single choice of smoothing parameter for all cities [27]. The smoothing parameter was defined as proportion of the integrated square second derivative of the fitted curves [25]. This smoothing removes the random day-to-day variation, e.g. non-systematic error, measurement error and normal fluctuations in the load of drugs, and so extracts the underlying temporal behaviour.

Functional principal component analysis
Principal component analysis (PCA) is a statistical methodology which is used to reveal the internal structure of the data in order to explain variability [28]. Functional principal component analysis (FPCA) is a generalization of traditional PCA to functional data [18]. A common practice in FPCA is to first normalize the data, that is, to first subtract the mean, as the mean curve is a mode of variation that tends to be shared by most curves [25]. However, as we were interested in the mean temporal differences in drug loads between cities, and also compare to traditional statistical methods, we did not normalize the data. The percentage of explained variation for each FPCs thus cannot be interpreted in the same way as for PCA on normalized data.
We used FPCA to identify the main temporal patterns across the 42 fitted smooth curves. The result of an FPCA is a set of mutually uncorrelated functional principal component (FPC) curves, which explain the main modes of temporal variability across the fitted curves for all cities. The analysis further provides each city with a score for each of the extracted FPC curves, representing the intensity with which that particular temporal pattern is present in the fitted function for that particular city. Cities with close to zero scores on all FPCs have temporal drug loads similar to the overall mean curve, while cities with a high score on a particular FPC have a temporal drug load closer to that specific FPC pattern. Each estimated FPC was interpreted and labelled according to the temporal information it exhibited.
The association between the more traditional statistical measures of wastewater drug loads and the FPCA was assessed by calculating the Pearson correlation coefficient (r) between the FPC scores, the overall mean of the log-transformed data, AUC and the difference d between weekdays and weekend means.

Functional analysis of variance
To move beyond mere exploration of patterns, we wanted to see whether the various temporal patterns of wastewater drug loads throughout a week were associated with basic characteristics of the city: latitude; longitude; gross domestic product (GDP) of country; relative size of the city, i.e., number of inhabitants in the city divided by the number of inhabitants in the country and density of the city, i.e., number of inhabitants in the city divided by the urban area of the city.
Traditional analysis of variance (ANOVA) explores the mean difference in a continuous response between the various categories in a categorical explanatory variable [29]. Functional analysis of variance (FANOVA) is the generalization of ANOVA to functional outcomes [18], and is often the suggested approach when exploring covariates in FDA [18]. We used FANOVA to analyze the effect of the five possible predictors, listed above, on the shape of the wastewater drug load curves [25]. As FANOVA must have categorical covariates we had to dichotomize each of the continuous explanatory variables and compared the mean curves in the two resulting groups. We explored the impact of cut-off point by selecting cut-off points across the whole observed range of the covariates, and considered p-values <0.05 to be statistically significant. Functional confidence intervals (95%) and p-value curves, as well as an overall p-value, were calculated for each covariate using a functional permutation F-test [25].

Multiple regression
FANOVA can be seen as a univariate ANOVA problem for each specific point in time [18], and thus cannot control for covariates. In order to explore multiple predictors simultaneously, without the need for dichotomization, we used the cities' scores for the estimated FPCs as outcome variables in multiple linear regression models. From the full multiple model, including all covariates, an optimal sub-model was chosen using Akaike's Information Criterion (AIC) [30]. AIC is a weighting between model parsimony and fit to the data, and is a measure of the "goodness" of a model [31], and can be used to compare statistical models.

Robustness analysis
To explore the robustness of the FDA results, i.e. whether temporal patterns would emerge purely by chance due to the nature of the curve fitting process, we also performed all of the above FDA on a dataset obtained by random sorting of the original data.

Software
All analyses were performed in R 3.1.0 [32]. The imputation was performed using Amelia II and the amelia package [33], and FDA, FPCA and FANOVA using the fda package [25].

Data summary
Wastewater drug loads for the 42 cities throughout the week are shown in Fig 1.1 and 1.2, and summarized in Table 1. The median load of ecstasy (MDMA) increased significantly at the weekend (p<0.001) but not for amphetamine (p = 0.369). The overall mean of the log-transformed data throughout the seven day week was highly correlated (r = 0.999) with the area under the curve (AUC) ( Tables 2 and 3).

Functional principal component analysis
The original data with the corresponding fitted smooth curves of log-transformed data are shown in Fig 1.3 and 1.4.
For both drugs, the first functional principal component (FPC1; representing the normalization of the original data) explained more than 90% of the total temporal variation between the fitted curves for the cities (Fig 1.5 and 1.6), while the three first FPCs in sum explained more than 99% of the total variation (Fig 1.5-1.10). For ecstasy (MDMA) the first FPC (FPC1) mainly represented the general level, including a small increase towards the second half of the weekend (Fig 1.5). The second FPC (FPC2) represented an additional increase in the second half of the weekend (Fig 1.7), and the third FPC (FPC3) the timing of this weekend peak (Fig 1.9). For amphetamine, FPC1 represented the general level and this alone explained 98.9% of the variance (Fig 1.6). FPC2 represented a miniscule, linear increase throughout the week (Fig 1.8) and FPC3 an increase at the weekend (Fig 1.10). Pearson's correlation between the simple summary measures and the scores for the first three FPCs for ecstasy (MDMA) and amphetamine are shown in Tables 2 and 3. For ecstasy (MDMA), the mean week load and the AUC statistics were almost perfectly correlated with FPC1 (r = 0.999), while the difference between weekday and weekend loads was moderately correlated with FPC2 (r = 0.762) and FPC3 (r = -0.497). For amphetamine, the mean week load and the AUC statistics were almost perfectly correlated with FPC1 (r = 0.999), while the difference between weekday and weekend loads was moderately correlated with FPC3 (r = -0.760). The rest of the correlations were miniscule.

Functional analysis of variance
The functional analysis of variance (FANOVA) showed that latitude, longitude and gross domestic product (GDP) were all potentially associated with ecstasy (MDMA) load throughout the week, depending on the cut-off value, while only latitude and GDP were associated with the mean load curve for amphetamine. For some of the predictors, choice of cut-off value for the dichotomization had a major impact on the estimated significance of the difference between the mean of the corresponding groups (Fig 2).

Multiple regression analyses
For ecstasy (MDMA), multiple linear regression using FPC scores as outcome showed that longitude was negatively associated with scores on FPC1, i.e. the general level of the ecstasy load in the wastewater tended to increase in a westerly direction, while latitude was positively associated with scores on FPC1, i.e. the level of the ecstasy load tended to increase in a northerly direction ( Table 4). The longitude of the city was also negatively associated with scores on FPC2, i.e. more pronounced weekend peaks were observed in a westerly direction. For amphetamine, multiple linear regression using FPC scores as outcome showed that latitude was positively associated with scores on FPC1, i.e. the level of the amphetamine load in the wastewater tended to increase in a northerly direction. Latitude was also negatively associated with the scores on FPC3, i.e. more pronounced weekend peaks were observed in a northerly direction (Table 5).

Robustness analysis
For the randomly sorted data, the smoothing parameter for both drugs was the local maximum of the chosen interval, indicating that all the variability from the data was composed by random variation.
The first FPC explained 64-71% of the total temporal variation between the fitted curves, while the second and third FPCs explained 30-36% and 0% of the total variation respectively. Neither of the FPCs showed any specific pattern. Moreover, the functional permutation F-test of the FANOVA analysis was not able to distinguish between groups of curves for either of the drugs (not shown).

Discussion
The objective of wastewater-based epidemiology (WBE) is to provide objective and reliable estimates of the abuse of various drugs within a population. It is a relatively new methodology within the health sciences, but has already shown promising results [11][12][13][14][15][16][17], and promises to be a valuable addition to more traditional data sources. How to best analyse such data is, however, unclear.
As in many medical research fields, the analysis of WBE data tends to be performed by researchers with their primary field of expertise outside of statistics, and WBE data have generally been analyzed using traditional statistical methods, such as simple summary measures and specification tests [13,15], which focus only on level of use [13,17]. Simple statistical methods have the advantage that they are easily understood and performed by most quantitative scientists. Such methods are, however, problematic if they do not utilize the data properly or, worse, lead to the wrong conclusions. Using traditional statistical methods and specification tests we were not able to identify any weekend pattern for amphetamine throughout the week, but we were able to demonstrate this using functional principal component analysis (FPCA). Understanding temporal patterns of stimulant drug consumption could help us to understand the extent of illicit drug problems better and suggest more effective preventive actions. This study is the first to use the framework of FDA to extract shape information from wastewater-based drug load data. While the mean of the fitted curves obtained from FDA represents information about the use of these two substances across Europe which was already known [13,17] using FDA, and in particular FPCA, we were also able to extract valuable, nuanced temporal information on the use of stimulant drugs throughout the week that simpler statistical methods missed.
FPCA decomposes the variation between curves into a set of uncorrelated temporal features, but the usefulness of this analysis depends on how the FPCs are interpreted. In our study, FPC1 mainly represented the general drug load, accounting alone for more than 90% of the temporal variability between cities. Interestingly, AUC was almost perfectly correlated with the FPC1 for both drugs, demonstrating that AUC carries both valuable and precise information about an important part of the temporal drug load.
The second and third FPCs roughly represented how pronounced a weekend peak was and the timing of such a peak. Even though they account for only 0.1-6.2% of the temporal variability, they paint a more nuanced picture of the drug use pattern that would be lost when using traditional statistical analyses. The difference d between weekday and weekend means was somewhat correlated with FPC2, but neither of the simple summary measures can be said to capture fully the information in the FPCs beyond FPC1.  Our results suggest that even when considering a drug with a smoothed behaviour throughout the week such as amphetamine, FPCA is able to capture difference in variability between weekdays and weekends. Moreover, since the second and third FPCs are uncorrelated, our analysis was able to untangle the part of the variability mainly due to the increasing drug load at the weekend and the timing of such an increase. FDA results also demonstrate that the "weekend" is a somewhat less well defined time period than the traditional cultural understanding of it. Using this approach one may estimate what constitutes the "weekend" for each city and each drug, without having to define it a priori, as is needed when applying standard statistical tests.
Performing multiple regression analyses using FPC scores as outcome variables, we found that the temporal patterns were associated with the geographical position of the city; the load of ecstasy increased significantly in north-west Europe, while the load of amphetamine increased in a northerly direction. This is generally in line with previous findings [34].
Usually, FANOVA is the suggested way to analyse the association between functional data and covariates [18]. However, FANOVA needs dichotomous explanatory variables, and most of the predictors that we investigated were continuous. Categorizing continuous predictors in regression models has been thoroughly examined in the statistical literature, and repeatedly argued against, as it reduces power and introduces bias of unknown direction and magnitude [35][36][37]. In our study applying FANOVA would introduce bias in the analysis, due to the arbitrary choice of the city groups [35]. The significance of the F-test strongly depended on the chosen cut-off level of the explanatory variable. Moreover, FANOVA cannot adjust for other covariates in a multiple regression model and it only looks at the mean temporal pattern. While this will verify differences between cities, it will not identify the mode of the difference. Our suggested multiple regression is not part of the original FDA framework, but opens for more flexibility. It has been proposed previously for the analysis of glucose and fetal movement data [19,20].
FDA shows several advantages over traditional approaches to analysing temporal data from wastewater treatment plants, and should be explored for drugs other than the two stimulant drugs we studied here. However, a major issue in the application of FPCA is the ability to  interpret the patterns shown by the most important FPCs, so as to consider appropriate predictors to explain them. Also, our use of FANOVA as the sole means of introducing covariate information in FDA regressions could be explored further.
FDA is developed for analysing temporal data, but a potential concern is that the smooth basis functions applied in FDA run the risk of smoothing over more abrupt changes in the drug load throughout the week, consequently underestimating for example the difference between week and weekend load. Wavelets have a long tradition in time series analysis, and allows for statistical modelling of less smooth temporal changes. Wavelet based PCA has recently been applied successfully to among others fetal movement data [38,39] and a similar approach might be worthwhile exploring also for WWA data.
Other future research should include more information about the structure of the sewage system, as well as longer periods of observation and more cities to further improve the statistical analysis, monitor the temporal variation and achieve a better overall picture of the use of illicit drugs in Europe.
Supporting Information S1