Bayesian structural time series for biomedical sensor data: A flexible modeling framework for evaluating interventions

The development of mobile-health technology has the potential to revolutionize personalized medicine. Biomedical sensors (e.g., wearables) can assist with determining treatment plans for individuals, provide quantitative information to healthcare providers, and give objective measurements of health, leading to the goal of precise phenotypic correlates for genotypes. Even though treatments and interventions are becoming more specific and datasets more abundant, measuring the causal impact of health interventions requires careful considerations of complex covariate structures, as well as knowledge of the temporal and spatial properties of the data. Thus, interpreting biomedical sensor data needs to make use of specialized statistical models. Here, we show how the Bayesian structural time series framework, widely used in economics, can be applied to these data. This framework corrects for covariates to provide accurate assessments of the significance of interventions. Furthermore, it allows for a time-dependent confidence interval of impact, which is useful for considering individualized assessments of intervention efficacy. We provide a customized biomedical adaptor tool, MhealthCI, around a specific implementation of the Bayesian structural time series framework that uniformly processes, prepares, and registers diverse biomedical data. We apply the software implementation of MhealthCI to a structured set of examples in biomedicine to showcase the ability of the framework to evaluate interventions with varying levels of data richness and covariate complexity and also compare the performance to other models. Specifically, we show how the framework is able to evaluate an exercise intervention’s effect on stabilizing blood glucose in a diabetes dataset. We also provide a future-anticipating illustration from a behavioral dataset showcasing how the framework integrates complex spatial covariates. Overall, we show the robustness of the Bayesian structural time series framework when applied to biomedical sensor data, highlighting its increasing value for current and future datasets.


Background
As mobile technology advances rapidly, the global mobile healthcare market is projected to be over 90 billion USD in 2022 [1]. Investment is bolstered by the great potential for advancing precision medicine in the near future [2][3][4]. Whereas medicine has previously focused on determining the right interventions, it is now more focused on for whom and when [5]. Identifying the right time for and timing of treatments remains relatively understudied, but this trend is expected to change soon as large streams of sensor data are released [6]. As sensor technology develops, data-rich features such as physical, chemical, behavioral, and biological variables will be measurable. In addition to time series data, spatial information is becoming more popular as well [7], all of which can be used for a more detailed understanding of interventions. A survey of distinct types of data are presented in Fig 1. Though increasing amounts of sensor data is being released and publications highlighting their usefulness in personal health exist [8][9][10], there is still a paucity of analytical methods in the biomedical field that can accommodate the complex covariate structures as well as the temporal and local trend considerations necessary to analyze these longitudinal data. Existing PLOS COMPUTATIONAL BIOLOGY methods applied to these situations are limited. In the simple case, t-tests can evaluate differences of averages before and after an intervention, but do not account for temporal aspects. Another type of method, auto regressive integrated moving average (ARIMA), is common for addressing longitudinally observed patterns that could inform the timing of an intervention but does not evaluate the impact and efficacy of such an intervention or provide dynamic pvalues (more details in S1 Text). Specifically, the ARIMA model makes use of lagged forecast errors in order to evaluate the best fit and forecast future values, though it is generally not set up to forecast multiple, connected time points in the future. Other methods like cumulative sum control chart (CUSUM) are used to detect changes in time series data, though they do not evaluate the degree and length that a change may last, and are thus not suitable for evaluating long term impact of an intervention [11][12][13].
While the nature of data from sensors and wearables will vary depending on the context, most of them share certain properties, such as being densely sampled and longitudinal. A variety of such data can be seen in Fig 1. It is important to discern these properties and to establish a flexible model for emerging sensor and wearable data, as it will have broad implications in fields such as personalized medicine [2]. Therefore, in this study we aim to contextualize a model and establish a flexible statistical framework to model various sensor and mobile health data. The model we adapted here is a combination of the Bayesian structural time series model and the Causal Impact model from Google [14]. The principles of this modeling framework stem from Bayesian inference and the analysis of time series data, which have been well established for decades [15]. While the idea of using Bayesian inference models has been extensively used in fields such as finance [16], ad campaigns [17], and marketing [18], they have been rarely used in the context of wearable and sensor data analysis to evaluate impact intervention over various time periods. A simple literature search for the application of Bayesian structural time series modeling of biosensor data yields few results. Therefore, here we apply this widely accepted statistical framework to the specialized context of biomedicine. We showcase the effectiveness of detecting the strength and duration of an intervention when applied to a variety of sensor data we collected and accurately assess the impact that various interventions have on individuals.

Modeling framework
To establish an intuitive modeling framework for data taken from sensors and other mobile health sources, it is important to lay out the structure as well as any assumptions that may be essential for the model. First, we must recognize that the dataset is a time series data with a known response variable that evolves over time, such as a person's weight. This response variable could be a direct measurement taken from a sensor, or it could be a derived value-calculated from various underlying variables measured by a device-such as calories burned [19] or a "Nike Fuel Score" [20]. The response variable's evolution in time is important, as it should be dynamic and change based on covariates and interventions. In the example of weight, the measured weight fluctuates over time based on things such as seasonality [21], temperature [22], or diet [23].
Because we are interested in the impact that an intervention has on such a variable, an important assumption is that we know when the intervention occurs as well as its duration. There exist methods that detect intervention times [24,25] though to jointly assess the impact and time of an intervention would result in a significant disadvantage in statistical power. Therefore, by having prior knowledge on the time of occurrence and duration of the intervention, we are able to more accurately assess the impact on an individualized level.
Given a specific intervention and its duration, we can split our time series into two segments-the pre-and post-intervention periods. These periods are used to describe all variables, known and unknown, as well as any parameters associated with them. For the example of weight and diet, the pre-intervention would be the period before the diet starts and the postintervention is the period after the diet starts. This is distinct from "post-cessation of intervention" (e.g., return to pre-diet eating following several weeks of dieting) which is not addressed by our present model.
Our goal is to model the pre-intervention period including covariates to most accurately assess the behavior of the response variable before any intervention has occurred. We also assume that any covariates used in our model should not be affected by the intervention, to provide a similar control in the post-intervention period. Though the post-intervention covariates themselves may change in value, the assumption is that they are derived from the same distribution as in the pre-intervention. Finally, using the model derived from the pre-intervention period and the covariates in the post-intervention, we can calculate a counterfactual in the post-intervention. The counterfactual predicts the response variable without intervention. It serves as a baseline to compare the actual observation in the post-intervention and ultimately is what we use to calculate the impact of an intervention. Compared to linear models, this framework allows for an evolving measure of impact, due to the dynamic confidence interval for the difference between counterfactual and observation that is inherent to using Bayesian structural time series. This temporal consideration, in addition to the more common advantage of using hyper-parameters and priors, is an important consideration for the Bayesian framework, and sets it apart from other models. This Bayesian structural time series framework can make use of complex covariate structures, which is useful and necessary to get an unbiased measure of impact. Two main types of covariates can be used: those that have known effects on the response and those that can account for hidden effects. The first type refers to covariates that could be correlated to the response, or cause changes to the response unrelated to the intervention. In relation to our example about weight, covariates of this type may include temperature, weather, or season. Since it is very unlikely to know all covariates of the first type that may perturb the response variable, we also introduce the notion of a second type of covariate, termed a "paired covariate". The paired covariate is an independent stream of sensor data of the same type as our original data in question. They could be collected in different locations, but the ideal case is they receive little to no treatment or intervention. If we imagine a scenario where the roommate of our subject is subject to many of the same external factors, but the roommate does not participate in the dieting, then this roommate's weight could be considered the paired covariate. Since both the original data and paired covariate share underlying biases, by using the paired covariates we can determine the true impact of the intervention.
As a whole, the Bayesian structural time series framework addresses many of the limitations of other methods and serves to advance the field by providing a structured method for analyzing data from sensors and mobile health sources. We give more details in the Statistical Formalism section and provide a schematic of the framework in Fig 2.

Bayesian structural time series
In order to understand the causal impact of interventions on longitudinal datasets that may be affected by a wide range of factors it is important to have a statistical framework that considers carefully the prior and posterior measurements as well as covariates that may affect the response variable. We detail aspects of this framework below and include a graphical schematic in Fig 2. We assume for a given time point, t, there exists an observation y t , linked to a variety of other parameters, a t : Here, a t represents all of the state parameters at time t. The parameters are defined below. As time progresses from t to t+1, the other parameters also progress due to their time dependency.
We use the following equations to represent our time series data [14,26].
Here we see that the equation involving the observation or response variable, y t , depends  on μ t , and additionally a variable x t to represent the covariates. Furthermore, we additionally include another layer of dependency, δ t , which μ t depends on. Because our observation, y t , can be biased, it is important to take into account as many covariates in the form of x t . Each equation has an error term represented by η t . Since each of the processes are different, the error term is also different. For example, the error associated with μ t is Z m t . These η t are drawn from some distribution satisfying Nð0; s 2 t Þ. Specifically, the parameters {σ μ , σ δ , σ y } correspond to the standard deviations of the distribution that each of fZ m t ; Z d t ; Z y t g are drawn from, respectively. In general, the {σ μ , σ δ , σ y } variables are initially sampled based on the prior distribution, usually a Gamma v 2 ; s 2 À � . Similar to other regression-based models, the covariates here are scaled by a coefficient vector β T . The model also employs a spike and slab method to penalize and select for important covariates when the number of covariates is large. Additionally, one important feature of each covariate is the posterior inclusion probability (PIP), which can be calculated from the sum of all posterior probabilities of all regressions that include that particular covariate [27]. The PIP gives a ranking measure to show how favorable the inclusion of a particular covariate is. The variable μ t , representing the local trend of the model, contributes to our response variable, and its representation is given as a time dependent equation. If we ignore the δ t variable, our equation becomes and this is a representation for the random walk. That is to say, in this simplified form of μ t , our observation is a random walk, where at each time point, there is some progression to the next time point in random fashion, based on the parameter η t,μ . However, one optimization that can be done for this parameter is the inclusion of δ t , which results in the pair of equations Here δ t follows a random walk, and the μ t equation is dependent on δ t . That is to say, δ t serves as a trajectory or slope parameter that helps to guide the behavior of μ t . If μ t is steadily increasing, it is likely that at μ t+1 there is also an increase, due to the inclusion of δ t . In other words, the δ t allows for more stability between each time point and serves a purpose similar to a slope parameter. We provide more examples in Figs A-E in S1 Text.
Evaluating intervention impact. While the above model describes longitudinal sensor data well, it is important to consider how the pre-and post-intervention periods differ. To determine if an intervention was effective in bringing about a change in our dependent measurement, y t , it is important to have a framework in which comparisons can be quantified. To do so, we define the pre-intervention period as time points t = 1,. . .,n, while the post-intervention period is defined as t = n+1,. . .,N. Furthermore, we define observations as y and predictions of the model as y 0 . Therefore, the set of observations in the pre-intervention period, y 1,. . ., n , serve as the training data, based on all covariates. At each time point t = 1,. . .,n, we update our parameter set of the model to better fit y 1,. . .,n using a Markov Chain Monte Carlo (MCMC) method of sampling from the posterior distribution. After estimation of parameters in the pre-intervention period, we predict the counterfactual in the post-intervention, y 0 n+1,. . ., N . This is done by using the parameter set defined in the pre-intervention period and the covariates from the post-intervention period. In particular, because a Bayesian approach coupled with MCMC is used to estimate the state parameter distributions, predicted values in the post-intervention are estimated based on distributions of parameters. This enables the counterfactual predictions to have reliable and accurate credible intervals, unlike many other time series methods. Additionally, because the counterfactual predictions are described as a distribution, these credible intervals can be propagated for the whole post-intervention period (which gives rise to the cone shape credible interval). Specifically, the counterfactual output value itself is the mean of the distribution and the credible interval is determined from the bounds (i.e. 95%) of the counterfactual distribution. Due to the stochastic nature of MCMC, small variations between runs of the framework may arise, though this can be mitigated with more iterations. Given y 0 n+1,. . .,N has an associated credible interval, we can calculate a significance p-value associated with the difference between y 0 n+1,. . .,N and y n+1,. . .,N . While we can get a p-value at every time point in the post intervention, it is more useful to consider the impact that an intervention had on the whole post-intervention period. The p-value associated with the full post-intervention time period (t = n+1,. . .,N) is known as the cumulative impact. It should be noted that the credible interval associated with y 0 n+1,. . .,N generally will increase as time progresses. This is one advantage of the Bayesian model and allows for an evolving cumulative impact. This is due to the fact that at every time point after the intervention, the variance associated with the distribution that the prediction is drawn from is compounded at every time point. It is useful to factor in this temporal aspect since the confidence that an intervention resulted in a causal impact may be dependent on time.
Biomedical adapter tool and software implementation. We provide a customized biomedical adaptor tool, MhealthCI, around a specific Google implementation of the Causal Impact model, which makes use of the bsts R package. Altogether, our wrapper uniformly processes, prepares, and registers diverse biomedical data. Specifically, time series data from biosensor data that are measured at different time points and intervals are unified so that given a time interval set t = {1,. . .,N}, there exists a set of variables (observation and covariates), y t and x t for each time point in t. The adaptor tool that applies Causal Impact and bsts to evaluate a user defined intervention and gives a report of results [14,28].

Results
To showcase the wide applicability of the Bayesian structural time series framework for biomedical applications, we provide examples of analysis from various types of sensor data. First, we apply the model to a real-world example-environmental sensor data collected from an air quality device-and show the usefulness of our model in identifying the effect of simple interventions in real world data. Second, we apply the Bayesian structural time series framework on data collected from an Android phone sensor in order to give intuition for-and demonstrate several key features of-our model. In this case, a known intervention is implemented, and allows for model comparison to ARIMA and model performance analysis. Third, we provide a core biomedical example that showcases a strong application of the framework and the potential it has in personalized medicine. In particular, we collected extensive biosensor data from a diabetes and exercise study which aims to understand how structured exercise can help to stabilize glucose levels throughout the day. The framework was then used to quantify the impact of an exercise regimen on clinical health markers of glucose level. We find a strong stabilizing effect on glucose after this exercise intervention and believe it is the first report of such a finding using data of this type. While these data are informative, they currently lack several attributes that we expect future biomedical data to have. In particular, we anticipate future biomedical datasets to have a large number of context-rich covariates (e.g., activity in different locations or environments) that can dramatically increase intervention assessment. Thus, we give a final example dataset rich in covariates to showcase how these (paired) covariates can be used creatively to more accurately assess the effect of interventions. Specifically, we collected human behavioral sensor data to showcase the patterns of crime across different neighborhoods in a city, and how the introduction of a mobile application aimed at increasing social cohesion affected these patterns. Below, we give more details regarding each of the specific sub-studies and their corresponding findings.

Simple real-world example: Environmental sensor data
We then transitioned to a simple real-world example where we performed a similar analysis using data collected from an AWAIR monitor, with measurements of CO 2 , dust, humidity, and temperature ( Fig 1E). The data were collected over a one-month period, with measurements taken every 15 minutes. The data were aggregated at each time point across all 30 days to give a smoother signal and limit the analysis to one intervention with a clearly delineated a pre-and post-intervention period, namely exposure of the room to people. It should be noted that though intervention often implies some treatment put in place, it can be widely adapted in the realm of Bayesian structural time series to allow for any disruption or change in status such as the effect that people have on environmental CO 2 levels. After correcting for covariates such as dust, humidity, and temperature [29], there was a significant increase in CO 2 in the hours when workers were in in the room with the sensor (p-value < 0.001). Fig 3A shows the CO 2 measurements across the aggregated time points in a 24-hour time frame. We can see that as CO 2 levels increased drastically after the 9am time point, the cumulative causal impact in the post-intervention period increased. This cumulative impact increases until around 5pm in the evening, where the cumulative impact tapered off, signaling a state was reached in the post-intervention that was similar to that of the pre-intervention period. Although there is one p-value associated with the whole post-intervention period, this framework can show confidence intervals for each time point in the post-intervention period as a function of both the observation and covariates. This helps us to define period of time where the intervention is most effective.

Wearable sensor timeseries with a controlled, defined intervention
To evaluate the performance of the BSTS modeling framework we designed an experiment with an intervention of defined size. In addition, we used this framework to demonstrate key features of the flexible covariate structure that set the BSTS framework apart from more commonly used methods. We used the Google Science Journal app on an Android phone and collected data for a person spinning and then extending the phone to arm's length while maintaining the same spin rate (intervention = change in centripetal force, see schematic in Fig 3B-3D). In the simplest case a single user's longitudinal data was used to estimate the effect of the intervention (Fig 3B). The model successfully detected the state change associated with the intervention, as shown by the cumulative change relative to the predicted values. For comparison, we also modeled these data using the more commonly used Autoregressive Integrated Moving Average (ARIMA) framework, which did not confidently detect an invention (p-value was greater than 0.05) due to larger confidence intervals (Fig 3B, left-middle panel). To further demonstrate the performance difference between the BSTS and ARIMA models, the size of the intervention was computationally altered (Fig 3B, middle panel). By scaling the effect size to 10 times larger than was experimentally generated (the equivalent of holding the sensor 30 ft away from the body while maintaining a constant rate of spin), we were able to compare the performance of two methods and show that our modeling framework confidently detected the intervention in half of the models when the effect size matched the experimental condition, while ARIMA required a 3x increase in the effect size to reach the same confidence.
Next, we demonstrate how the previous scenario is susceptible to noise, i.e. something that affects the signal but is not related to the intervention. In this case the noise is represented by a hop, which occurred after the intervention began and was therefore not predicted by the model (Fig 3C). In this case the model does not detect an effect of the intervention-the hop affected the signal to such a degree that the confidence in the prediction decreased, i.e. the credible intervals widened greatly. Neither BSTS, nor ARIMA confidently identified that an intervention occurred in this condition (Fig 3C, middle panel).
Finally, we show how including a paired covariate-another data stream that affects the state being measured but is not related to the intervention-can effectively correct for noise ( Fig 3D). In this case two sensors were held while spinning, but only one was subjected to the The intervention (arrival of office members) causes an increase in CO 2 and is determined to be impactful using the model. B) Accelerometer measurements of a person spinning in a chair holding sensor near body and then extended to arm's length at the intervention; C) with simulated "noise" produced by a hop during the intervention period and D) with a paired covariate (second sensor) that is not affected by the intervention but experiences the "noise" hop. Sensitivity analysis shows comparison of mhealthCI to ARIMA, with the vertical axis as the fraction of the intervention that correctly identified a non-zero impact. https://doi.org/10.1371/journal.pcbi.1009303.g003 intervention (extended away from the chest). Both, however, experience the noise (hop), which the model is then able to control. Here again the BSTS model correctly identifies the intervention similar to the case without noise (Fig 3B). The difference in the performance of the ARIMA model is greater even than in the simplest case, with a five times difference in effect size needed to detect an intervention with the same confidence as the BSTS model. A real-world analogy comparable to this simple experiment could be sleep data (accelerometer displacement) and the effect of changing one's pillow. Having no other information than one individual's sleep data is equivalent to the first experiment ( Fig 3B). Next, the "noise" that we showed with a hop could be loud neighbors moving in next door (adding noise in a literal sense), which leads to poor sleep in a way that is unrelated to the pillow intervention. Finally, the paired covariate that we demonstrated with a second phone could be another device measuring the sleep of one's partner sharing the same bed, but who did not change their pillow. This is distinct from a non-paired covariate that uses a different, relevant data stream. For example, a non-paired covariate could be a device that measures the decibel level in the room; it is relevant to the state (sleep quality), is not affected by the intervention (pillow change) and identifies the noise that affects the state (loud neighbors). Both paired and non-paired covariates are effective, however, a paired covariate has the characteristic that it is able to control for unknown confounders, akin to a control arm of a randomized controlled trial.

Core biomedical example: Clinical sensor data
We next provide a core biomedical example of how the framework can be applied to clinical sensor data. In particular, we collected biosensor data from a person with type 1 diabetes over a 12-week period who completed an exercise regimen (intervention). We chose to model this set of biomedical data because it is arguably one of the most established applications of personalized medicine today [30,31]. People with type 1 diabetes are advised to intensively monitor and manage their blood glucose to maintain it within the target range. Since this process is quite involved and requires continuous adjustments based on numerous biological factors, the result is a complex and high-stakes problem in personalized medicine. Fig 4A shows the data from a continuous glucose sensor and insulin pump as well as a comprehensive set of Apple Watch data from the study. The 12-week evaluation spanned an initial 2-week sedentary period followed by a 10-week exercise regimen. The Apple Watch and the insulin pump used by the patient provided several potential covariates, which could help us understand the glucose sensor data. We took glucose readings from the participant's glucose sensor and aggregated them into 24-hour values by transforming the values into two clinically relevant indicators of glucose stability, percent-in-target and percent-above-target [32]. In general, values above the target range are predictive of long-term organ damage, while values below the target range lead to acute hypoglycemia [33], a state of low blood glucose with symptoms that can range from mild fatigue and confusion to life-threatening coma, posing immediate threats to safety as well as long-term psychological consequences (e.g., fear of hypoglycemia [34]).
Maintenance of glucose levels in the target range is achieved by strategically managing a triad of factors: insulin administration, diet, and exercise. The timing and dosing of insulin (which decreases blood glucose) with carbohydrate ingestion (which increases blood glucose) must be carefully balanced. The role of exercise, however, is less clearly defined because it can either decrease or increase blood glucose during and up to 24hr after a session. Determinants of the direction and magnitude of the glycemic response to exercise are numerous and include 1) exercise characteristics (intensity and duration), 2) individual characteristics (endogenous insulin sensitivity, and the effect of physical fitness to increase insulin sensitivity), and 3) contextual factors (pre-exercise blood glucose level, insulin-and carbohydrates-on-board, and concentration of counter regulatory hormones) [30]. We applied our Bayesian structural time series framework to two different participants. Participant #1 entered the study with excessive blood glucose time above-target indicated by baseline glycosylated hemoglobin (HbA1c) 8.6% compared to recommended goal 7.0% [35]. Her HbA1c at the end of the exercise program reached the goal (6.9%), consistent with the lowering effect that exercise has on blood glucose. Participant #2 entered the study with extremely little blood glucose time above-target (baseline HbA1c 5.2%) which did not change by the end of the exercise program (HbA1c 5.3%). Our developed method accurately predicted both of these situations. Specifically, for participant #1, we found that the exercise regimen was effective in increasing the percent-in-target and decreasing the percent-above-target (p = 0.014 and 0.002 respectively). These results are shown in Fig 4B and 4C. Thus, our model supports the existence of a positive causal effect of this particular exercise regimen on maintaining a healthy glucose level for this individual. For participant 2, we found that the exercise regimen did not coincide with increased percent-in-target, which is supported by our model in Fig 4D  and 4E (p = 0.076 and 0.323 respectively). We also compared the model to a simple aggregation style test, which our model outperformed (see S1 Text). To further check the validity of our results, we also performed a negative control analysis to showcase no significant impact was detected when no intervention occurred (see S1 Text). Thus, our model accurately reflected that there was minimal or insignificant difference in this biological variable with the introduction of exercise. Furthermore, of all the covariates we used, insulin on board (IOB) was found to have the highest PIP. This is reasonable since insulin is essential to glucose control. In summary, the Bayesian structural time series framework was effective in determining the change for both of the participants analyzed.

Example with data rich in covariates: Human behavioral data
In our final example, we demonstrate how data rich in covariates can significantly improve assessments of intervention impact. While the example we provide is not clinical in the traditional sense, we provide this example as a way to showcase how the Bayesian structural time series framework performs when given extensive data and paired covariates-as we expect in the future. One aspect of a rich covariate set could be location information, an important factor in many clinical or personal health applications. For example, many watches collect GPS tracks of a run as well as steps and heart rate (e.g., Apple Watch, Garmin Forerunner). The Bayesian structural time series framework can flexibly accommodate location data through the inclusion of spatial correlation matrices. In addition, the spatial information can be used to segment the data where the intervention does and does not have an effect. By this method one can create "synthetic" paired covariates, in that data are from different spatial segments from the same source. However, spatial data are often protected for privacy concerns and are therefore less accessible to researchers. We therefore demonstrate the utility of the framework on more accessible behavioral data that share many characteristics: the effect of a social monitoring application called SeeClickFix on negative behavioral patterns (crime).
Crime patterns show similar characteristics to spatial mobile health data. They are affected by covariates such as temperature and precipitation [36] and can be linked to many other features such as census data of household incomes or education [37] (analogous to covariates of movement such as age, physical fitness, etc. [38]). SeeClickFix is a smartphone and web application developed to allow users to report issues in their communities including non-violent crimes. Posts can be voted on and supported by other users' comments and local government agencies acknowledge issues and post when they have been addressed. It has been hypothesized that SeeClickFix and similar tools may reduce crime through establishing social cohesion, and promoting collective efficacy [39].
Following the intervention (creation of SeeClickFix) there was no detectable decrease in crime across the entire area (Fig 5A), nor in particular neighborhoods (Fig 5B). However, one would expect many other factors besides the introduction of SeeClickFix to affect crime in this time (e.g., increases in the police force, changes to local employment, city-wide initiatives). The Bayesian structural time series framework can leverage the spatial information in these behavioral data to search for a paired covariate, in this case referring to locations not affected by the intervention (no SeeClickFix use) in order to control for other, unobserved effects on the outcome variable (increases in the city police force). We aggregated crimes and SeeClick-Fix posts by neighborhood (Fig 5C) and then modeled the crime using a neighborhood that was not affected by the intervention as a control (i.e. a neighborhood without SeeClickFix posts), but did experience any city-wide initiatives that might affect crime (Fig 5D). Neighborhoods with heavy SeeClickFix use showed an effect of the intervention on crime when controlling for unobserved factors with synthetic paired covariates (Fig 5E).

Discussion
In this paper, we demonstrated how the Bayesian structural time series framework can be applied to biomedical sensor data and personalized medicine, as well as created a wrapper software to facilitate the framework's use. We also demonstrated better performance as compared to other commonly used time series methods such as ARIMA. We successfully evaluated the impact of interventions in a variety of examples and additionally showcase how a rich dataset with complex covariates can benefit from the framework.
While some studies demonstrating the success of generalized linear models for mobile health data do exist [24,25,[40][41][42], there is a lack of emphasis on considerations for the temporal aspect of interventions and the effect-size of interventions at each time point in the postintervention period. Specifically, generalized linear modeling frameworks lack the flexibility to evaluate the intervention's impact cumulatively in the post-intervention period. In this study, we illustrate the benefit of using a Bayesian structural time series framework for modeling the behavior of various longitudinal data collected via apps and sensors. These data demonstrate properties that are commonly found across other wearable sensor data, which are increasingly gaining in popularity [3]. In order to effectively leverage such data for precision medicine and health in the future, we must understand how specific timings and types of interventions impact individual patients [2][3][4].
We show that the Bayesian modeling framework can take into account the rich covariates in our behavioral sensor dataset-specifically the paired covariate structure between different locations-and give results that are unbiased. Furthermore, the framework also considers the temporal properties, ensuring that predictions of intervention impact are conditional on duration of the intervention. This is especially important for future data from sensors and mobile health sources as one of the major concerns of personalized medicine is ensuring the right time and duration of a treatment or intervention is considered [5]. With time varying confidence intervals, we can begin to understand not only the effectiveness of an intervention but determine the most effective intervention plan necessary to reach a desired result.
When applying this method to various datasets, an additional consideration is how informative the covariates are. For example, in the case of the sensor data derived from the diabetes study, even though we were able to isolate the effect of the 10-week exercise regimen upon both metrics of glucose stability, more covariates could be used in the future to improve the prediction. Similar biological studies in the future could benefit from obtaining other clinically PLOS COMPUTATIONAL BIOLOGY relevant covariates such as plasma cortisol, epinephrine, growth hormone, glucagon, and directly sensed (not estimated) insulin levels or more to use high precision sensors with reduced noise. Furthermore, paired covariates (e.g., control group) should be taken into consideration when designing studies, as they can greatly increase model accuracy. While it is true that even an improved set of covariates could lead to more accurate results, we showcase here that the framework we have now is able to demonstrate results that converge with current literature showing favorable impact of exercise upon blood glucose control metrics [43]. However, these prior reports utilized HbA1c measured every 3-6 months as a chronic indicator of blood glucose control. We are the first study to our knowledge measuring changes in continuously measured blood glucose over several months, thus allowing a Bayesian time series approach that permitted analysis and adjustment for covariates at the individual participant level. We identified respective individuals whose blood glucose responded favorably and neutrally to the exercise intervention, after accounting for the blood glucose confounders of insulin dosing, physical activity outside of structured exercise, and autonomic nervous system function. This information would be useful feedback for the individuals and their healthcare providers since it can be otherwise challenging to predict whether exercise will increase or decrease blood glucose in type 1 diabetes. It is also important to consider the contexts in which this modeling framework does not yield significant advantage over other frameworks. For example, a stable and consistent intervention effect over the entire post-intervention period should be evaluated equally well by linear models. Given the computational complexity of the Bayesian method due to MCMC sampling for each time point, such considerations could be very important for large datasets. Furthermore, due to the nature of this type of model and the slight degree of randomness in parameter estimation, it is possible to have varying results in calculating the impact of the intervention. This is in contrast to the results found from linear models, which generally demonstrate a singular solution.
Additionally, we also note that one assumption of our model is that the intervention does not affect any of the covariates in the post-intervention period, and the strength of the results are based on such an assumption. In some cases, a synthetic paired covariate may be difficult to find. For example, though the exercise regimen is aimed at stabilizing glucose levels, we note that exercise and physical activity can have a general effect on various biological factors within the human body, including some of the covariates used. We minimize the effect of this through our transformation of variables and aggregation on a 24-hour block.
Overall, the Bayesian structural time series framework is a flexible modeling approach that, with relatively minor specialization we provide through our tool, can be applied to diverse biosensor data. It has features that few methods in biomedicine share, such as dynamic forecasts to observe the effect of an intervention and its evolution over time. These features are critical to advancing personalized medicine and realizing the challenge of relating genotype and phenotype data in the context of human research.

Data collection
There exist many datasets similar to those of mobile sensors and we should not confine ourselves to just the traditional accelerometer and gyroscope data that one would traditionally think of when envisioning sensor data. In studying the causal impact of an intervention, we show that most data sharing longitudinal qualities allow for development of algorithms and exploring how such algorithms can be useful in analyzing various sensor data. The data we analyze in this paper consists of environmental sensor data, physical activity sensor data and human behavioral sensor data.
Due to the longitudinal nature of sensor and wearable data, these data not only demonstrate interesting patterns in a response variable, but also are closely tied to the temporal property of a phenomena. By introducing the aspect of time, it becomes important to find models that leverage temporal considerations and use them to make accurate assessments about the data. Also, some of the data showcase complex covariates (paired) and can be used to better correct for unknown and hidden biases. More details about the data collection and data types are given below.
Google science journal data collection. Data were collected using the linear accelerometer measurement function in the Google Science Journal application on a Samsung Galaxy S8 smartphone. Each experiment lasted 20 seconds. One or two instruments were held at arm's length while spinning at a constant rate for 10 seconds. Next, one instrument was brought in close to the body for 10 seconds while maintaining the spinning rate. In the case of the noise simulation, the experimenter hopped once after 15 seconds (halfway through the intervention period). Data were exported from Science Journal and analyzed using the CausalImpact package in R.
AWAIR collection. Data were collected from a one-month period from the AWAIR device. The device was placed in an office lab setting where individuals frequented on a daily basis. CO 2 levels were measured in units of ppm. AWAIR also measures dust, temperature and humidity, which were used as covariates.
SeeClickFix data collection. The application SeeClickFix is a smartphone and web application developed in New Haven, Connecticut, where users report issues in their communities including non-violent crimes. SeeClickFix posts can be supported and commented on by other users, and local government agencies acknowledge and address issues. The SeeClickFix data are publicly available, providing a rich longitudinal and spatial dataset for monitoring behavior and interactions with other users and city representatives. Posts were aggregated by month for the New Haven metropolitan area and by neighborhood (n = 19) from 2007-2015.
Aggregated crime data were shared through a memorandum of understanding with the New Haven Police Department for 2000-2013. Rates were calculated using the 2014 ACS 5-year population estimates (crimes / 10K population per unit area).
Diabetes data collection. We used data from two participants in a single-group clinical trial that was evaluating an exercise intervention for previously sedentary adults with type 1 diabetes. Participants completed a 2-week baseline period then a 10-week exercise intervention, while wearing sensors that continuously monitored blood glucose, heart rate, heart rate variability and physical activity. They continued their normal prescribed insulin therapy, and shared device-recorded dosing logs with the research team. Besides these continuous measures, we assessed chronic diabetes control at the beginning of the baseline period and the end of the 10-week intervention using blood glycosylated hemoglobin concentrations. The 10-week intervention included motivational enhancement of exercise (i.e., patient-centered exercise coaching including instructional videos) and health feedback from biosensors (NCT04204733) [44]. Both were delivered through a customized mobile digital application and supported by a coach internally certified in exercise for diabetes (GlucoseZone TM , Fitscript LLC , New Haven, CT). The study was approved and overseen by the Yale University Institutional Review Board, and all participants provided written informed consent.
Biodata Collection. 1) 24hr blood glucose was measured every 5 minutes by the Dexcom G6 continuous glucose monitor (San Diego, CA), a subcutaneous wire sensor sampling interstitial fluid glucose content which is converted to estimated blood glucose (validated against venous blood glucose with mean absolute relative difference 9%) [45]. 2). Insulin was delivered according to each participant's usual prescribed therapy. The participants in this manuscript received lispro insulin via the Tandem t-slim Control IQ pump (San Diego, CA). The pump subcutaneously infuses insulin every 5 minutes according to the patient's individualized settings and current blood glucose levels using proprietary algorithms [46]. The patient can also manually dose insulin or adjust some standard settings for meals or other disturbances (e.g., planned exercise). Infusion doses are recorded, uploaded to a central server for exporting and analysis, and converted to estimated insulin on board by the manufacturer's proprietary pharmacokinetics algorithm. 3) Heart rate (beats per minute, validated against electrocardiography with mean absolute percentage error 1.1%-6.7%) [47] and heart rate variability (standard deviation of interbeat intervals, validated against electrocardiography with intraclass correlation coefficient 0.98) [48] were measured by the Apple Watch 3 (Cupertino, CA) using photoplethysmography. 4) Physical activity was measured by the Apple Watch 3 using accelerometry and converted to kcals per day (validated against calorimetry with mean absolute percentage error~40%) [49]. 5) Diabetes control was measured by glycosylated hemoglobin. Participant #1 used the DCA Vantage Analyzer (Bayer, Tarrytown, NY) at baseline and PTS Diagnostics A1cNow+ (Indianapolis, IN) at 10 weeks. Participant #2 used the AccuBase A1c Home Test Kit (DTI Laboratories, Thomasville, GA).
Computational Method. A computational method was developed in order to accurately determine the response to the exercise study. There were three phases to the algorithm: 1) align the imported data to the today study time period 2) match apple watch, insulin and glucose data to five-minute intervals 3) causal impact analysis. This alignment is required to test impact from the covariates. Following alignment, the casual impact analysis took covariates, target range, intervention period, and other features to predict the total time in range, above range, and below range. (more details in S1 Text).
Clinical Outcomes. Participant #1 was a 63-year-old white non-Hispanic female with type 1 diabetes for 50 years, receiving 81 units/day of insulin (0.9 units/kg body weight/day) and performing no regular exercise at baseline. During the 10-week intervention she received 93 units/day of insulin (1.1 units/kg body weight/day) and exercised on average 2.5 days per week, 26 minutes per session, at easy to moderate intensity (Borg rating of perceived exertion 2.5 / 10). Participant #2 was a 53-year-old white non-Hispanic female with type 1 diabetes for 35 years, receiving 22 units/day of insulin (0.3 units/kg body weight/day) and performing no regular exercise at baseline. During the 10-week intervention she received 19 units/day of insulin (0.2 units/kg body weight/day) and exercised on average 4.3 days per week, 40 minutes per session, at moderate to hard intensity (Borg rating of perceived exertion 4.0 / 10). The exercise routines were dynamic, interval-based, and equally emphasized all major muscle groups.
Processing and analysis of all data was done in R and Python and our packaged tool can be found at https://github.com/gersteinlab/mhealthci. Supporting information S1 Text. Supplemental information text file with additional details for manuscript. Fig A.  Generated Data. Fig B. Using a linear model with no covariates to model the generated data.