Comparative Effectiveness of Cognitive Therapies Delivered Face-to-Face or over the Telephone: An Observational Study Using Propensity Methods

Objectives To compare the clinical and cost-effectiveness of face-to-face (FTF) with over-the-telephone (OTT) delivery of low intensity cognitive behavioural therapy. Design Observational study following SROBE guidelines. Selection effects were controlled using propensity scores. Non-inferiority comparisons assessed effectiveness. Setting IAPT (improving access to psychological therapies) services in the East of England. Participants 39,227 adults referred to IAPT services. Propensity score strata included 4,106 individuals; 147 pairs participated in 1∶1 matching. Intervention Two or more sessions of computerised cognitive behavioural therapy (CBT). Main outcome measures Patient-reported outcomes: Patient Health Questionnaire (PHQ-9) for depression; Generalised Anxiety Disorder questionnaire (GAD-7); Work and Social Adjustment Scale (WSAS). Differences between groups were summarised as standardised effect sizes (ES), adjusted mean differences and minimally important difference for PHQ-9. Cost per session for OTT was compared with FTF. Results Analysis of covariance controlling for number of assessments, provider site, and baseline PHQ-9, GAD-7 and WSAS indicated statistically significantly greater reductions in scores for OTT treatment with moderate (PHQ-9: ES: 0.14; GAD-7: ES: 0.10) or small (WSAS: ES: 0.03) effect sizes. Non-inferiority in favour of OTT treatment for symptom severity persisted as small to moderate effects for all but individuals with the highest symptom severity. In the most stringent comparison, the one-to-one propensity matching, adjusted mean differences in treatment outcomes indicated non-inferiority between OTT versus FTF treatments for PHQ-9 and GAD-7, whereas the evidence was moderate for WSAS. The per-session cost for OTT was 36.2% lower than FTF. Conclusions The clinical effectiveness of low intensity CBT-based interventions delivered OTT was not inferior to those delivered FTF except for people with more severe illness where FTF was superior. This provides evidence for better targeting of therapy, efficiencies for patients, cost savings for services and greater access to psychological therapies for people with common mental disorders.


Introduction
The programme to improve access to psychological therapies (IAPT) is the most significant development in English mental health services since the closure of the asylums and the advent of community care. Whilst those developments concerned severe mental illness and secondary care, IAPT targets mild to moderate depression and anxiety. These are common conditions seen frequently in general practice [1,2]; they cause enormous disability at the population level [3,4]. The financial cost of depression in the UK was estimated at approximately 150 billion pounds in 2009/ 2010, of which 30 billion is thought to be related to inability to work [1].
Predicated on economic arguments and clinical evidence, IAPT promotes access to talking treatments based on cognitive behavioural therapy (CBT) approved by the National Institute of Clinical Excellence (NICE). IAPT services solicit referrals (including self-referral) and reduce waiting times [1,2] for CBT by substantially increasing the numbers of therapists. More than 300 new therapists were recruited between 2008 and 2011 in the East of England (EoE) alone. Maintaining or increasing the working capacity of patients are important secondary goals that underpin economic arguments for IAPT [1]. There are two tiers of IAPT therapy, depending on clinical severity, and corresponding to NICE steps 2 and 3 for the treatment of depression and anxiety. More intense therapy is delivered by more experienced clinicians in the higher tier. The present study concerns the lower tier that provides treatment for the majority of referrals from primary care and other sources.
Two IAPT demonstration sites in England (Newham and Doncaster) provided observational evidence that face-to-face (FTF) and over-the-telephone (OTT) delivery of psychological therapy were both effective in depression and anxiety [3]. Brief, CBT-based interventions led to significant reductions in symptoms of a magnitude similar to that reported in specialist out-patient psychotherapy services [4,5]. These findings are in line with a randomised controlled trial of telephone CBT [6] and a recent non-inferiority trial comparing OTT with FTF interventions for obsessive compulsive disorder [7,8].
Telephone mediated psychological interventions are convenient for patients and therapists, with a 40% reduction in treatment time [7,8] and removal of barriers to treatment [9]. Services are no longer constrained by working hours or treatment space. However, the evidence for these benefits relies on small samples in specialised settings, and may not be relevant to the relatively brief interventions (fewer than six sessions) delivered in the lower tier of IAPT [10].
OTT may represent a cost-effective option for IAPT. Both OTT and FTF services have already been implemented by the National Health Service across England, preventing the assessment of comparative effectiveness of mode of delivery of the therapy through randomised designs at the individual or even cluster (therapist or service) level. Therefore, we employed a stepped approach to the analysis of observational data, attempting to minimise the disadvantages of a non-randomised design and maximise the information relevant to the assessment of comparative effectiveness of OTT and FTF therapy. We used patientrated outcomes from 190,128 treatment sessions within IAPT sites in the EoE to compare the clinical and cost effectiveness of low intensity, CBT-based psychological interventions delivered OTT versus FTF.

Ethics Statement
The study design and database were reviewed by the National Research and Ethics Service (NRES) for England. NRES considered the work to be an evaluation of existing services using anonymised clinical record data and did not require further ethical review.

Setting and participants
All subjects entered treatment after referral to IAPT services commissioned by seven primary care trusts (PCTs; organisations charged with commissioning care from health providers) in the EoE region from September 2008 until September 2010. These were: NE Herts, NE Essex, Suffolk, W Herts and Mid Essex, Bedfordshire, and Cambridgeshire. These are referred to in a different order as services A-G. All had been providing treatment for longer than 12 months at the time of data extraction; five remaining PCTs delivering IAPT services in the EoE had yet to achieve this stability and were excluded.
All services implemented a stepped-care model: patients could receive either high or low intensity interventions, as deemed appropriate by a standard initial assessment. We focused our analysis on patients treated exclusively with low intensity interventions where OTT is a treatment option for trained therapists working in accordance with service implementation guidelines [11]. Six low intensity interventions, alone or in combination, were approved for delivery either FTF or OTT, a decision taken by the therapist at assessment in collaboration with the patient; there were no operational guidelines. The low intensity interventions were computerised cognitive behavioural therapy (C-CBT), books on prescription or guided self-help, behavioural activation, structured physical exercise, or attendance at psycho-educational groups.
Individuals were excluded from the analysis if they: a) received or were scheduled to receive one or more high intensity treatments (defined as either receiving visits marked as high intensity and/or receiving treatments associated with a high intensity interventions); b) attended fewer than two sessions; c) attended fewer than two sessions in which the outcome measures (psychometric rating scales assessing depression, anxiety) were completed; d) had no recorded treatment outcome; e) had more than a single recorded session of behavioural activation or structured exercise.
For the cost-minimisation analysis, information about number and type of treatment sessions were extracted from the clinical data while IAPT cost information, also anonymised, was extracted from reports prepared by Mental Health Strategies for the Department of Health [12]. The evaluation was funded and supported by NHS East of England and the NIHR CLAHRC for Cambridgeshire & Peterborough. The design, analyses and drafting of this manuscript were undertaken in accordance with STROBE guidelines [13].
Defining intervention groups: CBT delivered over-thetelephone (OTT) or face-to-face (FTF) People referred to IAPT underwent a routine, baseline face-toface assessment after which subjects were allocated to high or low intensity treatment, thereafter. The treatment comparison groups for this study were derived from those allocated to low intensity treatments who were themselves, divided between FTF therapy, where all subsequent sessions were face-to-face, and therapy given Over-The-Telephone OTT. This latter group includes a small number of people who did not receive the routine FTF baselines assessment because a single IAPT service provider arranged for this to be undertaken by the referrer (see Results); subjects in the OTT group received all subsequent therapy OTT regardless of where the initial assessment took place. Subjects who, following assessment, received a mixture of FTF and OTT therapies were excluded from the analysis.

Patient-reported Outcomes (PROs)
Three PROs are mandated for use in patient evaluation as per IAPT implementation guidelines [11]. They are measured at baseline assessment and before each subsequent treatment session. Patients complete the three outcome measures themselves, either on paper or on screen, as they prefer. The scores are available for discussion between the therapist and patient during the subsequent treatment session, whether OTT or FTF. Neither therapists nor patients were aware that the current comparative effectiveness study would take place.
The Patient Health Questionnaire Depression scale (PHQ-9). Nine questions assess symptoms of depression and are scored from 0 (''Not at all bothered by the problem'') to 3 (''Bothered nearly every day''). Sum scores range from 0 to 27. A score of 10 has been suggested as a cut point for a clinical diagnosis of depression [14]; severity bands for symptoms levels in terms of PHQ-9 scores are shown later in Results, Table 1.
The Generalised Anxiety Disorder scale (GAD-7). Seven items measure the severity of anxiety symptoms [15] using the same response options and item scores as the PHQ-9. Sum scores range from 0 to 21. A score of 8 or higher on the GAD-7 has been suggested as a threshold for a likely diagnosis of clinical anxiety [15]. Severity bands applicable to GAD-7 scores are shown in Table 1.
The Work and Social Adjustment Scale (WSAS). Five selfreport items regarding ability to work, home management, social leisure, private leisure and close relationships are each scored 0-8; zero indicates no impairment and eight is very severe impairment. The total score assesses overall functional impairment [16]. Severity bands are: 0-10 subclinical impairment; 10-20 functional impairment; 20-30 moderately severe impairment; 30+ severe impairment.

Analytical Approaches
We used three approaches to analyse these observational data. The first approach involved naïve comparisons between PROs for FTF and OTT treatments (unadjusted and adjusted, as described below). The second and third approaches used sampling methods based on propensity scores. This allowed us to adjust for potential confounding in the non-randomised design, particularly concerning selection by assessing clinicians of patients with certain characteristics to either FTF or OTT. These propensity methods allowed ''matched'' non-inferiority comparisons across the two groups using the patient-reported outcomes: symptom reduction (depression and anxiety), work and social adjustment. A costminimisation analysis based on treatment duration and previous estimates of per session costs was used to compare the cost implications of OTT and FTF sessions.
Approach 1: Naïve treatment comparisons. As a first step, differences in FTF versus OTT treatment effectiveness were assessed using an unadjusted ANCOVA model with only treatment as a factor. A second adjusted model controlling for baseline symptom severity (as measured using PHQ9, GAD7, and WSAS baseline scores), PCT, and treatment duration (represented as the number of attended sessions) was then developed. Approaches 2 & 3: Propensity Score Development. In observational, non-randomised evaluations, naïve comparisons between individuals who receive different treatments (FTF vs. OTT) are confounded; treatment choice is influenced by an individual's baseline clinical, social, or behavioural factors that may themselves affect the outcome, independent of treatment allocation. A propensity score approach attempts to mitigate this selection bias by balancing as many observed covariates (which potentially influence selection to treatment) as possible across treatments, increasing the validity of comparisons between nonrandomised treatment groups.
In this analysis, the propensity score represents the probability of receiving OTT rather than FTF treatment conditional on covariates entered into a logistic regression model. Covariates included demographic indicators (age and gender), baseline measures of symptom severity, work, social adjustment, and service level predictors. Three binary coded items assessing the presence of general or specific phobias were also included. Employment, benefit status, receipt of psychotropic medication at baseline, and referral source was also included. PCT (indicating a particular IAPT service or set of commissioned services) was included in the model to ensure that treatment selection was not biased by differences in treatment effectiveness or policy regarding OTT use. Other service characteristics entered included referral source and also how long the service had been operating when the patient was seen.
Individuals receiving different treatments with similar propensity scores can be considered matched and their outcomes can be directly compared. Unlike randomised designs, the assumption that covariates included in the propensity model largely account for treatment selection requires careful examination of regression results.
Given that baseline characteristics were routinely assessed but not always systematically recorded (lack of recording occurred in approximately 15% of cases), missing baseline data were imputed using corresponding variables collected at the second visit so as to increase the available analysis sample.

Deriving the matched on propensity score samples for comparison
The two separate propensity score matching approaches used probability estimates resulting from the model. In Approach 2, patients were assigned to one of five strata based on probability estimates such that propensity for OTT vs. FTF was minimal within strata and maximal between. Those in the first stratum were similar to individuals receiving OTT interventions; those in the fifth stratum were similar to individuals receiving FTF treatment (see Results). Effectiveness of the OTT versus FTF intervention was compared between treatments within each stratum.
Approach 3 used nearest-neighbour, 1:1, non-replacement matching to assess the sensitivity of the stratification approach. Each individual receiving OTT treatment was matched to an individual receiving FTF treatment with a near equivalent propensity score. Once matched, both individuals were removed from the sample and the process repeated until no further pairs could be matched. This process produced two samples with identical sample sizes matched on propensity score values, but excluding a large number of unmatched individuals.
The 1:1 matching was conducted within each of the seven PCTs to control for differences in the magnitude of the symptom change and differences in the balance of OTT and FTF implementation across providers. To ensure that services with a larger treatment volume or more matches did not bias the results, each service contributed the same number of matched pairs to the sample (n = 21 pairs; see Results), including all the matched pairs from the PCT with the smallest number. For the remainder, we drew random sub-samples with the same number of pairs (n = 21 pairs). Results were verified by repeatedly re-sampling the matched pairs with replacement. The level of tolerable difference in 1:1 matching was specified a priori using a value for the calliper estimate which determines the width of the propensity score interval. We adopted a conservative value of the logit (0.2 standard deviations).

Statistical comparisons
To compare baseline and treatment characteristics between OTT and FTF, independent sample t-tests were used for differences in continuous outcomes. Chi-square or Fisher's exact tests were used for categorical/dichotomous variables. Within each stratum and 1:1 matching sample, a random-effects ANCOVA model using the same covariates in the adjusted comparisons was used to compare the two treatments in terms of the three PROs: PHQ-9, GAD-7 and WSAS. Effect size (ES) estimates (Cohen's d) were provided with all OTT and FTF comparisons. Using established guidelines, ES values less than 0.10 were considered small, 0.11 to 0.25 were considered moderate, and those above 0.25 were considered large.
To assess non-inferiority between the treatments we used twosided significance tests and 95% confidence intervals for score differences between treatment groups on all three outcome measures. The lower limit of the confidence interval (LCL) represents a boundary of non-inferiority. For all three measures, the LCLs were compared with small (0.2 x pooled S.D.) and medium (0.5 x pooled S.D.) estimates of statistical uncertainty. For the PHQ-9 depression scale, the minimally important difference estimate of 5 units was an additional measure with which to assess non-inferiority and the importance of any differences between treatments [17].

Assessing cost-effectiveness
Data on session volume and corresponding total spend was available for five of the seven PCTs for financial year 2009/2010, allowing us to estimate OTT and FTF session costs. A costminimisation approach was selected on the basis of treatment equivalence [18]. The reported total spend on IAPT in each PCT was divided into a cost for low and high intensity activity. Initial micro-costing data indicated low intensity activity was 1.8 times less expensive than high intensity; representative but anonymous data from a PCT are available from the authors. A local tariff developed for IAPT services in the EoE region also confirmed that the cost ratio was closer to this estimate [19]. The proportion of low intensity sessions were adjusted for this base-case cost ratio of 1.8, providing a cost estimate of all low intensity activity. The 1.8 ratio was also varied from 1.2 to 2.0 to test the assumption's sensitivity.
Literature reports indicated that OTT requires shorter treatment durations than FTF [8,9]. To derive the cost per OTT and FTF session the difference in session duration was assumed to directly translate into an equivalent difference in cost. We used an observed ratio of 1.5 for the difference in treatment duration observed between FTF and OTT to calculate cost per session. The estimated total cost for all low intensity sessions was apportioned based on the proportion of OTT and FTF sessions for each PCT, adjusting for the 1.5 ratio to arrive at the total cost of OTT and FTF. Dividing the total costs by the number of sessions provided session costs for each treatment.

Results
Patient flow, follow-up, and sample characteristics Patient flow is described in Figure 1. During the survey period, 39,227 individuals were referred to IAPT services in seven PCTs. Of those, 21,452 (55%) attended at least two sessions during which treatment was administered: 11,401 (53%) of those received or were scheduled to receive a high intensity intervention, leaving 10,051 (47%) individuals receiving solely low intensity interventions. Of these people allocated to low intensity interventions, 6,873 (68%) had information on baseline and endpoint measures and all baseline covariates required for propensity score matching and calculation of change scores. These were the potential participants in this study.
The 6,873 potential participants were then divided into three groups based on receipt of OTT and/or FTF therapy: 1. The FTF group: 2560 people (32.3%) received exclusively FTF therapy; of these, 1791 (70%) had completed treatment at the time of data extraction and comprise the FTF group. 2. The OTT group: 2928 (46.0%) received only OTT interventions after a baseline FTF assessment (this included 311 people from a single IAPT service who received their baseline FTF assessment by the referrer, not that IAPT service); of these, 2315 (79%) had completed treatment at the time of data extraction and comprise the OTT group. 3. The remaining 1385 (21.7%) participants received a blend of OTT and FTF interventions and could not be reliably allocated to either treatment group; this third group of people receiving low-intensity interventions were excluded from our analyses.
Differences in baseline attributes across treatment groups. Table 1 displays baseline characteristics for both treatment groups before propensity score matching. The FTF group was more likely to be unemployed, referred by a general practitioner (family medical practitioner; GP) and to be receiving psychotropic medication at baseline. More people received FTF in PCTs B, C, and E. Individuals receiving FTF were also more likely to report moderately-severe or severe scores on the PHQ-9 and to have slightly higher average WSAS scores. Those who received OTT interventions were more likely to be economically inactive (e.g. student or homemaker). OTT interventions were more likely to have been received in PCTs A, D, and F. Small but significant differences were observed across groups in terms of age distribution, active service duration at first visit, and phobia questions 1 and 2.

Comparison of outcomes for CBT delivered OTT versus FTF
Propensity score estimates (odds of OTT vs. FTF) are shown in Table 2. Older age was associated with FTF treatment. Individuals were less likely to receive OTT interventions if they were unemployed. The longer an IAPT service had been in operation, the more likely OTT interventions were to be used. Receipt of OTT treatment was less likely if the individual was seen in PCTs B, C, D and E relative to the reference provider (A). The model Rsquare was 0.50, indicating excellent predictive accuracy.
Approach 1: Naïve Comparisons. Figure 2 displays the mean reduction in PHQ-9, GAD-7 and WSAS for both OTT and FTF treatments in the overall sample (unadjusted and adjusted for provider, number of treatment sessions), each propensity stratum, and the 1:1 matching sample. An unadjusted comparison shown in the top section indicated significant differences between treatments in the reductions of PHQ-9 and GAD-7 symptom scores; OTT interventions appeared to be more effective (PHQ-9: F = 17.

Approach 2: Propensity Scores & Stratification
Approach. Five propensity strata were developed using estimated propensity scores. This approach minimizes differences in propensity to being prescribed OTT vs. FTF by restricting the    Stratum one, which included those most likely to receive telephone interventions, had the lowest overall age and the highest percentage of women (71%) of all five strata. It also had the lowest average PHQ-9, GAD-7, and WSAS scores in addition to the lowest average scores for each phobia measure. 68% of individuals were employed (full or part-time) and 9.4% of individuals were receiving benefits. The majority of individuals in the stratum were seen 12 months or longer after the service initiated. Slightly less than half received psychotropic medication at baseline. It also contained the lowest proportion of patients referred by GPs (75%). Providers were not proportionately represented across strata.
Compared with stratum one, stratum two comprised a slightly older population with a smaller proportion of women (63%). PHQ-9, GAD-7, WSAS scores, and all phobia scores were markedly higher than stratum one. Percentage of individuals in full or part time employment was similar to stratum one (67%), although a higher proportion of individuals received benefits (16.6%). 45% of individuals were seen more than 12 months after the service was initiated. 83% of individuals were GP referrals.
Stratum three had age and gender characteristics similar to stratum two (63.6% women) with significantly higher average PHQ-9 scores but comparable GAD-7, WSAS scores. Average phobia scores were higher than strata one and two. 55.7% of individuals were employed (lowest of all strata), and 22.9% of individuals were receiving benefits at baseline. Only 10% of individuals were seen after the service had been active for longer than 12 months. 83.2% of individuals were referred by their GP.
Stratum four had the second-oldest age profile and the secondhighest proportion of women (64.4%). PHQ-9, GAD-7, and WSAS scores were slightly lower than strata two and three, similar to stratum one. Average phobia measures were similar to stratum two. 65.8% of individuals were in full or part-time employment at baseline; 14.8% were receiving benefits. 97% of individuals in stratum four were referred by GPs, and only 36.1% of individuals were seen after the service had been active for longer than 12 months.
Stratum five was most like stratum three in terms of its age and gender distribution (63.3% women); it had the highest PHQ-9, GAD7, and WSAS scores of all. Phobia scores were high, also similar to stratum three. 65.6% of individuals were in part or fulltime employment and 23.2% of individuals were receiving benefits at baseline. 25.9% individuals were seen after services had been active for longer than 12 months.
Significant differences in baseline means for PHQ-9 between treatments were observed only in stratum one (t-statistic: 2.34, p = 0.02). No significant baseline differences were observed in GAD-7 and WSAS across treatment groups in any stratum. All group average scores for GAD-7 and PHQ-9 indicated moderate impairment (10)(11)(12)(13)(14) while all group average WSAS scores were between 10 and 20, indicating functional impairment. Figure 2 displays the average score reduction within each stratum for all three outcome measures. Figure 3 displays adjusted mean differences in score reduction between OTT and FTF treatment so as to assess non-inferiority. In strata one to three, the lower confidence limit (LCL) of the adjusted mean difference did not fall below 0.2 S.D. on any of the measures; this is strong evidence that neither treatment was inferior to the other. In stratum four (and in the 1:1 propensity matching, see below) the LCL for adjusted mean difference in WSAS exceeded the 0.2 S.D. threshold, indicating only marginal support for non-inferiority regarding work and social adjustment improvement. The situation was different in stratum five, the group with most severe symptoms. Here, the LCL exceeded 0.2 and 0.5 S.D. for the PHQ-9 and GAD-7 scores and 0.2 S.D. for the WSAS, indicating potentially superior symptom reduction in all domains for individuals receiving FTF interventions.
The minimally important difference (MID) estimate of 5 points on the PHQ-9 [17] is represented by the extreme limits of the xaxis in Figure 3 for this outcome measure (25 favouring FTF and +5 favouring OTT). No estimates or LCL approached this MID estimate although some statistically differences between OTT and FTF groups were apparent. In strata 2 and 3, reductions in PHQ-9 scores were significantly larger for individuals receiving OTT versus FTF interventions (Stratum two: F: 4.05, p = .045, ES: 0.22, Stratum three: F: 4.09, p = .043, ES: 0.18). The average reduction in GAD-7 scores for those receiving OTT interventions was significantly higher in stratum two (F: 4.20, p = .041, ES: 0.22). Thus, whether they favoured OTT or FTF, any statistically significant effects identified in the comparisons within strata were small to medium (as defined as Cohen's d estimates between 0.10 and 0.25) and, for PHQ-9, below the threshold defined as a minimally important difference.  (Table 4). All univariate comparisons of matching variables across treatment groups were non-significant (data not presented).
Baseline PHQ-9 and GAD-7 mean scores for both OTT and FTF interventions indicated moderate depression and anxiety (PHQ-9: 13.0 and 11.6, GAD-7: 12.0 and 10.9 for OTT and FTF, Figure 3. Adjusted mean differences in PHQ-9, GAD-7 & WSAS measures shown as effect sizes. To assess non-inferiority between the treatments we used two-sided significance tests and 95% confidence intervals for score differences between treatment groups. The first pane is a key to the other six that show the results for the five strata (analysis approach 2) and for the 1:1 matched sample (approach 3). The lower limit of the confidence interval (LCL) represents a boundary of non-inferiority. For all three measures, the LCLs were compared with two estimates of statistical uncertainty: small (0.2x pooled SD; inner vertical line closer to line of equivalence) and medium (0.5x pooled SD; outer line). The next six panes display adjusted mean differences in score reduction between OTT and FTF treatment assessing non-inferiority. In strata one to three, the lower confidence limit (LCL) of the adjusted mean difference fell below 0.2 SD on none of the measures, indicating strong evidence that neither treatment was inferior to the other. In stratum four and in the 1:1 propensity matching, the WSAS LCL exceeded the 0.2 SD threshold, indicating only marginal support for non-inferiority regarding work and social adjustment improvement. The situation was different in stratum five, the group with most highest symptom scores, where the LCL exceeded 0.2 and 0.5 SD for the PHQ-9 and GAD-7 scores and 0.2 SD for the WSAS. This indicates potentially superior symptom reduction in all domains for individuals receiving FTF CBT. The a priori minimally important difference (MID) estimate of 5 points on the PHQ-9 is represented by the extreme limits of the x-axis in Figure 3 (25 favouring FTF and +5 favouring OTT). No estimates or LCL approached this MID estimate. Furthermore, the effect size is small for all the potential differences, including those reaching statistical significance in strata 2 and 3. See  PHQ-9 and GAD-7 did not, indicating that neither OTT nor FTF was an inferior treatment to the other. As for the within strata comparisons, the lower limit of the confidence interval for the PHQ-9 did not exceed the five point change estimate that would have indicated the possibility of a minimally important difference between scores.
Results from the random-effects ANCOVA revealed no significant difference in the magnitude of reductions in PHQ-9 (OTT: 5.6 (6.4), FTF Treatment characteristics and course in the FTF and OTT groups matched 1:1 on the basis of propensity scores. Table 4 provides information on the type, duration, and content of treatments provided in both treatment groups derived from the sensitivity analysis (random draws). For all outcomes, reported proportions were taken from a single, representative random draw while significance levels were calculated using iterative re-sampling of matched pairs. No significant difference was observed across groups in terms of type of treatment outcomes (average x 2 : 3.73, average p = 0.375, 95% CI: 0.296-0.454, indicating no difference in the manner in which patients terminated treatment across groups. Those receiving OTT interventions were significantly more likely to receive computerised CBT (average x 2 : 16.30, average p,0.001, 95% CI: 0.001-0.002). Those receiving OTT interventions were less likely to receive psycho-educational group interventions, although the difference was marginal (average x 2 : 4.565, p,0.005, 95% CI: 0.048-0.096). All other treatment type comparisons were non-significant. No significant differences across groups in the number of modalities received during the course of treatment were observed (average x 2 : 1.27, p,0.005, 95% CI: 0.690-0.798).
No significant difference was observed in the number of treatments received (average t-statistic: 0.2, p = 0.84, 95% CI: 0.55-0.80), although a significant difference in the total duration of treatment was observed (average t-statistic: 3.74, p,0.001, 95% CI: ,0.001-0.210). On average, individuals receiving OTT treatment received 32.6% less contact time than equivalent FTF patients.

Cost-minimisation Analysis
Under the assumption that OTT costs 32.6% (approx. 1.5 times) less than FTF, the cost per session was estimated for a base-case scenario ( Table 5). The mean cost per session for OTT was £79.19 (95% CI 55.0 to 103.3) and FTF was £ 118.76 (95% CI 82.5 to 155.0). Even when the cost ratio was varied from 1.2 times to 2 times, OTT was still cost-effective.

Discussion
This comparison of talking therapies delivered OTT or FTF was a naturalistic study in established, low intensity IAPT services across an entire region of England; the sample size is large and uses routine, systematic and prospectively collected patientreported outcomes. Existing evidence for the effectiveness of talking therapies based on CBT for mild to moderate depression and anxiety is strong and based on gold-standard randomised controlled trials (RCTs); the same is true for therapy delivered over the telephone. However, these trials are predominately in the context of small samples from research-based clinical settings. On the basis of Government mandate, IAPT services have been introduced at great pace throughout England, negating the possibility of extending the evidence from RCTs to the question as to whether CBT or its mode of delivery (e.g. FTF or OTT) are clinically-or cost-effective when scaled-up from the clinical laboratory to the entire population. In these circumstances, where even cluster randomisation is impossible, a range of further methodologies must be deployed in order to assess the comparative effectiveness of health services [22]. The results of observational studies need to be interpreted with great care because of potential bias and confounding. Nevertheless, it is important to exploit the information from routinely collected information in order to inform policy makers, clinicians and patients in the common circumstances that are beyond randomised trials.
In this vein, we have carefully applied a series of analytical approaches to the question of comparative effectiveness in common mental disorders of CBT delivered face-to-face or over-the-telephone in a network of health services that are already in place. We believe that some conclusions can be drawn and recommendations made for further work. The first analytical approach, a naïve comparison between is presented for completeness and in order to show how findings develop and the approach becomes more sophisticated. The following propensity score approaches militate against bias and confounding resulting from selection to treatment effects such as the systematic use of one treatment modality or another according to characteristics of the particular IAPT service or individual participants.
The two propensity score approaches compared the efficacy of the two treatments in individuals who were similar in their baseline treatment profiles and controlled for systematic differences between providers of the IAPT service. We demonstrated that OTT and FTF showed equivalent effectiveness for anxiety symptoms (GAD-7), depression symptoms (PHQ-9), and work and social functioning (WSAS) in all but the most severely affected patients who were identified within stratum five. Initial, unadjusted comparisons suggested OTT was actually more effective than a FTF approach but these results were heavily contaminated by selection to treatment effects and should be discounted in favour of results from the propensity approaches.

Strengths and Limitations of the study
Major strengths include the regional setting across several services, use of routine PROs and the large number of covariates available for the propensity score. Despite exclusions due to lack of treatment and missing data, the sample is much larger than comparable RCTs that often have sample selection effects, limited generalizability and unsatisfactory comparison groups such as waiting list controls. Subjects were inevitably excluded from the propensity sampling but each individual stratum and the 1:1 matched sample amount to some of the largest samples ever used to assess the effectiveness of telephone based interventions in routine psychological care. Reliance on comparisons with individuals on a waiting list and seen in routine GP care are a recognised weakness in the current literature assessing telephone based psychological interventions [6]. Our results are widely applicable given that the sample was drawn from a range of different IAPT providers implementing different service structures and treatment models, with individual participants representing a broad cross-section of the population within an English region.
The study has limitations. We cannot be certain whether the benefits seen across both groups were genuine effects of the IAPT treatment or were due to natural resolution of symptoms and regression to the mean. However, the CBT interventions have been shown to have efficacy in well-designed RCTs so it is a reasonable assumption that there were real treatment effects to be compared. There may be a degree of hidden bias and residual confounding due to un-assessed covariates (including unimagined factors) that would have been equalised in a randomised design; patient safety, mobility and previous illness history represent possible factors here. We see the rating of the outcomes by patients, themselves, as a positive aspect of the study. The measures were used in the same way in the two treatment arms and neither patients nor clinicians were aware that the comparison of FTF and OTT would subsequently be made. Independent, blind assessments by a third party would have been a useful corroboration but were impractical in an in-service evaluation such as this. The representativeness of the data remains a concern; those individuals who fail to report all of the items on the IAPT minimum dataset may be more (or less) likely to drop-out of treatment or to experience less symptom reduction. This can be better assessed as further data accrue in a wider range of IAPT service system providers.
Another limitation is that the effectiveness of OTT and FTF interventions cannot be assessed according to an exact clinical diagnosis given likely differences in diagnostic accuracy and conventions across the numerous health-care professionals within the sample of providers. All that can be said is that telephonebased interventions are effective in the treatment of many individuals presenting to the lower intensity tier of IAPT services in the East of England. The vast majority of these people will have mixed anxiety and depression, sometimes referred to as common mental disorder. Nevertheless, the accuracy and standardisation of clinician diagnosis and possible differential effects of treatment according to more detailed diagnostic categories remain important areas of future research, albeit that these are unlikely ever to be used consistently on a large scale. A fifth of potentially eligible subjects were excluded because they received a mixture of FTF and OTT treatments following their baseline assessment. It would have been useful to have included these subjects on the basis of an intention-to-treat analysis. This was not possible because we had no indication of what the intention was; we could have allocated subjects to either OTT or FTF on the basis of the nature of their contact in the session following baseline, but this would have been an assumption of intention that may not have been warranted. It is likely that mixtures of FTF and OTT arise because of the uncertainty as to which may be better, something our study was intended to resolve, or because the true intention was a blend of treatment, a third treatment arm.
The total sample size was large but became restricted as the analysis moved into the propensity strata and the 1:1 matched sample. Thus, the statistical power needs consideration. We were careful to ground the analyses in the context of conventionally defined effect sizes and, for the measure of depression, the PHQ-9, in an accepted definition of a minimally important difference [17] that equates to a large effect size. Thus, we can be clear from our results that we have not rejected such a treatment difference or greater on the basis of a Type II statistical error; none of the confidence limits in the 1:1 analysis approached such a difference and a post-hoc power analysis indicated that this analysis was sufficiently powered, with conventional parameters, to exclude, had they existed, a large effect for PHQ-9 (ES 0.27) and moderate effects for GAD-7 and WSAS (ES 0.16 and 0.24, respectively).
A limitation of the costing approach adopted is that it was primarily based on assumptions concerning estimates of the relative costs of OTT and FTF treatments. Given that the costs of IAPT are largely the costs of staff and clinical facilities, with education costs being similar in both groups, it is likely that our results based on staff time will be valid, with differences remaining similar even if a more detailed micro-costing approach was to be undertaken. Estimates of cost savings are also incomplete; additional savings in the form of reduced travel and reduced need for clinical accommodation remain important considerations to be incorporated into future calculations. Our view is that these would magnify the differences that we have observed.

The findings
Our results indicate that symptoms decrease and social function increases under both treatment conditions, and that OTT is a convenient and effective CBT modality for the majority of patients treated within the lower intensity tier of IAPT services. There was an important indication of heterogeneity of effect: the propensity strata showed that those who were older and had higher symptom scores (stratum 5) may do better with FTF. This conclusion has face validity given that these people would be closest to the threshold for the more intensive IAPT therapies given face-to-face that are not the subject of this study that concentrated on lower intensity treatments.
Thus, for the majority, the convenience of OTT to patients and services embeds a likely economic advantage to OTT for most people. The cost-minimisation analysis focused on service costs, alone. It indicated that the cost per OTT session was approximately one third lower than that of a FTF session, a result that was robust when the model assumptions were changed. The delivery of OTT or FTF therapy appeared to depend not on patient characteristics but mainly on where and when the treatment took place. This suggests that it is within the remit of commissioners and services to design IAPT services, accordingly, though further work is needed to define the characteristics of those people with severe disorder, such as those in stratum five, who will, in fact, do better with a FTF model. Overall, the results indicate that increasing the proportion of low intensity talking therapy delivered OTT in some areas may reduce the cost of the IAPT programme, increase its productivity and maintain the quality of the service. Further research is needed to better identify the group of people with more severe illness in stratum 5, initially assessed clinically as appropriate for the lower level therapies; they may have better outcomes with FTF therapy.
Comparison with other studies. A recent meta-analysis [20] demonstrated benefits for the use of technology-mediated interventions in the treatment of depression and anxiety, but only two studies compared telephone with face-to-face delivery. One RCT of 72 patients showed that CBT delivered by telephone was equivalent to treatment delivered face-to-face with similar levels of satisfaction. [7]. This study involved OCD sufferers, a specific anxiety disorder, and highly trained therapists akin to those in the higher intensity IAPT tier. A meta-analysis comparing the effect of self-guided interventions with brief therapeutic input, common in the IAPT low intensity setting, found that brief contact did boost effectiveness [10] but FTF or OTT contacts were not compared. Nevertheless, our large-scale, pragmatic evaluation of IAPT in a real-world setting is consistent with the findings of the prior work, strengthening the case for action.
The results from the 1:1 stratification sample indicated that OTT interventions can provide significant reductions in the total amount of time each patient is seen. Our estimates were somewhat lower than the 40% reported in other published accounts but further efficiencies may be found as services mature (our IAPT services had been established for only 1-2 years) and if more attention is paid to the design of the OTT approach [7]. Layard and colleagues, seminal contributors to the original IAPT model, estimated that the cost of providing a standard course of roughly ten sessions of CBT is £750 or £75 per session [21], an assumed rather than an observed estimate. Our OTT session costs estimates exceed this but still represent a significant route for cost savings and effective, flexible, and readily available treatment.

Conclusions and Policy Implications
The clinical effectiveness of low intensity CBT-based interventions delivered over-the-telephone was not inferior to those delivered face-to-face in the majority of patients with the common mental disorders of depression and anxiety. A minority of people with more severe illness and who tended to be older appeared to gain more benefit from face-to-face therapy; research is required in order for services to identify them efficiently. For most, CBT delivered over-the-telephone is a cost-effective and probably convenient option, providing the potential for significant financial savings to the IAPT programme in the lower intensity context. Increasing the proportion of low intensity talking therapy delivered over-the-telephone may reduce the cost of the IAPT programme, increase its productivity and maintain the quality of the service.
In a global context, the potential is enormous for spreading access to effective psychological therapies to the millions of people affected by depression and anxiety. As the availability of mobile phone technology in low and middle income countries grows, people now have the potential of having a therapist in their pocket, transcending traditional barriers to the receipt of effective treatments. Should this opportunity be taken we recommend that randomised evaluations are used to evaluate what could potentially amount to another major step forward in care for common mental disorders.