Effectiveness of Terbutaline Pump for the Prevention of Preterm Birth. A Systematic Review and Meta-Analysis

Background Subcutaneous terbutaline (SQ terbutaline) infusion by pump is used in pregnant women as a prolonged (beyond 48–72 h) maintenance tocolytic following acute treatment of preterm contractions. The effectiveness and safety of this maintenance tocolysis have not been clearly established. We aimed to systematically evaluate the effectiveness and safety of subcutaneous (SQ) terbutaline infusion by pump for maintenance tocolysis. Methodology/Principal Findings MEDLINE, EMBASE, CINAHL, the Cochrane Library, the Centre for Reviews and Dissemination databases, post-marketing surveillance data and grey literature were searched up to April 2011 for relevant experimental and observational studies. Two randomized trials, one nonrandomized trial, and 11 observational studies met inclusion criteria. Non-comparative studies were considered only for pump-related harms. We excluded case-reports but sought FDA summaries of post-marketing surveillance data. Non-English records without an English abstract were excluded. Evidence of low strength from observational studies with risk of bias favored SQ terbutaline pump for the outcomes of delivery at <32 and <37 weeks, mean days of pregnancy prolongation, and neonatal death. Observational studies of medium to high risk of bias also demonstrated benefit for other surrogate outcomes, such as birthweight and neonatal intensive care unit (NICU) admission. Several cases of maternal deaths and maternal cardiovascular events have been reported in patients receiving terbutaline tocolysis. Conclusions/Significance Although evidence suggests that pump therapy may be beneficial as maintenance tocolysis, our confidence in its validity and reproducibility is low, suggesting that its use should be limited to the research setting. Concerns regarding safety of therapy persist.


Executive Summary
Background Preterm birth is defined as delivery before the completion of the 37th week of gestation, and it affects 13 percent of live births in the United States. 1 According to the 2010 National Vital Statistics report, there were 542,893 preterm births in the United States in 2006. 2 Rates of preterm birth result in a significant disease burden to the health care system. Although overall rates of neonatal mortality continue to decline, infants born too early are at risk for long-term morbidity. 3 Tocolytics are drugs used to delay or inhibit contractions during the labor process. Several tocolytics are available to prevent preterm birth. These agents may be administered as primary therapy to control acute episodes of preterm labor or as maintenance therapy to prevent subsequent episodes. Maintenance tocolysis is usually provided for prolonged periods beyond 48 to 72 hours after arrest of acute preterm labor to inhibit the process of parturition until full term.
While several studies have examined these agents for the control of acute episodes of preterm labor, the evidence to support their safety and efficacy as maintenance therapy is limited.
The β-agonist agent, terbutaline sulfate, has been used orally and subcutaneously as maintenance tocolytic therapy in women following acute treatment and arrest of confirmed preterm labor. As with all other contemporary tocolytics, the use of terbutaline for maintenance tocolysis is off-label. The Food and Drug Administration (FDA) has approved terbutaline for the management of acute and chronic obstructive pulmonary disease only. When administered through the subcutaneous (SQ) route, terbutaline may be administered by a pump that provides a steady continuous infusion with allowance for boluses. Compared with the oral route of administration, the SQ terbutaline pump uses lower doses (usual basal rate is 0.03-0.05 mg/hr with an intermittent bolus of 0.25 mg every 4 to 6 hours) and has less potential for tachyphylaxis. 4 The effectiveness and safety of the SQ terbutaline pump for maintenance tocolytic therapy was examined in two systematic reviews. One review, which was based on two small randomized controlled trials (RCTs), concluded that the SQ terbutaline pump offers no advantages compared with the saline pump or oral terbutaline. 4 The second review found contradictory results among RCTs and observational studies; the RCTs found no difference between the SQ terbutaline pump and comparators, although the observational studies demonstrated positive effect estimates in favor of the pump. 5 Despite previous systematic reviews, uncertainty surrounding the use of terbutaline and other tocolytics as maintenance therapy to prevent recurrent episodes of preterm labor still exists. No clear first-line maintenance tocolytic therapy has yet emerged. The possibility of maternal side effects and unclear evidence on perinatal outcomes contribute to the ambiguity of terbutaline's role in obstetrical practice. Moreover, in a recent cost analysis of four tocolytic agents, subcutaneous terbutaline had the highest cost. 6 The expense is due not only to the device, but also to the need for increased monitoring and management of adverse events associated with this therapy. 6 Given the importance and associated uncertainty about the appropriateness of ongoing use of the terbutaline pump for maintenance tocolysis for clinicians, patients, and policymakers, a review about the effectiveness and safety of SQ terbutaline pump was commissioned by the Agency for Healthcare Research and Quality (AHRQ) to address six Key Questions. This evidence report will add to previous systematic reviews by performing an up-to-date search of the literature, synthesizing evidence in the context of specific populations of women, addressing confounding by level of maternal activity and level of care, and grading the strength of evidence for important outcomes to help decisionmakers develop evidence-based recommendations and policies.

Analytic Framework
We developed an analytic framework depicting links between the intervention and related clinical and intermediate efficacy and harms outcomes and other unintended adverse effects ( Figure A). In the framework below, the key questions of interest can be seen to encompass a holistic inquiry of the topic.

Input From Stakeholders
We formulated the population, intervention, comparator, outcome, timing, setting (PICOTS) conceptual framework and Key Questions in consultation with key informants during a topic refinement stage. The public was invited to provide comments on the Key Questions. During the review process, we followed a research protocol we developed with the clinical and methodological input of a technical expert panel. The protocol followed the Effective Health Care Program's Methods Guide for Effectiveness and Comparative Effectiveness Reviews. 7 We developed a peer-reviewed search strategy and searched the following databases: MEDLINE In-Process & Other Non-Indexed Citations and MEDLINE (1950 to April 1, 2011); Embase (1980 to April 1, 2011); Cumulative Index to Nursing and Allied Health Literature (CINAHL) via EBSCOhost (1985to December 7, 2009, the Cochrane Library via the Wiley interface (April 1, 2011) (including CENTRAL, Cochrane Database of Systematic Reviews, Database of Abstracts of Reviews of Effects -DARE, Health Technology Assessment -HTA, and the National Health Service Economic Evaluation Database -NHS EED), and the Centre for Reviews and Dissemination (CRD) databases (January 2, 2010). Appendix A provides details of the search strategies. We hand-searched the bibliographies and text of review articles, letters to editors, and commentaries and the reference lists of included studies for additional references. We also reviewed grey literature sources and information received from pharmaceutical companies (see Appendixes B and C), and sought unpublished information from Matria (now called Alere) Healthcare about their perinatal program and associated database.

Data Sources and Searches
In February 2011, the FDA issued new warnings against the use of terbutaline to treat preterm labor, so we also accessed a summary of the FDA postmarketing surveillance results. This decision was made post hoc.

Study Selection
Two reviewers screened abstracts and full-text reports with conflicts resolved by consensus or third-party adjudication. Studies were included if they met the following criteria: evaluated pregnant women between 24 and 36 weeks' gestation having had acute preterm labor arrested with primary tocolytic therapy; contained at least one group that was administered the SQ terbutaline pump; and assessed one of the specified outcomes listed in the key questions or described a long-term childhood outcome. Noncomparative studies (i.e., case series) were assessed only for pump-related harms outcomes, such as incidence of pump failure, missed doses, or overdose. Non-English records without an English abstract were excluded. We also excluded case reports, but in a post hoc decision sought FDA summaries of postmarketing data highlighting serious harms.

Data Extraction and Risk of Bias Assessment
One reviewer extracted data into a standardized electronic form and assessed study risk of bias and applicability. Extraction items included general study characteristics (e.g., year of publication, study design), population characteristics (e.g., inclusion/exclusion criteria, age, race, ES-6 level of activity), intervention characteristics (e.g., dose, duration, details about comparators, level of care), and outcomes with their estimates. A second reviewer verified outcomes data and study risk of bias assessments. Ratings for level of activity, level of care, and assessments of applicability were verified by a clinical expert. Level of activity and level of care were rated based on composite assessments across preidentified variables.
We assessed study risk of bias given the study design, by outcome, using generic items to assess confounding and various types of bias (e.g., selection, performance, detection bias, attrition bias). Selected items from the McMaster Quality Assessment Scale of Harms were also incorporated into the risk of bias assessment for harm-related outcomes. 8 Appendix D provides the data extraction, risk of bias, and applicability forms.
Certain criteria were specific to particular study designs (e.g., allocation generation and concealment applied only to RCTs). We rated each relevant outcome in a study with an overall risk of bias rating designated as high, medium, or low. Outcomes were rated as high risk of bias if there was an apparent and major flaw in the study that would invalidate results.

Data Synthesis and Analysis
We meta-analyzed the RCTs with a random effects model, following a DerSimonian and Laird approach, when they were clinically and methodologically similar. To assess statistical heterogeneity and the magnitude of heterogeneity, we used Cochran's Q (α=0.10) and the I 2

Strength of Evidence and Applicability
statistic respectively. Odds ratios (ORs) were calculated for dichotomous outcomes and mean differences for continuous outcomes. All analyses were performed using Comprehensive Meta Analysis version 2.2.046 or version 2.2.055 (New Jersey, USA). We did not meta-analyze observational studies because of potential differences in confounders, nor did we combine studies of singleton and multiple pregnancies. Synthesis of evidence from observational studies was, therefore, undertaken qualitatively. Due to the small number of studies, we could not perform any meta-regression to explore statistical heterogeneity in effect estimates.
Based on published guidance for the Effective Health Care Program,9 Results two reviewers graded the strength of evidence using the four primary domains (i.e., risk of bias, consistency, directness, and precision) for the following outcomes: incidence of delivery at various gestational ages (<28 weeks, <32 weeks, <34 weeks, <37 weeks), mean prolongation of pregnancy, bronchopulmonary dysplasia, significant intraventricular hemorrhage (grade III/IV), neonatal death, death within initial hospitalization, and maternal withdrawal due to adverse effects (Withdrawal-AE). We described population, intervention, comparison, outcome, timing, and setting characteristics to summarize the applicability of the body of evidence.

Study Selection
We screened 427 citations and included 14 unique records in the review. The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) diagram below depicts the flow of records from identification to inclusion ( Figure B). Most records were excluded at full-text screening (n=197) based on the reasons listed in the diagram. Appendix E provides a list of excluded studies, and Appendix F provides individual-level study data.  Table A presents general summary characteristics of the included studies. Most studies were observational and included cohorts and case series. Two studies were RCTs, and one was a nonrandomized trial. Sample sizes ranged from 9 to 1,366, but greater than 70 percent of studies included at least 200 participants (average 291 ± 395). All studies were from the United States, and participants were recruited either from single-center study sites or from a national proprietary database run by Matria Healthcare. The Matria database provides an outpatient ES-8 perinatal program consisting of 24-hour nursing and pharmacy support, home uterine activity monitoring, individualized education, and provision of tocolytic therapy to women with preterm labor. Because five studies originated in the Matria database, and not all reported geographic region and/or years over which participants were recruited, the question of overlap in participants across these studies was an important concern of reviewers. Through the Scientific Resource Center (SRC), we requested this missing information from Matria (now called Alere) Healthcare but did not receive a response. Therefore, where appropriate, we report this risk of doublecounting of participants.

Risk of Bias Assessment
We rated studies as low, medium, or high risk of bias for the relevant reported outcomes. Although the randomization procedures in the two RCTs were appropriate, we rated one RCT as low risk of bias 10 and the second RCT as high risk of bias because more than 90 percent of eligible participants declined to participate, the study was underpowered, and blinding was ineffective. 11 The single nonrandomized trial was high risk of bias for the outcomes of birth weight and gestational age at delivery due to potential prognostic imbalances in groups. However, we did not anticipate that such imbalances would impact the outcome of maternal hyperglycemia, which we rated as medium risk of bias, due to insufficient information to assess several other criteria. 12 We rated most of the cohort studies as high risk of bias because there were important group imbalances in baseline characteristics or prognostic factors. [13][14][15]18,21 The other cohort studies we rated as medium risk of bias; although these studies had no identifiable flaws, several criteria could not be assessed due to incomplete reporting. 16,17,19,20 Lastly, we rated the two case series as medium risk of bias because neither study provided clear definitions for the pump-related harm outcomes, and several criteria, such as compliance, adequacy of sample size, and selective outcome reporting, were unclear. 22,23

Neonatal Health Outcomes (KQ1)
Strength of evidence is insufficient for bronchopulmonary dysplasia, death within initial hospitalization, and significant intraventricular hemorrhage (grade III/IV). Based on one retrospective cohort of medium risk of bias, the strength of evidence favoring the SQ terbutaline pump compared with oral tocolytics for neonatal death in women with twin gestation and RPTL is low (Table B). This study investigated women from the Matria database and reported a statistically significant difference in neonatal death in favor of SQ terbutaline pump (OR = 0.09, 95% CI: 0.01, 0.70). 19 Sparse evidence from underpowered studies addressed necrotizing enterocolitis, retinopathy of prematurity, and sepsis with inconclusive results. 11,13 No data were available for periventricular leukomalacia and seizures.
Three retrospective cohort studies from the Matria database reported stillbirths in women with RPTL and single or twin gestation. [17][18][19] All three studies found nonsignificant differences between the SQ terbutaline pump and oral tocolytics. However, these studies were likely underpowered to detect a difference in still birth, given the small number of events (<1%).   . 11 No events occurred in the SQ terbutaline pump group or in the two comparator groups. We did not grade this evidence here because it did not pertain to any of the subgroups of interest. † One RCT reported significant intraventricular hemorrhage. 10 No events were observed in pump or comparator groups. We did not grade this evidence here because it did not pertain to any of the subgroups of interest. ‡ Incidence of delivery < 34 weeks was reported in one RCT, which showed a nonsignificant difference between SQ terbutaline pump and placebo (OR = 0.95, 95% CI: 0.32, 2.87). We did not grade this evidence here because it did not pertain to any of the subgroups if interest. § Incidence of delivery < 37 weeks was also reported in one RCT, which showed a nonsignificant difference between SQ terbutaline pump and placebo (OR = 1.57, 95% CI: 0.49, 5.02). We did not grade this evidence here because it did not pertain to any of the subgroups of interest. ** Mean prolongation of pregnancy was also reported in two RCTs, with nonsignificant effect estimates. We did not grade this evidence here because it did not apply to any of the subgroups of interest. † † Studies were not pooled. Also, there was risk of double-counting of participants across these studies.

ES-12
Other Surrogate Outcomes (KQ2) Studies reported surrogate outcomes of preterm labor much more frequently than neonatal or maternal clinical endpoints. However, none of the included studies examined incidence of delivery < 28 weeks (strength of evidence is insufficient, Table B), need for oxygen per nasal cannula, or ratio of birth weight/gestational age at delivery.

Incidence of Delivery at Various Gestational Ages
Incidence of delivery < 32 weeks: The strength of evidence favoring SQ terbutaline pump compared with either oral tocolytics or no treatment is low for women with RPTL and those additionally with twin gestation (OR range = 0.04-0.52, 95% CI range: 0.00-0.35, 0.50-0.76) ( Table B). The evidence originated in six, mostly Matria-based, cohort studies of medium to high risk of bias. 13,[15][16][17][18][19] Incidence of delivery < 34 weeks: The strength of evidence for this outcome is insufficient (Table B). One small RCT (n=52) that did not address any of the populations of interest, showed a nonsignificant difference between SQ terbutaline pump and placebo in women with singleton gestation. 10 Incidence of delivery < 37 weeks: The strength of evidence favoring SQ terbutaline pump compared with oral tocolytics or no treatment is insufficient or low for women with RPTL ( Table B). Four of five cohort studies of medium to high risk of bias, mostly from the Matria database, reported statistically significant differences in favor of SQ terbutaline pump (OR range = 0.04-0.75, 95% CI range: 0.01-0.58, 0.23-1.20). 13,15,17,18,20

Mean Gestational age at Delivery
Larger cohort studies of medium to high risk of bias in women with RPTL and single or twin gestation demonstrated consistent benefit of SQ terbutaline pump compared with oral tocolytics or no treatment (RPTL and singleton gestation: difference in means range = 0.70-3.40 weeks, 95% CI range: 0.28-1.80 weeks, 0.98-5.00 weeks; RPTL and twin gestation: difference in means = 0.70 weeks, 95% CI range: 0.43-0.48 weeks, 0.92-0.97 weeks). 13,[15][16][17][18][19] Most participants in the cohort studies came from the Matria database. RCT evidence not directly addressing the populations of interest yielded a nonsignificant effect estimate between the pump and placebo (n=52 and n=42).

10,11
The strength of evidence favoring SQ terbutaline pump compared with oral tocolytics or no treatment is insufficient or low for women with twin gestation and/or RPTL (difference in means range 5.   (Table B). 13,[15][16][17][18] This evidence came from five cohort studies of medium to high risk of bias, mostly from the Matria database. Two small RCTs (n=52 and n=42), which did not pertain to any of the populations of interest, showed nonsignificant differences between SQ terbutaline pump and placebo.

Birth Weight
Cohort studies of women with RPTL and single or twin gestation demonstrated statistically significant differences in mean birth weight in favor of SQ terbutaline pump compared with oral tocolytics or no treatment (range of mean difference in grams = 136-721, 95% CI range: . 13,[16][17][18][19] Aside from one study, all were from the Matria database. [16][17][18][19] Two small RCTs (n=52 and n=42), which did not pertain to any of the populations of interest, reported nonsignificant differences between SQ terbutaline pump and placebo.

Need for Assisted Ventilation
Both found statistically significant differences in favor of the SQ terbutaline pump compared with either no treatment or oral terbutaline (mean difference = 0.41, 95% CI: 0.26, 0.56; and 0.14, 95% CI: 0.02-0.26).
One cohort study from the Matria database reported a nonsignificant difference between the SQ terbutaline pump and oral tocolytics in requirement for ventilator among infants with NICU admission. 18

NICU Admission
Incidence of NICU Admission: Statistically significant differences in favor of the SQ terbutaline pump compared with oral tocolytics or no treatment were reported in cohort studies of women with RPTL and single or twin gestation (OR range 0.28-0.72, 95% CI range: 0.08-0.58, 0.63-0.97). 13,[15][16][17][18][19] Again, most of these studies were  One small RCT (n=52), which did not pertain to any of the populations of interest, reported a nonsignificant difference between the SQ terbutaline pump and placebo. 10 NICU length of stay: Statistically significant differences in favor of the SQ terbutaline pump compared with oral tocolytics or no treatment were also reported for NICU length of stay in mostly Matria-based cohort studies of women with RPTL and single or twin gestation (range of mean difference in days: . 13,15,18,19 Another small RCT (n=42), which did not address any of the subgroups of interest, reported a nonsignificant difference between the SQ terbutaline pump and placebo or oral terbutaline.

11
The strength of evidence is insufficient for Withdrawal-AE (Table B). One prospective cohort in women with singleton gestation and RPTL demonstrated highly unreliable odds favoring no treatment compared with the pump for tachycardia/nervousness (OR=25.48, 95% CI: 1.23, 526.6). 13 Underpowered studies demonstrated indeterminate results for the outcomes of mortality, pulmonary edema, and therapy discontinuation (i.e., type II error cannot be excluded). 10,18,19 Two studies, a retrospective cohort and a nonrandomized trial, demonstrated ES-14 nonsignificant differences between the SQ terbutaline pump and oral terbutaline in the incidence of gestational diabetes, though type II error cannot be excluded. No data were available on heart failure, myocardial infarction, refractory hypotension, and hypokalemia.
Until 2009, 16 maternal deaths and 12 cases of maternal cardiovascular events (hypertension, myocardial infarction tachycardia, arrhythmias, and pulmonary edema) in association with terbutaline tocolysis were reported to the FDA. Of these, at least three maternal deaths and three cardiovascular adverse events were clearly reported to be in association with the use of the SQ terbutaline pump. 24

Neonatal Harms (KQ4)
Neonatal harms data were very sparse. Neonatal hypoglycemia was reported in only one RCT that compared the SQ terbutaline pump with placebo and oral terbutaline. 11

Assessment of Confounding by Level of Activity and Level of Care (KQ5)
Differences between the SQ terbutaline pump and placebo or oral terbutaline were nonsignificant. However, given the small number of events and limited sample size (n=42), the RCT was underpowered and the results are inconclusive. No studies reported neonatal hypocalcemia or ileus.
Only a small number of studies could be rated for level of activity and level of care. Therefore, we could not carry out meta-regressions to explore the effect of these variables on maternal and neonatal outcomes. Furthermore, we could not even explore the impact of level of activity on effect estimates in a qualitative manner because all studies that could be rated were designated as having "low" level of activity. No apparent trends in effect estimates according to level of care based on qualitative assessments were observed.

Incidence of Pump Failure (KQ6)
Two case series and one RCT reported outcomes related to the pump device. 11,22,23 In a case series of 51 women, one participant had dislodgment of catheter (2 percent, exact central CI: 0.5%, 10%) and there was one pump that malfunctioned (2 percent, exact central CI: 0.5%, 10%). 22 No infusion site infections or mechanical failures were observed in a case series of nine women. 23 An underpowered RCT demonstrated indeterminate results for the outcomes of local pain and local skin irritation. 11

Applicability
No data were available for missed doses or overdoses.
In Table C below, we summarize the overall applicability of the evidence base, according to the domains of population, intervention, comparison, outcomes, timing, and setting. The majority of evidence pertained to women with recurrent preterm labor and singleton gestation in the United States. Very little information was reported about the study populations' demographic and clinical characteristics. Nine of 14 studies (64 percent) included women judged to be in labor on account of persistent contractions and cervical change. The definition of labor was unclear in other studies. Among the studies that suggested that the pump was efficacious, 50 percent reported cervical change and contractions as part of the definition of labor while 50 percent did not report how labor was defined.

Intervention
Although there were gaps in reporting, the intervention generally did not pose any serious limitations to applicability. Very few details were reported on cointerventions that could modify the effectiveness of therapy, such as administration of corticosteroids. In several studies, participants received specialized outpatient services from Matria Healthcare.

Comparison
Comparators included oral tocolytics, no treatment, and placebo.

Outcomes
Surrogate outcomes were the most commonly reported. Data on clinical outcomes, neonatal/maternal harms, and pump-related outcomes were sparse. Long-term outcomes have not been reported at all.

Timing of Outcomes Measurement
The absence of followup beyond delivery is a major limitation because important long-term outcomes have not been evaluated.

Setting
All studies were from the United States and participant data were acquired from a national database (Matria) or from single center sites. Women from the Matria database generally received a high level of care from an outpatient perinatal program. However, the distribution of regions from which patient data were included into the national database is unknown and information about the standards followed by the individual practice sites that provided obstetrical care was not reported. Similarly, for those studies that took place at single center sites, the standards of care followed at these sites are unclear.

Discussion
In this small review of 14 studies, most data came from observational designs, and several studies analyzed data from the Matria database. Aside from two RCTs, the studies exhibited considerable clinical and methodological heterogeneity. For the gradable outcomes, the available evidence addressed only two specific populations of interest-women with RPTL or those additionally with twin gestation. The strength of evidence favoring the SQ terbutaline pump compared with oral tocolytics for neonatal death in women with twin gestation and RPTL is low (OR = 0.09, 95% CI: 0.01, 0.70). While this result is striking in the presence of insufficient findings on other neonatal health outcomes summarized below, it is apparent that it stems from the largest of studies contributing data on neonatal health outcomes with more than 700 patients. As such, it is the only outcome that appears to be adequately powered to reach statistical significance. Strength of evidence favoring terbutaline pump compared to oral tocolytics or no treatment is also low for women with twin gestation and/or RPTL for the surrogate outcomes of pregnancy prolongation. For bronchopulmonary dysplasia, significant intraventricular hemorrhage, death within initial hospitalization, and Withdrawal-AE, strength of evidence is insufficient. The evidence was inconclusive for all other neonatal health outcomes, neonatal harms, maternal harms, and pump-related outcomes.
Based on postmarketing surveillance data, the FDA has issued a new warning against the use of terbutaline in general, and as an injection in particular, as maintenance tocolysis (i.e., beyond 48-72 hours) in pregnant women. 24 Although meriting transparent disclosure in the form of a warning, evidence emerging from case reports is usually regarded as noncomparative and hypothesis generating signal rather than a hypothesis testing confirmation. 25 Furthermore, case reports are useful in identifying rare and unexpected adverse events-the rarer the adverse event, the stronger is the effect size, and the magnitude of effect size is an important criterion that increases our confidence in an estimate. 9 However, adverse events such as death, hypertension, ES-16 myocardial infarction, tachycardia, arrhythmias, and pulmonary edema that were reported with the use of terbutaline are not so unexpected in any adult population-pregnant women may experience these adverse events in the absence of terbutaline therapy due to other reasons.
Observational studies of medium to high risk of bias, primarily from the Matria database, showed benefit of SQ terbutaline pump compared with oral tocolytics or no treatment for other surrogate outcomes, such as birth weight and NICU admission, for women with twin gestation and/or RPTL. In contrast, two small RCTs that did not address any of the populations of interest, reported nonsignificant differences for several surrogate outcomes.
The evidence base for this review contained several limitations. Most evidence came from observational designs of medium to high risk of bias. Several outcomes revealed nonsignificant results that could be attributed to type II error. Type II error is a statistical term that implies inability of studies to find a difference when it might truly exist because of their small sample size (false negative). Many important variables, such as race, socioeconomic status, and fetal fibronectin level were not reported. Furthermore, cointerventions, such as administration of corticosteroids, were rarely described. None of the included studies assessed long-term childhood outcomes, such as childhood development, neurobehavioral testing, long-term lung function, and long-term vision. Our review comprehensively reviewed the literature and selected reports based on well-defined inclusion and exclusion criteria. However, one potential limitation of our review process is that we excluded potentially relevant non-English publications. Also, we could not investigate the impact of publication bias. However, in completing this review, we undertook an extensive grey literature search. Further, we requested relevant scientific information from the industry and had many experts in the field participate in the review process. Despite this thorough process, the number of identified studies was very small-we had too few studies per outcome to perform statistical assessment of publication bias. We believe that all relevant data regarding the use of subcutaneous terbutaline for the prevention of preterm labor is captured in this review. Any exaggerated positive findings are more likely due to the medium to high risk of bias detected in observational studies than publication bias.
In conclusion, the available evidence suggests that pump therapy is beneficial as maintenance tocolysis. However, our confidence in the validity and reproducibility of this evidence is low. While postmarketing surveillance has detected cases of serious harms, safety of the therapy remains unclear.

Future Research
Although cohort studies have provided a glimpse of the potential for the SQ terbutaline pump to improve short-term neonatal outcomes for fetuses at risk for preterm birth, the answers to several important questions remain unanswered. Most importantly, it remains to be seen whether SQ terbutaline pump therapy alters long-term development or systemic impairment of offspring, and neonatal/maternal morbidity and mortality. The limitations of the available data must also be recognized. Most of the cohort studies were medium to high risk of bias. In addition, several of the cohort studies investigated participants from a single proprietary database (Matria), which raises concerns regarding double-counting of patients and common biases. Therefore, results showing effectiveness should be interpreted with caution, especially in light of the most recent FDA warning recommending against the use of terbutaline for maintenance tocolysis.
Information is lacking on the effectiveness and safety of SQ terbutaline pump as a maintenance tocolytic treatment in specific populations, including women who deliver at specific gestational ages, women of different racial or ethnic backgrounds, and women with previous ES-17 preterm birth or preeclampsia. Future studies, whether observational or experimental in design, should focus on garnering evidence for these specific populations.
Below we provide some specific recommendations for the conduct of RCTs and observational studies to further elucidate the potential benefits and harms of SQ terbutaline pump for maintenance tocolysis.

Randomized Trials
We recommend that an adequately powered randomized controlled and pragmatic clinical trial that assesses the SQ terbutaline pump as a maintenance tocolytic be conducted. A pragmatic RCT is designed to have broad applicability so that the results can guide decisions about practice. 26 Conducting RCTs to assess the efficacy of tocolytics in general is notoriously difficult. A definitive trial in this domain must include a focus on accurate diagnosis of preterm labor (perhaps combining stringent clinical criteria with factors such as positive fetal fibronectin and shortened transvaginal cervical length). Emphasis must also be placed on securing funding and maintaining followup for an appropriate duration of time to allow assessment of long-term childhood outcomes, including neurobehavioral testing and developmental assessment.
Such a trial should be placebo controlled and include blinding of study participants, care providers, and study personnel. Consideration should be given to employing multiple treatment arms in order to evaluate the pump against other tocolytic agents and conservative management. Furthermore, the level of care provided to participants (i.e., nursing assessments, home uterine monitoring, education, telephone support, and restriction of activities) should be practical, feasible, and likely to be adopted in routine practice. Important cointerventions, such as administration of corticosteroids, should be reported. A full accounting of the number of women approached but not enrolled should be included to allow users to assess the impact of respondent bias. The analysis should be "intent to treat," where all participants assigned by randomization to each group are included in the primary comparisons, regardless of whether the assigned medication was received. Outcomes to be examined should go beyond those of prolongation of pregnancy and birth weight to hard clinical endpoints of neonatal morbidity, such as bronchopulmonary dysplasia, necrotizing enterocolitis, significant intraventricular hemorrhage (grade III/IV), retinopathy of prematurity, sepsis, stillbirth, and neonatal death. Lastly, there should be long-term followup to assess subsequent childhood outcomes. Pharmacodynamic and pharmacokinetic outcome measures can additionally be studied to understand inter-individual differences in effectiveness and toxicity and avoidance of β-agonist related tachyphylaxis.

Observational Studies
Although the RCT is the ideal study design for evaluating the efficacy of interventions, it may not be feasible for a number of reasons, such as a prohibitive sample size requirements and ethical considerations. We realize that collecting RCT evidence on clinically important outcomes may not be possible because a large number of patients will need to be recruited to detect rare events, such as maternal deaths. Therefore, we additionally propose: • Well-designed, well powered cohort studies examining clinical outcomes. These studies should include a representative and inception cohort of all patients with arrested preterm labor. Since observational studies are susceptible to the effects of confounding, future observational studies should measure, report, and adjust for potential confounders such as fetal fibronectin, cervical length/dilation, cerclage, maternal characteristics (e.g., age, race), level of care and activity, and concomitant medications. Propensity scores based on ES-18 these variables may be considered. Other considerations about power, multiple comparison groups, level of care, reporting of cointerventions, and long-term followup are the same as for RCTs. • Record linkage studies in which mothers' prenatal and infants' NICU and childhood developmental electronic health records are linked may be a more practical research proposition for the near future with improvements in quality and accessibility of electronic patient records. NICU registries in which prenatal data of mothers are available can be a very valuable source. However, such linkage based studies may also be impacted by biases not uncommon to cohort study designs, especially confounding because of unmeasured or unrecorded variables with important prognostic implications.

Introduction
Preterm birth continues to be one of the largest contributors to neonatal morbidity and mortality worldwide and is associated with both short-and long-term disability. Preterm birth is defined as delivery before the completion of the 37th week of gestation and affects 12.3 percent of live births in the United States. 1 According to the 2010 National Vital Statistics report, there were 542,893 preterm births in the United States in 2006. 2 Approximately 40 percent of preterm births occur after the spontaneous onset of preterm labor. 27 Rates of preterm birth result in a significant disease burden to the health care system. Interestingly, the most recent data suggest a modest decrease in the preterm birth rate. 28 When medically indicated preterm births are excluded, the rates of spontaneous labor appear to have fallen by 20 to 30 percent. 29 Although overall rates of neonatal mortality are declining, infants born too early continue to be at risk.
The reasons behind this encouraging improvement deserve further elucidation.
The diagnosis of preterm labor is elusive, largely because the exact sequence and timing of events are poorly described and incompletely understood. This is partly due to the multifactorial nature of preterm labor, which arises from several different pathways but culminates in the same outcome of preterm birth. Further, the symptoms of preterm labor (i.e., pelvic pressure, increased vaginal discharge, backache, and menstrual-like cramping) are often vague. In the past, the presence of regular uterine contractions was sufficient to diagnose preterm labor and much of the literature uses this criterion. Due to the lack of precision involved in diagnosis, up to 40 percent of women with a preterm labor diagnosis are not actually in labor.
Neonatal survival increases steadily with increasing gestational age, particularly prior to 32 weeks' gestation, largely due to advances in neonatal care. Although this improvement is promising, the resulting increase in short-and long-term morbidity and the effect on quality of life for survivors are of concern. 30 Currently, most clinicians require appropriate evidence of progressive change in cervical dilation and/or effacement before a diagnosis of preterm labor is made. A diagnosis of preterm labor made based on contraction frequency of ≥ six per hour and cervical dilation ≥ 3 cm and/or effacement ≥ 80 percent, or if membranes rupture or bleeding occurs, is reasonably accurate.
Consequently, a significant proportion of women enrolled in clinical trials of tocolytic efficacy were not destined to deliver preterm. 31,32 Perinatal morbidity among premature infants can be loosely divided into short-term and longterm sequelae. Short-term outcomes include respiratory distress syndrome, intraventricular hemorrhage, necrotizing enterocolitis, sepsis, hypoglycemia, thermal instability, and jaundice.
Most clinicians, however, view any documented cervical change accompanied by regular contractions to indicate preterm labor, and intervene before the aforementioned criteria are met. Although doing so results in treatment of more women who are not destined to deliver preterm, early intervention is believed to benefit the infants of women who are experiencing true preterm labor. Documentation of cervical change is thought to increase the sensitivity and specificity of the diagnosis; this may allow for more rigorous evaluation of the effectiveness of treatments for preterm labor and reduce the number of patients treated unnecessarily. 33 Although these immediate concerns are important and often require treatment in the neonatal intensive care unit (NICU), long-term outcomes determine the overall quality of life for children and their families. These outcomes include bronchopulmonary dysplasia, significant intraventricular hemorrhage (grade III/IV), and retinopathy of prematurity.
Preterm labor is notoriously difficult to treat, owing in large part to the multifactorial nature of the condition. Risk factors for preterm labor include history of preterm birth, multiple gestation, maternal nonwhite race, low socioeconomic status, maternal underweight status, and maternal stress. [34][35][36] A tocolytic is a drug used to inhibit contractions and delay the parturitional process. Tocolytic therapy has thus far demonstrated poor efficacy, likely because the parturitional process is already well established. The goals of tocolysis are to reduce neonatal morbidity and mortality without causing significant maternal or neonatal side effects. Tocolytic therapy to date has primarily focused on short-term delay in delivery to allow for maternal administration of corticosteroids and transport to an appropriate facility for neonatal care. The most appropriate measures to assess the efficacy of tocolytic agents should focus on improved health outcomes for infants. To date, most tocolytic trials of maintenance therapy have insufficient power to assess such endpoints.
Spontaneous preterm labor occurs in the absence of maternal or fetal illness. The ultimate goal of treating preterm labor is to reduce long-term mortality and morbidity of the offspring. Unfortunately, most studies to date have focused on the prevention of preterm birth using surrogate endpoints such as gestational age at delivery, birthweight, and NICU admission.
The majority of tocolytic agents used to inhibit uterine contractions are efficacious for a period of approximately 48 hours. In contrast, terbutaline sulfate has been used in selected patients as maintenance tocolytic therapy to inhibit uterine contractions for longer periods of time after an episode of preterm labor has been arrested acutely with first-line tocolytic agents (including but not limited to indomethacin, magnesium sulfate, nifedipine, and nitroglycerin). Terbutaline is a β-sympathomimetic drug that relaxes smooth muscle in the bronchial tree, blood vessels, and myometrium. 4 For maintenance tocolysis, terbutaline is delivered by a subcutaneous (SQ) pump, usually at a basal rate of 0.03-0.05 mg/hr with intermittent boluses of 0.25 mg every 4 to 6 hours. 37 Maternal side effects are common because terbutaline does not act on the myometrium alone. Although most side effects are mild and self-limiting, such as shortness of breath, chest pain, anxiety and fatigue, As with all other contemporary tocolytics, the use of terbutaline for maintenance tocolysis is off-label. The Food and Drug Administration has approved terbutaline for the management of acute and chronic obstructive pulmonary disease. 10 serious adverse reactions such as pulmonary edema, myocardial ischemia, cardiac arrhythmias, hypotension, and metabolic alterations may also occur.
Maintenance tocolytic therapy with a terbutaline pump has been evaluated in two systematic reviews. A Cochrane review concluded that the pump does not decrease the risk of preterm birth by prolonging pregnancy based on two small randomized trials (Guinn n=52 and Wenstrom n=42). 4 4 Further, lack of information on safety and costs to implement therapy do not support use of the pump for the clinical management of arrested preterm labor. Another review included both randomized trials from the Cochrane review and also four additional observational studies. 5 An Agency for Healthcare Research and Quality (AHRQ) review examined the use of βmimetic agents for maintenance tocolysis.
Results were contradictory, with randomized trials failing to show efficacy and observational studies demonstrating positive results. 38 No benefits were observed for gestational age at birth, prolongation of pregnancy, or birthweight. In addition, β-mimetics were classified as conferring a high probability of maternal risk, including cardiovascular harms. However, the investigators did not distinguish between first-line and maintenance therapies when assessing harms and the SQ terbutaline pump was not examined specifically.
Despite conflicting evidence from previous systematic reviews, proponents of the terbutaline pump believe it still has a role as a maintenance tocolytic agent in women with arrested preterm labor. Given the discrepancy between available data and clinical practice, AHRQ commissioned a new systematic review of the literature to solidify the benefits and harms of the SQ terbutaline pump for maintenance tocolysis.
In this report we have systematically reviewed and summarized the available literature on the use of terbutaline pump for maintenance tocolytic therapy in women with arrested preterm labor. This evidence report will add to previous systematic reviews by performing an up to date search of the literature, synthesizing evidence in the context of specific populations of women, addressing confounding by level of maternal activity and level of maternal care, and grading the strength of evidence for important outcomes to help decision-makers develop evidence-based recommendations and policies.
A systematic review of terbutaline pump for maintenance tocolysis will inform clinicians, patients, and policymakers about the appropriateness of ongoing use and provide support for clinical guidelines. We hope that this review will result in more safe and effective treatment of women with preterm labor and, ultimately, in a reduction in mortality and long-term morbidity for their offspring.
Based on comments received during the peer-review process, we made modifications to the format of the Key Questions (For details, see Appendix G).

Conceptual Framework and Key Questions
As shown in the analytic framework ( Figure A, Executive Summary), we focused our evidence review on the following six Key Questions.
In women with arrested preterm labor, does treatment with a SQ infusion of terbutaline delivered by a pump, in comparison with placebo, conservative treatment or other interventions: Key Question 1: improve neonatal health outcomes, including bronchopulmonary dysplasia, neonatal death, death within initial hospitalization, significant intraventricular hemorrhage (grade III/IV), necrotizing enterocolitis, periventricular leukomalacia, retinopathy of prematurity, seizures, sepsis, and stillbirth for the following subgroups: a. Women <28 weeks, 0 days of gestation (extremely preterm)? b. Women between 28 weeks, 0 days and 31 weeks, 6 days of gestation (very preterm)? c. Women between 32 weeks, 0 days and 33 weeks, 6 days of gestation (

Topic Development and Refinement
With input from key informants, we developed the PICOTS (population, intervention, comparator, outcome, timing, setting), conceptual framework, and key questions during the topic refinement stage. The Key Questions were posted to the Effective Health Care Web site. The public was invited to comment on the Key Questions. After reviewing the public commentary, we drafted the final Key Questions and submitted them to the Agency for Healthcare Research and Quality for approval. The Technical Expert Panel (TEP) reviewed the protocol and provided additional clinical and methodological input. The analytic framework (Figure 1), which was developed by the review team in consultation with the TEP, outlines the main elements of each Key Question.

Search Strategy
In consultation with the rest of the team, our medical information specialist developed and tested electronic search strategies through an iterative process. Following published recommendations, MEDLINE and Embase strategies were peer reviewed by another information specialist using the PRESS Checklist and any amendments were subsequently applied to all databases. 39 We obtained additional references by hand-searching the bibliographies and text of review articles, letters to the editor, and commentaries identified during the screening of titles, abstracts, and full texts and with input from members of the TEP. We also hand-searched the reference lists of included studies for relevant citations. We conducted a grey (unpublished) literature search by scanning the Web sites of relevant specialty societies and organizations, health technology assessment agencies, guideline collections, regulatory agencies, and trial registries (see Appendix B). The Scientific Resource Center (SRC) also conducted a grey literature search of regulatory information, clinical trial registries, abstracts and conference papers, grants and federally funded research, and N.Y, Academy of Medicine's Grey Literature Index (see Appendix B). Materials obtained from the grey literature searches were evaluated by one reviewer for additional relevant references. In February 2011, the US Food and Drug Administration (FDA) issued new warnings against the use of terbutaline to treat preterm labor, so we also accessed a summary of the FDA postmarketing surveillance results. This decision was made post hoc.
The SRC requested information about published and unpublished randomized controlled trials (RCTs) and observational studies from pharmaceutical companies (see Appendix C). We screened the Scientific Information Packages that were submitted by industries, and sought unpublished information from Matria (now called Alere) Healthcare about their perinatal program and associated database.
The searches yielded a total of 431 citations after removal of duplicates. All citations were imported into an electronic database for screening and data extraction (Distiller Systematic Review Software; an Internet-based software program intended to facilitate collaboration among reviewers during the screening of abstracts and full texts, data extraction, exclusion reports, and table construction).

Study Selection
We developed inclusion and exclusion criteria based on the patient population, intervention, outcome measures, and study designs specified for the Key Questions. We screened titles and abstracts at Level 1 and full texts at Levels 2 and 3. The full-text articles of relevant abstracts, as assessed at Level 1 screening, were retrieved and assessed for relevancy by reapplying the inclusion criteria at Level 2 and Level 3 screening. The purpose of Level 3 screening was to further classify studies based on outcome and study design. Articles that passed through Level 1 to Level 3 screening were included in the review (see Appendix D for Level 1, 2, and 3 screening forms). Non-English language records without an English abstract were excluded. Results published only in abstract form were considered for inclusion only if sufficient information was presented to assess eligibility and validity. Two reviewers independently screened abstracts and full-text articles. Conflicts were resolved by consensus or by third-party adjudication.
Studies with the following population, intervention, comparators, and outcomes were included: • Population: Pregnant women 24-36 weeks' gestation and with preterm labor that had been arrested with primary tocolytic therapy • Intervention: Subcutaneous terbutaline (SQ terbutaline) delivered by infusion pump • Comparators: Either placebo, conservative treatment, or any other intervention • Outcomes: 1. Primary (neonatal) outcomes included bronchopulmonary dysplasia, necrotizing enterocolitis, significant intraventricular hemorrhage (grade III/IV), periventricular leukomalacia, seizures, sepsis, stillbirth, retinopathy of prematurity, death within initial hospitalization, and neonatal death. 2. Secondary (surrogate) outcomes included gestational age at delivery (continuous variable), incidence of delivery at various gestational ages (<28, <32, <34, <37 weeks), mean prolongation of pregnancy (days), need for assisted ventilation, need for oxygen per nasal cannula, neonatal intensive care unit admission, birth weight, ratio of birth weight/gestational age at delivery, and mean pregnancy prolongation index. Although not specified in the protocol, prolongation of pregnancy was also extracted as a dichotomous variable (i.e., prolongation > 7 days and prolongation > 14 days). 3. Maternal side effects included pulmonary edema, heart failure, arrhythmia, myocardial infarction, refractory hypotension, hypokalemia, hyperglycemia, maternal withdrawal due to adverse effects (withdrawal-AE), maternal discontinuation of therapy, and death. Neonatal side effects included hypoglycemia, hypocalcemia, and ileus. 4. Outcomes of pump failure included missed doses, dislodgment, and overdose. 5. Long-term childhood outcomes included childhood development, neurobehavioral testing, long-term lung function, and long-term vision. We also included observational studies because very few RCTs were available on this topic. We considered prospective and retrospective cohort studies, case-control studies, cross-sectional studies, and case series (exclusively for outcomes related to pump failure) as eligible study designs. As a post hoc decision, we sought FDA summaries of postmarketing data highlighting serious harms.
We did not undertake indirect comparisons of RCTs of other tocolytics because, based on a scoping literature search, sparse evidence was anticipated for maintenance tocolytic therapiesmostly single RCTs of various tocolytics, such as atosiban, nifedipinie, and ritodrine. [40][41][42] Comparisons from such scant indirect evidence would likely have been inconclusive. Furthermore, indirect comparisons are premature at this point because the efficacy of maintenance tocolysis versus no maintenance tocolysis or placebo remains to be clearly established. Indirect comparisons are helpful when direct comparisons of otherwise efficacious treatments are not available.

Data Extraction
We extracted the following items using the online program Distiller Systematic Review Software: general study characteristics (e.g., year of publication, country of origin, study design, setting, number screened, number included), population characteristics (e.g., inclusion/exclusion criteria, age, race, ratings for maternal level of activity), intervention characteristics (e.g., dose, duration, ratings for maternal level of care, details about comparators), outcomes (definitions and results), risk of bias, and applicability. One reviewer provided ratings for maternal level of activity, maternal level of care, and summarized applicability characteristics. Level of activity was rated as low, normal, or high based on a composite assessment of the following variables: marital status, working status, caring for other children in the home, available social support, bed rest, and restriction of maternal activities. Level of care was rated as low, moderate, or high based on the following variables: nursing assessments, home uterine activity monitoring, home visits, education about preterm labor, telephone support, restriction of maternal activities, and other cointerventions. Each variable was provided a rating based on predefined criteria (Tables F12 and F14 in Appendix F). We categorized responses into three tier levels and compared each level with another to decide the ratings of low, moderate/normal, and high. These assessments were verified by a clinical expert, with consensus reached by discussion. All other data were extracted by one reviewer, and outcome data was verified by a second reviewer.
When there were multiple reports of the same study, we referenced the most relevant record as the primary identifying study and extracted additional data as available from the companion report(s).

Risk of Bias Assessment
We evaluated risk of bias for each relevant outcome in individual studies using generic criteria for controlled trials and observational studies. Selected items from the McMaster Quality Assessment Scale of Harms were also incorporated into the assessment for those studies that evaluated treatment harms. 8 The following risk of bias criteria were evaluated for all included study designs (RCTs, nonrandomized trials, and observational studies Two reviewers assessed risk of bias and consensus was reached by discussion or involvement of a third team member. Appendix D presents the risk of bias form used to evaluate studies.
including case series • Extent to which valid primary outcomes were described. We considered both an explicit and implicit description as adequate, and assessed this only for the stated primary outcomes. ): • Differential loss to followup between the compared groups or overall high loss to followup. • Selective outcome reporting.
• Data quality (i.e., consistency of measurements across outcome assessors and consistency in outcome definitions across data sources -the latter point pertained only to retrospective cohorts). • Adequacy of sample size.
• Compliance with treatment regimen.
• Selected criteria from the McHarm checklist for studies assessing treatment harms (definition of harms, mode of harms collection, and training/background of personnel collecting harms data). The following criteria were assessed for all included study designs aside from case series • Similarity of groups in terms of baseline characteristics and prognostic factors : • Similarity of groups in terms of administration of primary tocolytic regimen to control acute episodes of preterm labor • Intention-to-treat analysis • Differential level of care between the compared groups Blinding of patients, health care providers, and outcome assessors to treatment allocation and maternal contractions was assessed only for experimental designs (i.e., RCTs and nonrandomized trials). Based on the outcomes of interest, the outcome assessor was assumed to be the same as the health care provider. Two criteria, which pertained exclusively to RCTs, were generation and concealment of the allocation sequence. Two additional criteria were applied only to observational studies (excluding case series) and nonrandomized controlled trials. These included an assessment of whether the same population was used to sample intervention and comparison groups and methods used to control for confounders.
We evaluated intention-to-treat by examining both loss to followup/discontinuation of treatment and unintended crossover to opposite intervention group(s). Loss to followup was assessed either by what was reported in the study or, if not clearly reported, by comparing the number of participants who entered the study with the number of participants reported in outcome table(s). Unlike randomized controlled trials, for which numbers randomized are reported, the reported sample size of nonrandomized studies could be a posthoc determination depending upon the number of participants left for analysis. Therefore, comparing the number of study participants with the number of participants analyzed as reported in tables may not truly reflect those who were lost to followup or dropped out for nonrandomized studies. For such study designs, assessment of intention-to-treat analysis required that the study reports the number of participants who met inclusion/exclusion criteria.
For each relevant outcome in a study, we provided an overall risk of bias rating, designated as high, medium, or low (Table 1). We made these summary ratings within a study design. In order to be classified as high risk of bias, a study must have demonstrated some apparent and major flaw (within that study design category) that would invalidate results.

Table 1. Overall risk of bias ratings
Low risk of bias. These studies have the least bias, and results are considered valid. Studies that adhere mostly to the commonly held concepts of high quality including the following: a formal randomized controlled design; clear description of the population, setting, interventions, and comparison groups; appropriate measurement of outcomes; appropriate statistical and analytic methods and reporting; no reporting errors; low dropout rate; and clear reporting of dropouts. Medium risk of bias. These studies are susceptible to some bias, but it is not sufficient to invalidate the results. They do not meet all the criteria required for a rating of good quality because they have some deficiencies, but no flaw is likely to cause major bias. Studies may be missing information, making it difficult to assess limitations and potential problems. High risk of bias. These studies have significant flaws that imply biases of various types that may invalidate the results. They have serious errors in design, analysis, or reporting; large amounts of missing information; or discrepancies in reporting.

Grading the Body of Evidence
The strength of a body of evidence was graded based on the following four domains as per previously published guidance: overall risk of bias by outcome, consistency, directness, and precision. 9 Optional domains such as dose-response association and existence of confounders were considered as not relevant to this comparative effectiveness review. Publication bias was also not considered as an important concern because we searched for grey literature, scientific information packets from industries, and had many experts in this field participate as Key Informants, Technical Expert panelists and peer reviewers. No concerns about additional unpublished studies were raised. Furthermore, as we had few studies per outcome, publication bias could not be statistically investigated. 43 In consultation with the TEP, the review team chose the following outcomes for grading: incidence of delivery at various gestational ages (<28 weeks, <32 weeks, <34 weeks, <37 weeks); mean prolongation of pregnancy; bronchopulmonary dysplasia; significant intraventricular hemorrhage (grade III/IV); neonatal death and/or death within initial hospitalization; and, maternal Withdrawal-AE. These outcomes were chosen based on importance to patients and clinicians. Each domain was graded by two reviewers, and consensus was reached by discussion.
We used four domains to grade outcomes: overall risk of bias, consistency, directness, and precision. For the body of evidence from observational studies, an initial grade of "low" could be upgraded across the domains where possible. We took care not to double count the inherent limitations of observational studies, so we did not factor in study design when assessing risk of bias. The overall risk of bias of an observational study, therefore, could potentially be "low." We took into account the inherent limitations of observational study designs when we graded the strength of evidence.

Applicability
We considered several factors to assess the applicability of the body of evidence. Population factors included breadth of inclusion/exclusion criteria, exclusion rate, patient demographics, and attrition rate. Intervention factors included dosing and treatment schedules, cointerventions, level of care, pump training, and dose of comparative agent. Outcomes were judged based on clinical utility, definition of harms, and timing of measurement. Geographic and clinical settings were also assessed.
One reviewer summarized the applicability of the body of evidence using the determinants of PICOTS (population, intervention, comparison, outcome, timing, and setting) and a clinical expert provided verification (see Appendix D for applicability form). Important determinants of applicability (population, intervention, and comparator) are presented by outcome for the available evidence. However, this information was not presented if the strength of evidence for an outcome was graded as insufficient (i.e., absent or inconclusive evidence).

Data Synthesis and Analysis
We used a random effects model, following a DerSimonian and Laird approach, to metaanalyze study estimates if they met the following criteria of clinical and methodological homogeneity: (1) same study design, (2) no important differences in the following factors: demographic and obstetrical characteristics; level of care; intervention; comparator type, dose, and frequency of administration; definition of outcome; timing; and clinical setting, and (3) similar risk of bias ratings. We compared SQ terbutaline pump with no treatment, saline infusion, or another tocolytic. If observational studies presented adjusted odds ratios (ORs), then we extracted and used these values in analyses. Otherwise, ORs and 95 percent confidence intervals were calculated for relevant outcomes in each included study. If a study group had no events, we added 0.5 to both event and nonevent cells. An OR of less than one indicates a smaller event rate in the SQ terbutaline pump group. Exact central confidence intervals were calculated for incidence rates presented in case series. These estimates were not meta-analyzed because only single studies were available by outcome. Statistical heterogeneity was assessed using Cochran's Q (α=0.10) and I 2 We considered observational studies for meta-analysis only if the reports made it clear that they were similar with respect to major confounding factors (e.g., age, race, comorbidities, history of preterm birth, cervical length, cervical dilation, and fetal fibronectin). Although some studies matched for either one or more variables (e.g., by gestational age) statistic was calculated to quantify the magnitude of heterogeneity. All analyses were performed using Comprehensive Meta Analysis version 2.2.046 or version 2.2.055 (New Jersey, USA). [18][19][20] Studies that were exclusively of women with singletons were not pooled with studies exclusively of women with multiple gestation to avoid the unit of analysis error due to the cluster effect. Clustering may arise in studies of women with multiple gestation because the unit of randomization or allocation is the mother rather than infant. Also, pooling was not carried when we could not rule out the probability that participants were double counted (i.e., use of the same participants and their outcomes data in different studies). A qualitative analysis was conducted on those studies that could not be synthesized quantitatively.
in no case was it apparent that there was equivalency in all or even most of these confounders. Therefore, observational studies were not pooled for any of the key questions, even if they were similar with respect to the PICOTS domains.
We needed a minimum of six studies to explore statistical heterogeneity in effect estimates through meta-regression. Since we could pool only a small number of studies, meta-regression was not possible.

Literature Search
The PRISMA diagram in Figure 2 depicts the flow of retrieved records through the phases of screening and inclusion. The titles and/or abstracts of 427 citations were screened at Level 1. These citations were identified from database searching, reference lists, grey literature, and Technical Expert Panel nomination. We did not identify any relevant data in the Scientific Information Packets submitted by pharmaceutical industries.
At full-text screening, 212 records were reviewed. Most records (n=197) were excluded with reasons listed in the PRISMA diagram. Ultimately, 14 unique records comprised the evidence base to answer the key questions. One record was an informative companion article for the study by Allbert et al. (1994). 20,44 A list of excluded studies is provided in Appendix E.

General Study Characteristics
Most studies were observational in design and all were from the United States. Participants were recruited from single center study sites (n=9) 10-14,20-23 or from a national proprietary database run by Matria (now called Alere) Healthcare, which provides an outpatient perinatal program consisting of 24-hour nursing and pharmacy support, home uterine activity monitoring, individualized education, and provision of tocolytic therapy to women with preterm labor (n=5). [15][16][17][18][19] The comparison groups were placebo (saline pump), 10,11 no treatment, 13 oral terbutaline, 11,12,20,21 oral nifedipine, 15-17 and oral tocolytics. 14,18,19 Population The definition of labor was unclear in 36 percent of the included studies. The remaining studies included women with persistent contractions and cervical change.
All studies included women who had at least one episode of preterm labor that was arrested with primary tocolytic treatment (often with parenteral magnesium sulfate) with subsequent placement on a maintenance tocolytic regimen. Nine of 14 studies reported persistent contractions or contractions > 4 per hour accompanied by cervical changes as their definition of labor (Appendix F, Table F1). In several studies, only women with two or more episodes of preterm labor during the same pregnancy (i.e., recurrent preterm labor) were eligible for inclusion. [13][14][15][16][17][18][19][20]23 Several studies were conducted exclusively in women with singleton gestation, 10,12,13,15,17,18 although a few studies evaluated women with twins only. 16,19 Some studies may have included women less than 24 weeks' gestational age (an inclusion criterion for the review was 24-36 weeks plus 6 weeks), but data for these participants could not be separated.

Double-counting of Outcomes Data
Five studies originated in the Matria database, and not all reported geographic region and/or years over which participants were recruited. Therefore, the question of overlap in participants across these studies was an important concern of reviewers. Through the Scientific Resource Center, we requested this missing information from Matria (now called Alere) Healthcare but did not receive a response. Hence, where appropriate, we report this risk of double-counting of participants.
Participants may have been double-counted among the following sets of Matria-based studies: • • de la Torre et al. and Lam et al. (2001) selected women with twin gestation and used oral nifedipine or oral tocolytics as comparators respectively.
However, there is a possibility that the SQ terbutaline pump sample of participants overlapped in all three studies. 16,19 Double-counting of participants in the comparator groups will be minimal because 92.3 percent of participants in the oral tocolytic group of the study by Lam et al. (2001) were administered oral terbutaline. 19

Risk of Bias Assessment
However, there is still a possibility that participants in the SQ terbutaline pump groups of these two studies overlapped.
Figures 5, 6, 7, and 8 present risk of bias charts for cohorts, case series, nonrandomized trials, and randomized controlled trials (RCTs) respectively. The bars in black represent studies with risk of bias, and the bars in grey represent studies without risk of bias for the corresponding criteria. Studies that were rated as unclear are represented by white bars. Table F2 in Appendix F presents the full text question that was posed for the criteria listed in the charts. The overall ratings for risk of bias of individual studies are presented in the evidence table (Table F1, Appendix F). Table F3 in Appendix F presents detailed risk of bias assessments for each study.
We rated studies as high risk of bias if we could identify at least one major flaw with the potential to significantly bias results. If we could not assess several factors due to incomplete information, but there was no major flaw, then we rated studies as medium risk of bias. We provided a rating of low risk of bias only if there were no identifiable flaws and there was sufficient information to evaluate most criteria.

Cohort Studies (n=9)
We rated five cohort studies as high risk of bias [13][14][15]18,21 and four as medium risk of bias ( Figure 3). 16,17,19,20 The cohort studies with high risk of bias all had imbalances in baseline characteristics or prognostic factors. We rated the remaining cohort studies as medium risk of bias because we could not assess several criteria due to incomplete reporting.

Case Series (n=2)
We rated both case series as medium risk of bias ( Figure 4). 22,23 Although we could not identify any major methodological flaws, neither study provided clear definitions for the pumprelated harm outcomes. Several criteria, such as compliance, adequacy of sample size, and selective outcome reporting, were unclear.

Nonrandomized Comparative Trials (n=1)
We rated the single nonrandomized trial as having high risk of bias for the outcomes birthweight and gestational age at delivery, and medium risk of bias for the outcome maternal hyperglycemia ( Loss to followup

Number of Studies
Allocation to intervention or comparator groups was based on primary tocolytic treatment. Participants who received < 24 hours of primary tocolysis were placed on oral terbutaline and participants who received > 24 hours of primary tocolysis, multiple courses of tocolysis, or multiple agents, were placed on terbutaline pump. This allocation scheme may have created prognostic differences among groups, which may have had an impact on the preterm birth outcomes (i.e., birth weight and gestational age at delivery). However, there is no clear indication that this would impact the outcome of maternal hyperglycemia.

Number of Studies
RCTs (n=3, Randomized Arms) Figure 6 presents risk of bias assessments for RCTs. 10,11 One RCT had two comparators (placebo and oral terbutaline), and we assessed each randomized arm separately. 11 We rated both arms as high risk of bias because the study was likely underpowered, blinding was ineffective, and the participants who participated in the study represented a select group of patients because more than 90 percent of eligible participants declined to participate. 11 We rated the second RCT as low risk of bias because randomization was carried out properly and patients and health care providers were blinded. 10

Sources of Funding
Most studies did not report sources of funding (Table 4). One RCT 10 and one case series 22 In women with arrested preterm labor, does treatment with a SQ infusion of terbutaline delivered by a pump, in comparison with placebo, conservative treatment or other interventions, improve neonatal health outcomes, including bronchopulmonary dysplasia, neonatal death, death within initial hospitalization, significant intraventricular hemorrhage (grade III/IV), necrotizing enterocolitis, periventricular leukomalacia, retinopathy of prematurity, seizures, sepsis, and stillbirth for the following subgroups: a. Women <28 weeks, 0 days of gestation (extremely preterm)? b. Women between 28 weeks, 0 days of gestation and 31 weeks, 6 days of gestation (very preterm)? c. Women between 32 weeks, 0 days of gestation and 33 weeks, 6 days of gestation (preterm)? d. Women between 34 weeks, 0 days of gestation and 36 weeks, 6 days of gestation (later preterm)? e. Multiple gestation? f. Racial or ethnic subgroups? g. Women with previous preterm birth? h. Women with history of preeclampsia? i. Women with recurrent preterm labor (RPTL) and women without RPTL?

Key Points
• Strength of evidence was graded as insufficient for the outcomes of bronchopulmonary dysplasia, death within initial hospitalization, and significant intraventricular hemorrhage (grade III/IV) and for all populations of interest. • For neonatal death, strength of evidence favoring SQ terbutaline pump over oral tocolytics is low for women with twin gestation and RPTL. For other populations, the evidence was graded as insufficient. • Underpowered studies demonstrated indeterminate results for the outcomes of necrotizing enterocolitis, retinopathy of prematurity, sepsis, and stillbirth (i.e., type II error cannot be excluded). • No data were available for periventricular leukomalacia and seizures.
• Data were unavailable for subgroups a-d and for subgroups f-h. Table F4 in Appendix F presents data for Key Question 1. None of the included studies reported data on bronchopulmonary dysplasia (strength of evidence is insufficient), death within initial hospitalization (strength of evidence is insufficient), periventricular leukomalacia, and seizures. Data could not be separated for women of specific gestational ages (subgroups a-d), racial or ethnic subgroups (subgroup f), women with previous preterm birth (subgroup g), and women with history of preeclampsia (subgroup h).

Detailed Analysis
Below we report results by outcome, grade strength of evidence (for the following prespecified outcomes that were reported in studies: significant intraventricular hemorrhage and neonatal death), and summarize determinants of applicability where relevant. Results from Matria-based studies pertained to women who were inducted into a U.S. national database and who received specialized services from an outpatient perinatal program. We graded the strength of evidence for the specific populations of interest, as indicated in the key question. We also graded evidence from two RCTs that pertained to a nonspecific population of women with preterm labor; one RCT was in women with singleton gestation 10 and the other RCT was in women with either single or twin gestation (effect estimates were not presented separately by gestation).
Summary tables are presented if more than one study was available for an outcome, otherwise all information has been summarized in the text.

Neonatal Death
Neonatal death was reported in three studies. Heterogeneity in study design, patient population, and comparator groups precluded evidence synthesis (Table 5). Either no or a sparse number of events were reported in all studies for followup to delivery.
In a retrospective cohort of women with singleton gestation and RPTL, no neonatal deaths were reported among the SQ terbutaline pump group or oral nifedipine comparator group. 17 Similarly, no neonatal deaths occurred in an RCT of women with singleton or twin gestation who received SQ terbutaline pump or placebo. 11 However, in a retrospective cohort of women with twin gestation and RPTL from the Matria database, Lam et al. demonstrated a statistically significant difference between SQ terbutaline pump and oral tocolytics, with fewer neonatal deaths occurring in the SQ terbutaline pump group (odds ratio [OR] = 0.09, 95% confidence interval [CI]: 0.01, 0.70). 19 The overall strength of evidence in favor of SQ terbutaline pump for neonatal death was graded as low for twin gestation (subgroup e) and RPTL (subgroup i), based on the study by Lam et al. (Table 6). 19 Strength of evidence for all other populations was insufficient. Strength of evidence for neonatal death for a nonspecific preterm labor population described in an RCT is insufficient (Table 7).

Significant Intraventricular Hemorrhage (Grade III/IV)
Two studies reported data on significant intraventricular hemorrhage (grade III/IV) ( Table 8). These studies were not synthesized because of heterogeneous study designs, patient populations, and comparators. Both studies were underpowered to detect a difference due to sparse event rates and small sample sizes.
Morrison et al. compared SQ terbutaline pump with no treatment in a prospective cohort of women with singleton gestation and RPTL. 13 This study reported a nonsignificant difference (OR = 0.30, 95% CI: 0.02, 5.85). 13 In the second study, which did not pertain to any of the populations of interest, women with singleton gestation were randomly allocated to receive either SQ terbutaline pump or placebo; significant intraventricular hemorrhage was not observed in either group. 10 We graded the overall strength of evidence as insufficient for women with RPTL (subgroup i) based on the prospective cohort study ( Table 9). Strength of evidence for all other populations of interest was insufficient because no studies were available. Based on the RCT in a nonspecific preterm labor population, strength of evidence is insufficient (Table 10). 10

Necrotizing Enterocolitis
Necrotizing enterocolitis was reported in a single prospective cohort that compared the SQ terbutaline pump with no treatment in women with singleton gestation and RPTL. 13 One case was reported in the no treatment comparator group, which resulted in a nonsignificant effect estimate (OR=0.96, 95% CI: 0.04, 24.74). 13

Retinopathy of Prematurity and Sepsis
Given that there was only one event among a sample size of 60 participants, this study was clearly underpowered to detect a difference in this outcome. We rated this study as high risk of bias because intervention and comparator groups were imbalanced in risk factors for preterm birth, primary tocolytic therapy, and level of care.
Retinopathy of prematurity and sepsis were reported in a single RCT of women with singleton or twin gestation who presented to a university hospital. 11 This study did not address any of the subgroups of interest. One infant in the SQ terbutaline pump group developed retinopathy of prematurity, and one infant in the oral terbutaline group developed sepsis. 11

Stillbirth
Although both results were statistically nonsignificant, this study was underpowered to detect differences in either outcome. This study was rated as high risk of bias because of selection bias, limitations in study power, and absence of blinded outcome assessment.
Three retrospective cohort studies reported data on stillbirth (Table 11). [17][18][19] All three studies indicated statistically nonsignificant differences in the occurrence of stillbirth between the SQ terbutaline pump group and oral nifedipine or oral tocolytic comparator groups. Two of these studies were exclusively in women with singleton gestation These studies could not be meta-analyzed because sample populations and comparators were diverse and differences in confounders could not be excluded.
17,18 and one restricted to twins. 19 All studies were underpowered to detect a difference in this outcome because of low event rates (total number of events ranged from one to seven). Furthermore, there is potential for participant overlap among the SQ terbutaline pump groups of the two singleton studies.

Key Question 2. Other Surrogate Outcomes
In women with arrested preterm labor, does treatment with a SQ infusion of terbutaline delivered by a pump, in comparison with placebo, conservative treatment or other interventions improve other surrogate outcomes, including mean gestational age at delivery, incidence of delivery at various gestational ages (<28 weeks, < 32 weeks, < 34 weeks, < 37 weeks), mean prolongation of pregnancy (days), birth weight, ratio of birth weight/gestational age at delivery, mean pregnancy prolongation index, need for assisted ventilation, need for oxygen per nasal cannula, and neonatal intensive care unit (NICU) admission, for the following subgroups: a. Women <28 weeks, 0 days of gestation (extremely preterm)? b. Women between 28 weeks, 0 days of gestation and 31 weeks, 6 days of gestation (very preterm)? c. Women between 32 weeks, 0 days of gestation and 33 weeks, 6 days of gestation (preterm)? d. Women between 34 weeks, 0 days of gestation and 36 weeks, 6 days of gestation (later preterm)? e. Multiple gestation? f. Racial or ethnic subgroups? g. Women with previous preterm birth? h. Women with history of preeclampsia? i. Women with RPTL and women without RPTL?

Key Points
No data were available for the following outcomes: incidence of delivery < 28 weeks (strength of evidence is insufficient), need for oxygen per nasal cannula, or ratio of birth weight/gestational age at delivery.

Multiple Gestation
Two cohorts of medium risk of bias from the Matria database reported statistically significant differences in favor of the SQ terbutaline pump, compared with oral tocolytics. The risk of double-counting of participants could not be ruled out.

RPTL
Six cohorts of medium to high risk of bias, mostly from the Matria database, reported statistically significant differences in favor of SQ terbutaline pump compared with oral tocolytics or no treatment. Again, there is risk of double-counting of participants across these studies.

Overall Evidence
Two RCTs and two observational studies, which did not pertain to any population of interest, reported statistically nonsignificant differences between SQ terbutaline pump and placebo or oral tocolytics. This evidence contrasted with results from larger cohort studies, which demonstrated consistent benefit. The RCTs and other observational studies, which showed nonsignificant differences, may have been underpowered.

< 32 weeks
Strength of evidence favoring SQ terbutaline pump compared with oral tocolytics or no treatment for women with twin gestation or RPTL is low. This evidence came mostly from Matria-based studies.

< 34 weeks
Strength of evidence is insufficient for all populations of interest. One RCT of medium risk of bias, which did not pertain to any population of interest, reported a statistically nonsignificant difference compared with placebo, although type II error cannot be excluded.

< 37 weeks
Strength of evidence favoring SQ terbutaline pump compared with oral tocolytics or no treatment is insufficient or low for women with RPTL. This evidence came mostly from Matriabased studies.

Prolongation of Pregnancy
Strength of evidence favoring the SQ terbutaline pump compared with oral tocolytics or no treatment for women with twin gestation or RPTL was graded as insufficient or low for mean prolongation of pregnancy. This evidence came mostly from Matria-based studies.
Two retrospective cohorts of medium and high risk of bias, from the Matria database, reported pregnancy prolongation > 7 days in women with RPTL. One cohort reported a statistically significant difference in favor of the SQ terbutaline pump, compared with oral tocolytics. The other reported a nonsignificant difference, although type II error cannot be excluded.
Five retrospective cohorts of medium to high risk of bias, from the Matria database, reported pregnancy prolongation > 14 days in women with RPTL and single or twin gestation. All reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics. Overlap in the study sample between these studies cannot be ruled out.

Multiple Gestation
Two cohorts of medium risk of bias from the Matria database reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics. Overlap in the study sample between the two studies cannot be ruled out.

RPTL
Five of six cohorts, mostly from the Matria database and of medium to high risk of bias, reported statistically significant differences in favor of SQ terbutaline pump compared with oral tocolytics or no treatment. Overlap in the study sample between the Matria-based studies cannot be ruled out.

Overall evidence
Two RCTs, which did not pertain to any population of interest, reported statistically nonsignificant differences between SQ terbutaline pump and placebo. This result is indeterminate because of possible type II error. The RCT evidence contrasted with results from larger cohort studies, which demonstrated consistent benefit.

Multiple Gestation
Two cohorts of medium risk of bias from the Matria database reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics. Overlap in the study sample between the two studies cannot be ruled out.

RPTL
Six cohorts, mostly from the Matria database and of medium to high risk of bias, reported statistically significant differences in favor of SQ terbutaline pump compared with oral tocolytics or no treatment. Study sample may have overlapped among the Matria-based studies.

Multiple Gestation
Two cohorts of medium risk of bias from the Matria database reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics. Overlap in the study sample between the two studies cannot be ruled out.

RPTL
Three of four cohorts of medium to high risk of bias, from the Matria database, reported statistically significant differences in favor of SQ terbutaline pump compared with oral tocolytics. The study sample may have overlapped.

Pregnancy Prolongation Index
Two cohorts of medium and high risk of bias in women with RPTL reported statistically significant differences in favor of SQ terbutaline pump, compared with oral terbutaline or no treatment.

Need for Assisted Ventilation
One retrospective cohort of high risk of bias in women with singleton gestation and RPTL from the Matria database reported a nonsignificant difference in need for ventilator among infants with NICU admission.

Multiple Gestation
Two cohorts of medium risk of bias from the Matria database reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics. Overlap in the study sample between the two studies cannot be ruled out.

RPTL
Six cohorts of medium to high risk of bias, mostly from the Matria database, reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics or no treatment. Participant overlap among Matria-based studies cannot be ruled out.

Multiple Gestation
One retrospective cohort of medium risk of bias, in women with twin gestation from the Matria database, reported a statistically significant difference in favor of SQ terbutaline pump compared with oral tocolytics.

RPTL
Four retrospective cohorts, mostly from the Matria database and primarily of high risk of bias, reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics or no treatment. Participant overlap among Matria-based studies cannot be ruled out.

Detailed Analysis
Tables F5 to F9 in Appendix F present data extracted for Key Question 2. No data were available for incidence of delivery < 28 weeks (strength of evidence is insufficient), need for oxygen per nasal cannula, or ratio of birthweight/gestational age at delivery. Information was unavailable for women of specific gestational ages (subgroups a-d), racial or ethnic subgroups (subgroup f), women with previous preterm birth (subgroup g), or women with history of preeclampsia (subgroup h).
Results are presented below by outcome, along with strength of evidence grades for certain prespecified outcomes (i.e., incidence of delivery at various gestational ages and mean prolongation of pregnancy) and determinants of applicability. We graded the strength of evidence for the specific populations of interest, as indicated in the key question. We also graded evidence from two RCTs that pertained to a nonspecific population of women with preterm labor; one RCT was in women with singleton gestation 10 and the other RCT was in women with either single or twin gestation (effect estimates were not presented separately by gestation). 11 If a single study was available on an outcome, then all information has been summarized in the text. Otherwise, we have summarized information in tables for each outcome according to the populations of interest. When we had information that did not pertain to any of the specific populations, we also summarized the population-specific and nonspecific data (which we have termed overall evidence) in tables and/or forest plots. We have presented forest plots to display the entire body of evidence for outcomes that had data from several studies.
The results from Matria-based studies pertained to women who were inducted into a U.S. national database and who received specialized services from an outpatient perinatal program. There is risk of double-counting of participants across some of the Matria-based studies. Table F5 in Appendix F presents study-level data for mean gestational age at delivery. Eleven heterogeneous studies reported gestational age at delivery. Two were RCTs, 10,11 one was a nonrandomized trial, 12 two were prospective cohorts, 13,14 and the remaining were retrospective cohorts. [15][16][17][18][19]21 Comparator groups included placebo, 10,11 no treatment, 13 and various oral tocolytic agents. 12,[14][15][16][17][18][19]21 Data were available for women with twin gestation (subgroup e) 16,19 and women with RPTL (subgroup i) 13,[15][16][17][18][19] The other studies did not explicitly address any of the populations of interest. [10][11][12]14,21 Two retrospective cohorts that used the Matria database were exclusively in women with twin gestation (Table 12). 16,19 Both studies demonstrated statistically significant differences in gestational age at delivery, with greater mean gestational age among women who received SQ terbutaline pump (difference in means = 0.70 weeks, 95% CI: 0.43 weeks, 0.97 weeks and 0.70 weeks, 95% CI: 0.48 weeks, 0.92 weeks). 16,19  Both studies were rated as medium risk of bias because several criteria, such as similarity in baseline characteristics and prognostic factors, could not be assessed due to incomplete reporting. RPTL = recurrent preterm labor ; SQ = subcutaneous

Subgroup: RPTL
One prospective cohort 13 and five retrospective cohorts specified RPTL as an inclusion criterion (Table 13). [15][16][17][18][19] Two of these studies were also in women with twin gestation, as described above. 16,19 The prospective cohort reported a statistically significant difference in gestational age at delivery between the SQ terbutaline pump and no treatment, in favor of the pump (difference in means = 3.40 weeks, 95% CI: 1.80 weeks, 5.00 weeks).
The remaining four studies were in women with singleton gestation. 13 All retrospective cohorts used the Matria database and reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics (difference in means range = 0.70-0.90 weeks, 95% CI range: 0.28-0.48 weeks, 0.92-1.52 weeks).
However, this study was rated as high risk of bias because the groups were imbalanced in preterm birth risk factors, primary tocolytic therapy, and level of care. This study included women with singleton gestation only, the majority of whom were of African American origin. [15][16][17][18][19] Two studies were rated as high risk of bias because of group imbalances 15,18 and three were rated as medium risk of bias because the information presented in the reports was insufficient to assess several criteria, such as group comparability. 16,17,19

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Retrospective cohort (2) Women with twin gestation and RPTL from the Matria database (n=656,)

Overall Evidence
Irrespective of patient populations and comparators, a total of 11 studies contributed evidence to the outcome of gestational age at delivery (Table 14). When compared with placebo, the RCT evidence for SQ terbutaline pump was indeterminate, given the small sample size ( Figure 7). 10,11 Five larger observational cohort studies that used the Matria database and that were of medium to high risk of bias, showed consistent benefit with the pump in comparison with other tocolytics. 15-19

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Prospective cohort (1)

Incidence of Delivery at Various Gestational Ages
Study-level data for incidence of delivery at various gestational ages are presented in Table  F6, Appendix F.

Incidence of Delivery < 32 Weeks' Gestation
Six cohort studies reported incidence of delivery at gestational age < 32 weeks' gestation. Data were available for women with multiple gestation 16,19 and women with RPTL. 13,[15][16][17][18][19] Two retrospective cohorts included women with twin gestation only (Table 15). 16,19 Both studies reported a statistically significant difference in the incidence of delivery < 32 weeks, with fewer cases in the SQ terbutaline pump group compared with oral tocolytics (OR=0.47, 95% CI: 0.33, 0.68 and OR=0.52, 95% CI: 0.35, 0.76). 16,19 Based on these two studies, we graded the strength of evidence favoring the SQ terbutaline pump compared with oral tocolytics as low, for the population of women with twin gestation (Table 20). This evidence pertained to women with twin gestation and RPTL from the Matria database.

Subgroup: Multiple Gestation
We rated both studies as medium risk of bias because incomplete reporting precluded assessment of several criteria, such as similarity in baseline characteristics and prognostic factors.

Subgroup: RPTL
The entire body of evidence for this outcome pertained to women with RPTL (Table 16 and Figure 8). Four studies were in women with singleton gestation, 13,15,17,18 and two studies, which are described above, were in women with twin gestation. 16,19 These studies all found statistically significant differences in favor of the SQ terbutaline pump, compared with either no treatment or oral tocolytics. 13,[15][16][17][18][19] Three studies were rated as high risk of bias due to group imbalances, 13,15,18 and three studies were rated as medium risk of bias due to incomplete reporting, which precluded an assessment of group comparability.
Strength of evidence for different comparators and patient populations (i.e., singletons and twins) favoring the SQ terbutaline pump is low for women with RPTL (Table 17). Aside from the comparison against no treatment in women with singletons, the evidence for all other comparators and populations pertained to women from the Matria database. The study with the no treatment comparison group included women who were mostly of African American origin. 16,17,19 Incidence of Delivery < 34 Weeks' Gestation 13 Strength of evidence was graded insufficient for all populations of interest. Only a single RCT, which compared the SQ terbutaline pump with placebo in women with singleton gestation, reported a nonsignificant difference in the incidence of delivery < 34 weeks (OR = 0.95, 95% CI: 0.32, 2.87). 10 Based on the small sample size, the possibility of type II error cannot be excluded. The strength of evidence for this nonspecific preterm labor population is insufficient (Table 18).

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Retrospective cohort (2) Women with twin gestation and RPTL from the Matria database (n=656,)

Incidence of Delivery < 37 Weeks' Gestation
Six studies reported incidence of delivery < 37 weeks' gestation. Population-specific data were available only for women with RPTL. 13,15,17,18,20 One RCT, which did not pertain to any specific population of interest, randomized women with singleton gestation to SQ terbutaline pump or placebo and reported a nonsignificant difference (OR=1.57, 95% CI: 0.49, 5.02) ( Figure  9).

10
Four retrospective cohorts and one prospective cohort reported incidence of delivery < 37 weeks in women with RPTL (Table 19). 13,15,17,18,20 Aside from one study, all included women with singleton gestation only. 13,15,17,18 One study likely consisted of women with single and multiple gestation. 20 Four of the five studies reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics or no treatment (OR range = 0.04-0.72, 95% CI range: 0.01-0.58, 0.23-0.98). 13,15,18,20 Although For the population of women with RPTL, we graded the strength of evidence favoring SQ terbutaline pump compared with various comparators as insufficient or low (Table 20). The majority of this evidence was derived from the Matria database. 15,17,18 The study with the no treatment comparator group included women who were mostly of African American origin 13 and the study with oral terbutaline as a comparator group likely included women with single and multiple gestation, the majority of whom were classified as "nonwhite." 20 The strength of evidence for a nonspecific preterm labor population from the RCT is insufficient (

Prolongation of Pregnancy
Studies that reported prolongation of pregnancy are presented in Table F7, Appendix F. This outcome was reported either as a continuous variable (i.e., mean prolongation of pregnancy) or as a dichotomous variable (i.e., pregnancy prolongation > 7 days or > 14 days). In most studies, the prolongation of pregnancy interval was defined. 10,13,[15][16][17]19

Mean Prolongation of Pregnancy
Population-specific data were available for women with multiple gestation and women with RPTL.
Seven studies reported mean prolongation of pregnancy. One observational study included women with twin gestation only 16 and five observational studies pertained to women with RPTL. 13,15-18 Two RCTs did not pertain to any populations of interest.

10,11
One retrospective cohort compared SQ terbutaline pump with oral nifedipine in women with twin gestation and RPTL from the Matria database. 16 The strength of evidence in favor of the SQ terbutaline pump compared with oral nifedipine for this specific population is low, based on this single study (Table 24).
Prolongation of pregnancy was measured from episode of RPTL to delivery. A statistically significant difference was observed in favor of the SQ terbutaline pump (difference in means in days: 7.20, 95% CI: 4.10, 10.30). We rated this study as medium risk of bias because there was insufficient information to assess several criteria, such as comparability of groups in baseline characteristics and prognostic factors.

Subgroup: RPTL
Five studies pertained to women with RPTL, including the one study in twins described above (Table 22 and Figure 10). 13,15-18 Statistically significant differences were reported by all studies, compared with either oral tocolytics or no treatment (difference in means in days ranged from 5.50-25.30, 95% CI range: 0. 79-16.77, 8.72-33.83). 13,15-18 Three studies were rated as high risk of bias because groups were imbalanced in baseline characteristics and/or prognostic factors. 13,15,18 Two studies were rated as medium risk of bias because information presented in the report was insufficient to assess several criteria, such as group comparability.
Strength of evidence favoring SQ terbutaline pump against various comparators for the populations of women with RPTL is insufficient or low (Table 24). The majority of evidence came from the Matria-based studies. 16,17 15-18 In the study with the no treatment comparison group, most women were of African American origin. 13 Strength of evidence for nonspecific preterm labor populations from RCTs is insufficient (Table 25).

Overall Evidence
Table 23 and Figure 10 present data for all studies that contributed to this outcome, regardless of study design and comparators. Evidence from RCTs showed indeterminate results (pooled mean difference=0.63, 95% CI: -9.6, 10.9) in contrast with evidence from observational studies of medium to high risk of bias, which showed consistent benefit. Plausible explanations include differences in study power (RCTs were underpowered compared with observational studies) and inherent risk of bias of observational study designs.

Pregnancy Prolongation > 7 Days
Two observational studies reported pregnancy prolongation > 7 days (Table 26). 15,17 Both included women with singleton gestation and RPTL from the Matria database and had oral nifedipine as a comparator. Flick et al. found that significantly more participants in the SQ terbutaline pump group had pregnancy prolonged for more than 7 days compared with the group that received oral nifedipine (OR=7.84, 95% CI: 3.59, 17.12; total number of events/sample size=1281/1366). We rated this study as high risk of bias because of group imbalances. Fleming et al., however, reported a nonsignificant difference between SQ terbutaline pump and oral nifedipine (OR=2.53, 95% CI: 0.87, 7.38; total number of events/sample size = 267/284). 17

Pregnancy Prolongation > 14 Days
Five retrospective cohorts reported pregnancy prolongation > 14 days. All of these studies used the Matria database and were exclusively in women with RPTL.

Study Design (Number of Studies)
We rated these studies as medium risk of bias because there was insufficient information to assess several criteria, such as group comparability.

Risk of Bias Overall Findings
Retrospective cohort (2) Women with singleton gestation and RPTL from the Matria database

Subgroup: RPTL
All studies were in women with RPTL, of either single or twin gestation (the studies in women with twin gestation have been described above) (Table 28 and Figure 11). Consistent, statistically significant differences in favor of SQ terbutaline pump compared with oral tocolytics were found across all studies (OR range=1.93-3.47, 95% CI range: 0.87-2.34, 2.65-5.15) ( Figure  11). 15-18 Two studies were rated as high risk of bias because of differences in baseline characteristics/prognostic factors among groups 15,18 and three were rated as medium risk of bias because there was insufficient information to assess several criteria, such as group comparability.

Birth Weight
Birth weight was reported as either a continuous or dichotomous variable (i.e., incidence of low birth weight and very low birth weight) in seven observational studies, one nonrandomized trial, and two RCTs. Observational studies reported birthweight for women with twin gestation 16,19 and RPTL. 13,[15][16][17][18][19][20] Four studies did not pertain to any specific population of interest. 10-12,21

Mean Birth Weight
Study-level data is presented in Table F8 in Appendix F.

Subgroup: Multiple Gestation
As shown in Table 29, two retrospective cohort studies that compared the SQ terbutaline pump with oral tocolytics in women with twin gestation and RPTL from the Matria database reported statistically higher birth weights among infants of the SQ terbutaline pump group (mean differences in grams = 163, 95% CI: 102, 224 and 136, 95% CI 83, 189). 16,19  Both studies were rated as medium risk of bias because several criteria, such as similarity in baseline characteristics and prognostic factors, could not be assessed due to incomplete reporting. RPTL Table 30 presents information on studies that reported birth weight in women with RPTL. Two of these studies were in women with twin gestation, as described above. 16,19 Aside from one study that reported a nonsignificant result among a study population that likely consisted of women with single and multiple gestation, 20 all demonstrated statistically significant differences in favor of SQ terbutaline pump, compared with either oral tocolytics or no treatment (range of mean difference in grams was 136-721, 95% CI range: 83-355, 189-1087). 13,[16][17][18][19] Two studies in this body of evidence were rated as high risk of bias because of apparent differences in groups 13,18 and the remaining four studies were rated as medium risk of bias because missing information prevented adequate assessment of potential limitations. 16,17,19,20 The majority of this evidence came from the Matria database. [16][17][18][19] The study with the no treatment comparator group included women who were mostly of African American origin 13 and the study with oral terbutaline as a comparator group likely included women with single and multiple gestation, the majority of whom were classified as "nonwhite."   Figure 12 present the entire body of evidence for mean difference in birthweight, regardless of subgroups. The study by Lindenbaum et al. reported discrepant results between table and text and therefore, the study estimate was deemed unreliable. 12 The pooled evidence from RCTs was inconclusive (difference in means = 121.75, 95% CI: -183.55, 427.05), but type II error cannot be ruled out (Figure 12). In contrast, evidence from larger observational studies of medium to high risk of bias showed statistically higher birth weights for women receiving the SQ terbutaline pump compared with oral tocolytics.

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Prospective cohort (1)

Incidence of low Birth Weight
Tables 32 and 33 and Figure 13 below present studies that reported low birth weight, which was defined as < 2,500 g. All studies were observational in design.

Subgroup: Multiple Gestation
Two studies were in women with twin gestation and RPTL from the Matria database (Table  32). Both reported statistically significant results in favor of SQ terbutaline pump compared with oral tocolytics (OR = 0.57, 95% CI: 0.44, 0.73; total number of events/number of infants = 975/1312 and OR = 0.64, 95% CI: 0.51, 0.80; total number of events/number of infants = 926/1393). 16,19  These studies were rated as medium risk of bias because there was insufficient information to assess potential limitations.

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Retrospective cohort (

Subgroup: RPTL
All studies pertained to women with RPTL (Table 33). Five of these studies were from the Matria database, so there may have been overlap in participant data. 15-19 Statistically significant differences were found across all studies of medium to high risk of bias in favor of the SQ terbutaline pump, compared with oral tocolytics or no treatment (OR range = 0.24-0.64, 95% CI range: 0.06-0.51, 0.62-0.96). The majority of women in the prospective cohort with the no treatment comparator group were of African American origin.

Incidence of Very low Birth Weight
Tables 34 and 35 and Figure 14 below present studies that reported incidence of very low birthweight, which was defined as < 1,500 g.

Subgroup: Multiple Gestation
Two of these studies were in women with twin gestation and RPTL from the Matria database (Table 34). Both reported statistically significant results in favor of SQ terbutaline pump compared with oral tocolytics (OR = 0.40, 95% CI: 0.26, 0.60; total number of events/number of infants = 156/1312 and OR=0.46, 95% CI: 0.29, 0.73; total number of events/sample size = 88/1393) (Table 32). 16,19  These studies were rated as medium risk of bias because there was insufficient information to assess potential limitations.

Subgroup: RPTL
All studies pertained to women with RPTL (Table 35). [16][17][18][19] Aside from the study by Fleming et al, which showed a nonsignificant result, 17 statistically significant differences were found across all studies of medium to high risk of bias in favor of SQ terbutaline pump, compared with oral tocolytics (OR range = 0.22-0.46, 95% CI range: 0.07-0.29, 0.60-0.73). 16,18,19  All subsequent outcomes are presented at the study-level in Table F9 in Appendix F.

Mean Pregnancy Prolongation Index
Two observational studies defined pregnancy prolongation index as the ratio of the number of days from RPTL to delivery divided by the number of days to 37 weeks' gestation (i.e., the desired prolongation) (Table 36). 13,20 Both studies were in women with RPTL and showed statistically significant differences in favor of the SQ terbutaline pump, compared with oral terbutaline or no treatment (mean difference = 0.41, 95% CI: 0.26, 0.56; and 0.14, 95% CI: 0.02-0.26). 13,20 One study was rated as high risk of bias because groups were clearly imbalanced in risk factors for preterm birth, primary tocolytic therapy, and level of care. 13 The other study was rated as medium risk of bias because several criteria were rated as unclear due to incomplete reporting. 20 One study pertained to women with singleton gestation, the majority of whom were of African American origin 13 and the other study likely included women with single and multiple gestation, the majority of whom were classified as "nonwhite."

Need for Assisted Ventilation
One retrospective cohort that compared the SQ terbutaline pump with oral tocolytics in women with singleton gestation and RPTL from the Matria database reported requirement for ventilator among infants with NICU admission. 18

Incidence of NICU Admission
This study reported a nonsignificant difference (OR = 0.91, 95% CI: 0.62, 1.33; total number of events/sample size = 141/558). We rated this study as high risk of bias because there were apparent differences among groups in baseline characteristics and prognostic factors.
Seven studies reported incidence of NICU admission. Six observational studies, mostly from the Matria database, reported data for women with twin gestation and RPTL (Tables 37 and  38). 13,[15][16][17][18][19] In addition, one underpowered RCT of low risk of bias showed no difference between SQ terbutaline pump and placebo in women with singletons ( Figure 15) (total number of events/sample size = 23/51).

10
Two studies were in women with twin gestation and RPTL from the Matria database (Table  37). Both reported statistically significant results in favor of SQ terbutaline pump compared with oral tocolytics (OR = 0.72, 95% CI: 0.58, 0.91; total number of events/number of infants = 655/1312) and OR = 0.51, 95% CI: 0.41, 0.63; total number of events/sample size = 650/1393) (Table 38). 16,19  These studies were rated as medium risk of bias because there was insufficient information to assess potential limitations.

Subgroup: RPTL
All observational studies, including the two studies in women with twin gestation described above, pertained to women with RPTL (Table 38). 13,[15][16][17][18][19] These studies were of medium to high risk of bias and five studies used the Matria database. 15-19

Study Design (Number of Studies)
Overall, a consistent and significant benefit associated with pump was noted across the observational studies (OR range 0.28-0.72, 95% CI range: 0.08-0.58, 0.63-0.97).

NICU Mean Length of Stay
Five studies reported data on NICU mean length of stay. Four observational studies, primarily from the Matria database, were available in women with RPTL 13,15,18,19 ; one of these studies was additionally in women with twin gestation (Table 39 and Figure 16). These studies were mostly of high risk of bias. All reported statistically significant differences in favor of SQ terbutaline pump, compared with oral tocolytics or no treatment (range of mean difference in days: -3.50 to -17.90, 95% CI range: -5.26 to -32.88, -1.74 to -3.54). One RCT, which did not specifically pertain to any of the populations of interest, was conducted in women with single and twin gestation. 11 This study reported a statistically nonsignificant differences for the SQ terbutaline pump compared with placebo or oral terbutaline. 11 The plausible explanation for discrepant results between the RCT and observational evidence is inadequacy of study power.

Key Question 3. Maternal Harms
In women with arrested preterm labor, does treatment with a SQ infusion of terbutaline delivered by a pump, in comparison with placebo, conservative treatment, or other interventions increase the maternal harms of arrhythmia, heart failure, hyperglycemia, hypokalemia, maternal mortality, myocardial infarction, pulmonary edema, refractory hypotension, or result in an increased rate of maternal discontinuation of therapy or maternal withdrawal due to adverse effects (Withdrawal-AE)?

Key Points
• Strength of evidence is insufficient for Withdrawal-AE.
• Tachycardia/nervousness was significantly higher among women who received the SQ terbutaline pump in comparison with no treatment in a prospective cohort of women with singleton gestation and RPTL, although the point estimate was unreliable.
• Underpowered studies demonstrated indeterminate results for the outcomes of mortality, pulmonary edema, and therapy discontinuation (i.e., type II error cannot be excluded). • Two studies demonstrated nonsignificant differences between the SQ terbutaline pump and oral terbutaline in the incidence of gestational diabetes, though type II error cannot be excluded. • No data were available for the following outcomes: heart failure, hypokalemia, myocardial infarction, or refractory hypotension. • FDA postmarketing surveillance has identified at least three maternal deaths and three cases of cardiovascular adverse events associated with the use of SQ terbutaline delivered by pump. Table F10 in Appendix F presents data for Key Question 3. None of the included studies reported data on heart failure, hypokalemia, myocardial infarction, refractory hypotension, or Withdrawal-AE (strength of evidence is insufficient). The evidence and determinants of applicability are presented below by outcome. Summary tables are presented if more than one study was available for an outcome; otherwise, all information has been summarized in the text.

Arrhythmia
In a prospective cohort study of women with singleton gestation and RPTL, Morrison et al. reported three cases of tachycardia/nervousness in women receiving the SQ terbutaline pump compared with no cases in the control group (OR=25.48, 95% CI: 1.23, 526.64). 13

Hyperglycemia
We rated this study as high risk of bias because groups were imbalanced in risk factors for preterm birth, primary tocolytic therapy, and level of care. This evidence pertained to women with singleton gestation and RPTL, the majority of whom were of African American origin. However, given that the outcome was not restricted to arrhythmia specifically (i.e., nervousness was included), applicability is limited.
Two studies reported data on gestational diabetes, diagnosed by 3-hour glucose tolerance test (GTT) (Table 40). Studies were not pooled because of heterogeneity in study designs and patient populations. Type II error cannot be excluded for the available evidence because these studies may be underpowered.
In a retrospective cohort, Regenstein et al. found a higher percentage of gestational diabetes among women in the SQ terbutaline pump group compared with the oral terbutaline group, but the difference was statistically nonsignificant (OR=1.94, 95% CI: 0.49, 7.65; total number of events/sample size=10/65). 21 Lindenbaum et al. conducted a nonrandomized trial in women with singleton gestation from a university hospital and found a lower percentage of gestational diabetes among women in the SQ terbutaline pump group compared with the oral terbutaline group. This result was also statistically nonsignificant (OR=0.46, 95% CI: 0.09, 2.40; total number of events/sample size=8/91).
This study included women with single or multiple gestations, the majority of whom were Caucasian.

Mortality
Two retrospective cohort studies investigated maternal mortality (Table 41). Lam et al. (2003) studied women with singleton gestation and RPTL 18 and Lam et al. (2001) studied women with twin gestation and RPTL. 19 Both studies used oral tocolytics as comparators. No maternal deaths were reported in either study.  Lam et al. (2003) and Lam et al. (2001) also reported data on pulmonary edema (Table  42)  Both studies were likely underpowered to detect a difference due to low event rates.

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Retrospective cohort (2) Women with singleton gestation and RPTL from the Matria database

Therapy Discontinuation
One prospective cohort and one RCT investigated maternal discontinuation of therapy (Table  43). 10,13 Morrison et al. reported no discontinuation of therapy in a prospective cohort of women with singleton gestation and RPTL. 13 Guinn et al. reported a higher percentage of treatment discontinuation among women with singleton gestation randomized to the SQ terbutaline pump group compared with placebo, but this difference was statistically nonsignificant (OR=1.79, 95% CI: 0.58, 5.52; total number of events/sample size=20/52). 10

Table 43. Summary table for maternal discontinuation of therapy
The RCT was likely underpowered to detect a difference for this outcome.

Study Design (Number of Studies) Population Comparator(s) Risk of Bias Overall Findings
Prospective cohort (1) Women with singleton gestation and RPTL (n=60) No treatment 13 High No women discontinued treatment in the prospective cohort. In the RCT, discontinuation was higher in the SQ terbutaline pump group compared with placebo, but the difference was statistically nonsignificant (OR=1.79, 95% CI: 0.58, 5.52). Type II error cannot be excluded.
RCT (1) Women with singleton gestation from Birmingham Hospital (n=52) Placebo 10 Low CI = confidence interval; OR = odds ratio; RCT = randomized controlled trial; RPTL = recurrent preterm labor; SQ = subcutaneous Until 2009, 16 maternal deaths and 12 cases of maternal cardiovascular events (hypertension, myocardial infarction tachycardia, arrhythmias, and pulmonary edema) in association with terbutaline tocolysis were reported to the FDA. Of these, at least three maternal deaths and three cardiovascular adverse events were clearly reported to be in association with the use of the SQ terbutaline pump. 24

Key Question 4. Neonatal Harms
In women with arrested preterm labor, does treatment with an SQ infusion of terbutaline delivered by a pump, in comparison with placebo, conservative treatment, or other interventions increase the neonatal harms of hypoglycemia, hypocalcemia, and ileus?

Key Points
• One case of hypoglycemia was reported in the placebo group of an underpowered RCT.
• No data were available for the outcomes of hypocalcemia and ileus. Table F11 in Appendix F presents data for Key Question 4. No information was available for hypocalcemia or ileus. Hypoglycemia was reported by Wenstrom et al. in an RCT of women with single or twin gestation recruited from a university hospital in the United States. 11 This study compared the SQ terbutaline pump with placebo and oral terbutaline. One case of hypoglycemia was observed in the placebo group. No cases were reported among women who received the SQ terbutaline pump or oral terbutaline (OR=0.25, 95% CI: 0.01, 6.53 for SQ terbutaline pump versus placebo). The occurrence of a single hypoglycemic event among a sample size of 42 participants indicates that this study was underpowered to detect a difference in this outcome. Furthermore, we rated this study as high risk of bias because of selection bias, limitations in study power, and absence of blinding.

Key Question 5. Level of Activity and Level of Care
Can the differences in the outcomes above be partially explained by differences in level of care (e.g., frequency of followup, nurse visits, concomitant treatment, etc.) and level of activity (e.g., other children in the home, marital/support status, working status, bed rest, etc.) between the terbutaline pump group and the comparator group?

Key Points
• Few studies reported the level of maternal activity and the level of maternal care as study-level covariates, precluding meta-regression on outcomes. • Qualitative analysis revealed no apparent trends between level of activity or level of care and the outcomes specified in Key Questions 1-4.

Detailed Analysis
Ratings for level of maternal activity and level of maternal care are provided in Appendix F, Tables F13 and F15. These tables provide overall ratings for each study, as well as ratings for the individual variables that comprise the level of activity and level of care variables.

Level of Maternal Activity
Level of maternal activity could not be rated for most studies due to insufficient information (i.e., incomplete reporting of marital status, working status, caring for other children, social support, bed rest, and restriction of maternal activities). 10,12,[14][15][16][17][18][19]21 The participants in other studies were rated as having a low level of maternal activity, primarily because they were advised to remain at bed rest. 11,13,20,22,23

Level of Maternal Care
Level of activity in these studies did not vary by treatment groups, so these ratings represented study-level covariates.
Level of maternal care could not be rated for three studies due to insufficient information (i.e., incomplete reporting of nursing assessments, home uterine activity monitoring, home visits, education about preterm labor, telephone support, restriction of maternal activities, and other cointerventions). 11,12,21 Table 44 below summarizes ratings for all other studies. In two studies, level of care was found to vary among the SQ terbutaline pump and comparator groups and, therefore, in these cases it was not a study-level covariate. 13,20

Evidence Synthesis
A minimum of 12 studies were needed to explore the effect of level of activity or level of care on the outcomes specified in Key Questions 1-4 through meta-regression (number of studies needed is equal to 6×(n-1) for n levels for a categorical covariate). Studies were to be considered for meta-regression only if all of the following criteria were satisfied: there was no within-study confounding by level of activity or level of care, studies were not rated as unclear, and studies were similar enough with respect to patient population, intervention, and comparator.
A meta-regression could not be conducted for level of activity because sufficient information was available to rate only five studies. Furthermore, it was impossible to assess the impact of level of activity on effect estimates even in a qualitative manner because all five studies were rated as "low." Similarly, a meta-regression could not be conducted for level of care because only 11 studies were ratable and, of these, 2 were confounded by level of care. Of the remaining nine studies, three were rated as moderate 10,14,19,22 and six were rated as high. [15][16][17][18]23 Key Question 6. Incidence of Pump Failure These studies were qualitatively examined to explore trends in effect estimates by level of care. No trends were apparent between level of care and any of the outcomes in Key Questions 1-4.
What is the incidence of failure of the pump device used for terbutaline infusion, including missed doses, dislodgment, and overdose?

Key Points
• Based on evidence from a case series, the incidence of dislodgement and pump malfunction were 2 percent (exact central CI, 0.5%, 10%). • An underpowered RCT demonstrated indeterminate results for the outcomes of local pain and local skin irritation. • No data were available for the outcomes of missed doses or overdose. Table F16 in Appendix F presents data for Key Question 6. None of the studies reported data on the incidence of missed doses or overdose. At least one study reported data on dislodgment and other pump-related outcomes, including infusion site infection, local pain, local skin irritation, and pump malfunction/mechanical failures and complications. The following pump manufacturers and models were reported in the studies: Adkins et al. This study also reported one case of local skin irritation in the SQ terbutaline pump group. Results for both local pain and local skin irritation were statistically nonsignificant when compared with either placebo or oral terbutaline (local pain: OR=0.77, 95% CI: 0.09, 6.45 for placebo and OR=5.74, 95% CI: 0.25, 130.38 for oral terbutaline; local skin irritation: OR=2.59, 95% CI: 0.10, 69.34 for placebo and OR=3.21, 95% CI: 0.12, 85.21 for oral terbutaline). However, given the sparse event rates, this RCT was likely underpowered to detect differences in either outcome.   [13][14][15][16][17][18][19][20]23 In other studies, RPTL was not mentioned as an inclusion criterion, so it is unclear whether these populations consisted of women with single or multiple preterm labor episodes. [10][11][12]21,22 The majority of evidence pertained to women with RPTL and singleton gestation. A couple of studies included women exclusively with RPTL and twin gestations; these participants represent a particularly high-risk, specialized group of patients. 16,19 Very little is known about the study populations' demographic and clinical characteristics. Furthermore, the possibility that participants represent a select group of individuals cannot be entirely ruled out for a large proportion of the evidence base due to poor reporting of exclusion rates and sampling methodology.

Overall Applicability for Body of Evidence
Nine of 14 studies (64 percent) included women judged to be in labor on account of persistent contractions and cervical change. The definition of labor was unclear in other studies. Among the evidence that suggested that the pump was efficacious, 50 percent reported cervical change and contractions as part of the definition of labor while 50 percent did not report how labor was defined.

Demographic characteristics
Several studies (n=9) took place at single centers in the United States with limited demographic information. [10][11][12][13][14][20][21][22][23] Although age was reported in most of the single-center studies and race in some, there was little information on measures of socioeconomic status. Other studies included patients from a U.S.-based national database run by Matria Healthcare (now called Alere). These studies reported information on age and marital status but, as with the single-center studies, complete demographic information was lacking.

Exclusion rate
The overall impact on applicability due to participant exclusion is unknown for much of the evidence because many studies did not report an exclusion rate. In one RCT, more than 90 percent of the eligible population declined to participate. 11 Run-in period (attrition before randomization) In the two RCTs included in the review, no issues pertaining to run-in period were identified.

Table 45. Applicability assessment by the PICOTS domains (continued) Intervention
Overall Conclusions

Dose and duration
The dose and duration of the SQ terbutaline pump were generally typical of those used in clinical practice. However, some studies failed to provide adequate information regarding bolus and basal doses to allow assessment.
No major issues were identified with respect to the intervention, although there were gaps in reporting. Very few details were reported on cointerventions that could modify the effectiveness of therapy.
Level of care and training on pump administration The level of care and training provided on pump administration were also deemed to be typical in most studies but, again, this information was not reported in some instances. In several studies, patients received specialized outpatient support, which may not be typical of practice.

Cointerventions
Cointerventions with the potential to affect outcomes were considered to be bed rest, restriction of maternal activities, and administration of betamethasone. Corticosteroid use was reported in only one study, and details about bed rest and restriction of maternal activities were rarely reported.

Comparison Overall Conclusions
Dose/schedule and whether comparator is best available alternative Several types of comparison groups were used in the studies. No issues were identified in the studies with an active treatment comparison group that would limit applicability.
No serious limitations to applicability due to comparators were identified.

Outcomes Overall Conclusions
Clinical benefits (versus surrogate) At least one clinical outcome was reported in most studies (i.e., neonatal outcomes of NEC, IVH, retinopathy of prematurity, sepsis, stillbirth, death; neonatal harm of hypoglycemia; and maternal harms of pulmonary edema, arrhythmia, hyperglycemia, death, and discontinuation of therapy). A few studies only reported surrogate outcomes (i.e., gestational age at delivery, birth weight, prolongation of pregnancy, or NICU admission). None of the studies reported any long-term outcomes such as childhood development, neurobehavioral testing, lung function, or vision.
Surrogate outcomes are the most commonly reported in this literature. Data on clinical outcomes and neonatal/maternal harms, including pump-related outcomes, is sparse. Several important clinical outcomes have not been reported. Assessment of long-term outcomes are also absent. Individual harms and how defined At least one neonatal or maternal harm outcome was reported in several studies. Very few studies reported outcomes related to the pump.

Timing of Followup Overall Conclusions
Timing of followup In all studies, outcomes were assessed up to the point of delivery.
The absence of followup beyond delivery is a major limitation because important long-term outcomes were not evaluated. All studies took place in the United States. Most studies were conducted at single study centers, and the remaining used a national database of women who were referred to an outpatient perinatal program. Most studies that took place at single study centers were at teaching hospitals, although one study took place at a private urban obstetrics and gynecology group practice.
All studies were from the United States, and participants were recruited either from a national database (Matria) or from single center sites. Women from the Matria database generally received a high level of care from an outpatient perinatal program. However, the distribution of regions from which patients were recruited into the national database is unknown and information about the standards followed by the individual practice sites that provided obstetrical care was not reported. Similarly, for those studies that took place at single center sites, the standards of care followed at these sites are unclear.
Clinical setting (level of care and population) Women recruited into the national database received services from a specialized perinatal program that consisted of 24-hour nursing and pharmacy support, home uterine activity monitoring, individualized education, and provision of tocolytic therapy, including the SQ terbutaline pump. All women in these studies had RPTL and were either exclusively of singleton or twin gestation. Details of the clinical setting in the single-center studies were reported inconsistently. Some studies reported the provision of patient education, telephone support, home visits, and/or home uterine activity monitoring. IVH = intraventricular hemorrhage; NEC = necrotizing enterocolitis; NICU = neonatal intensive care unit; PICOTS = population, intervention, comparison, outcome, timing, setting; RCT = randomized controlled trial; RPTL = recurrent preterm labor; SQ = subcutaneous

Discussion
The rate of preterm birth in North America is considerably high at 12.3 percent. 1 To clarify the evidence on the efficacy and safety of subcutaneous terbutaline (SQ terbutaline) infusion by pump for the prevention of preterm birth, the Agency for Healthcare Research and Quality requested an evidence report answering six distinct questions. We applied rigorous selection criteria and assessed risk of bias of each study. This evidence report outlines a comprehensive review of all the available research.
These births will contribute to the health care burden, including increased short-and long-term neonatal morbidity. An effective and safe intervention to delay or prevent preterm birth would be a welcome addition to the maternity care provider's armamentarium.
In this final chapter, we first review the limitations of included studies, then the major findings pertaining to each key question and the strength of the evidence for the prespecified outcomes of incidence of delivery at various gestational ages; mean prolongation of pregnancy; bronchopulmonary dysplasia; significant intraventricular hemorrhage (grade III/IV); neonatal death and/or death within initial hospitalization; and maternal withdrawal due to adverse effects (Withdrawal-AE). We graded the strength of evidence based on the domains of overall risk of bias, consistency, directness, and precision. We then present our conclusions, make recommendations for future research, and offer clinical and public health perspectives.

Limitations of Included Studies
Studies contributing evidence were either absent or sparse for most outcomes. Although evidence from randomized controlled trials (RCTs) pertained to women with preterm labor, the specific populations of investigational interest were not distinguished. Furthermore, the two trials were clearly underpowered for outcomes of benefit and harms. Evidence pertaining to specific populations of women with preterm labor originated in observational studies of medium to high risk of bias. Across several studies, our concern that participants might have been doublecounted because a common database was used could not be ruled out. Baseline clinical and socioeconomic characteristics with important prognostic implications were not reported across all studies. For example, no studies presented data on concomitant medications, body mass index, history of preeclampsia, cervical position, cervical consistency, cervical station, Bishop's Score, or fetal fibronectin. Cointerventions, such as administration of corticosteroids, were rarely described. None of the included studies assessed long-term childhood outcomes, such as childhood development, neurobehavioral testing, long-term lung function, and long-term vision.
In completing this review, we undertook an extensive grey literature search. Further, we requested relevant scientific information from the industry, Matria (now called Alere) Healthcare, and had many experts in the field participate in the review process. Despite this thorough process, the number of identified studies was very small-we had too few studies per outcome to perform statistical assessment of publication bias. We believe that all relevant data regarding the use of subcutaneous terbutaline for the prevention of preterm labor is captured in this review. Any exaggerated positive findings are more likely due to the medium to high risk of bias detected in observational studies than publication bias.

Key Question 1. Neonatal Health Outcomes
Information regarding neonatal health outcomes is derived from a few underpowered studies that examine the effect of SQ terbutaline infusion for the prevention of preterm birth on the key predictors of long-term health sequelae for offspring. The outcomes assessed in this review include those neonatal conditions that are generally accepted to be associated with mortality or impaired function later in life. Studies were either absent or underpowered for outcomes such as bronchopulmonary dysplasia, intraventricular hemorrhage, necrotizing enterocolitis, retinopathy of prematurity, sepsis, stillbirth, periventricular leukomalacia, and seizures, thereby limiting the utility of the data. Strength of evidence is insufficient for bronchopulmonary dysplasia and significant intraventricular hemorrhage.
Neonatal death was assessed using two different outcomes: classic neonatal death (i.e., death within the first 28 days of life) and death within initial hospitalization. Death within initial hospitalization is important given that the topic of interest is preterm birth-risk of morbidity and mortality in this population may extend beyond the first 28 days of life. However, no studies examined this variable and strength of evidence for this outcome, therefore, is insufficient. For neonatal death, strength of evidence favoring the SQ terbutaline pump over maintenance oral tocolytic therapy (92.3% received oral terbutaline) is low for women with recurrent preterm labor (RPTL) and twin gestation based on a single study from the Matria database (odds ratio [OR] = 0.09, 95% confidence interval [CI]: 0.01, 0.70). While this result is striking in the presence of insufficient findings on other neonatal health outcomes reported above, it is apparent that it stems from the largest of studies contributing data on neonatal health outcomes with over 700 patients. As such, it is the only outcome that appears to be adequately powered to reach statistical significance. For other populations of pregnant women with arrested preterm labor, the evidence was graded insufficient.

Key Question 2. Other Surrogate Outcomes
Surrogate outcomes are commonly used in maintenance tocolytic trials to assess efficacy. For many of these outcomes we could not assess data for important populations (as it was not reported), including delivery <28 weeks, specific gestational ages, racial or ethnic subgroups, women with previous preterm birth, or women with a history of preeclampsia.
A common outcome used in tocolytic trials of maintenance therapy is the incidence of delivery at various gestational ages. We chose to group gestational age at delivery according to commonly accepted categories of <28 weeks, <32 weeks, <34 weeks and <37 weeks, which correlate with improvements in clinical outcomes. Under this Key Question, we graded the strength of evidence for incidence of delivery at each gestational age cut-point and mean prolongation of pregnancy for prespecified populations. The strength of evidence for incidence of delivery at <28 and <34 weeks is insufficient. However, for the other outcomes strength of evidence favoring pump over oral tocolytics or no treatment is generally low for women with twin gestation and/or RPTL.
Mean birth weight significantly increased with the SQ terbutaline pump compared with oral tocolytics or no treatment in women with twin gestation and/or RPTL. This evidence largely originated from observational studies that used the same Matria database. Therefore, the studies were at risk of double-counting of participants across them. Two RCTs, which did not pertain to any specific population of interest, reported statistically nonsignificant differences between SQ terbutaline pump and placebo. However, this result is inconclusive because of the possibility of type II error. The RCT evidence contrasted with results from larger cohort studies, which demonstrated consistent benefit.
The final group of surrogate outcomes that were assessed involved need for specialized neonatal care. For women with twin gestation and/or RPTL, evidence from observational studies, mostly from the Matria database, showed lower incidence of neonatal intensive care unit (NICU) admission and shorter duration of stay for infants whose mothers used SQ terbutaline pump (incidence of NICU admission: OR range 0.28-0.72, 95% CI range: 0.08-0.58, 0.63-0.97 and NICU mean length of stay: range of mean difference in days: ). One retrospective cohort in women with RPTL showed a nonsignificant decrease in the need for assisted ventilation among infants of the SQ terbutaline pump group. 18

Key Question 3. Maternal Harms
No data were available on need for oxygen per nasal cannula.
The data available on incidence of maternal harms are sparse. One small prospective cohort showed a significant increase in tachycardia/nervousness among women using SQ terbutaline pump (OR = 25.48,95% CI 1.23,526.64). 13 In one RCT of terbutaline infusion versus placebo, 45.8 percent of patients discontinued the terbutaline infusion compared to 32 percent of patients who discontinued placebo treatment. 10 Results for the outcomes of maternal mortality, pulmonary edema, maternal hyperglycemia, and therapy discontinuation were inconclusive because studies were not adequately powered to detect these rare findings. No data were available for several recognized adverse outcomes, including hypokalemia, refractory hypotension, heart failure, myocardial infarction, and study withdrawal due to adverse effects (strength of evidence is insufficient).
The available data does not suggest the reasons for discontinuation of therapy (e.g., inconvenience versus nuisance side effects versus major complications).
The Food and Drug Administration (FDA) has issued new warnings against the use of terbutaline in pregnant women for prevention or prolonged treatment (beyond 48 to 72 hours) of preterm labor. 24 Based on postmarketing reports of maternal deaths and serious cardiovascular adverse events associated with the obstetrical use of terbutaline, the FDA is requiring that a Boxed Warning and Contraindication be placed on injectable and oral terbutaline drug labels. Between 1976 and 2009, 16 maternal deaths were reported; at least three of these cases were clearly reported to be in association with the administration of SQ terbutaline pump. Between 1998 and 2009, 12 maternal cases of serious cardiovascular events were reported, including arrhythmias, myocardial infarction, pulmonary edema, hypertension and tachycardia; at least three of these cases were clearly reported to be in association with the administration of the SQ terbutaline pump. 24 Although meriting transparent disclosure in the form of a warning, evidence emerging from case reports is usually regarded as noncomparative and a hypothesis generating signal rather than a hypothesis testing confirmation. 25 Furthermore, case reports are useful in identifying rare and unexpected adverse events-the rarer the adverse event, the stronger the effect size, and the magnitude of effect size is an important criterion that increases our confidence in an estimate. 9 However, adverse events such as death, hypertension, tachycardia, arrhythmias, and pulmonary edema that were reported with the use of terbutaline are not so unexpected in any adult population-pregnant women may experience these adverse events in the absence of terbutaline therapy due to other reasons.

Key Question 4. Neonatal Harms
Neonatal harms data were also very sparse. In one small RCT, only one case of hypoglycemia was identified in an infant whose mother received placebo infusion. 11

Key Question 5. Level of Activity and Level of Care
Given such a small event rate, the utility of this information is limited by insufficient power. No data were available for the incidence of neonatal hypocalcemia or ileus. Differences in maternal activity and level of care could potentially explain differences in outcomes. Level of activity was rated as low, normal, or high based on a composite assessment of the following variables: marital status, working status, caring for other children in the home, available social support, bed rest, and restriction of maternal activities. Level of care was rated as low, moderate, or high based on the following variables: nursing assessments, home uterine activity monitoring, home visits, education about preterm labor, telephone support, restriction of maternal activities, and other cointerventions. Unfortunately, few studies reported these as study level covariates, which precluded statistical assessment of heterogeneity by meta-regression. Furthermore, a qualitative assessment of heterogeneity revealed no apparent trends.
Key Question 6. Incidence of Pump Failure SQ terbutaline is administered by a mechanical pump, and, therefore, it is important to consider possible technology-related issues. Although the Key Question only specified missed doses, dislodgment, and overdose, we investigated a wider range of pump-related problems, including pump malfunction and local pain or skin irritation. No study reported on outcomes of missed doses or overdose. One case series reported a 2 percent incidence of dislodgement of the SQ catheter. 22 The same series reported a 2 percent incidence of unspecified pump malfunction. One small RCT reported the side effects of local pain and skin irritation, which were present in less than 20 percent of patients and not statistically different in patients receiving terbutaline infusion compared to a placebo pump. 11 No infusion site infections were reported in another case series. 23

Conclusions
Although these studies do not suggest that pump-related complications are significant, adverse events related to the pump device should be documented in future studies.
The available evidence for the SQ terbutaline pump as maintenance tocolytic therapy in women with arrested preterm labor pertained to only two of the specific populations of interest: women primarily with singleton gestation and RPTL or those with twin gestation and RPTL. This evidence base came entirely from observational studies, and most studies (45 percent) originated from a single proprietary database. The available RCT evidence did not apply to any of the specific preterm populations described in Key Questions 1 and 2, but rather included nonspecific populations of women with preterm labor.
For neonatal death, the strength of evidence favoring SQ terbutaline pump therapy compared with oral tocolytics is low for women with twin gestation and RPTL (OR=0.09, 95% CI: 0.01, 0.70). Strength of evidence favoring the terbutaline pump compared to oral tocolytics or no treatment is also low for the surrogate outcomes of pregnancy prolongation in women with twin gestation and/or RPTL. Insufficient evidence addressed bronchopulmonary dysplasia, death within initial hospitalization, significant intraventricular hemorrhage, and maternal withdrawal due to adverse events. The strength of evidence for nonspecific populations of women with preterm labor described in RCTs is insufficient.
Scant and underpowered evidence demonstrated inconclusive results for all other neonatal health outcomes, neonatal harms, maternal harms, and pump-related outcomes. Observational studies of medium to high risk of bias, with potential for participant double-counting, showed the benefit of the SQ terbutaline pump compared with oral tocolytics for other surrogate outcomes, such as birth weight and NICU admission.
FDA postmarketing surveillance has detected maternal deaths and maternal cardiovascular events in association with terbutaline tocolysis in general, and pump therapy in particular. However, causal association cannot be established with this evidence.
In conclusion, although evidence suggests that pump therapy is beneficial as maintenance tocolysis, our confidence in the validity and reproducibility of this evidence is low. While postmarketing surveillance has detected cases of serious harms, the safety of the therapy remains unclear.

Comparison of Results With Other Systematic Reviews
In agreement with the review by Nanda et al., we found that the available RCT evidence showed nonsignificant differences between the SQ terbutaline pump and placebo or oral terbutaline for several outcomes. 4 The Hayes group conducted a systematic review of the SQ terbutaline pump for maintenance therapy and, in contrast to the review by Nanda et al., included both observational studies and RCTs.
Nanda et al. concluded, "Terbutaline pump maintenance therapy has not been shown to decrease the risk of preterm birth by prolonging pregnancy" (p.2). The review also commented on the lack of information regarding safety and advocated for further study. We agree with these conclusions, but would also emphasize that the RCT evidence was likely prone to type II error.

Applicability
This review found that the available RCT and observational evidence was conflicting and our review came to a similar conclusion; RCT evidence did not demonstrate benefit of the SQ terbutaline pump, although cohort studies of limited methodological validity demonstrated statistically significant effects in favor of the pump for several outcomes. Our review included some additional studies that were not part of the Hayes review because we performed a more recent search, we did not have a lower cutoff year, and we included case series to assess pumprelated outcomes. Furthermore, the Hayes review did not specifically investigate different populations, the effect of confounding by level of maternal activity or level of maternal care, or pump-related outcomes. Our review examined these additional factors but found only limited data to address them. We also graded the strength of evidence from the body of observational studies as mostly insufficient or low for women with twin gestation and/or RPTL.
Below we summarize characteristics of applicability based on the domains of population, intervention, comparator, outcome, and setting. The following factors should be considered by maternity care providers and policymakers when entertaining the option of recommending the SQ terbutaline pump to women with preterm labor.
Nine of 14 studies (64 percent) included women judged to be in labor on account of persistent contractions and cervical change. The definition of labor was unclear in other studies. Among the evidence that suggested that the pump was efficacious, 50 percent reported cervical change and contractions as part of the definition of labor, and 50 percent did not report how labor was defined.
The majority of evidence included women with RPTL (i.e., treated with first-line tocolytic therapy for 48 hours, have cessation of symptoms, and then present with a second episode) and singleton gestation. Some evidence pertained additionally to women with twin gestation and RPTL, which is a high-risk, specialized group of patients.
Several studies included patients from a national proprietary database run by Matria (now called Alere) Healthcare, which provides an outpatient perinatal program consisting of 24-hour nursing and pharmacy support, home uterine activity monitoring, individualized education, and provision of tocolytic therapy to women with preterm labor. These women generally received a high level of care based on nursing assessments, home uterine activity monitoring, home visits, education about preterm labor, telephone support, restriction of maternal activities, and other cointerventions. The distribution of regions from which patients were recruited into the national database is unknown. Further, it is impossible to make any judgments about the standards followed by the individual practice sites that were providing obstetrical care to the women in the database.
In general, the dose and duration of SQ terbutaline pump therapy were typical of those used in clinical practice, although some studies did not provide adequate information regarding basal and bolus doses to allow assessment. Level of care and training provided to patients on pump were also typical in most studies, although this information was also somewhat limited. In several studies, patients received specialized outpatient support in the form of nursing/pharmacy support, monitoring and contact with physicians; this level of care may not be typical of practice. Investigators typically did not report information on cointerventions, such as bed rest, restriction of maternal activities, and administration of corticosteroids.
Multiple comparison groups were used, including no treatment, placebo, and oral tocolytics. Outcomes most commonly reported in the literature were surrogates, such as gestational age at birth, prolongation of pregnancy, and birth weight. Data on clinical outcomes and neonatal/maternal harms, including pump-related outcomes, are sparse. Several important clinical outcomes have not been investigated. These include short-term outcomes of neonatal death within initial hospitalization, intraventricular hemorrhage, and bronchopulmonary dysplasia and long-term outcomes, such as developmental and neurobehavioral testing.
No long-term outcomes of the SQ terbutaline pump for maintenance tocolysis have been assessed. This absence of followup beyond delivery is a major limitation of the available evidence.

Future Research
Although cohort studies have provided a glimpse of the potential for SQ terbutaline pump to improve short-term neonatal outcomes for fetuses at risk for preterm birth, the answers to several important questions remain unanswered. Most importantly, it remains to be seen whether SQ terbutaline pump therapy alters long-term development or systemic impairment of offspring, and neonatal/maternal morbidity and mortality. The limitations of the available data must also be recognized. Most of the cohort studies were medium to high risk of bias. In addition, several of the cohort studies investigated participants from a single proprietary database (Matria), which raises concerns regarding double-counting of patients and common biases. Therefore, results showing effectiveness should be interpreted with caution, especially in light of the recent FDA warnings.
Information is lacking on the effectiveness and safety of the SQ terbutaline pump as a maintenance tocolytic treatment in specific populations, including women who deliver at specific gestational ages, women of different racial or ethnic backgrounds, and women with previous preterm birth or preeclampsia. Future studies, whether observational or experimental in design, should focus on garnering evidence for these specific populations.
Below we provide some specific recommendations for the conduct of RCTs and observational studies to further elucidate the potential benefits and harms of SQ terbutaline pump for maintenance tocolysis.

Randomized Trials
We recommend that an adequately powered randomized controlled and pragmatic clinical trial that assesses the SQ terbutaline pump as a maintenance tocolytic be conducted. A pragmatic RCT is designed to have broad applicability so that the results can guide decisions about practice. 26 Conducting such RCTs to assess the efficacy of tocolytics in general is notoriously difficult. A definitive trial in this domain must include a focus on accurate diagnosis of preterm labor (perhaps combining stringent clinical criteria with factors such as positive fetal fibronectin and shortened transvaginal cervical length). Emphasis must also be placed on securing funding and maintaining followup for an appropriate duration of time to allow assessment of long-term childhood outcomes, including neurobehavioral testing and developmental assessment.
Such a trial should be placebo controlled and include blinding of study participants, care providers, and study personnel. Consideration should be given to employing multiple treatment arms in order to evaluate the pump against other tocolytic agents and conservative management. Furthermore, the level of care provided to participants (i.e., nursing assessments, home uterine monitoring, education, telephone support, and restriction of activities) should be practical, feasible, and likely to be adopted in routine practice. Important cointerventions, such as administration of corticosteroids, should be reported. A full accounting of the number of women approached but not enrolled should be included to allow users to assess the impact of respondent bias. The analysis should be "intent to treat," where all participants assigned by randomization to each group are included in the primary comparisons, regardless of whether the assigned medication was received. Outcomes to be examined should go beyond those of prolongation of pregnancy and birthweight to hard clinical endpoints of neonatal morbidity, such as bronchopulmonary dysplasia, necrotizing enterocolitis, significant intraventricular hemorrhage (grade III/IV), retinopathy of prematurity, sepsis, stillbirth, and neonatal death. Lastly, there should be long-term followup to assess subsequent childhood outcomes. Pharmacodynamic and pharmacokinetic outcome measures can additionally be studied to understand inter-individual differences in effectiveness and toxicity and avoidance of β-agonist related tachyphylaxis.

Observational Studies
Although the RCT is the ideal study design for evaluating the efficacy of interventions, at times it may not be considered feasible for a number of reasons, such as a prohibitive sample size requirement and ethical considerations. We realize that collecting RCT evidence on clinically important outcomes may not be possible because a large number of patients will need to be recruited to detect rare events, such as maternal deaths. Therefore, we additionally propose: • Well-designed, well-powered cohort studies examining clinical outcomes. These studies should include a representative and inception cohort of all patients with arrested preterm labor. Since observational studies are susceptible to the effects of confounding, future observational studies should measure, report, and adjust for potential confounders such as fetal fibronectin, cervical length/dilation, cerclage, maternal characteristics (e.g. age, race), level of care and activity, and concomitant medications; propensity scores based on these variables may be considered. Other considerations about power, multiple comparison groups, level of care, reporting of cointerventions, and long-term followup are the same as for RCTs. • Record linkage studies in which mother's prenatal, and infants NICU and childhood developmental electronic health records are linked may be a more practical research proposition for the near future with improvements in quality and accessibility of electronic patient records. NICU registries in which prenatal data of mothers are available can be very valuable source. However, such linkage based studies may also be impacted by biases not uncommon to cohort study designs, especially confounding because of unmeasured or unrecorded variables with important prognostic implications.

Implications
Given the sparse evidence favoring SQ terbutaline pump therapy over other tocolytics or no treatment, we have low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate. 9 Most of the available data are surrogate outcomes of preterm labor. Although many decisions regarding the SQ terbutaline pump are currently made on the assumption that short-term outcomes (for example, a heavier neonate or an infant born beyond 32 weeks) will correlate well with improved long-term outcomes, rigorous scientific evaluation is needed to confirm whether such factors do, in fact, lead to better outcomes in this population. As with any intervention, the benefits of providing treatment at varying gestational ages should outweigh the risks associated with the intervention. Given the sparse epidemiological and trial evidence available on maternal and neonatal harms and the recent FDA warning against the use of terbutaline for tocolysis based on case reports of maternal deaths and serious cardiovascular events, further discussion among policymakers and health care providers is urgently needed to determine if the risks and costs of SQ terbutaline infusion by pump are justified in this vulnerable population. Therefore, this systematic review calls into question the evidence base for the current practice of using terbutaline pump as a maintenance tocolytic agent. Decisionmakers and policymakers should take into consideration the limitations of the available data when formulating recommendations.

Glossary
Applicability: The relevance of the evidence base to an external population.
Bias: A systematic error, arising from participant selection or outcome measurement that produces an erroneous effect estimate.
Preterm birth: Delivery before completion of the 37th week of gestation.

Strength of evidence:
The strength of evidence grading reflects a global assessment of the evidence base. Strength of evidence may be designated as insufficient, low, moderate or high based on the domains of study risk of bias, consistency, directness, and precision.
Tocolytic: An agent that inhibits labor by slowing or halting uterine contractions.
Has the study assessed at least one of the following outcomes?
Other Health Outcomes: gestational age at delivery, incidence of delivery at <28 weeks, <34 weeks and <37 weeks gestational age, prolongation of pregnancy, birthweight, need for assisted ventilation, need for oxygen per nasal cannula, NICU admission Maternal Harms: pulmonary edema, heart failure, arrhythmia, myocardial infarction, refractory hypotension, hypokalemia, hyperglycemia, maternal withdrawal due to adverse effects, maternal discontinuation of therapy If a combination of pump related outcomes and maternal/neonatal outcomes were chosen: a. To be included in the review, either condition (1) and/or (2) below must be met: (1) For outcomes related to pump failure incidence data (versus prevalence data) must be available (2) For neonatal or other outcomes, maternal harms or neonatal harms, the study must: • include at least one comparison group receiving placebo, standard treatment or another intervention AND • be a controlled trial (randomized or non-randomized), a prospective or retrospective cohort study, a case-control study or a cross-sectional study AND • allow for an evaluation of the effectiveness or harms of subcutaneous terbutaline by infusion pump as the sole maintenance tocolytic therapy (note: study designs which are (treatment X + terbutaline pump vs. X alone) or (X + terbutaline pump vs. treatment X + treatment Y) are to be included. Study designs that are (terbutaline pump + treatment X vs. terbutaline pump alone or in conjunction with treatment Y) are to be excluded (unless there is incident pump failure data, as above) D-7 Is condition (1) and/or (2)

5.
If this is an experimental study, were healthcare providers blinded to the frequency and intensity of maternal contractions? (Select all that apply) • At initiation of maintenance therapy with the subcutaneous terbutaline pump (at treatment allocation) • During maintenance therapy with the subcutaneous terbutaline pump • When assessing treatment outcomes (of interest to this review) • Health care providers were at no point blinded to the frequency and intensity of maternal contractions • Unclear (data not reported) • N/A (not an experimental study)

6.
If this is an experimental study, was the outcome assessor blinded to treatment allocation?
• The comparability of groups cannot be assessed for certain because information on all relevant factors has not been presented (e.g., prognostic factors, such as cervical length and fetal fibronectin). However, randomization was carried out properly and patients/health care providers were blinded to treatment allocation, which will limit selection and detection biases. F-3 High (birth weight and gestational age at delivery) Medium (maternal hyperglycemia) Primary flaw in this study is the difference in groups with respect to severity/prognosis (i.e., groups were divided based on length of primary tocolytic treatment). Also, comparability of groups cannot be assessed due to missing information.

F-2
The potential difference in severity/prognosis among treatment and comparison groups should not impact the outcome of maternal hyperglycemia. However, issues pertaining to missing information still remain.  Major flaw is that the subcutaneous pump group had RPTL and comparison group did not. Therefore, the intervention group may have had a more serious condition. Also, there is missing information, which makes it difficult to assess other potential limitations.  Medium There is considerable missing information, which makes it difficult to assess the comparability of groups. There is some indication that there are baseline differences (i.e., in age and marital status) and data on many other important factors have not been reported (e.g., cervical length, race, SES). However, there are no major flaws that can be singled out as invalidating the results.

Medium
There is a large amount of missing information, which makes it difficult to assess the comparability of groups and other potential limitations. But there are no major flaws that can be identified that would invalidate the results.
F-9 There is a lot of missing information, which makes it difficult to assess comparability among groups and whether groups were derived from the same population. There is a possibility that groups received a different level of care, since only the subcutaneous terbutaline group has been specified as receiving home nursing care. However, it is unclear if this factor alone would be sufficient to impact the results to a large extent. Although the harm outcome of maternal hyperglycemia was defined and collected actively, the primary flaw with this study is that groups were not similar in baseline characteristics (i.e., in race and family history of gestational diabetes). Also, since no methods were used to control for confounders, there is a high likelihood that groups may differ in other baseline characteristics and prognostic factors, which have not been reported. There is also a lot of missing information which makes it difficult to assess the comparability of groups (e.g., primary tocolytic, loss to followup, differential level of care, compliance). Uterine contractions > four per hour and progressive cervical change.

F-11
I: SQ terbutaline (51): NR Medium There is missing information, which makes it difficult to assess some quality items. However, there was no high loss to followup and subjects were representative of source population. Adequacy of sample size is unclear (n=51), although it is larger than the previous case series of nine subjects. I: SQ terbutaline (9): NR Medium There is a lot of missing information, which makes it difficult to assess potential for selection bias (e.g., were the nine subjects in the study the entire sample, or were these the number left over after losses to followup?). Also, harm outcomes have not been defined. However, the study does not have any obvious major flaws, which would invalidate the results. SC = subcutaneous; IV = intravenous; NR = not reported; PTL = preterm labor; SD = standard deviation; RPTL = recurrent preterm labor; RCT = randomized controlled trial; SQ = subcutaneous * Either at preterm labor (indicated by P) or at start of subcutaneous terbutaline therapy (indicated by T). If study population stated RPTL as an inclusion criterion, then this is the gestational age at the episode of RPTL. † Received by entire study population, unless specified otherwise. ‡ Data from a third treatment arm, which consisted of a control group without preterm labor, has not been presented.

F-12
F-13 Table F2. Full-text question posed for criteria listed in risk of bias charts

Risk of Bias Chart Full Question Baseline characteristics/ prognostic factors
If groups were similar in baseline characteristics and prognostic factors.
If groups were similar in primary tocolytic therapy.
If intention-to-treat analysis conducted.
If sample size adequate.
If there was reliability among multiple outcome assessors (likely that there were multiple outcome assessors, since women were from the Matria database. But reliability among assessors cannot be assessed).
If compliance with study protocol was adequate.

Outcomes:
(1) Prolongation of pregnancy (2) Gestational age at delivery (3) Birth weight (4) NICU admission MEDIUM: There is a lot of missing information, which makes it difficult to assess comparability of groups (in terms of baseline characteristics and prognostic factors, primary tocolytic therapy, and compliance). But difficult to say that there is any limitation that would invalidate the results for sure. Primary outcome of gestational age < 35 weeks has not been adequately specified (i.e., method for determining gestational age not described).

F-21
No differential level of care.
No indication of selective outcome reporting.
Comparison group drawn from same population as treatment group.
If groups were similar in baseline characteristics and prognostic factors.
If groups were similar in primary tocolytic therapy.
If intention-to-treat analysis conducted.
If there was differential or high loss to followup.
If sample size was adequate.
If there was reliability in outcome assessors (likely that there were multiple outcome assessors, since the Matria database was used. But reliability among assessors cannot be determined).
If compliance with study protocol was adequate.
If appropriate methods were used to control for important confounders.
If subjects were representative of source population.

MEDIUM:
There is considerable missing information, which makes it difficult to assess the comparability of groups. There is some indication that there are baseline differences (i.e., in age and marital status) and data on many other important factors have not been reported (e.g., cervical length, race, SES). However, there are no major flaws that can be singled out as invalidating the results. Groups differ in baseline characteristics and prognostic factors (in particular: smoking status and previous preterm delivery).

F-22
Intention-to-treat analysis not done (losses to followup excluded).
Methods were not sufficient to control for confounders (only matched by gestational age at delivery).
No differential level of care between groups.
Comparison group drawn from the same sample population as treatment group.
Measured harms with standardized definitions (maternal pulmonary edema and maternal death).
Mode of harms collection not explicitly specified as active. However, this is not very relevant for outcomes of pulmonary edema and maternal death.
Report does not explicitly specify who collected harms data. However, it is reasonable to assume that pulmonary edema would be assessed by qualified healthcare professionals.
If groups were similar in primary tocolytic therapy.
If there was differential or high loss to followup (losses to followup were excluded).
If sample size was adequate.
If there was selective outcome reporting.
If there was reliability in outcome assessors (data was from Matria database, so likely that there were multiple outcome assessors. But reliability among assessors cannot be determined).
If compliance with study protocol was adequate.
If subjects were representative of source population.

Outcomes:
(1) Pregnancy prolongation (2) Gestational age at delivery HIGH: Primary flaw is that groups were not similar at baseline (differed in smoking status and previous PTD). Also, missing data makes it difficult to assess several other potential limitations. No high or differential loss to followup.

F-23
No differential level of care among groups.
Comparison group drawn from same population as treatment group.
Measured harms with standard definitions (maternal pulmonary edema and maternal deaths).
Mode of harms collection not explicitly specified as active. However, this is not very relevant for harms of maternal pulmonary edema and maternal death.
Report does not explicitly specify who collected harms data. However, it can be assumed that pulmonary edema and death would be assessed by qualified personnel.
Subjects were representative of source population.
If groups were similar in baseline characteristics and prognostic factors.
If groups were similar in primary tocolytic therapy.
If an intention-to-treat analysis was conducted.
If sample size was adequate.
If there was selective outcome reporting.
If there was reliability among multiple outcome assessors (data from Matria database, so likely there were multiple outcome assessors, but reliability among assessors cannot be determined).
If compliance with study protocol was adequate.
If appropriate methods used to control for confounders (matched by gestational age at hospitalization for recurrent preterm labor).

MEDIUM:
There is a large amount of missing information, which makes it difficult to assess the comparability of groups and other potential limitations. But there are no major flaws that can be identified that would invalidate the results. . Studies that reported pump-related outcomes (Key Question 6) CI = confidence interval; GA = gestational age; N/A = not applicable; NR = not reported; OR = odds ratio; RCT = randomized controlled trial; SQ = subcutaneous *

Outcome
Either at preterm labor (indicated by P) or at start of subcutaneous terbutaline therapy (indicated by T). If study population stated RPTL as an inclusion criterion, then this is the gestational age at the episode of RPTL.