Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Identification of pregnancies and their outcomes in healthcare claims data, 2008–2019: An algorithm

  • Elizabeth C. Ailes ,

    Contributed equally to this work with: Elizabeth C. Ailes, Weiming Zhu

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Validation, Writing – original draft, Writing – review & editing

    EAiles@cdc.gov

    Affiliation National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

  • Weiming Zhu ,

    Contributed equally to this work with: Elizabeth C. Ailes, Weiming Zhu

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Writing – review & editing

    Affiliation National Center for HIV, Viral Hepatitis, STD, and TB Prevention, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

  • Elizabeth A. Clark,

    Roles Methodology, Writing – review & editing

    Affiliation Emory University School of Medicine, Department of Gynecology and Obstetrics, Atlanta, Georgia, United States of America

  • Ya-lin A. Huang,

    Roles Conceptualization, Methodology, Writing – review & editing

    Affiliation National Center for HIV, Viral Hepatitis, STD, and TB Prevention, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

  • Margaret A. Lampe,

    Roles Methodology, Writing – review & editing

    Affiliation National Center for HIV, Viral Hepatitis, STD, and TB Prevention, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

  • Athena P. Kourtis,

    Roles Methodology, Writing – review & editing

    Affiliation National Center for HIV, Viral Hepatitis, STD, and TB Prevention, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

  • Jennita Reefhuis,

    Roles Conceptualization, Methodology, Writing – review & editing

    Affiliation National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

  • Karen W. Hoover

    Roles Conceptualization, Methodology, Project administration, Writing – review & editing

    Affiliation National Center for HIV, Viral Hepatitis, STD, and TB Prevention, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America

Abstract

Pregnancy is a condition of broad interest across many medical and health services research domains, but one not easily identified in healthcare claims data. Our objective was to establish an algorithm to identify pregnant women and their pregnancies in claims data. We identified pregnancy-related diagnosis, procedure, and diagnosis-related group codes, accounting for the transition to International Statistical Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) diagnosis and procedure codes, in health encounter reporting on 10/1/2015. We selected women in Merative MarketScan commercial databases aged 15–49 years with pregnancy-related claims, and their infants, during 2008–2019. Pregnancies, pregnancy outcomes, and gestational ages were assigned using the constellation of service dates, code types, pregnancy outcomes, and linkage to infant records. We describe pregnancy outcomes and gestational ages, as well as maternal age, census region, and health plan type. In a sensitivity analysis, we compared our algorithm-assigned date of last menstrual period (LMP) to fertility procedure-based LMP (date of procedure + 14 days) among women with embryo transfer or insemination procedures. Among 5,812,699 identified pregnancies, most (77.9%) were livebirths, followed by spontaneous abortions (16.2%); 3,274,353 (72.2%) livebirths could be linked to infants. Most pregnancies were among women 25–34 years (59.1%), living in the South (39.1%) and Midwest (22.4%), with large employer-sponsored insurance (52.0%). Outcome distributions were similar across ICD-9 and ICD-10 eras, with some variation in gestational age distribution observed. Sensitivity analyses supported our algorithm’s framework; algorithm- and fertility procedure-derived LMP estimates were within a week of each other (mean difference: -4 days [IQR: -13 to 6 days]; n = 107,870). We have developed an algorithm to identify pregnancies, their gestational age, and outcomes, across ICD-9 and ICD-10 eras using administrative data. This algorithm may be useful to reproductive health researchers investigating a broad range of pregnancy and infant outcomes.

Introduction

Pregnancy is a condition of broad interest across many medical and health services research domains, but one not easily identified in healthcare claims data as there is typically no marker for pregnancy status, outcome, gestational age, or delivery date. As many pregnancy outcomes (e.g., stillbirth) are rare, large health claims databases (e.g., Merative® MarketScan®, Centers for Medicare and Medicaid Services) represent the most feasible data source to study rare exposures and outcomes. While many algorithms to identify pregnancies in insurance claims data exist [18], few use the International Statistical Classification of Diseases, Tenth Revision, Clinical Modification and Procedure Coding Systems (ICD-10-CM/PCS) codes implemented on October 1, 2015 [2, 8]. Algorithms using earlier ICD-9 codes cannot easily be translated to ICD-10 codes. Additionally, ICD-9 algorithms often rely upon “completed weeks of gestation” diagnosis codes (765.XX), predominantly assigned to infant rather than maternal records, necessitating more complex algorithms to estimate gestational age. “Week of gestation” diagnosis codes in the ICD-10 schema (Z3A.XX), assigned to medical visits throughout pregnancy, are likely to allow for improved pregnancy identification, outcome determination, and assessment of gestational age.

We previously developed an algorithm to identify pregnancies in insurance claims data based largely on ICD-9 codes [1]. Herein we describe an updated algorithm that includes ICD-10 codes, as well as additional algorithm enhancements, such as linkage to infant records and a revised hierarchy of pregnancy outcomes. We describe the characteristics of the pregnancy cohort identified in MarketScan Commercial claims data from 2008–2019 using our new comprehensive algorithm to identify pregnancies, their gestational ages, and outcomes across both ICD-9 and ICD-10 coding systems.

Materials and methods

Data source

We analyzed 2008–2019 data from the MarketScan Commercial Database. These data include linkable patient-level medical claims from inpatient hospitalizations, outpatient medical visits, and prescriptions dispensed by outpatient pharmacies for a convenience sample of individuals with large employer-sponsored and smaller private health insurance plans (hereafter, “commercial insurance”). Detailed annual enrollment information, including sex and age of the enrollee, type of insurance, information on whether prescription drug information was captured in the data, and duration of enrollment in each calendar year is available. Year, but not date, of birth is available for all enrollees. When covered under the same insurance, family members (e.g., mothers and infants) are linkable. In this analysis, we included women aged 15–49 years with pregnancy-related claims from January 1, 2008 to December 31, 2019 and infants with years of birth from 2008 to 2019, as these were the most recent data available at the time of analysis.

Pregnancy algorithm

Identification of pregnancy-related codes.

We used a multi-stage process to identify relevant pregnancy and gestational age codes. First, we identified all diagnosis (ICD-9-CM, ICD-10-CM), procedure (ICD-9-PCS, ICD-10-PCS, current procedural terminology [CPT], and healthcare common procedure coding system [HCPCS]), or diagnosis-related group (DRG) codes related to pregnancy. Second, we bi-directionally mapped all pregnancy-related diagnosis and procedure codes from ICD-9 to ICD-10, as well as from ICD-10 to ICD-9, using the general equivalence mapping tools developed by the Centers for Medicare and Medicaid Services [9]. All pairs from the two crosswalks were de-duplicated. Third, because our algorithm focused on codes indicating a specific pregnancy outcome or gestational age, we focused our broad list on: 1) ICD-9-CM diagnosis, ICD-9, CPT, HCPCS procedure, or DRG codes used in the previous algorithm by Ailes et al. [1], and 2) codes with descriptions that included pregnancy outcome-related terms or phrases (S1 File). Our final code set used for pregnancy identification in maternal claims included maternal codes indicative of a pregnancy outcome, infant birth hospitalization/delivery codes, and specific “weeks of gestation” codes (ICD-9-CM 765.2X codes, ICD-10-CM Z3A.XX codes; full code list provided in S2 File).

We developed additional code sets for use in algorithm verification steps (S2 File). These included codes more broadly indicative of a preterm delivery or prolonged pregnancy, even though a pregnancy outcome could not be assigned to these codes; ectopic pregnancy procedures or methotrexate prescriptions codes; and embryo transfer or insemination codes. All codes and their descriptions were independently reviewed by two co-authors with expertise in obstetrics to determine which pregnancy outcome (live birth, stillbirth, multiple birth of live birth and stillbirth, spontaneous abortion, induced abortion, ectopic pregnancy, or unknown outcome type) and/or gestational age (in weeks) could be assigned or updated from those assigned by Ailes et al. [1]. Discrepancies were resolved via discussion.

Pregnancy identification.

Annual enrollment files were used to identify women aged 15–49 years and infants (defined as enrollees with birth year equal to enrollment year) during 2008–2019. Among these women, we extracted all inpatient, outpatient, and facility header claims with ≥1 pregnancy- or delivery-related diagnosis, procedure, and/or DRG code from 2007–2019 (2007 was included to more accurately estimate pregnancies ending in 2008). Infant codes recorded on “maternal” claims were captured. Pregnancy identification was based primarily on maternal claims, though infant claims were used during outcome and gestational age verification. In separate datasets, we extracted claims with codes used in outcome and gestational age refinement steps or sensitivity analyses for these same women and infants.

For each woman, we extracted and de-duplicated claims into pregnancy-related ‘records’: distinct combinations of service dates and pregnancy-related codes (diagnosis, procedure, or DRG). We assigned the algorithm-estimated pregnancy outcome and/or gestational age, with the service date serving as proxy for the end of pregnancy/delivery date. Date of last menstrual period (LMP) was assigned by multiplying the algorithm-estimated gestational age by 7 (days per week) and subtracting that number of days from the service date. We also retained the code type (diagnosis, procedure, or DRG) and code version (ICD-9, ICD-10).

To identify records likely belonging to the same pregnancy, we required ≥120 days from any live birth record (including a live birth and stillbirth), and ≥42 days from the end of all other outcomes to the service date and inferred LMP of the subsequent pregnancy. These gaps were chosen because: 1) they were physiologically plausible and 2) exploratory analyses showed that they appeared to differentiate two obvious spikes in the distribution of days between the end of one pregnancy and the estimated LMP of the subsequent pregnancy. In the rare instances when a maternal pregnancy record was not assigned a gestational age estimate (n = 368,234 / 40,724,108, 0.90% records), for the sake of grouping claims into pregnancies only, we assigned a temporary gestational age of 20 weeks (for records associated with live birth or stillbirth outcomes) or 6 weeks (for other pregnancy outcomes). For each pregnancy episode, we identified a pregnancy series (e.g., first, second, third) to account for multiple pregnancies in the same women; minimum and maximum date of service and estimated LMPs for each pregnancy; and indicator variables accounting for type of codes and pregnancy outcomes (e.g., live birth diagnosis code, spontaneous abortion DRG code, etc.).

Pregnancy outcome and gestational age verification.

To aid in pregnancy outcome estimation, we attempted to link infants born from 2008–2019 to all women with a pregnancy-related record, regardless of initial pregnancy outcome(s), based upon the unique family identifier. Among infants with infant birth hospitalization/delivery codes or preterm/prolonged gestation codes during the year of birth, we required the service date of the earliest infant claim to occur between 7 days before the minimum, and 30 days after the maximum, pregnancy end dates of a linked maternal pregnancy episode, similar to MacDonald et al. [7]. Among infants with 2008–2019 years of birth, but no pregnancy or infant birth hospitalization code that matched the pregnancy episode, we required the year of birth to match the delivery year on the maternal record. If one infant linked to multiple pregnancies, we selected the earliest pregnancy with a live birth code.

We used a hierarchy of outcomes based on code type, pregnancy outcome type, and linkage of a pregnancy episode to an infant to assign our initial pregnancy outcome (Table 1). We identified the earliest pregnancy record of the hierarchy-assigned initial outcome type. This record served as our initial best estimate of pregnancy outcome, end date, gestational age, and LMP. Our hierarchy was revised from our previous analysis [1] and chosen based on exploratory analyses and assumptions, primarily that: 1) identification of a linked infant record is strong evidence that the pregnancy ended in a live birth; 2) diagnosis codes were some of the strongest evidence of a particular pregnancy outcome; 3) pregnancies that included both a spontaneous and induced abortion code were likely to be spontaneous abortions; and, 4) DRG codes for spontaneous abortion, induced abortion, and ectopic pregnancy were stronger evidence than procedure codes; however, DRG codes for live birth were non-specific unless they were the sole pregnancy-related code present. Of note, no DRG code is specific to stillbirth.

thumbnail
Table 1. Pregnancy algorithm hierarchy of initial pregnancy outcome, based on outcome type and code type present (Diagnosis, Procedure, or Diagnosis-Related Group [DRG]).

https://doi.org/10.1371/journal.pone.0284893.t001

Additional refinements were made (see decision algorithms in S3 File), to adjust the final pregnancy outcome for a small proportion of pregnancies (1.3%, 84,679/6,520,768, S1 Table). These included recoding some pregnancies initially identified as induced abortions to ectopic pregnancies; requiring ectopic pregnancies to have a proximate ectopic pregnancy procedure or methotrexate prescription, similar to the methods of Hoover et al. [10] and Sarayani et al. [8]; and modifying pregnancies initially coded as stillbirths to other outcomes, based on available gestational age estimates and co-occurring spontaneous abortion, induced abortion, or live birth records.

After finalizing pregnancy outcomes, we made additional refinements to gestational age estimates, as described in more detail in S3 File. These modifications were based on the last ‘direct’ gestational age code estimates (e.g., ICD-10-CM Z3A.## or ICD-9-CM: 765.XX codes, S2 File) available for the pregnancy episode, as well as the presence of codes indicating a preterm or prolonged pregnancy either on the maternal or linked infant record. The LMP estimate from the selected direct gestational age claim, when available, was used as the final LMP estimate and was used to re-calculate the gestational age at the end of pregnancy by subtracting the final LMP from the pregnancy end date and dividing that total by 7 (days per week).

A small number of pregnancies with missing outcome (5.7%, S1 Table) remained at the end of this process. Pregnancy outcome could have been missing due to lapses in insurance enrollment at the end of pregnancy or because pregnancies were ongoing as of December 31, 2019 (the most recent data available at the time of the analysis). To address this, we assumed all pregnancies with missing outcomes were live births at 39 weeks gestation and estimated their date of delivery as their maximum pregnancy episode LMP + (39 weeks x 7 days/week). If this date was after December 31, 2019, we considered their pregnancy outcome to be right censored and unobservable. If the estimated pregnancy end date occurred before December 31, 2019, but after a woman’s last month and year of insurance enrollment (as identified using the annual enrollment files), we also considered the pregnancy outcome to be unobservable and excluded these pregnancies from analysis. A simplified schematic of the pregnancy algorithm steps is shown in Fig 1.

thumbnail
Fig 1. Schematic of select pregnancy algorithm implementation steps.

Abbreviations: ECT = ectopic pregnancy, IAB = induced abortion, LB = live birth, LMP = last menstrual period, SAB = spontaneous abortion, SB = stillbirth. a Select verification steps for pregnancy outcome type and/or gestational age. b Represent the minimum and maximum service dates associated with claims that have a pregnancy outcome diagnosis, procedure, and/or DRG code. See supporting files for more information.

https://doi.org/10.1371/journal.pone.0284893.g001

Analyses

We stratified characteristics of women and their pregnancies in our cohort by ICD “era” (1/1/2008-9/30/2015 deliveries for ICD-9 compared to 10/1/2015-12/31/2019 for ICD-10 deliveries). Overall and for each stratum, we estimated the total and average number of pregnancies per woman, average number of pregnancy- or gestational age-related records per pregnancy, and the distribution of pregnancies by outcome, delivery year, gestational age, maternal age at delivery, U.S. Census region, type of insurance, and continuous enrollment (i.e., at least one day of enrollment per month with no more than a two month gap in enrollment) before/during pregnancy. While the MarketScan Commercial data represent a convenience sample of persons with commercial insurance, we applied weights to generate national estimates among women with commercial insurance, derived from the American Community Survey [11]. To better understand the potential impact of missing prenatal exposures on studies of infant outcomes, we compared the aforementioned characteristics between pregnancies that could be linked (vs. not) to infant records. We also described the proportion of infant linked live births with any pregnancy-related claim available.

Lastly, we conducted two sensitivity analyses. To assess the potential impact of the specific weeks of gestation codes on our gestational age estimation, we removed these codes and calculated the gestational age using the remaining information available. We also conducted a sensitivity analysis, similar to the study of Bird et al. [12], among the small subset of pregnancies with embryo transfer or insemination codes. The service date of these procedures approximates the date of conception, which typically occurs 14 days after LMP. Because women might have had multiple unsuccessful fertility procedures prior to a successful one but also to allow for some inaccuracies in our LMP estimates, we identified the last fertility procedure that occurred from 56 days before the pregnancy’s estimated LMP through the end of pregnancy. We compared the LMP based on fertility procedure date (fertility procedure service date– 14 days) to the LMP based on our final algorithm.

MarketScan data are collected as part of billing for routine patient care and deidentified before access is granted to researchers; therefore no Institutional Review Board review was needed. All analyses were conducted using SAS v9.4 (Cary, NC; SAS code available in S1 Data). Because of the large sample size very small differences between groups could be considered statistically significant, we chose not to conduct statistical testing, but rather considered any differences of ≥5% between groups to be notable.

Results

Among the 49,998,987 women aged 15–49 years during 2008–2019 in the MarketScan Commercial data, we identified 40,724,108 unique pregnancy-related records in 5,158,773 (10.3%) women (Fig 2). Collapsing records into pregnancy episodes resulted in a total of 6,520,768 possible pregnancies. During outcome verification, we also identified 4,364,489 infants born during 2008–2019, of which 3,274,353 were linked to potential pregnancies. We found consistency in most recorded pregnancy outcomes within pregnancy episodes, with the exception that stillbirth claims were rarely (<10%) the only type of pregnancy outcome in a pregnancy episode (S2 Table), and often occurred in combination with live birth or spontaneous abortion records. Among the 57,277 pregnancies we initially coded as stillbirths, 34,467 (60.2%) remained after verification (S1 Table). Among 85,410 pregnancies initially coded as ectopic pregnancies, 42,275 (49.5%) were verified using ectopic procedure or methotrexate prescription codes (S1 Table). At the end of the verification process, 708,069 pregnancies had a missing pregnancy outcome or were outside our delivery years or ages of interest and were excluded (Fig 2). A total of 5,812,699 pregnancies to 4,671,524 women were included in our final cohort (Fig 2).

thumbnail
Fig 2. Identification of cohort of pregnant women, MarketScan commercial data, 2008–2019.

a Extracted and then deduplicated inpatient, outpatient, and facility header claims with codes from S2 into pregnancy-related ‘records’: distinct combinations of service dates and pregnancy-related codes (diagnosis, procedure, or diagnosis related group). b Records likely belonging to the same pregnancy were grouped together into pregnancy episodes. c Because women could have more than one pregnancy, the sum of these categories will be greater than the total number of women.

https://doi.org/10.1371/journal.pone.0284893.g002

Most (n = 4,401,015, 75.7%) pregnancies in the final cohort ended in the ICD-9 era between January 1, 2008 and September 30, 2015 (Table 2). The mean number of pregnancies per woman was 1.2 across all years (1.2 in the ICD-9 era, 1.1 in the ICD-10 era). There were an average of 6.5 (interquartile range [IQR]: 3–8) pregnancy-related records per pregnancy across the entire cohort, though pregnancies delivered in the ICD-10 era had more records (mean: 10.6, IQR: 6–14) than those in the ICD-9 era (mean: 5.2, IQR: 3–6). Over 85% of pregnancies from the ICD-10 era had at least one direct gestational age code, while many fewer (1.3%) did during the ICD-9 era.

thumbnail
Table 2. Characteristics of pregnancy cohort overall and by international statistical classification of diseases clinical modification and procedure coding systems (ICD) era MarketScan commercial data, 2008–2019 (N = 5,812,699 pregnancies).

https://doi.org/10.1371/journal.pone.0284893.t002

Of the 5,812,699 identified pregnancies, the majority (77.9%) were livebirths, followed by spontaneous abortion (16.2%); outcome distributions were similar between ICD-9 and ICD-10 eras (Table 2). While the unweighted annual number of identified pregnancies was smaller in later years because of changes in the MarketScan data contributors since 2015, weighted estimates remained stable. A higher proportion of ICD-9 era pregnancies (71.2%) were estimated to end at term (39–41 weeks) compared to ICD-10 pregnancies (52.2%); fewer early term (37–38 weeks) and fewer post-term (≥42 weeks) pregnancies were estimated in the ICD-9 era compared to the ICD-10 era (0.03% vs. 16.9% and 1.3% vs. 3.5%, respectively). Among live births, weighted frequencies of preterm birth were lower in the ICD-9 era than the ICD-10 era (6.7 vs. 8.8%; S3 Table) as were post-term (≥42 weeks) birth estimates (1.7% vs. 4.8%). Overall, the majority of pregnant women with commercial insurance were 25–34 years of age (59.1%), were South (39.1%) or Midwest (22.2%) residents had large employer-sponsored insurance (52.0%) and had continuous enrollment during pregnancy (74.9%).

In the final cohort, 4,533,630 pregnancies ended in a live birth or livebirth and stillbirth, of which 3,274,353 (72.2%) were linked to infant records in the database (Fig 2, Table 3). Among live births linked to an infant, infant birth hospitalization codes were found on the infant record more frequently than on the linked maternal record (86.2% vs 33.7%). Delivery year, type of insurance, and proportion with continuous enrollment were similar between women with live birth pregnancies that linked to an infant compared to those that did not (Table 3). However, live birth pregnancies that did not link to an infant were more often estimated to end at term (88.9% vs. 83.7%). Additionally, women with pregnancies that did not link to an infant were younger (15–24 years at delivery) compared to those with a linked infant (5.9% vs. 34.5%), more likely to reside in the South (45.5% vs. 37.9%) and less likely to reside in the Northeast (12.9% vs. 17.6%), and more likely to be a child of the primary insurance holder (30.7% vs. 0.9%).

thumbnail
Table 3. Comparison of live birth pregnancies that linked and did not link to an infant record, Marketscan commercial data, 2008–2019 (N = 4,533,630).

https://doi.org/10.1371/journal.pone.0284893.t003

In our first sensitivity analysis, we noted that use of direct gestational age codes resulted in shifts in the final gestational age categories, particular for pregnancies in the ICD-10 era and for pregnancies that would have been considered as ‘term’ based on other available codes (S4 Table). In our second sensitivity analysis of women who had an embryo transfer or insemination procedure code, we identified 107,870 pregnancies with the procedure occurring between 56 days before LMP through the end of pregnancy. On average, algorithm- and fertility procedure-derived LMP estimates were within a week of each other (mean difference: -4 days, median: -2 days [IQR: -13 to 6 days], S5 Table), with estimates closer among pregnancies estimated to end in a live birth (mean: 1 day, n = 78,283; distribution in S1 Fig) than those estimated to end in a non-live birth (mean: -17 days; n = 29,587).

Discussion

We identified 5.8 million pregnancies during 2008–2019 in MarketScan Commercial data, but our algorithm is applicable to any administrative or electronic health record data with service date and diagnosis, procedure, or DRG codes, across both ICD-9 and ICD-10 coding schemas. It builds upon previous algorithms [1, 3, 68, 10, 1218] to include components proposed separately but rarely combined in one algorithm (e.g., linkage to infants, verification of ectopic pregnancies, ICD-10 and ICD-9 codes). Furthermore, our algorithm carefully assigned pregnancy outcome when multiple outcome codes were present.

While we were unable to externally validate our algorithm, comparisons of our weighted estimates to national data and sensitivity analyses support our algorithm’s framework. Our observed stillbirth prevalence (8.0 per 1,000 live births) fell within estimates using fetal death certificates alone (5.9 per 1,000 live births [19]) and in combination with stillbirth surveillance (10.0 per 1,000 live births plus stillbirths [20]). Our sensitivity analysis among women with fertility procedures showed good agreement with algorithm estimates, proving further reassurance of our algorithm’s accuracy.

Despite inclusion of specific weeks of gestation and broader ‘preterm’ codes, our weighted ICD-9 era preterm birth rate (6.7%) was lower than contemporaneous national estimates based on obstetric estimates (9.7% [21]; S3 Table), but our ICD-10 era rate was closer (8.8% vs. 10.0%). Notably, our analysis did not include women with Medicaid insurance, who may experience an increased frequency of preterm births [22], and national estimates based on LMP (similar to our method) tend to have higher post-term (and preterm) estimates [23].

By linking to infant records, we internally validated 72% of live births; the remaining may not have linked because infants were on other insurance plans (e.g., their fathers’) and younger mothers tended to be on their parents’ insurance, which typically does not cover the resulting grandchildren. Identification of non-live birth outcomes was more challenging; efforts to improve the coding accuracy of these outcomes in clinical practice could help. Requiring proximate relevant procedures or prescriptions improved the specificity of our ectopic pregnancy ascertainment. We prioritized assignment of spontaneous abortion outcomes over induced abortions if both were present, in contrast to previous algorithms [1, 7, 8], as spontaneous abortions can be treated with procedures also used for induced abortions, and pregnancies among women with fertility procedure codes had both outcome types yet induced abortions are likely less common in this group. However, these decisions might have overestimated the number of spontaneous abortions.

Though relatively rare, identification of pregnancies ending in stillbirth posed challenges. Vital statistics data show a bimodal distribution of the gestational ages of stillbirths (at 20 and 39 weeks), rendering our experts unable to assign one gestational age estimate to some common stillbirth codes. Pregnancies with a stillbirth code also overwhelmingly (91%) had other outcome codes. By examining the timing and distribution of other outcome and gestational age codes in these pregnancies, and making subsequent adjustments to the final pregnancy outcome, our stillbirth prevalence was in the range of published estimates [19, 20].

Overall, our large sample size allowed for identification of rare pregnancy outcomes. Inclusion of non-live birth outcomes was a critical component of our algorithm, as restricting to live births can lead to selection bias in epidemiologic studies [2426]. Our comparison of ICD-9 to ICD-10 eras suggests that more detailed gestational age estimation is possible in the latter time period.

Limitations of our approach include that billing data are not designed for scientific investigations and their use may result in misclassification of some pregnancy outcomes or gestational ages because of billing errors, “rule-out” diagnoses, and other factors. While use of both ICD-9 and ICD-10 coding schema was a strength of our approach, and we, and others [8], have explored the impact of these changes, use of both coding systems could have led to coding errors. Additionally, we made assumptions about the gestational age at which many pregnancy outcomes occurred and were unable to verify our algorithm estimates compared to medical or birth records, though our sensitivity analysis and comparison to national estimates provides confidence in our algorithm. Lastly, we lacked information on healthcare experiences not covered by insurance.

These limitations notwithstanding, our algorithm represents a methodological advance in use of information from administrative data and could be useful for researchers, public health practitioners, health systems, third-party payers, and others to answer questions about pregnant women and maternal and infant outcomes, including rare outcomes. Internally validated algorithms like ours can have broad applications to clinical research. Additionally, use of standardized pregnancy algorithms will facilitate comparisons across different studies.

Supporting information

S1 File. List of pregnancy outcome-related terms or phrases-search terms.

https://doi.org/10.1371/journal.pone.0284893.s001

(DOCX)

S2 File. List of diagnosis, procedure, and diagnosis-related group codes used in algorithm.

https://doi.org/10.1371/journal.pone.0284893.s002

(XLSX)

S1 Table. Initial pregnancy outcome compared to final pregnancy outcome, MarketScan 2008–2019 (N = 6,520,768 potential pregnancies).

https://doi.org/10.1371/journal.pone.0284893.s005

(DOCX)

S2 Table. Frequency of single pregnancy outcome type in pregnancy episodes, by final pregnancy outcome type, MarketScan 2008–2019 (N = 5,812,699 pregnancies).

https://doi.org/10.1371/journal.pone.0284893.s006

(DOCX)

S3 Table. Gestational age distribution of pregnancies estimated to end in a live birth (weighted total: 19,190,432 pregnanciesa) to National Vital Statistics Systemb (NVSS) estimates.

https://doi.org/10.1371/journal.pone.0284893.s007

(DOCX)

S4 Table. Sensitivity analysis comparing estimated gestational age without use of direct gestational age codes to the final algorithm estimated gestational age using direct gestational age codes, by gestational age group and International Statistical Classification of Diseases, Clinical Modification and Procedure Coding System (ICD) era.

https://doi.org/10.1371/journal.pone.0284893.s008

(DOCX)

S5 Table. Sensitivity analysis of difference between algorithm-estimated last menstrual period (LMP) and fertility (embryo transfer or insemination) procedure-based LMP estimatea, by pregnancy outcome type and ICD era, among pregnancies with co-occurring assisted reproductive procedures (n = 107,870 pregnanciesb).

https://doi.org/10.1371/journal.pone.0284893.s009

(DOCX)

S1 Fig. Sensitivity analysis: Distribution of difference between algorithm-estimated last menstrual period (LMP) and fertility (embryo transfer or insemination) procedure-based LMP estimatea, among pregnancies estimated to end in a live birth with co-occurring assisted reproductive procedures (n = 73,241 pregnanciesb).

https://doi.org/10.1371/journal.pone.0284893.s010

(DOCX)

Acknowledgments

Disclaimer: The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

References

  1. 1. Ailes EC, Simeone RM, Dawson AL, Petersen EE, Gilboa SM. Using insurance claims data to identify and estimate critical periods in pregnancy: An application to antidepressants. Birth Defects Res A Clin Mol Teratol. 2016;106(11):927–34. Epub 2016/11/29. pmid:27891779; PubMed Central PMCID: PMC5225464.
  2. 2. Blotiere PO, Weill A, Dalichampt M, Billionnet C, Mezzarobba M, Raguideau F, et al. Development of an algorithm to identify pregnancy episodes and related outcomes in health care claims databases: An application to antiepileptic drug use in 4.9 million pregnant women in France. Pharmacoepidemiol Drug Saf. 2018;27(7):763–70. Epub 2018/05/16. pmid:29763992; PubMed Central PMCID: PMC6055607.
  3. 3. Korelitz JJ, McNally DL, Masters MN, Li SX, Xu Y, Rivkees SA. Prevalence of thyrotoxicosis, antithyroid medication use, and complications among pregnant women in the United States. Thyroid. 2013;23(6):758–65. Epub 2012/12/01. pmid:23194469; PubMed Central PMCID: PMC3675839.
  4. 4. Kuklina EV, Whiteman MK, Hillis SD, Jamieson DJ, Meikle SF, Posner SF, et al. An enhanced method for identifying obstetric deliveries: implications for estimating maternal morbidity. Matern Child Health J. 2008;12(4):469–77. Epub 2007/08/11. pmid:17690963.
  5. 5. Maric I, Winn VD, Borisenko E, Weber KA, Wong RJ, Aziz N, et al. Data-driven queries between medications and spontaneous preterm birth among 2.5 million pregnancies. Birth Defects Res. 2019;111(16):1145–53. Epub 2019/08/23. pmid:31433567.
  6. 6. Matcho A, Ryan P, Fife D, Gifkins D, Knoll C, Friedman A. Inferring pregnancy episodes and outcomes within a network of observational databases. PLoS One. 2018;13(2):e0192033. Epub 2018/02/02. pmid:29389968; PubMed Central PMCID: PMC5794136.
  7. 7. MacDonald SC, Cohen JM, Panchaud A, McElrath TF, Huybrechts KF, Hernandez-Diaz S. Identifying pregnancies in insurance claims data: Methods and application to retinoid teratogenic surveillance. Pharmacoepidemiol Drug Saf. 2019;28(9):1211–21. Epub 2019/07/23. pmid:31328328; PubMed Central PMCID: PMC6830505.
  8. 8. Sarayani A, Wang X, Thai TN, Albogami Y, Jeon N, Winterstein AG. Impact of the Transition from ICD-9-CM to ICD-10-CM on the Identification of Pregnancy Episodes in US Health Insurance Claims Data. Clin Epidemiol. 2020;12:1129–38. Epub 2020/10/30. pmid:33116906; PubMed Central PMCID: PMC7571578.
  9. 9. Center for Medicare and Medicaid Services. General Equivalence Mappings for ICD-10-CM and ICD-10-PCS[11/20/2021]. Available from: https://www.nber.org/research/data/icd-9-cm-and-icd-10-cm-and-icd-10-pcs-crosswalk-or-general-equivalence-mappings.
  10. 10. Hoover KW, Tao G, Kent CK. Trends in the diagnosis and treatment of ectopic pregnancy in the United States. Obstet Gynecol. 2010;115(3):495–502. Epub 2010/02/24. pmid:20177279.
  11. 11. IBM MarketScan Research Databases User Guide: Commercial Insurance Weights (Data year 2019 edition).
  12. 12. Bird ST, Toh S, Sahin L, Andrade SE, Gelperin K, Taylor L, et al. Misclassification in Assessment of First Trimester In-utero Exposure to Drugs Used Proximally to Conception: the Example of Letrozole Utilization for Infertility Treatment. Am J Epidemiol. 2019;188(2):418–25. Epub 2018/10/16. pmid:30321259.
  13. 13. Andrade SE, Scott PE, Davis RL, Li DK, Getahun D, Cheetham TC, et al. Validity of health plan and birth certificate data for pregnancy research. Pharmacoepidemiol Drug Saf. 2013;22(1):7–15. Epub 2012/07/04. pmid:22753079; PubMed Central PMCID: PMC3492503.
  14. 14. Devine S, West S, Andrews E, Tennis P, Hammad TA, Eaton S, et al. The identification of pregnancies within the general practice research database. Pharmacoepidemiol Drug Saf. 2010;19(1):45–50. Epub 2009/10/14. pmid:19823973.
  15. 15. Hornbrook MC, Whitlock EP, Berg CJ, Callaghan WM, Bachman DJ, Gold R, et al. Development of an algorithm to identify pregnancy episodes in an integrated health care delivery system. Health Serv Res. 2007;42(2):908–27. Epub 2007/03/17. pmid:17362224; PubMed Central PMCID: PMC1955367.
  16. 16. Li Q, Andrade SE, Cooper WO, Davis RL, Dublin S, Hammad TA, et al. Validation of an algorithm to estimate gestational age in electronic health plan databases. Pharmacoepidemiol Drug Saf. 2013;22(5):524–32. Epub 2013/01/22. pmid:23335117; PubMed Central PMCID: PMC3644383.
  17. 17. Margulis AV, Setoguchi S, Mittleman MA, Glynn RJ, Dormuth CR, Hernández-Díaz S. Algorithms to estimate the beginning of pregnancy in administrative databases. Pharmacoepidemiology and Drug Safety. 2013;22(1):16–24. pmid:22550030
  18. 18. Naleway AL, Gold R, Kurosky S, Riedlinger K, Henninger ML, Nordin JD, et al. Identifying pregnancy episodes, outcomes, and mother–infant pairs in the Vaccine Safety Datalink. Vaccine. 2013;31(27):2898–903. pmid:23639917
  19. 19. Hoyert DL, Gregory EC. Cause-of-death Data From the Fetal Death File, 2015–2017. Natl Vital Stat Rep. 2020;69(4):1–19. pmid:32510316
  20. 20. Duke WA C; Evans S; Atkinson M; Ailes E.C. Using a Birth Defects Surveillance Program to Enhance Existing Surveillance of Stillbirth. Journal of Registry Management. 2022;49(1):17–22.
  21. 21. Martin JA, Hamilton BE, Osterman MJK, Driscoll AK. Births: Final Data for 2019. Natl Vital Stat Rep. 2021;70(2):1–51. https://dx.doi.org/10.15620/cdc:100472. pmid:33814033
  22. 22. Markus AR KS, Garro N, Gerstein M, Pellegrini C. Examining the association between Medicaid coverage and preterm births using 2010–2013 National Vital Statistics Birth Data. Journal of Children and Poverty. 2017;23(1):79–94.
  23. 23. Martin JA, Osterman MJ, Kirmeyer SE, Gregory EC. Measuring Gestational Age in Vital Statistics Data: Transitioning to the Obstetric Estimate. Natl Vital Stat Rep. 2015;64(5):1–20. pmid:26047089.
  24. 24. Suarez EA, Boggess K, Engel SM, Sturmer T, Lund JL, Jonsson Funk M. Ondansetron use in early pregnancy and the risk of miscarriage. Pharmacoepidemiol Drug Saf. 2021;30(2):103–13. Epub 2020/10/02. pmid:33000871.
  25. 25. Suarez EA, Boggess K, Engel SM, Sturmer T, Lund JL, Funk MJ. Ondansetron use in early pregnancy and the risk of late pregnancy outcomes. Pharmacoepidemiol Drug Saf. 2021;30(2):114–25. Epub 2020/10/18. pmid:33067868.
  26. 26. Snowden JM, Reavis KM, Odden MC. Conceiving of Questions Before Delivering Analyses: Relevant Question Formulation in Reproductive and Perinatal Epidemiology. Epidemiology. 2020;31(5):644–8. Epub 2020/06/06. pmid:32501813.