Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Linking Data for Mothers and Babies in De-Identified Electronic Health Data

  • Katie Harron ,

    Affiliation Department of Health Services Research and Policy, London School of Hygiene and Tropical Medicine, 15-17 Tavistock Place, London, United Kingdom

  • Ruth Gilbert,

    Affiliation Institute of Child Health, University College London, 30 Guilford Street, London, United Kingdom

  • David Cromwell,

    Affiliation Department of Health Services Research and Policy, London School of Hygiene and Tropical Medicine, 15-17 Tavistock Place, London, United Kingdom

  • Jan van der Meulen

    Affiliation Department of Health Services Research and Policy, London School of Hygiene and Tropical Medicine, 15-17 Tavistock Place, London, United Kingdom

Linking Data for Mothers and Babies in De-Identified Electronic Health Data

  • Katie Harron, 
  • Ruth Gilbert, 
  • David Cromwell, 
  • Jan van der Meulen



Linkage of longitudinal administrative data for mothers and babies supports research and service evaluation in several populations around the world. We established a linked mother-baby cohort using pseudonymised, population-level data for England.

Design and Setting

Retrospective linkage study using electronic hospital records of mothers and babies admitted to NHS hospitals in England, captured in Hospital Episode Statistics between April 2001 and March 2013.


Of 672,955 baby records in 2012/13, 280,470 (42%) linked deterministically to a maternal record using hospital, GP practice, maternal age, birthweight, gestation, birth order and sex. A further 380,164 (56%) records linked using probabilistic methods incorporating additional variables that could differ between mother/baby records (admission dates, ethnicity, 3/4-character postcode district) or that include missing values (delivery variables). The false-match rate was estimated at 0.15% using synthetic data. Data quality improved over time: for 2001/02, 91% of baby records were linked (holding the estimated false-match rate at 0.15%). The linked cohort was representative of national distributions of gender, gestation, birth weight and maternal age, and captured approximately 97% of births in England.


Probabilistic linkage of maternal and baby healthcare characteristics offers an efficient way to enrich maternity data, improve data quality, and create longitudinal cohorts for research and service evaluation. This approach could be extended to linkage of other datasets that have non-disclosive characteristics in common.


Linkage of administrative or electronic health records for mothers and babies has the potential to provide a population-level resource to support research and service evaluation. Such linked data are increasingly used in populations around the world, including Scotland, Canada, Australia, the US and the Netherlands amongst others.[14] Linkage of primary care electronic health records for mothers and babies has been attempted for small populations in England [5, 6] and linkage of prospective maternity and children’s health services datasets on a larger scale is being developed by the Health and Social Care Information Centre (HSCIC).[7] However, routine linkage of maternal and baby records in existing administrative hospital data does not currently exist in England.

The large sample size and representativeness of linked administrative hospital data offer a cost-effective alternative to traditional cohort studies for studying childhood outcomes, providing valuable information on maternal morbidity prior to birth and maternal risk-factors for adverse birth outcomes.[812] Whilst existing birth cohort studies across Europe and the US have provided important information on short- and long-term outcomes,[13] they are associated with major costs, are subject to limited numbers for assessing rare conditions, suffer from selectivity in follow-up, and find it difficult to recruit due to increasing participant burden.[14] These limitations provide an imperative for finding alternative approaches using existing data sources. Linked data on a population-level could be used instead for service evaluation and to answer a range of research questions relating to the relationship between pre- and postnatal maternal risk-factors, adverse birth outcomes and healthcare use throughout the course of childhood.[15]

Barriers to linkage between maternal and baby healthcare records in England include uncertainty about the quality of data that are collected primarily for administrative purposes, and the availability and completeness of personal identifiers required for linkage.[1518] We developed methods for establishing a mother-baby cohort using linkage of a standard, pseudonymised extract of administrative hospital data for England. Our objective was to evaluate the success of mother-baby linkage in the absence of direct personal identifiers, using non-disclosive clinical variables (e.g. dates and delivery information) and demographic variables (e.g. ethnicity and GP practice code). We provide generalisable methods and guidance for combining information on individuals with healthcare characteristics in common.

Materials and Methods

Ethics statement

The study is exempt from UK NREC approval because it involved the analysis of an existing dataset of anonymous data for service evaluation. Approvals for the use of HES data were obtained as part of the standard Hospitals Episode Statistics approval process. Hospital Episode Statistics were made available by NHS Digital.

The steps required to create a linked cohort are outlined in the following sections: 1) Understanding the data source; 2) Identifying birth and delivery records; 3) Data preparation; 4) Data linkage; 5) Internal and external validity.

1) Understanding the data source.

Data were extracted from Hospital Episode Statistics (HES). HES is an administrative database holding detailed information for all admissions to NHS hospitals in England, and has been collected since 1989, primarily for financial purposes. HES data are made available to researchers with appropriate permissions, in a pseudonymised form (i.e. without personal identifiers) from the HSCIC. Data are divided into financial years, and structured as ‘episodes’ of care, within which a patient is under the care of one consultant. Each admission may comprise multiple episodes; episodes relating to the same individual are assigned the same pseudonymous ID (HESID) by the data provider, allowing researchers to track patient admissions over time without accessing any personal identifiers. HESID is assigned using a deterministic rule-based algorithm based on NHS number, local patient identifier, sex, date of birth and postcode.[19]

HES records contain clinical diagnoses (coded using the International Statistical Classification of Diseases and Related Health Problems 10th Revision: ICD-10[20]), procedures (coded using the Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures 4th revision: OPCS), and Healthcare Resource Groups (HRGs: Standard HES extracts also include sex, month and year of birth and ethnic category (UK census 18 categories). Geographical information includes organisational code (NHS Trust or Primary Care Trust), registered GP practice code, residential postcode district (first 3–4 postcode characters; each postcode district contains an average 9500 UK households) and Index of Multiple Deprivation (IMD, derived from postcode).[2123]

In addition to the main HES record, delivery episodes for mothers and birth episodes for babies include additional fields called the ‘baby’ (or ‘maternity’) tail.[21] The baby tail contains information on delivery, including gestational age, birth weight and mode of delivery. For multiple births, each delivery record can hold up to 9 baby tails (up to 6 prior to 2002). The baby tail should contain the same information on both maternal and baby records, but is sometimes incomplete.[15] There is no routine linkage of maternal and baby records within HES: the maternal NHS number is not available on the baby record or vice versa.

2) Identifying delivery and birth records.

All records relating to birth and delivery episodes for babies and mothers between April 2001 and March 2013 were extracted from HES. Birth episodes can be identified in a number of ways within HES.[15, 18] For this study, maternal (delivery) records were identified by the presence of ICD-10 codes Z37-Z38 (outcome of delivery, liveborn infant), OPCS codes R14-R27 (delivery procedures), or two or more valid baby tail fields (excluding numpreg, numbaby, neocare and well_baby). Baby (birth) records were identified by the presence of ICD-10 codes Z37-Z38, HRG codes N01-N05 (neonates) or HES fields relating to episode type, method of admission, age at start of episode and level of neonatal care. Ectopic pregnancies, terminations and duplicate episodes were excluded. Full descriptions of the code lists are provided in Tables A and B in S1 Appendix.

Maternal records with the same HESID but with episodes <169 days (24 weeks) apart were treated as duplicates (except for multiple births). Baby records with the same HESID, episode start date, start age, postcode district, birth order and birth weight were treated as duplicates. Information on duplicate records was combined and a single record retained. Where multiple births were recorded on the same maternal record, separate delivery records were created for each birth to facilitate linkage with distinct birth records.

The algorithm for assigning HESIDs to multiple episodes of care for the same individual can introduce errors.[24] Where the same HESID was assigned to multiple individuals, records were dropped, as it was not possible to identify the correct record. Where the same individual was assigned multiple HESIDs (i.e. multiple HESIDs for the same episode start date, age, hospital, GP practice, ethnicity, month-year of birth and baby tail fields), records were treated as duplicates and a log was kept of the relevant HESIDs (to facilitate linkage with subsequent episodes of care).

Birth outcomes were identified from clinical information in either baby or maternal records. Multiple births were identified by ICD-10 codes (Z372-Z377 or Z383-Z388), or HES fields ‘birordr’ (birth order) and ‘numbaby’ (number of babies). Preterm births were identified using ‘gestat’ (gestational age) or ICD-10 codes P072, O60 or P590. Still births were identified using three categories of codes: ‘dismeth’ (discharge method), ‘birstat’ (birth status) and ICD-10 diagnosis (see Table C in S1 Appendix for the full code list description).

3) Data preparation.

Since postcode is not always completed for birth episodes in HES, postcode was imputed using subsequent episode records up to one year after the birth episode. Where ‘sexbaby’ was not completed for baby records, ‘sex’ as recorded on the main HES record was used.

We hypothesised that any coding errors could occur simultaneously in corresponding maternal and baby records (i.e. if data were input to both records through the same system), and so only minimum data cleaning was applied prior to linkage.[25] Generic values for “not known” or “not applicable” were set to missing. All string variables were trimmed to remove blank characters and instances of “&”, “-”etc were removed (see Table A in S2 Appendix).

4) Data linkage.

Data linkage methods. There are two main approaches for linking data: deterministic and probabilistic. Deterministic linkage (or rule-based matching) typically requires exact or approximate agreement on a set of common identifiers (e.g. sex, postcode and date of birth). Exact deterministic matching generally achieves few false-matches (where records belonging to different individuals are linked), as it is unlikely that two individuals share the same set of identifiers. However, requiring exact agreement on identifiers can result in low match rates, as any errors or missing values can prevent a match (resulting in missed-matches, where records belonging to the same individual remain unlinked). Deterministic methods can also incorporate approximate matching, e.g. on month and year of birth, phonetic codes or string comparators for names, or dates within a particular timeframe.

Probabilistic linkage is based on deriving a match weight that represents the likelihood of records belonging to the same individual, given the agreement or disagreement on a set of common identifiers.[26] This approach accounts for the discriminative value of each identifier, i.e. agreement on postcode district would contribute more evidence of a match than agreement on sex. Calculation of the match weight depends on the estimation of two conditional probabilities:

  • M-probability: the probability that an identifier agrees given records belong to the same individual
  • U-probability: the probability that an identifier agrees given records belong to different individuals

The u-probability can be approximated by the probability of chance agreement. For example, the probability of chance agreement on sex is ½. The probability of chance agreement on month of birth is 1/12, and so on. M-probabilities represent the error rate in a particular identifier, and are typically estimated during the linkage process, and updated as more links are made. For example, if sex was miscoded in 5% of record pairs, the m-probability would be 0.95. Frequency-based weights can also be derived, which allows agreement on more rare values to contribute a higher weight.

The overall match weight is derived by calculating the ratio log2(m/u) for each identifier, and summing across all identifiers. Record pairs with agreement on multiple identifiers will have large positive match weights; record pairs with disagreement on most identifiers will have negative match weights.

Probabilistic linkage requires cut-off weights to be chosen for classifying record pairs as links or non-links, consequently determining the rates of missed-matches and false-matches. Typically, two thresholds are chosen, and record pairs with weights falling between the thresholds are subjected to further manual review. However, manual review processes can be both subjective and prohibitively time-consuming for large datasets, and often depend on having access to detailed identifying information. An alternative method is to set an optimal error rate (e.g. maximum allowed false-match rate) and to evaluate error rates for a range of threshold values. Estimation of linkage error rates at each potential threshold requires that the true match status is known, and is typically performed using a subset of gold-standard data (e.g. manual review of a sample of records) or by generating synthetic data with similar characteristics to the original data.[27]

Data linkage methods in this study. In this study, we firstly used exact deterministic linkage to bring together maternal and baby records, and supplemented this approach with probabilistic linkage of remaining unlinked records (Table 1). To reduce the number of comparison pairs, an initial blocking strategy was employed: mother and baby records were only considered as possible matches if they had been admitted to the same hospital and record pairs with implausible dates were not considered (baby discharged prior to the mother’s admission or mother discharged prior to the baby’s admission). This blocking strategy was subsequently relaxed to capture mothers and babies in different hospitals or where episode dates differed.

Table 1. Completeness of potential linkage variables in maternal and baby HES records for 2012/13.

Deterministic links were identified as records agreeing exactly on GP practice, maternal age, birth weight, gestation, birth order and sex. Our approach allowed for missing values, as long as at least three of the agreeing variables were complete, and there were no disagreeing values on any variable.

For our probabilistic approach, we used frequency-based match weights since the probability of agreement on a particular variable may vary according to the value of that variable. Frequency weights were derived for each value of gestational age, delivery place (intended), status of person conducting delivery, postcode district (first letter) and ethnic category). For example, this allowed for a higher chance of agreement on a gestational age of 40 weeks (a common value) than 26 weeks (a rare value). Episode start and end dates for mothers and babies could also genuinely be different, e.g. if the mother was admitted the day before delivery. Therefore for dates, match weights were calculated depending on the difference in the number of days (0, 1, 2, 3, 4, 5, 6, 7, 8–14 and 14+ days). Records with dates a small number of days apart would therefore have higher match weights than those that were more than a week apart.

In our study, initial estimates for the m-probabilities were obtained using the deterministically-linked records. U-probabilities were obtained from pairwise comparisons of a random sample of 5000 unlinked records (i.e. 25,000,000 comparisons). Estimates were then iteratively updated using the probabilistically-linked records according to the following steps:

  1. Initial match weights were assigned to all record pairs
  2. Record pairs were manually reviewed and non-links were removed
  3. M-probabilities were re-estimated based on the remaining record pairs
  4. New match weights were assigned to all record pairs
  5. The process was repeated until match weights stabilised (three iterations in this study)

There is no gold-standard for linkage of maternal and baby records in HES, and even if it were possible to access personal identifiers, maternal and baby HES records do not share a unique identifier. Therefore, linkage quality was evaluated by testing the algorithm and estimating the match rate and false-match rate on synthetic data. Full details of the synthetic data approach are provided in S3 Appendix; in brief, 100 synthetic datasets with similar identifier error rates and missing values to HES were created, where the true match-status was known; after applying the linkage algorithm to each synthetic dataset, false-match rates were estimated.

Completeness of HES fields is known to have improved over time.[15] Therefore, we compared linkage rates for 2001/02 and 2012/13. In order to evaluate the relative contribution to linkage success of more sensitive linkage variables (postcode district and GP practice), we repeated the linkage process excluding these variables.

5) Internal and external validity.

We firstly compared values of gestational age and birth weight with published reference values and set to missing values falling more than 3 standard deviations from the average.[28] We then assessed internal validity of maternal and baby records by checking consistency of three rare but important birth outcomes (still births, multiple births and preterm births). Linked maternal-baby records that were discordant on these outcomes were resolved using corroborating information in HES (additional ICD-10 diagnosis codes, the presence of multiple maternity tails, or subsequent admission records).

The representativeness of the linked cohort was evaluated by comparing distributions of key birth characteristics and outcomes with national published data (compiled on birth registrations) from the Office for National Statistics (ONS). Differences were identified using chi2 tests for categorical data, t-tests for normal data and the Mann-Whitney U test for skewed data.


1) Understanding the data source and 2) Identifying delivery and birth records

The number of records in the baby extract rose from 553,094 in 2001/02 to 672,955 in 2012/13. Fig 1 describes the cohort extraction in detail for 2012/13. HESID assignment errors occurred in <0.01% of maternal records and in up to 0.8% of baby records (Table B in S2 Appendix shows numbers for each year). Completeness of most linkage variables increased over time (Fig A in S2 Appendix).

Fig 1. Extract flow-diagram for delivery and birth episodes captured in HES for 2012/13.

3) Data preparation and 4) Data linkage

Assessment of completeness of clinical and demographic information common to both baby and maternal records showed that fields were generally more well completed for maternity records than baby records (Table 1). Missing postcode was a particular problem for baby records. A number of common variables were not considered for linkage due to missing or non-informative values. For example in maternity records, the Well baby flag always contained the value “N” and the neonatal level of care was almost always “Not applicable”. In baby records, antenatal days of stay was always 0.

For the 2012/13 cohort, 280,237/672,955 baby records (42%) were deterministically linked to a mother. In the deterministically-linked records, agreement on other baby tail variables (those not used in the deterministic linkage) was high (Table 1).

For probabilistic linkage, final match weights for each linkage variable are provided in Table 2. To choose a threshold, estimates of sensitivity and specificity were derived for combined match weights between 5 and 30, averaged over the 100 synthetic datasets (Fig 2). A threshold of 20 was chosen, for which the false-match rate was estimated as 0.15% in the synthetic data. Probabilistic linkage with this threshold resulted in linkage of a further 380,164 baby records (56%). A total of 660,401 baby records (98%) were therefore linked using deterministic and probabilistic linkage combined.

Fig 2. Estimated false-match rate and sensitivity for a range of threshold weights, based on synthetic data.

Accuracy of linkage variables improved over time: more of the baby records with complete values for all deterministic linkage variables matched exactly to a maternal record in 2012/13 than in 2001/02 (78% versus 73%, Table 3). This implies that in 2001/02, at least one variable contained an error in 27% of records, compared with 22% in 2012/13. Linkage of data from 2001/02 had slightly inferior results: 91% of records could be linked whilst retaining the same estimated false-match rate of 0.15%. The highest match rate that could be achieved was 94%, with an associated estimated false-match rate of 1.2%.

Table 3. Probability of achieving a deterministic link according to completeness of baby records.

The final row shows an increase in accuracy of variables over time: in 2001/02, deterministic links were found for 73.0% of baby records with complete values on all linkage variables compared with 77.5% in 2012/13.

Some variables were more important than others for linkage (Fig 3). Variables contributing most to the probabilistic linkage (i.e. having highest match weights) were GP practice, postcode district and estimated delivery date (Fig 3, Table 2). Excluding these variables form the linkage process had a detrimental effect: only 80% of baby records could be linked whilst accepting an estimated false-match rate of 0.15%; the highest match rate that could be achieved whilst excluding these variables was 94%, with a corresponding estimated false-match rate of 7%.

Fig 3. Contribution of each linking variable to overall match weight.

Agreement = positive contribution (solid line), disagreement = negative contribution (dashed line). The higher the value, the more information the linkage variable provides.

It was not possible to link the remaining 2% of baby records, even through manual review of available data. Inspection of the unlinked baby records identified that the main reason for a lack of linkage was missing values: 7452/12,654 (59%) of unlinked records had no baby tail fields compared with 62,157/660,401 (9%) of linked records. Another possible explanation for unlinked baby records is that the mother’s record was not present in the maternal extract (e.g. due to home births where the baby was subsequently admitted but the mother was not). In addition to having more missing values, unlinked records were more likely to be still births, lower gestational age, lower birth weights, younger maternal age, more deprived and non-white but less likely to be multiple births, caesarean sections, or from pregnancies where the first antenatal assessment was before 20 weeks (Table 4).

Table 4. Comparison of linked and unlinked baby record characteristics for 2012/13.

Missing values are excluded from all categories.

5) Internal and external validity

Inspection of the linked cohort in terms of gestational age and birthweight distributions identified coding issues specific to individual hospital providers. For example, one hospital coded gestational age in days rather than weeks (e.g. 280 days rather than 40 weeks). As gestational age was truncated at two digits, this meant that the majority of babies within this hospital appeared to be born preterm with unfeasibly large birthweight (Fig 4). Similarly, birthweight was truncated at 2 or 3 digits for a small number of records, indicating weights recorded as kilograms rather than grams.

Fig 4. Distribution of birth weight by week of gestation in baby records.

Vertical lines show 3 standard deviations from the average; values above the upper limit are likely to have been miscoded as days (rather than weeks) of gestation, truncated to 2 digits.

Internal validity

Preterm birth.

Gestational age was available for 546,083/660,401 (83%) linked baby records and 567,699/660,401 (86%) linked maternal records for 2012/13. Completeness of gestational age increased to 92% when using information from either record. Gestational age was discordant on 4% of linked baby-maternal records, and the majority of these (71%) differed by 1 or 2 weeks only. For discordant records, the value in the maternal record seemed to be more accurate (based on birth weight for gestational age). Only 4 records had an ICD-10 code for preterm birth but a gestational age >37 weeks.

Multiple births.

Multiple birth status was discordant in 1860/660,401 (0.3%) record pairs, suggesting missing or inaccurate records. For 617 pairs, there was evidence of a multiple birth in the maternal record but not in the baby record. For 1243 pairs, there was evidence of multiple birth in the baby record, but only one maternal record.

Still births.

Still birth status was discordant in 1232/660,401 (0.2%) record pairs. For 1165 pairs, still birth was recorded in the maternal record but not the baby record. For 67 pairs, stillbirth was recorded in the baby record but not the maternal record. Discordant records were resolved by checking the baby’s length of stay: if length of stay was >1 day, still births were reclassified as live births. The majority of these errors were related to multiple births: maternal records with ICD10 code Z373 (Twins, one liveborn and one stillborn) or birth status in the maternity record baby tail.

External validity

The linked birth cohort captured 660,401 births for 2012/13 (equating to 97% of total births in English hospitals according to the ONS) and was representative of national data in terms of distributions of gender (51.3% males in both data sources), gestational age, birth weight and maternal age (Fig 5). Although babies with adverse outcomes (lower gestational age, lower birth weight etc) were less likely to be linked, the absolute number of records that failed to link from these groups was low. Overall, there were no differences in any of the key characteristics or birth outcomes between ONS data and the linked cohort: still birth rates were 0.49% (ONS) and 0.54% (linked cohort); multiple birth rates were 3.17% (ONS) and 3.09% (linked cohort); preterm birth rates were 7.09% (ONS) and 7.29% (linked cohort).

Fig 5. Representativeness of linked HES cohort in terms of maternal age, birth weight and gestational age.

Dark shade = HES, light shade = Office for National Statistics.


Main findings

Our study demonstrates the feasibility of linking maternal and baby healthcare characteristics using a range of clinical and demographic variables captured in pseudonymised hospital data. We demonstrate a linkage approach that can be used to enhance information health on electronic health records but that does not require the release of any personal identifiers and therefore preserves existing levels of confidentiality within the data. Triangulating outcomes recorded in different hospital records can help improve data quality. Our methods are generalisable to linkage of administrative data in other contexts, where all available information can be combined into “indirect” identifiers for linkage.[29]

The main limitation of linking administrative or electronic healthcare data is the imperfect nature of data collected for reasons other than research.[30] Compared with data collection in a busy healthcare environment, research studies often have more capacity for quality control, for example a birth cohort study or rolling survey is likely to be more complete due to more opportunities for validation, and a greater level of importance given to the accuracy of variables collected.[30] Furthermore, discordance within and between maternal and baby records in our study indicates that there remains uncertainty in coding of some conditions or events. However, our study also demonstrates that linkage can be used to generate high quality data, through triangulating outcomes coded in different hospital records, and improving ascertainment of outcomes by combining information from different sources.

We also demonstrate that validation of data quality using external sources (such as national birth registration data from ONS) can support the use of these data for specific purposes but also helps to highlight where limitations in the data lie. For example in this linked dataset, there remained some uncertainty about coding of still births within multiple birth pregnancies. The implications of any uncertainty or inconsistencies in coding or potential selection bias need to be carefully considered in light of the proposed use for the data. Quality of linked data should be carefully reported, e.g. by comparing characteristics of linked and unlinked records to identify potential sources of bias, so that researchers and policy makers can assess the relevance of the resulting data for their purposes.[3032]

Errors occurring during linkage (missed-matches and false matches) can result in substantially biased results: false-matches can bias associations towards the null and missed-matches can lead to selection bias.[31, 33] Our evaluation of linkage quality supports evidence from other studies showing differing data quality between subgroups, as more babies at extremes of birth weight and gestational age remained unlinked.[33] However, probabilistic linkage produced a large sample of linked records (660,401: 97% of babies born in 2012/13) and comparisons with published data indicated that the linked data were nationally representative in terms of key birth characteristics and outcomes. Although there is no gold-standard for evaluating linkage quality for HES, and it was not possible to access personal identifiers to perform detailed manual review, synthetic data provide a convenient method for estimating false-match rates. The estimated false-match rate of 0.15% was unlikely to introduce any substantial bias into the linked data. Where this is not the case, statistical methods such as imputation can be considered to account for bias due to linkage error.[31, 34, 35]

In exploiting individual-level data for public benefit, data providers and data users have a responsibility both to ensure that confidential information is protected, and that the data are as accurate as possible. There is a growing body of literature on data confidentiality, some of which argues that individual-level data can never be truly anonymous, depending on external information available to individuals accessing that data.[36] However, there are a number of safeguards in place to protect against inadvertent misuse of data, and restrict the ability of any individual to purposefully behave in a way that jeopardizes data security. Firstly researchers have a responsibility to use data for bona fide purposes only, and there are legal sanctions where data are used inappropriately or without due care. Data access approval processes require that researchers be regularly trained in information governance, to avoid any accidental data breaches. Secondly, secure physical locations (known as safe havens or safe pods) have been established for the processing and linkage of personal data, and are characterised by strict access arrangements, secure data transfer processes, restricted network and/or internet access, and tight disclosure control procedures.[36] Whilst using direct personal identifiers for this study could have helped achieve the highest level of accuracy in the resulting data, restricting the release of personal identifiers provides further protection against outsiders with malicious intent. In the context of current information governance and data protection regulations in the UK, researchers can very rarely access personal identifiers and more innovative linkage methods, such as those used in our study, are required.

Although our study only combined information for mothers and babies relating to the same admission (the delivery / birth episode), the longitudinal nature of HES allows admissions for the same individual to be linked over time. This means that in addition to enriching maternity data, this linkage provides an opportunity for evaluating how pre- and postnatal maternal medical histories (that are solely captured in maternal records) influence infant and childhood outcomes.[37] Such data are particularly useful for investigating the effect of exposures during pregnancy on outcomes throughout childhood, and could be enhanced further through linkage to different sources of data such as primary care and education. Linkage of retrospective electronic healthcare data can be useful for resolving data quality issues, and could be used to supplement evidence from cohort studies and prospective data collection such as the HSCIC maternity and children’s dataset. Ultimately, these data will improve our understanding of maternal risk factors for childhood outcomes, e.g. for assessing the effects of prenatal exposure to drugs or maternal mental health.[38, 39] Given appropriate safeguards, linked maternal-baby data could be made available as a resource for service evaluation and research, to complement linkage of prospective maternity and child health datasets in the UK.


Probabilistic linkage of maternal and baby healthcare characteristics offers an efficient way to enrich maternity data, improve data quality, and create longitudinal cohorts for research and service evaluation, without the use of direct patient identifiers. Combining information from multiple sources can help to address data quality issues in electronic health data, and the approaches described here could be extended to other administrative data sources. Linked maternal-baby hospital records in England provide a nationally representative resource for service evaluation and research on the impact of maternal risk-factors and interventions on outcomes in childhood.


The authors would like to thank Astrid Guttmann and Harvey Goldstein for their input to this work. The authors would also like to acknowledge the input of Hannah Knight and Ipek Gurol from the Royal College of Obstetricians and Gynaecologists and Lynn Copley for facilitating data management at the Royal College of Surgeons. Thanks also goes to the support given by Sheila Bird.

Data availability statement: Authors do not have permission to share patient-level HES data. HES data are available from the NHS Digital Data Access Advisory Group ( for researchers who meet the criteria for access to confidential data.

Author Contributions

  1. Conceptualization: KH RG DC JvM.
  2. Formal analysis: KH.
  3. Methodology: KH.
  4. Supervision: RG JvM.
  5. Validation: KH.
  6. Visualization: KH.
  7. Writing – original draft: KH.
  8. Writing – review & editing: KH DC RG JvM.


  1. 1. Vigod SN, Gomes T, Wilton AS, Taylor VH, Ray JG. Antipsychotic drug use in pregnancy: high dimensional, propensity matched, population based cohort study. BMJ. 2015;350:h2298. pmid:25972273
  2. 2. Riordan DV, Morris C, Hattie J, Stark C. Family size and perinatal circumstances, as mental health risk factors in a Scottish birth cohort. Soc Psychiatry Psychiatr Epidemiol. 2012;47(6):975–83. pmid:21667190
  3. 3. Ford JB, Roberts CL, Taylor LK. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatr Perinat Epidemiol. 2006;20(4):329–37. pmid:16879505
  4. 4. Kamphuis E, Koullali B, Hof M, de Groot C, Kazemier B, Mol BW, et al. Fetal gender of the first born and the recurrent risk of spontaneous preterm birth. Am J Obstet Gynecol. 2015;212(1):S386.
  5. 5. Howard LM, Goss C, Leese M, Appleby L, Thornicroft G. The psychosocial outcome of pregnancy in women with psychotic disorders. Schizophr Res. 2004;71(1):49–60. pmid:15374572
  6. 6. Meeraus WH, Petersen I, Gilbert R. Association between antibiotic prescribing in pregnancy and cerebral palsy or epilepsy in children born at term: A cohort study using The Health Improvement Network. PLoS ONE. 2015;10(3):e0122034. pmid:25807115
  7. 7. Health and Social Care Information Centre. Maternity Services Data Set (MSDS) Data Model v1.5 2015 [30/10/15]. Available from:
  8. 8. Johnson KE, Beaton SJ, Andrade SE, Cheetham TC, Scott PE, Hammad TA, et al. Methods of linking mothers and infants using health plan data for studies of pregnancy outcomes. Pharmacoepidemiol Drug Saf. 2013;22(7):776–82. pmid:23596095
  9. 9. Bird TM, Bronstein JM, Hall RW, Lowery CL, Nugent R, Mays GP. Late preterm infants: birth outcomes and health care utilization in the first year. Pediatrics. 2010;126(2):e311–e9. pmid:20603259
  10. 10. MacKay DF, Smith GCS, Dobbie R, Pell JP. Gestational age at delivery and Special Educational Need: retrospective cohort study of 407,503 schoolchildren. PLoS Med. 2010;7(6):e1000289. pmid:20543995
  11. 11. Lain SJ, Nassar N, Bowen JR, Roberts CL. Risk factors and costs of hospital admissions in first year of life: a population-based study. J Pediatr. 2013;163(4):1014–9. pmid:23769505
  12. 12. Oliver-Williams C, Fleming M, Wood AM, Smith GCS. Previous miscarriage and the subsequent risk of preterm birth in Scotland, 1980–2008: a historical cohort study. BJOG. 2015;online first:n/a-n/a. pmid:25626593
  13. 13. Adams-Chapman I, Hansen NI, Shankaran S, Bell EF, Boghossian NS, Murray JC, et al. Ten-year review of major birth defects in VLBW infants. Pediatrics. 2013;132(1):49–61. pmid:23733791
  14. 14. Pearson H. Massive study to follow 80,000 British babies cancelled. Nature. 2015;(526):620–1.
  15. 15. Murray J, Saxena S, Modi N, Majeed A, Aylin P, Bottle A, et al. Quality of routine hospital birth records and the feasibility of their use for creating birth cohorts. Journal of Public Health. 2012. pmid:22967908
  16. 16. Dattani N, Datta-Nemdharry P, Macfarlane A. Linking maternity data for England 2007: methods and data quality. Health Stat Q. 2012;53:4–21.
  17. 17. Benchimol EI, Langan S, Guttmann A. Call to RECORD: the need for complete reporting of research using routinely collected health data. J Clin Epidemiol. 2013;66(7):703–5. pmid:23186992
  18. 18. Knight HE, Gurol-Urganci I, Mahmood TA, Templeton A, Richmond D, van der Meulen JH, et al. Evaluating maternity care using national administrative health datasets: How are statistics affected by the quality of data on method of delivery? BMC Health Serv Res. 2013;13(1):200.
  19. 19. Health and Social Care Information Centre. Methodology for creation of the HES Patient ID (HESID). 2014.
  20. 20. ICD. International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM). 2011 28/10/13. Report No.
  21. 21. Health and Social Care Information Centre. HES Data Dictionary: Admitted Patient Care V2.0 2015 [Accessed 21/10/2015]. Available from:
  22. 22. Health and Social Care Information Centre. Hospital Episode Statistics: Outpatient Data Dictionary 2010 [Accessed 26/05/2015]. Available from:
  23. 23. Health and Social Care Information Centre. Hospital Episode Statistics: A&E Data Dictionary 2009 [26/05/2015]. Available from:
  24. 24. Hagger-Johnson G, Harron K, Gonzalez-Izquierdo A, Cortina-Borja M, Dattani N, Muller-Pebody B, et al. Identifying false matches in anonymised hospital administrative data without patient identifiers Health Serv Res. 2014;(online first).
  25. 25. Randall SM, Ferrante AM, Boyd JH, Semmens JB. The effect of data cleaning on record linkage quality. BMC Med Res Methodol. 2013;13(64). pmid:23739011
  26. 26. Blakely T, Salmond C. Probabilistic record linkage and a method to calculate the positive predictive value. Int J Epidemiol. 2002;31(6):1246–52. pmid:12540730
  27. 27. Winglee M, Valliant R, Scheuren F. A case study in record linkage. Surv Methodol. 2005;31(1):3–11.
  28. 28. Cole TJ, Statnikov Y, Santhakumaran S, Pan H, Modi N, on behalf of the Neonatal Data Analysis Unit and the Preterm Growth Investigator Group. Birth weight and longitudinal growth in infants born below 32 weeks’ gestation: a UK population study. Arch Dis Child Fetal Neonatal Ed. 2014;99(1):F34–F40. pmid:23934365
  29. 29. Hammill B, Hernandez A, Peterson E, Fonarow G, Schulman K, Curtis L. Linking inpatient clinical registry data to Medicare claims data using indirect identifiers. Am Heart J. 2009;157(6):995. pmid:19464409
  30. 30. Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Med. 2015;12(10):e1001885. pmid:26440803
  31. 31. Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H. Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol. 2014;14(1):36.
  32. 32. Harron K, Wade A, Muller-Pebody B, Goldstein H, Gilbert R. Opening the black box of record linkage. J Epidemiol Commun H. 2012;66(12):1198. pmid:22705654
  33. 33. Bohensky M. Chapter 4: Bias in data linkage studies. In: Harron K, Dibben C, Goldstein H, editors. Methodological Developments in Data Linkage. London: Wiley; 2015.
  34. 34. Goldstein H, Harron K, Wade A. The analysis of record-linked data using multiple imputation with data value priors. Stat Med. 2012;31(28):3481–93. pmid:22807145
  35. 35. Harron K, Dibben C, Goldstein H. Methodological developments in data linkage: Wiley; 2015.
  36. 36. Dibben C, Elliot M, Gowans H, Lightfoot D. Chapter 3: The data linkage environment. In: Harron K, Dibben C, Goldstein H, editors. Methodological Developments in Data Linkage. London: Wiley; 2015.
  37. 37. Lain SJ, Hadfield RM, Raynes-Greenow CH, Ford JB, Mealing NM, Algert CS, et al. Quality of data in perinatal population health databases: a systematic review. Med Care. 2012;50(4):e7–e20. 00005650-201204000-00016. pmid:21617569
  38. 38. Chan E, Quigley MA. School performance at age 7 years in late preterm and early term birth: a cohort study. Arch Dis Child Fetal Neonatal Ed. 2014;99(6):F451–F7. pmid:24966128
  39. 39. Sutter-Dallay A, Bales M, Pambrun E, Glangeaud-Freudenthal N, Wisner K, Verdoux H. Impact of prenatal exposure to psychotropic drugs on neonatal outcome in infants of mothers with serious psychiatric illnesses. J Clin Psychiat. 2015.