Computational framework for targeted high-coverage sequencing based NIPT

Non-invasive prenatal testing (NIPT) enables accurate detection of fetal chromosomal trisomies. The majority of existing computational methods for sequencing-based NIPT analyses rely on low-coverage whole-genome sequencing (WGS) data and are not applicable for targeted high-coverage sequencing data from cell-free DNA samples. Here, we present a novel computational framework for a targeted high-coverage sequencing based NIPT analysis. The developed methods use a hidden Markov model (HMM)-based approach in conjunction with supplemental machine learning methods, such as decision tree (DT) and support vector machine (SVM), to detect fetal trisomy and parental origin of additional fetal chromosomes. These methods were tested with simulated datasets covering a wide range of biologically relevant scenarios with various chromosomal quantities, parental origins of extra chromosomes, fetal DNA fractions and sequencing read depths. Consequently, we determined the functional feasibility and limitations of each proposed approach and demonstrated that read count-based HMM achieved the best overall classification accuracy of 0.89 for detecting fetal euploidies and trisomies. Furthermore, we show that by using the DT and SVM methods on the HMM state classification results, it was possible to increase the final trisomy classification accuracy to 0.98 and 0.99, respectively. We demonstrated that read count and allelic ratio-based models can achieve a high accuracy (up to 0.98) for detecting fetal trisomy even if the fetal fraction is as low as 2%. Currently existing methods require at least 4% fetal fraction, which can be an issue in the case of early gestational age (<10 weeks) or elevated maternal body mass index (>35 kg/m2). More accurate detection can be achieved at higher sequencing depth using HMM in conjunction with supplemental methods, which significantly improve the trisomy detection especially in borderline scenarios (e.g., very low fetal fraction) and can enable to perform NIPT even earlier than 10 weeks of pregnancy.

74 Compared to the WGS-based methods, targeted approaches require less cfDNA and enable to 5 75 study more samples in parallel, making it a cost-efficient alternative. A few already available 76 targeted solutions rely on sequencing single nucleotide polymorphisms (SNPs). In these 77 cases, allelic information from sequencing read counts can be used to calculate allelic ratios 78 obtained from heterozygous SNPs and also serve as an extra source of information for 79 inferring fetal aneuploidies (17). For example, NATUS software, developed by Natera, Inc., 80 considers parental genotypes and crossover frequency data to calculate the expected allele 123 124 In addition, we generated allele counts for each SNP according to the mean sequencing 125 coverage and FF of the dataset. One might assume that all reads in a given region would 126 follow a Poisson distribution with a mean proportional to the copy number of the region. 127 However, due to the various technical biases, the process is over-dispersed and the simulation 128 distribution followed the negative binomial distribution with a variance-to-mean ratio of 3 129 (22). 130 8 131 Allelic ratio calculation 132 Based on the simulated data, we calculated the allelic ratio for every "informative" SNP.
133 Only SNPs which were heterozygous in mother and/or fetus were considered as informative.
134 If both alleles have equal likelihood of occurrence (MAF = 0.5), on average 75% of SNPs 135 were informative in case of maternally originated trisomy and the proportion of informative 136 SNPs was even higher in the case of paternally originated trisomy as both paternal alleles 137 contributed to heterozygosity independently. The allelic ratio was defined as the number of 138 sequencing reads carrying a major allele for a certain variant divided by the number of 139 sequencing reads carrying a minor allele.

141 Fetal fraction calculation
142 FF showed the proportion of fetal cfDNA in total cfDNA. We estimated the FF of a cfDNA 143 sample using the allelic counts of the sample's reference chromosome. First, we filtered the 144 informative SNPs on the reference chromosome, where the mother was homozygous and the 145 fetus was heterozygous (allelic ratio > 2.5). In this subset, the major allele count was the sum 146 of maternal allele counts and 1/2 of the fetal allele count. The minor allele count was 147 proportional to 1/2 of the fetal allele count. The FF was calculated as the median value of the 148 ratios between 2 × minor allele counts and the sum of major and minor allele counts.
149 The FF of a sample was calculated using the following formula:  159 read counts (Fig A in S1 Fig), (2) allelic ratios (Fig B in S1 Fig), and (3) the combination of 160 both read counts and allelic ratios (Fig B in S1 Fig). Second, we estimated the parameters for 161 the models empirically using a simulated training dataset. Finally, we used the Viterbi 162 algorithm to find the most likely underlying fetal condition behind each SNP.  (Table in 179 S6 Table). The possible outcome states of the model are "euploidy", "trisomy", and "paternal 10 180 trisomy". Although the "trisomy" condition includes loci typical to both maternally and 181 paternally originated trisomy, here we associated "trisomy" with maternally originated 182 trisomy to avoid over-estimation of paternally originated trisomy. Results and Discussion 210 We developed three novel HMM-based statistical methods to detect fetal chromosomal 211 trisomies from targeted sequencing assays. In addition to a naïve HMM-based frequentist 212 approach for trisomy detection, we applied two machine learning (ML) methods to infer fetal 213 trisomy. While considering a wide range of biologically and technically motivated 214 conditions, we simulated datasets mimicking cfDNA sequencing assays and used these data 215 to perform a comprehensive evaluation of our proposed computational methods (Fig 1).

217 Novel HMM-based methods for trisomy detection
218 By considering the sequencing read counts (RC) of targeted loci, allelic ratios (AR) of 219 targeted SNPs, or both (RCAR), the developed HMM models were used to classify 220 consecutive target loci on a studied chromosome into pre-defined underlying states. In the 2-221 state RC model, these unique states represented fetal euploidy and trisomy (Fig A in S1 Fig).
222 In the case of the 7-state AR and RCAR models, these different states can occur with fetal 223 euploidy or maternally/paternally originated trisomy (Fig B in S1 Fig). Consequently, the 224 proportion of loci classified into these distinct states can be used to estimate the fetal 225 condition of each studied chromosome (see "Fetal condition estimation" in Methods). And 226 although such naïve classification works relatively well in case of high sequencing read depth 227 (RD) and fetal fraction (FF) scenarios, the proportion of loci classified into these underlying 228 states can be similar and thus difficult to distinguish unambiguously in the case of low RD 229 and FF (Fig 2).
230 231 Therefore, the precise calculation of FF is also crucial for controlling the precision and 232 uncertainty of fetal trisomy detection and sequencing-based NIPT. Notably, in the case of the 233 RC model and autosomal chromosomes there is no information that could be used to infer the 13 234 FF of the studied sample so that optimal corresponding model parameters can be used. One 235 possible solution to overcome this challenge is to use the expected median FF of 10% (23). In 236 the case of the AR and RCAR models, we used informative polymorphic SNPs with 237 heterozygous alleles in mother and/or fetus to infer the sample-specific FF (Fig in S2 Fig), 238 similarly to previous studies (24-26). Additionally, in the case of the AR and RCAR models, 239 allelic count data at informative SNPs can be used to calculate allelic ratios, distinguishing 240 maternally and paternally originated trisomies (see "Allelic ratio calculation" in Methods) 241 according to their distinct allelic patterns (Table in S6 Table). On the other hand, these 242 models only consider informative targeted SNPs that are polymorphic in a given sample, 243 which reduces the total number of analyzed SNPs least by 25% and therefore somewhat 244 decreases the detection accuracy (data not shown).

246 Supplemental methods for trisomy detection
247 Since in some possible scenarios, such as paternally originated trisomy, the previously 248 described HMM-based models did not unambiguously infer the underlying fetal condition 249 (Fig 2), we developed two additional "supplemental" machine learning (ML)-based methods 250 to improve the sample classification accuracy. The supplemental methods, which take HMM-251 classified state proportions as input, significantly improved the sample classification 252 especially when the proportion of loci inferred into one or the other HMM state was not an 253 obvious majority and where the frequentist approach, therefore, did not work ( shortcoming is due to the fixed FF parameter rather than the properties of the DT (Fig 3, Fig   283 in S3 Fig). 291 was ≥ 6% and RD was higher than 10,000 (Fig 4). In contrast to the DT and the SVM 292 methods, it was unable to detect paternally originated trisomy in a given range of FF and RD 293 (Fig in S4 Fig).
294 295 Compared to the read count data, allelic ratio information was used to estimate the FF of a 296 sample using specific allelic patterns (Table in S6 Table). In addition, allelic ratio data were 297 used to separate maternally and paternally originated trisomies. As for the HMM, the 298 inability to detect paternally originated trisomy can be explained by the overlapping emission 299 distributions of the allelic ratios of maternally and paternally originated trisomies. 300 301 In general, the supplementary methods increased the detection accuracy for the AR model 302 significantly (Table 2), especially in the case of paternally originated trisomy (Table 1 In the case of maternally originated trisomy, all three methods had similar 304 characteristics as the detection accuracy was positively correlated with both sequencing RD 305 and FF (Fig 4). The read count had a stronger impact on the AR model, whereas the RC 306 model was mostly affected by FF. The DT had a slight fetal trisomy detection improvement