Fig 1.
ForestQC takes a raw variant call set in the VCF format as input. Then it calculates the statistics of each variant, including MAF, mean depth, mean genotyping quality. In the filtering step, it separates the variant call set into high-quality, low-quality, and undetermined variants by applying various hard filters, such as Mendelian error rate and genotype missing rate. In the classification step, high-quality and low-quality variants are used to train a random forest model, which is then applied to assign labels to undetermined variants. Variants predicted to be high-quality among undetermined variants are combined with high-quality variants from the classification step for the final set of high-quality variants. The same procedure applies to find the final set of low-quality variants.
Table 1.
Performance of eight different machine learning algorithms.
Fig 2.
Overall quality of high-quality variants in the BP dataset detected by four different methods.
(a) The ME rate, (b) the genotype discordance rate, and (c) the missing rate of high-quality SNVs. (d) The ME rate and (e) the missing rate of high-quality indels. Data are represented as the mean ± SEM.
Table 2.
Variant-level quality metrics of high-quality variants in the BP dataset processed by different methods.
Fig 3.
Sample-level quality metrics of high-quality variants in the BP dataset identified by four different methods.
(a) Ti/Tv ratio of SNVs not found in dbSNP. (b) The total number of SNVs. (c) The number of SNVs not found in dbSNP. (d) The total number of indels. The version of dbSNP is 150.
Fig 4.
Overall quality of high-quality variants in the PSP dataset detected by four different methods.
(a) The missing rate and (b) the genotype discordance rate of high-quality SNVs. (c) The missing rate of high-quality indels. Data are represented as the mean ± SEM.
Fig 5.
Sample-level quality metrics of high-quality variants in the PSP dataset identified by four different methods.
(a) Ti/Tv ratio of SNVs not found in dbSNP. (b) The total number of SNVs. (c) The number of SNVs not found in dbSNP. (d) The total number of indels. The version of dbSNP is 150.
Table 3.
Variant-level quality metrics of high-quality variants in the PSP dataset processed by four different methods.