SVCurator: A Crowdsourcing app to visualize evidence of structural variants for the human genome

A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is yet to be defined. In this study, we manually curated 1235 SVs which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app – SVCurator – to help curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002], We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. The crowdsourced results were highly concordant with 37 out of the 61 curators having at least 78% concordance with a set of ‘expert’ curators, where there was 93% concordance amongst ‘expert’ curators. This produced high confidence labels for 935 events. When compared to the heuristic-based draft benchmark SV callset from GIAB, the SVCurator crowdsourced labels were 94.5% concordant with the benchmark set. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.

SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. The crowdsourced results were highly concordant with 37 out of the 61 curators having at least 78% concordance with a set of 'expert' curators, where there was 93% concordance amongst 'expert' curators. This produced high confidence labels for 935 events. When compared to the heuristic-based draft benchmark SV callset from GIAB, the SVCurator crowdsourced labels were 94.5% concordant with the benchmark set. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.

Background
Structural variants (SVs) are typically defined as DNA variants ≥ 50 base pairs (bp), and include: insertions, deletions, duplications, and inversions 1 . SVs have been linked to a number of human diseases 2 . Recent next generation sequencing technologies and analysis algorithms have substantially improved the discovery of SVs. However, identifying SVs with high confidence remains a challenge as evidenced by inconsistent predictions of SVs across different methods 3 . Several groups have demonstrated that crowdsourcing applications can be effective for generating labeled data for putative SVs. Greenside 4 . Recently, SV-Plaudit was used to evaluate 1350 SVs (97% deletions), and allowed participants to evaluate candidate SVs using samplot, which displays images representing short and long read sequencing technologies 5,6 . The web-based platform, Plotcritic, renders samplot images and provides users with an interface to evaluate putative SVs 5 .
In the current study, we generated a list of SVs that contain SV type, size, and genotype labels which can ultimately be used to train machine learning models to characterize properties of a benchmark genome. These data were generated via a Python Flask-based web application (app) -SVCurator -that allows users to evaluate large indels and SVs from the one human's genome -the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. The platform allows users to inspect and classify large indels and SVs by providing a variety of IGV and svviz2 images from short, long, and linked read sequencing data for putative SVs randomly sampled from candidate calls. These were generated from over 30 variant callers using data produced from five different sequencing technologies. To evaluate the accuracy of curations, we discuss the levels of concordance with heuristic based labels assigned to events within the GIAB v0.6 sequence resolved SV calls for HG002.

Results
SVCurator platform overview SVCurator is a Python Flask-based web platform ( Fig 1 ) we developed to evaluate putative large indels ≥20bp and SVs from the union of callsets from diverse technologies and calling methods for the Genome in a Bottle (GIAB) Ashkenazi Jewish Trio son (HG002/NA24385) [ftp:// ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenaziTrio/analysis/NIST_UnionSVs_12122017/ SVmerge121217/ ]. Curators evaluated 1295 SV calls (579 deletions and 716 insertions) that were randomly selected from a pool of candidate variants binned by size ( Fig 2 ). For each SV, SVCurator displays a number of images developed and recommended by experts from the GIAB consortium. Extensive data was generated from short, long, and linked-read whole genome sequencing technologies by the GIAB consortium. These data include Illumina 250bp paired end sequencing, Illumina 150bp paired end sequencing, Illumina 6kb mate-pair, haplotype-partitioned PacBio and haplotype-partitioned 10x Genomics ( Supplementary Fig 1) 7 . Svviz2 8 was used to generate images of reads from each dataset aligned to the reference or alternate alleles. Svviz2 was also used to generate dotplots to visualize repetitive regions in the reference and alternate haplotypes and alignments of individual reads to the haplotypes. Images of Illumina 250x250bp paired end sequencing, haplotype-partitioned PacBio and haplotype-partitioned 10x Genomics in Integrative Genomics Viewer (IGV) were also included 7 .
Participants were asked to evaluate each call and determine whether a SV exists at each site within 20% of the called size of the variant, assign a label describing the variant genotype ["Homozygous Reference", "Heterozygous Variant", "Homozygous Variant", "Complex or difficult"] and a confidence score for the variant genotype (GT) assigned. Concordance amongst curators in evaluation of SVCurator Events Curators were recruited from the GIAB analysis team and the genomics community through GIAB email lists and a GIAB Twitter account announcement. 136 participants registered to use the app, 61 of whom evaluated events. Of the 1295 events, 1290 events were curated at least 3 times ( Supplementary Fig 2 [general distribution]). The average time to curate each event was 47.31 seconds ( Supplementary Fig 3). To select curator responses for label evaluation, labels assigned by each curator were compared to labels assigned by a set of seven 'expert' curators from the GIAB Analysis Team who had experience curating SVs. The expert consensus label was assigned to each event by simple voting (i.e., from the label assigned by the most 'expert' curators). The percent concordance was defined as the ratio of 'expert' curators who agreed on the consensus label divided by the total number of expert curators who evaluated the event. On average, the 'expert' curators were 93% concordant on the labels assigned to each event. Each 'expert' was assigned a percent concordance score based on the level of concordance between their assigned label and the consensus label from the remaining experts.
Labels and concordance between 'experts' (percent and number of 'expert' curators that agreed on the final label) were found for all events. The concordance of each expert with the consensus expert label ranged from 77.7% to 100%. 541 events had at least 68% concordance with a consensus between at least 4 expert curators. All seven 'expert' curators agreed on the assigned label for 407 events. Overall, deletions averaged 86.36% concordance among 'experts' and insertions averaged 79.69% concordance. There were 298 deletions and 243 insertions where 'expert' curators had at least 68% concordance on the assigned label with 3 or more 'expert' curators who agreed on the assigned label.
There were 20 curators (including 5 'expert' curators) who evaluated more than 648 SVCurator events. Of these, on average 670 events per curator were available for further analysis after filtering responses where participants were unsure about an event existing at a particular site or were assigned low genotype confidence scores [Genotype confidence score = 0]. These curators had on average 86.92% concordance with 'expert' consensus labels ( Fig 3). Because many curators were anonymous, we screened curators based on their concordance with the 'expert' consensus label for the 541 events. In order to filter the responses from curators that would be used to determine the final labels, responses from curators were filtered and binned into two threshold groups. Responses were placed into two groups of "top curators": 26 (out of 61) curators above Threshold 1 (90.9% or greater concordance, at least as concordant as the expert with the second lowest concordance), and 37 curators above Threshold 2 (77.7% or greater concordance, at least as concordant as the expert with the lowest concordance -see Supplementary Table 1). We filtered 133 out of 1295 sites because the consensus label of curators above Threshold 1 was different from the consensus label of curators above Threshold 2. 1162 events (527 deletions and 635 insertions) were retained ( Supplementary Figure 4).
The responses from Threshold 1 and Threshold 2 top curators were highly concordant within each group ( Fig 4). Threshold 1 top curators were more concordant than Threshold 2 top curators, particularly for insertions, but fewer Threshold 1 curators agreed on the assigned label. Complex events had the lowest levels of concordance for top curators within both groups with a mean concordance of 64% and 47% within top curators above Threshold 1 and 2, respectively.

Label Evaluation
To evaluate the reliability of the top curators' labels for the 527 deletions and 635 insertions, they were compared to the GIAB v0.6 sequence resolved SV calls and benchmark regions for the Ashkenazi Jewish Trio son. 698 curated sites were inside the v0.6 benchmark regions, and the labels assigned by the top curators were 94.5% concordant with the v0.6 genotype labels ( Fig 5). The focus of the v0.6 sequence-resolved SV calls was on variants greater than 50bp in size, but we included the filtered v0.6 calls 20 to 49 bp in size in this comparison to help evaluate the reliability of top curators' labels in this size range. 10 of the 29 events discordant between the curators and v0.6 were 20 to 49 bp, and all but one of these appeared to be accurately labeled by curators or could be labeled in multiple ways. For instance, the event could be complex or could contain two or more insertions of different sizes at the same loci. The v0.6 benchmark regions were designed to exclude complex events (i.e., regions with two or more SVs within 1000bp). 11 of 29 discordant events were labeled as complex variants by the top curators (2 of which were also 20 to 49bp in size). Fig 9 includes two examples of these events that were difficult to evaluate by the curators as shown by having 50% or less concordance amongst curators. Upon further curation, all but one or two of these appeared to be true complex variants. Of the remaining 10 discordant events, most appeared to be correctly classified by top curators. However, 2 events were classified as homozygous reference by curators even though another variant was in the same tandem repeat outside the IGV view displayed to curators. This difficulty in accurately classifying complex events in tandem repeat regions highlights the importance of expanding the view to display the entire tandem repeat region for variants overlapping them. Many of the differences between v0.6 and top curators were related to challenges in translating the v0.6 benchmark calls and regions into labels for the curated events. For example, because v0.6 focused on variants >49bp, v0.6 labels were often different if curators labeled a complex variant in which part of the variant was <50bp. There were also cases where multiple nearby variants could be combined into a single variant or separated into multiple variants. Figure 6 summarizes characteristics of the calls discordant between v0.6 and top curators. To assign final crowdsourced labels, a random sample of events were manually inspected. Events that were assigned labels with less than 50% concordance amongst all top curators were not included as final labels, which included 84 events. Upon manual inspection of 44 sites with only 50-60% concordance amongst all top curators, it was found that 61% of the events were assigned the correct label. Many of the incorrectly labeled events were not correctly classified as complex variants. Upon manual inspection of 28 sites with 60-70% concordance amongst all top curators, it was found that 85% of the events were assigned the correct label. Therefore, only events that were assigned labels with greater than 60% concordance amongst all top curators and at least 3 top curators agreed on the label were included in the final labeled callset. These sites included 496 insertions and 439 deletions with 94% of the events receiving labels of Homozygous Reference, Heterozygous Variant, or Homozygous Variant ( Fig 7). We also used svviz2 to evaluate the curators' final labeled callset, including variants outside the v0. 6 ( Fig 8) which included 811 out of 879 labels. There were also 58 events where only 1 technology supported the crowdsourced label; PacBio supported the majority of these events (26 events, mostly in tandem repeats) followed by Illumina Mate Pair (18 events). These results further support the accuracy of the crowdsourced labeled events, including those outside the v0.6 benchmark regions. Figure 8. svviz2 genotypes support SVCurator crowdsourced labels. A) A summary of the number of technologies whose svviz2 genotypes support the SVCurator genotype label. 92.2% of the events were supported by at least 2 technologies. B) A count of the number of genotypes from each technology that match the SVCurator crowdsourced labels. C) A summary of the number of technologies that had genotype scores supporting the crowdsourced label as summarized based on label and variant type; and, D) by size of the event. . Schematic summarizing how SVCurator responses were processed to determine the final label for each event. A) Data Collection and Data Cleaning : Curators evaluated the 1295 events within SVCurator. After removing events that received a low confidence score for genotype assigned and an 'unsure' response for whether an event exists at a particular site, 1273 event remained for analysis. B) Screen Curator Responses : To determine the curator responses that were used to find final labels for the SVCurator events, first consensus labels assigned by 7 'expert' curators were determined. These 7 'expert' curators were members of the Genome in a Bottle (GIAB) analysis team. Of the 1273 events, 541 were assigned a consensus label by the 'expert' curators, where each event had 68% or greater concordance on the assigned label and 4 or more experts that agreed on the assigned label. Using a leave-one-out strategy, a percent concordance score was found for each 'expert' curator, and the two lowest percent concordance scores (90.9% and 77.7%) were used as a threshold for screening top curators. To find the top curators, labels assigned by each curator were compared to the 541 events and percent concordance with experts was found for each curator. Curators that had 90.9% or greater concordance and 77.7% or greater concordance were considered top curators and their responses were placed in two threshold groups. The responses for these curators were used to find final labels for the SVCurator events. C) Determine crowdsourced labeled data : There were 935 events that were assigned final labels by top curators. These events had at least 66.7% concordance amongst top curators and at least 3 top curators that agreed on the final label assigned.

Discussion
SVCurator is a crowdsourcing tool that incorporates read aligned images from multiple short, long and linked read sequencing data into an SV visualization tool that allow users to evaluate SV calls. SVCurator uniquely enables curators to evaluate multiple sources of evidence for each call in one app interface. We displayed svviz2 images of reads from 3 different Illumina sequencing methods, haplotype-partitioned PacBio, and haplotype-partitioned 10x Genomics aligned to reference and alternate alleles, as well as dotplots to visualize repetitive regions. The app also includes IGV images for comparison that display Illumina 250bp, PacBio and 10x Genomics reads. Both the IGV and svviz read aligned images include indicators of repetitive regions. Curators were also able to evaluate haplotype-partitioned PacBio and 10x Genomics data. These features allowed participants to more easily evaluate deletions and insertions, including repetitive regions and complex events.
The results of this study suggest that a group of participants can accurately curate SV calls by evaluating a variety of static images from multiple sequencing technologies. In general, simple heterozygous and homozygous variants and homozygous reference regions were accurately labeled, but complex variants were more challenging. To add additional support for these assigned labels, future work might include determining the Mendelian consistency for each event and completing PCR validation for a select group of events. A number of events with lower concordance scores were complex events that were assigned another label, and were often located in repetitive regions of the genome. Curators may not have taken into account the evidence within images that suggest a complex event. Crowdsourcing studies specifically focused on complex events could be conducted in the future to better characterize complex events. This would involve asking the participants to provide feedback on the way tutorials should be structured to facilitate the analysis of complex events.
The crowdsourced labels derived from this study will be useful training datasets for machine learning studies that evaluate SVs, and could be used as a resource to improve SV calling methods. The calls could also be used as a resource to help members of the clinical genomics community improve their evaluation of SVs. Crowdsourcing could also yield more reliable resources that could improve clinical interpretations of SVs as many of the guidelines are qualitative 9 . Finally, this study demonstrates that crowdsourcing is a useful strategy for evaluating SV calls and the results of crowdsourcing could yield results that may be useful in improving SV tools and analyses approaches in multiple domains.

Participant Recruitment
Participants were recruited from the Genome in a Bottle Analysis Team (https://groups.google.com/forum/#!forum/giab-analysis-team) and from the genomics community via the @GenomeinaBottle Twitter account. SVCurator was made available to the public for one month to allow participants to evaluate the events within the app. An incentive of co-authorship on the current publication was offered for participants who curated at least half of the events (648 or more events).

SVCurator App Interface
SVCurator ( www.svcurator.com ) is a Python Flask-based app ( Fig 1 ) and uses SQLite3 as a database management system. User login was implemented using Google OAuth 2.0. The SVCurator app was deployed using pythonanywhere[ www.pythonanywhere.com ].
App Images : The interface consisted of four thumbnail images for each event and a set of four questions. The four thumbnail images consisted of the following: IGV image, svviz2 PacBio haplotype-partitioned read aligned image, svviz2 10X Genomics haplotype-partitioned read aligned image, svviz2 dotplot image representing reference versus alternate allele. A lightbox contained additional images to describe each event, and included the following: svviz2 read aligned image for haplotype and non-haplotype-partitioned PacBio data, 10X Genomics haplotype-partitioned data, Illumina 6kb Mate Pair data, Illumina HiSeq 250bp read length data, Illumina HiSeq 300x read depth data; svviz2 dotplots: represent reads with highest mapping quality versus reference and alternate allele, reference allele versus alternate allele, reference allele versus reference allele, and alternate allele versus alternate allele. Images included in the lightbox allowed curators to zoom in on sections of the SV call that required a more close evaluation. Each curator evaluated the same events for the first 43 events, and events 44-1295 were randomized for each user.
Questions : Participants were given the structural variant call: unique ID, size, chromosome number, start and end coordinates. For each event, curators evaluated the putative SV type, determined whether an event exists within 20% the size of the variant, and the genotype for each event. The questions included in SVCurator were designed to describe the size accuracy and genotype of each SV call. Members of the GIAB community helped structure and finalize the questions included in the app. Curators described each event by responding to the following questions: Comment Box: included to give curators the opportunity to add additional comments to describe each event or report any user interface issues (ie: images that may have not rendered properly) Responses were collected over the course of one month after the app was made publically available. Participants were also provided with a tutorial that describes general guidelines for analyzing SV calls ( https://lesleymaraina.github.io/svcurator_tutorial_2/ ) .