Figures
Abstract
The nature and effect of mutations are of fundamental importance to the evolutionary process. The generation of mutations with mutagens has also played important roles in genetics. Applications of mutagens include dissecting the genetic basis of trait variation, inducing desirable traits in crops, and understanding the nature of genetic load. Previous studies of sodium azide-induced mutations have reported single nucleotide variants (SNVs) found in individual genes. To characterize the nature of mutations induced by sodium azide, we analyze whole-genome sequencing (WGS) of 11 barley lines derived from sodium azide mutagenesis, where all lines were selected for diminution of plant fitness owing to induced mutations. We contrast observed mutagen-induced variants with those found in standing variation in WGS of 13 barley landraces. Here, we report indels that are two orders of magnitude more abundant than expected based on nominal mutation rates. We found induced SNVs are very specific, with C → T changes occurring in a context followed by another C on the same strand (or the reverse complement). The codons most affected by the mutagen include the sodium azide-specific CC motif (or the reverse complement), resulting in a handful of amino acid changes and few stop codons. The specific nature of induced mutations suggests that mutagens could be chosen based on experimental goals. Sodium azide would not be ideal for gene knockouts but will create many missense mutations with more subtle effects on protein function.
Author summary
Sodium azide is frequently used as a mutagen for experimental studies in plants. It induces primarily C → T changes. We find that these most often occur when a cytosine (C) is followed by another cytosine. In coding sequence, this results in a limited yet distinct set of amino acid alterations that differ from natural variants in barley. Notably, harmful mutations prevalent in sodium azide-treated samples include glycine to aspartic acid and proline to serine, while untreated landraces exhibit distinct sets of putatively harmful changes. The mutated lines demonstrate an average yield reduction of 37.7% due to induced single nucleotide variants (SNVs) and insertions/deletions (indels), which disrupt numerous coding variants. Although detecting putative deleterious mutations is straightforward, discerning the individual impact of these variants remains complex. However, advancements in understanding the context of these mutations may facilitate the training of machine-learning models to predict and rank their effects, ultimately benefiting plant and animal breeding and research on complex human diseases. The findings suggest that the occurrence of specific mutations induced by chemical mutagens can be anticipated, enhancing our comprehension of mutagen-induced genetic diversity and its phenotypic implications.
Citation: Liu C, Frascarelli G, Stec AO, Heinen S, Lei L, Wyant SR, et al. (2025) Sodium azide mutagenesis induces a unique pattern of mutations. PLoS Genet 21(6): e1011634. https://doi.org/10.1371/journal.pgen.1011634
Editor: Angela Hancock, Max Planck Institute for Plant Breeding Research: Max-Planck-Institut fur Pflanzenzuchtungsforschung, GERMANY
Received: May 14, 2024; Accepted: February 22, 2025; Published: June 3, 2025
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: SRA numbers: WGS Morex and Morex treated with sodium azide BioProject PRJNA849997. WGS barley landraces BioProject PRJNA674330. ONT of Morex and Morex treated with sodium azide BioProject PRJNA967725. Github: https://github.com/MorrellLAB/Barley_Mutated DRUM: https://doi.org/10.13020/sewd-qq35.
Funding: This study was supported by a University of Minnesota Informatics Institute MnDRIVE Graduate Assistantship award to Chaochih Liu, the National Science Foundation (IOS-1339393 to PLM, JCF, and KPS), US Department of Agriculture Biotechnology Risk Assessment Research Grants Program (USDA BRAG 2023-33522-41008 to PLM), and the Minnesota Agricultural Experiment Station fund (MIN-13-122 to PLM). Syngenta Crop Protection Inc. covered the sequencing costs for a portion of our sample. The funders had no role in study design, data collection and analysis, decision to publish, or manuscript preparation.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Mutagens can quickly generate novel, heritable genetic variation for identifying gene function when naturally occurring variation will not suffice. Sodium azide (NaN3), ethyl methanesulfonate (EMS), and fast neutron (FN) radiation have been the most commonly used mutagens in plants. Sodium azide and EMS are chemical mutagens expected to cause point mutations [1]. FN is known for generating many large structural variants and small indels (especially deletions) that create frameshifts that disrupt gene function [2,3]. Mutagenizing agents have been used in many species to create knockouts or knockdowns of individual genes to understand gene function [4]. Despite decades of active use [5,6] and growing interest in the applications of mutagenesis [7,8], the characterization of the nature of the mutations generated has been limited [9]. More generally, the number of studies examining the effects of mutagens at the nucleotide sequence level is limited [1,3,10,11].
The nature and context in which mutations occur are essential determinants of their likely functional impact [12,13]. Recently reported efforts to employ sodium azide mutagenesis on a massive scale (500,000 mutated lines) clarify the need for an improved understanding of the number and types of mutations likely to be generated [7].
We can characterize induced mutations by comparing newly generated (de novo) mutations to variants occurring naturally in untreated lines. However, there are essential considerations to minimize false positives when distinguishing between induced mutations and existing variants. One challenge with identifying induced de novo changes is that the mutations are always a mixture of induced de novo and spontaneously occurring mutations [13]. The experimental design necessary to study induced mutations in plant populations requires multiple generations of seed bulking before the mutagenesis treatment and several generations in which new mutations are made homozygous in inbred lines; thus, multiple generations in which new spontaneous variants can arise [3]. Mutagenesis experiments often use multiple generations of self-fertilization to ensure that mutations are meiotically heritable [14]. Mutagen dosage also affects the nature of observable mutations, with an effective dosage of mutagens defined by the LD50, or lethal dose for 50% of the treated sample, which results in the death of half the individuals. Of course, this creates significant attrition due to lethal mutations, resulting in some mutations or combinations that cannot be observed directly. Observable mutations are less damaging, allowing the plant to survive and reproduce.
New mutations, whether induced or naturally occurring, are more likely to be harmful than segregating variants subject to generations of natural (purifying) selection. More potentially harmful changes are observed among newer mutations in very deep resequencing panels [15] and induced mutations [3]. Newer mutations tend to include more changes of large effect, including nonsynonymous variants, changes in start and stop codons, intron splicing variants, and frameshifts [16,17]. Comparative approaches using phylogenetic constraint have been used for decades to predict which mutations are more likely to harm the organism [18,19]. More recent studies of induced mutations in Arabidopsis thaliana suggest that these approaches accurately identify phenotype-changing variants and are more likely to have a major effect on organismal fitness [20]. Thus, predicting the harmfulness of mutations has the potential to identify the nature of variants most likely to impact organismal fitness or yield in crops [21–23].
In barley, sodium azide primarily generates cytosine-to-thymine changes [24]. More recent studies have determined that mutagens create unique suites of mutations and that these mutations typically occur in a specific, immediate nucleotide context [1,13]. This observation is important because the nature of mutations, particularly when they occur predominantly in the presence of flanking motifs, can determine their relative effect. For example, C → T transitions, particularly in specific local sequence contexts, may limit the potential for mutagens to generate premature stop codons, even when the reverse complement of a mutation motif is considered. Olsen et al. 1993 [24] identified sodium azide mutations in the barley Ant18 gene, finding that A.T → G.C base-pair transitions were most frequently generated. However, resequencing this single barley gene does not capture the effects of sodium azide at the whole-genome scale.
In the present study, we examine sodium azide-induced mutations in 11 mutagenized lines of Morex, the barley variety used as the primary reference genome [25]. We also generated whole-genome sequence for a sample of the Morex seed stock used for mutagenesis. In addition to single nucleotide changes, we find evidence of sequence insertions and deletions (indels) private to each mutagenized line. Mutations identified in sodium azide-treated lines were compared to variants in 13 barley landrace samples subject to whole-genome sequencing. To understand the nature of mutations induced by sodium azide, we address the following questions: 1) What is the nature of the variants induced by the mutagen? More specifically, does sodium azide tend to generate SNPs, indels, or other types of structural variants? 2) What is the observed mutation rate in sodium azide-treated samples in barley? 3) Do induced variants differ in type or predicted effect from variants occurring in untreated individuals? and 4) Is a greater number of harmful mutations associated with a reduction in yield in barley?
Results & Discussion
Identifying sodium azide-induced mutations
Data collected and analyzed in the present study includes three distinct datasets. First, we used multiple resequencing datasets to identify differences in our Morex samples and the Morex reference genome. This included new Illumina paired-end data with 10X Genomics linked reads, as well as both new and previously published Oxford Nanopore Technologies (ONT) and published Pacific Biosciences (PacBio) sequence [25] (Tables 1 and S1). A second data set included lines treated with sodium azide that were resequenced with Illumina paired-end data; for a subset of these lines we also generated linked reads and ONT DNA sequencing data (S1 Table). We used this data to identify single nucleotide and structural variant differences between our Morex line published Morex_v3 reference genome [25]. We then identified mutations in lines subject to mutagenesis. For contrast between variants in mutagen-treated lines and naturally occurring variants, we used Illumina paired-end data to identify spontaneous variants in 13 barley landraces.
Variants in our Morex sample relative to the Morex reference genome.
A major challenge in isolating de novo variants induced by the mutagen treatment is the need to distinguish between variants present in the mutagenized seed stock (i.e., Morex) and those that arise during mutagenesis. Heterogeneity, genetic variation within an inbred cultivar or variety, can contribute large numbers of variants [26] that are not due to the mutagen. Experimental contamination through unintended hybridization [3,27] can also contribute to large numbers of variants (S1 Fig). Mutagen-induced variants are expected to be relatively rare, requiring filtering of the variants that account for: (1) differences between the parental line used for mutagenesis and the reference genome, (2) uncallable regions (see Materials & Methods) that include genomic regions with unknown nucleotide state in the reference, or the regions where sequence reads do not align uniquely, and (3) heterogeneity among lines (S2 and S3 Figs). We identified callable portions of the genome that were 820 Mb and 817 Mb (the latter excludes low complexity sequence) relative to the 4.2 Gb genome size for Morex_v3 [25]. Callable regions capture 88% of high-confidence (HC) genes (31,625 HC genes out of 35,827 total HC genes) in the Morex_v3 reference genome. No individuals in our experiment show an excess of variants, runs of variants, or high heterozygosity consistent with recent hybridization [27].
Differences between the Morex_v3 reference genome and our Morex sample are mutations that have arisen in individual seed stocks, errors in the reference assembly, or errors in variant calling. To address these issues, we generated 10X Genomics-based resequencing with 46x and an ONT data set with 5x average mapped read coverage (S1 Table). We identified 52,596 SNPs, 8,203 1-bp indels (insertions and deletions), 3,182 indels ranging in size from 2-204 bp, and 53 deletions ranging from 41 bp - 60 Kbp that survived rigorous filtering (S2 and S3 Figs). Filtered ONT variant calls include 68 insertions and 42 deletions. Using published ONT (at 85x coverage) and PacBio data (at 27x coverage) sampled from 100 seedlings [25] (Table 1) that were incorporated in the Morex_v3 assembly, we detected an additional 178 insertions and 93 deletions that were excluded from callable regions.
de novo variants in mutagenized barley lines.
Reference-based read mapping, variant calling in GATK [17,28,29], and filtering of the 11 M5 mutagenized lines identified 23,339 SNVs (Single Nucleotide Variants) with an observed transition to transversion ratio (Ts/Tv) = 5.24. This compares to a ratio of Ts/Tv = 1.7 previously reported based on resequencing of naturally occurring variants in barley [17]. Among mutagenized lines, we also identified 5,376 smaller indels ranging in size from 1 - 296 bp for 28,715 variants potentially induced by the mutagen (Fig 1A). Here, we use SNV (Single Nucleotide Variants) to identify variants generated during mutagenesis and SNP (Single Nucleotide Polymorphisms) to identify variants segregating in the non-mutagenized population. Each variant is private to a mutagenized plant (i.e., occurs in only a single individual). There was an average of 2,122 SNVs and 489 indels per mutated sample, with more indels identified in 10X Genomics samples utilizing linked reads (Fig 1B and Table 2). Those numbers are much higher (17x and 139x) than estimates of the average number of mutations that would spontaneously arise in the absence of a single-generation of sodium azide treatment. Those estimates were calculated as below (see Materials & Methods). Using the mean nucleotide substitution rate estimate of 6.5 x 10-9 base substitutions per site per generation from [30] and accounting for our experimental design, we expect ~124 SNVs per individual in the 4.2 Gbp genome. For indels, we expect ~3.5 indels per individual to arise without the mutagen treatment based on an average indel mutation rate of 0.45 x 10-9 for 1–3 bp indels and 0.5 x 10-9 for >3 bp indels [30].
The ONT sequence and 10X Genomics linked reads of the same three mutagenized lines improved the detection of larger structural variants (SVs). No inversions passed the filtering criteria, and a set of high-quality duplications could not be identified; thus, structural variant calling focused on insertions and deletions. The SVs detected in the 10X Genomics data set included 52 deletions ranging from 41 bp to 60 Kbp (S4 Fig). ONT reads for three samples (S1 Table) were generated to validate the larger structural variants identified in the 10X Genomics data. The average ONT mapped read coverage was 3.4x (S1 Table). A total of 86 insertions (36–4,786 bp) and 26 deletions (36–300 bp) were called by Sniffles2; 8 insertions (21–172 bp) and 6 deletions (16–116 bp) were called by cuteSV (S4 and S5 Figs). These ONT calls provided direct sequence read-based confirmation of two larger deletion calls that were also called based on the 10X Genomics data set.
Variants in untreated barley landraces.
For comparison, we generated whole-genome resequencing of 13 barley landraces [31]. Average coverage ranged from 41 - 93x (S1 Table), and after variant calling and quality filtering, we identified a total of 6.7 million SNPs with Ts/Tv = 1.74 and 849,618 indels ranging in size from 1 - 388 bp. Out of the 6,746,637 SNPs, 2,277,853 SNPs were categorized as rare (i.e., non-reference allele count of two or less, the allele was identified in one or two genotypes) with Ts/Tv = 1.71 and 4,468,784 were common (i.e., non-reference allele count of three or higher) with Ts/Tv = 1.76 (Fig 1A). Rare variants were compared to de novo variants in the treated lines because they have experienced fewer generations of selection, and their mutational spectrum is likely more similar to that in treated lines.
Comparison of mutagenized versus untreated samples
SNPs in untreated samples are primarily transitions, particularly C → T* (Fig 2), where the notation C → T* includes the reverse complement G → A. Partitioning variants in the untreated samples into “rare” or “common” had a limited effect on the proportion of variants among each class, with a slight skew of rare variants to more C → T* transitions and fewer A → G*. Variants in sodium azide-treated lines were dominated by C → T* transitions, which comprised 79.1% of all SNVs in the mutagenized lines (Fig 2).
Each bin includes the reverse complement. For example, the C → T* bin also includes G → A changes.
Single base pair changes dominate insertion and deletion variants, particularly in sodium-azide treated lines, constituting 28.3% of insertions and 36.4% of deletions (Fig 3). The pattern of indels in treated lines is similar to that in rare and common variants in standing variation. Notable differences include more 1- and 2-bp deletions and 1-bp insertions in treated lines (Fig 3). The 53 larger deletions (41 bp - 60 Kbp) called in the 10X Genomics data set represent roughly 1.7% of all 3,135 deletions in the mutated lines. Linked and long read sequencing was not possible for the landrace lines, precluding a direct comparison.
Insertions are shown as positive values and deletions are shown as negative values. Only variants with lengths <20 bp are shown here.
In mutated samples, 9.7% of SNVs and 8.9% of indels occur in genic regions (S6 Fig). The percentage of variants in genic regions is lower in untreated lines for rare and common variants. Rare variants had 6.5% SNPs and 1.7% indels occurring in genic regions and are more directly comparable with mutated sample SNVs and indels. Variant Effect Predictor (VeP) [32] identified most sodium azide-induced variants as occurring in intergenic regions or genomic regions up or downstream of genes. A larger proportion of variants were found in intergenic regions among induced variants than in untreated lines. Fewer sodium azide-induced SNVs and indels were adjacent to genes (S7 and S8 Figs). Sodium azide-treated lines have a slightly higher proportion of missense variants (3.86%) than untreated lines (2.19% Rare, 2.32% Common), but this effect is small. Slight increases in the proportion of start-stop-related changes (0.2% Mutated, 0.09% Rare, and 0.07% Common) and splice donor and acceptor sites are also observed (0.09% Mutated, 0.01% Rare, and 0.01% Common). However, these variants are considered the most potentially damaging based on VeP categorization (S7 and S8 Figs). Larger deletion variants (41 bp - 60 Kbp) detected among the three lines with linked reads disrupt a genic region in 11.3% of cases (6 out of 53 total).
Harmful mutations based on phylogenetic constraint
On average, sodium azide-treated lines include 78.6 nonsynonymous SNVs per sample, with 865 nonsynonymous SNVs identified among the 11 mutated lines (S9 Fig). Estimates of putative variant effects based on phylogenetic constraints [20] were used to identify potentially damaging nonsynonymous variants among primary transcripts in the barley genome. This analysis includes missense variants (a change in amino acid), start lost, stop gained, and stop lost variants based on the Sequence Ontology definition of nonsynonymous changes [33]. For the 11 mutated lines, 611 nonsynonymous mutations in primary transcripts were tested for phylogenetic constraint relative to 72 other angiosperms to identify potentially universally harmful mutations. Among the successfully annotated mutations, 155 (35.9%) were annotated as “harmful” (i.e., putatively deleterious), while the remaining 277 (64.1%) were identified as “tolerated.” This value compares to 9,716 “rare” nonsynonymous variants tested in 13 barley landraces, where 1,633 (13.2%) are identified as “harmful” and 10,693 (86.8%) as tolerated. For “common nonsynonymous variants, 14,537 were tested, where 23,506 (7%) are “harmful,” and 311,121 (93%) are tolerated.
An average of 14.1 (± 6.46) deleterious SNVs were identified per mutagenized sample (S10 Fig). Given that these changes at conserved coding positions are frequently phenotype-changing [20], this suggests roughly 14 disrupted SNVs per individual treated line. The ratio of nonsynonymous SNVs (nSNV) to synonymous SNVs (sSNV) in mutated lines is 1.8:1. In comparison, the ratio of nSNVs:sSNVs in rare and common SNP categories is 1.36:1 and 1.22:1, respectively. The proportion of nSNVs inferred to be deleterious was 17.9% in treated lines versus nSNPs at 3.2% in rare and 4.1% in common categories. To standardize results among samples, we identified the number of harmful mutations per codon in 10 Mbp windows. The proportion of dSNPs per codon was lower near the centromeres for rare and common SNPs in the landraces (S11-S13 Figs).
The context of variants induced by sodium azide
Biochemical interactions between mutagenic compounds and DNA produce SNVs in specific nucleotide contexts [13]. We used the program Mutation Motif [9] for all SNVs to examine this effect in sodium azide-treated barley lines. The predominant mutation types in both treated and untreated lines are C → T* changes. In sodium azide-treated lines, the cytosine that changes to thymine is frequently followed by another cytosine, creating a CC context of mutation on the forward strand (Fig 4). To our knowledge, there have not been previous studies on the preferential context of sodium azide mutations. There are highly significant differences between sodium azide-induced variants and variants spontaneously originating in the genome (S2 Table). In untreated lines, the mutated cytosine is generally followed by guanine at the + 1 or +2 site (downstream) from the C, thus resulting in a CG or potentially CGG context in which mutations occur. In the complete 4.2 Gb Morex_v3 genome assembly, the CC, CG (the two bp motif for CpG changes), and CGG motifs occur ~228 million, ~ 154 million, and ~40 million times, respectively. In the 820 Mb region in which unique single nucleotide variants could be called, CC occurs ~48 million times, CG occurs ~35 million times, and CGG motifs occur ~9 million times. This suggests that, on average, a single generation of sodium azide treatment resulted in the mutation of 0.013% of CC sites at which unique mutations could be detected. The CC and CG motifs constitute 5.8% and 4.3% of all two nucleotide combinations in the 820 Mb callable regions, and the CGG motif constitutes 1.1% of three nucleotide combinations. In contrast, AA and TT motifs are the most frequent two nucleotide motifs, making up 8.1%.
Position 0 indicates where the C → T change occurred. The relative height of the letters indicates their relative entropy (RE), with a higher RE indicating a position has a greater influence on the mutation. Upright letters indicate overrepresented bases, whereas upside-down letters indicate underrepresented bases at positions neighboring position 0. The null expectation (RE of zero) is based on randomly sampling a nearby location with the same starting base (e.g., for a C → T mutation, a random choice of a position with a C is selected.
Amino acid changes identified in sodium azide-treated lines are dominated by those that include the CC or (reverse complement) GG motif (S14 Fig). Glycine to aspartic acid, proline to serine, and alanine to threonine are the three most abundant amino acid changes in SNVs identified as harmful (i.e., deleterious) (Table 3 and S15 Fig). The top four changes in tolerated SNVs (tSNVs) in mutagenized samples are similar to those annotated as harmful; tolerated amino acid changes include alanine to threonine, alanine to valine, glycine to aspartic acid, and proline to serine (Table 3). This contrasts with amino acid changes induced by rare and common variants in standing variation, where transitions associated with CpG are more abundant. For rare and common dSNPs, alanine-to-threonine and alanine-to-valine changes appear at the highest frequencies. The arginine-to-cysteine amino acid change had the third highest frequency in the common dSNPs class and frequently annotates as deleterious.
Putatively harmful SNVs and phenotypic variation
A total of 25 mutagenized barley lines self-fertilized for five to seven generations (M5:7) were used for yield testing at one location in the first year (St. Paul, MN) and three locations in Minnesota (Crookston, Lamberton, and St. Paul) in years 2 and 3. Yield testing was performed in the presence of 5–8 check lines (see Materials and Methods) and the original Morex line untreated with the mutagen. Data for heading days after planting (DAP) and plant height were also collected. After spatial adjustment for variation across plots, the average yield for each line was calculated for all years combined. As expected, most mutagenized lines had lower grain yields than the Morex W2017 parental line, with six mutagenized lines yielding roughly the same or slightly higher than the parental line (Fig 5). M29 is among the three lowest-yielding lines and is the only mutagenized line with a visibly distinct phenotype, described as onion-like, short-stature, and very compact (S16 Fig). Mutagenized lines tend to have diminished yield relative to Morex, though some line-by-year combinations slightly exceeded the yield in Morex and some checks (S17 Fig). The average diminution in yield across years and lines for the mutagenized lines was 32.8% relative to the Morex W2017 parent. The heading DAP in mutagenized lines increased by 6.2%, and height was reduced by 15% compared to the Morex W2017 parent line. To determine if observed damaging mutations impacted yield, we compared the relative order of yield to the number of damaging mutations per line. We found a slightly negative but nonsignificant correlation of -0.28 (P = 0.4) between the number of harmful variants and yield (S3 Table). Most mutagenized lines had lower variance across replicates than the check lines. This is likely due to the experimental design with seeds originating from plants that can be traced through single-seed descent following the mutagen treatment, whereas check lines derive from more heterogeneous seed stocks.
The box plots are sorted by the median for each line, and the bars in the box plot indicate the mean. Sodium azide-treated lines are represented by red outlines. Red shaded boxes indicate mutated lines that were sequenced in this study.
Conclusions
Sodium azide is widely used as a mutagen in experimental plant populations. It has been frequently used for inducing variants in barley [24,34], including recent reports of extremely large-scale experiments involving the characterization of hundreds of thousands of individual plants [7]. Other mutagens used historically have included X-rays, neutrons, ethylene imine, sulfonates, other chemical mutagens, and various combinations of mutagens [detailed in 35]. However, most studies have focused on the phenotypic effects of mutagenesis [4,36,37] or changes induced at individual genes [24]. The genome-level effects of the mutagen have rarely been examined.
Consistent with a prior single gene resequencing study, we find that C → T* transitions dominate induced mutations [7,24,38] (Fig 2). A similar pattern of nucleotide change has been found for ethyl methanesulfonate (EMS) induced mutation [10,39]. N-ethyl-N-nitrosourea (ENU) induced mutations in mice produce primarily A → G* or A → T* mutations [13]. In contrast, C → T* transitions predominate among the variants observed after fast neutron treatment, but with less enrichment relative to other SNVs [3,10].
Sodium azide appears to generate single nucleotide variants primarily. We identify an average of 2,122 SNVs per mutagenized line. This is an ~ 88-fold increase in SNVs compared to expectations in the absence of a single-generation sodium azide treatment (see Materials and Methods for equations 1–6). Induced indels of all sizes are less abundant (Figs 1 and 3) but occur at ~130–140 fold higher rate than nominal mutation rates.
The observation of higher indel rates derives from the comparison of data from multiple sequencing platforms, including linked-reads and long-read sequencing, with an average of 489 indels per line in mutagenized M5 lines. The mutations present in M5 lines are necessarily a mixture of induced mutations and mutations that arose spontaneously during line maintenance [13]. However, based on mutation accumulation resequencing studies in Arabidopsis thaliana, the 1–3 bp mutation rate was estimated as an average of 0.45 x 10-9 indels per site per generation, and the large deletions (>3 bp) mutation rate was estimated as 0.5 x 10-9 [30]. In the 820 Mb portion of the barley genome, where variants can be called unambiguously, we expect ~3.5 indels per individual to arise naturally without mutagenesis treatment over the course of the experiment (see Methods for equations 1–6). In the 4.2 Gbp barley genome, we would expect ~18 indels per individual (Fig 1B). This is a ~ 130–140-fold increase in the indel mutation rate of treated lines to an average of 5.13 x 10-7 indels per site per generation.
Most sodium azide-induced mutations occur in a specific nucleotide sequence context, as C → T* changes in a CC mutation motif (Fig 4) or the reverse complement. This results in a relatively small number of amino acid changes that predominate among induced mutations. Li et al. 2017 [10] describe similar results for mutations found after EMS or EMS/ENU treatment, where the most common changes within codons produce only a fraction of all possible amino acid changes. In a similar manner, sodium azide-induced amino acid changes are very distinct from most amino acid changes segregating in barley. The amino acid changes that annotate as harmful and predominate in the mutagenized samples are glycine to aspartic acid, proline to serine, and alanine to threonine. In comparison, the top three amino acid changes annotated as harmful and segregating in the untreated barley landraces include alanine to threonine, alanine to valine, and arginine to cysteine. For projects seeking to induce novel changes, for example, in disease resistance genes or genes associated with stress tolerance, sodium azide will induce many coding changes that are rarely observed among standing variation.
Resequencing of individual genes identified many sodium azide-induced SNVs in barley [24]. Induced indels and SVs were not previously reported but would be difficult to identify with Sanger sequencing. Indeed, sequencing technology continues to present a limitation. Many of the SVs identified here were identified by linked reads (in two cases verified by ONT long reads) but could not be identified by Illumina paired-end reads alone. Regarding relative effect, indels and SVs identified with linked reads and verified with ONT result in six disruptive mutations that either induce a frameshift or eliminate a portion of a coding gene. This results in an average of two structural disruptions of genes per individual instead of an average of 33.4 per individual due to 1–3 bp nucleotide sequence-level changes.
Our mutated lines average a 37.7% reduction in yield relative to their non-mutated parental Morex line. This reduction in yield can be attributed to an average of 14.1 induced SNVs and 37.3 indels per line. The typical line has an average of 249.1 disruptions of coding variants, including SNVs and indels (S6 Fig). The approach used in this study identified a finite number of deleterious (i.e., harmful) mutations induced by sodium azide. It was successful at creating lines that had lower yield than the untreated Morex parent line in the experiment. The reduction in fitness (using yield as a proxy for fitness) following the mutagen treatment was expected, given that most new amino-acid-changing mutations that impact fitness will be deleterious [16,40].
In practical applications, deleterious variants are relatively easy to detect, which makes it possible to select against them or eliminate them via targeted replacement of individual variants [22,23,41,42]. However, it is still challenging to identify the effects of individual deleterious variants. With lines that have reduced yield and a better understanding of the nature of changes generated by sodium azide and the sequence contexts in which they occur, there is the possibility of training machine-learning models to predict which variants contribute to harmful phenotypic change [43,44]. Then, it will be possible to rank the expected effect size of each harmful variant and combine the predictions with existing genomic prediction approaches to benefit plant and animal breeding programs [45] and the study of complex human diseases. All of these applications require a means to identify the most harmful variants.
Materials and methods
Plant materials and mutagenesis
Barley from a Morex seed stock was treated with sodium azide following the protocol in [46]. Morex is a 6-row malting barley variety used as the primary reference genome [25]. Morex was chosen for these experiments to facilitate the identification and isolation of de novo variants because the reference makes it easier to distinguish between variants that are different between our parent sample and the reference genome versus variants that arose from the mutagen. The Morex line in this experiment traces back to the parent seed stock used to generate the Steptoe x Morex doubled haploid barley mapping population [47]. To generate sufficient Morex seeds for sodium azide mutagenesis, 120 seeds were planted from a single Morex plant (S1 Fig). Next, 200 seeds from the resulting bulk of seeds were planted; this was repeated one more time. A portion of the seeds was treated with sodium azide (1 mM NaN3) following the [46] protocol; the remaining portion of untreated Morex seeds was planted for another round of seed bulking and then planted to collect leaf tissue for sequencing (S1 Fig). After mutagenesis, the resulting mutagenized seeds were grown to maturity and harvested, forming the M1 generation. These individuals then underwent single-seed descent until M5.
Estimates for the expected number of spontaneous mutations occurring without mutagen treatment were calculated using the experimental design for seven generations of self-fertilization (S1 Fig) and rate estimates from [30]. The mutation rates used for SNPs was and for 1–3 bp indels was
. For the 820 Mbp (820,594,305 bp) callable regions, a diploid genome size of 1,641,188,610 bp was used. For estimates of the 4.2 Gbp (4,225,605,719 bp) genome size [25], a diploid genome size of 8,451,211,438 bp was used. Each generation, new mutations appear in the heterozygous state, and the number of new heterozygous mutations (Nhet) is given by
The experimental design involved multiple generations of selfing, meaning de novo mutations from previous generations are being lost or fixed over time. In each generation, heterozygous mutations are inherited (Ihet) and are represented by
where x is the current generation. Similarly, each generation homozygous mutations are inherited (Ihom) and are represented by
The number of heterozygous and homozygous spontaneous mutations accumulated by generation seven in our experimental design is given by
and is used as our estimated number of spontaneous mutations that would have been present without mutagen treatment.
Phenotypic data collection
Twenty-five M5:7 mutated lines and the F1:2 W2017 Morex parent line were evaluated in yield trials with two replicates at one location (St. Paul, MN) in 2020 and three locations in Minnesota (Crookston, Lamberton, and St. Paul) in 2021 and 2022. Lines were grown in a randomized complete block design. Phenotypic data on grain yield, heading days after planting (DAP), height, and lodging were collected. Check varieties were used to adjust for spatial variation across trial plots for traits with a continuous scale (yield, heading DAP, and height). Spatial adjustments were performed using the R package mvngGrAd [48]. Check lines were chosen because of well-quantified field performance and included Conlon, FEG141–20, Lacey, ND20448, ND26104, ND_Genesis, Pinnacle, and Rasmusson.
Whole-genome short-read and long-read sequencing
We generated whole-genome sequencing in 25 barley (Hordeum vulgare ssp. vulgare) accessions: the parent of the mutagenized lines (W2017 Morex), 11 mutagenized lines, and 13 barley landraces [31] for comparative analyses (Tables 1 and S1). High molecular weight genomic DNA was extracted from 4-6 week-old leaf tissue collected on ice using the Cytiva Nucleon PhytoPure kit for the mutagenized lines. We sequenced three of the 11 M5 mutagenized lines (M01, M20, and M29) and the W2017 Morex line using 10X Genomics linked read library preparation followed by Illumina NovaSeq 6000 sequencing with 150-bp paired-end technology to a target depth of 40x. For the remaining eight M5 mutagenized lines (M02, M11, M14, M28, M35, M36, M39, and M41), libraries were prepared using Illumina DNA Prep followed by Illumina NovaSeq 6000 sequencing with 150-bp paired-end technology to a target depth of 16x. Sequences for the 13 landraces were generated using Illumina TruSeq DNA Nano Prep followed by Illumina NovaSeq 6000 sequencing with 150-bp paired-end technology to a target depth of 40x.
We used Oxford Nanopore Technologies (ONT) to sequence W2017 Morex and three of the 11 M5 mutagenized lines (M01, M20, and M29, see Tables 1 and S1). This data was collected to provide read-level confirmation of SVs indicated by the Illumina short-read resequencing. For sample M01, following the ONT protocol, a high molecular weight gDNA extraction was performed using the Qiagen Genomic-tip kit (10262) with Carlson Lysis buffer (10450002–1). High molecular weight gDNA extractions for samples M20 and M29 were generated using the NucleoBond HMW DNA kit (740160.20) from Takara Bio USA. Size selection was performed using Circulomics SRE buffer, and DNA was quantified using the Qubit assay. The libraries were prepared with 400 ng of gDNA using the Rapid Sequencing Kit (SQK-RAD004) following the protocol version RSE_9046_v1_revT_14Aug2019. The library was primed using the flow cell priming kit (EXP-FLP002), then 400 ng of the library was loaded onto an R9.4.1 flow cell (FLO-MIN106D). M01 was run for 72 hours on a MinION Mk1C (MIN-101C). We found the flow cells were no longer collecting new data after 24 hours and modified the run for the remaining two samples. For M20 and M29, 400 ng of the library was loaded onto an R9.4.1 flow cell, run for 12 hours, and then paused. At this point, the pores were cleared using the flow cell wash kit (EXP-WSH004), then 400 ng of additional library was loaded and run for another 12 hours before the run was paused. Again, pores were cleared with the flow cell wash kit, then 400 ng of an additional library was loaded. The flow cell was run for an additional 24–30 hours. Three reactions were run for a single flow cell; a single sample with two washes ran for a total of 48–56 hours. This approach produced the highest data output for our samples. This process was repeated until each mutagenized line was sequenced to a target depth of 2-3x (S1 Table). Basecalling was performed using Guppy v5.0.12 + eb1a981 (for all runs except one) and Guppy v5.0.17 + 99baa5b (for one M01 run #3) using the default setting on the MinION Mk1C.
Read mapping and variant calling
Read alignment and variant calling for the eight mutagenized and 13 landrace WGS lines were processed using the sequence_handling workflow (https://github.com/MorrellLAB/sequence_handling), which integrates publicly available software into a series of bash scripts [49]. The configuration files, which identify software versions and parameters, and scripts are available in the GitHub repositories https://github.com/MorrellLAB/hybrid_barley and https://github.com/MorrellLAB/Barley_Mutated. Reads were aligned against the third version of the barley Morex reference genome (Morex_v3) [25] with parameters adjusted to account for the level of nucleotide diversity in barley. Variants were called as part of a larger set of samples and followed the Genome Analysis Toolkit (GATK) best practices recommendations [28,29]. SNPs underwent GATK VariantRecalibrator with the following as input: filtered variants and SNPs from genotyping assays, which include 2,975 BOPA SNPs [50], 7,541 9K SNPs [51], and 41,813 50K SNPs [52]. Only polymorphic and biallelic SNPs were included. Additional SNP filtering criteria include allele balance [53,54] deviation of 0.1, proportion heterozygous genotypes at a site > 0.1, per sample minimum DP < 5, per sample maximum DP > 158, proportion missing genotypes at a site > 0.30, QUAL < 30, and GQ < 9. SNPs identified in a barley Sanger resequencing dataset [55,56] were used for validation. Indels were filtered following GATK’s Best Practices Guidelines for hard filtering since we did not have enough truth and training datasets to run indels through VariantRecalibrator. All filtering criteria are detailed in the scripts available in the GitHub repository https://github.com/MorrellLAB/Barley_Mutated.
For the four 10X Genomics samples (W2017 Morex and three mutagenized lines, see S1 Table), reads were aligned to the Morex v3 reference genome, and variants were called with the 10X Genomics software, Long Ranger v2.2.2. The Long Ranger pipeline processes the Chromium-prepared sequencing samples. Variants were filtered based on filters generated by Long Ranger, which include: 10X_QUAL_FILTER, 10X_ALLELE_FRACTION_FILTER, 10X_PHASING_INCONSISTENT, 10X_HOMOPOLYMER_UNPHASED_INSERTION, 10X_RESCUED_MOLECULE_HIGH_DIVERSITY, and LOWQ. SNPs and 1-bp indels were filtered to sites with per sample DP between 5 and 78 and an allele balance deviation of +/- 0.2 (from the expected 0.5) for heterozygous genotypes.
For the ONT data of W2017 Morex, M01, M20, and M29, read quality and summary statistics were generated with NanoPlot v1.38.1 and pycoQC [57]. Adapters were trimmed with Porechop v0.2.4 [58]. Reads were then aligned to the barley Morex v3 reference genome using Minimap2 v2.17 [59] with parameters recommended for ONT sequence reads. The resulting SAM files were then realigned using a modified version of the Vulcan pipeline [60] (customized version, https://gitlab.com/ChaochihL/vulcan), which utilizes NGMLR [61] for read realignment, converted to BAM format, and sorted using Samtools v1.9 [62]. Structural variants were then called using Sniffles v2.0.3 [61,63].
We also used a publicly available Morex data sequenced with PacBio CCS reads (BioProject PRJEB40587, ERR numbers ERR4659245-ERR4659249) [25]. We used HiFiAdapterFilt [64] to filter adapters. Reads were aligned using Minimap2 v2.17 [59] using parameters recommended for PacBio sequence reads. Similar to the ONT data, the resulting SAM files were realigned using a modified version of the Vulcan pipeline [60] (customized version, https://gitlab.com/ChaochihL/vulcan), converted to BAM format, and sorted using Samtools v1.9 [62]. Structural variants were called used Sniffles v2.0.3 [61,63].
For all datasets in this study, mapped coverage was estimated using Mosdepth v0.3.1 [65]. All mapping parameters and filtering criteria are detailed in scripts available in the GitHub repository (https://github.com/MorrellLAB/Barley_Mutated).
Identifying de novo variants
Part 1: Finding differences between Morex samples and Morex reference genome.
To identify de novo variants induced by the mutagen, we generated a list of regions where variants were called in Morex-sample2 (W2017 Morex); these are differences between the Morex parent in this study and the Morex reference genome. Variants called in the [25] Morex ONT and PacBio data were also counted as differences from the reference; these are potentially due to heterogeneity (variation among individuals in the Morex variety). To minimize spurious SV calls in difficult-to-call regions for the 10X Genomics, ONT, and PacBio data, we filtered out variants that overlap with uncallable regions, which includes annotated repeats, stretches of N’s in the reference genome sequence, and “high copy” regions (i.e., regions where plastids, rDNA repeats, and centromere repeats align). For the ONT and 10X Genomics data, SVs that overlap low complexity regions (defined as regions containing low-copy sequence) were also filtered out because they can be non-biological artifactual sequences that result in unmapped sequences or sequences mapped to multiple locations. Low complexity regions were generated using BBMask from BBTools (BBMap – Bushnell B. – sourceforge.net/projects/bbmap/) with entropy set at 0.7, which was determined through data exploration to capture a majority of low complexity sequences (scripts available at https://github.com/MorrellLAB/morex_reference/tree/master/morex_v3). For the ONT and PacBio data, SVs were filtered out if they had less than five supporting reads. For Morex-sample2, SNPs in 100 bp windows with >2% diversity were filtered out. Such high diversity windows are unlikely when aligning Morex-sample2 to the Morex reference genome. SVs were then visually inspected in IGV or scored in SV-plaudit (in the case of the 10X Genomics larger deletions), and filtering criteria were tuned if necessary (summarized in S2 Fig). The filtered SVs form a high confidence set of places where de novo variants shouldn’t be called due to regions that are difficult to align or are heterogeneous among Morex individuals.
Part 2: de novo filtering mutated individuals.
For the 10X Genomics and ONT sequenced mutagenized lines (M01, M20, M29), SVs that overlap uncallable regions or low complexity regions were filtered out (S3 Fig). To benefit from using the strengths of distinct SV callers for ONT data, we utilize Sniffles2 [60] and cuteSV [66]. SVs called by cuteSV and Sniffles2 in the ONT data were filtered similarly, except in the Sniffles2 calls, we required at least five supporting reads. For all sequenced mutagenized lines (10X Genomics, ONT, and Illumina WGS), variants that overlap the “differences from reference” regions were filtered out (summarized in S2 Fig). SNPs identified in the mutagenized samples that also appear in the BOPA (Barley Oligo Pooled Assay 1 and 2) on the Illumina Golden Gate genotyping platform [50], Barley 9K Illumina Infinium iSelect Custom Genotyping BeadChip [67], and Barley 50K iSelect SNP array [52] panels were also excluded. Variants were filtered to those private to individual mutagenized samples, meaning the variant only exists in one of the mutagenized samples at a genomic position. This is based on the expectation that variants induced by the mutagen are new (arose after the mutagen treatment was applied) and unique to each individual. So, variants identified in the mutagenized lines that also exist in the Morex samples are likely due to heterogeneity in Morex and were not generated by the mutagen treatment. Again, variants were visually inspected in IGV [68] or igv-reports (https://github.com/igvteam/igv-reports), and filtering criteria were tuned if necessary.
An image scoring approach was used to verify the larger deletions in the 10X Genomics three mutagenized samples. Images of the SVs were created with the SamPlot software [69]. Following the pipeline implemented in SV-Plaudit [70], the SV images were stored in the Amazon Web Services cloud storage and scored by multiple investigators through the PlotCritic website (https://github.com/jbelyeu/PlotCritic) based on the following criteria: coverage, insert size, and linked/split read evidence. This produced a set of scored deletions where a majority of scorers confirmed read evidence for the variant. This confirmed variant set was then verified using the ONT-sequenced samples (M01, M20, and M29).
Nucleotide composition of variants
Mutation Motif [9] was used to identify the most frequent sequence motifs that affect SNPs in the mutated, rare, and common variant classes. This program performs comparative statistical analyses of neighborhoods (5 bp windows) centered around each SNP to identify the frequency of sequence motifs and the influence of neighboring bases for each SNP. The neighborhood of each SNP (e.g., C → T mutation) is compared to the neighborhood of a reference occurrence of the same nucleotide (e.g., C) randomly sampled from within +/- 100 bp of the mutation. Comparing the motifs associated with mutated, rare, and common classes of SNPs allowed us to identify specific motifs that sodium azide may preferentially target. The number of 2- and 3- bp motifs in the 4.2 Gb reference genome and 820 Mb callable regions was calculated using the EMBOSS [71] compseq tool.
Deleterious predictions
Ensembl Variant Effect Predictor (VeP) [32] was used to determine the predicted effect of each variant in the filtered VCF file, which includes SNPs, insertions, and deletions. Identification of nonsynonymous variants (includes missense, start lost, stop gained, and stop lost variants) used gene models provided by [25]. Nonsynonymous variants for the mutated samples and landraces were extracted from the VeP reports and assessed using BAD_Mutations [17,20], which includes a likelihood ratio test [72] that compares codon conservation across Angiosperm species to determine if a base substitution is likely to be deleterious. We ran the BAD_Mutations pipeline with a set of 72 Angiosperm species genome sequences that are available through Phytozome v13 (https://phytozome-next.jgi.doe.gov/, last accessed November 29, 2021) and Ensembl Plants (http://plants.ensembl.org/, last accessed November 29, 2021). We ran BAD_Mutations using 35,827 primary transcripts. A SNP was annotated as deleterious if the P-value for the test was < 0.05 with a multiple tests correction based on the number of tested codons, minimum of 10 sequences, maximum constraint of 1, and if the alternate or reference allele was not seen in any of the other species. Our thresholds for the three data groups were 8.1E-5 (611 codons tested) for the mutated samples, 5.1E-6 (9,716 codons tested) for the rare variants, and 3.4E-6 (14,537 codons tested) for the common variants. SNPs that failed this set of criteria were annotated as tolerated.
Supporting information
S1 Fig. Experimental design for generating the sodium azide treated lines.
Gx is the generation used for spontaneous mutation estimates (see Methods).
https://doi.org/10.1371/journal.pgen.1011634.s001
(TIFF)
S2 Fig. Flowchart of part 1 of variant filtering to identify differences between the Morex parent of the mutagenized lines and the Morex reference genome.
These are regions where induced mutations should not be called as they are more likely to be variation among Morex individuals.
https://doi.org/10.1371/journal.pgen.1011634.s002
(TIFF)
S3 Fig. Flowchart of part 2 of variant filtering to identify de novo variants induced by sodium azide.
Larger SVs were visually evaluated with a similar approach as in part 1 of the filtering (S2 Fig).
https://doi.org/10.1371/journal.pgen.1011634.s003
(TIFF)
S4 Fig. The distribution of larger deletion sizes in three mutagenized lines (M01, M20, and M29) as called by 10X Genomics Longranger, Sniffles2 (ONT), and cuteSV (ONT).
https://doi.org/10.1371/journal.pgen.1011634.s004
(TIFF)
S5 Fig. The distribution of larger insertion sizes in three mutagenized lines (M01, M20, and M29) as called by Sniffles2 (ONT) and cuteSV (ONT).
Larger insertions were not called in the 10X Genomics dataset.
https://doi.org/10.1371/journal.pgen.1011634.s005
(TIFF)
S6 Fig.
A) A summary of the percentage and number of SNPs and indels in genic regions that can potentially disrupt genes for sodium azide-treated samples and rare vs. common categories. B) Per sample breakdowns of the number and percentage of SNVs and indels that are in genic regions and can potentially disrupt genes.
https://doi.org/10.1371/journal.pgen.1011634.s006
(TIFF)
S7 Fig. The functional effects of mutagenized, untreated rare, and untreated common SNVs/SNPs as annotated by VeP. Bars are labeled with each consequence type’s percentage and number of SNVs/SNPs.
Boxes on the left side indicate the impact classification of the consequence type.
https://doi.org/10.1371/journal.pgen.1011634.s007
(TIFF)
S8 Fig. The functional effects of mutagenized, untreated rare, and untreated common indels as annotated by VeP. Bars are labeled with the percentage and number of indels in each consequence type.
Boxes on the left side indicate the impact classification of the consequence type.
https://doi.org/10.1371/journal.pgen.1011634.s008
(TIFF)
S9 Fig. The number of nonsynonymous and synonymous SNVs in each mutated sample.
https://doi.org/10.1371/journal.pgen.1011634.s009
(TIFF)
S10 Fig. The number of nonsynonymous SNVs in each mutated sample was partitioned into “Deleterious” and “Tolerated.
”.
https://doi.org/10.1371/journal.pgen.1011634.s010
(TIFF)
S11 Fig. The number of nonsynonymous SNPs per covered codon in 10 Mb windows in mutated samples across the barley genome.
Nonsynonymous SNPs are separated into “Deleterious” vs. “Tolerated” and are plotted separately. The vertical grey line indicates the centromeric region.
https://doi.org/10.1371/journal.pgen.1011634.s011
(TIFF)
S12 Fig. The number of nonsynonymous SNPs per covered codon in 10 Mb windows categorized as “rare” in the landrace samples across the barley genome.
Nonsynonymous SNPs are separated into “Deleterious” vs. “Tolerated” and are plotted separately. The vertical grey line indicates the centromeric region.
https://doi.org/10.1371/journal.pgen.1011634.s012
(TIFF)
S13 Fig. The number of nonsynonymous SNPs per covered codon in 10 Mb windows categorized as “common” in the landrace samples across the barley genome.
Nonsynonymous SNPs are separated into “Deleterious” vs. “Tolerated” and are plotted separately. The vertical grey line indicates the centromeric region.
https://doi.org/10.1371/journal.pgen.1011634.s013
(TIFF)
S14 Fig. Representation of the CC and GG motif mutations that generate the most abundant amino acid changes identified as harmful in the SNVs.
Colors represent the polarity of each amino acid. Arrows show the within codon changes, their line thickness, and the side numbers indicate the correspondent Grantham score. A thicker line indicates a greater evolutionary distance between two amino acids. Red letters represent the mutations induced by the sodium azide-associated motifs CC or reverse complement GG. Colors represent the primary properties of each amino acid, including polarity and acidity.
https://doi.org/10.1371/journal.pgen.1011634.s014
(TIFF)
S15 Fig. Frequency of amino acid changes for SNVs that annotate as tolerated versus deleterious in mutated lines.
https://doi.org/10.1371/journal.pgen.1011634.s015
(TIFF)
S16 Fig. A photograph of the mutagenized lines M28 next to M29, which has a distinct onion-like, compact, atypical barley phenotype.
Both lines have been self-fertilized for four generations, and photos were taken five weeks after planting.
https://doi.org/10.1371/journal.pgen.1011634.s016
(TIFF)
S17 Fig. Grain yield for 25 mutated lines, the Morex W2017 parent, and eight check lines for three locations and three years.
The box plots are sorted by the median for each line, and the bars in the box plot indicate the mean. Red outlines represent sodium azide-treated lines. Red shaded boxes indicate mutated lines that were sequenced in this study.
https://doi.org/10.1371/journal.pgen.1011634.s017
(TIFF)
S1 Table. Detailed summary of samples sequenced in this study, which samples were treated with sodium azide, the library preparation and sequencing technology, and mapped average coverage.
https://doi.org/10.1371/journal.pgen.1011634.s018
(CSV)
S2 Table. Log-linear analysis performed by Mutation Motif for C → T variants induced by sodium azide compared to those originating spontaneously in the reference sequence.
Position is relative to the mutated base. Deviance is a likelihood ratio from the log-linear model. Degrees-of-freedom (df) and P-values are from the chi-squared distribution.
https://doi.org/10.1371/journal.pgen.1011634.s019
(XLSX)
S3 Table. The Pearson or Spearman correlation between the number of SNVs and the phenotypes across each functional class of variants.
https://doi.org/10.1371/journal.pgen.1011634.s020
(CSV)
Acknowledgments
The authors thank Ron Okagaki for help with the initial screening of the mutated lines; Lucie Lu for submitting the raw sequencing data to NCBI SRA; Elaine Lee for processing the 10X Genomics datasets using Longranger; Nadia Janis for scoring images of the deletions; Malik Samuel, Emily Vonderharr, Samuel Hamann, and Mackenzie Linane for help with growing out the first couple of generations and processing the seed; Erica Sun for making digital sketches of the barley plants, and Max Okagaki for help with S14 Fig. This research was carried out with software and hardware support provided by the Minnesota Supercomputing Institute (MSI) at the University of Minnesota. Syngenta Crop Protection Inc. provided DNA samples.
References
- 1. Henry IM, Nagalakshmi U, Lieberman MC, Ngo KJ, Krasileva KV, Vasquez-Gross H, et al. Efficient genome-wide detection and cataloging of EMS-induced mutations using exome capture and next-generation sequencing. Plant Cell. 2014;26(4):1382–97. pmid:24728647
- 2. Bolon Y-T, Stec AO, Michno J-M, Roessler J, Bhaskar PB, Ries L, et al. Genome resilience and prevalence of segmental duplications following fast neutron irradiation of soybean. Genetics. 2014;198(3):967–81. pmid:25213171
- 3. Wyant SR, Rodriguez MF, Carter CK, Parrott WA, Jackson SA, Stupar RM, et al. Fast neutron mutagenesis in soybean enriches for small indels and creates frameshift mutations. G3 (Bethesda). 2022;12(2):jkab431. pmid:35100358
- 4. Schneeberger K. Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nat Rev Genet. 2014;15(10):662–76. pmid:25139187
- 5. Kleinhofs A, Owais WM, Nilan RA. Azide. Mutat Res. 1978;55(3–4):165–95. pmid:107442
- 6. Owais WM, Kleinhofs A. Metabolic activation of the mutagen azide in biological systems. Mutat Res. 1988;197(2):313–23. pmid:3123923
- 7. Knudsen S, Wendt T, Dockter C, Thomsen HC, Rasmussen M, Egevang Jørgensen M, et al. FIND-IT: Accelerated trait development for a green evolution. Sci Adv. 2022;8(34):eabq2266. pmid:36001660
- 8.
Nations FAAOTU. Manual on mutation breeding third edition. Rome, Italy: Food & Agriculture Org. 2018.
- 9. Zhu Y, Neeman T, Yap VB, Huttley GA. Statistical Methods for Identifying Sequence Motifs Affecting Point Mutations. Genetics. 2017;205(2):843–56. pmid:27974498
- 10. Li G, Jain R, Chern M, Pham NT, Martin JA, Wei T, et al. The sequences of 1504 Mutants in the Model Rice Variety Kitaake Facilitate Rapid Functional Genomic Studies. Plant Cell. 2017;29(6):1218–31. pmid:28576844
- 11. Belfield EJ, Gan X, Mithani A, Brown C, Jiang C, Franklin K, et al. Genome-wide analysis of mutations in mutant lineages selected following fast-neutron irradiation mutagenesis of Arabidopsis thaliana. Genome Res. 2012;22(7):1306–15. pmid:22499668
- 12. Morton BR, Bi IV, McMullen MD, Gaut BS. Variation in mutation dynamics across the maize genome as a function of regional and flanking base composition. Genetics. 2006;172(1):569–77. pmid:16219784
- 13. Zhu Y, Ong CS, Huttley GA. Machine Learning Techniques for Classifying the Mutagenic Origins of Point Mutations. Genetics. 2020;215(1):25–40. pmid:32193188
- 14.
Nations FAAOOTU. Manual on Mutation Breeding Third Edition. Food & Agriculture Org. 2018. 319 p.
- 15. Schaibley VM, Zawistowski M, Wegmann D, Ehm MG, Nelson MR, St Jean PL, et al. The influence of genomic context on mutation patterns in the human genome inferred from rare variants. Genome Res. 2013;23(12):1974–84. pmid:23990608
- 16. Eyre-Walker A, Keightley PD. The distribution of fitness effects of new mutations. Nat Rev Genet. 2007;8(8):610–8. pmid:17637733
- 17. Kono TJY, Fu F, Mohammadi M, Hoffman PJ, Liu C, Stupar RM, et al. The Role of Deleterious Substitutions in Crop Genomes. Mol Biol Evol. 2016;33(9):2307–17. pmid:27301592
- 18. Sunyaev S, Ramensky V, Koch I, Lathe W 3rd, Kondrashov AS, Bork P. Prediction of deleterious human alleles. Hum Mol Genet. 2001;10(6):591–7. pmid:11230178
- 19. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31(13):3812–4. pmid:12824425
- 20. Kono TJY, Lei L, Shih C-H, Hoffman PJ, Morrell PL, Fay JC. Comparative Genomics Approaches Accurately Predict Deleterious Variants in Plants. G3 (Bethesda). 2018;8(10):3321–9. pmid:30139765
- 21. Lu J, Tang T, Tang H, Huang J, Shi S, Wu C-I. The accumulation of deleterious mutations in rice genomes: a hypothesis on the cost of domestication. Trends Genet. 2006;22(3):126–31. pmid:16443304
- 22. Morrell PL, Buckler ES, Ross-Ibarra J. Crop genomics: advances and applications. Nat Rev Genet. 2011;13(2):85–96. pmid:22207165
- 23. Moyers BT, Morrell PL, McKay JK. Genetic Costs of Domestication and Improvement. J Hered. 2018;109(2):103–16. pmid:28992310
- 24. Olsen O, Wang X, von Wettstein D. Sodium azide mutagenesis: preferential generation of A.T-->G.C transitions in the barley Ant18 gene. Proc Natl Acad Sci U S A. 1993;90(17):8043–7. pmid:8367460
- 25. Mascher M, Wicker T, Jenkins J, Plott C, Lux T, Koh CS, et al. Long-read sequence assembly: a technical evaluation in barley. Plant Cell. 2021;33(6):1888–906. pmid:33710295
- 26. Haun WJ, Hyten DL, Xu WW, Gerhardt DJ, Albert TJ, Richmond T, et al. The composition and origins of genomic variation among individuals of the soybean reference cultivar Williams 82. Plant Physiol. 2011;155(2):645–55. pmid:21115807
- 27. Michno J-M, Stupar RM. The importance of genotype identity, genetic heterogeneity, and bioinformatic handling for properly assessing genomic variation in transgenic plants. BMC Biotechnol. 2018;18(1):38. pmid:29859067
- 28. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8. pmid:21478889
- 29.
Van der Auwera GA, O’Connor BD. Genomics in the cloud: using Docker, GATK, and WDL in Terra. O’Reilly Media. 2020.
- 30. Ossowski S, Schneeberger K, Lucas-Lledó JI, Warthmann N, Clark RM, Shaw RG, et al. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science. 2010;327(5961):92–4. pmid:20044577
- 31. Sommer L, Spiller M, Stiewe G, Pillen K, Reif JC, Schulthess AW. Proof of concept to unmask the breeding value of genetic resources of barley (Hordeum vulgare) with a hybrid strategy. Plant Breeding. 2019;139(3):536–49.
- 32. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17(1):122. pmid:27268795
- 33. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, et al. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6(5):R44. pmid:15892872
- 34. Talamè V, Bovina R, Sanguineti MC, Tuberosa R, Lundqvist U, Salvi S. TILLMore, a resource for the discovery of chemically induced mutants in barley. Plant Biotechnol J. 2008;6(5):477–85. pmid:18422888
- 35. Lundqvist U, Lundqvist A. Mutagen specificity in barley for 1580 eceriferum mutants localized to 79 loci. Hereditas. 2008;108(1):1–12.
- 36.
Hansson M, Komatsuda T, Stein N, Muehlbauer GJ. Molecular mapping and cloning of genes and QTLs. The Barley Genome. Springer. 2018. p. 139–54.
- 37. Hansson M, Youssef HM, Zakhrabekova S, Stuart D, Svensson JT, Dockter C, et al. A guide to barley mutants. Hereditas. 2024;161(1):11. pmid:38454479
- 38. Monson-Miller J, Sanchez-Mendez DC, Fass J, Henry IM, Tai TH, Comai L. Reference genome-independent assessment of mutation density using restriction enzyme-phased sequencing. BMC Genomics. 2012;13:72. pmid:22333298
- 39. Greene EA, Codomo CA, Taylor NE, Henikoff JG, Till BJ, Reynolds SH, et al. Spectrum of chemically induced mutations from a large-scale reverse-genetic screen in Arabidopsis. Genetics. 2003;164(2):731–40. pmid:12807792
- 40. Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 2008;4(5):e1000083. pmid:18516229
- 41. Johnsson M, Gaynor RC, Jenko J, Gorjanc G, de Koning D-J, Hickey JM. Removal of alleles by genome editing (RAGE) against deleterious load. Genet Sel Evol. 2019;51(1):14. pmid:30995904
- 42.
Smith KP, Thomas W, Gutierrez L, Bull H. Genomics-based barley breeding. editors. The Barley Genome. Springer. 2018. pp. 287–315.
- 43. Plekhanova E, Nuzhdin SV, Utkin LV, Samsonova MG. Prediction of deleterious mutations in coding regions of mammals with transfer learning. Evol Appl. 2018;12(1):18–28. pmid:30622632
- 44. Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci U S A. 2023;120(44):e2311219120. pmid:37883436
- 45. Wallace JG, Rodgers-Melnick E, Buckler ES. On the Road to Breeding 4.0: Unraveling the Good, the Bad, and the Boring of Crop Quantitative Genomics. Annu Rev Genet. 2018;52:421–44. pmid:30285496
- 46. Döring H-P, Lin J, Uhrig H, Salamini F. Clonal analysis of the development of the barley ( Hordeum vulgare L.) leaf using periclinal chlorophyll chimeras. Planta. 1999;207(3):335–42.
- 47. Kleinhofs A, Kilian A, Saghai Maroof MA, Biyashev RM, Hayes P, Chen FQ, et al. A molecular, isozyme and morphological map of the barley (Hordeum vulgare) genome. Theor Appl Genet. 1993;86(6):705–12. pmid:24193780
- 48.
Frank T. R package mvngGrAd: moving grid adjustment in plant breeding field trials. 2015.
- 49.
Liu C, Hoffman PJ, Wyant SR, Dittmar EL, Takebasahi N, Hamann S, et al. MorrellLAB/sequence_handling: Release v3.0: SNP calling with GATK 4.1 and Slurm compatibility. 2022.
- 50. Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks N, Ramsay L, et al. Development and implementation of high-throughput SNP genotyping in barley. BMC Genomics. 2009;10:582. pmid:19961604
- 51. Comadran J, Ramsay L, MacKenzie K, Hayes P, Close TJ, Muehlbauer G, et al. Patterns of polymorphism and linkage disequilibrium in cultivated barley. Theor Appl Genet. 2011;122(3):523–31. pmid:21076812
- 52. Bayer MM, Rapazote-Flores P, Ganal M, Hedley PE, Macaulay M, Plieske J, et al. Development and Evaluation of a Barley 50k iSelect SNP Array. Front Plant Sci. 2017;8:1792. pmid:29089957
- 53. Pedersen BS, Brown JM, Dashnow H, Wallace AD, Velinder M, Tristani-Firouzi M, et al. Effective variant filtering and expected candidate variant yield in studies of rare human disease. NPJ Genom Med. 2021;6(1):60. pmid:34267211
- 54. Muyas F, Bosio M, Puig A, Susak H, Domènech L, Escaramis G, et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum Mutat. 2019;40(1):115–26. pmid:30353964
- 55. Morrell PL, Toleno DM, Lundy KE, Clegg MT. Estimating the contribution of mutation, recombination and gene conversion in the generation of haplotypic diversity. Genetics. 2006;173(3):1705–23. pmid:16624913
- 56. Morrell PL, Gonzales AM, Meyer KKT, Clegg MT. Resequencing data indicate a modest effect of domestication on diversity in barley: a cultigen with multiple origins. J Hered. 2014;105(2):253–64. pmid:24336926
- 57. Leger A, Leonardi T. pycoQC, interactive quality control for Oxford Nanopore Sequencing. JOSS. 2019;4(34):1236.
- 58. Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 2017;3(10):e000132. pmid:29177090
- 59. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. pmid:29750242
- 60. Fu Y, Mahmoud M, Muraliraman VV, Sedlazeck FJ, Treangen TJ. Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment. Gigascience. 2021;10(9):giab063. pmid:34561697
- 61. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461–8. pmid:29713083
- 62. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. pmid:19505943
- 63. Smolka M, Paulin LF, Grochowski CM, Mahmoud M, Behera S, Gandhi M, et al. Comprehensive structural variant detection: from mosaic to population-level. BioRxiv. 2022.
- 64. Sim SB, Corpuz RL, Simmonds TJ, Geib SM. HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly. BMC Genomics. 2022;23(1):157. pmid:35193521
- 65. Pedersen BS, Quinlan AR. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018;34(5):867–8. pmid:29096012
- 66. Jiang T, Liu Y, Jiang Y, Li J, Gao Y, Cui Z, et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 2020;21(1):189. pmid:32746918
- 67. Comadran J, Kilian B, Russell J, Ramsay L, Stein N, Ganal M, et al. Natural variation in a homolog of Antirrhinum CENTRORADIALIS contributed to spring growth habit and environmental adaptation in cultivated barley. Nat Genet. 2012;44(12):1388–92. pmid:23160098
- 68. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92. pmid:22517427
- 69. Belyeu JR, Chowdhury M, Brown J, Pedersen BS, Cormier MJ, Quinlan AR, et al. Samplot: a platform for structural variant visual validation and automated filtering. Genome Biol. 2021;22(1):161. pmid:34034781
- 70. Belyeu JR, Nicholas TJ, Pedersen BS, Sasani TA, Havrilla JM, Kravitz SN, et al. SV-plaudit: A cloud-based framework for manually curating thousands of structural variants. Gigascience. 2018;7(7):giy064. pmid:29860504
- 71. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16(6):276–7. pmid:10827456
- 72. Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res. 2009;19(9):1553–61. pmid:19602639