Misidentification by farmers of the crop varieties they grow: Lessons from DNA fingerprinting of wheat in Ethiopia

Accurate identification of crop varieties grown by farmers is crucial, among others, for crop management, food security and varietal development and dissemination purposes. One may expect varietal identification to be more challenging in the context of developing countries where literacy and education are limited and informal seed systems and seed recycling are common. This paper evaluates the extent to which smallholder farmers misidentify their wheat varieties in Ethiopia and explores the associated factors and their implications. The study uses data from a nationally representative wheat growing sample household survey and DNA fingerprinting of seed samples from 3,884 wheat plots in major wheat growing zones of Ethiopia. 28–34% of the farmers correctly identified their wheat varieties. Correct identification was positively associated with farmer education and seed purchases from trusted sources (cooperatives or known farmers) and negatively associated with seed recycling. Farmers’ varietal identification thereby is problematic and leads to erroneous results in adoption and impact assessments. DNA fingerprinting can enhance varietal identification but remains mute in the identification of contextual and explanatory factors. Thus, combining household survey and DNA fingerprinting approaches is needed for reliable varietal adoption and impact assessments, and generate useful knowledge to inform policy recommendations related to varietal replacement and seed systems development.


Introduction
Crop varietal identification in farmers' fields has long relied on elicitation of farmers' own identification of the varieties in household surveys. Ethiopia is no exception, and several adoption and impact assessments of improved crop varieties have been conducted there relying on farmers' elicitation [1][2][3][4]. Farmers might report their variety with a specific name, possibly corresponding with names listed in the national variety registry or a locally adapted name. In some cases, farmers might not report a specific name but use generic terms as 'improved' or 'local' or refer to the presumed origin. These farmer-reported names are typically taken at face value and subsequently used to estimate adoption rates and associated impacts. Of late, the reliability of such varietal information collected based on farmers' elicitation has been increasingly questioned by the advent of DNA fingerprinting [5][6][7][8][9][10][11][12]. Farmers misidentification of their crop varieties is an example of measurement error. Survey measurement error through such farmer recall can be thought to represent simple misreporting correctable through improved measurement [13], like now possible with DNA fingerprinting. Correcting such varietal identification error is important in relation to varietal development and dissemination, including replacing old varieties with better performing new ones [14] and documenting the adoption of improved crop varieties.
Varietal misidentification might also reflect misperceptions that materially affect the respondents' decisions under study [13]. Crop varieties variously respond to crop management (e.g. fertilization, timeliness) with important productivity and profitability considerations. Misidentification may thus result in production inefficiencies. For instance, farmers may apply fertilizer to what they think are improved varieties but are in fact unimproved; or alternatively, may fail to apply fertilizer and other agronomic practices to what they think are local varieties but are in fact improved. Farmers' technology adoption decisions can indeed exhibit strong input complementarity for some inputs [15]. Food security considerations at household and national levels are of particular concern in relation to tracking and replacing specific varieties susceptible to diseases and insect pests. Smallholder farmers in Ethiopia rely on wheat for subsistence and livelihood security. Farmers can replace stress susceptible varieties with more tolerant or resistant ones, like in the case of wheat rust epidemics [16]. Knowing the prevalence of susceptible wheat varieties also is key to assist policy makers target limited stocks of fungicide during wheat rust alerts in countries such as Ethiopia [17] and thereby the effectiveness of policy intervention [15].
Misidentification may also have direct economic impacts on how markets function. For example, grain marketing and processing may be variety-dependent as some varieties possess special traits preferred by consumers or processing industries (e.g. bread vs durum wheat varieties in Ethiopia). Misidentification may originate along the seed supply, including mixing with other varieties at seed production, processing and marketing [18], and information bottlenecks in seed delivery and extension [19].
One may expect varietal identification to be more challenging in the context of developing countries with limited literacy and education, with informal seed systems and saving and recycling of seed [2,[20][21][22]. In Ethiopia wheat seed is widely recycled and varieties are not proprietary, thereby reducing incentives for seed sellers to engage in possible fraud. Moreover, seeds from informal markets/sources could be heterogeneous and with diverse local names. Official variety names may also be complicated for illiterate farmers (e.g. scientific names, numbers or names with locally unknown meanings), and farmers may rebrand varieties with easier local names, e.g. describing varietal features or one that is more easily remembered [23].
The contribution of this paper is two-fold. First, we establish the extent of crop varietal misidentification based on nationally representative data from a developing country. We do so using the case of wheat in Ethiopia and contrasting estimates of varietal identification using conventional household surveys based on farmers' elicitation with results from DNA fingerprinting (DNA FP) methods. Second, we identify some of the factors influencing farmers' ability to correctly identify the crop varieties they grow. The findings have various implications, particularly for breeding programs, varietal replacement and crop protection strategies in developing countries in general, and wheat in Ethiopia in particular. Importantly, a better understanding of these two facets provides a foundation for subsequent explorations of the economic implications of seed misidentification.

Reference library
In varietal identification using DNA FP, building a comprehensive reference library is key. From 1960s to 2016, a total of 133 improved wheat varieties were released in Ethiopia. From the 133 released varieties, breeder seeds of 111 varieties were collected from the original research centres that developed and released these varieties. Breeder seeds for the remaining 22 varieties were not available for collection due to their old age and inability to trace viable sources. In the survey, only 1.2% of the farmers reported that they grew eight of these 22 varieties, which is a small proportion to affect the overall analysis results. For the collected breeder seeds, DNA was extracted from each variety at Holeta National Agricultural Biotechnology Research Center (NABRC), Ethiopia. With careful and proper tags, the DNA extracted from these breeder seeds were sent to the Diversity Arrays Technology (DArT) in Australia to build a reference library to be used for any sequencing and identification purposes of wheat samples collected from Ethiopia.

Sampling, data collection and generation procedure
Data used in this study for variety identification were collected from two sources: (1) DNA FP of grain samples collected from randomly selected wheat plots, and (2) survey data from households operating the wheat plots from where the grain samples were collected for DNA FP. Ethiopia's Central Statistical Agency (CSA) led the data collection with technical support from the Ethiopian Institute of Agricultural Research (EIAR). Though there is no permanent internal review board at CSA, before the survey was conducted, the survey instruments were assessed for their compatibility on ethical standards by a team of experts from the Agriculture, Natural Resources and Environmental Statistics Directorate (ANRESD) at CSA. The survey data were collected by experienced and well-trained enumerators speaking the local languages fluently and hired by CSA. Before each interview started with a sample household head (or respondent), the enumerators explained the purpose of the study and the anonymity of all information they provide. Then enumerators asked respondent's consent to continue with the interview. All the sample households responded during the survey passed through this procedure and provided their full consent orally. CSA has a long experience in collecting crop production data through its Agricultural Sample Survey (AgSS). For the same purpose, CSA uses Enumeration Areas (EA) as a minimum sampling unit. By considering the four major regional states (Oromia, Amhara, SNNPR, and Tigray) together producing over 95% of Ethiopia's wheat, and the major administrative zones producing wheat in these four regional states, CSA randomly selected 420 EAs for wheat data collection (Table 1).
In each randomly selected EA, again a maximum of 10 wheat plots were randomly selected from the available wheat plots during the 2016/17 main cropping season. It is worth noting that the number of wheat samples in less wheat potential EAs could be lower due to lack of enough number of wheat plots within the specific EA. From these randomly selected wheat plots, wheat grain samples were collected from a randomly identified 4m-by-4m quadrant within the sample plots. Plot owners were also interviewed to get household and farm level data including names of the wheat varieties grown on the sample plots. Both the survey data and collected grain samples were transferred to Holeta NABRC for data entry and DNA extraction, respectively. To accurately trace the survey and the associated DNA FP data, both the survey questionnaire and the grain bags for sample collection were tagged with the same unique barcode for each wheat plot. Extra copies of the unique barcodes were kept in the grain bags for use during DNA extraction and shipping the DNA samples to DArT for DNA fingerprinting.
The crop-cut data for yield estimation was taken from a 4m-by-4m sub-plot randomly selected within a specific wheat plot. The wheat plots were also randomly selected from the existing list of wheat plots in a pre-selected EA. Grain obtained from the 16 m 2 sub-plot area was dried to at least 12.5% moisture content before the final weight measurement was taken. From the harvested and dried grain, after measurement, 250 gm was taken for DNA extraction to the Holeta NABRC. Part of the grain from each sample was grinded and DNA was extracted following a standard Zymo kit protocol at NABRC. The extracted DNA samples were shipped to the Diversity Arrays Technology (DArT) in Australia for DNA fingerprinting/sequencing following DArTseq method. For genotyping by sequencing DArTseq, a combination of a DArT complexity reduction methods and next generation sequencing platforms was used [24,25]. The remaining grain in each bag was kept in a cold room as backup and for future use. Then, farmers were also interviewed using a survey instrument to elicit the name of the specific variety and other agronomic practices used on each of their plots from which crop-cuts were taken. The survey data and DNA sequenced results obtained from DArT were merged for analysis using the unique identification barcodes on both entities.

Survey and DNA sequenced data
During the 2016/17 data collection, 3,884 sample wheat plots were surveyed and grain samples were collected for DNA extraction and fingerprinting. A total of 3,771 wheat samples were genotyped against the total 111 unique varieties in the reference library, with 123 (3%) samples dropped for various reasons, including DNA quality. Through DNA FP, 3,543 (94%) of the samples were identified with specific wheat variety names against the existing reference library  Table 2). 228 samples (6%) remained unclassified by DNA FP-i.e. these samples do not map against any identified variety in the reference library. This could be the case for local varieties or (old) improved varieties not included in the library, whereby farmer identification could be potentially correct. It could also include cases of impure varieties and/or potentially still farmer misidentified. Therefore, we added the 6% to the correctly matched set to provide the potential upper range of correctly identified varieties, i.e., 28-34%.

Empirical approaches
A binary Logit model was used in assessing factors explaining the mismatch between variety names reported using farmers' elicitation and identified by DNA sequencing. Referring to varietal names identified by the DNA sequencing, a value of 1 is assigned to the cases where farmers accurately identified varieties they grew and 0 for those not. Household characteristics, seed sources from where the initial seeds of the varieties were obtained, and plot characteristics are considered in the regression analysis. To further explore the mismatch between varietal names, we sub-categorized the mismatched samples into four: (1) those which didn't match but reported by farmers with a specific name in the national variety registry, (2) those reported by farmers with specific name but names are not in the registry, (3) farmers reported the variety using generic name as 'Improved variety but specific name not known', and (4) 'Local variety but specific name not known'. Taking samples with the exact match as a reference, we applied a multinomial logit model to assess variables explaining the likelihood that varietal identification using farmers' elicitation falls in any one of these mismatched categories. With J possible categories where a farmer-reported varietal name could fall under, a multinomial logit model is specified as: ; where j ¼ 0; 1; 2; . . . ; J and m 6 ¼ j ð1Þ

PLOS ONE
Where P(y = j|X) is the probability that farmers' elicitation happened to be matching with DNA FP result or taking one of the four mismatch cases stated above. X is a vector of covariates affecting farmers' varietal knowledge, and β is a vector of parameters to be estimated. When reporting wheat grown, in 50% of the cases, farmers' elicitation picked specific varietal names in the national registry (Table 2). However, 309 farmers reported a variety with a name of a recently released ones where the variety, as identified by DNA FP, was actually older than what the farmers reported. With the assumption that recently released varieties have better traits to resist rust and show better yield performance, the report is 'false' but 'positive' as it picked names of recently released varieties when the actual variety was an old release. On the other hand, 436 farmers reported a variety with a name of an old variety when the actual variety grown was a recent release, that is a 'false' report and 'negative' in picking an old variety name when the variety was actually a recent release. Such a mix-up of variety names in farmers' elicitation, and mixing names of varieties with different varietal age has a consequence on varietal replacement strategies farmers follow. To explore this, we used a multinomial Logit estimation considering accurately identified varieties as a reference.
Accurate varietal identification is also important to estimate the average age of varietal replacement. To get this estimate for wheat varieties grown in Ethiopia, we used the following equation stated in [26].
Where WA t is area weighted average age of a variety at a given t, p it is the proportion of area sown to variety i in year t, and R it is the number of years (at time t) since the release of variety i.

Variety-level correspondence between farmer reporting and DNA fingerprinting
DNA FP identified 45 improved wheat varieties, 38 of which are bread and 7 durum wheat (  Table 3 show that, from the most popularly grown varieties, there is no single variety that was 100% correctly identified by the farmers growing that variety. Wheat variety identification by farmers therefore remains an undeniable challenge in the Ethiopian wheat system. Based on varieties identified by DNA FP, 47.3% was covered by varieties released since 2006 (based on a sample of 1,115.7 ha of wheat). Those released before 1996, i.e., above 20  Table 4, about 50% of the sample wheat area was covered by three popular varieties (Kakaba, Kubsa and Danda'a). During the 2010/11 main production season, there was a serious yellow rust epidemic that devastated wheat production in the most wheat growing regions of the country. Though the government has made lots of efforts in seed production and dissemination of resistant varieties to replace these susceptible varieties by the resistant ones (Kakaba, Danda'a and Digalu), the wider existence of Kubsa variety in farmers' field after such incidence could show that farmers still had some sort of preference for Kubsa due to some good reason(s) including higher and stable yield under normal season.

Explaining varietal identification
As indicated above, there was only 28% match between farmer-reported and official names. Considering observations with matched varietal identification as 1 and those not matched as 0, a binary Logit model was estimated to assess household, farm and seed source characteristics explaining the likelihood that a farmer uses the same name for the wheat variety (s)he grows as in the Official registration. Estimation results in Table 5 show that model farmers and those with better education are more likely in reporting varieties they grow accurately. Moreover, it is less likely that farmers report variety names with their original names with an increase in the years a variety has been recycled and used. In addition, very old wheat varieties are relatively less likely to be accurately identified by farmers. Farmers obtained the initial seeds of the wheat varieties they reported from diverse sources: from cooperatives (23%), seed company (5%), known farmers (26%), from market (17%) and other sources (30%). Compared to seeds initially purchased from cooperatives, wheat seeds obtained from unidentified sources are less likely to be correctly identified by farmers. Labelling seeds in formal seed markets could help farmers to know varietal names for identification. Varieties grown on larger plots are more likely to be identified by farmers. Farmers who grow wheat on larger plots tend to be commercial [21] and are likely to give due attention to the identity of varieties they grow. There is a regional variation in the level of varietal identification. Compared to farmers in Oromia Region, farmers in Tigray, Amhara, and SNNP Regions are lesser chance of identifying varietal names as stated in the national registry. This could be associated with most popular wheat

PLOS ONE
varieties identified in this study having Oromo Language names when released and might take locally adapted names when disseminated to other regions.
Varieties not accurately and uniquely identified by farmers were given different types of names. Some farmers reported using other existing improved variety names, others reported generically as 'improved' or 'local', others with a specific but unregistered name. Using these four mismatch categories, and clustering the standard errors at Enumeration Area level, we estimated a multinomial logit regression to identify key factors explaining the likelihood that farmers tended to report using these mismatching names (Table 6). Results show that, in most of the cases, respondents with better education are less likely in picking non-matching variety names. Increase in the duration a variety was recycled and used by a farmer was positively associated with the likelihood that a farmer misreports the variety name. The likelihood of reporting these varieties with non-registered locally adapted names is even higher. A similar situation is observed for varieties released relatively long time ago. Seed sources show mixed results in explaining why a given variety was reported by farmers with a different name not matching to the name in the registry. Compared to seeds purchased from cooperatives, seeds obtained from markets (i.e., from seed traders and unknown farmers in market) are more likely to be reported as 'local' or with a locally adapted name. Relatively, the likelihood that a given variety is reported as 'improved' or 'local' is lower when a variety is grown on larger

PLOS ONE
wheat plots. Larger wheat areas are likely to result in more wheat surplus and wheat marketed at the household level, and farmers are more likely to give due attention to variety names. As indicated earlier, compared to farmers from Oromia Region, on average, farmers from the other three Regions (Tigray, Amhara, and SNNP) were more likely to misidentify varieties and report them with names not in the registry, or with generic names as 'improved' or 'local'.

Bias in reporting
When variety names are not actually known by farmers, or varieties were sold to farmers with different names compared to the one in the registry, there could be some possibilities that farmers report old varieties with the names of recently released ones. This could be more likely in marketing when recently released varieties might have better traits that old varieties might not have. The contrary case would also be possible when farmers value traits of an old variety and new varieties could be marketed with the name of the old one. As recently released wheat varieties are usually better than the old ones in many features including resistance to disease and yield performance [28], farmers' misidentification is 'False-Negative' when recently released varieties are misidentified and reported with names of old varieties, e.g. a farmer reporting a given variety grown as Kakaba (released in 2010) when DNA FP identified it as Kubsa (released in 1994), and 'False-Positive' when actually old varieties are misidentified and reported with names of recently released varieties. As misidentification increases with age of varieties since released (Table 5), we applied inverse probability weights (IPW) to varietal age and run a multinomial logit model in identifying factors explaining these two biases in misidentified wheat variety names (Table 7). Accordingly, the likelihood of reporting 'False-Positive' increases with age of respondents and number of years since the specific variety was released. The likelihood that farmers confuse names of old varieties with names of relatively younger varieties increases with the number of years since the varieties were released. Compared to seeds obtained from Cooperatives, farmers are more likely to report old varieties obtained from seed companies using names of recently released varieties and vice versa. Accounting for all factors included in this estimation, farmers who grew wheat on relatively larger plots were less likely to report recently released varieties they grew using names of relatively older varieties. The implication for 'False-Negative' is that farmers tend to replace the recently released varieties at hand with something else thinking that what they have at hand is not a recent release. Contrarily, 'False-Positive' implies that farmers might be hesitant to change the actually old variety at hand thinking that it is a recently released variety. Such kind of misinformed knowledge is more serious when a farmer grows rust susceptible varieties thinking that his/her varieties are resistant ones.

Estimating improved wheat variety adoption
Using the conventional household survey method (and without considering the number of seasons wheat seed was recycled), our data suggests an improved wheat varietal adoption rate of 50-67% by the number of wheat plots and 61-73% by wheat area. The lower bound of the ranges consider only registered varietal names and the upper bound includes generic names. Interestingly, 93% of the varieties that farmers reported as "name unknown but local" were identified by DNA FP as improved varieties (Table 3). DNA FP takes the adoption estimate to 95% in terms of wheat area (Table 8). Still, some caveats are in order, not least as some of the identified improved wheat varieties were released 30-40 years ago, and it is about 20-30 years since their formal seed multiplication was stopped.
Our analysis thereby highlights that neither the farmers' or the DNA FP reports provide a complete picture of the adoption status of improved wheat varieties. Instead, they might need to complement each other. While the household surveys can shed light on the history of the varieties grown, the duration of varietal use and contextual factors, the DNA FP (only) accurately identifies the variety. The misidentification by farmers is unlikely intentional-and may reflect that farmers have incomplete or incorrect information of the varieties they are growing including the actual names assigned to the specific varieties by breeders when released.

PLOS ONE
Although it is advisable to use these two approaches as a potential complement in varietal identification as well as adoption and impacts assessment, the costs involved in data collection and DNA sequencing should not be underestimated. Advanced methods including hand-held portable devices (technologies) could help in identifying crop varieties in the field or from grain samples with acceptable levels of accuracy and reduce costs [29]. In general, it is always worth to consider what level of accuracy is expected (or what level of error is allowed) in estimating some of the outcome variables. Here, judgement of the researcher is essential in choosing the data collection method(s) for variety identification and the level of accuracy each method could provide at the data analysis stage.

Linking varieties to their germplasm sources
Only 50.4% of the farmer-reported varieties had specific names appearing in the variety registry-the other half included generic names or names that did not register (Table 8). On the other hand, DNA FP correctly identified 94% of the samples collected failing to identify only 6%. Focusing on those varieties identified by a specific name both by farmers and DNA FP, the distribution of these sub-samples in terms of their germplasm source is more or less the same. For instance, from the varieties identified by farmers with their specific registered names, 92% were derived from the International Maize and Wheat Improvement Center (CIMMYT) germplasm-although it should be remembered that only half of the farmers were able to report such names (Table 9). Similarly, from those samples identified by DNA fingerprinting, 91% were linked to CIMMYT germplasm. Accounting for the 6% that were not classified by the DNA sequencing, CIMMYT's share in the national wheat varietal use was 86%. These results show that over 94.5% of all varieties identified by DNA FP were derived from the CGIAR, specifically CIMMYT and ICARDA (International Center for Agricultural Research in the Dry Areas), while those which directly came from the national research system accounted for 3.4%.

Discussions
Understanding crop varietal use is useful to inform policy decisions and agricultural development. For instance, crop varieties differ in their disease and pest tolerance and other traits. Some policies intend to replace old varieties with new and better performing ones; some market development calls for specific crop variety with special traits, etc. In responding to these and similar demands, accurate varietal identification at the plot levels and aggregating varietal distributions at a regional/national or agroecology levels are crucial. The commonly used method in varietal level data collection is farm household survey where farmers report the type of specific varieties they grew during a specific season. However, varietal names collected using this approach have their own challenges including the fact that most farmers report varieties in locally adapted names which is usually different from the original names these varieties had when registered and/or released. Even for those varieties reported with known names in the variety registry book, there is no guarantee that these varieties indeed are what they are reported to be. Farmers may also not necessarily know or give names to every crop varieties they grow. Under such circumstances, using DNA FP with a robust reference library built based on breeders' seed collected from the institution(s) under which the individual varieties were registered is indispensable. This paper shows the disparity between estimates of the level of improved wheat variety adoption generated using farmers' reporting and DNA FP approaches. DNA FP identified 94% of the samples as improved whereas the corresponding household survey estimate was 50-67% (depending on the way we define what farmers reported as improved or not). There are issues in both approaches considered for varietal identification in this study. First, unless supported by further inquiry through surveys, it is difficult to know how long the varieties identified by DNA FP have been actually used in the production system. This is important particularly in situations where improved seeds could lose their yield potential due to mixture in the field when recycled for longer years and need to be replaced with fresh seed for better performance. Second, as shown in this study, of the 1,957 wheat samples reported by farmers with names in the variety registry book (i.e., 50% of the total sample), only half of them (i.e., 989 samples) matched with the variety names identified by DNA FP. This implies that there is sizeable mismatch between varietal names used by farmers and that in the national registry. Such mixing up of improved wheat variety names by farmers makes variety specific results obtained from farmers' elicitation dubious.
The self-pollinated nature of wheat allows a relatively better purity over a certain period of time compared to cross-pollinated crops. Theoretically, one might thus expect farmers to know and report the actual name of improved wheat varieties. Instead, farmers' misidentification of the wheat varieties they grow was relatively large (i.e., 66-72%). Several factors associated with the misidentification were identified in this study, including education level of farmers, number of years of seed recycling, whether a farmer is relatively specialized in wheat production and whether the seed was purchased from formal seed sources, etc. Results imply that any study targeted to varietal identification using household survey data based on farmers' elicitation need to double-check how accurate these varietal names are and put some further inquiries to enhance the level of accuracy. If not, decisions taken based on such kind of studies that relied only on farmers' elicitation could lead to undesirable outcomes. In this regard, further studies might be relevant to sort out the sources of deviations between names that farmers use in identifying a variety and what actually these varieties are by considering different nodes of the seed development and distribution channels. In addition, to complement surveys conducted to document variety specific data, it is essential to look for advanced methods and technologies that could help in identifying crop varieties in the field or based on grain samples with acceptable levels of accuracy.

Conclusions
Using DNA FP and farmers' elicitation approaches in varietal identification, this paper assessed the level of misidentification by smallholder farmers of the wheat varieties they grew and identifying key variables explaining the misidentification. Data showed clear disparity between varietal identification based on farmers' elicitation and DNA FP methods with important implications for the estimated level of improved wheat variety adoption. DNA FP identified 94% of wheat samples in Ethiopia in 2016/17 as improved whereas the household survey estimated this at 50-67%. Only 28-34% of varieties were correctly identified, i.e. a misidentification of some 70% whereby farmer-reported variety names did not match those identified by DNA FP. Level of education, source of seed, level of seed recycling, age of the variety since release, and number and size of wheat plots determine the ability of farmers to correctly identify the wheat varieties they grow in relation to variety names in the national registry. Our findings imply that variety specific adoption and impact assessments based solely on farmerreported variety names are highly dubious. Thus, regardless of the associated costs, we recommend DNA FP as a reliable method in varietal identification. Still, DNA FP remains mute in the identification of contextual and explanatory factors and thus is best enriched with complementary targeted household survey data. In combination, they can provide sound and reliable varietal adoption and impact assessments, and generate useful knowledge to inform policy recommendations related to varietal replacement and seed systems development.