The use of the hypervariable P8 region of trnL(UAA) intron for identification of orchid species: Evidence from restriction site polymorphism analysis

The P8 stem-loop region of the trnL intron, which is known to be hypervariable in size with multiple repeat motifs and created difficulties in alignment, is always excluded in phylogenetic as well as barcode analyses. This region was investigated for species discrimination in 98 taxa of orchids belonging to the tribe Vandeae using in silico mapping of restriction site polymorphism. The length of the P8 regions varied from 200 nucleotides in Aerides rosea to 669 nucleotides in Dendrophylax sallei. Forty two taxa had unique lengths, while as many as eight shared a common length of 521 nucleotides. Of the 35 restriction endonucleases producing digestions in the P8 regions, three, viz., AgsI, ApoI and TspDTI turned out to have recognition sites across all the 98 taxa being studied. When their restriction data were combined, 92 taxa could be discriminated leaving three taxon pairs. However, Acampe papillosa and Aeranthes arachnites despite having similar restriction sites differed in their P8 lengths. This is the first report on thorough investigation of the P8 region of trnL intron for search of species specific restriction sites and hence its use as a potential plant DNA barcode.


Introduction
For the past few decades there has been a hunt for a short DNA segment which can be used as a universal marker, popularly termed as DNA Barcode, for identification of faunal and floral species inhabiting this planet. For animal identification the mitochondrial gene cytochrome c oxidase subunit 1 (COX1) has proved successful [1] however, there is still difficulty in fixing a universal barcode for plants because of a more complex genetic background. Various nuclear and plastid coding and non-coding loci, viz., ITS1, ITS2, accD, matK, ndhJ, rpoB, rpoC1, ycf5, atpF-H, psbK-I, rbcL, rbcLa, trnH-psbA, trnL-F, etc. have been tested for validation either singly or in combination of two or more loci [2][3][4].
The chloroplast trnL-F non-coding region includes the trnL(UAA) intron ranging from 350 to 600 base pairs (bp) and the intergeneric spacer between trnL(UAA) 3 0 exon and the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 trnF(GAA) gene [5,6]. The trnL(UAA) intron interrupts the anticodon loop of the tRNA Leu, which is encoded in the large single copy region of the plastid genome . In the chloroplast DNA, trnL is the only Group I intron region having conserved secondary structure [7,8] with alternation of conserved and variable regions [9]. They are capable of catalyzing their own splicing from the flanking exons. The secondary structure of trnL intron contains regions of complimentary sequences that form nine stem-loop structures (P1-P9) [10]. Within these stems there are four regions (P, Q, R and S) conserved in primary sequences among all group I introns [11] and they are known as the catalytic core [12]. The P8 stem-loop region of the trnL intron is known to be hypervariable in size with multiple repeat motifs [12,13] and created difficulties in alignment. Therefore, this region is always excluded in phylogenetic as well as barcode analyses [14][15][16][17].
Orchidaceae comprises of 850 genera and 20,000 species which are arranged in five subfamilies, 22 tribes and 70 sub-tribes [18]. The tribe Vandeae consists of five sub-tribes, 139 genera with 2600 species of monopodial epiphytes distributed in tropical America, tropical and southern Africa, tropical and sub-tropical Asia, eastern Australia and Tasmania, much of the tropical Pacific south to New Zealand and east to Tahiti [19]. Considering the hypervariability to be an intrinsic property of the P8 region of trnL(UAA) intron, the present investigation was undertaken to find out whether the locus had any potential in discrimination of closely related species belonging to the tribe Vandeae based on in silico restriction site polymorphism analysis.

Gene sequences
One hundred and twenty five sequences of tRNA Leu (trnL) gene for orchids belonging to tribe Vandeae were downloaded from GenBank. These included ten sequences (GU185926, GU18 5928, GU185931, GU185932, GU185933, GU185934, GU185935, GU185938, GU185939 and GU185940) generated and submitted by the first author. Methods followed for DNA extraction and PCR amplification were given by Kishor and Sunitibala [20]. Sequencing was done using 3730 DNA Analyzer (Applied Biosystems, Warrington, UK) available at DNA Sequencing Facility, University of Delhi, South Campus. The primer pair 'c' (3 0 CGAAATCGGTAGA CGCTACG5 0 ) and 'd' (3 0 GGGGATAGAGGGACTTGAAC5 0 ) [5] were used for amplification as well as sequencing.

In silico analysis of restriction site polymorphism for species discrimination
The trnL intron borders were delimited by identifying the binding sites of the primers 'c' and 'd'. We looked for the P8 region by delimiting the borders following the method of Borsch et al. [13] and by considering the secondary structures of trnL of Campylopus flexuous [21] and Nymphaea odorata [13]. The correctness of the borders was again verified by considering the secondary structures of the P8 regions drawn for Aerides odorata, A. sukauensis and A. krabiensis [16].
In silico restriction mapping was done using the online software RestrictionMapper Version 3 (http://restrictionmapper.org/). Selection was made to include all commercially available restriction enzymes and find the base pair position where each enzyme cut the trnL P8 region for all the taxa. The program was set to allow providing maximum cuts with a minimum site length of 5 nucleotides. As many as 35 restriction endonucleases were found to have recognition sites in the P8 regions; however, only four, viz., AgsI, ApoI, TspDTI and VspI (five and six base cutters) were selected for the analysis because each of them had their recognition sites in the region across all the taxa.
The in silico mapped restrictions produced by the four enzymes were scored by presence (1) or absence (0). Further analyses were done using the computer package NTSYS [22]. Similarity matrix was generated using the program Qualitative in NTSYS 2.20e package. This matrix was subjected to the unweighted pair group method with arithmetical averages (UPGMA). Cluster analysis was performed on the similarity matrix with the SAHN program using UPGMA and the dendrograms were generated with the TREE program. Analysis was done for each enzyme as well as for combination of two, three and four of them.

Length variation of trnL P8 region
The downloaded trnL sequences for the 125 accessions contained many with incomplete sequences. After establishing the borders, only 98 taxa with complete sequence of the P8 region could be selected for the analysis. The length of the P8 regions varied from 200 nucleotides (Aerides rosea) to 669 nucleotides (Dendrophylax sallei) (Fig 1, Table 1). Forty two taxa had unique lengths, while as many as eight shared a common length of 521 nucleotides (Aeranthes arachnites, Aeranthes grandiflora, Angraecum conchiferum, Angraecum dives, Angraecum leonis, Bonniera appendiculata, Phalaenopsis amboinensis, and Phalaenopsis modesta). Within the genus Phalaenopsis, which is represented by 23 species in the present study, length variation of the P8 regions ranged from 494 (P. tetraspis) to 538 nucleotides (P. lindenii); while in eight taxa of Vanda, it ranged from 471 (V. testaceae) to 562 nucleotides (V. grifithii).

In silico analysis of restriction site polymorphism
The result of the in silico restriction site analysis is presented in Table 1. Of all the commercially available restriction endonucleases being tried for the in silico restriction mapping using the online software 'RestrictionMapper', 35 endonucleases (AflII, AccI, AgsI, ApoI, AsuII,    (Fig 2). Again, from amongst the 35 restriction endonucleases, only three, viz., AgsI, ApoI and TspDTI turned out to have recognition sites across all the 98 taxa being studied; while VspI had in all excepting Angraecum dives. Hence, the above four were selected for subsequent analysis of the restriction site polymorphism.

Species discrimination by single restriction endonuclease
The restriction sites recognized by AgsI ranged from one (Aerides krabiensis and A. multiflora) to 14 (Phalaenopsis inscriptiosinensis) ( Table 1). From the dendrogram constructed by the UPGMA method which showed similarity coefficient ranging from 0.91 to 1.00 (Fig 3), it was found that AgsI exhibited species specific restriction sites in 80 taxa; however, it could not provide discrimination for 18 taxa which included the taxon pairs Acampe papillosa and Aeranthes arachnites; Angraecum dives and Angraecum leonis; Phalaenopsis lamelligera and P. pantherina, P. lueddemanniana and P. pulchra; Phalaenopsis philippinensis and P. schilleriana; Phalaenopsis P8 region of trnL(UAA) intron for identification of orchid species maculata and P. mariae; Phalaenopsis amboinensis and P. modesta; Phalaenopsis corningiana and P. sumatrana; Pomatocalpa diffusum and Vanda testacea. Each of these taxon pairs shared the same AgsI restriction sites. The ApoI recognized restriction sites ranged from two (in six taxa) to ten (Phalaenopsis inscriptiosinensis) ( Table 1). As evidenced from the UPGMA dendrogram (Fig 4), of the 98 taxa analysed ApoI digestion discriminated 77 species, however; it could not show species specific restriction sites in 21 taxa. The following taxon pairs or groups shared the same restriction sites, Acampe ochraceae, Acampe papillosa and Aeranthes arachnites, Angraecum calceolus and Jumellea maxillarioides, Angraecum conchiferum, Angraecum dives, Angraecum leonis and Bonniera appendiculata, Rhipidoglossum kamerunense and Rhipidoglossum subsimplex, Tridactyle crassifolia and Tridactyle filifolia, Phalaenopsis amboinensis and P. modesta, Phalaenopsis lamelligera and P. pantherina, Phalaenopsis corningiana and P. sumatrana, Phalaenopsis philippinensis and P. schilleriana. This dendrogram showed similarity coefficient ranging from 0.92 to 1.00.

Species discrimination using combined data of restriction endonucleases
Analysis of the in silico restriction mapping using the combined data of AgsI and ApoI resulted in discrimination of 86 taxa (Fig 7) as against their individual resolutions of 80 and 77 taxa respectively. Those which had exactly the same restriction sites included six taxon pairs, Acampe papillosa and Aeranthes arachnites; Angraecum dives and Angraecum leonis; Phalaenopsis lamelligera and P. pantherina; Phalaenopsis amboinensis and P. modesta; Phalaenopsis corningiana and P. sumatrana; Phalaenopsis philippinensis and P. schilleriana. The UPGMA dendrogram showed similarity coefficient ranging from 0.91 to 1.00.
Again, when ApoI, AgsI and TspDTI restriction data were combined, 92 taxa could be discriminated while the three taxon pairs which could not be discriminated from each other included Acampe papillosa and Aeranthes arachnites; Phalaenopsis corningiana and P. sumatrana; Phalaenopsis philippinensis and P. schilleriana (Table 1, Fig 8). The UPGMA dendrogram showed similarity coefficient ranging from 0.92 to 1.00. Combination of data of all the four enzymes ApoI, AgsI, TspDTI and VspI also gave the same result as that of the above three enzymes with no discrimination of the following three pairs Acampe papillosa and Aeranthes arachnites; Phalaenopsis corningiana and P. sumatrana;  Phalaenopsis philippinensis and P. schilleriana (Table 1, Fig 9). The UPGMA dendrogram showed similarity coefficient ranging from 0.92 to 1.00.

Discussion
The P8 stem-loop region is reported to be the most length-variable region of the trnL intron which is due to presence of repeats of various sizes [21]. These repeats together with the enormous length and sequence variation hampered alignment in most of the plant groups being studied for phylogeny or barcoding and therefore, this region is often excluded from analysis of trnL intron sequences [14][15][16]. Due to this exclusion, the results inferred from analysis of the trnL intron sequences do not represent the exact information it should naturally infer. By virtue of its hypervariability, the P8 region should be having certain information which may be analysed and interpreted for applying in species level identification or barcoding, if not for phylogenetic inference. As there is difficulty in sequence alignment of the P8 regions, the only other method for analysis should be the restriction analysis. In silico restriction site polymorphism was opted over the conventional PCR-Restriction Fragment Length Polymorphism (PCR-RFLP) as there were already a large number of trnL sequences deposited to GenBank. Moreover, the PCR-RFLP may not show the true picture in case two fragments of equal lengths coming from different parts of the sequence cannot be distinguished by observing the gel picture. For plant species identification PCR-RFLP is seldom used, however, a few investigators employed it for identification of mangrove and mangrove associate species [23], fine roots of trees from the Alps [24], Cinnamomum spp. [25], upland grassland species from roots [26] and Vasconcellea species [27], Dendrobium orchids [28].

Sequence length analysis
The tribe Vandeae (Family: Orchidaceae) is very robust consisting of 2600 species of monopodial epiphytes under 139 genera. Some previous investigators had already worked on phylogeny of these orchids using trnL intron sequences [15][16][17]. As an objective of the present investigation we aimed to discriminate closely related species belonging to a genus or a subtribe or a tribe. While searching the GenBank database, 125 accessions of trnL intron sequence for taxa, under the tribe Vandeae, was available when we started the investigation and they were retrieved for analysis. Upon further investigation 98 out of the 125 sequences had complete P8 hypervariable regions, and thus were considered for the final analysis. Sequence length variation of the P8 regions could discriminate 92 out of the 98 taxa of orchids being investigated. It was observed that the length of the P8 regions of all the taxa being varied from 200 (Aerides rosea) to 669 nucleotides (Dendrophylax sallei). Dendrophylax sallei had 30 more nucleotides longer than Luisia curtisii (Orchidaceae) which was reported earlier to be the longest recorded angiosperm P8 (639 nucleotides) [16]. As evidenced from our finding, those taxa sharing a common length did not necessarily belong to the same genus and that there could be great differences in P8 lengths within the same genus, which is in conformity with that of Kocyan et al. [16]. This difference in P8 lengths might be due to slipped-strand mispairing (SSM) resulting into high repetition of A motifs [21,29].

Restriction site polymorphism analysis
Our study represents the first ever in silico restriction analysis of the P8 region of trnL in an attempt to utilize the genetic information present in it for a meaningful interpretation in species identification of a certain group of angiospermic plants. From all the UPGMA dendrograms generated in this investigation it is evidenced that species belonging to one genus are clustered with those belonging to other genus or genera, or in the other sense they are haphazardly clustered owing to the hypervariable nature of the P8 regions. This showed that the present approach might not be applicable for phylogenetic inference of the taxa being investigated. However, since 95.9% of these taxa could be discriminated based on restriction site polymorphisms and sequence length data, this technique might be adopted for rapid species identification and hence as a plant DNA barcode.
So far there has not been much report on utilization of PCR-RFLP for identification of orchid species except for certain Thai Dendrobium orchids using rDNA-ITS and cpDNA regions [28]. Some investigators already showed the efficacy of double digestion or using two or more restriction endonucleases over single in generation of more polymorphic fragments in PCR-RFLP experiments. PCR-RFLP of the chloroplast trnS-psbC gene regions using a combination of two enzymes, HaeIII and MspI could successfully identify all the 119 accessions of millet into 7 species [30]. 579 grasses roots were distinguished to ten species using PCR-RFLP of trnL intron, with one or two enzyme digest [26]. Again, 16 taxa out of 30 tree species from the Alps were identified using PCR-RFLP with four restriction endonucleases TaqI, HinfI, RsaI and CfoI [24]. However, in our investigation, it was observed that as many as 35 restriction endonucleases had their recognition sites in the region. Combined data from analyses of the P8 regions with three restriction endonucleases ApoI, AgsI and TspDTI, could discriminate 92 of the 98 taxa based on species specific restriction sites. Hence, the advantage of screening a large number of restriction endonucleases is required for higher success rate in species discrimination of the plant specimens being investigated.
Aerides rosea, having the shortest P8 length of 200 nuclotides, showed to have recognition sites of at least eight enzymes, while Dendrophylax sallei despite having the longest P8 length (669 nucleotides) had recognition sites for only 20 enzymes. Microcoelia stolzii with a P8 length of 480 nucleotides had the maximum number (25) of restriction endonucleases cutting the region. Again, considering the number of recognition sites per restriction endonuclease for an individual taxon, Phalaenopsis inscriptiosinensis, with a P8 length of 498 nucleotides, had as many as 14 AgsI, 10 ApoI, 10 TspDTI and 8 VspI recognition sites. Hence, our result showed that the longest P8 length neither had the maximum number of restriction endonuclease recognizing it nor maximum number of recognition sites for an individual enzyme.
It was observed that there were 20 restriction endonucleases having recognition sites in the P8 regions of both Acampe papillosa and Aeranthes arachnites. These two taxa also exhibited the same number of restriction sites for all the twenty enzymes. The UPGMA trees drawn for each of the four enzymes as well as those for their combined data revealed them to have 100% genetic similarity. Hence, the only information to differentiate them as different species would be their P8 sequence lengths, 420 nucleotides for Acampe papilosa and 521 nucleotides for Aeranthes arachnites.
Phalaenopsis corningiana and P. sumatrana are treated as two separate species [31][32][33]. Our investigation revealed that both of them had the same P8 length (508 nucleotides) and identical sequence and hence could not be distinguished from each other as they had similar restriction sites for all the 19 restriction enzymes. Tsai [34] also could not separate the two taxa based on information derived from nrITS, IGS and atpB-rbcL sequences. The only morphological differences between them are in callus and marking patterns on the petals. Distribution of P. sumatrana is widespread ranging from Myanmar, Thailand, Vietnam, Indonesia, Malaysia and the Philippines; whereas P. corningiana is restricted to Borneo and Sarawak. Phalaenopsis philippinensis and P. schilleriana are also endemic to Philippines and they have distinct distinguishing morphological characters. From our analysis, it is observed that both possessed identical P8 sequences and hence identical restriction endonucleases and their restriction sites. Whether these two taxon pairs should be treated as natural hybrids or ecotypes may need evidences from other coding, non-coding sequences or protein markers.
It was suggested that for a gene region to be practical as a DNA barcode the following three criteria must be fulfilled: (i) contain significant species-level genetic variability and divergence, (ii) possess conserved flanking sites for developing universal PCR primers for wide taxonomic application, and (iii) have a short sequence length so as to facilitate current capabilities of DNA extraction and amplification [3]. From our result, it is learnt that P8 regions of the 98 taxa contained good amount of genetic variation either in sequence length or restriction sites of the enzymes being used. Those taxa which could not be discriminated might require understanding of their maternal origin, in case of natural hybridization; plastid DNA barcodes will fail in case of natural hybrids. Second, the primer pair (c and d) used to amplify trnL intron is well conserved from brayophytes to angiosperms [35]. The present study also employed this primer pair for both PCR amplification as well as sequencing. Using these primers, the whole trnL region could be amplified and sequenced and from it an intact P8 region be retrieved easily, which is an advantage. Third, sequence lengths of the P8 regions observed in the present investigation ranged from 200 to 669 nucleotides which were short enough to be considered as DNA barcodes. The only disadvantage of the P8 region lie in their inability to be aligned due to hypervariability, and hence cannot be used for further processing using standardized phylogenetic or barcoding techniques.

Conclusions
A technique for molecular identification using sequence length variation and in silico restriction site polymorphism analysis of the trnL intron P8 sequence was developed and utilized to discriminate 94 out of 98 taxa of orchids to the level of species. Investigations using this technique for species level discrimination across all the angiospermic families may be tried. The four restriction endonucleases ApoI, AgsI, TspDTI and VspI could be utilized for further analysis of the trnL intron P8 sequences of other uninvestigated orchid taxa either in silico or in PCR-RFLP. A plant DNA barcoding system using restriction site polymorphism of the trnL P8 region has not been suggested yet; and with this report there is high possibility of using this tool to establish a barcoding system.