Transcriptome Analysis Revealed Highly Expressed Genes Encoding Secondary Metabolite Pathways and Small Cysteine-Rich Proteins in the Sclerotium of Lignosus rhinocerotis

Lignosus rhinocerotis (Cooke) Ryvarden (tiger milk mushroom) has long been known for its nutritional and medicinal benefits among the local communities in Southeast Asia. However, the molecular and genetic basis of its medicinal and nutraceutical properties at transcriptional level have not been investigated. In this study, the transcriptome of L. rhinocerotis sclerotium, the part with medicinal value, was analyzed using high-throughput Illumina HiSeqTM platform with good sequencing quality and alignment results. A total of 3,673, 117, and 59,649 events of alternative splicing, novel transcripts, and SNP variation were found to enrich its current genome database. A large number of transcripts were expressed and involved in the processing of gene information and carbohydrate metabolism. A few highly expressed genes encoding the cysteine-rich cerato-platanin, hydrophobins, and sugar-binding lectins were identified and their possible roles in L. rhinocerotis were discussed. Genes encoding enzymes involved in the biosynthesis of glucans, six gene clusters encoding four terpene synthases and one each of non-ribosomal peptide synthetase and polyketide synthase, and 109 transcribed cytochrome P450 sequences were also identified in the transcriptome. The data from this study forms a valuable foundation for future research in the exploitation of this mushroom in pharmacological and industrial applications.


Introduction
The polypore Lignosus rhinocerotis (Cooke) Ryvarden, or generally known as tiger milk mushroom, is a highly valued medicinal mushroom of the Southeast Asia. It is mainly distributed in China, Thailand, Malaysia, Indonesia, Philippines, and Papua New Guinea [1][2][3]. This mushroom is traditionally being used as tonic, to maintain general health, for immune enhancement, or as a treatment regime for numerous ailments including cancer, asthma, and bronchitis. It is also used to treat discomfort caused by fright, fever, coughing, vomiting, and cuts. The mushroom's sclerotium is the main source of medicinal material. Besides being consumed in the form of decoction (usually with other herbs), several other preparation methods for medicinal purposes have been documented [4]. In the interior of state of Pahang, tiger milk mushroom was found to be administered in a betel quid. While in the state of Kelantan, where the mushroom is often given to mothers after childbirth, its sclerotium is pounded with raw rice, infused, and drunk [5,6]. A technique mimicking cold water extraction has been described by Chan [7] where the sclerotium is grated on a hard surface such as granite plate with some water and the resulting mixture is further diluted with water before consumed.
Since its successful cultivation, researchers have been actively validating the traditional claims by various scientific approaches. Several authors have reported studies on the chemical composition, nutritive value, and health benefits of L. rhinocerotis. While some have also demonstrated the presence of antiproliferative activity in aqueous (hot and cold) or methanol pressurized liquid extracts, and hot water-soluble polysaccharide-protein complex isolated from L. rhinocerotis sclerotium against a panel of human cancer cell lines as well as various types of leukemic cells including acute promyelocytic leukemia cells (HL-60), chronic myelogenous leukemia cells (K562), and human acute monocytic leukemia cells (THP-1), through apoptosis and/ or cell cycle arrest [8][9][10][11]. Some other biomedical properties of L. rhinocerotis sclerotium such as immunomodulatory, neurite outgrowth-stimulating, antinociceptive, and antimicrobial properties have also been reported [12][13][14][15][16][17].
The genome of L. rhinocerotis was reported previously by our group. It is particularly enriched with sesquiterpenoid biosynthesis genes and appears to encode the capabilities to produce 1,3-β-and 1,6-β-glucans as well as several bioactive proteins including lectins and fungal immunomodulatory proteins [18]. With the rapid advancement of high-throughput deep sequencing technologies such as RNA sequencing (RNA-seq), application of genome-wide expression profiling on this mushroom is now possible. RNA-seq or "Whole Transcriptome Shotgun Sequencing" is a rapid and cost-effective revolutionary tool that uses the capabilities of next-generation sequencing platform to generate large numbers of high-quality short reads for comprehensive transcriptomics data mining [19,20]. In order to shed light into the genes that are expressed in the sclerotium of L. rhinocerotis, transcriptome analysis was conducted and the data generated was mapped to its genome to further enhance gene structure annotation and novel transcripts and alternative splicing predictions [21,22]. This is the first survey on transcriptome characterization for this mushroom where assessment of transcriptome coverage, gene sequences, and functional annotation by bioinformatics analysis were included. And in view of the availability of transcriptome data, the secondary metabolism and polysaccharides biosynthesis of L. rhinocerotis were re-examined to gain better understanding. The structural and functional features of several highly expressed genes were also discussed.

Strain and culture condition
Freeze-dried sclerotial powder of L. rhinocerotis TM02 cultivar was a product of LiGNO™ Biotech Sdn. Bhd. (Selangor, Malaysia). A voucher specimen was deposited at Royal Botanic Gardens, Kew (London, UK) with the accession number K(M) 177812. The fungal culture was grown in sterile rice-based solid media at 28°C in the dark for three months. Samples from the developed sclerotia were harvested and freeze dried to a completely dry form. The product was authenticated by DNA fingerprinting [23].

RNA preparation and library sequencing
Total RNA was extracted from fresh sclerotial powder using TRIzol 1 Reagent (Invitrogen, California, USA) according to manufacturer's instructions. A sufficient amount of total RNA of not less than 10 μg with OD 260 /OD 280 values of 1.8 to 2.2 was used. The sample was purified and treated with DNase in order to remove protein and DNA contaminants. Sample for transcriptome analysis was prepared as per Illumina manufacturer's instructions where magnetic beads with oligo(dT) were used to isolate polyadenylated mRNA (poly(A) + RNA) from the total RNA. Fragmentation buffer consisting of divalent cations was added for interrupting mRNA to short fragments of 200 to 700 nucleotides in length. These short fragments were used as templates to synthesize the first-strand cDNA using random hexamer-primer. The second-strand cDNA was synthesized using buffer, dNTPs, RNaseH, and DNA polymerase I. The products were purified and resolved with QIAquick PCR Purification Kit (Qiagen, California, USA) and EB buffer for end reparation and tailing A, respectively. Purified cDNA fragments were connected with sequencing adapters and gel electrophoresed to select suitable fragments for PCR amplification. Agilent 2100 Bioanalyzer and Applied Biosystems 1 StepOnePlus™ Real-Time PCR System were used in quantification and qualification of the sample library for quality control. A paired-end cDNA library was constructed and sequenced using Illumina HiSeq™ 2000 at BGI-Shenzhen, China.

Data processing
To avoid negative impact on subsequent bioinformatics analysis, raw reads were filtered by the removal of reads with adapters, unknown nucleotides larger than 10%, and low quality reads where more than half of the bases' qualities are less than five (Q 5). Base composition and quality distribution were determined as quality control. The obtained clean reads were then mapped to the genome and genes sequences of L. rhinocerotis TM02 using Short Oligonucleotide Analysis Package 2 (SOAP2) [24] where up to five bases mismatches were allowed in the alignment. The alignment data was used to calculate distribution of reads on reference genes and mapping ratio.

Bioinformatics analysis
For alternative splicing analysis, junction sites were detected by TopHat with default parameters [25]. Junction sites which provide information on boundaries and combinations of different exons in a transcript of the same gene were used to distinguish the type of alternative splicing event including exon skipping, intron retention, alternative 5' splice site, and alternative 3' splice site. For novel transcripts prediction, gene models found in intergenic regions [200 base pairs (bp) away from upstream or downstream genes] were thought to be potential candidate. A member of the SOAP, SOAPsnp was used to detect single-nucleotide polymorphism (SNPs) [24]. SNPs were identified on the consensus sequence through the comparison with the reference genome of L. rhinocerotis TM02 strain [18]. Gene models were aligned to SwissProt, TrEMBL, and NCBI nr (BLASTP cut-off e-value 1e-5) and their functional classification was annotated by Gene Ontology (GO), Cluster of Orthologous Groups (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [26][27][28].

Expression annotation
Gene expression level was determined by the reads per kilobase per million reads (RPKM) method developed by Mortazavi et al. [29] to eliminate the influence of different gene length and sequencing discrepancy. For instance, the RPKM value of gene A was calculated using the formula 10 6 ÁC Ä ((NÁL)/10 3 ) where C is the number of reads that uniquely aligned to gene A, N is the total number of reads that uniquely aligned to all genes, and L is the base number in the coding sequence (CDS) of gene A.

Phylogenetic tree construction
Protein sequences of interest were aligned with CLUSTAL X [30] and the poorly alignable regions were removed by GBlocks Server at http://molevol.cmima.csic.es/castresana/Gblocks_ server.html [31]. PROTTEST was used to select the best fit empirical substitution model of protein evolution based on the Bayesian information criterion [32]. Maximum-likelihood trees were constructed using MEGA software version 6.0.6 [33].

Data availability
Raw Illumina sequencing data of L. rhinocerotis TM02 strain was submitted to NCBI Sequence Read Archive (SRA) at http://www.ncbi.nlm.nih.gov/Traces/sra with the accession number SRR1509475 under experiment SRX648275. The Whole Genome Shotgun project of L. rhinocerotis TM02 strain used in this paper is version AXZM01000000 as previously described [18].

Results and Discussion
RNA sequencing assessment RNA sample extracted from L. rhinocerotis sclerotium was subjected to high-throughput Illumina sequencing in order to gain an overview of the fungus transcriptome. Over 5 Gb of raw reads were obtained from each cDNA library. The sequencing assessment revealed a good sequencing quality and alignment results. In total, 52,933,332 reads with 90 bp in length were obtained after filtering the raw reads. The clean reads were mapped to the genome and genes sequences of L. rhinocerotis ( Table 1). The mapping results provide an overall assessment of the sequencing. A total of 65.4 and 46.1% clean reads were able to map to the L. rhinocerotis genome and gene respectively and about 68.0 and 74.0% of the respective mapped reads got the perfect match (without mismatch). Less than 0.5% mapped reads are multi-position match while majority are unique match where the reads were mapped to unique (only one) positions of the genome or gene.
The randomness of mRNA fragmentation during experimental setting was evaluated witd the random distribution of reads in reference genes. A poor randomness of fragmentation leads to generation of reads from specific regions of the original transcripts and affecting subsequent analysis. In this study, the distribution of reads was homogeneous with good randomness of fragmentation (Fig 1A), thus indicating good quality sequence data. Fig 1B shows the distributions of genes' coverage (the percentage of a gene covered by reads) in L. rhinocerotis. A total of 88.2% (9,475 of 10,742) of genes were covered by the mapped reads from the transcriptome dataset where 83.0% of the genes showed perfect coverage of 90 to 100%. About 97.0% of genes' coverage is higher than 50%. A total of 8,641 (91.2%) transcriptome gene models were reassigned upon refinement of gene structures and alternative splicing (AS) analysis.

AS, Novel transcripts prediction, and SNPs
AS is a universal, post-transcriptional event in eukaryotes to increase protein diversity and functional complexity [34]. The seven known modes of AS are exon skipping (ES), intron retention (IR), alternative 5' splice site (A5SS), alternative 3' splice site (A3SS), alternative first exon, alternative last exon, and mutually exclusive exon. The last three modes of AS that produce high false positive results with TopHat were not examined in this study. A total of 3,673 AS events involving 2,497 genes were identified. It is worth noting that some genes produced two or more AS events (Fig 2, S1 Table). A3SS is the major mode of AS in L. rhinocerotis which makes up 65.0% of the total events. While unlike other fungi such as Cordyceps militaris and Tuber melanosporum, IR was not detected in L. rhinocerotis [21,35]. Several genes presented all three modes of AS, such as those coding for oxidoreductase (GME10159_g), aldehyde dehydrogenase (GME4106_g), peroxiredoxin Q (GME1120_g), and cytochrome P450 (GME4821_g).
Novel transcripts can be found via high throughput sequencing to enrich the present database which may be incomplete. A novel transcript is not a known gene but its expressed depth is at least 2. A total of 117 novel transcripts with 244 exons were predicted in L. rhinocerotis and about 28.2% of them were longer than 500 bp (S2 Table). The novel transcripts were mostly found in scaffold 1, 10, 100 with expression level (reads number per bp, RNPB) ranged from minimum 2 to highest 164.67.
SNP variation of L. rhinocerotis was investigated with SOAPsnp based on the Bayes' theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment, and recurring experimental errors [24,36]. A total of 59,649 SNP events which account an average SNP frequency of merely 0.2% (59,649/34,316,739) per nucleotide involving 712 scaffolds were identified (S3 Table). About 56.2% (33,492) of the SNPs fall within the CDS of 6,857 genes and 13,359 SNPs in the coding region of 5,102 genes are nonsynonymous. Of them, 63 (0.5%) and 33 (0.3%) are nonsense and "nonstop" mutations, respectively. "Nonstop" mutation occurs within a stop codon and leads to the continued and inappropriate translation of the mRNA into the 3'-untranslated region [37].
Genome-based functional annotation of L. rhinocerotis transcriptome Gene expression was calculated using RPKM. The longest transcript for a gene was used to calculate its expression level and coverage when there is more than one transcript.   gene expression distribution in L. rhinocerotis transcriptome. The genes were grouped into eight log-scaled bins according to their expressions and they are considered to be lowly (or highly) expressed if their RPKMs are below 1 (or above 100); 4.5% of the genes are lowly expressed while 14.4% were considered highly expressed with RPKMs above 100. Genome-based functional annotation revealed that 8,377 (88.4%) of the 9,475 genes covered by the reads from transcriptome dataset were significantly similar to known proteins in the NCBI nr database where more than 75.0% of the 8,377 genes were non-informative with unknown function and/or uncharacterized yet. For instance, "predicted", "unnamed", "putative", "expressed", and "hypothetical" proteins. The e-value distribution of the top hits in the NCBI nr database showed that 72.5% of the mapped sequences exhibited strong homology ( 1e-50), while the e-value distribution of the remaining 27.5% homologous sequences ranged between 1e-5 and 2e-50. S4 and S5 Tables list the top 100 highly expressed annotated genes and top 100 highly expressed genes with informative annotations, respectively. GME3505_g, GME1052_g, and GME275_g that encode for respective cerato-platanin (CP), fungal hydrophobin, and lectin are among the genes with notable expression values (5,842.62-65,892.65 RPKM). Their structural and functional features were further discussed in a later sub-section. Comparison against the SwissProt database yielded a smaller percentage of matches with 5,684 (60.0%) significantly similar BLAST hits and 47.6% of them exhibited strong homology ( 1e-50). TrEMBL database returned a comparable percentage of matched genes to NCBI nr, 8,383 (88.5%) hits with 6,071 (72.4%) of them exhibited strong homology ( 1e-50). Altogether, 8,390 genes were assigned with descriptions by comparison with NCBI nr, SwissProt, and TrEMBL databases (data not shown).
The transcriptomics data was correlated with L. rhinocerotis proteomic dataset from our recent work [38] by comparing the mRNA abundance (in RPKM) and protein abundance (in peptide counts per kilobase). A total of 378 identified non-redundant proteins with complete data for the pair of variables ("Distinct peptide" > 0, RPKM > 0) were compared (S1 Fig). In particular, the putative lectins and CP were found to be abundantly expressed in the proteome of L. rhinocerotis sclerotium, which correlated with the transcriptome analysis. Two serine proteases, which account for 11.1% of the total proteome are encoded by genes GME4347_g (10.5%) and GME8711_g (0.6%) which were expressed at considerable level with RPKM values of 2,693.41 and 1,173.79, respectively [39]. Interestingly, a cytotoxic F5 fraction which was identified to be serine protease encoded by GME4347_g was found to exhibit potent selective cytotoxicity against human breast adenocarcinoma (MCF7) cells with IC 50 value of 3.00 ± 1.01 μg/ml [38].
In general, the protein abundance dataset (n = 378) had a modest positive correlation of 0.4290 with mRNA expression (p < 0.05). The relationship is not in a perfect association where for many of the values of RPKM, it appears that protein abundance (peptides/kbp genes) may be either low or high. An adjusted squared Pearson correlation coefficient of 0.1819 suggested that about 20% of the variation in protein abundance can be explained by mRNA expression levels (S1 Fig). Majority of the expressed proteins have RPKMs value of 100 to 1,000. The relatively poor correlation is not entirely surprising. Although transcriptome has always been perceived as a precursor for proteome, mRNA expression and corresponding protein do not necessarily follow linearity. This could be due to the differences in stability and turnover of mRNAs and proteins as well as various cellular regulatory processes such as transcriptional, posttranscriptional, translational, and protein degradation in controlling steady-state protein abundances [40]. Protein identification is further affected by the type of protein sample preparation methods used. For instance, the approach employed in our recent studies [38,39] which involved the elution of proteins from polyacrylamide gel electrophoresis is generally limited to proteins that are not too hydrophobic and within the molecular weight ranges of 6 to 202 kDa [41,42]. Nevertheless, the transcriptomics data in this study will complement the proteomic dataset in understanding the biology and pharmacological properties of this medicinal mushroom.
The genes covered by the reads from transcriptome dataset were mapped to KEGG to identify the biological pathways that are active in L. rhinocerotis. A total of 4,824 (50.9%) genes were assigned to the KEGG orthology (KO) system. About 1.7% (84) of them are lowly expressed with RPKMs below 1. Thus, only the moderately to highly expressed mapped genes (RPKMs above 1) were considered for downstream analysis. Fig 4 shows the representation of the KEGG pathways in first and second layers. The identified KEGG pathways provide a research platform for L. rhinocerotis metabolic pathways, particularly its secondary metabolism as described previously [18]. One gene could be annotated into more than one KO term. KEGG annotations of L. rhinocerotis transcriptome showed that carbohydrate and amino acid metabolisms were active in this mushroom. The annotated sequences for the genes involved in COG classifications were further identified. In total, 4,788 (50.5%) genes were assigned to COG. Only 90 (1.9%) of them are lowly expressed and was not considered in the analysis (Fig 5A). Some genes with multiple functions were classified into more than one COG category. The top ranked R category for "General function prediction only" with 1,704 genes suggested that these genes were not unambiguously assigned to a certain group. A high percentage of genes were from "Carbohydrate transport and metabolism", "Transcription", and "Replication, recombination and repair binding" while only a few were categorized to "RNA processing and modification", "Extracellular structures", and "Nuclear structure". KEGG pathways and COG classifications of L. rhinocerotis transcriptome revealed a large number of expressed genes involved in the processing of gene information (including transcription, translation, replication, and repair) and carbohydrate metabolism. This is comprehensible as some of the expressed high-abundance proteins such as lectins and serine proteases as previously reported are related to carbohydrate-binding activity and post-translational modification, protein turnover, as well as acting as chaperones, respectively [39].
A total of 2,705 (28.6%) genes were categorized into 33 GO functional groups under three main categories of "Cellular component", "Biological process", and "Molecular function" ( Fig  5B). The latter occupied the largest proportion (37.0%). A high percentage of genes were from "catalytic", "binding", and "metabolic process" while only a few were categorized to "extracellular region part", "cell killing", and "reproduction". These findings are almost similar to Agrocybe aegerita and Lentinula edodes [43,44]. Nevertheless, further studies are still needed in order to gain a better insight of the metabolic pathways underlying the sclerotium of L. rhinocerotis. About 0.7% (18) of the mapped genes are lowly expressed with RPKMs lower than 1 and were not considered in this analysis.
Secondary metabolite gene clusters from L. rhinocerotis transcriptome Some fungal secondary metabolites are known to have interesting bioactivity including antibacterial, antifungal, antiviral, and antitumor [45]. The genes involved in secondary metabolism of L. rhinocerotis were previously described [18]. In this study, the clustering of genes encoding secondary metabolites was re-examined using the transcriptome resource. Lowly expressed genes with RPKMs lower than 1 were not considered for analysis. A minimum of six gene clusters encoding four terpene synthases, and one each of non-ribosomal peptide synthetase, and polyketide synthase were identified (Table 2). These enzymes are crucial for the biosynthesis of terpenes, peptides such as cyclic peptide antibiotics, and polyketides, respectively. Terpenoids are known to have beneficial pharmaceutical properties. Some renowned fungal terpenoids include pleuromutilin antibiotics, anticancer active illudins, and the ganoderic acids from Ganoderma lucidum [46][47][48]. Present transcriptome study reveals the secondary metabolite pathways that are active in the L. rhinocerotis sclerotium. This finding will be a useful guide to establish the connection between genes and molecules for compound isolation work where at current, we are attempting to heterologously express these sclerotium-expressed secondary metabolite genes and to further characterise the gene products.
Cytochrome P450 (CYP) superfamily is a diverse group of enzymes involved in various physiological processes [49]. It was found that L. rhinocerotis genome had a total of 136 CYP sequences which can be classified into 37 families [18]. However, 22 of them were found to be not transcribed and five were lowly expressed (RPKMs below 1). The remaining 109 CYP sequences were classified into 32 families and ranged from 1.08 to 936.99 RPKM (S6 Table). CYP5144 family had the most number of genes (27 genes). This is followed by CYP5150 (13 genes) and CYP5037 (11 genes) families. These three CYP families are also highly enriched in other basidiomycete fungi including Phanerochaete chrysosporium, Phanerochaete carnosa, Agaricus bisporus, Postia placenta, Ganoderma sp., and Serpula lacrymans. Of all, CYP5144 family forms the largest P450 contingent family [50]. This family is associated with the metabolism of fungal steroids and xenobiotics [51,52]. Aspergillus sp. and Fusarium verticillioides which also have the largest numbers of CYP5144 family are known prolific producers of various secondary metabolites [53].

Insights into the highly-expressed genes in L. rhinocerotis sclerotium
The genes encoding several small secreted cysteine-rich proteins (CPs and hydrophobins) and lectins were among the highest expressed genes in the sclerotium of L. rhinocerotis. Several fungal proteins belong to these families have been shown to exhibit desirable bioactivities in previous studies. We performed additional bioinformatics and phylogenetic analyses with comparison to characterised genes/proteins to provide further insights into the functions and taxonomic distribution of these highly-expressed genes in L. rhinocerotis sclerotium.

Cerato-platanins (CPs)
The high expression of several genes in the CP domain containing family correlated well with our proteomic data [38,39] (Table 3). Among the highly expressed genes, GME3505_g with 65,892.65 RPKM was identified from shotgun proteomics with 2.26% abundance [38]. The CP family conserved domain spans from 20 to 136 residues of the protein. In addition to GME3505_g, a putative protein encoded by GME9467_g with 2,042.84 RPKM and 0.06% protein abundance was also found to carry a CP domain from 21 to 139 residues ( Table 3). The CP domain was found to be homologous to an immunomodulatory protein of Trametes versicolor FP-101664 SS1 (XP_008036158.1) with 79.0% identity [54]. Multiple sequence alignment of GME3505_g and GME9467_g with their close homologs and several characterised CP family members revealed the four conserved cysteine residues and two conserved tryptophan residues which are hydrophobic (Fig 6). The two disulphide bonds characteristic of CPs were presumably formed between the four conserved cysteine residues of GME3505_g at positions 20, 58, 61, and 115 (Fig 6). For GME9467_g, the disulphide bonds were formed at Cys-21-Cys-59 and Cys-62-Cys-118. Among the aligned CP domain-containing proteins, 14 of them (inclusive of GME3505_g and GME9467_g) contain the CSD amino acid signature sequence while both AspF13 antigen isolated from Aspergillus fumigatus and the phytotoxic CP from ascomycete Ceratocystis fimbriata f. sp. platani contain the CSN signature [55]. CP family proteins have been reported to exhibit various biological activities and their biological roles remain unclear. Some CPs were found to act as allergens while others are elicitors and phytotoxins [56]. Baccelli [57] proposed that the fungal specific CP family proteins exhibit expansin-like activity during fungus-plant interactions and may play multiple biological roles associated to the fungal growth and development, parasitism, cell wall morphogenesis, immune response, and chemotaxis [58]. Phylogenetic analysis of close homologs of CP family members along with previously reported CPs shows that they can be grouped in two major clades corresponding to their origins from basidiomycetes or ascomycetes (Fig 7). The ascomycetes clade was further divided into two groups. As expected, both CP domain-containing proteins from L. rhinocerotis clustered together with the rest of basidiomycete sequences including CPL2 and the novel immunomodulatory protein Aca1 from Heterobasidion annosum and Taiwanofungus camphoratus (synonym to Antrodia camphorata) respectively but did not cluster with SnodProt1 (from Phaeosphaeria nodorum, synonym: Parastagonospora nodorum) [59] and CP [60] that have virulence implications as well as the allergenic AspF13 [61]. GME9467_g was found to cluster with the CP domain-containing sequences of Trametes sp. and Dichomitus squalens that corresponded to their taxonomic hierarchy, which was consistent with our previous phylogeny study [18]. This suggests that GME9467_g homologs are conserved among the Polyporaceae family. On the other hand, GME3505_g which shares 68.3% identity and 75.6% similarity to GME9467_g was found to be relatively distant away from its sub-group.
Among the homologs that clade together with the L. rhinocerotis proteins, CPL2 was reported to act as an elicitor of defence responses and/or effector of resistance on host and non-host plants [62] while Aca1 possessed immunomodulatory activity by exhibiting TLR2-dependent NF-κB activation and M1 polarization within murine macrophages [63]. Considering this, the putative CP-encoded genes of L. rhinocerotis may possess similar biological activities, which may contribute to the medicinal properties of this mushroom.

Fungal hydrophobins
CP was found to be closely related to hydrophobins although it does not contain the unique eight-cysteine pattern characteristic of the latter. Both protein families possess variable degrees of hydrophobicity and accumulates abundantly on the fungal cell surface [60]. The small-tomedium sized hydrophobins (75-120 amino acids) are found in many saprophytic and/or pathogenic fungi [64]. They play prominent roles in fungal morphogenesis, pathogenicity, and host specificity [65]. Putative fungal hydrophobins in L. rhinocerotis sclerotium are encoded by GME1052_g, GME1560_g, GME10149_g, GME2954_g, and GME6443_g; all with RPKMs above 1,000 ( Table 3). The annotated hydrophobins may have similar roles. These genes carry either a hydrophobin (pfam01185) or HYDRO (smart00075) domain. Both domains are members of the hydrophobin superfamily (cl02451). High hydrophobicity of the gene products was found to be incompatible with our previous approaches for protein identification where their abundances were not detected [38,39]. A different method to isolate hydrophobic proteins should be used in the future to identify these proteins. Multiple sequence alignment of the genes to other relatively close amino acid sequences and a few class I hydrophobin family members yielded a mean distance of 1.2137 with overall disparity index of 0.1663 (Fig 8). The gene sequences within the hydrophobin domain region were found to contain four to eight conserved cysteine residues. However, only protein primary structures of GME1052_g and GME10149_g show the C-CC-C-C-CC-C unique pattern of eight cysteine residues forming four conserved disulphide bonds. Nevertheless, sequence conservation between the genes is relatively low. The hydrophobin domain regions of GME1560_g, GME2954_g, and GME6443_g contain a tryptophan residue at position 58 instead of the consensus cysteine. GME2954_g and GME6443_g further show partial domain. Thus, their functionality remains questionable.
Phylogenetic analysis of several fungi from the phylum Basidiomycota shows the five hydrophobin domain-containing gene sequences in L. rhinocerotis are sequentially diverse and their corresponding homologs are distributed across various fungal taxa (Fig 9). Multiple-site indels and variations are likely to account for the differences among the sequences. The two full sequences GME10149_g and GME1052_g show a close genetic linkage to the respective novel 8 kDa HGFI (isolated from Grifola frondosa) which forms rodlets in compressed monolayers [66] and HYD1-like protein isolated from Tricholoma fulvum. The hyd1 gene is developmentally regulated in the ectomycorrhizal fungus [67]. Due to the amphipathic nature of these hydrophobins, they have great potential for various biotechnological and medical applications ranging from medical and technical coatings to the production of proteinaceous glue and cosmetics [68].

Lectins
In the genome of L. rhinocerotis, lectins are encoded by highly expressed genes such as GME275_g, GME273_g, and GME272_g (1,198.14-5,842.62 RPKM and some minor ones like GME271_g, GME270_g, GME2199_g, GME2200_g, GME269_g, and GME266_g with RPKMs from 0.76 to 201.79 (Table 3, S5 Table). GME275_g, GME272_g, and GME273_g with protein abundances from 0.29 to 14.96% contain a Jacalin-like plant lectin domain with sugar (chemical) binding site [38]. These Jacalin-like lectins are mostly found in plants and may occur in various oligomerization states. Fig 10 shows the sequence alignment of the genes' sugar binding site to top listed sequences from public databases. They own a β-prism topology with circularly permuted three-fold repeat of a structural motif and bind mono-or oligosaccharides with high specificity [69]. Unlike plant lectins, the sugar binding site of mushroom lectins show diverse structures, sequences, and carbohydrate recognition properties [70], as supported by the findings from this study. In fact, only a few consensus residues were identified from GME273_g and GME272_g with considerable mutual homology upon sequence alignment to top listed sequences from public databases. On the other hand, GME275_g showed unique sequence (Fig 10).
As these putative lectins encoded by GME275_g, GME272_g, and GME273_g show 36 to 41% identity to Jacalin-type lectin from G. frondosa (BAE43847.1) [71], they may belong to  Group-1 lectins, also known as galectin-like lectins, which are characterized by the presence of a galactose binding lectin domain but with large sequence diversity among the members [70]. Prototype galectin fold consists of a β-sandwich formed by two parallel six stranded antiparallel β-sheets with either one of them forming a concave surface for the binding of β-galactosides [72]. Therefore, it is likely that lectins from L. rhinocerotis are all-β proteins and may have similar fold to Jacalin and/or galectin. Further investigations like X-ray crystallography, nuclear magnetic resonance spectroscopy, and electron microscopy are required in order to determine their exact structures. Several fungal lectins have been shown to possess potential pharmacological properties such as mitogenic, immunoenhancing, antiproliferative, antitumor, vasorelaxing, and hypotensive activities [73,74]. For instance, lectin from G. frondosa fruiting bodies, a close homolog of L. rhinocerotis lectins, displayed antiproliferative potential against HeLa cells at 25 μg/ml by cross-linking these cell surface glycoconjugates or through immunomodulatory effects [75,76]. Whether these putative lectins in L. rhinocerotis TM02 possess similar pharmacological roles awaits further elucidation.
Fungal lectins may also be involved in various physiological functions related to fungal growth, development, dormancy, and morphogenesis by mobilizing sugars (or carbohydrates) via their specific binding sites and also play a part in molecular recognition during early stage of mycorrhization and fungal morphological changes consequent on parasitic infection [77]. They may as well serve as defence proteins like in plants against predators as they may possess toxic activities such as insecticidal, vermicidal or antiviral [72]. For instance, mushroom galectin Coprinopsis cinerea CGL2 was found to play a role in the defence against nematode Caenorhabditis elegans via binding to specific glycoconjugate of the host [78]. The various putative lectins encoded by L. rhinocerotis genome may play similar roles. In particular, they may act as passive-defence proteins to accumulate nitrogen and/or carbohydrate reserves considering the life cycle of this mushroom.

Conclusions
Insights into the transcriptome of L. rhinocerotis is valuable for inferring the functional elements of its genome and for further understanding of the medicinal properties of this mushroom. This is the first transcriptome re-sequencing analysis of L. rhinocerotis using highthroughput Illumina HiSeq™ platform. The transcriptome analysis revealed the expression of several secondary metabolite biosynthetic pathways (in particular for terpene biosynthesis) and putative genes involved in glucans biosynthesis in the sclerotium of L. rhinocerotis. Interestingly, several genes encoding the cysteine-rich CPs, hydrophobins, and sugar-binding lectins were among the most highly expressed genes in L. rhinocerotis sclerotium. The homologs of some of the CPs and lectins in L. rhinocerotis have been previously implicated to exhibit various bioactivities relevant to human diseases. Data from this study provides a valuable resource for future research and exploitation of this mushroom. Specifically, this study identified several candidate proteins and secondary metabolite pathways that warrant further investigations in terms of their bioactivities. Recombinant expression of some of these secondary metabolite pathways and cysteine-rich proteins for further characterisation is currently underway.