Comparison of Metatranscriptomic Samples Based on k-Tuple Frequencies

Background The comparison of samples, or beta diversity, is one of the essential problems in ecological studies. Next generation sequencing (NGS) technologies make it possible to obtain large amounts of metagenomic and metatranscriptomic short read sequences across many microbial communities. De novo assembly of the short reads can be especially challenging because the number of genomes and their sequences are generally unknown and the coverage of each genome can be very low, where the traditional alignment-based sequence comparison methods cannot be used. Alignment-free approaches based on k-tuple frequencies, on the other hand, have yielded promising results for the comparison of metagenomic samples. However, it is not known if these approaches can be used for the comparison of metatranscriptome datasets and which dissimilarity measures perform the best. Results We applied several beta diversity measures based on k-tuple frequencies to real metatranscriptomic datasets from pyrosequencing 454 and Illumina sequencing platforms to evaluate their effectiveness for the clustering of metatranscriptomic samples, including three dissimilarity measures, one dissimilarity measure in CVTree, one relative entropy based measure S2 and three classical distances. Results showed that the measure can achieve superior performance on clustering metatranscriptomic samples into different groups under different sequencing depths for both 454 and Illumina datasets, recovering environmental gradients affecting microbial samples, classifying coexisting metagenomic and metatranscriptomic datasets, and being robust to sequencing errors. We also investigated the effects of tuple size and order of the background Markov model. A software pipeline to implement all the steps of analysis is built and is available at http://code.google.com/p/d2-tools/. Conclusions The k-tuple based sequence signature measures can effectively reveal major groups and gradient variation among metatranscriptomic samples from NGS reads. The dissimilarity measure performs well in all application scenarios and its performance is robust with respect to tuple size and order of the Markov model.


Introduction
The comparison of microbial communities is crucial for understanding how environment factors affect the composition and the function of the communities [1]. For the comparison of microbial communities, effective similarity/dissimilarity measures between the communities are urgently needed. Alignment-based sequence comparison methods such as the Smith-Waterman algorithm [2] and BLAST [3] have been the dominant approaches for the comparison of individual genomes when their genome sequences are available. However, microbial communities are usually complex and contain mixtures of hundreds to thousands of genomes with unknown genomic sequences. With the emergency of next generation sequencing (NGS) technologies, whole metagenome/metatranscriptome shotgun sequencing is becoming a new powerful approach to investigate complex microbial samples [4][5][6][7][8][9][10][11][12][13]. In metagenomic studies, the nucleotide DNA sequences are sampled and the composition of different organisms within the communities can be studied. On the other hand, in metatranscriptome studies, RNA sequences are sampled from the communities and the expression levels of various RNA molecules can be estimated. For both metagenomic and metatranscriptomic data, alignment-based approaches for the comparison of communities may not be applicable because the reads can be sampled from different parts of the genomes or RNAs of the various organisms. Alignment-based approaches are highly dependent on reliable reference sequences of known gene and/or pathway databases. However, most environmental microbial communities contain uncultured microorganisms, which affect the accuracy and completeness of alignment-based approaches. Therefore, alignment-free approaches provide attractive alternatives.
The k-tuple (k-word, k-gram) sequence signature of a community is defined as the frequency of k-tuples among the reads from the community. Previous studies have shown that k-tuple frequencies are similar across different regions of the same genome, but differ between genomes [14], which offers the theoretical basis to measure the dissimilarity between microbial communities with ktuple method. The sequencing result of one microbial community is represented by a frequency vector of k-tuples, dependent only on the sequencing data, free from sequence alignments and reference genome information. Therefore, k-tuple based methods could potentially be very useful for the comparison of short reads data from NGS of microbial communities.
With the k-tuple sequence signature, each NGS data from a genome is represented by the k-tuple frequency vector whose elements are the number of occurrences of every k-tuple. With the k-tuple vectors, measurements of the dissimilarity between two genomic sequences were studied from different points of views, goals and applications. For two genomic sequences, the uncentered inner product of the two sequence signatures, D 2 , was first proposed to measure the distance of two k-tuple vectors [15], and then it was widely used in sequence database searches [16] and clustering of expressed sequence tags (ESTs) [17]. Various dissimilarity measures based on D 2 were extensively studied with different normalization, centralization and background models, including D Z 2 [18], D S 2 [19],D Ã 2 [20], S2 [21,22] and a measure in CVTree [23] (called Hao in this paper). However, the above measures were developed for evaluating the distance between two long sequences (such as genomic sequences). For better performance applying to NGS data, D S 2 and D Ã 2 were further normalized to d S 2 and d Ã 2 [24], respectively, with a range from 0 to 1, in order to reduce the effects of sequence length and different nucleotide probabilities.
The k-tuple sequence signatures have been applied to compare microbial communities based on metagenomic data [25]; study the evolutionary relationships among genomic sequences [26]; identify horizontal gene transfer among different genomes [27]; and bin genomic fragments from metagenomic samples [4,28]. For the comparison of microbial communities based on NGS metagenomic datasets, Rohwer et al. [29] analyzed the ability of di-, triand tetra-nucleotide signatures to explain the variance between biomes and identified metagenomes with anomalous content. HabiSign [7] utilized patterns of tetra-nucleotide usage to cluster metagenomes at biome, phenotypic and species levels. Furthermore, Jiang et al. [25] applied 13 k-tuple based dissimilarity measures to compare metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples, aiming to evaluate the performances of these measures on comparing metagenomic samples. They found that k-tuple sequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases and the d S 2 dissimilarity measure outperforms others in all application scenarios.
With the development of experimental and sequencing techniques, many metatranscriptomic datasets have been generated to explore the expressed genes and their abundances in microbial communities [5,6,9,11,13]. Metatranscriptomic data offers more elaborate details about what genes are expressed and their expression levels as well as the functions of the microbial communities and thus, it is crucial to compare metatranscriptomes among microbial environments. As far as we know, no alignmentfree approaches have been used to compare metatranscriptomic data. Although our previous researches verified the effectiveness of alignment-free approaches on distinguishing the metagenomic datasets [25], it is built on the theoretical basis obtained by the previous studies that k-tuple frequencies are similar across different regions of the same genome, but differ between genomes [14]. When the target switches from DNA to RNA, the quantity and the structure of sequences are significantly changed. At the same time, the different characteristics of RNA from DNA, such as degradation, stability, easiness to be broken and alternative splicing, etc., bring different preferences and bias distributions to the sequencing. When the expression abundance information is imported and the sequences of intron and inter-genic regions are taken out, whether the alignment-free approaches are valid to distinguish the metatranscriptomic datasets is a critical question for their further applications to the metatranscriptomic datasets. Therefore, in this paper, we applied 16 k-tuple sequence signature measures to 99 metatranscriptomic and 16 metagenomic datasets from 13 communities/projects, among which 92 datasets from 12 communities were generated by the pyrosequencing 454 platform and 7 datasets from 1 community were generated by the Illumina Genome Analyzer IIx platform. The processing follows the same steps with our previous work [25]: counting k-tuple vectors of each dataset, calculating signature measures between dataset pair and then clustering according to the dissimilarity matrix. We conducted a series of computational experiments to study the effectiveness of the 16 ktuple based sequence signature measures in clustering metatranscriptomic or mixture of metagenomic and metatranscriptomic datasets, identifying gradient relationships of microbial community samples, clustering ability when sequencing depth is low and the effect of sequencing errors on their performance. We also investigated the effects of various tuple sizes and the order of Markov model for the background genome sequences. We also developed a software pipeline to implement the processing procedures, which is more efficient in calculating, more comprehensive in function and more convenient to use compared to d2Meta for calculating the three d2-type measures in previous work [25] for analyzing metagenomic datasets.

Dissimilarity Measures based on k-tuple Sequence Signature
The sequence signature of a NGS data set counts the number of k-tuple occurrences within the reads. This representation makes the direct comparison of two sequence datasets, for example, two metatranscriptomic sequencing datasets, feasible. The comparison is free from alignment of the reads to reference sequences, which are often incomplete or unavailable. Therefore, in our paper, the sequence signature represented by k-tuple frequency is applied to compare metatranscriptomic datasets.
Without alignment to genome/transcriptome, the information of the reads' strand direction cannot be obtained. Hence, we take both a read and its complement into consideration when counting k-tuple frequencies. For metagenomic or metatranscriptomic sequencing data, with four possible alphabet S~fA, C, G, Tg, there are 4 k possible tuples of length k in all reads. The numbers of occurrences in the whole sample for the k-tuples form the k-tuple frequency vector of the sample.
The k-tuple frequency vector represents the absolute number of occurrences of each k-tuple. For the distance between two datasets, proper normalization or centralization of the sequence signature is required. Therefore, there are various dissimilarity measures with different normalizations. In this paper, we studied various dissimilarity measures based on k-tuple frequency vectors, including three d 2 Àtype (d 2 ,d S 2 ,d Ã 2 ) [24] dissimilarity measures, one dissimilarity measure in CVTree (called Hao in the paper) [23], one relative entropy based measure S2 [22] and three classical l p Ànorm distances, Manhattan (l 1 Ànorm), Euclidean (l 2 Ànorm) and Chebyshev (l ? Ànorm). The detail of these dissimilarity measures can be found in the corresponding reference papers. For completeness, we briefly summarize the calculation for each measure as follows.
Let c X~( c X ,1 ,c X ,2 , Á Á Á ,c X ,4 k ) and c Y~( c Y ,1 ,c Y ,2 , Á Á Á ,c Y ,4 k ) be the k-tuple frequency vectors from two NGS sequencing datasets X and Y , respectively. Let n S~P 4 k i~1 c s,i , S~X ,Y be the sum of the counts of all k-tuples.
The d 2 dissimilarity measure [24] is based on the D 2 statistic [15] and is defined as: The d S 2 and d Ã 2 dissimilarity measures [24] are based on the D S 2 and D Ã 2 statistics [30] and they are defined as: and For formulas (2) and (3),c c X ,i~cX ,i {n X p X ,i and c c Y ,i~cY ,i {n Y p Y ,i are centralizations, where p .,i is the probability of the i-th k-tuple under a probability model (Markov model of order = 0-3) for the sequences. The ranges of d S 2 and d Ã 2 are between 0 and 1.
The Manhattan (Ma), Euclidean (Eu) and Chebyshev (Ch) dissimilarity measures are classical l p Ànorm distances and are defined as: Ch For formulas 4 ð Þ, 5 ð Þ and 6 ð Þ, f X~c X n X and f Y~c Y n Y

:
The Hao dissimilarity measure is from [23] by Bailin Hao's group, so it is called Hao in our paper to simplify the notation. Hao is defined as follows: where E f .,i jM k{2 ½ is the expectation of f .,i under the (k22)-th order Markov chain.
The S2 dissimilarity measure is a relative entropy based dissimilarity measure developed by Tianming Wang's group [22]. S2 is defined as: For formula (8), Q X ,i~fX ,i p X ,i and Q Y ,i~fY ,i p Y ,i , where, p X ,i and p Y ,i are the probability of the i-th k-tuple under an r-th order Markov chain for X and Y , respectively.
The construction of Markov model [25]. Markov models of different orders were used to model the background sequences for the d S 2 , d Ã 2 , S2 and Hao dissimilarity measures. Under Markov model M r of order r, the probability of a k-tuple w~w 1 w 2 . . . w k , namely the expected frequency, can be computed as: where p(w j ) is the probability of w j estimated by the ratio of the number of occurrences of w j over all the number of nucleotides. The value of p(w 1 w 2 . . . w r ) is estimated by the ratio of the number of occurrences of w 1 w 2 . . . w r over all the number of r-tuple occurrences. The value of p(w jzr jw j w jz1 . . . w jzr{1 ) is estimated by the fraction of occurrences of w jzr conditional on the previous occurrences of w j w jz1 Á Á Á w jzr{1 :

Evaluation Metrics and Implementing Codes
Symmetric difference was originally defined to compare two sets. Robinson used symmetric difference as a criterion to evaluate the consistence between two trees [31]. For two trees with the same leaf nodes l 1 ,l 2 , Á Á Á ,l n , let A and B be the sets of nodes including the internal nodes of the two trees, respectively. Each node is denoted as a subset of their clustered leaves, that is, n p~( l i1 ,l i2 , Á Á Á ,l ip ). For tree nodes sets A~fn A1 ,n A2 , Á Á Á ,n An{1 g and B~fn B1 ,n B2 , Á Á Á ,n Bn{1 g, the symmetric difference is the number of nodes in ADB~(A\B 0 )|(B\A 0 ), the union of nodes that are present in one tree but not in the other one, where A 0 is the complement of set A. Actually, the symmetric difference is the number of nodes that are in one tree and not in the other one. Compared with Parsimony score [32,33], symmetric difference takes the order of hierarchical clustering into consideration, which offers more meticulous comparison. The symmetric difference does not use branch length information, only the tree topologies. Hence, symmetric difference is applied to evaluate the consistence between the reference tree and the clustering tree in our study. Symmetric difference is calculated with Treedist from Phylip (http://evolution.genetics.washington.edu/phylip.html).
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) [34] is used for hierarchical clustering based on dissimilarity matrix. Firstly, the dissimilarity between any two clusters A and B is calculated as the average of all dissimilarities between pairs of objects x in A and y in B, written as: 1 where d(x,y) is the dissimilarity between x and y. Then, at each step, the nearest two clusters are combined into a higher-level cluster. UPGMA is implemented with the function 'upgma' from the 'phangorn' toolbox of R. PCoA (Principal Coordinates Analysis) [35] is also known as classical multidimensional scaling. If a dissimilarity matrix is denoted as D~d ij À Á n|n , the objective of PCoA is to find X 1 ,X 2 , Á Á Á ,X n , where X i is a vector in an N-dimensional Euclid space, N,n, to optimize the function min X1,X2,ÁÁÁ,Xn The results of PCoA are a set of eigenvalues and eigenvectors. The corresponding eigenvector of the largest eigenvalue is the first principal coordinate. The Goodness Of Fit (GOF) of PCoA reflects the accuracy that the coordinates approximate the distance matrix. The PCoA is implemented with the function 'pcoa' from the R 'ape' toolbox. Spearman's Ranking Correlation Coefficient (SRCC) assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect SRCC of +1 or 21 occurs when each of the variables is a perfect monotone function of the other. In our study, SRCC is applied to evaluate the relationship between the gradient variable and the first principal coordinate of different measures. The SRCC is calculated by the function 'cor' from the R toolbox 'stats' of R.
The implementation software pipeline. We developed d2Tools with Python and R to count the k-tuple vectors, calculate probabilities of k-tuples under various orders of the Markov models and calculate the dissimilarity matrices under various dissimilarity measures d 2 ,d S 2 ,d Ã 2 , Hao, S2, Ma, Eu and Ch. The tool package can be downloaded from http://code.google.com/p/d2tools/. Compared with d2Meta, the tool to implement the same processing steps for metagenomic comparison [25], d2Tools has the following improvements: d2Meta only computes three d2-type (d 2 ,d S 2 ,d Ã 2 ) dissimilarity measures. For d2Tools, besides the three d2-type measures, Hao, S2, Euclidean, Mahattan, and Chebyshev distance measures are also included.
d2Meta only computes the probabilities of 0-order Markov model for d2-type measures, while d2Tools computes the probability and dissimilarity matrices on 0to 3-order Markov model for these measures.
d2Meta accepts only the reads files as input and d2Tools accepts the reads files or the generated frequency and probability files by d2Tools as inputs. Therefore, users only need to keep the frequency of k-tuple files for future analysis. In addition d2Tools recognizes the input file format with the suffix name automatically.
We tested the d2Tools on the dataset including four samples (each sample is about 200 MB in size) in fasta format. There are a total of 2,830,286 reads [mean = 707,5716164,498(SD) per sample] and the read length is 1646102(SD) nt. It takes about 4 hours and 1.45 GB memory to finish the pipeline for 2-10 tuple sizes and 4 samples serially for all measures. The tuple sizes can be implemented simultaneously with simple shell command, which will speed up the running time of the pipeline. The running time of calculating the k-tuple frequency vector depends on the size of input datasets and the tuple size. The running memory mainly depends on the tuple size.

Real Metatranscriptomic and Metagenomic Datasets
We downloaded 99 metatranscriptomic and 16 metagenomic datasets from 13 communities/projects in CAMERA (http:// camera.calit2.net/) and NCBI SRA (http://www.ncbi.nlm.nih. gov/sra) databases. Ninety two metatranscriptomic and 16 metagenomic datasets from 12 communities were generated by the pyrosequencing 454 platform and 7 datasets from one community were generated by the Illumina Genome Analyzer IIx. There are a total of 16,453,499 reads with length 198692 nt in the 92 metatranscriptomic datasets and 6,031,133 reads with length 4436212 nt in the 16 metagenomic datasets from the 454 sequencing platform. There are 11,447,063 reads with 76 nt in the 7 metatranscriptomic datasets from the Illumina platform. In the previous study of metagenomic comparison [25], 39 mammalian guts samples from 33 species (including omnivore, herbivore and carnivore samples) were analyzed the grouping ability of the sequence signature measures and 56 global ocean samples from 1 project were applied for grouping and gradient performance analysis. All the data in the previous study were metagenomic datasets generated with the pyrosequencing 454 platform. In our study, 92 metatranscriptomic ocean samples from 12 projects covering global ocean with different geographical locations, collection times, seasons and depths were applied to detect grouping and gradient performance for datasets from the pyrosequencing 454 platform. We also analyze the grouping characteristics when metagenomic and metatranscriptomic datasets co-exist using eight metagenomic and eight metatranscriptomic datasets. Furthermore, in order to explore the performance of sequence signature measures for different sequencing platforms, seven metatranscriptomic datasets of cecum and colon from the Illumina platform were analyzed.
The brief summaries of the datasets of the 13 projects are shown in Table 1 and the locations of 11 marine communities are marked in Figure 1.
Datasets in Experiment 1. All the 92 metatranscriptomic datasets from the pyrosequencing 454 platform in Table 1 were analyzed with the various dissimilarity measures. The objective is to evaluate their performance of grouping different samples/ communities. First, 19 metatranscriptomic datasets from 4 different geographical marine locations (Dataset 2,4,7,11 in Table 1) were studied. There are 6,633,683 total reads [mean = 349,1416258,620(SD) per sample]. The read length is 180694(SD) nt. These four communities are located on subtropical north Pacific (Hawaii), north Atlantic (West English Channel), foot of north Atlantic (Sapelo Island) and East Pacific ocean (Gulf of California). They are geographically separated distinctively and sampled in May or August, from the surface or deep sea hydrothermal plumes. Therefore, the reference cluster of their grouping is clear. With k-tuple vectors, different measures are applied to calculate the dissimilarity between sample-pairs. Based  Table 1, were collected from different locations with two research cruises in the Equatorial North Atlantic ocean and South Pacific Subtropical gyre. The locations of the other 11 communities are marked on the above map (using the DatasetID from on the dissimilarity matrix, the different samples are hierarchically clustered using UPGMA [34]. The symmetric difference scores between the derived clusters and the reference cluster are calculated. Second, with obtained optimal k, Markov model and dissimilarity measures yielding the lowest symmetric difference, the entire 92 metatranscriptomic datasets from 12 communities/ projects are clustered to see the performance of corresponding measures.
To evaluate the effect of sequencing depth on the dissimilarity between metatranscriptomic sample-pairs, the 19 metatranscriptomic datasets of 4 communities were sampled with different rates. For each sample, 10%, 1% and 0.1% of reads are sampled randomly for 100 times, and the averaged symmetric differences between the clustering results of the sampled reads and the reference cluster are calculated to assess the effect of sequencing depth under different dissimilarity measures.
Datasets in Experiment 2. The objective of Experiment 2 is to evaluate the performance of these measures in recovering the gradient relationships among the samples, that is, the change of gene expression along a gradient such as ocean depth, temperature, or pH level. Eight metatranscriptomic data sets from 25 m, 75 m, 125 m and 500 m depth (Dataset 12 in Table 1, two samples for each depth) of North Pacific Subtropical Gyre (NPSG) in ALOHA stations are used. Except for the depth, other collecting conditions are the same. Therefore, they are depthgradient metatranscriptomic datasets. There are a total of 451,482 reads [mean = 56,43567,701(SD) per sample]. The read length is 122617(SD) nt. Based on the dissimilarity matrices under different settings and measures, PCoA is carried out to find the principal coordinate. Then the SRCC is calculated between the first principal coordinate and the pre-determined depth-gradient axis. A higher SRCC indicates better performance in identifying the gradient among the metatranscriptomic samples.
For these datasets, we also randomly sample them with 10%, 1% and 0.1% rates for 100 times. The PCoA and SRCC are also calculated to evaluate the measures' ability to identify the gradient datasets under different sequencing depth.
Datasets in Experiment 3. The objective of Experiment 3 is to evaluate the ability of the dissimilarity measures to separate metagenomic from metatranscriptomic datasets. Eight metatranscriptomic and eight corresponding metagenomic datasets of two communities, respectively (Datasets 11 and 12 in Table 1 To evaluate the effect of low sequencing depth, we sample the 16 metagenomic and metatranscriptomic datasets with 10%, 1% and 0.1% rates for 100 times to see the corresponding clustering results. Datasets in Experiment 4. Currently, most metatranscriptomic datasets are produced by the pyrosequencing 454 platform. With the decrease of Illumina sequencing cost, some researchers began to sequence metatranscriptomic datasets with the Illumina platform. The objective of Experiment 4 is to evaluate the ability of the dissimilarity measures to separate metatranscriptomic communities from Illumina datasets. The datasets (Datasets 13 in Table 1) in this experiment are the metatranscriptomic samples from mouse cecum and colon [36], generated by the Genome Analyzer IIx (GaIIx) platform. There are 12 datasets from 7 samples with two RNA extraction protocols, Qiagen-based protocol and Invitrogen protocol. In order to obtain clear clustering ground truth, we extracted 7 samples produced with the Qiagen protocol. They are NOD501CecQN, NOD501-ColQN, NOD502CecQN, NOD502ColQN, NOD503CecQN, NOD504CecQN and NOD504ColQN, where the digital numbers, such as '501' are the mouse ID, 'Cec' means cecum and 'Col' means colon, QN means Qiagen-based protocol. In the seven metatranscriptomic datasets, there are 11,447,063 [mean = 1,635,2946333,882(SD) per sample] and the read length is 76 nt.
To evaluate the effect of low sequencing depth, we sample the seven metagenomic datasets with 1%, 0.1% and 0.01% rates for 100 times to see the corresponding clustering results.
Datasets in Experiment 5. The objective of this experiment is to study the effect of sequencing errors on the performance of these measures. The same 19 metatranscriptomic datasets from 4 different geographical marine locations (Datasets 2,4,7, and 11 in Table 1) in Experiment 1 were used in this experiment. The 19 datasets were considered as complete correct sequencing data. According to the characteristics of pyrosequencing 454, 1% indel and 0.1% substitution errors are imported to the datasets with FlowSim [37], a simulation software, and output simulated reads with pre-setting error rate and error models. Compared with the clustering results of the original datasets, the effects of sequencing errors on the performance of these dissimilarity measures are studied.

Results
The real data are used to analyze the effectiveness of k-tuple based sequence signature measures for the comparison of microbial community samples. We studied the performance of d 2 ,d S 2 ,d Ã 2 , Hao, S2, Ma, Eu and Ch dissimilarity measures.

Experiment 1: The Performance of Different Dissimilarity Measures using Sequence Signatures for Clustering Global Ocean Metatranscriptomic Datasets
Ninety-Two metatranscriptomic data collected from global ocean by twelve different projects were sequenced with the pyrosequencing 454 platform and were downloaded from CAMERA and NCBI SRA. The data are from different geographic locations including Hawaiian Ocean, Mexican Gulf, California Gulf, Norwegian Fjord, North Atlantic ocean, South Pacific ocean, Western English Channel and Eastern Equatorial Atlantic Ocean mixed with Amazon river plume. The descriptions and datasets ID can be found in Table 1.
First, 19 metatranscriptomic samples of 4 communities that have clear grouping relationships were extracted. These four communities are geologically separated distinctively, located on subtropical north Pacific (Hawaii), north Atlantic (West English Channel), foot of north Atlantic (Sapelo Island) and East Pacific ocean (Gulf of California). So the 19 samples can be clustered into 4 groups distinctively. Furthermore, Dataset 4 (Georgia_May) contains 2 control samples, 2 PUT-amended samples and 2 SPDamended samples. Therefore, within the same community, the samples from the same condition should merge first. The reference clustering tree for the 19 samples from the 4 communities is shown in Figure 2, without distance information included. Based on the ktuple frequency vectors and the 16 dissimilarity measures, the dissimilarity between each sample-pair is obtained. With the dissimilarity matrix, hierarchical clustering is implemented with UPGMA. The symmetric differences between the clustering results and reference cluster under various measures, tuple size and order of Markov model are calculated as shown in Table 2.
In Table 2, the optimal symmetric difference score between the reference tree and clustering results is 12. Almost all the dissimilarity measures except S2, Eu and Ch can achieve this optimum score with appropriate choices of tuple size k. However, the range of the values k that yields the optimal results is different. For d 2 the optimal score is obtained for k = 8, 9, 10. For d S 2 , the optimal results are obtained when k is between 6-9 with the order of the Markov chain not significantly affecting the results. For d Ã 2 , the optimal results are obtained for k between 6-10 for all orders of the Markov chain. Thus, the results are robust with respect to the length of k-tuples and the order of Markov chain. We observed that big ranges of settings under d S 2 and d Ã 2 measures and certain settings under Hao yield the optimal value. For d S 2 and d Ã 2 , the optimal clustering trees are all obtained from k §6 for 0-3rd orders of Markov model. The order of background Markov Model has less effect on the clustering results. One of the optimal clustering trees with k = 6 and 0-th order Markov model under d S 2 is shown in Figure 3. Except for the sub-classes of two control samples from the Georgia_May data, all the basic groups of four communities are clustered well. It is also observed that the communities with close geographical locations are clustered first. Because the latitudes of Hawaii and California are very close, they are clustered first, and then the second closest Georgia May channel is merged. The farthest West English Channel joins at last. This clustering order reflects that the communities with similar geographical conditions are more similar with respect to their gene expression levels, which also fit the biological intuition.
To evaluate the effect of sequence depth on the performance of the different dissimilarity measures, we randomly sample 10%, 1% and 0.1% of the original reads from the 19 metatranscriptomic datasets. The read numbers are shown in Table S1 in Supplement S1. For 0.1% sampling rate, the minimum read number of the samples is only 86. We repeat the sampling experiments 100 times. The average symmetric difference scores between the clustering and the reference cluster with different tuple size k and dissimilarity measures are shown in Figure 4 and the detail scores   Tables S2, S3 and S4 in Supplement S1. For 10% sampling rate, the optimal score is 12,the same as that for the complete dataset. It shows that even under one-tenth sequencing depth, the d S 2 can still obtain the same satisfactory results as with the complete dataset. The performance of d 2 and d Ã 2 deteriorates compared with their performance with complete data, which indicates that their performances are significantly affected by the sequencing depth. However, for d Ã 2 , the clustering results are affected by the order of Markov model. The scores close to optimal can only be obtained under the 2 nd order and 3 rd order Markov model. For 1% sampling rate, the optimal average score is increased to about 14, and all measures cannot achieve the results as good as that using the complete data. When sampling rate drops to 0.1%, the d S 2 can still cluster the four basic communities correctly, but cannot distinguish the subclasses in the Georgia May community well, and the resulting clustering tree is shown in Figure 5. The performance of Hao deteriorates significantly when the tuple size increases. This trend is more obvious with the decrease of sampling rate, that is, when the sequencing depth is low. It might be caused by Hao's attributions of the high number of parameters that need to be estimated to fit a Markov model of order k22. Ma can achieve pretty good symmetric difference scores when the size k are between 8 to 10 at all sampling rates, which may attribute to its summing up the difference between two communities for all the 4 k k-tuples, which can reduce the bias from inefficient coverage when the sequencing depth is low. Other measures, such as Ch, Eu, and S2, do not perform well in all cases. It can be seen from Figure 4 that d S 2 outperformed all the other measures under most situations.
For the outstanding performance of d S 2 in all sequencing depth, d S 2 is applied to cluster all the 92 marine metatranscriptomic datasets from 12 projects with k = 6, and the simplest 0-th order Markov model. The clustering result is shown in Figure 6. Most samples from the same community are clustered together, except the Dataset 10 SWGE, which were collected from different locations of Equatorial North Atlantic Ocean and South Pacific Subtropical Gyre. Therefore, it is reasonable to see their samples clustered with different communities. Moreover, Dataset 9 (Hawaii_depth) and Dataset 12 (NPSG) were actually collected from the same location, ALOHA site, but using different preprocessing procedures before sequencing. The two datasets are merged first according to four different collecting depths, and then clustered together. This result validates the effectiveness and accuracy of d S 2 in samples clustering. It also reflects that the d S 2 is robust to different sequencing platforms for dissimilarity measurements. We also observe that the clustering have obvious geographical tendency. The samples collected in Hawaii or close to Hawaii are clustered first. The samples from Norway and West English Channel, which are north latitudes in geographical location, are closely clustered. Most control samples and amended samples collected from the same locations are merged first respectively. For example, Mexico E1 and E2; Mexico C1 and C2 have a very clear hierarchical clustering order, and also the control samples and amended samples from Hawaii_Aug_DSM community. For Georgia community, except for the control samples, the SPD and PUT amended samples are clustered first respectively. For EasternEquaAtlan_Amazon community, the samples are merged into two main clusters which are very close to each other, which can be explained by the fact that the samples were collected during an oceanographic research cruise across the Amazon River plume to the eastern Equatorial Atlantic Ocean. The gene expression levels in microbial communities may be controlled by a gradient such as ocean depth, temperature, and pH levels. In order to see the performance of the different dissimilarity measures in recovering the gradient relationships of the microbial communities, we used eight metatranscriptomic samples from 25 m, 75 m, 125 m and 500 m depth (two samples for each depth) of North Pacific Subtropical Gyre (NPSG) in ALOHA stations (datasets 12 on Table 1), which were sequenced with the pyrosequencing 454 platform. Except for the collection depth, other factors were the same for the eight communities.
Based on the k-tuple frequency vectors, the 16 dissimilarity measures are applied to calculate the dissimilarity between any pair of the eight samples. PCoA is then applied to assign for each sample a location in a low-dimensional space based on the dissimilarity matrix. The GOF value is the proportion of variance explained by the first principal coordinate and indicates the reliability of using the first principal coordinate to represent the data. Since the major difference of the eight samples of interest is collection depth, we expect that the principal coordinate can explain most of the differences of the samples and thus the GOF value is relatively high. The GOF of the first principal coordinates under each measure was shown in Table 3. From the table, we can see that, except for S2 and Ma, the GOF by the first principal coordinate under most measures is higher than 0.70, indicating that the first principal coordinate can represent the data reasonably well. For a good dissimilarity measure, we expect that the first principal coordinate is highly associated with the  Table 4. The highest SRCC 0.9759 is obtained under the d S 2 measure when k = 10 and 2 nd order Markov model indicating very high correlation between the first principal coordinate and the collection depth. Note that the corresponding GOF by the first principal coordinate is 0.77 indicating that the first principal coordinate represents the sample data well.
The two-dimension PCoA ordinates plot and the corresponding clustering results based on the dissimilarity matrix using the d S measure. This tendency is consistent with the observation in Experiment 1. For almost all other measures, the highest SRCC is 0.78, which means these measures can identify the gradient variance to some extent. For d 2 , the performance is good when k is at least 8. The performance of Hao is reasonably good for k between 3 and 9, but deteriorates rapidly when k = 10. The relative performance of Hao with respect to tuple size k is consistent with that in Experiment 1. Similar to the results in Experiment 1, the performance of Eu and Ch is poor, while the performance of Ma is reasonable in recovering the gradient relationship between samples.
To see the effect of sequencing depth on the performance of the various dissimilarity measures in recovering gradient relationships of the microbial communities, we sample the eight metatranscriptomic datasets from four depths with 10%, 1% and 0.1% rates. The read numbers are shown in Table S5 in Supplement S1. At 0.1% sampling rate, the minimum read number of the samples is only 43. For each sampling rate, the random sampling is repeated 100 times, and the average GOF values by the first principal coordinate at each sampling rate are shown in Table S6, S7, and S8 in Supplement S1. From Table S6, except for the dissimilarity measures S2 and Ma and for large tuple size of k = 10, the GOF values are all above 0.5. The average SRCCs are shown in Table S9 in Supplement S1. For d S 2 , with 74% GOF, the optimal SRCC is 0.98, the same as that with complete data, which means d S 2 still maintains good performance using 10% of the reads. The other dissimilarity measures also yield similar performance using 10% of the data as with complete data, but do not perform better than d S 2 . At 1% and 0.1% sampling rates, most GOF values are much smaller than that obtained with the complete data. With the increase of tuple size and the order of Markov model, the GOF values decrease dramatically. So the first principal coordinate does not explain the differences among the communities well. Thus, the SRCC analysis between the principal coordinate and the collection depth is not highly meaningful.  applied to measure the beta diversity between these datasets. The clustering results based on all the measures show obvious tendency that the metagenomic and metatranscriptomic datasets are clustered into two separate groups. However, the measures Ch, Eu and Hao for large tuple size k cannot completely distinguish the two types of data with some metatranscriptomic and metagenomic datasets mixed together, which are shown in the Figure S1 in Supplement S1. The observation of poor performance of Hao for large tuple size validates the analysis in Experiments 1 and 2. Except for k = 2 with 1 st order Markov model for d S 2 and d Ã 2 , and small tuple size for d 2 and Ma, all the other settings under d 2 , d S 2 , d Ã 2 and Ma can distinguish the metagenomic and metatranscriptomic datasets indicating metagenomic and metatranscriptomic data have different sequence signature information. However, the hierarchical clustering of sub-classes within either the metagenomic or the metatranscriptomic samples shows the performance differences of these measures. As shown in Figure 8   and metagenomic datasets, but show performance difference in subclass details.
The eight metatranscriptomic and eight metagenomic datasets from four depths are also randomly sampled with 10%, 1% and 0.1% rates to evaluate the performance of the different dissimilarity measures when the sequencing depth is low. The read numbers are shown in Table S10 in Supplement S1. For 0.1% sampling rate, the minimum read number of the samples is only 43. The sampling rate affects the sub-classes within the metagenomic and metatranscriptomic datasets. However, the basis grouping of metagenomic and metatranscriptomic data can still be well distinguished even under the 0.1% sampling rate. With 1% sampling rate, d S 2 still yields the same satisfactory results as that with complete data when k = 6, as shown on figure 8(C). While for Hao, the performance drops dramatically with low sequencing depth, even the tuple size is medium (k = 6), shown on figure 8(D). When the sampling rate is 0.1%, even the d S 2 cannot achieve as good results as using the complete data. The clustering results are also sensitive to tuple size, as shown on figure 8(EF). For d S 2 , when k = 4, the clustering in metatranscriptomic dataset are still correct, while when the tuple size increases from 4 to 6, the clustering results become worse.
We also analyze eight metagenomic and eight metatranscriptomic samples from the Western English Channel sequenced with pyrosequencing 454 platform [9] (Dataset11 in table 1). We investigate how well the dissimilarity measures cluster these samples. Similarly, the metagenomic and metatranscriptomic datasets are clearly clustered into two groups further validating that the sequence signatures are different for metagenomic and metatranscriptomic datasets. As shown in Figure 9, the clustering of metatranscriptomic data from marine presents a clear pattern, where 4 am of Aug.(dawning), 4 pm of April(dusk), 3 pm of Jan.(dusk) and 10 pm of Aug.(dusk); 10 am of Aug. and 4 pm of Aug.(both daytime) are clustered together, while 10 pm of April and 7 pm of Jan. (both dark) are cluster together. This means that photosynthesis might play an important role for the gene expression of microbes. Therefore, d S 2 can reveal the differential expression pattern among the metatranscriptomic datasets. At the same time, the metagenomic data from the same location are clustered with a season pattern, where the data collected in Aug. are together and those from Jan. and April are together.  Seven metatranscriptomic datasets collected from cecum and colon tissues of four mice were sequenced with the Ilumina Genome Analyzer IIx (GaIIx) platform and they were downloaded from NCBI SRA. The description can be found in Dataset13 of Table 1.
According to the two tissues the datasets coming from, the seven samples can be clustered into two distinct groups, as shown in Figure 10, without distance information included. Based on the ktuple frequency vectors and the 16 dissimilarity measures, the dissimilarity between each sample-pair is obtained. With the dissimilarity matrix, hierarchical clustering is implemented with UPGMA. The symmetric differences between the clustering results and reference cluster under various measures, tuple size and order of Markov models are calculated as shown in Table 5.
In Table 5, the optimal symmetric difference score between the reference tree and clustering results is 5. d S 2 , S2, Hao and and Ma can achieve this optimum score with appropriate choices of tuple size k. However, the range of the values k that yields the optimal results is different. For d S 2 , the optimal score is obtained for k = 4 to 10 for the 0-th order Markov chain and for relatively narrow ranges of k with higher orders of Markov chain. For S2, the optimal results are obtained when k is between 6-8 with the order of the Markov chain playing no roles. For Hao and Ma, the optimal results are obtained for k between 6-10. One of the optimal clustering trees with k = 4 and 0-th order Markov model under d S 2 is shown in Figure 11. The four samples from cecum are clustered together, and two of the colon samples are clustered together, while the remaining colon sample is finally merged. The results show that the sequence signature measures are valid for sequencing data from the Illumina platform and d S 2 still keeps the outstanding performance for all orders of Markov chains.
To evaluate the effect of sequence depth on the measures' performance with Illumina datasets, we randomly sample 1%, 0.1% and 0.01% of the original reads from the seven metatranscriptomic datasets. The reason we sampled with 10% sampling rate for 454, but not for Illumina, is due to the high sequencing depth of Illumina data. The read numbers are shown in Table S11 in Supplement S1. For 0.01% sampling rate, the minimum read number of the samples is only 115. We repeat the sampling experiments 100 times. The average symmetric difference scores between the clustering and the reference cluster with different tuple size k and dissimilarity measures are shown in Figure 12 and the detail scores are given in Tables S12, S13 and S14 in Supplement S1. For 1% and 0.1% sampling rate, the d S 2 can still obtain the same optimal results under the same range of k value as the complete dataset. However, for d Ã 2 and Hao, althougth they can still obtain the optimal results, the range of optimal k value become smaller compared with the complete datasets. The results show that compared with other measures, d S 2 is robust to sequencing depth. The performance of Ma deteriorates slightly with the decrease of sampling rate, which can obtain the optimal clustering results with slightly narrower range of tuple size k. However, the performances of other measures, including Hao, Ma and S2, become worse when coverage decreases, and except the Ma, other measures cannot obtain the optimal clustering results, as shown on Table S14 in Supplement S1. In this study, d S 2 shows its outstanding performance for datasets from the Illumina sequencing platform.

Experiment 5: The Effects of Sequencing Errors on the Performance of Different Dissimilarity Measures
The 19 metatranscriptomic samples of 4 communities having distinct geographic locations used in Experiment 1 are used to study the effect of sequencing errors on the performance of sequencing signature measures. According to sequencing error features of the 454 platform [37][38][39], we added 1% indel errors and 0.1% substitution errors to the original sequencing data. The simulated reads set with the preset error rates are generated with a pyrosequencing 454 simulator, FlowSim [37]. The same clustering procedures as in Experiment 1 are implemented. The symmetric differences and the comparison with the results based on the original data are shown in Table 6. The clustering results do not change much under 1% indel and 0.1% substitution error rates.  values are changed for each measure compared with the results based on the original data. Ma keeps the same symmetric differences as the original data. Many more of the symmetric differences, 3-7 among 9 values, for d 2 , S2, Hao, Eu and Ch, are changed.
If a measure can obtain optimal clustering results based on the original data, it keeps the performance on imported-error data. If the measure cannot obtain the optimal results with the original data, the measure will not achieve the best performance either with added errors. Therefore, the relative performances of the

Discussion and Conclusions
In this study, we evaluated the performance of k-tuple based sequence signature methods to compare microbial metatranscriptomic samples with NGS reads data. We studied three d 2 Àtype dissimilarity measures, one dissimilarity measure in CVTree, one relative entropy based measure and three classical l p Ànorm distances. These dissimilarity measures, together with tuple size k = 2-10 and different Markov background models, were compared on the basis of five experiments of real metatranscriptomic datasets from global marine communities, with the objectives to explore their performance on clustering metatranscriptomic sequencing data from different communities generated by pryosequencing 454 and Illumina platforms, identifying gradient variance of metatranscriptomic datasets, clustering characteristics when metagenomic and metatranscriptomic datasets co-exists and robustness under sequencing errors.
For geographically well separated communities, all the measures can classify the big groups correctly. With the complete data, for certain range of tuple size k, d 2 , d S 2 , d Ã 2 , Hao and Ma can classify the subgroups and obtain the closest clustering results from the reference cluster. When sequencing depth is low, only d S 2 still keep the outstanding performance and other measures are more sensitive to sequencing depth. Even for the 92 samples from 12 communities, most measures can cluster major groups correctly and d S 2 can merge the communities according to similar geographical locations. The k-tuple dissimilarity measures can reflect the gradient tendency, and d S 2 can obtain the highest correlation coefficient between the first principal coordinate and  the gradient. When metagenomic and metatranscriptomic datasets co-exist, all the measures can cluster the samples according to the data type. We also evaluate the effects of tuple size and background model. In this study, tuple size k = 2-10, and the performance of different dissimilarity measures varies with different tuple size. The order of background Markov model does not affect the performance significantly.
Our results indicate that d S 2 performs satisfactorily for grouping microbial communities, identifying their gradient relationships and separating metagenomic and metatranscriptomic communities. The d S 2 dissimilarity measure performs similarly in some scenarios or outperforms other dissimilarity measures in many other scenarios and its performance is not highly sensitive to tuple size, which makes it easier to apply to real data. It is a powerful approach for metatranscriptomic sample comparison based on NGS shotgun reads. For d 2 , relationship between the sequences in both samples plays less effects than the variation of the tuple occurrences within one sample, which lead to its relative poor performance. Hao's attributions of the high number of parameters that need to be estimated to fit a Markov model of order k22 leads to the poor performance under low sequencing depth. Ch considers the maximum difference between the tuple frequencies for the samples only and does not make full use of the information from all the tuples. On the other hand, Ma sums up the difference between two communities for all the 4 k k-tuples, which can reduce the bias from low coverage when sequencing depth is low. The normalization of the tuple counts by their corresponding expectations plays an important role in the superior performance of d S 2 and d Ã 2 . The performance of different dissimilarity measures varies with the tuple size. We show that d S 2 and d Ã 2 can achieve reasonable clustering results for metatranscriptomic datasets. In particular, the d S 2 dissimilarity measure outperforms others in most scenarios and its performance is not highly sensitive to the tuple size. Thus, it is a powerful approach for metatranscriptomic sample comparison based on NGS shotgun reads. The dissimilarity measure d 2 performs reasonably well when the tuple size is relatively high. However, it does not perform well when the tuple size is low. These observations are consistent with the results for the comparison of metagenomic datasets. The Hao dissimilarity measure performs reasonably well when sequence depth is high and the tuple size is relatively low. One explanation is that it compares the numbers of occurrences of k-tuples with their corresponding expectations based on the k22 order of Markov chain, which may not be accurate especially when the sequence depth is low and the tuple size is high. Ch considers the maximum difference between the tuple frequencies of the samples only and does not make full use of the information from all the tuples. On the other hand, Ma sums up the differences of all the k-tuple frequencies between two communities, which can reduce the bias from inefficient coverage when the sequencing depth is low. The normalization of the tuple counts by their expectations plays an important role in the superior performance of d S 2 and d Ã 2 . The study design in this paper is similar to that in Jiang et al. [25]. The objective of this study is to see whether the conclusions about the relative performance of alignment-free methods for metagenomic comparison observed in Jiang et al. [25]are also true for the comparison of metatranscriptomes. This conclusion is not obvious due to the different characteristics of metatranscriptomic from metagenomic data. Previous study of the effectiveness of alignment-free approaches on metagenomic datasets is built on the theoretical basis [14] that k-tuple frequencies are similar across different regions of the same genome, but differ between genomes. However, in metatranscriptomic data, the genes within a genome can have different expression levels and the intron and intergenetic region sequences are removed, while in metagenomic data, all the genomic regions are the same. At the same time, RNA the different characteristics from DNA, such as degradation, stability, easiness to be broken and alternative splicing, etc., which bring the different preferences and bias distributions to the sequencing process. Therefore, under the situation that: the expression abundance information is imported, the sequences of intron and inter-genic regions are taken out, and different sequencing preference and bias are introduced, whether the alignment-free approaches are valid is a critical question for their further applications to the metatranscriptomic datasets.
Similar conclusions about the relative performance of the different dissimilarity measures on clustering microbial metagenomic and/or metatranscriptomic communities were obtained. Our conclusion about the applicability of alignment-free statistics, in particular d S 2 , for the comparison of metatranscriptomic samples is of biological significance. In addition, we showed that metagenomic and metatranscriptomic samples can be separated by using alignment-free statistics as shown in Figure 9, which cannot be achieved by studying metagenomic or metatranscriptomic dataset alone. Since the sequencing data, sequencing platform, and the microbial communities from the two studies were widely different, we believe that d S 2 can be used to effectively cluster both metagenomic and metatranscriptomic communities. Therefore, k-tuple based sequence signature methods such as d S 2 and d Ã 2 are simple and computationally efficient for the comparison of metatranscriptomic data, and they provide symmetric difference scores as a function of tuple size k for different dissimilarity measures based on 100 random samplings of 1%, 0.1% and 0.01% sampling rates, respectively. The lower the score is, the closer the clustering results and reference tree is. It is clear that d 2 s shows best performance under most of the conditions. doi:10.1371/journal.pone.0084348.g012 attractive alternative approaches for microbial community comparison.

Supporting Information
Supplement S1 Figure S1. The clustering results of both metagenomic and metatranscriptomic datasets in Experiment 3 based on the dissimilarity measures Ch, Hao and Eu. Table S1. Sampling reads number for Experiment 1. Table S2. The symmetric difference between the reference and clustering trees for Experiment 1 (10% sampling).