Biogeochemical Typing of Paddy Field by a Data-Driven Approach Revealing Sub-Systems within a Complex Environment - A Pipeline to Filtrate, Organize and Frame Massive Dataset from Multi-Omics Analyses

We propose the technique of biogeochemical typing (BGC typing) as a novel methodology to set forth the sub-systems of organismal communities associated to the correlated chemical profiles working within a larger complex environment. Given the intricate characteristic of both organismal and chemical consortia inherent to the nature, many environmental studies employ the holistic approach of multi-omics analyses undermining as much information as possible. Due to the massive amount of data produced applying multi-omics analyses, the results are hard to visualize and to process. The BGC typing analysis is a pipeline built using integrative statistical analysis that can treat such huge datasets filtering, organizing and framing the information based on the strength of the various mutual trends of the organismal and chemical fluctuations occurring simultaneously in the environment. To test our technique of BGC typing, we choose a rich environment abounding in chemical nutrients and organismal diversity: the surficial freshwater from Japanese paddy fields and surrounding waters. To identify the community consortia profile we employed metagenomics as high throughput sequencing (HTS) for the fragments amplified from Archaea rRNA, universal 16S rRNA and 18S rRNA; to assess the elemental content we employed ionomics by inductively coupled plasma optical emission spectroscopy (ICP-OES); and for the organic chemical profile, metabolomics employing both Fourier transformed infrared (FT-IR) spectroscopy and proton nuclear magnetic resonance (1H-NMR) all these analyses comprised our multi-omics dataset. The similar trends between the community consortia against the chemical profiles were connected through correlation. The result was then filtered, organized and framed according to correlation strengths and peculiarities. The output gave us four BGC types displaying uniqueness in community and chemical distribution, diversity and richness. We conclude therefore that the BGC typing is a successful technique for elucidating the sub-systems of organismal communities with associated chemical profiles in complex ecosystems.


Introduction
Unravelling trends that rule complex aquatic environments is a puzzling task due to the myriad of possibilities of interactions presented between and within the hosted organismal consortia with organic and inorganic compounds.
As explaining the totality of the interactions is a goal hard to achieve, if not plainly impossible considering the never ending development on science, therefore, here we intend to frame subsystems co-existing within a larger system using a data-driven approach [1].To comprehend such interactions we gave rise to the biogeochemical typing (BGC typing), a flexible tool to bring forth and individualize a subset of structures underlying in the studied environment based on the correlation between the community and chemical profiles analysed.
The BGC typing analysis is a pipeline built using integrative statistical analysis and can treat massive datasets as used here produced by multi-omics analysis [2] which would otherwise be hard to visualize [3] and process [4].It filters, organizes and frames the data based on the strength of the mutual trends working within the environment.
The multi-omics analyses here was composed by metagenomics which gave the community consortia profile, ionomics showing the elemental content a metabolomics for the organic chemical profile.Here, we regard metagenomics as applying solely to the characterization of small-subunit ribosomal RNA.Therefore, the multi-omics analysis provided who is there and what is there as explained as following.
In this study we researched on the aquatic environment of three distinct paddy fields and surrounding water located in Saitama Prefecture, Japan.The paddy field is the source of one of the most important staple foods in the world and a rich environment comparable to a natural wetland: more than merely the ability to sustain crops, it harbours an intricate net of life, and it is able to support even higher-trophic level organisms such as fish [5].
In a complex environment, one can find thousands of different organisms thriving.To answer who is there, we performed the metagenomics to identify the organismal consortia.The identification was expressed as operational taxonomic units (OTUs) [6] retrieved by high throughput sequencing (HTS) the polymerase chain reaction (PCR) products of Archaea-specific and universal 16S (Archaeal genes excluded) and universal 18S small-subunit ribosomal RNA primers.
To assess what is there we joined the pieces of information from ionomics and metabolomics.
The ionomics is the elemental analysis evaluating its variation over a set of samples in an approach as the one applied to plant assay [7].The ionomic analysis was assessed by the use of inductively coupled plasma optical emission spectroscopy (ICP-OES).
For the metabolomics we used two techniques, the attenuated total reflectance Fourier transformed infrared (FT-IR) and the proton nuclear magnetic resonance ( 1 H-NMR).
The FT-IR is a technique easy to be employed by request little preparation to the sample and give us information about its organic chemical profile regarding the rotational-vibrational frequency from the chemical bonds present in the molecules being a useful tool in metabolomics [8].
The 1 H-NMR has been proved for long to be also a powerful tool in metabolomics [9,10], assessing the information related to the structure from the molecules in our sample that contain hydrogen as the large majority of organic compounds.
Such multi-omics dataset was split in two groups of data: one derived from metagenomics aggregating the OTUs from Archaea, 16S rRNA and 18S rRNA forming our organismal community matrix (matrix community) and the second group of data formed by the ionomic information acquired by ICP-OES joint to the metabolomic information represented by the integration of FT-IR and 1 H-NMR spectra, thus being fused to one matrix of chemical profile (matrix chemicals).
As in our method we are not able to differentiate whether the cells were dead or alive in the exact time of sampling and some organisms may feed on dead cells it was assumed that all matter including the cells constituents took part of the environmental condition, therefore, regarded on the chemical profile.
The integrated statistical analysis is the tool to connect and frame the various trends acting underneath the broader complexity of the totality of the environment.Our group has being developing statistical tools to grasp the explainable features present on diverse environments [11,12].
Here, we filter, organize and frame the data applying the pipeline of the BGC typing to expose the links amongst the organismal community and the chemical profile.It optimizes the set of information retrieved by filtering the data according to the strength of the correlation and individualizes sub-systems of organismal consortia along its chemical features framing our BGC types.Each BGC type thus comprises a small universe statistically isolated working within the environment, helping the understanding of the whole system.
The BGC typing pipeline is based on integrative statistical analysis: namely, Spearman correlation [13,14], least-squares structuring [15] and k-means clustering [16,17].The set formed by the groups of organisms and the chemical profiles associated by this pipeline we call the BGC types which are meaningful a priori only in the specific study, nevertheless we expect to find similar BGC types spread on similar environments under similar conditions and analyses.The BGC typing then would improve and develop itself as more and more studies are done following this approach.
The description of four singular BGC types found in this study shows that we successfully established the technique of BGC typing as a tool to characterize sub-systems composed by the community distributions and structures associated to chemical profiles on a complex environment such the Japanese paddy field and surrounding waters (Fig. 1).

Sampling
We designed the sampling method to encompass what we regarded as three unities of sampling sites which comprised three samples from the water of a chosen paddy field plus its collector stream.We added two sampling points from the river that boundaries the paddy fields, the Ara river -one sampling was taken from the river right before it meets the paddy field area and another right after such paddy area ends.
The sampling site lies on the plains of the Hiki District of Saitama Prefecture (Japan) over a large agricultural area following the course of the Ara River for approximately 13 km.Samples were collected on August 23, 2011, a few weeks prior to harvest; the paddy fields had been flooded all summer to raise the crop (rice).Using sterile tubes, four 50 mL aliquots of water were collected per sample from each sampling point.The points were located in three individual paddy fields, their respective collector streams, and the Ara River itself, for a total of 14 sampling points (Fig. 2).Specifically, there were two samples from the Ara River, one upstream of the paddy field areas and the other downstream of the paddy fields (ara1 -36u29320N 139u30980E; ara2 -35u569540N 139u329410E); three samples from different points of paddy field 1 (p1f1-p1f3 -36u29230N 139u299510E) and its collector stream (p1s -36u29250N 139u299460E); three samples from different points of paddy field 2 (p2f1-p2f3 -35u599370N 139u30950E) and its collector stream (p2s -35u599370N 139u30950E); three samples from different points of paddy field 3 (p3f1-p3f3 -35u589290N 139u309330E) and its collector stream (p3s -35u589290N 139u309320E).The samples were stored at 4uC in a cooler box and returned to the laboratory immediately, where they were stored at -80uC.Before each analysis, a 60 h freezedrying pre-processing step was performed.All freeze-dried samples were weighed.One 50-mL aliquot was used for DNA extraction and community analysis, another aliquot for ICP-OES, another aliquot for FT-IR, and another one for 1 H-NMR.

HTS of the PCR products from the ribosomal RNA gene from environmental samples
High throughput sequencing is a powerful tool to identify the constituent organisms in environmental studies [18].
The DNA was extracted from the freeze-dried samples by using the Power Soil DNA extraction kit (MoBio, CA, USA); the concentration of extracted nucleic acids was measured using a CLUBIO Micro Spectrophotometer.
We amplified by PCR the small sub-unit ribosomal RNA sequences from the extracted DNA.Fragments from 16S rRNA, 18S rRNA, and Archaeal rRNA were separately amplified.The hipervariable regions from V1 to V3 for the 16S rRNA were amplified using modified Ba27F (59-AGAGTTT-GATCCTGGCTCAG-39) as the forward primer [19] and PRUN518 (59-ATTACCGCGGCTGCTGG-39) as the reverse primer [20].The hipervariable regions from V1 to V3 for the 18S rRNA were amplified using Euk1A (59-CTGGTTGATCCTGC-CAG-39) as the forward primer and Euk516R (59-ACGGGGG-GACCAGACTTGCCCTCC-39) as the reverse primer [21].The hipervariable regions from V4 to V6 for the Archaeal rRNA were amplified using the 16S Archaea-specific rRNA pair of S-D-Arch-0519-a-S-15 (59-CAGCMGCCGCGGTAA-39) as the forward primer and S-D-Arch-1041-a-A-18 (59-GGCCATG-CACCWCCTCTC-39) as the reverse primer [22].The PCR products were subjected to agarose gel electrophoresis.Correctly sized fragments were retrieved from the gel and DNA was extracted using the Wizard SV Gel and PCR Clean-Up System (Promega, WI, USA).The final DNA concentration was measured using Invitrogen Quant-iT PicoGreen sDNA Reagent and Kits (Invitrogen, CA, USA).Correct dilutions were performed using Milli-Q water.The sequencing library for HTS was prepared using the GS Junior Titanium emPCR kit (Lib-L) (454 Life Sciences, CT, USA) by following the provided protocol.The library was read by a GS Junior sequencer following standard operating procedures.The obtained reads from each GS Junior run were treated using QIIME software [23].We followed the ''454 Overview Tutorial: de novo OTU picking and diversity analyses using 454 data'' (http://qiime.org/tutorials/tutorial.html) using default settings, with the following exceptions: de novo chimera detection and Trie pre-filtering in the OTU picking step [24]; uclust_ref as the clustering method [25]; SILVA 108 of the SILVA rRNA database as a sequence reference [26].The aligned sequences were assigned against the SILVA rRNA database.OTUs represented by a single read over all sampling points were filtered out to decrease computational demand, since our correlation method used would not be able to show any trend for a sequence read only once.OTUs generated by one set of primers (e.g., Archaea) that were aligned to other domains (e.g., Eukaryota) were also filtered out to prevent overrepresentation of organisms.To compare quantified information of each OTU amongst the sampling sites, the number of reads for each OTU was divided by the total number of reads for its sampling point in order to find the relative abundance for each OTU.This step was separately performed to the three domains studied.The resulting tables of OTUs against sampling points for Archaea rRNA, 16S rRNA, and 18S rRNA were fused to one matrix (matrix community).Although the aim of this study is not building an ultimate phylogenetic tree, neither this dataset allows such task, an emulation of a phylogenetic tree was built in QIIME [23,27] and exported to R to plot a more visually informative tree [28].Observing the resultant tree with mixed Bacteria and Archaea domains, we opted to reassign the 16S rRNA and Archaeal OTUs according to data from the Ribosomal Database Project (RDP) [29].The trees were built once again after the BGC typing for comparison and the 16S rRNA, 18S rRNA, and Archaeal OTUs appeared fairly distinguished.

Inductively Coupled Plasma Optical Emission Spectrometry (ICP-OES)
For the preparation to the analysis, we suspended in Milli-Q water each freeze-dried sample to recreate the same 50 mL initial volume.Three dilutions were prepared-1:1, 1:10, and 1:100-for 6-mL aliquots in 15-mL plastic tubes.For the dilutions, elemental analysis was performed using SII model SPS 5510 CCD simultaneous ICP-OES (SII NanoTechnology Inc., Chiba, Japan) equipped with an SPS-3 auto-sampler (SII NanoTechnology Inc.) and using ICP Expert software (SII NanoTechnology Inc.).We used the Multi-Element Calibration Standards 3, 4, and 5 acquired from PerkinElmer (PerkinElmer Japan Co., Ltd., Yokohama, Japan) to calibrate the machine.A concentration of 1 mg L 21 of each standard dilution was used for this step.We quantified 27 chemical elements: Al, B, Ba, Be, Ca, Cd, Cr, Cs, Cu, Fe, Hg, K, Li, Mg, Mn, Na, Ni, P, Pb, Rb, S, Sb, Se, Si, Sn, Sr and Zn.Three emission wavelengths of each element were chosen to satisfy both the achievement of maximum intensities and the elimination or minimization of the interference effect for discrimination of each element in the samples.The ICP-OES operating conditions were as follows: power 1.2 kW, plasma gas flow 15 L min 21 , auxiliary gas flow 1.5 L min 21 , nebulizer gas flow 0.75 L min 21 , and peristaltic pump speed 15 rpm.From the obtained data, a matrix was built using the concentration in ppm The rectangle indicates where the samples were taken.Magnified area is schematic map for the sampling site.At right side, schematic figure: Ara River was sampled in two points (ara1 and ara2) as well as paddy fields located within the area between these two river sampling points.Three independent paddy fields (paddy field 1, 2, 3) were selected and three samples were taken from each paddy field (p1f1-3, p2f1-3 and p3f1-3).These paddy fields were connected with Ara River through independent collector streams which were also sampled (p1s, p2s, p3s), respectively for each paddy field.for each element against the sampling points from the average result of the optimal dilution with the optimal wavelengths.

Attenuated Total Reflectance Fourier Transformed Infrared (FT-IR) Spectroscopy
The freeze-dried samples were pressed directly on the crystal of Nicolet 6700 FT-IR spectrometer using the ATR smart iTR accessory with a high-pressure clamp (Thermo Scientific) to measure the absorbance from 4,000 to 650 cm 21 at a resolution of 8 cm 21 .The peaks were annotated according to the absorbance wavelength.Distinguishable peaks consisting of regions with overlapping chemical bond signals were annotated by more than one chemical bond (i.e., as many as needed).The region of interest (ROI) was integrated for each assigned peak using an interval with no observed overlap.From 1,200 to 849 cm 21 the signals for the chemical bonds were permitted to overlap, such that the peaks were assigned to more than one chemical bond candidate.The region over the interval from 847 to 650 cm 21 was not used for annotation due to visually present but indistinguishable peaks.The spectra were integrated using Thermo OMIC software (USA).A matrix was built by assigning the sum of the integration values to a distinct part for each assigned peak as its relative amount.
All spectra were recorded at 298 K on a Bruker DRU-700 spectrometer (Germany) equipped with a 1 H inverse cryogenic probe with triple-axis gradients operating at 700.15 MHz.
The 1 H-NMR spectra were recorded at 32,768 points over 256 scans using the Watergate pulse sequence [30].The J-resolved spectra were recorded in 32 scans per f1 increment with a total of 32 complex f1 and 16,384 complex f2 points.
The spectra were manually phased and calibrated in the Bruker Top Spin program.Integration of spectra was performed in advanced bucketing mode in Bruker AMIX 3.5 software on manually picked peaks using bucket widths equal to 0.02 ppm to find the integration value for each peak.Two broad peaks from 3.48 to 0.78 ppm were integrated by the sum of the integration for each bucket with no visible sharp peak for the region.
The peaks of the J-resolved spectra were assigned according to the Birmingham Metabolite Library [31].The peaks for 1 H-NMR spectra were assigned with the help of SpinAssign [32,33,34].Both assigned tables were transposed to the 1 H-NMR bucketed table (table for the integrated peaks) based on the chemical shifts.Broad peaks were assigned to proteins, whose presence was supported by positive Bradford protein assay results.1D-STOCSY also was performed to the annotated bins correlated to the BGC types [35].A column for the broad peak was added to the 1 H-NMR assigned table to complete the 1 H-NMR matrix.

HTS
HTS generated 100,641 sequences for the small-subunit Archaea rRNA primers, 302,974 sequences for the small-subunit 16S rRNA general primers, and 314,632 sequences for the smallsubunit 18S rRNA general primers.The sequences repeated were collapsed and then assigned with similarity $97%, generating 5,688 OTUs for Archaea rRNA, 24,633 OTUs for 16S rRNA (no Archaeal OTU was found among universal 16S rRNA amplicons), and 9,355 OTUs for 18S rRNA.The rarefaction analysis showed a good coverage for the organismal community (Table S1, S2, S3, Fig. S1, S2, S3).
After filtering out the reads to OTUs appearing only once over the 14 sampling points, the respective totals for retrieved OTUs were 4,019, 14,942, and 6,391.Performing BGC typing as correlation filtering process, we retrieved 854 OTUs for Archaea rRNA, 1,743 OTUs for 16S rRNA and 815 OTUs for 18S rRNA.For each BGC type, we used QIIME to build an emulation of a phylogenetic tree and software R to plot and compare how the sequences were separated amongst the products retrieved from Archaea rRNA, 16S rRNA and 18S rRNA.The result displayed a fairly distinguished distribution of the OTUs (Fig. S4, S5, S6, S7).

ICP-OES
We tested for 27 elements and detected 15 in our samples.Aluminium was detected in all samples with a lower concentration in paddy field waters than in the river.It is known that pH affects the concentration of elements; however, no clear trend between elemental concentration and pH was observed in the current study (Table S4).

FT-IR
We retrieved 10 distinguishable peaks and annotated accordingly [36].The peaks annotated to NH 2 and N-C = O follow the same pattern, suggesting that they are intrinsically linked as moieties of amino acids (Fig. 3).These peaks displayed the highest intensities at the two sampling points from the Ara River (ara1; ara2) and the one from the collector stream for paddy field 3 (p3s).
Also, we annotated on the spectra peaks as C-H and as O-H, suggesting organic energy available since molecules with aliphatic and alcohol bonds are prone to be oxidized by many organisms (Table S5).

H-NMR
e integrated 161 individual peaks for 1 H-NMR plus an integration for the two broad peaks, for a total of 162 1 H-NMR variables.The soluble organic compounds detected by this technique exhibited a clear pattern of samples from the Ara River having poorer concentrations (Fig. 3).Assignment of all peaks on the 1 H-NMR spectra was difficult due to low concentrations of extractable compounds and to limited sensitivity insufficient to extend analysis to 1 H- 13 C correlation experiments.However, Jresolved 1 H-NMR analysis allowed us to annotate a total of 60 buckets (Fig. S8).
These annotations constituted molecular residue information from the chemical compounds, amino acids, or organic products present in our samples.A fraction of the peaks correlated to the BGC types was evaluated by one-dimensional statistical total correlation spectroscopy (1D-STOCSY) to verify the degree of support for the annotation [37].The tallest peak in the spectra was assigned to lactate with good support from 1D-STOCSY plot and direct comparison against the spectrum for the pure compound provided on the Bruker AMIX database (Table S6, Fig. S9, S10).
The two broad peaks observed in the spectra had the sum of their approximated area integrated discounting the buckets with sharp peaks.These broad peaks possessed characteristic patterns of proteins [38,39].The Bradford assay [40] was performed on the samples with the largest broad peaks for each paddy field -p1f3, p2f3, and p3f2-with respective results of 18.1 (sd = 1.2), 29.8 (0.4), and 43.0 (0.2) mg mL 21 of protein.Additionally, 1D-STOCSY showed a correlation between protein concentrations against the large broad peaks in the 1 H-NMR spectra, suggesting that soluble protein produced these peaks (Fig. S11).

Biogeochemical typing (BGC Typing)
We constructed two independent matrices arranged by our sampling points, the matrix community and the matrix chemicals.We characterized four different groups composed by the correlated organismal and chemical profiles.We propose this statistical treatment as BGC typing.
Once the dataset is built, in this case using multi-omics approach, we applied the statistical process of BGC typing which can be divided into three main steps: 1) Filtration, 2) Organization and 3) Description.
1) Filtration.The matrices formed by the OTUs retrieved from Archaea rRNA, 16S rRNA and 18S rRNA sequencing were fused in one table as being the matrix community.The matrices from ICP-OES, FT-IR, and 1 H-NMR were fused into a single matrix termed chemicals.Using R software [41], Spearman correlation was performed between the community and chemicals matrices.The resulting correlation matrix was then imported to Microsoft Excel to extract all OTUs with coefficients equal to or higher than |0.70|.
2) Organization.To evaluate the appropriate number of BGC types representing the sub-systems, we used the least square structuring testing up to 15 groups [42,43].The curve inflexion indicated four as the optimal number of BGC types (Fig. 4) [15].
The correlation matrix was then divided within the principal component analysis (PCA) into four BGC types by the k-means clustering method [16] and plotted [44].The BGC types were distributed in a cross-like fashion: two oriented horizontally according to the axis of principal component 1 (PC1) and two oriented vertically according to the axis of principal component 2 (PC2) (Fig. 5).
A table of the OTUs from each BGC type was built.For the community distribution analysis, we plotted the sum of all relative abundances along the sampling points for each BGC type (Fig. 6).To analyse the community structure, the OTUs were collapsed to the class level or beyond according to the next divergence on the taxon presented.The resulting matrices were used to find the percentage of abundance for the groups of organisms in each BGC type.To visualize the differences among chemical profiles presented by the different BGC types, we built a table and plotted (Fig. 7).
A table of the chemical profile for each BGC type was built by using the average correlation index for all the chemical variables for each chemical variable from the BGC type and the loadings for PC1 or PC2 according to the one that better explains the BGC type.In order to facilitate the comparison, the tables were scaled by unity of variance without centring (Fig. 8).
The PC which better explains each BGC type had the loadings compared against the average correlations from each chemical variable using the standard deviation of a population (STDEVP).Chemical variables with STDEVP of 0.20 and over were extracted to build the table of the chemical profile for each BGC type (Table 1).
3) Description.According the structure presented by each BGC type regarding the organismal community and chemical profiles along the sampling points, we analysed the features that individualize them and searched for the literature that could bring sense to the sub-systems exposed.

Organismal community distribution
Organisms of BGC type I had a scattered distribution over the sampling points, with a tendency to be less present in the lotic waters ara1, p2s, p3s and ara2 with exception of p1s.BGC type II presented a scattered distribution over the paddy field and collector streams.BGC types I and II appeared to be minimally present in the Ara River (ara1, ara2).
Organisms in BGC type III were distributed predominantly on the lotic waters and had minimum presence in the lentic waters on the paddy fields.The organisms for BGC type IV were distributed predominantly on the lentic waters from the paddy fields.Conversely, BGC type IV had a minimal distribution on the river and collector streams, the lotic waters.When we summed the three domains for each paddy field data, a gradual increase was observed from paddy 1 to paddy 3, although none of the paddies was directly connected.The organismal community distribution over the sampling points can be visualized in Fig. 6.

Organismal community structures
The retrieved Archaeal community profile was dominated by the phylum Euryarchaeota, mostly composed of the classes Methanomicrobia, Thermoplasmata and Methanobacteria.In addition, Euryarchaeota exhibited the largest abundance in three out of the four BGC types, being the dominant phylum on the paddy floodwaters and the collection streams for BGC types I, II and IV.The mentioned classes tend to be directly involved in methane production [45,46,47].BGC type III, which was predominantly distributed over the lotic waters of the Ara River and the collector streams, had the phylum Crenarchaeota, class Thermoprotei as the most abundant kind of Archaea.Crenarchaeota is suggested to take part in primary production by active involvement in the ammonia-oxidizing process of the nitrogen cycle and in autotrophic carbon assimilation [48,49,50].This phylum has also been suggested to be dominant in fresh-water systems [51], which is in accordance with our findings for the distribution of BGC type III.However, such a pattern was clearly not mirrored in the paddy floodwaters, where the profile resembled that of soil [52] or a suboxic freshwater pond [53].The BGC types distributed over the paddy fields and collector streams encompassed methanogenic candidates along with organisms that thrive in a wide range of aerobic conditions.Some of the associations might be possible by the formation of anoxic microzones in aggregates [54], although Archaea methanogens can survive in aerobic environments [55] (Fig. S12, S13, S14, S15).The community identified by amplifying the 16S rRNA showed the ubiquitous class Actinobacteria to present the largest abundance (,38%) and indeed as one of the four most abundant classes across all BGC types.It prevailed in abundance even over the ,26% of the higher-taxon Proteobacteria phylum.Actinobacteria possess versatile metabolism, exemplified by their ability to decompose lignocellulose [56,57] or to produce bioactive metabolites as antibiotics.The high abundance of Actinobacteria in BGC types II and IV in particular mirrored studies of both soil and aquatic ecosystems where this phylum often appears as an important component of the community alongside Proteobacteria [58,59,60].Besides Actinobacteria, BGC types I and III shared Proteobacteria, Flavobacteria and Sphingobacteria as significant elements of the community.The bacteria detected in this study were largely from aerobic or facultative anaerobic taxa, suggesting aerobic or a micro-aerobic conditions (Fig. S16,S17, S18, S19).
The community identified amplifying the 18S rRNA, the Eukaryota fraction was dominated by chemoheterotrophic phyla such as Metazoa (with the classes Gastrotricha and Rotifera being first-and second-largest, respectively) and Alveolata (with the classes Ciliophora and Apicomplexa being first-and second-largest, respectively) that may serve as links in energy transfer from lower to higher trophic levels.The BGC types I, II and IV that were distributed over the paddy fields all shared this pattern.Jurgens and Gude [61] suggested that the presence of protozoans and metazoans might shape the characteristics of the bacterial community, not just regarding structure and diversity, but also triggering phenotype changes in response to predation.Similar to our findings for the Archaea community, BGC type III showed a different structure from the other BGC types, presenting the phylum Viridiplantae as the most abundant.Viridiplantae is a taxon formed by a wide range of terrestrial and aquatic primary producers that rely on chloroplasts to perform photosynthesis, which plants are part of this clade [62] (Fig. S20, S21, S22, S23).

Chemical profiles and organismal community structures
The differences in the community distributions for BGC types I and II were not entirely clear.However, the differences in the chemical profiles revealed that all four BGC types are indeed unique.On the PCA score plot, the k-clustering method segregated four clusters as the BGC types in a cross-like fashion.This structure translates how each BGC type has its chemical profile distinguished from the others or, alternatively, its own environmental condition.In accordance with the results, BGC types III and IV were more distinct from each other than BGC types I and II were from each other, as the former were split along the PC1 axis and the latter along the PC2 axis as seen in Fig. 5.So that, BGC types III and IV had the average correlations for the chemical variables compared against PC1 while BGC I and II had the average correlations for the chemical variables compared against PC2 as seen in Fig. 8.
The chemical variables with positive correlations were extracted and assigned to the related BGC type.We listed only variables with positive correlations for each BGC type, since positivity implies that the organisms living under those specific environmental conditions are at least tolerant to those conditions.A negative correlation would not necessarily imply suppression arising from either side, biological or chemical, since the samples are from an open system with many variables that were not measured, such as geographical configuration or even weather conditions; however, it does imply the absence of a given chemical variable in the presence of a given organism, and vice-versa.
BGC type I was statistically independent from the elements tested and from the variation on the FT-IR profile.However, it presented a positive correlation to buckets on the aromatic side of the 1 H-NMR spectra, including those annotated as phenylalanine, some annotated as other amino acids, and, most remarkably, to the largest peak on the 1 H-NMR spectra, which was assigned as lactate.The presence of lactate suggests anoxic conditions where this molecule could be derived from pyruvate fermentation [63] as an adaptation of Metazoan organisms to thrive under suboxic conditions [64].Despite a suggestion of anoxic conditions for this BGC type, we found that 20% of the 16S rRNA community were from chloroplasts.Therefore, we hypothesize that anoxic microzones might buffer the chloroplast-produced oxygen [65].Another possibility is that the chloroplasts might even simply constitute debris, just dead organic material serving as a substrate to other organisms [66].
BGC type II presented the highest negative correlation to the buckets assigned to lactate on 1 H-NMR.Since BGC types I and II inhabit the same set of samples to different degrees of colonization, this finding may suggest either competitive interactions between them or chemical heterogeneity in the occurrence of microzones displaying different chemical compositions, and hence a resulting difference in the kinds of life supported.
Organisms from BGC type III showed a positive correlation with three elements tested using ICP-OES and with two nonannotated buckets in the aromatic region of the 1 H-NMR.One of the elements was manganese and it is a key element to, among other biological functions, photosynthesis [67,68].The other two elements were silica and calcium that can be linked to, but not restricted to, silification and calcification: both are processes related to increasing physical resistance in organisms as cytoskeletons and shells, features which are important in many groups within algae and microalgae, for example [69].Indeed, for this BGC type, the phylum Alveolata within Eukaryota had the class Dinophyceae as its dominant class, instead of Ciliophora, which dominated the same phylum in the other BGC types.BGC type III encompassed the organisms thriving in the environments poorest in the organic compounds detected by 1 H-NMR and FT-IR.The phyla Viridiplantae and Crenarchaeota would be the primary sources of organic carbon, with the latter also potentially serving as a nitrogen source.As the waters from paddy fields can drain into the river but the opposite is not possible, BGC type III could not act as a source of organic matter for the other BGC types.The richness in Crenarchaeota despite poor organic load is consistent with the findings of Ochsenreiter et al. [70], where the phylum was found throughout many kinds of environmental soil and freshwater.Despite poor organic matter content, BGC type III had the highest abundance of the three domains compared to all other BGC types, suggesting efficient cycling of photosynthesized compounds.
BGC type IV presented the group of organisms statistically correlated to more chemical variables than any of the previous ones.There are correlations to five elements within ICP-OESbarium, copper, iron, zinc and phosphorous.Sanchez-Moral et al. [71] suggested that barium precipitation can be bio-induced in pure cultures of Actinobacteria which was the most abundant class of bacteria present in this BGC type.Copper, iron and zinc are some of the trace elements essential for enzymatic activity in methanogenic systems, and deficiencies are suggested to diminish such activity [72].Phosphorous is abundant in several metabolic pathways, being involved in structural biomolecules and the energy currency ATP.BGC type IV was also correlated to those bands of absorbance in FT-IR assigned to be C-H and O-H bonds: the former may suggest long carbon chains from lipids, and the latter sugar-or alcohol-related molecules.Both of them are intrinsically linked to high levels of chemical energy, either implicated in catabolism or anabolism.Within 1 H-NMR, BGC type IV was correlated to a vast part of the spectra, suggesting a chemically rich organic environment, with relationships to many buckets assigned to amino acids and to the broad peaks assigned as protein.This would suggest nitrogen recycling, meaning the group presents an organismal community that may be both source and consumer of the organic compound assigned as protein.

Conclusions
After the evaluation done for each BGC type, we can classify them roughly to the remarkable features that individualize them.According to the findings, BGC type I would represent the subsystem represented by the anoxic/near anoxic microzones by presenting lactate as an important energy source intermediate (aka the ''BGC type Anoxic'').The BGC type II would be the aerophilic counter part of BGC type I, eventually presenting a relation energy transfer between them (aka the ''BGC type Counter Part'').The BGC type III would represent a sub-system of photosynthetic organisms and the related chemical profile (aka the ''BGC type Photosynthetic'').The BGC type IV would represent the most active in the cycling of organic compounds (aka the ''BGC type Glutton'').
Hence, we successfully established the BGC typing analysis pipeline technique and applied to the environment of Japanese paddy fields bringing forth four unique subset of organismal and chemical assembles.The technique is flexible and can accept any biogeochemical or omics measurements, since it operates on numerical tables of values (e.g., intensity, concentration, number of reads, etc.) and can contribute to further insights into biogeochemical cycles in other environments.
This holistic technique will broaden the understanding of ''hidden'' sub-systems working within the totality of the environments.

Ethics Statement
There is no specific permission required for all of following sampling points as they are public places.Also the field does not host endangered or protected species.The exact location for the sampling points from Ara River (ara1, ara2) are 36u29320N 139u30980E and 35u560540N 139u329410E respectively, from paddy field 1 (p1f1-p1f3) are: 36u29230N 139u299510E, from paddy field 2 (p2f1-p2f3) are: 35u599370N 139u30950E, from paddy field 3 (p3f1-p3f3) are: 35u589290N 139u309330E, from collector stream from paddy field 1 (p1s) is: 36u29250N 139u299460E, from collector stream from paddy field 2 (p2s) is: 35u599370N 139u30950E and from collector stream from paddy field 3 (p3s) is: 35u589290N 139u309320E.
The whole of the sequences retrieved in our study are available in the DDBJ Sequenced Read Archive under accession number DRA002437.Table S6 1 H-NMR along sampling points.Rows: sampling points.Columns: bins for chemical shifts (ppm) with annotation where applicable and integration for the broad peaks annotated as protein.

Figure 1 .
Figure 1.Schematic representation for the Biogeochemical Typing (BGC typing).Yellow box: steps for collection and pre-processing the samples.Orange box: steps for data acquisition and formatting for BGC typing.Red box: steps for BGC typing as the integrated statistical analyses.doi:10.1371/journal.pone.0110723.g001

Figure 2 .
Figure 2. Schematic representation for the sampling location.Shadow map showing Kanto region (Japan).Grey area is Saitama prefecture.The rectangle indicates where the samples were taken.Magnified area is schematic map for the sampling site.At right side, schematic figure: Ara River was sampled in two points (ara1 and ara2) as well as paddy fields located within the area between these two river sampling points.Three independent paddy fields (paddy field 1, 2, 3) were selected and three samples were taken from each paddy field (p1f1-3, p2f1-3 and p3f1-3).These paddy fields were connected with Ara River through independent collector streams which were also sampled (p1s, p2s, p3s), respectively for each paddy field.Blue arrows indicate flow direction of Ara river and light brown arrows indicate each sampling points.Gaps in Ara River indicate bridges.doi:10.1371/journal.pone.0110723.g002 Figure 2. Schematic representation for the sampling location.Shadow map showing Kanto region (Japan).Grey area is Saitama prefecture.The rectangle indicates where the samples were taken.Magnified area is schematic map for the sampling site.At right side, schematic figure: Ara River was sampled in two points (ara1 and ara2) as well as paddy fields located within the area between these two river sampling points.Three independent paddy fields (paddy field 1, 2, 3) were selected and three samples were taken from each paddy field (p1f1-3, p2f1-3 and p3f1-3).These paddy fields were connected with Ara River through independent collector streams which were also sampled (p1s, p2s, p3s), respectively for each paddy field.Blue arrows indicate flow direction of Ara river and light brown arrows indicate each sampling points.Gaps in Ara River indicate bridges.doi:10.1371/journal.pone.0110723.g002

Figure 4 .Figure 5 .Figure 6 .
Figure 4. Finding the optimal number of BGC types.X-axis: number of clusters.Y-axis: sum of squared distances from each variable to the centroid within the BGC type.doi:10.1371/journal.pone.0110723.g004

Figure 7 .Figure 8 .
Figure 7. Heatmap for the Spearman correlations of extracted OTUs against chemical profiles.X-axis, in order: chemical elements (ICP-OES), wavelength number (FT-IR), chemical shifts (H-NMR).Y-axis: OTUs arranged from BGC type I (upper) to IV (bottom).BGC types are indicated on the left.The meaning of the colours is indicated in the legend at the bottom.doi:10.1371/journal.pone.0110723.g007

Figure
Figure S1 Rarefaction curve to total Archaea OTUs.Rarefaction curve for Archaea rRNA OTUs for all samples.Xaxis: number of readings.Y-axis: number of species (log).(PDF) Figure S2 Rarefaction curve to toatal 16S rRNA OTUs.Rarefaction curve for 16S rRNA OTUs for all samples.X-axis: number of readings.Y-axis: number of species (log).(PDF) Figure S3 Rarefaction curve to total 18S rRNA OTUs.Rarefaction curve for 18S rRNA OTUs for all samples.X-axis: number of readings.Y-axis: number of species (log).(PDF) Figure S4 Emulation of a phylogenetic tree for BGC type I. Abundance range indicated by the size of the symbol indicated at upper right.Symbols indicative of the domain indicated at middle right.Colour codes for the most abundant classes indicated at lower right.The number of symbols for each branch is related to the number of sampling points within the OTUs present.(PDF) Figure S5 Emulation of a phylogenetic tree for BGC type II.Abundance range indicated by the size of the symbol indicated at upper right.Symbols indicative of the domain indicated at middle right.Colour codes for the most abundant classes indicated at lower right.The number of symbols for each branch is related to the number of sampling points within the OTUs present.(PDF)