Transitions at CpG Dinucleotides, Geographic Clustering of TP53 Mutations and Food Availability Patterns in Colorectal Cancer

Background Colorectal cancer is mainly attributed to diet, but the role exerted by foods remains unclear because involved factors are extremely complex. Geography substantially impacts on foods. Correlations between international variation in colorectal cancer-associated mutation patterns and food availabilities could highlight the influence of foods on colorectal mutagenesis. Methodology To test such hypothesis, we applied techniques based on hierarchical clustering, feature extraction and selection, and statistical pattern recognition to the analysis of 2,572 colorectal cancer-associated TP53 mutations from 12 countries/geographic areas. For food availabilities, we relied on data extracted from the Food Balance Sheets of the Food and Agriculture Organization of the United Nations. Dendrograms for mutation sites, mutation types and food patterns were constructed through Ward's hierarchical clustering algorithm and their stability was assessed evaluating silhouette values. Feature selection used entropy-based measures for similarity between clusterings, combined with principal component analysis by exhaustive and heuristic approaches. Conclusion/Significance Mutations clustered in two major geographic groups, one including only Western countries, the other Asia and parts of Europe. This was determined by variation in the frequency of transitions at CpGs, the most common mutation type. Higher frequencies of transitions at CpGs in the cluster that included only Western countries mainly reflected higher frequencies of mutations at CpG codons 175, 248 and 273, the three major TP53 hotspots. Pearson's correlation scores, computed between the principal components of the datamatrices for mutation types, food availability and mutation sites, demonstrated statistically significant correlations between transitions at CpGs and both mutation sites and availabilities of meat, milk, sweeteners and animal fats, the energy-dense foods at the basis of “Western” diets. This is best explainable by differential exposure to nitrosative DNA damage due to foods that promote metabolic stress and chronic inflammation.


Introduction
The TP53 gene (OMIM no. 191117), which encodes a tumorsuppressor protein that drives multiple cellular responses to stress, including cell-cycle arrest, DNA repair, apoptosis, metabolism and autophagy, is frequently mutated in cancer [1,2,3,4,5,6]. TP53 mutations are mostly missense and cluster in exons 5-8, the evolutionarily-conserved region of the DNA-binding domain that contains <90% of the known mutations and all mutation hotspots at CpG dinucleotides [7,8,9,10,11]. Laboratory models and data from tumors with established environmental risk factors show that TP53 mutation patterns reflect primary mutagenic signatures of DNA damage by carcinogens, vulnerability of nucleotide positions in DNA secondary structure, efficiency of repair processing, and selection for loss of trans-activation properties [10,11,12,13,14,15,16].
Colorectal cancer (CRC), worldwide one of the most common malignancies, is mainly attributed to dietary risk factors [17,18,19,20,21,22,23,24]. TP53 mutations are found in 50-60% of all CRCs and are thought to originate in precancerous lesions, where aberrantly proliferating colonocyte progenitors are directly exposed to dietary residue [25,26]. Nevertheless the TP53 mutation pattern typical of CRC cannot be easily correlated to diet, because it is characterized by a striking preponderance of G:C.A:T transitions [9,13,16]. These are the most frequent base substitutions induced by reactive oxygen species, byproducts of normal aerobic metabolism generated at high levels in all inflammatory processes and after exposure to a wide variety of carcinogens and toxicants [27,28,29,30,31,32,33,34]. Furthermore CRC development appears to depend on whole-life nutrition pattern [23], and TP53 mutations may occur years before CRC diagnosis [25,35]. Thus the time-frame for the estimation of diet may not fully capture the period relevant for mutagenesis and carcinogenesis. This is complicated by the relatively limited variation in dietary habits within single populations, by biases in reporting and recording dietary intakes and by the problematic assessment of exposures to food-borne carcinogens and toxicants, natural and generated in foods production, processing, preservation, and preparation [17,23,36,37,38,39,40,41,42]. Adding to complexity, intestinal mutagenesis may be modified by nutrient/ nutrient, nutrient/microflora, nutrient/cell metabolism, nutrient/ gene and nutrient/DNA repair interactions, and affected by epigenetic modifications, transit time of dietary residue, inflammatory and endocrine responses, body mass and energy consumption through physical activity [23,40,43,44,45,46,47,48].
Geography strongly impacts on the ecological, cultural and economic factors that determine food systems and diets. CRCs from patients embedded in geographically diverse populations and cultures reflect substantially different dietary exposures, extended over the whole-life course and unbiased by estimation errors [17,21,23]. Thus food-related mutational signatures could be highlighted through the analysis of geographic variation in CRC-associated TP53 mutations. To test such hypothesis, we analyzed 2,572 TP53 mutations associated with primary CRCs from 12 countries or geographic areas. The mutations (Database S1) were extracted from the TP53 database of the International Agency for Research on Cancer (IARC) (R10 update, July -2005, http://www-p53.iarc.fr/Somatic.html), with the addition of an Iranian series [11,49]. To investigate correlations between geographic clustering of TP53 mutations and foods, we relied on the food balance sheets (FBS) of the Food and Agriculture Organization of the United Nations (FAO, http:// faostat.fao.org/site/368/DesktopDefault.aspx?PageID = 368), that provide unique comprehensive pictures of the patterns of national food supply, useful for international comparisons [50,51,52]. Food availability patterns (FPs) for the countries/geographic areas in the TP53 database were derived from the mean per caput supplies, in percent of the total caloric value, of each major food group available for human consumption during the reference year 1990 [17] (Dataset S1). The datamatrices generated for mutation sites (MS), mutation types (MT) and FP (Datamatrices S1) were investigated for geographic variation by hierarchical clustering (HC). Factors underlying HC were defined by feature analysis (FA) through principal component analysis (PCA). Pearson's correlation scores were computed between the principal components of the mutation type, food availability pattern and mutation site datamatrices. These analyses demonstrated significant correlations between transitions at CpGs and both mutation sites as well as availabilities of meat, milk, sweeteners and animal fats. Our results could be best explainable with differential exposure to nitrosative DNA damage due to the consumption of energy-dense foods that promote metabolic stress and chronic low-grade inflammation.  The MS and  MT trees showed similar structures, each with two major  geographic clusters, one including only Western countries (I-MS,  I-MT), the other Asia and parts of Europe (II-MS, II-MT). The main difference consisted in the position of West and Central Europe in II-MS and I-MT, respectively. Stability of clusters was assessed by silhouette values. Silhouette plots for different thresholds, applied to each dendrogram, were compared to assess the reliability of the clustering solutions. In both cases the tree structure showed two stable clusters. The low silhouette value of MT was related to the poor stability of the ''Spain'' branch, attributable to either I-MT or II-MT. The MS and MT tree structures were correlated by two-tailed Mantel test (r = 0.581, P = 0.001) ( Figure 2).

Geographic variation in mutation site and type
By multivariate FA we next investigated the factors that determined clustering for MS (i.e., codons) and MT (i.e., mutation types), respectively using heuristic or exhaustive approaches. Feature selection aimed at identifying the minimum subset of features necessary to generate the clustering structure obtained using all the features. Sequential forward feature selection with two different rankings, respectively based on the number of mutations recorded for each codon (feature) and on the PC coefficients of each feature, was used to analyze the MS datamatrix by heuristic approach (Figure 3).
Stable MS clustering was obtained with 23 weight-ranked or 22 PCA-ranked codons, in both cases including the five TP53 mutation hotspots (i.e., CpG codons 175, 245, 248, 273, 282) [9,13,16], out of 173 mutated codons in the datamatrix. The variance contributed by the PCs of the MS datamatrix and their eigenvalues are shown in panels A and D of Figure 4, respectively. Total MS variance was explained by 11 components. Four components contributed 80% of the variance, and the first component, which accounted for 31%, had highly significant PC coefficients for the features corresponding to the five CpG hotspots, as detailed in File S1 and in Figure S1, panel A.
Exhaustive multivariate FA of the MT datamatrix is reported in Tables 1 and 2. In decreasing order, the most relevant features were G:C.A:T at CpGs, followed by A:T.C:G , G:C.A:T and G:C.C:G. The variance contributed by each PC of the MT datamatrix and their eigenvalues are shown in panels B and E of Figures 4 respectively. Total MT variance was explained by 4 components, the first of which accounted for 65%, and, as detailed in File S1 and in Figure S1, panel D, the highest PC feature loading among the 8 mutation types corresponded to transitions at CpGs. Other mutations, including transitions at non-CpGs, were associated to minor fractions of variance.
The frequency box-plots of the mutations at the 19 codons with highest weights and highest PCA variance coefficients in Figure 5, panel A, showed higher mutation frequencies at the three major hotspot codons 175, 248, and, particularly, 273, in I-MS versus II-MS. This reflected higher frequencies of transitions at CpGs in I-MT (range: 46.1-61.2%) versus II-MT (range: 41.2-43.3%) in the frequency box-plots of the 8 mutation types in Figure 5, panel B. Such most relevant features were used to geographically visualize MS and MT variation ( Figure 6, panels A-B). Highlighted groupings of countries/geographic areas were similar to the MS and MT clusters in Figure 1, obtained by HC using all the features. Overall these results indicate that in CRC TP53 transition mutagenesis at CpGs is modulated by geographyrelated factors. This might reflect differences in exposure(s) to specific food-associated mutagenic process(es) [53].

Geographic variation in food supply patterns
To address this issue, we analyzed the FP datamatrix by HC and FA through PCA. HC for FP was based on the mean per caput supply values, in percent of total available calories, of each major food group in the relevant countries/geographic areas during the reference year 1990 [17]. HC yielded two major clusters, I-FP, with Western countries and Japan, and II-FP, with South and East Asia plus Iran (Figure 1, panels E-F). The clusterization of Japan in I-FP had a low silhouette value and contrasted with the previous assignments of Japan to clusters II-MS and II-MT. To verify Japan's assignment, we generated all the possible subsets of the 13 FP features (food groups), i.e., 8,192 subsets. HC trees, cut to obtain two clusters, were then generated based on each of these subsets. Dendrograms were classified as A or B when Japan clusterized respectively in II-FP or I-FP, and as C, when different from A and B. Overall 2,405 clusterings, classified as A, assigned Japan to cluster II-FP with Iran and South and East Asia; 4,178, classified as B, assigned Japan to cluster I-FP, with Western countries; and 1,609 were classified as C, being different from A and B. The histograms in Figure S2, panels A-B, that visualize the number of times that each of the 13 features was present when type A or B clusterings respectively were obtained, readily show that feature cereals was almost always absent in type A clusterings and almost always present in type B clusterings. Thus Japan joined I-FP only because of the low availability of cereals. Tables 3 and 4 show the results of exhaustive FA of the FP datamatrix. In decreasing order, the most relevant features were cereals, milk, and meat. PCA showed that total FP variance was explained by 3 components, the first of which accounted for a major fraction of 87.3% (Figure 4, panels C and F). The variance of this component, which, in loading order, included the features cereals, meat, milk, sweeteners, animal fats (File S1 and Figure S1, panel G), explained the tree structure, determined by lower cereals and higher meat, milk, sweeteners and animal fats in I-FP relative to II-FP, as shown in panels C and F of Figure 4.

Correlations between mutation pattern and food supply pattern
The data from the MS, MT and FP datamatrices were projected on the 1-dimensional space spanned by their respective PCs. Pairwise Pearson correlations were then computed for the three datamatrices in all the projected spaces. Tables 5 to 7 show the correlation scores, and the corresponding P-values, obtained for the first 3 PCs of each datamatrix, that, except for MS, accounted for most of the variance. Pearson's correlation between the PCs for MT and for FP (Table 5) showed that the first PC for MT was correlated with the first PC for FP, with r = 20.60 (P = 0.039). Availabilities of meat, milk, sweeteners and animal fats were directly correlated to transitions at CpGs, availability of cereals to transitions at non-CpGs (File S1 and Figure S1, panels D and G). As detailed in File S1 and in Figure S1, other less important correlations involved second and third PCs that accounted for minor fractions of variance. With the same analysis, the first PCs for MS and for MT resulted again strongly correlated, with r = 20.87 (P = 0.0002, Table 6), which supported Mantel's test results ( Figure 2). However, in spite of the correlation between MT and FP, there were no significant correlations between the PCs of MS and FP (Table 7).
Scatter plots with superimposed linear regression showing the global trend of correlations were built for the countries/ geographic areas as projected on the 2-dimensional spaces spanned by the first PCs of MS and MT ( Figure 7) and of MT and FP ( Figure 8). As shown in Figure 7, Italy, Iran, South and East Asia and West and Central Europe had relatively lower frequencies of mutations at CpG hotspot codons, compensated by higher frequencies of mutations at all other sites (see also box-plots in Figure 2). Mutation frequencies at CpG hotspots increased in other countries, with highest frequencies in Australia and UK. As shown in Figure 8, transitions at CpGs correlated with countries/ geographic areas characterized by higher availabilities of energydense, Western-style foods, while South and East Asia, Iran, Japan and, to a lesser extent, Italy, where cereals were higher and meat, milk, sweeteners and animal fats lower, had lower frequencies of such mutations.
Overall, variation in the frequency of transitions at CpGs reflected variation in the availabilities of the energy-dense foods that form the basis of ''Western-style'' diets and that are linked to overweight and obesity [18,20,21,22,23]. Transitions at non-CpGs balanced decreases in transitions at CpGs in the countries/ geographic areas where cereals compensated for lower availabilities of such foods.

Discussion
Several studies addressed the issue of CpG transition mutagenesis in cancer, with particular regard to TP53 mutations in CRC. Being exonic CpGs constitutively hyper-methylated, C to T mutations at coding CpGs in TP53 should be scored as direct transitions from hypermutable 5-methylcytosine to thymine [54,55,56,57,58,59]. Dietary folate is a defined environmental determinant of genomic methylation [23,60,61]. Laboratory models and data on CRCs in patients carrying a germline methylenetetrahydrofolate reductase (MTHFR) gene variant that results in reduced plasma and serum folate suggest that low folate, by inducing global hypomethylation, may decrease TP53 transition mutagenesis at CpGs [62,63,64]. Folate-rich foods include fresh vegetables, pulses (legumes) and relatively unprocessed cereals [65,66]. Little is known about DNA methylation variation among individuals and populations [67], [68]. We did not find any correlation between availability of vegetables or pulses and TP53 mutation pattern, while cereals, relatively unprocessed in most Asian countries [69,70], inversely correlated with transitions at CpGs. Thus folate availability may not account for our results. This conclusion agrees with studies showing that, in absence of interacting genetic effects, folate alone does not influence TP53 mutation patterns in CRC (although it may affect TP53 protein expression) [44,71,72].
NO is produced at mutagenic concentrations by inducible NO synthase (iNOS), the widespread enzyme isoform upregulated by inflammatory cytokines [76,82,87]. It has already been suggested that the excess of TP53 transitions at CpGs found in cancers arising on a chronic inflammatory background, such as CRC in ulcerative colitis and bladder cancer associated with Schistosomiasis, results from nitrosative stress [74,88]. Moreover transitions at CpGs are strongly related to iNOS expression in both CRC and adenocarcinoma of Barrett's esophagus [89,90]. Arginine, the substrate for NO synthesis and a potential CRC-related dietary factor [87,91,92,93], is contained in a variety of protein-rich foods of animal and vegetable origin [65,66] and may not per se explain why variation in the frequency of transitions at CpGs correlated with variation in the availabilities of meat, milk, sweeteners and  animal fats. However it is known that these energy-dense foods promote a pro-inflammatory milieu that increases iNOS expression and NO production [23,78,94,95,96,97,98,99,100], [101]. In addition red meat is a major exogenous source of nitrogen compounds and haem, which contribute to N-nitrosation in the intestinal environment [23,102,103,104,105,106,107,108]. Such considerations are supported by the fact that our data point to a key role of the ubiquitously methylated major TP53 hotspot codons 175, 248 and 273 in geographic clustering. In fact, the vast majority of the mutations at these 3 codons reported in human cancer are compatible with nitrosative deamination [9,11,32,54,74,109]. Moreover, transitions at codon 248 were experimentally induced with an NO-releasing compound [110] while mutations at codon 273 were found to be strongly associated with diets high in red meat and fat [44].
In conclusion, we recognize the difficulties inherent in interpreting causes and mechanisms responsible for CRCassociated TP53 mutations, which are the end result of complex cascades of events. It is important to keep in mind the limitations of our analyses, based on a single, albeit large, database of mutations. Furthermore FAO FBS, the only standardized comprehensive food data available for international comparisons, approximate food supply patterns. Nevertheless, geographic variation in CRC-associated TP53 mutation patterns appears to be due to transitions at CpGs and mainly related to differential mutation frequencies at the major TP53 hotspots. This could be explainable by differential exposure to nitrosative DNA damage, linked to the consumption of foods promoting metabolic stress and chronic low-grade inflammation.
Mutations in adenomas, metastatic CRC and cell lines were excluded, as their spectrum could differ from that of primary CRC [111]. Analyses were based on 2,542 mutations in coding regions for MS, and on 2,572 (i.e., all) mutations for MT (Database S1). Mutations were grouped according to country or geographic area, the latter including geographically and ethnically related countries with low mutation numbers. The FP dataset (Dataset S1) was extracted from the FAO FBS [50,51] compiled for the reference year 1990 (http://faostat.fao. org/site/368/DesktopDefault.aspx?PageID = 368), as used in reference [17]. Year selection tended to exclude the most recent and current international variations in food availabilities and nutrition, as CRC develops over several years and is mostly diagnosed in patients aged 65 years or older [112], while the IARC TP53 database compiles mutations since 1989 [11]. The FP dataset included the following major food groups: animal fats, animal products, cereals, fish/seafood, fruit, meat, milk, oilcrops, pulses (legumes), starchy roots, sweeteners, vegetable oils and vegetables. For the purpose of this study alcohol was excluded, being much of the data on average availability of alcoholic drinks not informative and potentially confounding, due to large interindividual variability [23]. Spices and stimulants, which account for low percentages of the total available daily energy supply, were also excluded. Statistical analyses were therefore conducted using the estimated percent (%) contribution of each considered food group to mean per caput daily energy availability [17]. Weighted average availabilities were calculated for geographic areas by adjusting for the 1990 population size of each included country. The MS, MT and FP datamatrices were normalized converting absolute numbers into frequencies (Datamatrices S1).
All standard techniques, including hierarchical clustering (HC), principal components analysis (PCA), Pearson correlation and linear regression, were used in their implementations from Matlab (2007b, The Mathworks and Matlab Statistics Toolbox).

Statistical Pattern Recognition
Statistical pattern recognition allowed the integrated analysis of the MS, MT, and FP datamatrices to investigate relations between TP53 mutation sites, TP53 mutation types, and food supply patterns. The first analytical step consisted in clustering the 12 analyzed coutries/geographic areas by HC with respect to the data contained in the MS, MT and FP datamatrices. The stability of the obtained clusterings was assessed using the silhouette values. The similarities between the obtained clustering solutions, represented by dendrograms, were assessed using an entropy-based similarity measure. Feature analysis and selection, which is the process of studying the contribution of single features, or subsets of features, to dataset properties, was the next relevant processing step. Exact feature analysis can be performed testing the ability of each single subset of features to maintain a chosen property. In practice, this is feasible only when Table 2. Best-case exhaustive feature analysis of the mutation types datamatrix.   the number of features is low. In comparing the MT and FP datasets, because of the relatively low number of features, such exhaustive analysis could be carried out. With regard to the MS dataset, the number of possible subsets of features was too high, and therefore a heuristic approach, i.e., sequential forward selection, was used to select feature subsets. The principal components and the relative weights of the features were used as ranking criteria. Results were visualized on geographic maps with the relevant areas colored according to the most relevant features. Finally, multivariate correlations between the datasets were computed exploiting their PC projections. All these analytical steps are detailed below.   Table 5. Pearson's correlation scores between the PCs of mutation types and food patterns.  Table 6. Pearson's correlation scores between the PCs of mutation types and mutation sites. Hierarchical clustering Distance matrices for MS, MT and FP were computed by pairwise comparison between TP53 countries/geographic areas using the squared Euclidean distance. Dendrograms were constructed through Ward's hierarchical clustering algorithm [113]. Stability of clusters was assessed evaluating the silhouette values [114] that measure how close each point in one cluster is to the points in the neighboring clusters. This measure ranges from +1, indicating points very distant from neighboring clusters, through 0, indicating points not distinctly in one cluster or another, to 21, indicating points probably assigned to the wrong cluster.
Matrices for MS, MT and FP were tested for correlation by Mantel's test [115]. The program Mantel version 3.1 was used to estimate Pearson correlation coefficients. Significance was assessed by 10,000 random permutations.

Feature selection
Feature selection involved the use of a similarity measure between hierarchical clusterings, visualized as dendrograms, respectively built on the entire feature set and on the feature subset(s) to be tested. The higher the similarity, the higher the rank of the chosen feature subset. The entropy-based similarity measure used is defined below.
Two clusterings are identical if there is one-to-one correspondence between their clusters. The more a cluster of one clustering is filled with objects from different clusters of the other clustering (disorder), the less is the concordance between clusterings. All the information needed to summarize this phenomenon is the corresponding confusion matrix. Given two clusterings, A and B, where A is made of n clusters and B of m clusters, the confusion matrix M between A and B is an n|m matrix, in which the entry i,j ð Þ reports the number of objects in the cluster i of A falling into the cluster j of B. Entropy is the obvious tool to measure such disorder. If R i is the i-th row of M and C j is the j-th column of M, then H R i ð Þ measures the disorder of the i-th cluster of A with respect to B, and H C j À Á measures the disorder of the j-th cluster of B with respect to A.
A way to compute the similarity between B and A is the mean entropy of the clusters of B versus A, where the a priori probability of a cluster X , p X ð Þ, can be approximated as number {of {objects{in{X =total{number{of {objects, giving the formula:   expressing the similarity of B versus A, while the similarity of A versus B can be obtained with the analogue formula on C j , which turns to be S M T ð Þ. The measure of similarity between clusterings is in the trade-off between S M ð Þ and S M T ð Þ. We define the final similarity measure: where a, 0ƒaƒ1, can be used to set the acceptable level of 'subclusteringness' of B with respect to A. When a~0, no importance is given to the fragmentation level of the clusters in B. When a~1 only exact matching between A and B will produce a maximum for S a . Basing on such similarity measure between clusterings, useful comparisons between dendrograms can be easily performed. Given a solution obtained from a dendrogram (the target solution), it is possible to assess how much such solution can be approximated by another dendrogram.
Given a dendrogram D, let d D ð Þ be the clustering solution obtained applying a cutting threshold d to D. We define complete threshold set for a dendrogram any minimal set of threshold values, applying which all the possible clusterings for the dendrogram can be obtained. We indicate any such set for a dendrogram D by D D ð Þ. It can be easily shown that Given a dendrogram D 0 , a target solution T can be derived applying a cutting threshold. The similarity between D 0 and another dendrogram, D, can be approximated using the dendrogram similarity procedure.

Exhaustive approach to feature selection for the MT and FP datamatrices
Feature analysis studies the properties of single features or subsets of features of the analyzed data. Exact feature analysis can be performed testing the properties of each possible subset of features. In this study, the property of interest was the ability to maintain the groups obtained in the clustering analysis phase. Such exhaustive approach was successfully performed on the MT and FP datamatrices.
Given a set of features F and a scoring function f : 2 F ð Þ?<, the exhaustive feature analysis approach consists in computing f A ð Þ, VA [ 2 F ð Þ. We performed this analysis using the features of the MT and FP datamatrices in turn for F and dendrogramSimilarity T, D A ð Þ, 1 ð Þ for f A ð Þ, where T is the solution obtained in the clustering analysis phase of the data and D A ð Þ is the dendrogram built using the features subset A(F. The results of exhaustive feature analysis are reported in Tables 1 and 2 for the MT datamatrix and in Tables 3 and 4 for the FP datamatrix. In Tables 1 and 3 the entry i, j reports the worst score obtained using A~x i |B f g, where x i [ F and B j j~i. In Tables 2 and 4 the i-th entry reports the set of features C such that f i,j f g ð Þ~1, Vj [ C.

Heuristic approach to feature selection for the MS datamatrix
Being the number of MS feature subsets equal to 2 173 , we used the sequential forward selection approach for MS feature selection. A filter method was used.
Given a feature set F and a scoring function f : x?<, x [ F , a ranking of features can be obtained computing and sorting f x ð Þ, Vx [ F f g . Let such ordered set be: Instead of producing all possible subsets of F , we produce the sets S 1 , . . . S n such that: . . . ,n: Substituting 2 F ð Þ with S 1 , . . . ,S n f gin the exhaustive approach completes the definition of the heuristic approach.
We used this method with two different rankings, respectively based on the number of mutations recorded for each codon (feature); and on the sum of the first 11 Principal Components (PCs) coefficients of each feature (where 11 was the number of PCs contributing 100% of the data variance). Panels A and B of Figure 3 report the f values obtained for the two different ranking functions. Panel C of the same Figure compares the best feature sets (minimal stable subsets giving f~1).

Geographic visualization of feature relevance
Geographic visualizations of the most relevant features of the MS, MT and FP datamatrices were obtained by respectively summing feature frequencies (for MS and MT) and per caput supply of each food group expressed as % of the total available calories, as detailed above [50,17] (for FP). Resulting values were projected into yellow to red color range onto the geographic profiles of the countries and geographic areas contributing to the TP53 mutation database.

Correlation analyses
To perform a multivariate correlation analysis between the PCs of MS, MT and FP, we exploited their projections on the respective PCs. Both the Scree and the Kaiser [116] tests provided clear support for extracting the first 11 components for MS. When applied to MT, these tests supported the extraction of 4 and 3 PCs respectively, being the eigenvalue of the fourth PC near the lower limit value (i.e., 0.9). For FP the Scree and Kaiser tests indicated 3 and 4 PCs, respectively. Pairwise Pearson correlations were then computed between the PCs in all the projected spaces.

Supporting Information
File S1 Coefficient loadings of the three most relevant PCs of mutation sites (MS), mutation types (MT) and food patterns (FP) and Pearson's correlation scores computed between the PCs of MS, MT and FP. Found at: doi:10.1371/journal.pone.0006824.s001 (0.04 MB DOC) Figure S1 Coefficient loadings of the first three PCs of the mutation sites, mutation types and food patterns datamatrices. Coefficient loadings of the three most relevant principal components (PCs) of the mutation sites (MS, A-C), mutation types (MT, D-F) and food availability patterns (FP, G-I) datamatrices are projected on their 1-dimensional space (see File S1 for discussion). Found at: doi:10.1371/journal.pone.0006824.s002 (0.71 MB TIF) Figure S2 Assignment of Japan to clusters I or II in cluster analysis for food availability patterns. The food category ''cereals'' determined clusterization of Japan with Western countries for food availability patterns. Histograms visualize the number of times that each of the 13 features was present in the 2,405 clusterings classified as type A, i.e, where Japan joined Iran and South and East Asia in cluster II-FP (A), or in the 4,178 clusterings classified as type B, i.e, where Japan joined Western countries in cluster I-FP (B). It is readily evident that feature 3 (cereals) was almost always absent in type A clusterings and almost always present in type B clusterings. This reflects the estimated low mean per caput supply of cereals available for human consumption in Japan, compared to the countries/geographic areas in the II-FP cluster (i.e., Iran and South and East Asia). Features 1 to 13 represent the following food categories: 1, animal fats; 2, animal products; 3, cereals; 4, fish/ seafood; 5, fruit; 6, meat; 7, milk; 8, oilcrops; 9, pulses (legumes); 10, starchy roots; 11, sweeteners; 12, vegetable oils; 13, vegetables. Datamatrices S1 Normalized datamatrix for TP53 mutation sites (MS) and mutation types (MT) assigned to 12 countries/ geographic areas and normalized datamatrix for the estimated food availability patterns (FP) of the 12 countries/geographic areas in the TP53 database. Data from 2,572 TP53 exons 5-8 mutations associated with primary CRCs were retrieved from the TP53