Fig 1.
Formulation and implementation of preprocessing decisions.
(A) Two types of gene mapping methods (GM1 and GM2) are compared. (B) Two types of thresholding approaches (global and local) are compared. (C) Formulation of three combinations of number of states (Global T1, Local T1, and Local T2) (D) Decisions about the order in which thresholding and gene mapping are performed. For Order 1, gene expression is converted to reaction activity followed by thresholding of reaction activity; for Order 2, thresholding of gene expression is followed by its conversion to reaction activity.
Table 1.
Decisions involved in transcriptomic data preprocessing.
Fig 2.
Preprocessing decisions affect the definition of active reactions sets.
(A) Twenty different combinations of preprocessing decisions led to a large diversity number of reactions considered as active. (B) The first three principal components (PCs) explain most of the variance in the number of active reactions in a GEM. (C) Thresholding contributes the most to the first PC and more specifically the main contributor is the thresholding approach (i.e. local or global). (D, E and F). The influence of thresholding parameter selection is clear in the first PC (F), while the networks are less influenced by the gene mapping method (E) and the order of preprocessing steps used (D).
Fig 3.
Influence of preprocessing decisions on capturing tissue similarities.
Visual representation using a Principal Coordinates Analysis of the similarity between tissues grouped by organ system for each preprocessing decision (numbers in legends are the mean Euclidean distance of the tissues belonging to each group; F–Female reproductive group, G–Gastrointestinal group, and L–Lymphatic group).
Fig 4.
Preprocessing decisions influence the significance of tissue grouping at organ-system level.
We compared the mean Euclidean distance observed between tissues belonging to the same organ-system to the mean Euclidean distance for 10000 randomly selected groups with the same number of tissues. The significance of the grouping (P-value) is computed as the proportion of random distances lower than the observed distance for each organ-system.
Fig 5.
Comparison of active reaction lists obtained using different thresholding methods with manually-curated resource.
Tissues where pathways are known to be active are bolded and colored on the y-axis and colored on the binary heatmap. (A) All thresholding methods accurately capture heme synthesis in bone marrow but not in liver. Global50 enriches the pathway in many other tissues but the pathway is known to occur only in bone marrow and liver. (B) Androgen and estrogen synthesis and metabolism is known to occur in brain, adrenal gland, skin, adipose tissue, bone marrow, skeletal muscle, and smooth muscle. Global75 enriches the pathway in higher number of tissues; and global50 and local25 does not enrich it in lower number of tissues compared to other methods. (C) Citric acid cycle is known to occur in all tissues but local25 and local25-90 does not enrich in many of the tissues compared to other methods. Thus, suggesting that local25-75 performs better than other thresholding methods in all pathway types together (Fig F in S1 Text).