Fig 1.
An isoform transcript network based on protein domain-domain interactions.
(A) The subnetwork shows the domain-domain interactions among transcripts from four human genes, CD79B, CD79A, LCK and SYK. In the network, the nodes represent isoform transcripts, which are further grouped and annotated by their gene name; and the edges represent domain-domain interactions between two transcripts. Each edge is also annotated by the interacting domains in the two transcripts. (B) RefSeq transcript annotations of CD79A and CD79B are shown with Pfam domain marked in color. The Pfam domains were detected with Pfam-Scan software. Note that no interaction is included between transcripts NM_001039933 and NM_000626 of gene CD79B without assuming self-interactions for modeling simplicity. For better visualization, only the interactions coincide with PPI are shown in the figure.
Table 1.
Notations.
Table 2.
Network characteristics.
Fig 2.
Transcript interaction neighborhood.
In this toy example, transcript Tik has four neighbor transcripts {Tg1 a, Tg2 b, Tg3 c, Tg4 d}, which are transcripts from g1, g2, g3 and g4, respectively. The neighborhood expression ϕik of Tik is then calculated as the average of its neighbor transcripts’ expressions and further normalized by transcript length, represented as the vector product between π and S*,(i,k) normalized by the number of neighbors ∑S*,(i,k) and the transcript length lik in the figure.
Table 3.
Summary of patient samples in TCGA datasets.
The samples are classified by cutoffs on survival and relapse time based on the available clinical information in each dataset.
Fig 3.
Correlation between transcript co-expression and protein domain-domain interaction in TCGA datasets.
The correlation coefficients between transcript expressions across all patient samples are first calculated in each dataset for each pair of transcripts by Cufflinks. The correlation coefficients are then sorted from largest to smallest and grouped into bins of size 1000 each. The x-axis is the index of the bins with lower index indicating larger correlation coefficients. The y-axis is the number of the pairs among the 1000 pairs of transcripts in each bin that coincide with protein domain-domain interaction between the transcript pair. The red line is the smooth plot by fitting local linear regression method with weighted linear least squares (LOWESS) to the curves. p-value is reported by chi-square test. (A) Co-expressions are calculated based on the small gene list. (B) Co-expressions are calculated based on the large gene list. In both (A) and (B), the left column shows the plots based on the connected transcript pairs in the transcript network and the right column shows the plots based on the transcript pairs with distance up to 2 in the network.
Fig 4.
Correlation between estimated transcript expressions and ground truth in simulation.
In (A) and (B) x-axis are labeled by the compared methods and different λ parameters of Net-RSTQ. The bar plots show the results of running Net-RSTQ with 100 randomized networks. In (C) and (D), x-axis are the percentage of edges that are removed from the networks. The plots show the results of running Net-RSTQ with the incomplete networks. (A) and (C) report the results of 109 transcripts of the isoforms in the same gene with different domain-domain interactions. (B) and (D) report the results of 712 isoforms in genes with multiple isoforms.
Fig 5.
Validation by comparison with qRT-PCR results.
(A) The scatter plots compare the reported relative proportion of each pair of the isoforms of each gene between the computational methods (Net-RSTQ, base EM, Cufflinks, and RSEM) and qRT-PCR experiments. The proportions of the two compared isoforms in a pair are normalized to adding to 1. The x-axis and y-axis are the relative proportion of one of the two isoform (the other is 1 minus the proportion) reported by qRT-PCR and the computational methods, respectively. The scatter points aligning closer to the diagonal line indicate better estimations by a computational method matching to the qRT-PCR results. The unshaded gradient around the diagonal line shows the regions with scatter differences less than 0.1, 0.15, 0.2 and 0.25, within which the estimations are more similar to the qRT-PCR results. (B)-(D) The scatter plots on each individual dataset. (E) The table shows the percentage of predictions by each method within the unshaded regions and the overall Root Mean Square Error of the predictions by each method compared to the qRT-PCR results.
Table 4.
Classification performance of estimated transcript expressions and gene expression on the small cancer gene list.
The mean AUC scores of classifying patients by estimated transcript (gene) expression in four-fold cross-validation for each dataset are reported. The best AUCs across the five models using isoforms as features are bold.
Table 5.
Classification performance of estimated transcript expressions and gene expression on the large cancer gene list.
The mean AUC scores of classifying patients by estimated transcript (gene) expression in four-fold cross-validation for each dataset are reported. The best AUCs across the five models are bold.
Fig 6.
Statistical analysis with randomized networks.
Comparison of the classification results by the randomized networks and the true network. The λ parameter was fixed to be 0.1 in all the experiments. The blue star and the red star represent the results with the real network and without network (base EM), respectively. The boxplot shows the results with the randomized networks.
Fig 7.
The plots show the CPU time (Intel Xeon E5-1620 with 3.70GHZ) for running the Net-RSTQ algorithm one three networks, the small transcript network, the large transcript network, and an artificial huge network of 10000 transcripts.