Table 1.
Properties of assays measuring 3D genome architecture generated by 4DN and included in this study.
Fig 1.
Available contact maps from 4DN.
Each panel displays a contact map for chromosome 19 from a particular assay and a particular biosample where the number of normalized log counts is displayed as a color. Only cell types with at least two experiments and assays with at least two experiments were included. Each of the 41 non-missing contact maps has a colored border to indicate whether it was used for training (green), validation (orange) or testing (red).
Fig 2.
Each dimension of the 4D input is encoded using latent factors, and the concatenated factors are processed by a multi-layer perceptron. The figure shows an example of predicting the contact between positions 2 and 5 in DNA SPRITE H1-hESC. The final output of the model is the predicted normalized contact count.
Fig 3.
Sphinx outperforms mean model baseline.
(a) Sphinx was trained using 500 randomly selected settings of seven different hyperparameters. Each panel plots the validation set MSE (x-axis) for various values of a particular hyperparameter (y-axis). MSEs are shown as horizontal lines for the three baseline methods: cross-mean (red line), same cell type (magenta), and same assay type (green). All analyses are based on data from chromosome 19. (b) Each panel plots the test set MSE achieved by Sphinx (x-axis) versus the cross-mean baseline (y-axis). Each point corresponds to one of the 14 assay/cell type combinations in the test set. (c) A comparison of the number of training experiments that shared assay (blue), cell type (orange), or either assay or cell type (green) versus the MSE of the test set example. Each point is one test set example. The x-axis is jittered for visibility. Least squares lines are shown along with their associated R2 values.
Table 2.
Biosamples included in the data set.
Fig 4.
Imputation enables visualization of unobserved contact decay profiles.
Contact decay profiles for training (green outline), validation (orange outline), test (red outline) and unobserved data (purple outline) are shown. The inset shows an example of one contact decay profile with the distance from the diagonal (X-axis) plotted against the intensity (Y-axis), which is the mean value for all normalized contacts at that distance from the diagonal. Axes are consistent across all panels.
Fig 5.
Imputation enables visualization of unobserved eigenvector profiles.
Eigenvector profiles for training (green outline), validation (orange outline), test (red outline) and unobserved data (purple outline) are shown. The inset shows an example of the eigenvector profile with the position (X-axis) plotted against the eigenvector value (Y-axis). Axes are consistent across all panels.
Fig 6.
Imputation enables visualization of unobserved insulation score profiles.
Insulation score profiles for training (green outline), validation (orange outline), test (red outline) and unobserved data (purple outline) are shown. The inset shows an example of the insulation score profile with the position (X-axis) plotted against the insulation (Y-axis). Axes are consistent across all panels.
Fig 7.
Comparisons of cell types and assay types based on contact decay, eigenvectors, and insulation scores.
In each panel, the Pearson correlation is calculated between the (left column) contact decay profiles, (middle column) eigenvectors, and (right column) insulation scores for each of the cell types with shared assays (top row) and assays with shared cell types (bottom row). The lower triangle is the correlation after missing contact maps have been imputed, and the upper triangle is correlation with no imputation. Blue squares indicate comparisons that were not possible due to unobserved experiments. The dendrogram is computed using the imputed values.
Fig 8.
On sparser data, Sphinx outperforms the baseline in validation data, but not in test data.
(A) We conducted a hyperparameter search using the standard Sphinx model for chromosome 1 with around 150 hyperparameters tested. (B) Raw MSE, contact decay profile MSE, eigenvector MSE, and insulation score MSE are compared between Sphinx and the cross-mean model, similarly to in Fig 3. (C) The training curve, validation curve, and baselines for the best performing model on chromosome 1. The training curve is a rolling average of 10 batches. (D) A histogram of the fraction of nonzero elements in all data samples in chromosome 1.
Fig 9.
Convolutional Sphinx architecture does not improve performance over standard architecture
(a) Cell type and assay factors are input into the hidden layers as in Fig 2. Position factors are first convoluted in a window of neighboring factors. The window size is shown as 1 in the figure as an example. The result of the convolution is input into the hidden layers. (b) A comparison of the raw, contact decay profile, eigenvector, and insulation score MSEs. The black line indicates y = x. (c) The training curve (rolling average over 10 batches) and validation curve (plotted per epoch) are shown and compared to baselines.
Fig 10.
Sphinx imputes contact maps for unobserved assay-cell type combinations.
The plot is similar to Fig 1, except that the unobserved contact maps (purple outline) are imputed by Sphinx. Each plot is a contact map for chromosome 1, where the number of normalized log counts is displayed as a color.