Mesoscopic Model and Free Energy Landscape for Protein-DNA Binding Sites: Analysis of Cyanobacterial Promoters

The identification of protein binding sites in promoter sequences is a key problem to understand and control regulation in biochemistry and biotechnological processes. We use a computational method to analyze promoters from a given genome. Our approach is based on a physical model at the mesoscopic level of protein-DNA interaction based on the influence of DNA local conformation on the dynamics of a general particle along the chain. Following the proposed model, the joined dynamics of the protein particle and the DNA portion of interest, only characterized by its base pair sequence, is simulated. The simulation output is analyzed by generating and analyzing the Free Energy Landscape of the system. In order to prove the capacity of prediction of our computational method we have analyzed nine promoters of Anabaena PCC 7120. We are able to identify the transcription starting site of each of the promoters as the most populated macrostate in the dynamics. The developed procedure allows also to characterize promoter macrostates in terms of thermo-statistical magnitudes (free energy and entropy), with valuable biological implications. Our results agree with independent previous experimental results. Thus, our methods appear as a powerful complementary tool for identifying protein binding sites in promoter sequences.


Materials and Methods (extended)
We extend the Materials and Methods section by writing explicitly the Langevin equations of motion and by explaining in a detailed way the analysis algorithm we use.

Langevin equations of motion
The model is simulated by integrating numerically the Langevin equation for both the chain base pairs and the particle. The explicit Langevin equations for the N base pairs are m ∂ 2 y i ∂t 2 + mη where η is the damping and ξ i (t) a white thermal noise (and so ξ i (t) = 0 and ξ i (t)ξ k (t ) = 2mηk B T δ ik δ(t − t )). The particle follows so analogously η p stands for the damping and ξ p (t) for the thermal noise: ξ p (t) = 0 and ξ p (t)ξ p (t ) = 2m p η p k B T δ(t − t ).

Table of parameters
The parameters of the system are the following: • Base-pair parameters: m = 300 Da, η = 5 ps −1 .

Conformational Markov Network
The Conformational Markov Network (CMN) appears as a useful coarse-grained representation of large stochastic trajectories. This picture is obtained by discretizing the conformational space explored by the system and considering the dynamical jumps between the discretized configurations along the simulation. In this sense, the nodes of the complex network are defined by the discretized states, while the links account for the observed transitions between them. The arising network is thus a weighted and directed graph. In our case, the conformational space is defined by the five first principal components, in order to reduce the number of degrees of freedom, keeping indeed the essential features of our system. We divide each of the principal component into 20 cells of equal volume, while the particle's trajectory is divided into N bins, coinciding with the N possible base pairs the particle can occupy. Our discretized conformational space is thus made up of N × 20 5 possible states, which may be or not occupied within the stochastic trajectory. We assign each node a weight P i accounting for the fraction of trajectory that the system has visited within the trajectory. The normalization condition i P i = 1 holds. Secondly, the value P ij is assigned to each directional link accounting for the dynamical jumps from node j to i. Self-loops can exist, and thus P ii = 0. Finally the normalization condition i P ij = 1 is forced. According to this, the CMN is totally defined by the occupancy vector P = {P i } and the transition matrixS = {P ij }. The matrixS is the transition probability of the Markov chain defined by: where v(t) it the probability distribution at time t. If the trajectory is long enough,S is ergodic and time invariant, vector P coincides with the stationary distribution associated with the Markov chain P =SP. Moreover, the detailed balance condition must hold:

Stochastic Steepest Descent
Once we have translated de molecular dynamics trajectories onto a CMN, we apply the stochastic steepest descent (SSD) algorithm in order to split it into its basins of attraction in an efficient way, obtaining in turn useful thermo-statistical information about the system.The SSD algorithm is inspired in the deterministic steepest descent algorithm used to find minima in a multidimensional surface. We define the assisting vector W = {w i }, where i labels the nodes. The steps of the SSD algorithm are the following: 1. We start with W = 0.
2. Select randomly a node l with w l = 0 and write an auxiliary list of nodes adding l as first entry.
3. Select within the neighbors of l the node m that follows the maximum probability flux, this is P ml = max{P jl,∀j =l }. Check which of the following conditions is fulfilled: (a) If P ml > P lm and w m = 0, add m to the list and go back to 3. using m instead of l.
(b) If P ml > P lm and w m = 0 write the labels of all the nodes in the list as w j = w m . Go back to step 3.
(c) If P ml ≤ P lm remove link l → m from the graph. Return to point 3.
This process ends when every node in the CMN has been labelled, this is w i = 0∀i. Then, the whole conformational space has been characterized and every node is connected with its local minima in the FEL. All nodes with the same label belong to the same basin in this FEL and therefore we can associate them with the same conformational state.
Given the basin partition, a new CMN network can be built, taken the basins themselves as new nodes. The occupation probabilities will now be defined as P α = i∈α P i , while the transition probabilities P βα = i∈α j∈β P ji P i / i∈α P i . From this definitions rate constants can be easily calculated as k αβ = P βα /∆t, while the relative free energy of basin α with respect to basin β is simply ∆F α = −k B T log(P α /P β ).

Free Energy dendrogram
The free energy dendrogram is a hierarchical representation of the FEL of the system that can be directly built from the basin structure. Taking F/k B T as previously defined as control parameter. We can reconstruct now the CMN by increasing gradually its value from an initial cut-off value. At each step, this cut-off is increased and new nodes emerge together with their links. This reconstruction provides a hierarchical picture of the nodes together with the way they are connected with each other. The graphical representation of this picture is the free energy dendrogram (also called disconnectivity graph), where each basin is represented in terms of its free energy and the barriers between basins represent the hierarchical relationship between each basin.

Definition of macrostates and non specific states
Considering the hierarchical free energy representation of the basin CMN we apply a new clustering procedure in order to define the macrostates of our system. This procedure is accomplished in order to provide a physical meaning to the states according to the purpose of our model. Typically, in our basin structure, we observe groups of basins which represent very similar physical states separated by small free energy barriers. It is plausible to think that such basins will be merged into a single state within short transition times.
The macrostates of the system are built according to this characteristic by clustering basins separated by barriers lower than 1.5k B T . The system is able to jump thermally between these basins within short times, so we consider they represent the same physical state. The occupancies and transition probabilities of this clustered version of the basin CMN are constructed in an straight forward way.
The basin CMN structure shows another additional feature that allows us to distinguish between specific and non-specific states. Typically, up to the 90% of the network weight is distributed by less than the 1% of the basins. We name this large group of low populated basins as non-specific states. They represent short-lived transitionary states where the particle goes over the sequence searching for an stable binding site. The accumulated weight of these basins π N S is considered to be that of the non-specific states and is used as reference value to calculate the free energy difference between an specific state i defined by its occupancy π i and the non specific state ∆F i /k B T = log π i /π N S .

Supplementary figures
We include the free energy dendrograms for the six remaining promoters not shown in the main text.