On the Characterization and Software Implementation of General Protein Lattice Models

Abstract models of proteins have been widely used as a practical means to computationally investigate general properties of the system. In lattice models any sterically feasible conformation is represented as a self-avoiding walk on a lattice, and residue types are limited in number. So far, only two- or three-dimensional lattices have been used. The inspection of the neighborhood of alpha carbons in the core of real proteins reveals that also lattices with higher coordination numbers, possibly in higher dimensional spaces, can be adopted. In this paper, a new general parametric lattice model for simplified protein conformations is proposed and investigated. It is shown how the supporting software can be consistently designed to let algorithms that operate on protein structures be implemented in a lattice-agnostic way. The necessary theoretical foundations are developed and organically presented, pinpointing the role of the concept of main directions in lattice-agnostic model handling. Subsequently, the model features across dimensions and lattice types are explored in tests performed on benchmark protein sequences, using a Python implementation. Simulations give insights on the use of square and triangular lattices in a range of dimensions. The trend of potential minimum for sequences of different lengths, varying the lattice dimension, is uncovered. Moreover, an extensive quantitative characterization of the usage of the so-called “move types” is reported for the first time. The proposed general framework for the development of lattice models is simple yet complete, and an object-oriented architecture can be proficiently employed for the supporting software, by designing ad-hoc classes. The proposed framework represents a new general viewpoint that potentially subsumes a number of solutions previously studied. The adoption of the described model pushes to look at protein structure issues from a more general and essential perspective, making computational investigations over simplified models more straightforward as well.


Introduction
In the wide assortment of reduced models of proteins [1], minimalist representations have been proposed and widely used in the last decades as a practical means to computationally investigate general properties of these polymers [2,3]. In pursuing simplification, the number of possible conformations can be reduced by imposing residues to be located only on vertices of a given lattice. According to this vision, in the so-called lattice models, any valid (i.e., sterically feasible) conformation is represented as a self-avoiding walk on the lattice [4], with adjacent residues placed on adjacent vertices. Moreover, the number of residue types can be drastically restricted, e.g. just setting apart ''hydrophobic'' and ''polar'' ones, as it happens in the HP model, which can be regarded as the paradigmatic example [5] of protein lattice models. The HP model was originally proposed on a square two-dimensional lattice and it has also been exploited to understand the behavior of specific real proteins, such as chaperonins [6,7].
Generally speaking, lattice models have found application in multiple aspects of theoretical investigations on proteins: exploration of the conformation space [8], analysis of folding pathways [9], dynamics and thermodynamics of the folding process [10], evolutionary issues and origin of long-range interactions [11], strategies to enforce protein stability [12]. Although the large majority of lattice models do not encompass a representation of side chains, they are able to accommodate also this possibility and in some studies a single residue is modeled with a lattice point for the alpha carbon along with an adjacent point for the whole side chain [13,14]. Recently, lattice models have been used to characterize the placement of termini in native protein structures [15], and they have found application also in the investigation of RNA folding [16]. For many applications, the accuracy gap between all-atom models and lattice models is indeed significant, and attempts to bridge it have been proposed by projecting the former onto the latter via optimization methods [17,18].
Despite their simplicity, lattice models actually show protein-like features [13], indicating that they incorporate the fundamental physical principles of proteins.

General Lattice Models: How and Why
In the context of the present discussion, the primary structure of a protein can be represented with a string s whose characters are taken out of a standard alphabet S that encodes the possible monomer types. E.g., in the classical HP model we have S~f'H','P'g. The symbol s i (or s½i) indicates the type (character) of the i-th residue.
For the sake of clearness and precision, a lattice L is defined as a subset of R N , containing vertices (points) that are orderly placed. In the most regular case, it can be expressed as The set of vectors fu 0 , . . . u N{1 g is a basis for R N that, respect to the standard basis, can be represented by the basis matrix B L~u0 . . . u N{1 ð Þ . Any lattice vertex is identified through its integer coordinates a i ,i~0 . . . N{1. Moreover, it is sometimes useful to explicitly consider also edges in the lattices, i.e., the unordered couples of vertices that represents ''connections''. In a large number of practical applications, a reasonable way to specify edges makes use of a given threshold for the distance of vertices that possibly define an edge: using the ordinary euclidean norm : j j 2 . The value of ffiffiffi k p is generally set to the minimum yielding E L =1. Regardless of the way edges are specified, two distinct vertices in an edge are said to be adjacent.
Protein lattice models have been proposed so far on different lattice types, in two and three dimensions. An extensive formal treatment of lattices is beyond the scope of this discussion: further details can be found e.g. in the book by Conway and Sloane [19]. In this paper, a generalization of conformation studies by employing parametrical regular lattices is proposed, along with indications on how software support can be provided to this aim. In this perspective, the lattice type and dimension can be simply regarded as two parameters in the characterization of the search space. This approach is motivated by the observation that, in a lattice model, the neighborhood of each residue placed on a given vertex corresponds to the set of adjacent vertices, which are necessarily limited in number. More formally, in a lattice the coordination number (here indicated with z) is the number of adjacent vertices for any single vertex (i.e., the cardinality of the so-called vertex neighborhood), and it depends both on the lattice type and dimension. Thus, we can accommodate as many other residues around any given residue as the coordination number of the used lattice. A viable way to increase this number is just scaling up the lattice dimension. This has been currently done passing from 2D to 3D models, but the process can be further extended. A caveat is mandatory: High dimensional models, although possibly actractive for several aspects, may lead to conformations with an unnatural surface/volume ratio, hampering their employment in the study of solvent interactions, or of multichain systems.
In the core of real proteins, residues are densely packed and alpha carbons are close to one another. We can quantitatively characterize such an accommodation, and subsequently we can to check whether one specific lattice type, with its own coordination number, is able to correctly describe the neighborhood of an alpha-carbon (C a ). To this extent, the radial distribution of C a s can be investigated. By # n (r) we indicate the number of other C a s included in a sphere of radius r centered on a given C a . Such an investigation is significant for residues in the central part of a globular protein, for r values corresponding to a sphere completely embedded in the molecule. The symbol S# n (r)T 8 denotes the average # n (r) calculated for C a s around the molecule centroid, within an upper bound r Ã (e.g. r Ã can be taken as 1=2 of the molecule gyration radius).
Each protein structure has its own actual shape of S# n (r)T 8 , but similar patterns for low values of r can be found even across different structural classes of globular proteins. In the first chart of Figure 1, four sample molecules from different classes (all-a, all-b, azb, a=b), identified in PDB as 1MBN, 1GOF, 1BQB, 2FSU are considered. The shapes of their S# n (r)T 8 are very similar up to 7 Å , and then slightly diverge. Choosing other proteins, the specific shapes may be different, but the overall trends are analogous. A simple analysis of the reported curves can suggest how many ''neighbors'' a C a may have, depending on the defined neighborhood is defined. Usually, in studies on the interactions among residues, it is assumed that the maximum distance for interacting residues is 8 Å (in any case, less than 10 Å ). In fact, it has been shown that this limit is sufficient to characterize the hydrophobic behavior of amino acid residues and to accommodate both the local and non-local interactions [20]. In other specific studies on contact potentials, cutoff values of 4, 6, 6:4, 6:5, or 7 Å have been employed [21]. According with the cited results, here we assume that the neighborhood of a C a cannot reasonably extend beyond 9 Å .
In the left chart of Figure 1, the yellow vertical belt indicates the range of possible values for the upper bound to be used in the definition of a C a neighborhood. Thus, any lattice model that is expected to correctly represent such a neighborhood should have a coordination number not lower than 3-4 and not higher than 19-20, as shown by the shaded horizontal belt that visualize such a rough interval, hereafter called admissible range. Now it is possible to point out the regular lattice types whose coordination number is included in the admissible range.
The most popular lattices employed for simplified protein models since the first works in this area [2,5] are the square lattice in two dimensions, and the three-dimensional cubic lattice, with coordination numbers 4 and 6 respectively. An unpleasant feature of this kind of lattices, known as parity constraint, is that no two residues can be placed on adjacent vertices if they are separated along the backbone by an odd number of residues. This limitation has pushed towards the exploration of different lattice types [22,23]. Other significant choices are the planar triangular lattice, and the FCC (Face Centered Cubic) lattice (which can be considered as a thee-dimensional generalization of the former), with coordination numbers 6 and 12 respectively. Finally, also the two-dimensional honeycomb lattice, called also hexagonal, with coordination number 3 (thus at the very lower boundary of the admissible range), has been explored [24]; this one is not a Bravais lattice, and the directions to reach neighbor vertices depend on the specific vertex we are placed on.
For all the lattice types usually employed so far for simplified protein models, the coordination number is located in the lower part of the admissible range. From this standpoint, the FCC lattice represents the most appropriate choice to model the C a neighborhood. In order to keep the basic structure of a square lattice, to overcome the parity constraint, and to increase the number of neighbors, studies on extended cubic lattices have been proposed [14] (obtained choosing the standard basis in Eq. (1), and k~N in Eq. (2)). Other exploration possibilities may easily come from the adoption of lattices in generic spatial dimensions, even beyond the classical planar and three-dimensional cases. In this work, a uniform handling of residue placement upon lattice vertices is pursued, and thus the following requirements are stated: i) all edges must have the same length, ii) all vertices must have the same coordination number, and iii) all neighbor points must be reached from any vertex by always moving in directions taken from the same given set. This last property does not hold for the planar honeycomb lattice, which cannot be generalized according to a straightforward, linear course. For this reason and taking into consideration the previous properties, the proposed generalization stems from square and triangular lattices. This does not represent a loss of generality, because the honeycomb model can be plainly obtained within the planar triangular lattice by imposing an angle of 120 degrees between any couple of subsequent bonds, and this constraint can also be added in the case of triangular lattices in higher dimensions.
The generalization of the square lattice to N dimensions is trivial, and the dependence of the coordination number from the dimension can be expressed as z % (N)~2N. This means that, in the square case, even going up to N~10, the lattice remains in the admissible range. The generalization of the triangular lattice to N dimensions is less self-evident. In our treatment, the corresponding coordination number can be expressed as z D (N)~N 2 zN, and the maximum N to fall in the admissible range is 4. These discussed limits for the coordination numbers can be directly visualized in the right chart of Figure 1.
In this work it has been developed a computational framework to deal with the lattice type and dimension in a seamless way, so that as a protein model is instantiated in the first place, the lattice specifications must be provided at the same time. One challenging issue is to organize the framework so that any algorithm that operates on the model may abstract as much as possible from the actual lattice details.
In the subsequent ''Methods'' section the theoretical tools to support generic lattice models in any dimension are presented, and it is discussed how algorithms can be accordingly developed, possibly within a neat object-oriented scheme.
In the ''Results and discussion'' section, by employing two classical optimization methods on benchmark sequences, the main features of the proposed models across lattice types and dimensions are uncovered, as well as the characterization of possible simple modifications to the conformations (typically used in Monte Carlo approaches).

Methods
The generalization of lattice models requires some basic notions that will be illustrated in the following, starting from the 2D case. In the first place, it is crucial understanding how to structurally characterize the lattice model, given its type and dimensions as model parameters. Later, lattice-agnostic functions to deal with the model must be developed, as well as methods alike to manipulate conformations. The computation of the chosen potential function can thus rely on such generic functions, and possibly more efficient versions for specific values of the model parameters can be developed.

Definitions
According to the desired regularity of the employed lattices, the discussion is focused on square and triangular lattices. All the edges are imposed to have the same length that, with no loss of generality, is assumed to be 1. In general, different bases can be chosen for the same lattice. The basis vectors are often referred to as primitive vectors. In Figure 2 the two-dimensional square and triangular lattices are represented, along with the coordinates of vertices in the chosen basis (which, for the square lattice, is the ordinary standard one). From a computational standpoint, it is not always convenient to resort to the standard cartesian coordinates to locate vertices. On the contrary, the employment of the definition in Eq. 1 let us use just integer values to this purpose.
A possible proper basis for the two-dimensional triangle lattice is shown in the second chart of Figure 2, whose corresponding basis matrix (referred to the standard basis) is In the general case, the set of main directions D main contains the (unit) vectors that, summed to the position vector of a given vertex, produce the position vectors of all the neighbor vertices. As previously recalled, D main k kcorresponds to the coordination number of the specific lattice. Within computations, the availability of D main values let us check adjacency of two vertices in a straightforward way.
For the 2D square lattice (see Figure 2), D main%2~f +u 0 ,+u 1 g. For the 2D triangular lattice, some additional considerations are needed. It is worth underlining that the position vector w 2 ¼ D {u 0 zu 1 corresponds to a neighbor of the origin, and that in this case D mainD2~f +u 0 ,+u 1 ,+w 2 g. In both cases, the set of main directions can be generated by the repeated application of a rotation to u 0 . Working in the triangular lattice with the chosen basis, such a counterclockwise p=3 rotation is expressed by 3 D u 0 , and so on. Thus, an alternative formulation for the set of main directions of the 2D triangular lattice is the following: Scaling up to 3D and beyond. The extension of the square lattice up to three or more dimensions is straightforward, as the chosen basis can trivially be the standard one. The main directions correspond to + the basis vectors, so in dimension N and the coordination number will vary as z % (N)~D main%N k k2N Again, the triangular case deserves more attention. In 3D, the 2D basis vectors u 0 and u 1 can be kept, adding 0 as their last coordinate; the third basis vector u 2 can be chosen imposing its unit length, and the value p=3 to angles d u 0 u 1 u 0 u 1~d u 1 u 2 u 1 u 2~d u 2 u 0 u 2 u 0 , i.e., u 0 : u 1~u1 : u 2~u2 : u 0~1 =2. In compliance with this choice, The lattice generated by such a basis B D3 is the well-known FCC (Face Centered Cubic). Any further extension up to higher dimensions can be dealt with in the same way, iteratively building the basis B Djz1 starting from the known B Dj and imposing the same constraints on length and angles: and 2 u N{1 : u k~1 Vk[½0,N{2 In the chosen B D3 for FCC, any possible vertex adjacent to the origin can be either Accordingly, the coordination number depends on N as z D (N)~D mainDN k k2NzN(N{1)~N 2 zN, as already shown in the right section of Figure 1.
Equations (5), and (8) can be plainly used in dedicated software libraries to compute the main directions for any specified lattice. Such a set of vectors (as well as the basis matrix) can be subsequently referred and generically exploited in algorithms of any type, that operate on the lattice model, regardless of the actual dimension. For the sake of clarity, the set D main can be sorted e.g. according to a lexicographic order, and kept in a list L D , so that L D ½i refers to the i-th main direction.

Handling Conformations
By convention, the initial residue of a conformation selfavoiding walk can always be placed on the lattice origin. The most natural way to represent a conformation is by means of an ordered list of l vertex coordinates in the chosen basis. Such a list is referred to as absolute encoding V ABS , and v i (or V ABS ½i) indicates its i th element, i[½0,l{1. An alternative way is the differential encoding V DIF ½i, i.e., a sequence of l{1 steps (i.e., vectors in D main ) such that V DIF ½i~V ABS ½iz1{V ABS ½i.
Often, it can be useful checking whether a list of lattice vertices corresponds to a self-avoiding walk. It is required that both i) any vertex (but the first) is just one step away from the previous one, and ii) no clash is present, i.e., no position vector occurs more than once in the list. The self-avoidance check can be trivially implemented in a completely general way, given the set of main directions, because the first condition imposes that every element in V DIF must be a vector out of D main , while the second condition asks just for equality checks between integer vectors of the same dimension (in V ABS ). The computational complexity of the selfavoidance check procedure, intrinsically quadratic in the list length, can be made linear by using a hashtable to keep information on the occupancy state of any vertex in the lattice portion where the conformation is located [16].
In the literature, by pursuing the specification of a conformation regardless of its orientation respect to the origin, the relative encoding V REL has been introduced for 2D and 3D lattices. It corresponds to an ordered list of l{2 relative move types, and each relative move type specifies how to actually perform the next step, given the previous ones. In 2D, it is additionally required to fix the first move, and usually by convention also the second residue is constrained to be placed on a given vertex. The move type corresponds to the specific planar rotation R to transform a given differential move into the next. Schiemann et al. [25] explicitly discuss the case of relative moves for the 3D cubic lattice. In higher dimensions and with triangular lattices, this kind of conformation representation is not so straightforward and intuitive as in the basic cases, and for this reason it is not addressed here.
The most elementary operations to handle a configuration correspond to rigid-body transformations. Working with lattices, both translations and rotations must carry any source vertex exactly onto a destination vertex. To this aim, in any dimensions, permitted translations are expressed by a vector in Z n , and the absolute encoding of the translated conformation can be plainly obtained by adding such vector to each element of V ABS . On the contrary, V DIF is translation-insensitive.
Rotations deserve more attention, because even if they are intuitive and familiar transformations on the plane or in 3D, they must be adequately generalized in higher spaces. A popular way to extend the ordinary concept of rigid rotation to the n-dimensional case makes use of the so-called linear hypothesis, i.e., the assumption that in R n , the rigid rotation is performed about an (n{2)dimensional subspace. The problem is treated in a concise yet clear way by Mortari [26]. According to this assumption, it is possible to calculate rotation matrices R fijg that transform one ''source'' main direction L D ½i into another ''destination'' L D ½j, about a subspace normal to both. In general, R fijg not necessarily maps L onto L, although for square lattices this happens in any dimension.
Examples of 2D conformations on square and triangular lattices are depicted in Figure 3. Only H and P residue types are considered in the models, and the classic HP potential has been used for the reported values. Moreover, examples of conformations of the same sequence on 3D lattices are reported in Figure 4.
Distance on the lattice and computation of potential. In protein lattice models the notion of potential is usually related to some kind of contact, i.e., placement on adjacent vertices of residues that are not adjacent on the primary structure [27]. The adjacency of two lattice vertices, whose coordinates are expressed according to the basis matrix B, can be checked simply by inspecting if a{b[D main . In any case, the computation of potential requires the employment of a distance function d over the lattice. Considering the kind of lattices addressed in this work, for any d and for any couple of adjacent vertices a, it holds d(a,b)~1. The euclidean distance can be plainly computed as d 2 (a,b)~B(a{b) k k 2 . In practical cases, often it may be more convenient adopting a notion of distance that is simpler to deal with and quicker to compute than d 2 . For example, the ''hop distance'' d HOP (a,b) can be defined as the minimum number of hops across adjacent lattice vertices to go from a to b. For square lattices in any dimension d HOP% ( : ) corresponds to the classical ''Manhattan distance'' d 1 , and thus it can be computed very quickly.
On triangular lattices, the computation of d HOPD ( : ) is quick in the two-dimensional case. Indicating with d the value a{b, and with d½0,d½1 its coordinates in the basis indicated in Figure 1, the hop distance is the sum of the two smaller values among fDd½0D,Dd½1D,Dd½0zd½1Dg. Conversely, in higher dimensions it may need a very significant amount of operations. In general, the basic idea to compute it can be sketched as follows: The value of the Manhattan distance d 1 (a,b) depends on the basis chosen to express the coordinates for a and b; thus it must be found the specific basisB B, picking N linearly independent unit vectors out of D mainDN , such that, indicating withã a andb b the coordinates of a and b according toB B, the corresponding d 1 (ã a,b b) would be minimal.
In practice, to limit the computational effort, it might be sensible to have recourse to an approximate evaluation of d HOPD that is indeed exact for the lower values (typically for d HOP ƒ3) i.e., those that are involved in the definition of simplified potentials.
A fundamental component of a lattice model is the corresponding potential function, which associates a conformation with a measure of its energy. It typically contains contributions from each possible couple of residues [28], except those corresponding to adjacent residues along the backbone (in fact, their relative position is not allowed to change, and such contribution would always be constant). It is usually expressed in a form like the following: The factor D(d(v i ,v j )) takes into account the inter-residue distance. As only neighboring residues give a significant contribution, D=0 only for low values of d. In the classic HP potential, D( : )~1 only for residues in contact (i.e., with d~1). The coefficient C si,sj depends on the types of residues i and j. In the HP model potential, C H,H~{ 1, and it is 0 in all the other cases, i.e., it is the number of HH contacts times (21). If all the different amino acids are explicitly considered in the alphabet S, a popular choice for such coefficients is the one defined for the MJ potential [27,29].

Lattice-agnostic Implementation of Lattice Models and Algorithms
Following an object-oriented approach in developing software libraries to handle protein lattice models, the structural and behavioral details of the model can be caught by dedicated classes, as depicted in Figure 5. A class can be dedicated to lattice models in general (indicated as Latticemodel), and a derived class Latticeprot to hold features typical of protein lattice models.
The structure specification is given by both the lattice type (either ''square'' or ''triangle'') and the dimension N. Moreover, a string with characters out of a known alphabet can be used to describe the primary structure. This information must be provided to the class constructor, so that all the supporting structural data (like the basis matrix B, kept in the attribute basis, and D main , in the list main_dirs), could be computed at initialization time. Moreover, the conformation V ABS can be represented as a list pos_abs of position vectors of dimension N each, to be allocated by the constructor. The initial configuration is set up by the class constructor by choosing, for the position vectors in pos_abs, values that correspond to a self-avoiding walk, i.e. which satisfy the conditions previously mentioned. Multiple different build-up procedures can be provided to this aim, and the one to be used can be specified through a constructor parameter. In our implementation, the default choice is a random walk.
The model behavior can be encoded in methods. Among them it is worth recalling the hop distance (the d HOP ( : ) discussed before, possibly specified as class method) and the potential (to be computed upon the configuration, thus specified as instance method). It is particularly important underlining that in principle all the methods must be implemented in a lattice-agnostic way, i.e., without exploiting properties that hold only for specific lattice types or dimensions. This can be done by leveraging the structural attributes set up at the class instance initialization. Algorithm 1 ( Table 1) shows a clear example: a Python simplistic method to obtain the HP potential. Calculations are carried out on position vectors, and so the dimension does not require to be explicitly considered. Here, the Numpy library [30] is used and imported as np; position vectors are implemented via Numpy arrays; the function np.dot() performs the standard vector inner product. The code can also abstract the lattice type: At line 11 (commented out), the check on the distance of two H vertices is performed just applying the definition of euclidean distance on the coordinates values, that unfortunately holds only for square lattices. Instead, such a check could be carried out as specified in line 12, by exploiting the current main directions. Of course, several kinds of improvements can be applied to improve the effectiveness of the function implementation.
For performance reasons, in special cases (i.e., for particular types and/or dimensions) it could be sensible to substitute the generic definition of a method with one specific efficient implementation that applies only in such cases. To this aim, depending on the programming language, different coding solutions can be found. In Python and other scripting languages, it could be up to the class constructor at initialization time to bind the generic method name to an additional method that holds the specific implementation.
A general plot() method in the Latticeprot class may represent a convenient way to visualize a conformation. Its implementation must inspect the structural model parameters in the class instance, and accordingly draw the graphical representation (typical examples of outcomes of this method are depicted in Figures 3  and 4). In case of lattices in dimension greater than 3, a representation of the projection of the conformation onto the 3D space may be used, as shown in Figure 6: only three axes amongst all are considered, possibly after a rotation of the whole structure.
The application of different folding algorithms to a given protein model can be directly supported in the Latticeprot class by recurring to the behavioral strategy pattern (see [31], page 315). The actual algorithm can be implemented in a separate dedicated ''Optimizer'' class whose instances hold all the required optimization parameters, to be provided at the object initialization. Each specific optimizer can be associated to a single protein model instance at a time, and the optimizer must be coded abstracting from the lattice characteristics: Simply, it just has to get references to the basic data structures and methods of the object it is linked to. The folding process can then be triggered by invoking a standard method on the protein model, namely foldit(), and this will (possibly) in turn call the corresponding procedure within the currently bound optimizer (if any). This approach allows us seamlessly apply different optimizators to the same protein model at different times, as well as use exactly the same optimization method on different models.
General moves on general lattices. Monte Carlo sampling methods have been extensively employed upon heteropolymer lattice models. Such approaches modify a conformation by applying moves out of a well-defined move set. Each particular move transforms a conformation into another admissible one [2,[32][33][34], which cannot be superposed to the former (symmetries are not considered as moves). It has been widely experimented that the effectiveness of a specific Monte Carlo approach strictly depends on the used move set [35]. One of the earliest move sets was  proposed by Verdier and Stockmayer [36], and it contains only two moves that operate upon a single residue, namely the single residue end move and the corner move. Such approach was further developed by Hilhorst and Deutch [37], who also introduced the two-residue crankshaft move. These three move types have been used together e.g. by Gurler et al. [38]. Following Thachuk et al. [39], here they are collectively called VSHD set. Dealing with multidimensional models, VSHD moves have to be accordingly generalized.
The single residue end move can be described in a general way as the placement of an end residue onto a free vertex that is one step away from its adjacent residue, and its implementation is trivial.
The corner move (also known as kink-jump move) can be reformulated as the placement of a target residue onto a free vertex that is one step away from both its two adjacent residues. This definition can be used (and coherently implemented) for any type of lattice in any dimension.
Similarly, although it was originally conceived for square/cubic lattices, the definition of the two-residue crankshaft move can be generalized as follows: Given four successive residues r p , t 0 , t 1 , and r s on vertices v p , v 0 , v 1 , and v s , place t 0 and t 1 in two free vertices v 0 ' ( =v 0 ) and v 1 ' ( =v 1 ) such that all the pairs (v p , ), ( , ), and ( ,v s ) contain adjacent vertices.
The generalized VSHD moves just described can be viewed as local ones, as their application affects only residues in the very neighborhood of the target residue(s). On the contrary, the use of global moves (sometimes called also long-range moves [40]) generates more distant conformations. For the sake of completeness, it must be recalled that in specific studies some authors [41] have proposed moves based on breaking and reforming the chain. As they are applied for special purposes, the discussion will be focused on the generalization of global moves that keep the chain connected.
The slithering snake move (a.k.a. reptation, see [32], sect. 4.7.2) looks first for a free position adjacent to a target end, and such terminus is moved onto it. The other residues are all shifted by one position along the backbone path. This move, although it keeps most of the overall position arrangement, may completely disrupt the adjacency pattern of residues of different types, thus deeply impacting on the global potential. This definition can be taken as valid for any lattice type and dimension.
The pivot move (a.k.a. wiggle, see [32], sect. 4.7.2 or [2]) represents an effective way to drive a potentially radical change in the conformation, and it has been used in plenty of simulation works on bi-and three-dimensional lattices [2,34,40]. In practice, one residue r pivot on vertex v pivot is chosen as ''pivot'', and one   Figure 6. 3D projections of a 5D configuration for sequence HI 9/2, in a triangular lattice. The corresponding potential is 231. Both images come from an orthogonal projection and in the second, to uncover the superposed vertices, a preliminary slight rotation in the 5D space has been applied.
branch departing from it, either the forward or the backward one, is wholly rotated around v pivot to a new position. Let L rB and L vB respectively indicate the lists of residues and absolute positions of the chosen branch, so that L rB ½0 and L vB ½0 would correspond to the residue/vertex adjacent to r pivot and v pivot , and so on. The rotation can be specified by indicating what lattice vertex v f , both free and adjacent to r pivot , L rB ½0 should be moved to. For the implementation of the pivot move in the general case, multidimensional rotations are required. The specific rotation matrix R to apply can be computed from vectors L vB ½0{v pivot and v f {v pivot .
Even if v f is properly chosen, the corresponding pivot move could be unfeasible because R would map some residues onto positions that are either already occupied by the opposite branch (i.e. a clash occurs), or do not actually correspond to lattice vertices (in this case the elements of R are not all integers). Anyway, the application of a pivot move in any case requires a significant computational effort [42], hampering its extensive employment in Monte Carlo methods. Local moves are an effective means to explore the neighborhood of a given conformation, but in optimization problems like Figure 7. Examples of a pull move in 2D square (drawings on the left) and triangular (drawings on the right) lattices. The notation to indicate the involved elements is described in the text. The target residue is depicted in red, and the residues that undergo a displacement are colored. In the resulting configurations, the chain portion affected by the move is indicated in blue. doi:10.1371/journal.pone.0059504.g007 protein folding it is very difficult to edge away from local minima by their application in a random way. On the contrary, the pivot move may provide very significant conformation changes, but the more compact the conformation is, the more often the move comes to be unfeasible. A good tradeoff has been found with pull moves, originally introduced for square lattices [43], but also generalized to honeycomb [23] and 3D triangular lattices [44]. They exhibit the semi-local property, i.e., although in the worst case a  pull move may relocate a large number of elements, on average only a few residues are involved (as witnessed also in experiments on FCC lattices by Jiang et al. [45]). The description of the generalized pull move can make use of a notation similar to that already used for the pivot move. Figure 7 shows the application of a pull move on 2D square and triangular lattices and, despite its simpleness, can be useful to identify the involved elements.
One residue r target on vertex v target is initially selected, as well as one branch departing from it, either the forward or the backward one (it is the branch to be ''pulled''). L rB and L vB indicate respectively the lists of residues and absolute positions of the chosen branch, so that L rB ½0 and L vB ½0 would correspond to the residue/vertex adjacent to r target and v target , and so on. The symbol p steps corresponds to the lowest number of steps the chain can be pulled in the given lattice type; it is topology-dependent, and its value is 2 for square lattices, and 1 for triangular ones.
The move can be specified by indicating what particular lattice vertex v f , that is free, adjacent to r target , and p steps away from L rB ½0, L rB ½0 should be moved to. For the move to be feasible, there must exist a path from L rB ½0 to v f with p steps {1 free internal vertices. Once L rB ½0 has been placed onto v f , the branch must be pulled. Starting from L rB ½1 until the current residue does not need to be moved (or the list is finished), one residue L rB ½i at a time is inspected and, in case it is not already adjacent to the previous one in the branch (i.e., L rB ½i{1), it is ''pulled'' back p steps to become adjacent. Other graphical explanations of particular   instances of this procedure are reported in previous works (among the others, [16,39,43,44]).
From a software architecture standpoint, moves can correspond to methods of the Latticemodel class, and they operate on the configuration of any class instance. Also in this case, their code can be written in a lattice-agnostic way because the previous generalized definitions of moves are expressed in terms of adjacency and distances that, as already discussed, can be checked/calculated taking the lattice specifications as parameters.
In Figure 4 only one method to perform a move is shown (namely, move_pullmove()). Moreover, at the class level it is convenient to envisage general move management methods, e.g., make_ran-dom_move(), along with setter/getter methods for parameters that determine the management strategies.

Software
The described approach in dealing with general lattice models has been applied in an actual implementation in Python. Such a language has been chosen because it allows a quick and handy manipulation of the code, thus encouraging experimentations.
In investigations, performance was not the prime interest, basically because the main focus was on generic lattice modeling issues, and not on new efficient optimization methods. In any case, in the last years it has been shown that Python is also suitablefor challenging scientific computing [46], and for Bioinformatics problems as well [47]. The use of the numerical library Numpy [30], mainly for managing the representations of lattice vertices and configurations via arrays (not primitively supported by Python), let us obtain a reasonable performance level. Performance improvements can be obtained with more efficient  compiled languages. The described software organization is suitable to be easily coded in any object-oriented language, such as C++.
All the tests presented in this paper have been run on an ordinary Mac notebook, with Mac OS ver. 10.6.8 on an Intel Core 2 Duo 2.4 GHz (but all the implementations are sequential ones), 2Gb RAM. Python ver. 2.6 and Numpy 1.6.2 have been used.

Results and Discussion
The proposed general model has been defined first, and then its software implementation has been designed according to an object-oriented organization. Moreover, it has been stressed the need of a lattice-agnostic approach in shaping algorithms that operate on model instances. The primary result can be identified in the availability of a handy tool to explore the conformation space of lattice proteins, where the lattice model can be easily specified in a parametric way. At this point, it is crucial to better understand how the lattice type and the dimension affect the outcomes of the employed algorithms (also in terms of runtime), and how our tool can be used to get insights on specific models, or on comparisons among different ones. Some direct experiences with different kinds of optimization techniques are reported, in order to understand how the generalized algorithms, previously known in the literature only for basic cases, actually behave across the different models. Again, the main interest is not in developing new and better folding algorithms, but instead in investigating benefits and limitations of the whole tool as a means to explore the configuration space.
Two different classical approaches for the implementation of two different SpecificOptimizers (developed according to the pattern in Figure 5) have been considered: chain growth and simulated annealing. The former builds up the solution step by step, while the latter applies successive modifications to an initial conformation. In both cases, the classic HP potential has been chosen. The advantage of using these different algorithms in the present discussion is twofold: on one hand the main characteristics of the models can be uncovered, and on the other hand the flexibility of the model software implementation can be practically assessed. Details on the benchmark sequences used in the experiments are reported in Table 2.

Insights from Build-up Methods
Chain-growth algorithms represent a group of simple yet significant methods for protein folding in a lattice model [48]. Regardless of their specific details, all of them progressively build up the complete conformation by subsequent additions of chain chunks, and each chunk is placed in a way to minimize the potential of the protein aggregate built so far. At each step, an exhaustive exploration of possible placements is done for the next m residues (so m is named the look-ahead parameter), and the first p of them, pƒm, are used to build the next aggregate. Although alternative stochastic choices might be sensible [48], in the implementation presented here (called CgOptimizer) the target chunk conformation is always selected out of those that yield the current minimal potential, and this last choice takes into account only the chunks placed more closely to the current aggregate. According to the experiments, this strategy provides nearly-optimum conformations in a limited number of runs. Beyond the case of the chosen algorithm, the chain growth principle has been successfully applied also in other popular swarm intelligence methods to find ground conformations, e.g., in ant colony optimization (ACO), but only in two-and three-dimensional cases [40,49,50]. Moreover PERM [51], one of the most effective folding algorithms in the HP model, can be regarded as a particular form of Monte Carlo chaingrowth.
It is worth noticing that the implementation of chain growth algorithms can be easily made lattice-agnostic by generically referring to the elements in main_dirs for the construction of the possible chunks to be added. Moreover, the correct definition of the potential to be used is directly available through the ordinary potential() method provided by the class Latticeprot.
Although CgOptimizer does not guarantee to find the optimal solution, it can obtain a decent lower bound just by running it a few times (at least for ordinary sequences). This procedure can be applied over different protein sequences, on different lattice types, and across multiple dimensions (reasonably only within the admissible range as defined in the first section). Results of this kind are shown in Figure 8. The benchmark sequences are the ten so-called ''Harvard Instances'' (HI), widely used for testing of folding algorithms since 1995 [52]. In the experiment, each test consists of ten optimization runs, executed on each sequence on both lattice types, on dimensions from 2 up to 10 for square lattices, and from 2 up to 5 for triangular lattices. Both the m and p parameters have been set to 3. In the charts of Figure 7 the minimum values for the potential are shown, and each line connects the values for the same sequence at increasing dimensions.
The most evident feature is the decrease of the minimum potential for increasing dimensions, as sensibly expected. Keeping the same sequence, such a decrease becomes less pronounced at higher dimensions, and its trend looks to come to a sort of saturation. This phenomenon is not very manifest in the triangular case, because the lack of the parity constraint allows better  accommodations of H residues, moving forward the beginning of saturation. It is obvious that the theoretical trends in Figure 8 must be nonincreasing monotonic. Somehow surprisingly, in the square case for several sequences CgOptimizer has not been able to find better (or even equal) solutions than those found for lower dimensions. About this point, it must be noticed that i) on each test, because of long durations in the most demanding cases, the number of runs is necessarily limited and likely not sufficient to effectively sample the conformation space in higher dimensions, and ii) the look-ahead parameter m has been set to 3 to avoid excessive runtimes at high dimensions, but this hampers an effective handling of long ''all-P'' subsequences.
The trend of minimum potential versus dimension deserves additional discussion. Beyond the simple experiment described so far, in order to empirically investigate the dependence on sequence length and characteristics, the chain growth optimization has been applied to multiple, heterogeneous sequences and the results are summarized in Figure 9. Usually longer sequences hit deeper minima and, the lower the minimum, the higher the dimension the trend saturation tends to occur. Actually, important factors are the number of H residues and the H/P ratio. The last three sequences in Figure 9 have different lengths but the same number of Hs, and the lower curve corresponds to the one with the highest H/P ratio. The analysis of runtime ( Figure 10) is not particularly surprising, because the curves correspond to the temporal computational complexity for the algorithm. Specifically, it is O( l p : (z{1) m ), with l as the sequence length, m and p the chain growth parameters, and z the lattice coordination number which in turn depends linearly on the dimension in the square case, and quadratically in the triangular one. To make Figure 10 more comprehensible, logarithmic scales have been used in the charts. A more objective comparison between results obtained for square and triangular lattices can be carried out focusing on the coordination number of the models. In fact, it would be interesting to check how in the general case models behave as the neighborhood varies. This can be done looking at Figure 11, which basically shows the same data of the previous charts organized in two single diagrams according to the corresponding values of the coordination number. Regarding the minimum potential, it becomes evident that the triangular lattice, due to looser topological constraints, is able to accommodate more hydrophobic residues around an H vertex respect to a square lattice with the same neighborhood value. Moreover, for the length of HI sequences, the potential looks to saturate within the admissible range in the square case, but not for triangular lattices. Regarding runtime (right chart in Figure 11), their values are very similar for both lattice types, and their trend over dimension is almost the same. For our implementation, triangular lattices are dealt with slightly more efficiently than the corresponding square ones. Finally, no significant difference in behavior can be noticed across the HI sequences. In the next tests, just HI 4 and the first half of H 9 (indicated as HI 9/2) are taken out of them as representative samples.

Insights on Manipulation of Conformations
The examples seen so far do not consider the modification of conformations. To test also this aspect, an algorithm that makes use of moves must be taken as reference. Several Monte Carlo procedures have been developed for folding in the HP model, such as e.g. Evolutionary Monte Carlo [33], Monte Carlo Replica Exchange [39], and genetic algorithms [44]. Quantum annealing has been used for this purpose as well [53]. Moreover, some of the most effective folding algorithms recently designed employ moves in local optimizations within a larger hybrid approach [54].
One of the simplest Monte Carlo methods to find reasonable potential minima is the simulated annealing heuristic. Although its original formulation has shown to be scarcely effective in the exploration of protein conformation space (an early significant example for specific protein models is reported by Brower et al. [55]), particular versions have been employed also recently with good results on classical HP models on 2D square lattices [56] and 3D cubic lattices as well [57]. A simple implementation of simulated annealing will be taken as reference, to show issues in using move sets across different dimensions and lattice types.
In simulated annealing, the basic run (or session) is controlled along with the decreasing of a ''temperature'' parameter T from a value T 0 down to a final T f . Starting from an initial conformation, at each temperature value a given number of steps n sT is performed. Each step consists of the application of a move to obtain a neighbor conformation, which may represent a better solution and may be possibly accepted as the current one. The acceptance is decided in the so-called Metropolis check, that depends on both the found potential difference DU~U next {U current and the current value of T: If DUv0 it is always accepted, otherwise it is accepted only with a probability given by the ''Boltzmann'' factor e { DU k B T (usually k B is taken as 1). The session outcome is the best conformation encountered throughout the whole procedure.
In addition to this general scheme, the used implementation in a class named SaSimpleOptimizer is characterized by the following details: i) an exponentially decreasing cooling schedule is chosen (i.e., T jz1~a T j with 0vav1); ii) as long as a session has been able to improve the solution (i.e., it has been fruitful), a successive one is performed, starting with the previous best conformation; iii) the initial conformation for the first session is a simple linear accommodation of the residues; iv) during the first half of the temperature range for the first session, referring to the generic formulation in eq. 9 the used potential holds C si,sj exactly as in the HP potential, and D(d(v i ,v j ))~1=d 2 (v i ,v j ) 2 . This choice is aimed at getting faster towards a globular conformation; v) no heuristic has been used in selecting the specific next move to apply (as it often happens in the most effective variants of simulated annealing). It has been shown in the literature [54] that properly modified fitness functions, as in iv) above, can dramatically improve the effectiveness of local searches.
The implementation of simulated annealing, as well as any other Monte Carlo algorithm that makes use of configuration modifications, is automatically lattice-agnostic because, in performing moves, it simply exploits methods that have been already written in a generic way as integral members of the class Latticemodel (see Figure 5). Multiple configurations for the same molecule can be kept in different copies of pos_abs.
The presented tests have been performed on a benchmark that include the sequences used for Figure 8, characterized by different lengths. For triangular lattices the last two sequences have been excluded, because of the excessively long runtime. For the same reason, each test on each sequence has been run only ten times, for both lattice types.
The values used for the simulated annealing parameters are the following, using the notation introduced above: T 0~1 2, T f~0 :2, a~0:8, n sT~8 0, and k B~1 . They have been chosen to produce a good solution in the basic case of a 2D square lattice in no more than two sessions. As the configuration space gets wider, reasonably an increasingly larger number of restarts should be needed. In this context, across different dimensions and lattice types, the number of fruitful sessions can give us a rough idea of the effort required by the Monte Carlo procedure (relative to the basic case) to single out a solution.
The used move set contain all the types discussed in this paper, implemented according to their general definitions. The random attempts to use move types obey a distribution with the following values of try rate t for all the move types: Slithering snake t~5%, Pivot t~10%, End t~5%, Kinkjump t~15%, Crankshaft t~15%, Pull t~50%. In particular, the value for the try rate of the pull move has been selected following the discussion reported by Thachuk et al. [39] (they use the symbol r to indicate it).
A first point around the test outcomes is the evaluation of the average number of fruitful sessions SN Sf T; such results are reported in the charts of Figure 12. Informally, the value of this metric should grow as the sampling space widens out, and this may happen either increasing the sequence length, or the lattice coordination number z. Figure 12 clearly shows that, regardless of dimension and lattice type, the longer the sequence, the larger SN Sf T, in accordance with our observation. Moreover, keeping the same sequence and increasing the dimension, SN Sf T increases as well. Also for Comparing square and triangular lattices, the number of required restarts is similar when the lattice coordination number is the same.
Although SN Sf T can quantify the effort in locating the solution, this metric does not reflect the overall computational weight of the procedure, because a single session usually asks for more operations as the sequence length and the dimension increases. An account of the average runtime for square and triangular lattices is presented in the charts of Figure 13. Although simulated annealing scales sufficiently well with dimension, its application beyond the admissible range (as defined in the first section) and with longer sequences becomes unpractical.
In order to assess the ability of our Monte Carlo procedure to find good optimal solutions, minimum potentials obtained by simulated annealing (see Figure 14) can be compared with those found by chain growth, that may be kept as decent approximations of the actual minima. Just after a quick look at the first chart of Figure 14 and Figure 9, it becomes evident at higher dimensions that the tests with simulated annealing provided worse approximations, and also the curves for each sequence in the square lattices are not monotonic decreasing. This aspect can be quantified by means of the relative variation of the value from simulated annealing respect to the corresponding one from chain growth, calculated as (P SA {P CG )=P CG and reported in Tables 3  and 4. Somehow unexpectedly, for short sequences at the lower dimensions, simulated annealing behaves better than chain growth (upper left corner of the tables). On the contrary, as the sampling space gets wider (lower right corner), the solutions are in percent less and less accurate. Simulated annealing reveals its weakness in locating optima in these conditions, at least using the same parameters that turned to be effective for more limited explorations. In any case, a better effectiveness is experienced on triangular lattices.
The facet that likely deserves more attention is the characterization of the usage of moves. In particular, for the different cases it is relevant understanding whether a random application of a move of a given type exhibits the same success probability, and how the moves that lead to an improvement of the solution are distributed among all the move types. To this aim, the following metrics can be used: hit rate h i.e., the number of move applications that could be actually done over the total number of tried random move applications improvement fraction si.e., the portion of all effective move applications that belongs to a given move type (an effective move is one that determined an improvement of the current solution).
The values of h and s for our tests are reported in Tables 5 and  6. The slithering snake move in practice can always be feasibly applied for dimensions higher than 3, and its application is quite effective, considering that it is chosen only in the 5% of the move tries and that it holds a s[½1,10. This move type is more effective at lower dimensions, and for square lattices. The pivot move increases its h as the dimension increases, but usually decreases towards longer sequences. Its effectiveness varies considerably, but most of the times svt~10%.
The application of the end move is feasible almost all the times, but shows a very poor effectiveness.
The kinkjump and crankshaft moves increase their h along with the lattice dimension. Crankshaft shows a better improvement fraction than kinkjump at higher dimensions.
Finally, the pull move (with t~50%) turns to be the most favorable type. It is very effective, and h hits 100% in higher dimensions. The value of s in one case reaches 81% (substantially larger than t), but usually decreases as dimension increases. In general, in triangular lattices s shows lower variations.

Conclusions
So far, studies on simplified protein structures based on lattice models have directly targeted only a few specific lattice models. This paper has shown that it is possible to exploit a parametrical description of the underlying lattice, specifying its type and dimension. As a consequence, investigations can be conducted across different lattices simply by varying the proper parameters. Observing the neighborhood of alpha carbons in the core of real proteins, it can be easily noted that it is sensible making use of lattices whose dimension is larger than three. The proposed parametric models can automatically accommodate this possibility.
The theoretical tools for the proposed lattice models have been developed, shown and organized in the first part of the paper, indicating also how the basic concepts can be exploited in developing the supporting software. To this aim, the classical HP model has been considered. Moreover, an object-oriented architecture has been proposed for generic protein lattice models in any dimension, and a seamless hooking scheme for different optimizers has been suggested and implemented. The systematic adoption of such a software framework would make more meaningful and fair the comparison among different optimization methods.
The adoption of parametric models asks for lattice-agnostic implementations of all the algorithms that operate upon them. This issue has been explicitly addressed for the most important cases (e.g. for the calculation of the HP potential), suggesting to embed the basic functionalities directly in the protein model class. It has been necessary to provide general definitions for the move types proposed so far in the literature to modify protein configurations in Monte Carlo methods.
Experimentations with the proposed models have been carried out by means of a Python implementation, using sequences out of well established benchmarks widely studied in the literature. To point out the main features of the protein models varying lattice type and dimension, two simple yet significant classic optimization methods have been taken as references: Chain growth, and simulation annealing. It has been possible to experimentally uncover the trends of the minimum HP potential with increasing dimension for sequences of different lengths. Moreover, for the first time it has been shown a quantitative characterization of the employment and the effectiveness of move types usage in different lattice types and dimensions.
The shown theoretical and empirical results can be regarded as first steps in the investigation of multidimensional protein models and related algorithms. Some intrinsic geometrical features may restrict the application field of high-dimensional models. E.g., we can notice that in the N-dimensional continous space, the hypersphere surface/volume ratio is N=R, which can rapidly become very different from the ''natural'' 3=R value as N increases. Moreover, it is unclear whether multidimensional models may provide hints on the dynamic properties of protein systems. Further research is needed to ascertain the possible limitations of the general models.
It must be underlined that the simple optimizers used in the paper do not guarantee to find exact minima; for traditional models, the works by Backofen et al. [58,59], based on constraint programming and a specific form of memo-ization, represent the reference approach but it reasonably would require an overall generalization in order to be applied to every possible case. Another example of non-trivial generalization to be carried out is the formulation of effective bounding functions in branch and bound algorithms for HP models [60,61].
The proposed framework can be considered a novel general viewpoint in the study of protein lattice models, that subsumes several specific solutions proposed so far. The need to look at problems from a more general perspective may stimulate the reasoning on simplified models to better catch the basic characteristics of protein conformations and their manipulation.