Reduction of response time by data placement reflecting co-occurrence structures in structured overlay networks

We propose a method to accelerate a response of structured overlay networks by reducing the number of hops required to answer multi-queries. In the proposed method, by copying data items to the redundant storage spaces in other storages, a good data placement reflecting co-occurrence structures in the structured overlay network is achieved. We formulate the optimization problem of the data placement in the limited redundant space of the storages as an integer programming. A greedy approach to solve the optimization problem is also proposed. Through several simulations, it is confirmed that the proposed method can reduce the average number of hops required to answer multi-queries by about 30% at the maximum in our simulation settings. The reduction rate of the average number of hops depends on the level of co-occurrence. Further, the reduction of the computation time to solve the optimization problem with the greedy approach is evaluated. We also confirm that the proposed method does not affect load balancing of structured overlay networks.


Introduction
In recent years, the utilization of big data attracts attention, and various services that handle huge data collected from various devices such as sensors are expected. There are several technologies that support these big data services, such as inexpensive sensors. Even among them, storage systems for storing and managing the huge amount of collected data are one of the key technologies.
It is expected that storage systems based on structured overlay networks can solve a scalability problem in big data management [1][2][3][4][5][6][7]. In a storage system based on a structured overlay network, an overlay network is constituted by a large number of storages on an underlay network (typically, on the Internet), and data items are distributed on these storages. There is no central server that indexes data items stored on each storage since systems are decentralized. Therefore, a user cannot find a data item by querying to a central server. When a storage receives a query from a user, the system searches a storage that stores the data item in the query by repeatedly forwarding the query to a neighbor node. PLOS

Structured overlay networks
In structured overlay networks, an overlay network that has a specific topology (e.g., a ring topology) is constructed on an underlay network, by connecting storages each other with virtual links. The data items are distributed between many storages and each storage performs in decentralized manner, thereby being able to handle huge amount of data items. In Chord that is a representative protocol of structured overlay networks, the storages form a ring network by maintaining successor lists (see Fig 1). Additionally, by maintaining finger tables, storages that are far away in the ring network are connected by virtual links in order to reduce the number of hops to a data item. The ID space of data is a one-dimensional space in Chord, and a storage in the ring network stores data items in a part of the ID space. Since data ID is assigned using a hash function, a load of each storage is balanced. Some protocols support range-queries in which required data items are specified by a range of data IDs [5,7,[9][10][11]. Chord# extends Chord for range-queries by replacing the hash function to a key-order preserving function. It constructs a ring network and a one-dimensional ID space as with Chord. Since load balancing effect by hashing is lost in Chord#, it adopts load balancing technique proposed by D. Karger et al. [12]. When range-queries are effective, data items with adjacent IDs tend to be required together. In other words, data items with adjacent IDs have high co-occurrence. In order to support range-queries, data items with adjacent IDs are placed at the same storage (or the adjacent storages) in Chord#. It means that the co-occurrence structure is expressed as a one-dimensional ID space, thereby reflecting the co-occurrence structure to the placement of data items.
The number of hops required to answer a query is one of the most important performance metrics in structured overlay networks. Each storage performs in decentralized manner, and there is no central server that indexes data items stored on each storage. Required data items are searched by repeatedly forwarding a query to a neighbor node.

A data copy method reflecting co-occurrence structures
In this section, we will explain the proposed method to reduce the response time to multi-queries by reflecting co-occurrence structures of all stored data items onto the data placement in structured overlay networks. In the proposed method, we place copies of data items at storages that do not have original data items, thereby increasing the probability of getting data items in a multi-query at the same storages. We formulate the optimization problem of the placement of the copied data items in the limited space of storages as an integer programming. Moreover, we propose a greedy algorithm that solves it.

Co-occurrence structures of data items
There are combinations of data items that tend to be required at the same time in data items stored on structured overlay networks, and we call the combinations as co-occurrence structures. These co-occurrence structures can affect performance of structured overlay networks. We show an example of co-occurrence structures that is shown in Fig 2. In Fig 2, we consider the case where weather information at 7:00, 11:00, and 14:00 in Tokyo, Osaka, and Niigata are stored on a structured overlay network. Let us focus on combinations of data items surrounded by 3 rectangles: 1) The data items in the red dotted line rectangle can be regarded as data items on the weather of the whole day in Tokyo, 2) The data items in the green double line rectangle can be regarded as data items on the weather at 7:00 in all prefectures in Japan, 3) The data items in the blue solid line rectangle can be regarded as data items on the weather in the morning in Japan's Pacific coast. In this example, we assume that these combinations of data items tend to be required at the same time.
If we placed the data items in these combinations at the same storage, the number of hops required to answer a query can be reduced. When we can expect these tendencies as a prior knowledge (the tendency of case 1) and 2) are easily expected), we can utilize the conventional methods that support range-queries. However, it is difficult to expect all the co-occurrence structures of all data items since the number of combinations of data items that are assumed are enormous. Moreover, the structures can be dynamically changed depending on user behaviors.

Reflection of co-occurrence structures to a data placement by data copy
In the proposed method, by copying data items and storing it in storages other than a storage having an original data item, a data placement reflecting the co-occurrence structures is achieved. If the data item that tends to be required at the same time are placed at the same storage, the probability of getting all required data items together at once, thereby reducing the number of hops required to answer a query.
The proposed method behaves as an extension of conventional routing methods (including Chord, Chord#, Mercury, etc.). The functions that are added by the proposed method are as follows: 1. From other storages, each storage copies data items that tend to require at the same time with stored data items in the own storage (see Fig 3a and 3b). How to select data items to be copied will be described later.
2. When a multi-query is received, queries for all single data item required by the multi-query are generated (see Fig 3c).
3. When a storage receives queries generated from a multi-query, the storage send the data items to user if the storage has data items required by the multi-query (see Fig 3d).  Most of the conventional methods does not conflict with the proposed method since the original data items are left. If an data item stored in the storages is moved without copy, routing to the data item by the conventional method should be changed in order to maintain reachability to the data item. The proposed method only adds the above functions, and it does not intervene the routing of the conventional method. By using the copy of data items, the cooccurrence structure of the data items can be reflected in the data placement while maintaining the reachability to the data items.
There is a trade-off between the reduction of the number of hops and the storage consumption due to data copy in the proposed method. Needless to say, the proposed method consumes extra storage spaces to store the copies. If the storage space is not limited, all data items on the structured overlay network can be stored at a single storage and queries for every data items can be answered immediately (i.e. 0 hops). Conversely, when the storage space cannot be used, the proposed method corresponds to the conventional method (i.e., the number of hops is not reduced). Data placement reflecting co-occurrence structures in structured overlay networks Generally, when data items are stored in a storage, the storage space is not fully utilized, thereby a redundant space exists (see Fig 4). This redundant storage space can be utilized as a space for storing copies of data items in the proposed method. As we mentioned above, copied data items do not affect the reachability of data items, even if it is deleted without any negotiation. When a new data item to be stored appear, we can easily delete the copied data items and store the new data item in the empty space. Therefore, the copied data items do not reduce the potential capacity of the storage. Even if there is no redundant space to store the copied data items, the proposed method never increases the number of hops over that of the conventional method.
In the proposed method, queries received by each storage are recorded as a log, and data items are copied when the number of queries in a log reaches n. The data items to be copied are specified by solving an optimization problem (the optimization problem and the algorithm that solves it will be discussed in Section 3.3). As we mentioned in Section 2, structured overlay networks perform in decentralized manner. Not violating the decentralized manner, the data Data placement reflecting co-occurrence structures in structured overlay networks items to be copied are specified by each storage without global information. A storage utilizes only information of a log of queries in its own storage to specify the data items to be copied. A log of queries in a storage is cleared after the copy process, and a storage repeats the copy process every time the number of queries in the log reaches n. By updating data items to be copied, we can reflect co-occurrence structures of data items to data placement, even if co-occurrence structures of data items dynamically change.

An optimization problem of data items to be copied
In the proposed method, how to specify the data items to be copied is most important. As we mentioned in Section 3.2, a storage should select the data items to be copied from other storages to its own storage since the storage space is limited. In this section, we will discuss an optimization problem of the data items to be copied when the number of data items that can be copied to the redundant storage space is limited in c.
One of naive approaches is to utilize Least Recently Used (LRU) strategy that is commonly used for caching. In this approach, each storage copies latest c data items that the storage does not store. Unfortunately, the LRU approach does not achieve good performance (We will show simulation results in Section 4).
In the proposed method, we consider the optimization problem as a problem of maximizing the number of multi-queries in a log that all queries in the multi-query can be answered by single storage. We show an example of the maximization in Fig 5. Suppose that a structured overlay network stores 5 data items with ID A to E, and a storage stores A, originally. When the storage has 2 redundant storage spaces, we have 4 C 2 choices that are listed in the middle of Fig 5 as the combination of data items to be copied since the number of data items except A is 4. If data items B and C are copied, the storage can answer 3 multi-queries {A, B} × 2 and {A, B, C} × 1 in a log shown in the top of Fig 5. The number of multi-queries that can be answered when the other combinations of data items are copied, is lower than 3. Therefore, the optimal combination of data items to be copied is B and C.
This optimization problem can be formulated as the following integer programming.
c: Size of a redundant storage space.
: An indicator function that is 1 when b j 2 a i , otherwise 0.
x i 2 {0, 1}: An indicator function that is 1 when ith multi-query in α is answered, otherwise 0. y j 2 {0, 1}: An indicator function that is 1 when data item b j is stored, otherwise 0. By solving (1), a storage can obtain the optimal combination of data items to be copied. The objective function P n i¼1 x i means the number of multi-queries whose all data items can be answered by the storage. x i and y j are explanatory variables. x i indicates whether ith multiquery is answered by the storage or not. y j indicates whether jth data item is stored or not. The relationship between x i and y j is described through d ij in the second inequality constraint. The constraint means that all data items in a multi-query should be stored if the multi-query is answered by the storage. The first inequality constraint represents a space constraint of a redundant storage space. Data placement reflecting co-occurrence structures in structured overlay networks We propose a greedy algorithm for solving (1), since an integer programming is generally NP-hard. In the greedy algorithm, the efficiency of each multi-query in a log is calculated, and all data items in the multi-query with the maximum efficiency are copied, repeatedly. The pseudo code for the algorithm is shown in Algorithm 1. The input parameters of the algorithm are size c of a redundant space, a set α of originally stored data items, and a set γ of multi-queries in a log. The output of the algorithm is a set S of data items to be copied in the redundant space. First of all, data items in a redundant space are cleared (Line 1). Counter l i counts the number of queries that can be answered by the storage when queries in a i are copied (Lines 4 to 7). Then, efficiency that is l i per data item for a i is calculated (Line 8). A set of data items in a i that maximizes efficiency e a i is added to a set S of data items to be copied in the redundant space (Lines 9 to 10). The above process (Line 3 to 10) is repeated while the redundant space is not fulled.
Algorithm 1: A greedy algorithm for (1) Input: c, α, γ Output: The optimal data items to be copied S 1 S ;

Evaluation
In order to evaluate the performance of the proposed method, we simulated Chord#-based method that is extended by the proposed method. We will confirm the effect of the proposed method on reduction of the number of hops required to answer a query. Additionally, we compare the computation time to solve (1) by using the integer programming and the greedy algorithm.

Simulation settings
In the simulation, we constructed a structured overlay network consisting of 100 storages and generated 10000 multi-queries. The size of the redundant space for storing copied data items is 30 data items per storage. We perform the following 4 protocols: • Chord#: it is original Chord# protocol with a one-dimensional ID space. A load balancing technique [12] is adopted.
• Chord# with LRU: Chord# is extended by the proposed method with LRU strategy. Each storage copies data items that are chosen by LRU strategy.
• Chord# with IP: Chord# is extended by the proposed method with an exact solution of an Integer Programming (IP). Each storage specifies data items to be copied by directly solving integer programming (1).
• Chord# with greedy IP: Chord# is extended by the proposed method with an approximate solution of IP. Each storage specifies data items to be copied by the greedy approach that we mentioned in Section 3.3.
The optimization of data placement is performed every 1000 queries. The number of hops required to answer a query, processing times, and load distribution of storages are compared among the protocols when the system reaches a stationary state. As we mentioned above, the proposed method requires an extra storage space in order to copy data items from the other storages. Hence, it is difficult to fairly compare the proposed method to original Chord#. For fair competition, we compare the proposed method to Chord# with LRU that is a naive approach to utilize a redundant storage space. In the simulations below, the parameters listed in Table 1 will be used as default parameters.
As for the query generation model expressing co-occurrence structures of data items in the simulations, we assume that the co-occurrence structures can be expressed as the two-dimensional torus space shown in Fig 6. We assume that ID on a one-dimensional ID space is assigned to data items stored on the structured overlay network. We map the data items whose IDs are on a one-dimensional ID space, onto the two-dimensional torus space as shown in Fig  6. It is assumed that data items surrounded by an arbitrary rectangle are required as a multiquery. Thereby, data items that are close in the two-dimensional torus space can have high cooccurrence. Though the co-occurrence structure regarding the horizontal direction can be expressed by the ID space, the structure regarding the vertical direction cannot be expressed by the ID space. The rectangle that determines the data items requested as a multi-query is given by 4 parameters: the vertical and horizontal lengths, and the lower left x and y coordinates of the rectangle. The coordinates of the lower left corner of the rectangle are assumed to follow a Zipf distribution with the point of ID 0 as the origin. The vertical and horizontal lengths of the rectangles are also assumed to follow a Zipf distribution, respectively. A shape Data placement reflecting co-occurrence structures in structured overlay networks parameter of each Zipf distribution is set to 1.4. The Power-low and the Pareto principle is frequently observed in network measurements [13]. A Zipf distribution is one of the most common distributions that exhibits the two fundamental characteristics. It is well known that the popularity of files in peer-to-peer file sharing follows a Zipf distribution [14]. By setting the shape parameters to 1.4, the Zipf distributions in our simulation exhibit the Pareto principle: the occurrence probability of the top 20% of the items is almost 80%.

Reduction of the number of hops required to answer a multi-query
In order to verify reduction of the number of hops required to answer a multi-query by the proposed method, the average number of hops required to answer a multi-query was measured, by changing the number of storages consisting the structured overlay network from 10 to 100 storages. Parameters other than the number of storages are default settings that are shown in Table 1. The average number of hops is defined as the average number of hops required to reach all the data items included in a multi-query. For the 4 protocols, Chord#, Chord# with LRU, Chord# with IP, and Chord# with greedy IP, the results of the average number of hops are shown in Fig 7. The horizontal axis represents the number of storages consisting the structured overlay network, and the vertical axis represents the average number of hops required to answer a multi-query. According to Fig 7, it is confirmed that the average number of hops can be reduced compared with original Chord# in any number of storages when Chord# is extended by the proposed method. In particular, Chord# with greedy IP reduces the average number of hops by about 30% of that of Chord# at the maximum. Moreover, Chord# with IP and Chord# with greedy IP can reduce the average number of hops compared with Chord# with LRU. This is because LRU simply copies the latest data items in a log, and it does not optimize data items to be copied taking the combination of data items in Data placement reflecting co-occurrence structures in structured overlay networks multi-queries into consideration. Apart from that, it is noteworthy that Chord# with greedy IP achieves almost the same performance as Chord# with IP even though the solution of the integer programming problem in Chord# with greedy IP is an approximate solution. A part of the result in Chord# with greedy IP slightly lower than that of Chord# with IP. Since the optimization problem is solved using a log, the results may depend on the future query with randomness even if the solution is optimal. According to the above results, it was confirmed that the average number of hops can be greatly reduced by optimizing data placement with the integer programming rather than a simple approach such as LRU.
Needless to say, the effectiveness of our method depends on the level of co-occurrence. The reduction rate of the average number of hops can be changed depending on a shape parameter of a Zipf distribution in the model of the co-occurrence structure. If there is no cooccurrence structure, the performance of our method will be almost the same as that of the LRU approach.

Effect of redundant storage spaces on load balancing
In the proposed method, since a data item is copied to other storages that do not have the original data item, the storage that answers to the query will be changed, thereby changing the load of the storage. Due to copy of the data item, it is not necessary that the storage with the original data item answers a query. In structured overlay networks, load balancing of storages is important, but load of each storage may be changed due to the proposed method.
In order to verify the effect of the proposed method on load balancing of storages, a distribution of loads on storages was derived through a simulation. The default settings that are shown in Table 1 are used in the simulation. The cumulative distribution functions of loads in original Chord# and Chord# with greedy IP are shown in Fig 8. The horizontal axis represents the number of storages, and the vertical axis represents the ratio of the cumulative load of the storages to the total load. Here, the load of each storage is defined as the number of queries that are answered by the storage. Chord# that is the base protocol of the simulation balances the load so that the number of data items stored in each storage is evenly distributed. Note that the number of queries is not evenly distributed though the number of stored data items in each storage is evenly distributed. According to Fig 8, the distribution of load in Chord# with Data placement reflecting co-occurrence structures in structured overlay networks greedy IP is almost the same as that of Chord#, so it can be confirmed that the effect of load change due to the proposed method is negligible.

Reduction of the computation time by the greedy approach
In order to verify the reduction effect of the computation time by the greedy approach, the computation times to solve (1) in Chord# with IP and Chord# with greedy IP were measured when the number of stored data items was 10000 and 1000000. The results are shown in Fig 9. According to Fig 9, we confirmed that the computation time in Chord# with greedy IP is shorter than that in Chord# with IP for both results. Consequently, the greedy approach is about 3 to 10 times faster than the approach in which the exact solution of the integer programming is calculated.

Related works
A wide variety of protocols have been proposed in researches regarding structured overlay networks. Especially, many studies have focused on the number of hops to search required data items and load balancing across storages [4, 5, 7, 9-12, 15, 16]. Various topologies of overlay Data placement reflecting co-occurrence structures in structured overlay networks networks are tried to reduce the number of hops, and assignment of data items to storages is optimized to balance the load of each storage. In Chord [4] proposed by I. Stoica et al., storages of an overlay network form a ring network. A one-dimensional ID space is assigned to storages on the ring network. Range-queries are not supported though the load is balanced across storages since ID is hashed.
Many protocols have been derived from Chord since it has a simple network structure. Chord# [7] constructs a ring network and a one-dimensional ID space, similar to Chord. Chord# supports range-queries, since values of data items are used as it is, as ID without hashing. The algorithm proposed by D. Karger et al is introduced in Chord# as a load balancing technique. Mercury [1] constructs multiple ring networks, called hubs to support multi-attribute queries. In Mercury, near-uniform load balancing is achieved by an algorithm based on a random sampling. Y. Gu et al proposed an algorithm that supports range-queries on a distributed tree [17].
For overlay networks other than the structured overlay networks, there are also several researches focusing on the co-occurrence structure of data items. In the literature of semantic overlay networks [18][19][20][21][22], data items are clustered and stored, depending on the contents of the data items. Generally, they are difficult to apply for big data since data items are indexed with metadata.

Conclusion and future works
We proposed a method to accelerate a response of structured overlay networks by reducing the number of hops required to answer multi-queries. By copying data items from the other storages to a redundant storage space, the proposed method reflects co-occurrence structures of all stored data items onto the data placement in structured overlay networks. We formulated an optimization problem that specifies the data items to be copied as an integer programming, and proposed a greedy approach to solve it. Through simulations, we confirmed that the proposed method can reduce the average number of hops required to answer multi-queries by about 30% at the maximum. Further, the greedy approach in the proposed method can reduce the computation time of the optimization problem by one tenth at the maximum. It is also confirmed that the proposed method does not affect load balancing.
In future work, we will evaluate the proposed method in realistic situations. In order to evaluate scalability of the proposed method, the number of storages should be larger. The query generation model used in the simulations should be replaced by a log of queries in real networks. Moreover, it should be confirmed that various protocols of structured overlay networks except Chord# can be extended by the proposed method without any problems.