Cost-Efficient and Multi-Functional Secure Aggregation in Large Scale Distributed Application

Secure aggregation is an essential component of modern distributed applications and data mining platforms. Aggregated statistical results are typically adopted in constructing a data cube for data analysis at multiple abstraction levels in data warehouse platforms. Generating different types of statistical results efficiently at the same time (or referred to as enabling multi-functional support) is a fundamental requirement in practice. However, most of the existing schemes support a very limited number of statistics. Securely obtaining typical statistical results simultaneously in the distribution system, without recovering the original data, is still an open problem. In this paper, we present SEDAR, which is a SEcure Data Aggregation scheme under the Range segmentation model. Range segmentation model is proposed to reduce the communication cost by capturing the data characteristics, and different range uses different aggregation strategy. For raw data in the dominant range, SEDAR encodes them into well defined vectors to provide value-preservation and order-preservation, and thus provides the basis for multi-functional aggregation. A homomorphic encryption scheme is used to achieve data privacy. We also present two enhanced versions. The first one is a Random based SEDAR (REDAR), and the second is a Compression based SEDAR (CEDAR). Both of them can significantly reduce communication cost with the trade-off lower security and lower accuracy, respectively. Experimental evaluations, based on six different scenes of real data, show that all of them have an excellent performance on cost and accuracy.


Introduction
Enormous amounts of rich diverse information are constantly generated in modern large distributed systems, which are also called big data. Such large-scale big data sources create exciting opportunities for service quality monitoring, novelty discovery, or attack detection, etc.
However, directly transmitting them to a single node and processing using centralized algorithms are difficult. Distributed aggregation is an efficient way to minimize consumption of energy and bandwidth.
Typical distributed application scenarios can be easy to found big internet firms, such as Google and Bing. Click log data of these service providers is distributed on thousands of servers around the world, which is usually up to megabytes per minute. In these distributed big data scenarios, 90% of regular analytics jobs issue queries against different type of aggregated values, instead of requiring the raw log data records. To efficient generate these aggregated results, performance metrics is generated locally from log data, and then a distributed aggregation can be adopted [1].
As another example, consider the WSNs application scenarios. In these applications, nodes are often equipped with a battery as the energy unit, which means the energy capacities are limited. Meanwhile, such WSNs are envisioned to be spread out over a large geographical area, and the total number of the nodes is huge, so the battery change is impossible. How to save the overall energy resources and extend the lifetime of the networks is essential. Distributed aggregation is also a popular research topic in this area [2].
Enabling multi-functional support is a fundamental requirement in practice. Here, multifunctional support means to provide as many statistical results as possible. Typical aggregation functions include count, summation, mean, median, maximum, minimum, variance, mode, etc. These statistical results are typically adopted in constructing a data cube for data analysis at multiple abstraction levels in data warehouse platforms [3], In order to improve the performance of data mining, it is a basic requirement to keep data feature (i.e. statistics) as much as possible in data cubes, which means that enabling multi-functional support is necessary for corresponding distributed aggregation schemes. System wide properties generated from data aggregation, can also be used as input parameters for other distributed applications and algorithms, or utilized for decision making directly. For example, setting the fan-out of gossip protocol [4] in peer-to-peer applications, or achieving load balancing in content delivery networks [5] need aggregation results as their parameters or inputs.
Serval distributed aggregation schemes [6][7][8] have been put forward. Security is a basic requirement for most applications. Serval secure distributed aggregation schemes [2,[9][10][11][12][13] have likewise existed. However, most of them can only achieve a very limited type of statistics, and even combine several existed schemes still can't respond to the request. In fact, efficient obtaining global-related statistical results, such as median and mode, in a distributed manner, even without considering security problem, is still a challenge [3].
In RCDA [14], a homomorphic encryption algorithm is used to provide end-to-end confidentiality, and simple concatenation all sensing data without any information compression method to enable recoverability of all sensing data. Although the scheme can achieve arbitrary method support, the communication cost is too heavy to be applied to large scale networks. Based on RCDA, EERCDA [13] uses a differential data transfer method to reduce the communication cost, in which the difference data rather than raw data from the sensor node are transmitted to the cluster head. However, the total transmission overhead still too heavy for most time.
To the best of our knowledge, securely obtaining typical statistical results simultaneously in the distribution system, without recovering the original data, is still an open problem.
In this paper, we study the problem of multi-functional secure distributed aggregation, in which all the aggregation functions mentioned above can be obtained securely in a single aggregation query. We also propose three complementary schemes to work around this problem. We first present SEDAR, which is a SEcure Data Aggregation scheme under the Range segmentation model, and then proposed two enhanced version REDAR (Random based SEDAR) and CEDAR (Compression based SEDAR).
To reduce the communication cost by capturing the data characteristics, a range segmentation model is adopted in proposed schemes, and different range uses different aggregation strategy. Raw data in dominant range are encoded into well defined vectors at each node to preserve both the order-related and the value-related information during distributed aggregation, and thus different types of statistics can be obtained simultaneously recovering the original data. The vectors are encrypted by a homomorphic scheme, and encrypted vectors are aggregated directly in cipher domain at an intermediate node, so concealment is also achieved. Raw data in other range will be encrypted by traditional asymmetric encryption schemes, and transmitted without in-network aggregation.
The major contributions of this paper are summarised as follows.
• We propose a novel and practical scheme, called SEDAR, in which all common statistical results can be securely and efficiently obtained without recovering the original data.
• We also present two enhanced versions, namely REDAR and CEDAR, to further reduce the communication cost with the trade-off lower security and lower accuracy, respectively.
• We implement these three schemes and extensively evaluate their performance. Evaluation results, based on six different scenes of real data, show that all of them have an excellent performance on cost and accuracy.
The remaining parts of this paper are structured as follows. Section 2 describes terminologies, and additional background knowledge. Section 3, 4 and 5 introduce SEDAR, REDAR and CEDAR. Section 6 and 7 is performance analysis and evaluation results. Section 8 briefly examines the related work. Section 9 provides a summary.

Preliminaries
In this section, we first give a range segmentation model and a network model, and then present problem definition. We also introduce a homomorphic encryption scheme.

Range Segmentation Model
An illustration of range segmentation model is given in Fig 1, terminologies used in this model are defined as follows.
Definition 1 (R m , measurement range) Measurement ranges are those over which the measurement instruments are calibrated. Convincing and reliable results of a given instrument will only appear in its measurement range, i.e., R m = [X LM , X UM ], s.t. R m R Definition 2 (R e , effective range, operation range, valid range) Effective range is the set of allowed values for a variable in a concrete application. It's a subset of R m , i.e., R e = [X LE , X UE ], s.t. R e R m Definition 3 (R d , dominant range, dominant area, advantaged region, main range) Dominant range is a subset of R e , whose probability is significantly greater than other's, i.e., Definition 4 (R b , border region, boundary region, margin area) Border region is the set of allowed values that outside the dominant range. It's a subset of R e , i.e., The network is modeled as a connected graph G = (V, E), with jVj vertices and jEj links. Each vertex represents a network node and each link represents a communication channel. Node is a logical concept. For example, in global scale distribution systems, each data center can be regarded as a node.
The sink node S 2 V, which has a powerful computing and storage capacity, is a trusted node. S is also known as query server. The remainder nodes C & V are either reliable or unreliable, each node only has one parent node. |C| = N. X = {x 1 , x 2 , . . ., x N } is raw data generated at these nodes. A set of nodes A are selected as aggregator nodes, A & C. The aggregator nodes are also performed as cluster heads, the others nodes (CA) are cluster members. Each cluster member join an appropriate cluster according certain criterion, such as signal strength in wireless network and delay in wired network.

Problem Definition
Definition 5 (Data aggregation) Given a dataset X = {x 1 , x 2 , . . ., x N }, a aggregation function set F = {f, h, . . .}, and a aggregation result set Y = {y 1 , y 2 , . . .}, a data aggregation is defined as Y = F(X), s.t. jYj ( jXj. Definition 6 (Distributed Aggregation) Given a network G, X is the raw data generated at each node, divide X into several subsets {X 1 , An in-network data aggregation is defined as y = F(X) = f (h(X 1 ),. . ., h(X M )). Each subset can be further divided, and this definition is still satisfied.
Data aggregation of each subset is accomplished at aggregator nodes, and the final data aggregation is executed at the query server.
Definition 7 (MFSDA, Multifunction Secure Distributed Aggregation) An MFSDA is a distributed aggregation which can provide both privacy confidentiality and multi-functional supporting. Multi-functional means that several statistical results can obtain efficiently in the same query, and results include at least count, summation, average, median, maximum, minimum, variance and standard deviation.

Homomorphic Encryption Scheme
The traditional encryption technology is not suitable for secure distributed aggregation. It only provides concealment but do not support cipher text operations, so the intermediate aggregators will have to decrypt the received data before aggregation. And then the aggregated results need be re-encrypted before sending. Frequent encryption and decryption in intermediate nodes will increase the computing cost and the energy consumption. The key management is also difficult. Each intermediate node has to maintain the private key for decryption, which will increase the risk of leaks.
To reduce the computing cost and enhance the security, a homomorphic encryption scheme is used in the proposed schemes. It is derived from homomorphism in the abstract algebra. By using homomorphism, operations in one algebraic system (plaintext) can be mapped into operation in another algebraic system (cipher text), which means data aggregation can perform on cipher text directly, and only the sink node needs to store the private for decryption. Homomorphic encryption technology includes partially homomorphic and fully homomorphic. In theory, based on the fully homomorphic, all these statistics can be easily computed. However, the fully homomorphic encryption, while revolutionary, is not really practical. Practitioners rely therefore on already existing partially homomorphic encryption, which is constructed from traditional encryption schemes and has been widely used in multiparty computation, electronic voting, non-interactive verifiable secret sharing, e-auction, and others [15,16].
Homomorphic encryption used in proposed schemes is a partially homomorphic which can only allow homomorphic computation of only one operation (i.e., addition). To further reduce key size with high security and benefit us with the computation cost, an ElGamal Encryption Scheme (EC-EG) [2,14], which is an elliptic curve based encryption, will be used here. EC-EG is also an asymmetric homomorphic encryption scheme, so the key management is easy. It consists of four parts: Setup, KeyGen, Encryption (HEnc) and Decryption (HDec). Its ciphertext is an elliptic curve over a finite field, and È is the point addition on elliptic curves.

SEDAR
In this section, we introduce SEDAR to solve the multi functional secure distribution aggregation problem. We first give a brief overview of SEDAR. Then, a detail version is presented. Finally, a concrete example is given.

Overview
In the proposed scheme, aggregation is performed in cipher domain. Both the sub-aggregation and the final aggregation results are encrypted vectors, which can be decrypted using the private key owned by the server. All statistics are calculated directly from the final aggregated vector at the server, which makes thing much easier.
As shown in Fig 2, there are five steps in the proposed scheme: mapping, encoding, encryption, aggregation, and decryption. The first three are executed on each node independently, while the last two are executed at aggregation nodes and the server respectively. Mapping and encoding are used to enable multi-function supporting, while the other three are used to achieve data confidentiality.
There are also five kinds of data corresponding to each step: raw data x, mapped data y, encoded dataṽ, encrypted datac, aggregation of encrypted dataC, and aggregation of encoded dataṼ .
Raw data x k is the original data gathered at node k, which is belong to a subset of real domain. This real domain is split into several partition based on the range segmentation model, and different strategies will choice for different range.
The lower bound of R d is defined as X L ¼ maxðX LE ;m À bdÞ, and the upper one is X U ¼ minðX UE ;m þ bdÞ.m andd are mean and standard deviation estimated using historical or empirical data. β is a factor. β 2 [1. 8,3], and β = 2 satisfies most application. The bounds of R e and R m are determined by the application itself.
Effective data in the dominant range (x k 2 R d ) will be transformed into mapped data y k using the mapping function. y k is belong to a subset of the natural numbers, i.e. y k 2 (0, L], where L ¼ d X U ÀX L a e and a is the accuracy requirement of x k . The conversion between x k and y k is achieve by the mapping function and its inverse, i.e. f m and f À1 m . To achieve value-preserved and order-preserved during this conversion, f m and f À1 m should be monotonic functions. The mapping function can be a linear one or a nonlinear one, which can be either a piecewise function or a non piecewise function. For example, a linear mapping function and its inverse are Effective data outside the dominant range (x k 2 R b ) will be encrypted by traditional asymmetric encryption scheme, and transmitted without in-network aggregation. As P(R e ) − P(R d )(1, serial transmission of these data will not increase the transmission overhead significantly. Abnormal data that outside R e are also reported to the server without aggregation. Encoded dataṽ k is a vector,ṽ k 2 f0; 1g L , where L is the number of elements. The (y k )th element ofṽ k is 1, and all other elements is 0. The conversion between y k andṽ k is achieve by the encoding function and its inverse, i.e.,ṽ k ¼ f e ðy k Þ, and y k ¼ f À1 e ðṽ k Þ: For example, the encoding function can be programmed as two instructions, i.e.ṽ k ¼ zerosð1; LÞ andṽ k ðy k Þ ¼ 1, while its inverse function can be achieved by y k ¼ findðṽ k > 0Þ.
Each node k encrypts itsṽ k intoc k using the homomorphic encryption scheme, and sentc k to its parent. As all encrypted datac k are generated with the same public key, we can aggregation it directly in ciphertextC ¼ Èc k . According homomorphism, the final aggregation result V can be obtain by decryptingC. Then the typical statistical results can be obtained fromṼ .

Detail of SEDAR
SEDAR consists of three procedures: Setup, Operations on Clients, and Operations on Server.
Setup. The Setup procedure performs network initialization, encryption initialization, and parameters distribution. Boundary definitions (i.e., R e , R d , and R b ) and accuracy requirement (i.e. a) are distributed into each node, include clients and server. Itm size , B msize , B csize , and B num mentioned above are also public information.
There are two encryption schemes used in this paper: a traditional one and a homomorphic one. Both of them are public key cryptogram schemes.
The traditional encryption scheme is used for R b , where the key pair is {KPriS, KPubS}, the encryption function is C = Enc(msg, KPubS), and the decryption function is msg = Dec(C, KPriS).
The homomorphic encryption scheme is used for R d , where the key pair is {KPriSH, KPubSH}, the encryption function is C = HEnc(msg, KPubSH), and the decryption function is msg = HDec(C, KPriSH).
Both KPubS and KPubSH are public information, while private keys (i.e. KPriS and KPriSH) must be keep in privacy only by the server.
Operations on Clients. It consists of three parts: local data processing, received data processing and data transmission.
As shown in algorithm 1, in local data processing, each node gathers the raw data x i , and process it according the range definition. The traditional encryption scheme will be used for the data belong to R b , the encrypted data will add into bSet i . For data in R d , mapping f m and encoding f e function will be used before the homomorphic encryption, the encrypted data will add into hSet i . For the data outside the valid range R e , the node ID will be added into the alarm set aSet i after being encrypted.

Algorithm 1 Operations on Client (Part I)
for j 0 to B num − 1 do 10: start maxð1; ðjṽ j i ðÀðj þ 1ÞB msize þ 1Þ 11: end jṽ i j À jB msize 12: m ṽ i ½start; end 13: c HEnc(m, KeyPubSH) 14: C hi [ c, C hi ] 15: end for 16: As shown in algorithm 2, the received data processing only exists in cluster header (i.e., CHs). Each CH use it to deal with packets received from its children (i.e., CMs). Items in each packet will be classified into three sets, i.e., bSet i , hSet i , and aSet i . All items of hSet i will be aggregated directly in cipher domain, and the aggregation result is Ch i .
In the data transmission, all processing results Ch i , bSet i , and aSet i are send to its parent. Each element inṽ k (i.e., v k (i)) orṼ is allocated the same size (denote as, Itm size or jv k (i)j), it is influenced by N and the distribution of x. Itm size 2 dlog N L e; dlogNe À Á .
The maximum size (denote as P size ) that the homomorphic encryption function can operate each time, is always large than Itm size . To reduce the total ciphertext size and the computation cost, several adjacent vector elements can be encrypted at the same time. For example, in Fig 2, each element is allocated three bits, and every two elements are encrypted with each other.
The maximum number of vector elements that can be encrypted by the encryption function is b P size Itm size c, and the actual size of plaintext for the encryption function is B msize ¼ b P size Itm size cItm size . The corresponding ciphertext size is denoted as B csize . When jṽ k j is much larger than B msize , the homomorphic encryption function need to repeat B num ¼ d jṽ k j B msize e times to finish the encryption forṽ k . The decryption function also needs to repeat B num times.

Algorithm 2 Operations on Client (Part II)
1: procedure RECEIVED DATA PROCESSING 2: if current node is a CH then 3: for all received Packet k do 4: end for 9: for j 1 to B num do 10: ct 1 is initialized as the infinity point of E 11: for all C hk 2 hSet do 12: end for 17: end if 18: end procedure Operations on Sever. Operations on server consist of five parts: data receiving and retrieving, boundary range data processing, alarm data processing, dominant range data processing, and obtain statistical results. The first three are contained in algorithm 3, while the others are contained in algorithm 4 and 5.
Algorithm 3 Operations on Server (Part I) In the data receiving and retrieving, the server receives all packets from its children. All its children are CH, and the packets like {C hi , bSet i , aSet i }. Items in each packet will be classified into three sets, i.e. bSet, aSet and hSet.
Items in bSet are boundary range data, all of them are encrypted in traditional scheme. The decrypted data are added into mSet.
Items in aSet are IDs of clients which data is out of boundary range. Those nodes are treated as potential abnormal node, and may need further analysis.
Items in hSet are dominant range data, all of them are homomorphic encrypted data, so all items in this set can be aggregated directly in cipher domain. After decrypting the aggregated encrypted data using KPriSH, we get an aggregation result of data in dominant range, in a vector form, i.e.,Ṽ ¼ fn 1 ; n 2 ; . . . ; n L g.

Algorithm 4
Operations on Server (Part II) 1: procedure OPERATIONS ON SERVER ⊳ dominant range data processing 2: for j 1 to B num do 3: ct 1 is initialized as the infinity point of E 4: for all C i 2 hSet do 5: Finally, each statistical results can be obtained directly fromṼ and mSet by algorithm 5.

Property of SEDAR
Multi-function. On the one hand, the value-related information needs to be preserved in the transformation for the summation based statistics. y k can be recovered fromṼ ðiÞ, and x k can be recovered from y k .Ṽ ðiÞ itself represents how many x k ¼ f À1 m ðiÞ in the raw data. So the value-related information is maintained in theṼ .
On the other hand, the order-related information needs to be preserved in the transformation for the comparison based statistics. Assuming thatṼ ðiÞ 6 ¼ 0,Ṽ ðjÞ 6 ¼ 0, and i > j. We recover the raw data as x i ¼ f À1 m ðiÞ and x j ¼ f À1 m ðjÞ, and then we can use the monotonicity of f m to judge which one is larger. So order-related information is maintained in the vectorṼ .
Therefore, both the summation based statistics and the comparison based statistics can be calculated in the proposed scheme.
Data Privacy. On the one hand, the adversary cannot infer the true position of the nonzero element from an encrypted vector, and thus can not recover x k by using these information. There is a random function in the homomorphic encryption scheme. Even if two elements have the same value, their ciphertexts are still different from each other. For example, in the leaf nodes, each encoded vectors contain only a non-zero elements, and all other elements are zero. It's easily infering the corresponding value x k in plaintext domain by obtaining the position i of the non-zero elements and using f À1 m f À1 e ðiÞ. However, in ciphertext domain, encrypted elements are different to each other, and even encrypted zero elements are also different to each other. As a result, inferring the position of non-zero elements is difficult in encrypted vectors.
Algorithm 5 Operations on Server (Part III)

A Concrete Example for SEDAR
Assuming R e = (20, 40], R d = (30,34] and accuracy requirement is a = 1. As shown in Table 1, there are 10 nodes in the given network. Raw data of each node list in the 2nd column, the 3rd and 4th columns are data classification and processing results.
Raw data of node 2 and node 8 are outside of the valid data range R e = (20,40]. Both of them will be regarded as illegal data and discarded, and their IDs will be added into alarm set aSet.
Raw data of node 5 and node 10 are in the boundary range R b ¼ R e R d ¼ ð20; 30 S ð34; 40. Both of them will be added into boundary set bSet.
Other raw data are in the dominant range R d . Each of them will be transformed into y k (y k 2 (0, 4]) by using the mapping function. Each valid mapped data y k will then be encoded into a vectorṽ k whose length is L. The y k -th elements is 1, while all the remaining elements are set to 0. For example, in node 1, the raw data is x 1 = 32, the mapped data y k = 2 is obtained after mapping step, and in the encoding step, the 2nd (y k -th) element of the vector is set to 1, while other elements are 0, i.e.ṽ k ¼ ð0 1 0 0Þ.
Each vector will be encrypted by the homomorphic encryption scheme and in-network aggregation will perform directly in cipher domain.
Elements of aSet and bSet will be encrypted by the traditional encryption scheme, and be relayed to the server without in-network aggregation.
According the homomorphic property, the aggregation of vectors in cipher text domain is equivalent to that in plaintext. Therefore, the server can obtain the final aggregation result V ¼ Pṽ k by decrypting the received data. Encrypted data in aSet and bSet can also be decrypted by server. The final data obtained at the server includẽ Each statistic can be calculated using algorithm 5.
Note that CNT is 8 instead of 10; this is because there are two nodes whose data is out of the operation range, which means there a failure is caused by node failure or other reasons. That is to say, computation of the final statistics, can automatically adapt to the dynamic network.

REDAR
In SEDAR, most elements of the encoded dataṽ near the leaf nodes are zero, which means it contains redundant information. Directly transmitting these low information data using a full vector is too expensive. As these encrypted zeros are used to hidden the exact position of the encrypted non-zero elements. Encrypting all zeros is not necessary, especially when L large.
In this section, we propose REDAR, which can significantly reduce the communication cost with the trade-off of lower security on leaf node. In REDAR, all non-zero elements and a small number of random selected zero elements of the leaf nodes' vector are encrypted.

Random Encryption
Random selection zero-elements are used to reduce packet size, as well as provide security for the non-zero elements.
First,ṽ are split the into several segments. Then, all non-zero elements and a small number of randomly chose zero elements are encrypted.
For example, in Fig 3-1, there are 27 elements inṽ, each element contains 3 bits, each cipher element is encrypted from 2 elements, where the leftmost cipher element only contain 1 element in this case.ṽ is split into 3 segments, each of the right two segments has 5 cipher elements at most, and the last segments has 4 cipher elements at most.
For each segment whose elements are zero, a random number r will generate between 1 and n s , where n s is the number of elements in the segment. Then the r rightmost elements of the segment are encrypted using the homomorphic encryption scheme. For example, in Fig 3-1, all elements of the leftmost and the rightmost segments are zero, i.e., n s = 4 and n s = 5 respectively, and thus one cipher element in the leftmost segment and two cipher elements in the rightmost segment are obtained.
For the segment containing a non-zero element, a random number r is generated, where r is between 0 and n s − p y , and p y is the position of non-zero elements in the segment with respect to the right end. Then the r + p y rightmost elements of the segment are encrypted. For example, in Fig 3-1, the 2nd segment contains a non-zero element, and p y = 2. Because r = 0 is return by the random function, only p y + r = 2 leftmost elements are encrypted.

Packing and Unpacking
The packing step is used for constructing packet from the random encryption result. As show Fig 3-2, encrypted data in each segment are packed together in the original order, and a delimiter is added between adjacent segments. Because each cipher has the same size and only several leftmost elements are encrypted, the original encrypted vector can be reconstructed in the unpacking step.

Secure Data Aggregation
All received packets are unpacked to getc 0 k sets, and then aligned together with the local generated encrypted data. Finally, data aggregation will performs directly on cipher domain column by column, and aggregation results of all segments are packed and send to its parent.
For example, in Fig 3-3, the first line is the encrypted data generated locally, the 2nd and 3rd are received from its children, and all of them are the 2nd segment of each vector. Segments received from different children may have a different number of elements, all of them are aligned to the right side. Because c 17 is not exist in the 1st child, we ignore it, and just aggregate other two elements, i.e. c 7 0 ¼ c 7 È c 27 . Among these three segments, the maximum number of elements is 3 (the 3nd line in Fig 3-3), so the elements number in the aggregation result of this segment is also 3.

Correctness and Security
As random selected encryption only performs on zero elements, all no-zero elements in each vector are encrypted, and aggregated into the final result. No raw data is loss in REDAR, so the final results are the same as that in SEDAR. In REDAR, the communication cost is reduced for much less zero elements is encrypted and contained in the packet. However, as the number of encrypted data is reduced, it benefits the adversary with guessing the true position of non-zero elements. An inappropriate distribution of the encrypted data also benefits the adversary with the success probability of guess. So we need carefully design the random function and make sure that sufficient encrypted data is still retained after using random selected encryption. The more the encrypted elements, the lower probability it is guessed successfully. After aggregating at intermediate node, the encrypted elements will contain more than one encrypted non-zero elements, which means the success probability of the adversary will decrease significantly. More specifically, in the leaf node i, where only one non-zero elements in the vector. Assuming n i encrypted data exist in the final packet, then the adversary's success probability is 1 n i . In the cluster header, assuming k nodes aggregate together, the probability reduces to 1 n k , where n = ∑ j max i (n ij ), and n ij is the number of elements in the jth segment of node i. As n and k increase along the aggregation tree, the adversary's success probability decrease obviously.
For example, when n ! 25 and k ! 4, the success probability no larger than 2.56 × 10 −6 . As in cluster-based networks, the cluster member in each cluster is often large than 4, which means when n ! 25, except in the leaf node, no encrypted data can be success guessed with probability larger than 2.56 × 10 −6 . When k = 6 and n = 35, the success probability already decreases to 5.44 × 10 −10 .

CEDAR
In this section, we present CEDAR. CEDAR and REDAR are complementary schemes. CEDAR is used before encoding, while REDAR is used after encoding.
In SEDAR, the total communication cost is mainly determined by the size and accuracy of R d , i.e. L ¼ jR d j a . L sometimes is large, so the total communication cost is still heavy. In order to reduce the communication cost, a compression step is introduced in CEDAR. As shown in Fig 4, mapping data y is compressed from a lager space with size of L into a smaller space with size of L 0 . Encoding step executes on compression data z, which make the vector length decreased from L to L 0 . Compression function can be a linear one or a non-linear one. Due to the limited space, we only illustrate the linear one.
A linear compression function f c can compress y into z, i.e., z ¼ f c ðyÞ ¼ f c f m ðxÞ ¼ d y c e ¼ d f m ðxÞ c e, c is the compression factor, which is larger than 1. Encoding step is based on z instead of y, i.e., ṽ ¼ f e ðzÞ ¼ f e f c ðyÞ. So the total communication cost is reduce from L to d L c e. One can recoverŷ as an estimate of y, using the decompressing function on z, i.e.,

Performance Analysis
In this section, we analyse communication and computation performance of proposed schemes. Performance criteria includes whether there are existing a bottleneck, and whether it achieves load balance.

Communication performance
First, we analyse the maximum packet size to judge whether exists a bottleneck. Then analyse the distribution attribution of packet size to judge whether it achieves load balance. After a thorough analysis, we find out that no bottleneck exist, and it achieves load balance. The detailed analysis goes as follows. Data in R d are encrypted by a homomorphic scheme, so encrypted data can aggregate directly in cipher domain in the intermediate nodes, and thus the total length will not change. Data in R b uses a traditional encryption scheme. Without the private key, the intermediate nodes have to cascade each encrypted data and replay forwarding. So the length will increase, the minimum values of the data packets size appear in the leaf node of the aggregation tree, and the maximum length of the package is in the vicinity of the server node. Now, let's analyse packet length for R d and R b respectively. For the sake of simplicity, data length analysis is based on plain text.
Communication cost for R d is determined by the number of elements in the vector and the data length of each element. The former is determined by the range length of R d and accuracy requirement a. The latter determined by the largest number of samples fall in the same point, and the worst case is all samples in R d (i.e. N × P(R d )) fall in the same position. In practice, the probability of the worst case can be ignored. So, Now, let's consider the communication cost for R b . Since the data in R b is not aggregated in the intermediate nodes, the total data length will reach the maximum in the vicinity of the server node. The maximum value is determined by the total number of samples N and the probability (P(R b )) of the sample in the region R b , as well as the transmission overhead of a single sample. In practice, we can assume that the probability of abnormal data is much less than the normal one, which means the number of elements outside R e can be ignored. So, PðR b Þ ¼ PðR e R d Þ ¼ PðR e Þ À PðR d Þ % 1 À PðR d Þ. and then In order to judge wether it achieves load balance, let's analyze the distribution of the whole network traffic first.
The minimum values of the data packets size appear in the leaf node of the aggregation tree. In the leaf node, when x 2 R b , no encoding step is used, so its data length is dlog jR e jÀjR d j a e. When x 2 R d , the encoding data length is Cost R d . The former one (Let's denote it as C 0 ) is much less than the last one. However, C 0 only exists in very small number of leaf node. Assuming each cluster contains m leafs. The probability that C 0 appears at the same time in m nodes is (1 − P(R d )) m . For example, in a normal distribution, assuming β = 2, when m = 3, (1 − P(R d )) m = 9.48 × 10 −5 . The probability is small, which means even a small number of leaf node has a packet size of C 0 , its parent will at least Cost R d . Therefore, C 0 is lack of significance, and the representative minimum value should be selected as Cost min ¼ Cost R d .
In the whole network, the minimum packet appears in the leaf nodes, the maximum packet appears in the vicinity of the server. In the path of the leaf node to the root node, the size of the packet increases from the minimum to the maximum value. Because P(R d ) ) P(R b ), the growth rate of packet size is small enough, and the average packet size Cost % Cost R d .
The maximum value of the data packet is, On the one hand, once the R d is determined, Cost R d can be treat as a const. If Cost R d is too large, CEDAR can be used. So Cost R d is controllable. On the other hand, according the define of R d , P(R d ) % 1, so Cost R d is small enough. As a result, the maximum packet size can be regarded as a controllable const, and there is no bottleneck in the network.
As the difference among Cost min , Cost max and Cost is small, we can easily make a conclusion that it achieves load balance.

Computation performance
For computation performance, we also analyse the maximum computation cost to judge whether exists a bottleneck, and analyse the distribution attribution of computation cost to judge whether it achieves load balance. We find out that no bottleneck exist, and it achieves load balance. The detailed analysis goes as follows.
Each data can either be homomorphic encrypted after encoded, or encrypted by traditional scheme directly, according the range it belongs to. Let's denote the computation cost of the former as C 11 , and the latter one as C 21 .
For the data inside the R d , mapping and encoding are required before homomorphic encryption. Both of them cost much less than encryption and decryption, thus can be ignored. For the homomorphic encrypted data, the intermediate nodes will not decrypt it, and aggregate them directly in cipher domain. Assuming a single cipher domain addition cost C 13 , the total aggregation cost is C 13 (N × P(R d ) − 1), due to that N × P(R d ) − 1 times aggregation operation are necessary for N × P(R d ) elements in R d .
The final aggregated result will be decrypted in the server, and the decryption cost is C 12 . Each encrypted data in the boundary range, will also decrypted in the server, and the decryption cost is C 22 .
Average computational cost is C avg In instances of homomorphic encryption, encryption cost and decryption cost are often much larger than the cipher domain aggregation cost. For example, in the ECC-based version, the main operations of encryption and decryption are scalar multiplication, and the main operation of cipher domain aggregation is point addition. The former is far greater than the latter, so C 13 can also be ignored, and C avg % C 11 P(R d ) + (C 21 +C 22 )(1 − P(R d )).
There are two main computational cost operations, i.e. encryption and decryption. Both of them not exist in the same node. Each client only performs one type of encryption operation, i.e. homomorphic one or traditional one. Two types of decryption exist in the server.
Since each client only chooses one of the two kinds of encryption mechanisms, each data is encrypted only once, so the computation cost is C 11 or C 21 . In general, the encode data has larger length than the raw data, so C 11 > C 21 , so the maximum computational cost of client is C 11 . Two types of decryption exist in the server, the corresponding overhead is C 12 + N × P(R b )C 22 . Because N × P(R b ) is often small, and the server node has a large computational power, the decryption operation is not a difficult task. Therefore, there is no computational bottleneck exist.
Each client only encryption once, and the aggregation operations in each intermediate node are not compute-intensive, so the the proposed scheme also achieve load balance in computation.

Statistics functions supported
In this section, we compare the proposed schemes with other data aggregation schemes on statistics functions and encoding method. Table 2 is the comparison result. All of them are distributed aggregation schemes, which mean that intermediate nodes generate partial aggregation results from their received data.

Data sets description
Evaluation is based on six datasets gathered from different type of sensors. All of them are obtained from TAO (Tropical Atmosphere Ocean) project. The TAO is a project of NOAA (National Oceanic and Atmospheric Administration), which aim to enable real-time collection of high quality oceanographic and surface meteorological data for monitoring, forecasting, and understanding of climate swings associated with El Nino and La Nina. Table 3 is the general description of each dataset. Rh0n156e_hr is a dataset of relative humidity. Bp0n156e_hr is sea level pressure. W0n156e_hr is wind direction. Sst0n147e_hr and sst0n156e_hr are different datasets of sea surface temperature. rad0n156e_hr is shortwave radiation. The 2nd column is the sample size of each dataset. The 3rd and 4th columns are skewness and kurtosis respectively. The 5th and 6th columns are mean and standard deviation estimated using the history record.

Effectiveness of Range Segmentation Model
In the proposed schemes, a range segmentation model is introduced to reduce the encoded vectors length, and thus reduce the total communication cost as long as the P(R b ) is small enough. To achieve this purpose, we should choose dominate range carefully and make sure that samples beside this range is small enough. Now, let's verify whether the boundary setting of the dominate range (R d ) is effective.
As we described above, the lower bound X L and the upper bound X U of dominate range R d are determined by X L ¼ maxðX LE ;m À bdÞ and X U ¼ minðX UE ;m þ bdÞ. X LE and X UE are const defined in TAO project.m andd are estimated from history data, which can also be regarded as const. Different dominate range can be generated by different β.
The proportion of data outside R d , i.e. P(R b ), under different parameters are list in Table 4. As β increase, P(R b ) reduce significantly. For example, when β = 2.2, P(R b ) is no larger than 3.5% in all six datasets, and two of them are even reduced to zero.
However, it's not means the larger of β, the better of the communication performance. As β increase, the encoded vector length also increase, and the communication cost for R d will increase. We need to achieve a balance between the communication cost for R d and R b . Fig 5 is the relationship between the maximum packet size and dominant range R d setting. The network size is 1000. We can easily find out that, as β increasing, the maximum packet size reduces significantly, and when β > 1.8, the decreasing rate becomes moderate. Let's analysis the reason. As the increase of β, the communication cost for the boundary range R b reduce significantly. More specifically, the communication overhead reduced in R b is much larger than that increased in R d , so the total packet size is still significantly decreased. When β reaches a certain value, the maximum packet size reaches a minimum value. E.g., β = 1.6 for sst0n156e_hr and β = 2 for w0n156e. In some case, when β is larger than its optimal value, P(R b ) is small enough, the communication cost for R b can be ignored, and the cost for R d is may increase as the bound of R d still in R e , so the maximum packet size may increase mildly.  8 and 3, the change of the maximum packet size in each dataset is relatively small. Which means, any β 2 [1.8, 3] meets the basically requirements and doesn't significantly reduce the communication performance. This feature is very useful, which means R d setting is easy.

Communication Cost of SEDAR in Different R d Setting
Although it is difficult to achieve optimal performance by setting an accurate dominant range in advance, by choosing arbitrary β 2 [1.8, 3], we can still obtain a suboptimal performance, which is very similar to the optimal one. For example, in following evaluation, we directly set β = 2 for different datasets, and still obtain a good result.

Comparsion with RCDA and EERCDA
In this section, we compare SEDAR with RCDA and EERCDA. Both of them support mutilfunctional security data aggregation, and homomorphic encryption scheme is used in all of them. The main difference lies in the encoding function. Each client encrypts collected and encoded data. The aggregation performs on cipher text directly at each intermediate aggregator, and the decryption performs on the server. The computation cost of encryption and decryption are near-linear related to the encoded data length. Limited to space, we only concern the comparison on communication cost. In these evaluations, β = 2. The comparison result lists in Fig 6. (θ is used to characterize the intensity of data fluctuation in a given application for EERCDA.) According Fig 6, we can easily find out that the proposed scheme is obviously superior to RCDA and EERCDA. And due to the slow growth of communication cost as the increase of N, it can be applied to large scale networks.
In w0n156e_hr, bp0n156e_hr and rad0n156e_hr, when N is small, RCDA and EERCDA is better than SEDAR. In these datasets, when β = 2, the dominate range is a little large. So when the network size N is small, the communication cost is larger than that in RCDA and EERCDA. However, when N increases to a certain value, the advantage of this scheme is very obvious. In other three datasets, the dominate region size is small, and most samples are in the dominate range when β = 2, so the proposed scheme has an absolute advantage even in the small network.
In sst0n147e_hr, w0n156e_hr and rad0n156e_hr, the average and maximum communication cost are almost have the same value, this is because most elements are in R d . In contrast, the average and maximum communication cost aren't the same in other three datasets. Table 5 is comparison on end-to-end aggregation time. Due to limited space, we only compare SEDAR with RCDA. The evaluation is built on MICAz. MICAz has a low-power 8-bit microcontroller ATmega128L and an IEEE 802.15.4 compliant CC2420 transceiver. The clock frequency of ATmega128L is 8 MHz. The claimed data rate of CC2420 is 250 kbps. Meulenaer et al. [22] measured the effective data rate for transmitting is 121 kbps which far below the claimed rates. In the energy models used in this paper, we use 121 kbps as the data rate for the evaluation of communication delay. For the computational cost evaluation, we decide to implement the proposed scheme based on TinyECC [23]. According to its evaluation result based on MICAz, the execution time for encryption is 3907.46ms, which is similar to the one used in RCDA [14]. According to RCDA, MICAz needs 73.71 ms to aggregate two data in cipher domain. According to the comparison results, end-to-end aggregation time of SEDAR is much smaller than that of RCDA. With the increase of network size, this advantage will be more obvious.

Cost and Accuracy Evaluation for CEDAR
We measure the performance of CEDAR in this section. Figs 7 and 8 are the communication overhead and accuracy in different compression factor c.
According to Fig 7, we can know that with the increase of c, the average and the maximum communication cost of each dataset are reduced to some extent. The reduction trend in each dataset is not entirely consistent. When the compression factor is large, the communication volume curve is flat, which means the compression effect is decreased. According to Fig 8, with the increase of C, the error rates in each dataset are increasing. So it is necessary to ensure that a balanced between the communication cost and the error rate.
The growth trend of the error rate has a certain degree of relationship with the initial communication cost. The initial communication cost is the corresponding communication cost in SEDAR. The error rates increase more quickly in cases that have much larger initial communication cost. For example, as shown in Fig 8, the initial communication cost of rad0n156e_hr is relatively large, and when c = 5, the bound of error rates is still less than ±0.015. In sst0n147e_hr, w0n156e_hr, and rh0n156e_hr, in which the initial communication cost is small, the bound is close to or more than pm0.03 when c = 5. In particular, error bound of rh0n156e_hr, is greater than 0.4 when c = 4. In fact, according to Fig 7, the initial communication cost of rh0n156e_hr is the minimal one, and the reduction tendency of rh0n156e_hr is not obvious.
Hence, we can choose a large compression factor for the case with large initial communication cost, and we should choose a small one, or even give up the CEDAR for the case with small initial communication cost.

Distributed Aggregation
Distributed aggregation is a traditional research topic in database community. Kuhn and Oshman [6] studied the complexity of computing count and minimum in synchronous directed networks. Hobbs et al. [7] presented a distributed protocol to compute maximum and average under the SINR model. Cormode and Yi [8] focused on tracking the value of a aggregation function on distributed monitoring area. Cheng et al. [24], Li and Cheng [25] considered the approximate aggregation problem and presented (, δ)-approximate schemes based on Bernoulli sampling. Xie and Wang [26] and Shen et al. [27] studied network construction and message routing algorithm for data aggregation.

Secure Distributed Aggregation
Several secure distributed aggregation schemes have been proposed. Most of them focus on secure itself, and very limited numbers of aggregation functions can be supported. Considine et al. [9] and Roy et al [11] proposed secure distributed aggregation scheme for duplicate sensitive aggregation based on synopsis generation function. Li et al. [17] and Yang et al. [18] proposed slice-mix based schemes for additive aggregation functions, which guarantees data privacy through data "slicing and assembling" technique. Castelluccia et al. [10] and Lu et al. doi:10.1371/journal.pone.0159605.g007 [19] proposed secure distributed aggregation scheme based on homomorphic encryption, which is also only support summation-based statistical functions, such as CNT and SUM. Agrawal et al. [28] presented the order-preserving encryption scheme. Ertaul et al. [20] and Samanthula et al. [21] applied it to secure distributed aggregation, to get comparison-based statistics, such as MAX, MIN. However, summation-based statistics is not support in these schemes. Chien-Ming et al. [14] and Jose et al. [13] adopted encoding steps before encryption to achieve arbitrary aggregation function. However, their encoding steps are simple concatenation all sensing data without any information compression method, and the communication cost is too heavy to extend to large scale networks. Enabling operation in cipher domain is also an important topic in cloud computing [29][30][31]. In addition to traditional encryption scheme, data privacy can be achieved by steganography [32,33]. Beside data privacy, date authentication is also necessary. Ren et al. [34] proposed an efficient mutual verifiable provable data possession scheme. Guo et al. [35] designed a lightweight and tolerant authentication to guarantee data security.

Conclusions
In this paper, we have studied the problem of multifunction secure distributed aggregation, and also have proposed three complementary schemes (i.e., SEDAR, REDAR and CEDAR) to solve this problem. The first one can obtain accurate aggregation results. The other two can significantly reduce communication cost with the trade-off lower security and lower accuracy, respectively. Extensive analysis and experiments, based on six different scenes of real data, have shown that all of them have an excellent performance.