Correction
15 Sep 2022: Kim YK, Kim HJ, Lee H, Chang JW (2022) Correction: Privacy-preserving parallel kNN classification algorithm using index-based filtering in cloud computing. PLOS ONE 17(9): e0274981. https://doi.org/10.1371/journal.pone.0274981 View correction
Figures
Abstract
With the development of cloud computing, interest in database outsourcing has recently increased. In cloud computing, it is necessary to protect the sensitive information of data owners and authorized users. For this, data mining techniques over encrypted data have been studied to protect the original database, user queries and data access patterns. The typical data mining technique is kNN classification which is widely used for data analysis and artificial intelligence. However, existing works do not provide a sufficient level of efficiency for a large amount of encrypted data. To solve this problem, in this paper, we propose a privacy-preserving parallel kNN classification algorithm. To reduce the computation cost for encryption, we propose an improved secure protocol by using an encrypted random value pool. To reduce the query processing time, we not only design a parallel algorithm, but also adopt a garbled circuit. In addition, the security analysis of the proposed algorithm is performed to prove its data protection, query protection, and access pattern protection. Through our performance evaluation, the proposed algorithm shows about 2∼25 times better performance compared with existing algorithms.
Citation: Kim Y-K, Kim H-J, Lee H, Chang J-W (2022) Privacy-preserving parallel kNN classification algorithm using index-based filtering in cloud computing. PLoS ONE 17(5): e0267908. https://doi.org/10.1371/journal.pone.0267908
Editor: Hua Wang, Victoria University, AUSTRALIA
Received: December 7, 2021; Accepted: April 18, 2022; Published: May 5, 2022
Copyright: © 2022 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The real data used are available in the following URL: (http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29). It contains 28,056 instances with 6 attributes. The data belongs to a third party. The authors did not have any special access privileges to the data.
Funding: This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2019R1I1A3A01058375). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: NO authors have competing interests.
1 Introduction
With the growing popularity of cloud computing, there has been growing interest in outsourcing databases. Cloud computing provides a service that allows internet-connected users to use virtual computing resources such as storage, computation, and network. Thus, a cloud service provider can maintain computing resources rapidly and flexibly. A data owner can reduce efforts to purchase, install, and expand computing systems, and mitigate the constraints of physical space. Cloud computing is attracting a lot of attention from individuals and companies because it can reduce the cost of system maintenance and data management, and can utilize computing resources needed without expertise. Meanwhile, we should consider three requirements in an outsourced database. First, it is necessary to protect the database because the database contains sensitive information of the data owner [1, 2]. Second, the query and the query result should not be exposed because personal information related to user preference may be uncovered. Third, data access patterns should be protected because the cloud provider is able to infer private information from the data access pattern.
Therefore, Data Mining over Encrypted Data (DMED) has been studied to protect the original database, user queries and data access patterns. Early studies modify plaintexts to substituted data and outsources them to a cloud [3–7]. However, these early studies have a disadvantage in that they cannot completely protect data and queries because they are vulnerable to various attacks such as chosen-plaintext attacks. To solve this problem, recent studies encrypt the database and outsource the encrypted database to the cloud [8–15]. Before a data owner outsources his/her database to a cloud service provider (cloud provider), he/she encrypts the database. The cloud provider processes the query received from an authorized user. The cloud provider can perform data management and system maintenance instead of the data owner. The authorized user can directly request the desired results from the cloud provider. The process of query processing over the outsourced database is shown in Fig 1.
Among DMED, the kNN classification algorithm is widely used for three reasons. First, the kNN classification algorithm has a relatively higher accuracy than other classification algorithms. Second, with the addition of more data, the kNN classification algorithm constantly evolves and is capable of quickly adapting to the changes in input dataset. Finally, the kNN classification algorithm gives a user a flexibility to choose a distance measure metric. Therefore, the kNN classification algorithm is used for various applications such as pattern analysis, image analysis, and user analysis [16].
Samanthula et al.’s work [16] and Kim et al.’s work [17] proposed kNN classification algorithms based on homomorphic encryption which can support various operations without decryption. Recent studies can also support data privacy, query privacy and hiding data access patterns. However, while processing the kNN classification algorithm, the recent works require high computation cost because they need to add random noise data to prevent exposure of the original data. Moreover, they require a large amount of processing time for kNN classification over the encrypted database. To the best of our knowledge, there is no existing parallel kNN classification algorithm which is suitable for processing a large amount of encrypted data.
The motivation of this paper is as follows. First, the existing algorithms suffer from high computational cost by using encrypted binary array to perform comparison operations. Therefore, we aim at reducing computational cost by proposing secure comparison protocol based on Yao’s garbled circuit. Second, the existing algorithms require high data encryption cost. To deal with this problem, we propose an improved secure protocol by using an encrypted random value pool. Finally, to the best of our knowledge, there is no existing parallel kNN classification algorithm. We aim at designing a parallel kNN classification algorithm for processing a large amount of encrypted data.
The contributions of this paper are as follows.
- Supporting privacy preservation: By processing queries using homomorphic encryption without data decryption, we can protect the confidentiality of both data and user’s queries while hiding data access patterns from an attacker.
- Reducing computation cost: By using the improved secure protocol based on an encrypted random value pool, we can reduce the high computation cost of the random value generation for the data encryption.
- Improving the performance of kNN classification: By proposing a new parallel kNN classification algorithm, we can reduce the amount of processing time for kNN classification.
The rest of this paper is as follows. In Section 2, we introduce the existing works on kNN classification algorithms over the encrypted database. In Section 3, we describe the overall system architecture and propose secure protocols for the proposed parallel kNN classification algorithm. In Section 4, we propose a parallel kNN classification algorithm that preserves both data and query privacy on the cloud. In Section 5, we provide the security proof of our kNN classification algorithm. In Section 6, we perform a performance analysis of the proposed algorithm. In Section 7, we describe the impact of the proposed parallel classification algorithm as a discussion. Finally, in Section 8, we conclude our paper with the future work.
2 Background and related work
2.1) Background
2.1.1 Paillier cryptosystem.
The Paillier cryptosystem is a probabilistic asymmetric algorithm for public key cryptography [18]. In the Paillier cryptosystem, the encryption key pk is given as (N, g), where N is the multiplication value between two large prime numbers p and q in ZN2. Here, g is a random integer value at ZN2 where ZN2 denotes an integer domain ranged from 0 to ZN2. Meanwhile, the decryption key sk is given as (p, q). The Paillier cryptosystem has the following characteristics. First the Paillier cryptosystem can support homomorphic addition and multiplication. Assume that the encryption function of the Paillier cryptosystem is E(.) and its decryption function is D(.), For two encrypted data E(a) and E(b), the product E(a) × E(b) is equal to E(a+b), which is the encrypted value of the plaintext a+b, as shown in Eq (1).
(1)
For two plaintexts a and b, the bth power of the encrypted data E(a), i.e, E(a)b, is equal to E(a × b), which is the encrypted value of the plaintext a × b, as shown in Eq (2).
(2)
Second, the Paillier cryptosystem supports semantic security where only negligible information about the plaintext can be feasibly extracted from the ciphertext. Specifically, any probabilistic, polynomial-time algorithm (PPTA), which is given the ciphertext of a certain message m and its length, cannot determine any partial information on the message with a probability higher than all other PPTA’s that only have access to the message length [19]. This concept is the computational complexity similar to Shannon’s concept of perfect secrecy. Perfect secrecy means that the ciphertext reveals no information at all about the plaintext, whereas semantic security implies that any information revealed cannot be feasibly extracted.
2.1.2 Attack model.
In the outsourcing database environment, two attack models can be considered: a semi-honest attack model and a malicious attack model [20]. In the semi-honest (or honest-but-curious) attack model, the cloud performs its own protocol honestly, but attempts to obtain sensitive data about the data owner and the authorized user during the protocol execution. To prevent a semi-honest attack, sensitive data must always be protected. A malicious attack model attempts to acquire sensitive data by deviating from a given secure protocol. Because a secure protocol can be contaminated by a malicious attack, it is difficult to recover the secure protocol. To protect sensitive data against the malicious attack model, a defender focuses on detecting attacks and recovering the damaged secure protocol. Since we aim at protecting sensitive data in cloud computing, we design our algorithm based on the semi-honest attack model. A secure protocol for the semi-honest attack model is defined as follows [17].
Definition 1. Assuming that ai is the input data of cloud Ci, ∏i(π) is the execution image of Ci for the protocol π and bi is the result data of Ci executing the π protocol. If the execution image ∏Si(π) simulating π is computationally indistinguishable from ∏i(π), the protocol π is said to be a secure protocol for the semi-honest attack model.
In Definition 1, the execution image generally includes the input data and output data of the protocol. The security of the protocol under the semi-honest attack model can be verified by showing that the protocol’s execution image does not expose the cloud’s data.
2.2) Related work
2.2.1 B. Yao et al.’s work.
B. Yao et al. proposed a secure kNN classification algorithm [21] based on a partition-based secure Voronoi diagram (SVD) [22]. The SVD relies on any standard encryption scheme E such as public-key encryption RSA and symmetric-key encryption AES, rather than using any new encryption schemes. Because the SVD is as secure as E for any standard security model in which E is proven secure, the SVD is indistinguishable in either chosen plaintext or chosen ciphertext attacks. To process the secure kNN classification queries, the algorithm retrieves the relevant encrypted partition instead of finding the encrypted exact k-nearest neighbors. However, most of the computations are performed locally by the end-user while processing the kNN classification query. As a result, the algorithm conflicts the purpose of outsourcing the DBMS functionalities to the cloud. Furthermore, the algorithm leaks data access patterns to the cloud, such as the partition ID corresponding to a user query.
2.2.2 B. K. Samanthual et al.’s work.
B. K. Samanthula et. al. proposed a secure k-NN classification algorithm, denoted by PPkNN, over encrypted data in the cloud [16]. PPkNN can protect the confidentiality of the data, user’s input query, and data access patterns. PPkNN mainly consists of two stages: the secure retrieval of k-nearest neighbors and the secure computation of majority class. In the secure retrieval of k-nearest neighbors, a query user initially sends his query q (in encrypted form) to C1. Then, C1 and C2 involve in a set of sub-protocols to securely retrieve the class labels corresponding to the k-nearest neighbors of the input query q. At the end of this step, the encrypted class labels of the k-nearest neighbors are known only to C1. In the secure computation of the majority class, C1 and C2 jointly compute the class label with majority voting among the k-nearest neighbors of q. At the end of this step, only the query user knows the class label corresponding to input query record q. However, PPkNN requires a very high computation cost for hiding data access patterns.
2.2.3 H. Kim et al.’s work.
H. Kim et. al. proposed a secure kNN classification algorithm which uses both the Paillier cryptosystem and an encrypted kd-tree index [17]. The Paillier cryptosystem is a homomorphic encryption scheme which is indistinguishable in either chosen-plaintext or chosen-ciphertext attacks, so that the cloud can process the kNN classification queries without decrypting any data or a user’s query. Before outsourcing data to the cloud, a data owner builds a kd-tree index and encrypts both the original database and the leaf nodes of the kd-tree index. Therefore, the algorithm can protect the data, the query and the data access pattern. By using the encrypted kd-tree index, the algorithm can reduce the amount of query processing time. However, because the algorithm must generate encrypted random values for privacy-preserving, it requires a high computation cost.
2.2.4 W. Wu et al.’s work.
W. Wu et al. proposed a privacy preserving kNN classification algorithm over encrypted database in outsourced cloud environments [23]. The algorithm newly generates unique classification label keys for each user through a secure three-party protocol. The keys are used to re-encrypt the labels into new ciphertexts that can only be decrypted by the corresponding user. The algorithm hides the data access patterns from a federated cloud server which performs the process of kNN classification by using two non-colluding clouds. However, the algorithm conflicts the purpose of outsourcing the DBMS functionalities to the cloud because both the data owner and authorized users must participate in the process of label re-encryption.
2.2.5 Y. Tan et al.’s work.
Y. Tan et al. proposed a lightweight edge-based privacy-preserving kNN classification algorithm over a hybrid encrypted cloud database [24]. A data owner can upload his/her database to the cloud server, and an authorized user can send a query to the cloud server to execute kNN queries. The algorithm is performed against the semi-honest attack model. After the query is sent, the authorized user does not need to participate in the kNN classification. They also proposed a secure distance protocol in which the cloud servers cannot derive any private information from the authorized user. Compared with the SIP protocol in the state-of-the-art PPKC algorithm [16], the proposed secure distance protocol has less corrupted computation.
2.2.6 J. Du and F. Bian’s work.
J. Du and F. Bian proposed a non-interactive and efficient privacy-preserving kNN classification algorithm [25]. The algorithm is performed against the semi-honest attack model. To achieve privacy preservation, the algorithm encrypts all outsourced data and users’ query records by using two encryption schemes: order preserving encryption [26] and the Paillier cryptosystem [16]. To hide the data access pattern, the information in the cloud server is always maintained in ciphertext format. In terms of classification accuracy, the algorithm is proven to be very close to one using both plaintext data and the non-interactive encrypted data query scheme.
Table 1 shows the comparison of the existing studies. We explain their comparison with respect to three major factors. First, B. K. Samanthula et al.’s work [16], H. Kim et al.’s work [17], W. Wu et al.’s work [23] and Y. Tan et al.’s work [24] support hiding access pattern, while B. Yao et al.’s work [21] and J. Du and F. Bian’s work [25] do not support it. Second, W. Wu et al.’s work and Y. Tan et al.’s work require low computation overhead while B. K. Samanthula et al.’s work and H. Kim et al.’s work need high computation overhead. Finally, B. Yao et al.’s work, B. K. Samanthula et al.’s work, H. Kim et al.’s work and W. Wu et al.’s work have low risk in terms of security, while Y. Tan et al.’s work and J. Du and F. Bian’s work have high risk in terms of security.
3 Overall system architecture and secure protocols
3.1) System architecture
In the outsourcing database environment, two attack models can be considered: a malicious attack model and a semi- honest attack model [20]. In a malicious attack model, the cloud can deviate from the protocol procedure. A protocol against malicious attack model is inefficient because it requires exceedingly high cost. In the semi-honest attack model, the cloud correctly follows the given protocol, but tries to acquire the sensitive information of both the data owner and the query issuer. However, a protocol against a semi-honest attack model is practical because the cloud has a higher level of authority than outsider attackers. Therefore, according to earlier work [16, 17], we also adopt the semi-honest attack model. Table 2 shows a list of notations used in this paper. Our system architecture supports secure protocols between clouds by performing Secure Multiparty Computation (SMC). SMC is based on multi-party data processing in which several entities cooperate to perform calculations for deriving specific results. For this, the following factors must be satisfied to achieve the result of secure protocols while avoiding data leakage.
3.1.1 Input privacy.
No information about private data held by multiple parties can be inferred from the messages sent during the protocol execution. The only information that can be inferred about private data is whatever could be inferred from seeing the output of the function alone.
3.1.2 Correctness.
Any proper subset of adversarial colluding parties that is willing to share information or deviate from the instructions during the protocol execution should not be able to force honest parties to output an incorrect result. This correctness goal comes in two categories: either the honest parties are guaranteed to compute the correct output (a robust SMC protocol), or the honest parties abort if they find an error (an SMC protocol with abort).
Fig 2 shows the overall system architecture. The data owner holds the original database T consisting of n records ti (1 ≤ i ≤ n). Each record ti includes m attributes (or columns) and one label. Here, we call the jth attribute of the ith record as ti,j(1 ≤ i ≤ n, 1 ≤ j ≤ m + 1). First, the data owner partitions the original data by using the kd-tree index. Assuming that the level of the constructed kd-tree is h, the total number of leaf nodes is 2h−1. In the leaf node, an attribute stores its region information, i.e., a lower bound lbz,j and an upper bound ubz,j, where 1 ≤ z ≤ 2h−1 and 1 ≤ j ≤ m. Second, the data owner generates an encryption public key (pk) and a decryption secret key (sk) based on the Paillier cryptosystem [18]. Third, the data owner encrypts the database with the Paillier cryptosystem to protect the original data. Because the unit of the encryption is the attribute of each record, E(ti,j) (1 ≤ i ≤ n, 1 ≤ j ≤ m + 1) is generated. Finally, the leaf node of the constructed kd-tree is encrypted because the data owner needs to protect the data access pattern. Because the unit of the encryption is the attributes of each leaf node, E(lbz,j) and E(ubz,j) are generated(1 ≤ z ≤ 2h−1, 1 ≤ j ≤ m).
3.2) Secure protocols
3.2.1 Encrypted random value pool.
To support data privacy in a cloud computing environment, the existing works [16, 17] prevent CB from extracting meaningful information (Fig 2) while executing a secure protocol by using the Paillier cryptosystem. However, they require high computation cost because the secure protocol generates an encrypted random value for protecting the original data. Therefore, we propose an encrypted random value pool to reduce the computation cost for encryption. Before CA processes a query (Fig 2), we generate the random plaintext from ZN and store the encrypted random plaintext into an encrypted random value pool. While processing a query in CA, a random ciphertext is selected from the encrypted random value pool whenever a secure protocol is called. Therefore, while processing a secure protocol, CA not only prevents CB from extracting meaningful information, but also reduces the cost of generating encrypted random values. Table 3 shows a comparison of the number of data encryptions for each secure protocol in our work and existing works [16, 17]. The Secure Multiplication protocol of the existing works requires three times as the number of encryptions as our work. The Secure Compare protocol used in B. K. Samanthula et al.’s work requires log2 D times as the number of encryptions as our work while the one used in H. Kim et al.’s work requires three times as the number of encryptions as our work.
3.2.2 Secure multiplication protocol using an encrypted random value pool.
We propose a Secure Multiplication protocol using an Encrypted random value pool (SME protocol) which multiplies two encrypted values E(α) and E(β). Algorithm 1 shows the SME protocol. First, when two encrypted values E(α) and E(β) are given as inputs, CA selects two random values E(ra) and E(rb) from the encrypted random value pool (line 1). Second, CA calculates E(α + ra) and E(β + rb) by using Eq (1), then sends them to CB (line 2∼3). Third, CB decrypts E(α + ra) and E(β + rb) by using the secret key and calculates the multiplication of the two plaintext α + ra and β + rb (line 4). Fourth, CB encrypts (α + ra) × (β + rb) and send it to CA (line 5). Finally, CA obtains E(α × β) by removing α × rb, β × ra and ra × rb from the received value, where ‘N−x’ in the ZN domain is the same as ‘-x’ (line 6).
Algorithm 1 SME Protocol
Input: E(α), E(β)
Output: E(α × β)
CA:
1: Pick random value E(ra) and E(rb) in the encrypted random value pool
2: E(α′)←E(α) × E(ra);E(β′)←E(β) × E(rb)
3: Send E(α′), E(β′) to CB
CB:
4: h ← D(E(α′) × D(E(β′) mod N // h = α × β + α × rb + β × ra + ra × rb
5: Send E(h) to CA
CA:
6:
3.2.3 Garbled secure compare protocol using encrypted random value pool.
We propose the Garbled Secure Compare protocol using an Encrypted random value pool (GSCE protocol) which is performed by using a garbled circuit consisting of two ADD gates and one CMP gate [27]. Assume that E(u) and E(v) are ciphertext for two plaintext u and v. When E(u) and E(v) are given to CA, the GSCE protocol returns E(1) if u ≤ v is satisfied, otherwise it returns E(0). Algorithm 2 shows the GSCE protocol. First, CA selects two random value E(ru) and E(rv) from the encrypted random value pool (line 1). Second, CA calculates E(m1) = E(u)2 × E(ru) and E(m2) = E(v)2 × E(1) × E(rv) (line 1∼2). Third, CA randomly selects one of two random functions, i.e., F0 and F1. The selected random function is not disclosed to CB. If CA selects F0, CA sends an encrypted ordered pair <E(m2), E(m1)> to CB. If CA selects F1, CA sends an encrypted ordered pair <E(m1), E(m2)> to CB (line 3∼7). Fourth, CB decrypts the data received from CA (line 8∼11). When CA selects F0, CB acquires an ordered pair <m2, m1>, otherwise CB acquires an ordered pair <m1, m2>. Fifth, CA creates a garbled circuit consisting of two ADD gates and one CMP gate. If F0 is selected, −rv and −ru are transferred to the first ADD gate and the second ADD gate, respectively. Otherwise, −ru and −rv are transferred to the first and the second ADD gates, respectively (lines 12∼16). Sixth, CB transfers the first data to the first ADD gate, and the second data to the second ADD gate. Therefore, when F0 is selected, CB transfers m2 and m1 to the first and the second ADD gates, respectively. Otherwise, m1 and m2 are transferred to the first and the second ADD gates, respectively (line 17∼20). Seventh, the first ADD gate adds two input values: −rv and m2 for F0 and −ru and m1 for F1. The result of the first ADD gate (result1) is transferred to the CMP gate (line 21∼24). Eighth, the second ADD gate adds two input values: −ru and m1 for F0 and −rv and m2 for F1. The result of the second ADD gate (result2) is transferred to the CMP gate (line 25∼28). Due to the characteristics of the garbled circuit, the exposure of any information does not occur in the ADD gate. Ninth, the CMP gate returns α = 1 if result1 ≤ result2, and α = 0 otherwise (line 29 30). Finally, the result α can be checked on CB side, and CB transmits E(α) to CA (line 31). Because CB does not know whether F0 or F1 is selected by CA, CB cannot determine the result of comparison of E(u) and E(v). When F0 is selected, CA changes E(α) through the SBN protocol [11] and returns E(α) (line 32∼34). Here, CA cannot obtain the actual value of α due to the characteristics of the Pallier cryptosystem.
Algorithm 2 GSCE Protocol
Input: E(u), E(v)
Output: E(1) when u ≤ v, E(0) otherwise
CA:
01: Pick random value E(ru) and E(rv) in the encrypted random value pool
02: E(m1)←E(u)2 × E(ru)
03: E(m2)←E(v)2 × E(1) × E(rv)
04: h ← D(E(α′) × D(E(β′) mod N // h = α × β + α × rb + β × ra + ra × rb
05: Randomly choose F0 or F1
06: If F0 u>v is chosen, then
07: Send <E(m2), E(m1)> to CB
08: else
09: Send <E(m1), E(m2)> to CB
CB:
10: If F0 u>v is chosen, then
11: Obtain <m2, m1> by decrypting <E(m2), E(m1)>
12: else
13: Obtain <m1, m2> by decrypting <E(m1), E(m2)>
CA:
14: Generate garbled circuit
15: If F0 u>v is chosen, then
16: Put −rv and −ru into 1st and 2nd ADD gates
17: else
18: Put −ru and −rv into 1st and 2nd ADD gates
CB:
19: If F0 u>v is chosen, then
20: Put m2 and m1 into 1st and 2nd ADD gates
21: else
22: Put m1 and m2 into 1st and 2nd ADD gates
1st ADD Gate:
24: If F0 u>v is chosen, then
25: result1 = calculate − rv + (v + rv)
26: else
27: result1 = calculate − ru + (u + ru)
2nd ADD Gate:
29: If F0 u>v is chosen, then
30: result2 = calculate − ru + (u + ru)
31: else
32: result2 = calculate − rv + (v + rv)
CMP Gate
34: If result1 > result2 is chosen, then
35: output α = 1 to CB
36: else
37: output α = 0 to CB
CB:
38: E(α)← encrypt α
CA:
39: If F0 u>v is chosen, then
40: E(α)←SBN(E(α))
41: Return E(α)
4 Privacy-preserving parallel kNN classification algorithm using index filtering
The proposed parallel kNN classification algorithm can support the protection of data, query, and data access pattern in a cloud computing environment. For this, the proposed privacy-preserving parallel kNN classification algorithm is composed of four phases: secure index search, k-nearest neighbors search, kNN verification, and kNN classification, as shown in Fig 3.
4.1) Secure index search phase
In the secure index search phase, the proposed algorithm determines the leaf node which includes the given query in the encrypted kd-tree. The procedure of the secure index search is shown in Algorithm 3. First, CA makes t number of partitions and allocates them to the given threads (line 1). Here, t is calculated by dividing the number of leaf nodes by the number of threads. Second, by using the GSRO protocol, the algorithm finds which leaf node includes the query in each thread. If a node includes the query, the GSRO protocol returns E(1), otherwise the protocol returns E(0). The result of the GSRO protocol is stored in an array E(α). The algorithm randomly reorders the members of the array E(α) and transfers the reordered array E(α′) to CB(line 2∼7). Third, CB decrypts the array E(α′) and makes groups by allocating the decrypted members uniformly based on the number of 1s. If a node has the decrypted value of 1, it becomes a seed of a group. CB sends groups to CA (line 8–15). Finally, CA extracts all the encrypted data in the node corresponding to E(1). If a node has E(1), the algorithm can safely extract the data of the node because the node includes the query. Otherwise, the algorithm can remove the data of the node because it does not include the query (line 16∼30).
Algorithm 3 Secure Index Search
Input:
Output:
CA:
01: t = NumNode/NumThread
02: Run thread
03: for 1 ≤ i ≤ NumThread
04: for t × (i − 1)≤j ≤ t × i
05: E(αj) = GSRO(E(q), E(q), E(rangej.lb), E(rangej.ub))
06: Terminate thread
07: E(α′) = ∏(E(α)); Send E(α′) to CB
CB:
08: α′ = D(E(α′))
09: c = the number of’1’ in α′
10: Create c number of node groups
11: for each node group
12: assign a node with α′ = 1
13: assign (numnode/c) − 1 nodes with α′ = 0
14: shuffle the sequence of nodes
15: Send node group to CA
CA:
16: cnt = 0
17: for each node group
18: permute node IDs using Π−1
19: t = F/NumThread
20: Run thread
21: for 1 ≤ i ≤ NumThread
22: for each node group
23: for t × (i − 1)≤s ≤ t × i
24: for 1 ≤ z ≤ num //num is # of nodes in the selected group
25: for 1 ≤ j ≤ m + 1
26: for 1 ≤ j ≤ m + 1
27:
28: cnt ← cnt + 1
29: Terminate thread
30: return E(cand)
4.2) k-Nearest neighbors search phase
In the k-nearest neighbors search phase, our algorithm finds k-nearest points among the encrypted candidates which are extracted from the index search phase. The procedure of the k-nearest neighbors search is shown in Algorithm 4. First, CA calculates the squared Euclidean distance set E(di) (1 ≤ i ≤ cnt, where cnt is the number of candidates) between the query and the encrypted candidates through the ESSED protocol [17] in a parallel way (line 1∼6). Second, CA finds the minimum value E(dmin) among E(di)(1 ≤ i ≤ cnt) through the SMSn protocol [16] (line 8–10). Additionally, CA calculates the difference between E(dmin) and E(di) (1 ≤ i ≤ cnt) by using E(dmin) × E(di)N−1, and stores the results into an array E(τi) (1 ≤ i ≤ cnt). CA makes ) (1 ≤ i ≤ cnt) by raising E(τi) to the power of a random integer. CA makes E(βi) (1 ≤ i ≤ cnt) by applying a shuffling function π to
(1 ≤ i ≤ cnt) and sends it to CB (line 11∼18). Therefore, the original distance and data access patterns are protected from CB. Third, if the ith decrypted value of E(βi) (1 ≤ i ≤ cnt) is 0, CB sets to E(1) the ith value of a temporary array E(Ui) (1 ≤ i ≤ cnt). Otherwise, CB sets to E(0) the ith value of a temporary array E(Ui)(1 ≤ i ≤ cnt). CB sends E(Ui)(1 ≤ i ≤ cnt) to CA (line 19∼22). Fourth, CA makes E(Vi)(1 ≤ i ≤ cnt) by applying a deshuffling function π − 1 to E(Ui)(1 ≤ i ≤ cnt). CA performs the SM protocol between E(Vi)(1 ≤ i ≤ cnt) and E(candi,j)(1 ≤ i ≤ cnt and 1 ≤ j ≤ m + 1, where m is the data dimension). CA stores the result of the SM protocol in a temporary array
)(1 ≤ i ≤ cnt and 1 ≤ j ≤ m + 1). Next, CA calculates Eq 3 by using Eq 1 (line 23∼31). Fifth, if the algorithm does not find k-nearest neighbors, CA updates E(di) (1 ≤ i ≤ cnt) by calculating Eq 4 in a parallel way, where E(max) is the maximum value of the data domain (line 32∼38). If E(Vi) equals to E(1), E(di) corresponding to E(Vi) is updated to E(max) through Eq 4. Otherwise, E(di) corresponding to E(Vi) is maintained. Finally, CA terminates the k-nearest neighbors search phase if k-nearest neighbors are found (line 39).
(3)
(4)
Algorithm 4 k-nearest neighbor search phase
Input: E(q), E(cand), k
Output: t′//candidatekNNresult
CA:
01: Run thread
02: t = NumNode/NumThread
03: for 1 ≤ i ≤ NumThread
04: for t × (i − 1)≤j ≤ t × i
05: E(dj) = ESSED(E(q), E(candj))
06: Terminate thread
07: for 1 ≤ s ≤ k
08: Run thread
09: E(dmin) = SMSn(E(d1), …, E(dcnt))
10: Terminate thread
11: Run thread
12: t = cnt/NumThread
13: for 1 ≤ i ≤ NumThread
14: for t × (i − 1)≤j ≤ t × i
15: E(τj) = E(dmin) × E(dj)N−1
16:
17: Terminate thread
18: E(β)←∏(τj); Send E(β) to CB
CB:
19: for 1 ≤ i ≤ cnt
20: If D(E(βj)) = 0, then E(Ui)←E(1)
21: Else E(Ui)←E(0)
22: Send E(U) to CA
CA:
23: E(V)←∏−1(U)
24: Run thread
25: t = cnt/NumThread
26: for 1 ≤ u ≤ NumThread
27: for t × (u − 1)≤i ≤ t × u
28: for 1 ≤ j ≤ m + 1
29:
30:
31: Terminate thread
32: Run thread
33: t = cnt/NumThread
34: for 1 ≤ i ≤ NumThread
35: for t × (i − 1)≤j ≤ t × i
36: If s < k then,
37: E(dj) = SM(E(Vj), E(max)) × SM(SBN(E(Vj)), E(dj))
38: Terminate thread
39: return E(t′)
4.3) k-Nearest neighbors verification phase
In the k-nearest neighbors verification phase, the algorithm verifies whether the distance between the a node and the query(E(q) = <E(q1), E(q2), …, E(qm)>, where m is the data dimension) is shorter than the distance, E(distk), between the query and kth nearest neighbor (). The procedure of the k-nearest neighbors verification phase is shown in Algorithm 5. First, CA calculates E(distk) between E(q) and
using the ESSED protocol (line 1). Second, the algorithm performs the GSCE protocol between E(qj) and the lower bound of nodez(E(nodez.lbj) (1 ≤ z ≤ numnode) for each dimension j(1 ≤ j ≤ m), and stores the result of the GSCE protocol into E(ψ1,j). If E(qj) (1 ≤ j ≤ m) is less than or equal to E(nodez.lbj), E(ψ1,j) is E(1). Then, the algorithm performs the GSCE protocol between E(qj) (1 ≤ j ≤ m) and the upper bound of nodez (E(nodez.ubj)(1 ≤ z ≤ numnode) for each dimension j, and stores the result of the GSCE protocol into E(ψ2,j) (line 2∼5). If E(qj) is less than or equal to E(nodez.ubj), E(ψ2,j) is E(1). Third, the algorithm performs the SBXOR protocol [16] between E(ψ1,j) and E(ψ2,j), and stores the result of the SBXOR protocol into E(ψ3,j) (line 6). Fourth, the algorithm calculates the shortest point of nodez (1 ≤ z ≤ numnode), E(spz) = < E(spz,1), E(spz,2), …, E(spz,m) > where m is the data dimension, by using Eqs 5 and 6 (line 7∼10).
(5)
(6)
Fifth, CA calculates the squared Euclidean distance between E(q) and E(spz)(1 ≤ z ≤ numnode) through the ESSED protocol and stores the result into the shortest distance of the nodez, E(spdistz)(1 ≤ z ≤ numnode) (line 11). In addition, CA updates E(spdistz)(1 ≤ z ≤ numnode) by using Eq 7 (line 12 13). E(αz) in Eq 7 is the result of the GSRO protocol in algorithm 1. This update avoids an unnecessary index search phase by updating the shortest distance of the node already searched in the previous phase.
(7)
Sixth, CA performs the GSCE protocol between E(spdistz) and E(distk), and stores the result into E(αz) (line 14). If E(spdistz) is less than E(distk), the nodez needs additional searching. Finally, by performing lines 9∼33 of the secure index search phase, CA extracts the encrypted data belonging to the nodez and adds them to E(t′). In addition, CA obtains the kNN result array, E(resulti)(1 ≤ i ≤ k), by performing the k-nearest neighbors search phase (line 15∼17). CA stores the label of the k-nearest neighbors into (line 18∼19).
Algorithm 5 k-nearest neighbors verification phase
Input: E(q), E(node), E(t′), k
Output: result
CA:
01:
02: for 1 ≤ z ≤ numnode
03: for 1 ≤ j ≤ m
04: E(ψ1) = GSCMP(E(qj), E(nodez.lbj))
05: E(ψ2) = GSCMP(E(qj), E(nodez.ubj))
06: E(ψ1) = SBXOR(E(ψ1), E(ψ2))
07: E(temp) = SM(E(ψ1), E(nodez.lbj))
08: E(temp)←E(temp) × SM(SBN(E(ψ1)), E(nodez.ubj))
09: E(temp) = SM(E(temp), SBN(E(ψ3)))
10: E(spz,j) = E(temp) × SM(E(ψ1), E(qj))
11: E(spdistz) = ESSED(E(q), E(spz))
12: E(temp) = SM(E(αz), E(max))
13: E(spdistz) = E(temp) × SM(SBN(E(αz)), E(spdistz))
14: E(αz)←GSCMP(E(spdistz), E(distk))
15: E(t″)← perform 7 ∼ 36 lines of Algorithm 1
16: E(t′)← append E(t″) to E(t′)
17: result ← performAlgorithm2
18: for 1 ≤ i ≤ k
19:
4.4) k-Nearest neighbors classification phase
In the kNN classification phase, the algorithm extracts the most frequent label from the label of the k-nearest neighbors, . The procedure of the kNN classification phase is shown in Algorithm 6. CA and CB calculate the frequency of
by using the secure frequency protocol [17] (line 1). The label with the highest frequency is selected (line 2). CA adds a random integer rq to the selected label and stores the result into a temporary variable E(rq) (line 3). CA sends E(rq) to CB and rq to AU (line 4). CB decrypts E(rq) and sends it to AU (line 5–6). AU obtains the final result by combining the results of CA and CB (line 7∼8).
Algorithm 6 Knn classification
Input:
Output: E(Lq)
CA and CB:
01: <E(f(L1), …, E(f(Lw)))> = SF(Δ, Δ′), where
02: (f(max), E(Lq)) = SXSw(<E(f(L1)), …, E(f(Lw)) >, < E(L1), …, E(Lw)>)
CA:
03: E(λq) = E(cq) × E(rq), where rq ∈ ZN
04: Send E(λq) to CB and rq to AU
CB:
05: Receive E(λq) from CA
06: Send
to AU
AU:
07: Receive rq from CA and λq from CB
08:
4.5) Example of kNN classification
Here, an example of the proposed secure kNN classification algorithm is described. Assume that the original data is indexed and encrypted by using the kd-tree, as shown in Fig 4. The encrypted kd-tree contains 4-fold attributes for each leaf node, i.e., a node identifier (ID), an encrypted lower bound of the node, an encrypted upper bound of the node, and the encrypted data. Fig 5 shows how to extract data in a selected node through the secure index search phase. First, CA sends a node identifier (ID), an encrypted lower bound, an encrypted upper bound, an encrypted query to all the threads. In each thread, the algorithm performs the GSRO protocol to determine whether a node includes the query or not. If a node includes the query, the GSRO protocol returns E(1). Otherwise it returns E(0). Second, the algorithm performs the RSM protocol by multiplying the encrypted data in each node(E(nodez.data)) and the results of the GSRO protocol. As a result, E(nodez.data) is returned only if the result of the GSRO is E(1). Finally, the algorithm can safely obtain the encrypted data by merging the results of the RSM protocol. Fig 6 shows how to obtain kNN candidates through the k-nearest neighbors search phase. First, the algorithm selects the encrypted data which has the minimum distance from the query by using the GSMINn protocol. In Fig 6, E(d3) is selected as 1NN because the distance of d3 is the minimum. Second, the algorithm sets the distance of the selected data to the maximum value for excluding the selected data. Therefore, the distance of E(d3) is set to E(MAX). Finally, the algorithm is repeated until the kth nearest data is selected. In the same way, E(d4) and E(d2) are selected as 2NN and 3NN, respectively. As a result, the algorithm can safely select the k number of nearest neighbors. Figs 7 and 8 show the examples of index search and k-nearest neighbor search in the kNN verification phase, respectively. In each thread, the algorithm calculates the shortest distance E(spdistz) between the query and a leaf node(nodez), and compares E(spdistz) with E(distk). If E(spdistz) is smaller than E(distk), the data in the nodez is extracted. In Fig 7, because E(spdist2), i.e., (E(1)), is smaller than E(distk), i.e., (E(5)), node2 is selected. Fig 8 shows how to obtain the final kNN. The algorithm merges the kNN candidates and obtains the final k-nearest neighbors. In the kNN classification phase, the algorithm calculates the frequency of labels in E(L′). Because the frequency of E(L1) is the highest in kNN, E(L1) is selected as the final result, as shown in Fig 9.
5 Random value pool’s security proof
5.1) Security proof of the secure protocols
In this section, we describe the security proof of the SME and the GSCE protocols proposed in Section 3. To prove that the proposed protocols are secure under the semi-honest model, we show that the simulated images of the proposed protocols are computationally indistinguishable from their actual execution images. Security proof of the SME protocol: We describe the security proof of the SME protocol by analyzing the security of the execution images of CA and CB. First, the execution image on CB side, i.e., , is shown in Eq 8. Here,
and
are the encrypted data received from CA (line 1∼2 of Algorithm 1),
and
and are obtained through the decryption of
and
, respectively. Also, α is a result which is calculated by the SME protocol using
and
on CB side.
(8)
For example, assume that is the simulated execution image using the SME protocol on CB side. Here,
and
are the non-deterministic numbers selected in ZN2, and
and
are the indistinguishable numbers which are added by each value in the random value pool.
is the result of the SME protocol using
and
on CB side. Because the SME protocol is implemented based on the Paillier cryptosystem, it can support semantic security. Therefore,
and
are computationally indistinguishable from
and
, respectively.
is indistinguishable from
and
because
is calculated by multiplying two indistinguishable numbers in CA,
and
. Therefore, it can be said that
is computationally indistinguishable from
. Because CB can check only the result (e.g., α) of the multiplication between the non-deterministic numbers (e.g.,
and
), CB cannot obtain the original data while performing the SME protocol. Meanwhile, the execution image of CA is
such that E(α) from CB can be regarded as the result of the SME protocol. Suppose that the simulated image of CA is
, where E(s4) is randomly generated from ZN2. Therefore, E(α) is computationally indistinguishable from E(s4). According to the above analyses, there is no information leakage both at CA and CB side. Therefore, we can conclude that the proposed SME protocol is secure under the semi-honest adversarial model. Security proof of the GSCE protocol: We describe the security proof of the GSCE protocol by analyzing the security of the execution images of CA side and CB side. First, the execution image on CB side, i.e.,
, is shown in Eq 9. Here,
and
refer to the encrypted data received from CA (line 1∼2 of Algorithm 2), and both
and
are obtained through decryption of
and
, respectively. Also, β is the result which is calculated by the GSCE protocol using
and
on CB side.
(9)
For example, assume that for the simulated execution image using the GSCE protocol on CB side. Here,
and
are the non-deterministic numbers selected in ZN2, and both
and
are the indistinguishable numbers selected in the random value pool.
is the result of the GSCE protocol using
and
on CB side. Because the GSCE protocol is implemented based on the Paillier cryptosystem, it can support semantic security. Therefore,
and
are computationally indistinguishable from
and
, respectively.
is indistinguishable from
and
because
is calculated by comparing two indistinguishable numbers in CA,
and
. Therefore, it can be said that
is computationally indistinguishable from
. Because CB can check only the result (e.g., β) of the comparison between the non-deterministic numbers (e.g.,
and
), CB cannot obtain the original data while performing the GSCE protocol. Meanwhile, the execution image of CA is
such that E(β) from CB can be regarded as the result of the GSCE protocol. Suppose that the simulated image of CA is
, where E(s4) is randomly generated from ZN2. Therefore, E(β) is computationally indistinguishable from E(s4). According to the above analyses, there is no information leakage both at CA and CB side. Therefore, we can conclude that the proposed GSCE protocol is secure under the semi-honest adversarial model
5.2) Security proof of the proposed kNN classification algorithm
We prove that the proposed kNN classification algorithm on the encrypted database is safe under the semi-honest attack model. The proposed kNN classification algorithm in the cryptographic database consists of a secure index search phase (Algorithm 3), a kNN search phase (Algorithm 4), a kNN verification phase (Algorithm 5), and a kNN classification phase (Algorithm 6). To show that the proposed secure kNN classification algorithm is safe under the semi-honest attack model, security analysis is performed at each execution phase. First, because the secure index search phase is composed of the GSRO protocol [17] which has been proven to be safe, the Algorithm 3 is safe under the semi-honest attack model by composition theory [17]. Second, the kNN search phase is safe in CA side, because CA performs the ESSED, SMINn and SM protocols which have been proven to be safe in the previous studies [16, 17]. Even though the kNN search phase decrypts the received data from CA, CB cannot extract the original data. This is because the data received from CA is modified by raising the original data to the power of a random integer and applying a shuffling function. Therefore, according to the composition theory, Algorithm 4 is safe under the semi-honest attack model. Third, the images which are generated by the kNN verification phase are the same as those generated by Algorithms 3 and 4. Therefore the kNN verification phase (Algorithm 5) is safe under the semi-honest attack model. Lastly, the kNN classification phase (Algorithm 6) is safe under the semi-honest attack model because Algorithm 6 has been proven safe in the previous work [16, 17]. As a result, all the phases of the proposed secure kNN classification algorithm is safe under the semi-honest attack model.
6 Performance analysis
Because there is no privacy-preserving parallel kNN Classification algorithm, we compare our privacy-preserving parallel kNN classification algorithm with the extension of existing works. That is, we make parallel SkNNC-M by extending B. K. Samanthula et. al.’s work [16] in a naive way so that it may operate in a multi-core environment. We make parallel SkNNC-G by extending H. J. KIM et. al.’s work [17] in the same way. For performance evaluation, three algorithms were implemented by using C++ under an Intel(R) Xeon(R) CPU E5–2630 v4 @ 2.20GHz and 64GB (16GB × 4AE) DDR3 UDIMM 1600MHz in a Linux Ubuntu 18.04.2 environment. We compare three parallel algorithms in terms of the query processing time by varying the number of data, the number of k, the level of the kd-tree, the number of the data dimension, and the number of threads. We use both a synthetic dataset and real dataset [28] for our experiments.
6.1) Performance analysis of kNN classification algorithm for synthetic dataset
Table 4 shows the parameters used in the performance evaluation for the synthetic dataset. For the synthetic dataset, we randomly generate 30,000 integer data with 12 dimensions. The domain of data is ranged from 0 to 212. We do an experiment to find the optimal value of the level of kd-tree(h). It is shown that the performances of both SkNNC-G and the proposed algorithm are best when h is 7. So, we set h to 7 in our experiment.
The performance of the kNN classification algorithms is evaluated for synthetic data. Fig 10 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of data. When n = 30k, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 215, 497, and 7,089 seconds, respectively. That is, the proposed algorithm shows 2.3 times better performance than parallel SkNNC-G and 32 times better performance than parallel SkNNC-M. This is because our secure protocols (SME and GSCE protocols) can reduce the number of data encryptions by selecting an encrypted value from the random value pool instead of generating it, as mentioned in Table 3. Fig 11 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to k. When k = 20, the proposed parallel algorithm, parallel SkNNC-G, and parallel SkNNC-M require 202, 487, and 4,658 seconds, respectively. That is, the proposed algorithm shows 2.4 times better performance than parallel SkNNC-G and 23 times better performance than parallel SkNNC-M. The reason is the same as mentioned in Fig 10.
Fig 12 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of data dimension(m). When m = 6, the proposed parallel algorithm, parallel SkNNC-G, and parallel SkNNC-M require 57, 112, and 2,353 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 15 times better performance than parallel SkNNC-M. The reason is the same as mentioned in Fig 10. Fig 13 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of threads. When the number of threads = 1(single-core), the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 443, 894, and 15,572 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 35 times better performance than parallel SkNNC-M. This is because our secure protocols (SME and GSCE protocols) can reduce the number of data encryptions by selecting an encrypted value from the random value pool instead of generating it. When the number of threads = 10, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 93, 203, and 2350 seconds, respectively. That is, the proposed algorithm shows 2.1 times better performance than parallel SkNNC-G and 25 times better performance than parallel SkNNC-M. Because a thread performs secure protocols concurrently without interfering with each other, query processing time linearly decreases as the number of threads increases. As a result, our parallel algorithm shows better performance than the existing algorithms in a multi-core environment.
6.2) Performance analysis of kNN classification algorithm for real dataset
Table 5 shows the parameters used in the performance evaluation for real data. For this, we used a chess dataset [28] generated by a chess endgame database for white king and rook against black king. The chess dataset aims to classify the optimal depth of win for white. With the real dataset, we do an experiment to find the optimal value of the level of kd-tree(h). It is shown that the performances of both SkNNC-G and the proposed algorithm are best when h is 7. So, we set h to 7 in our experiment.
Fig 14 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to k. When k = 20, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 425, 894, and 13,175 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 27 times better performance than parallel SkNNC-M. This is because our algorithm uses both SME and GSCE protocols which can reduce the number of data encryptions by selecting an encrypted value from the random value pool. Fig 15 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of threads. When the number of threads = 1 (single-core), the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 1106, 2306, and 44,570 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 40 times better performance than parallel SkNNC-M. The reason is the same as mentioned in Fig 14. When the number of threads = 10, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 227, 487, and 6,639 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 29 times better performance than parallel SkNNC-M. Because a thread performs secure protocols concurrently without any interference of each other, it can be seen that query processing time linearly decreases as the number of threads increases.
6.3) Theoretical analysis of the proposed algorithm in terms of privacy
Assuming that an attacker does not have any information of original data items, an adversary needs tremendous time to obtain the original plaintext from paillier cryptosystem while using a brute force attack. It means that it is impossible to do an experiment to prove data protection, query protection and access pattern protection. Therefore, instead of experimental analysis, we conduct the theoretical analysis of data privacy, query privacy and access pattern privacy to support the security analysis of the proposed algorithm. For this, we estimate the time complexity it takes for the original data to be exposed and calculate the probability of access pattern leakage.
6.3.1 Theoretical analysis of data privacy.
In CA, an attacker only obtains the ciphertext of data. Because the data is protected by the paillier cryptosystem, the security performance is measured through the time complexity of the brute force attack to break down the paillier cryptosystem. Our paillier cryptosystem uses 512-bit encryption key size. Assuming that CPU cycle is 4GHz, the time required to decrypt the ciphertext by changing the key is as shown in Eq (10).
(10)
It is impossible to break down a paillier cryptosystem because it takes about 4.2 × 10146 years with 512-bit key size. It means that the proposed privacy preserving kNN classification algorithm is secure in terms of data privacy even if the ciphertext is exposed. Fig 16 shows the time taken for a brute force attack in CA as the key size is changed. In CB, an attacker only obtains a plaintext data which adds a random number to the original data. In the paillier cryptosystem, because the range of the plaintext data is 0 ≤ m ≤ 2512, brute force attack time in CB has the same as that in CA.
6.3.2 Theoretical analysis of query privacy.
In CA, an attacker only obtains the ciphertext of query. Because the query is protected by the paillier cryptosystem, the security performance is measured through the time complexity of the brute force attack to break down the paillier cryptosystem. Since our paillier cryptosystem uses 512-bit encryption key size, the time required to decrypt the ciphertext by changing the key is as shown in 10, where CPU cycle is 4GHz. It is impossible to break down a paillier cryptosystem because it takes about 4.2 × 10146 years with 512-bit key size. It means that the proposed privacy preserving kNN classification algorithm is secure in terms of query privacy even if the ciphertext is exposed. The times taken for a brute force attack in CA is the same as that of data privacy in CA (Fig 16). In CB, query privacy is preserved because CB does not receive the query.
6.3.3 Theoretical analysis of access pattern privacy.
The access pattern means the sequence of accessing a data item. In the proposed algorithm, the sequence of accessing a data item consists of the leaf node access of kd-tree and data access in the leaf node. In CA, an attacker only obtains the ciphertext of leaf node. Because all the leaf nodes have the same number of data items, an attacker cannot distinguish the leaf node by using density of data items. If the kd-tree level is h, the number of leaf node is 2h−1. The probability that an attacker can distinguish a node(nodei) from the others, i.e., P(nodei), is . Because nodei includes the same number of data items as fanout, the probability that an attacker can distinguish a data item from the others in nodei, i.e., P(nodei.dataj), is
. Therefore, the probability of data access pattern leakage (PAPL) is shown in Eq (11).
(11)
PAPL is equal to the probability that an attacker distinguishes a specific data item from the others in the entire data items. Therefore, the proposed algorithm can preserve the access pattern privacy in CA. In CB, access pattern privacy is preserved because CB does not have any data item.
7 Discussion
7.1 Impact of hiding data access patterns
The data access pattern is one of the most important factors for privacy preservation. If an attacker possesses the order or the frequency of data, he/she can infer the original data by using data access patterns. Therefore, hiding data access patterns is as important as encrypting data. First, B. Yao et al.’s work [21] proposed a secure kNN classification algorithm using the Voronoi diagram [22]. However, the order of accessing the Voronoi diagram is distinguishable and an attacker can partially infer the original data from the query. Second, J. Du and F. Bian’s work [25] proposed a kNN classification algorithm using an order-preserving index. However, the index access patterns are exposed because the order of accessing the index can be easily obtained from the query. This allows an attacker to easily infer the original data if he/she has an index access pattern. Meanwhile, our algorithm uses the Paillier cryptosystem which supports semantic security for data protection. As a result, all of the ciphertext is indistinguishable and secure from frequency-based attacks. In addition, the kd-tree filtering technique used in our algorithm is secure from the exposure of data access patterns because our algorithm accesses only the encrypted leaf nodes of the kd-tree without accessing the index by using a top-down approach. Therefore, our algorithm can hide the data access patterns.
7.2 Impact of parallel algorithm with garbled circuit
First, a garbled circuit is used for efficient processing of secure protocols. B. K. Samanthual et al.’s work [16] has high overhead by using a secure protocol based on the comparison of binary array. To overcome this problem, our secure protocols use a garbled circuit that performs a fast and secure comparison operation in the state of the ciphertext. Second, the existing algorithms do not use parallelism for the privacy-preserving classification algorithm [16, 17, 25]. On the contrary, our algorithm proposes a parallel classification algorithm adopting the garbled circuit. Our algorithm performs three phases in parallel: index searching, kNN searching and kNN verification. As shown in our performance evaluation, our parallel classification algorithm shows performance improvement in proportion to the number of threads.
7.3 Impact of encrypted random value pool
In our secure system, we use two-party computation for the parallel kNN classification algorithm. Thus, we need to prevent CB from extracting meaningful information while executing secure protocols. For this, CA generates a random value r from ZN and encrypts r by using the Paillier cryptosystem. Then, CA adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing E(m + r) = E(m) × E(r). Because m±r is independent from m, CB cannot obtain meaningful information with decryption. However, adding a random value to the ciphertext in the Paillier cryptosystem leads to performance degradation because both encryption and decryption operations require higher computation cost than other encrypted operations. In the Secure Multiplication protocol, both B. K. Samanthula et al.’s work and H. Kim et al.’s work require three times of the encryption: 2 encryptions for random values at CA and 1 encryption for the result of multiplication at CB. Meanwhile, our algorithm requires only one encryption for the result of multiplication at CB because it selects the encrypted random values from the random value pool without encrypting the random values at CA. In the Secure Compare protocol, B. K. Samanthula et al.’s work requires log2 D times of encryption where D is a data domain. H. Kim et al.’s work requires three times of the encryption: 2 encryptions for random values at CA and 1 encryption for the result of the comparison between two values at CB. Meanwhile, our algorithm requires only one encryption for the result of comparison at CB by using the random value pool. Therefore, our algorithm can reduce the amount of computation cost for encryption by using the encrypted random value pool.
7.4 Practical example of proposed kNN classification
The proposed secure kNN classification algorithm can be used in various fields. For example, first, it can be used to diagnose a disease by classifying the patterns of the patient’s symptoms [29]. Because the existing disease diagnosis system depends on only the doctor’s knowledge and experience, it may cause damage to patients due to misdiagnosis. Therefore, kNN classification algorithms can help doctors classify the pattern of the patient’s symptoms so as to diagnose what kind of disease it is. However, because patients’ information contains sensitive data, such as past medical history, family history and allergies, the proposed privacy-preserving kNN classification algorithm can be used to protect the sensitive data of patients. Second, the proposed privacy-preserving kNN classification algorithm can be used to solve the problem of insurance coverage recommendation where insurance companies provide the most suitable coverage for customers [30]. The insurance coverage recommendation classifies customers’ grades based on various customers’ information, such as movement patterns and lifestyles. To perform the classification of customers’ grades, the proposed privacy-preserving kNN classification algorithm can be used to protect the personal information of customers.
8 Conclusion
In this paper, we proposed a parallel kNN classification algorithm over encrypted data to preserve data privacy, query privacy, and access pattern privacy in cloud computing. To reduce the computation cost for encryption, we proposed two secure protocols, SME and GSCE, which support secure multi-party computation by using an encrypted random value pool. To reduce the query processing time, we not only designed a parallel algorithm, but also adopted a garbled circuit. In addition, we proved that our algorithm over the encrypted database is safe under the semi-honest attack model. Through our performance evaluation, our algorithm showed about 2∼25 times better performance compared with the existing algorithms. For future work, we plan to apply our parallel query processing algorithm to secure k-Means clustering.
References
- 1. Ge YF, Yu WJ, Cao J, Wang H, Zhan ZH, Zhang Y, et al. Distributed memetic algorithm for outsourced database fragmentation. IEEE Transactions on Cybernetics. 2020;51(10):4808–4821.
- 2. Ge YF, Orlowska M, Cao J, Wang H, Zhang Y MDDE: multitasking distributed differential evolution for privacy-preserving database fragmentation. The VLDB Journal. 2022;1–19.
- 3. Brian H, Brunschwiler T, Dill H, Christ H, Falsafi B, Fischer M, et al. Cloud computing. Communications of the ACM. 2008 July;51(7):9–11.
- 4. Josep AD, Katz R, Konwinski A, Gunho LEE, Patterson D, Rabkin A. A view of cloud computing. Communications of the ACM. 2010;53(4):50–58.
- 5.
Xiong L, Chitti S, Liu L. Topk queries across multiple private databases. In 25th IEEE International Conference on Distributed Computing Systems. 2005 June;145–154.
- 6.
Gutscher A. Coordinate transformation-a solution for the privacy problem of location based services?. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 2006 April;7.
- 7. Hassanat AB. Two-point-based binary search trees for accelerating big data classification using KNN. PloS one. 2018;13(11):e0207772. pmid:30475862
- 8.
Wong WK, Cheung DWL, Kao B, Mamoulis N. Secure knn computation on encrypted databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 2009;139–152.
- 9.
Hu H, Xu J, Ren C, Choi B. Processing private queries over untrusted data cloud through privacy homomorphism. In 2011 IEEE 27th International Conference on Data Engineering. 2011;601–612.
- 10.
Wang B, Hou Y, Li M, Wang H, Li H. Maple: scalable multi-dimensional range search over encrypted cloud data with tree-based index. In Proceedings of the 9th ACM symposium on Information, computer and communications security. 2014 June;111–122.
- 11.
Elmehdwi Y, Samanthula BK, Jiang W. Secure k-nearest neighbor query over encrypted data in outsourced environments. In 2014 IEEE 30th International Conference on Data Engineering. 2014 March;664–675.
- 12. Jiang ZL, Guo N, Jin Y, Lv J, Wu Y, Liu Z, et al. Efficient two-party privacy-preserving collaborative k-means clustering protocol supporting both storage and computation outsourcing. Information Sciences. 2020;518:168–180.
- 13. Alabdulatif A, Khalil I, Yi X. Towards secure big data analytic for cloud-enabled applications with fully homomorphic encryption. Journal of Parallel and Distributed Computing, 2020;137:192–204.
- 14. Pang H, Wang B. Privacy-preserving association rule mining using homomorphic encryption in a multikey environment. IEEE Systems Journal. 2020;15(2):3131–3141.
- 15. Wu W, Liu J, Wang H, Hao J, Xian M. Secure and efficient outsourced k-means clustering using fully homomorphic encryption with ciphertext packing technique. IEEE Transactions on Knowledge and Data Engineering. 2020;33(10):3424–3437.
- 16. Samanthula BK, Elmehdwi Y, Jiang W. K-nearest neighbor classification over semantically secure encrypted relational data. IEEE transactions on Knowledge and data engineering. 2014;27(5):1261–1273.
- 17.
Kim HJ, Kim HI, Chang JW. A Privacy-Preserving kNN Classification Algorithm Using Yao’s Garbled Circuit on Cloud Computing. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD). 2017 June;766–769.
- 18.
Paillier P. Public-key cryptosystems based on composite degree residuosity classes. In International conference on the theory and applications of cryptographic techniques. 1999 May; 223–238.
- 19. Goldwasser S, Micali S. Probabilistic encryption & how to play mental poker keeping secret all partial information. In Providing sound foundations for cryptography: on the work of Shafi Goldwasser and Silvio Micali. 2019; 173–201.
- 20.
Brickell J, Shmatikov V. Privacy-preserving graph algorithms in the semi-honest model. In International Conference on the Theory and Application of Cryptology and Information Security. 2005 December;236–252.
- 21.
Yao B, Li F, Xiao X. Secure nearest neighbor revisited. In 2013 IEEE 29th international conference on data engineering (ICDE). 2013 April; 733–744.
- 22. Erwig M. The graph Voronoi diagram with applications. Networks: An International Journal. 2000; 36(3):156–163.
- 23. Wu W, Liu J, Rong H, Wang H, Xian M. Efficient k-nearest neighbor classification over semantically secure hybrid encrypted cloud database. IEEE Access. 2018;6:41771–41784.
- 24. Tan Y, Wu W, Liu J, Wang H, Xian M. Lightweight edge based kNN privacy preserving classification scheme in cloud computing circumstance. Concurrency and Computation: Practice and Experience. 2020;32(19).
- 25. Du J, Bian F. A Privacy-Preserving and Efficient k-nearest neighbor query and classification scheme based on k-dimensional tree for outsourced data. IEEE Access. 2020;8:69333–69345.
- 26.
Boldyreva A, Chenette N, Lee Y, O’neill A. Order-preserving symmetric encryption. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. 2009 April; 224–241.
- 27.
Yao ACC. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science. 1986 October;162–167.
- 28.
Michael B. Chess (King-Rook vs. King) Data Set. The UCI KDD Archive. 1994 June; http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29.
- 29.
Hashi EK, Zaman MSU, Hasan MR. An expert clinical decision support system to predict disease using classification techniques. In 2017 International conference on electrical, computer and communication engineering (ECCE). 2017 February;396–400.
- 30. Khalili-Damghani K, Abdi F, Abolmakarem S. Solving customer insurance coverage recommendation problem using a two-stage clustering-classification model. International Journal of Management Science and Engineering Management. 2019;14(1):9–19.