PPCD: Privacy-preserving clinical decision with cloud support

Hui Ma; Xuyang Guo; Yuan Ping; Baocang Wang; Yuehua Yang; Zhili Zhang; Jingxian Zhou

doi:10.1371/journal.pone.0217349

Abstract

With the prosperity of machine learning and cloud computing, meaningful information can be mined from mass electronic medical data which help physicians make proper disease diagnosis for patients. However, using medical data and disease information of patients frequently raise privacy concerns. In this paper, based on single-layer perceptron, we propose a scheme of privacy-preserving clinical decision with cloud support (PPCD), which securely conducts disease model training and prediction for the patient. Each party learns nothing about the other’s private information. In PPCD, a lightweight secure multiplication is presented and introduced to improve the model training. Security analysis and experimental results on real data confirm the high accuracy of disease prediction achieved by the proposed PPCD without the risk of privacy disclosure.

Citation: Ma H, Guo X, Ping Y, Wang B, Yang Y, Zhang Z, et al. (2019) PPCD: Privacy-preserving clinical decision with cloud support. PLoS ONE 14(5): e0217349. https://doi.org/10.1371/journal.pone.0217349

Editor: Lixiang Li, Beijing University of Posts and Telecommunications, CHINA

Received: December 9, 2018; Accepted: May 9, 2019; Published: May 29, 2019

Copyright: © 2019 Ma et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All the data sets are available from the UCI machine learning repository. URL:http://archive.ics.uci.edu/ml.

Funding: This work is supported by the National Key R&D Program of China under Grants no. 2017YFB0802000 to BW, the National Natural Science Foundation of China under Grant no. U1736111 to BW, the Plan For Scientific Innovation Talent of Henan Province under Grand no. 184100510012 to BW, the Program for Science & Technology Innovation Talents in Universities of He’nan Province under Grant No. 18HASTIT022 to YP, Key Technologies R&D Program of He’nan Province under Grant No. 182102210123 to YP, the Foundation of He’nan Educational Committee under Grant No. 18A520047 to YP, the Foundation for University Key Teacher of He’nan Province under Grant No. 2016GGJS-141 to YP, Key Technologies R&D Program of He’nan Province (192102210295 to HM), and Innovation Scientists and Technicians Troop Construction Projects of Henan Province. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

With sharp growth of electronic data, machine learning has impacted on human’s lifestyle by predicting human’s behavior and future trends on everything [1], [2], [3]. To overcome the limitations of storage and computing resource, how to outsource pricey tasks of machine learning to the Cloud has attracted much more attention. For instances, data of the client can be transmitted to the Cloud for either model training and predicting [4], [5], [6]. As a popular machine learning algorithm, single-layer perceptron (SLP) is simple yet efficient and has been widely used in disease prediction [7], [8], [9]. It is more appropriate for real-time disease predicting than some complex techniques such as naïve bayesian [10], decision trees [2] and support vector machines (SVMs) [11], [12] and so on. Clinical decision support system (CDSS), which uses various data mining techniques to help physicians make proper disease diagnosis and provide health services for patients, has received considerable attention [7], [13], [14],[15]. However, for privacy concerns, users don’t want to submit their medical data to an unauthorized institution [16], [17], [18]. At the same time, due to classifier being considered as own asset of the medical service provider, there is a risk of exposing the prediction model to third-party. Otherwise, third-party will use the model to make disease prediction for a patient who could damage the profile of medical service provider. Therefore, the confidentiality of both medical data and disease model are crucial for the CDSS. How to achieve secure disease prediction without compromising the accuracy of the result becomes a challenging issue.

To protect the privacy of patients’ medical data and the security of the prediction model, in this study, we propose a privacy-preserving clinical decision scheme based on SLP with cloud support (PPCD). As shown in Fig 1, two phases of SLP model training and disease predicting are included. In the model training, Diagnosed patients encrypt their symptoms data and outsource them with the corresponding diagnosed disease to the cloud. Meanwhile, the hospital generates random weights which are then encrypted and sent to the cloud. After receiving both of the encrypted medical data and the weights, the cloud trains the model accompanied by a few interactions with the hospital. The cloud selects an encrypted sample and executes the sign(.) function. If the returned value of sign(.) does not match its label, the cloud updates the weights until the convergence criterion is satisfied or all the disease cases are matched. When a patient wants to check his disease, he encrypts the data of the symptoms and submits it to the hospital which completes the analysis based on the disease model and sends back the encrypted diagnosis result and some medical advice.

Download:

Fig 1. Architecture of the proposed PPCD.

https://doi.org/10.1371/journal.pone.0217349.g001

Towards tackling the privacy concerns in Clinical decision support system, PPCD provides disease model training and disease risk prediction for the patient in a privacy-preserving way that makes the Cloud learns nothing about the patient’s medical information and the actual model. Specifically, the main contributions lie in:

The proposal of PPCD which provides a privacy-preserving clinical decision based on SLP with cloud support. It helps the doctor to predict disease since the medical data and the diagnosis result remains in encrypted forms. Furthermore, the built disease diagnosis model is also protected as an asset of the hospital.
For privacy-preserving in the phase of model training, a specific lightweight secure multiplication (LSM) is presented. By employing LSM, PPCD securely finishes the inner-product in encrypted-domain (ED) after one round.
We implement PPCD by Java to check its performance in ED. Experimental results from several medical data analysis confirm that PPCD achieves comparable accuracies with SLP in plain-domain (PD).

The remainder of this paper is organized as follows: The following section briefly introduces the preliminaries. Then, PPCD is proposed along with LSM. Also, correction & security analysis is detailed, followed by the section of performance evaluation. Related works and conclusions are respectively given by the last two sections.

Preliminaries

In this section, a brief glimpse of the Paillier cryptosystem, SLP and secure multiplication (SM) are given. Table 1 summarizes the key notations.

Download:

Table 1. Summary of notations.

https://doi.org/10.1371/journal.pone.0217349.t001

Single-layer perceptron

Following [19], SLP is to learn the weight vector w which is then multiplied with the input features to determine if a sample belongs to one class or the other. We define an activation function sign(z) which takes the linear combination of the input values x and w as input. If sign(z) is greater than a defined threshold θ, we predict 1 and -1 otherwise. In order to simplify the notation, we define w₀ = −θ and x₀ = 1, so that (1) where

For each training sample x_i, we calculate the output value, and update w if the output is not the same with the target. The value for updating the weights at each increment is calculated by the learning rule, (2) where η is the learning rate (0 < η ≤ 1).

It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable. If a linear decision boundary can’t separate the two classes, a maximum number of passes should be set over the training dataset and/or a threshold for the number of tolerated misclassifications.

Paillier cryptosystem

Paillier cryptosystem is an additively homomorphic cryptosystem [20]. It works as follows:

Key generation: Two large prime numbers p and q are randomly and independently chosen such that gcd(pq, (p − 1)(q − 1)) = 1, where |p| = |q|. Then, we compute n = pq and λ = lcm(p − 1, q − 1), and select a random integer g in . By setting μ = (L(g^λ mod n²))⁻¹ mod n and , the public key (n, g) and the private key (λ, μ) are obtained.
Encryption: Let m be a message to be encrypted where 0 ≤ m < n. With a randomly selected r where 0 < r < n, the ciphertext is calculated by c = E(m) = g^m · r^m mod n².
Decryption: Let c be the ciphertext to decrypt where , the plaintext message is got by m = D(c) = L(c^λ mod n²) · μ mod n.

As a additively homomorphic, its identities: D((E(m₁, r₁) · E(m₂, r₂) mod n²) = (m₁ + m₂) mod n and homomorphic multiplication of plaintexts: D((m₁, r₁)^k mod n²) = km₁ mod n.

Secure multiplication.

Secure Multiplication(SM) [21] supports multiplication in ED. Suppose Alice has two encrypted data E_pk(X) and E_pk(Y), Bob has the private key sk corresponding to public key pk, the goal of SM is to compute E_pk(X * Y) without leaking X and Y to Alice. SM protocol is described as follow:

Alice gets ciphertext E_pk(x) and E_pk(y), generates two random numbers r_x, r_y ∈ z_n, and then calculates x1 = E_pk(x) · E_pk(r_x) and y1 = E_pk(y) · E_pk(r_y). Send x1 and y1 to Bob.
After received x1 and y1, Bob decrypts x1 and y1 by using the private key sk to get H_x = D_sk(x1) and H_y = D_sk(y1), then computes H1 = H_x · Hy mod N, last Bob encrypts H1 with pk H = E_pk(H1) and sends H to Alice.
Alice first computes , and s3 = E_pk(r_x · r_y)^N−1, then multiplies them as E_pk(x · y) = H · s1 ·s2 ·s3.

The proposed PPCD model

Model overview and requirements

Model overview.

To make employing SLP for model training and disease prediction with privacy being protected, the proposed PPCD model contains four parties which are illustrated in Table 2. They collaboratively conduct SLP model training and disease predicting. The CS trains a disease prediction model based on the DP’s disease data. To check a disease, UP submits his symptoms data to the Hospital which predicts the corresponding disease based on the trained model. Fig 1 depicts the detailed procedure.

Download:

Table 2. Description of the attended four parties.

https://doi.org/10.1371/journal.pone.0217349.t002

Privacy requirements.

In PPCD, DPs are trustworthy. They provide correct medical data to the Cloud server. Meanwhile, CS and UP are honest-but-curious [22]. CS strictly follows the privacy-preserving SLP learning protocol performed in the system. It wants to know HP’s sensitive medical data and UP’s medical information once the condition is met. UP is interested in the trained disease model. Hospital is honest. At the same time, an adversary from outside is curious about all transferred data in the system by eavesdropping. So privacy-preserving is critical for successfully diagnosing the patient’s disease, and security requirements of PPCD are listed as follows.

UP’s Privacy: In the disease diagnosis, sensitive symptom data of UP should not be leaked to other untrusted parties during the transmission. Furthermore, the diagnosed result is confidential for the patients such that it cannot be exposed to any other entities. It means that UP’s privacy should be preserved.
DP’s Privacy: Generally, DP gets some history medical information, e.g., the diagnosed disease and the confirmed symptoms data. This information is highly sensitive and cannot be got by the unauthorized entities. Otherwise, DP is unwilling to provide the history disease data for model training due to privacy concerns.
Hospital’s Privacy: In PPCD, hospital trains disease model using the historical medical data with the help of the Cloud. As an asset of the hospital, the disease model cannot be leaked to UP and other parties during disease diagnosis.

Design goal.

Based on the above scenarios and the security requirements, the system will realize model training and disease diagnosis in a privacy-preserving and efficient way. The particular goals are shown as follows.

Privacy-preserving requirements: the flourish of Clinical decision support hinges upon information secure and privacy-preserving. If the model’s privacy requirements are not considered, the patient’s sensitive data and the disease model will be exposed to the unauthorized parties. Thus history patients are more unwilling to share their medical data to PPCD, the accuracy of the trained model is not ensured, and diagnosis service will be bad. Therefore, the system should realize the privacy of history patients and undiagnosed patients.
Confidentiality and accuracy of disease model should be achieved: the disease model is a valuable asset of the hospital, which may be reluctant to reveal the information of the disease model. Simultaneously, it is crucial applying privacy-preserving can’t compromise the accuracy of predicting model.

The Proposed PPCD Model

Privacy-preserving training.

This section shows how to construct PPCD, train the disease model and predict disease based on the model in a privacy-preserving way.

(1) System setting

Key generation: Paillier encryption algorithm is run by the hospital to generate keys for both UP and the hospital. Given the secure parameter k, choose two large prime numbers p and q randomly which satisfy |q| = |p| = k, hospital generates the pubic key (n, g) and the corresponding private key (λ, μ), where n = pq and λ = lcm(p − 1, q − 1).

Data encryption: Raw medical data are encrypted and submitted to the Cloud for storage and model training. The Cloud stores the disease patterns , each of which represents a disease sample , where x_i is a n-dimension vector, each element represents confirmed symptom and O_i ∈ {−1, 1} is associated desired output, where 1 represents suffering from the disease and -1 represents not. Suppose medical data have been preprocessed, so the format of data is suitable for PPCD. In system, disease output is stored in cloud server in plaintext because leaking disease output does not damage patients’ privacy. The encrypted patients’ medical data are stored in cloud as Table 3.

Download:

Table 3. Medical data for the k-th disease.

https://doi.org/10.1371/journal.pone.0217349.t003

Meanwhile, the disease predicting model is sensitive data which should be encrypted. At the beginning of model training, the hospital generates a random weight w = (w₁, w₂, ⋯, w_n) and encrypts it, then sends ciphertext of the weight to the Cloud server.

(2) Lightweight secure multiplication protocol

SM can be used to calculate inner-product on the two encrypted vectors. Given and , is calculated by running SM for n times. To efficiently compute the inner-product of two encrypted vectors, based on SM, we propose an efficient lightweight secure multiplication (LSM) protocol which can achieve inner-product on ciphertext in one time. By considering two parties C1 and C2, LSM is detailed in Algorithm 1.

Algorithm 1:

Require: C1 has and ; C2 has sk

Step1: C1:

(1) Chooses 2n random numbers r_xij, r_wj, ∈ Z_N
(2) Cr_xij ← E(r_xij)
(3) Cr_wj ← E(r_wj)

For each Cx_ij and Cw_j

(4) X_ij = Cx_ij · Cr_xij
(5) W_j = Cw_j · Cr_wj; Send X_ij, W_j to the C2

Step2: C2

(1) Receive X_ij W_j from C1
(2)
(3)
(4)
(5) H = E_pk(h); sends H to C1

Step3: C1

(1) Receiving the H
(2)
(3)
(4)
(5)

(3) Model training

In system setting phase, DP encrypts its medical information <x_i, O_i> and outsources <Cx_i, O_i> to the Cloud. The Cloud collects some medical data where k represents the k-th disease. To train the predicting model w_k of the k-th disease, the Cloud selects disease samples with I_k to train the model.

Privacy-preserving disease model training is described by Algorithm 2.

Algorithm 2: Privacy-Preserving Model Training Based on SLP

1: Input: n input samples, , 1 ≤ k ≤ m, iteration_max, learning rate η, sign function sign(·)

2: Output: prediction model w_k, 1 ≤ k ≤ m

3: DP: for 1 ≤ k ≤ m do

4: for 1 ≤ i ≤ n do

5: DP encrypts symptom data as <Cx_i, O_i, I_k> and submits to the cloud

6: Endfor

7: Endfor

8: for 1 ≤ k ≤ m do

9: Hospital: chooses initialization randomly.

10: for iteration = 1, 2, …, iteration_max

11: for 1 ≤ i ≤ n do

12: Hospital: encrypts and upload to the cloud

13: Cloud: chooses a medical sample <Cx_i, O_i> and executes LSM to get

14: and send to the hospital

15: Hospital: decrypt R and calculation sign function Si = sign(DEC(R)) and send to the cloud.

16: Cloud: If S # O_i and O_i = 1, exp = η

17: If S # O_i and O_i = −1, exp = n − η

18: for j = 1,…d

19:

20: Cw_j = Cw_j ⋅ u_j

21: endfor

22: endfor

23: endfor

24: return w_k, 1 ≤ k ≤ m

Lines 3–7: DP encrypts symptom data and submits <Cx_i, O_i, I_k> to the cloud.

Lines 8–12: The hospital randomly generates the weight in which not all elements is equal to 0 and encrypts it with own public key pk, then, send weight ciphertext {to the Cloud.

Lines 13–14: In the Cloud, choose a disease sample {C_xi, I_k} and 2n random numbers r_xij, r_wj ∈ Z_N, then executes LSM to compute , where the cloud server is C1, hospital is C2. Lastly send R to the hospital.

Lines 15: After receiving R, teh hospital decrypts R with private key sk, and execute the sign(·) function as , then send S to cloud.

Lines 16–20: The Cloud compare S with O_i. if S # O_i and O_i = 1, let exp = η; if S # O_i and O_i = −1, let exp = n − η. Next the Cloud updates C_xi as , and then, update Cw_j as .

Line 24: If the entire disease samples are matched or training count is greater than convergence criterion, hospital will terminate the training model and <w_k I_k> is seen as prediction model for D_k, else return and repeat lines 13–14.

After getting the k-th disease model, the Cloud selects and repeats lines 8–24. After all medical sample are trained, hospital cloud get prediction models for all disease.

Disease prediction.

In the phase, assuming prediction models have been trained and stored in the hospital. The hospital can predict whether a patient suffers from K-th disease using a K-th disease model. When an undiagnosed patient submits his encrypted symptoms information to the hospital, the prediction will be executed as follow.

Step 1: When the ciphertext of symptoms information is arrived, the hospital decrypts the ciphertext and gets the plaintext symptoms data .
Step 2: Let s = 0, for each x_j and w_j, the hospital calculates s_j = x_j · w_j, then gets .
Step 3: Compute S = sign(s), If S > = 0, then the patient suffers from the disease, but not otherwise.
Step 4: hospital encrypts the prediction result with UP’s public key and return to the patient.

Correction & security analysis

In this section, we analyze the correction and security of the proposed PPCD scheme. Notably, we focus on how PPCD achieve the privacy preserving of medical information of patient and disease model.

(1) Correctness analysis of LSM

The correctness of LSM can be illustrated as follows:

In Step1: (3) (4)

In Step2: (5) (6) (7)

In the Step3: (8) (9) (10) (11)

From the above derivation, LSM can calculate the in a round.

(2) Correctness analysis of training model

The correctness of PPCD can be illustrated as follows: in step3, the hospital decrypts R with private key sk, and compute (12)

So s_i is consistent with that in Eq (1).

In Step 4. The Cloud update Cw_k as Cw_j = Cw_j · u_j,

where

If S # O_i and O_i = 1, exp = η (13)

Then (14)

If S # O_i and O_i = −1, exp = n − η (15)

Then (16)

Thus Cw_j is also consistent with that in Eq (2).

From the above calculation, PPCD train correct disease model in the cloud. Namely the accuracy of prediction model is satisfied.

(3) Security of patient’s medical data

To predict disease for patients, DP and UP encrypt medical information x_i = {x_i1, x_i2,…,x_ij} with the hospital’s public key PK_h and upload the ciphertext Cx_i = {Cx_i1, Cx_i2,…,Cx_ij} to the Cloud. In the process of transmission, all the medical information is encrypted to prevent outside attacker from eavesdropping. An adversary cannot decrypt the ciphertext without the hospital’s private key SK_h. The symptom data is encrypted by the Paillier which is semantic secure against the choose plaintext attack. So the medical information stored in the Cloud is secure since the Cloud cannot identify the corresponding contents and get the plaintext of symptom data.

(4) Security of training disease model

During training the prediction model, all the computations are done over ciphertexts. is calculated by using LSM in which each party learns nothing from the protocol. The initial model is generated by the hospital randomly and updated in the process of training over ciphertext, and the hospital’s SK_h is well protected. and Cw_j = Cw_j · u_j = E(w_j + ηO_ix_i,j) can be computed easily over ciphertext because of the additive homomorphism property of Paillier. Suppose the disease model is leaked to UP or the Cloud, they are not able to recover w_k, without the private key SK_h.

(5) Security of predicting result

When a patient wants to identity his disease, he submits the ciphertext of symptoms data to the hospital. After finishing disease prediction, diagnosis result is encrypted by UP’s public key PK_up and returned to UP. When an attack captures predicting result, he can’t recover the corresponding contents without DP’s private key SK_up.

Performance evaluation

Complexity analysis

Computational complexity.

To analyze the complexity of the proposed PPCD, Table 4 illustrates the computational cost for each step. For simplicity, we use EXP to denote the time complexity of one exponentiation operation on ciphertext in the Paillier cryptosystem. Similarly, the time complexities of one multiplication operation on ciphertext and one modular inverse operation in the decryption algorithm are represented by MUL and DIV, respectively. In Step 1 of the disease learning phase, n exponents and multiplications are required by the hospital which encrypts the initial weight. In Step 2, the Cloud uses (2n+3) exponents and (4n+7) multiplications, and the hospital executes 2n exponents and 4n multiplications to obtain R. In Step 3, one exponent and one modular inverse are consumed before getting S. In Step 4, to update the weight, the Cloud does n exponents and n multiplication. At last, (n-1) multiplications, one exponent and one modular inverse are executed to predict disease risk. Then the encrypted diagnosis result is sent to UP.

Download:

Table 4. Summary of computational cost for x_i in PPCD.

https://doi.org/10.1371/journal.pone.0217349.t004

Communication complexity.

Assuming there are N samples with n dimensions, and the length of the ciphertext is p. In the proposed PPCD system, the encrypted symptom data are outsourced to the Cloud to train the classifier which costs O(N(np+L)). In model training, the hospital transmits the encrypted initial weight which requires O(np+L_IK). To compute R, the cost of transferring data is O(3np+2p+L_IK). In disease prediction, the hospital sends the encrypted predicting result to UP that costs O(np+L_IK). The communication complexities of the proposed PPCD are detailed in Table 5.

Download:

Table 5. Summary of communication overhead in PPCD.

https://doi.org/10.1371/journal.pone.0217349.t005

Experimental results

To fairly evaluate the performance, the proposed PPCD is implemented by Java on Windows 7-X64. The Cloud is a computer with Intel Quad core 3.4GHz and 16GB available RAM, the hospital runs a machine with Intel Quad core 3.4GHz and 8GB available RAM, and the patient uses a laptop with Intel Dual core 2.0GHz and 8GB available RAM.

Data sets.

In the experiment, we use the Wisconsin breast cancer dataset (WBCD), the heart disease dataset (HDD) and the acute inflammations dataset (AID) from the UCI machine learning repository [23] to test the performance of SLP based on our PPCD scheme. Table 6 shows the statistical information of the employed three datasets.

Download:

Table 6. Description of the benchmark data sets.

https://doi.org/10.1371/journal.pone.0217349.t006

WBCD contains 683 instances, and each instance includes 9 attributes ranging from 1 to 10. In WBCD, each instance can be grouped into one of two possible classes: benign or malignant. HDD has 297 instances, and each instance consists of 13 attributes with two classes. Except for sex, trestbpl, chol and thalach, the other 9 attributes range from 1 to 10. AID contains 120 instances, and each instance includes 6 attributes with two decisions, i.e., inflammation of urinary bladder (IUB) and nephritis of renal pelvis origin (NRPO). Except for the temperature, the other attribute is either 1 (YES) or 0 (No).

In reality, the raw medical data may be decimal. However, the Paillier can only encrypt integers. To resolve the above problem, approximation and expansion (A&E) method is adopted. Following the suggestion of [12], we adopt expanding each piece of medical data by multiplying 10⁴, and rounding off all the values after the decimal point. For instance, x_ij is an integer lying in (Z_n ∼ −Z_n), the item of weight w = (w₁, w₂, …, w_n) is in (Z_n ∼ −Z_n), then x_i,j are encrypted using the Pallier as follows. (17) (18) where Cx_i,j, Cw_j are the ciphertexts of x_i,j and Cw_j, respectively.

Results and analysis.

We conduct PPCD with a predefined iteration threshold 100, and then use the classifier and three real data sets to evaluate the classifier’s performance in terms of accuracy. For each data set, the ratio of training data samples to the testing data samples is 7:3. Experimental results are detailed in Tables 7–10. Apparently, for breast cancer, the overall accuracy achieved by SLP is 96.2% while PPCD reaches 95.6%. For heart disease, SLP obtains an overall accuracy of 94.6%, and PPCD has 93.9%. On AID, SLP gets an accuracy of 93.3% for IUB while PPCD achieves a comparable result 92.5%. For NRPO in AID, accuracy for SLP is 93.3% while PPCD gets 91.7%. Actually, PPCD reaches comparable disease analysis results with that of by SLP.

Download:

Table 7. Accuracy comparisons of SLP in PD and PPCD in ED on WBCD.

https://doi.org/10.1371/journal.pone.0217349.t007

Download:

Table 8. Accuracy comparisons of SLP in PD and PPCD in ED on HDD.

https://doi.org/10.1371/journal.pone.0217349.t008

Download:

Table 9. Accuracy comparisons of SLP in PD and PPCD in ED for IUB of AID.

https://doi.org/10.1371/journal.pone.0217349.t009

Download:

Table 10. Accuracy comparisons of SLP in PD and PPCD in ED for NRPO of AID.

https://doi.org/10.1371/journal.pone.0217349.t010

In terms of efficiency, Table 11 gives the runtime comparisons of PPCD on the three data sets. For Breast cancer, it takes 6.125s for history patients to encrypt all the symptoms. In the training phase, it takes 2993.1s for the Cloud to train the classifier. In the predicting phase, it takes 0.098s for the hospital to computer undiagnosed patient’s disease risk (including 0.013s for UP to encrypt all the symptoms). For Heart disease and AID, the time cost of data encryption, model training, and disease predicting are decreased as the reduction of the number of sample cases. For the sake of simplicity, multicore programming has not adopted the evaluation.

Download:

Table 11. Runtime comparisons of PPCD in ED and SLP in PD.

https://doi.org/10.1371/journal.pone.0217349.t011

Related work

Without sufficient storage, computation or knowledge of the clinical decision, the clients frequently prefer outsourcing their data to the Cloud for model training and disease predicting. Ledley and lusted [24] firstly proposed a clinical decision support system which can help physicians to solve diagnostic problems. Later, a large number of disease prediction system based on various data mining techniques have been presented. For example, a fast prediction disease system based on SVM was proposed by [25] to predict the risk of progression of adolescent idiopathic scoliosis. Wang et al. [26] gave a risk assessment for individuals with a family history of pancreatic cancer using Bayesian classification. By introducing SVM, Huang et al. [27] designed a prediction model for breast cancer diagnosis while Barakat et al. [28] focused on the diagnosis of diabetes mellitus. For heart disease analysis, Anooj et al. [29] tried to use specific fuzzy rules. Though various prediction models have been developed, privacy protection of patients medical information fails to take into account which will impede the more progress of CDSS.

To address this challenge, some secure disease prediction [1], [7], [8], [9], [11], [12], [14] which diagnose patients’ disease without leaking medical data and prediction model have been widely studied. Wang et al. [14] proposed a Healer framework based on somewhat homomorphic encryption. It uses a small samples size to facilitate secure rare variants analysis and obtains the final results by decrypting ciphertexts in the trusted party. A privacy-preserving CDSS on Naïve Bayesian Classification was proposed by Liu et al. [5] which can help a clinician to diagnose the risk of patients’ disease in a privacy-preserving way. Wang et al. [9] proposed a secure SLP learning model for e-Healthcare, but it can only protect the privacy of patients’ medical information, the disease model isn’t protected. In [11], Zhu et al. proposed an efficient and privacy-preserving medical pre-diagnosis framework using SVM which can protect the sensitive personal health information without privacy disclosure with lightweight multi-party random masking and polynomial.

Recently, Tsung et al. [30] proposed a decentralized privacy-preserving healthcare predictive modeling framework on private Blockchain networks, in which privacy-preserving online machine learning is integrated with a private Blockchain network, apply transaction metadata to disseminate partial models, and design a new proof-of-information algorithm to determine the order of the online learning process, Each participating site contributes to model parameter estimation without revealing any patient health information. Zhang et al. [1] proposed a secure disease prediction scheme based on matrices and SLP which builds on new medical data encryption, disease learning, and disease prediction algorithms that utilizes random matrices. Liu et al. [7] proposed a Hybrid privacy-preserving clinical decision support system in fog–cloud computing, in which a fog server uses SLP to securely monitor patients’ health condition in real-time, The newly detected abnormal symptoms can be further sent to the cloud server for high-accuracy prediction in a privacy-preserving way. Compared with some sophisticated machine learning algorithms such as Naïve Bayesian, SVM, and deep learning classification, SLP is efficient and straightforward.

Conclusions

In this paper, we proposed a privacy-preserving disease predicting system based SLP which can help physicians make a proper diagnosis of disease and provide health services for patients anytime anywhere in a privacy-preserving way. In PPCD, DP’s historical medical data are used to train SLP in ED, and the hospital uses the trained model to predict diseases for a UP. Towards easing the privacy concerns from DP, we suggest an additively homomorphic encryption also for simplicity and generality. Inevitable multiplications of SLP motivate us introducing LSM into PPCD. Then users’ medical information and the trained model are secret to the cloud. Compared with SLP, comparable results reached by PPCD suggest that sacrificing data precision to improve efficiency is feasible in practical use.

Although PPCD benefits privacy-preserving diagnosis, the balance between security and efficiency should be considered firstly. Therefore, how to optimize the model training using mini-batch for efficiency improvement and finding an effective way of introducing some other advanced machine learning methods to build the privacy-preserving disease prediction system are worthy of investigation.

Acknowledgments

The authors would like to thank the Editor and the anonymous reviewers for their constructive comments that greatly improved the quality of this manuscript.

References

1. Zhang C, Zhu L, Xu C, and Lu R. PPDP: An efficient and privacy-preserving disease prediction scheme in cloud-based e-Healthcare system. Future Generation Computer Systems. 2018;79: 16–25.
- View Article
- Google Scholar
2. Taigel F, Tueno AK, and Pibernik P. Privacy-preserving condition-based forecasting using machine learning. 2018. https://doi.org/10.1007/s11573-017-0889-x.
3. Phan N, Wang Y, Wu X, Dou D. Differential Privacy Preservation for Deep Auto-Encoders: An Application of Human Behavior Prediction. in Proc. Thirtieth Int. Conf. Artificial Intelligence processing. 2016; 1309–1316.
4. Liu J, Juuti M, Lu Y, Asokan N. Oblivious neural network predictions via minion transformation. in proc. twenty-fourth ACM. Conf. computer communications security. 2017; PP. 619–631.
5. Li P, Li J, Huang Z, Li T, Gao CZ, Yiu SM, et al. Multi-key privacy-preserving deep learning in cloud computing. Future Generation Computer Systems.2017; 74:76–85.
- View Article
- Google Scholar
6. Gao CZ, Cheng Q, He P, Susilo W, Li J. Privacy-preserving naïve bayes classifiers secure against the substitution-then-comparison attack. Information Sciences. 2018;444:72–88.
- View Article
- Google Scholar
7. Liu XM, Deng RH, Yang Y, tran NH, and Zhong SP. Hybrid privacy-preserving clinical decision support system in fog–cloud computing. Future Generation Computer Systems. 2018;78(2): 825–837.
- View Article
- Google Scholar
8. Zhang X, Chen X, Wang J, Zhan Z, and Li J. Verifiable privacy-preserving single-layer perceptron training scheme in cloud computing. 2018. Soft Computing [online]. https://doi.org/10.1007/s00500-018-32-33-7.
9. Wang GM, Lu RX, and Huang C. PSLP: privacy-preserving Single-Layer Perceptron Learning for e-Healthcare. Proc ICICS 10th Int. Conf. information, communication and Signal processing. 2015; pp. 1–5.
10. Schurink C, Lucas P, Hoepelman I, and Bonten M. computer-assisted decision support for the diagnosis and treatment of infectious diseases in intensive care units. The Lancet infectious diseases. 2005; 5(5):305–312. pmid:15854886
- View Article
- PubMed/NCBI
- Google Scholar
11. Zhu H, Liu X, Lu R, and Li H. Efficient and Privacy-Preserving Online Medical Pre-Diagnosis Framework Using Nonlinear SVM. IEEE Journal of Biomedical and Health Informatics. 2017;21(3): 838–850. pmid:28113828
- View Article
- PubMed/NCBI
- Google Scholar
12. Rahulamathavan Y, Veluru S, phan RC, Chambers JA, Rajarajan M. Privacy-Preserving Clinical Decision Support System Using Gaussian Kernel-Based Classification. IEEE Journal of Biomedical and Health Informatics. 2014;18(1): 56–66. pmid:24403404
- View Article
- PubMed/NCBI
- Google Scholar
13. Musen MA, Shahar Y, Shortliffe EH, Clinical decision-support systems. Springer. Journal of Biomedical Informatics. pp. 698–736, 2014.
14. Wang S, zhang Y, Dai W, Lauter K, Kim M, Tang Y, et al. HEALER:Homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bilinformatics. 2016;32(2): 211–218.
- View Article
- Google Scholar
15. Liu X, Lu R, Ma J, Chen L, and Qin B. Privacy-preserving Patient-Centric Clinical Decision Support System on Naïve Bayesian Classification. IEEE Journal of Biomedical and Health Informatics. 2016;20(2): 655–668. pmid:26960216
- View Article
- PubMed/NCBI
- Google Scholar
16. Jiang X, zhao Y, Wang X, Malin B, Wang S, Ohno-Machado L, et al. A community assessment of privacy preserving techniques for human genomes. BMC medical informatics and decision making. 2014;14(S1):S1.
- View Article
- Google Scholar
17. Zhao Y, Wang X, Jiang X, Ohno-Machado L, and Tang H. Choosing blindly but wisely: differentially private solicitation of dna datasets for disease marker discovery. Journal of the American Medical Informatics Association. 2015;22(1):100–8. pmid:25352565
- View Article
- PubMed/NCBI
- Google Scholar
18. Wang S, Mohammed N, and Chen R. Differentially private genome data dissemination through top-down specialization. BMC medical informatics and decision making. 2014;14(S1):S2.
- View Article
- Google Scholar
19. Freund Y, and Schapire RE. Large margin classification using the perceptron algorithm. Mach. Learn. 1999;37(3) 277–296.
- View Article
- Google Scholar
20. Paillier P. public-key cryptosystems based on composite degree residuosity classes. Proc advances in Cryptology–EUROCRYPT ‘99, Theory and Application of Cryptographic Techniques, Prague, Czech Republic, may 2–6, 1999; pp.223-238.
21. Samanthula BK, Elmehdwi Y, and Jiang W. K-nearest neighbor classification over semantically secure encrypted relational data. arXiv preprint arXiv:1403.5001, 2014.
22. Vimercati SDCdi, Foresti S, Jajodia S, Paraboschi S and Samarati P. Over-encryption: management of access control evolution on outsourced data. In Proc. 33th Int. Conf. Very Large Data Bases. VLDB endowment, 2007, pp. 123–134.
23. Lichman M. UCI machine learning repository. [cited 2018 Dec 8]. http://archive.ics.uci.edu/ml.
24. Ledley RS and Lusted LB. Reasoning foundations of medical diagnosis. Science. 1959;130(3366): 9–21. pmid:13668531
- View Article
- PubMed/NCBI
- Google Scholar
25. Ajemba P, Ramirez L, Durdle N, Hill D, and Raso V. A support vectors classifier approach to predicting the risk of progression of adolescent idiopathic scoliosis. IEEE Trans. Inform. Technol. Biomed. 2005;9(2):276–282.
- View Article
- Google Scholar
26. Wang W, Chen S, Brune KA, Hruban RH, Parmigiani G, and Klein AP. PancPRO: risk assessment for individuals with a family history of pancreatic cancer. J. Clin. Oncol. 2007;25(11):1417–1422. pmid:17416862
- View Article
- PubMed/NCBI
- Google Scholar
27. Huang CL, Chen HC, Chen MC. Prediction model building and feature selection with support vector machines in breast cancer diagnosis. Expert Syst. Appl. 2008;34(1): 578–587.
- View Article
- Google Scholar
28. Barakat MNH, and Bradley AP. Intelligble support vector machine for diagnosis of diabetes mellitus. IEEE Trans. Inform, Technol. Biomed. 2010;14(4): 1114–1120.
- View Article
- Google Scholar
29. Anooj PK. Clinical decision support system: Risk level prediction of heart disease using weighted fuzzy rules. J.King Saud Univ.–Comput. Inf.Sci. 2012;24(1): 27–40.
- View Article
- Google Scholar
30. Kuo TT, and Ohno-Machado L. ModelChain: Decentralized privacy-preserving healthcare predictive modeling framework on private blockchain networks. 2018. https://arxiv.org/abs/1802.01746.

[ref1] 1. Zhang C, Zhu L, Xu C, and Lu R. PPDP: An efficient and privacy-preserving disease prediction scheme in cloud-based e-Healthcare system. Future Generation Computer Systems. 2018;79: 16–25.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Taigel F, Tueno AK, and Pibernik P. Privacy-preserving condition-based forecasting using machine learning. 2018. https://doi.org/10.1007/s11573-017-0889-x.

[ref3] 3. Phan N, Wang Y, Wu X, Dou D. Differential Privacy Preservation for Deep Auto-Encoders: An Application of Human Behavior Prediction. in Proc. Thirtieth Int. Conf. Artificial Intelligence processing. 2016; 1309–1316.

[ref4] 4. Liu J, Juuti M, Lu Y, Asokan N. Oblivious neural network predictions via minion transformation. in proc. twenty-fourth ACM. Conf. computer communications security. 2017; PP. 619–631.

[ref5] 5. Li P, Li J, Huang Z, Li T, Gao CZ, Yiu SM, et al. Multi-key privacy-preserving deep learning in cloud computing. Future Generation Computer Systems.2017; 74:76–85.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref6] 6. Gao CZ, Cheng Q, He P, Susilo W, Li J. Privacy-preserving naïve bayes classifiers secure against the substitution-then-comparison attack. Information Sciences. 2018;444:72–88.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref7] 7. Liu XM, Deng RH, Yang Y, tran NH, and Zhong SP. Hybrid privacy-preserving clinical decision support system in fog–cloud computing. Future Generation Computer Systems. 2018;78(2): 825–837.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref8] 8. Zhang X, Chen X, Wang J, Zhan Z, and Li J. Verifiable privacy-preserving single-layer perceptron training scheme in cloud computing. 2018. Soft Computing [online]. https://doi.org/10.1007/s00500-018-32-33-7.

[ref9] 9. Wang GM, Lu RX, and Huang C. PSLP: privacy-preserving Single-Layer Perceptron Learning for e-Healthcare. Proc ICICS 10th Int. Conf. information, communication and Signal processing. 2015; pp. 1–5.

[ref10] 10. Schurink C, Lucas P, Hoepelman I, and Bonten M. computer-assisted decision support for the diagnosis and treatment of infectious diseases in intensive care units. The Lancet infectious diseases. 2005; 5(5):305–312. pmid:15854886
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref11] 11. Zhu H, Liu X, Lu R, and Li H. Efficient and Privacy-Preserving Online Medical Pre-Diagnosis Framework Using Nonlinear SVM. IEEE Journal of Biomedical and Health Informatics. 2017;21(3): 838–850. pmid:28113828
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref12] 12. Rahulamathavan Y, Veluru S, phan RC, Chambers JA, Rajarajan M. Privacy-Preserving Clinical Decision Support System Using Gaussian Kernel-Based Classification. IEEE Journal of Biomedical and Health Informatics. 2014;18(1): 56–66. pmid:24403404
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref13] 13. Musen MA, Shahar Y, Shortliffe EH, Clinical decision-support systems. Springer. Journal of Biomedical Informatics. pp. 698–736, 2014.

[ref14] 14. Wang S, zhang Y, Dai W, Lauter K, Kim M, Tang Y, et al. HEALER:Homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bilinformatics. 2016;32(2): 211–218.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref15] 15. Liu X, Lu R, Ma J, Chen L, and Qin B. Privacy-preserving Patient-Centric Clinical Decision Support System on Naïve Bayesian Classification. IEEE Journal of Biomedical and Health Informatics. 2016;20(2): 655–668. pmid:26960216
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref16] 16. Jiang X, zhao Y, Wang X, Malin B, Wang S, Ohno-Machado L, et al. A community assessment of privacy preserving techniques for human genomes. BMC medical informatics and decision making. 2014;14(S1):S1.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref17] 17. Zhao Y, Wang X, Jiang X, Ohno-Machado L, and Tang H. Choosing blindly but wisely: differentially private solicitation of dna datasets for disease marker discovery. Journal of the American Medical Informatics Association. 2015;22(1):100–8. pmid:25352565
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref18] 18. Wang S, Mohammed N, and Chen R. Differentially private genome data dissemination through top-down specialization. BMC medical informatics and decision making. 2014;14(S1):S2.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref19] 19. Freund Y, and Schapire RE. Large margin classification using the perceptron algorithm. Mach. Learn. 1999;37(3) 277–296.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref20] 20. Paillier P. public-key cryptosystems based on composite degree residuosity classes. Proc advances in Cryptology–EUROCRYPT ‘99, Theory and Application of Cryptographic Techniques, Prague, Czech Republic, may 2–6, 1999; pp.223-238.

[ref21] 21. Samanthula BK, Elmehdwi Y, and Jiang W. K-nearest neighbor classification over semantically secure encrypted relational data. arXiv preprint arXiv:1403.5001, 2014.

[ref22] 22. Vimercati SDCdi, Foresti S, Jajodia S, Paraboschi S and Samarati P. Over-encryption: management of access control evolution on outsourced data. In Proc. 33th Int. Conf. Very Large Data Bases. VLDB endowment, 2007, pp. 123–134.

[ref23] 23. Lichman M. UCI machine learning repository. [cited 2018 Dec 8]. http://archive.ics.uci.edu/ml.

[ref24] 24. Ledley RS and Lusted LB. Reasoning foundations of medical diagnosis. Science. 1959;130(3366): 9–21. pmid:13668531
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref25] 25. Ajemba P, Ramirez L, Durdle N, Hill D, and Raso V. A support vectors classifier approach to predicting the risk of progression of adolescent idiopathic scoliosis. IEEE Trans. Inform. Technol. Biomed. 2005;9(2):276–282.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref26] 26. Wang W, Chen S, Brune KA, Hruban RH, Parmigiani G, and Klein AP. PancPRO: risk assessment for individuals with a family history of pancreatic cancer. J. Clin. Oncol. 2007;25(11):1417–1422. pmid:17416862
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref27] 27. Huang CL, Chen HC, Chen MC. Prediction model building and feature selection with support vector machines in breast cancer diagnosis. Expert Syst. Appl. 2008;34(1): 578–587.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref28] 28. Barakat MNH, and Bradley AP. Intelligble support vector machine for diagnosis of diabetes mellitus. IEEE Trans. Inform, Technol. Biomed. 2010;14(4): 1114–1120.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref29] 29. Anooj PK. Clinical decision support system: Risk level prediction of heart disease using weighted fuzzy rules. J.King Saud Univ.–Comput. Inf.Sci. 2012;24(1): 27–40.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref30] 30. Kuo TT, and Ohno-Machado L. ModelChain: Decentralized privacy-preserving healthcare predictive modeling framework on private blockchain networks. 2018. https://arxiv.org/abs/1802.01746.

Figures

Abstract

Introduction

Preliminaries

Single-layer perceptron

Paillier cryptosystem

Secure multiplication.

The proposed PPCD model

Model overview and requirements

Model overview.

Privacy requirements.

Design goal.

The Proposed PPCD Model

Privacy-preserving training.

Disease prediction.

Correction & security analysis

(1) Correctness analysis of LSM

(2) Correctness analysis of training model

(3) Security of patient’s medical data

(4) Security of training disease model

(5) Security of predicting result

Performance evaluation

Complexity analysis

Computational complexity.

Communication complexity.

Experimental results

Data sets.

Results and analysis.

Related work

Conclusions

Acknowledgments

References