PPCD: Privacy-preserving clinical decision with cloud support

With the prosperity of machine learning and cloud computing, meaningful information can be mined from mass electronic medical data which help physicians make proper disease diagnosis for patients. However, using medical data and disease information of patients frequently raise privacy concerns. In this paper, based on single-layer perceptron, we propose a scheme of privacy-preserving clinical decision with cloud support (PPCD), which securely conducts disease model training and prediction for the patient. Each party learns nothing about the other’s private information. In PPCD, a lightweight secure multiplication is presented and introduced to improve the model training. Security analysis and experimental results on real data confirm the high accuracy of disease prediction achieved by the proposed PPCD without the risk of privacy disclosure.


Introduction
With sharp growth of electronic data, machine learning has impacted on human's lifestyle by predicting human's behavior and future trends on everything [1], [2], [3]. To overcome the limitations of storage and computing resource, how to outsource pricey tasks of machine learning to the Cloud has attracted much more attention. For instances, data of the client can be transmitted to the Cloud for either model training and predicting [4], [5], [6]. As a popular machine learning algorithm, single-layer perceptron (SLP) is simple yet efficient and has been widely used in disease prediction [7], [8], [9]. It is more appropriate for real-time disease predicting than some complex techniques such as naïve bayesian [10], decision trees [2] and support vector machines (SVMs) [11], [12] and so on. Clinical decision support system (CDSS), which uses various data mining techniques to help physicians make proper disease diagnosis and provide health services for patients, has received considerable attention [7], [13], [14], [15]. However, for privacy concerns, users don't want to submit their medical data to an unauthorized institution [16], [17], [18]. At the same time, due to classifier being considered as own asset of the medical service provider, there is a risk of exposing the prediction model to thirdparty. Otherwise, third-party will use the model to make disease prediction for a patient who a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 could damage the profile of medical service provider. Therefore, the confidentiality of both medical data and disease model are crucial for the CDSS. How to achieve secure disease prediction without compromising the accuracy of the result becomes a challenging issue.
To protect the privacy of patients' medical data and the security of the prediction model, in this study, we propose a privacy-preserving clinical decision scheme based on SLP with cloud support (PPCD). As shown in Fig 1, two phases of SLP model training and disease predicting are included. In the model training, Diagnosed patients encrypt their symptoms data and outsource them with the corresponding diagnosed disease to the cloud. Meanwhile, the hospital generates random weights which are then encrypted and sent to the cloud. After receiving both of the encrypted medical data and the weights, the cloud trains the model accompanied by a few interactions with the hospital. The cloud selects an encrypted sample and executes the sign(.) function. If the returned value of sign(.) does not match its label, the cloud updates the weights until the convergence criterion is satisfied or all the disease cases are matched. When a patient wants to check his disease, he encrypts the data of the symptoms and submits it to the hospital which completes the analysis based on the disease model and sends back the encrypted diagnosis result and some medical advice.
Towards tackling the privacy concerns in Clinical decision support system, PPCD provides disease model training and disease risk prediction for the patient in a privacy-preserving way that makes the Cloud learns nothing about the patient's medical information and the actual model. Specifically, the main contributions lie in: 1. The proposal of PPCD which provides a privacy-preserving clinical decision based on SLP with cloud support. It helps the doctor to predict disease since the medical data and the diagnosis result remains in encrypted forms. Furthermore, the built disease diagnosis model is also protected as an asset of the hospital.
2. For privacy-preserving in the phase of model training, a specific lightweight secure multiplication (LSM) is presented. By employing LSM, PPCD securely finishes the inner-product in encrypted-domain (ED) after one round.
3. We implement PPCD by Java to check its performance in ED. Experimental results from several medical data analysis confirm that PPCD achieves comparable accuracies with SLP in plain-domain (PD).
The remainder of this paper is organized as follows: The following section briefly introduces the preliminaries. Then, PPCD is proposed along with LSM. Also, correction & security analysis is detailed, followed by the section of performance evaluation. Related works and conclusions are respectively given by the last two sections.

Preliminaries
In this section, a brief glimpse of the Paillier cryptosystem, SLP and secure multiplication (SM) are given. Table 1 summarizes the key notations.

Single-layer perceptron
Following [19], SLP is to learn the weight vector w which is then multiplied with the input features to determine if a sample belongs to one class or the other. We define an activation function sign(z) which takes the linear combination of the input values x and w as input. If sign(z) is greater than a defined threshold θ, we predict 1 and -1 otherwise. In order to simplify the  Privacy-preserving clinical decision with cloud support notation, we define w 0 = −θ and x 0 = 1, so that For each training sample x i , we calculate the output value, and update w if the output is not the same with the target. The value for updating the weights at each increment is calculated by the learning rule, where η is the learning rate (0 < η � 1). It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable. If a linear decision boundary can't separate the two classes, a maximum number of passes should be set over the training dataset and/or a threshold for the number of tolerated misclassifications.

Paillier cryptosystem
Paillier cryptosystem is an additively homomorphic cryptosystem [20]. It works as follows: 1. Key generation: Two large prime numbers p and q are randomly and independently chosen such that gcd(pq, (p − 1)(q − 1)) = 1, where |p| = |q|. Then, we compute n = pq and λ = lcm(p − 1, q − 1), and select a random integer g in Z � n 2 . By setting μ = (L(g λ mod n 2 )) −1 mod n and LðxÞ ¼ xÀ 1 n , the public key (n, g) and the private key (λ, μ) are obtained. 2. Encryption: Let m be a message to be encrypted where 0 � m < n. With a randomly selected r where 0 < r < n, the ciphertext is calculated by c = E(m) = g m � r m mod n 2 .
3. Decryption: Let c be the ciphertext to decrypt where c 2 Z � n 2 , the plaintext message is got by m = D(c) = L(c λ mod n 2 ) � μ mod n.
Secure multiplication. Secure Multiplication(SM) [21] supports multiplication in ED. Suppose Alice has two encrypted data E pk (X) and E pk (Y), Bob has the private key sk corresponding to public key pk, the goal of SM is to compute E pk (X � Y) without leaking X and Y to Alice. SM protocol is described as follow:

Model overview and requirements
Model overview. To make employing SLP for model training and disease prediction with privacy being protected, the proposed PPCD model contains four parties which are illustrated in Table 2. They collaboratively conduct SLP model training and disease predicting. The CS trains a disease prediction model based on the DP's disease data. To check a disease, UP submits his symptoms data to the Hospital which predicts the corresponding disease based on the trained model. Fig 1 depicts the detailed procedure.
Privacy requirements. In PPCD, DPs are trustworthy. They provide correct medical data to the Cloud server. Meanwhile, CS and UP are honest-but-curious [22]. CS strictly follows the privacy-preserving SLP learning protocol performed in the system. It wants to know HP's sensitive medical data and UP's medical information once the condition is met. UP is interested in the trained disease model. Hospital is honest. At the same time, an adversary from outside is curious about all transferred data in the system by eavesdropping. So privacy-preserving is critical for successfully diagnosing the patient's disease, and security requirements of PPCD are listed as follows.
1. UP's Privacy: In the disease diagnosis, sensitive symptom data of UP should not be leaked to other untrusted parties during the transmission. Furthermore, the diagnosed result is confidential for the patients such that it cannot be exposed to any other entities. It means that UP's privacy should be preserved.
2. DP's Privacy: Generally, DP gets some history medical information, e.g., the diagnosed disease and the confirmed symptoms data. This information is highly sensitive and cannot be got by the unauthorized entities. Otherwise, DP is unwilling to provide the history disease data for model training due to privacy concerns.
3. Hospital's Privacy: In PPCD, hospital trains disease model using the historical medical data with the help of the Cloud. As an asset of the hospital, the disease model cannot be leaked to UP and other parties during disease diagnosis.
Design goal. Based on the above scenarios and the security requirements, the system will realize model training and disease diagnosis in a privacy-preserving and efficient way. The particular goals are shown as follows.
1. Privacy-preserving requirements: the flourish of Clinical decision support hinges upon information secure and privacy-preserving. If the model's privacy requirements are not considered, the patient's sensitive data and the disease model will be exposed to the unauthorized parties. Thus history patients are more unwilling to share their medical data to PPCD, the accuracy of the trained model is not ensured, and diagnosis service will be bad. Therefore, the system should realize the privacy of history patients and undiagnosed patients.
2. Confidentiality and accuracy of disease model should be achieved: the disease model is a valuable asset of the hospital, which may be reluctant to reveal the information of the disease model. Simultaneously, it is crucial applying privacy-preserving can't compromise the accuracy of predicting model.

The Proposed PPCD Model
Privacy-preserving training. This section shows how to construct PPCD, train the disease model and predict disease based on the model in a privacy-preserving way.
(1) System setting Key generation: Paillier encryption algorithm is run by the hospital to generate keys for both UP and the hospital. Given the secure parameter k, choose two large prime numbers p and q randomly which satisfy |q| = |p| = k, hospital generates the pubic key (n, g) and the corresponding private key (λ, μ), where n = pq and λ = lcm(p − 1, q − 1).
Data encryption: Raw medical data . . .; x i;n Þ i;n are encrypted and submitted to the Cloud for storage and model training. The Cloud stores the disease patterns , where x i is a n-dimension vector, each element represents confirmed symptom and O i 2 {−1, 1} is associated desired output, where 1 represents suffering from the disease and -1 represents not. Suppose medical data have been preprocessed, so the format of data is suitable for PPCD. In system, disease output is stored in cloud server in plaintext because leaking disease output does not damage patients' privacy. The encrypted patients' medical data are stored in cloud as Table 3.
Meanwhile, the disease predicting model is sensitive data which should be encrypted. At the beginning of model training, the hospital generates a random weight w = (w 1 , w 2 , � � �, w n ) and encrypts it, then sends ciphertext of the weight to the Cloud server.
(2) Lightweight secure multiplication protocol SM can be used to calculate inner-product on the two encrypted vectors. Given Cx i �! ¼ Þ is calculated by running SM for n times. To efficiently compute the inner-product of two encrypted vectors, based on SM, we propose an efficient lightweight secure multiplication (LSM) protocol which can achieve inner-product on ciphertext in one time. By considering two parties C1 and C2, LSM is detailed in Algorithm 1. Table 3. Medical data for the k-th disease.

Medical sample
Medical data Desired output For each Cx ij and Cw j (4) X ij = Cx ij � Cr xij (5) W j = Cw j � Cr wj ; Send X ij , W j to the C2 Step2: C2 Step3: C1 (1) Receiving the H Line 24: If the entire disease samples are matched or training count is greater than convergence criterion, hospital will terminate the training model and <w k I k > is seen as prediction model for D k , else return and repeat lines 13-14.
After getting the k-th disease model, the Cloud selects < Cx i ; O i > X m i¼1 2 D kþ1 and repeats lines 8-24. After all medical sample are trained, hospital cloud get prediction models < Cw k ; I k > m k¼1 for all disease. Disease prediction. In the phase, assuming prediction models have been trained and stored in the hospital. The hospital can predict whether a patient suffers from K-th disease using a K-th disease model. When an undiagnosed patient submits his encrypted symptoms information to the hospital, the prediction will be executed as follow.
Step 1: When the ciphertext of symptoms information is arrived, the hospital decrypts the ciphertext and gets the plaintext symptoms data x i ! .
Step 2: Let s = 0, for each x j and w j , the hospital calculates s j = x j � w j , then gets s ¼ X n j¼1 s j .
Step 3: Compute S = sign(s), If S > = 0, then the patient suffers from the disease, but not otherwise.
Step 4: hospital encrypts the prediction result with UP's public key and return to the patient.

Correction & security analysis
In this section, we analyze the correction and security of the proposed PPCD scheme. Notably, we focus on how PPCD achieve the privacy preserving of medical information of patient and disease model.

(1) Correctness analysis of LSM
The correctness of LSM can be illustrated as follows: In Step1: In Step2: In the Step3: From the above derivation, LSM can calculate the Eð X n i¼1 x i � w i Þ in a round.

(2) Correctness analysis of training model
The correctness of PPCD can be illustrated as follows: in step3, the hospital decrypts R with private key sk, and compute So s i is consistent with that in Eq (1). In Step 4. The Cloud update Cw k as Cw j = Cw j � u j , Then Then Thus Cw j is also consistent with that in Eq (2). From the above calculation, PPCD train correct disease model in the cloud. Namely the accuracy of prediction model is satisfied.

(3) Security of patient's medical data
To predict disease for patients, DP and UP encrypt medical information x i = {x i1 , x i2 ,. . .,x ij } with the hospital's public key PK h and upload the ciphertext Cx i = {Cx i1 , Cx i2 ,. . .,Cx ij } to the Cloud. In the process of transmission, all the medical information is encrypted to prevent outside attacker from eavesdropping. An adversary cannot decrypt the ciphertext without the hospital's private key SK h . The symptom data is encrypted by the Paillier which is semantic secure against the choose plaintext attack. So the medical information stored in the Cloud is secure since the Cloud cannot identify the corresponding contents and get the plaintext of symptom data.

(4) Security of training disease model
During training the prediction model, all the computations are done over ciphertexts. Eð X n i¼1 x ij � w i Þ is calculated by using LSM in which each party learns nothing from the protocol. The initial model is generated by the hospital randomly and updated in the process of training over ciphertext, and the hospital's SK h is well protected. Cx i;j exp and Cw j = Cw j � u j = E(w j + ηO i x i,j ) can be computed easily over ciphertext because of the additive homomorphism property of Paillier. Suppose the disease model is leaked to UP or the Cloud, they are not able to recover w k , without the private key SK h .

(5) Security of predicting result
When a patient wants to identity his disease, he submits the ciphertext of symptoms data to the hospital. After finishing disease prediction, diagnosis result is encrypted by UP's public key PK up and returned to UP. When an attack captures predicting result, he can't recover the corresponding contents without DP's private key SK up .

Complexity analysis
Computational complexity. To analyze the complexity of the proposed PPCD, Table 4 illustrates the computational cost for each step. For simplicity, we use EXP to denote the time complexity of one exponentiation operation on ciphertext in the Paillier cryptosystem. Similarly, the time complexities of one multiplication operation on ciphertext and one modular inverse operation in the decryption algorithm are represented by MUL and DIV, respectively. In Step 1 of the disease learning phase, n exponents and multiplications are required by the hospital which encrypts the initial weight. In Step 2, the Cloud uses (2n+3) exponents and (4n+7) multiplications, and the hospital executes 2n exponents and 4n multiplications to obtain R. In Step 3, one exponent and one modular inverse are consumed before getting S. In Step 4, to update the weight, the Cloud does n exponents and n multiplication. At last, (n-1) multiplications, one exponent and one modular inverse are executed to predict disease risk. Then the encrypted diagnosis result is sent to UP.
Communication complexity. Assuming there are N samples with n dimensions, and the length of the ciphertext is p. In the proposed PPCD system, the encrypted symptom data are outsourced to the Cloud to train the classifier which costs O(N(np+L)). In model training, the hospital transmits the encrypted initial weight which requires O(np+L IK ). To compute R, the cost of transferring data is O(3np+2p+L IK ). In disease prediction, the hospital sends the encrypted predicting result to UP that costs O(np+L IK ). The communication complexities of the proposed PPCD are detailed in Table 5.

Experimental results
To fairly evaluate the performance, the proposed PPCD is implemented by Java on Windows 7-X64. The Cloud is a computer with Intel Quad core 3.4GHz and 16GB available RAM, the hospital runs a machine with Intel Quad core 3.4GHz and 8GB available RAM, and the patient uses a laptop with Intel Dual core 2.0GHz and 8GB available RAM.
Data sets. In the experiment, we use the Wisconsin breast cancer dataset (WBCD), the heart disease dataset (HDD) and the acute inflammations dataset (AID) from the UCI machine learning repository [23] to test the performance of SLP based on our PPCD scheme. Table 6 shows the statistical information of the employed three datasets.
WBCD contains 683 instances, and each instance includes 9 attributes ranging from 1 to 10. In WBCD, each instance can be grouped into one of two possible classes: benign or Table 4. Summary of computational cost for x i in PPCD.

Phase
Step Entity Computational cost

Phase
Step Communication overhead Outsourcing DP's data N(np+L)

Disease learning
Step 1 np+L IK Step 2 2np+2p Step 4 np+ L IK Disease prediction np+ L IK https://doi.org/10.1371/journal.pone.0217349.t005 Privacy-preserving clinical decision with cloud support malignant. HDD has 297 instances, and each instance consists of 13 attributes with two classes. Except for sex, trestbpl, chol and thalach, the other 9 attributes range from 1 to 10. AID contains 120 instances, and each instance includes 6 attributes with two decisions, i.e., inflammation of urinary bladder (IUB) and nephritis of renal pelvis origin (NRPO). Except for the temperature, the other attribute is either 1 (YES) or 0 (No). In reality, the raw medical data x i;j 2 x i ! ¼ ðx i;1 ; x i;2 ; . . .; x i;n Þ may be decimal. However, the Paillier can only encrypt integers. To resolve the above problem, approximation and expansion (A&E) method is adopted. Following the suggestion of [12], we adopt expanding each piece of medical data by multiplying 10 4 , and rounding off all the values after the decimal point. For instance, x ij is an integer lying in (Z n * −Z n ), the item of weight w = (w 1 , w 2 , . . ., w n ) is in (Z n * −Z n ), then x i,j are encrypted using the Pallier as follows.
Eðn À jx i;j jÞ x i;j < 0; where Cx i,j , Cw j are the ciphertexts of x i,j and Cw j , respectively. Results and analysis. We conduct PPCD with a predefined iteration threshold 100, and then use the classifier and three real data sets to evaluate the classifier's performance in terms of accuracy. For each data set, the ratio of training data samples to the testing data samples is 7:3. Experimental results are detailed in Tables 7-10. Apparently, for breast cancer, the overall accuracy achieved by SLP is 96.2% while PPCD reaches 95.6%. For heart disease, SLP obtains an overall accuracy of 94.6%, and PPCD has 93.9%. On AID, SLP gets an accuracy of 93.3% for IUB while PPCD achieves a comparable result 92.5%. For NRPO in AID, accuracy for SLP is 93.3% while PPCD gets 91.7%. Actually, PPCD reaches comparable disease analysis results with that of by SLP. In terms of efficiency, Table 11 gives the runtime comparisons of PPCD on the three data sets. For Breast cancer, it takes 6.125s for history patients to encrypt all the symptoms. In the training phase, it takes 2993.1s for the Cloud to train the classifier. In the predicting phase, it takes 0.098s for the hospital to computer undiagnosed patient's disease risk (including 0.013s for UP to encrypt all the symptoms). For Heart disease and AID, the time cost of data encryption, model training, and disease predicting are decreased as the reduction of the number of sample cases. For the sake of simplicity, multicore programming has not adopted the evaluation.

Related work
Without sufficient storage, computation or knowledge of the clinical decision, the clients frequently prefer outsourcing their data to the Cloud for model training and disease predicting. Ledley and lusted [24] firstly proposed a clinical decision support system which can help physicians to solve diagnostic problems. Later, a large number of disease prediction system based on various data mining techniques have been presented. For example, a fast prediction disease system based on SVM was proposed by [25] to predict the risk of progression of adolescent idiopathic scoliosis. Wang et al. [26] gave a risk assessment for individuals with a family history of pancreatic cancer using Bayesian classification. By introducing SVM, Huang et al. [27]  [29] tried to use specific fuzzy rules. Though various prediction models have been developed, privacy protection of patients medical information fails to take into account which will impede the more progress of CDSS.
To address this challenge, some secure disease prediction [1], [7], [8], [9], [11], [12], [14] which diagnose patients' disease without leaking medical data and prediction model have been widely studied. Wang et al. [14] proposed a Healer framework based on somewhat homomorphic encryption. It uses a small samples size to facilitate secure rare variants analysis and obtains the final results by decrypting ciphertexts in the trusted party. A privacy-preserving CDSS on Naïve Bayesian Classification was proposed by Liu et al. [5] which can help a clinician to diagnose the risk of patients' disease in a privacy-preserving way. Wang et al. [9] proposed a secure SLP learning model for e-Healthcare, but it can only protect the privacy of patients' medical information, the disease model isn't protected. In [11], Zhu et al. proposed an efficient and privacy-preserving medical pre-diagnosis framework using SVM which can protect the sensitive personal health information without privacy disclosure with lightweight multi-party random masking and polynomial.
Recently, Tsung et al.
[30] proposed a decentralized privacy-preserving healthcare predictive modeling framework on private Blockchain networks, in which privacy-preserving online machine learning is integrated with a private Blockchain network, apply transaction metadata to disseminate partial models, and design a new proof-of-information algorithm to determine the order of the online learning process, Each participating site contributes to model parameter estimation without revealing any patient health information. Zhang et al. [1] proposed a secure disease prediction scheme based on matrices and SLP which builds on new medical data encryption, disease learning, and disease prediction algorithms that utilizes random matrices. Liu et al. [7] proposed a Hybrid privacy-preserving clinical decision support system in fog-cloud computing, in which a fog server uses SLP to securely monitor patients' health condition in real-time, The newly detected abnormal symptoms can be further sent to the cloud server for high-accuracy prediction in a privacy-preserving way. Compared with some sophisticated machine learning algorithms such as Naïve Bayesian, SVM, and deep learning classification, SLP is efficient and straightforward.

Conclusions
In this paper, we proposed a privacy-preserving disease predicting system based SLP which can help physicians make a proper diagnosis of disease and provide health services for patients anytime anywhere in a privacy-preserving way. In PPCD, DP's historical medical data are used to train SLP in ED, and the hospital uses the trained model to predict diseases for a UP. Towards easing the privacy concerns from DP, we suggest an additively homomorphic encryption also for simplicity and generality. Inevitable multiplications of SLP motivate us introducing LSM into PPCD. Then users' medical information and the trained model are secret to the cloud. Compared with SLP, comparable results reached by PPCD suggest that sacrificing data precision to improve efficiency is feasible in practical use. Although PPCD benefits privacy-preserving diagnosis, the balance between security and efficiency should be considered firstly. Therefore, how to optimize the model training using mini-batch for efficiency improvement and finding an effective way of introducing some other advanced machine learning methods to build the privacy-preserving disease prediction system are worthy of investigation.