Pindex: Private multi-linked index for encrypted document retrieval

Cryptographic cloud storage is used to make optimal use of the cloud storage infrastructure to outsource sensitive and mission-critical data. The continuous growth of encrypted data outsourced to cloud storage requires continuous updating. Attacks like file-injection are reported to compromise confidentiality of the user as a consequence of information leakage during update. It is required that dynamic schemes provide forward privacy guarantees. Updates should not leak information to the untrusted server regarding the previously issued queries. Therefore, the challenge is to design an efficient searchable encryption scheme with dynamic updates and forward privacy guarantees. In this paper, a novel private multi-linked dynamic index for encrypted document retrieval namely Pindex is proposed. The multi-linked dynamic index is constructed using probabilistic homomorphic encryption mechanism and secret orthogonal vectors. Full security proofs for correctness and forward privacy in the random oracle model is provided. Experiments on real world Enron dataset demonstrates that our construction is practical and efficient. The security and performance analysis of Pindex shows that the dynamic multi-linked index guarantees forward privacy without significant loss of efficiency.


Introduction
Cloud computing has revolutionized data storage by offering effortless data storage for personal, enterprises, governments and institutions. Data owners have flexible, on-demand access to data with simplified data management and significant cost benefits with increased performance. Data owners outsource sensitive data such as the Electronic Health Records (EHR), citizens' personal information, emails, credit cards data, government information and critical business data to the public cloud [1]. Despite the numerous benefits of cloud storage fuelled by high speed networking technologies and although the cloud storage providers (CSPs) claim to adopt strong security measures governments, organizations and businesses are slow to fully embrace the public cloud storage due to privacy and security concerns. When data owners outsource sensitive data to the cloud; they lose control over their data leading to a number of security issues [2]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Data owners encrypt their sensitive information to protect privacy of data stored with the honest but curious cloud storage and to defend against unauthorized access. Encryption imposes restriction on searching which is the only way to access the data. Unless they can easily be indexed, shared, retrieved and utilized, encrypted data serves no purpose. Searchable encryption enables a user to perform efficient keyword search while preserving privacy. Indexes are used to improve the search performance by significantly reducing the search complexity. A number of searchable encryption schemes [3][4][5][6] have been proposed but most of the schemes provide a static mechanism which hinders its application in reality.
Dynamic searchable encryption allows clients to dynamically update files or keywords without rebuilding the encrypted index. For any searchable encryption scheme to be practical the scheme must be secure and should allow efficient updates with optimal search time. A few reported works [7][8][9][10][11] support dynamic index. Nevertheless, most of the solutions leaks critical information since the adversary can observe and learn more information from the interaction between data user and CSP. This leakage can be from search pattern, access pattern, update pattern, size pattern, file identifiers containing a specific keyword, trace and trapdoor linkability. As a result of the leakages, forward privacy was introduced by Stefanov et al. [12]. Zhang et al. [13] presented a file injection attack on dynamic constructions by adding files to the database. There is a trade-off between security and practicality. In addition to the data privacy, it is imperative that the computational correctness should be guaranteed by design. This is because the malicious server can return incorrect results due to hardware or software malfunction or to save resources. In order to fully utilize the services offered by the cloud server, there is a need for a provably secure dynamic searchable encryption scheme with efficient search, forward privacy and support for parallelism, which is a challenging problem.
From the above discussions, the significant research problems in designing protocols for privacy-preserving search over encrypted outsourced cloud data can be summarized as: a) an efficient and secure index construction to improve search without reconstructing the index. b) a privacy preserving search over encrypted data with allowable minimum leakage of information to the CSP. c) providing support for efficient dynamic updates to encrypted index with allowable minimum leakage and without requirement of trusted platform at the CSP. d) Support parallel execution of multi-keyword search and update operations. e) a construction with minimum communicational and computational overhead. Motivated by the challenges summarized above the main contributions and results of the proposed method, namely, PINDEX are: 1. Secure Dynamic Index: A novel private multi-linked dynamic index construction using probabilistic homomorphic encryption and a secret orthogonal vector as building blocks. Support to add/delete keywords or documents without reconstructing the outsourced encrypted index using the hash table that contains the sum of inner product of the rows that are linked by orthogonal vectors.
structure allows multi-keywords search and update operations to run independently over p processors.
4. The results obtained from security analysis shows that the proposed design is secure in the real-ideal security model, efficient with sub-linear time complexity OðjF q jÞ, efficiently parallelized and exhibits relatively better performance in empirical analysis.
The rest of the paper is organized in the order of research methodology adopted which is similar to the standard works [5,8,9,12,14,15]. First, we define the security requirements of the protocol under design in the form of definitions such that it captures thoroughly the adversarial environment and adversary. Such definitions stated as a security model for the protocol to serve as the basis for further design is described in section 2. Second, we describe in section 3 the design of the cryptographic protocol adhering to the security model devised in section 2. Third, we present in section 4, a thorough theoretical correctness and security analysis using standard provable security attack or adversary model to investigate vulnerabilities in the design. Fourth in section 5, we support the theoretical correctness and efficiency with results obtained from empirical simulations. Lastly, we present in section 6 the related works, followed by discussion in section 7 and conclusion in section 8.

Model
This section describes the preliminaries and security model which captures the requirements of the protocol under design. The model captures the notions of security and adversary through precise definitions to be used later as the basis of design. . ., f n } denote the file identifiers of documents given by f i = id(C i ). A collection of document D is to be outsourced to the Cloud Service Provider (CSP). The data owner encrypts the document set D to obtain a encrypted document collection C. To enable privacy preserving searching over the encrypted document collection C, a secure encrypted index denoted by γ is built using BUILDIN-DEX. The encrypted document collection C, the secure encrypted Index γ are outsourced to the CSP. An authorised Data User (DU) acquires a search token denoted by τ q generated using SRCHTOKEN corresponding to a given query keyword w q from the Data Owner (DO). The search token τ q is then sent to the CSP using a standard communication protocol. The CSP searches for the matching documents for the query keyword using SEARCH over the encrypted index γ preserving privacy. The SEARCH algorithm outputs a set of file identifiers relevant to the query denoted by F q that contain the query keyword. The DO updates the index using UPDATE whenever the document changes or files need to be added/deleted. The overall system working mechanism of the search is shown in Fig 1. We use the notation = to denote a deterministic assignment, a probabilistically computed assignment and $ an uniformly sampled assignment.

Security definitions
The construction of the PINDEX scheme aims to securely perform computations at the CSP on the encrypted data while preserving privacy and is defined as in [9]: Definition 1 (PINDEX). PINDEX is a dynamic searchable encryption scheme consisting of a tuple {KGen, Enc 1 , Dec 1 , Enc 2 , Dec 2 , BUILDINDEX, SRCHTOKEN, SEARCH, UPDTOKEN, UPDATE} of ten algorithms such that: 4. c Enc 2 (pk, m, r) is a CPA-secure probabilistic polynomial time additive homomorphic algorithm that outputs an encrypted ciphertext c when given as input a secret key pk, a message m and randomness r.
5. m = Dec 2 (sk, c) is a deterministic polynomial time algorithm that outputs m when given as input a secret key sk and a ciphertext c.
6. (γ, Γ) BUILDINDEX(K, δ, V, D, Enc 2 ) is a probabilistic polynomial time algorithm that outputs a encrypted index γ and a parameter hash table Γ when given as input a secret K, vector V, primitive Enc 2 , document collection D and a index δ.
7. τ q SRCHTOKEN(pk, w q , Γ) is a probabilistic polynomial time algorithm that outputs a search token τ q when given as input a secret key pk, query keyword w q and a parameter hash table Γ.
8. F q SEARCH(γ, τ q ) is a deterministic polynomial time algorithm that outputs a set of identifiers F q � D when given as input an encrypted index γ and a search token τ q .
9. τ u UPDTOKEN(pk, f i , u, w u ) is a probabilistic polynomial time algorithm. Given a secret key K, update keyword w u and a file f i as input, the algorithm outputs an update token τ u for the update type u 2 {add, delete}. 10. γ 0 UPDATE(γ, τ u ) is a deterministic polynomial time algorithm that outputs an updated encrypted index γ 0 given an encrypted index γ and an update token τ u .
A scheme as defined above is secure when it is designed based on the real/ideal security paradigm. The paradigm requires defining correctness and privacy. The correctness captures the notion that the specified polynomial time algorithms in Definition 1 outputs correctly and it is given as: Definition 2 (Correctness). Let π be a dynamic searchable encryption scheme as given in Definition 1. Such π is said to be correct if 8k 2 N, 8K: K KGen(1 λ ), 8(δ, f), 8γ: γ BUILDIN-DEX(K, δ, V, D, Enc 2 ), 8c: c Enc 1 (k, f) and for all update operations of UPDATE(γ, τ u ), where τ u is the update token obtained by 8(f, τ u ): τ u UPDTOKEN(pk, f i , w) and all u 2 {add, delete}, 8(w, τ q : τ q SRCHTOKEN(pk, w, Γ), 8F q : F q SEARCH(γ, τ q ), the plaintext f w = {Dec 1 (k, c i ):i 2 F q } are all plaintexts in f containing keyword w.
Most of the dynamic searchable encryption schemes leak information which could be exploited to launch attacks such as file injection attacks to recover the encryption key. For example, information leak such as file identifiers returned for a respective search or update query cannot be prevented. The leakage function captures the notion of allowable leakages in a dynamic searchable encryption defined as in [9]: The definition of privacy or security captures the notion that the view of each party can be separately simulated given the leakage function and it is defined as: Definition 4 (Security). Let π be a dynamic searchable encryption scheme as defined in Definition 1. Let A be a stateful adversary, Sim be a stateful simulator and ðL 1 ; L 2 Þ are stateful leakage algorithms in the following experiments: 1. Real A ðlÞ: The challenger C runs K KGen. The adversary outputs ðd; f Þ A and receives (γ, Γ) BUILDINDEX(K, δ, V, D, Enc 1 ) from C. Each time the adversary computes q 2 {w, f i } and makes polynomial number of adaptive queries. A obtains from C a search token τ q SRCHTOKEN(pk, w, Γ) whenever it wants to perform a search query q = w. If A wants to make a update query q = f i it receives a update token τ u UPDTOKEN(pk, f i , w u ) from C. A returns an output bit b at the end of the experiment.

Ideal
A;S ðlÞ: Sim outputs (γ, c) to A when given (δ, f) and L 1 ðd; f Þ. A then can make polynomial number of adaptive queries q 2 {w, f i }. A makes a search query q = w then the simulator is given L 2 ðd; f ; w; tÞ and when q is update query, the Sim is given a new L 2 ðd; f ; w; tÞ including the adaptive query history by A. Also, the Sim returns a token τ to A appropriately. A then returns an output bit b at the end of the experiment.
The scheme π is said to be ðL 1 ; L 2 Þ-secure against adaptive dynamic chosen-keyword attacks if for all PPT adversaries A, there exists a PPT simulator Sim such that A dynamic searchable scheme is considered secure if no information is revealed during execution of all its operations. However, designing such schemes which do not leak any information is a challenge. For example, when a search or update is performed for the keyword w, the CSP can correlate the file identifiers with it's query and learn about the outsourced data. Therefore, there is a need for stronger security guarantees for dynamic searchable schemes.
Definition 5 (Forward private). A dynamic searchable encryption scheme which is ðL 1 ; L 2 Þ-adaptively secure is said to be forward private if the following holds: where L upd is the update leakage function, {f i , μ i } is the set of document-number of keywords updated pair, f i is the document updated, μ i is the number of keywords updated on respective f i and L 0 a stateless function. The forward private definition 5 states that the server should be able to learn only about the file identifier(s), size of file(s) and number of keywords contained in file(s) after observing searches before and after a dynamic update operation op 2 {ADD, DELETE} over encrypted index is performed.

Homomorphic and orthogonality
The design of dynamic searchable schemes satisfying definitions 4 and 5 more often requires the use of encryption mechanisms called homomorphic cryptosystems as a building block.
Definition 6 (Homomorphic). A public key encryption scheme E = (KGen, Enc, Dec, M, C) is said to be homomorphic if it holds that Dec(sk, c 1 � c 2 ) = m 1 � m 2 for all m 1 , m 2 2 M and c 1 , c 2 2 C with m 1 = Dec(sk, c 1 ) and m 2 = Dec(sk, c 2 ) such that for all c obtained from Enc(pk, m), c belongs to C.
The proposed PINDEX uses Paillier cryptosystem to instantiate the encryption which exhibits homomorphic addition property. The group operation � for Paillier cryptosystem has homomorphic property such that Dec(sk, c 1 � c 2 ) = m 1 + m 2 [16]. Let V = v 1 , v 2 . . ., v |W| be a matrix of mutually orthogonal vectors where v i is the i th row vector and |W| is the number of keywords in the keyword collection W. A square n × n matrix V with elements ±1 that satisfies V × V T = nI n is called a Hadamard matrix of order n [17,18]. Such matrices make cross correlation values to be zero and this property is used in the design of the proposed search and update mechanism of PINDEX.

Construction
The cloud data storage service involves three entities namely, the Data Owner (DO), Cloud Server or the Cloud Service Provider (CSP) and the Data User (DU). The index is constructed using a novel multi-linked hash map structure. The index construction based on the security model presented in section 2 is given below:

Multi-linked hash map construction
The two dimensional matrix M mn whose elements are stored in a multi-linked hash map δ, where each element {m ij : m ij = [next, prev], m ij 2 M mn ,1 � i � m,1 � j � n} has link to the next non-empty element in the respective row i and column j. The row links are implemented by orthogonal vectors and respective column links using pointers as shown in Fig 2. The hash table value δ(i) is orthogonally linked row values for all 1 � i � m as given below: Eq 3 shows that each value is a sum of inner product of the orthogonal vector v j and a pointer to the next non-empty value in the column j. Each v j represents a column and all v j 's are mutually orthogonal 4.
Hence, assuming y such that y = j, where m ij by definition is a pointer to the next non-empty element in the column j and <.,.> denotes vector inner product. Therefore, all the non-zero elements in the column j can be retrieved using the recursive function: The hash table value δ can be used as an index for retrieving documents containing a given keyword. Let the rows and columns in matrix M represent documents and keywords respectively. Each element in a column j is a pointer to next document containing the keyword w j . Each row i is encoded as an orthogonal sum of product of keyword vector v j and the next document f i containing keyword w j and stored in a keyword multi-linked hash map as δ(f i ). Each element m ij = [f next , f prev ] represents the next and previous file of f i containing the keyword w j and an empty value m ij = ; represents that w j = 2 f i . Therefore, using Eq 6 set of all documents F q containing query keyword w q can be retrieved in F q � F in sub-linear time as for all practical queries |F q | ⋘ |F|. The multi-linked hash map data structure is designed as described above such that it supports addition of new documents which includes encoding a row, adding encoded value and updating either the last or first row pointers accordingly. The multi-linked hash map data structure design improves search to sub-linear time and supports parallel execution of search and updates.

Algorithm K KGen(1 λ )
It is a probabilistic polynomial time algorithm which generates the secret parameters used for building index and encryption. The PINDEX algorithm uses Paillier Cryptosystem for building the index and generation of search and update tokens. The key (pk, sk) generation setup is described in [18] where pk denotes the public key and sk denotes the private key of Paillier cryptosystem. The KGen also generates a secret key k to be used by Enc 1 for encrypting the document collection D. The design choice for Enc 1 is the CPA-secure AES while the design choice for Enc 2 is the CPA-secure Paillier cryptosystem which also exhibits additive homomorphic properties.

Algorithm c Enc 2 (K, m, r)
It is a probabilistic algorithm, given a message m 2 P and a public key pk = (n, g), Enc 2 (pk, m, r) returns the ciphertext c = g m r n mod n 2 , where given r is a random value chosen such that r 2 Z � n jgcdðr; nÞ ¼ 1.

Algorithm m = Dec 2 (sk, c)
It is a probabilistic algorithm, given a ciphertext c 2 C and a private key sk = (p, q, λ), Dec 2 (sk, c) returns the message m.

Algorithm (γ, Γ) BuildIndex(K, δ, V, D, Enc 2 )
BUILDINDEX algorithm constructed by the data owner takes as input five parameters namely the document set F = {f 1 , f 2 ,. . ., f |D| }, keyword set W = {w 1 , w 2 ,. . .w |W| }, a pre-built multi-linked hash map δ as described in section 3.1 for F, mutually orthogonal vector V and the key K as input. The algorithm outputs multi-linked hash map index γ and keyword parameter hash table Γ. The algorithm computes: 3. Send γ to CSP and Γ is kept secret at the DO The pictorial representation of index structure is given in Fig 3. The encrypted index (γ, Γ) can be observed to be designed such that the link can be traversed vertical through the pointers and traversed horizontal through the orthogonal property without leaking critical information.

Algorithm τ s SrchToken(w q , pk, V)
The SRCHTOKEN algorithm is computed by the Data Owner. It takes a query keyword w q 2 W, key pk, set of mutually orthogonal vectors V as input and outputs a search token τ q . The algorithm is given below: 1. Get the keyword secret (v, r) Γ(w q ) from the keyword parameter hash table Γ and compute r À 1 2 Z n such that rr −1 � 1 mod n 2 2. Compute s −1 Enc 2 (pk, −w q , r −1 ) and returns to DU the search token τ q by computing: where r 0 $ Z n and v 0 $ V r The trapdoor τ q is designed to contain the encrypted keyword parameters embedded with the corresponding orthogonal vector v and random v 0 .r 0 are added at each instance to make the token appear indistinguishable even when the same query is repeated.

Algorithm F q Search(γ, τ q )
The SEARCH algorithm is computed at CSP and it takes as input the encrypted index γ, search token τ q , key pk and outputs F q = {f 1 , f 2 ,‥f n }. F q is the set of document identifiers such that w q satisfies w q 2 W and ? otherwise, where ? denotes null. The search algorithm is given below: 1. Compute hγ(0), τ q i to obtain the first file pointer f q = f 1 of document containing the keyword w q .
2. If f q = ; then there are no documents matching the query keyword w q or the search token is invalid. The algorithm terminates returning ?. 3.
Otherwise, obtain all f q computing recursively as given below with base condition f q = f 1 until f q = ?
Thus, it can be observed that the design of the search algorithm is simple and dependent on the search token design. Since, the search token is designed to be created with the orthogonally embedded encrypted keyword parameters (v.s −1 ) and the index contains (v, s), a inner product of the two will result in retrieving the corresponding document identifiers. The proof of correctness for which is presented in section 4.

Algorithm τ q UpdToken(pk, f, u, w)
The UPDTOKEN probabilistic algorithm is computed by the DO. The algorithm returns an update token to the CSP for securely updating the index. We define a header node to be the node pointed by γ(0) or the row when i = 1 in δ, a terminal node to be the last file containing a. Get (v, r) Γ(w), compute s Enc 2 (pk, w, r), b. When f is the header node, compute c. When f is the terminal node, compute 3. When the update operation is ADDKEYWORD w 2 W to document f then compute: c. Add f as the header node by computing This algorithm takes an index γ, an update token τ u = (τ 1 , τ 2 ) and the public key pk as input and updates the encrypted index. The updated index γ 0 is computed by UPDATE algorithm is given below: 1. When the update operation u 2 {DELETEKEYWORD, DELETEFILE} 8k 2 t 1 2. When the update operation u 2 {ADDKEYWORD, ADDFILE} The rationale behind the design of UPDATE described above and depicted in Fig 4 is based on exploiting the orthogonal property of V and homomorphic property of s used in creating the index (γ, Γ). This enables addition and deletion over encrypted index such that the tokens are indistinguishable guaranteed by CPA-security.

Security analysis
The privacy guarantees and correctness of the proposed design, PINDEX with respect to the security model given in section 2 and particularly in definition 4 is analysed in this section. The first formal security model given by Bellare et al. [19] is a game-based definition in which the adversary is allowed to interact with a set of oracles (encryption and decryption). The definition models communicating parties in a network where the adversary's goal is to distinguish between a correct shared secret or random value for the given challenge. Similar approach is used to prove the indistinguishability of the allowed information leakage of the protocol PIN-DEX. Firstly, the correctness of the proposed protocol is proved independently. Secondly, the privacy and correctness with respect to the security model is proved. The SEARCH can handle only two cases: w q 2 W and w q = 2 W during its operation. To prove the correctness in each case, the proof follows two lemmas: Lemma 1. When given a search token t w q for a keyword (w q = 2 W) and the index γ, the Proof. The concise representation of the SEARCH algorithm for the above case is given below: It can be observed from the SEARCH algorithm that ? is returned whenever f l = 0. Therefore, whenever a query keyword w q = 2 W is searched over the document set the SEARCH algorithm returns ? with probability 1.

Lemma 2. When given a search token t w q for a keyword (w q 2 W) and the index γ, the
Proof. Similar to lemma 1, the SEARCH algorithm can be generally represented for w q 2 W as: v i :Enc 2 ðpk; w i ; r i Þ:f z i þ v r :rÞ:ðv q :Enc 2 ðpk; À w q ; r À 1 q Þ þ v r :rÞÞ and g w i À w q ¼ g 0 ¼ 1 and r n i :ðr n q Þ À 1 ¼ 1. Thus, a file identifier f i will be obtained by the SEARCH algorithm. Therefore, when the computation is recursively repeated until f l == ?, the algorithm returns the set of file identifiers ðF q ¼ ff i g i¼1;::jw q 2Wj w q 2W ¼ ff 1 ; f n g with probability 1.

Lemma 3. The algorithm UPDATE provides correct dynamic index update for a given update token τ u .
Proof. The UPDATE algorithm updates the encrypted index using an update token generated for both the cases addition or deletion of files/keywords.
Case DELETEKEYWORD: The process of deleting a keyword is as follows: 1. When f is the header node file pointer, the UPDATETOKEN computes τ 1 v.s.f next , τ 2 v.s.f and returns τ u (τ 1 , τ 2 , f 1 = 0, f 2 = ?). The UPDATE is represented as: 2. When f is the terminal node, compute τ 1 = v.s.?, τ 2 = v.s.f and return 3. When f is a neither a header nor a terminal node pointer, determine f next , f prev , compute τ 1 v.s.f next , τ 2 v.s.f and return t u Eqs 10, 11 and 12 demonstrates how the file pointers are adjusted when the keyword w u has to be added or removed. The old pointers are removed and new pointers are updated from/to previous or next appropriate file depending on the case.
Case ADDKEYWORD: The process of adding a keyword can be represented as follows: 1. When f is the first node, compute τ 1 v.s.f, τ 2 v.s.f next and return τ u (τ 1 , τ 2 , f 1 = 0, f 2 = f) 2. When f is the terminal node, compute τ 1 v.s.f, τ 2 v.s.? and return τ u (τ 1 , τ 2 , f 1 = f prev , f 2 = ?) 3. When f is a neither a header nor a terminal node pointer, determine f next , f prev , compute τ 1 v.s.f, τ 2 v.s.f next and return τ u (τ 1 , Eqs 13, 14 and 15 shows how the file pointers are modified while adding the keyword w u . File addition ADDFILE and deletion DELETEFILE follow similar modification of file pointers. Hence, the dynamic index update is performed correctly. To prove the indistinguishability between the real game and the ideal game by any PPT distinguisher Dist, we construct a PPT simulator Sim which uses the leakage functions L 1 and L 2 to simulate functionality correctly of the algorithms described in Definition 1 indistinguishablly. Theorem 1. Given the Definition(CKA-2 security) 4, the PINDEX scheme given above is ðL 1 ; L 2 ÞÀ secure in the random oracle model, where L 1 leaks the number of keywords, the number of documents, the identifiers of the documents and the size of each document; and L 2 leaks the search pattern and the access pattern.
Proof. Let Sim be the simulator which interacts with an adversary A in an execution of an Ideal A;Sim ðkÞ as in Definition 4. We construct (γ, c) given the leakage L 1 ðd; f Þ as follows: c. Compute gð0Þ P m j¼1 v j :Enc 2 ðpk; j; rÞ:f x such that S ij 6 ¼ 0 and x = sup(X) where X ¼ fx : S yj 6 ¼ 0^1 < y � ðn þ 1Þg; v j $ V w ; v r $ V r and r is obtained from random oracle H given j. Store δ(j) (v j , r). d. Compute gðf i Þ P m j¼1 v j :Enc 2 ðpk; j; rÞ:f x for all 1 � i � n + 1 such that S ij 6 ¼ 0 and if sup(X) 6 ¼ ; then x = sup(X) otherwise x = ? where X ¼ fx : S yj 6 ¼ 0^i < y � ðn þ 1Þg; ðv j ; rÞ dðjÞ.
3. Sim then gives adversary A the encrypted index γ.
4. When the Sim receives an adaptive query q for keyword w j , it computes τ q v j .Enc 2 (pk, −j, r −1 ) + v r .r 0 and returns τ q to the adversary A. The A may use the SEARCH function for the obtained token τ q and it will obtain exactly the same file identifiers as in the leakage function L 2 every time correctly although the search token are different for same keyword w j . It repeats only after making |V r | queries for the same keyword w j .
5. When the Sim receives an adaptive update query q = u = f i and if u is for add, the simulator uses Sim Enc 1 to simulate c i assuming that the leakage function L 1 reveals the identifier and performs: a. Updates the state by adding a row to the state matrix S where q = n + 2 and each entry S qj 2 {0,1} is determined by flipping a coin.
e. A can then use τ u on UPDATE function and obtain the updated encrypted index γ 0 .
6. Similarly, Sim can simulate operation ADD and DELETE for keywords and file.
The results returned to the A's random oracle queries by Sim are consistent. The keys used in the tokens and keys used in the construction of index γ are indistinguishable since (Enc 1 , Enc 2 ) used to encrypt is by definition indistinguishable from random. Hence, the CPA-Security of (Enc 1 ,Enc 2 ) guarantee indistinguishability between real and simulated encryptions of the files and index by A.

Performance analysis
This section presents the performance analysis of the proposed design for its efficiency. PINDEX is implemented in Java and the experimental results are evaluated on a Windows server with Intel Xeon Processor running at 2.30 GHz and 16 GB RAM. The experiment uses real world Enron email dataset [20]. The performance of PINDEX is evaluated and compared with MRSE [21]. The experimental results are shown in Fig 5.

Multi-link index
Building the multi-linked index is a one-time process carried out by the data owner. The multi-linked dynamic searchable index γ can be constructed using two steps. First, encrypting the keyword w using Paillier cryptosystem if it belongs to document f and secondly computing the encrypted index γ which is a multi-link between the document f and keywords w i 2 W. Therefore the time cost of index generation is proportional to the number of files and number of keywords contained in each file. It can be observed from Fig 5a that the time cost for building the index with varying number of documents of the proposed system performs relatively better than MRSE. This is due to use of relatively small matrix dimensions in the design and computationally less intensive operations like scalar multiplications (|W| operations) and vector additions (|W| + 1 operations).

Search token generation
The search token τ q is constructed for the keywords in the query. The process involves retrieving the pair (v, r) from the parameter hash table Γ and encrypting the query keyword w q . The time cost for search token depends on the number of query keywords. Fig 5b shows the time cost for generating the search token in logarithmic scale for document size n = |D| = 1000 and varying the size of the query keyword. It is observed that PINDEX performance is better than MRSE. This is because our scheme use less computationally intensive operations and the number of operations involved in building a trapdoor is minimum. The operations include two scalar multiplications, one vector addition for indistinguishability and one lookup.

Search
Search operation executed by the cloud server consists of multiplying the search token τ q with the encrypted index γ. If w q 2 W, the first pointer is computed and recursively the multi linked index is traversed until f q = ?. For a multi-keyword search with |w q | keywords in a query q and run on p processors parallel which returned |F q | files then time complexity would be OðjF q j=pÞ. In MRSE similarity scores for the document set is computed and therefore the time complexity is OðNÞ. The time cost for varying the number of documents for m = |W| = 1000 is shown in Fig 5c and 5d. Search time of PINDEX is relatively better than MRSE. Also, it can be observed that as the number of keywords per query is increased, the corresponding increase in time of query to be sub-linear for the proposed design. It is due to optimal number of operations used in the design of search operation which involves just one vector inner product.

Update
The update (add and delete) of both keywords and document involves adding the update token τ 1 and deleting τ 2 from f prev . The worst case scenario for update delete file operation is when the number of f prev equals the number of keywords contained in the file to be deleted. Updates are not compared as MRSE does not support dynamic index, the index has to be rebuilt for updates and the results are shown in Fig 5e and 5f. The time cost for update i.e to add/ delete a keyword is Oð1Þ and to add/delete a file is Oðk=pÞ. The proposed method supports dynamic updates unlike MRSE and it can be observed that dynamic updates comprises of building update trapdoor and executing update algorithm. The time to build update trapdoor is significantly less due to the same reasoning as that of building trapdoor and execution of update algorithm involves only vector additions and deletions as given in section 3.

Related works
The confidentiality of sensitive data like Electronic Health Records (EHR) outsourced to CSP is protected from external and internal attacks by deploying a searchable encryption scheme specially designed to support operations like search and updates over encrypted data [22].

Searchable encryption
The searchable encryption schemes are primarily classified into Searchable Symmetric Encryption and Public Key Encryption with Keyword Search [2] depending upon the type of encryption primitive used and notion of provable security. The searchable encryption systems is further classified based on the type of search operation into keyword based and semantic based searchable encryption system [23]. The keyword based searchable encryption system uses either specially designed secure encrypted indexes to support search operations over encrypted data or sequential scanning of encrypted documents to support search [4]. The keyword based searchable encryption system are exact search keyword based while the semantic keyword search matches even closest query keywords. The proposed design is based on keyword based searchable encryption schemes using secure encrypted index. The first practical solution to symmetric searchable encryption was proposed by Song et al. [3]. Followed by a number of searchable encryption schemes especially with secure index construction, improved efficiency, security definitions and formalizations [4,5,8,9,[24][25][26].

Secure index
To improve performance and decrease the search complexity, clients build keyword-based indexes and outsource it to the cloud. Searchable encryption schemes in literature use a forward, inverted or tree-based index [4,24,27,28]. However, the scalability of the index based schemes can be achieved by either rebuilding the index or by using expensive techniques [5]. The index contains information such as document identifiers, document length, number of keywords in the file. The adversary should not be able to determine the query keyword using statistical analysis. Therefore, the index must not leak information which could aide honest-but-curious CSP to link tokens, keywords and document identifiers. The indexes and the tokens or trapdoors are inherently linked. Deterministic trapdoors are known to leak information therefore probabilistic trapdoors are used. Goh [4] uses forward index using bloom filters per document. The search complexity is proportional to the number of files and bloom filters have an inherent problem of false positives. Chang and Mitzenmacher [24] use prebuilt dictionary of distinct keywords to build the forward index. The search complexity is proportional to the number of files. Inverted index or index per keyword was introduced by [5] and thus reduced the search time to a sub-linear scheme with optimal search time. [26] proposed a scheme based on inverted index using hash tables.

Dynamic searchable encryption
Dynamism is a main requirement for any searchable encryption scheme to be practical. The first dynamic searchable encryption was proposed by Kamara et al. [8] using inverted index and dynamic addition and deletion of documents. The update complexity is linear to the updated document/keyword pair but the update leaks some information about the documents. Kamara and Pamamanthou [9] proposed a red-black tree index based on an encrypted linkedlist multi-map. To insert a new keyword, they added as a new entry to the right list and deletion using a dual representation of the index but update leaks information. Cash et al. [10] proposed a dynamic index using a separate update database to add and delete tuples stored in the revocation list. However, the update is inefficient and leaks significant information. Naveed et al. [11] presented a dynamic searchable encryption with blind storage. Instead of encrypting the index, the stored blocks are scattered using hashing. This scheme leaks the information while adding keyword. Other dynamic searchable encryption schemes are [7,14,[29][30][31]].

Forward privacy
There is a trade-off between practicality and security as most of the dynamic searchable encryption schemes leak significant amount of information about the document. Attacker can leverage this leakage to reveal information on the client's queries. Deterministic encryption schemes leak information regarding size, search, access patterns from the repetition of the queried keyword. File injection attack on dynamic searchable encryption by Zhang et al. [13] triggered much attention on forward privacy schemes. By injecting few carefully selected files, the adversary can recover keywords queried in the past from the information leaked from the search token. Thus, the prior search token needs to be invalidated to achieve forward security. Therefore, a stronger property called forward privacy and an informal definition was introduced by Stefanov et al. [12] based on Oblivious RAM. However, the scheme leads to a search cost OðjF q j log 3 NÞ and Oðk log 2 NÞ update cost. Existing schemes that achieve forward privacy are generally based on trapdoor permutation or using pseudorandom functions. Bost proposed the first formal definition of forward security and a trapdoor permutation based scheme Sophos [32]. Later proposed Diana [15] a constrained PRF based scheme. A dual dictionary Dual [33] was proposed that simultaneously uses both forward with inverted index and achieves forward privacy using fresh keys. To realize forward security, a symmetric punctured encryption Janus ++ [34] was proposed. Etemad et al. [35] achieve forward privacy by replacing the keys revealed to the server. Sun et al. Aura [36] use a non-interative DSSE using bloom filters and multi-puncturable PRF. Wei et al. [37] proposed an index structure with keyed block chain. Khons [38] uses inverted index and achieves weak forward security by exploitation hidden pointer technique.

Discussion
A comparison of our work with existing dynamic schemes is given in Table 1. In our work, there is a relative improvement in client and server storage. The client storage is OðnÞ as the client needs to store the δ, which is the sum of the inner product of orthogonal vectors for documents and Γ the keyword parameter hash table. The server storage is OðnÞ as the server stores the encrypted multi-linked index γ that contains the number of document and the keywords w i 2 W. Efficient updates are performed without rebuilding the index. The search and update tokens are indistinguishable from the previous queries. The proposed multi-linked index is parallelizable and provides relatively efficient search and update with OðjF q j=pÞ and Oðk=pÞ respectively. The efficient time complexity of search is achieved by exploiting the vertical pointers and orthogonal horizontal linking property of the proposed multi-linked encrypted index structure which enables retrieval of the encrypted keyword index with a vector inner product. It can be observed from the SRCHTOKEN algorithm that for each search, the token generated consists of the orthogonal sum of ðv i :s À 1 i Þ and (v r .r) in which (v r , r) is chosen uniformly random for each query to make the token indistinguishable for even the same query. Similarly, the UPDTOKEN algorithm also generates each token which are indistinguishable. This token generation strategy makes the tokens, file identifiers and keywords unlinkable which eventually provides forward privacy as in Definition 5. Moreover, the search can be performed in parallel for multiple tokens simultaneously in parallel on a shared index. This makes the proposed scheme more efficient in exploiting the available parallel processors and improves the execution speed of search operations.

Conclusion
This paper proposes a novel private multi-linked dynamic index construction for efficient retrieval of encrypted documents with forward privacy guarantees. The multi-linked index is constructed using probabilistic homomorphic encryption and secret orthogonal vectors. Experimental evaluations on Enron dataset proves that our construction achieves optimal n-total number of files, m-total number of keywords, |F q |-number of files containing a multiple keywords, k-number of unique keywords in a file. p and N are number of processors and (keyword, file ID) mappings respectively, n ad and n d -number of times a keyword has been affected by file deletions since beginning and since the last search for the same keyword, respectively and (n ad � n d ). https://doi.org/10.1371/journal.pone.0256223.t001 search and update cost. PINDEX supports parallelism using the multi-link structure. Thus, the server can distribute the load to its available p processors with search and update cost as OðjF q j=pÞ and Oðk=pÞ respectively. Forward privacy guarantees is achieved using probabilistic algorithms for search and update token generation. Further, the client and server storage is reduced to OðnÞ. Security analysis of PINDEX using random oracle models proves the privacy and correctness of search and update operations. Our work PINDEX can be used on critical cloud infrastructre for secure and efficient encrypted document retrieval with forward privacy guarantees. As a future work, PINDEX can be adapted to provide verifiablilty and backward privacy.