LiPISC: A Lightweight and Flexible Method for Privacy-Aware Intersection Set Computation

Privacy-aware intersection set computation (PISC) can be modeled as secure multi-party computation. The basic idea is to compute the intersection of input sets without leaking privacy. Furthermore, PISC should be sufficiently flexible to recommend approximate intersection items. In this paper, we reveal two previously unpublished attacks against PISC, which can be used to reveal and link one input set to another input set, resulting in privacy leakage. We coin these as Set Linkage Attack and Set Reveal Attack. We then present a lightweight and flexible PISC scheme (LiPISC) and prove its security (including against Set Linkage Attack and Set Reveal Attack).


Introduction
Online social networks (including enterprise social networks such as Yammer) and e-commerce websites are increasingly popular with both individual and organizational users. In these applications, recommendations are a frequently used mechanism for users and service providers to suggest potential friends or interested commodities to other users. For example, in online social networks such as Tencent QQ or Facebook, service providers recommend potential friends to another user, by relying on the intersection set of the user's friends' friends. More specifically in this context, B and C are A's friends; thus, the intersection set of B's friends and C's friends may be A's potential friends. The service provider needs to conduct regular intersection set computation (ISC) to find the set D that excludes A's friends. Users in set D are then recommended to A as potential friends. Similarly, services or other contacts (e.g., prospective partners for dating apps) are recommended using this (recommendation computation) approach.
Recommendation computation can be generalized as the calculation of an intersection function, where the input is two or more sets and the output is the intersection set of the input sets. We remark that the entity (e.g., online social network service providers) for computing intersection function may not always be trustworthy. Hence, the privacy of users may be leaked (or compromised) during the computations. Therefore, privacy-aware computation of intersection set has attracted the attention of security and privacy researchers (see [1][2][3][4][5][6][7]), and is key to the widespread adoption of online social networks and e-commence websites.
Most existing studies address specific privacy protection problems in the context of a particular application. Relatively few research formalize the privacy-aware computation of intersection set problem as an abstract problem (i.e., privacy-aware intersection set computation (PISC)) or focus on designing schemes that are lightweight and flexible. This is the gap that we seek to address in this paper. We also reveal two previously published security attacks against PISC, namely: set reveal attack and set linkage attack.
In this paper, we propose a lightweight and flexible PISC (hereafter referred to as LiPISC) scheme, which satisfies the following properties: 1. A basic requirement for privacy protection is for intersection computation to be conducted over input sets with concealed members instead of plain members. This is more difficult than traditional ISC.
2. An enhanced requirement for privacy protection is unlinkable. In other words, computation entities cannot infer the existence of the same items using the received input sets after two computation operations.
3. A basic requirement for flexibility is the capability to compute rough set. That is, PISC can return approximate intersection set even though matched common members do not exist in input sets.
4. Lightweight and scalable for large scale applications.
5. Security against set reveal attack, set linkage attack, and other common attacks.
We then prove the security of LiPISC. The remainder of this paper is organized as follows. Section 2 describes related work. In Section 3, we present the basic assumption and models used in the paper. Section 4 details our proposed LiPISC scheme and the security proof. Finally, Section 5 concludes this paper.

Related Work
Ensuring the privacy of recommendations in social networks is a current research focus (see [8][9][10][11][12][13][14][15]). Wicker and Schrader [8] surveyed the philosophical, legal, moral, and epistemological literature on privacy in information networks, and introduced privacy-aware design principles. In the same year, Pentafronimos, Karantjias and Polemi [9] identified privacy requirements in collaborative workspaces, and suggested a number of guidelines for privacy-aware identity and access management systems. Attempting to fulfill the privacy requirements in online social networks, Akcora, Carminati and Ferrari [10] proposed a privacy risk measure to provide users a way of estimating risk (via the use of a Facebook prototype application). In 2013, Shoaran, Thomo and Weber-Jahnke [11] studied privacy issue in big data. They used graphs to model big data and proposed privacy-aware release of graph summarization using zero-knowledge privacy. Similarly, Vidyalakshmi et al [12] proposed a privacy aware information dispersal method in social networks, where they employed a supervised learning model to assist user in spotting unintended audience for a post. A year later in 2014, Li [13] posited that users should be able to indicate different comfort levels to share their friendships in social networks, and this can be done by limiting the number of their friends returned in response to queries through the friend search engine. They also proposed a new attack model that may infer more friendship information based on the friendships detected in the query results, as well as defining a model to safely display friends of individual users in response to queries. However, these studies only focused on specific applications, rather than seeking to solve the underlying challenge using a generalized approach.
Privacy protection of recommendations in e-commerce systems has also been the subject of extensive research. For example, Bunea et al [16] presented a privacy-aware collaborative filtering recommender framework, where they used a probabilistic matrix factorization technique to mitigate the sparsity and a dynamic privacy inference model to ensure privacy. In the dynamically personalized recommendation algorithm of Tang and Zhou [17], information in ratings and profile contents were used, without a strong focus on privacy protection. He, Ren and Zhang [18] presented an approach to increase the conversion rate from browsing to buying. Their method uses a behavior-based inference of a customer's propensity to purchase from a product category. Celdran et al [19] proposed a middle-ware to provide users with custom context-aware recommendations. However, these studies neither focused on privacy protection nor presented a more general method that can be deployed in a wide range of applications. We also observe that there is no suggestion that PISC can be modeled as secure multi-party computation.
PISC problem has been addressed using different approaches in the literature (see [1,[3][4][5][6][7]). For example, Shao, Yang and Yu [1] used searchable encryption to fulfill private set intersection, in which public key encryption with multiple keywords search is used as the basic tool. Zhao and Luo proposed a two-party private set intersection protocol based on negative database [3]. The negative database is a new technique for preserving privacy, and it stores information in the complementary set of a traditional database. The security foundation of this technique is that reversing the negative database to recover the corresponding database is NPhard. In the study of how to extract common sensitive information from encrypted sets, Liu et al [4] argued that existing methods are not suitable for cloud deployment. Hence, they designed the Encrypted Set Intersection Protocol (ESIP) that allows server and users to perform collaborative operations to obtain the correct set intersection with privacy-preserving. In the attempt to solve privacy-preserving intersection of regular languages instead of finite sets, Guanciale, Gurov and Laud [5] proposed an approach based on minimal deterministic finite automata. Wang, Zhu and Luo [6] proposed a scheme which allows any entity to publicly verify the correctness of set intersection query, without requiring any secret key. More recently in 2015, Thapa et al [7] used asymmetric social proximity to design different private matching protocols, designed to provide different privacy levels. We observe that the literature rarely requires a PISC solution to be both lightweight and flexible. In the big data era, lightweight and flexible are two critical properties for privacy-aware PISC [20,21].

Network Model
There are three entities in the intersection set computation, namely: an entity A, an entity B, and a computation server C. Upon receiving the sets from A and B, C will compute and return the intersection sets to A and B. For privacy protection, A and B conduct their respective computations to conceal the original set prior to sending to C. C conducts the computation of PISC over the concealed sets and returns information on the intersection of original sets.
We denote the set from A to C, the set from B to C, the the intersection set from C to A and B, as Set A , Set B , and Set C , respectively. Thus, the basic logic flow in network model can be simplified as follows: Msg1) A ! C : Set A , where x ! y : z denotes a message z being sent from entity x to entity y.
It is worth noting that Set A and Set B are concealed sets of the original sets, and Set C can be redirected to the intersection of the original sets only at A and at B upon receipt.

Attack Model
We assume that the communication channels between entities and the server are protected by standard security mechanisms (e.g., encryption and integrity protection) at link layers (e.g., IEEE802.11i and CDMA); thus, attackers who can sniff packets are beyond the scope of this paper. In this paper, we focus on adversaries for privacy leakage at server side. In this context, we point out the following potential attacks, and we denote the adversary as Adv.
Definition Set Reveal Attack (Adv sra ). The adversary reveals the privacy of entity A and entity B from observing Set A and Set B . Roughly speaking, where H is the entropy function and x is any information about A or B.
In other words, entity A and entity B present Set A and Set B to server C for discovering of the intersection members. As the server C is not entirely trustworthy, the presenting sets should not reveal the privacy of A or B. Thus, Set A and Set B must be transformed from the original sets to the concealed sets for further intersection computation. That is, Set A and Set B that are presented to the server C must not leak the privacy of entity A and entity B, respectively. In the tradition reductionist approach [22][23][24], the adversary we consider in this paper has an upper bound in computational capabilities (i.e., a probabilistic polynomial-time turing machine-PPTM).
Adv sra captures the basic security notion, and the security can be achieved by some transformations guaranteeing computational secrecy. However, we observe that if sets presented to untrustworthy servers are "linkable", it may also compromise the privacy of entity A (or B). This is because the adversary at the untrustworthy server can discover that entity A (or B) has conducted the same behavior more frequently than others, the interests or preference at entity A (or B) can then be inferred by the adversary (e.g., purchasing certain merchandise such as milk powder more frequently than others). We coin such an attack against the PISC as a Set Linkage Attack.
Definition Set Linkage Attack (Adv sla ). The adversary can successfully guess that at least one item in Set A (or Set B ) in one round is the same as another entity in a previous round. Roughly speaking, PrfFind x; where Set A and Set 0 A are in two distinct rounds. For example, after the adversary observes Set A in one round and Set 0 A in another round, the adversary can link one member in Set A to another member in Set 0 A . That is, the adversary can successfully guess there exists a same member in both Set A and Set 0 A , which can compromise the privacy of that particular entity in some situations.
A more relaxed notion is to assume server C is semi-trustworthy (also known as honestbut-curious adversary). That is, the computation on server C for intersection set computation (mathematically) is trustworthy (i.e., correctly computed), but the adversary on server C may seek to infer the privacy of entity A and entity B by observing Set A and Set B . Here, the computation functionality for intersection set has to be trustworthy as a prerequisite condition; otherwise, we will not be able to achieve privacy-aware intersection set computation.
Thus, the privacy requirements are stated as follows: Definition Pri Adv sra ;Adv sla PISC . In the presence of Adv sra and Adv sla at a semi-trustworthy server C with a computational bound, the intersection set of Set A and Set B uploaded by entity A and entity B can be correctly computed without compromising the privacy of entity A and entity B. (Note that, the intersection set is for the original sets, rather than the concealed sets.)

Design Goals
The design of PISC should also consider flexibility, as Set A and Set B may not have the exact intersection set. In other words, we can also compute the rough intersection set even though Set A and Set B are not exactly or approximately equal. The flexibility will be particularly helpful in social network and e-commerce applications, as approximate intersection set is adequate for the required functionalities (e.g., identifying similar goods that may be of interest to an e-commerce user). Flexibility is also necessary when exact intersection set for the uploaded sets does not exist.
To achieve flexible PISC in a lightweight manner is also important, particularly due to the scale of data involved and the real-time nature of such computations. The lightweight computation in one time computation of PISC will significantly influence the scalability of the proposed method.
Therefore, the design goals are lightweight and flexible PISC.

Abstract Model
LiPISC consists of the following functions: 1) f conceal . It takes as input the original set, Set AO , and outputs a concealed set, Set AC . That is, where Set A ) Set B means that 8x 2 Set A , compute y = f conceal (x) and include y in Set B . 2) f intersect . It takes as input two sets, Set AC and Set BC , and outputs a concealed intersection set, Set IC . That is, 3) f reveal . It takes as input a set, Set IC , and outputs an original intersection set, Set IO . That is, Therefore, in LiPISC, Proof f conceal is the only transformation of the original set before the observation of the adversary. After this transformation, the adversary at C can observe the transformation result of f conceal . Thus, f conceal should defend against Adv sra and Adv sla . □ f conceal should make it possible to find the corresponding f intersect and f reveal . That is, corresponding f intersect and f reveal should be found easily and should perform efficiently after the transformation of f conceal . Thus, f conceal is the most important of three functions. Besides, these three functions must be lightweight in terms of computation. Therefore, finding proper f conceal has the highest priority in searches.

Basic Construction
The basic construction is described as follows: 1) Entity A: 1.1) f conceal : Set AO ) Set AC . Let f conceal ¼ HashðÁÞ : f0; 1g m ! f0; 1g n ; m; n 2 N; That is, 3) Server C: Remarks 1) f conceal is pre-deployed or negotiated in advance at A and B, or distributed by C instantly and publicly. Even though f conceal is publicly known, Adv sra is still defended against due to the cryptographic properties of Hash(Á).
2) Note that Hash(Á) could be any function with dedicated requirements, although a computationally efficient cryptographic hash is usually preferred. Next, we will define the one-way and collision resistant requirements. Proof f conceal should be collision resistant, which is not for security but for the soundness of resulting intersection set returned from C. If so, the probability will be negligible that y 1 2 Set AC and y 2 2 Set BC are equal but x 1 2 Set AO and x 2 2 Set BO are not equal. □ Note that it is non-trivial to be aware that f conceal is not second pre-image resistant, because server C has no idea of the pre-image of Set AC as well as Set BC .
Proof There are 1/2 m−n x 2 Set AO mapping into the same y 2 Set AC . Thus, even y is the same x is different. The false probability of basic construction is 1/2 m−n . □ 3) Only last k, k < |Hash(Á)| = n bits can be selected in Hash(Á) during comparison to further improve the efficiency, but it may induce extra false members in intersection set with false probability (namely, 1/2 m−k ).
The privacy strength for the basic construction is as follows: Proposition 4.5 Basic construction can defend against Adv sra but not Adv sla . Proof As f conceal is one-way, the adversary can reveal neither Set AO from Set AC nor Set BO from Set BC . Thus, Adv sra is defended against. However, the same y 2 Set AC (Set BC ) in different rounds can be linked to the same x 2 Set AO (Set BO ). Thus, Adv sla is not defended against. □

An Enhanced Basic Construction
To defend against Adv sla , we propose the following enhancement by adding a random number to be used only once (i.e., nonce) in f conceal to make it unlinkable when the adversary observes Set AC (or Set BC ). More specifically, the enhancement is at steps 1.1) and 2.1) with the addition of the nonce, respectively. The enhanced steps 1.1) and 2.1) are as follows: 1.1) f conceal : Set AO ) Set AC . Let f conceal = Hash(Á); That is, Set AC ( H(nonce A ||Set AO ), where Hash(Á) could be a cryptographic hash function; nonce is a number used once. In other words, 8x 2 Set AO , y ( Hash(nonce A ||x), y 2 Set AC .
Remarks 1) Nonce A and Nonce B could be timestamps, in which A and B need to be synchronized in advance.
2) Nonce A and Nonce B could be a value from a pseudorandom number generator, which is generated from a shared secret key (seed) and synchronized at A and B. That is, Nonce = PRNG (k), where k is a shared secret key between A and B (e.g., generated using a key establishment protocol [25]); PRNG(Á) is a pseudorandom number generator.
3) Nonce A and Nonce B could be a synchronized counter. Proposition 4.6 The enhanced basic construction can defend against Adv sra and Adv sla . Proof The enhanced basic construction can defend against Adv sra inherently due to basic construction. The discussion, thus, only concentrates on Adv sla . As Nonce A and Nonce B vary in each round, the same y 2 Set AC (Set BC ) in different rounds usually cannot be identified by the adversary due to the collision resistance property of f conceal . Thus, the linkage will not be drawn and Adv sla is defended against. □

Advanced Construction with Rough Intersection for Flexibility
In a real-world deployment, such as on-line social networks and e-commerce websites, entity A and entity B are unlikely to have the exact number or same members in the submitted sets. Although exact intersection of Set AO and Set BO dose not exist, server C is still required to return approximate intersection of Set AO and Set BO . In this scenario, server C has to recommend the most approximate intersection of Set AO and Set BO from Set AC and Set BC . The basic construction and the enhanced basic construction are not able to address such a situation. Thus, we propose an advanced method that can satisfy this requirement.
The improvements are due to the following: at f conceal , Hash(Á) is replaced by GðÁÞ : g x mod p, where p is a sufficient large prime, and at f intersect , the comparison is changed into computation of ðg x 1 Þ=ðg x 2 Þ % g d . More specifically, the method is described as follows: 1) Entity A: 1.1) A : f conceal : Set AO ) Set AC . Let f conceal ¼ g x mod p, where x 2 Set AO ; p is a large prime. That is, Set AC ( G(Set AO ), where G(SetÃ) means to compute f conceal (x) for 8x 2 SetÃ. In other words, Set AC ¼ fyj8x 2 Set AO ; y ¼ g x mod pg. 3) Server C: Remarks 1) To reduce the computation overhead at entity A and entity B, in computing G(Á), A and B can only compute the last k bits of x, y, 8x 2 Set AO and 8y 2 Set BO . log 2 x = {0,1} m = {0,1} m − k k{0,1} k , log 2 y = {0,1} n = {0,1} n − k k{0,1} k . It will not compromise the soundness of PISC if k is properly chosen. We state the chosen method in the following proposition.
Proposition 4.7 If the absolute gap value between x 2 Set AO and y 2 Set BO is kx − yk, we let k = log 2 log g kx − yk.
Proof x 2 Set AO , g x mod p 2 Set AC . x 2 Set BO , g x mod p 2 Set BC . g x g y can be computed via g mþðxÀmÞ g mþðyÀmÞ ; where log g (x − m) < 2 k and log g (y − m) < 2 k . Thus, only computing the last k bits of x, y will not result in any difference for g x g y at server C. □ 2) δ is a system parameter, which measures the approximate strength of the intersection set. More specifically, we state it formally in the following proposition.
Proposition 4.8 If the absolute gap value between x 2 Set AO and y 2 Set BO is kx − yk, the intersection set produced by PISC will have kx − yk δ.
Proof x 2 Set AO , g x mod p 2 Set AC . y 2 Set BO , g y mod p 2 Set BC . Server C selects intersection set by g x g y < g d ; thus kx − yk δ + kφ(p) = δ + k(p − 1), where ϕ(Á) is Euler function, and k 2 N . p − 1 > > δ and x p − 1, y p − 1, thus kx − yk δ. □ From above two propositions, we can choose k = log 2 log g kx − yk = log 2 log g δ bits at rear of log 2 x, 8x 2 Set AO or log 2 y, 8y 2 Set BO to compute Set AC and Set BC , respectively.
3) The computation overhead at server C is module exponential computation (for computing g d mod p) and one time division (for computing Set AC [i]/Set BC [j]). The computation overhead at an entity is module exponential computation (for computing g k mod p and k is at most log 2 log g δ bits).
Proposition 4.9 The advanced construction can defend against Adv sra , but not Adv sla . Proof As f conceal is one-way due to the underlying discrete logarithm problem, Adv sra is defended against. As f conceal is a deterministic function, the same image will link to the same pre-image; thus, Adv sla cannot be defended against. □ Proposition 4.10 In the advanced construction, f conceal is collision resistant. Proof The aim is to prove that it is difficult to find x 6 ¼ y such at g x g y mod p. That is, it is difficult to find x 6 ¼ y such that g x−y mod p = 1. If g x−y mod p = 1, x − y mod p − 1 = 0 by Fermat's theorem. As x − y < δ and p is a big prime, it is computational infeasible for x À y mod p À 1 ¼ 0. □ Proposition 4.11 The advanced construction is sound.
Proof It is concluded from the previous propositions. f conceal is one-way, collision resistant, and the returned intersection sets are indeed the approximate members. Thus, the advanced construction is sound. □ Finally, to defend against Adv sla , we propose an enhancement method for the advanced construction by adding a nonce in f conceal . More specifically, the enhancement is at Steps 1.1) and Step 2.1) by multiplying the nonce, respectively: 1.1). f conceal = nonce Ã g x mod p, where x 2 Set AO ; p is a large prime. That is, Set AC ( G (Set AO ), where G(SetÃ) means to compute f conceal (x) for 8x 2 SetÃ. In other words, The discussion on nonce is similar to the aforementioned remarks in the last section. The security proof after the inclusion of nonce is similar to Proposition 4.6. Until now, we reach the final proposed version of PISC that is the enhancement of the advanced construction. This incremental presentation helps for better understanding and smoothly remembering.
Potential Applications and Further Discussions 1) Example I. In social networks for recommending prospective friends, both users B and C are user A's friends; thus, the intersection set of B's friends and C's friends may also be A's friends. The service provider needs to perform PISC of two inputs-B's friends and C's friends, and those inputs should be concealed for privacy protection.
2) Example II. For recommending potentially goods of interest in an e-commence website, B's purchase history and C's purchase history have overlapping items with A's purchase history (i.e., A, B, C have similar purchasing habits and preferences); thus, the intersection set of user B's purchases and C's purchases may also interest A. The service provider needs to perform PISC of two concealed inputs-B's purchase history and C's purchase history. Since we require that the purchase history of both B and C to be unlinkable, the service provider is unable to guess whether the same item exists in B's or C's purchase history.
3) As in LiPISC, A and B have to establish a shared nonce, which may require synchronization at both A and B. Alternatively, nonce can also be distributed by servers which does not break the security of the scheme as the servers will not know the nonce due to the one-wayness of f conceal . 4) As stated in Section 3 (i.e., Problem Formulation), the adversary (trust) model in this paper only assumes an adversary at a central server. That is, entity A and entity B for PISC is trustworthy. Nonetheless, even though A (or B) could be un-trustworthy, A (or B) can only reveal the intersection set with B (or A) by random guess due to the one-wayness of f conceal . The probability of successful guess is negligible due to the selection of f conceal .
5) The gap between numeric distance represents the numeric deviation of inputting values, and the gap between Hamming distance represents the property deviation. The former gap can be computed from the latter gap if the major difference comes from the bits at the end, and the latter gap can also be estimated from the former gap. In this paper, we focus on the numeric distance, which is more general than the Hamming gap. In addition, the proposed method is suitable for both two situations, as here only rough intersection set computation is required.
6) The proposed scheme outperforms related work in sever aspects as follows: It is lightweight as only raw discrete logarithm computation is involved instead of cryptographic encryption. It can compute rough intersection set that is flexible in realistic. It tackles set linkage attack that has not been carefully explored in the literatures.

Conclusion
In this paper, we introduced two new attacks against PISC, which we coined as set reveal attack and set linkage attack. We then proposed a lightweight and flexible PISC (LiPISC), which achieves approximate intersection set computation and rough intersection set computation in a lightweight manner. We then proved the security of LiPISC.
Future work includes deploying the proposed scheme in a real-world application, such as ecommerce website, with the aims of refining and validating the scheme.

Author Contributions
Conceived and designed the experiments: WR KKRC. Performed the experiments: WR SH. Analyzed the data: WR. Contributed reagents/materials/analysis tools: WR YR. Wrote the paper: WR YR KKRC.