A Combinatorial Model of Malware Diffusion via Bluetooth Connections

We outline here the mathematical expression of a diffusion model for cellphones malware transmitted through Bluetooth channels. In particular, we provide the deterministic formula underlying the proposed infection model, in its equivalent recursive (simple but computationally heavy) and closed form (more complex but efficiently computable) expression.


Introduction
The spreading of malware, i.e., malicious self-replicating codes, has rapidly grown in the last few years, becoming a substantial threat to the wireless devices, and mobile (smart)phones represent nowadays the most appetible present and future target. Papers studying the problem from both theoretical and technical points of view already appeared in literature since 2005 [1][2][3][4][5][6][7][8][9], and nowadays a number of different approaches to modeling the virus diffusion are already available to the community. With the present work we want to contribute to this topic by proposing a more accurate model for the spread of a malware through the Bluetooth channel, providing both a recursive and a combinatorial equivalent deterministic formulation of the described solution.

The Model
The dynamics of the proposed model is the following: at a certain time t, a number I of infected mobiles b 1 , . . . ,b I come in contact with a number S of clean (non-infected) cellphones w 1 , . . . ,w S ; hereafter we will denote this configuration as (I,S).
All SzI telephones are in the Bluetooth transmission range of each other and they all have their Bluetooth device on. Each infected mobile tries to establish a connection with another device, clearly not knowing whether it is trying to pair to a clean or to an infected phone. All these connections are established instantaneously at time t. However, for the sake of simplicity we assume that the infected mobiles establish connections following a given sequence, starting from b 1 down to b I . In other words, b 1 is the first to try to establish a connection, b I is the last one. Moreover, each connection is chosen uniformly at random among all possible available choices. Connections between infected and clean mobiles deterministically result in infection transmission: when a clean mobile gets paired to an infected one, it becomes infected. All these events occur in the time interval ½t,tzDt, where Dt is the minimal time allowing all infected mobiles to establish a connection and eventually transmit the virus: in practice, it may be considered of the order of a few tens of seconds. We assume that in this time interval clean cellphones do not try to establish any connections, e.g., for non-malware purposes. We also assume that in this time interval no other mobile enters the Bluetooth transmission range of the SzI mobiles and, when a connection between two mobiles is established, the two mobiles remain connected for the whole time interval. Basically, we are assuming that the initial configuration (I,S) is given and it does not change in the time interval ½t,tzDt. Note that, given the definition of Dt, new infections do not result in configuration changes in the time interval ½t,tzDt.
All the aforementioned assumptions are reasonably realistic, due to the very short time-scale considered.
The task here is to discover the probability that, in this situation, a given clean mobile gets paired to an infected one, and thus it becomes itself infected.
Summarizing, the setup and the constraints of the model are the following: Setup I infected mobiles b 1 , . . . ,b I and S clean mobiles w 1 , . . . ,w S are in a room (i.e., in the Bluetooth transmission range of each other).

Dynamics
Starting from b 1 down to b I , each infected mobile tries to connect with a yet unconnected device, regardless of whether it is infected or not.

Constraint #1
Since the connection channel is Bluetooth, once a connection between two mobiles is established, these two devices become unavailable to further connection, or, in other words, each device can have at most one connection to another cellphone.

Constraint #2
For each t~1, . . . ,I, when it is b t 's turn to choose, b t must connect to one of the still available devices, if any.
Let us consider the generic configuration (I,S) with I unpaired infected mobiles b 1 , . . . ,b I and S unpaired clean mobiles w 1 , . . . ,w S . According to the setup, the first mobile establishing a connection is b 1 . In Fig. 1 a possible evolution is displayed starting from an initial configuration with I~7 infected and S~5 clean mobiles, together with an explanatory description of the occuring dynamics.
Due to the described dynamics, all the infected mobiles succeed in paring, with the exception of at most one b z , which can remain unpaired if there are no more available mobiles. This case can only happen when there are more infected mobiles than clean ones, their sum is odd and all the clean mobiles get paired: where j is the number of pairings between two infected mobiles. Henceforth, the last choosing infected mobile b z cannot find any available device to pair to. In what follows, we will refer to this case as the case {; an example of this situation in the initial configuration (7,2) is shown in Fig. 2. The model is completely described by computing the probability P(I,S) that a certain clean mobile, for instance w 1 , gets infected in the time interval ½t,tzDt.  Although P(I,S) could be stochastically approximated by running repeated simulations, in the following Sections we will derive two equivalent exact (deterministic) formulae for P(I,S) in the aforementioned setup. The former is a simple recursive expression, which follows straightforwardly from the model dynamics, while the latter is its corresponding closed form (thus with no recursion involved), which has a more complex expression and it heavily relies on combinatorics. Other than their alternative mathematical nature, the two formulae show different behaviours also from a computational point of view, as discussed in a dedicated Section.

The Recursive Formula
Recursively, the probability P(I,S) of a given susceptible mobile w t to get infected starting from a given initial configuration (I,S) can be written by the following expression: where the trivial conditions P(0,S)~0, P(I,0)~0 and P(1,S)~1=S initialize the recursion, thus covering all possible cases. Since all clean mobiles share the same probability P(I,S) of getting infected, without loss of generality we may assume w t~w1 . However, w 1 may be infected later by the remaining I{1 available infected phones (with only S{1 clean mobiles still available, because one clean mobile has been infected by b 1 ), thus falling back to a (I{1,S{1) configuration. 3. b 1 establishes a pairing with one of the other I{1 unpaired infected mobiles b 2 , . . . ,b I . This event occurs with probability and of course w 1 does not get infected by b 1 .
However, similarly to the previous situation, w 1 may be infected later by the remaining I{2 unpaired infected phones, thus falling back to a (I{2,S) configuration.
A worked out example illustrating the construction of Eq. 1 is shown in Fig. 3. The formula in Eq. 1 for P(I,S) relies on a recursive equation of second order with non constant coefficients, for which no general method is known to derive the corresponding non-recursive (closed) expression. Moreover, as detailed in a later Section, calculating P(I,S) by using Eq. 1 is computationally heavy. However, we will obtain the equivalent time-saving closed form solution in the next Section using combinatorial arguments.

The Combinatorial Formula
To construct the explicit formula equivalent to Eq. 1, we need to employ a few combinatorial considerations. The key observation is that we can count all wirings (lists of pairings) that can occur at the end of the pairing process. Clearly, the fact that there is an order in setting up the connections between the mobiles heavily influences the probability that a given wiring can occur: in particular, this probability depends on the number j of pairings between infected mobiles (bb-pairings, for short). As background material, we recall some definitions and results from combinatorics in the box in Fig. 4, together with the two following functions: N the Heaviside step function for x §0 0 for xv0;  ) and we are done. In blue, the case when b 1 pairs to one of the remaining another I{1 infected mobiles b t with probability

I{1 IzS{1
; then b 1 and b t becomes unavailable for pairing with the following choosing mobile b 2 , and we are moved into the case of computing the probability that w 1 gets infected when there are I{2 unlinked infected mobiles and S clean ones, i.e., P(I{2,S). Finally, in orange, the case when b 1 pairs to one of the other S{1 clean mobiles w t (with w t =w 1 ) with probability S{1 IzS{1 ; then b 1 and w t becomes unavailable for pairing with the following choosing mobile b 2 , and we are moved into the case of computing the probability that w 1 gets infected when there are I{1 unlinked infected mobiles and S{1 unlinked clean ones, i.e., P(I{1,S{1). The general case P(I,S)~1 As an example, the following indicator function can be written in the two equivalent formulations: where mod is the Euclidean remainder function, so x mod 2 is zero for even x and one for odd x.
Suppose now we are starting from an initial configuration (I,S); then define the following quantities:  In the above notations, the (non recursive) closed form expression equivalent to Eq. 1 for the probability P(I,S) of a given susceptible mobile w t to get infected in a given initial configuration (I,S) can be written as follows: : Eq. 2 has its roots on the following counting argument: the probability that a given clean mobile w t gets infected is the sum over all admissible values of j of all possible wirings with j bbpairings weighted by the probability that a wiring with exactly j bb-pairings occurs: where L(I,S) is the minimum number of bb-pairings that can be established in an initial configuration (I,S). The rationale of summing over the number of bb-pairings to compute P(I,S) relies on the observation that the probability of w t of getting infected depends on the number of available infected mobiles that will pair with clean mobiles, that is exactly the number of infected mobiles which are not already paired to another infected mobile, i.e., that are not involved in a bb-pairing.
In particular, the three terms between brackets in Eq. 2 match respectively the three factors in Eq. 3, while the term between double brackets (½½, to enhance readability) corresponds to N { (I,S,h).
In what follows we will show that the expansion of the righthand member of Eq. 3 coincides with Eq 2. The expansions of all terms will be carried out first by separately considering all occurring cases, and then providing an unique closed form formula (without conditional expressions) by using the Heaviside step and the Kronecker delta functions.

Lemma 1
Given an initial configuration (I,S), the minimum number L(I,S) of bbpairings in a wiring is the following: while the maximum number is t I 2 s.
In fact, while when IƒS it is possible not to have any bbpairing, when IwS they cannot be less than I{S 2 or I{S{1 2 respectively when I{S is even or odd. This is due to the constraint #1 imposing that an infected mobile b t must connect to another device whenever available, when it is its turn to choose.

Lemma 2
Given a (I,S) configuration, the probability P(I,S,j) that a wiring with exactly j §0 bb-pairings between two infected mobiles occurs is the following: In fact, when there are j bb-pairings in the admissible range, all possible wirings depend on the choice of j infected devices b and I{2j clean devices w, i.e. I{j elements from the original sets of IzS. The first element has probability 1
The idea is that all the I{h infected mobiles b hz1 , . . . ,b I must be part of a bb-pairing, so they must be connected to one of the b 1 , . . . ,b h{1 . Once they have been chosen, the remaining j{(I{h) bb-pairings must be selected among the mobiles b 1 , . . . ,b h{1 that are yet unpaired. Both considerations can be exploited in terms of combinations using the definitions and the properties of Fig. 4.

Lemma 4
In the (I,S) configuration, the number of all possible ways to select j bbpairings is: Apart from the { case, selecting j bb-pairings is equivalent to consecutively choosing j unordered pairs b r Db s from the original set of I infected mobiles. The first pair can be chosen in DC(I,2)D ways, the second pair in DC(I{2,2)D and so on. The division by DP(j)D is motivated by the fact that the particular ordering in which the j pairs are chosen is irrelevant: the list b 1 Db 2 ,b 3 Db 4 ,b 5 Db 6 is undistinguishable from the list b 5 Db 6 ,b 1 Db 2 ,b 3 Db 4 . The number of these different ordering is precisely DP(j)D by definition of permutations. In the { case, if j~0 there is only one way to choose 0 bb-pairings, while if j~1 the unpaired infected mobile can only be b I , so from DC(I,2)D we have to subtract the case where the only bb-pairing involves b I , which is impossible. Finally, in the { case with j §2 the unpaired infected mobile can be any b h with Sz1ƒhƒI, and the total number of cases (which coincides with the number of cases where b t is selected, since all the clean mobiles are connected in these situations) is the sum of all cases with h~Sz1, . . . ,I.

Lemma 5
In the (I,S) configuration, with j bb-pairings, the number of all possible cases when a particular w t is chosen is: The result follows immediately from the cardinality equations in Fig. 4, in particular from the fact that among all combinations of M objects in groups of T elements, a particular element is selected exactly DC(M{1,T{1)D times. When I is even and j~I 2 we follow the convention A B

~0
for Aw0,Bv0. In case {, since all the non infected mobiles are selected, the possible ways to select them are exactly their permutations. This completes the expansion of Eq. 3 into Eq. 2. Equivalence between the recursive and the closed formula can be proven by showing that Eq. 2 satisfies the recursive relations of Eq. 1. The analytical proof of the equivalence involves working out a large number of cumbersome identities of binomial coefficients and factorials: in the last Section, we will briefly outline a sketch of the proof in the simple case I~S[2Z. Numerically, the differences between the two formulae are below machine precision for 1ƒI,Sƒ50.
We conclude the Section with the observation that the sum of the total number of cases weighted by their corresponding probabilities adds up correctly to one:

Lemma 6
In the (I,S) configuration with j bb-pairings, the number N w (I,S,j) of all possible ways to select the remaining clean mobiles for pairing is: Apart from the { case, when there are j bb-pairings, I{2j infected mobiles remain to be connected with I{2j clean devices. This is equivalent to compute the number of possible sets of I{2j elements from an initial set of S clean mobiles: since here the ordering matters, this is the definition of dispositions (see Fig. 4) of I{2j elements from an original set of S.
Note that, since in the case { all the clean mobiles are selected, the two quantities N w (I,S,j) and N(I,S,j,w t ) coincide.

Analytical and Computational Notes
Although defined only for positive integer values of I and S, it is possible to provide a graphical sketch of the shape of the function P(I,S) by linear interpolation on the non integer real values. In Fig. 5 we show both the tridimensional surface of P(I,S) and its corresponding contourplot for values of I and S ranging between 1 and 100. Asymptotically, the function P(I,S) converges to the following limits: Graphical examples of the behaviour stated in Eq. 4 are provided in Fig. 6, where a few curves of P(I,S) are plotted when one of the two parameters is kept constant (and equal to 10, 50, 100) and the other ranges between 0 and 100, together with the curve corresponding to P(I,S) for 1ƒI~Sƒ100. When one of the two parameter is equal to a constant T, the smaller is T, the faster P(I,S) converges to the limits in Eq. 4.
Apart from its intrinsic theoretical relevance, the non recursive closed formula is essential for numerically compute P(I,S). In fact, the computational cost is notably different by using either the recursive formula Eq. 3 or its closed form counterpart Eq. 2: namely, the explicit formula is much faster, as shown by the values reported in Table 1 and the curves plotted in Fig. 7. For the recursive formula the computing time shows an exponentially growing trends for increasing values of I and S, while for the non recursive formula the computing time is very small and minimally growing for I and S ranging between 0 and 100. Actually, the average time over 10 values using a Python implementation of the non recursive formula on a 24 core Intel Xeon E5649 CPU 2.53GHz Linux workstation with 47 GB RAM is 11 milliseconds for I~S~5 and 60 milliseconds for I~S~10, with very limited standard deviation. On the same hardware, a Python implementation of the recursive formula took about 12 milliseconds for P(5,5), 2.4 seconds for P(30,30), 6 minutes for P(40,40) and more than 9 hours for P(50,50), which was the largest tested value.

Proof of Equivalence in the Case I~S[2Z
In this Section we show the kind of arguments involved in proving the equivalence between Eq. 1 and Eq. 2 by outlining the main steps of the proof in a simple case, i.e., when there as many infected as clean mobiles, and their numnber is even. Clearly, the general case is computationally far more complex, but it used the same ideas.
Proving the equivalence between the recursive and the combinatorial formula requires substituting the explicit expression for P(I,S) of Eq. 2 in its three occurrences in Eq. 1. We are assuming I~S~2x[2Z, thus in this case the identity we need to prove reads as follows: or, equivalently: The expression for P(I,S) becomes: where the upper bound is x{1 since the right-hand member vanishes for j~x and the product symbols were eliminated by using the factorial and double factorial notations: Analogously, the expansions for P(I{1,S{1) and P(I{2,S) become respectively: In blue, we show three curves of P(I,S) for constant I (I~10 solid line, I~50 dashed line and I~100 dotted line) and S ranging from 0 to 100. All three curves approach the asymptotic value 0 for increasing S, more rapidly for smaller values of I. In black, we show the symmetric cases obtained keeping S constant (S~10 solid line, S~50 dashed line and S~100 dotted line) and letting I range from 0 to 100. Again, all three curves approach the asymptotic value 1 for increasing I, more rapidly for smaller values of S. The sawtooth shape of the curve P(I,10) for I §30 is due to the effect of the { case, which induces abrupt differences in P(I,S) for consecutive values of I (changing from even to odd). Finally, the dotted-dashed red line shows the curve of P(I,S) for I~S ranging between 0 and 100: in this case, the curve gets very close to its asymptotic value 0.5 even with small values of I~S; for instance, P(10,10)^0:52 and P(25,25)^0:51. doi:10.1371/journal.pone.0059468.g006 In particular, I~S~5 . . . 100, and only the closed formula was used for I,Sw50 (due to the excessively long runtimes: e.g., computing P(50,50) by the recursive formula took more than 9 hours). Mean, maximum (Max) and minimum (Min) values for 10 replicates of each experiment are reported. All simulations were run on a 24 core Intel Xeon E5649 CPU 2.53GHz workstation with 47 GB RAM, Linux 2.6.32 (Red Hat 4.4.6), with software written in Python 2.6.6. doi:10.1371/journal.pone.0059468.t001