Speeding Up Non-Parametric Bootstrap Computations for Statistics Based on Sample Moments in Small/Moderate Sample Size Applications

In this paper we propose a vectorized implementation of the non-parametric bootstrap for statistics based on sample moments. Basically, we adopt the multinomial sampling formulation of the non-parametric bootstrap, and compute bootstrap replications of sample moment statistics by simply weighting the observed data according to multinomial counts instead of evaluating the statistic on a resampled version of the observed data. Using this formulation we can generate a matrix of bootstrap weights and compute the entire vector of bootstrap replications with a few matrix multiplications. Vectorization is particularly important for matrix-oriented programming languages such as R, where matrix/vector calculations tend to be faster than scalar operations implemented in a loop. We illustrate the application of the vectorized implementation in real and simulated data sets, when bootstrapping Pearson’s sample correlation coefficient, and compared its performance against two state-of-the-art R implementations of the non-parametric bootstrap, as well as a straightforward one based on a for loop. Our investigations spanned varying sample sizes and number of bootstrap replications. The vectorized bootstrap compared favorably against the state-of-the-art implementations in all cases tested, and was remarkably/considerably faster for small/moderate sample sizes. The same results were observed in the comparison with the straightforward implementation, except for large sample sizes, where the vectorized bootstrap was slightly slower than the straightforward implementation due to increased time expenditures in the generation of weight matrices via multinomial sampling.


S2 Text. Comparison of single thread versus parallel implementations
As any other bootstrap technique, our vectorized implementation can potentially benefit from parallelization by splitting and distributing the total number of bootstrap calculations across multiple processors/cores. Using the standard capabilities provided in parallel R package, we implemented parallel versions of the "for loop" and "vectorized" bootstrap functions for bootstrapping Pearson's correlation coefficient, and we compared the running times of the single threaded versus parallel implementations in all the real data and simulated data examples presented in the main text.
In this benchmark study we evaluated parallel versions based on both the parLapply and mclapply alternatives to the lapply function, in the parallel R package. Since forking is not available for the Windows operating system (and the mclapply function defaults back to single thread computation in Windows), we benchmarked the results in two distinct platforms, namely, a Intel Core i7-3610QM (2.3 GHz), 24 Gb RAM, Windows 7 Enterprize (64-bit), and a Intel (R) Core (TM) i7 CPU 950@3.07GHz, 24 Gb RAM, Xubuntu 14-04 (64-bit). For the parallel implementations employing the parLapply function, we used PSOCK clusters generated by calling makeCluster(specs, type = "PSOCK") function (with the specs argument set to 4, since both Windows and Xubuntu machines have quad core processors).
Figures 1 to 5 present the results. In all figures, panel a presents the results for the Windows platform, and we see that the timings for the single thread (full lines) and mclapply parallel implementations (dotted lines) are very close since forking is not available in Windows and the mclapply function defaults back to single thread computation. In all figures, panel b presents the results for the Xubuntu platform. Overall, we observe the following results: 1. Inspection of Figures 3, 4, and 5 shows that, similarly to the results observed in the single thread computations in the main text, the vectorized parallel implementations tended to out-perform the "for loop" parallel implementations in small/moderate sample sizes, but tended to be slower for larger sample sizes. Figures 1, 2, 4, and 5 shows that the "for loop" parallel implementations tended to be faster than the "for loop" single thread computations for all sample sizes tested, with the gains being larger for larger sample sizes and number of bootstrap computations.

Comparison of the brown curves on
(Note that on Figure 3 (B = 10, 000), nonetheless, the parLapply implementation is slower than single thread computing, although the mclapply parallel implementation is faster.) 3. Comparison of the blue curves on all figures shows that the vectorized parallel implementations tended to be faster than the vectorized single thread computations as the sample size and number of bootstrap replications increased (Figures 1, 4 and 5 show that the gap between the single thread and parallel computations tends to increase, as we increase B). However, for small sample sizes the single thread computations were sometimes slightly faster (compare the full and dashed blue lines in Figure 2, and the full and dashed blue lines at small N values on Figure 4).
4. When B = 10, 000 (Figure 3), comparison of the parLapply and mclapply implementations (for both "for loop" and vectorized implementations) showed that the parLapply function is slower than the single thread computation, whereas mclapply is faster. However, for larger numbers of bootstrap replications (Figures 4 and 5) both parallel implementations tend to be faster than the single thread computation, with the parLapply becoming more efficient than the mclapply as B increases.
Interestingly, these benchmarking results suggest that parallelization should not always be preferred to single thread computation since, sometimes, the time spent in distributing and gathering results across multiple cores can be greater than the time needed for single thread computing. We point out, nonetheless, that benchmarking results for parallel implementations are highly dependent on hardware specifications.