A Network Approach to Analyzing Highly Recombinant Malaria Parasite Genes
Choosing a noise threshold requires balance between two competing requirements for correctly identifying network communities: minimize the number of incorrectly placed links, yet retain as many correctly placed links as possible to satisfy the network connectivity requirements of the community detection method. (A) The probability of two sequences sharing a block while not actually being related decreases as block length increases, modeled in S1. Each HVR's length and composition are taken into account separately (colored lines). Choosing a tolerance for false positives (grey line) specifies a minimum retained block length; since blocks are of integer length, the next largest integer is the minimum retained block length (squares). Curves for HVRs 3 and 5 are plotted, for which we would select thresholds of five and seven, respectively. Curves for all nine HVRs are shown in Fig. S6A. (B) For a choice of threshold 6 for HVR 1, the histogram of HVR 1 block lengths shows that a vast majority of the blocks are below the threshold (white bars) and that the retained blocks are widely distributed (green bars, inset). (C) Networks are fragmented as the block length threshold is increased and more links are discarded. The relationship between the size of the largest component and block length threshold is shown for the least-connected (HVR3) and most-connected (HVR7) networks. Some thresholds allow too many false positives, as described in panel A (grey lines), yet other thresholds fragment the network too much for reliable community detection (shaded region). Those points that are plotted in color above the shaded region are both sufficiently error-free and well connected that we may reliably infer network communities. For HVRs 2–4, even the most permissive false positive threshold results in a network that is too fragmented for community detection (red circle). Curves for all nine HVRs are shown in Figure S6B.