RecPD: A Recombination-aware measure of phylogenetic diversity

A critical step in studying biological features (e.g., genetic variants, gene families, metabolic capabilities, or taxa) is assessing their diversity and distribution among a sample of individuals. Accurate assessments of these patterns are essential for linking features to traits or outcomes of interest and understanding their functional impact. Consequently, it is of crucial importance that the measures employed for quantifying feature diversity can perform robustly under any evolutionary scenario. However, the standard measures used for quantifying and comparing the distribution of features, such as prevalence, phylogenetic diversity, and related approaches, either do not take into consideration evolutionary history, or assume strictly vertical patterns of inheritance. Consequently, these approaches cannot accurately assess diversity for features that have undergone recombination or horizontal transfer. To address this issue, we have devised RecPD, a novel recombination-aware phylogenetic-diversity statistic for measuring the distribution and diversity of features under all evolutionary scenarios. RecPD utilizes ancestral-state reconstruction to map the presence / absence of features onto ancestral nodes in a species tree, and then identifies potential recombination events in the evolutionary history of the feature. We also derive several related measures from RecPD that can be used to assess and quantify evolutionary dynamics and correlation of feature evolutionary histories. We used simulation studies to show that RecPD reliably reconstructs feature evolutionary histories under diverse recombination and loss scenarios. We then applied RecPD in two diverse real-world scenarios including a preliminary study type III effector protein families secreted by the plant pathogenic bacterium Pseudomonas syringae and growth phenotypes of the Pseudomonas genus and demonstrate that prevalence is an inadequate measure that obscures the potential impact of recombination. We believe RecPD will have broad utility for revealing and quantifying complex evolutionary processes for features at any biological level.

Moreover, the authors introduce RecPD as "ecological" diversity measure, but the discussed use cases (both on simulated and real data) are not designed that way: rather than comparing samples of entire communities (e.g. of closely related strains), the authors compare lineage genomes (or traits) directly, as would be done in a comparative genomics study, basically using PD and RecPD as summary statistics on the trait. The authors address this by introducing RecPD as a measure on "features" that can be many things, but it remains unclear how RecPD would be used as an "ecological" diversity measure in the stricter sense in practice.

RESPONSE:
The reviewer is correct that we were not as careful in our description of diversity metrics as we should have been. While many of the metrics we described in the introduction are most widely used in ecology, they are by no means limited to these applications. We have expanded our definitions of alpha and beta diversity and removed reference to ecology to better focus on their more general applicability. We hope that the new structure will be clearer to readers not familiar with the field.
Admittedly, while the P. syringae results give a good general example of RecPD's usefulness, the manuscript does not leave me with a clear idea of how RecPD would be used in practice (and where it would not be appropriate). I believe that the authors should at least outline possible limitations with regards to data types and requirements: can RecPD handle imperfect phylogenies or missing data? How does the method scale computationally to larger problems? RESPONSE: We have provided a second example to illustrate the application of RecPD in the case of growth phenotypes of the Pseudomonas genera. We hope breadth and contrast of these two examples of 1) effector families across strains, and 2) phenotypic diversity across species, will give readers a better appreciation for the potential utility of RecPD. We also agree with the reviewer that these analyses will face certain limitations, particularly regarding availability and quality of available datasets, and have incorporated these considerations explicitly into our revised discussion. We have also added the time taken for running RecPD on our simulated and real-world examples and foresee future improvements which will increase computational efficiency.
RecPD and derived measures do not adjust for differential abundance of entities (taxa) carrying 'features'. As described in the text, RecPD is an adjusted richness measure, but does not account for the frequency with which each trait/feature is observed in a community. I believe this is a relevant limitation of the method that should at least be addressed in the textalso to put RecPD in context of existing measures, see previous point.

RESPONSE:
The reviewer is correct that does not account for differential abundances, so care must be taken when comparing multiple features that have different abundance in the study population. We explicitly point this out in the revised discussion.
Related to this, a more formal description of RecPD would be desirable. While the text and Fig 1 lay out the concept very well, it would be good to have more exact mathematical formulations, probably in the Methods section.

RESPONSE:
We have provided a more formal description of RecPD as the first section of the Methods section and an additional table describing all the measures we have devised in the text as a potentially useful reference for the reader.