We published in Nature in 2018 a renewed version of the phylogenetic bootstrap, which was proposed by Joseph Felsenstein more than 30 years ago. This method, based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. Its usefulness, simplicity and interpretability made it extremely popular in evolutionary studies, to the point that it is generally required for publication of phylogenies. Felsenstein’s article has been cited more than 35,000 times and is ranked in the top 100 of the most cited scientific papers of all time. In 2017, it was cited more than 2,000 times.
However, it is commonly acknowledged that Felsenstein’s bootstrap is not appropriate for large datasets containing hundreds or thousands of taxa, which are now common thanks to high-throughput sequencing technologies. While such datasets generally contain a lot of phylogenetic information, the Felsenstein’s bootstrap proportions (FBP) tend to be low, especially when the tree is inferred from a single gene, or only a few genes. The reason for such degradation is explained by the core methodology of Felsenstein’s bootstrap. A bootstrap branch must match exactly a branch in the original tree estimate, to be accounted for in the bootstrap support of that branch. A difference of just one taxon is sufficient for the bootstrap branch to be counted absent, while it is nearly identical to the original branch. The standard approach is to remove “rogue” (phylogenetically unstable) taxa and relaunch the analysis, but this is statistically questionable and computationally expensive. Moreover, with large trees inferred branches are likely to have errors and a large fraction of taxa may be unstable.
We proposed (Lemoine et al. Nature 2018) a new version of the phylogenetic bootstrap, in which the presence of original branches in bootstrap trees is measured using a gradual “transfer” distance, as opposed to the original version using a binary presence/absence index. This distance is normalized in the [0, 1] range and averaged over all bootstrap trees. We so obtain the “transfer bootstrap expectation” (TBE), which replaces the branch presence frequency of FBP (i.e. the expectation of a 0/1 function), by the expectation of a nearly continuous function. By construction, TBE supports are necessarily higher than FBP’s and the difference is substantial for deep branches.
We studied the statistical basis of this approach in (Davila Felipe et al. Journal of Mathematical Biology 2019). In this paper, we demonstrated mathematically that in the absence of signal (i.e. the original tree is randomly drawn), the branch supports converge in probability to 0 when the number n of tree tips tends to infinity. Moreover, we fully characterized the distribution of the transfer support in the absence of signal on caterpillar trees, indicating that the convergence is fast, and that even when n is small, moderate levels of branch support cannot appear by chance.
TBE computation and other phylogenetic tools are available from http://booster.c3bi.pasteur.fr . Currently, we are still working on the subject, to elucidate the mathematical bases of the transfer distance and use it in other contexts (e.g. tree comparison, phylogenomics).