Species Tree Inference

Introduction

During the period from 2003 to 2026 I coauthored 14 papers on the topic of species tree inference. Estimating the phylogenetic relationships among species is a central problem in evolutionary biology, but it is complicated by the fact that different genomic regions may have different genealogical histories due to incomplete lineage sorting (ILS), cross-species gene flow, and recombination. The multispecies coalescent (MSC) model provides a natural framework for accommodating this genealogical heterogeneity and for inferring species trees from multi-locus genomic data.

My work on species tree inference began with a 2003 paper with Ziheng Yang developing a Bayesian method for estimating species divergence times and ancestral population sizes under the MSC [1]. We subsequently authored a review of methods for phylogenetic inference using whole genomes [2]. Postdoc Adam Leache and I then conducted a systematic comparison of species tree estimation methods using simulations, demonstrating the superior accuracy of full-likelihood Bayesian methods under the MSC when gene tree discordance is high [3]. Later, a collaboration with Leache, Harris, and Yang characterized the effects of gene flow on species tree estimation [4]. In 2017, Ziheng Yang and I developed efficient MCMC algorithms (SPR and node-slider proposals) for Bayesian species tree inference that substantially improved upon earlier approaches [5].

The BPP program, developed collaboratively with Tomas Flouri and Ziheng Yang, has been a central vehicle for implementing these methods. Flouri et al. (2018) provided a practical guide to species tree estimation with BPP [6]. The MSC framework was subsequently extended to accommodate introgression [7], and we examined how gene flow impacts species tree estimation [8]. Three book chapters provide comprehensive reviews and tutorials [9,10,11]. More recent work has extended BPP to relaxed clocks [11] and continuous migration models [12], and we have examined the effects of recombination [13] and its interaction with gene flow inference [14] on species tree methods.

The papers


[1] B. Rannala and Z. Yang. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164: 1645-1656. Download

This paper developed the foundational Bayesian MCMC method for simultaneous estimation of species divergence times and ancestral effective population sizes using DNA sequence data from multiple loci under the MSC model. The method integrates over uncertain gene trees and branch lengths at each locus. We applied it to noncoding DNA sequences from humans and the great apes, estimating the population size of the human-chimpanzee common ancestor to be approximately 20,000. This paper laid the groundwork for much of the subsequent development of species tree inference methods under the MSC.


[2] B. Rannala and Z. Yang. 2008. Phylogenetic inference using whole genomes. Annual Review of Genomics and Human Genetics 9: 217-231. Download

This review article critically examines methods for analyzing genomic datasets from multiple loci, including concatenation, separate gene-by-gene analyses, and statistical models that accommodate heterogeneity in different aspects of the evolutionary process among data partitions. We discuss factors that may cause the gene tree to differ from the species tree, as well as strategies for estimating species phylogenies in the presence of gene tree conflicts. The review highlights the computational and statistical challenges posed by genomic datasets.


[3] A.D. Leache and B. Rannala. 2011. The accuracy of species tree estimation under simulation: a comparison of methods. Systematic Biology 60: 126-137. Download

This paper compared the performance of several species tree estimation methods using computer simulation, varying tree shape, tree length, and population size parameters. When the probability of discordance between gene trees and the species tree is high (small divergence times and/or large population sizes), Bayesian species tree inference using the MSC (as implemented in BEST) outperformed other methods. Performance of all methods improved with increased tree length (reflecting decreased deep coalescence and more accurate gene tree estimates) and decreased population sizes. Ten loci were adequate for estimating the correct species tree when deep coalescence was limited, but 100 loci were needed under difficult demographic scenarios.


[4] A.D. Leache, R.B. Harris, B. Rannala, and Z. Yang. 2014. The influence of gene flow on species tree estimation: a simulation study. Systematic Biology 63: 17-30. Download

This paper used simulations to quantify the impacts of gene flow on species tree inference under various migration models, including the isolation-migration model, the n-island model, gene flow between non-sister species, and allelic introgression. Even when the species tree topology is estimated with high accuracy, gene flow can cause overestimates of population sizes (species tree dilation) and underestimates of divergence times (species tree compression). These results highlight the need for careful sampling design in phylogeographic and species delimitation studies, as gene flow, introgression, or incorrect sample assignments can bias both the species tree topology and parameter estimates.


[5] B. Rannala and Z. Yang. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Systematic Biology 66: 823-842. Download

This paper developed two efficient MCMC proposals for species tree inference under the MSC: one based on subtree pruning and regrafting (SPR) and another based on a node-slider algorithm. Both proposals alter the species tree while simultaneously modifying gene trees at multiple loci to avoid conflicts with the newly proposed species tree. The method showed excellent performance, inferring the correct species tree with near certainty when 10 loci were included. Analysis of rattlesnake data revealed drastically different evolutionary dynamics between nuclear and mitochondrial loci, even though they supported largely consistent species trees.


[6] T. Flouri, X. Jiao, B. Rannala, and Z. Yang. 2018. Species tree inference with BPP using genomic sequences and the multispecies coalescent. Molecular Biology and Evolution 35: 2585-2593. Download

This paper provided a practical guide to using BPP for species tree estimation. BPP includes a full-likelihood implementation of the MSC, using transmodel MCMC to calculate posterior probabilities of different species trees. The program accommodates heterogeneity of gene trees among loci and gene tree uncertainties due to limited phylogenetic information at each locus. The paper shows how to use both BPP 3.4 and BPP 4.0 for species tree inference with genomic sequence data.


[7] T. Flouri, X. Jiao, B. Rannala, and Z. Yang. 2020. A Bayesian implementation of the multispecies coalescent model with introgression for phylogenomic analysis. Molecular Biology and Evolution 37: 1211-1223. Download

This paper implemented the multispecies-coalescent-with-introgression (MSci) model in BPP, extending the MSC to incorporate introgression. The MSci model accommodates both deep coalescence and introgression, providing a natural framework for inference using genomic sequence data. Simulations confirmed good statistical properties, although hundreds or thousands of loci are typically needed to estimate introgression probabilities reliably. Reanalysis of purple cone spruce data confirmed the hypothesis of homoploid hybrid speciation. Introgression probabilities estimated for Anopheles gambiae mosquito species varied considerably across the genome, likely driven by differential selection against introgressed alleles.


[8] X. Jiao, T. Flouri, B. Rannala, and Z. Yang. 2020. The impact of cross-species gene flow on species tree estimation. Systematic Biology 69: 830-847. Download

This paper examined the levels of gene flow needed to mislead species tree estimation with three species under both episodic introgressive hybridization and continuous migration. The majority-vote method based on gene tree topologies was found to be more robust to gene flow than the UPGMA method based on coalescent times, and both were more robust than full-likelihood inference assuming an MSC model without gene flow. A small amount of gene flow per generation can cause drastic changes to the genetic history and mislead species tree methods, especially if species diverged through radiative speciation events. Analysis of Anopheles gambiae species data provided an example of extreme impact of gene flow on species phylogeny.


[9] B. Rannala, S.V. Edwards, A. Leache, and Z. Yang. 2020. The Multi-species Coalescent Model and Species Tree Inference. In C. Scornavacca, F. Delsuc, and N. Galtier (eds.), Phylogenetics in the Genomic Era, pp. 3.3:1-3.3:21. Download

This book chapter outlines the basic theory of the MSC and its important applications in analysis of genomic sequence data, describing the most widely used full-likelihood and heuristic methods of species tree estimation. The chapter discusses the framework for accommodating speciation events and coalescent processes within species, and how genealogical fluctuations across genomic regions serve as a source of information for estimating species divergence times, ancestral population sizes, and cross-species introgression. Several active areas of research and predicted future developments are discussed.


[10] T. Flouri, B. Rannala, and Z. Yang. 2020. A Tutorial on the Use of BPP for Species Tree Estimation and Species Delimitation. In C. Scornavacca, F. Delsuc, and N. Galtier (eds.), Phylogenetics in the Genomic Era, pp. 5.6:1-5.6:16. Download

This tutorial illustrates the use of BPP for species tree estimation and species delimitation, providing practical guidelines on running BPP on multicore systems. BPP can be used for estimation of population sizes and species divergence times, species tree estimation, species delimitation, and estimation of cross-species introgression intensity. The program can also simulate gene trees and sequence alignments under the MSC model with or without migration.


[11] T. Flouri, J. Huang, X. Jiao, P. Kapli, B. Rannala, and Z. Yang. 2022. Bayesian phylogenetic inference using relaxed-clocks and the multispecies coalescent. Molecular Biology and Evolution 39: msac161. Download

This paper extended BPP to incorporate more general substitution models and relaxed clocks within the MSC framework, allowing rates to vary among species. Simulations confirmed that the strict clock model is adequate for closely related species but accounting for clock violation is important for distant species. Valuable phylogenetic information exists in gene-tree branch lengths even when the molecular clock is seriously violated, and the relaxed-clock models in BPP are able to extract such information. The models are currently most effective for estimating population parameters when the species tree topology is fixed.


[12] T. Flouri, X. Jiao, J. Huang, B. Rannala, and Z. Yang. 2023. Efficient Bayesian inference under the multispecies coalescent with migration. Proceedings of the National Academy of Sciences 120: e2310708120. Download

This paper implemented the multispecies-coalescent-with-migration model in BPP for testing gene flow and estimating migration rates along with species divergence times and population sizes. Efficient MCMC algorithms were developed enabling analysis of genome-scale datasets with thousands of loci. The implementation of both introgression and migration models in the same program allows testing whether gene flow occurred continuously over time or in pulses. Analyses of Anopheles mosquito genomic data demonstrated the rich information in typical genomic datasets about the mode and rate of gene flow.


[13] B. Rannala. 2025. Recombination and phylogenetic inference. Evolutionary Journal of the Linnean Society 4: kzaf016. Download

This paper examines the effects of recombination on phylogenetic tree inference. Concatenation approaches that treat all loci as sharing one gene tree are compared with species tree methods that assume each locus has its own gene tree. Recombination is found to be more detrimental for concatenation methods and has relatively little impact on topology or divergence time estimates for species tree methods under the MSC. An important practical implication is that recombination detection may be unnecessary for species tree analysis, and that removing recombinant loci could actually introduce bias in parameter estimates.


[14] Y. Thawornwattana, B. Rannala, and Z. Yang. 2026. On the robustness of Bayesian inference of gene flow to intragenic recombination and natural selection. Molecular Biology and Evolution 43: msaf327. Download

This paper uses simulation to examine false positive rates in a Bayesian test of gene flow under the MSC, considering recombination, natural selection, species divergence timing, and whether gene flow involves sister or non-sister lineages. The test has very low false positive rates in most scenarios. Gene flow detection between sister lineages may be prone to high false positives with very recent divergence combined with very high recombination rates. The test is robust to various types of selection at low recombination rates, although prolonged balancing selection can produce false gene flow signals between sister lineages. Gene flow detection between non-sister lineages remains robust across all recombination and divergence levels.