Species Delimitation
Introduction
During the period from 2010 to 2020 I coauthored 9 papers on the topic of species delimitation – the process of determining which groups of individual organisms constitute different populations of a single species and which constitute different species. The problem of species delimitation goes back to the earliest days of taxonomy and formalized procedures for describing new species are well-developed and widely used. Traditional taxonomic methods are time-intensive and problematic for some species, however, particularly cryptic species. This topic is thus at the heart of both systematic biology and conservation biology.
My work on species delimitation began with a collaboration with Ziheng Yang. In 2010, we developed the first Bayesian method for species delimitation using multilocus sequence data under the multispecies coalescent (MSC) model [1]. The MSC provides a natural framework for accommodating the gene tree discordance caused by incomplete lineage sorting and for generating posterior probabilities for different species delimitation models. The initial method required a user-specified guide tree, but we subsequently improved the MCMC algorithms [2] and developed an “unguided” method that simultaneously infers the species tree and species delimitation [4]. A collaboration with Chi Zhang, a visiting student in my group, and Ziheng Yang demonstrated that the method is robust to errors in guide tree inference [3]. I also wrote a review article discussing both the art (guided by experience, expert knowledge and intuition) and the science (guided by formal algorithms) of species delimitation in the context of modern genomic data [5].
We have since extended the approach in several directions. Ziheng Yang and I showed that the MSC framework can provide significant improvements over DNA barcoding for species identification [6]. Together with Adam Leache, Tianqi Zhu, and Ziheng Yang, we addressed the important concern that model-based methods might over-split populations into species [7], demonstrating the distinction between Bayesian model selection and parameter estimation approaches and examining how a previously proposed heuristic metric, the genealogical divergence index (gdi), could be applied to bpp MCMC output as an empirical criterion for species delimitation. Two book chapters provide comprehensive reviews and tutorials [8,9].
The papers
[1] Z. Yang and B. Rannala. 2010. Bayesian species delimitation using multilocus sequence data. Proceedings of the National Academy of Sciences 107: 9264-9269. Download
This paper introduced the first Bayesian method for species delimitation using multilocus sequence data under the MSC model. The method generates posterior probabilities of species assignments while accounting for uncertainties due to unknown gene trees and the ancestral coalescent process. A user-specified guide tree is used for tractability to avoid integrating over all possible species delimitations. The method was examined using simulations and illustrated by analyzing data from rotifers, fence lizards, and human populations.
[2] B. Rannala and Z. Yang. 2013. Improved reversible jump algorithms for Bayesian species delimitation. Genetics 194: 245-253. Download
This paper described several modifications to the reversible-jump MCMC (rjMCMC) algorithms used for Bayesian species delimitation. The original method suffered from poor mixing of the Markov chain. We proposed a flexible prior that allows the user to specify the probability that each node on the guide tree represents a true speciation event, and introduced modifications to the rjMCMC algorithms that remove the constraint on the new species divergence time when splitting and alter the gene trees to remove incompatibilities. The new algorithms substantially improved mixing of the Markov chain for both simulated and empirical datasets.
[3] C. Zhang, B. Rannala, and Z. Yang. 2014. Bayesian species delimitation can be robust to guide-tree inference errors. Systematic Biology 63: 993-1004. Download
A potential concern with Bayesian species delimitation is its dependence on a user-specified guide tree. This paper examined the robustness of the method to errors in guide-tree inference using simulations and empirical data. The results demonstrated that the method can produce reliable species delimitation results even when the guide tree is incorrectly estimated, alleviating an important practical concern for users of the method.
[4] Z. Yang and B. Rannala. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Molecular Biology and Evolution 31: 3125-3135. Download
This paper developed a method for simultaneous Bayesian inference of species delimitation and species phylogeny, eliminating the need for a user-specified guide tree. The nearest-neighbor interchange algorithm was adapted to propose changes to the species tree, with the gene trees for multiple loci altered in the proposal to avoid conflicts. A simulation study using six populations and three true species showed that the method tends to be conservative, with high posterior probabilities being a confident indicator of species status. The power to delimit species increases with divergence times and the number of loci. Reanalyses of cavefish and coast horned lizard data revealed considerable phylogenetic uncertainty even though the data were informative about species delimitation.
[5] B. Rannala. 2015. The art and science of species delimitation. Current Zoology 61: 846-853. Download
This review article discusses model-based Bayesian methods for species delimitation developed using the MSC, as well as several approximate methods and their limitations. Explicit species delimitation models have the advantage of clarifying more precisely what is being delimited and what assumptions are being made. Moreover, the methods can be very powerful when applied to large multi-locus datasets and thus take full advantage of data generated using modern sequencing technologies.
[6] Z. Yang and B. Rannala. 2017. Bayesian species identification under the multispecies coalescent provides significant improvements to DNA barcoding analyses. Molecular Ecology 26: 3028-3036. Download
This paper demonstrated three features of the MSC method implemented in BPP for species identification. First, with one locus the MSC can accurately assign individuals to species without the need for arbitrarily determined distance thresholds as required by barcoding methods. Second, BPP can identify cryptic species that may be misidentified as a single species within the barcoding library. Third, taxon rarity does not present problems for species assignments using BPP – accurate assignments can be achieved even when only one or a few loci are available. The results address concerns that MSC methods may have problems analyzing rare taxa.
[7] A.D. Leache, T. Zhu, B. Rannala, and Z. Yang. 2019. The spectre of too many species. Systematic Biology 68: 168-181. Download
This paper addressed the important concern that Bayesian species delimitation methods may detect population splits rather than species divergences and may tend to over-split when many loci are analyzed. We provided mathematical justifications for these results and pointed out that the distinction between population and species splits in the protracted speciation model has no influence on the generation of gene trees and sequence data. We explored how the genealogical divergence index (gdi) might be employed as an empirical criterion for determining species status among allopatric populations. We distinguished between Bayesian model selection and parameter estimation approaches and suggested that model selection is useful for identifying sympatric cryptic species while parameter estimation may be used to implement empirical criteria for allopatric populations.
[8] B. Rannala and Z. Yang. 2020. Species Delimitation. In C. Scornavacca, F. Delsuc, and N. Galtier (eds.), Phylogenetics in the Genomic Era, pp. 5.5:1-5.5:18. Download
This book chapter provides a comprehensive review of the history of molecular species delimitation leading up to the genomic era. We describe the most widely used computational methods for species delimitation using single- and multi-locus genomic data, discuss relative strengths and weaknesses of the approaches, and propose a new method for delimiting species based on empirical criteria.
[9] T. Flouri, B. Rannala, and Z. Yang. 2020. A Tutorial on the Use of BPP for Species Tree Estimation and Species Delimitation. In C. Scornavacca, F. Delsuc, and N. Galtier (eds.), Phylogenetics in the Genomic Era, pp. 5.6:1-5.6:16. Download
This tutorial chapter illustrates the use of BPP for species tree estimation and species delimitation, providing practical guidelines on running BPP on multicore systems. BPP is a Bayesian MCMC program for analyzing multilocus sequence data under the MSC model with and without introgression, and can be used for estimation of population size and species divergence times, species tree estimation, species delimitation, and estimation of cross-species introgression intensity.