Species Divergence Time Inference
Introduction
During the period from 2003 to 2025 I coauthored 9 papers on the topic of estimating species divergence times using DNA sequence data. The estimation of divergence times is one of the most important applications of molecular phylogenetics, providing a temporal framework for understanding the history of life. The idea that DNA sequences accumulate substitutions at a roughly constant rate over time – the molecular clock hypothesis proposed by Zuckerkandl and Pauling in the early 1960s – opened the possibility of using molecular data to date speciation events. However, it was soon recognized that rates of molecular evolution vary among lineages, and accommodating this variation has been a major challenge for the field.
My work on divergence time estimation began with a 2003 paper with Ziheng Yang that developed a Bayesian method for simultaneous estimation of species divergence times and ancestral population sizes under the multispecies coalescent (MSC) model [1]. This was one of the earliest implementations of the MSC for analyzing multi-locus data. We subsequently developed methods for incorporating multiple fossil calibrations with flexible statistical distributions (“soft bounds”) rather than the fixed calibrations used previously [2], and extended the approach to allow rate variation among lineages under relaxed-clock models [3]. A book chapter with Ziheng Yang reviewed the history and methodology of molecular clock dating [4]. I also addressed several conceptual issues in Bayesian divergence time estimation, including the non-identifiability of time parameters without fossil calibrations and the potential for conflicts between fossil calibrations and sequence data to produce misleading results [5].
More recently, collaborators and I have made substantial advances in divergence time estimation methodology. Tomas Flouri and colleagues extended the BPP program to incorporate relaxed clocks and more general substitution models within the MSC framework [6]. George Tiley and colleagues showed how cross-species gene flow can bias divergence time estimates when ignored, and demonstrated that the MSC with introgression model can accurately estimate divergence times even in the presence of gene flow [7]. Anna Nagel and colleagues developed an MSC model incorporating ancient DNA tip dates, demonstrating the importance of properly accounting for sample ages when analyzing ancient DNA [8]. Ziheng Yang and I also reviewed the theory and identifiability of birth-death process models used as priors on species divergence times [9].
The papers
[1] B. Rannala and Z. Yang. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164: 1645-1656. Download
This paper developed a Bayesian MCMC method for simultaneous estimation of species divergence times and current and ancestral effective population sizes using DNA sequence data from multiple loci. The method extracts information from conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes under the MSC model. We applied the method to noncoding DNA sequences from humans and the great apes. With an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species was estimated to be approximately 20,000, with a 95% credibility interval of 8,000 to 40,000.
[2] Z. Yang and B. Rannala. 2006. Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Molecular Biology and Evolution 23: 212-226. Download
This paper implemented a Bayesian MCMC algorithm for estimating species divergence times using heterogeneous data from multiple gene loci and multiple fossil calibration nodes. We proposed a new approach using “soft bounds” – arbitrary and flexible statistical distributions describing uncertainties in fossil dates where the probability that the true divergence time is outside the bounds is small but nonzero. Analyses of both real and simulated data demonstrated that soft bounds allow sequence data to correct poor calibrations (which hard bounds cannot), eliminate the need for unrealistically high upper bounds, and allow more reliable assessment of estimation errors.
[3] B. Rannala and Z. Yang. 2007. Inferring speciation times under an episodic molecular clock. Systematic Biology 56: 453-466. Download
This paper extended our Bayesian divergence time estimation algorithm to allow variable evolutionary rates among lineages under relaxed molecular clock models. We implemented two models: one with autocorrelated rates among adjacent lineages based on a geometric Brownian motion model of rate drift, and one with independent rates among lineages specified by a log-normal distribution. We developed an infinite-sites theory predicting that when the amount of sequence data approaches infinity, the width of the posterior credibility interval is determined by uncertainties in fossil calibrations. Simulations confirmed that posterior time estimates typically involve considerable uncertainties even with infinite sequence data, underscoring the critical importance of fossil calibration reliability.
[4] Z. Yang and B. Rannala. 2013. Molecular clock dating. In J.B. Losos (ed.), The Princeton Guide to Evolution. Princeton University Press. Download
This book chapter reviews the history of the molecular clock, its impact on molecular evolution, and the controversies surrounding mechanisms of evolutionary rate variation and the application of the clock to date species divergences. We review current molecular clock dating methods, including maximum likelihood and Bayesian methods, with an emphasis on relaxing the clock and on incorporating uncertainties into fossil calibrations.
[5] B. Rannala. 2016. Conceptual issues in Bayesian divergence time estimation. Philosophical Transactions of the Royal Society B 371: 20150134. Download
This paper examines several conceptual issues in Bayesian divergence time estimation. Divergence time parameters are not identifiable unless both fossil calibrations and sequence data are available. I show that commonly used marginal priors on divergence times derived from fossil calibrations may conflict with node order on the phylogenetic tree, causing a change in the effective prior. A topology-consistent prior that preserves the marginal priors is defined. I also demonstrate that conflicts between fossil calibrations and relative branch lengths can cause estimates of divergence times that are grossly incorrect yet have narrow posterior distributions, and recommend that overly narrow posteriors be carefully scrutinized.
[6] T. Flouri, J. Huang, X. Jiao, P. Kapli, B. Rannala, and Z. Yang. 2022. Bayesian phylogenetic inference using relaxed-clocks and the multispecies coalescent. Molecular Biology and Evolution 39: msac161. Download
This paper extended the BPP program to incorporate more general substitution models and relaxed clocks within the MSC framework, allowing rate variation among species. The MSC-with-relaxed-clock model enables estimation of species divergence times and ancestral population sizes when the strict clock assumption is violated. Simulations confirmed that the strict clock model is adequate for closely related species, but accounting for clock violation is important for more distant species. The relaxed-clock models in BPP are able to extract phylogenetic information from gene-tree branch lengths even when the molecular clock assumption is seriously violated.
[7] G.P. Tiley, T. Flouri, X. Jiao, J.W. Poelstra, B. Xu, T. Zhu, B. Rannala, A.D. Yoder, and Z. Yang. 2023. Estimation of species divergence times in presence of cross-species gene flow. Systematic Biology 72: syad015. Download
This paper used simulations to demonstrate that even small amounts of cross-species introgression can bias divergence time estimates when gene flow is ignored in the analysis. The MSC with introgression (MSci) model is capable of accurately estimating both divergence times and ancestral effective population sizes, even when only a single diploid individual per species is sampled. We characterized biases under three scenarios: introgression between sister species, between non-sister species, and from an unsampled outgroup lineage. Simulations under the isolation-with-migration model showed that the MSci model assuming episodic gene flow accurately estimated divergence times despite high levels of continuous gene flow. Empirical analyses of baobab and Jaltomata datasets confirmed these findings.
[8] A.A. Nagel, T. Flouri, Z. Yang, and B. Rannala. 2024. Bayesian Inference Under the Multispecies Coalescent with Ancient DNA Sequences. Systematic Biology 73: syae047. Download
This paper developed an MSC model with tip dates for analyzing ancient DNA (aDNA) and implemented it in BPP. If aDNA samples are sufficiently old, expected branch lengths are reduced relative to contemporary samples, which must be accounted for by incorporating sample ages. The method performed well for biologically realistic scenarios, estimating calibrated divergence times and mutation rates precisely. Simulations suggested that estimation precision is best improved by sampling many loci and more ancient samples. Incorrectly treating ancient samples as contemporary – a common practice – led to large systematic biases in divergence time estimates. We demonstrated the method’s utility by analyzing genomic datasets of mammoths and elephants.
[9] B. Rannala and Z. Yang. 2025. Reading tree leaves: inferring speciation and extinction processes using phylogenies. Philosophical Transactions of the Royal Society B 380: 20230309. Download
This paper reviews the probability theory underpinning the generalized birth-death process (GBDP) as a model of cladogenesis, which is widely used as a prior for species divergence times. The GBDP allows speciation and extinction rates to be arbitrary functions of time. We review recent findings concerning identifiability: the GBDP with arbitrary continuous rate functions is non-identifiable from lineage-through-time data, meaning the parameters cannot be estimated even with infinitely large phylogenies. However, a restricted class with piecewise-constant rates has been shown to be identifiable. We illustrate these findings through examples and discuss implications for biologists interested in inferring the past tempo and mode of evolution.