|
|
||||||||
Systematics, Phytogeography, and Evolution |
Section of Evolution and Ecology, University of California, Davis, California 95616 USA
Received for publication August 3, 2000. Accepted for publication February 13, 2001.
| ABSTRACT |
|---|
|
|
|---|
140190 mya (Early Jurassicearliest Cretaceous). Approximate 95% confidence intervals on ages are wider for rbcL than 18S, ranging up to 160 my for phylogenetic uncertainty, 90 my for substitutional noise, and 70 my for lineage effects. These intervals overlap the oldest occurrences of angiosperms in the fossil record, as well as some estimates from previous molecular studies.
Key Words: angiosperms confidence intervals fossil record molecular clock rbcL 18S rDNA
| INTRODUCTION |
|---|
|
|
|---|
In this paper, we address the possibility that some of the apparent conflict between molecular and fossil estimates may stem from insufficient attention to sources of error and assessment of confidence limits on age estimates based on molecular data. Because of the potential importance of deviations from true global rate constancy, we consider a much larger sample of taxa than previous age studies. First, we present experiments with data from two genes that have been widely studied for this and related problems, the chloroplast gene rbcL and 18S nuclear rDNA (ribosomal DNA), which suggest that errors in tree topology and variation in rates among lineages can lead to erroneous age estimates. Second, we attempt to obtain a more reliable assessment of the confidence interval on molecular age estimates based on rbcL and 18S data, which allows us to quantify several potential sources of error in these estimates.
Previous estimates
Until the 1960s, it was widely assumed that angiosperms originated long before their first unquestioned fossil record in the mid-Early Cretaceous, based on assignment of Cretaceous fossils (mostly leaves) to diverse and advanced extant taxa (Axelrod, 1952, 1970
). However, more recent studies of fossil pollen, leaves, flowers, and fruits have indicated that Early Cretaceous angiosperms were far less advanced than previously believed and have painted a coherent picture of rapid morphological diversification, which in its specifics agrees with views on angiosperm evolution based on modern plants (Doyle, 1969, 1978
; Muller, 1970, 1981
; Doyle and Hickey, 1976
; Friis and Crepet, 1987
; Doyle and Donoghue, 1993
; Friis, Pedersen, and Crane, 1994
; Crane, Friis, and Pedersen, 1995
). At present, the oldest definite angiosperm fossils are pollen grains of Valanginian or Hauterivian age,
130 mya (million years ago) (Trevisan, 1988
; Hughes, 1994
; Brenner, 1996
); a supposed Jurassic record (Sun et al., 1998
) has been redated as Early Cretaceous (Swisher et al., 1999
). These data suggest that angiosperms may have originated barely before their first fossil records, although they do not rule out the existence of older angiosperms that were rare and plesiomorphic.
The application of phylogenetic thinking to living and fossil seed plants has also affected this discussion. Any extant group has two ages: the age at which its stem lineage branched from the line leading to its extant sister group and the age of the most recent common ancestor of all its living members or the crown group (Hennig, 1965
; Jefferies, 1979
). Following Doyle and Donoghue (1993)
, we restrict the term "angiosperms" to the crown group; this is the age addressed by molecular studies. Most phylogenetic analyses based on morphology have indicated that the sister group of angiosperms is Gnetales, Gnetales plus Bennettitales, or Caytonia (Crane, 1985
; Doyle and Donoghue, 1986
; Loconte and Stevenson, 1990
; Rothwell and Serbet, 1994
; Doyle, 1996
). Since all these taxa are known back to the Late Triassic, these results imply that the angiosperm stem lineage is also this old. However, the crown group could be much younger, especially considering the many apomorphies that distinguish angiosperms from other seed plants and the plesiomorphic nature of Early Cretaceous fossils. Molecular analyses have generally refuted the relationship of angiosperms and Gnetales, and several indicate that angiosperms and extant gymnosperms are sister groups, pushing the angiosperm stem lineage back to the mid-Carboniferous (Goremykin et al., 1996
; Chaw et al., 1997, 2000
; Hansen et al., 1999
; Qiu et al., 1999
; Winter et al., 1999
; Bowe, Coat, and dePamphilis, 2000
; Donoghue and Doyle, 2000
). However, this does not rule out a relationship of angiosperms with Mesozoic groups such as Bennettitales or Caytonia, and it does not relate directly to the age of the crown group.
The first molecular studies gave far older ages for the angiosperms than their oldest fossil records. Ramshaw et al. (1972)
obtained an estimate of 350420 mya (Late Silurian-Mississippian) based on amino acid sequences of cytochrome c, calibrated with the birdmammal split. Using nonsynonymous substitutions in the nuclear gene gapC, calibrated with the animal fossil record and the presumed divergence of plants, animals, and fungi at 1000 mya, Martin, Gierl, and Saedler (1989)
dated the split between monocots (two grasses) and dicots (Magnolia and six eudicots) as 319 mya (mid-Carboniferous). This is more than twice the age of the oldest fossils; at that time, the most advanced known seed plants were "seed ferns" more plesiomorphic than all living seed plants, to say nothing of angiosperms. Martin, Gierl, and Saedler (1989)
dismissed the concept of a Cretaceous origin as based on negative evidence and suggested that their results favored the views of Axelrod (1952, 1970)
. However, Crane et al. (1989)
argued that the conflict with the fossil record is not so easy to explain away. In particular, Martin, Gierl, and Saedler dated the common ancestor of eudicots as 276 mya (Permian), but eudicots (a strongly supported monophyletic group: Chase et al., 1993
; Soltis et al., 1998
; Qiu et al., 1999
; Soltis, Soltis, and Chase, 1999
) are united by tricolpate pollen, which has a dense fossil record, beginning in the late Barremian (120 mya: Doyle, 1992
; Hughes, 1994
) and becoming ubiquitous in the Albian (110 mya). Furthermore, Albian eudicots represent lines near the base of this clade (Doyle, 1998b
; Magallón, Crane, and Herendeen, 1999
).
Subsequent studies made the improvement of calibrating dates with other land plants. Some have given more recent ages, though still pre-Cretaceous. Wolfe et al. (1989)
dated the angiosperms as 200 mya (Early Jurassic), using rRNA (ribosomal RNA) sequences, several chloroplast genes, and two calibrations: the divergence of three grasses at 60 mya and the split of liverworts from other land plants at 400 mya (Early Devonian), which is probably 50 my (million years) too recent (vascular plant megafossils extend back to the Middle Silurian and land plant spores to the Middle Ordovician: Kenrick and Crane, 1997
). For rRNA, they also had a cycad sequence; this diverged from angiosperms at 340 mya (Mississippian), which is consistent with fossil data. Laroche, Li, and Bousquet (1995)
also dated angiosperms at 200 mya, based on nonsynonymous substitutions in several mitochondrial genes, calibrated with grasses and legumes. However, other studies with improved calibrations have given older ages. Martin et al. (1993)
added a liverwort and a conifer and used nonsynonymous substitutions in both gapC and rbcL; assuming that liverworts diverged at 450 mya (Late Ordovician) and conifers at 330 mya (Late Mississipian), they dated the monocotdicot split as 300 mya (Late Pennsylvanian). In a study of chloroplast transfer RNAs, calibrated with divergence of a liverwort and two grasses, Brandl, Mann, and Sprinzl (1992)
also obtained a 300 mya age for angiosperms.
The youngest estimate so far was obtained by Goremykin, Hansmann, and Martin (1997)
, based on protein sequences of 58 genes from six completely sequenced chloroplast genomes (Porphyra, Marchantia, Pinus, Nicotiana, Oryza, Zea). Assuming that Marchantia diverged at 450 mya, these authors dated the angiosperms as 160 mya (Late Jurassic) and the split between Pinus and angiosperms as 348 mya (Early Carboniferous), which they noted is more congruent with fossil evidence than their earlier results (Martin, Gierl, and Saedler, 1989
; Martin et al., 1993
). However, they found strong lineage-specific rate variation in the two grass genomes and therefore calculated the angiosperm age from the root node to Nicotiana only. Thus, although their analysis used an unprecedented number of genes, their dates were based on a very small number of taxa.
Sanderson (1997)
used an experimental method (NPRS) for reconstructing ages in the absence of a molecular clock, which smooths local variations in rates by an optimization algorithm. Based on 36 land plant rbcL sequences and a land plant calibration of 450 mya, he obtained an estimate of 165 mya (Middle Jurassic). Using the same rbcL data set, Thorne, Kishino, and Painter (1998, fig. 3) used a model-based Bayesian approach to calculate that the angiosperm root node is 51% as old as the most recent common ancestor of vascular plants (i.e.,
200 mya, Early Jurassic). Both methods assume an autocorrelation in rates of molecular evolution across the tree, the presence or magnitude of which has yet to be determined.
Sources of error in estimating divergence times
These dates are in considerable conflict with each other and with the fossil record. Some of this conflict can be attributed to biases in the data or the statistical estimation methods used, but much of it is probably due to stochastic and deterministic aspects of the molecular evolutionary process itself, especially rate variation across lineages, or "lineage effects" (Britten, 1986
; Gillespie, 1991
; Gaut, Muse, and Clegg, 1993; Avise, 1994
; Clegg et al., 1994
; Nickrent and Starr, 1994
; Li, 1997
; Yang and Nielsen, 1998
). Even with a stochastically constant rate, substitutional noise imposes an absolute lower bound on errors in age estimates (Kumar, Tamura, and Nei, 1993
; Hillis, Mable, and Moritz, 1996
). Variation in rate across sites causes sequence divergences to be estimated incorrectly, most severely at high rates (Gillespie, 1986
; Yang, 1996
) and high rate variability (Kelly and Rice, 1996
; Miyamoto and Fitch, 1996
; Yang, 1996
). Still other errors relate to the underlying phylogenetic context for molecular divergence, including incorrect phylogenies and calibrations that associate fossil ages with the wrong nodes of a tree.
Several of the angiosperm studies reported the error rate in estimation of branch lengths due to substitutional noise (e.g., Goremykin, Hansmann, and Martin, 1997
), but only Martin, Gierl, and Saedler (1989)
, Martin et al. (1993)
, and Sanderson (1997)
used it to assess the corresponding errors in age estimates. Several studies tested for lineage effects, but only Wolfe et al. (1989)
assessed the error component due to these. Wolfe et al. (1989)
, Brandl, Mann, and Sprinzl (1992)
, Laroche, Li, and Bousquet (1995)
, and Goremykin, Hansmann, and Martin (1997)
considered calibration error (although the last authors, concluding that substitutional noise was relatively low, subsumed it in the calibration error). None of these studies considered between-site sequence rate heterogeneity or choice of the tree used in deriving age estimates. The ideal tree, of course, would be the true tree. Most studies have used trees derived from phylogenetic analysis of each gene under study, but many of these are clearly incorrect as species trees, since they differ from each other.
In order to evaluate these results, we undertook our own analyses of rbcL and 18S data, designed to probe the various sources of error, reasons why estimates have varied so much, and ways to obtain better estimates. Our taxon sampling (modified from Sanderson, 1997
) was designed to span critical nodes, provide an adequate sample of extant outgroups, and allow comparisons with previous studies and fossil evidence on the ages of nodes. First, we present a series of analyses that illustrate the effect of various factors on point estimates of the age of angiosperms: variations in tree topology, models for nucleotide substitution (with and without rate variation), sampling of taxa with different rates of evolution (lineage effects), and use of first and second vs. third codon positions (an approximation of nonsynonymous vs. synonymous substitutions). Second, we present a series of resampling experiments designed to provide a statistical estimate of the relative magnitude of errors due to these factors.
| MATERIALS AND METHODS |
|---|
|
|
|---|
1842 bp, excluding poorly aligned segments; Chaw et al., 1997
Taxa sampled
The 37 taxa in our data set comprise 22 angiosperms, 9 other seed plants, 5 other land plants, and Chara, one of the most closely related green algae, to root land plants (Mishler et al., 1994
).
To span the root node of extant angiosperms, we included a variety of "magnoliid" taxa, based on current understanding of angiosperm relationships. Analyses of atpB (Savolainen et al., 2000)
, phytochrome genes (Mathews and Donoghue, 1999
), a combined 18S, rbcL, and atpB data set (Soltis et al., 1998
; Soltis, Soltis, and Chase, 1999
), and five-gene data sets including mitochondrial genes (Parkinson, Adams, and Palmer, 1999
; Qiu et al., 1999
) indicate that Amborella is the sister group of all other angiosperms, followed by Nymphaeales and then a clade consisting of Austrobaileya, Trimeniaceae, and Illiciales, in agreement with earlier analyses that placed Nymphaeales at the base of angiosperms (Hamby and Zimmer, 1992
; Doyle, Donoghue, and Zimmer, 1994
; Goremykin et al., 1996
). Other analyses link Amborella with Nymphaeales or reverse these two taxa (Barkman et al., 2000
; Graham and Olmstead, 2000
; Qiu et al., 2000)
, but these lines are still basal to other angiosperms. We represented these basal lines with Amborella, Nymphaea, and Austrobaileya, and other magnoliid clades (APG, 1998
; Qiu et al., 1999
; Soltis, Soltis, and Chase, 1999
) with Magnolia (Magnoliales), Calycanthus and either Persea or Sassafras (Laurales), Drimys (Winteraceae), Saururus (Piperales), and Chloranthus (Chloranthaceae). We did not include Ceratophyllum, which is sister to all other angiosperms in trees based on rbcL (Chase et al., 1993
), because it is never basal in analyses of other genes. If we had included Ceratophyllum, it would be unclear to what extent our conclusions were a function of this anomalous rooting, without performing additional experiments with topological constraints.
For other seed plants, we included the three genera of Gnetales, Ginkgo, and Cycas and Zamia, the latter representing the basal split in Cycadales. Pinaceae (plus Gnetales in some studies) are the sister group of other conifers in molecular analyses (Chaw et al., 1997, 2000
; Stefanovic et al., 1998
; Qiu et al., 1999
; Bowe, Coat, and dePamphilis, 2000
); to span the basal conifer node, we used Picea (Pinaceae), Podocarpus (Podocarpaceae), and Taxus (Taxaceae). In ferns, Osmunda represents Osmundaceae, the probable sister group of other Filicales (Pryer, Smith, and Skog, 1995
), exemplified by Asplenium. Marchantia represents liverworts, which morphological and some molecular analyses identify as the sister group of other land plants (Mishler et al., 1994
; Qiu et al., 1998
). Although other molecular analyses place anthocerotes in this position (Nickrent et al., 2000)
, this should not be critical for our purposes, since Marchantia is the only bryophytic group in our data set, and at worst Marchantia represents a clade that diverged just one node above the base of land plants.
For 30 species sequences were available for both genes. For the seven other taxa, we used a different exemplar of the same family for the two genes (18S/rbcL): Nageia/Afrocarpus (Podocarpaceae); Sassafras/Persea (Lauraceae); Calla/Spathiphyllum (Araceae); Veitchia/Drymophloeus (Palmae); Buxus/Pachysandra (Buxaceae); Arctostaphylos/Enkianthus (Ericaceae); Brunfelsia/Nicotiana (Solanaceae). This procedure may introduce some error because of changes in rate of evolution within families, but presumably these tend to be smaller than changes between families.
Trees
Because one of our goals was to clarify the effect of tree topology on age estimates, we examined a series of eight "standard" trees. Three of these were found by normal parsimony analysis of rbcL and 18S; the other five, intended to represent a range of current hypotheses on seed-plant phylogeny, were obtained by imposing topological constraints during parsimony analysis of rbcL, 18S, or the two data sets combined. Some of these constraints are not directly relevant to seed-plant relationships but were needed to correct anomalies elsewhere in the tree (e.g., in rooting of vascular plants or of angiosperms). These constraints and the reasoning behind their selection are described at the point where each tree is first discussed in the Results section. For these analyses, we used PAUP 3.1 (Swofford, 1991
) to find most parsimonious trees, with 100 replicates using stepwise random addition of taxa, MULPARS (multiple most parsimonious trees), TBR (tree bisection-reconnection) branch swapping, and holding one tree at each step. For several subsequent analyses we used one of these trees, designated the "gnetifer" tree, in which Gnetales are the sister group of conifers and angiosperms are the sister group of other seed plants, as indicated by 18S data (Chaw et al., 1997, 2000
; Bowe, Coat, and dePamphilis, 2000
). Recent multigene analyses (Qiu et al., 1999
; Bowe, Coat, and dePamphilis, 2000
; Chaw et al., 2000
) have produced somewhat different "gnepine" trees in which Gnetales are nested within now-paraphyletic conifers, linked with Pinaceae, but the gnetifer tree is more consistent with loss of the inverted repeat in the chloroplast genome of conifers but not Gnetales (Raubeson and Jansen, 1992b
). For comparisons with trees of Martin, Gierl, and Saedler (1989)
and Martin et al. (1993)
, we also examined trees including only three angiosperms comparable to those in their study, plus three other subsets of angiosperm taxa, designed to address problems of variation in rates of evolution.
Preliminary hypothesis testing
Prior to estimating ages, we undertook a round of hypothesis testing to infer the tempo and mode of evolution of these genes. We used ML (maximum likelihood) methods (Swofford et al., 1996
; Huelsenbeck and Rannala, 1997
) for estimation of evolutionary parameters and hypothesis testing. Several models of nucleotide substitution were examined, differing in complexity and number of parameters. The F81 ("Felsenstein 1981"), HKY85 ("Hasegawa-Kishino-Yano 1985"), and GTR (general time-reversible) models estimate one, two, and six parameters in the rate matrix, respectively (Swofford et al., 1996
). Site-to-site rate variation was implemented using a gamma distribution of rates (denoted by adding "+
" to the acronyms above, and referred to as "gamma" in the following discussion). The shape parameter of the gamma distribution is estimated from the data using a four-category discrete approximation. In the absence of rate constancy across lineages, there are also 2N 2 branch length parameters to be estimated, where N is the number of taxa. Any of these models can have the additional assumption of rate constancy across lineages (molecular clock). This reduces the number of parameters associated with the tree to N 2 internal node times (plus one overall rate). Clock models will be denoted by adding the suffix "+ cl" to the model's acronym. Unless otherwise noted, all ML analyses used PAUP* 4.0 (Swofford, 2000)
. In general, estimation of model parameters (other than branch lengths) is fairly insensitive to topology (Yang, Goldman, and Friday, 1995
). Therefore, preliminary analyses were run only on the gnetifer tree.
Likelihood ratio tests of one substitution model against a more complex alternative were used to test for goodness of fit of the model to the data (Huelsenbeck and Rannala, 1997
), using the gnetifer tree. Degrees of freedom for the test are equal to the difference in the number of free parameters between the models. Models with and without rate variation across sites were tested against each other by assuming that both have gamma-distributed rates, but in one the shape parameter was left free, whereas in the other it was set to correspond to a constant rate across sites (by setting the shape parameter to infinity: Swofford et al., 1996
). A complete battery of tests was run both with and without the assumption of a molecular clock.
Four "data partitions" were constructed a priori, consisting of (1) the entire 18S gene, (2) the entire rbcL gene, (3) the first and second codon positions of rbcL, and (4) the third positions of rbcL. Differences in the mode of molecular evolution were examined in pairs of these partitions using a likelihood ratio test. For each test, the null hypothesis was that the two partitions evolved together according to the same model with one set of rate parameters. The alternative hypothesis was that each evolved according to a separate model with two different sets of rate parameters. Likelihood ratio tests were performed on each of the standard trees. On a given tree the log likelihood of the null hypothesis can be calculated directly in PAUP*. For the alternative, it is necessary to exclude one partition and calculate the log likelihood of the other partition, then do the reverse, and sum the two log likelihoods to find the overall likelihood of the alternative model. This is not the same as a "partition homogeneity test" (or ILD, incongruence length difference: Farris et al., 1995
), which tests whether the phylogenetic signal is homogeneous across positions. Joint tests of more than two partitions at a time are possible, but high heterogeneity in the pairwise tests immediately indicated it was unnecessary (see Results). The HKY85 +
substitution model was assumed in all tests, based on results from tests on the substitution model described above. The degrees of freedom are calculated as follows. For the model associated with one partition, there are two rate parameters, µ and
, associated with the substitution matrix (Swofford et al., 1996
), one shape parameter associated with the gamma-distributed rate variation, plus 2N 2 = 35 branch length parameters, for a total of 38 parameters. If the genes were allowed to evolve according to separate models, the joint model would have 76 parameters. The null model, that two partitions combined are evolving according to a common model, has 38 parameters again, so the df are 76 38 = 38.
A likelihood ratio test was used to determine whether rates were constant across lineages (Felsenstein, 1988
). The null model was HKY85 +
+ cl with the alternative being HKY85 +
. The number of degrees of freedom in the likelihood ratio test is N 2 if the tree is fully resolved, where N is the number of taxa (Felsenstein, 1993
). The test was performed separately for the four data partitions, on all eight of the standard trees, for a total of 32 tests. Critical values for all likelihood ratio tests were obtained under the assumption that 2 log (LR) is distributed approximately as
2.
Point estimates of angiosperm age
The crown-group age of angiosperms was estimated by ML with PAUP*, assuming substitution models that include a molecular clock. Such analyses yield a tree that we call a "chronogram," in which branch lengths are proportional to time. Absolute ages are then assigned to individual nodes by calibrating some node in the tree. We calculated ages relative to the most recent common ancestor of land plants, to which we assigned an age of 450 mya (Late Ordovician), soon after the first appearance of land plant meiospores in the fossil record (Middle Ordovician). This is the same calibration used by other authors (e.g., Goremykin, Hansmann, and Martin, 1997
). Such a fixed calibration should be distinguished from minimum or maximum age constraints on nodes, as used by Sanderson (1997)
; experiments with such constraints (Doyle, Magallón, and Sanderson, 2000
) will be described elsewhere. Absolute ages for the geological time scale are based on Palmer (1983)
.
Sensitivity analysis I: effects of gene, codon partition, model, and tree
To explore the sensitivity of age estimates to various factors, we first obtained such estimates under a wide range of specific conditions: different substitution models, genes, and codon partitions, and the set of eight standard trees. The effect of phylogenetic uncertainty, construed more broadly, is considered in the second set of analyses.
Sensitivity analysis II: effects of phylogenetic uncertainty, substitutional noise, and lineage effects
The factors described above entail finite and small numbers of alternatives, but other variables affecting age estimates entail a very large number of alternatives. Such factors include the phylogeny itself, which in reality must include many more possible alternatives than the eight treated here. Phylogenetic uncertainty has several sources, including substitutional noise (sampling from a finite number of stochastically evolving characters), which is often studied by bootstrapping (Felsenstein, 1985
), and long-branch attraction, which is more difficult to detect (Felsenstein, 1978
; Sanderson et al., 2000
). Even if the phylogeny is essentially certain, substitutional noise introduces errors into age estimates on the tree, because of fluctuations in the numbers of substitutions occurring in a given interval of time. Finally, differences in rate between lineages may cause variation in age estimates.
To estimate the magnitude of error in age estimates due to phylogenetic uncertainty, we examined confidence sets of phylogenies (Sanderson, 1989
; Sanderson and Wojciechowski, 1996
; Baldwin and Sanderson, 1998
) derived from the two genes. For each gene, one tree from each of 100 bootstrap replicates using parsimony (simple taxon addition sequence, MULPARS, TBR branch swapping, holding one tree at each step) was saved to a treefile (some replicates produced more than one most parsimonious tree). Maximum likelihood age estimation was then implemented on all of these trees using the original (unbootstrapped) data, the HKY85 +
+ cl substitution model, and calibration procedures described above under point estimates. The resulting chronograms were written to a treefile, which was in turn parsed by the program "r8s," which was used to calibrate node ages using the land plant calibration and to summarize the results across all the trees. This program is available from MJS at http://loco.ucdavis.edu/r8s/r8s.html.
The procedure just described estimates the effect of character sampling on topology. To estimate the magnitude of error from substitutional noise independent of topology, we fixed the tree and bootstrapped the characters repeatedly, estimating the age of angiosperms for each bootstrap replicate. Bootstrap data matrices were generated using the SEQBOOT program in PHYLIP (Felsenstein, 1993
), but instead of being used to generate trees, these matrices were used to estimate the age of the angiosperm node on the gnetifer tree. This was accomplished by placing all 100 randomized matrices in a batch file and translating them to NEXUS format, with each data block followed by PAUP* commands directing PAUP* to perform ML estimation on the gnetifer tree. To test whether the estimates obtained are sensitive to tree topology, we performed the same analysis on one of the trees most different from the gnetifer tree, the most parsimonious rbcL tree with Oryza basal in angiosperms (Fig. 2).
|
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
model was selected as a reasonable compromise among competing issues of bias, error variance, and running time (Zharkikh, 1994
|
model were extremely significant across all eight of the standard trees (Table 2), whether or not a molecular clock was assumed. Codon positions within rbcL were even more heterogeneous. Clearly, the tempo and mode of evolution differ among these data partitions, and for this reason we performed separate age estimations on the different partitions.
|
, of the likelihood ratio statistic indicates the amount of departure from rate constancy, but
values can only be compared within partitions. Generally, the trees that are most clocklike of the eight correspond to most parsimonious trees. For rbcL, the most clocklike tree for either codon partition is one of the most parsimonious trees derived from the rbcL data, namely the tree (almost surely incorrect) with Oryza basal in angiosperms. For 18S, the most clocklike tree is the most parsimonious tree derived from the 18S data. Reasons for this effect are suggested in the discussion of individual trees.
|
model, with and without gamma, and for first and second vs. third positions in rbcL, are presented in Table 4. Ages of other nodes of interest (especially seed plants, Gnetales, and eudicots) are given in the text or can be obtained from the chronograms.
|
2030 my, than those estimated without gamma. The same effect is also seen in ages for eudicots, but its magnitude is less for older groups, such as seed plants. Because use of gamma is theoretically preferable, this suggests that previous studies systematically overestimated the age of angiosperms. To gain insight into these results, we examine estimates from the standard trees in more detail. First we present results for rbcL, then for 18S. Although there are significant effects due to codon position in rbcL, for purposes of discussing lineage effects, topology, and their interaction, we first discuss ages based on all codon positions.
Two of the 12 most parsimonious trees derived from the rbcL data set are shown as chronograms in Figs. 2 and 3. In both trees, the rooting of seed plants agrees with that found in other analyses of rbcL (Albert et al., 1994
), although not with analyses of morphology and other genes, in that Gnetales are the sister group of other seed plants. However, they differ radically in the rooting of the angiosperms, and this shows the potentially major effect of erroneous tree topologies on age estimates.
In Fig. 2 ("rbcL.MP.Oryza" in Tables 24), angiosperms are rooted among monocots, with Oryza (representing grasses) the sister group of all other angiosperms. This tree implies that the age of angiosperms is 224 mya without gamma, 214 mya with gamma (both Late Triassic). This rooting conflicts sharply with trees based on larger rbcL data sets, to say nothing of other molecular analyses and conventional views of angiosperm evolution, which nest grasses within monocots and monocots within angiosperms (e.g., Chase et al., 1993
; Soltis, Soltis, and Chase, 1999
). The magnoliid groups, usually thought to form a basal paraphyletic grade, instead form a clade nested well within the angiosperms.
In Fig. 3 ("rbcL.MP.Ambo" in Tables 24), angiosperms are rooted among magnoliids, with Amborella branching first, followed by Nymphaea and then Austrobaileya. This rooting agrees with the multigene analyses of Mathews and Donoghue (1999)
, Parkinson, Adams, and Palmer (1999)
, Qiu et al. (1999)
, and Soltis, Soltis, and Chase (1999)
. In this case, the estimated age of angiosperms is much younger: 143 mya without gamma (earliest Cretaceous) and 124 mya with gamma, actually younger than the oldest undisputed fossil angiosperms (Valanginian-Hauterivian,
130 mya: Trevisan, 1988
; Hughes, 1994
; Brenner, 1996
). Considering the very short branch between Amborella and Nymphaea, trees in which these two lines form a clade (Barkman et al., 2000
; Graham and Olmstead, 2000
; Qiu et al., 2000)
would presumably give similar dates.
|
To evaluate the impact of these topological variations (some of which must be incorrect), we will use the tree in Fig. 4 ("rbcL.mincon" in Tables 24), one of 12 trees found by analyzing the rbcL data set with two constraints designed to bring outgroup relationships more in line with other data, forcing Lycopodium to the base of vascular plants and conifers into a clade (although some analyses have nested Gnetales in conifers, they have not done so for Ginkgo, cycads, or angiosperms). These trees are only three steps longer than the shortest trees (2707 rather than 2704). In Fig. 4 Amborella is basal in angiosperms (though Oryza is basal in other trees); other relationships are generally consistent with analyses of more taxa. Since this tree is almost as parsimonious as the shortest trees, consistent with other rbcL analyses of seed-plant phylogeny, and consistent with other data on the rooting of angiosperms, we will use it as a basis for discussion of the effect of various factors on age estimates derived from this gene. Henceforth all ages cited are based on gamma (see Table 4 for ages without gamma).
|
Other anomalously young ages are seen within angiosperms. The Nelumbo-Platanus clade (Proteales; APG, 1998
) is dated as 48 mya (Eocene), but both lines are known from the Albian, 100110 mya (platanoid leaves and inflorescences, Nelumbites leaves and flowers: Friis, Crane, and Pedersen, 1988
; Crane et al., 1993
; Upchurch, Crane, and Drinnan, 1994
). The Fagus-Carya clade (Fagales) is dated as 39 mya, but the line leading to Carya, represented by Normapolles pollen and associated flowers (Friis, 1983
; Sims et al., 1999
), extends back to the Cenomanian (95 mya). However, not all dates within angiosperms are too youngpalms and grasses (commelinoids) diverge at 89 mya, and the oldest palm fossils are
85 mya (Herendeen and Crane, 1995
). The Calycanthus-Lauraceae clade (Laurales) is dated as 89 mya; fossils related to both groups extend back to the Albian, 100110 mya (Drinnan et al., 1990
; Friis et al., 1994
).
These results are clearly related to inequality of ratesthe fact that the data are not clocklike, as already indicated by likelihood ratio tests (Table 3). This is illustrated by Fig. 5, the tree in Fig. 4 plotted as a phylogram, so that branch lengths are proportional to the amount of molecular evolution. Within the angiosperms, some branches are long, notably Oryza, Pisum, and Solanaceae (represented in this data set by Nicotiana), all herbaceous groups. As noted above, this effect was recognized with rbcL by Bousquet et al. (1992)
, Gaut et al. (1992)
, and Eyre-Walker and Gaut (1997)
, who suggested that the rate variation was related to habit and/or generation time. In the absence of a model of rate evolution (such as Thorne, Kishino, and Painter, 1998
), it cannot be said whether evolution sped up in grasses (for example) or slowed down four times, in Saururus and the three monocot lines attached below them, but a parsimony argument would favor the former scenario. On the other hand, branches such as Platanus, Nelumbo, Fagus, and Carya are relatively short, which may explain the anomalously young ages obtained for Proteales and Fagales (because the likelihood method tends to equalize absolute substitution rates by "pulling" short branches toward the present). If these short branches are the result of slowing of molecular evolution, Platanus and Nelumbo may be "living fossils" in molecular as well as morphological terms, as suggested for Winteraceae by Suh et al. (1993)
.
|
These observations suggest that previous estimates of the age of angiosperms may have been biased by preferential sampling of herbaceous angiosperm lineages with accelerated rates of molecular evolution, such as Oryza, Pisum, and Nicotiana. To evaluate this effect, we calculated ages on the tree in Fig. 4 after removing all angiosperms except these three genera. Using just these taxa nearly doubles the inferred age of angiosperms, from 139 to 253 mya (Late Permian). Conversely, removing these three taxa lowers the age of angiosperms to 122 mya (Barremian).
Since branch lengths in Fig. 5 are especially variable in monocots and eudicots, it might be suggested that better age estimates could be obtained by considering only more basal lines, on the assumption that these may provide better evidence on original evolutionary rates. Following this reasoning, we removed the clade consisting of Saururus, monocots, and eudicots from the tree in Fig. 4. The resulting age is 98 mya (late Albian), more than 30 my younger than the first fossil records of the angiosperm crown group. Removing all angiosperms except Amborella, Nymphaea, and Austrobaileya, representing the first three branches in this analysis and others (Mathews and Donoghue, 1999
; Parkinson, Adams, and Palmer, 1999
; Qiu et al., 1999
; Soltis, Soltis, and Chase, 1999
), gives an even younger age, 85 mya (Santonian). This implies that rates in these basal lines were actually slower than the average rate in the outgroups, as well as in other angiosperms, as noted for Winteraceae by Suh et al. (1993)
. This could be due to (1) deceleration on the angiosperm stem lineage, (2) parallel deceleration in the basal lines from higher rates during their initial radiation, and/or (3) acceleration in other lines. Establishing which of these scenarios is correct will be crucial for more accurate estimates of the age of angiosperms.
Other experiments were designed to assess the effect of uncertainties in seed-plant relationships, prompted by the fact that the arrangement based on rbcL conflicts with other analyses. Since the true tree is unknown, we used three trees with relevant taxa forced into arrangements found in other recent analyses, generated by analyzing the combined rbcL and 18S data sets with topological constraints ("anthophyte," "gnetifer," and "gnepine" in Tables 24).
The anthophyte tree (Fig. 6) is consistent with the morphological hypothesis that Gnetales are the closest living relatives of angiosperms (Crane, 1985
; Doyle and Donoghue, 1986
; Loconte and Stevenson, 1990
; Rothwell and Serbet, 1994
; Doyle, 1996
). This is one of two trees found after forcing Lycopodium to the base of vascular plants, Gnetales and angiosperms into a clade, and Amborella to the base of the angiosperms (otherwise Solanaceae are basal). In Fig. 6, the base of the seed plants is a trichotomy, because the length of the branch subtending the clade of Gnetales plus angiosperms is zero for rbcL. This same trichotomy was observed in constrained anthophyte trees for the plastid genes psaA and psbB (Sanderson et al., 2000)
. Thus there is no support for the anthophyte hypothesis in these genes. This change in topology has surprisingly little effect on the angiosperm ageit actually increases slightly from that based on the constrained rbcL analysis (Fig. 4), from 139 to 143 mya, near the beginning of the Early Cretaceous. It has more effect on the age of Gnetales, which decreases from 218 to 198 myaas might be expected, since Gnetales are nested within seed plants, rather than basal.
|
|
|
The dates in Table 4 based on different codon positions in rbcL give insight into earlier studies that analyzed protein sequences or nonsynonymous substitutions (Martin, Gierl, and Saedler, 1989
; Martin et al., 1993
; Laroche, Li, and Bousquet, 1995
), which can be approximated by analyzing first and second codon positions when the gene is highly conserved at the amino acid level. Martin et al. (1993)
justified their approach by arguing that rbcL is "saturated" with synonymous substitutions at the level of seed plants; their age for angiosperms (300 mya, mid-Pennsylvanian) was much older than our estimates based on all positions. We investigated this factor on the gnetifer tree (Fig. 7). When dates are calculated based on first and second positions, the age of the angiosperms increases dramatically, from 141 to 211 mya (Late Triassic). When Oryza, Pisum, and Nicotiana are used as the only angiosperms (Fig. 9), the age increases still more, to 281 mya (Early Permian). In contrast to the pattern noted above, use of gamma increases these ages rather than decreasing them, but only slightly (e.g., from 273 to 281 mya in the last case, still Early Permian). These observations help explain the 300 mya date found by Martin et al. (1993)
, since their analysis was based largely on herbaceous taxa. On the other hand, when only third positions are analyzed, the age of the angiosperms decreases to 88 mya (early Late Cretaceous), much younger than the oldest records of the group. In this case, use of gamma decreases the inferred age (from 121 mya without gamma). Overall, age estimates based on third positions are more sensitive to model choice than estimates based on first and second positions (Tables 1 and 4). This is expected if saturation is a problem, because "corrections" for saturation are model dependent and most likely to give variable results at high levels of sequence divergence.
|