|
|
||||||||
a Department of Botany, Bessey Hall, Iowa State University, Ames, Iowa 50011
| ABSTRACT |
|---|
|
|
|---|
Key Words: alcohol dehydrogenase Gossypium molecular phylogenetics noncoding chloroplast DNA polyploidy
| INTRODUCTION |
|---|
|
|
|---|
One of the often-cited advantages of molecular data for phylogenetic reconstruction is the almost infinite number of characters that can be sampled. Yet, for plant groups where radiations have been relatively recent it may be extraordinarily difficult to generate sufficient phylogenetic signal because of the relatively slow accumulation of mutations, even in "rapidly evolving" noncoding DNA. The literature is replete with cladograms derived from molecular data that are well resolved internally, but that contain unresolved terminal clades of presumably closely related species (e.g., Hodges and Arnold, 1994; Bayer, Hufford, and Soltis, 1996; Soltis et al., 1996; Panero and Jansen, 1997; Sang, Crawford, and Stuessy, 1997). This phenomenon is the focus of the present paper. Specifically, we wished to address the issue of phylogenetic resolution within recent radiations by asking the following questions: (1) are mutation rates sufficiently high in noncoding cpDNA to provide phylogenetic resolution within a group of woody perennials that may be only 0.52 million years old? (2) do mutation rates vary among cpDNA noncoding regions, and if so, which exhibits the highest mutation rate? (3) can strictly orthologous low-copy nuclear-encoded genes be isolated, and if so, do they exhibit a higher mutation rate than noncoding cpDNA? (4) what are the relative strengths and weaknesses of the various types of molecular data for evaluating the phylogenetic relationships of recently radiated groups? As a model system for examining these questions we chose the tetraploid species of Gossypium L.
Gossypium includes ~50 species (Fryxell, 1992; Wendel, 1995; Wendel, Brubaker, and Seelanan, in press), of which the majority are diploid (2n = 2x = 26) and five are allotetraploids (2n = 4x = 52). Previous studies have resulted in the phylogenetic hypothesis shown in Fig. 1. The allotetraploid species appear to be a monophyletic assemblage derived from a single polyploidization event ~0.52 million years ago (Wendel, 1989; Wendel and Albert, 1992; Seelanan, Schnabel, and Wendel, 1997), and despite extensive efforts directed at understanding relationships among tetraploid cottons, only weak resolution has been obtained (Endrizzi, Turcotte, and Kohel, 1985; Wendel, 1989; DeJoode and Wendel, 1992; Wendel and Albert, 1992; Reinisch et al., 1994; Cronn et al., 1996; Wendel, Schnabel, and Seelanan, 1995a, b; Seelanan, Schnabel, and Wendel, 1997). In addition to cpDNA and rDNA restriction site data, sequences from the nuclear ribosomal ITS regions are available for all tetraploid species (Wendel, Schnabel, and Seelanan, 1995a, b; Seelanan, Schnabel, and Wendel, 1997) and ndhF data are available for two of the five species (Seelanan, Schnabel, and Wendel, 1997). Given voluminous data yet little phylogenetic resolution, tetraploid Gossypium provides a test case for evaluating the utility of a variety of putatively quickly evolving molecular sequences for resolving the phylogeny of a recent radiation. To this end we sequenced seven cpDNA noncoding regions in each of the five tetraploid species and a representative of the diploid maternal (chloroplast donor; Wendel, 1989) lineage, G. arboreum L. In addition, we isolated and sequenced a region of a pair of homoeologous nuclear-encoded alcohol dehydrogenase (Adh) genes for these same taxa, as well as a representative of the paternal lineage, G. raimondii Ulbrich, and an additional outgroup, G. robinsonii F. Mueller.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
|
|
Nuclear-encoded alcohol dehydrogenase loci
Alcohol dehydrogenase (Adh, E.C. number 1.1.1.1) is a metabolic enzyme responsible for the interconversion of ethanol and acetaldehyde, primarily in response to hypoxic conditions (Freeling and Bennett, 1985). In cotton, as in most plants, Adh exists as a nuclear-encoded small gene family (Millar and Dennis, 1996; Small and Wendel, unpublished data). Gene structure of Adh in Gossypium is generally conserved relative to other plant species studied (Fig. 3; Millar and Dennis, 1996; Small and Wendel, unpublished data). Because the Gossypium species under consideration are allotetraploids (containing A and D subgenomes; see above) each nuclear-encoded locus present in diploid species is present in two copies (homoeologues) in the tetraploid species, one per subgenome. We have PCR-amplified, cloned, and sequenced the majority of a pair of homoeologous Adh genes from tetraploid Gossypium as well as the orthologues from diploid Gossypium representing the parents of the allopolyploid.
|
Adh sequences have been used previously in a number of phylogenetic and molecular evolutionary studies in plants (Gaut and Clegg, 1991, 1993; Goloubinoff, Pääbo, and Wilson, 1993; Hanfstingl et al., 1994; Gaut et al., 1996; Innan et al., 1996; Miyashita, Innan, and Terauchi, 1996; Morton, Gaut, and Clegg, 1996; Sang, Donoghue, and Zhang, 1997).
Amplification, cloning, and sequencing
cpDNA regions
PCR amplifications were performed in 50-µL reactions consisting of 1 unit Taq polymerase (Promega, Madison, Wisconsin), 1X buffer (Promega), 200 µmol/L each deoxy-nucleotide triphosphate, 1.5 mmol/L MgCl2, 1020 pmol of each primer, and 812 ng of template genomic DNA. Amplifications were carried out using the parameters described in Table 3 in an MJ Research PTC-100 thermal cycler (Watertown, Massachusetts). Amplifications were preceded by a "hotstart" consisting of 2 min at 94°C followed by 5 min at 80°C during which time the Taq polymerase was added to the reactions. A negative control reaction (no template DNA) was included for each set of amplifications to monitor for the possibility of contamination. All PCR primers were either obtained from other researchers or were synthesized by Integrated DNA Technologies (Coralville, Iowa). Amplification products were visualized by agarose gel electrophoresis, concentrated using Microcon-100 centrifugation separators (Amicon, Beverly, Massachusetts), and quantified fluorometrically. PCR products were either sequenced directly (rpl16 intron, trnL-trnF spacer, rpoC1 intron, ndhA intron) or cloned into pGEM-T (Promega) and sequenced (atpB-rbcL spacer, trnT-trnL spacer, accD-psaI spacer). For the cloning approach, purified PCR products were ligated into pGEM-T according to the manufacturer's instructions. Competent Top10 F' (Invitrogen, San Diego, California) cells were transformed via electroporation and the resulting colonies were screened for plasmids with inserts by PCR using the original amplification primers. Plasmids were isolated from a single recombinant colony using an alkaline lysis/PEG precipitation protocol (Sambrook, Fritsch, and Maniatis, 1989). Cloning was performed only when PCR-amplification resulted in insufficient template for automated sequencing or when difficulties were encountered in using the amplification primers as sequencing primers. All sequencing was performed using amplification, internal, and/or vector specific primers (Table 2) at the Iowa State University DNA Sequencing and Synthesis Facility.
|
Analyses
Characterization of each region and sequence comparisons were facilitated by the software programs MacClade 3.05 (Sinauer, Sunderland, Massachusetts), PAUP 3.1.1 (Swofford, 1993) and MEGA 1.0 (Kumar, Tamura, and Nei, 1993). Analyses were conducted both on individual and combined data sets as follows. Individual cpDNA region data sets were analyzed separately (when warranted by the existence of sufficient variation) and then as a combined cpDNA data set. Adh sequences were analyzed in three separate ways: individual sequences as terminal "taxa," by subgenome, and by combining Adh homoeologue sequences for tetraploid taxa for an Adh "total evidence" analysis. For each data set a g1 statistic (Hillis and Huelsenbeck, 1992; Hillis, Allard, and Miyamoto, 1993) was calculated using PAUP 3.1.1 to determine whether or not significant phylogenetic structure existed within the data set. For phylogenetic analyses, exhaustive searches for most-parsimonious trees were conducted with uninformative characters excluded. Due to the larger number of sequences included in the initial Adh analysis (each allotetraploid represented by two distinct sequences), the Branch and Bound algorithm was employed to search for maximally parsimonious trees. Relative levels of support for clades present in the most-parsimonious trees were assessed by calculating decay values, the number of extra steps required to collapse the clade (Bremer, 1988). For all phylogenetic analyses the tree lengths and consistency indices reported do not include autapomorphic characters. Rate variation among sequences was assessed using the 1D and 2D relative rate tests of Tajima (1993) as implemented in the program Tajima93 (T. Seelanan, unpublished software).
| RESULTS |
|---|
|
|
|---|
|
Overall, 7369 characters (nucleotides) were sampled, yielding 52 variable positions (0.71%) and four potentially phylogenetically informative nucleotide substitutions (0.05%). In addition to nucleotide substitutions, we observed 15 length mutations (indels), of which four were potentially phylogenetically informative.
Phylogenetic analyses of cpDNA sequences
Potentially phylogenetically informative characters were found in only two of the seven regions: the trnT-trnL spacer (four characters) and the rpl16 intron (four characters) (see Table 4). Exhaustive searches of all possible trees were performed for each of these data sets using PAUP v. 3.1.1 (Swofford, 1993). The g1 statistics were -1.57 and -0.23 for the trnT-trnL and the rpl16 intron, respectively. For the number of taxa and characters in these data sets, only the trnT-trnL spacer data set is significantly more structured than random (P < 0.01; Hillis and Huelsenbeck, 1992). The single most-parsimonious tree resulting from analysis of the trnT-trnL data set is shown in Fig. 4 (length = 4; consistency index [CI] = 1.0; retention index [RI] = 1.0). When all cpDNA data were combined into a single data set, a g1 statistic of -1.08 was obtained which is significantly more structured than random (P < 0.01). Two equally most-parsimonious trees (length = 11; CI = 0.727; RI = 0.625) were found in an exhaustive search; the topology of the strict consensus tree was identical to Fig. 4. The two shortest trees differed only in the placement of G. hirsutum, which was resolved either as sister to a G. barbadense + G. darwinii clade or as part of an unresolved polytomy as in the strict consensus tree.
|
All AdhC sequences maintain the expected 5' GT... and ...AG 3' intron boundary sequences with the exception of a G to A transition of the first nucleotide of intron 6 of the D-subgenomes of G. hirsutum and G. tomentosum, and an A to G transition at the 3' end of intron 3. All sequences also maintain exon integrity (presence, length, reading frame) with the following exceptions. A 67-bp deletion in the A-subgenome sequences of G. barbadense and G. darwinii begins seven nucleotides from the 3' end of exon 4 and ends in the middle of intron 4. A large (182 bp) deletion in the G. arboreum sequence results in partial loss of introns 5 and 6, and all of exon 6. Finally, a G to A transition in exon 2 of the G. arboreum sequence results in the conversion of a tryptophan-encoding codon (TGG) to a stop codon (TAG). The relevance of the foregoing observations to AdhC expression was not explored.
Sequence characteristics for AdhC are summarized in Table 5 and are discussed below. The total aligned length of the data matrix is 1667 bp; this includes 798 bp of exon sequence and 869 bp of intron sequence. With the exception of the sequence from G. arboreum, the absolute sequence lengths ranged from 1579 bp to 1655 bp. GC content varied little between the A- and D-(sub)genomes, but varied greatly between exons (45.446.2%) and introns (30.132.0%). Among sequences from tetraploid taxa, transition:transversion ratios (Ts:Tv) varied between genomes, and especially between introns and exons. In the A-(sub)genome the Ts:Tv was ~4.2:1, whereas in the D-(sub)genome the Ts:Tv was ~3.6:1 (Table 5). The differences between intron and exon Ts:Tv are more dramatic, ranging from 78:1 in exons to 1.63.3:1 in introns. Table 5 also reveals a marked disparity in the number of nucleotide substitutions in the two subgenomes; the number of nucleotide differences between all pairs of sequences are shown in Table 6. The D-subgenome sequences have experienced ~1.5 times as many nucleotide substitutions and yield almost three times as many potentially phylogenetically informative characters. This disparity is also reflected in the relative rate tests (Tajima, 1993), as summarized in Table 6. These tests indicate that, in all comparisons, AdhC genes from the D-(sub)genomes are accumulating substitutions at a rate that is significantly faster than are their orthologues/homoeologues in the A-(sub)genomes.
|
|
For the data set in which each sequence was treated as a terminal the g1 statistic estimated from 10 000 random trees was -0.49, which indicates that the data are significantly more structured than random (P < 0.01). Phylogenetic analysis of this data set resulted in a single most-parsimonious tree (length = 97, CI = 0.93, RI = 0.98), which is shown in Fig. 5. The tree is completely resolved and divided into two primary cladesone including the D-genome diploid and D-subgenome of the allotetraploids and the second including the A-genome diploid and the A-subgenomes of the allotetraploids. Within each (sub)genomic clade the resolution is complete and the topology is identical between clades.
|
Finally, the data for both homoeologues were combined for each taxon for an Adh "total evidence" analysis. For outgroup comparison, the G. raimondii and G. arboreum sequences were combined to make a "diploid progenitor" sequence and the G. robinsonii sequence was duplicated. This data set had a g1 statistic of -1.39, significantly more structured than random at the P = 0.01 level. An exhaustive search found a single most-parsimonious tree (Fig. 6) with length = 43, CI = 0.91, and RI = 0.91. The tree is fully resolved and well supported, as indicated by high decay values and branch lengths.
|
| DISCUSSION |
|---|
|
|
|---|
Relationships hypothesized by these data additionally confirm predictions based on other sources of evidence. For example, the basal position of G. mustelinum predicts that it should be genetically equidistant from all other tetraploid species (Wendel, Rowley, and Stewart, 1994). This is borne out not only by the allozyme data presented by Wendel, Rowley, and Stewart (1994), but also by the AdhC sequence data reported in this paper; in the combined analysis (Fig. 6) there are 34, 35, 28, and 32 character-state changes between G. mustelinum and G. hirsutum, G. tomentosum, G. barbadense and G. darwinii, respectively (mean divergence from G. mustelinum = 1.0%). The Adh data also support the conclusion that G. barbadense and G. darwinii diverged more recently from each other than did G. hirsutum and G. tomentosum: while the branches leading to these two clades have similar lengths (10 vs. 12 steps), the number of autapomorphies each lineage has accumulated differ dramatically (9 and 10, respectively, in G. hirsutum and G. tomentosum vs. 1 and 5, respectively, in G. barbadense and G. darwinii).
Molecular evolution of noncoding cpDNA
The impetus for the experiments described here was to explore the phylogenetic utility of various sequences rather than to provide an in-depth analysis of patterns of molecular evolution. Nonetheless, some observations are prompted by our data. First, it has been recognized that cpDNA accumulates nucleotide substitutions more slowly than does plant nuclear DNA (Wolfe, Li, and Sharp, 1987; Wolfe, Sharp, and Li, 1989). As summarized in Tables 4 and 6, this rate difference is clearly evident in our data. In fact, the cpDNA data are astounding in their lack of informativeness, with a total of only eight phylogenetically informative characters observed among over seven thousand nucleotides surveyed. As a result of so little variation, the cpDNA provide only limited phylogenetic power.
In addition to the overall paucity of genetic variation, certain patterns observed previously are also noted here. First, the finding of Morton (1995) that transversions are more prevalent at positions flanked by A/T is supported by our data qualitatively, but sufficient data do not exist to statistically test this association. Also, previous observations that indels occur almost as frequently as nucleotide substitutions in noncoding cpDNA (Golenberg et al., 1993; Gielly and Taberlet, 1994b) are not supported by our data (Table 4). Rather, we detected over three times as many substitutions as indels in sequences from the allopolyploids (52 vs. 15, Table 4). Patterns of substitutions and indels vary between regions and in no case does the number of indels equal the number of substitutions. Of the indels that occur, two primary types are observed: insertion/deletion of a multinucleotide stretch of unique sequence or insertion/deletion of one or a few nucleotides within a polynucleotide tract (particularly polyA/T). The former type of indel is generally easily aligned and, if cladistically informative, is usually nonhomoplasious. In our cpDNA data there were 12 such indels, of which three were phylogenetically informative and none were homoplasious. The latter type of indel (three in our data), however, appears evolutionarily labile and probably originates via slipped-strand mispairing during replication (Levinson and Gutman, 1987). These types of indels often provide homoplasious characters. For example, the single homoplasious indel character in our cpDNA data set is a deletion of a single T in a string of ten in the rpl16 intron, which is shared by G. hirsutum and G. barbadense.
Molecular evolution of Adh
Patterns of molecular evolution among the AdhC sequences will be discussed in the context of a full presentation of the evolution of the Adh gene family in Gossypium. Certain features of the data, however, are especially relevant here. In particular, the disparity of substitution rates between AdhC sequences of the A- and D-subgenomes is striking, consistent, and statistically significant (see Table 6). Relative rate differences may be attributed to a number of evolutionary or population genetic phenomena, including background mutational processes, generation time, lineage effects, selection, drift, and rates of recombination (Bosquet et al., 1992; Gaut et al., 1992; Gaut, Muse, and Clegg, 1993; Clegg et al., 1994; Eyre-Walker and Gaut, 1997). Because both of the two AdhC homoeologues exist within the same nuclear genome, however, background mutational and population genetic phenomena should affect them equally and can therefore be ruled out as having a significant effect. Selection is one (but not the only) process that can potentially differentially affect genes in the same nucleus. Either differing levels of purifying selection on the subgenome sequences or positive (diversifying or directional) selection on the D-subgenome sequences could account for the observed rate differences. There is an almost fivefold elevation of nucleotide substitution rates in exons of the D-subgenome relative to the A-subgenome (K = 0.014 vs. 0.003, respectively; Table 5), despite the fact that intron nucleotide substitution rates are actually slightly higher in the A-subgenome sequences (K = 0.009 vs. 0.008; Table 5). Secondly, within exon sequences the synonymous nucleotide substitution rate (Ks) is over twice as high in the D-subgenome relative to the A-subgenome (Ks = 0.019 vs. 0.008; Table 5), but the nonsynonymous nucleotide substitution rate (Ka) is over six times higher (Ka = 0.013 vs. 0.002; Table 5). Finally, overall AdhC nucleotide substitution rates in the A-subgenome sequences are higher in the introns than in the exons (K = 0.009 vs. 0.003, respectively; Table 5) as predicted by neutral theory (Kimura, 1983); yet, in the D-subgenome sequences the nucleotide substitution rate is approximately twice as high in exons as in the introns (K = 0.014 vs. 0.008 respectively; Table 5). These data collectively suggest that selective forces may differ between homoeologues.
Relative phylogenetic utilities of molecular data
The phylogenetic conclusions described above are based almost exclusively on the wealth of data provided by the AdhC sequences, despite the volume of cpDNA data generated for identical taxa. In addition to the data presented in this paper, there exist for allotetraploid Gossypium comparable molecular data sets for cpDNA restriction sites (Wendel, 1989; DeJoode and Wendel, 1992; Wendel and Albert, 1992), and ITS sequences (Wendel, Schnabel, and Seelanan, 1995a, b; Seelanan, Schnabel, and Wendel, 1997). Figure 7 presents a comparison of the percentage of phylogenetically informative characters for these data sets. The cpDNA data consistently exhibit lower levels of informative characters than do the nuclear-encoded loci, as expected (Wolfe, Li, and Sharp, 1987; Wolfe, Sharp, and Li, 1989; Eyre-Walker and Gaut, 1997). The percentage of phylogenetically informative characters in the cpDNA data sets varied from 0 to 0.34%, and several of the cpDNA noncoding regions yielded no informative characters. The three cpDNA data sets that did contain informative characters (rpl16 intron, trnT-trnL spacer, and cpDNA restriction sites) exhibited similar levels of informativeness both in terms of percentages (0.290.34%) and absolute numbers of informative characters (34).
|
Advantages and limitations of nuclear-encoded genes for phylogenetic analysis
Relative rates
It has long been recognized that nuclear-encoded sequences evolve at a faster rate than plastid-encoded sequences (e.g., Wolfe, Li, and Sharp, 1987; Wolfe, Sharp and Li, 1989; Eyre-Walker and Gaut, 1997). Despite this, in the search for the most phylogenetic information per unit of effort, nuclear-encoded sequences have been relatively ignored, with the exception of the widely used rDNA regions. The data presented here show clearly that cpDNA noncoding sequences may not be able to provide sufficient characters for robust resolution among closely related taxa, even if sampled ad infinitum. We sampled over 6 kb of cpDNA noncoding sequence (~10% of all unique cpDNA noncoding sequences) and yet obtained incomplete and poorly supported phylogenetic resolution. In addition, over 1000 cpDNA restriction sites were previously sampled (Wendel, 1989; DeJoode and Wendel, 1992), again with incomplete resolution. In contrast, sequences from a 1.6-kb nuclear-encoded AdhC gene provided complete and robust resolution among these closely related taxa. This difference in phylogenetic utility reflects simply the greatly accelerated rates of nucleotide substitution in the nuclear genome relative to the plastome, as illustrated in Fig. 7. The mean number of substitutions per site (K) in the combined cpDNA sequence data set was K = 0.002, while in the AdhC data sets K = 0.006 in the A-(sub)genome and K = 0.011 in the D-(sub)genomea three to sixfold difference in nucleotide substitution rates. Extrapolation of these data allows the following observation. Given a total of four informative nucleotide substitutions out of a total of 6438 bp of noncoding cpDNA sequenced, and 25 informative nucleotide substitutions in the AdhC sequences, and assuming that levels of informative characters are constant across the chloroplast genome, over 40 kb of noncoding cpDNA would have to be sequenced to obtain an equivalent number of informative nucleotide substitutions as found in the AdhC sequences. This represents 62% (40 238 bp/64 437 bp) of the unique noncoding complement of the tobacco chloroplast genome (K. Wolfe, University of Dublin, Trinity College, Ireland, personal communication).
Patterns of mutation
In addition to levels of divergence, issues of alignability are important in selecting a genic or noncoding region for phylogenetic studies. While noncoding sequences generally accumulate nucleotide substitutions at a higher rate than coding sequences, they also appear to accumulate indels at a faster rate, occasionally equaling the rate of nucleotide substitutions (Golenberg et al., 1993; Gielly and Taberlet, 1994b). Because coding regions are constrained to maintain frame, indels occur less frequently, and when they do, they occur in multiples of three (i.e., a codon). Sequence alignment for genic regions, therefore, is usually straightforward, thereby making assessment of positional homology unambiguous. Noncoding regions, on the other hand, experience indel mutations of all lengths and at high frequency, making sequence alignment more problematic in many cases, particularly as more distantly related taxa are included (e.g., Golenberg et al., 1993; Downie, Katz-Downie, and Cho, 1996; Savolainen, Spichiger, and Manen, 1997). Additional confounding factors in assessing homology of mutations include the duplication/deletion of short repeats (or individual nucleotides in a run) via slipped-strand mispairing (Levinson and Gutman, 1987; Golenberg et al., 1993; Cummings, King, and Kellogg, 1994); the potential multiple origin of small inversions that occur in the loop of stem-loop secondary structures (Kelchner and Wendel, 1996); the higher potential for homoplasy due to a functionally reduced number of character states (due to the high AT content of noncoding cpDNA regions), and biased nucleotide substitutions in AT-rich regions (Morton, 1995). The use of coding regions can circumvent these difficulties, but at the cost of reduced levels of variation, at least in cpDNA genes. Nuclear-encoded genes, however, may offer the higher levels of variation desired, with the ease of alignment afforded by coding sequences.
Sequencing vs. restriction site data
Jansen, Wee, and Millie (1998) have analyzed both the relative utility (in terms of number of characters) and the relative reliability (in terms of CI and RI) of gene sequencing and restriction site studies of cpDNA. They suggest that, for intrageneric comparisons, cpDNA restriction site data are preferable, both because of the greater number of informative characters and because they report that restriction site data are, in general, less homoplasious than sequence data. Their analyses, however, did not address the lower end of the divergence spectrum (as in our study), where analysis of over 1000 cpDNA restriction sites still provided only limited resolution. cpDNA restriction site data are relatively free from problems associated with sequence data such as alignability. Comparison of mapped restriction sites is straightforward (assuming low levels of rearrangement), but becomes more difficult as taxonomic distance increases (Olmstead and Palmer, 1994; Jansen, Wee, and Millie, 1998). Restriction site studies, however, require large amounts of clean DNA and hence, are contraindicated in situations where availability of material is limiting.
Coalescence and intraspecific variation
Intraspecific genetic variation (i.e., allelic variation) is often observed when more than one accession of a species is sampled for molecular phylogenetic analysis. Two types of variation may be observed and their impacts on phylogenetic reconstruction are profoundly different. First, alleles within species may all be derived from a single ancestral allele present in the species, i.e., alleles coalesce within species. In this case, all intraspecific variation will be autapomorphic and therefore irrelevant for parsimony analysis. On the other hand, allelic variation may transcend species boundaries and therefore gene trees may not be equivalent to species trees simply because alleles may be older than species and multiple alleles can be maintained within a lineage (Pamilo and Nei, 1988; Hudson, 1990; Maddison, 1995; Clegg, 1997; Wendel and Doyle, 1998). The probability of concordance between a species tree and a gene tree is dependent on the time (in generations) between speciation events (the greater the number of generations, the higher the probability of recovering the species tree) and population genetic factors such as effective population size and selection. Although phylogenetic analyses of nuclear-encoded genes that have sampled multiple alleles are rare (see Huttley et al., 1997; Clegg, 1997, and references therein), incomplete coalescence has been observed (Buckler and Holtsford, 1996a, b; Gaut and Clegg, 1993; Goloubinoff, Pääbo, and Wilson, 1993; Hanson et al., 1996). Problems of noncoalescence are expected to be most prevalent in species where population genetic parameters promote the maintenance of multiple alleles, for example, large population size, high migration, and outbreeding (Pamilo and Nei, 1988; Hudson, 1990; Maddison, 1995). Populations of Gossypium species are primarily small, isolated, and inbred. These observations, in concert with the concordance of the phylogenies estimated from the separate homoeologues and the congruence with previous analyses, suggest to us that lack of coalescence is not an issue for this locus for these taxa. Current studies are underway to assess intraspecific polymorphism and to explicitly test whether or not Adh loci coalesce within closely related Gossypium species.
Concerted evolution
Multigene families are often subject to concerted evolution (Arnheim, 1983; Nagylaki, 1984; Walsh, 1987; Sanderson and Doyle, 1992; Elder and Turner, 1995). The ITS regions of nuclear rDNA became widely used as a source of sequence data after it became apparent that concerted evolution homogenizes sequences so that an entire array of tandemly repeated rDNA cistrons evolves as a single "locus" (Arnheim, 1983; Hillis and Dixon, 1991; Elder and Turner, 1995). Exceptions to the apparent rule of intraspecific and intraindividual sequence homogeneity are being discovered with increasing frequency, however, and the implications of these findings can be profound for phylogenetic reconstruction. Three observations that bear on the use of ITS are: (1) paralogous loci are not necessarily homogenized by concerted evolution (e.g., Suh et al., 1993); (2) in polyploids, interlocus concerted evolution may serve to homogenize homoeologous rDNA loci so that only a single parental type is retained, and this may occur differentially toward either parental type in different descendant lineages (Wendel, Schnabel, and Seelanan, 1995b; but see Waters and Schaal, 1996); and (3) rDNA pseudogenes may persist within the genome and may be preferentially sampled by PCR (Buckler and Holtsford, 1996a, b; Buckler, Ippolito, and Holtsford, 1997; Seelanan and Wendel, unpublished data). All three of the above phenomena may give rise to incongruence between the gene tree and the organismal tree, despite a well-resolved and robustly supported gene tree.
While interlocus gene conversion and recombination have been observed for low-copy nuclear-encoded gene families in plants (e.g., actins, Moniz de Sá and Drouin, 1996; heat-shock proteins, Waters, 1995; rbcS, Meagher, Berry-Lowe, and Rice, 1989; glutamine synthetase, Walker et al., 1995) the frequency of these events may depend on sequence conservation between paralogues (e.g., Walsh, 1987). Clearly, gene families that retain a large number of loci with strong sequence homologies are more likely to undergo interlocus concerted evolution and/or recombination than are smaller, more divergent gene families.
In our Southern hybridization experiments we used an AdhC-specific probe under high stringency conditions (65°C, 0.1 x SSC/0.5% SDS wash) and detected a single hybridizing band with multiple enzyme digestions for diploid taxa (data not shown) with the exception of G. raimondii (which showed a multibanded digestion pattern), and two hybridizing bands in the tetraploids. These Southern hybridization data, the recovery of two identical, paralogous gene trees, the genetic mapping data, and the high degree of sequence divergence between Gossypium Adh loci (1625% in exons, introns are unalignable, Small and Wendel, unpublished data) provide strong evidence that homoeologues were sampled in the allotetraploids and that these sequences have been free from interlocus concerted evolution.
Conclusions
For phylogenetic analysis to accurately reconstruct organismal history (i.e., the species tree), orthologous sequences need to be compared (Wendel and Doyle, 1998). For this reason, among others, plant molecular systematics have relied primarily on cpDNA data because the chloroplast genome is nonrecombinant, generally uniparentally inherited, and "single copy." Because nuclear-encoded genes usually exist in gene families, each member of which exists in a minimum of two copies (in diploids), and because these multiple copies may experience recombination and gene conversion, demonstration of orthology is more complex. Methods for establishing orthology (whether explicitly stated or implied) vary considerably and include criteria such as overall sequence similarity; monophyly and systematic content, i.e., reconstruction of the expected phylogeny (Gaut et al., 1996); tissue specificity (Doyle, 1991); Southern hybridization data (Matthews and Sharrock, 1996); and most convincingly, comparative genetic mapping data (Zhu et al., 1995; Cronn and Wendel, in press; this paper). These data are not always available or readily obtainable, but inferences of orthology may be facilitated with only a modest investment of effort by Southern hybridization experiments conducted using locus-specific probes and multiple enzyme digestions.
By isolating and analyzing orthologous nuclear genes and a number of different cpDNA regions, we have shown that mutation rates in noncoding cpDNA do not appear high enough to provide sufficient phylogenetic information to resolve relationships of this recently radiated group of tetraploid cottons, despite sequencing over 6 kb of noncoding cpDNA. Consequently, it is difficult to draw conclusions regarding the relative utility of the various cpDNA noncoding regions used. It is clear, however, that levels of divergence vary among noncoding cpDNA sequences (as pointed out for cpDNA introns by Downie, Katz-Downie, and Cho, 1996) and our analyses tentatively identify the rpl16 intron and the trnT-trnL intergenic spacer as among the fastest evolving cpDNA regions (Table 4); this agrees with Downie, Katz-Downie, and Cho (1996), who suggested that rpl16 should be the fastest evolving cpDNA intron.
As an alternative source of phylogenetic evidence, orthologous, low-copy, nuclear-encoded loci such as AdhC in Gossypium, may be isolated and may exhibit mutation rates up to six times higher than cpDNA noncoding sequences (Fig. 7). The use of nuclear-encoded genes for phylogeny reconstruction has both advantages and limitations. Primary among the advantages are the higher mutation rates and the ability to analyze large regions of sequence with interspersed coding and noncoding regions. The limitations, however, need to be considered. Demonstration of orthology among sequences is imperative and requires additional experimental effort. In addition, cognizance of issues such as coalescence and concerted evolution are required even when strict orthologues are recovered. Our study provides reason for both encouragement and caution in the continuing quest for additional and more informative tools for phylogenetic analysis in plants.
| FOOTNOTES |
|---|
2 Author for correspondence (jfw{at}iastate.edu
). ![]()
| REFERENCES |
|---|
|
|
|---|
Baum, D. A., R. Small, and J. F. Wendel.1998.Biogeography and floral evolution of Baobabs (Adansonia, Bombacaceae) as inferred from multiple data sets. Systematic Biology 47: 181207. [CrossRef][ISI][Medline]
Bayer, R. J., L. Hufford, and D. E. Soltis.1996.Phylogenetic relationships in Sarraceniaceae based on rbcL and ITS sequences. Systematic Botany 21: 121134.
Böhle, U.-R., H. Hilger, R. Cerff, and W. F. Martin.1994.Non-coding chloroplast DNA for plant molecular systematics at the infrageneric level. In B. Schierwater, B. Streit, G. P. Wagner, and R. DeSalle [eds.], Molecular ecology and evolution: approaches and applications, 391403. Birkhäuser Verlag, Basel.
, , and W. F. Martin.1997.Island colonization and evolution of the insular woody habit in Echium L. (Boraginaceae). Proceedings of the National Academy of Sciences, USA 93: 1174011745.
Bosquet, J., S. H. Strauss, A. H. Doerksen, and R. A. Price.1992.Extensive variation in evolutionary rate of rbcL gene sequences among seed plants. Proceedings of the National Academy of Sciences, USA 89: 78447848.
Bremer, K.1988.The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42: 795803. [CrossRef][ISI]
Buckler, E. S., and T. P. Holtsford.1996a.Zea systematics: ribosomal ITS evidence. Molecular Biology and Evolution 13: 612622. [Abstract]
, and .1996b.Zea ribosomal repeat evolution and substitution patterns. Molecular Biology and Evolution 13: 623632. [Abstract]
, A. Ippolito, and T. P. Holtsford.1997.The evolution of ribosomal DNA: divergent paralogs and phylogenetic implications. Genetics 145: 821832. [Abstract]
Chase, M. W., et al.1993.Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene rbcL. Annals of the Missouri Botanical Garden 80: 528580. [CrossRef][ISI]
Clegg, M. T.1997.Plant genetic diversity and the struggle to measure selection. Journal of Heredity 88: 17.
, B. S. Gaut, G. H. Learn, Jr., and B. R. Morton.1994.Rates and patterns of chloroplast DNA evolution. Proceedings of the National Academy of Sciences, USA 91: 67956801.
Cronn, R. C., X. Zhao, A. H. Paterson, and J. F. Wendel.1996.Polymorphism and concerted evolution in a tandemly repeated gene family: 5S ribosomal DNA in diploid and allopolyploid cottons. Journal of Molecular Evolution 42: 685705. [CrossRef][ISI][Medline]
, and J. F. Wendel.In press.Simple methods for isolating homoeologous loci from allopolyploid genomes. Genome.
Cummings, M. P., L. M. King, and E. A. Kellogg.1994.Slipped-strand mispairing in a plastid gene: rpoC2 in grasses (Poaceae). Molecular Biology and Evolution 11: 18. [Abstract]
DeJoode, D. R., and J. F. Wendel.1992.Genetic diversity and origin of the Hawaiian islands cotton, Gossypium tomentosum. American Journal of Botany 79: 13111319. [CrossRef][ISI]
Demesure, B., B. Comps, and R. J. Petit.1996.Chloroplast DNA phylogeography of the common beech (Fagus sylvatica L.) in Europe. Evolution 50: 25152520. [CrossRef][ISI]
Dickie, S. L.1996.Phylogeny and evolution in the Subfamily Opuntioideae (Cactaceae): insights from rpl16 intron sequence variation. Master's thesis, Iowa State University, Ames, IA.
Don, R. H., P. T. Cox, B. J. Wainwright, K. Baker, and J. S. Mattick.1991.`Touchdown' PCR to circumvent spurious priming during gene amplification. Nucleic Acids Research 19: 4008.
Downie, S. R., D. S. Katz-Downie, and K.-J. Cho.1996.Phylogenetic analysis of Apiaceae subfamily Apioideae using nucleotide sequences from the chloroplast rpoC1 intron. Molecular Phylogenetics and Evolution 6: 118. [CrossRef][ISI][Medline]
Doyle, J. J.1991.Evolution of higher-plant glutamine synthetase genes: tissue specificity as a criterion for predicting orthology. Molecular Biology and Evolution 8: 366377. [ISI]
Elder, J. F., Jr., and B. J. Turner.1995.Concerted evolution of repetitive DNA sequences in eukaryotes. Quarterly Review of Biology 70: 297320. [CrossRef][Medline]
Endrizzi, J. E., E. L. Turcotte, and R. J. Kohel.1985.Genetics, cytology, and evolution of Gossypium. Advances in Genetics 23: 271375.
Eyre-Walker, A., and B. S. Gaut.1997.Correlated rates of synonymous site evolution across plant genomes. Molecular Biology and Evolution 14: 455460. [Abstract]
Freeling, M., and D. C. Bennett.1985.Maize Adh1. Annual Review of Genetics 19: 297323. [ISI][Medline]
Fryxell, P. A.1992.A revised taxonomic interpretation of Gossypium L. (Malvaceae). Rheedea 2: 108165.
Gaut, B. S., and M. T. Clegg.1991.Molecular evolution of alcohol dehydrogenase 1 in members of the grass family. Proceedings of the National Academy of Sciences, USA 88: 20602064.
, and .1993.Molecular evolution of the Adh1 locus in the genus Zea. Proceedings of the National Academy of Sciences, USA 90: 50955099.
, B. R. Morton, B. C. McCaig, and M. T. Clegg.1996.Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcL. Proceedings of the National Academy of Sciences, USA 93: 1027410279.
, S. V. Muse, W. D. Clark, and M. T. Clegg.1992.Relative rates of nucleotide substitution at the rbcL locus of monocotyledonous plants. Journal of Molecular Evolution 35: 292303. [CrossRef][ISI][Medline]
, , and M. T. Clegg.1993.Relative rates of nucleotide substitution in the chloroplast genome. Molecular Phylogenetics and Evolution 2: 8996. [CrossRef][Medline]
Gielly, L., and P. Taberlet.1994a.Chloroplast DNA polymorphism at the intrageneric level and plant phylogenies. Comptes Rendus des Seances, Academie des Sciences (Paris); Serie III Sciences de la vie/Life sciences 317: 685692.
, and .1994b.The use of chloroplast DNA to resolve plant phylogenies: noncoding versus rbcL sequences. Molecular Biology and Evolution 11: 769777. [Abstract]
, and .1996.A phylogeny of the European gentians inferred from chloroplast trnL (UAA) intron sequences. Botanical Journal of the Linnean Society 120: 5775. [CrossRef]<