|
|
||||||||
Systematics |
Department of Botany, 437 Hesler Biology, University of Tennessee, Knoxville, Tennessee 37996 USA
Received for publication January 13, 2004. Accepted for publication September 2, 2004.
| ABSTRACT |
|---|
|
|
|---|
Key Words: angiosperms cpDNA intergenic spacers introns molecular systematics noncoding chloroplast DNA phylogeny seed plants
| INTRODUCTION |
|---|
|
|
|---|
|
It has been clearly shown that the phylogenetic utility of different noncoding cpDNA regions within a given taxonomic group can vary tremendously (Sang et al., 1997
; Small et al., 1998
; Xu et al., 2000
; Hartmann et al., 2002
; Mast and Givnish, 2002
; Cronn et al., 2002
; Hamilton et al., 2003
; Perret et al., 2003
; Sakai et al., 2003
), but choosing an appropriate cpDNA region for phylogenetic investigation is often difficult because of the paucity of information about the relative tempo of evolution among different noncoding cpDNA regions. Gielly and Taberlet (1994
, p. 774) wrote: "it is not easy, for many reasons, to establish a rule for the choice of a particular region of the chloroplast genome for resolving phylogenies." While many authors have compared relative rates of evolution among a few noncoding regions (Sang et al., 1997
; Small et al., 1998
; Wang et al., 1999
; Kusumi et al., 2000
; Xu et al., 2000
; Soltis et al., 2001
; Cronn et al., 2002
; Mast and Givnish, 2002
; Hamilton et al., 2003
; Perret et al., 2003
; Sakai et al., 2003
; Yamane et al., 2003
), these studies are all of a relatively narrow phylogenetic context and there is no consensus as to variability in evolutionary rates among noncoding cpDNA regions across a broad phylogenetic range. To our knowledge, the only work that has attempted to compare levels of variation among several different noncoding cpDNA regions across a wide range of lineages is Aoki et al. (2003)
. However, their results are equivocal because of insufficient data. Therefore, for most investigators, choosing the appropriate region for phylogenetic investigation at a particular taxonomic level is often guesswork.
We present a comparison of 21 noncoding cpDNA regions sampled across all of the major lineages of phanerogams sensu APG II (2003)
(Fig. 2). Sequence divergence and, more importantly, the amount of information offered to phylogenetic investigations by the various noncoding cpDNA regions is compared across lineages to assess the phylogenetic utility of each. In this investigation, we determine whether there is any predictable rate heterogeneity among different noncoding chloroplast regions that have been employed in the field of molecular systematics. We will also provide a discussion of the often used noncoding cpDNA regions and present a general protocol for selecting potential noncoding cpDNA regions useful to systematic investigations.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
|
The atpB-rbcL spacer, perhaps one of the first intergenic spacers to be widely used, was excluded from our analysis because it is apparently of little infrageneric phylogenetic utility. It has consistently provided fewer variable characters compared to the entire trnK intron (Azuma et al., 2001
), trnH-psbA (Azuma et al., 2001
; Schönenberger and Conti, 2003
; Hamilton et al., 2003
), 5'rpS12-rpL20 (Hamilton et al., 2003
), rpL16 (Renner, 1999
; Schönenberger and Conti, 2003
), rpS16 (Schönenberger and Conti, 2003
), or trnL-trnL-trnF (Mayer et al., 2003
).
Another well-characterized region found in the literature but excluded from this study is the rpoC1 intron. The rpoC1 intron was excluded here because it was shown to be less informative in cotton (Gossypium) than atpB-rbcL, trnL-trnF, ndhA, and rpL16 (Small et al., 1998
) and it yielded fewer characters than rpL16, rpS16, and matK in a study of the Apiaceae subfamily Apioideae (Downie et al., 2001
). Although this region appears to show appropriate levels of variation for studies above the family level, it was noted as being "largely inappropriate to infer phylogeny among closely related taxa" (Downie et al., 1996
, p. 14).
For the sake of clarity, we wish to point out that it is important to use specific terminology to describe a region of interest. For example, authors have used "trnL-trnF" to mean either the trnL intron plus trnL-trnF spacer or just the trnL-trnF spacer. To be precise we will use, for example, "trnL-trnF" to indicate the intergenic spacer alone, but "trnL-trnL-trnF" to indicate the intron plus the intergenic spacer. In addition, because there are multiple tRNA genes in the chloroplast genome that encode tRNAs for the same amino acid, it is desirable to denote the specific tRNA gene by the addition of the anti-codon as a superscript. For example, one of the regions we found to be highly variable is the trnSGCU-trnGUUC intergenic spacer, which is different than the trnSUGA-trnGGCC intergenic spacer that lies within the trnSUGA-trnfMCAU region (Fig 3).
Molecular techniques
Because the genes surrounding noncoding regions are highly conserved across seed plants (and especially within angiosperms), many polymerase chain reaction (PCR) primers for amplification and sequencing could be used across the diverse taxonomic groups of this study. Nearly all of the primer regions used here were published in other studies. However, alignment of GenBank sequences from a wide array of phanerogam lineages was used to determine the universality of the previously published primers, modify problematic primers, and aid in the construction of new primers. In some cases, we designed new primers for regions not previously surveyed, or to help sequence through difficult regions (e.g., polynucleotide runs). Unless otherwise noted, all of the primers listed below and in Fig. 3 were successfully used for both amplification and sequencing reactions in all taxonomic groups.
DNA was extracted from leaf tissue using either the DNeasy Plant Mini Kit (Qiagen, Valencia, California, USA) or the CTAB method (Doyle and Doyle, 1987
). PCR was performed using either Eppendorf or MJ Research thermal cyclers in 2050 µL volumes with the following reaction components: 1 µL template DNA (
10100 ng), 1X buffer (PanVera/TaKaRa, Madison, Wisconsin, USA or Promega, Madison, Wisconsin, USA), 200 µmol/L each dNTP, 3.0 mmol/L MgCl2, 0.1 µmol/L each primer, and 1.25 units Taq (PanVera/TaKaRa or Promega). Some reactions included bovine serum albumin with a final concentration of 0.2 µg/µL to improve amplification of difficult templates. In a few cases, 10 µmol/L tetramethyl ammonium chloride (TMACl) was included in the PCR solution because it is reported to reduce problems associated with long polynucleotide runs (Oxelman et al., 1997
). However, we did not perform a comparative study to determine whether or not its presence actually improved our sequences. PCR amplification protocols and reaction conditions were continuously optimized throughout this investigation for all regions across all lineages. Material and methodological information and primer sequences specific to each of the different noncoding cpDNA regions are described below. All primer sequences are written in standard 5' to 3' orientation and their relative positions and orientations are illustrated in Fig. 3. A key to the shorthand for the following PCR parameters is as follows: initial denaturing step (temperature, time); number of repetitions of the amplification cycle [#x (denaturing temperature, time; primer annealing temperature, time; chain extension temperature, time)]; final extension step (temperature, time). All reactions ended with a final 4°C hold step.
PCR products were purified prior to sequencing with either the QIAquick PCR Purification Kit (Qiagen, Valencia, California, USA) or ExoSAP-IT (USB, Cleveland, Ohio, USA). All DNA sequencing was performed with the ABI Prism BigDye Terminator Cycle Sequencing Ready Reaction Kit, v. 2.0 or 3.1 (Perkin-Elmer/Applied Biosystems, Foster City, California, USA), using the thermal cycle parameters 80°C, 5 min; 30x (96°C, 10 s; 50°C, 5 s; 60°C, 4 min). The products were electrophoresed and detected on an ABI Prism 3100 automated sequencer (University of Tennessee Molecular Biology Resource Facility). All sequences have been deposited in GenBank, and accession numbers are provided in Table 1.
trnHGUG-psbA
The PCR parameters for this region were 80°C, 5 min; 35x (94°C, 30 s; 5056°C, 30 s; 72°C, 1 min); 72°C, 10 min with primers trnHGUG (CGC GCA TGG TGG ATT CAC AAT CC) (Tate and Simpson, 2003
) and psbA (GTT ATG CAT GAA CGT AAT GCT C) (Sang et al., 1997
). This region amplified and sequenced easily for all lineages. Because the average length of this region is relatively short (
500 bp), only the trnH primer was used in sequencing in most cases.
psbA-3'trnKUUU-[matK]-5'trnKUUU
These regions were the most problematic of any in this investigation. A variety of previously published and newly designed primers were required to amplify and sequence these regions, and very few completely universal primers were identified. We included only the noncoding portions of this region: psbA-3'trnK spacer, 3'trnK-matK intron, and matK-5'trnK intron. The matK gene was excluded primarily because it is a coding region, but also because of the inefficiency in designing the many primers that would be necessary to obtain this region for all lineages. In many cases, after amplifying the entire trnK-matK-trnK fragment, we were unable to sequence the PCR product with either the amplification or internal primers. However, if the region was PCR amplified in smaller sections using internal primers we were able to sequence these amplicons using the same primers that had previously failed. This phenomenon was observed independently in the laboratories of both E. E. Schilling and R. L. Small, as well as by J. Panero (University of Texas, personal communication) and R. Rapp (Iowa State University, personal communication) who suggested that dimethylsulfoxide might help during sequencing. Different primer combinations were often required for different taxa. The gymnosperm lineage is not represented in this data set because gymnosperm-specific primers were not obtained (Kusumi et al., 2000
). The primers used in this study include: psbA5'R (AAC CAT CCA ATG TAA AGA CGG TTT), ALS-11F (ATC TTT CGC ATT ATT ATA G) (M. Nepokroeff, University of South Dakota, personal communication), matKAR (CTG TTG ATA CAT TCG A) (Kazempour Osaloo et al., 1999
), matKM (TCG ACT TTC TGG GCT ATC) (Tate and Simpson, 2003
), matK1 (AAC TAG TCG GAT GGA GTA G) (Johnson and Soltis, 1994
), matK5 (TGT CAT AAC CTG CAT TTT CC) (Panero and Crozier, 2003
), matK5'R (GCA TAA ATA TAY TCC YGA AAR ATA AGT GG), matK6 (TGG GTT GCT AAC TCA ATG G) (Johnson and Soltis, 1994
), matK8F (TCG ACT TTC TTG TGC TAG AAC TTT) (Steele and Vilgalys, 1994
), matK5PSIF (CTA TGG CTC CAA TTC TGG T), matK5PSIR (CCG CAT CAG GCA CTA ATC TA).
Hibiscus and Minuartia protocol: Amplification of the matK-5'trnK spacer used the matK6 and matK5'R primers with the PCR parameters 80°C, 5 min; 35x (95°C, 1 min; 50°C, 1 min with a ramp of 0.3°C/s; 65°C, 5 min); 65°C, 5 min. This spacer was sequenced with the matK6 primer. The psbA-trnK-matK spacers were amplified using the matKM (Hibiscus) or ALS-11F (Minuartia) and psbA5'R primers using the parameters 80°C, 5 min; 30x (94°C, 30 s; 50°C, 30 s; 72°C, 2 min); 72°C, 5 min. This region was sequenced using the psbA5'R primer.
Magnolia, Prunus, and Gratiola protocol: Amplification of the matK-5'trnK spacer used the matK6 and matK5 primers with the parameters 80°C, 5 min; 3035x (94°C, 1 min; 50°C, 1 min; 72°C, 1.5 min); 72°C, 5 min. Amplification of the psbA-3'trnK-matK spacers was done using the matK8F and psbA5'R primers with the same PCR protocol.
Trillium-Pseudotrillium protocol: Amplification of the matK-5'trnK spacer used the matK6 and matKAR primers with the parameters 80°C, 5 min; 30 35x (94°C, 1 min; 50°C, 1 min; 72°C, 2 min); 72°C, 5 min. Amplification of the psbA-3'trnK-matK spacers used the matK8F and psbA5'R primers with the same PCR parameters. Because of two poly-A/T runs, matK5PSIF and matK5PSIR were used for internal sequencing.
Solanum, Carphephorus-Trilisa, Eupatorium protocol: Amplification of the matK-5'trnK spacer used the matK6 and matK5 primers with the parameters 80°C, 5 min; 35x (95°C, 1 min; 50°C, 1 min; 65°C, 5 min); 65°C, 5 min. Both primers were also used for sequencing reactions. The psbA-3'trnK-matK spacers were amplified with the psbA5'R and ALS-11F for Solanum americanum and S. physalifolium, matKM for S. ptychanthum, and matK8F for Eupatorium and Carphephorus-Trilisa with the above parameters. All were sequenced using only the psbA5'R primer.
rpS16
This region was amplified using the parameters 80°C, 5 min; 35x (94°C, 30 s; 5055°C, 30 s; 72°C, 1 min); 72°C, 5 min, with primers rpS16F (AAA CGA TGT GGT ARA AAG CAA C) and rpS16R (AAC ATC WAT TGC AAS GAT TCG ATA), which are modified from Oxelman et al. (1997)
. Both primers were also used in sequencing reactions. This region amplified and sequenced easily for all angiosperm taxa and two of the three gymnosperm representatives with minimal troubleshooting. Despite trying several different PCR programs, annealing temperatures, and MgCl2 concentrations, we were unable to amplify this region for Cryptomeria japonica.
trnSGCU-trnGUUC-trnGUUC
For this region, three different protocols were used and in most cases the trnS-trnG spacer and the trnG intron were amplified as one fragment. For most taxa protocol 1was successful. Both protocols 1 and 2 used the primers trnSGCU (AGA TAG GGA TTC GAA CCC TCG GT) and 3'trnGUUC (GTA GCG GGA ATC GAA CCC GCA TC). Additional primers 5'trnG2G (GCG GGT ATA GTT TAG TGG TAA AA) (toward trnG) and 5'trnG2S (TTT TAC CAC TAA ACT ATA CCC GC) (toward trnS) were sometimes used to amplify only the trnG intron, and for sequencing longer fragments and templates with a difficult poly-A repeat.
Protocol 1: This is a two-step PCR protocol with primer annealing and chain extension occurring at the same temperature, using the parameters 80°C, 5 min; 30x (95°C, 1 min; 66°C, 4 min); 66°C, 10 min. A final MgCl2 concentration of 1.5 mmol/L (rather than 3.0 mmol/L) was used.
Protocol 2: This protocol was used when amplification with protocol 1 was problematic. The parameters are 80°C, 5 min; 35x (95°C, 1 min; 50°C, 1 min with a ramp of 0.3°C/s; 65°C, 5 min); 65°C, 10 min. This protocol always coamplifies the trnSUGA and trnGGGC part of the trnSUGA-trnfMCAU spacer. The result of this protocol yields two equal-intensity, but well-separated bands in a test gel, the larger of which was always the target trnSGCU-trnGUUC. The desired fragment was excised from the gel and cleaned with a QIAquick Gel Extraction Kit. Because of the sequence similarity of these two different trnS and trnG genes, primer design was difficult and the protocols needed to be very specific to amplify only the correct region.
Protocol 3: Independent inversions in monocots (Hiratsuka et al., 1989
) and Asteraceae (Jansen and Palmer, 1987
) interrupt the trnSUGA-trnGGGC spacer preventing amplification. However, using the 3'trnG and 5'trnG2G primers, we successfully amplified and sequenced the trnG intron for Trillium-Pseudotrillium, Carphephorus-Trilisa, and Eupatorium. The amplification parameters for the trnG intron are 80°C, 5 min; 35x (95°C, 1 min; 50°C, 1 min with a ramp of 0.3°C/s; 65°C, 5 min); 65°C, 5 min.
rpoB-trnCGCA
This region amplified easily for most angiosperm taxa using primers trnCGCAR (CAC CCR GAT TYG AAC TGG GG) and rpoB (CKA CAA AAY CCY TCR AAT TG), modified from Ohsako and Ohnishi (2000)
. The PCR parameters for this region are 80°C, 5 min; 3035x (96°C, 1 min; 5057°C, 2 min; 72°C, 3 min); 72°C, 5 min. For unknown reasons, we were unable to amplify this region for Taxodium, Glyptostrobus, or Cryptomeria.
trnCGCA-ycf6-psbM-trnDGUC
Two different, but equally successful, protocols were used to amplify this region. For Gratiola, Hibiscus, Magnolia, Minuartia, Prunus, and Taxodium, we amplified the entire approximately 3-kb trnC to trnD fragment. For Carphephorus-Trilisa, Eupatorium, Solanum, and Trillium-Pseudotrillium, we amplified the fragments trnC-psbM and ycf6-trnD. Both protocols used the same PCR parameters, which were 80°C, 5 min; 35x (94°C, 1 min; 5055°C, 1 min; 72°C, 3.5 min); 72°C, 5 min. PCR and sequencing primers included trnCGCAF (CCA GTT CRA ATC YGG GTG) (modified from Demesure et al., 1995
), ycf6R (GCC CAA GCR AGA CTT ACT ATA TCC AT), ycf6F (ATG GAT ATA GTA AGT CTY GCT TGG GC), psbMR (ATG GAA GTA AAT ATT CTY GCA TTT ATT GCT), psbMF (AGC AAT AAA TGC RAG AAT ATT TAC TTC CAT), Taxodium-psbMF2 (CTT TTG TTC GGG TGA GAA AGG), and trnDGUCR (GGG ATT GTA GYT CAA TTG GT) (modified from Demesure et al., 1995
). This region required only moderate troubleshooting. After trying several different PCR modifications, we were unable to obtain the psbM-trnD segment for Carphephorus-Trilisa. In nearly all surveyed lineages, a poly-A/T run exists between psbM and trnD, but created sequencing difficulties in only a few cases.
trnDGUC-trnTGGU
This spacer amplified easily for most taxa using Demesure et al. (1995)
primers trnDGUCF (ACC AAT TGA ACT ACA ATC CC) and trnTGGU (CTA CCA CTG AGT TAA AAG GG). The PCR parameters for this region are 80°C, 5 min; 30x (94°C, 45 s; 52-58°C, 30 s; 72°C, 1 min); 72°C, 5 min. Internal sequencing primers trnEUUC (AGG ACA TCT CTC TTT CAA GGA G) and trnYGUA (CCG AGC TGG ATT TGA ACC A) were created because of poly-A/T repeats that were difficult to sequence and the atypically large size of the region in a few taxa. A large inversion in the Asteraceae, excluding the Barnadesieae (Jansen and Palmer, 1987
), interrupts the trnD-trnT spacer precluding its use. This region also appears to be absent in the Pinus chloroplast genome (Wakasugi et al., 1994
), which may explain why we were unable to amplify this region for Taxodium, Glyptostrobus, or Cryptomeria.
trnSUGA-trnfMCAU
The amplification parameters for this region are 80°C, 5 min; 30x (94°C, 30 s; 55°C, 30 s; 72°C, 2 min); 72°C, 5 min, using Demesure et al. (1995)
primers trnSUGA (GAG AGA GAG GGA TTC GAA CC) and trnfMCAU (CAT AAC CTT GAG GTC ACG GG). This region amplified and sequenced easily for most taxa with minimal troubleshooting.
As explained in the trnSGCU-trnGUUC-trnGUUC region above, trnGGCC occurs between trnSUGA-trnfMCAU. Because there is so little difference between the sequences of these trnS and trnG genes, the two independent trnS-trnG regions will coamplify under certain amplification parameters. However, a seemingly counterintuitive advantage to such sequence similarity is that primer 3'trnGUUC (and possibly primers 5'trnG2G and 5'trnG2S) can be used as an internal sequencing primer for the trnSUGA-trnfMCAU region.
trnSGGA-rpS4-trnTUGU-trnLUAA-trnLUAA-trnFGAA
Because of an initial lack of communication, we PCR amplified several of the taxa using different primer combinations, all of which worked well. However, for all of the lineages of angiosperm taxa, this region was easily amplified in two fragments. The first, trnS-5'trnL, was amplified using primers trnSGGA (TTA CCG AGG GTT CGA ATC CCT C) and 5'trnLUAAR (TabB) (TCT ACC GAT TTC GCC ATA TC) (Taberlet et al., 1991
) with the parameters 96°C, 5 min; 35x (96°C, 1 min; 5055°C, 2 min; 72°C, 2.5 min); 72°C, 5 min. The second fragment, trnL5'-trnF, was amplified using primers trnL5'UAAF (TabC) (CGA AAT CGG TAG ACG CTA CG) (Taberlet et al., 1991
) and trnFGAA (TabF) (ATT TGA ACT GGT GAC ACG AG) (Taberlet et al., 1991
) with the parameters 80°C, 5 min; 35x (94°C, 1 min; 50°C, 1 min; 72°C, 2 min); 72°C, 5 min. Several internal sequencing primers were used and included rpS4R2 (CTG TNA GWC CRT AAT GAA AAC G), trnTUGUR (AGG TTA GAG CAT CGC ATT TG), trnTUGUF (TabA) (CAT TAC AAA TGC GAT GCT CT) (Taberlet et al., 1991
), trnTUGU2F (CAA ATG CGA TGC TCT AAC CT) (trnA2 of Cronn et al., 2002
), 3'trnLUAAR (TabD) (GGG GAT AGA GGG ACT TGA AC) (Taberlet et al., 1991
), and 3'trnLUAAR (TabE) (GGT TCA AGT CCC TCT ATC CC) (Taberlet et al., 1991
).
5'rpS12-rpL20
This region amplified and sequenced easily for almost all taxa using primers 5'rpS12 (ATT AGA AAN RCA AGA CAG CCA AT) and rpL20 (CGY YAY CGA GCT ATA TAT CC), both modified from Hamilton (1999a)
. Amplification parameters were 96°C, 5 min; 35x (96°C, 1 min; 5055°C, 1 min; 72°C, 1 min); 72°C, 5 min. Although amplification of this region was successful for Trillium ovatum, sequencing reactions using either primer failed repeatedly, even for several different accessions of this species.
psbB-psbH
This region amplified and sequenced easily for all taxa using primers psbB (TCC AAA AAN KKG GAG ATC CAA C) and psbH (TCA AYR GTY TGT GTA GCC AT), both modified from Hamilton (1999a)
. Amplification parameters were 80°C, 5 min; 35x (94°C, 30 s; 5760°C, 30 s; 72°C, 1 min); 72°C, 5 min.
rpL16
This region amplified and sequenced easily for all taxa with minimal troubleshooting using primers rpL16F71 (GCT ATG CTT AGT GTG TGA CTC GTT G) and rpL16R1516 (CCC TTC ATT CTT CCT CTA TGT TG) (Small et al., 1998
). Amplification parameters were 80°C, 5 min; 35x (95°C, 1 min; 50°C, 1 min with a ramp of 0.3°C/s; 65°C, 5 min); 65°C, 4 min.
cpDNA compilation and analysis
Sequencher 3.0 (Gene Codes Corp., 1998
) was used to compile contiguous sequences (contigs) of each accession from electropherograms generated on the automated sequencer. Positions of coding and noncoding (gene, exon, and intron) borders were determined by comparison with either Arabidopsis (NC 000932), Lotus (NC 001874), or Nicotiana (NC 002694) entire cpDNA sequences in GenBank. Terminal coding regions and, in a few rare cases, unreadable ends of the PCR amplicons were excluded from the contigs. Small coding regions within some of the noncoding regions (e.g., trnEUUC and trnYGUA within the trnDGUC-trnTGGU spacer) were not excluded from the contigs. Sequences of each of the three-species groups were aligned using Clustal X (Thompson et al., 2001
) and manually corrected using McClade v. 4.0 to produce an alignment with the fewest number of changes (indels or nucleotide substitutions). All polymorphic sites found in the three-species groups were rechecked against the original electropherograms. Alignments are available upon request from J. Shaw, E. B. Lickey, or R. L. Small.
The number of nucleotide substitutions, indels, and inversions (hereafter referred to collectively as Potentially Informative Characters or PICs) between the two ingroup species and between either ingroup species and the outgroup species were tallied for each noncoding cpDNA region in each of the lineages. Because indels have been shown to be prevalent and often phylogenetically informative (Golenberg et al., 1993
; Morton and Clegg, 1993
; Gielly and Taberlet, 1994
), they were scored in this study, as were inversions. Indels, any nucleotide substitutions within the indels, and inversions were scored as independent, single characters. We then estimated the proportion of observed mutational events for each noncoding cpDNA region using a modified version of the formula used in O'Donnell (1992)
and Gielly and Taberlet (1994)
. The proportion of mutational events (or % variability) = [(NS + ID + IV) / L] x 100, where NS = the number of nucleotide substitutions, ID = the number of indels, IV = the number of inversions, and L = the total sequence length.
Assessment of a correlation between variability and length
To assess whether or not the length of the different noncoding cpDNA regions accounts for the number of PICs observed within a particular region, we used a simple regression analysis. Because of the variation in phylogenetic distance between species in the different lineages we could not combine all lineages in a single regression. Instead, we performed 10 separate regressions (one per lineage) and calculated r2 for each to determine how much of the variation seen in the PIC values is explained by the length of the region.
Cost/benefit analysis of coamplifiable noncoding cpDNA regions
In the above analyses, each noncoding region was treated individually. However, several adjacent, shorter, noncoding cpDNA regions may be coamplified as a single contiguous unit. We surveyed several cpDNA region combinations to assess the potential phylogenetic utility of coamplifiable regions from a cost/ benefit perspective. For example, the trnL intron and trnL-trnF spacer are often coamplified, and most of the time these two regions are sequenced with the same two primers that were used in PCR (TabC and TabF). From a cost/ benefit perspective, it is beneficial to amplify and sequence both of these regions together instead of separately by maximizing the number of characters obtained per two sequencing reactions. Our sequencing reactions always yielded easily readable sequence data of 800 bp from a single-primer sequencing reaction. We therefore limited what we categorize as "coamplifiable" regions to those whose total length average is < approximately 1500 bp and can be sequenced entirely with two sequencing reactions. These coamplifiable regions include psbA-3'trnK-matK, trnS-trnG-trnG, trnC-ycf6-psbM, ycf6-psbM-trnD, rps4-trnT-trnL, and trnL-trnL-trnF.
Assessment of the predictive value of a three-species sample study
Our inferences from these data rely on the assumption that a sample of three species is predictive of the overall levels of variation that will be found in an entire data set. To test the predictive power of a three-species survey we compared the number of PICs among the three species with the respective complete data sets of 18 taxa of Prunus sect. Prunocerasus (Shaw and Small, 2004
) and nine taxa of Hibiscus sect. Furcaria (R. L. Small et al., unpublished data), each with a single outgroup. The comparison of the Prunus data sets was made with introns trnL, trnG, rpS16, and rpL16 and intergenic spacers trnL-trnF, trnH-psbA, and trnS-trnG, and the comparison of the Hibiscus data sets was made with introns rpS16, rpL16, and trnG and intergenic spacers trnD-trnT, rpoB-trnC, trnH-psbA, and trnS-trnG. Regression lines were calculated and their slopes were compared on a scatterplot for each data set comparison.
| RESULTS |
|---|
|
|
|---|
We did not apply statistical analyses to these data because of potentially different rates of evolution among the different lineages, the incongruent phylogenetic distances between the species in each lineage, and the exclusion of some regions because of structural rearrangement of the cpDNA molecule or PCR amplification or sequencing difficulties. Thus, the following discussion is based on our qualitative interpretation of the results, which are compiled in Table 2 and simplified in Fig. 4.
|
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The initial goal of this investigation was to provide a comparison of noncoding cpDNA regions to see if there are any that reliably yield a greater number of variable characters (PICs) at low taxonomic levels, and thus would be of greater value to systematic studies than the often used trnL-trnL-trnF or trnK-matK-trnK regions. To do so we used three-species surveys representing most of the major phylogenetic lineages of phanerogams (sensu APG II, 2003
). To test the predictive power of a three-species survey we compared the surveys of seven regions in Prunus and eight regions in Hibiscus with their respective complete data sets (Fig. 8). Figure 8 shows that as the number of PICs in survey of three species increases, so will the actual number of variable characters in a complete data set generated from those regions. Therefore, a survey of three species is highly predictive of the amount of information that a noncoding cpDNA region might offer to a phylogenetic investigation and is an effective means of comparison between different noncoding cpDNA regions.
Most investigators, when comparing different DNA regions, have used either of two metrics that are not wholly separate. One tallies the number of variable characters including nucleotide substitutions, indels, and inversions (PICs), while the other calculates the percent variability, or percent divergence of a region, by dividing the total number of variable characters by the total length of the region. It is necessary to emphasize that, from the viewpoint of systematists, the total number of variable characters offered by a region is more important than the percent variability. A highly variable but extremely short region may not provide a sufficient number of variable characters with which to generate a resolved phylogeny. As systematists, we are interested in obtaining the greatest number of variable characters per sequencing reaction, arguably the costliest portion of sequence acquisition, where current techniques and equipment allow for 600800 bp of easily readable nucleotides per reaction. Therefore, it would be ideal to use cpDNA regions that combine high variability in fragments of approximately 7001500 bp that can easily be sequenced with one or two primers, ideally the original amplification primers. To show that the number of PICs offered to systematic studies is not due solely to total length of a region, we regressed PICs on length of the region for all of the regions surveyed in this study (Fig. 6). It is apparent that while the length of the region accounts for some proportion of the PIC value, there is a large amount of unaccountable variation in this trend. Within Prunus, for example, regions that are between 261 and 307 bp contain between 2 and 14 PICs, while regions that are from 709 to 783 bp contain between 2 and 34 PICs, with the largest region not accounting for the greatest PIC value.
Our results clearly show that a disparity exists in the information offered to phylogenetic investigations by different noncoding cpDNA regions. Additionally, we show that the most widely used noncoding cpDNA regions in infrageneric systematic investigations, namely the trnL-trnL-trnF and trnK-matK intron regions, consistently provide fewer PICs than several other choices, such as trnS-trnG-trnG, trnC-ycf6-psbM, trnD-trnT, trnT-trnL, and rpoB-trnC.
Discussion of each of the regions
Below is a summary of each of the 21 different noncoding cpDNA regions that we have surveyed in this study including a brief history of their utility in previous studies and an assessment of their utility based on the results of this study. Because there is no intuitively straightforward way to rank each of the regions, we have divided the regions into three tiers based on their overall qualitative usefulness (Fig. 5). Tier 1 contains five regions that on average consistently provide the greatest number of PICs across all phylogenetic lineages. Tier 2 includes the next five regions that may provide some useful information, but they may be less than optimal in providing the number of characters needed for a well-resolved phylogenetic study. Tier 3 comprises those regions that consistently provide the fewest PICs across all lineages and are therefore not recommended for low-level studies because better noncoding cpDNA choices exist. Ranking these regions in three tiers offers information relevant to studies focused on very low taxonomic levels where researchers might opt to choose one or more regions that likely contain the highest number of PICs. In addition, this ranking scheme is also useful in providing information to researchers who may wish to couple quickly evolving regions with more slowly evolving Tier 2 or Tier 3 regions, which might allow for resolution within the clade of interest in addition to confidence alignment with an outgroup (Asmussen and Chase, 2001
).
trnHGUG-psbA (Tier 3)
Inquiry into the trnH-psbA intergenic spacer began with Aldrich et al. (1988)
who showed that indels were prevalent in this region, even between closely related species. An early study that showed this region to be of value to systematics is Sang et al. (1997)
who noted that it was highly variable compared to matK and trnL-trnF. The utility of trnH-psbA was also shown by Hamilton (1999b)
who used it for an intraspecific study within Corythophora (Lecythidaceae). Subsequent to these two studies, several investigators have used this region to study closely related genera and species (Azuma et al., 1999
; Chandler et al., 2001
; Mast and Givnish, 2002
; Fukuda et al., 2003
; Miller et al., 2003
; Tate and Simpson, 2003
). It has also been used in an intraspecific investigation (Holdregger and Abbott, 2003
). At higher levels, trnH-psbA has proven to be largely unalignable (Laurales: Renner, 1999
; Saxifragaceae: Soltis et al., 2001
; Lecythidaceae: Hamilton et al., 2003
). In a study of the relative rates of nucleotide and indel evolution, Hamilton et al. (2003)
showed trnH-psbA to be more divergent, based on percent variability, than trnS-trnG, psbB-psbH, atpB-rbcL, trnL-trnF, and 5'rpS12-rpL20. Although studies have shown that trnH-psbA contains a very high percentage of variable characters (Azuma et al., 2001
; Hamilton et al., 2003
), this spacer is usually coupled with other regions because it is comparatively short and may not yield enough characters with which to build a well-resolved phylogeny.
The average length of trnH-psbA is 465 bp, and it ranges from 198 to 1077 bp. Based on our data, and data of the previous workers listed above, the 1077-bp length found in Trillium-Pseudotrillium is atypical. Although this spacer is the second-most variable on a percent basis, we include it in Tier 3 because its relatively short length provides few overall characters. However, it amplified and sequenced easily across all lineages and can be sequenced with only one primer in most taxa. It is also worth noting that the ends of this spacer, roughly 75 bp from either gene, are relatively conserved compared to the middle portion of this spacer, which is highly indel prone (Aldrich et al., 1988
), and contains several poly-A/T runs. Most of the numerous observed indels were relatively short, but a 132-bp indel was observed among the Hibiscus accessions. Among more distantly related taxa, this indel-prone middle region may generate a relatively high amount of homoplasy due to apparent indel "hot spots" with numerous, repeating, and overlapping indels.
psbA-3'trnKUUU-[matK]-5'trnKUUU (Tier 3 + Tier 3 + Tier 3)
The matK gene region (trnK-matK-trnK) or some portion of it was first employed in intrafamilial phylogenetic studies by Steele and Vilgalys (1994)
and Johnson and Soltis (1994)
. Since then, this region has been a primary tool in phylogenetic investigations below the family level, but it has also been suggested as an effective tool above the familial level (Hilu and Liang, 1997
; Hilu et al., 2003
). The frequency of infrageneric phylogenetic use of this region is second only to trnL-trnL-trnF, representing 22 vs. 55%, respectively, of studies in 2002 (Fig. 1). Several studies have used the entire trnK-matK-trnK region (e.g., Johnson and Soltis, 1994
; Sang et al., 1997
; Hardig et al., 2000
; Miller and Bayer, 2001
), while most have carved out various portions depending on variable primer success and availability. Additionally, some investigators have used the intergenic spacer between psbA and 3'trnK (Winkworth et al., 2002
; Pedersen and Hadenäs, 2003
). In some studies the 3'trnK intron to some 3' portion of matK was used (Wang et al., 1999
; Schultheis, 2001
; Winkworth et al., 2002
; Hufford et al., 2003
; Salazar et al., 2003
). Others have used some 5' portion of matK to 5'trnK (Plunkett et al., 1996
; Ohsako and Ohnishi, 2000
, 2001
; Chandler et al., 2001
), and still others have used part of the matK gene only (Kajita et al., 1998
; Bayer et al., 2002
; Cuénoud et al., 2002
; Ge et al., 2002
; Samuel et al., 2003
). In many of the abovementioned investigations, several sequencing primers were required in addition to the PCR primers to piece together sequences for the entire desired region. Also, truly universal primers cannot be designed due to the variability of the gene across broad phylogenetic lineages, and often primers have to be made that are specific to different groups (e.g., Wang et al., 1999
; Hardig et al., 2000
; Hu et al., 2000
; Miller and Bayer, 2001
; Mort et al., 2001
; Pridgeon et al., 2001
; Bayer et al., 2002
; Hilu et al., 2003
). Therefore, in terms of cost, the matK region is relatively expensive because it often involves several sequencing reactions from multiple unique primers. Although matK is putatively the most variable coding region found within cpDNA (Neuhaus and Link, 1987
; Olmstead and Palmer, 1994
), it was excluded from this study primarily because it is a coding region and not part of our focus. Furthermore, the gene's large size would require the development of several internal sequencing primers, and with few strategically placed conserved regions, the number of primers for specific lineages becomes too cumbersome for the scope of this investigation. Therefore, we only included both ends of the trnK intron in addition to the psbA-3'trnK