In the preceding section, we examined several different classes of DNA families with members that carry out a variety of tasks necessary for the survival of the organism. This section surveys a final major class of DNA families whose members in and of themselves do not function for the benefit of the animal in which they lie. This class can be subdivided further into individual families that are actively involved in their own dispersion the so-called selfish genes and those that consist of very simple sequences that appear to arise de novo at each genomic location. The selfish gene group can be further divided somewhat arbitrarily into subclasses based on copy number in the genome. Each of the resulting subclasses of repetitive DNA families will be discussed in the subsections that follow.
Retroviruses are RNA-containing viruses that can convert their RNA genome into circular DNA molecules through a viral-associated reverse transcriptase which becomes activated upon cell infection. The resultant DNA "provirus" can integrate itself into a relatively-random site in the host genome. The genetic information present in the retroviral genome is retained within the integrated provirus, and under certain conditions, the provirus can be activated to produce new RNA genomes along with the associated proteins including reverse transcriptase that can come together to form new virus particles that are ultimately released from the cell surface by exocytosis. However, in many cases, stably integrated retroviral elements appear not be active.
Once it has become integrated into a chromosome, the provirus will become replicated with every round of host replication irrespective of whether the provirus itself is active or silent. Furthermore, proviruses that integrate into the germ line through the sperm or egg genome will segregate along with their host chromosome into the progeny of the host animal and into subsequent generations of animals as well. In certain hybrid mouse strains, new proviral integrations into the germ line can be observed to occur at abnormally high frequencies (Jenkins and Copeland, 1985).
All strains of mice as well as all other mammals have endogenous proviral elements. These elements can be classified and subclassified according to the type of retrovirus from which they derived (ecotropic, MMTV, xenotropic, and others). Ecotropic elements are generally present at 0-10 copies (Jenkins et al., 1982), MMTVs are present at 4-12 copies (Kozak et al., 1987), and non-ecotropic elements are present at 40-60 copies (Frankel et al., 1990). Loss and acquisition of new proviral sequences is an ongoing process and, as a consequence, the genomic distribution of these elements is highly polymorphic. Thus, these elements can be very useful as genetic markers as discussed in Section 8.2.4.
In addition to the DNA families clearly related to known retroviruses, there are a number of additional families that are retroviral-like in structure but are not clearly related to any known virus strain in existence today. The Intracisternal A Particle (IAP) DNA family is defined by homology to RNA sequences that are actually present within non-functional retroviral-like particles found in the cytoplasm of some types of mouse cells. The IAP family is present in ~1000 copies (Lueders and Kuff, 1977), but very few of these copies can actually produce transcripts. Another retroviral-like DNA family is called VL30, which stands for viral-like 30S particles (Carter et al., 1986); there are approximately 200 copies of this element in the mouse genome (Courtney et al., 1982; Keshet and Itin, 1982). There is no reason to expect that additional retroviral-like families will not be uncovered through further genomic studies.
It is of evolutionary interest to ask the question: from where do retroviruses come? Retroviruses cannot propagate in the absence of cells, but cells can propagate in the absence of retroviruses. Thus, it seems extremely likely that retroviruses are derived from sequences that were originally present in the cell genome. The first retrovirus must have been able to free itself from the confines of the cell nucleus through an association with a small number of proteins that allowed it to coat, and thus protect, itself from the harsh extracellular environment. Of course, the protein most critical to the propagation of the retrovirus is the enzyme that allows it to reproduce RNA-dependent DNA polymerase, commonly referred to as reverse transcriptase. But where did this enzyme come from? Reverse transcriptase catalyzes the production of single stranded complementary DNA molecules from an RNA template. This enzymatic activity does not appear to be required for any normal cellular process known in mammals! How could such an activity without any apparent benefit to the host organism arise de novo in a normal cell? One possible answer is that reverse transcriptase did not evolve for the benefit of the organism itself but, rather, for the benefit of selfish DNA elements within the genome that utilize the enzyme to propagate themselves within the confines of the genome as described in the next section.
The mouse genome contains three independent families of dispersed repetitive DNA elements called B1, B2, and LINE-1 35 (or L1) that are each present at more than 80,000 chromosomal sites (Hasties, 1989). The general name coined for genomic elements of this type that disperse themselves through the genome by means of an RNA intermediate is retroposon. Of the three major retroposon families, it is only L1 that appears to be derived from a full-fledged selfish DNA sequence with a self-encoded reverse transcriptase. The mouse L1 DNA family is very old and homologous repetitive families have been found in a wide variety of organisms including protists and plants (Martin, 1991). Thus, LINE-related elements, or others of a similar nature, are likely to have been the source material that gave rise to retroviruses.
Full-length L1 elements have a length of 7 kb; however, the vast majority (>90%) of the ~100,000 L1 elements have truncated sequences which vary in length down to 500 bp (Martin, 1991). However, of the ~10,000 full-length L1 elements, only a few retain a completely functional reverse transcriptase gene which has not been inactivated by mutation. Thus, only a very small fraction of the L1 family members retain "transposition competence," and it is these that are responsible for dispersing new elements into the genome.
Dispersion to new positions in the germ-line genome presumably begins with the transcription of competent L1 elements in spermatogenic or oogenic cells. The reverse transcriptase coding region on the L1 transcript is translated into enzyme that preferentially associates with and utilizes the transcript that it came from as a template to produce L1 cDNA sequences (Martin, 1991). For reasons that are unclear, it seems that the reverse transcriptase usually stops before a full-length copy is finished. These incomplete cDNA molecules are, nevertheless, capable of forming a second strand and integrating into the genome as truncated L1 elements that are forever dormant.
The L1 family appears to evolve by repeated episodic amplifications from one or a few progenitor elements, followed by the slow degradation of most new integrants by genetic drift into random sequence. Thus, at any point in time, a large fraction of the cross-hybridizing L1 elements in any one genome will be more similar to each other than to L1 elements in other species. In a sense, episodic amplification followed by general degradation is another mechanism of concerted evolution.
A large percentage of the mouse L1 elements share two EcoRI restriction sites located at a 1.3 kb distance from each other near the 3' end of the full-length sequence. With its very high copy number, this 1.3 kb fragment is readily observed in and, in fact, diagnostic of EcoRI digests of total mouse genomic DNA that has been separated by agarose gel electrophoresis and subjected to staining by ethidium bromide. This high copy number EcoRI fragment was originally given other names, including MIF-1 and 1.3RI, before it was realized to be simply a portion of L1.
The two other major families of highly repetitive elements in the mouse B1 and B2 are both of the SINE type with relatively short repeat units of ~140 bp and ~190 bp in length respectively. The significance of this short repeat length is that it does not provide sufficient capacity for these elements to actually encode their own reverse transcriptase. Nevertheless, SINE elements are able to disperse themselves through the genome, just like LINE elements, by means of an RNA intermediate that undergoes reverse transcription. Clearly, SINEs are dependent on the availability of reverse transcriptase produced elsewhere, perhaps from L1 transcripts or endogenous retroviruses.
All SINE elements, in the mouse genome and elsewhere, appear to have evolved out of small cellular RNA species most often tRNAs but also (in the case of mice and humans) the 7SL cytoplasmic RNA which is one of the components of the signal recognition particle (SRP) essential for protein translocation across the endoplasmic reticulum (Okada, 1991). Unlike the LINE families, however, SINE families present in the genomes of different organisms appear, for the most part, to have independent origins. The defining event in the evolution of a functional cellular RNA into an altered-function self-replicating SINE element is the accumulation of nucleotide changes in the 3' region that lead to self-complementarity with the propensity to form hairpin loops. The open end of the hairpin loop can be recognized by reverse transcriptase as a primer for strand elongation. Since hairpin loop formation of this type is likely to be very rare among normal cellular RNAs, the SINE transcripts in a cell will be utilized preferentially as templates for the production of cDNA molecules that are able (somehow) to integrate into the genome at random sites. Like the L1 family, the B1 and B2 families appear to be evolving by episodic amplification followed by sequence degradation.
The B1 element is repeated ~150,000 times, and the B2 element is repeated ~90,000 times (Hasties, 1989); together these elements alone account for ~1.3% of the material in the Mus musculus genome. The B1 element is derived from a portion of the 7SL RNA gene whereas the B2 element appears to be derived (in a complicated fashion) from a tRNAlys gene. Human beings have just one family of SINE elements, referred to as the "Alu family," which is present in 500,000 copies and is also derived from 7SL RNA, although in an independent fashion from the mouse B1 element. Interestingly, the mouse genome does contain about 10,000 copies of a retroposon family that is closely related to the human Alu family (Hasties, 1989).
A number of other independent SINE families have been identified in the mouse genome, but none are present in more than 10,000 copies. One such family of 80 bp tRNA-derived elements called ID was originally found in the rat genome at a copy number of 200,000; in the mouse genome, there are only 10,000 ID copies. In addition, there are probably other minor SINE families in the mouse genome that have yet to be well characterized.
The total mass of SINE and LINE elements probably accounts for less than 15% of the mouse genome. With the efficient means of self-dispersion that these elements employ, one can ask why they haven't amplified themselves to even higher levels? The answer is almost certainly that if the amount of selfish DNA in a genome goes above a certain critical level, it will cause the host organism to be less fit and, thus, less likely to pass its selfish DNA load on to future generations. The existence of a critical ceiling means that the various SINE and LINE families are in direct competition for a limited amount of genomic real estate.
If one assumes that each of the major highly repetitive families B1, B2 and L1 is dispersed at random, it is a simple matter to calculate that, on average, a member of each family will be present in every 20-30 kb of DNA. In fact, if one screens a complete genomic library with 15 kb inserts in bacteriophage lambda with a probe for each family, 80% of the clones are found to contain B1 elements, 50% are found to contain B2 elements, and 20% are found to contain the central portion of the full-length L1 element (Hasties, 1989). However, if one analyzes individual clones for the presence of both SINE and LINE elements, there is a significant negative correlation. In other words, SINE and LINE elements appear to prefer different genomic domains.
To better understand the non-random distribution of the three major mouse repetitive families and to investigate possible correlations with chromosome structure, the karyotypic distribution of each family was investigated by fluorescence in situ hybridization or FISH (Boyle et al., 1990 and Section 10.2). Incredibly, the distribution of the LINE-1 elements corresponds almost precisely with the distribution of the Giemsa stained dark G bands. In contrast, both SINE element families co-localize to the lightly stained chromosomal regions located between G bands (R bands). When the same type of experiment was performed on human karyotypes, the same result was obtained the human SINE element Alu was found in R bands, whereas human LINE sequences co-localized with G bands (Korenberg and Rykowski, 1988). Since, essentially all of the SINE and LINE elements integrated into the mouse and human genomes subsequent to their divergence from a common mammalian ancestor, the implication is that preferential integration into different chromosomal domains is a property of each element class.
The correlation between G and R bands and LINE and SINE distribution respectively is not perfect. Some chromosomal regions are observed to have an overabundance or underabundance of the associated sequences, and a small, but significant fraction, of elements are located outside the "correct" regions. One consistent exception to the general correlation is that centromeric heterochromatic regions, which normally stain brightly with Giemsa, do not have any detectable LINE elements in mice or humans. However, this exception can be easily understood in terms of the special structural role played by centromeric satellite DNA in chromosome segregation any integration of a LINE element would disrupt this special DNA and its function, and this would be selected against evolutionarily.
As a final note, it should be mentioned that although the SINE and LINE elements have amplified themselves for selfish purposes, they have, in turn, had a profound impact on whole genome evolution. In particular, homologous elements located at nearby locations can, and will, act to catalyze unequal but "homologous" crossovers that result in the duplication of single copy genes located in between and the initiation of gene cluster formation as illustrated in Figure 5.5.
With large-scale sequencing and hybridization analyses of mammalian genomes came the frequent observation of tandem repeats of DNA sequences, without any apparent function, scattered throughout the genome. The repeating unit can be as short as two nucleotides (CACACACA etc.), or as long as 20 kb. The number of tandem repeats can also vary from as few as two to as many as several hundred. The mechanism by which tandem repeat loci originate may be different for loci having very short repeat units as compared to those with longer repeat units. Tandem repeats of short di- or trinucleotides can originate through random changes in non-functional sequences. In contrast, the initial duplication of larger repeat units is likely to be a consequence of unequal crossing over. Once two or more copies of a repeat unit (whether long or short) exist in tandem, unequal pairing followed by crossing over can lead to an increase in the number of repeat units in subsequent generations (see Figure 8.4). Whether stochastic mechanisms alone can account for the rich variety of tandem repeat loci that exist in the genome or whether other selective forces are at play is not clear at the present time. In any case, tandem repeat loci continue to be highly susceptible to unequal crossovers and, as a result, they tend to be highly polymorphic in terms of overall locus size.
Tandem repeat loci are classified according to both the size of the individual repeat unit and the length of the whole repeat cluster. The smallest and simplest with repeat units of 1-4 bases and locus sizes of less than 100 bp are called microsatellites. The use of microsatellites as genetic markers has revolutionized the entire field of mammalian genetics (see Section 8.3.6). Next come the minisatellites with repeat units of 10 to 40 bp and locus sizes that vary from several hundred base pairs to several kilobases (see Section 8.2.3). Tandem repeat loci of other sizes do not appear to be as common, but a great variety are scattered throughout the genome. The term midisatellite has been proposed for loci containing 40 bp repeat units that extend over distances of 250 to 500 kb, and macrosatellites has been proposed as the term to described loci with large repeat units of 3 to 20 kb present in clusters that extend over 800 kb (Giacalone et al., 1992). However, the use of arbitrary size boundaries to "define" these other types of loci is probably not meaningful since it appears that, in reality, no such boundaries exist in the potential for tandem repeat loci to form in the mouse and other mammalian genomes.