|
|
|
|||
| Home Help Feedback Subscriptions Archive Search Table of Contents | ||||
First published online April 20, 2007
Journal of Experimental Biology 210, 1526-1547 (2007)
Published by The Company of Biologists 2007
doi: 10.1242/jeb.005017
Review Article |
A new paradigm for developmental biology
ARC Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, St Lucia QLD 4072, Australia
e-mail: j.mattick{at}imb.uq.edu.au
Accepted 19 February 2007
| Summary |
|---|
|
|
|---|
Key words: non-coding RNA, intron, regulation
| Introduction |
|---|
|
|
|---|
How is this feat achieved, and where is this information embedded? In the
only well-studied case, the nematode worm Caenorhabditis elegans, it
is known that developmental ontogeny is precise and invariant, with each cell
in the adult being the result of a spatially and temporally ordered
progression of cell division, selected apoptosis (programmed cell death) and,
ultimately, differentiation into nerve, muscle, gut, germ and other
specialized cells (Ambros,
2001
; Sternberg and Felix,
1997
). Similar processes are observed in the development of
insects and mammals (Baehrecke,
2002
; McCarthy,
2003
), for example in the apoptosis that sculpts the eye ommatidia
in the former (Clark et al.,
2002
) and separates the digits of the fore- and hindlimbs in the
latter (Zuzarte-Luis and Hurle,
2005
). Thus, it is likely that the ontogeny of higher animals,
while vastly more complex and likely to be subject to individual (genomic)
variation, is also precisely programmed
(Clarke and Tickle, 1999
).
Indeed, the almost exact identity of monozygotic twins in their physical
characteristics and idiosyncrasies, as well as a high degree of concordance in
their psychological characteristics (independent of environment), is clear
testimony to the precision and reproducibility of the genetic instructions
they share.
The genetic programming of development is usually considered to be directed
by proteins involved in morphogenetic signalling and various aspects of gene
regulation. These include homeodomain-containing proteins, chromatin-modifying
proteins, and transcription factors acting on cis-regulatory
elements, informed by those involved in cell surface receptor and signal
transduction systems. Together they form elaborate modular regulatory networks
(Arnone and Davidson, 1997
;
Bantignies and Cavalli, 2006
;
Levine and Davidson, 2005
;
Levine and Tjian, 2003
)
notwithstanding the recent discovery of microRNAs (see below) that are
regarded as an interesting extension of the current paradigm
(Davidson, 2006
) rather than
the vanguard of another entire layer of regulation. This protein-centric
perspective underpins most conceptions of the control of development, as
exemplified by elegant studies on sea urchin embryogenesis and fruitfly
development (Ben-Tabou de-Leon and
Davidson, 2006
; Davidson,
2006
; Levine and Davidson,
2005
; Stathopoulos and
Levine, 2005
). On the other hand, many proteins are shared in
common throughout the metazoa (Duboule and
Wilkins, 1998
). Moreover, the genomes of C. elegans
(Stein et al., 2003
), which
only has 1000 cells, and sea urchins
(Sodergren et al., 2006
) have
essentially the same number of annotated protein-coding genes as those of
vertebrates, including humans (Aparicio et
al., 2002
; International
Human Genome Sequencing Consortium, 2004a
;
International Human Genome Sequencing
Consortium, 2004b
; Goodstadt
and Ponting, 2006
; Taft et
al., 2007
).
All of these observations suggest that significant amounts of relevant
information must lie beyond protein-coding sequences, presumably in expanded
regulatory regions that control the expression of these proteins
(Kleinjan and van Heyningen,
2005
; Taft et al.,
2007
). It also seems likely, although firm conclusions are limited
by the poor cDNA library coverage in many species, that the proteome is
expanded in more developmentally complex species by the increased use of
alternative splicing (Graveley,
2001
; Smith and Valcarcel,
2000
; Stamm et al.,
2005
). This in turn, however, mandates an increase in regulation,
assuming that cell- or tissue-specific alternative splicing is not random.
Thus evolutionary innovation and phenotypic divergence is achieved not only by
variations in the structure and function of proteins, but also and probably
more so, by those in the regulatory circuitry that controls their deployment
(Davidson, 2006
;
Duboule and Wilkins, 1998
;
Jacob, 1977
;
Zuckerkandl and Cavalli,
2007
).
| Analogue components and digital information transfer in complex systems |
|---|
|
|
|---|
In addition to sophisticated operational controls, complex entities
(whether aircraft or organisms) require extensive and detailed design plans
for their construction, information about which has also to be stored in the
system, along with the specifications of the components themselves. Random
changes to assembly plans may have more subtle effects than those that alter
component structure (particularly those that compromise component function),
creating design variations that often have less severe consequences, although
there will be exceptions in both directions. In biology these changes will
therefore often result in minor defects, quantitative trait variation or
alterations in disease susceptibility. Altered regulatory information has been
shown to underlie such variation in a number of cases where it has been
possible to map the causative nucleotide changes to completion in
well-structured pedigrees (Clark et al.,
2006
; Clop et al.,
2006
; Ishii et al.,
2006
; Smit et al.,
2003
; Van Laere et al.,
2003
).
While it has long been recognized that genetic information is encoded
digitally in DNA, it has also been widely assumed that the cellular outputs of
this information, expressed via the intermediate of messenger RNA
(mRNA), are almost exclusively analogue components. That is, it has been
assumed that most genes are synonymous with proteins and that most genetic
information is transacted by proteins. This is essentially true for the
prokaryotes, whose genomes comprise densely packed protein-coding sequences,
although these genomes clearly also encode a limited number of small
regulatory RNAs that function in part by sequence-specific interactions with
other RNAs and DNA (Gottesman,
2005
; Mattick and Makunin,
2006
; Vogel and Sharma,
2005
; Winkler,
2005
). The situation is similar in unicellular eukaryotes such as
the yeasts Saccharomyces cerevisiae
(David et al., 2006
;
Olivas et al., 1997
) and
Schizosaccharomyces pombe
(Watanabe et al., 2002
).
Interestingly, although of similar complexity, the former has more
protein-coding sequences than the latter, whereas the latter has many more
introns (Goffeau et al.,
1996
; Wood et al.,
2002
) and a more elaborate RNA signalling infrastructure, which
includes the basic components of the RNA interference (RNAi) pathway
(Martienssen et al., 2005
).
This suggests that there may be some trade-off between protein- and RNA-based
forms of gene regulation in simple eukaryotes. In any case, at first
approximation it is reasonable to say that micro-organisms, particularly the
prokaryotes, are in fact largely analogue devices (the `bicycles' of biology)
and that proteins not only comprise the primary structural and catalytic
components of these cells but are also the main agents by which they are
regulated.
For the past 50 years it has been assumed that the same applies in more
complex organisms, i.e. that regulation, particularly developmental
regulation, is also largely analogue (protein-based) in multicellular
organisms (Davidson, 2006
),
despite the fact that genome sequence analysis has shown that the numbers of
protein-coding genes do not scale strongly or consistently with morphological
complexity (Taft et al.,
2007
) (Fig. 1).
This apparently quite reasonable assumption (at least initially) led logically
to two subsidiary assumptions: (i) that the increased regulatory
sophistication of more complex organisms is achieved through combinatoric
interactions of regulatory proteins intersecting with more complex regulatory
sequences in promoters and untranslated regions of mRNAs (etc.)
(Buchler et al., 2003
;
Levine and Tjian, 2003
); and
(ii) that the vast amounts of non-protein-coding sequences in more complex
organisms are, apart from a limited amount of cis-acting regulatory
sequences, evolutionary debris. The latter view has been reinforced by the
fact that many of these non-coding sequences are derived from transposons (DNA
sequences that can move within the genome to new positions), themselves widely
assumed to be non-functional, selfish DNA
(Doolittle and Sapienza, 1980
;
Orgel and Crick, 1980
) and to
be evolving `neutrally' (Waterston et
al., 2002
). These assumptions have remained largely unquestioned
for many years and have become articles of faith, but they are not necessarily
correct.
|
| Non-linear scaling of regulatory information in integrated systems |
|---|
|
|
|---|
It should be noted that the power and precision of digital communication
and control systems has only been broadly established in the human
intellectual and technological experience during the past 2030 years,
well after the central tenets of molecular biology were developed and after
introns had been discovered. The latter was undoubtedly the biggest surprise
(Williamson, 1977
), and its
misinterpretation possibly the biggest mistake, in the history of molecular
biology. Although introns are transcribed, since they did not encode proteins
and it was inconceivable that so much non-coding RNA could be functional,
especially in an unexpected way, it was immediately and almost universally
assumed that introns are non-functional and that the intronic RNA is degraded
(rather than further processed) after splicing. The presence of introns in
eukaryotic genomes was then rationalized as the residue of the early assembly
of genes that had not yet been removed and that had utility in the evolution
of proteins by facilitating domain shuffling and alternative splicing
(Crick, 1979
;
Gilbert, 1978
;
Padgett et al., 1986
).
Interestingly, while it has been widely appreciated for many years that DNA
itself is a digital storage medium, it was not generally considered that some
of its outputs may themselves be digital signals, communicated via
RNA1.
Analysis of prokaryotic genomes has shown that, as predicted
(Croft et al., 2003
), the
numbers of genes encoding regulatory proteins scale almost quadratically with
gene number or genome size (Croft et al.,
2003
; Gagen and Mattick,
2005
; Mattick,
2004
; Mattick and Gagen,
2005
; van Nimwegen,
2003
). In addition, extrapolation of these relationships show that
the point where the number of new regulatory genes is predicted to exceed the
number of new (non-regulatory) functional genes is close to the observed upper
size limit of bacterial genomes (Gagen
and Mattick, 2004
; Gagen and
Mattick, 2005
). This implies (albeit does not prove) that bacteria
have reached a complexity ceiling imposed by the accelerating cost of
protein-based regulation, possibly early in evolution. It also implies (i)
that the more complex eukaryotes must have solved the problem some other way,
most likely by the co-option of RNA as a sequence-specific regulatory molecule
[microRNAs (miRNAs) being a good example] and, more subtly, (ii) that the
combinatorics of regulatory factors per se cannot be used to enlarge
the regulatory space to get past this ceiling, as there is no a
priori reason to expect that prokaryotes could not have easily evolved
more complex promoters and recruited additional transcription factors, etc.
This in turn suggests that the complex gene regulatory regimes in the higher
organisms may operate through multiple layers of regulation and regulatory
decisions, rather than multiple (combinatoric) inputs at any given point.
In any case, and consistent with the non-linear scaling of regulatory
information, there is a strong relationship between the extent of
non-protein-coding DNA sequences in the genomes of higher organisms and their
relative complexity. Indeed this appears to be the only consistent
relationship between genome information content and complexity
(Taft et al., 2007
)
(Fig. 1). These
non-protein-coding sequences occupy almost 99% of the human genome
(Frith et al., 2005
), and it
has been inconceivable to many that they might all be functional as
cis-acting regulatory elements (although these have clearly expanded
in complex organisms). Again this view is implicitly predicated on the
assumption that most genetic information is transacted by proteins.
| The major output of metazoan genomes is non-coding RNA |
|---|
|
|
|---|
40% of the human genome) or as intergenic or
antisense transcripts (Frith et al.,
2005
There are literally tens of thousands of long non-coding RNAs (ncRNAs) that
have been identified in mammals (Carninci
et al., 2005
; Kampa et al.,
2004
; Okazaki et al.,
2002
), including many antisense transcripts
(Alfano et al., 2005
;
Cocquet et al., 2005
;
Katayama et al., 2005
;
Korneev and O'Shea, 2005
;
Pandorf et al., 2006
;
Reis et al., 2004
;
Tufarelli et al., 2003
;
Werner, 2005
;
Werner and Berdal, 2005
) and
large numbers of smaller RNAs such as miRNAs
(Berezikov et al., 2006a
;
Berezikov et al., 2006b
) and
piRNAs (Aravin et al., 2006
;
Girard et al., 2006
;
Lau et al., 2006
). Many of
these ncRNAs are expressed in a cell- or tissue-specific manner, suggesting
that they are developmentally regulated. Characterized long ncRNAs include
H19 (Barsyte-Lovejoy et al.,
2006
; Brannan et al.,
1990
; Wrana,
1994
), 7H4 (Velleca
et al., 1994
), bic
(Tam et al., 1997
),
NTT (Liu et al.,
1997
), BORG (Takeda
et al., 1998
), Xist
(Brockdorff, 1998
),
Tsix (Lee et al.,
1999
), DD3
(Bussemakers et al., 1999
),
Msx1 (Blin-Wakkach et al.,
2001
), Air (Sleutels
et al., 2002
), MALAT-1
(Ji et al., 2003
),
adapt33 (Wang et al.,
2003
), SCA8
(Mutsuddi et al., 2004
),
MIAT (Ishii et al.,
2006
), CTN (Prasanth
et al., 2005
), NFAT
(Willingham et al., 2005
),
PRINS (Sonkoly et al.,
2005
), TUG1 (Young
et al., 2005
), PINC
(Ginger et al., 2006
),
SAF (Yan et al.,
2005
), Evf-2 (Feng
et al., 2006
), HSR1
(Shamovsky et al., 2006
) and
HAR1 (Pollard et al.,
2006
), most of which have been associated with specific cellular
or developmental functions and/or disease. However, most of the ncRNAs
discovered in genome-wide transcriptomic analyses or expressed from particular
genomic regions have not been studied in any detail, although high-throughput
cell-based and other screening strategies are beginning to be deployed to
ascertain their function (Mattick,
2005
; Reis et al.,
2004
; Willingham et al.,
2005
). Moreover, the documented numbers of these RNAs are
conservative estimates: more are being regularly discovered as genomic
analyses of one sort or another delve deeper into the transcriptome. Recent
evidence suggests that deep sequencing has not remotely exhausted the
repertoire of either long ncRNAs (Carninci
et al., 2005
) or short ncRNAs
(Berezikov et al., 2006a
;
Berezikov et al., 2006b
;
Cummins et al., 2006
;
Ruby et al., 2006
) and that
there may be hundreds of thousands of small RNAs expressed in humans (T. R.
Gingeras, personal communication; L. Croft, R. J. Taft and J.S.M., unpublished
data).
These observations confront and very largely contradict the traditional
protein-centric view of genetic information and genome organization
(Mattick and Makunin, 2006
).
Either the bulk of the transcriptional output from the human genome and those
of other complex organisms is random `noise' (or, in the case of introns, the
residue of evolutionary baggage retained and accumulated within genes, as
widely assumed) or this transcription comprises a massive but hitherto hidden
layer of expression of systemic genetic information that is transacted by RNA
(Mattick, 1994
;
Mattick, 2001
;
Mattick, 2003
;
Mattick, 2004
). The former
has been described as a rather nihilistic view
(Werner, 2005
), but is one
that is comfortable for the prevailing orthodoxy. On the other hand, the
latter is strongly supported by the observations that: (i) all well-studied
loci in insects and mammals express a large number of non-protein-coding
transcripts (e.g. Ashe et al.,
1997
; Bae et al.,
2002
; Holmes et al.,
2003
; Jones and Flavell,
2005
; Lemons and McGinnis,
2006
; Lipshitz et al.,
1987
; Sanchez-Herrero and
Akam, 1989
; Sessa et al.,
2007
); (ii) many of the experimentally detected ncRNAs are
differentially expressed (Carninci et al.,
2005
; Cheng et al.,
2005
; Ravasi et al.,
2006
), apparently under the control of common transcription
factors (Barsyte-Lovejoy et al.,
2006
; Cawley et al.,
2004
); (iii) at least some have specific subcellular locations
(Ginger et al., 2006
;
Prasanth et al., 2005
); and
(iv) at least some have been shown to be functional
(Brannan et al., 1990
;
Brockdorff, 1998
;
Feng et al., 2006
;
Ginger et al., 2006
;
Prasanth et al., 2005
;
Velleca et al., 1994
;
Willingham et al., 2005
;
Wrana, 1994
;
Young et al., 2005
).
Microarray analyses have shown that large numbers of ncRNAs are dynamically
regulated during the differentiation of embryonal stem cells, myoblasts,
neuronal cells and the gonadal ridge, as well as during T-cell and macrophage
activation (M. E. Dinger, K. C. Pang, I. Qureshi, M. Crowe, A. C. Perkins, S.
M. Grimmond, D. A. Hume, P. A. Koopman, G. E. O Muscat, S. Bruce, M. F. Mehler
and J.S.M., manuscript in preparation) and in cancer
(Lu et al., 2005
;
Reis et al., 2004
). In
addition, in situ hybridization analyses are revealing large numbers
of ncRNAs that are expressed in particular regions of the brain and in
particular subcellular locations (T. R. Mercer, M. E. Dinger, S. Sunkin, M. F.
Mehler and J.S.M., in preparation). Many of these ncRNAs are antisense or
intronic to genes encoding proteins important in neural development, function
and disease. It is also now evident that many of the complex genetic phenomena
in complex organisms, including transcriptional and post-transcriptional gene
silencing (Cogoni and Macino,
2000
; Matzke et al.,
2001
; Zamore and Haley,
2005
), imprinting (Kelley and
Kuroda, 2000
; Morison et al.,
2005
; Nikaido et al.,
2003
) and probably also transvection
(Mattick and Gagen, 2001
) and
transinduction (Ashe et al.,
1997
), are linked to RNA signalling
(Mattick, 2003
;
Mattick and Gagen, 2001
).
| Digitalanalogue conversion of RNA signals |
|---|
|
|
|---|
However, the sequence-specific interaction of a regulatory RNA with its
target is relatively sterile unless this interaction can be converted into a
meaningful analogue action. At its simplest level, this may comprise antisense
binding to block another interaction, and this primitive mechanism seems to be
a common feature of regulatory RNAs in prokaryotes. However, a more
sophisticated strategy is to embed secondary signals either in the RNA itself
or in the structure of the resulting RNA:RNA or RNA:DNA complex, to recruit
different types of complexes, which then undertake the type of analogue action
required upon receipt of the signal. Good examples are (i) the complexes of
RNA-modifying enzymes that act at a site adjacent to and determined by the
position of the sense:antisense interaction between small nucleolar RNAs
(snoRNAs) and their targets (Bachellerie
et al., 2002
; Meier,
2005
), and (ii) the RNA-induced silencing (RISC) complexes that
act on RNAs bound to small interfering RNAs (siRNAs) and miRNAs
(Tang, 2005
). Thus, there are
two components to RNA signals: a sequence-specific interaction with the
intended target(s) and a secondary or tertiary structural component that acts
as a transducer to recruit generic infrastructural proteins to impart
different types of actions. Indeed, this two-stage principle also applies to
other classes of functional RNAs including snRNAs and tRNAs, which recognize
splice junctions in pre-mRNAs or codons in mRNAs and recruit the spliceosome
or ribosome, respectively. That is, RNAs function as adaptors, with a target
sequence-specific address code and separate structural motifs that specify the
type of consequent function and bind the appropriate proteins.
Such considerations suggest that a receptive infrastructure for RNA
signalling must have co-evolved with the RNA signals themselves and become
progressively more sophisticated as RNA regulatory and transport networks
gained currency during the evolution of the eukaryotes. Examples include the
proteins of the argonaute family and others associated with RNA interference
(Carmell et al., 2002
), and
those containing RRM domains, KH domains, SR domains, SET domains,
pumilio-homology domains and double-stranded RNA-binding domains, which occur
in a wide range of developmental regulators with global functions
(Anantharaman et al., 2002
;
Bernstein and Allis, 2005
;
Saunders and Barber, 2003
;
Wang et al., 2002
). Indeed
many of the so-called nucleic acid binding proteins and chromatin-binding
proteins whose target specificity is uncertain or unknown may in fact
recognize different types of RNA signals. This possibility is supported by
evidence suggesting that regulatory proteins containing C2H2 zinc fingers
(Shi and Berg, 1995
), Y-boxes
(Ladomery, 1997
),
chromodomains (Akhtar et al.,
2000
; Bernstein and Allis,
2005
), tudor domains
(Maurer-Stroh et al., 2003
)
and SET domains (Krajewski et al.,
2005
), and others such as DNA methyl transferases
(Jeffery and Nakielny, 2004
),
may recognize such RNA signals in one form or another.
| The origin and evolution of RNA-based regulatory networks in complex organisms |
|---|
|
|
|---|
|
The subsequent evolution of the spliceosome occurred by the devolution of
the originally cis-acting catalytic sequences within introns to
trans-acting generic co-factors (spliceosomal RNAs) and the
recruitment of ancillary proteins. This reduced the internal sequence
constraints on the introns, allowing them more freedom to evolve and flexibly
explore new functional space (as RNA molecules). It also made their excision
from primary transcripts more efficient, perversely providing them with even
greater facility to expand and invade other genes
(Mattick, 1994
). As these RNA
networks began to be established, proteins capable of recognizing subsets of
signals in these networks would have been selected for, increasing the
sophistication of the system. Moreover, it would be expected that increasing
numbers of genes would have evolved solely to express RNA as higher-order
regulators in this increasingly complex system. This will have occurred at
least in part by gene duplication followed by loss of protein-coding capacity,
as appears to have happened in Xist (the ncRNA controlling X
chromosome inactivation in female mammals)
(Duret et al., 2006
) and in
many of the non-protein-coding genes that encode snoRNAs or miRNAs in their
introns (Cavaille et al.,
2001
; Mattick and Makunin,
2005
; Rodriguez et al.,
2004
; Tycowski et al.,
1996
; Ying and Lin,
2005
). Interestingly, many ncRNAs are alternatively spliced
(Cocquet et al., 2005
;
Pang et al., 2005
),
suggesting that there is an operational distinction between RNA sourced from
exons and introns. The other major source of functional RNAs has almost
certainly been various other types of mobile (transposable) elements, many of
which are derived from small RNAs and have been a potent force in genome
evolution and genetic innovation (Brosius,
1999
; Brosius,
2005
; Waterston et al.,
2002
).
| The extent of the genome under evolutionary selection |
|---|
|
|
|---|
46% in humans) is composed of transposon-derived sequences
(Lander et al., 2001
5% of the human genome is
under purifying selection in mammals
(Waterston et al., 2002
This is in direct contradiction to the suggestion that much of the
genome-wide transcription, which is developmentally regulated, is functional.
However, it is questionable whether the ARs that are used as yardsticks for
these estimations are really evolving neutrally. First, if ARs have no
functional relevance to the organism, they would be expected to evolve freely
and eventually to either acquire function or be deleted (M. Pheasant and
J.S.M., manuscript submitted for publication), as appears to have occurred
with a large fraction of ARs (Waterston
et al., 2002
). That is, the more ancient the extant sequence, the
more likely it is to have acquired function. Second, in agreement with this
logic, there are increasing numbers of transposon-derived sequences of all
classes, both ancient and modern, including lineage-specific repeats such as
Alu elements that have been shown to have undergone functional
exaptation as gene promoters, regulatory elements, exons and microRNA
precursors (Bejerano et al.,
2006
; Britten,
2006
; Brosius,
1999
; Dagan et al.,
2004
; Ferrigno et al.,
2001
; Hasler and Strub,
2006
; Krull et al.,
2005
; Landry et al.,
2001
; Lev-Maor et al.,
2003
; Lippman et al.,
2004
; Matlik et al.,
2006
; Nigumann et al.,
2002
; Smalheiser and Torvik,
2005
; Smalheiser and Torvik,
2006
; Volff,
2006
; Zhou et al.,
2002
).
These observations throw increasing doubt on the widespread assumption that
such sequences are mostly parasitic, and remain as inert genomic passengers.
Transposable elements have also been found to underlie the birth of new genes
and regulatory networks (Brandt et al.,
2005
; Cordaux et al.,
2006
; Landry et al.,
2001
; Zhou et al.,
2002
) and to influence early development
(Peaston et al., 2004
) and
phenotypic variation (Whitelaw and
Martin, 2001
). It is also possible to identify AR sequences that
are clearly conserved, some of which are very ancient
(Nishihara et al., 2006
),
such as recently discovered classes of ARs in humans sharing common ancestors
with those in marsupials (Kamal et al.,
2006
) and fish (Ogiwara et
al., 2002
; Xie et al.,
2006
), including an example of the slowest evolving regions of the
human genome (Bejerano et al.,
2006
). Moreover, some major classes of ARs show variable rates of
sequence conservation within them. One example is the class of so-called
`mammalian interspersed repeats' (MIRs), of which there are
300 000
copies in the human genome (Smit and
Riggs, 1995
). These MIRs date back
130 million years and are
tRNA-derived SINEs (short interspersed elements) with a consensus length of
260 nt including a 70 nt central region and 1525 nt more highly
conserved core (Silva et al.,
2003
; Smit and Riggs,
1995
). The fact that hundreds of thousands of such elements have
an internal sequence that is conserved more highly than the rest of the
element is prima facie evidence that this class of ARs (or at least
the conserved core within them) is not neutrally evolving and is likely under
selection, presumably for function and possibly as regulatory RNAs.
It is also clear that there are widely different rates of evolution of
different types of genomic sequences, particularly of gene regulatory
sequences, some of which are extraordinarily highly conserved blocks
(Bejerano et al., 2004
), while
many others cover extended genomic regions and exhibit rapid turnover
(Fisher et al., 2006
;
Frith et al., 2006
;
Smith et al., 2004
;
Taylor et al., 2006
). The
latter includes the remarkable functional conservation of regulatory sequences
controlling ret gene expression in zebrafish and humans, although
there is little recognizable primary sequence conservation
(Fisher et al., 2006
). The
cis-regulatory elements of the HoxA cluster have also been shown to
undergo accelerated evolution, presumably under positive selection during the
origin of amniotes and mammals (Wagner et
al., 2004
). Moreover, it is evident that phenotypic
diversification may be due as much, if not more, to changes in regulatory
architecture than to the protein components
(Duboule and Wilkins, 1998
;
Levine and Tjian, 2003
;
Mattick and Gagen, 2001
).
Indeed, regulatory sequences often exhibit considerable evolutionary
plasticity (depending on the number of their interacting targets; see below)
and relatively low conservation (Pang et
al., 2006
) compared with proteins whose evolutionary flexibility
is limited by both analogue structurefunction relationships and
multitasking, i.e. the differential use of the same components in multiple
contexts (Duboule and Wilkins,
1998
; Mattick and Gagen,
2001
).
There are also other regions of the genome under evolutionary constraints
that are not evident at the primary sequence level, including shuffled
cis-regulatory elements (Sanges
et al., 2006
), gene deserts
(Ovcharenko et al., 2005
),
transposon-free regions (Simons et al.,
2006
), chromatin domains
(Bernstein et al., 2005
;
Bernstein, B. E. et al.,
2006
), regions under indel-purifying selection
(Lunter et al., 2006
), the
distances between ultra-conserved elements
(Sun et al., 2006
) and
regions predicted to contain common RNA secondary or tertiary structures
(Lescoute et al., 2005
;
Washietl et al., 2005
). Thus,
the proportion of functionally meaningful DNA in the human genome is
substantially greater than estimated from sequence conservation alone
(Smith et al., 2004
).
Different rates of evolution also occur within and between different
classes of functional gene products, both RNAs and proteins. While most
protein-coding sequences are highly constrained and hence highly conserved,
some are much more flexible and others have diverged under positive selection
(Bustamante et al., 2005
). The
estimated 5% of the human genome that is conserved with mouse does not include
35% of annotated protein-coding sequences and 17% of RefSeq annotated genes
(M. Pheasant and J.S.M., manuscript submitted for publication). Many miRNAs
are highly conserved (Pang et al.,
2006
) but many are not, being lineage- or even species-specific
(Berezikov et al., 2006a
;
Berezikov et al., 2006b
).
There are also thousands of recently discovered small RNAs (piRNAs) expressed
in testis that are not conserved between rodents and humans, although similar
RNAs are produced from syntenically orthologous loci
(Aravin et al., 2006
;
Girard et al., 2006
;
Lau et al., 2006
). SnoRNAs
have very divergent sequences and many are identifiable only by the loose
consensus and positioning of the C/D (RUGAUGA/CUGA)
(Shanab and Maxwell, 1992
) or
H(ANANNA)/ACA boxes (Meier,
2005
). It is also clear that many longer functional
non-protein-coding RNAs (ncRNAs), such as the Xist and Tsix
transcripts involved in X-chromosome dosage compensation, are evolving quickly
(Chureau et al., 2002
;
Migeon et al., 2001
;
Nesterova et al., 2001
;
Pang et al., 2006
). In other
cases, there is evidence of recent positive selection in ncRNAs, such as the
HAR1 transcript expressed in particular regions of the brain
(Pollard et al., 2006
). While
functionally validated RNAs do not presently add up to a large fraction of the
genome, they do illustrate that lack of conservation does not necessarily
equate to lack of function (Pang et al.,
2006
; Smith et al.,
2004
). They also point to the likelihood that many functional
transcripts, particularly regulatory ncRNAs, are not highly conserved over
significant evolutionary distances.
Most of the mammalian genome appears to be evolving more quickly than
protein-coding sequences, and at a (regionally adjusted) rate similar to
ancient transposon-derived sequences. However, this is evidence simply that
the majority of the genome is under similar average selection pressures (M.
Pheasant and J. S. Mattick, manuscript submitted for publication), rather than
being non-functional and evolving neutrally, although the latter is the
favored explanation (Waterston et al.,
2002
) being consistent with the orthodox view. Moreover, it has
been known for some time that the nucleotide substitution frequency varies
across the genome. This has often been interpreted as the result of regional
variation in the background mutation or fixation (related to recombination)
frequencies, rather than selection, as it was (again) inconceivable that the
vast intronic and intergenic sequences could be under selection, since that in
turn would impute function. Variation in substitution frequencies beyond that
which might be expected from random events is also observed at close range
within genomic regions, and the data are more consistent with the genome
comprising different types of genetic information that are evolving at
different rates under different selection pressures and different
structurefunction constraints (M. Pheasant and J.S.M., manuscript
submitted for publication).
| Functional constraints on the evolution of regulatory RNAs |
|---|
|
|
|---|