spacer gif spacer gif spacer gif spacer gif Online submission spacer gif
 QUICK SEARCH:   [advanced]


spacer gif
     Home     Help     Feedback     Subscriptions     Archive     Search     Table of Contents    

First published online April 20, 2007
Journal of Experimental Biology 210, 1526-1547 (2007)
Published by The Company of Biologists 2007
doi: 10.1242/jeb.005017
This Article
Right arrow Summary Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Glossary
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Related articles in JEB
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Mattick, J. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mattick, J. S.

Review Article

A new paradigm for developmental biology

John S. Mattick

ARC Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, St Lucia QLD 4072, Australia

e-mail: j.mattick{at}imb.uq.edu.au

Accepted 19 February 2007


    Summary
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
It is usually thought that the development of complex organisms is controlled by protein regulatory factors and morphogenetic signals exchanged between cells and differentiating tissues during ontogeny. However, it is now evident that the majority of all animal genomes is transcribed, apparently in a developmentally regulated manner, suggesting that these genomes largely encode RNA machines and that there may be a vast hidden layer of RNA regulatory transactions in the background. I propose that the epigenetic trajectories of differentiation and development are primarily programmed by feed-forward RNA regulatory networks and that most of the information required for multicellular development is embedded in these networks, with cell–cell signalling required to provide important positional information and to correct stochastic errors in the endogenous RNA-directed program.

Key words: non-coding RNA, intron, regulation


    Introduction
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
The developmental ontogeny of a human from an embryo to a fully formed adult involves the construction of an organism of approximately 100 trillion cells, with an extremely precise architecture and many differentiated tissues. These include intricately sculpted bones, organs and muscles, such as the dozens of fine muscles in the face (Gray, 1918Go), as well as a brain that evolves in situ in response to experience (Edelman, 1993Go). This is an extraordinary feat of genetic programming, which in all likelihood, requires enormous amounts of information. This information directs not just a human developmental program, or that of another species, but the idiosyncrasies of the particular program that was inherited by the individual from their parents and their ancestors, as exemplified by the shape of our nose, mouth and ears and other identifying familial features.

How is this feat achieved, and where is this information embedded? In the only well-studied case, the nematode worm Caenorhabditis elegans, it is known that developmental ontogeny is precise and invariant, with each cell in the adult being the result of a spatially and temporally ordered progression of cell division, selected apoptosis (programmed cell death) and, ultimately, differentiation into nerve, muscle, gut, germ and other specialized cells (Ambros, 2001Go; Sternberg and Felix, 1997Go). Similar processes are observed in the development of insects and mammals (Baehrecke, 2002Go; McCarthy, 2003Go), for example in the apoptosis that sculpts the eye ommatidia in the former (Clark et al., 2002Go) and separates the digits of the fore- and hindlimbs in the latter (Zuzarte-Luis and Hurle, 2005Go). Thus, it is likely that the ontogeny of higher animals, while vastly more complex and likely to be subject to individual (genomic) variation, is also precisely programmed (Clarke and Tickle, 1999Go). Indeed, the almost exact identity of monozygotic twins in their physical characteristics and idiosyncrasies, as well as a high degree of concordance in their psychological characteristics (independent of environment), is clear testimony to the precision and reproducibility of the genetic instructions they share.

The genetic programming of development is usually considered to be directed by proteins involved in morphogenetic signalling and various aspects of gene regulation. These include homeodomain-containing proteins, chromatin-modifying proteins, and transcription factors acting on cis-regulatory elements, informed by those involved in cell surface receptor and signal transduction systems. Together they form elaborate modular regulatory networks (Arnone and Davidson, 1997Go; Bantignies and Cavalli, 2006Go; Levine and Davidson, 2005Go; Levine and Tjian, 2003Go) – notwithstanding the recent discovery of microRNAs (see below) that are regarded as an interesting extension of the current paradigm (Davidson, 2006Go) rather than the vanguard of another entire layer of regulation. This protein-centric perspective underpins most conceptions of the control of development, as exemplified by elegant studies on sea urchin embryogenesis and fruitfly development (Ben-Tabou de-Leon and Davidson, 2006Go; Davidson, 2006Go; Levine and Davidson, 2005Go; Stathopoulos and Levine, 2005Go). On the other hand, many proteins are shared in common throughout the metazoa (Duboule and Wilkins, 1998Go). Moreover, the genomes of C. elegans (Stein et al., 2003Go), which only has 1000 cells, and sea urchins (Sodergren et al., 2006Go) have essentially the same number of annotated protein-coding genes as those of vertebrates, including humans (Aparicio et al., 2002Go; International Human Genome Sequencing Consortium, 2004aGo; International Human Genome Sequencing Consortium, 2004bGo; Goodstadt and Ponting, 2006Go; Taft et al., 2007Go).

All of these observations suggest that significant amounts of relevant information must lie beyond protein-coding sequences, presumably in expanded regulatory regions that control the expression of these proteins (Kleinjan and van Heyningen, 2005Go; Taft et al., 2007Go). It also seems likely, although firm conclusions are limited by the poor cDNA library coverage in many species, that the proteome is expanded in more developmentally complex species by the increased use of alternative splicing (Graveley, 2001Go; Smith and Valcarcel, 2000Go; Stamm et al., 2005Go). This in turn, however, mandates an increase in regulation, assuming that cell- or tissue-specific alternative splicing is not random. Thus evolutionary innovation and phenotypic divergence is achieved not only by variations in the structure and function of proteins, but also and probably more so, by those in the regulatory circuitry that controls their deployment (Davidson, 2006Go; Duboule and Wilkins, 1998Go; Jacob, 1977Go; Zuckerkandl and Cavalli, 2007Go).


    Analogue components and digital information transfer in complex systems
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
Proteins are extraordinarily versatile macromolecules that perform the vast bulk of the catalytic, structural and (to a greater or lesser extent; see below) regulatory functions in biology. As such, proteins (and their derived products such as carbohydrates, lipids and infrastructural RNAs) may be thought of as the analogue components of cells, in the same way that windows, chairs, wheels, gears, sensors and signalling systems comprise the analogue components of bicycles and aircraft. Damage to components usually has severe consequences for the function of the system and is therefore likely to be very evident, although there will be exceptions.

In addition to sophisticated operational controls, complex entities (whether aircraft or organisms) require extensive and detailed design plans for their construction, information about which has also to be stored in the system, along with the specifications of the components themselves. Random changes to assembly plans may have more subtle effects than those that alter component structure (particularly those that compromise component function), creating design variations that often have less severe consequences, although there will be exceptions in both directions. In biology these changes will therefore often result in minor defects, quantitative trait variation or alterations in disease susceptibility. Altered regulatory information has been shown to underlie such variation in a number of cases where it has been possible to map the causative nucleotide changes to completion in well-structured pedigrees (Clark et al., 2006Go; Clop et al., 2006Go; Ishii et al., 2006Go; Smit et al., 2003Go; Van Laere et al., 2003Go).

While it has long been recognized that genetic information is encoded digitally in DNA, it has also been widely assumed that the cellular outputs of this information, expressed via the intermediate of messenger RNA (mRNA), are almost exclusively analogue components. That is, it has been assumed that most genes are synonymous with proteins and that most genetic information is transacted by proteins. This is essentially true for the prokaryotes, whose genomes comprise densely packed protein-coding sequences, although these genomes clearly also encode a limited number of small regulatory RNAs that function in part by sequence-specific interactions with other RNAs and DNA (Gottesman, 2005Go; Mattick and Makunin, 2006Go; Vogel and Sharma, 2005Go; Winkler, 2005Go). The situation is similar in unicellular eukaryotes such as the yeasts Saccharomyces cerevisiae (David et al., 2006Go; Olivas et al., 1997Go) and Schizosaccharomyces pombe (Watanabe et al., 2002Go). Interestingly, although of similar complexity, the former has more protein-coding sequences than the latter, whereas the latter has many more introns (Goffeau et al., 1996Go; Wood et al., 2002Go) and a more elaborate RNA signalling infrastructure, which includes the basic components of the RNA interference (RNAi) pathway (Martienssen et al., 2005Go). This suggests that there may be some trade-off between protein- and RNA-based forms of gene regulation in simple eukaryotes. In any case, at first approximation it is reasonable to say that micro-organisms, particularly the prokaryotes, are in fact largely analogue devices (the `bicycles' of biology) and that proteins not only comprise the primary structural and catalytic components of these cells but are also the main agents by which they are regulated.

For the past 50 years it has been assumed that the same applies in more complex organisms, i.e. that regulation, particularly developmental regulation, is also largely analogue (protein-based) in multicellular organisms (Davidson, 2006Go), despite the fact that genome sequence analysis has shown that the numbers of protein-coding genes do not scale strongly or consistently with morphological complexity (Taft et al., 2007Go) (Fig. 1). This apparently quite reasonable assumption (at least initially) led logically to two subsidiary assumptions: (i) that the increased regulatory sophistication of more complex organisms is achieved through combinatoric interactions of regulatory proteins intersecting with more complex regulatory sequences in promoters and untranslated regions of mRNAs (etc.) (Buchler et al., 2003Go; Levine and Tjian, 2003Go); and (ii) that the vast amounts of non-protein-coding sequences in more complex organisms are, apart from a limited amount of cis-acting regulatory sequences, evolutionary debris. The latter view has been reinforced by the fact that many of these non-coding sequences are derived from transposons (DNA sequences that can move within the genome to new positions), themselves widely assumed to be non-functional, selfish DNA (Doolittle and Sapienza, 1980Go; Orgel and Crick, 1980Go) and to be evolving `neutrally' (Waterston et al., 2002Go). These assumptions have remained largely unquestioned for many years and have become articles of faith, but they are not necessarily correct.


Figure 1
View larger version (72K):
[in this window]
[in a new window]

 
Fig. 1. The fraction of non-protein-coding DNA and megabases of protein coding sequence (CDS) per haploid genome in different species. (A) The ratio of the total bases of non-protein-coding to the total bases of genomic DNA per sequenced genome across phyla (i.e. the fraction of non-protein-coding DNA). The four largest prokaryote genomes and two well-known bacterial species are depicted in black. Single-celled organisms are shown in gray, organisms known to be both single and multicellular depending on lifecycle are light blue, basal multicellular organisms are blue, plants are green, nematodes are purple, arthropods are orange, ascidians are yellow, and vertebrates are red. Species names are listed below B. (B) The amount (in megabases) of CDS per genome for species ranked by fraction of non-protein-coding DNA. Figure adapted from Taft et al. (Taft et al., 2007Go) with permission from BioEssays.

 


    Non-linear scaling of regulatory information in integrated systems
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
In earlier papers it was shown that the requirement for endogenous communication and regulatory information in integrated complex systems, whether cells or computers, scales faster than linearly with function and thus must hit a limit (Gagen and Mattick, 2005Go; Mattick and Gagen, 2005Go). This limit can only be relaxed and raised by changing the physical basis and efficiency of the control architecture. In other domains, this limit has been raised by superimposition of digital communication and control systems, using symbolic or sequence-specific strings to store and transmit information within the system. This allows both higher information density and improved transmission accuracy, the latter to overcome the problem of amplified noise (unintended crosstalk) inherent in analogue computation, thereby achieving higher operational sophistication and complexity (see e.g. Collen, 1994Go). Good examples are the transition from analogue to digital computing (Weinstein and Keim, 1965Go) and the evolution of aircraft from purely mechanical devices to modern passenger or military jets, wherein a large proportion of the information and cost is entailed in the computing and software systems, including hundreds of kilometers of optical fiber (Csete and Doyle, 2002Go). Imagine what a bicycle engineer, or even an aeronautical engineer, might have made of the latter when unexpectedly confronted with it, a situation akin to the discovery of introns in the late 1970s (see below).

It should be noted that the power and precision of digital communication and control systems has only been broadly established in the human intellectual and technological experience during the past 20–30 years, well after the central tenets of molecular biology were developed and after introns had been discovered. The latter was undoubtedly the biggest surprise (Williamson, 1977Go), and its misinterpretation possibly the biggest mistake, in the history of molecular biology. Although introns are transcribed, since they did not encode proteins and it was inconceivable that so much non-coding RNA could be functional, especially in an unexpected way, it was immediately and almost universally assumed that introns are non-functional and that the intronic RNA is degraded (rather than further processed) after splicing. The presence of introns in eukaryotic genomes was then rationalized as the residue of the early assembly of genes that had not yet been removed and that had utility in the evolution of proteins by facilitating domain shuffling and alternative splicing (Crick, 1979Go; Gilbert, 1978Go; Padgett et al., 1986Go). Interestingly, while it has been widely appreciated for many years that DNA itself is a digital storage medium, it was not generally considered that some of its outputs may themselves be digital signals, communicated via RNA1.

Analysis of prokaryotic genomes has shown that, as predicted (Croft et al., 2003Go), the numbers of genes encoding regulatory proteins scale almost quadratically with gene number or genome size (Croft et al., 2003Go; Gagen and Mattick, 2005Go; Mattick, 2004Go; Mattick and Gagen, 2005Go; van Nimwegen, 2003Go). In addition, extrapolation of these relationships show that the point where the number of new regulatory genes is predicted to exceed the number of new (non-regulatory) functional genes is close to the observed upper size limit of bacterial genomes (Gagen and Mattick, 2004Go; Gagen and Mattick, 2005Go). This implies (albeit does not prove) that bacteria have reached a complexity ceiling imposed by the accelerating cost of protein-based regulation, possibly early in evolution. It also implies (i) that the more complex eukaryotes must have solved the problem some other way, most likely by the co-option of RNA as a sequence-specific regulatory molecule [microRNAs (miRNAs) being a good example] and, more subtly, (ii) that the combinatorics of regulatory factors per se cannot be used to enlarge the regulatory space to get past this ceiling, as there is no a priori reason to expect that prokaryotes could not have easily evolved more complex promoters and recruited additional transcription factors, etc. This in turn suggests that the complex gene regulatory regimes in the higher organisms may operate through multiple layers of regulation and regulatory decisions, rather than multiple (combinatoric) inputs at any given point.

In any case, and consistent with the non-linear scaling of regulatory information, there is a strong relationship between the extent of non-protein-coding DNA sequences in the genomes of higher organisms and their relative complexity. Indeed this appears to be the only consistent relationship between genome information content and complexity (Taft et al., 2007Go) (Fig. 1). These non-protein-coding sequences occupy almost 99% of the human genome (Frith et al., 2005Go), and it has been inconceivable to many that they might all be functional as cis-acting regulatory elements (although these have clearly expanded in complex organisms). Again this view is implicitly predicated on the assumption that most genetic information is transacted by proteins.


    The major output of metazoan genomes is non-coding RNA
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
In apparent opposition to the above assumption, it is now evident that most of the non-protein-coding sequences in genomes are in fact expressed (i.e. transcribed), either as introns in the primary transcripts of protein-coding genes (which occupy ~40% of the human genome) or as intergenic or antisense transcripts (Frith et al., 2005Go; Mattick and Makunin, 2006Go). Indeed it appears that the vast majority of all genomes, from yeast to insects and mammals (wherein most studies have been done), are transcribed, much on both strands (Carninci et al., 2005Go; Cheng et al., 2005Go; David et al., 2006Go; Manak et al., 2006Go). Both cDNA (Carninci et al., 2005Go; Katayama et al., 2005Go; Okazaki et al., 2002Go) and genome tiling array studies (Cheng et al., 2005Go; Kampa et al., 2004Go; Kapranov et al., 2002Go; Kapranov et al., 2005Go) of the transcriptome have revealed an extraordinarily complex landscape of interleaved and overlapping transcripts, with distal exons, elaborate splicing patterns and alternative polyadenylation sites, many of which appear to have no protein-coding capacity (Mattick and Makunin, 2006Go). The most recent data show that at least 85% of the Drosophila genome (Manak et al., 2006Go), 70% of the mouse genome (Carninci et al., 2005Go) and 93% of the ENCODE regions of the human genome (The ENCODE Project Consortium, manuscript submitted for publication) have experimentally documented transcripts. Moreover, there also appears to be a large and mostly distinct population of non-polyadenylated transcripts located in the nucleus and the cytoplasm, which (despite indications from some very early studies) it was not appreciated existed, because of the widespread use of oligo dT to purify mRNA and to construct cDNA libraries (Cheng et al., 2005Go).

There are literally tens of thousands of long non-coding RNAs (ncRNAs) that have been identified in mammals (Carninci et al., 2005Go; Kampa et al., 2004Go; Okazaki et al., 2002Go), including many antisense transcripts (Alfano et al., 2005Go; Cocquet et al., 2005Go; Katayama et al., 2005Go; Korneev and O'Shea, 2005Go; Pandorf et al., 2006Go; Reis et al., 2004Go; Tufarelli et al., 2003Go; Werner, 2005Go; Werner and Berdal, 2005Go) and large numbers of smaller RNAs such as miRNAs (Berezikov et al., 2006aGo; Berezikov et al., 2006bGo) and piRNAs (Aravin et al., 2006Go; Girard et al., 2006Go; Lau et al., 2006Go). Many of these ncRNAs are expressed in a cell- or tissue-specific manner, suggesting that they are developmentally regulated. Characterized long ncRNAs include H19 (Barsyte-Lovejoy et al., 2006Go; Brannan et al., 1990Go; Wrana, 1994Go), 7H4 (Velleca et al., 1994Go), bic (Tam et al., 1997Go), NTT (Liu et al., 1997Go), BORG (Takeda et al., 1998Go), Xist (Brockdorff, 1998Go), Tsix (Lee et al., 1999Go), DD3 (Bussemakers et al., 1999Go), Msx1 (Blin-Wakkach et al., 2001Go), Air (Sleutels et al., 2002Go), MALAT-1 (Ji et al., 2003Go), adapt33 (Wang et al., 2003Go), SCA8 (Mutsuddi et al., 2004Go), MIAT (Ishii et al., 2006Go), CTN (Prasanth et al., 2005Go), NFAT (Willingham et al., 2005Go), PRINS (Sonkoly et al., 2005Go), TUG1 (Young et al., 2005Go), PINC (Ginger et al., 2006Go), SAF (Yan et al., 2005Go), Evf-2 (Feng et al., 2006Go), HSR1 (Shamovsky et al., 2006Go) and HAR1 (Pollard et al., 2006Go), most of which have been associated with specific cellular or developmental functions and/or disease. However, most of the ncRNAs discovered in genome-wide transcriptomic analyses or expressed from particular genomic regions have not been studied in any detail, although high-throughput cell-based and other screening strategies are beginning to be deployed to ascertain their function (Mattick, 2005Go; Reis et al., 2004Go; Willingham et al., 2005Go). Moreover, the documented numbers of these RNAs are conservative estimates: more are being regularly discovered as genomic analyses of one sort or another delve deeper into the transcriptome. Recent evidence suggests that deep sequencing has not remotely exhausted the repertoire of either long ncRNAs (Carninci et al., 2005Go) or short ncRNAs (Berezikov et al., 2006aGo; Berezikov et al., 2006bGo; Cummins et al., 2006Go; Ruby et al., 2006Go) and that there may be hundreds of thousands of small RNAs expressed in humans (T. R. Gingeras, personal communication; L. Croft, R. J. Taft and J.S.M., unpublished data).

These observations confront and very largely contradict the traditional protein-centric view of genetic information and genome organization (Mattick and Makunin, 2006Go). Either the bulk of the transcriptional output from the human genome and those of other complex organisms is random `noise' (or, in the case of introns, the residue of evolutionary baggage retained and accumulated within genes, as widely assumed) or this transcription comprises a massive but hitherto hidden layer of expression of systemic genetic information that is transacted by RNA (Mattick, 1994Go; Mattick, 2001Go; Mattick, 2003Go; Mattick, 2004Go). The former has been described as a rather nihilistic view (Werner, 2005Go), but is one that is comfortable for the prevailing orthodoxy. On the other hand, the latter is strongly supported by the observations that: (i) all well-studied loci in insects and mammals express a large number of non-protein-coding transcripts (e.g. Ashe et al., 1997Go; Bae et al., 2002Go; Holmes et al., 2003Go; Jones and Flavell, 2005Go; Lemons and McGinnis, 2006Go; Lipshitz et al., 1987Go; Sanchez-Herrero and Akam, 1989Go; Sessa et al., 2007Go); (ii) many of the experimentally detected ncRNAs are differentially expressed (Carninci et al., 2005Go; Cheng et al., 2005Go; Ravasi et al., 2006Go), apparently under the control of common transcription factors (Barsyte-Lovejoy et al., 2006Go; Cawley et al., 2004Go); (iii) at least some have specific subcellular locations (Ginger et al., 2006Go; Prasanth et al., 2005Go); and (iv) at least some have been shown to be functional (Brannan et al., 1990Go; Brockdorff, 1998Go; Feng et al., 2006Go; Ginger et al., 2006Go; Prasanth et al., 2005Go; Velleca et al., 1994Go; Willingham et al., 2005Go; Wrana, 1994Go; Young et al., 2005Go).

Microarray analyses have shown that large numbers of ncRNAs are dynamically regulated during the differentiation of embryonal stem cells, myoblasts, neuronal cells and the gonadal ridge, as well as during T-cell and macrophage activation (M. E. Dinger, K. C. Pang, I. Qureshi, M. Crowe, A. C. Perkins, S. M. Grimmond, D. A. Hume, P. A. Koopman, G. E. O Muscat, S. Bruce, M. F. Mehler and J.S.M., manuscript in preparation) and in cancer (Lu et al., 2005Go; Reis et al., 2004Go). In addition, in situ hybridization analyses are revealing large numbers of ncRNAs that are expressed in particular regions of the brain and in particular subcellular locations (T. R. Mercer, M. E. Dinger, S. Sunkin, M. F. Mehler and J.S.M., in preparation). Many of these ncRNAs are antisense or intronic to genes encoding proteins important in neural development, function and disease. It is also now evident that many of the complex genetic phenomena in complex organisms, including transcriptional and post-transcriptional gene silencing (Cogoni and Macino, 2000Go; Matzke et al., 2001Go; Zamore and Haley, 2005Go), imprinting (Kelley and Kuroda, 2000Go; Morison et al., 2005Go; Nikaido et al., 2003Go) and probably also transvection (Mattick and Gagen, 2001Go) and transinduction (Ashe et al., 1997Go), are linked to RNA signalling (Mattick, 2003Go; Mattick and Gagen, 2001Go).


    Digital–analogue conversion of RNA signals
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
A key advantage of RNA is its sequence specificity, in that it can direct a precise interaction with its target by base pairing, over short stretches of nucleotides, far more efficiently than can be achieved by proteins. This allows large numbers of regulatory controls to be encoded compactly in genomes, especially as those genomes come under pressure to contain exponentially greater amounts of regulatory information as complexity increases. These regulatory controls can also be flexibly altered and re-configured by evolution to achieve phenotypic variation without altering the underlying components of the system, a concept that is well established in engineering (Mattick and Gagen, 2001Go). A good case in point is that of miRNAs, some of which are widely distributed among species and highly conserved while others are species-specific (Berezikov et al., 2006aGo; Berezikov et al., 2006bGo), with two documented cases of mutations in miRNA target sites underpinning disease (Abelson et al., 2005Go) or quantitative trait variation (Clop et al., 2006Go). RNAs also intrinsically possess much more precise specificity of interactions with other RNAs and DNA than is usually possible by and between proteins, thus potentially improving the precision of the control system and minimizing noise from crosstalk, especially in complex regulatory networks. (The problem of noise was a primary limitation of analogue computers and a primary driving force in the transition to digital computing.) Thus it appears that evolution may have discovered the power of digital communication and control systems a billion years before we did (see below).

However, the sequence-specific interaction of a regulatory RNA with its target is relatively sterile unless this interaction can be converted into a meaningful analogue action. At its simplest level, this may comprise antisense binding to block another interaction, and this primitive mechanism seems to be a common feature of regulatory RNAs in prokaryotes. However, a more sophisticated strategy is to embed secondary signals either in the RNA itself or in the structure of the resulting RNA:RNA or RNA:DNA complex, to recruit different types of complexes, which then undertake the type of analogue action required upon receipt of the signal. Good examples are (i) the complexes of RNA-modifying enzymes that act at a site adjacent to and determined by the position of the sense:antisense interaction between small nucleolar RNAs (snoRNAs) and their targets (Bachellerie et al., 2002Go; Meier, 2005Go), and (ii) the RNA-induced silencing (RISC) complexes that act on RNAs bound to small interfering RNAs (siRNAs) and miRNAs (Tang, 2005Go). Thus, there are two components to RNA signals: a sequence-specific interaction with the intended target(s) and a secondary or tertiary structural component that acts as a transducer to recruit generic infrastructural proteins to impart different types of actions. Indeed, this two-stage principle also applies to other classes of functional RNAs including snRNAs and tRNAs, which recognize splice junctions in pre-mRNAs or codons in mRNAs and recruit the spliceosome or ribosome, respectively. That is, RNAs function as adaptors, with a target sequence-specific address code and separate structural motifs that specify the type of consequent function and bind the appropriate proteins.

Such considerations suggest that a receptive infrastructure for RNA signalling must have co-evolved with the RNA signals themselves and become progressively more sophisticated as RNA regulatory and transport networks gained currency during the evolution of the eukaryotes. Examples include the proteins of the argonaute family and others associated with RNA interference (Carmell et al., 2002Go), and those containing RRM domains, KH domains, SR domains, SET domains, pumilio-homology domains and double-stranded RNA-binding domains, which occur in a wide range of developmental regulators with global functions (Anantharaman et al., 2002Go; Bernstein and Allis, 2005Go; Saunders and Barber, 2003Go; Wang et al., 2002Go). Indeed many of the so-called nucleic acid binding proteins and chromatin-binding proteins whose target specificity is uncertain or unknown may in fact recognize different types of RNA signals. This possibility is supported by evidence suggesting that regulatory proteins containing C2H2 zinc fingers (Shi and Berg, 1995Go), Y-boxes (Ladomery, 1997Go), chromodomains (Akhtar et al., 2000Go; Bernstein and Allis, 2005Go), tudor domains (Maurer-Stroh et al., 2003Go) and SET domains (Krajewski et al., 2005Go), and others such as DNA methyl transferases (Jeffery and Nakielny, 2004Go), may recognize such RNA signals in one form or another.


    The origin and evolution of RNA-based regulatory networks in complex organisms
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
I suggest that the transition from a largely analogue protein-based regulatory control to digitally based RNA regulation was a fundamental rate-limiting step in the emergence of complex organisms (Mattick, 1994Go; Mattick and Gagen, 2001Go), together with other factors such as the level of atmospheric oxygen (Canfield et al., 2007Go). It follows that the RNA-based regulatory systems underpinning the ability to control more complex developmental trajectories must have been largely in place prior to the metazoan radiation and have been a critical factor enabling this evolutionary event (Mattick, 1994Go; Mattick, 2001Go; Mattick, 2004Go; Mattick and Gagen, 2001Go). Following the emergence of all modern animal phyla at that time, often referred to as the Cambrian explosion (Fig. 2), these new dynasties of multicellular organisms settled down to `battle it' out in evolutionary competition. This was achieved, firstly, by refining and introducing new adaptations to body plans to improve their competitiveness for survival and reproduction, and to enable the colonization of new ecological niches and new domains such as the land and the air. The latter presented new physical and physiological challenges, which required significant innovations in proteins as well as in the regulatory architecture controlling developmental ontogeny (Bejerano et al., 2004Go; Kleinjan and van Heyningen, 2005Go; Mattick and Gagen, 2001Go). Recent data indicate that many regulatory RNAs, such as miRNAs, emerged in the ancestors of the Bilateria (Hertel et al., 2006Go; Prochnik et al., 2007Go) and in major transitions of metazoan evolution, including the advent of the vertebrates and eutherian mammals (Hertel et al., 2006Go). Secondly, there would have been considerable evolutionary advantage, and therefore pressure, to enhance sensory and cognitive capacities to recognize and respond to opportunities and threats and to alter the environment in favour of better survival and reproduction. This led to the evolution of learning and memory, an even greater mechanistic challenge that almost certainly involved RNA editing as a means of dynamically intersecting the environment with otherwise hardwired genetic information, ultimately leading to the emergence of higher-order cognition (Mehler and Mattick, 2007Go).


Figure 2
View larger version (12K):
[in this window]
[in a new window]

 
Fig. 2. A simplified view of the biological history of the Earth.

 
Although RNA is an ancient molecule and may well have been the progenitor of both DNA and proteins (Gesteland et al., 2006Go), its evolution as a regulatory molecule with associated infrastructure and networks probably had its genesis in the invasion of eukaryotic protein-coding genes by mobile self-splicing group II introns (Cavalier-Smith, 1991Go; Cousineau et al., 2000Go; Lambowitz and Zimmerly, 2004Go; Mattick, 1994Go; Palmer and Logsdon, Jr, 1991Go). These sequences occur in prokaryotes (Ferat and Michel, 1993Go; Martinez-Abarca and Toro, 2000Go) but are restricted to non-protein-coding sequences by the intimate coupling between transcription and translation (Cavalier-Smith, 1991Go; Mattick, 1994Go), thereby restricting the target area for evolutionary experimentation. While RNA regulation occurs in prokaryotes, it is not well developed, just as there is little need for digital control systems in a bicycle. The need to find solutions to the accelerating problem of increasing regulatory sophistication required to underpin multicellular development – ultimately through the co-option of RNA as a compact signalling molecule and later connecting these signals to different types of actions through the co-evolution of different types of RNA binding and effector proteins – might have been felt by both prokaryotes and eukaryotes, but the latter may have had more opportunity to do so, especially given the compartmentalization of their cells. This latter feature probably arose due to the lifestyle of early eukaryotes as phagocytic cellular predators, such as amoebae or macrophages (Cavalier-Smith, 1991Go). Importantly, the separation of transcription from translation by the introduction of a nuclear membrane allowed introns to invade protein-coding sequences, as their negative effects could be minimized as long as they were (self) spliced out before export to the cytoplasm. In so doing, it also created the raw material for a new round of molecular evolution of RNA signals produced in parallel with protein-coding sequences (Mattick, 1994Go) (Fig. 2).

The subsequent evolution of the spliceosome occurred by the devolution of the originally cis-acting catalytic sequences within introns to trans-acting generic co-factors (spliceosomal RNAs) and the recruitment of ancillary proteins. This reduced the internal sequence constraints on the introns, allowing them more freedom to evolve and flexibly explore new functional space (as RNA molecules). It also made their excision from primary transcripts more efficient, perversely providing them with even greater facility to expand and invade other genes (Mattick, 1994Go). As these RNA networks began to be established, proteins capable of recognizing subsets of signals in these networks would have been selected for, increasing the sophistication of the system. Moreover, it would be expected that increasing numbers of genes would have evolved solely to express RNA as higher-order regulators in this increasingly complex system. This will have occurred at least in part by gene duplication followed by loss of protein-coding capacity, as appears to have happened in Xist (the ncRNA controlling X chromosome inactivation in female mammals) (Duret et al., 2006Go) and in many of the non-protein-coding genes that encode snoRNAs or miRNAs in their introns (Cavaille et al., 2001Go; Mattick and Makunin, 2005Go; Rodriguez et al., 2004Go; Tycowski et al., 1996Go; Ying and Lin, 2005Go). Interestingly, many ncRNAs are alternatively spliced (Cocquet et al., 2005Go; Pang et al., 2005Go), suggesting that there is an operational distinction between RNA sourced from exons and introns. The other major source of functional RNAs has almost certainly been various other types of mobile (transposable) elements, many of which are derived from small RNAs and have been a potent force in genome evolution and genetic innovation (Brosius, 1999Go; Brosius, 2005Go; Waterston et al., 2002Go).


    The extent of the genome under evolutionary selection
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
This raises the question of the composition, rate of evolution and functionality of the genome as a whole, especially as it is now known that most of the genome is transcribed. A large percentage of the mammalian genome (~46% in humans) is composed of transposon-derived sequences (Lander et al., 2001Go; Waterston et al., 2002Go), often pejoratively referred to as repeats, and assumed to be non-functional and therefore evolving `neutrally' (Waterston et al., 2002Go). The same assumption has often been made about introns, although it is now evident that there are significant amounts of conserved sequences within them (Dermitzakis et al., 2003Go; Hare and Palumbi, 2003Go; Sironi et al., 2005Go), presumably reflecting either functional RNA products or important cis-acting regulatory sequences. In any case, on the assumption that ancient repeats (ARs) can be used as a yardstick of the background neutral evolutionary rate, it has been estimated that ~5% of the human genome is under purifying selection in mammals (Waterston et al., 2002Go), and therefore functional, with the remainder largely considered to comprise genetically inert, neutrally evolving evolutionary debris.

This is in direct contradiction to the suggestion that much of the genome-wide transcription, which is developmentally regulated, is functional. However, it is questionable whether the ARs that are used as yardsticks for these estimations are really evolving neutrally. First, if ARs have no functional relevance to the organism, they would be expected to evolve freely and eventually to either acquire function or be deleted (M. Pheasant and J.S.M., manuscript submitted for publication), as appears to have occurred with a large fraction of ARs (Waterston et al., 2002Go). That is, the more ancient the extant sequence, the more likely it is to have acquired function. Second, in agreement with this logic, there are increasing numbers of transposon-derived sequences of all classes, both ancient and modern, including lineage-specific repeats such as Alu elements that have been shown to have undergone functional exaptation as gene promoters, regulatory elements, exons and microRNA precursors (Bejerano et al., 2006Go; Britten, 2006Go; Brosius, 1999Go; Dagan et al., 2004Go; Ferrigno et al., 2001Go; Hasler and Strub, 2006Go; Krull et al., 2005Go; Landry et al., 2001Go; Lev-Maor et al., 2003Go; Lippman et al., 2004Go; Matlik et al., 2006Go; Nigumann et al., 2002Go; Smalheiser and Torvik, 2005Go; Smalheiser and Torvik, 2006Go; Volff, 2006Go; Zhou et al., 2002Go).

These observations throw increasing doubt on the widespread assumption that such sequences are mostly parasitic, and remain as inert genomic passengers. Transposable elements have also been found to underlie the birth of new genes and regulatory networks (Brandt et al., 2005Go; Cordaux et al., 2006Go; Landry et al., 2001Go; Zhou et al., 2002Go) and to influence early development (Peaston et al., 2004Go) and phenotypic variation (Whitelaw and Martin, 2001Go). It is also possible to identify AR sequences that are clearly conserved, some of which are very ancient (Nishihara et al., 2006Go), such as recently discovered classes of ARs in humans sharing common ancestors with those in marsupials (Kamal et al., 2006Go) and fish (Ogiwara et al., 2002Go; Xie et al., 2006Go), including an example of the slowest evolving regions of the human genome (Bejerano et al., 2006Go). Moreover, some major classes of ARs show variable rates of sequence conservation within them. One example is the class of so-called `mammalian interspersed repeats' (MIRs), of which there are ~300 000 copies in the human genome (Smit and Riggs, 1995Go). These MIRs date back ~130 million years and are tRNA-derived SINEs (short interspersed elements) with a consensus length of ~260 nt including a 70 nt central region and 15–25 nt more highly conserved core (Silva et al., 2003Go; Smit and Riggs, 1995Go). The fact that hundreds of thousands of such elements have an internal sequence that is conserved more highly than the rest of the element is prima facie evidence that this class of ARs (or at least the conserved core within them) is not neutrally evolving and is likely under selection, presumably for function and possibly as regulatory RNAs.

It is also clear that there are widely different rates of evolution of different types of genomic sequences, particularly of gene regulatory sequences, some of which are extraordinarily highly conserved blocks (Bejerano et al., 2004Go), while many others cover extended genomic regions and exhibit rapid turnover (Fisher et al., 2006Go; Frith et al., 2006Go; Smith et al., 2004Go; Taylor et al., 2006Go). The latter includes the remarkable functional conservation of regulatory sequences controlling ret gene expression in zebrafish and humans, although there is little recognizable primary sequence conservation (Fisher et al., 2006Go). The cis-regulatory elements of the HoxA cluster have also been shown to undergo accelerated evolution, presumably under positive selection during the origin of amniotes and mammals (Wagner et al., 2004Go). Moreover, it is evident that phenotypic diversification may be due as much, if not more, to changes in regulatory architecture than to the protein components (Duboule and Wilkins, 1998Go; Levine and Tjian, 2003Go; Mattick and Gagen, 2001Go). Indeed, regulatory sequences often exhibit considerable evolutionary plasticity (depending on the number of their interacting targets; see below) and relatively low conservation (Pang et al., 2006Go) compared with proteins whose evolutionary flexibility is limited by both analogue structure–function relationships and multitasking, i.e. the differential use of the same components in multiple contexts (Duboule and Wilkins, 1998Go; Mattick and Gagen, 2001Go).

There are also other regions of the genome under evolutionary constraints that are not evident at the primary sequence level, including shuffled cis-regulatory elements (Sanges et al., 2006Go), gene deserts (Ovcharenko et al., 2005Go), transposon-free regions (Simons et al., 2006Go), chromatin domains (Bernstein et al., 2005Go; Bernstein, B. E. et al., 2006Go), regions under indel-purifying selection (Lunter et al., 2006Go), the distances between ultra-conserved elements (Sun et al., 2006Go) and regions predicted to contain common RNA secondary or tertiary structures (Lescoute et al., 2005Go; Washietl et al., 2005Go). Thus, the proportion of functionally meaningful DNA in the human genome is substantially greater than estimated from sequence conservation alone (Smith et al., 2004Go).

Different rates of evolution also occur within and between different classes of functional gene products, both RNAs and proteins. While most protein-coding sequences are highly constrained and hence highly conserved, some are much more flexible and others have diverged under positive selection (Bustamante et al., 2005Go). The estimated 5% of the human genome that is conserved with mouse does not include 35% of annotated protein-coding sequences and 17% of RefSeq annotated genes (M. Pheasant and J.S.M., manuscript submitted for publication). Many miRNAs are highly conserved (Pang et al., 2006Go) but many are not, being lineage- or even species-specific (Berezikov et al., 2006aGo; Berezikov et al., 2006bGo). There are also thousands of recently discovered small RNAs (piRNAs) expressed in testis that are not conserved between rodents and humans, although similar RNAs are produced from syntenically orthologous loci (Aravin et al., 2006Go; Girard et al., 2006Go; Lau et al., 2006Go). SnoRNAs have very divergent sequences and many are identifiable only by the loose consensus and positioning of the C/D (RUGAUGA/CUGA) (Shanab and Maxwell, 1992Go) or H(ANANNA)/ACA boxes (Meier, 2005Go). It is also clear that many longer functional non-protein-coding RNAs (ncRNAs), such as the Xist and Tsix transcripts involved in X-chromosome dosage compensation, are evolving quickly (Chureau et al., 2002Go; Migeon et al., 2001Go; Nesterova et al., 2001Go; Pang et al., 2006Go). In other cases, there is evidence of recent positive selection in ncRNAs, such as the HAR1 transcript expressed in particular regions of the brain (Pollard et al., 2006Go). While functionally validated RNAs do not presently add up to a large fraction of the genome, they do illustrate that lack of conservation does not necessarily equate to lack of function (Pang et al., 2006Go; Smith et al., 2004Go). They also point to the likelihood that many functional transcripts, particularly regulatory ncRNAs, are not highly conserved over significant evolutionary distances.

Most of the mammalian genome appears to be evolving more quickly than protein-coding sequences, and at a (regionally adjusted) rate similar to ancient transposon-derived sequences. However, this is evidence simply that the majority of the genome is under similar average selection pressures (M. Pheasant and J. S. Mattick, manuscript submitted for publication), rather than being non-functional and evolving neutrally, although the latter is the favored explanation (Waterston et al., 2002Go) being consistent with the orthodox view. Moreover, it has been known for some time that the nucleotide substitution frequency varies across the genome. This has often been interpreted as the result of regional variation in the background mutation or fixation (related to recombination) frequencies, rather than selection, as it was (again) inconceivable that the vast intronic and intergenic sequences could be under selection, since that in turn would impute function. Variation in substitution frequencies beyond that which might be expected from random events is also observed at close range within genomic regions, and the data are more consistent with the genome comprising different types of genetic information that are evolving at different rates under different selection pressures and different structure–function constraints (M. Pheasant and J.S.M., manuscript submitted for publication).


    Functional constraints on the evolution of regulatory RNAs
 TOP
 Summary
 Introduction
 Analogue components and digital...
 Non-linear scaling of regulatory...
 The major output of...
 Digital-analogue conversion of...
 The origin and evolution...
 The extent of the...
 Functional constraints on the...
 Endogenous feed-forward control...
 Parallel expression of exonic...
 Layers of RNA-directed control...
 The role of proteins...
 Genetic signatures of RNA...
 Conclusion
 References
 
Structure–function constraints are different for different types of molecules. As noted already, proteins are analogue components that have quite strict structural specifications. There are only so many ways to construct a wheel, a catalytic site, or an oxygen-binding pocket that is responsive to O2 and CO2 partial pressures, and it is hard to vary a successful design. On the other hand, sequence-specific regulatory signals like miRNAs are purely informational and only need to address the right targets; thus at first glance it seems a mystery why many of the known miRNAs have been so fiercely conserved – more so than most protein-coding sequences (Pang et al., 2006Go) – over 500 million years of evolution from worms to mammals. The exact sequence of these small RNAs does not seem to matter that much: it is easy to design them artificially against almost any sequence, and such siRNAs are now commonly employed as experimental tools (Chalk et al., 2005Go; Truss et al., 2005Go). So why have some been so frozen in evolution? The answer appears to be that those miRNAs that were first cloned are common central regulators that have multiple targets (John et al., 2004Go; Lewis et al., 2005Go; Lim et al., 2005Go), which makes co-variation almost impossible in evolutionary terms. If the odds of a miRNA and a target co-varying by compensatory mutations in the same generation are 10–5, the odds of co-variation of an miRNA with 20 targets are 10–100. Most miRNAs that have been subsequently identified through bioinformatics means have also invoked evolutionary conservation as a filter (Berezikov et al., 200