SUMMARY
Repeatability studies are gaining considerable interest among physiological ecologists, particularly in traits affected by high environmental/residual variance, such as wholeanimal metabolic rate (MR). The original definition of repeatability, known as the intraclass correlation coefficient, is computed from the components of variance obtained in a oneway ANOVA on several individuals from which two or more measurements are performed. An alternative estimation of repeatability, popular among physiological ecologists, is the Pearson product–moment correlation between two consecutive measurements. However, despite the more than 30 studies reporting repeatability of MR, so far there is not a definite synthesis indicating: (1) whether repeatability changes in different types of animals; (2) whether some kinds of metabolism are more repeatable than others; and most important, (3) whether metabolic rate is significantly repeatable. We performed a metaanalysis to address these questions, as well as to explore the historical trend in repeatability studies. Our results show that metabolic rate is significantly repeatable and its effect size is not statistically affected by any of the mentioned factors (i.e. repeatability of MR does not change in different species, type of metabolism, time between measurements, and number of individuals). The cumulative metaanalysis revealed that repeatability studies in MR have already reached an asymptotical effect size with no further change either in its magnitude and/or variance (i.e. additional studies will not contribute significantly to the estimator). There was no evidence of strong publication bias.
 repeatability
 heritability
 metaanalysis
 energy metabolism
 intraclass correlation coefficient
 effect size
Introduction
In evolutionary terms, traits are organism's attributes that in some way reflect biological performance, and which ultimately have an impact on fitness (Arnold, 1983). The operational definition of a trait involves an instrument and a measurement scale. A crucial concept in this definition is repeatability: the timeconsistency of a trait, which is measured by the intraclass correlation coefficientτ =σ^{2}_{A}/(σ^{2}_{A}+σ^{2}_{e}) where σ^{2}_{A} is the betweenindividual component of variance and σ^{2}_{e} is the residual variance component when multiple measurements are performed in the same individuals (i.e. within individual variation). Betweenindividual variance equals genetic variance (σ^{2}_{G}) + general environmental variance (σ^{2}_{E}). In turn, phenotypic variance equalsσ ^{2}_{G}+σ^{2}_{E} + residual variance (σ^{2}_{e}). Then, τ can also be decomposed asτ =(σ^{2}_{G}+σ^{2}_{E})/σ^{2}_{P} (Lessels and Boag, 1987; Falconer and Mackay, 1997). Since this measurement includes genetic variance, repeatability appears not only to give insight into the timeconsistency of a trait (i.e. its `qualification' to be treated as a trait) but also to set the upper limit for heritability (the ratio between genetic variance and phenotypic variance) (Lessels and Boag, 1987; Falconer and Mackay, 1997; Roff, 1997; Lynch and Walsh, 1998; Dohm, 2002).
The ease of repeatability computations (and the difficulty of quantitative genetic studies) makes this quantity of great interest for organismal biologists interested in the evolutionary significance of traits, but especially important for physiological ecologists working with metabolic rate (MR) in whole animals. In such studies, the adaptive significance of MR is frequently quoted (e.g. McNab, 2002). However, MR is usually measured by flowthrough respirometry, a technique that includes considerable residual variation (i.e. error variance). The most accurate modern flowthrough respirometers have a minimum of 10–20% measurement error (Konarzewski et al., 2005), which also holds for isotopic methods for field metabolic rate (Speakman, 2004).
Logistically, the measurement of MR is not as straightforward as other kind of traits such as morphology, life histories or even behavior. It requires the researcher to capture animals, move them into the laboratory, and usually to acclimate them for a number of days or weeks. On the other hand, and depending on the desired metabolic variable, respirometric trials are usually combined with certain imposed conditions to animals (e.g. cold, warm, treadmill running, noradrenaline injection, fasting period). In summary, because of the very nature of the technique, MR is a trait with high residual variance (see Konarzewski et al., 2005), and consequently repeatability of MR is important to ecological and evolutionary physiologists.
The repeatability of metabolic rate has been inferred from the intraclass correlation coefficient and multiple measurements. Also, a commonly used repeatability estimation is the Pearson–moment correlation from two consecutive measurements (r_{P}=σ_{x,y}/σ_{x}σ_{y}; where σ_{x,y} is the covariance between the first and the second measurements, x and y, andσ _{x}σ_{y} is the product of both standard deviations) (Lynch and Walsh, 1998). Although variancecomponents and τ can also be computed from two measurements, for practical reasons authors have specialized in r_{P} when two measurements are available, or in τ when multiple measurements are performed (see Table 1).
We performed metaanalysis on repeatability of metabolic rate to answer the following questions:

Does repeatability change in different types of animals?

Are some kinds of metabolic rates more repeatable than others?

What is the effect of elapsed time between measurements on repeatability?

Does repeatability of metabolic rate from laboratory and wild populations differ?
Materials and methods
Literature survey
We searched the literature in order to find published studies that measured repeatability of MR in animals. We recorded the magnitude of the estimator, number of individuals measured and the time elapsed between measurements, in days. We found 30 studies, from which we extracted 47 estimators that were classified according to the organisms measured (we found studies on birds, mammals, insects, reptiles and fish) and according to the standard nomenclature for metabolic measurement: locomotory maximum metabolic rate (MMR_{LOC}), standard metabolic rate (SMR), field metabolic rate (FMR), resting metabolic rate (RMR), thermoregulatory maximum metabolic rate (MMR_{THERM}) and basal metabolic rate (BMR) (Table 1). For small mammals, studies were also classified as from laboratory or wild populations. After removing a few studies where body mass M_{b} was not controlled (N=5, see Table 1), the resulting effect size was almost identical, so we present results including those studies (see also Konarzewski et al., 2005).
Statistics
Conventional statistical methods were performed using Statistica 6.0 whereas the metaanalytical techniques were performed using MetaWin 2.1 (Rosenberg et al., 2000). We first computed effect size and its variances applying the Fisher's Ztransformation. Then, we computed the mean effect size for the sample, its 95% confidence intervals, bootstrapped confidence intervals and general heterogeneity by the Qstatistic, which is distributed asχ ^{2} with N–1 degrees of freedom (d.f.) (Rosenberg et al., 2000). We specifically tested the categorical structure of the data: type of variable (six levels), type of organism (five levels), and type of population (only in rodents; two levels). Both type of variable and type of organism were considered random factors since they do not account for all possible levels, and population (lab/wild) was considered fixed. We decomposed the total heterogeneity (Q_{T}) into the heterogeneity explained by the model (Q_{M}) and error heterogeneity (Q_{E}) in a similar fashion to oneway analysis of variance (Rosenberg et al., 2000). These procedures allowed us to test whether (1) different kinds of MR have a significant effect on published repeatabilities, (2) different kinds of animals have a significant effect on published repeatabilities and (3) whether, in the case of small mammals, laboratory and wild populations differ in their repeatability estimation.
In several cases we used more than one estimator from a single study, which could potentially violate the assumption of independence of metaanalyses (i.e. the withinstudy variance could be larger than the amongstudy variance due to methodological similarities) (Rosenberg et al., 2000). To test for such a possible effect, we performed a preliminary analysis with those studies that reported more than one estimator and tested whether they had a categorical effect on repeatability. This preliminary result showed that the `study' effect was nonsignificant (Q_{M}=3.63; Q_{E}=9.99; P_{χ2}=0.60; P_{rand}=0.58). In addition, we performed a cumulative metaanalysis in order to assess the chronological trend in the effect sizes. This analysis permits determination of whether the present effect sizes were attained at some point in the past (further studies being essentially redundant) (Rosenberg et al., 2000). Finally, we assessed publication bias graphically by funnel plots, and also by failsafe numbers. This last procedure computes the number of nonsignificant, unpublished or missing studies that would need to be added to a metaanalysis in order to change the results from significant to nonsignificant. Specifically, we applied the Rosenthal method (Rosenberg et al., 2000), which computes the number of additional studies with a mean effect size of zero needed to reduce the combined significance to an alpha level set equal to 0.05. Additionally, we computed the Orwin method (Rosenberg et al., 2000), which computes the number of additional studies needed to reduce an observed mean effect size to a minimum effect of 0.2.
Results
Although the reviewed studies were remarkably uneven regarding organism type (insects: 7 cases, 16.7%; mammals: 25 cases, 59.5%; birds: 7 cases, 16.7%; reptiles: 3 cases, 7.1%; see Table 1) and metabolic rate type (SMR: 7 cases, 16.7%; MMR: 20 cases, 47.6%; BMR/RMR: 14 cases, 33.3%; FMR: 1 case, 2.4%; see Table 1), the use of the Pearson product–moment correlation and intraclass correlation appear balanced (Pearson product–moment correlation: 23 cases, 54.8%; intraclass correlation: 19 cases, 45.2%; see Table 1). Published repeatabilities were leftward biased (Fig. 1), but did not significantly depart from normality (d=0.124; P=0.2; Kolmogorov–Smirnov test), with mean=0.57 and s.d.=0.26 (Fig. 1). The funnel plot showed that there was a nonsignificant correlation between sample size and the magnitude of repeatability estimation (R=0.164; P=0.27; Fig. 2). The mean effect size (E) was significantly different from zero (E=0.746; bootstrap 95%CI: 0.6260.873) and heterogeneity was nonsignificant (Q_{total}=32.71; d.f.=46; P=0.96), which suggest that data do not have a subjacent structure. Also, there was no effect of time between measurements and the sizeeffect of repeatability (Q=33.5; d.f.=39; P=0.72). Although this result renders it unnecessary to test the effect of the categorical factors that we were interested in, we performed it and as expected, we found nonsignificant effects of all categories (data not shown). In other words, our metaanalysis does not provide enough evidence to reject the hypotheses that repeatability of MR does not change with type of animal, metabolic measurement, time between measurements and laboratory or wild populations. However, the fact that only one study in FMR was found does not permit any conclusion to be made regarding the potential differences between this variable and the other categories. The cumulative metaanalysis showed that the asymptotic effect size was approximately attained between cases 26 and 30 (Fig. 3, see also Table 1). Failsafe numbers resulted high (Rosenthal=1442.7; Orwin=128.4) which, together with the funnel plot (Fig. 2), suggests that publication bias, if it existed, would have had a negligible effect on the magnitude of repeatability.
Discussion
To the best of our knowledge, this is the first synthetic analysis about repeatability of wholeanimal metabolism. The principal outcome of our metaanalysis is that whole animal metabolism is significantly repeatable, with a magnitude that fluctuates between 0.60 and 0.80. Also, we found that repeatability, measured by either the Pearson product–moment correlation or the intraclass correlation coefficient, does not change. According to our results, repeatability is not affected by different types of animals, or by differences between laboratory and wild populations. However, the possibility that biological, meaningful effects existed but the analyses lacked sufficient statistical power to detect them is always present. Finally, the cumulative metaanalysis showed that repeatability studies have reached an asymptotic state needing no further improvement either on the precision or the magnitude of the estimator.
Are repeatability studies useful?
According to Falconer and Mackay [(Falconer and Mackay, 1997) p. 136], repeatability or the intraclass correlation coefficient (τ) has four main uses (in this order): (1) to show how much is to be gained by the repetition of measurements, (2) to set the upper limit of the ratiosσ _{G}/σ_{P} orσ _{A}/σ_{P}, (3) to predict the future performance from past records, and (4) to give light on the nature of the environmental variance. For evolutionary purposes, statements (2) and (4) are the most important and have attracted the interest of several organismal/evolutionary biologists over recent decades (Lessels and Boag, 1987; Bennett, 1987; Hayes and Jenkins, 1997) (see also references therein). However the second statement, or the capacity of repeatability to estimate the upper bound of heritability, appears as the more attractive application of repeatability since it allows assessment of the response to selection in a trait [as the possibility of phenotypic correlations being good estimators of genetic correlations (see Cheverud, 1988)]. However, at least regarding physiological traits in animals, Hayes and Jenkins (Hayes and Jenkins, 1997) toned down this assertion, indicating that repeatability has some utility as a preliminary screening tool to determine whether some more detailed genetic analyses are warranted. These authors, and subsequent ones, pointed out that the capacity of repeatability to predict the upper bound of heritability is rather unrealistic in natural populations (Hayes and Jenkins, 1997; Dohm, 2002; Konarzewski et al., 2005). In fact, the reviewed literature permits qualitative evaluation of the predictive power of repeatability from studies on same traits and organisms. For instance, Chappell et al. (Chappell et al., 1995) computed the (long term) repeatability of thermoregulatory MMR in Belding's ground squirrels. According to their estimation (Table 1), the repeatability of MMR was nonsignificant and its magnitude was 0.38, which means that heritability of thermoregulatory MMR should not surpass ∼0.40. However, Nespolo et al. (Nespolo et al., 2005) computed a significant narrowsense heritability of thermoregulatory MMR of 0.69 in the leafeared mouse. Similarly, in house mice, fairly high repeatabilities for BMR (over 0.70, see Table 1) have been reported (Hayes et al., 1992; Ksiazek et al., 2004), but this trait appears to exhibit very low (nonsignificant) additive genetic variance in the leafeared mouse (Nespolo et al., 2003a; Nespolo et al., 2005). Recently, using large sample sizes (see Table 1), a nearzero repeatability in BMR was reported in the deer mouse (Russell and Chappell, 2006), but a heritability of BMR equal to 0.40 was computed in the bank vole (Sadowska et al., 2005). Repeatability thus looks confusing in its capacity to predict the upper bound of heritability. The question remains, however, that different procedures, species or even manipulations could have yielded qualitatively different repeatability and/or heritability estimations. To date, we have found only two studies where both repeatability and narrowsense heritability were computed in exactly the same metabolic rate, animal (bank voles) and experimental conditions (Labocha et al., 2004; Sadowska et al., 2005). These authors designed a multigeneration quantitative genetic design where several metabolic rates were computed in thousands of individuals. They reported a mean repeatability (across generations) of BMR of 0.50 (see Labocha et al., 2004; Sadowska et al., 2005) and a narrowsense heritability of this trait of 0.40 (Sadowska et al., 2005); a repeatability of thermoregulatory MMR of 0.45 and a narrowsense heritability of this trait of 0.43; and a repeatability of swiminduced MMR (a proxy of locomotory MMR) of 0.50 and a narrowsense heritability of this trait of 0.40. Thus, in these cases repeatability was a good predictor of the upper bound of heritability since the former was consistently greater than the latter in all traits. These examples only confirm, however, that the conditions for repeatability to be the upper limit of heritability are fairly restrictive (Dohm, 2002).
Our results suggest that repeatability of MR is remarkably homogeneous. From this fact, together with our discussion about the operational definition of traits, environmental variance and the inherent uncertainty of instruments for MR measurements (see also Konarzewski et al., 2005) we can conclude and support previous authors in their conclusions suggesting that the main contribution of repeatability studies is the determination of environmental variance in MR. Hence, the homogeneous results we found in MR repeatabilities would be a consequence of the homogeneity in the method for MR measurements. A corollary of this assertion is that probably, as the technique improves, energy metabolism will exhibit progressive higher repeatabilities (although the cumulative metaanalysis suggested that this has not happen so far). In biological terms, however, metabolic rate could be considered a repeatable trait.
On the error measurement in MR records
Given that MR is a consequence of an unmeasured variable known as energy metabolism, the error in its measurement, as discussed above, is inherent to the instrument used. Several authors have recognized this problem and some alternative methods have been proposed to determine energy metabolism with more precision. One of them is the calculation of latent variables in multivariate statistical analyses such as structural equation modeling (Hayes and Shonkwiler, 1996), where different measurable consequences of energy metabolism (e.g. oxygen consumption, CO_{2} production, heat production, food consumption) can be measured, and a `latent' variable could be constructed from the resulting covariance structure. In a similar fashion, repeatability can be treated with factor analysis, a related statistical method that considers each repeated measurement of MR as observable indicators of an underlying true factor or latent variable (Hayes and Jenkins, 1997). It is very surprising to find that few authors have applied such comprehensive quantitative approaches in further studies of MR repeatability. It would appear that physiological ecologists have avoided using these approaches to improve the precision of MR measurements in a similar way to the persistent use of massspecific MR unities, despite the fact that many authors have shown how misleading is to use them as a body mass standardization (see Hayes, 1996; Christians, 1999; Packard and Boardman, 1999; Hayes, 2001).
In summary, our analysis provides synthetic evidence to suggest that whole animal metabolic rate is repeatable and calls for new directions in order to determine precisely the sources of this interindividual variation in energy metabolism. We feel that organismal biologists have not fully recognized the wide possibilities of quantitative methods such as metaanalysis. Metaanalytic procedures could be applied not only to physiological traits but also to biomechanics, life history evolution, behavior and any field where sufficient published information has accumulated around specific questions or hypotheses.
ACKNOWLEDGEMENTS
We thank Project `Anillos' Bicentenario de Ciencia y Tecnologia, ACT38 and a PhD CONICYT Fellowship to Marcela Franco. We also thank two anonymous reviewers.
 © The Company of Biologists Limited 2007