## ABSTRACT

While mounting evidence indicates that a phylogenetically diverse group of animals detect Earth-strength magnetic fields, a magnetoreceptor has not been identified in any animal. One possible reason that identifying a magnetoreceptor has proven challenging is that, like many research fields, magnetoreception research lacks extensive independent replication. Independent replication is important because a subset of studies undoubtedly contain false positive results and without replication it is difficult to determine whether the outcome of an experiment is a false positive. However, we report here a reanalysis of a well-cited paper on honeybee magnetoreception demonstrating that the original paper represented a false positive finding caused by incorrect estimates of probability. We also point out how good experimental design practices could have revealed the error prior to publication. Hopefully, this reanalysis will serve as a reminder of the importance of good experimental design in order to reduce the likelihood of publishing false positive results.

## INTRODUCTION

Despite evidence that magnetoreception is widespread in the animal kingdom, a magnetoreceptor and accompanying underlying neural circuitry have yet to be identified in any animal (Shaw et al., 2015; Clites and Pierce, 2017; Nordmann et al., 2017). Many reasons have been cited for why magnetoreceptors have proven elusive, including the absence of large, obvious magnetoreceptive organs and the possibility that magnetoreceptors could exist anywhere within an animal (Johnsen and Lohmann, 2005; Shaw et al., 2015). However, the field has also likely been hindered by the inevitability that a subset of the published findings on magnetoreception are false positives. In addition to published examples of failures to replicate specific magnetoreception studies (Klotz et al., 1997; Hert et al., 2011; Landler et al., 2018), there is increasing evidence that false positives are a wide-spread problem in published research (Ioannidis, 2005; Collins and Tabak, 2014; Open Science Collaboration, 2015). While replication can eventually lead to the identification of false positives, poor experimental design and misunderstanding of statistical analyses facilitate the publication of false positives, which can then cause other researchers to waste resources (Simmons et al., 2011). Recently, we discovered that the conclusions of Kirschvink et al. (1997), a well-cited article on honeybee (*Apis mellifera carnica*, Pollman 1879) magnetoreception, were based on incorrect data analyses.

According to the Journal of Experimental Biology website (http://jeb.biologists.org/content/200/9/1363.article-info; accessed 5 September 2018), Kirschvink et al. (1997) has been cited in the scientific literature over 40 times, or about twice per year since it was published. The article is still currently being cited as evidence for magnetoreception in bees (e.g. Prato et al., 2013; Ferrari, 2014; Pereira-Bomfim et al., 2015; Liang et al., 2016; Lambinet et al., 2017; Kong et al., 2018). However, the number of citations understates the impact of Kirschvink et al. (1997). The article abstract on the Journal of Experimental Biology website has been accessed over 2200 times since 2001, including 193 times in the first 8 months of 2018, while the full-text PDF has been accessed over 3200 times since 2001, including 158 times in the first 8 months of 2018.

Unfortunately, the positive findings of Kirschvink et al. (1997) rely solely on incorrect estimates of probability. The authors trained bees to use a magnetic field to distinguish between a positive reward (sucrose) and a negative reward (electric shock). Once the bees learned to associate the magnetic field cue with a positive reward, the authors reduced the magnetic field strength and allowed the bees to learn to associate the weaker stimulus with the food reward. If the bees succeeded in learning the new association, the magnetic field strength was reduced again. This process was continued until the bees could no longer learn to associate the magnetic field and the positive stimulus, presumably because they could no longer detect the magnetic field.

The error the authors made was in their criteria for determining whether or not the bees had learned to associate the magnetic field with the positive stimulus. The bees were given approximately 80 trials to make either six consecutive correct decisions or at least seven of eight correct decisions (i.e. seven of eight or eight of eight). The authors stated that there was only a 1.6% chance and a 3.5% chance, respectively, that bees would achieve these levels of success randomly. However, the authors failed to consider that over the course of 80 trials the bees had up to 75 opportunities to reach one of the learning criteria. The actual probability of reaching a learning criterion over the course of 80 trials if the bees were choosing targets randomly was approximately 66.5%. We were able to produce similar results to Kirschvink et al. (1997) using a random number generator, thereby demonstrating the fundamental flaw in the experimental design. Hopefully this example will encourage other researchers to consider their experimental design carefully before embarking on experiments.

## MATERIALS AND METHODS

The probability of bees reaching a criterion (six correct choices in a row or seven out of eight choices) was determined using a random number generator in R (https://www.R-project.org/). We used the function *rbinom* to create one-million 80-trial blocks of zeros and ones, then counted how many of the 80-trial blocks contained a sequence where the number one occurred in six of six trials or in seven of eight trials. We performed 10 replicates and found that randomly behaving ‘bees’ reached one of the established criteria (6 of 6 or 7 of 8) in 66.5±0.1% (mean±s.d.) of 80-trial blocks.

To confirm these results, we also determined the probability of bees reaching a criterion using numerical experiments carried out in Matlab R2017a (MathWorks, Natick, MA, USA). Eighty-trial blocks were modeled by arrays of 80 random numbers generated using the built-in functions *rand* and *randn*. The numbers generated by *rand* are uniformly distributed on [0, 1]. If the number was greater than 0.5, we set the value to one, otherwise we set the value to zero. Similarly, the numbers generated by *randn* are normally distributed on [−1, 1]; we set the value to one if it was greater than zero, otherwise we set the value to zero. The arrays were then analyzed to determine whether ones occurred in six of six trials or in seven of eight trials within each 80-trial block.

All random numbers in the Matlab simulations were generated using the Mersenne Twister random number generator. This random number generator is widely used and is also the default random number generator in R. We used the default seed and a controlled seed for comparison. Using *rand* and performing 500,000 80-trial blocks, at least one of the learning criteria was reached in 66.3% of the trials using the controlled seed and in 66.5% of the trials using the default seed (Table S1). Using *randn* and performing 500,000 80-trial blocks, a criterion was reached in 66.5% of trials using the controlled seed and in 66.4% of the trials using the default seed.

To reanalyze the data from Kirschvink et al. (1997), we determined the proportion of bees that reached one of the criteria at each magnetic field intensity by drawing a horizontal gridline through the appropriate data point in fig. 3 from Kirschvink et al. (1997) and identifying the *y*-intercept (Table 1; Fig. S1). Gridlines were drawn using Adobe Illustrator CS 5 (Adobe Systems Incorporated, San Jose, CA, USA). Kirschvink et al. (1997) performed two experiments: one experiment tested the ability of 15 bees to learn to associate a 10 Hz AC magnetic field with a sucrose reward, and the second experiment tested the ability of 11 bees to learn to associate a 60 Hz AC magnetic field with a sucrose reward. The proportions of successful bees as reported in Kirschvink et al. (1997) did not align exactly with the proportions that would be produced using 15 or 11 bees; therefore, our calculations for the number of successful bees were rounded to the nearest integer. For the experiment with 15 bees exposed to 10 Hz AC magnetic fields, the data point for 1300 μT fell almost exactly between two possible proportions, so the data were analyzed using both possible values for that particular data point.

Using an expected probability of 66.5%, we performed exact multinomial tests for both sets of data from Kirschvink et al. (1997) using the *XNomial* package in R (https://CRAN.R-project.org/package=XNomial). If 66.5% of bees randomly reach a learning criterion, 33.5% of bees would be expected to not reach a learning criterion for the first magnetic field stimulus. Of the 66.5% of bees that reached a criterion for the first magnetic field stimulus, 66.5% would be expected to reach a criterion for the second association, etc. It should be noted that in the original experiment by Kirschvink et al. (1997), bees were given approximately 80 attempts, and at least one trial was terminated early because a bee failed to return after 19 trials.

We also performed 15 simulations of individual bees using a random number generator in R. Mimicking the protocol in Kirschvink et al. (1997), we gave simulated bees up to 80 trials to make six of six or seven of eight correct choices; if a simulated bee reached one of the criteria, we concluded that the ‘bee’ had learned. The simulated bee then began a new set of trials and had up to an additional 80 trials to reach a criterion. This process continued until the simulated bee failed to reach one of the learning criteria.

Kirschvink et al. (1997) based their experimental design on that of Walker and Bitterman (1989). Walker and Bitterman (1989), however, used a DC magnetic field and, for most trials, allowed the bees 32 attempts to reach the learning criteria. We performed exact multinomial tests for the data from Walker and Bitterman (1989) using a predicted probability of success of 32.7%, which was determined using 10 random number simulations of one-million 32-trial blocks in R (32.7±0.04%; mean±s.d.).

In the Kirschvink et al. (1997) and Walker and Bitterman (1989) experiments, any bee that reached a criterion was tested again under a weaker magnetic field. Therefore, the data points for each magnetic field strength are not independent and represent pseudoreplication. For example, in their experiment with a 60 Hz AC magnetic field, Kirschvink et al. (1997) performed 25 trials with 11 bees; bees succeeded in reaching the learning criteria in 14 of those trials. To avoid the problem of non-independence, for the exact multinomial tests where we compared the original experimental data with the predicted random distribution, we only used the lowest field strength each bee was reported to detect. For example, using 11 bees and a 60 Hz AC magnetic field, Kirschvink et al. (1997) reported that 4 bees failed to reach a criterion, 7 succeeded when tested with a 1300 μT magnetic field, 4 succeeded at 430 μT, 3 succeeded at 130 μT and 0 succeeded at 43 μT (Table 1). For our analysis, we used the following values: 4 bees failed to reach a criterion, 3 bees did not succeed below 1300 μT, 1 bee did not succeed below 430 μT, 3 bees did not succeed below 130 μT and 0 bees succeeded at 43 μT (Table S2).

For our experiment with 15 simulated bees, some of the magnetic field categories had values of zero (e.g. all 5 simulated bees that reached a criterion at 43 μT also reached a criterion at 13 μT). Because exact multinomial tests cannot be performed with expected values of zero, we chose to use all available data points for our statistical analysis, even though the data points were not independent.

The recreated data from Kirschvink et al. (1997) and Walker and Bitterman (1989) with the results of the statistical analyses, as well as simulated data sets are available at https://figshare.com/s/87a60291069dddb35910.

## RESULTS AND DISCUSSION

Kirschvink et al. (1997) stated that the probability of bees reaching the learning criterion of six correct choices in a row was 1.6%, and the probability of making seven or eight correct choices out of eight attempts was 3.5%. However, because the bees were given multiple attempts to reach the criteria, the probability of reaching the criteria was much higher (Fig. 1A). In fact, the probability of a bee reaching the given criteria randomly was greater than 5% after only nine trials. We found that the data from the Kirschvink et al. (1997) experiment with bees exposed to 10 Hz AC magnetic fields were not significantly different from the expected outcome if the bees were choosing targets randomly, regardless of whether 10 or 11 bees succeeded at the first magnetic field strength (exact multinomial test: 10 bees, *P*=0.17; 11 bees, *P*=0.23; Fig. 1B, Table 1). Likewise, the data from the Kirschvink et al. (1997) experiment with bees exposed to 60 Hz AC magnetic fields were not significantly different from the expected outcome if the bees were choosing targets randomly (exact multinomial test: *P*=0.17).

The results of our random number simulation with 15 ‘bees’ were also not significantly different from the data of Kirschvink et al. (1997) (exact multinomial test: 10 bees, *P*=0.96; 11 bees, *P*=0.98). An example ‘bee’ from our simulation is shown alongside a recreation of fig. 2 from Kirschvink et al. (1997) for comparison (Fig. 2A,B). For simulated bees that reached a criterion, the average number of trials it took to reach the criterion was 28.6±15.4 (mean±s.d.). Some of our simulated bees showed a pattern of results that looked like they were learning, while others showed a pattern that looked like they were making random choices (Fig. 2C).

The reasonable conclusion to make from the results published in Kirschvink et al. (1997) is that the bees did not learn to associate either 10 Hz AC or 60 Hz AC magnetic fields with a positive reward. Kirschvink et al. (1997) should no longer be cited as evidence that bees can detect magnetic fields.

In addition to the incorrect estimate of probability, there were several other experimental design concerns in Kirschvink et al. (1997). First, because bees were removed from testing as soon as they failed to reach a criterion, each subsequent field strength was presented to a smaller number of bees. As a result, the proportion of bees that succeeded inevitably continued to decrease, thereby creating the appearance of a dose–response curve even if the success of a given bee occurred simply by random chance. All possible experimental outcomes, other than 0% success or 100% success, would have created the appearance that the response of the bees decreased as the magnetic field intensity decreased.

A second problem with the experimental design was that although Kirschvink et al. (1997) did not know which target was the reward until after a given trial, they were aware of the outcome of a trial prior to the initiation of the next trial and prior to data analysis; therefore, the experiments were not performed blind. As a result of this design, the researchers stopped any given 80-trial block early if the expected outcome was observed, but allowed the experiment to continue if the expected outcome was not observed. The researchers should have continued the experiment through 80 trials and then examined whether or not the bees continued to perform above random chance once they had reached the learning criteria.

An additional problem with the experimental design was that there were no experimental controls to determine how bees behaved in the absence of a magnetic field. The use of experimental controls would have likely revealed the incorrect estimation of probability. Finally, because the magnetic field stimuli were not presented in a random order, the magnetic field effects and temporal effects were confounded. Proper randomization would have shown that the bees were randomly choosing targets rather than learning a discrimination task.

Kirschvink et al. (1997) was an important paper because it was a rare example of the independent replication of a previous experiment that demonstrated magnetoreception in bees (Walker and Bitterman, 1989; Vácha and Soukopová, 2004). There were two primary differences between the experimental designs described in Walker and Bitterman (1989) and Kirschvink et al. (1997). Walker and Bitterman (1989) only allowed the bees approximately 32 trials to reach one of the learning criteria. If bees were choosing randomly, a criterion would be expected to be reached in 33% of trials. The other difference was that Kirschvink et al. (1997) used an automated delivery system so that the reward was not made available until after the bees made a choice, whereas in Walker and Bitterman (1989), the bees could have potentially smelled the difference between the positive and negative reward. While Walker and Bitterman (1989) made the same incorrect estimates of probability, and had similar experimental design flaws, our reanalysis of the Walker and Bitterman (1989) data found that their results were significantly different from the predicted distribution if bees were choosing targets randomly (Table S2; exact multinomial test: *P*<0.000001). Based on the median performance of their bees, Walker and Bitterman (1989) stated that the threshold of magnetic field intensity detection was 260 nT; however, because of the above concerns regarding experimental design, this conclusion should be reconsidered. To our knowledge, this particular protocol has not been used in other studies of magnetoreception in bees.

Experimental design problems are not uncommon in biological research (Holman et al., 2015). Concerns about the reproducibility of results have also gained significant attention in recent years, particularly in medical research and in psychology research (Begley and Ellis, 2012; Open Science Collaboration, 2015; Baker, 2016; Johnson et al., 2017). False positives cannot be eliminated, but they can be reduced by proper experimental design including randomization, *a priori* determination of statistical analyses to be performed, blind data collection and independent evidence of an error before an outlier is discarded (Festing and Altman, 2002; van Wilgenburg and Elgar, 2013; Holman et al., 2015; Curtis et al., 2015). While the example we presented here is from the field of magnetoreception research, we hope it will serve as a valuable reminder for all experimental biologists to both carefully consider their own experimental design and critically evaluate the methods within published research studies.

## Acknowledgements

We thank P. Aldrich and two anonymous reviewers for providing valuable comments on the manuscript.

## FOOTNOTES

**Competing interests**The authors declare no competing or financial interests.

**Author contributions**Conceptualization: M.J.B., M.W.N.; Methodology: M.J.B., M.W.N.; Software: M.J.B., M.W.N.; Validation: M.W.N.; Formal analysis: M.J.B.; Data curation: M.J.B.; Writing - original draft: M.J.B.; Writing - review & editing: M.J.B., M.W.N.; Visualization: M.J.B., M.W.N.

**Funding**This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

**Data availability**Data are available from the figshare repository: https://figshare.com/s/87a60291069dddb35910

**Supplementary information**Supplementary information available online at http://jeb.biologists.org/lookup/doi/10.1242/jeb.185454.supplemental

- Received May 28, 2018.
- Accepted September 24, 2018.

- © 2018. Published by The Company of Biologists Ltd