Multi-Center Trial of a Standardized Battery of Tests of Mouse Behavior

Supplemental Information In Regard to the Statistical Strategies in the Science Paper

Three sections are available:

Notes on Statistical Analysis.

Issues with results of the Rotarod Coordination test.

. Comments on Muliple Comparision Strategies and Multiple Inferences. Response to a Reviewer.


Notes on data analysis

1. Most of the analyses were done with a complete factorial ANOVA of the various dependent measures in Table 1 and should be easily repeated by interested colleagues from the raw data available through the web site. We analysed many other measures that we considered to be of less importance, but of course some people will care about certain of these more that the variables we chose to report. We welcome further analyses, knowing full well there are many undiscovered morsels in this large data set.

2. One exception to the above point was the analysis of the replication effect. All data collection in Albany and Portland was completed with two replications, whereas in Edmonton there were empty cells in the first two replications that were filled by 6 mice in a third, very small "replication." For purposes of analysing replication effects, these 6 mice were simply pooled with those in replication 2. Inspection of their values suggested they were not markedly deviant from mice in replication 2, although sample sizes were very small.

3. In the data set for the water escape task, two problems arose that persuaded us to depart from the usual analysis. First, the A/J strain was an extreme outlier that never learned and very often went the full 40 sec limit without escaping. The reason for this was clearly an exceptional degree of wall hugging or thigmotaxis. Consequently, our reported analysis of water escape learning was done after excluding A/J from the data set. Second, for all other mice there was a clear mean-variance correlation in the data, as expected for a latency measure. This was addressed with a square root transformation of the raw latencies. Third, the mice generally did not improve significantly after the third escape trial, so our analysis was based only on the first four trials. This reduced the sensitivity of the statistical tests by reducing the reliability of the latency measure. However, it made the results more closely comparable to other tests where behavior was not yet close to asymptotic.

Rotarod data

This was the task where the mice won! For the reasons detailed below, we believe that the rotarod data are essentially uninterpretable, and we therefore did not report them in the manuscript.

As the deadline for starting the experiment neared, we noticed that the surfaces of the rotarods were not identical. Even though the genotypes and conditions were randomized across rods in each site, we were concerned that this could distort the estimated performance of some groups. One of us (JC) made the last-minute and very ill-advised decision to cover all rotarod surfaces with 320 grit wet/dry sandpaper to achieve uniformity. This surface had worked well on other rotarods he had used in his laboratory. There was no time to pretest the rotarods before we launched into the experiment.

The accelerating rotarod task is designed to place increasing demand on the mouse until it is no longer able to stay on top and falls. The behavioral strategy it seeks to measure is a constant shifting of position as the rod rotates away from beneath the mouse. However, the combination of relatively small-diameter rotarod with sandpaper offered the mice a second (and superior!) behavioral strategy, which was to "flatten" themselves to the rod and essentially wrap themselves around it. On trials when a mouse adopted this flattened posture and grip, its latency to fall was dramatically elevated (e.g., from latencies of 5-20 seconds to latencies of 30-70 seconds). Thus, the raw data from the 10 trials of the experiment contained highly variable scores and were in many senses bimodal.

There were three reasons that this problem led to our decision to declare the data uninterpretable. First, some mice clearly learned to flatten, based on increasing numbers of flattened trials in the second five versus the first five trials. Ignoring the last five trials helped, but even then flattening occurred on 20% of the trials across all sites. Second, sites differed in the proportion of trials they declared as "flattening" (range: 11-29%). Third, strains clearly differed in the extent to which they engaged in the flattening strategy. For example, a large proportion of C57BL/6J mice engaged in this strategy across sites, while DBA/2J mice rarely did. Finally, with only five trials, the influence of very long latencies skewed the results. We tried deleting individual trials on which flattening had occurred, but both the site differences (and the strain differences) were still apparent. Indeed, attempting to cleanse the data in this way led to one cell in the ANOVA with no entry.

Along with the attempts just described to render the data more interpretable, we examined the patterns of strain, sex, site and shipping condition differences under the different levels of data cleansing. We reluctantly decided that we could not interpret the results of this test. The raw data are available for the interested peruser. Trials on which a "flattening" occurred are not indicated in the data files, but data tables containing this information can be obtained from JC.

 

 

Text about table in MS reporting effect sizes - justification for not using multiple comparison adjustments

One reviewer suggests that we should adjust P values reported in Table I for multiple comparisons. He or she further suggests a rather radical approach, which presumes that there were no prior predictions about various factors in the study. Within the severe constraints of a Report to Science, we obviously did not have the space necessary to elucidate our statistical strategy in any depth. We do not believe that adjustments in the alpha levels for multiple comparison would enable the communication of the results any more clearly than can currently be seen in Table I. A statistically sophisticated reviewer, such as Reviewer #3, will immediately be able to see the effect of multiple comparison adjustment him- or herself, using whatever assumptions seem personally appropriate. But, even if this were done, the main findings of the MS would not be altered. (The reviewer states in the introductory paragraph, "...the authors conclusions seem undeniable.") What is really needed, and is freely acknowledged by us, is a larger experiment with greater statistical power to reveal higher-order interactions. We acknowledge this in the MS while discussing test reliability.

In our study, we were mainly interested in the overall pattern of results. The sheer numbers of mice required by the design forced us to have limited power to detect modest interactions. We did not want to obscure our ability to detect the overall pattern of the results by using unduly stringent alpha levels. Furthermore, our study was in itself a test of replicability, and a test of replicability is far more relevant than adjusted alpha levels.

The most important consideration is the purpose of the study, as embodied in the original power calculation. We set N to yield adequate power to detect strain x lab interaction, and there were not very many such interactions to examine for the most important variables. We argue that, for six behavioral tests, we were mainly interested in about a dozen tests of interaction, and " =.01 seems quite reasonable to us in this context.

Whether or not to use multiple comparison corrections at all for a study of this sort is a contentious issue among statisticians. Another alternative would be to use MANOVA. Indeed, some would argue that we should not use significance tests at all. We have reported effect sizes and simple probabilities, which strikes us as a good alternative. That is, we have presented the data both ways, in a concise and informative format, in Table I, leaving statistical interpretation up to the reader. Even if one is committed to the use of multiple comparison corrections, the proper Type I error probability in our study certainly should not be set at a level that protects against false rejection of the null for all 56 tests. This experiment is quite different from a genetic linkage study where there is real credibility in the null and most tests against markers are expected to be non-significant. In our study, to give a simple example, the strain differences were virtually certain to be significant by almost any criterion, so it is not appropriate to include them in a multiple comparison adjustment.


Return to the main page.