Sunday, February 22, 2009

Voodoo Meta-Analysis


In my previous post I ran some simulations that explored how various summary scores of cluster-wise correlation magnitude are affected by cluster size. I showed that the peak correlation and the two-stage correlation yield systematically higher correlation estimates than either the median or minimum correlation in a cluster. I also made a statement about the magnitude of the bias in such summary scores that was based on a misunderstanding of the "non-independence" error as described by Vul et al.
in the "voodoo correlations" paper and in an in press book chapter by Vul and Kanwisher. I will return to my simulations and argue that they are indeed still informative, but first I want to discuss just what is meant by the "non-independence" error in neuroimaging, as defined by Vul and colleagues.

Understanding the "Non-Independence Error"

There is a common category of errors that often crop up in functional neuroimaging studies in which an ROI is selected on the basis of one statistical test and then a second non-independent test is carried out on the same data. This type of non-independence is often discussed in elementary neuroimaging tutorials and forums and is well-known and rather fiercely guarded against. It, moreover, involves two null hypothesis tests -- and therefore two statistical inferences, the second of which is biased to yield a result in favor of the experimenter's hypothesis. The category of errors referred to in Vul et al. subsumes, but is not limited to, these kind of "two hypothesis test" errors.

Consider the case of a whole-brain correlation analysis. One has just carried out an analysis correlating some behavioral measure x with with a measure of activity in every voxel in the brain. One has corrected for multiple comparisons and identified a number of "activated clusters" in the brain. So far so good. We have conducted one hypothesis test for each voxel in the brain. We are interested in finding where the significant clusters are located (if there are any at all) and we may also be interested in the magnitude of the correlations in the those active clusters.

If we have corrected for multiple comparisons, then we may safely report the location of the clusters in x,y,z coordinates. What we may not do, according to Vul and colleagues, is report the magnitude of the correlation. Neither may we report the maximum of the cluster. Nor may we report the minimum of the cluster. We may not chose any voxel in the cluster randomly and report its value. Let me go further, we may not substitute the threshold (t = 0.6, say) to serve as lower bound for the correlation magnitude. To report the magnitude of the correlation of a selected cluster, or any derivative measure thereof, is to commit the "non-independence error". [I note only in passing that if neuroimaging studies only ever reported the lower bound of a correlation (i.e. the threshold), no studies would ever report correlations greater than ~ 0.7].

One reason we may not (according to Vul et al.) report the magnitude of a correlation is because correlation estimates selected on the basis of a threshold t, will in on average be inflated relative to the "true" value. The reason for this is that above-threshold values are likely to have benefited from "favorable noise" and will therefore be biased upwards. The problem is akin to regression to the mean and is not specific to correlations or social neuroscience or even functional neuroimaging, per se. You can get an idea of the scope and generality of the concept in the recent chapter of Vul and Kanwisher -- which is an extended homily on the varieties of the "non-independence error" where you will, among other things, learn the virtue of not plotting your data:

The most common, most simple, and most innocuous instance of non-independence occurs when researchers simply plot (rather than test) the signal change in a set of voxels that were selected based on that same signal change.” (pg 5)

Vul and Kanwisher are also critical of several authors for presenting bar plots indicating the pattern of means in a series of ROIs selected on the basis of a statistical contrast. We are told that such a presentation is "redundant" and "statistically guaranteed"
(pg 11). I'll give you another example (this one I thought up all on my own) of Vul's non-independence error: reporting a correlation in the text of the results section and then also, quite redundantly and non-informatively, reporting the correlation separately in a table or figure legend. You see the range.

Before I begin with my critique of the Vul et al. meta-analysis, I just want to make it clear that the hypothesis that correlations in whole-brain analyses will tend to be inflated is quite reasonable. The other part of their hypothesis, that correlations are massively -- rather than, say, negligibly -- inflated needs to be backed up empirically. It is this aspect of their study -- the empirical part -- that I find unsatisfying (perhaps that is an understatement).

Voodoo Meta-Analysis and the Non-Independence Error

We have just remarked that in their in press chapter Vul and Kanwisher emphasize that the non-independence error is not limited to reporting of biased statistics, but may also involve the mere presentation of data that has been selected in a non-independent manner.

“Authors that show such graphs must usually recognize that it would be inappropriate to draw explicit conclusions from statistical tests on these data (as these tests are less common), but the graphs are presented regardless." (pg 8.)

Where else might we find example of such a misleading presentation of data? Sometimes it turns out that the "non-independence error" is lurking in your own backyard. Take for instance the meta-analysis presented in the "voodoo correlations" paper by Vul et al. (in press). The thesis of this paper is quite straightforward. First, the authors surmise that correlations observed in brain imaging studies of social neuroscience are "impossibly high". Second, because the magnitude of correlations are intrinsically important, scientists must also provide accurate estimates of the correlation magnitude -- something that is not necessarily guaranteed by null hypothesis testing alone.

To explore the question further, Vul et al. searched the literature for social neuroscience papers that reported brain-behavior correlations, and then sent a series of survey questions to the authors of the selected papers. On the basis of the authors' responses and other unknown considerations, they classified the papers as either using (good) independent methods or using (suspect) non-independent methods.

What constitutes a "non-independent" analysis, you ask? Studies classified as non-independent were ones that selected significant clusters and reported the magnitude of these activations based on a summary score (usually the mean or maximum value of the cluster). Let me be absolutely clear about this because there has been some confusion about this issue (I'm pointing at myself here). These studies did not perform two non-independent statistical tests. They performed one and only one correlation for every voxel in the brain. Because such analyses perform many correlations over the brain (tens of thousands), a correction for multiple comparisons is imposed, resulting in high statistical thresholds to achieve a nominal alpha value of 0.05. The key point is that in the Vul et al. meta-analysis, non-independent analyses are by and large synonymous with whole-brain analyses. That is a crucial element to the argument that follows, so take note of it.

An independent analysis, on the other hand, was defined as an analysis that used a region of interest (ROI) that was defined either anatomically, via independent functional ROI or localizer, or through some combination of both. As a consequence, such independent analyses usually only calculate just one or perhaps a handful of correlations, and therefore apply far more lenient statistical thresholds. For instance, in a large study with 37 subjects, such as the one by Matsuda and Komaki (2006, cited in Vul et al.), a correlation of 0.27 was declared statistically reliable at an alpha level of .05. A whole-brain analysis with the same number of subjects and a 0.001 alpha level would have required a correlation as least as great as 0.5 for a one-tailed test (in addition to whatever cluster extent threshold is applied). For a more typically sized 18 person study, the correlation would have had to be as large as 0.67 (one-tailed) to reach significance.

So What's Wrong with the Voodoo Meta-Analysis?

Let me count the ways.

Remember, the aim of the Vul et al. meta-analysis is to establish that non-independent methods produce massively -- not just marginally -- inflated correlations. The meta-analysis itself is fundamentally an empirical, rather than theoretical, endeavor. Let me remind you that studies classified as "non-independent" are all whole-brain analyses, and therefore involve corrections for multiple comparisons that necessitate a large correlation magnitude to achieve statistical significance. Those studies classified as independent do not impose such high thresholds. The upshot of this is that a whole-brain (non-independent) analysis will by definition never report a correlation less than about 0.5 (assuming a large 37 subject maximum sample). On the other hand, independent analyses, because of their greater sensitivity, will report correlations as low as 0.27 (assuming the same 37 subject maximum sample).

What does this tell us? The classification of papers in to "non-independent" and "independent" groups was guaranteed to produce higher correlations on average for the former than for the latter group, irrespective of whatever genuine inflation of correlation magnitudes may exist in the latter category.

The same result could have been produced with a random number simulation. Suppose I sample numbers randomly from the range -1 to 1. In a first run I sample a number and check to see if it's greater than 0.3, and store it in an array. I keep doing this until I've got about 25 values. In a second run I sample numbers from the same underlying distribution, but I only accept a number greater than 0.6. I then plot a histogram, showing how the first group of numbers are shifted to the left (plotted in green) of the second group of numbers (plotted in red). Note that I'll have to sample more numbers in the latter case to get to 30, but that's OK as I have an inexhaustible supply of random numbers to draw from. Compare this to the Vul et al. literature search which found approximately equal (30 and 26, respectively) numbers of independent and non-independent analyses even though the relative frequencies of the two classes may be very different. But Pubmed, like a random number generator, is inexhaustible.

There is a counterargument that Vul et al. might avail themselves of, however. They might argue that high thresholds and inflated correlations are inextricably linked. It is the high thresholds that lead to the inflated correlations in the first place. Unfortunately, the argument holds little water, as high thresholds would lead to high (significant) correlations even in the absence of any correlation "inflation", which happens to be consistent with the null hypothesis that the authors wish to reject (or persuade you to reject, as we shall see in the next section). Moreover, this argument, if seriously offered, would be a rather obvious example of "begging the question", a practice the authors strongly repudiate. Finally, the division of studies into the two groups is confounded by the differing sensitivities of the analyses, with non-independent studies sensitive only to larger magnitude correlations.


Voodoo Histogram


I would like now to everyone to turn to page 14 of the "voodoo correlations" paper where you may get acquainted with the most famous histogram of 2009, the Christmas colored wonder showing the the distribution of correlations among "independent" and "non-independent" studies that entered Vul et al. survey. What is the purpose of this histogram? Before we answer that question, let us return to the central theses of Vul et al. First, correlation magnitudes matter. And, second, that non-independent analyses produce grossly inflated correlations.

What evidence, other than a priori reasoning, do they adduce in favor of the inflation hypothesis? Well, as a starter, do they provide summary statistics, i.e. the mean or median correlation in the two groups? No. Do they perform a statistical test comparing the the two samples for a shift in the central tendency using, for instance, a t-test or a non-parametric test of some kind? No. Do they carry out an analysis of the frequencies distributed over bins with a chi-square or equivalent statistical test? No. Finally, if correlation magnitudes matter, why does it appear that the authors make an exception to that rule in the their own analysis which fails to report an estimate of the difference in correlations between the two groups? After all, how are we to know how serious the error is, if there is one at all? Do we care if the bias in correlation magnitude is .001 or .05 or even .1? Probably not very much.

Now, the reason for the omission of any statistical test or summaries, I think, is that Vul et al., being virtuous abstainers of the "non-independence error", believed they could avoid its commission by eschewing a formal test -- and therefore insulate themselves against the charge of non-independence. Instead, they reasoned, "we'll just present a green/red colored histogram and let the human color perception system work its magic". (Sadly, since the authors used red and green squares in their histogram, color blind social neuroscientists are mystified as to what all the fuss is about). Sometimes, however, it is enough to plot one's data to be accused guilty of the "non-independence error".

Let me remind you of a passage from Vul and Kanwisher (in press) which contains more of the wit and wisdom of Ed Vul.

"Authors that show such [non-independent] graphs must usually recognize that it would be inappropriate to draw explicit conclusions from statistical tests on these data (as these tests are less common), but the graphs are presented regardless. Unfortunately, the non-independence of these graphs is usually not explicitly noted, and often not noticed, so the reader is often not warned that the graphs should carry little inferential weight." (Vul and Kanwisher, in press, pg. 8)

I think that quote is a rather a nice summing up of the sad affair of the "voodoo histogram". The thing was based on non-independent data selection (due to the differing thresholds between the two groups and sundry other reasons described below) but was nevertheless used to persuade the reader of the correctness of the authors' main hypothesis. In the end, we do not know what to conclude from this meta-analysis, having been presented with no evidence in favor of the central hypotheses put forth by the authors. That the evidence was selected in a non-independent manner in the first place, due to the disparity in the statistical thresholds across groups, has a strange self-referential quality to it that reminds me of one of those Russellian paradoxes about barbers or Cretans and so on.

Cataloguing some of the Voodoo

The more I look at Vul and colleagues' meta-analysis the more perfect little pearls of "non-independence" turn up in its soft tissue. In the following sections I am simply going give you a taste.

1) Vul et al. classified studies as "independent" that selected voxels based on a functional localizer and then correlated a behavioral measure with data extracted from that ROI. The majority of such studies identified the ROI used for the secondary correlation analysis with a whole-brain t-test conducted at the group level in normalized space. It so happens that the magnitude of a t-statistic is influenced by both the difference of two sample means (or the difference between a sample mean and a constant) and the variance of the sample. Thus, ROIs identified in this manner will have taken advantage of favorable noise that will insure both large effects and small variance. As Lieberman et al. cleverly point out in their rebuttal to "voodoo correlations", low variance will inevitably lead to range restriction, a phenomenon that has the effect of artificially deflating correlations. Therefore, the studies labelled "independent" that used whole-brain t-tests to identify ROIs (the majority of such studies) were virtually guaranteed to produce reduced correlations, and therefore constitute another example of the "non-independence error" unwittingly perpetrated by the Vul et al. meta-analysis.

2) The meta-analysis fails to identify which studies reported the peak magnitude of a cluster and which studies reported mean correlations. Vul et al. repeatedly insist that "correlation magnitudes matter" and if this is the case it would be important to distinguish between those two sets. You may refer to my previous blog entry to see that average measures of correlation magnitude in a cluster hew towards the threshold, which is generally around .6 or .65 for a whole-brain analysis. On the other hand, "peak" values in a cluster are (by definition) not representative of the regional magnitude of the correlation estimate and, moreover, suffer from the problem of regression to the mean. It is very likely that virtually all the correlation estimates exceeding .8 come from studies that used the peak magnitude of the cluster. This is important to know! Remember, Vul et al.'s argument isn't to say that reporting only peak magnitudes is a bad practice, it's to say that reporting any summary measure in a selected cluster will result in massively inflated correlations. No evidence for that assertion is provided in the meta-analysis and critical information as to which summary measure was used for each study is not reported.

3) Localization of an ROI using an independent contrast is an imperfect process. Just as there is noise in the estimation of correlation magnitudes so too is there noise in the estimation of the "true location" of a functional area. Thus, spatial variation in the locus of a functional ROI insures that a subsequent estimate of correlation magnitude will be systematically biased downwards. It would take many repetitions of the same experiment in the same group of subjects to arrive at a sufficiently accurate estimate of the "true location" (insofar as such a thing exists) of a functional region to mitigate this spatial selection error.

4) Vul et al. do not consider the possibility that exploratory whole-brain correlation analyes are much more likely to find genuine large magnitude correlations than hypothesis-driven ROI analyses. Think about it. An ROI analysis is about confirming a specific hypothesis that an experimenter has about a brain-behavior relationship. It's a one-shot deal. If the researcher is wrong, he comes up empty. Whether a significant correlation is observed depends a lot on whether the scientist was right to look in that particular region in the first place. But remember, the brain is still a mysterious object and sometimes neuroscientists aren't quite sure where exactly to look. It is in these cases, generally, that they turn to whole-brain analyses. Consider the following. Suppose I have an hypothesis about the relationship between monetary greed and and brain activity in the orbitofrontal cortex. I define an ROI and perform a correlation between brain activity and some measure of the behavior of interest (greed). The correlation turns out to be 0.6. Now, what is probability that some other region in the brain has a higher correlation than the one I discovered? Well, since we know relatively little about the brain I submit that the probability is very close to 1. If that's the case, for every ROI analysis that discovered a correlation with magnitude r a corresponding whole-brain analysis, due to its exploratory nature, is likely to find those other regions that correlate more strongly with the behavioral measure than the ROI that was chosen on the basis of imperfect knowledge. The bottom line is that exploratory analyses have the opportunity, because they are exhaustive, to uncover the big correlations, while targeted ROI analyses are fundamentally limited by the experimenter's knowledge of the brain and the constraint of looking only in one location.

There are two parties looking to find a buried treasure. One party has a hazy hunch that the treasure in located in spot X. The other party, which is bank-rolled by the royal family, is composed of thousands of men who fan out all over the area, searching for the treasure in every conceivable place, leaving no stone unturned. If the first party's hunch is wrong, they strike out. The second party, however, by performing an exhaustive search, always succeeds in finding the treasure -- provided it exists.

5) A few more things to chew on. Vul et al.'s study included five Science and Nature studies combined, which accounted for 10% of the all studies (which means that these "big two" journals were vastly over-represented). Of those 5 papers, 13 out of the 14 correlations included were in the non-independent group. Second, of the the 135 correlations in the non-independent group, 22 came from a single study (study 11, see Vul et al. appendix). Of the 55 correlations that were greater than .7 in the non-independent group, a whopping 23% (13/55) came from this same study. The mean number of correlations from each study was 4.9 and the standard deviation was 4.4, meaning that 22 is nearly 4 standard deviations outside the mean and is therefore an outlier by anyone's standard. Remember that correlations drawn from the same study are non-independent and therefore including 22 correlations from a single study -- especially when that study contributed a disproportionate amount of correlations greater than 0.7, is rather dubious. Indeed, to avoid a non-independence error in this case, Vul et al. should have only chosen one correlation from each study -- and have chosen that correlation randomly.

6) The variance of an correlation estimate is related to the sample size of the study. And yet Vul et al. fail to report the sample size of the 54 studies that entered their analysis. This is a serious omission for obvious reasons that Vul et al. should have been attuned to. This is another potential commission of the non-independence error is Vul et al.

7) I'm getting tired so I will be brief on this last one. The selection criteria for the papers that entered the meta-analysis are poorly described. For instance, how did 5 Science and Nature studies get in to the sample? Is that an accident? If not, what was the rationale for choosing all those high profile papers? A meta-analysis should either be exhaustive or otherwise take pains to achieve a representative sample -- and, if the latter, then it is incumbent on the authors to describe the selection criteria and methods in detail. For instance, were the persons who selected the papers blind to the hypothesis? And so on.


References

Vul, E. Harris, C. Winielman, P. Pashler, H. Voodoo Correlations in Social Neuroscience. Perspectives in Psychological Science. In Press.

Vul E. and Kanwisher N. Begging the Question: The Non-Independence Error in fMRI Data Analysis. Book Chapter. In Press.

Lieberman, M. Berkman, E. Wager, T. Correlation in Social Neuroscience Aren't Voodoo: A Reply to Vul et al. Perspectives in Psychological Science. In Press.


Tuesday, February 17, 2009

Simulating Voodoo Correlations: How much voodoo, exactly, are we dealing with?

The recent article "Voodoo Correlations in Social Neuroscience" by Ed Vul and colleagues has gotten a lot of attention and has stimulated a great deal of discussion about statistical practices in functional neuroimaging. The main critique in the article by Vul involves a bias incurred when a correlation coefficient is re-computed by averaging over a cluster of active voxels that are selected from a whole-brain correlation analysis. Vul et al. correctly point out that the method will produce inflated estimates of the correlation magnitude. There have been several excellent replies to the original paper, including a detailed statistical rebuttal showing that the actual bias incurred by the two-stage correlation (henceforth: vul-correlation) is rather modest.

It occurred to me in thinking about this problem that the bias in the correlation magnitude should be related to the number voxels included in the selected cluster. For instance, in the case of a 1 voxel cluster the bias is obviously zero since there is only one voxel to average over. How fast does this bias increase as a function of cluster volume in a typical fMRI data set with a typically complex spatial covariance structure? Consideration of the high correlation among voxels within a cluster led me to wonder about the true extent of bias in vul-correlations. For instance, in the most extreme case, where all voxels in a cluster are perfectly correlated, there is zero inflation due to avergaing over voxels.

To explore these questions I ran some simulations with real world data. The data I used were from a study carried out on the old 4 Tesla magnet at UC Berkeley and consisted of a set of 27 spatially normalized and smoothed (7mm FWHM) contrasts in a verbal working memory experiment (delay period activation > baseline). The goal was to run many correlation analyses between the "real" contrast maps and a succession of randomly generated "behavioral scores". Thus, for each of 1000 iterations I sampled 27 values from a random normal distribution to create a set of random behavioral scores. I then computed the voxel-wise correlation between each set of scores with the set of 27 contrast maps. I then thresholded the resulting correlation maps at 0.6 (p = 0.001) and clustered the above-threshold voxels using FSL's "cluster" command. This resulted in 1000 thresholded (and clustered) statistical maps representing the correlation between a set of "real" contrast maps and 1000 randomly generated "behavioral scores".

Next, I loaded each of the 1000 statistical volumes and computed, for each active cluster, the minimum correlation in the cluster, the median correlation in the cluster, the maximum correlation in the cluster, and the two-stage vul-correlation. The vul-correlation was computed as follows: I extracted the matrix of values from the set of contrast maps for each cluster where (rows=number of subjects(27), columns=number of voxels in cluster) and averaged across columns, yielding a new vector of 27 values. I then recomputed the correlation coefficient between this averaged vector and the original randomly generated "behavioral variable" (all 1000 of which had been saved in a text file). Then I plotted cluster volume in cubic centimeters against its median, maximum, and vul-correlations. Here's the result.





What you can see is that vul-correlation rapidly increases as a function of cluster volume, reaching asymptote at a correlation of about .73 and a cluster volume of roughly 2 cubic centimeters. You can see, however, that the maximum correlation, which is not a two-stage correlation, has almost the exact same functional profile. The median correlation within a cluster also increases somewhat, but not as high or as rapidly as the vul- and maximum- correlations.

To quantify the "bias" in the vul-correlation as a function of cluster size I plotted the difference between the vul-correlation and median correlation.




It is clear from this plot that the bias becomes maximal when the cluster size is approximately 3 cubic centimeters. That is, however, rather a large cluster by fMRI standards. For a 1 cubic centimeter cluster the bias is about .075 and for a 1/2 cubic centimeter cluster (approximately 20 3 x 3 x 3 mm voxels) the bias is about 0.06. I'm not sure whether that rises to the level of "voodoo". Perhaps voodoo of a Gilligan's Island variety. Minor voodoo, if you like.

Lastly, I examined the minimum correlation as a function of cluster size. Of course, the minimum correlation can never fall below the cluster threshold, which was .6. Thus, I thought that the minimum correlation might serve as a good lower bound for reporting correlation magnitudes. You can see from the plot below that for these random simulations, at least, the minimum correlation does not increase with cluster size. In fact, it tends to approach the correlation threshold, which is not surprising, as this is what would be expected in a noise distribution. This time I've plotted cluster volume on a log (base 2) scale for easier visualization of the trend.






So, what have I learned from this exercise? First, the amount of inflation incurred from a two-stage correlation (vul-corrrelation) increases as a function of cluster size. For smallish clusters (1/2 to 1 cubic centimeters) this bias is not that much, whereas for larger clusters the bias is as high as 0.1. Second, the maximum correlation has a nearly identical relation with cluster volume as does the vul-correlation. Finally, candidates for the reporting of cluster magnitudes could be the median or minimum correlations. The median correlation increases with cluster size, but not by much. The minimum correlation decreases with cluster size, but again not by much.

All in all, I think the problem identified by Vul et al. is a genuine one. Two-stage correlation estimates are inflated when compared to the median correlation within the cluster -- but not by that much. One reason for this is the high threshold required to achieve significance in whole-brain analyses yield voxels that don't have much room to go up. In addition, the constiuent voxels of a cluster are already highly correlated, so that the "truncation of the noise distribution" referred to by Vul et al. may be less than would be expected among truly independent voxels. So, perhaps, in the end the vul-correlation isn't so much a voodoo correlation as it is a vehicle for voodoo fame.