Friday, July 8, 2016

Faulty cluster-wise inference: Laughter and the 40,000 fMRI papers

Cluster Failure 

In a paper entitled: "Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates", Eklund, Nichols, and Knutsson (2016) show that several popular statistical packages for fMRI analysis have a strong tendency to find activation where there is, well, none. This is like the old "dead salmon" study, but with a lot more rigor, and a lot fewer laughs. At one point the authors estimate that 40,000 fMRI papers (since revised) might be invalid, noting dryly that it is "not feasible to redo them". Would we want to redo them if it was feasible?



Imagine a World ...

Imagine a world where it was feasible to redo 40,000 fMRI studies in about six-months, for around 50 million US dollars (a tiny fraction of the original cost of the studies). Four hundred MRI scanners, arrayed in hedge-like rows on an abandoned crop-field in Kansas, operated round the clock by experienced MR-technicians, and a huge pool of willing participants: old or young, consented, metal-free, and ready to perform a task of some gentle soul's devising (n-back, Stroop, or something more exotic). A cadre of crack Eprime coders (making $15 bucks an hour) will work round the clock to create accurate facsimiles of some 40,000 psychological tasks.

With the help of an extremely generous gift from a major MRI manufacturer and by being ultra-efficient, the six month timeline seems genuinely realistic in this alternative universe. We just need to run 220 studies per day for 180 days, which means we would need to scan approximately 4400 subjects per day or 11 subjects per-scanner per-day. Don't get me wrong, it's a lot of work, but it could be done (in the hypothetical universe that I have constructed for the very purpose).

The data would be analyzed on Google cloud-based servers (thanks to Google for their hypothetical support; Amazon passed) using an ultra-powerful 100,000 core computing cluster with AFNI's latest version (AFNI is really fast), and an updated version of the somewhat maligned 3dClustSim--which we understand has fixed the optimistic p-value bias. An automated report generator will check the results against the original finding and create a pdf addendum that will be linked to the original article's DOI and will show up in PubMed searches. If there is a replication failure, the paper will be automatically retracted, per agreement with a consortium of participating editors of neuroimaging-related journals, and replaced only with the addendum. Think of it: the sound of 20,000 fMRI papers being retracted. Some laughter, but a lot more crying. A fair amount of personal and scientific chaos is sure to ensue.

This can be done. Even with all the infrastructure, the 400 scanners, the compute servers, the MR-technicians, however, it will be hard work. But it's no moon landing, it's not rocket science (as they say), it's more or less routine, busy work--coding, scheduling, scanning, data crunching. Moreover, fifty million dollars is comparatively trivial, the cost of perhaps 50 small RO1 grants from NIH. Remember, the original research cost billions of dollars (according to my R terminal which prints out a large number in scientific notation:  4e+08 -- Billions, right?).

Would you fund my proposal? Would I fund my own proposal? Surely the modest effort to replicate 40,000 studies would be worth it, right? But it's not for me to decide. We need to ask the funding agencies, our peers, etc.

One critique we might face is, simply, that the work is not novel. Lack of novelty is the bete noire of many a hardened and unimaginative reviewer. Multiply this lack of novelty by 40,000 and we probably have a triage situation on our hands. No, definitely triaged. Indeed, R1 almost certainly writes as follows in a farrago of Trumpian emptiness: "Lacks novelty, not feasible, over-ambitious, fails to make a case for impact, power analyses not provided. Investigator is strong, kudos".

While it is true that R1 skimmed the grant on a turbulent United Airlines flight and just went ahead and submitted his usual boilerplate, in this particular case I happen to agree with him (everything but the "over-ambitious" part -- we can do this!). Let me repeat: I broadly agree with R1's negative assessment. The thing is, something doesn't sit right with me about my proposal. Would it perchance be a gigantic waste of time?

A Gigantic Waste of Time

Let me tread carefully here, since I am an fMRI researcher, and I believe that it's a very powerful and useful tool and that it has revolutionized neuroscience. With that said, I would not argue with anyone who said that the vast majority of fMRI-based science is not mission-critical to the immediate survival of the species. 

Put another way: lives are probably not at stake because of an fMRI study conducted in 2002 reported activation in the amygdala with an invalid significance cut-off. With fMRI and the neuroscience that uses it, it's more about the long haul, about accumulating evidence, about inching ones way forward along a variety of complicated scientific paths, full of false starts, blind alleys, and the odd crawl-hole that takes us to a new place. And make no mistake: evidence is accumulating from research using fMRI, and it is doing so rapidly, but rarely on the basis of a single study.

Pick 10 fMRI studies at random from a Pubmed search. Ask yourself, seriously. Does this study need to be redone and checked for validity by recreating the data on an abandoned crop field in Kansas with 400 simultaneously running MRI-scanners? The answer in 9 out of 10 cases will be a curt: "no". Even if if you think the result is false or suspect, the answer is very likely still "no", because, you know, why bother?

Ask yourself a second question. Would you rather the paper stays in the record somewhere or that it be summarily retracted if it were shown to have used a faulty p-value correction method? Before you answer, I want to discuss the value of eating your own dog food.

Eating One's Own Dog Food

In fMRI's adolescence (and PET's menopausal years), a curious thing started happening. Researchers started data-mining the published coordinate tables (so-called "Talairach coordinates") and doing ad-hoc and, later, quantitative meta-analyses of the data. These "coordinate-based meta-analyses" became very popular and as of today a Pubmed search indicates that at least 273 of these meta-analyses have been published. Even I published one. The 40,000 fMRI studies that we might, in our most draconian moods, like to have retracted like so many wayward social-priming psychology experiments, is the very stuff these meta-analyses are constructed out of.

And have you seen Neurosynth? This is a marvel of the modern world: it takes in the worst of the fMRI trash (many of the 40,000 are in there, I promise) and churns out very valid-looking meta-analytic results. Neurosynth thrives on weak effects that can be pooled together to produce a reliable result. It's like a mortgage-backed security for neuroscientists, without the financial crash. The reality is that if you take away the 40,000, you deliver a severe blow to Neurosynth. These studies are the dog food, and Neurosynth is one of its most advanced and effective eaters. It devours the stuff by the hundred. Neurosynth has evolved to such an extent that it easts the dog food, the cat food, the hamster food, the reeking offal on the butcher shop floor-- and, yes, even the caviar if it's in there too

Single fMRI Studies Rarely Matter, Except when they Do

Let me suggest that there is a kind of unwritten law in the fMRI-based research. If a study matters, it will very likely be followed up on. It may not be replicated exactly, but it will be followed up on. A study that matters launches a new "mini-paradigm" and a lot of researchers investigate similar questions with similar methods. These mini-paradigms beget a lot of publications, and then a lot of meta-analyses, and then Neurosynth gobbles up their coordinate tables, paying no attention whatsoever to their method of p-value construction, whether cluster-based, false-discovery-rated, uncorrected--whatever. It's just dog food and Neurosynth eats it.

A paper that matters, is one that starts a new mini-paradigm. I define "matter" purely in the social sense of "generating research". The studies that matter are a tiny minority of the total worldwide fMRI research output. Let me be clear, a study could be rather poor as a scientific exemplar and yet still "matter". Likewise, a very good fMRI study could be published and then, sadly, die on the vine. It doesn't matter (for now at least). Sometimes a study is even published in Science or Nature, and immediately consigned to the sorry state of not mattering.

Now, if a study is not having an impact, if a study doesn't matter in this limited sense--it's not generating research, nobody is using the paradigm, etc., do we need to redo the work? Why? Suppose the original result was overturned by our Kansas MRI fleet? What of it?

Yes, someone will be embarrassed, but what else? Fortunately--or unfortunately as the case may be--the vast majority of the 40,000 potentially invalid studies belong to this lot. It's simply not a matter of urgency to redo much of this work; and it might even be actively harmful if these studies were removed from the scientific record.

There is another category of papers, however. These are studies that generate a lot of interest and spawn much new research in one way or another but are never followed up on, let alone directly replicated. These are one-offs. You don't find them in meta-analyses, they are as it were paradigmatic loaners with outsized influence. There are perhaps no more than 100 examples of this category in the history of fMRI-PET research. And let me tell you something. Forget our methodological difficulties with cluster-based p-values and false-positive rates and the idea of "redoing 40,000 studies". Forget all that. Because if we can't even muster the energy to replicate a handful of major "one-off" studies with hundreds of citations and a major influence of the field--how are we going to replicate the innumerable ignored and ignorable fMRI-dregs that are cited twice a decade? The issue isn't p-values, it's the will to replicate, i.e. that there's very little of it

 Quick, name three major fMRI replication attempts of high-impact studies that have not been "followed up on" in the above sense? I'll even throw you a bone. Now you only have to name two.

Modest Goal for fMRI-research: Replicate 100 High-Impact (Non-Replicated) Studies

We don't need 400 scanners on a crop-field in Kansas running round the clock. We could get by with five. I'm only asking for five scanners ! (R1: "The Budget is reasonable"). The goal is to replicate the top 100 most influential fMRI studies that have not been directly replicated or otherwise meta-analyzed to death. We do not need to replicate the "n-back" task because once in 1998 somebody used a faulty p-threshold. We're looking for special papers, with special influence, that have never been replicated. It doesn't matter what analysis package or p-value method they used.

The top 100 studies will be determined by a consensus process resulting in a rank-ordered list. We might use graph theory, hubs and all that. Power analyses will be painstakingly carried out for each of the 100 studies. We will still petition Google and Amazon for the 100,000-core server, even it will frankly be overkill. We will not request any "retractions" of the original work in the case of a replication failure, and we will of course place the resulting fMRI data in something eminently accessible, like OpenFMRI.

R1: "The project is novel and innovative and seems quite feasible. This reviewer, for one, appreciates the detailed power analyses. The investigator is strong. I only have a few nitpicks (detailed below) but these are very minor. Overall, a very strong proposal."