Neuroimaging: Many Analysts, Differing Results

Kayt Sukel
August 4, 2020

For decades, both the research and medical communities have relied on neuroimaging tools like functional magnetic resonance imaging (fMRI) to give them a window into the living human brain. Such scans have provided unprecedented insights into the brain’s structure and function – and the field, as a whole, has used this technique to better understand how the brain gives rise to thoughts, emotions, and actions. But as neuroimaging technology has advanced, so have the different analysis tools and the number of ways one can evaluate the resulting data. Now, the results of unique research project, the Neuroimaging Analysis, Replication, and Prediction Study (NARPS), suggest that different analyses can lead to strikingly different results from the same data set.

Reproducibility and Analysis

Russell Poldrack, a cognitive neuroscientist at Stanford University and a strong proponent of reproducibility in science (the ability to replicate published studies and get the same results) is no stranger to testing the limits of commonly used neuroscience techniques (See Making the Connectome Personal: One Brain, Many Scans). So, when a group of economists contacted him; his former student Tom Schonberg, now at Tel Aviv University; and Thomas Nichols at Oxford University about how to predict whether a particular neuroimaging study would be replicable, he was intrigued.

“There have been several large-scale replication analyses in the field of psychology, but we weren’t sure how to do that with neuroimaging, given the expense and other limitations,” he said. “We spent some time thinking about what we could do. Ultimately, we decided that, instead of trying to replicate a bunch of studies, we could take one dataset and see how reproducible the analysis outcomes were. The big question being, if you analyze data in different ways, would you get different results?”

Two years earlier, Brian Nosek, a psychologist at the University of Virginia, as well as co-founder and executive director of the Center for Open Science, an organization with a mission to promote open and reproducible research practices, had produced a similar study in the psychology realm. Nosek and colleagues had given an identical data set to 29 analysis teams, asking each to analyze the data and determine whether soccer referees are more likely to give red cards to darker-skinned soccer players. They discovered significant variation in how each team analyzed the data and, as a consequence, differences in what each concluded from that analysis.

“It was a massive undertaking, but we saw something very important in the results,” said Nosek. “It showed us that researchers’ analysis choices matter to what kind of results they find and report. Each researcher, each lab, may do things a little bit differently. I might have one analysis pipeline while my colleague has another. And there is now a good deal of evidence to suggest that, because of that, I may find something different than my colleague in the same exact set of data. That has huge implications for reproducibility.” Analysis pipelines are the chain of steps/decisions researchers make as they analyze the data; the chain can include which software packages they use, how they process raw data, how they group and compare scans, and what thresholds they set to determine whether a hypothesis is positive or negative.

Many Analysts, One Neuroimaging Dataset

Poldrack and colleagues recruited 70 teams of scientists from around the world to analyze the same brain imaging data set of 108 study participants performing a well-known decision-making task called the mixed gambles task. The task requires participants to decide whether to accept or reject a particular gamble that will lead to gain or losing money. It is a task that has been widely used to demonstrate that human beings, in general, are much more sensitive to potential losses than gains when making decisions.

“When we first started reaching out to different labs to participate, we particularly focused on the neuroeconomics community, because this is a task that they are familiar with,” said Poldrack. “Once they agreed to take part, we basically just gave them the data set, nine set hypotheses asking if they saw activation in a particular brain area, and a three-month deadline. We told them to analyze the data however they normally would in their labs to address those hypotheses.”

When Mauricio Delgado, a neuroscientist at Rutgers University, was asked to participate, he said he saw it as a great opportunity. Two of his lab members, Jamil Bhanji, a research faculty member, and Emily Brudner, a doctoral candidate, took on the analysis.

“I had some preconceived ideas that we would find what others had found before with this task,” said Brudner. “But, at the same time, one of the reasons we decided to take part is that there was a good chance that idea was wrong. We wanted to see how well we did.”

When the group sat down to discuss how to best analyze the data, they decided to keep things as simple as possible.

“We did it with the simplest model we could in order to address the project’s nine hypotheses,” said Bhanji. “If it had been our own study, we might have had more specific hypotheses we wanted to test, like how a person’s reaction time, for example, might influence his or her brain responses. In that case, we might have come up with a more complicated model to include that. But, ultimately, for the purposes of this study, it seemed best to keep it simple.”

Many Analysts, Different Results

When the 70 analysis groups turned in their findings, Poldrack and colleagues discovered significant variation within the results – especially in five of the nine hypotheses.

“We saw surprisingly good agreement across four of the nine hypotheses. In three of them, in fact, everyone said, ‘No, there’s no activity in this brain region,’ and in the other one, everyone said, ‘Yes, there is activity there,’ with [only] five research groups not being so sure,” said Poldrack. “Ultimately, when we looked at the pattern of results across these different research groups, there was a lot of variability in the nature of their answers.”

Poldrack said he was not surprised to see that. There are now so many steps in any analysis pipeline, he said, and each group handled the analysis just a little bit differently at those different decision points. (see a sample pipeline for fMRI data)

“There are so many different software packages now, and different labs use different ones for all manner of different reasons. There are also different philosophies about how analyses should be done. All those little differences can add up,” he said. “But, when we looked closely at what people provided, we could see that the results under the hood were substantially more similar than what they concluded. There was something about going from the intermediate steps of analysis workflow to determining the right threshold to denote a final yes/no answer that changed things. It was really striking.”

Despite their variability, Delgado said that the study results reassured him that his lab’s analysis pipeline is solid. But the results also reinforced the idea that he and his team need to take care to be as specific as possible when they describe their analysis methods in future publications.

“There was actually a lot of consistency, especially when you looked at how confident different labs were in their different results,” he said. “But we also took away that we need to make sure we are reporting every single parameter used in our analyses so that if other labs want to test our hypothesis or use a different analysis pipeline, they can see the exact steps and decisions that got us to our result.”

The Future is Reproducible

While Poldrack acknowledges that some people may be uncomfortable with the study’s findings, he sees the results as an example of “science at its best.”

“Our goal isn’t to just break things or say that the way we do things is wrong. But we do want to be able to see our shortcomings and find ways to make our science even better,” he said. “It’s important to realize the limits on generalizability of any particular analysis pipeline. And that’s why it’s so important that we spend more time learning about the different analysis tools and practices that are out there in order to understand how different choices we make can change what we may find.”

He and his colleagues recommend that researchers not only carefully consider all those choices but also consider running multiple analyses before submitting results for publication. And, of course, every step taken should be carefully documented so other scientists can follow and replicate their process when they want to.

Nosek said he was impressed by the amount of work that went into this study, and hopes that it will encourage more scientific stakeholders to commit to more transparent, reproducible research, regardless of their field.

“These kinds of studies are very good for science,” he said. “The way science makes progress is by the very fact that it doesn’t trust itself. You need to test, to verify, and to replicate. This kind of approach, that highlights how we can do better, provides us with direction on how to make science better. Knowing that these issues are out there is hard, no doubt – but not knowing is much, much worse than knowing and finding a way to fix it. And finding and fixing our approach is how we continue to build and maintain trust in science. It’s the only way forward.”