Science Buffs is covering an upcoming Keystone Symposia live streaming event called “Rigor in Science: The Challenge of Reproducibility in Biomedical Research.” A panel discussion (filmed previously this year) will be streamed for free on November 8th, 2017 and will be followed with a live Q&A where anyone from the public can pose questions. The next three posts on our blog will be about reproducibility in science, with an emphasis on biological science, in order to aid the discussion.
If Your Experiments Are Better, You Can Do More of Them: The Importance of Experimental Design for Reproducible Science
Last year, I tried for months to replicate an experiment from the scientific literature of my field. The paper told an exciting story in a high-profile and well-regarded journal, but I could never get the experiment to work—or at least, I could never produce the same results that were published. It was a tricky procedure that I had never done before and I just thought I was screwing it up. Graduate students aren’t known for their staggering competence with new protocols, after all, and I’ll be the first to admit that my inability to replicate the results could well have been my own fault.
But what if it wasn’t my fault? What if I did the experiment a hundred times, with a different set of variables each time, and I never got the same result as the published paper? Am I a victim of the reproducibility crisis in science?
Reproducibility, or the ability to replicate the results of an experiment more than once, is a crucial aspect of the scientific community. If I clicked my heels together, was magically transported to Kansas, and then called the results the “Dorothy Principle,” wouldn’t you want to see it at least twice more before you believed it? That’s true for most scientific conclusions, even minor ones. When scientists can’t replicate the results published in their field, it makes them nervous and a bit suspicious. That was certainly true for me.
There are a few possible reasons that an experiment won’t replicate: the first is that the experiment was badly designed and the statistics used during analysis didn’t catch that the exciting result was probably a random chance occurrence. This is an issue of experimental design and proper use of statistical methods.
The second possible reason that an experiment didn’t replicate is that the experiment in question would always have the same results, but only in a very particular set of circumstances. This is a slightly different problem and raises a hotly-debated question: how much faith can we put in the conclusions of any one experiment?
The upcoming Keystone Symposia panel (described above) about reproducibility in science is addressing both of these issues—badly designed experiments and biological reproducibility—with a focus on the second issue. In this article we focus on badly designed experiments, and a follow-up post will discuss biological reproducibility.
I don’t know if the experiment that I failed to replicate last year was badly designed or not because it wasn’t described in much detail in the paper—a problem we’ll address more in the second post. Luckily, it wasn’t critical to replicate for my own work. The fact that I didn’t get the same results as another group didn’t have major consequences. But sometimes the inability of scientists to replicate the results of other scientists does have far-reaching effects.
Take the story of C. Glenn Begley, a researcher at Amgen. Begley attempted to replicate the results of 53 papers he identified as “landmark” publications in the field of cancer research. A staggering 47 of them could not be replicated by his group, making them useless for follow-up studies on drug development. If other researchers had designed drugs based on those original 53 papers, billions of research dollars could have been spent with no effective cancer drug in sight.
Begley’s work highlights a greater problem in science, one that has lately received a certain amount of negative press. Exciting, sometimes paradigm-shifting research gets published in a high-profile journal, and the press goes nuts for it. It’s only later that people start complaining that the research isn’t reproducible.
Dr. Michael Holtzman, Director of Pulmonary and Critical Care Medicine at Washington University School of Medicine, says it’s a well-known problem in his field.
“Even in the same lab with a different person it can hard to reproduce biology,” said Holtzman. “But we have repeatedly had problems with taking a method out of the literature, trying to make it work in the lab. It’s challenging.”
Experiments that are badly designed or analyzed with sloppy statistics often fall into the category of experiments that are hard to replicate. The Keystone Symposia panel was moderated by Richard Harris, an NPR correspondent and science journalist who wrote a book called Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions. The panel discussed, among other things, new requirements from the National Institutes of Health (NIH) and scientific journals that are attempting to increase reproducibility of results by enforcing rigor in experimental design and analysis.
Journals, for example, are requiring the submitting authors to include more information about how they performed their experiments, as well as requiring them to upload much of the raw data they collect. Harris is optimistic that changes like this will increase people’s awareness of fundamental questions that lead to better, more reproducible experiments.
“Just thinking about these issues helps people ask those basic questions,” says Harris. “Have I really designed the experiment in the right way? Do I have enough animals? Am I doing it blinded correctly? For this sample size am I using the right statistics? Am I saying in advance what my hypothesis is, as opposed to running an experiment and then deciding after the fact what my hypothesis should have been?”
All of these questions, if properly addressed, lead to more thoughtful experiments. More thoughtful experiments are more likely to result in conclusions that will be replicated in other labs.
After a thoughtful experiment has been designed and executed, somebody has to crunch the numbers on the data. Much has been said about the issue of “p-hacking”—or the manipulation of results to get the desired statistical output. This is surprisingly easy to achieve with carefully “massaged” data but it leads to conclusions that are very difficult to reproduce.
Similarly, a data set that is too small, or analyzed incorrectly, could lead you to believe that a result is exciting when it is simply a matter of chance. For example, if you designed an experiment with 8 people—four of whom ate chocolate for a week and four of whom abstained—and then measured 20 different variables related to health, you would probably find one positive health outcome in the chocolate-eaters by chance. This does not mean that chocolate is good for you, it simply means that your sample size was too small to glean any real information.
Harris says that all of these experimental design details should be reported and thought about deeply.
“It’s not like there’s an NIH police out there busting down the doors of labs and reporting on all of these things,” said Harris. “We’re just telling people that these are the expectations, these are the questions we want you to think about.”
Hopefully a well-designed experiment gives you the same results no matter how many times you do it. But sometimes the picture is more complicated than that. In our next post on reproducibility, we’ll discuss how a complex system can work against scientists (primarily in biology and biomedical fields) who aim to design reproducible experiments.
By Alison Gilchrist