Friday, August 3, 2012

7 Habits of the Open Scientist: #2 -- Reproducible Research

Note: this post is part of a series on habits of the open scientist.  Here I discuss the second habit, reproducible research.  The previous post was on open scientific publishing.

Reproducible research

Reproducibility is part of the definition of science: if the results of your experiment cannot be replicated by different people in a different location, then you're not doing science.  Far from being a mere philosophic concern, reproducible research has been a key issue in prominent controversies like climategate and cancer research clinical trials.

Especially disconcerting is the typical irreproducibility of scientific work involving computer code:

“Computational science is facing a credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today.” (LeVeque, Mitchell, Stodden, CiSE 2012)

Frankly, I used to find that I was often unable to reproduce my own computational results after a few months, because I had not maintained sufficiently detailed notes about my code and my computing environment.

The open scientist ensures that the entire research compendium -- including not only the paper but the data, source code, parameters, post-processing, and computing environment -- is made freely available, preferably in a way that facilitates its reuse by others.

I won't spend more time motivating reproducible research, since others have done that much better than I could.  Instead, let me focus on the relatively easy first steps you can take to make your research more reproducible.

The bare minimum: publish your code and data

If you wish to set an example of good reproducible computational research practices, I have good news for you: the bar is very low at the moment.  The reason why "it's impossible to verify most of the computational results" is that most researchers don't release their code and data.  The first step toward working reproducibly is simply to put the code and data that is used in your published research out in the open.

If you don't want to release your code to the public, please read about why you should and why you can.  Once you're convinced, go endorse the Science Code Manifesto.

Releasing your code and data can be as simple as posting a tarball on your website with a reference to the paper it pertains to.  Or you may wish to start putting all your code out in the open on Bitbucket or Github, like I do.  I don't claim that these are the best solutions possible, but they are a big step forward from keeping everything on your own hard drive.

When you release your code and data, it is important to use an appropriate license.  Victoria Stodden, a leader in the reproducible research movement, recommends the use of a permissive license like modified BSD for code and Science Commons Database Protocol for data.  Together with the Creative Commons BY license for media (that I mentioned in my last post), these comprise the Reproducible Research Standard, a convenient amalgamation of licenses for open science.

Be sure to include a mention of reproducibility in your paper, along with links to the code and data.  If you release your work under the RRS, I suggest using this citation.

Real benefits

The open scientist may adopt reproducible research practices for philosophical reasons, but he soon finds that they bring more direct benefits.  Because he writes code and prepares data with the expectation that it will be seen by others, the open scientist finds it much easier for himself, students, and colleagues to build on past work.  New collaborations are formed when others discover his work through openly released code and data.  And (as in the case of this paper, for example) the code itself may be the main subject of publications in journals that have come to recognize the importance of scientific software.

Taking it further

Like free and open scientific publishing, reproducible research has become a very large movement, and only a book could hope to cover it all.  Here I've merely distilled some basic practical suggestions.

Openly releasing code and data is only the first step.  Open scientists may wish to adopt tools that track code provenance and ensure a fully reproducible workflow, such as

5 comments:

  1. Great entry, really. I have (yet) nothing to do with scientific research, but I am so happy to see there are so many resources for those who want to be Open Scientists. Thank you for the insight and the links.

    Many have been talking about the non-free nature of GitHub and the like. Maybe free alternatives like Gitorious might be adopted, but this one particularly is so limited in terms of features and community adoption. Let's see what happens in the future.

    ReplyDelete
    Replies
    1. Thanks, Juanlu. But what do you mean by the non-free nature of Github? Public repositories are free. They only charge for private repositories or organization accounts. Academics can get a free organization account (https://github.com/edu) and private repositories aren't what I'm advocating.

      Delete
    2. GitHub as a software, in the free-as-in-freedom sense. You know, some Stallman-ish concerns :P

      Delete
  2. Great post! We've been working on making our work as fully reproducible/replicable as we can, also using Github to distribute both code and data (e.g., https://github.com/weecology/white-etal-2012-ecology/). We tend to use MIT instead of BSD since as a wise man once told me "University intellectual property departments don't understand open source, but they do know what MIT is."

    Also, you've got a couple of broken links - all of the Github related links, the Bitbucket link, and the RRS citation link at least.

    ReplyDelete
    Replies
    1. Thanks, Ethan. I believe all the links are fixed now. I recently started using MarsEdit and obviously still have more to learn.

      Delete