Tuesday, August 28, 2012

NodePy version 0.4 released

NodePy is a Python package for analyzing numerical integrators for initial value ODEs.  It's essentially a collection of all the kinds of analysis I've used in my time integrator research, collected in a single object-oriented package.  

If you have a new Runge-Kutta method and want to know all about it, NodePy can tell you most anything.  If you want to design new time integration methods, NodePy can help you.

Although I'm rather proud of it, it fills a very small niche in the world and I'm not aware of anyone using it outside of my group and close collaborators.  If you've used it, please let me know in the comments.

One of the thorniest issues in NodePy previously was that floating-point representations of method coefficients were sometimes insufficient, especially when studying very high order methods.  I've now updated NodePy to use Sympy Rationals (and radicals, etc.) wherever possible, allowing exact analysis of many properties.

That and much more awaits in NodePy version 0.4, now available via pip.

Wednesday, August 22, 2012

7 Habits of the Open Scientist #3: Pre-publication dissemination of research

Note: this post is part of a series on habits of the open scientist.  Here I discuss the third habit, pre-publication dissemination of research.  The previous post was on reproducible research.

A personal story

In 2003-2004, as a senior undergraduate, I got involved in research on strong stability preserving (SSP) Runge--Kutta (RK) methods.  I noticed a number of "numerical coincidences" -- certain numbers characterizing ostensibly different properties of RK methods always happened to be exactly the same.  I didn't yet have the necessary background to fully prove the conjectured connection, but after months of work, I finally succeeded in completing a partial solution to the problem, which I wrote up as my undergraduate thesis.  Before I could submit a manuscript for publication, I discovered that two other researchers had just published the full result.  Hence my manuscript was, of course, unpublishable.

Occasionally, situations like this are inevitable.  But those researchers had worked out and written up the result at least a year ahead of me -- before I even began the work in earnest.  If their work had been available to me at the outset, I could have devoted my time to unanswered questions.

Refereeing is slow; distribution is fast

In my field (applied math), it often takes more than 1 year for a submitted paper to be published.  This is because a thorough referee process of a manuscript takes time, and I think that time is worthwhile.  In contrast, I can "publish" a new paper on the arXiv in just 48 hours, or on my professional website instantaneously.

Many readers may not wish to see my work until it has been refereed.  But for those working on similar problems, reading my work 1 year earlier can be very useful by pointing out promising new avenues or avoiding duplication of effort.

The open scientist distributes his publishable research openly before the formal refereeing and publishing process, by placing completed manuscripts on a preprint server like the arXiv.

If you're brave, you can also share your grant proposals openly.


The first time you do this, you may feel worried that someone else will 'steal' your preprint and publish it before you.  But posting it on the arXiv makes it public and stamps it with a date, so such theft would be obvious to everyone.  You may be worried that others will steal your ideas and immediately begin working on your next planned research question.  But if you're like me, the number of related research questions to pursue is essentially endless and you'd be fortunate if your efforts attract others to work on closely related topics (for the truly self-interested, note that it will increase your citation count).  Finally, in some fields or subfields, there is cultural resistance to making preprints public; you can see my take on the issue in this prior blog post.  But there are signs that it is gaining wider acceptance.

If everyone began to practice this, it would effectively transform the role of journals.  They would no longer be the primary distribution apparatus; their role would be that of filtering the already-published literature.  This would make a lot more science available a lot sooner, without sacrificing the usefulness of peer review. 

Next up: Habit #4 -- Open notebook science.

Friday, August 3, 2012

7 Habits of the Open Scientist: #2 -- Reproducible Research

Note: this post is part of a series on habits of the open scientist.  Here I discuss the second habit, reproducible research.  The previous post was on open scientific publishing.

Reproducible research

Reproducibility is part of the definition of science: if the results of your experiment cannot be replicated by different people in a different location, then you're not doing science.  Far from being a mere philosophic concern, reproducible research has been a key issue in prominent controversies like climategate and cancer research clinical trials.

Especially disconcerting is the typical irreproducibility of scientific work involving computer code:

“Computational science is facing a credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today.” (LeVeque, Mitchell, Stodden, CiSE 2012)

Frankly, I used to find that I was often unable to reproduce my own computational results after a few months, because I had not maintained sufficiently detailed notes about my code and my computing environment.

The open scientist ensures that the entire research compendium -- including not only the paper but the data, source code, parameters, post-processing, and computing environment -- is made freely available, preferably in a way that facilitates its reuse by others.

I won't spend more time motivating reproducible research, since others have done that much better than I could.  Instead, let me focus on the relatively easy first steps you can take to make your research more reproducible.

The bare minimum: publish your code and data

If you wish to set an example of good reproducible computational research practices, I have good news for you: the bar is very low at the moment.  The reason why "it's impossible to verify most of the computational results" is that most researchers don't release their code and data.  The first step toward working reproducibly is simply to put the code and data that is used in your published research out in the open.

If you don't want to release your code to the public, please read about why you should and why you can.  Once you're convinced, go endorse the Science Code Manifesto.

Releasing your code and data can be as simple as posting a tarball on your website with a reference to the paper it pertains to.  Or you may wish to start putting all your code out in the open on Bitbucket or Github, like I do.  I don't claim that these are the best solutions possible, but they are a big step forward from keeping everything on your own hard drive.

When you release your code and data, it is important to use an appropriate license.  Victoria Stodden, a leader in the reproducible research movement, recommends the use of a permissive license like modified BSD for code and Science Commons Database Protocol for data.  Together with the Creative Commons BY license for media (that I mentioned in my last post), these comprise the Reproducible Research Standard, a convenient amalgamation of licenses for open science.

Be sure to include a mention of reproducibility in your paper, along with links to the code and data.  If you release your work under the RRS, I suggest using this citation.

Real benefits

The open scientist may adopt reproducible research practices for philosophical reasons, but he soon finds that they bring more direct benefits.  Because he writes code and prepares data with the expectation that it will be seen by others, the open scientist finds it much easier for himself, students, and colleagues to build on past work.  New collaborations are formed when others discover his work through openly released code and data.  And (as in the case of this paper, for example) the code itself may be the main subject of publications in journals that have come to recognize the importance of scientific software.

Taking it further

Like free and open scientific publishing, reproducible research has become a very large movement, and only a book could hope to cover it all.  Here I've merely distilled some basic practical suggestions.

Openly releasing code and data is only the first step.  Open scientists may wish to adopt tools that track code provenance and ensure a fully reproducible workflow, such as

Wednesday, August 1, 2012

7 Habits of the Open Scientist: #1 -- Open publishing

Note: this post is part of a series on habits of the open scientist.  Here I discuss the first habit, open scientific publishing.


Why you should publish openly

A hallmark of important scientific work is that it is reused, modified, and built upon by other scientists.  As a scientist, I spend a great deal of time and effort advertising my work to others so that they will read it and use it.  


By default, scientific works fall subject to copyright law, which is intended to prevent reuse and modification.  To make matters worse, the copyrights are typically held by publishers who charge a fee just for access.   Copyright makes sense for musicians and popular authors, because they make a living by charging for access to their works.  But as a scientist, I don't get paid by those who read and use my work, nor do I seek to.  So copyright does not serve me, even from a purely self-interested perspective.  


Stepping back from personal interest, I believe that academic scientists have a moral imperative to freely distribute their work, for two reasons.  First, in academia science is primarily funded by taxes.  Therefore, it has been 'purchased' by the public and cannot rightly be withheld from them.  Second, and more importantly, science is intended to benefit humanity.  If it is to do so, it must be shared and communicated.  That is why it has been said that "science must push copyright aside."


The  open scientist proactively ensures that published research is freely and conveniently available to all.  Ideally, the open scientist releases research under a license like Creative Commons BY that explicitly allows use in derivative works as long as attribution is given.


How you can provide free, open access to your work

  • Green open access (self archiving): independently of publication in a journal, the author uploads a pre-print, post-print, or final published version of the article to an institutional server, preprint server, or personal webpage.  Anyone can download this version of the article for free.  The author pays nothing and the reader pays nothing.

  • Gold open access: The author pays the publishing journal a fee in order to have the article available for free on the publisher's website.  Author charges typically are in the range of hundreds to thousands of dollars. 

I have written elsewhere about the dangers of the gold open access model.  Suffice it to say that the gold open access approach severely limits which journals I can submit to and consumes my research funds, whereas the green model does not.  I post all my preprints on the arXiv and on my professional website before submission.  Where allowed, I post final versions as well.

Many journals still have restrictive policies that prevent green open access.  If you believe this to be the case for journals that you publish in, it's worth checking to be sure.  You can easily find this information in the Sherpa/Romeo database.  The number of publishers who still don't allow any kind of green open access are surprisingly few.  For instance, even the evil Elsevier typically allows archiving of pre- and post-prints.  


Best practices: pushing copyright aside

If you are brave, you can even modify the journal's copyright transfer agreement, to allow you to retain copyright and release your work under Creative Commons BY.  This is also (surprisingly) often accepted by publishers.  I haven't done this yet, but I plan to do so with all of my future papers, including those currently under referee.

This first habit of the open scientist is essential but no longer revolutionary.  The open access movement has really picked up speed in the past year, with many petitions and initiatives by governments and funding agencies moving forward.

Next up: Habit #2 -- Reproducible research: open code, open data.