Git for researchers

In my previous job—as a grad student, doing computational/biomedical research—I used Git to manage my projects.

For small projects, people usually treat CVS/SVN as checkpointing tools—tools to get you back to a known good state when you've screwed up. Git, however, provides a whole new vocabulary you can use to talk about creating, altering, composing, combining, splitting, undoing, and otherwise manipulating changes to code (commits). It helps you get stuff done faster every day, not just when you mess up.

Here are a couple of reflections and "lessons learned" on really using VCS to your advantage in a research environment, where some of the rules of thumb are a bit different from those in industry.

(They seem so stunningly obvious now that I've committed them to writing, but they seemed much less so when I first articulated them to myself.)

Retaining history, all of it. I have found git merge -s ours to be very handy. It produces a merge commit and merge topology, tying in the history of the other branch, but without applying any of the changes produced in that branch.

Typically, if a feature doesn't pan out, you delete the corresponding branch and destroy all evidence that you tried. But in exploratory or research contexts, the details of your failed experiments can be quite important. You might need to revisit some past state in order to perform further investigation. Or maybe you want to obtain some numbers for a paper or presentation.

Graphically: imagine you have a "successful" branch feature1 and a "failed" branch feature2 (left). You don't want to git branch -D feature2, since that could cause its history to be lost. If you instead git merge -s ours feature2, you get a topology where the states from both branches appear in your git log (right), but the state at the tip is the same as that at feature1.

* ddddddd (refs/heads/feature1)
* ccccccc
* bbbbbbb
| * 2222222 (refs/heads/feature2)
| * 1111111
|/
* aaaaaaa
* eeeeeee "Merge branch 'feature2'."
|\
* | ddddddd (refs/heads/feature1)
* | ccccccc
* | bbbbbbb
| * 2222222 (refs/heads/feature2)
| * 1111111
|/
* aaaaaaa

This kind of setup makes tracking your progress super easy. My git log basically becomes the scaffolding for my research notebook. I have bare-bones notes like the following:

Commit 2222222: this change did not improve quality at all. Furthermore it runs much slower, probably because blah blah blah blah. See full output in /home/phil/logs/2222222.

The great thing is that now every result (whether a success or a failure) has, attached to it, a commit name: a pointer to the exact code that generated that result. If I hadn't had complete change history so easily available, I would have spent half of my time second-guessing results I'd already obtained.

This application also demonstrates the strengths of DVCS versus CVCS. Research and software development do not happen in a clean linear way. There is lots of backtracking, and sometimes you cannot expect to work effectively with a VCS whose basic model is "one damn commit after another."

Summary: 90% of everything ends in failure. Keeping your failure history (as well as your success history) around is something that is underemphasized.

Long-lived branches vs. refactoring. If you know what you're going to do in advance, then it's not called research. In my work, what I ended up writing on a day-to-day basis depended more on experimentation and testing than on planning and specs. Here's some sample code for illustrative purposes:

# (1)
def my_function(a, b):
   foo = random_sample() # Random heuristic
   something(foo)
   ...

I want to find out how the following code stacks up against (1). Does it perform better? Is it faster?

# (2)
def my_function(a, b):
   foo = shortest_path(a, b) # A better(?) heuristic
   something(foo)
   ...

In reality we might be evaluating alternative heuristics (as here), different numeric parameters, alternative algorithms, or an alternative data source (e.g., training vs. testing data).

Sometimes, when there are a number of alternatives, the right thing to do is to refactor to parameterize the code, for example,

# (3)
def my_function(a, b, heuristic = 'shortest_path'):
   if heuristic == 'random':
       foo = random_sample()
   elif heuristic == 'shortest_path':
       foo = shortest_path(a, b)
   else:
       foo = ... # Additional logic...
   something(foo)
   ...

But every parameterization increases complexity. The new argument is something you have to think about every time you or someone else tries to read your code. Your function is longer, leaks more implementation details, and provides less abstraction. So you don't want to go down this route unless it's necessary. If one choice is a clear winner, and every invocation is going to pass the same argument, then the extra generality you introduced is a liability, not an asset. To do that refactoring can be a lot of work without much reward.

So you want to run and evaluate the alternatives before refactoring. People who find themselves in this situation often write code like this:

# (4)
def my_function(a, b):
   foo = random_sample()
   ## Uncomment the next line if blah blah blah
   # foo = shortest_path(a, b)
   something(foo)
   ...

which is convenient to write, but setting all the switches by hand whenever you want to run it is rather error-prone, especially if the difference is more complicated than one line.

Branching saves the day by letting your tools manage what you were doing by hand in (4). You can compare alternatives like (1) and (2) above against each other if you keep them in parallel branches (granted, you can't select between the alternatives at runtime, but that may be OK). Maintenance is a breeze: with git merge it's easy to maintain multiple parallel source trees, differing by just that one line, for as long as you please. And because you're committing every merge commit, your results are 100% reproducible (if you were messing with your files by hand, in order to reproduce a code state you would have to not only specify a commit name, but also what lines you had commented and uncommented).

After branching, you can mull it over and obtain data on all the alternatives. When you've made your decision, you either drop one implementation and end up with (1) or (2), or, if you need the generality, then you refactor so you can choose between them at runtime (3).

Summary: lightweight branches allow you to defer the work of refactoring rather than having to pay for it up front. They greatly improve the hackability of code, by letting you try out many different alternatives reliably and without much hassle.

2 comments:

  1. Wow. This is a very thoughtful description of why version control is useful for research. Wish I had used git now. I often do the parameterize thing or comment out code alternatives thing.

    Though, I'm not sure what advantages git provides over svn. Should I really switch to git or mercurial?

    Also, sometimes, I'm not disciplined in my use of version control and I mess up. (Forget to sync before editing for example.)

    Pleaee enlighten me!

    ReplyDelete
  2. SVN can in principle support almost all of this workflow, but it's still more cumbersome to work with branches in SVN (better than it used to be, though). For example, in SVN, to see the full change history of a file that was merged, you have to look both on the trunk and in the corresponding branch directory. Not to mention, it's more complicated to create and keep track of branches in the first place than it is in git.

    Because of issues like this, people tend to not use the branching and merging features in SVN except for large and lengthy changes. So while you could use SVN, I think you would need quite a bit of discipline to get the advantages of a git workflow (e.g. being able to reproduce every state you've ever tested). I find git's basic model/workflow to be more elegant and better suited for this purpose.

    Regarding general VC habits: having distributed version control (git, hg, or bzr) solves a lot of the concurrency problems that tend to come up with CVS/SVN.

    For example, if you forget to sync before editing, it doesn't matter. You can commit as usual and as many times as you like to your local copy. When next you synchronize with the server, git incorporates the changes (usually 100% automatically). The local and remote changes are presented as two coherent and parallel lines of development that eventually get "sewn together" to form a state containing the changes from both lines. You don't have to worry about your local copy being sync'd up when you start work, it just gets fixed later.

    More generally, git also provides a number of features that help you fix up other things after the fact (like moving/copying patches around), so that even if you make a mistake (e.g. making a change on the wrong branch) it's easy to get yourself out of a jam. I find that makes it a lot less intimidating than RCS/CVS/SVN.

    ReplyDelete