Language-Aware SCM

[article]
Summary:
Five or six years ago, the SCM arena was in a comfortable “status quo”, in which tools delivered only what developers expected (or even less) and innovation didn’t arrive at a quick pace.And then the DVCS arrived on the scene and turned the SCM world upside down.

But there’s still a lot to do to move acceptance of the DVCS beyond early adopters. What else can be done to help developers in their daily version control experience?

Diffing and merging

Diff and merge tools are two version control features that developers use on a daily basis, but which haven’t received a lot of attention in years. Yes, there are more and better colored interfaces today than ten years ago, but the underlying algorithms are all text based, so they can work in a language-agnostic way. Evolution seems to have skipped this field.

Today I’ll be talking about some of the trends in the next generation of diff and merge tools and how they can impact a developer’s day to day activities.

Refactoring is the key

Let me focus on one of the best practices widely adopted by developers, independent of their programming language or the industry they’re working in: refactoring. Refactoring code applies a simple set of rules and actions to it, transforming its format without altering its behavior. There are many reasons to refactor, with improving readability one of the most important.

What does refactoring have to do with version control? If we apply it at the file level (you know, a Java class lives in a file named after the class), ) it is quite clear that proper tracking of moves and renames is key to supporting simple things like renaming a class or locating it in a different module or package. The same applies for packages and directory trees . As simple as it sounds, good support for moved and renamed files and directories has been around for only a few years (unless you were using one of the high-end commercial SCMs), and was totally out of the scope of some of the widely used open source version control systems before the DVCS era.

But today I’m thinking about operations performed inside a single file: you move a private method down in your class (following what you learned on Clean Code, for instance), you diff it with the previous version of your file, and the typical diff tool identifies two separate changes: one block added and another block removed. There’s no automated way to tell it’s really the same code.

The following picture shows a similar example: a method has been moved up and detected as a “delete-add” pair of changes.

psmar11-1

What if the diff tool was able to find that it’s really a single block of code has been moved?

The result would be something like the following, where the developer is informed about the “move operation”:

2

The former implementation is based on a modified “Levenshtein distance” algorithm and, as such, is language agnostic, so it can only be considered as an initial step towards the goals I mentioned above.

Refactoring can be harder

It is not unusual for developers to move methods around (and perhaps modify them a bit, as well) in a way that even an advanced text based algorithm would have big trouble  figuring out the right matches. But as tough as it sounds for a “text based” diff system, it would be trivial for a “code based” diff: a tool that actually parsed the file, locating its methods, properties, members, etc.,  and then used them as “units of change” It would be straightforward to figure out if heavy modifications to a file were really just a shuffling of its code units.

The idea is not new -- in fact, it’s a typical question developers ask when they’re trained on a new version control system with better merge support: “hey, 

CMCrossroads is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, TechWell.com is the place to go for what is happening in software development and delivery.  Join the conversation now!