What are the advantages and disadvantages of Distributed vs Centralised SCM?

I'd be interested in feedback from others on this subject, and particularly comparing and contrasting the use of distributed tools such as git and mercurial (and others) vs their more centralised brethren, be they open source (subversion) or commercial.

The requirements for developing something like Linux are certainly different to those for many commercial organisations IMO.

Thoughts?

Tags:

configuration management

distributed scm

git

centralized scm

13 Answers

Answer Vote

Commercial organizations generally want to ensure that they have their source safe in a centrally managed repository rather than scattered in unknown places. But they also have needs for distributed tools. Working with external partners (or even integrating code from separate internal organizations who keep their own repositories) - the ability of distributed CM systems to pass a selected code branch from one location to another is valuable. Another case are field engineers who need to make changes to code on a customer site and don't have access to the central repository at the time. This is not a problem with a distributed CM system.

So idealy they want the best of both distributed and centralized systems. With a bit of work to improve ease of use of shared repositories and to push work from a private to a central repository GIT would be a very attractive system for commercial use.

Answer Vote

Here are my thoughts...

1. If your code is distributed across many sites/depots/servers/whatever, does that mean that your ability to do a full build is affected whenever one of them, or the network to it, is down?

2. With distributed code, how do you do a consistent backup of your entire product?

3. I like multiple site solutions, as long as there is no need to do partitioning and re-synchronization operations. This seems only feasible if the entire product exists at all sites, or at least some subset of the sites.

4. When you distribute code across sites, it's probably more important that you have an at-rest encryption capability, because it's typically under less control. So if someone gets access to the files, they still don't have the goods.

5. Centralized repository with good remote access is nice, but it doesn't cut it when you start doing things like full product delta reports (vs your local workspace), full builds, etc. unless you have high speed connectivity all the way through.

These are all real concerns that have to be addressed by an SCM tool, and indeed by an ALM tool. I don't like tools that make me administer separate multiple site solutions for each component of the ALM solution (and usually not for all of them).

In CM+ here's what we do.

1. Centralized repository
2. Multiple Site option allows replication of all transactions at all sites in real time. (So each site looks and feels like a single site centralized repository.)
3. At-rest encryption option (applied at library creation time)
4. Ability to restrict certain files (or file types, products, etc.) to specific sites.
5. Access to file controlled by user roles/permissions, not by location of the file. So change sites and you have the same data and same permissions based on your user id.
6. Apply multiple site capability across all ALM functions, not just source code.
7. Use multiple site feature to provide warm-standby disaster recovery and live, up-to-date on-line backups.
8. Allow you to disconnect a site and have full read access and limited write access to the repository. So if you needed to take it to the space station, or on a flight, or out to sea.
9. Allow automatic recovery from network outages so that if you're connected to the network on the space station and you lose connectivity in some parts of your orbit, you are automatically resynched when you regain connectivity.
10. Allow remote access both through a native interface, with intelligent caching, or through a web interface.
11. Allow near-zero administration for the multiple site solution (CM+ MultiSite).
12. Ensure that schema changes (for your meta data), and process changes are automatically propagated across sites in near real time.
13. Provide an option for automatic propagation of user interface customization (by default, this can be site specific).
14. Ensure that all inter-site traffic is encrypted.
15. Use the multiple site framework to monitor synchronization for any potential problems.

As a commercial organization, centralization is important. I don't buy the suggestion that distributing data minimizes backup times, server delays, etc. If that is the case, you're using old "BIG-IT" technology, which generally is server-centric, instead of using smart clients.

I really don't think cloud computing should apply to CM/ALM, unless its pseudo-cloud (e.g. having IBM host your repository for you).

So those are my thoughts, along with how we have integrated these thoughts into the CM+ product.

Answer Vote

My vote is Centralized. Distributed (ie; Git) are not suitable version control systems for mature software companies.

Git specific, unless your company is willing to shoulder a significant training expense (time, errors and personnel) I'd avoid it.

Git is counter-productive for commercial companies, having wasted my developers time, slowed projects, introduced errors in releases, and inhibited finding trained candidates to come in and hit the ground running. GitHub revolves around a social website concept like facebook... so if you're just looking for 'cool' coders out there and want to follow their work, sign up. But don't expect to get any real work done.

Answer Vote

lincoln wrote: "My vote is Centralized. Distributed (ie; Git) are not suitable version control systems for mature software companies."

But there are less and less of those.

Anyway, I disagree with your arguments as well as with your conclusion.

Commercial systems like UCM have wasted the time and efforts of developers for 10+ years. Commercial companies cannot afford being kept captive of vendors.

Anyway, companies raising money on the stock exchange try only to save short time money, so these considerations do not even affect them.
For the security aspect, data is best secured by being replicated, not held centrally. Secret and single point failure systems are just doomed.

Marc

Answer Vote

lincoln,

I couldn't agree more. I downloaded Git yesterday. The only thing I can say is. What a piece of junk, absolutely not intuitive, follows no convention, it errored out and locked up when I tried to commit files to the repository.

I have said this before. When you get free ware you get exactly what you paid for, nothing. At least tools like Clearcase, Version Manager, SourceSafe, etc. have at least a similar look and feel.

Regards,

Joe

Answer Vote

Marc,
While I agree that UCM and other semi-fixed process-based centralized tools have kept companies captive, certainly you must agree that Neuma's CM+, the most centralized CM/ALM tool, is quite the opposite.

It is priced low, typically under $1000/user (for ALL ALM functions, not just CM, and this includes its CM+MultiSite, the most advanced multiple site CM/ALM tool in the industry).

It allows full customization of Process, User Interface (menus, quick links, tool bars, dashboards dialogs, etc.), Roles, Data Schema, etc. while providing a very extensive, but lean (approx. 1+ line of code per configured item) customization capability, well beyond the likes of script intensive tools. Customization is INTERACTIVE, FUN, EASY and FAST.

The underlying technology ensures longevity of the solution. The oldest CM/ALM libraries are more than 20 years old and still contain every bit of history.

To further ensure that corporations are not locked it, it has near-zero administration - that means you don't need a CM Tool administrator, only someone who can occasionally help with upgrades (typically minutes), as well as do some customization tweaks.

Neuma is always willing to help with customization, and usually at no cost, since customization is done at such a high level that makes it easy to do, in most cases, the same day, through an email exchange.

And because CM+MultiSite extends across the entire repository, it's not just the version control that benefits from it - every ALM function, AS WELL AS ANY NEW FUNCTION ADDED BY THE CUSTOMER.

The integration of all the ALM functions plus additional custom functions into one interface, with one point of minimal administration, fully custom role-based user interfaces and easy customization lead to a dramatically advanced user experience.

Point-and-click, rapid data navigation.
All my important data accessible through a single dashboard.
Advanced, 35 year-old mature CM technology that reduces the complexity of the user experience while advancing the capabilities of CM beyond the industry norm.

So, in summary... Centralized CM/ALM, DONE RIGHT, is not just better than distributed, and best-of-breed alternatives - it's essential to being truely competitive and having all the information at your fingertips for the best corporate, management, and developer decision making.

Joe

Answer Vote

Sorry Marc,
I usually mention this, but didn't this time. I am clear about it at the end of my CM: THE NEXT GENERATION column in the CM Journal though.

Answer Vote

MarcGirod wrote: "But there are less and less of those."

True, mature software companies are becoming extinct.

Marc, I'm speaking from experience on my current team.

My developers hate the tool yet we've been dictated by our architect to use it. The training curve in learning it is steep and we continually have something commited that shouldn't be. The concepts used (ie rebase, squashing commits, etc.) are foreign to structured SCM practices and fly in the face of stability.

Getting new developers who already have Git background is difficult (try searching for Git expertise on Dice) so we have to set aside training time for them to grasp the huge paradigm shift.

Git is essentially counter-productive in projects who need to quickly get version control in place. I get told whipped with the 'old technology' noodle each time we come across something that doesn't work (like getting hooks implemented in existing GitHub projects?) but the key here is in SCM practices, established means stable.

I agree with you on UCM, totally out of bounds on weighted processes. I'm a Subversion fan and know that I'd be able to provide my team better support using it than Git.

Answer Vote

I agree with you when mentioning the steep learning curve of Git and the lack of developers with Git experience in the market.

What makes Git an absolutely KILLER option are workflows! See this short chapter in the "Pro Git" book:

http://progit.org/book/ch5-1.html

We have a multitude of enterprise(!) customers adopting the Integrator Workflow (aka Integration Manager Workflow or simply as "pull request") VERY efficiently. You simply don't have this level of efficiency, security and continuity with Subversion, or any other centralized approach.

Code contribution, forking, merging is generally quite painful with SVN due to its "developers must have write access to the central repo, or a maintainer must manually merge patches sent by them" approach. With Git and Mercurial it is totally different. Developers work in their sandbox, and when they want contribute their changes to the "blessed" repository (which contains the reference code), they send a so-called pull request to the maintainers. They will then review the changes and merge them if they like, all with a single click in their browser.

It works like a charm in industries, where there is tons code coming from external suppliers, like in automotives, embedded or telco.

GitHub and BitBucket implements this workflow if hosting is an option for you.

codeBeamer implements this workflow if you want to host it for yourself, or when you need a single complete ALM solution.
Details: https://codebeamer.com/cb/wiki/33573

(Disclaimer: I work for Intland Software, the company who develops codeBeamer.)

Answer Vote

One of the issues that concerns me with distributed SCM is that of tracking what is happening. It obviously works great in the open source world it was designed for where you don't necessarily care what others do or how many others do until they seek to contribute it back (in which case there are bottlenecks which are the human maintainers).

In a commercial world, you might wish to know a little more about who is working on what, and why Fred hasn't checked anything in for a week or two. Also, you may wish to know that a potential collision is coming up (2 or more people/teams working on same area) and take account of those things in your planning.

I was interested in the talk that originally started this thread - the organisation had started wrapping their own control stuff on top of git.

Answer Vote

Distributed SCM doesn't mean at all that the number of repositories "explode", and the work of team member becomes non-trackable.

Here is what we do: each team member has a fork repo (his own work area). Their activities are aggregated into so-called Activity Streams (activity timelines), that are visible in our wiki pages and also readable as RSS feeds.

If a manager or tech lead is particularly interested in seeing souce code changes of individual developers, fork repositories can be subscribed: you can get an email at each change, with colored diffs, the task associated with the code change and other details included.

Along with these, we use strict task management with our issue tracker that allows associating requirements with task, tasks with subtasks, and tasks/subtasks with actual source code changes.

This makes the whole process easy to track, perfectly organized, traceable in both direction, yet flexible and fast.

We are dog-fooding, so we our product to develop the product:

http://www.intland.com/products/codebeamer

Answer Vote

I think there needs to be a separation between Centralized vs Distributed SCM from some of the specific open-source implementations. Also we should be clear on the difference between use of DVCS to achieve "full repository replication" (or mirroring) versus using true distributed capability.

Among the main differences I saw when the started becoming popular just over a decade ago were:

Developer "workspaces" can be a first class repository (complete with "private versioning"), which helps resolves some issues regarding backup/recovery of developer workspaces (and introduces some others)
There is a form of perceived "empowerment" that developers feel they have when they get to have their own workspace as a first-class repository. They don't have to have someone elses central hooks/triggers and procedures imposed on them (this is often a big "minus" to others trying to impose some semblance of centralized standardization or control). They don't necessairly have to "adhere" to such standards until such time as they submit their change to the central repository. This is like the version-control equivalent of the design pattern that dictates "program to an interface, not to an implementation", and in this case it is more like CM to an Interface, not to an Implementation!
Centralized repositories that "collect" or "aggregate" other change-sources have the ability to "(task) branch-on-demand" instead of "(task) branching in advance" -- meaning as long as contributors have a way to "package" their changes as a single "commit" or change/patch-set, they don't have to create a branch ahead of time just in case there might have been parallel activity. Instead the need to branch (or not) can be deferred to the "last responsibile moment" at the central repository when the change is submitted and the branch can be created only when necessary.

Well before Git came along the main "reference" implementations of distributed VCS were pretty much boiled down to (Gnu) DARCS and the BitKeeper. Those might be better ones to look at as being more "conceptually complete" than Git or even Hg or Bzr, etc.

I always liked David Wheeler's essay on the subject, and then a few others came along, like the one from InfoQ.com and another essay by Eric Raymond, and another by Ben Sussman.

I blogged about a dozen or so good links back in 2008 at http://blog.bradapp.net/2008/02/distributed-version-control-systems.html -- I hope there are some better (or more up-to-date) pages available now.

Answer Vote

One issue with centralized VCS is when you are in a large corporate environment where engineers may not be collocated - or none of the engineers are even located in the same area as the data-center housing the CVCS. This can make checkouts, commits, etc, painfully slow. Having a distributed system can allow the engineers to work without connection to VPN, or to merge against the locallized repo to there dev-center. This of course requires someone to synchronize or further merge up/down-stream to other repos housing the same development, but can make the teams much more nimble.

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.

Sep 22	STARWEST Software Testing Conference in Anaheim & Online
Oct 13	Agile + DevOps USA The Conference for Agile and DevOps Professionals
Apr 27	STAREAST Software Testing Conference in Orlando & Online

Jul 29	Automate smarter, test faster: Discover the power of AI in testing
On Demand	Building Confidence in Your Automation
On Demand	Leveraging Open Source Tools for DevSecOps
On Demand	Five Reasons Why Agile Isn't Working
On Demand	Building a Stellar Team