In today's increasingly complex corporate environment, the
need to geographically distribute software development has become almost the
norm. This distribution can be based on several patterns including Multiple
Offices, Working from Home and Subcontract Organizations. Each of these shares
some common SCM problems as well as has some unique problem areas.
In the sections that follow, please keep in mind that there
are two types of distribution being discussed: distributed development and
distributed repositories. The repositories consist not only of the Controlled
Item (CI) library, but also the
Defect, Issue and Enhancement Tracking (DIET)
library
Some Common Problems
The most common problem from an SCM perspective is just
keeping track of the CIs. In most cases this is source code, but it should also
include project and user documentation; third-party components and tool chains;
and test plans, test cases and test data. While it is also part of a
Product/Project Manager's responsibility, SCM needs to ensure that the format
and directory hierarchy is consistent across all contributors, regardless of
location.
Some Unique Problems
The most likely problem in this area is the corporate
culture. While it is common to assume this only applies to Subcontract
Organizations, independent offices and home-based workers may also diverge from
internally enforced standards. This divergence could be something as simple as not
adhering to coding standards or as diverse as using different tool chains. It
is not unusual to be unable to require Subcontract Organizations to use the
same CM tools as the parent corporation, so sharing CIs (and sometimes even
deciding what the CIs are) becomes a problem.
Two additional problems exist when "globalization" is a
factor - primary language and time zones. Even when all tools, processes and
procedures are the same, communications can become a major stumbling block -
either due to language or just because one side of a dialog may be asleep when
the other is awake. Even though Einsteinian physics states that true
simultaneity cannot exist when events are separated by distance, CM tools
require a common time base in order to determine the order changes (revisions)
should be applied in a repository. This is difficult to accomplish when
distributed repositories are in use.
Structuring for
Distributed Development
Once it becomes necessary to transition into a distributed
development model, certain decisions need to be made. Some of these decisions
will be based on how the codebase is architected, others on how much control
SCM has over development tools, processes and procedures.
Centralized vs. Distributed vs. Mixed
Centralized Repository
If a single repository is used (Centralized model) then all
developers will have to use the same SCM tools and there will have to be secure
network access with reasonably good bandwidth. Depending on whether the tool
architecture is truly client-server in nature or not, time base synchronization
may become a problem between distributed users. Additionally, some tools are
better at minimizing bandwidth use than others.
The primary advantage in using a single repository is in
control. It is easier to administer, upgrade, backup, etc. a single repository
than it is to manage a distributed environment. With server-side triggers it is
also easier to control what goes into the repository, by whom and when. The
downside is that the bandwidth necessary to support this model increases
arithmetically with the number of users connecting remotely. I separate the
bandwidth requirements for external access from internal access due to the
difference in their natures. This is a case where the SCM tools being used are
critical to success. Fat-clients and high-bandwidth tools should be avoided.
Distributed Repository
If distributed repositories are used, and assuming the same
tools are used, then the administrative overhead is increased while "local
developer" access times go down. In fact, this decrease in local access times
is the primary reason most organizations make this transition. The biggest
problem with a distributed model from a developer's perspective is how to
handle collisions gracefully. The biggest problem from an SCM perspective is
how to keep the various repositories in synch.
Mixed (Both)
Repositories
A mixed model is where there are multiple repositories. Some
of them are distributed and others are centralized, though the centralized ones
may be at geographically different locations. This has all of the overhead and
problems of both, but allows for a finer security granularity and access
optimization. Security is improved by not allowing distribution of certain
components and access is optimized in that only those project which could
really benefit from distributed repositories use that model. A major issue here
is that the CM tools must support
both models at the same time.
Picking the Model to Follow
All other things being equal, there are several
considerations to be addressed before deciding to pay the administrative penalty
for distributed CM. These include, but are not limited to:
- Pathological
Connections
- Dependent
vs. Independent Code
- Tool
Chains Used
- Process
/ Procedures
- Build
Methods
Pathological
Connections
This occurs when changes to one piece of code affect other
pieces of code, often in different subsystems. It also occurs then separate
fixes to unique faults (bugs) causes the same file or files to be modified.
What this means is that it is difficult, if not impossible, for CM to be able
to separate code so that each geographical location is responsible for changing
only "its" code.
The concept of CI Ownership, at least at an "organizational"
or "geographical" level, allows for distributed repositories to be established
where each location would be responsible for changes to a specific repository
set (repository mastership) and the other locations would access their copies
of these repositories in a read-only fashion. Pathological connectivity
effectively prevents this from happening. If you are in this situation, the
best you can hope for in a distributed repository model is that you can
concentrate your changes to the local repository and the SCM tools support both
local and remote changes (repository multi-mastership). Note that most SCM
tools do not support this model out
of the box; therefore strongly consider if you can afford a centralized
repository model.
Dependent vs.
Independent Code
Independent code does not depend on the presence of other source code to be worked on, compiled,
tested, etc. It is not uncommon for distributed development to each be
responsible for certain components or subsystems and to release those
independently to each other when "ready." Dependent code requires the presence
of others' source code just to be able to do your own development.
If you are lucky enough to have code that is independent in
nature, or that could easily be refactored to be so, then distributed
repositories are a possibility. Dependent code tends to make this more
difficult, but may be possible if others' code only needs to be available in a
read-only mode. Most likely, though, if you are in a dependent code situation
you are probably also in a pathologically connected situation. In that case, a
centralized repository model may be your best choice.
Tool Chains Used
Tool Chains are the tools required to reproducibly go from source code to a deliverable product. They
include, but are not limited to, compilers, linkers, build managers, packagers
and third-party components and libraries. In some cases, even the operating
system is part of the tool chain, especially if the use of shared system
libraries is involved in the final product. Also parts of the tool chain are
editors and the development IDE. Notepad and vi do not have much impact on the
development process from an SCM perspective, however IDEs such as Eclipse and
Visual Studio do. It is not uncommon to have SCM tools integrated into various
components of the tool chain and those integrations must support the repository
model as well as the CM tools themselves.
One would think that the CM tools would shield the user from
having to worry about the repository model, but that is not always the case.
The plug-ins are often written to use fat-client technologies instead of
client-server ones where the tool supports both. Additionally, plug-ins rarely
support the full functionality of the native tools. In deciding whether to use
a centralized or distributed model, consider all of the methods that will be used to access them.
Process / Procedures
When development is distributed outside of a single
location, it becomes more difficult to enforce consistent processes and
procedures - and CM is all about consistency! Try to keep both as simple as
possible and document them well. Cookbooks, slide presentations and FAQs are
all good ideas. As development works out their globalization problems, be
prepared to evolve both CM Processes and CM Procedures while publishing updates
to the various documents along the way.
Make sure you are engaged with the development management during
the transition. It is not uncommon for remote developers to delay checking
their changes in whereas it is often better to check in even more often than
they were used to. Try to have developers work on one issue at a time instead
of the typical "day's" worth of work (or "week's," etc.). Try to have
development have each geographical area work on related code, or at least
related issues. This tends to maximize communications and minimize coding
conflicts with other areas.
Also keep in mind that it is rare for SCM team members to
actually be present at each location. Remote administration is the norm unless
you are dealing with satellite offices of some size or with Subcontract
organizations (where you would normally be prohibited from placing SCM personnel
anyway).
Build Methods
There are typically two types of builds going on: developer
and release (production) candidate. Developer builds occur very often and each
developer's codebase will be subtly different. They will each need access to a
standardized tool chain plus any other third-party and product intermediates
(libraries, JAR files, etc.) produced by other development groups. Some times
these additional components are made available via the Version Control (VC) tool, other times reference copies
are made available by CM. The primary need, so far as developers are concerns,
is that SCM adds minimal time to each build.
Release candidate (RC)
builds are produced under the control of CM and are CIs in and of themselves.
They are rarely versioned, but they are normally preserved for use by QA/QC and
eventual release. They are produced relatively infrequently and often have
additional post-build steps appended such as automatic regression testing,
execution profiling and static analysis. Failures in these builds to execute
are resolved by either the build team or by development, whichever is
appropriate based on the failure mechanism. Failures in the post-build steps
are normally addressed by development. The primary need, so far as SCM is
concerned, is that these builds are controlled and reproducible.
Continuous integration builds are a third type of build that
share some of the aspects of both development and RC builds. They are performed
reasonably soon after development checks in code changes so they happen often,
but they are produced using the CM controlled tool chain and build scripts so
they are controlled and reproducible as well. The primary purposes for these
builds are:
- To
make components available for other developers to use going forward
- To
ensure that the modified codebase builds cleanly
Note that under normal conditions they are not, or should
not be, used by QA/QC for testing.
Where, how and when each of these build types occur is
affected by the VC tool choice, both in terms of its overall architecture and
whether the repositories are distributed or centralized. Access times to
controlled source can make a significant different in overall build times.
Summary
If you are making the jump to distributed development and
have plenty of time to prepare, consider carefully if your current SCM tools
are sufficient to keep you out of trouble. If not, consider if you have the
time to make changes. If the answer to either is "no," then start out with a
centralized repository model and keep careful control over the ways the
developers access the repositories. Unless you are involved in a totally new
development effort, make the transition to distributed SCM tools first using a
centralized repository if at all possible. Plan up-front for distributed
repositories, then extend the SCM implementation to using distributed
repositories once everything is working and all of the controls are in place.
If you are involved in a totally new development effort and have the chance to
implement both the SCM tools and repository model up-front, do what seems best
- just make sure you have enough time to both do the initial implementation and
test it thoroughly before you allow developers to actually start using it for
production development.
We started out with having to go truly global with cvs. We
kept a central repository and put several server-side triggers in place to make
people behave (i.e. keep the developers from taking short cuts or accidentally
working on the wrong branch). All of the production candidate builds were done
at the same location as the cvs repositories and made available to the remote
locations via VPN. It was slow, but it kept us in reasonable control.
We have since transitioned to another client-server tool
(AccuRev), but maintained the centralized repository model. In conjunction with
a change in the overall Build Management tool, we achieved a full one order of
magnitude improvement in build times. We are currently working on paring down
what each remote developer has to download for day-to-day work, but the
centralized control and administration is still necessary. Future plans
currently call for being able to control remote builds, but keeping the
repository centralized - even though the tool supports repository replication.
As we found out, it is best to "Make haste slowly." Take
small steps in whatever you do and maintain control at all times. The
development itself can be globally distributed more easily than SCM, so long as
the tools you already have in place allow you to emplace server-side triggers
and keep revision timing under a single time base.
Ben Weatherall is currently based in Fort Worth,
Texas where he practices Practical CM on a daily basis using a
combination of CVS and custom tools to support a modified Agile-SCRUM
development methodology. He is a member of IEEE, ASEE (Association of
Software Engineering Excellence – The SEI’s Dallas based SPIN
Affiliate), NTLUG (North Texas Linux Users Group), and PLUG (Phoenix
Linux Users Group).
Trackback(0)
Comments 
Write comment
 |