The Pitfalls and Benefits of GNU Make Parallelization

[article]
Ask Mr. Make

Many build processes run for hours with build managers commonly typing 'make' and going home for the night. GNU Make's solution to this problem is parallel execution, which is a simple command-line option that causes GNU Make to run jobs in parallel using the dependency in the Makefile to run in the correct order.

In practice, GNU Make parallel execution is severely limited by a number of problems stemming from the fact that almost all Makefiles are written with the implicit notion that they will run in series. Hardly any Makefile authors 'think parallel' when writing their Makefiles which cause hidden traps that either cause the build to fail with a fatal error when GNU Make is run in parallel-mode, or, worse, the build succeeds but builds incorrect binaries.

This article looks at parallel GNU Make and points out the pitfalls and how to work around them to get maximum parallelism.

The Basics: -j or --jobs
To start GNU Make in parallel mode it's enough to specify either the -j or --jobs option on the command-line.  The argument to the option is the maximum number of processes that GNU Make will run in parallel.

For example, typing make --jobs=4 will allow GNU Make to run up to four subprocesses in parallel. This would give a theoretical maximum speed up of four times cutting build time by a quarter.  The theoretical time is, however, severely limited by restrictions in the Makefile.  Calculating the maximum actual speed up is done using Amdahl's Law (see below).

One simple, but very annoying problem, found in parallel GNU Make is that since the jobs are no longer run serially (and the order depends on the timing of jobs) the output from GNU Make will be somewhat randomly permuted depending on the actual order of job execution.

Consider the following simple example:

.PHONY: all
all: t5 t4 t1
   @echo Making $@

t1: t3 t2
   touch $@

t2:
   cp t3 $@

t3:
   touch $@

t4:
   touch $@

t5:
   touch $@

It builds five targets t1, t2, t3, t4 and t5.  All are simply touched except for t2 which is copied from t3.

Running this through standard GNU Make without a parallel option gives the output:

touch t5
touch t4
touch t3
cp t3 t2
touch t1
Making all

The order of execution will be the same each time because GNU Make will follow the prerequisites depth-first and from left-to-right.  Note that the left-to-right execution (for example, in the all rule t5 is built before t4 which is built before t1) is a convention and is not guaranteed even in serial GNU Make.

Now if GNU Make is run in parallel mode (say make --jobs=16) GNU Make can run multiple jobs in parallel.  For example, it's clear that t5, t4 and t1 can be run at the same time since there are no dependencies between them.  Similarly, t3 and t2 do not depend on each other and hence can be run in parallel. 

So the output of a parallel GNU Make on the same Makefile might be:

touch t4
touch t5
touch t3
cp t3 t2
touch t1
Making all

or even

touch t3
cp t3 t2
touch t4
touch t1
touch t5
Making all

This makes any process that examines log files to check for build problems (such as diffing log files) difficult to handle.  (There's no easy solution to this in GNU Make; you'll have to live with it or look at a commercial solution that works around this problem, like Electric Cloud-and that's the last time that I'll mention the company I founded to deal with parallel build problems in this article).

Missing Dependencies
The Makefile example has an additional problem.  The author fell into the classic left-to-right trap when they wrote the Makefile and when run in parallel it's possible for the following to happen:

touch t5
touch t4
cp t3 t2
cp: cannot stat `t3': No such file or directory
make: *** [t2] Error 1

That's because when run in parallel the rule to build t2 can occur before the rule to build t3 and t2 needs t3 to have already been built.  This didn't happen in the serial case because of the left-to-right assumption: the rule to build t1 is t1: t3 t2 which implies that t3 will be built before t2.

But there's no actual dependency in the Makefile that states that t3 must be built before t2.  The fix is simple: just add t2: t3 to the Makefile.  This is a simple example of a real problem that plagues Makefiles when run in parallel: missing or implicit (left-to-right) dependencies.  If a Makefile breaks under parallel GNU Make with an error it's worth looking for missing dependencies straight away as they are very common.

Another way GNU Make can break when running in parallel is if multiple rules use the same temporary file.

The hidden temporary file problem
Consider this example Makefile:

TMP_FILE := /tmp/scratch_file

.PHONY: all
all: t

t: t1 t2
   cat t1 t2 ] $@

t1:
   echo Output from $@ ] $(TMP_FILE)
   cat $(TMP_FILE) ] $@

t2:
   echo Output from $@ ] $(TMP_FILE)
   cat $(TMP_FILE) ] $@

Run without a parallel option GNU Make produces the following output:

echo Output from t1 ] /tmp/scratch_file
cat /tmp/scratch_file ] t1
rm /tmp/scratch_file
echo Output from t2 ] /tmp/scratch_file
cat /tmp/scratch_file ] t2
rm /tmp/scratch_file
cat t1 t2 ] t

and the t file contains:

Output from t1
Output from t2

But run in parallel with GNU Make it's possible to see the following output:

echo Output from t1 ] /tmp/scratch_file
echo Output from t2 ] /tmp/scratch_file
cat /tmp/scratch_file ] t1
cat /tmp/scratch_file ] t2
cat t1 t2 ] t

and now t contains:

Output from t2
Output from t2

This has occurred because there's no dependency between t1 and t2 (neither requires the output of the other) and hence they can be run in parallel.  In the output above you can see that they are running in parallel and that the output from the two rules is interleaved.  Since the two echo statements run first the temporary file (which is being shared by the two rules) has the wrong value when cated to t1.  This results in the wrong value to t.

This may seem like a contrived example, but exactly the same thing happens in real Makefiles when run in parallel resulting either in broken builds, or, as in the case above, the 'wrong binary' being built.  For example, the yacc program produces temporary files called y.tab.c and y.tab.h.  If more than one yacc is run in the same directory those files could be used by the wrong process.   Other poorly designed (or, at least, not parallel aware) tools may have similar problems.

A simple solution in the Makefile above is to change the definition of TMP_FILE to TMP_FILE = /tmp/scratch_file.$@.  That way it's name will depend on the target being built and now a parallel run looks like:

echo Output from t1 ] /tmp/scratch_file.t1
echo Output from t2 ] /tmp/scratch_file.t2
cat /tmp/scratch_file.t1 ] t1
cat /tmp/scratch_file.t2 ] t2
cat t1 t2 ] t

A related problem occurs when multiple jobs in the Makefile write to a shared file.  Even if they never read the file (for example, they might write to a log file) locking of the file for write access can cause competing jobs to be stalled reducing the overall performance of the parallel build.   

Consider the following example Makefile that uses the lockfile command to lock a file and simulate write locking.  While the file is locked each job waits a number of seconds:

LOCK_FILE := lock.me

.PHONY: all
all: t1 t2
   @echo done.

t1:
   @lockfile $(LOCK_FILE)
   @sleep 10
   @rm -f $(LOCK_FILE)
   @echo Finished $@

t2:
   @lockfile $(LOCK_FILE)
   @sleep 20
   @rm -f $(LOCK_FILE)
   @echo Finished $@

Running this Makefile in a serial build takes about 30 seconds:

$ time make
Finished t1
Finished t2
done.
make  0.01s user 0.01s system 0% cpu 30.034 total

but it isn't any faster in parallel even though t1 and t2 should be able to run in parallel:

$ time make -j4
Finished t1
Finished t2
done.
make -j4  0.01s user 0.02s system 0% cpu 36.812 total

(It's actually slower because of the way in which lockfile detects lock availability).  Clearly, that's contrived, but you can imagine that write locking a file could cause similar delays in an otherwise parallel-friendly Makefile.

Related to the file locking problem is a danger concerning archive (ar) files.  If multiple ar files were to run simulataneously on the same archive file then the archive could end up being corrupted.  Locking around archive updates is necessary in a parallel build, or care needs to be taken that dependencies prevent multiple ar commands running on the same file at the same time.

One way to prevent parallel problems is to specify .NOTPARALLEL in a Makefile.  If this is seen then the entire make execution will be run in series in the -j or --jobs command line option will be ignored.  This is a very blunt tool because it affects an entire invocation of GNU Make, but it could be handy in a recursive Make situation with, for example, a third party Makefile that is not parallel-safe.

The right way to do recursive Make
GNU Make is smart enough to share parallelism across sub-Makes if the Makefile that recurses using $(MAKE) is careful about how it calls sub-Makes.   GNU Make has a message passing mechanism that works across most platforms and enables a single Make that starts sub-Makes to use all the available jobs specified with -j or --jobs by passing tokens across pipes between the make processes.

The only serious gotcha is that you must allow the sub-Makes themselves to run in parallel for this to work.  The classic recursive Make style that uses a shell for loop processes each sub-Make in turn which doesn't allow more than one sub-Make to run simultaneously.

For example, the following Makefile isn't parallel-aware:

SUBDIRS := foo bar baz

.PHONY: all
all:
    for d in $(SUBDIRS);
    do
        $(MAKE) --directory=$$d;
    done

When run in parallel-mode the all rule still walks through each sub-directory and waits for its $(MAKE) to complete.  Even though each of those individual sub-Makes will be able to run in parallel the overall Make does not.  And on a highly parallel machine that could mean a less than ideal speed up.  For example, if the Make in the bar directory is only capable of running 4 jobs at once then running on a sixteen processor machine won't make the build any faster.

The solution is to remove the for loop and replace with a single rule for each directory.  Each directory is considered to be phony since the directory itself doesn't actually get built:

SUBDIRS := foo bar baz

.PHONY: all $(SUBDIRS)
all: $(SUBDIRS)

$(SUBDIRS):
    $(MAKE) --directory=$@

Now each directory can run while the others are running and parallelism is maximized.

distcc
If your project uses gcc then it's possible to use the distcc to maximize the amount of parallelization in GNU Make.  Typically GNU Make is limited to running processes on a single machine.  On a dual processor machine specifying -j2, -j3 or even -j4 may get you a nice speed up, but beyond that the processors will be fully occupied.  Machines with larger numbers of processors are, in general, incredibly expensive.  On the other hand most work places have many single CPU machines lying around.  At night those machines are often idle.  distcc takes a single gcc invocation and farms it out to one of a cluster of machines.  This means that a project can be run with a very large -j number so that many jobs are running in parallel on many machines connected to the network.

Once installed, using distcc is simple: just change the compiler from gcc to distcc.  That's achieve by setting the CC GNU Make variable to distcc:

make -j32 CC=distcc

Full details of distcc are here: http://distcc.samba.org/

Amdahl's Law and the limits of parallelization
Finally, there are real limits on the amount of parallelization that are possible in a project.  Take a look at the following Makefile:

.PHONY: all
all: t
   @echo done

t: t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
   @sleep 10
   @echo Made $@

t1:
   @sleep 11
   @echo Made $@

t2:
   @sleep 4
   @echo Made $@

t3: t5
   @sleep 7
   @echo Made $@

t4:
   @sleep 9
   @echo Made $@

t5: t8
   @sleep 10
   @echo Made $@

t6:
   @sleep 2
   @echo Made $@

t7:
   @sleep 12
   @echo Made $@

t8:
   @sleep 3
   @echo Made $@

t9: t10
   @sleep 4
   @echo Made $@

t10:
   @sleep 6
   @echo Made $@

t11: t12
   @sleep 1
   @echo Made $@

t12:
   @sleep 9
   @echo Made $@

When run in series it takes about eighty-eight seconds to complete:

$ time make                          14:52-84
Made t1
Made t2
Made t8
Made t5
Made t3
Made t4
Made t6
Made t7
Made t10
Made t9
Made t12
Made t11
Made t
done
make  0.04s user 0.03s system 0% cpu 1:28.68 total

What's the maximum speed up possible, assuming as many CPUs are available as desired?  Working through the Makefile by hand you'll see that t takes 10 seconds to build and every thing else has to be built before that. t1, t2, t4, t6, and t7 are all independent and the longest of them takes 12 seconds. t3 waits for t5 which needs t8: that chain takes a total of 20 seconds. t9 needs t10 for a total of ten seconds and t11 needs t12 for another ten.

So the longest serial part of this build is the sequence: t, t3, t5, t8 which takes a total of thirty seconds. This build can never go faster than thirty seconds (or 2.93x faster than the serial eighty-eight second time).  How many processors are needed to achieve that speed up?

In general the maximum speed up achievable is governed by Amdahl's Law: if F is the fraction of the build that cannot be parallelized and N is the number of available processors then the maximum speed up achievable is 1 / ( F + ( 1 - F ) / N ).

In the example 34 percent of the build can't be parallelized (the sequence identified above).  Applying Amdahl's law on shows:

# processorsMaximum speed up
21.49x
31.79x
41.98x
52.12x
62.22x
72.30x
82.37x
92.42x
102.46x
112.50x
122.53x


For this small build the maximum speed Amdahl's Law predicts a plateau starting around eight processors.  The actual plateau is further limited by the fact that there are only thirteen possible jobs in the actual build.

Looking at the structure of the build we can see that eight processors is the actual maximum because there are five jobs that can run in parallel without any dependencies: t1, t2, t4, t6, and t7.  Then there are three small chains of jobs that can each use one processor at a time: t3, t5 and t8; t9 and t10 and t11 and t12.  Building t can reuse one of the eight processors since they'll all be idle at that point.

More can be found on Amdahl's Law here: http://en.wikipedia.org/wiki/Amdahl's_law.

Conclusion
Making a parallel-build work isn't pretty, and going from a parallel-blind to parallel-aware build can be a difficult task, but the speed ups possible are well worth it.

About the author

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.