[build2] LTO and parallelization with build2

Mon Aug 3 14:56:38 UTC 2020

On Mon, Aug 3, 2020 at 7:51 AM Boris Kolpackov <boris at codesynthesis.com> wrote:
>
> The bigger issue is potential memory usage (I've seen translation units
> that take over 1G to compile).

Yeah, very good point. This will especially be a problem on ARM.

> For -flto to work via jobserver, I believe build2 (and ninja) would
> need to implement the server proper, not just the client.

Yes, I believe that's right. The ninja devs in issue 1139 I think were
primarily trying to address a different issue which is what happens if
ninja is invoked by an outer make, or somehow make is the primary
scheduler.

> And having been subscribed to make-alpha for the past decade, I can tell you
> that jobserver in GNU make has been a never-ending source of bugs,
> corner case, and compatibility issues (see this post[1] for a primer).
> So I would like to avoid touching that can of worms if I can help it

Yeah, it seems like implementing the make jobserver would be rather complex.

> Now on to how we could handle -flto in build2. It would actually
> be quite easy for the link rule to allocate more than one hardware
> thread (if available) in order to pass it on to the linker. There
> is no such support in the scheduler now but it should be pretty
> straightforward to add. With this idea then it's only a matter
> of rewriting -flto=auto or -flto=jobserver with -flto=N where
> N is the number of hardware threads allocated.

Yeah, that's kind of what I had in mind.

> There are two potentially problems with this:
>
> 1. If GCC does not use all the allocated threads, then they will be
>    wasted, which would be pretty bad.
>
>    Do you know if GCC will always utilize all the threads given? It
>    appears to be generating a Makefile that it then passes to make,
>    so probably it depends on what's in that Makefile.

The LTO WHOPR mode[1] is enabled when -flto is passed[2] and an LTO
partitioning algorithm is used[3]. The LGEN phase should be executed
in parallel already by build2 since it invokes the compiler in
parallel for the TUs. Then lto-wrapper forks and execs the two
remaining stages, each executed with the specified parallelism:

1. The WPA stage is partitioned[4] with the output of each partition
done by separate forks of the LTO process[5]. Currently, the WPA stage
doesn't support the jobserver mode[6]. The default partitioning
algorithm is balanced[7], with the number of partitions controlled by
lto-partitions parameter[7] (default 32[8]). It looks like the
lto-partitions should exceed the number of CPUs used for compilation.
2. The LTRANS stage also operates on each partition independently.
Whether or not using the jobserver, the parallel LTO mode generates a
temporary Makefile[9], but if not using the jobserver, make will be
invoked without --jobserver-fd args[10] and with a statically
generated number of make jobs[11].

So provided that there are more partitions than allocated CPU threads
(i.e. lto-partitions > n), both WPA and LTRANS stages of GCC LTO
should utilize all n threads from the scheduler.

> 2. Theoretically, via the jobserver, the linker can utilize additional
>    threads as they become available. In our case, the number of
>    allocated threads would be fixed at the linker start time.

Yeah, I had thought about this issue with the static thread number
approach. Without the build2 jobserver, both WPA and LTRANS stages are
limited to n threads. Only the LTRANS stage currently supports the use
of the jobserver, though. Thus, even if build2 implements a make
jobserver, the WPA stage will currently be limited to at most n
threads from the build2 scheduler when spawning the linker.

Hypothetically, the WPA stage might support the jobserver at some
point, so both stages could support dynamic thread allocation from a
build2 jobserver. But I think the above analysis at least indicates
that the static thread approach should be capable of fully utilizing
the threads assigned for linking. It would then be up to build2 to
make sure it's keeping the other threads busy with other tasks while
linking.

[1] https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
[2] https://github.com/gcc-mirror/gcc/blob/d2ae6d5c053315c94143103eeae1d3cba005ad9d/gcc/lto-wrapper.c#L1521
[3] https://github.com/gcc-mirror/gcc/blob/d2ae6d5c053315c94143103eeae1d3cba005ad9d/gcc/lto-wrapper.c#L1564
[4] https://github.com/gcc-mirror/gcc/blob/d2ae6d5c053315c94143103eeae1d3cba005ad9d/gcc/lto-wrapper.c#L1747
[5] https://github.com/gcc-mirror/gcc/blob/d1961e648e0fedebd06e4ad786c1bfc536312ef7/gcc/lto/lto.c#L2398
[6] https://github.com/gcc-mirror/gcc/blob/d1961e648e0fedebd06e4ad786c1bfc536312ef7/gcc/lto/lto.c#L3154
[7] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
[8] https://github.com/gcc-mirror/gcc/blob/51e85e64e125803502fde94b9e22037c0ccaa8b2/gcc/params.def#L1097
[9] https://github.com/gcc-mirror/gcc/blob/d2ae6d5c053315c94143103eeae1d3cba005ad9d/gcc/lto-wrapper.c#L1877
[10] https://github.com/gcc-mirror/gcc/blob/d2ae6d5c053315c94143103eeae1d3cba005ad9d/gcc/lto-wrapper.c#L1949
[11] https://github.com/gcc-mirror/gcc/blob/d2ae6d5c053315c94143103eeae1d3cba005ad9d/gcc/lto-wrapper.c#L1968