[build2] LTO and parallelization with build2

Wed Aug 5 16:24:13 UTC 2020

On Wed, Aug 5, 2020 at 8:21 AM Boris Kolpackov <boris at codesynthesis.com> wrote:
>
> One idea came into my head: we could provide hooks in the scheduler
> to allow a build system module to interface with something that
> controls hardware thread allocation (we would also need some kind
> of a command line option for module pre-load since such a module
> wouldn't be loaded from buildfiles). One could then implement a module
> that provides the jobserver functionality either just the client or
> the server and either from scratch or by reusing GNU make. Just putting
> the idea out there.

That's an interesting idea indeed. It would require implementing the
jobserver somehow as you said (either custom or reusing GNU make), but
at least it would be somewhat isolated from the existing scheduler
then.

> Does the number of partititions somehow correlate with the number of
> TUs being linked?

I think for -flto-partition=balanced (the default) the number of
partitions is always statically determined by --param
lto-partitions=m, regardless of the number of TUs (by default, m=32).
For -flto-partition=1to1, m might be equivalent to the number of TUs.

> 1. We could supply too few hardware threads (e.g., because other
>    threads are still being used by build2 at the start of linking
>    but may become available during linking).

Yeah, this is the problem that would require integrating linker stages
2-3 with the build2 jobserver somehow.

> 2. We could supply too many hardware threads that the linker cannot
>    utilize but that build2 could have used for other tasks.

Based on my reading of how the partitioning works, as long as
lto-partitons=m > -flto=n, I think it should be possible for the
linker to fully utilize all given threads. This might require some
experimentation and verification, though.

> But it's probably a good enough first approximation.

Yeah, that's kind of what I was thinking.

With the current understanding of the partitioning, I was also
wondering if -flto=auto might be as likely to OOM as we previously
thought (i.e. with current build2 without changes).

build2 will already spawn as many linker threads (each currently
spawning 1 process when CFLAGS has -flto or -flto=1) as the scheduler
has available (call it n). Let's say those n threads are linking TU's
with cumulative size N_i bytes (or some other metric of code
complexity, like nodes in the tree representation), 0<i<n. Then
perhaps the total memory usage would be sum_i {N_i}

If CFLAGS is instead -flto=m or -flto=auto build2 (and children) will
spawn n*m threads, but if the m LTO threads are ~equally sized
partitions of the TUs (which balanced algorithm tries to do), their
size is e.g. M_ij = N_i / m now, and the memory usage would be
sum_{i=1}^n \sum_{j=1}^m {N_i / m} = sum_i {N_i}, which is the same.

This of course has some assumptions about the cumulative memory of the
m partitioned threads being the same as 1 thread doing the full link,
but maybe it's not far off. In any case, there are still concerns
about CPU starvation with n*m threads in the latter case, so it'd
probably be good to address regardless.