# [build2] LTO and parallelization with build2

Matthew Krupcale mkrupcale at matthewkrupcale.com
Wed Aug 5 16:24:13 UTC 2020

On Wed, Aug 5, 2020 at 8:21 AM Boris Kolpackov <boris at codesynthesis.com> wrote:
>
> One idea came into my head: we could provide hooks in the scheduler
> to allow a build system module to interface with something that
> controls hardware thread allocation (we would also need some kind
> of a command line option for module pre-load since such a module
> wouldn't be loaded from buildfiles). One could then implement a module
> that provides the jobserver functionality either just the client or
> the server and either from scratch or by reusing GNU make. Just putting
> the idea out there.

That's an interesting idea indeed. It would require implementing the
jobserver somehow as you said (either custom or reusing GNU make), but
at least it would be somewhat isolated from the existing scheduler
then.

> Does the number of partititions somehow correlate with the number of

I think for -flto-partition=balanced (the default) the number of
partitions is always statically determined by --param
lto-partitions=m, regardless of the number of TUs (by default, m=32).
For -flto-partition=1to1, m might be equivalent to the number of TUs.

> 1. We could supply too few hardware threads (e.g., because other
>    threads are still being used by build2 at the start of linking
>    but may become available during linking).

Yeah, this is the problem that would require integrating linker stages
2-3 with the build2 jobserver somehow.

> 2. We could supply too many hardware threads that the linker cannot
>    utilize but that build2 could have used for other tasks.

Based on my reading of how the partitioning works, as long as
lto-partitons=m > -flto=n, I think it should be possible for the
experimentation and verification, though.

> But it's probably a good enough first approximation.

Yeah, that's kind of what I was thinking.

With the current understanding of the partitioning, I was also
wondering if -flto=auto might be as likely to OOM as we previously
thought (i.e. with current build2 without changes).

spawning 1 process when CFLAGS has -flto or -flto=1) as the scheduler
has available (call it n). Let's say those n threads are linking TU's
with cumulative size N_i bytes (or some other metric of code
complexity, like nodes in the tree representation), 0<i<n. Then
perhaps the total memory usage would be sum_i {N_i}

If CFLAGS is instead -flto=m or -flto=auto build2 (and children) will
spawn n*m threads, but if the m LTO threads are ~equally sized
partitions of the TUs (which balanced algorithm tries to do), their
size is e.g. M_ij = N_i / m now, and the memory usage would be
sum_{i=1}^n \sum_{j=1}^m {N_i / m} = sum_i {N_i}, which is the same.

This of course has some assumptions about the cumulative memory of the