[build2] LTO and parallelization with build2

Tue Aug 11 03:18:09 UTC 2020

On Mon, Aug 10, 2020 at 10:48 AM Boris Kolpackov
<boris at codesynthesis.com> wrote:
>
> I wonder why Fedora doesn't default to that.

Probably because both the LTO[1] and Clang toolchain[2] changes are
new for F33. Both have obviously been supported in the distribution,
but these changes are (potentially) distribution-wide changes.

> If a project has a single final link stage (e.g., an executable), then
> the linker should be given all the available threads. It would be good
> to confirm at least this is the case.

After applying your scheduler fixes and setting m=0 in the
alloc_guard, this appears to work correctly.

> I think these are due to a silly bug (one of those "made sure everyting
> is correct expect the most trivial part") in my implementation of
> allocate()/deallocate() which is now fixed in master.

When I saw the original implementation, I was wondering about those
particular lines, but I thought maybe my understanding of the
scheduler was wrong (i.e. perhaps active_ was referring to internally
[to the scheduler] active threads or something).

> Can you give it a try and see if you get a more sensible behavior?

Yes, I've pushed changes to the same branch rebased on top of your
scheduler fix, and things are working much better. Here are some
timings (3 trials) on my 4C4T system for linking the
executables/libraries in build2 (this is only the linking, compilation
is already done):

$ b /tmp/build2-test-build/
$ for i in $(seq 3); do
> find /tmp/build2-test-build/ -type f -executable -delete
> time b /tmp/build2-test-build/
> done

master -flto=auto
real    6m33.626s    6m27.739s    6m29.363s    |    6m30.243s
user    22m55.659s    22m29.588s    22m56.146s    |    22m47.131s
sys    1m11.267s    1m10.142s    1m11.252s    |    1m10.887s

master -flto=1
real    5m56.454s    5m53.458s    5m54.255s    |    5m54.722s
user    19m22.788s    19m20.051s    19m22.424s    |    19m21.754s
sys    1m2.472s    1m2.406s    1m3.169s    |    1m2.682s

lto-parallelization -flto=auto
real    5m53.139s    5m46.499s    5m49.596s    |    5m49.745s
user    19m16.702s    19m15.868s    19m15.820s    |    19m16.130s
sys    1m1.647s    1m1.269s    1m1.649s    |    1m1.522s

So there's a reduction in real/user time of the lto-parallelization
branch with -flto=auto relative to both -flto=auto (-10%/-15%) and
-flto=1 (-1.4%/-0.5%) master branch. The difference will probably be
greater on larger core machines and with fewer objects that can be
linked in parallel.

Best,
Matthew

[1] https://fedoraproject.org/wiki/LTOByDefault
[2] https://fedoraproject.org/wiki/Changes/CompilerPolicy