[build2] LTO and parallelization with build2
mkrupcale at matthewkrupcale.com
Tue Aug 11 03:18:09 UTC 2020
On Mon, Aug 10, 2020 at 10:48 AM Boris Kolpackov
<boris at codesynthesis.com> wrote:
> I wonder why Fedora doesn't default to that.
Probably because both the LTO and Clang toolchain changes are
new for F33. Both have obviously been supported in the distribution,
but these changes are (potentially) distribution-wide changes.
> If a project has a single final link stage (e.g., an executable), then
> the linker should be given all the available threads. It would be good
> to confirm at least this is the case.
After applying your scheduler fixes and setting m=0 in the
alloc_guard, this appears to work correctly.
> I think these are due to a silly bug (one of those "made sure everyting
> is correct expect the most trivial part") in my implementation of
> allocate()/deallocate() which is now fixed in master.
When I saw the original implementation, I was wondering about those
particular lines, but I thought maybe my understanding of the
scheduler was wrong (i.e. perhaps active_ was referring to internally
[to the scheduler] active threads or something).
> Can you give it a try and see if you get a more sensible behavior?
Yes, I've pushed changes to the same branch rebased on top of your
scheduler fix, and things are working much better. Here are some
timings (3 trials) on my 4C4T system for linking the
executables/libraries in build2 (this is only the linking, compilation
is already done):
$ b /tmp/build2-test-build/
$ for i in $(seq 3); do
> find /tmp/build2-test-build/ -type f -executable -delete
> time b /tmp/build2-test-build/
real 6m33.626s 6m27.739s 6m29.363s | 6m30.243s
user 22m55.659s 22m29.588s 22m56.146s | 22m47.131s
sys 1m11.267s 1m10.142s 1m11.252s | 1m10.887s
real 5m56.454s 5m53.458s 5m54.255s | 5m54.722s
user 19m22.788s 19m20.051s 19m22.424s | 19m21.754s
sys 1m2.472s 1m2.406s 1m3.169s | 1m2.682s
real 5m53.139s 5m46.499s 5m49.596s | 5m49.745s
user 19m16.702s 19m15.868s 19m15.820s | 19m16.130s
sys 1m1.647s 1m1.269s 1m1.649s | 1m1.522s
So there's a reduction in real/user time of the lto-parallelization
branch with -flto=auto relative to both -flto=auto (-10%/-15%) and
-flto=1 (-1.4%/-0.5%) master branch. The difference will probably be
greater on larger core machines and with fewer objects that can be
linked in parallel.
More information about the users