Opened 7 months ago

Last modified 5 months ago

#13739 infoneeded bug

very slow linking of executables with ld.bfd < 2.27

Reported by: duog Owned by:
Priority: high Milestone: 8.2.1
Component: Compiler Version: 8.2.1-rc2
Keywords: Cc: niteria, rwbarton, nh2
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #13541 Differential Rev(s):
Wiki Page:

Description

While building profiled executables with 8.2.1rc2 I've noticed the link times seem to have significantly regressed.

I don't have a minimal test case.

Testing on cabal head source tree

cabal --version
>cabal-install version 2.1.0.0
>compiled using version 2.1.0.0 of the Cabal library 
cabal new-configure --enable-profiling --enable-newer --with-ghc=/opt/ghc-8.2.1/bin/ghc
cabal build cabal-install

# hit Ctrl-C during linking
time cabal build cabal-install

gives real 1m54.833s user 1m52.936s sys 0m1.620s

doing the same with 8.0.2 links in less than a second

Attachments (5)

run-test.2 (890 bytes) - added by duog 7 months ago.
out (16.0 KB) - added by duog 7 months ago.
run-test (1017 bytes) - added by duog 7 months ago.
T13739-check-ld (1.0 KB) - added by duog 6 months ago.
Script to time ghc with different versions of ld
T13739-check-ld-output (1.9 KB) - added by duog 6 months ago.

Download all attachments as: .zip

Change History (35)

comment:1 Changed 7 months ago by bgamari

What cabal commit are you looking at?

I tried building cabal b3af711a492970718d31ffdc46d31110109ddc42 with GHC master (6f8c3ce4b1dac02acf93c351862d3b6c46815840) but was unable to reproduce this. Link times are long (real 0m5.947s user 0m0.028s sys 0m5.909s) but nothing like you report.

I'll try again with 8.2.1-rc2.

Last edited 7 months ago by bgamari (previous) (diff)

comment:2 Changed 7 months ago by bgamari

I tried 8.2.1-rc2 and it took a similar amount of time (real 0m6.241s user 0m0.013s sys 0m6.227s) to link.

However, then I tried 8.0.2 and indeed saw it was quite a bit faster (real 0m2.888s user 0m0.009s sys 0m2.878s). I still don't know you ended up seeing two minute link times though.

comment:3 Changed 7 months ago by duog

I am working on github/haskell/cabal commit 0b5f13d71e3282df03495e1db75cd8375b0879ab

I believe I have not been testing with rc2, but various releases from the rc2 branch obtained from https://launchpad.net/~hvr/+archive/ubuntu/ghc. I replicated this again with ghc-8.2.0.20170521.

I am running ubuntu 16.04 with all my packages upgraded.

comment:4 Changed 7 months ago by niteria

I'm getting 2 minute compile times when linking profiled GHC HEAD. Running with -v reveals that GHC is stuck on gcc doing linking

Here's the stuck command: https://phabricator.haskell.org/P151, it takes 135s to complete. When I remove -Wl,--gc-sections, it's only 11s.

It's part of this GHC invocation:

"inplace/bin/ghc-stage1" -o ghc/stage2/build/tmp/ghc-stage2 -hisuf p_hi -osuf  p_o -hcsuf p_hc -static -prof -eventlog  -O0 -H64m -Wall       -hide-all-packages -
i -ighc/. -ighc/stage2/build -Ighc/stage2/build -ighc/stage2/build/ghc/autogen -Ighc/stage2/build/ghc/autogen    -optP-DGHCI -optP-include -optPghc/stage2/build/ghc/autogen/cabal_macros.h -package-id base-4.10.0.0 -package-id array-0.5.1.2 -package-id bytestring-0.10.8.2 -package-id directory-1.3.0.2 -package-id process-1.6.0.0 -package-id filepath-1.4.1.2 -pack
age-id ghc-boot-8.3 -package-id ghc-8.3 -package-id unix-2.7.2.2 -package-id containers-0.5.10.2 -package-id deepseq-1.4.3.0 -package-id ghci-8.3 -package-id haskeline-0.7.4.0 -packa
ge-id time-1.8.0.1 -package-id transformers-0.5.2.0 -Wall -fno-warn-name-shadowing -XHaskell2010  -O -no-hs-main -threaded -no-user-package-db -rtsopts      -Wnoncanonical-monad-inst
ances  -odir ghc/stage2/build -hidir ghc/stage2/build -stubdir ghc/stage2/build  -split-sections   -static -prof -eventlog  -O0 -H64m -Wall       -hide-all-packages -i -ighc/. -ighc/
stage2/build -Ighc/stage2/build -ighc/stage2/build/ghc/autogen -Ighc/stage2/build/ghc/autogen    -optP-DGHCI -optP-include -optPghc/stage2/build/ghc/autogen/cabal_macros.h -package-id base-4.10.0.0 -package-id array-0.5.1.2 -package-id bytestring-0.10.8.2 -package-id directory-1.3.0.2 -package-id process-1.6.0.0 -package-id filepath-1.4.1.2 -package-id ghc-boot-
8.3 -package-id ghc-8.3 -package-id unix-2.7.2.2 -package-id containers-0.5.10.2 -package-id deepseq-1.4.3.0 -package-id ghci-8.3 -package-id haskeline-0.7.4.0 -package-id time-1.8.0
.1 -package-id transformers-0.5.2.0 -Wall -fno-warn-name-shadowing -XHaskell2010  -O -no-hs-main -threaded -no-user-package-db -rtsopts      -Wnoncanonical-monad-instances  ghc/stage
2/build/Main.p_o ghc/stage2/build/GHCi/UI.p_o ghc/stage2/build/GHCi/UI/Info.p_o ghc/stage2/build/GHCi/UI/Monad.p_o ghc/stage2/build/GHCi/UI/Tags.p_o ghc/stage2/build/hschooks.p_o -v -keep-tmp-files

My gcc version:

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@doug: Can you run the command that hangs with: cabal build cabal-install --ghc-options='-v' to check if it's an instance of the same problem?

If it's also stuck on gcc you can run it with cabal build cabal-install --ghc-options='-v -keep-tmp-files' and run the gcc command by itself. Then try removing -Wl,--gc-sections.

comment:5 Changed 7 months ago by niteria

Running the ld command with --print-gc-sections gives me 254934 lines. I don't know if that's big or small.

comment:6 Changed 7 months ago by duog

I just tried building a profiled executable with -fwhole-archive-hs-libs, which seems to only control the -Wl,--gc-sections flag, and it is much faster.

I'll replicate my previous setup and try this later today.

comment:7 Changed 7 months ago by niteria

Cc: niteria added

comment:8 Changed 7 months ago by duog

I have attached a script "run-test" and my output "out" that exhibits a large slowdown in linking a trivial profiled executable.

It times linking a file Main.hs with main = return () with ghc-8.0.2 and ghc-8.2.0.2017.0522.

It shows that the linking times has slowed considerably, and that -fwhole-archive-hs-libs speeds up linking by ~80%, but it's still 10x slower than ghc-8.0.2

the three linker invocations are:

# ghc -O0 --make -no-link -prof Main.hs &> /dev/null
# time (ghc -O0 --make -prof Main.hs -v) &>> out
gcc -fno-stack-protector -DTABLES_NEXT_TO_CODE '-Wl,--hash-size=31' -Wl,--reduce-memory-overheads -Wl,--no-as-needed -no-pie -Wl,--gc-sections Main.o -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/base-4.10.0.0 -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/integer-gmp-1.0.0.1 -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/ghc-prim-0.5.0.0 -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/rts /tmp/ghc5140_0/ghc_2.o /tmp/ghc5140_0/ghc_5.o -Wl,-u,base_GHCziInt_I16zh_con_info -Wl,-u,base_GHCziInt_I32zh_con_info -Wl,-u,base_GHCziInt_I64zh_con_info -Wl,-u,base_GHCziInt_I8zh_con_info -Wl,-u,base_GHCziPtr_FunPtr_con_info -Wl,-u,base_GHCziPtr_Ptr_con_info -Wl,-u,base_GHCziStable_StablePtr_con_info -Wl,-u,base_GHCziWord_W16zh_con_info -Wl,-u,base_GHCziWord_W32zh_con_info -Wl,-u,base_GHCziWord_W64zh_con_info -Wl,-u,base_GHCziWord_W8zh_con_info -Wl,-u,ghczmprim_GHCziTypes_Czh_con_info -Wl,-u,ghczmprim_GHCziTypes_Dzh_con_info -Wl,-u,ghczmprim_GHCziTypes_Fzh_con_info -Wl,-u,ghczmprim_GHCziTypes_Izh_con_info -Wl,-u,ghczmprim_GHCziTypes_Wzh_con_info -Wl,-u,ghczmprim_GHCziTuple_Z0T_closure -Wl,-u,ghczmprim_GHCziTypes_False_closure -Wl,-u,ghczmprim_GHCziTypes_True_closure -Wl,-u,base_GHCziPack_unpackCString_closure -Wl,-u,base_GHCziIOziException_stackOverflow_closure -Wl,-u,base_GHCziIOziException_heapOverflow_closure -Wl,-u,base_ControlziExceptionziBase_nonTermination_closure -Wl,-u,base_GHCziIOziException_blockedIndefinitelyOnMVar_closure -Wl,-u,base_GHCziIOziException_blockedIndefinitelyOnSTM_closure -Wl,-u,base_GHCziIOziException_allocationLimitExceeded_closure -Wl,-u,base_GHCziIOziException_cannotCompactFunction_closure -Wl,-u,base_GHCziIOziException_cannotCompactPinned_closure -Wl,-u,base_GHCziIOziException_cannotCompactMutable_closure -Wl,-u,base_ControlziExceptionziBase_nestedAtomically_closure -Wl,-u,base_GHCziEventziThread_blockedOnBadFD_closure -Wl,-u,base_GHCziWeak_runFinalizzerBatch_closure -Wl,-u,base_GHCziTopHandler_flushStdHandles_closure -Wl,-u,base_GHCziTopHandler_runIO_closure -Wl,-u,base_GHCziTopHandler_runNonIO_closure -Wl,-u,base_GHCziTopHandler_runMainIO_closure -Wl,-u,base_GHCziConcziIO_ensureIOManagerIsRunning_closure -Wl,-u,base_GHCziConcziIO_ioManagerCapabilitiesChanged_closure -Wl,-u,base_GHCziConcziSync_runSparks_closure -Wl,-u,base_GHCziConcziSignal_runHandlersPtr_closure -Wl,-u,hs_atomic_add8 -Wl,-u,hs_atomic_add16 -Wl,-u,hs_atomic_add32 -Wl,-u,hs_atomic_add64 -Wl,-u,hs_atomic_sub8 -Wl,-u,hs_atomic_sub16 -Wl,-u,hs_atomic_sub32 -Wl,-u,hs_atomic_sub64 -Wl,-u,hs_atomic_and8 -Wl,-u,hs_atomic_and16 -Wl,-u,hs_atomic_and32 -Wl,-u,hs_atomic_and64 -Wl,-u,hs_atomic_nand8 -Wl,-u,hs_atomic_nand16 -Wl,-u,hs_atomic_nand32 -Wl,-u,hs_atomic_nand64 -Wl,-u,hs_atomic_or8 -Wl,-u,hs_atomic_or16 -Wl,-u,hs_atomic_or32 -Wl,-u,hs_atomic_or64 -Wl,-u,hs_atomic_xor8 -Wl,-u,hs_atomic_xor16 -Wl,-u,hs_atomic_xor32 -Wl,-u,hs_atomic_xor64 -Wl,-u,hs_cmpxchg8 -Wl,-u,hs_cmpxchg16 -Wl,-u,hs_cmpxchg32 -Wl,-u,hs_cmpxchg64 -Wl,-u,hs_atomicread8 -Wl,-u,hs_atomicread16 -Wl,-u,hs_atomicread32 -Wl,-u,hs_atomicread64 -Wl,-u,hs_atomicwrite8 -Wl,-u,hs_atomicwrite16 -Wl,-u,hs_atomicwrite32 -Wl,-u,hs_atomicwrite64 -lHSbase-4.10.0.0_p -lHSinteger-gmp-1.0.0.1_p -lHSghc-prim-0.5.0.0_p -lHSrts_p -lCffi_p -lgmp -lm -lrt -ldl -lpthread
# ghc -O0 --make -no-link -prof -fwhole-archive-hs-libs Main.hs &> /dev/null
# time (ghc -O0 --make -prof -fwhole-archive-hs-libs Main.hs -v) &>> out
gcc -fno-stack-protector -DTABLES_NEXT_TO_CODE '-Wl,--hash-size=31' -Wl,--reduce-memory-overheads -Wl,--no-as-needed -o Main -no-pie Main.o -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/base-4.10.0.0 -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/integer-gmp-1.0.0.1 -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/ghc-prim-0.5.0.0 -L/opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/rts /tmp/ghc5241_0/ghc_2.o /tmp/ghc5241_0/ghc_5.o -Wl,-u,base_GHCziInt_I16zh_con_info -Wl,-u,base_GHCziInt_I32zh_con_info -Wl,-u,base_GHCziInt_I64zh_con_info -Wl,-u,base_GHCziInt_I8zh_con_info -Wl,-u,base_GHCziPtr_FunPtr_con_info -Wl,-u,base_GHCziPtr_Ptr_con_info -Wl,-u,base_GHCziStable_StablePtr_con_info -Wl,-u,base_GHCziWord_W16zh_con_info -Wl,-u,base_GHCziWord_W32zh_con_info -Wl,-u,base_GHCziWord_W64zh_con_info -Wl,-u,base_GHCziWord_W8zh_con_info -Wl,-u,ghczmprim_GHCziTypes_Czh_con_info -Wl,-u,ghczmprim_GHCziTypes_Dzh_con_info -Wl,-u,ghczmprim_GHCziTypes_Fzh_con_info -Wl,-u,ghczmprim_GHCziTypes_Izh_con_info -Wl,-u,ghczmprim_GHCziTypes_Wzh_con_info -Wl,-u,ghczmprim_GHCziTuple_Z0T_closure -Wl,-u,ghczmprim_GHCziTypes_False_closure -Wl,-u,ghczmprim_GHCziTypes_True_closure -Wl,-u,base_GHCziPack_unpackCString_closure -Wl,-u,base_GHCziIOziException_stackOverflow_closure -Wl,-u,base_GHCziIOziException_heapOverflow_closure -Wl,-u,base_ControlziExceptionziBase_nonTermination_closure -Wl,-u,base_GHCziIOziException_blockedIndefinitelyOnMVar_closure -Wl,-u,base_GHCziIOziException_blockedIndefinitelyOnSTM_closure -Wl,-u,base_GHCziIOziException_allocationLimitExceeded_closure -Wl,-u,base_GHCziIOziException_cannotCompactFunction_closure -Wl,-u,base_GHCziIOziException_cannotCompactPinned_closure -Wl,-u,base_GHCziIOziException_cannotCompactMutable_closure -Wl,-u,base_ControlziExceptionziBase_nestedAtomically_closure -Wl,-u,base_GHCziEventziThread_blockedOnBadFD_closure -Wl,-u,base_GHCziWeak_runFinalizzerBatch_closure -Wl,-u,base_GHCziTopHandler_flushStdHandles_closure -Wl,-u,base_GHCziTopHandler_runIO_closure -Wl,-u,base_GHCziTopHandler_runNonIO_closure -Wl,-u,base_GHCziTopHandler_runMainIO_closure -Wl,-u,base_GHCziConcziIO_ensureIOManagerIsRunning_closure -Wl,-u,base_GHCziConcziIO_ioManagerCapabilitiesChanged_closure -Wl,-u,base_GHCziConcziSync_runSparks_closure -Wl,-u,base_GHCziConcziSignal_runHandlersPtr_closure -Wl,-u,hs_atomic_add8 -Wl,-u,hs_atomic_add16 -Wl,-u,hs_atomic_add32 -Wl,-u,hs_atomic_add64 -Wl,-u,hs_atomic_sub8 -Wl,-u,hs_atomic_sub16 -Wl,-u,hs_atomic_sub32 -Wl,-u,hs_atomic_sub64 -Wl,-u,hs_atomic_and8 -Wl,-u,hs_atomic_and16 -Wl,-u,hs_atomic_and32 -Wl,-u,hs_atomic_and64 -Wl,-u,hs_atomic_nand8 -Wl,-u,hs_atomic_nand16 -Wl,-u,hs_atomic_nand32 -Wl,-u,hs_atomic_nand64 -Wl,-u,hs_atomic_or8 -Wl,-u,hs_atomic_or16 -Wl,-u,hs_atomic_or32 -Wl,-u,hs_atomic_or64 -Wl,-u,hs_atomic_xor8 -Wl,-u,hs_atomic_xor16 -Wl,-u,hs_atomic_xor32 -Wl,-u,hs_atomic_xor64 -Wl,-u,hs_cmpxchg8 -Wl,-u,hs_cmpxchg16 -Wl,-u,hs_cmpxchg32 -Wl,-u,hs_cmpxchg64 -Wl,-u,hs_atomicread8 -Wl,-u,hs_atomicread16 -Wl,-u,hs_atomicread32 -Wl,-u,hs_atomicread64 -Wl,-u,hs_atomicwrite8 -Wl,-u,hs_atomicwrite16 -Wl,-u,hs_atomicwrite32 -Wl,-u,hs_atomicwrite64 -Wl,--whole-archive -lHSbase-4.10.0.0_p -lHSinteger-gmp-1.0.0.1_p -lHSghc-prim-0.5.0.0_p -lHSrts_p -lCffi_p -Wl,--no-whole-archive -lgmp -lm -lrt -ldl -lpthread
# /opt/ghc/8.0.2/bin/ghc -O0 --make -no-link -prof Main.hs &> /dev/null
# time (/opt/ghc/8.0.2/bin/ghc -O0 --make -prof Main.hs -v) &>> out
/usr/bin/gcc -fno-stack-protector -DTABLES_NEXT_TO_CODE '-Wl,--hash-size=31' -Wl,--reduce-memory-overheads -Wl,--no-as-needed -no-pie -Wl,--gc-sections Main.o -L/opt/ghc/8.0.2/lib/ghc-8.0.2/base-4.9.1.0 -L/opt/ghc/8.0.2/lib/ghc-8.0.2/integer-gmp-1.0.0.1 -L/opt/ghc/8.0.2/lib/ghc-8.0.2/ghc-prim-0.5.0.0 -L/opt/ghc/8.0.2/lib/ghc-8.0.2/rts /tmp/ghc5284_0/ghc_2.o /tmp/ghc5284_0/ghc_5.o -Wl,-u,ghczmprim_GHCziTypes_Izh_static_info -Wl,-u,ghczmprim_GHCziTypes_Czh_static_info -Wl,-u,ghczmprim_GHCziTypes_Fzh_static_info -Wl,-u,ghczmprim_GHCziTypes_Dzh_static_info -Wl,-u,base_GHCziPtr_Ptr_static_info -Wl,-u,ghczmprim_GHCziTypes_Wzh_static_info -Wl,-u,base_GHCziInt_I8zh_static_info -Wl,-u,base_GHCziInt_I16zh_static_info -Wl,-u,base_GHCziInt_I32zh_static_info -Wl,-u,base_GHCziInt_I64zh_static_info -Wl,-u,base_GHCziWord_W8zh_static_info -Wl,-u,base_GHCziWord_W16zh_static_info -Wl,-u,base_GHCziWord_W32zh_static_info -Wl,-u,base_GHCziWord_W64zh_static_info -Wl,-u,base_GHCziStable_StablePtr_static_info -Wl,-u,ghczmprim_GHCziTypes_Izh_con_info -Wl,-u,ghczmprim_GHCziTypes_Czh_con_info -Wl,-u,ghczmprim_GHCziTypes_Fzh_con_info -Wl,-u,ghczmprim_GHCziTypes_Dzh_con_info -Wl,-u,base_GHCziPtr_Ptr_con_info -Wl,-u,base_GHCziPtr_FunPtr_con_info -Wl,-u,base_GHCziStable_StablePtr_con_info -Wl,-u,ghczmprim_GHCziTypes_False_closure -Wl,-u,ghczmprim_GHCziTypes_True_closure -Wl,-u,base_GHCziPack_unpackCString_closure -Wl,-u,base_GHCziIOziException_stackOverflow_closure -Wl,-u,base_GHCziIOziException_heapOverflow_closure -Wl,-u,base_ControlziExceptionziBase_nonTermination_closure -Wl,-u,base_GHCziIOziException_blockedIndefinitelyOnMVar_closure -Wl,-u,base_GHCziIOziException_blockedIndefinitelyOnSTM_closure -Wl,-u,base_GHCziIOziException_allocationLimitExceeded_closure -Wl,-u,base_ControlziExceptionziBase_nestedAtomically_closure -Wl,-u,base_GHCziEventziThread_blockedOnBadFD_closure -Wl,-u,base_GHCziWeak_runFinalizzerBatch_closure -Wl,-u,base_GHCziTopHandler_flushStdHandles_closure -Wl,-u,base_GHCziTopHandler_runIO_closure -Wl,-u,base_GHCziTopHandler_runNonIO_closure -Wl,-u,base_GHCziConcziIO_ensureIOManagerIsRunning_closure -Wl,-u,base_GHCziConcziIO_ioManagerCapabilitiesChanged_closure -Wl,-u,base_GHCziConcziSync_runSparks_closure -Wl,-u,base_GHCziConcziSignal_runHandlersPtr_closure -lHSbase-4.9.1.0_p -lHSinteger-gmp-1.0.0.1_p -lHSghc-prim-0.5.0.0_p -lHSrts_p -lCffi_p -lgmp -lm -lrt -ldl -lpthread

Note that -fwhole-archive-hs-libs does also add -Wl,--whole-archive to the linker invocation. running the first linker invocation from the log with -Wl,--gc-sections removed completes in 2s, about 2/3 the time as with -fwhole-archive-hs-libs.

Please disregard run-test.2, this was added erroneously as I updated the script.

Last edited 7 months ago by duog (previous) (diff)

Changed 7 months ago by duog

Attachment: run-test.2 added

Changed 7 months ago by duog

Attachment: out added

Changed 7 months ago by duog

Attachment: run-test added

comment:9 Changed 7 months ago by RyanGlScott

Thanks for the script, duog! It'll help me figure out what commit(s) introduced this regression.

I did some quick testing on my Ubuntu machine with gcc-4.8.4. As baselines, I tested GHC 8.0.2:

real    0m0.299s
user    0m0.256s
sys     0m0.028s

and GHC 8.2.1-rc2:

(Without -fwhole-archive-hs-libs)
real    0m12.946s
user    0m12.816s
sys     0m0.136s

(With -fwhole-archive-hs-libs)
real    0m2.676s
user    0m2.436s
sys     0m0.220s

I noticed a significant difference around commit b207b536ded40156f9adb168565ca78e1eef2c74 (Generalize kind of the (->) tycon). Before that commit, we have:

Commit efeaf9e436109cb35b491e08b5407c0598108186 (Bump nofib submodule)
------
(Without -fwhole-archive-hs-libs)
real    0m4.686s
user    0m4.600s
sys     0m0.068s

(With -fwhole-archive-hs-libs)
real    0m0.055s
user    0m0.040s
sys     0m0.008s

And after:

Commit b207b536ded40156f9adb168565ca78e1eef2c74 (Generalize kind of the (->) tycon)
------
(Without -fwhole-archive-hs-libs)
real    0m13.047s
user    0m12.948s
sys     0m0.104s

(With -fwhole-archive-hs-libs)
real    0m2.696s
user    0m2.480s
sys     0m0.196s

However, the times for commit efeaf9e436109cb35b491e08b5407c0598108186 are still not at parity with GHC 8.0.2, so an earlier commit must have contributed as well. I'll try to track down which one(s).

comment:10 Changed 7 months ago by RyanGlScott

Cc: rwbarton added

Commit 283acec1d7307fdbd8cd7b3f1d984a036366d6b4 (Make split sections by default work again) is what caused the other regression. Before that commit, I obtained these times with duog's script:

Commit 4d31880a0bfb3bf653c86f844fc8cf332c955547 (Fix comment (old filename '.lhs') in libraries/)
-----
(Without -fwhole-archive-hs-libs)
real	0m0.306s
user	0m0.260s
sys	0m0.028s

(With -fwhole-archive-hs-libs)
real	0m0.056s
user	0m0.036s
sys	0m0.012s

Notice that the -fwhole-archive-hs-libs times are pretty much identical to the GHC 8.0.2 times I found in comment:9.

After that commit, I obtained:

Commit 283acec1d7307fdbd8cd7b3f1d984a036366d6b4 (Make split sections by default work again)
-----
(Without -fwhole-archive-hs-libs)
real	0m4.656s
user	0m4.548s
sys	0m0.088s

(With -fwhole-archive-hs-libs)
real	0m0.056s
user	0m0.040s
sys	0m0.008s

Which is basically identical to the times I found pre-b207b536ded40156f9adb168565ca78e1eef2c74.

Last edited 7 months ago by RyanGlScott (previous) (diff)

comment:11 Changed 7 months ago by RyanGlScott

For reference, I decided to compare two builds of GHC 8.2.1-rc2 with and without reverting commit 283acec1d7307fdbd8cd7b3f1d984a036366d6b4, and the difference is astounding.

If you revert commit 283acec1d7307fdbd8cd7b3f1d984a036366d6b4, you get:

(Without -fwhole-archive-hs-libs)
real	0m0.277s
user	0m0.224s
sys	0m0.036s

(With -fwhole-archive-hs-libs)
real	0m5.447s
user	0m5.104s
sys	0m0.328s

And if you use plain 8.2.1-rc2 (without reverting commit 283acec1d7307fdbd8cd7b3f1d984a036366d6b4), you get:

(Without -fwhole-archive-hs-libs)
real	0m13.648s
user	0m12.716s
sys	0m0.156s

(With -fwhole-archive-hs-libs)
real	0m3.077s
user	0m2.500s
sys	0m0.216s

So that commit at least explains the differences without the use of the -fwhole-archive-hs-libs flag. However, it doesn't account for the jump in times _with_ the the -fwhole-archive-hs-libs flag. I'll try investigating that next.

Version 0, edited 7 months ago by RyanGlScott (next)

comment:12 Changed 7 months ago by RyanGlScott

Oops, please ignore the noise about -fwhole-archive-hs-libs above. I forgot that the commit which introduced -fwhole-archive-hs-libs (a6874e546294173c166859769dd8054887a6ded7) didn't come until after 283acec1d7307fdbd8cd7b3f1d984a036366d6b4 (Make split sections by default work again), so the times I got above were just from GHC erroring out. (I suppressed the stdout/stderr to avoid flooding my terminal, so I didn't notice this happening.)

Also, an observation that I meant to make earlier is that I think that b207b536ded40156f9adb168565ca78e1eef2c74 (Generalize kind of the (->) tycon), which I singled out in comment:9, is a red herring. I think commit 283acec1d7307fdbd8cd7b3f1d984a036366d6b4 (Make split sections by default work again) is the true culprit that causes a bad constant factor for each additional linked symbol, and commit b207b536ded40156f9adb168565ca78e1eef2c74 just happened to add a bunch of extra symbols. This is further evidenced by the fact that reverting 283acec1d7307fdbd8cd7b3f1d984a036366d6b4 alone brings the link time back down to the same as what I get with GHC 8.0.2.

comment:13 Changed 7 months ago by duog

That patch is a one-liner to a configure script that sets SplitSections=Yes by default while building ghc. Perhaps the problem is caused by how the libraries are linked with SplitSections=YES? If bgamari had SplitSections = NO in his .mk, then this would explain why he couldn't replicate the issue.

Last edited 7 months ago by duog (previous) (diff)

comment:14 Changed 7 months ago by duog

See https://phabricator.haskell.org/D1242#40188 Is it plausible that profiling libraries would have 100x more sections than non-profiling libraries?

Last edited 7 months ago by duog (previous) (diff)

comment:15 Changed 7 months ago by bgamari

Well, I certainly can't rule it out. Surely if this is the case it is avoidable, however. Can you run objdump on an affected object file?

comment:16 Changed 7 months ago by RyanGlScott

I tried objdump-ing object files produced by 8.0.2 and 8.2.1, and they look almost identical.

I have to wonder how much of this is due to GCC differences. For comparison, I just tried compiling a simple profiled executable on my machine with GCC 6.2.0, and the whole process only took about 1.5 seconds. (Compare that to the 13-second times that I experienced on my other machine with GCC 4.8.4.)

comment:17 in reply to:  16 Changed 7 months ago by duog

I have done some investigation and think I understand this now. Linking against libraries built with -split-sections is slow, and seems to be superlinear in the number of sections.

I have been testing on HEAD, using BuildFlavour = Validate. By default, profiling libraries are not built, and SplitSections = No.

Setting BUILD_PROF_LIBS = YES in build.mk and Linking the trivial program is only slightly slower than with 8.0.2 with and without -prof.

I have been using the following command to count sections in library files, this is counting the sections in the profiled base I built without SplitSections:

objdump -xw libraries/base/dist-install/build/libHSbase-4.10.0.0_p.a | wc -l
524008

This is counting the sections in the profiled base I got from launchpad:

objdump -xw /opt/ghc/8.2.1/lib/ghc-8.2.0.20170522/base-4.10.0.0/libHSbase-4.10.0.0_p.a | wc -l
1022947

Now setting SplitSections = YES, cleaning and building libraries

objdump -xw libraries/base/dist-install/build/libHSbase-4.10.0.0_p.a | wc -l
1031190

And the test case is as slow as it has been in my previous testing. I tried upgrading my gcc to 6.3.0, and this does not change the results. I haven't tried upgrading ld, I can't see how to do that.

It looks like the libraries from launchpad, both profiled and unprofiled, were built with SplitSections = Yes. The prof build flavour does not set SplitSections = No, unlike all of the other ones, so I suspect they came from a prof build.

I believe that the above explains what causes linking to be very slow, and why we have had difficulty reliably reproducing it. I don't know why many sections slows linking. My only idea comes from http://eli.thegreenplace.net/2013/07/09/library-order-in-static-linking:

Finally, if any of the objects in the library has been included in the link, the library is rescanned again - it's possible that symbols imported by the included object can be found in other objects within the same library.

That algorithm sounds quadratic in the worst case.

The following are the file sizes produced in various cases (83 has SplitSections libraries):

-rwxr-xr-x 1 doug fileservs  1507544 May 29 00:09 profiled_80
-rwxr-xr-x 1 doug fileservs  1511320 May 29 00:09 profiled_83
-rwxr-xr-x 1 doug fileservs 13138024 May 29 00:09 profiled_nogcsections_83
-rwxr-xr-x 1 doug fileservs  1040752 May 29 00:09 unprofiled_80
-rwxr-xr-x 1 doug fileservs   903616 May 29 00:09 unprofiled_83
-rwxr-xr-x 1 doug fileservs  7923416 May 29 00:09 unprofiled_nogcsections_83

and the times taken to produce them:

8.0 profiled command

real	0m0.183s
user	0m0.168s
sys	0m0.012s
8.2 profiled command

real	0m14.764s
user	0m14.616s
sys	0m0.144s
8.2 profiled command no gc-section s

real	0m2.138s
user	0m1.940s
sys	0m0.192s
8.0 unprofiled command

real	0m0.179s
user	0m0.140s
sys	0m0.036s
8.2 unprofiled command

real	0m1.749s
user	0m1.656s
sys	0m0.088s
8.2 unprofiled command no gc-sections

real	0m1.373s
user	0m1.220s
sys	0m0.148s

I think SplitSections should not be enabled in prof.mk (or any others), at least for libraries.

There is good reason to pass -Wl,--gc-sections to the linker always, since it lets you leverage libraries that have been compiled with -split-sections, but perhaps this is too dangerous while we don't understand exactly what causes the slowdown.

There should be better interface than -fwhole-archive-hs-libs to disable --gc-sections while doing a profiling build. One or many of:

  • Don't pass --gc-sections if profiling is enabled
  • * unless another flag is passed
  • A warning if -prof is passed without -fwhole-archive-hs-libs
  • * add another flag that only disables --gc-sections

comment:18 Changed 7 months ago by niteria

I wanted to compare gold to collect2, so I run linking of profiled ghc-stage2 under both with and without --gc-sections. The results:

no --gc-sections, collect2
real    0m9.456s
user    0m8.248s
sys     0m1.168s
161M    ghc/stage2/build/tmp/ghc-stage2

no --gc-sections, gold
real    0m3.122s
user    0m2.680s
sys     0m0.441s
161M    ghc/stage2/build/tmp/ghc-stage2

--gc-sections, collect2
real    2m14.460s
user    2m13.356s
sys     0m1.087s
129M    ghc/stage2/build/tmp/ghc-stage2

--gc-sections, gold
real    0m3.118s
user    0m2.691s
sys     0m0.411s
129M    ghc/stage2/build/tmp/ghc-stage2

Looks like gold doesn't even break a sweat.

comment:19 Changed 7 months ago by RyanGlScott

Really, you don't even need profiled code to notice a difference. On my machine with gcc-4.8.4/ld-2.24, I experience longer linking times with an ordinary, non-profiled main=return () executable:

$ time /opt/ghc/8.0.2/bin/ghc -fforce-recomp -O0 Bug.hs
[1 of 1] Compiling Main             ( Bug.hs, Bug.o )
Linking Bug ...

real	0m0.441s
user	0m0.360s
sys	0m0.060s
$ time /opt/ghc/8.2.1/bin/ghc -fforce-recomp -O0 Bug.hs
[1 of 1] Compiling Main             ( Bug.hs, Bug.o )
Linking Bug ...

real	0m1.616s
user	0m1.476s
sys	0m0.116s

On larger programs, the delay is more noticeable.

comment:20 Changed 6 months ago by RyanGlScott

Milestone: 8.2.1
Priority: normalhigh

Bumping the priority, as it would be nice to get some kind of resolution to this before GHC 8.2.1 is released.

Changed 6 months ago by duog

Attachment: T13739-check-ld added

Script to time ghc with different versions of ld

Changed 6 months ago by duog

Attachment: T13739-check-ld-output added

comment:21 Changed 6 months ago by duog

I've attached timings with 3 versions of ld obtained from http://ftp.gnu.org/gnu/binutils/.

These show that 2.27 fixes the issue we're running into. Still much slower than 8.0.2 though.

I would also note that I find the very large file sizes of 8.2.1-no-gc-sections relative to 8.0.2 in comment 17 weird. It looks, to my untrained eye, like many more .o files from the archives are getting pulled in.

comment:22 Changed 6 months ago by duog

Another note: -split-sections does not apply to shared libraries, and passing -dynamic to ghc shows no difference between 8.0.2 and 8.2.1. Profiled dynamic libraries do not appear to be packaged, so I have not tested the -dynamic -prof case, but I'd expect there to be no difference there either.

comment:23 Changed 6 months ago by duog

Summary: Very slow linking of profiled executablesvery slow linking of executables with ld.bfd < 2.27

This problem rears its head in an ugly fasion while configuring packages that have custom setups:

cabal unpack distributive-0.5.2
cd distributive-0.5.2
cabal sandbox init
cabal install --only-dependencies
time cabal configure

gives:

[1 of 1] Compiling Main             ( dist/setup/setup.hs, dist/setup/Main.o )
Linking ./dist/setup/setup ...
Configuring distributive-0.5.2...

real	0m21.580s
user	0m20.928s
sys	0m0.560s

while

time cabal configure --ghc-options=-dynamic

gives:

Resolving dependencies...
[1 of 1] Compiling Main             ( dist/setup/setup.hs, dist/setup/Main.o )
Linking ./dist/setup/setup ...
Configuring distributive-0.5.2...

real	0m1.830s
user	0m1.460s
sys	0m0.228s

comment:24 Changed 6 months ago by nh2

Cc: nh2 added

comment:25 Changed 6 months ago by nh2

I have filed an upstream GNU ld bug to ask whether they are aware what could have brought the 10x performance improvement that ld 2.27 brought over 2.26:

https://sourceware.org/bugzilla/show_bug.cgi?id=21677

comment:26 Changed 6 months ago by bgamari

Cc: nh2 removed
Status: newinfoneeded

I don't think there's much else that we can do about this. binutils 2.27 seems to fix the majority of the issue and ultimately the plan is to use gold more often anyways (see #13541).

comment:27 Changed 6 months ago by bgamari

Cc: nh2 added

comment:28 Changed 6 months ago by duog

I don't know whether anyone has investigated whether there is anything we are doing which is exasperating the problem. Executables produced by removing -Wl,--gc-sections from the linker invocation look like they're pulling in most of the archives. This does not occur with 8.0.2. I will see if it happens with gold once support it lands in the ghc-8.2 branch.

I would like to see the configure scripts disable split sectioned libraries if the linker is ld.bfd < 2.27, or at least warn that the user is in for an unpleasant experience.

EDIT: Executables produced by gold without -Wl,--gc-sections are indeed just as large as those produced by bfd

Last edited 6 months ago by duog (previous) (diff)

comment:29 Changed 6 months ago by duog

Also, would anyone be able to check whether this occurs for non ELF targets? I don't have suitable machines. The thread linked to in comment 25 suggests some candidates for the fix in 2.27, both of which are ELF specific.

Last edited 6 months ago by duog (previous) (diff)

comment:30 Changed 5 months ago by RyanGlScott

Note: See TracTickets for help on using tickets.