Opened 13 months ago

Last modified 4 weeks ago

#14741 new bug

High-memory usage during compilation using Template Haskell

Reported by: donatello Owned by: sighingnow
Priority: normal Milestone:
Component: Compiler Version: 8.2.2
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #16190 Differential Rev(s): Phab:D4384
Wiki Page:

Description (last modified by donatello)

When trying to embed some files into an executable using Template Haskell, I find that memory usage during compilation exceeds 4GB and often crashes my laptop. The files I am trying to embed are only about 25MB in size (totally 35MB in size).

I made a somewhat minimal example to demonstrate this problem. To embed the files, I am using the `file-embed` package (the issue persists when using the alternative `wai-app-static` package too). The code to demonstrate runs in Linux and is available here - https://github.com/donatello/file-embed-exp. To try it out, just clone the repository and run make (it uses the Haskell Stack tool and the Linux dd utility).

This appear to be an issue in GHC. Is there anyway to mitigate the issue in the current version?

Related discussion: https://github.com/snoyberg/file-embed/issues/24

Change History (18)

comment:1 Changed 13 months ago by donatello

Description: modified (diff)

Fixed typo.

comment:2 Changed 13 months ago by donatello

Description: modified (diff)

Fix typo.

comment:3 Changed 13 months ago by mpickering

If you compile with -O0, does that make a difference?

Could you reduce the dependency footprint further to just rely on "cabal" and perhaps inline the specific parts of file-embed which you need?

comment:4 Changed 13 months ago by donatello

No, compiling with -O0 or -O2 has no effect. I see that embedding a 3MB file takes over 2.5GB of RAM!

I have updated the code to use only cabal and have managed to inline specific parts of file-embed (I am not very familiar with template haskell) - the problem still persists. Now I am only trying to embed a 3MB file (created by the Makefile).

https://github.com/donatello/file-embed-exp

Pasting some relevant bits of code here:

EmbedFile.hs

{-# LANGUAGE TemplateHaskell #-}
module EmbedFile (embedFile) where

import qualified Data.ByteString            as B
import qualified Data.ByteString.Char8      as B8
import           Data.ByteString.Unsafe     (unsafePackAddressLen)
import           Language.Haskell.TH.Syntax (Exp (AppE, ListE, LitE, SigE, TupE, VarE),
                                             Lit (IntegerL, StringL, StringPrimL),
                                             Q, Quasi (qAddDependentFile),
                                             loc_filename, qLocation, runIO)
import           System.IO.Unsafe           (unsafePerformIO)

bsToExp :: B.ByteString -> Q Exp
bsToExp bs =
    return $ VarE 'unsafePerformIO
      `AppE` (VarE 'unsafePackAddressLen
      `AppE` LitE (IntegerL $ fromIntegral $ B8.length bs)
      `AppE` LitE (StringPrimL $ B.unpack bs))

embedFile :: FilePath -> Q Exp
embedFile fp =
    qAddDependentFile fp >>
    (runIO $ B.readFile fp) >>= bsToExp

Static.hs

{-# LANGUAGE TemplateHaskell #-}
module Static
    ( embedList
    ) where

import qualified Data.ByteString as B
import           System.IO       (FilePath)

import           EmbedFile       (embedFile)

embedList :: [(FilePath, B.ByteString)]
embedList = [("mypath", $(embedFile "build/3mb"))]

comment:5 Changed 13 months ago by mpickering

Thanks, good example.

I can compile it now with

ghc src/Main.hs -isrc 

Perhaps someone with a profiling tree already built can quickly run it on this program to see what is causing the allocations?

comment:6 Changed 13 months ago by sighingnow

Differential Rev(s): Phab:D4384
Owner: set to sighingnow
Type of failure: None/UnknownCompile-time performance bug

After profiling, I found that the pprASCII function consumed the most part of memory.

codeOutput                                    HscMain                           compiler\main\HscMain.hs:(1349,19)-(1350,67)       1863          2    0.1    0.0    84.6   94.3
 OutputAsm                                    CodeOutput                        compiler\main\CodeOutput.hs:(169,37)-(171,78)      1873          2    0.2    0.0    84.6   94.3
  NativeCodeGen                               CodeOutput                        compiler\main\CodeOutput.hs:171:18-78              1874          2    0.0    0.0    84.3   94.3
   cmmNativeGenStream                         AsmCodeGen                        compiler\nativeGen\AsmCodeGen.hs:(342,56)-(343,50) 1875          2    0.0    0.0    84.3   94.3
    cmmNativeGens                             AsmCodeGen                        compiler\nativeGen\AsmCodeGen.hs:(432,53)-(433,66) 1886         21    0.0    0.0    80.3   87.8
     pprNativeCode                            AsmCodeGen                        compiler\nativeGen\AsmCodeGen.hs:(530,37)-(531,65) 1891        109   43.9   22.3    51.5   30.3
      x86_pprNatCmmDecl_CmmData               X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:78:43-82             1901          0    0.1    0.0     7.5    8.0
       pprDataItem'                           X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:477:76-98            1910        124    0.0    0.0     0.0    0.0
        pprDataItem'_vcat                     X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:481:37-95            1911        124    0.0    0.0     0.0    0.0
       pprData_CmmString                      X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:152:36-90            1903          0    0.0    0.0     7.5    8.0
        pprASCII                              X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:199:28-62            1905          0    7.5    8.0     7.5    8.0
      x86_pprNatCmmDecl_CmmProc               X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:(81,43)-(113,26)     1965          0    0.0    0.0     0.1    0.0
       pprDataItem'                           X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:477:76-98            1968        122    0.0    0.0     0.0    0.0
        pprDataItem'_vcat                     X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:481:37-95            1969        122    0.0    0.0     0.0    0.0
     seqString                                AsmCodeGen                        compiler\nativeGen\AsmCodeGen.hs:505:33-95         1908        109    0.0    0.0     0.0    0.0
     x86_pprNatCmmDecl_CmmData                X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:78:43-82             1900         85    0.0    0.0    28.6   57.3
      pprData_CmmString                       X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:152:36-90            1902         66    0.0    0.0    28.6   57.3
       pprASCII                               X86.Ppr                           compiler\nativeGen\X86\Ppr.hs:199:28-62            1904         66   28.6   57.3    28.6   57.3

The embed bytestring generates a large literal bytestring in assembly code, represented by (CmmString [Word8]). The pprASCII function will generated a list of Lit SDoc then use hcat to combine them.

I have made some optimization to pprASCII in Phab:D4384, after this patch this pprASCII still consume the most part of memory allocation, but it can decrease the total memory allocation efficiently.

Before:

total time  =        2.43 secs   (2429 ticks @ 1000 us, 1 processor)
total alloc = 4,741,422,496 bytes  (excludes profiling overheads)

After:

total time  =        0.85 secs   (851 ticks @ 1000 us, 1 processor)
total alloc = 1,343,531,416 bytes  (excludes profiling overheads)

comment:7 Changed 13 months ago by sighingnow

After the optimization of pprASCII, I found the other two cost center:

COST CENTRE                          MODULE     SRC                                               %time %alloc

pprASCII                             X86.Ppr    compiler\nativeGen\X86\Ppr.hs:181:28-92            47.8   56.2
tc_rn_src_decls                      TcRnDriver compiler\typecheck\TcRnDriver.hs:(494,4)-(556,7)    6.7   10.4
mapM_cgTopBinding                    StgCmm     compiler\codeGen\StgCmm.hs:90:43-84                 5.4   22.6

comment:8 Changed 13 months ago by sighingnow

Further optimization for pprASCII needs change CmmString [Word8] to CmmString ByteString.

I have test the performance of unpacking ByteString to [Word8]:

embedFile :: FilePath -> IO [Word8]
embedFile fp = do
  print fp
  B.unpack <$> B.readFile fp

main :: IO ()
main = do
    x' <- {-# SCC "forceRead" #-} (force <$> embedFile "3mb")
    print (length x')

It generates the following prof result:

	   foldr-test.exe +RTS -p -RTS

	total time  =        0.07 secs   (74 ticks @ 1000 us, 1 processor)
	total alloc = 130,807,344 bytes  (excludes profiling overheads)

COST CENTRE MODULE SRC                     %time %alloc

forceRead   Main   foldr-test.hs:44:35-61   67.6  100.0
MAIN        MAIN   <built-in>               32.4    0.0

comment:9 Changed 13 months ago by Ben Gamari <ben@…>

In 2987b04/ghc:

Improve X86CodeGen's pprASCII.

The original implementation generates a list of SDoc then concatenates
them using `hcat`. For memory optimization, we can transform the given
literal string into escaped string the construct SDoc directly.

This optimization will decreate the memory allocation when there's big
literal strings in haskell code, see Trac #14741.

Signed-off-by: HE, Tao <sighingnow@gmail.com>

Reviewers: bgamari, mpickering, simonpj

Reviewed By: simonpj

Subscribers: simonpj, rwbarton, thomie, carter

GHC Trac Issues: #14741

Differential Revision: https://phabricator.haskell.org/D4384

comment:10 Changed 13 months ago by donatello

Thank you for the fix, it looks promising - but I am not sure if the problem is completely solved.

The profiling output says that total allocations were reduced from 4.7GB to 1.3GB, which is 3.5X improvement. However, the goal in my initial program was to embed ~100MB of static data in my program - whereas the bug demonstrates the problem with a 3MB embedded string.

Is there any way I could get a built version of the ghc master for 64-bit x86 Linux (from a CI server perhaps), so I could try it out myself?

comment:11 Changed 13 months ago by mpickering

I'm not sure embedding a 100mb file into a program is really supported. What are you doing after you embed this file? Can't you just read the file when the program runs?

Maybe the easiest way would be to install HEAD from hvr's ppa - https://launchpad.net/~hvr/+archive/ubuntu/ghc. It is also easy with nixos if you are using that.

comment:12 Changed 13 months ago by donatello

I want to embed some static assets used by my program (which is also built as a static binary), into the binary itself to enable easy distribution/deployment - simply download and execute a single (binary) file. It is quite common in some other languages (e.g. https://github.com/elazarl/go-bindata-assetfs#readme).

Due to this issue, I am currently reading the static assets in at start, but I would prefer to build all the assets into the binary itself.

The PPA does not seem to have the most recent commits, so I will for it to be updated before I try this out.

comment:13 Changed 13 months ago by sighingnow

Embedding ~100MB static data in haskell code may consume around 40GB memory. Currently in TH the StringPrimL is built with [Word8] rather than ByteString.

Unpacking ~100MB bytestring to [Word8] and escaping it already consume GBs of memory.

comment:14 Changed 13 months ago by simonpj

All the existing machinery for literals is oriented for relatively short, human-readable literal strings. It's unsurprising that it chokes on 100Mb.

But it seems like an absolutely legitimate request to me. Happy needs this too, in the form of its parsing tables; albeit they aren't so big.

There's even a wiki page about it: StaticData spun out of #5218.

This must be do-able, but it would need someone to lead on it.

comment:15 in reply to:  14 Changed 13 months ago by hsyl20

Replying to simonpj:

There's even a wiki page about it: StaticData spun out of #5218.

I have updated this/my proposal. Comments welcome! We should get Phab:D4217 merged to get started if we follow this plan.

comment:16 Changed 5 weeks ago by hsyl20

comment:17 Changed 4 weeks ago by hsyl20

comment:18 Changed 4 weeks ago by hsyl20

I have made a patch to add an helper to TH to create "bytes" primitives: https://gitlab.haskell.org/hsyl20/ghc/tree/hsyl20-T14741

Using it we can patch file-embed like this:

19c19
< module Data.FileEmbed
---
> module FileEmbed
49,53c49
< #if MIN_VERSION_template_haskell(2,5,0)
<     , Lit (StringL, StringPrimL, IntegerL)
< #else
<     , Lit (StringL, IntegerL)
< #endif
---
>     , Lit (..)
60a57
> import Language.Haskell.TH
65a63
> import qualified Data.ByteString.Internal as B
154c152,156
< #if MIN_VERSION_template_haskell(2, 8, 0)
---
> #if MIN_VERSION_template_haskell(2, 15, 0)
>       `AppE` LitE (bytesPrimL (
>                 let B.PS ptr off sz = bs
>                 in  mkBytes ptr (fromIntegral off) (fromIntegral sz))))
> #elif MIN_VERSION_template_haskell(2, 8, 0)

Using previous patches for #16198 and #16190, we get the following results when we embed a file of the given size:

  • V1: HEAD + patch for #16198
  • V2: V1 + patch for #16190 (default threshold set to 500K)
  • V3: V2 + this patch
Size 8.6.3 V1 V2 V3 Gain (V3 over v2)
128 2.650 2.331 2.346 2.291 +2%
3K 2.651 2.289 2.290 2.310 -1%
30K 2.590 2.353 2.307 2.299 +0%
100K 2.717 2.379 2.389 2.298 +4%
500K 3.621 2.814 2.331 2.315 +1%
1M 4.694 3.526 2.654 2.320 +12%
2M 6.784 4.668 2.650 2.350 +11%
3M 8.851 5.616 3.073 2.400 +22%
30M 63.181 34.318 8.517 3.390 +60%
Note: See TracTickets for help on using tickets.