Opened 6 years ago

Closed 2 years ago

Last modified 2 years ago

#2466 closed bug (fixed)

Execution time mis-accounting

Reported by: heatsink Owned by:
Priority: high Milestone: 7.4.1
Component: Profiling Version: 7.3
Keywords: Cc:
Operating System: Linux Architecture: x86
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

In the attached test case, profiling assigns a large fraction of run time to a single function. The fraction ranges from 19% to 40% across runs. Although I haven't independently verified the profile, I believe execution time is mis-attributed to this function.

The function in question is:

intersect :: Object -> Ray -> Maybe Intersection
intersect obj@(Object dat mem) = objmIntersect mem dat obj

where objmIntersect is a field selector. In -ddump-simpl output, the function consists of two case statements and a function call. The function it calls has its own SCC.

In profiling, this function is usually listed as the most time-consuming function in the program. Its callee is assigned a similar amount of time. Both functions are called the same number of times. The callee computes a ray-triangle intersection, which is presumably more time-consuming than two case statements and a function call...

COST CENTRE                    MODULE               %time %alloc

intersect                      Datatypes             28.5   11.3
triangle.intersect             Triangle              25.0    2.0
(other functions...)

I've reduced the test case from a somewhat larger ray tracer. I'm willing to help reduce it further if needed. The test case depends on the vector-space library.

Attachments (2)

htrace.tar.gz (8.8 KB) - added by heatsink 6 years ago.
Test case, a ray tracer evolved from Htrace
htrace-1.tar.gz (8.2 KB) - added by heatsink 2 years ago.
Updated to work with vector-space-0.7.3

Download all attachments as: .zip

Change History (14)

Changed 6 years ago by heatsink

Test case, a ray tracer evolved from Htrace

comment:1 Changed 6 years ago by igloo

  • Difficulty set to Unknown
  • Milestone set to 6.10 branch

Thanks for the report

comment:2 Changed 5 years ago by igloo

  • Milestone changed from 6.10 branch to 6.12 branch

comment:3 Changed 4 years ago by igloo

  • Milestone changed from 6.12 branch to 6.12.3

comment:4 Changed 4 years ago by igloo

  • Milestone changed from 6.12.3 to 6.14.1
  • Priority changed from normal to low

comment:5 Changed 3 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:6 Changed 3 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:7 Changed 3 years ago by simonmar

  • Milestone changed from 7.2.1 to 7.4.1
  • Owner set to simonmar
  • Priority changed from low to high
  • Type of failure set to None/Unknown

comment:8 Changed 2 years ago by marlowsd@…

commit 7bb0447df9a783c222c2a077e35e5013c7c68d91

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Thu Oct 27 13:47:27 2011 +0100

    Overhaul of infrastructure for profiling, coverage (HPC) and breakpoints
    
    User visible changes
    ====================
    
    Profilng
    --------
    
    Flags renamed (the old ones are still accepted for now):
    
      OLD            NEW
      ---------      ------------
      -auto-all      -fprof-auto
      -auto          -fprof-exported
      -caf-all       -fprof-cafs
    
    New flags:
    
      -fprof-auto              Annotates all bindings (not just top-level
                               ones) with SCCs
    
      -fprof-top               Annotates just top-level bindings with SCCs
    
      -fprof-exported          Annotates just exported bindings with SCCs
    
      -fprof-no-count-entries  Do not maintain entry counts when profiling
                               (can make profiled code go faster; useful with
                               heap profiling where entry counts are not used)
    
    Cost-centre stacks have a new semantics, which should in most cases
    result in more useful and intuitive profiles.  If you find this not to
    be the case, please let me know.  This is the area where I have been
    experimenting most, and the current solution is probably not the
    final version, however it does address all the outstanding bugs and
    seems to be better than GHC 7.2.
    
    Stack traces
    ------------
    
    +RTS -xc now gives more information.  If the exception originates from
    a CAF (as is common, because GHC tends to lift exceptions out to the
    top-level), then the RTS walks up the stack and reports the stack in
    the enclosing update frame(s).
    
    Result: +RTS -xc is much more useful now - but you still have to
    compile for profiling to get it.  I've played around a little with
    adding 'head []' to GHC itself, and +RTS -xc does pinpoint the problem
    quite accurately.
    
    I plan to add more facilities for stack tracing (e.g. in GHCi) in the
    future.
    
    Coverage (HPC)
    --------------
    
     * derived instances are now coloured yellow if they weren't used
     * likewise record field names
     * entry counts are more accurate (hpc --fun-entry-count)
     * tab width is now correct (markup was previously off in source with
       tabs)
    
    Internal changes
    ================
    
    In Core, the Note constructor has been replaced by
    
            Tick (Tickish b) (Expr b)
    
    which is used to represent all the kinds of source annotation we
    support: profiling SCCs, HPC ticks, and GHCi breakpoints.
    
    Depending on the properties of the Tickish, different transformations
    apply to Tick.  See CoreUtils.mkTick for details.
    
    Tickets
    =======
    
    This commit closes the following tickets, test cases to follow:
    
      - Close #2552: not a bug, but the behaviour is now more intuitive
        (test is T2552)
    
      - Close #680 (test is T680)
    
      - Close #1531 (test is result001)
    
      - Close #949 (test is T949)
    
      - Close #2466: test case has bitrotted (doesn't compile against current
        version of vector-space package)

 compiler/basicTypes/Id.lhs           |   17 +-
 compiler/basicTypes/MkId.lhs         |   28 +-
 compiler/basicTypes/Name.lhs         |   20 +-
 compiler/cmm/CmmParse.y              |    1 -
 compiler/codeGen/CgCallConv.hs       |   15 +-
 compiler/codeGen/CgClosure.lhs       |   31 +-
 compiler/codeGen/CgCon.lhs           |    6 +-
 compiler/codeGen/CgExpr.lhs          |    2 +-
 compiler/codeGen/CgHeapery.lhs       |    8 +-
 compiler/codeGen/CgProf.hs           |  195 +-----
 compiler/codeGen/StgCmmBind.hs       |   23 +-
 compiler/codeGen/StgCmmCon.hs        |    4 +-
 compiler/codeGen/StgCmmExpr.hs       |    2 +-
 compiler/codeGen/StgCmmHeap.hs       |    5 -
 compiler/codeGen/StgCmmProf.hs       |  177 +-----
 compiler/coreSyn/CoreArity.lhs       |   20 +-
 compiler/coreSyn/CoreFVs.lhs         |   14 +-
 compiler/coreSyn/CoreLint.lhs        |    8 +-
 compiler/coreSyn/CorePrep.lhs        |   56 +-
 compiler/coreSyn/CoreSubst.lhs       |   35 +-
 compiler/coreSyn/CoreSyn.lhs         |  123 +++-
 compiler/coreSyn/CoreTidy.lhs        |    9 +-
 compiler/coreSyn/CoreUnfold.lhs      |    6 +-
 compiler/coreSyn/CoreUtils.lhs       |  187 ++++--
 compiler/coreSyn/ExternalCore.lhs    |    2 +-
 compiler/coreSyn/MkExternalCore.lhs  |    3 +-
 compiler/coreSyn/PprCore.lhs         |   35 +-
 compiler/coreSyn/PprExternalCore.lhs |    2 +-
 compiler/coreSyn/TrieMap.lhs         |   25 +-
 compiler/deSugar/Coverage.lhs        |  556 +++++++++++------
 compiler/deSugar/Desugar.lhs         |   35 +-
 compiler/deSugar/DsArrows.lhs        |    5 +-
 compiler/deSugar/DsBinds.lhs         |  146 ++---
 compiler/deSugar/DsExpr.lhs          |   13 +-
 compiler/deSugar/DsGRHSs.lhs         |   11 +-
 compiler/deSugar/DsUtils.lhs         |   71 +--
 compiler/deSugar/Match.lhs           |    2 +-
 compiler/ghci/ByteCodeGen.lhs        |  127 ++---
 compiler/hsSyn/Convert.lhs           |    3 +-
 compiler/hsSyn/HsBinds.lhs           |   15 +-
 compiler/hsSyn/HsExpr.lhs            |   14 +-
 compiler/iface/BinIface.hs           |   64 +--
 compiler/iface/IfaceEnv.lhs          |   18 -
 compiler/iface/IfaceSyn.lhs          |   33 +-
 compiler/iface/MkIface.lhs           |   12 +-
 compiler/iface/TcIface.lhs           |   15 +-
 compiler/main/DynFlags.hs            |   54 +-
 compiler/main/HscMain.hs             |   17 +-
 compiler/main/TidyPgm.lhs            |    5 +-
 compiler/parser/Parser.y.pp          |    2 +-
 compiler/parser/RdrHsSyn.lhs         |    3 +-
 compiler/profiling/CostCentre.lhs    |  407 ++++--------
 compiler/profiling/SCCfinal.lhs      |  285 ++------
 compiler/simplCore/CSE.lhs           |    2 +-
 compiler/simplCore/FloatIn.lhs       |   11 +-
 compiler/simplCore/FloatOut.lhs      |   25 +-
 compiler/simplCore/LiberateCase.lhs  |    2 +-
 compiler/simplCore/OccurAnal.lhs     |   28 +-
 compiler/simplCore/SAT.lhs           |    4 +-
 compiler/simplCore/SetLevels.lhs     |   10 +-
 compiler/simplCore/SimplCore.lhs     |    2 +-
 compiler/simplCore/SimplEnv.lhs      |   33 +-
 compiler/simplCore/SimplUtils.lhs    |   21 +-
 compiler/simplCore/Simplify.lhs      |  122 +++-
 compiler/simplStg/SRT.lhs            |    2 +-
 compiler/simplStg/StgStats.lhs       |    2 +-
 compiler/specialise/Rules.lhs        |   19 +-
 compiler/specialise/SpecConstr.lhs   |   14 +-
 compiler/specialise/Specialise.lhs   |   16 +-
 compiler/stgSyn/CoreToStg.lhs        |   27 +-
 compiler/stgSyn/StgLint.lhs          |    2 +-
 compiler/stgSyn/StgSyn.lhs           |   13 +-
 compiler/stranal/DmdAnal.lhs         |    4 +-
 compiler/stranal/WorkWrap.lhs        |    6 +-
 compiler/stranal/WwLib.lhs           |   14 +-
 compiler/typecheck/TcBinds.lhs       |    3 +-
 compiler/vectorise/Vectorise/Exp.hs  |   15 +-
 compiler/vectorise/Vectorise/Vect.hs |    6 +-
 docs/users_guide/flags.xml           |   47 +-
 docs/users_guide/profiling.xml       |  560 +++++++++-------
 docs/users_guide/runtime_control.xml |   56 ++-
 includes/rts/prof/CCS.h              |  190 +++---
 includes/stg/MiscClosures.h          |    1 +
 rts/Apply.cmm                        |   25 +-
 rts/AutoApply.h                      |   11 +
 rts/Exception.cmm                    |    4 +-
 rts/PrimOps.cmm                      |    4 +
 rts/ProfHeap.c                       |    4 +-
 rts/Profiling.c                      | 1200 ++++++++++++++++++----------------
 rts/Profiling.h                      |    3 +-
 rts/Proftimer.c                      |    2 +
 rts/RaiseAsync.c                     |    2 +-
 rts/RtsFlags.c                       |    7 +-
 rts/StgMiscClosures.cmm              |   16 +-
 rts/Updates.cmm                      |    2 +-
 rts/sm/Storage.c                     |    6 +-
 utils/genapply/GenApply.hs           |   65 ++-
 97 files changed, 2750 insertions(+), 2830 deletions(-)

comment:9 Changed 2 years ago by simonmar

  • Resolution set to fixed
  • Status changed from new to closed

I think this is probably fixed, but I couldn't get the test case to compile. Please re-open if the problem still exists in 7.4.1.

Changed 2 years ago by heatsink

Updated to work with vector-space-0.7.3

comment:10 Changed 2 years ago by heatsink

  • Owner simonmar deleted
  • Resolution fixed deleted
  • Status changed from closed to new
  • Version changed from 6.8.2 to 7.3

The problem is still there. I've uploaded some files that work with the current vector-space package (and Cabal). I tested with the 20111118 nightly snapshot of GHC on Mac OS X.

The "intersect" function still has a high profile time. Good news is, its allocation is now reported as 0%, as expected since it doesn't contain any data constructor applications.

comment:11 Changed 2 years ago by simonmar

  • Resolution set to fixed
  • Status changed from new to closed

Thanks for updating the test case. I've looked into this today, and I think the profile is correct.

With HEAD I see this:

 CAF                     Main                    127           0    0.0    0.0    98.2   99.9
  icosahedron            Main                    256           0    0.0    0.0     0.0    0.0
  scene                  Main                    253           1    0.0    0.0     0.0    0.0
   icosahedron           Main                    254           2    0.0    0.0     0.0    0.0
    triangle             Triangle                255          40    0.0    0.0     0.0    0.0
  pgm                    Main                    251           0    0.0    0.0     0.0    0.0
  viewport               Main                    245           1    0.0    0.0     0.0    0.0
  main                   Main                    242           1    0.0    0.0    98.2   99.9
   render                Camera                  247           1    8.4   19.5    91.0   85.6
    raytrace             Scene                   252      102400   37.7   60.9    82.6   66.1
     lambertian          Shade                   260       37215    0.0    0.0     3.6    3.7
      sumForAllLights    Shade                   261       37215    3.6    2.3     3.6    3.7
       illumination      Shade                   262      111645    0.0    1.3     0.0    1.3
     shade               Datatypes               259       37215    0.0    0.0     0.0    0.0
     intersect           Datatypes               257     4096000   21.6    0.0    41.3    1.5
      triangle.intersect Triangle                258     4096000   19.8    1.5    19.8    1.5
    qtrApply             VectorMath              248           3    0.0    0.0     0.0    0.0
     negateQtr           VectorMath              250           3    0.0    0.0     0.0    0.0
     qtrComp             VectorMath              249           6    0.0    0.0     0.0    0.0
   pgm                   Main                    244           1    7.2   14.3     7.2   14.3

So we see the same number of calls to intersect as to the inner function triangle.intersect, but the time is split about 50/50 between them. The outer intersect consists of two cases and a function call; but I suspect the crucial factor is that the inner call is a call to an unknown function, so it has to go through the generic call machinery in the RTS. It might also be calling a PAP (partial application), which is more costly. If we do a low-level profile with perf (see Debugging/LowLevelProfiling/Perf), we see:

    11.71%          htrace  ./dist/build/htrace/htrace  [.] s33w_info
    10.94%          htrace  ./dist/build/htrace/htrace  [.] stg_upd_frame_info
     8.36%          htrace  ./dist/build/htrace/htrace  [.] pushCostCentre
     6.06%          htrace  ./dist/build/htrace/htrace  [.] stg_PAP_apply
     4.89%          htrace  ./dist/build/htrace/htrace  [.] 0x000000005f99c9
     4.10%          htrace  ./dist/build/htrace/htrace  [.] sCb_info
     3.82%          htrace  ./dist/build/htrace/htrace  [.] stg_ap_pp_fast
     3.76%          htrace  ./dist/build/htrace/htrace  [.] 0x0000000001f8c2
     3.36%          htrace  ./dist/build/htrace/htrace  [.] s33x_info
     3.12%          htrace  ./dist/build/htrace/htrace  [.] sJT_info
     2.00%          htrace  ./dist/build/htrace/htrace  [.] enterFunCurShorter
     1.98%          htrace  ./dist/build/htrace/htrace  [.] overwritingClosure

There is a lot of time being spent in stg_PAP_apply and stg_ap_pp_fast, these are the RTS functions for applying a partial application and applying an unknown function to two arguments respectively.

comment:12 Changed 2 years ago by heatsink

Sounds reasonable to me. Thank you for explaining the profile.

Note: See TracTickets for help on using tickets.