Opened 2 years ago

Closed 15 months ago

Last modified 14 months ago

#11158 closed task (fixed)

Combine exprIsTrivial and cpe_ExprIsTrivial

Reported by: simonpj Owned by:
Priority: highest Milestone: 8.2.1
Component: Compiler Version: 7.10.2
Keywords: Cc: ezyang, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #12757 Differential Rev(s): Phab:D2666
Wiki Page:

Description

Look at CoreUtils.exprIsTrivial and CorePrep.cpe_ExprIsTrivial. They are identical!

The latter has a comment saying

cpe_ExprIsTrivial :: CoreExpr -> Bool
-- Version that doesn't consider an scc annotation to be trivial.

but the code does not treat ticks any differently.

So the question is: are they supposed to be different? And if not, can we just combine them?

Peter W, Simon M: this is your territory. I don't understand tick stuff well enough.

Simon

Change History (38)

comment:1 Changed 2 years ago by simonpj

Milestone: 8.0.1
Owner: set to scpmw
Priority: normalhigh

I'll put it as high priority because it should be quick and easy

comment:2 Changed 2 years ago by scpmw

From a quick scan through the Git history, that particular comment has been out of date for quite some time, starting with Simon M's tick overhaul in 7bb0447, where

Note [SCCs are trivial]
~~~~~~~~~~~~~~~~~~~~~~~
We used not to treat (_scc_ "foo" x) as trivial, because it really
generates code, (and a heap object when it's a function arg) to
capture the cost centre.  However, the profiling system discounts the
allocation costs for such "boxing thunks" whereas the extra costs of
*not* inlining otherwise-trivial bindings can be high, and are hard to
discount.

got replaced by

Note [Tick trivial]
~~~~~~~~~~~~~~~~~~~
Ticks are not trivial.  If we treat "tick<n> x" as trivial, it will be
inlined inside lambdas and the entry count will be skewed, for
example.  Furthermore "scc<n> x" will turn into just "x" in mkTick.

Since then, exprIsTrivial has in fact been slightly more restrictive than cpe_exprIsTrivial concerning HPC ticks. When adding the exception for SourceNotes, I considered that an inconsistency.

My vote would be to combine them - would have done it myself if I had spotted it.

comment:3 Changed 2 years ago by bgamari

Owner: changed from scpmw to bgamari

I'll grab this.

comment:4 Changed 2 years ago by bgamari

Differential Rev(s): Phab:D1656
Priority: highnormal
Status: newpatch
Type: bugtask

Phab:1656 resolves this.

comment:5 Changed 2 years ago by bgamari

When quickly reviewing Phab:D1656 in response to some validation issues I noticed that these exprIsTrivial variants differ slightly in their treatment of literals.

While CorePrep's cpe_ExprIsTrivial always considers Lits to be trivial, CoreUtils's variant relies on Literal.litIsTrivial in this case. litIsTrivial deems all literals except MachStr and LitIntegers to be trivial.

This effectively means that cpeArg will no longer consider these literals to be trivial, resulting in eta expansion of some expressions containing them which are currently left untouched. This is evidently problematic.

comment:6 Changed 2 years ago by simonpj

Ben says that he got seg-faults when making cpe_ExprIsTrivial simply call litIsTrivial. That is deeply mysterious.

comment:7 Changed 2 years ago by bgamari

Indeed, these tests all fail with segmentation faults despite passing Core Lint with Phab:D1656,

TEST="literals parsed landmines T11430 T10313 comments parseTree annotations listcomps T8628 T8639_api T10508_api T7478 T1969 T9872d"

They all seem to share the fact that they use the GHC API.

comment:8 Changed 2 years ago by simonpj

Milestone: 8.0.18.2.1
Priority: normalhighest

I suppose is isn't urgent for 8.0, but I really would like to get to the bottom of these seg-faults. So I'll make it "highest" for 8.2.

Ben may find time to investigate, but perhaps someone else could? It's a very specific change so it should not be hard.

comment:9 Changed 18 months ago by mpickering

Keywords: newcomer added

comment:10 Changed 17 months ago by thomie

Owner: bgamari deleted

Unassigning, to make this ticket show up on the Newcomers page.

comment:11 Changed 15 months ago by simonpj

It would be really really good to know why the tiny difference between the functions causes a seg fault. Why doesn't Core Lint (on the output of CorePrep) catch the seg-fault?

How to find it? Since the failing tests use the GHC API, that suggests that the code for GHC itself gets messed up, which is a pain -- GHC is a very big test-case!

One idea: spit out a pprTrace whenever cpe_exprIsTrivial does something different to what exprIsTrivial would have. I bet it doesn't happen much.

comment:12 Changed 15 months ago by simonpj

Look at commit 11ff1df8a7c25485c9c7508d65bcb380e592010d, and talk to Edward Yang

comment:13 Changed 15 months ago by nomeata

Doesn’t the note “Note [Inlining in CorePrep]” explain it sufficiently well? The runtime does not support trivial thunks (no stg_ap_0_upd_info), so CorePrep must not produce them. (Well, it is not a complete story yet – why does it segfault instead of giving a linker error; why does it only occur when linked against GHC.) I’ll ponder it some more.

comment:14 Changed 15 months ago by nomeata

Cc: ezyang added
Keywords: newcomer removed
Status: patchinfoneeded

Ok, here is some insight.

The problem is related to trivial bindings – whether they are literals or nullary data cons. The latter are easy to produce, so here is what I did:

I first broke cpe_ExprIsLiteral, by adding this as a new first clause

cpe_ExprIsTrivial (Var v)         | isDataConWorkId v = False

Then I compiled the smallest possible program (with -dcore-lint -dstg-lint)

main = return ()

and it crashes. Here the output with +RTS -Dg:

cap 0: initialised
allocated 1 megablock(s) at 0x4200000000
allocated 1 megablock(s) at 0x4200100000
----------------------------------------------------------
  Gen     Max  Mut-list  Blocks    Large     Live     Slop
       Blocks     Bytes          Objects                  
----------------------------------------------------------
    0       0         0       1        0        0     4096
    1       0         0       1        0        0     4096
----------------------------------------------------------
                                                0     8192
----------------------------------------------------------

cap 0: starting GC
GC (gen 1, using 1 thread(s))
Memory inventory:
  gen 0 blocks :    11 blocks (   0.0 MB)
  gen 1 blocks :     2 blocks (   0.0 MB)
  nursery      :   256 blocks (   1.0 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     0 blocks (   0.0 MB)
  GC free pool :     0 blocks (   0.0 MB)
  free         :   235 blocks (   0.9 MB)
  total        :   504 blocks (   2.0 MB)
cap 0: GC working
scavenging static objects
T11158: internal error: evacuate(static): strange closure type 38956
    (GHC version 8.1.20161012 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
Abgebrochen

So some pointer from the set of static objects is broken by this change.

We are in a slightly better position than above because we no longer need to use ghc-the-library to trigger the crash. But still, the problem could be anywhere in the libraries.

Here is the backtrace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007ffff70dd40a in __GI_abort () at abort.c:89
#2  0x00000000006d041a in rtsFatalInternalErrorFn (s=0x7525c0 "evacuate(static): strange closure type %d", ap=0x7fffffffdbf8) at rts/RtsMessages.c:182
#3  0x00000000006d004c in barf (s=0x7525c0 "evacuate(static): strange closure type %d") at rts/RtsMessages.c:46
#4  0x00000000006f2cc3 in evacuate (p=0x986158) at rts/sm/Evac.c:562
#5  0x0000000000717f18 in scavenge_static () at rts/sm/Scav.c:1764
#6  0x00000000007184d9 in scavenge_loop () at rts/sm/Scav.c:2088
#7  0x00000000006ee6a4 in scavenge_until_all_done () at rts/sm/GC.c:1001
#8  0x00000000006ed6cc in GarbageCollect (collect_gen=1, do_heap_census=rtsFalse, gc_type=0, cap=0x9d0d80 <MainCapability>) at rts/sm/GC.c:404
#9  0x00000000006dc686 in scheduleDoGC (pcap=0x7fffffffdf10, task=0x9e9ce0, force_major=rtsTrue) at rts/Schedule.c:1821
#10 0x00000000006dd1a5 in exitScheduler (wait_foreign=rtsFalse) at rts/Schedule.c:2636
#11 0x00000000006d09c2 in hs_exit_ (wait_foreign=rtsFalse) at rts/RtsStartup.c:326
#12 0x00000000006d0b7a in shutdownHaskellAndExit (n=0, fastExit=0) at rts/RtsStartup.c:483
#13 0x00000000006e5312 in hs_main (argc=1, argv=0x7fffffffe178, main_closure=0x97d670, rts_config=...) at rts/RtsMain.c:91
#14 0x0000000000404226 in main ()

But at this time of day, I will not dig in deeper. Edward, does this analysis provide any insight to you?

comment:15 Changed 15 months ago by simonpj

What C-- and STG code do you get for

f x = let y = Nothing in (y,y)

No optimistaion, so the y binding isn't floated, so you get a local nullary-datacon binding.

comment:16 Changed 15 months ago by nomeata

I get

T11158.f
  :: forall a1_aDG a2_aDF t_aDp.
     t_aDp -> (GHC.Base.Maybe a2_aDF, GHC.Base.Maybe a1_aDG)
[GblId, Arity=1, Caf=NoCafRefs, Unf=OtherCon []] =
    \r [x_sNH]
        let {
          sat_sNJ [Occ=Once] :: GHC.Base.Maybe a_aDG
          [LclId] =
              NO_CCS GHC.Base.Nothing! []; } in
        let {
          sat_sNI [Occ=Once] :: GHC.Base.Maybe a1_aDF
          [LclId] =
              NO_CCS GHC.Base.Nothing! [];
        } in  (,) [sat_sNI sat_sNJ];

and

==================== Output Cmm ====================
[section ""data" . T11158.f_closure" {
     T11158.f_closure:
         const T11158.f_info;
 },
 T11158.f_entry() //  [R2]
         { info_tbl: [(cNQ,
                       label: T11158.f_info
                       rep:HeapRep static { Fun {arity: 1 fun_type: ArgSpec 5} })]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       cNQ:
           _sNG::P64 = R2;
           goto cNS;
       cNS:
           Hp = Hp + 24;
           if (Hp > HpLim) goto cNU; else goto cNT;
       cNU:
           HpAlloc = 24;
           goto cNR;
       cNR:
           R2 = _sNG::P64;
           R1 = T11158.f_closure;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       cNT:
           I64[Hp - 16] = (,)_con_info;
           P64[Hp - 8] = GHC.Base.Nothing_closure+1;
           P64[Hp] = GHC.Base.Nothing_closure+1;
           _cNP::P64 = Hp - 15;
           R1 = _cNP::P64;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }
 }]

which looks reasonable, doesn’t it?

But here is an interesting difference, which might be a clue. Before my patch:

==================== Output Cmm ====================
[section ""relreadonly" . SOb_srt" {
     SOb_srt:
         const GHC.Base.$fMonadIO_closure;
 }]

After my patch:

==================== Output Cmm ====================
[section ""relreadonly" . SNV_srt" {
     SNV_srt:
         const GHC.Base.Nothing_closure;
         const GHC.Base.$fMonadIO_closure;
         const sat_sNJ_closure;
 }]

Is it kosher to add a static constructor closure to the srt table?

The sat_sNJ_closure seen here is this thing, which comes from the () of main = return ():

[section ""data" . sat_sNJ_closure" {
     sat_sNJ_closure:
         const ()_static_info;
 }]
Last edited 15 months ago by nomeata (previous) (diff)

comment:17 Changed 15 months ago by simonpj

Cc: simonmar added

Yes, looks reasonable.

No point in having Nothing in an SRT. It should have a no-CAF info. I wonder why it doesn't.

The sat_sNJ_closure thing is strange. Does it come from a top level binding

sat_sNG = ()

or how does it arise?

I don't know if anything bad would happen if these two closures do end up in SRT; I don't see why they should. Adding simonmar to cc; he'll know for sure.

comment:18 Changed 15 months ago by nomeata

It arose out of main = return (). In CoreTidy, it is

main = return @ IO GHC.Base.$fMonadIO @ () GHC.Tuple.()

and CorePrep turns that (due to my change of cpe_ExprIsTrivial!) into

-- RHS size: {terms: 1, types: 0, coercions: 0}
sat_sNJ :: ()
[LclId]
sat_sNJ = GHC.Tuple.()

-- RHS size: {terms: 3, types: 2, coercions: 0}
T11158.main :: GHC.Types.IO ()
[GblId]
T11158.main =
  GHC.Base.return @ GHC.Types.IO GHC.Base.$fMonadIO @ () sat_sNJ

which yields this STG

sat_sNJ :: ()
[LclId] =
    NO_CCS ()! [];

T11158.main :: GHC.Types.IO ()
[GblId] =
    \u [] GHC.Base.return GHC.Base.$fMonadIO sat_sNJ;

Given that the crash happens in scavenge_static () I think it is likely that the SRT table is related to the issue.

I don’t see anywhere in cpeArg (which creates the float) code that would set CAF information on the floated argument. According to note Note [CafInfo and floating] it is safe to have top-level binds with HasCafRef even when they do not actually reference any CAFs.

(Trying to set NoCafInfo on all the extra floats I am generating here, and rebuilding the libraries. Will report back.)

comment:19 Changed 15 months ago by nomeata

(Trying to set NoCafInfo on all the extra floats I am generating here, and rebuilding the libraries. Will report back.)

That did the trick. No more crashes. I conclude that having a data constructor closure mentioned in the SRT is bad.

But it is not every occurrence like that causes problems: With the libraries built so that all new floats have NoCafInfo, and then building my test program to include the floated () closure or GHC.Base.Nothing_closure in the SRT does *not* crash. So there must be further conditions required for such an entry in the SRT causing a crash.

Maybe with this analysis Simon Marlow can without too much effort shed more light on the issue.

Last edited 15 months ago by nomeata (previous) (diff)

comment:20 Changed 15 months ago by simonmar

I would have thought it was safe (but inefficient) to use HasCafRef instead of NoCafInfo.

comment:21 Changed 15 months ago by nomeata

I would have thought it was safe (but inefficient) to use HasCafRef instead of NoCafInfo.

Yes, also the code obviously makes that assumption, but it does not seem to be true.

comment:22 Changed 15 months ago by simonpj

It's worth reading Note [CafInfo and floating] in CorePrep.

Here's what I don't understand:

  1. I think that local bindings (which may get floated) have by-default a MayHaveCafRefs flag. Worth checking.
  1. I don't understand why a top-level binding
    sat1 = Nothing
    
    (with a MayHaveCafRefs flag) gives rise to a Nothing_closure in the SRT of the binding mentioning sat, whereas
    sat2 = GHC.Tuple.()
    
    (also with a MayHaveCafRefs flag) gives rise to a sat2_closure entry in he SRT. Why are the two treated differently?
  1. In the binding
    main = return @ IO GHC.Base.$fMonadIO @ () GHC.Tuple.()
    
    I don't understand why we ANF-ise the GHC.Tuple.() argument. It's already atomic; no need to create a binding for it. Does the same happen for Nothing?

Simon, do you have any ideas there? E.g. maybe data constructor closures are in static space and can't get marked by the CAF traversal or something? It'd be good to document it.

  1. I don't understand why having those entries in the SRT would cause a crash.

Finally, although it woudl be great to understand these things, the Real Solution is #9718, which I have been longing to get done for literally years. I really don't think it's hard; we just need to be a bit careful about memory usage.

I'd rather fix #9718 than burn cycles doing very careful jiggery-pokery with CorePrep and floating. Fixing #9718 fixes the problem at source, rather than perpetuating it.

comment:23 Changed 15 months ago by simonmar

I'll take a closer look later, but there seem to be multiple silly things going on. This:

[section ""data" . sat_sNJ_closure" {
     sat_sNJ_closure:
         const ()_static_info;
 }]

Is an instance of the () constructor which is crazy because there only needs to be one instance of (), we don't need to create new instances of it.

It's also totally silly for Nothing to be in an SRT. Even when we're not optimising we ought to know that nullary constructors have NoCafRefs.

comment:24 Changed 15 months ago by nomeata

ARG! Where is my comment to Simon’s comment:22 gone? I replied to that! (Maybe got bitten by #11718 again.) So here is a second try:

Re SPJ’s point 3:

As I wrote in comment:14, the above happens when I change cpe_ExprIsLiteral to not consider data constructors as trivial:

cpe_ExprIsTrivial (Var v)         | isDataConWorkId v = False

Obviously this is not what we want, but it is the easiest way to trigger the mysterious crashes and understand more about the pipeline, in particular why is it not safe to have MayHaveCafRefs on such pointless bindings.

comment:25 Changed 15 months ago by simonpj

RE point 3: fair enough. I missed that. Thanks. But 1,2,4 are still odd.

comment:26 Changed 15 months ago by bgamari

So for what it's worth the segmentation faults alluded to in comment:6 are from testsuite tests using the GHC API. Namely the following tests fail if you make cpe_ExprIsTrivial use litIsTrivial,

make test TEST="T12156 landmines parsed par01 T7478 T11430 T10313 comments parseTree annotations listcomps T12529 T8628 T6145 T8639_api T10508_api T1969 T5631 T3064 T5030 T12227"

I quickly looked at one particular example (T8628) and found shockingly few clues while running the program in gdb. Judging by the fact that the value of the $rip register points to unmapped memory I believe that the mutator is jumping to some random location in the program's image. The C stack offers essentially no hints. The Haskell stack, on the other hand, suggests that the crash was while evaluating within the lexer.

Also, I have confirmed that allowing strings (but not integers) to be trivial in cpe_ExprIsTrivial is sufficient to prevent the crash.

comment:27 Changed 15 months ago by nomeata

Right. Note that you do not actually have to use the GHC API, it is sufficient to link against it:

import GHC
import System.Exit
main :: IO ()
main = putStrLn "Hi" >> exitSuccess >> GHC.runGhc undefined (return ())

(can you confirm that?)

Note that with my change, I could tickle the crash (assuming its the same crash) without linking against ghc, but the cause is still somewhere in base or ghc-prim.

comment:28 Changed 15 months ago by bgamari

Right. Note that you do not actually have to use the GHC API, it is sufficient to link against it

I actually haven't observed that. The example that you provided didn't reproduce the crash.

comment:29 Changed 15 months ago by bgamari

One thing I'm a bit confused by is the comment on CoreToStg.FreeVarsInfo,

INVARIANT: Any ImportBound Ids are HaveCafRef Ids Imported Ids without CAF refs are simply not put in the FreeVarsInfo for an expression.

Why is this limited to import-bound Ids? I would have thought that any non-thunk Id which has NoCafRefs would be safe to omit from the SRT, regardless of where it is defined. I must be missing something here.

comment:30 Changed 15 months ago by simonpj

Why is this limited to import-bound Ids? I would have thought that any non-thunk Id which has NoCafRefs would be safe to omit from the SRT.

Non-top-level Ids must be collected in FV info, becuase they become part of the dynamically allocated closure. They never go in an SRT.

For SRTs we are talking about top level Ids. For top level imported Ids that defintely don't have CAF-refs, we can neglect them entirely in this free var into.

I think it'd be OK to neglect local top-level Ids that have no CAF-refs too. I think that's what you are suggesting. I don't see any reason why not... feel free to try. Should not change the resulting STG or Cmm at all.

comment:31 Changed 15 months ago by bgamari

Yesterday I start from a clean slate and started by again turning cpe_ExprIsTrivial calls into calls to exprIsTrivial. Recall that this essentially marks STG string and integer literals as non-trivial, ensuring that they are not duplicated in applications like ....

After doing this I found that a DEBUG compiler build failed while building GHC.Types with an assertion failure in CoreToStg. Namely, the consistentCafInfo check failed as the IdInfo claimed NoCafRefs yet the binding itself was caffy. The binding in question is of the form,

  $tcMode1 :: TrName
  [GblId, Caf=NoCafRefs, Str=m1] =
      \u []
          case "Mode"# of sat_swgf { __DEFAULT -> TrNameS [sat_swgf]; };

The reason topStgBindHasCafRefs considers this to be caffy is because the RHS is an updatable StgRhsClosure.

It seems to me that the problem here is that we floated the string literal as an case analysis, meaning that we have turned what should be a StgRhsCon into a StgRhsClosure. I believe this is ultimately the same problem presented by #11312.

comment:32 Changed 15 months ago by Simon Peyton Jones <simonpj@…>

In 623b8e4/ghc:

Renaming and comments in CorePrep

In particular I renamed
  'triv' to 'arg'
  CpeTriv to CpeArg
in Note [CorePrep invariants], with knock on consequences.

This is groundwork for the fix to Trac #11158

comment:33 Changed 15 months ago by simonpj

Differential Rev(s): Phab:D1656Phab:D2666

Well found! That does look plausible.

Would you like to abandon Phab:D1656?

I've added your patch Phab:D2666.

Last edited 15 months ago by simonpj (previous) (diff)

comment:34 Changed 15 months ago by bgamari

Resolution: fixed
Status: infoneededclosed

comment:35 Changed 15 months ago by simonpj

Where's the commit?

comment:36 Changed 15 months ago by Ben Gamari <ben@…>

In 967dd5c9/ghc:

Merge cpe_ExprIsTrivial and exprIsTrivial

Strangely my previous attempts at resolving this all seemed to end in
perplexing segmentation faults in the GHC testsuite (including some
rather recent attempts). Somehow this attempt miraculously works.

However, there was one wrinkle that I still need to work out fully: we
need to consider Lits as non-trivial in cpeArg. Failure to do this means
that we would transform something like,

    $trModule = TrModule "HelloWorld"#

into

    $trModule = case "HelloWorld"# of x { __DEFAULT -> TrModule x }

Which then fails the consistentStgInfo check in CoreToStg for reasons
that I am still trying to work out.

Mark T12757 as fixed

Reviewers: simonmar, simonpj, austin

Reviewed By: simonpj

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2666

GHC Trac Issues: #11158

comment:37 Changed 15 months ago by simonpj

Which then fails the consistentStgInfo check in CoreToStg for reasons

that I am still trying to work out.

Here's why (I think):

  • As the commit message says, if we case-bind the literal string, something that was a static data structure (predicted as NoCafRefs by CorePrep becomes a thunk. This prediction is made by rhsIsStatic, called in CorePrep.
  • Since it is marked as NoCafRefs it won't appear in anyone's SRT
  • But when it is evaluated, the thunk will allocate its result in the heap, and update the static closure to point to it.
  • Later the garbage collector will collect that thunk; the only pointer to it is from a static closure that is in no SRT.
  • The space will be re-used
  • Later someone uses the thing again, but the pointer now points to garbage. Seg fault.

It's FATAL for something marked as NoCafRefs to turn into a thunk. Arrangig to panic on that would be good. It should not be just a warning, and I think it'd be worth the tiny perf hit to test it even in non-DEBUG code.

comment:38 Changed 14 months ago by bgamari

Sadly this fix didn't quite make it in to 8.0.2-rc1. This will be fixed in -rc2.

Last edited 14 months ago by bgamari (previous) (diff)
Note: See TracTickets for help on using tickets.