Opened 8 months ago

Closed 7 months ago

Last modified 7 months ago

#15038 closed bug (fixed)

Memory Corruption (strange closure type)

Reported by: andrewthad Owned by:
Priority: highest Milestone: 8.4.3
Component: Compiler Version: 8.4.1
Keywords: Cc: osa1, simonpj, bgamari
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: #9718 Differential Rev(s): Phab:D4680
Wiki Page:

Description

I've been encountering corruption of memory in a library of mine that uses some of GHC's more recent features (levity polymorphism and unboxed sums). The library is packed, and it's source code can be found on my github. But, more relevant to this issue is the minimal reproducible example I've put together here: https://github.com/andrewthad/corrupted-memory-example

It's still a bit larger than I'd like it to be, and if no one has any insights into this, I can keep whittling it down to make it more minimal. Here's how to run it:

git clone https://github.com/andrewthad/corrupted-memory-example
cd corrupted-memory-example
cabal new-build --enable-tests test
./dist-newstyle/build/corrupted-memory-example-0.1/build/test/test

This consistently crashes with:

test: internal error: evacuate(static): strange closure type 16648
    (GHC version 8.4.1 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Additional information: This issue can be reproduced using GHC 8.2.2, GHC 8.4.1, and GHC HEAD. This project cannot be built with GHC 8.0.2 and earlier since it uses unboxed sums. Building this project with either stack or cabal new-build and then running the test causes the same crash. However, building it with plain old GHC and then running it does not. This is really weird, since there aren't any GHC options in the cabal file that should affect this. In the original library, I have a bunch of tests that use the parsers byte and any in a variety of situations, and none of them every trigger this crash except for the ones where I stick a Parser inside of a Trie. This project does a lot of manual passing of state tokens, but I believe all of these uses to be correct (mostly because of the extensive tests in the original repo, but I've also spent a lot of time just looking at the code to figure out if this was something on my end).

Let me know if the example is too big. I can spend some more time shrinking it further, but I wanted to go ahead and get it up here.

Change History (45)

comment:1 Changed 8 months ago by osa1

Cc: osa1 added

comment:2 Changed 8 months ago by simonpj

Thanks for making a repro case. Yes, shrinking it is incredibly valuable.

Did you compile everything with -dcore-lint? Always worth doing that.

comment:3 Changed 8 months ago by simonpj

Type of failure: None/UnknownRuntime crash

comment:4 Changed 8 months ago by andrewthad

I’ll work on shrinking it further. And yes, I forgot to mention that core lint comes back fine and (on ghc Head) stg lint comes back fine as well. I’ve tried to consolidate the modules, but merging them all into a single module causes the problem to disappear.

comment:5 Changed 8 months ago by andrewthad

I have gotten it a little smaller. I've removed some functions from Packed.Bytes.Parser, and there are also no more levity polymorphic functions. Reminder about weird things:

  1. Problems only happen when using cabal or stack to build, not when using ghc on its own.
  2. Collapsing all modules into a single module (which affects inlining) makes the crash disappear.
  3. This happens with -O0 and -O2.

comment:6 Changed 8 months ago by andrewthad

Here's my rudimentary analysis of the error message.

internal error: evacuate(static): strange closure type 16648

If this were caused by user code, I think it could only be the result of an out-of-bounds call to writeArray, writeByteArray, readArray, or indexArray. An out of bounds function that reads from a byte array would not cause this. It would simply give the user a garbage value. Well, I guess if it were far enough out-of-bounds, the operating system would kill the process, but that doesn't happen here. The function writeByteArray is used a few times, but all of its uses look safe to me.

comment:7 Changed 8 months ago by osa1

I managed to reproduce this without stack or cabal. Steps:

  • Install dependencies: cabal install containers split ghc-prim primitive --with-ghc=ghc-stage2
  • Build the executable: ghc-stage2 common/Data/Trie/Naive.hs src/Packed/Bytes.hs src/Packed/Bytes/Parser.hs src/Packed/Bytes/Small.hs src/Packed/Bytes/Stream/ST.hs src/Packed/Bytes/Window.hs test/Main.hs test/Parser.hs -fforce-recomp -o main
  • Run ./main

It works fine with -O0, fails with -O1 and -O2, but with different "strange closure" types. -O1 prints 16661, -O2 prints 16665. Even more interestingly I get a different number if I also add -debug: 16621.

The object with "strange closure type" is a static one, and does not reside in heap. On my system it's at 0x967900.

Last edited 8 months ago by osa1 (previous) (diff)

comment:8 Changed 8 months ago by andrewthad

Thanks osa1 for reproducing this with just ghc and for the analysis. I'll try to poke around over the next few days to see if I can minimize failing case further.

comment:9 Changed 8 months ago by bgamari

Any update on this, andrewthad?

comment:10 Changed 8 months ago by andrewthad

I just spent some time making it a little more minimal. It's down to about 450 lines of code at this point, which is still pretty high. I've also consolidated the modules somewhat. I'll keep trying to get this smaller over the next few days.

comment:11 Changed 8 months ago by andrewthad

My additional changes have been pushed to the github repo. Also, here's my two cents about what I think is going on. The thing that causes the problem appears to be:

snmptrapdNaive :: Naive.Trie (Parser Word)
snmptrapdNaive = Naive.fromStringList
  [ ("STRING: ", P.any >>= \_ -> return 5)
  ]

If I use my parser language to simply build up a parser, and then I run the parser, everything works fine. However, sticking it inside of some other data structure seems like it causes problems (I'm not sure if it matters that the value is a CAF). Also, my parser type is a newtype wrapper around a function that uses unboxed sums. I'm also not sure if that matters.

comment:12 Changed 8 months ago by osa1

Thanks for simplifying the code.

I made an attempt at removing unboxed sums (and had to remove some unboxed tuples on the way) here, and that made the bug disappear. So this may be related with unboxed tuples/sums.

If I use my parser language to simply build up a parser, and then I run the parser

Can you elaborate on this? What do you mean by using your parser language to build up a parser? Can you share the code?

Last edited 8 months ago by osa1 (previous) (diff)

comment:13 Changed 8 months ago by andrewthad

The full original code is the packed repository (along with its test suite): https://github.com/andrewthad/packed.

The minimal example repo I put up is just packed but with everything that wasn't needed to make the error happen stripped out. Here is an example of a parser that I build in the original code that doesn't cause errors: https://github.com/andrewthad/packed/blob/master/test/Parser.hs#L112-L122. None of the parsers from the original test suite ever cause crashes excepted for the trie-based one where a Parser is nested inside of a Trie, which causes a crash whenever the GC is run. Let me know if there's a way that I can clarify this further.

comment:14 Changed 8 months ago by osa1

I spent some more time debugging this. Basically at one point the GC overwrites an info table, messing up the type field.

$ gdb ./Main -nh
Reading symbols from ./Main...done.
(gdb) start
Temporary breakpoint 1 at 0x40d29e
Starting program: /home/omer/haskell/corrupted-memory-example/Main
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Temporary breakpoint 1, 0x000000000040d29e in main ()

Now I can see that type of an info table is CONSTR_2_0:

(gdb) print get_itbl(0x9815f8)->type
$1 = 4

Now continue until it crashes:

(gdb) c
Continuing.
Main: internal error: evacuate(static): strange closure type 16580
    (GHC version 8.5.20180425 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Program received signal SIGABRT, Aborted.
0x00007ffff6e90428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

and at this point the same info table has wrong type field:

(gdb) print get_itbl(0x9815f8)->type
$2 = 16580

Adding a watchpoint to this location causes gdb to leak memory so I couldn't do it from the start. I realized that this happens when evacuate is called 143851nd time so I run the program until that call, then enable the watchpoint, and continue. First hit is

Watchpoint 3: get_itbl(0x9815f8)->type

Old value = 4
New value = 0
evacuate_static_object (link_field=0x9815f8, q=0x9815f0) at rts/sm/Evac.c:347
347             gct->static_objects = (StgClosure *)new_list_head;

(gdb) bt
#0  evacuate_static_object (link_field=0x9815f8, q=0x9815f0) at rts/sm/Evac.c:347
#1  0x00000000008e79e0 in evacuate (p=0x981608) at rts/sm/Evac.c:546
#2  0x0000000000910c29 in scavenge_static () at rts/sm/Scav.c:1831
#3  0x00000000009111f8 in scavenge_loop () at rts/sm/Scav.c:2185
#4  0x00000000008e34bb in scavenge_until_all_done () at rts/sm/GC.c:1092
#5  0x00000000008e2118 in GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0, cap=0xad4080 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:418
#6  0x00000000008d4dfb in scheduleDoGC (pcap=0x7fffffffd630, task=0xaebd00, force_major=false) at rts/Schedule.c:1799
#7  0x00000000008d4347 in schedule (initialCapability=0xad4080 <MainCapability>, task=0xaebd00) at rts/Schedule.c:545
#8  0x00000000008d57a1 in scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7fffffffd720) at rts/Schedule.c:2533
#9  0x00000000008d7997 in rts_evalLazyIO (cap=0x7fffffffd720, p=0x981710, ret=0x0) at rts/RtsAPI.c:530
#10 0x00000000008d808a in hs_main (argc=1, argv=0x7fffffffd918, main_closure=0x981710, rts_config=...) at rts/RtsMain.c:72
#11 0x000000000040d39a in main ()

Here it gets overwritten to 0. When I continue it gets overwritten one more time:

(gdb) c
Continuing.

Watchpoint 3: get_itbl(0x9815f8)->type

Old value = 0
New value = 16580
scavenge_static () at rts/sm/Scav.c:1789
1789        gct->scavenged_static_objects = flagged_p;

(gdb) bt
#0  scavenge_static () at rts/sm/Scav.c:1789
#1  0x00000000009111f8 in scavenge_loop () at rts/sm/Scav.c:2185
#2  0x00000000008e34bb in scavenge_until_all_done () at rts/sm/GC.c:1092
#3  0x00000000008e2118 in GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0, cap=0xad4080 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:418
#4  0x00000000008d4dfb in scheduleDoGC (pcap=0x7fffffffd630, task=0xaebd00, force_major=false) at rts/Schedule.c:1799
#5  0x00000000008d4347 in schedule (initialCapability=0xad4080 <MainCapability>, task=0xaebd00) at rts/Schedule.c:545
#6  0x00000000008d57a1 in scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7fffffffd720) at rts/Schedule.c:2533
#7  0x00000000008d7997 in rts_evalLazyIO (cap=0x7fffffffd720, p=0x981710, ret=0x0) at rts/RtsAPI.c:530
#8  0x00000000008d808a in hs_main (argc=1, argv=0x7fffffffd918, main_closure=0x981710, rts_config=...) at rts/RtsMain.c:72
#9  0x000000000040d39a in main ()

The value 16580 is what we see in the error message.

I don't know if this is a problem with GC, maybe it's something wrong with the code gen.

comment:15 Changed 8 months ago by osa1

Cc: simonpj bgamari added
Priority: normalhighest

Finally found the problem. In summary; in unarise we introduce absentError calls for unboxed sum slots that are not supposed to be used and absentError id (aBSENT_ERROR_ID) is marked as CAFFY. Now suppose we had a top-level definition that is not CAFFY but uses unboxed sums. TidyPgm thinks that it's not CAFFY. But in unarise we make it CAFFY by introducing absentError. This causes mismatch between the CAF info recorded in the binder id for this top-level definition and the actual RHS. The code that generates closure layout doesn't add a static_link field, but the code that generates the info table generates a SRT and SRT bitmap.

Some potential solutions:

  • Why is absentError marked as CAF? I think error ids are conservatively marked as MayHaveCafRefs but perhaps it's fine to not mark absentError as MayHaveCafRefs. (I implemented this and it fixed the bug)
  • Update CAFFY-ness of ids in unarise if we introduce absentError (haven't tried this yet)
  • Conservatively consider unboxed sum constructors as CAFFY in TidyPgm
  • Something else?

Any ideas? Ping @simonpj @bgamari

Last edited 8 months ago by osa1 (previous) (diff)

comment:16 Changed 8 months ago by bgamari

Good sleuthing!

However, I'm not sure about marking absentError as non-CAFFY. Afterall, judging by its RHS it looks as though it may carry a reference to at least one CAF (namely `unpackCString# "Oops! Entered absent arg"#). Have you looked at the STG that is produced for this?

comment:17 Changed 8 months ago by osa1

STG that is produced for this?

Do you mean STG for absentError or the function with wrong CAFFY-ness information? I haven't checked absentError but this is the STG for the function with wrong CAFFYness information:

lvl3_rctu
  :: forall s.
     Packed.Bytes.Parser.Maybe# (Packed.Bytes.Parser.Leftovers# s)
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s,
           Packed.Bytes.Parser.Result# s GHC.Types.Word #)
[GblId,
 Arity=2,
 Caf=NoCafRefs,
 Str=<S,1*U><S,U>,
 Unf=OtherCon []] =
    [] \r [us_gcwc us_gcwd us_gcwe us_gcwf us_gcwg void_0E]
        case us_gcwc of tag_gcwh {
          __DEFAULT ->
              (#,,,,,,#) [1#
                          Control.Exception.Base.absentError
                          Control.Exception.Base.absentError
                          0##
                          0##
                          1#
                          Control.Exception.Base.absentError];
          2# ->
              case ># [us_gcwg 0#] of {
                __DEFAULT ->
                    case Packed.Bytes.Parser.nextNonEmpty us_gcwe GHC.Prim.void# of {
                      (#,,,,#) us_gcwi us_gcwj us_gcwk us_gcwl us_gcwm ->
                          case us_gcwi of tag_gcwn {
                            __DEFAULT ->
                                (#,,,,,,#) [1#
                                            Control.Exception.Base.absentError
                                            Control.Exception.Base.absentError
                                            0##
                                            0##
                                            1#
                                            Control.Exception.Base.absentError];
                            2# ->
                                case -# [us_gcwm 1#] of sat_scuk {
                                  __DEFAULT ->
                                      case +# [us_gcwl 1#] of sat_scuj {
                                        __DEFAULT ->
                                            (#,,,,,,#) [2#
                                                        us_gcwj
                                                        us_gcwk
                                                        sat_scuj
                                                        sat_scuk
                                                        2#
                                                        a1_rcbR];
                                      };
                                };
                          };
                    };
                1# ->
                    case -# [us_gcwg 1#] of sat_scur {
                      __DEFAULT ->
                          case +# [us_gcwf 1#] of sat_scuq {
                            __DEFAULT ->
                                (#,,,,,,#) [2# us_gcwd us_gcwe sat_scuq sat_scur 2# a1_rcbR];
                          };
                    };
              };
        };

It says Caf=NoCafRefs but has a bunch of absentError references.

Last edited 8 months ago by osa1 (previous) (diff)

comment:18 Changed 8 months ago by bgamari

The STG for absentError; we want to know specifically whether it is necessary to mark it as CAFFY.

comment:19 Changed 8 months ago by osa1

STG for absentError:

lvl_r3wv :: GHC.Prim.Addr#
[GblId, Caf=NoCafRefs, Unf=OtherCon []] =
    "Oops!  Entered absent arg "#;

Control.Exception.Base.absentError :: forall a. GHC.Prim.Addr# -> a
[GblId, Arity=1, Str=<B,U>x, Unf=OtherCon []] =
    [] \r [s_s3BA]
        let {
          sat_s3BB [Occ=Once] :: [GHC.Types.Char]
          [LclId] =
              [s_s3BA] \u [] GHC.CString.unpackCStringUtf8# s_s3BA;
        } in
          case
              GHC.CString.unpackAppendCString# lvl_r3wv sat_s3BB
          of
          sat_s3BC
          { __DEFAULT -> GHC.Err.errorWithoutStackTrace sat_s3BC;
          };

Apparently this is CAFFY (othewise we'd see Caf=NoCafRefs).

I also realized that unarise uses absentError wrong: absentError expects an Addr# argument (for a string that represents the "absent" argument) but unarise just uses absentError without an argument. So there's also a type error.

Applying absentError to an argument in unarise would mean let bindings (heap allocation) so perhaps we should use something else for unused pointer location in unboxed sums. (IIRC when we first used absentError it wasn't taking an argument)

If the id we'll use instead of absentError is not CAFFY then this problem will disappear. However I wonder if we could somehow implement a check that ensures we won't make things more caffy after TidyPgm in the future.

comment:20 in reply to:  19 Changed 8 months ago by bgamari

Replying to osa1:

STG for absentError:

...

Apparently this is CAFFY (othewise we'd see Caf=NoCafRefs).

Right, that's what I suspected.

I also realized that unarise uses absentError wrong: absentError expects an Addr# argument (for a string that represents the "absent" argument) but unarise just uses absentError without an argument. So there's also a type error.

Yikes! It seems like this is one case that the old STG lint type-checker likely could have caught. Oh well.

Applying absentError to an argument in unarise would mean let bindings (heap allocation) so perhaps we should use something else for unused pointer location in unboxed sums. (IIRC when we first used absentError it wasn't taking an argument)

If the id we'll use instead of absentError is not CAFFY then this problem will disappear.

Yes, this seems like the shortest path to solving this. Introduce a variant of absentError that throws a less helpful error message but requires no allocation and isn't caffy.

However I wonder if we could somehow implement a check that ensures we won't make things more caffy after TidyPgm in the future.

The whole interaction between CorePrep, CoreTidy, and caffyness is quite unfortunate. It would be better if CorePrep and CoreTidy were less coupled in this regard. I thought there was a ticket discussing this, but I'm having trouble finding it at the moment.

comment:21 Changed 8 months ago by bgamari

comment:22 Changed 8 months ago by simonpj

I also realized that unarise uses absentError wrong: absentError expects an Addr# argument (for a string that represents the "absent" argument) but unarise just uses absentError without an argument. So there's also a type error.

That is pretty terrible. I suggest making a separate ticket and fixing it asap.

comment:23 Changed 8 months ago by simonpj

The whole interaction between CorePrep, CoreTidy, and caffyness is quite unfortunate. It would be better if CorePrep and CoreTidy were less coupled in this regard. I thought there was a ticket discussing this, but I'm having trouble finding it at the moment.

It's #9718, and I wonder if we should devote some effort to nailing it? It's not that hard, and the status quo is a continuing source of fragility.

comment:24 Changed 8 months ago by simonpj

PS: good sleuthing!

comment:25 Changed 8 months ago by andrewthad

Ömer, thanks for the excellent sleuthing and the very clear explanation of what had gone awry. It sounds like the crash I have been experiencing will be fixed in GHC 8.6 (and that someone may resolve the longstanding #9718). What I don’t fully grasp is the set of circumstances under which this error manifests itself. Based on my reading of the explanation, it seems like this would happen any time there was a pattern match on an unboxed sum in which none of the cases referenced a CAF. But that doesn’t seem right, because that would mean that this error would happen all the time. The reason I am interested in understanding this is because I would like to know if there is a subset of programs in which unboxed sums can safely be used. Or, as a library author, is it currently impossible to use unboxed sums in such a way that the end user cannot induce the failure case?

comment:26 Changed 8 months ago by osa1

Andrew,

It sounds like the crash I have been experiencing will be fixed in GHC 8.6

That's correct. I'm implementing a fix right now, it should be merged in a few days.

What I don’t fully grasp is the set of circumstances under which this error manifests itself.

This error happens when:

  • You have a top-level definition that uses unboxed sums and is not CAFFY (i.e. does not reference to CAFs directly or transitively).

Note that GHC as a result of optimisations can introduce top-level definitions so predicting whether you'll have such definitions is impossible from the source code.

  • Unboxed sums used in this top-level definition have unused fields. This happens when you have a sum with type
(# Int# | Int #)

Here you need 3 slots: a tag, an Int# and an Int, because pointer and non-pointer fields can't overlap (GC concerns). Now here are two terms and what they're compiled to:

(# 1# | #) ===> (# 0#, 1#, UNUSED_POINTER #)
(# | 1 #)  ===> (# 1#, UNUSED_INT#, 1 #)

We don't want to leave these unused fields uninitialized (for various reasons, like safety and IIRC also has to do with Cmm code generation which used to panic when pointer field is left with uninitialized garbage), so we use something called absentError. absentError should not be CAFFY otherwise it would make everything transitively referring to it CAFFY (I think when we first implemented unboxed sums absentError was not CAFFY), and CAFFY objects need special treatment from code generator and garbage collector. This causes the problem because CAFFY-ness is calculated in an early stage than the pass that compiles unboxed sums to unboxed tuples, causing inconsistent CAFFY information.

In summary if at least one of these conditions hold you're safe:

  • You don't have UNUSED_POINTER fields in your unboxed sum types
  • The functions that use unboxed sums are CAFFY

I'd suggest updating to next GHC when it's released.

Simon,

It's #9718, and I wonder if we should devote some effort to nailing it? It's not that hard, and the status quo is a continuing source of fragility.

I can work on this.

That is pretty terrible. I suggest making a separate ticket and fixing it asap.

I'll do that now. The new function/primop we'll use instead of absentError won't be CAFFY so it'll fix this ticket.

Last edited 8 months ago by osa1 (previous) (diff)

comment:27 Changed 8 months ago by osa1

Differential Rev(s): Phab:D4652
Status: newpatch

Simon, I submitted a fix for the absent error problem, but did not make a new ticket because it also fixes this.

I'll continue with #9718.

comment:28 Changed 8 months ago by simonpj

I also realized that unarise uses absentError wrong: absentError expects an Addr# argument (for a string that represents the "absent" argument) but unarise just uses absentError without an argument. So there's also a type error.

I think your patch is to fix this problem. But I see nothing about giving it an argument of the right type ... I see stuff about not giving it an argument at all for some reason; and strange stable pointers.

If we are going to fix #9718, how much of this do we need? Perhaps very little?

At least, if there's

  1. something that is necessary regardless; and
  2. something that's a temporary sticking plaster to deal with the absence of #9718

then let's keep them separate so we can revert (2) when we get #9718.

comment:29 Changed 8 months ago by osa1

I responded in the diff.

If we are going to fix #9718, how much of this do we need? Perhaps very little?

I don't think fixing #9718 would fix this bug because this bug happens in unarise which happens much later than both TidyPgm and CorePrep. I think to avoid this bug we need a way to ensure we don't make things more CAFFY after CAFFYness information is calculated in TidyPgm. This includes all the Stg-to-Stg transforms (CSE, unarise, perhaps new passes added in the future).

comment:30 Changed 8 months ago by simonpj

I think to avoid this bug we need a way to ensure we don't make things more CAFFY after CAFFYness information is calculated in TidyPgm

The whole point of #9718 is to stop TidyPgm predicting CAF-hood. Instead compute CAF-hood just before code generation, after all transformations (including any on STG) have settled down.

comment:31 Changed 8 months ago by osa1

Fair enough. Sorry I'm confused because the title says "predicting what CorePrep will do" not "predicting what CodePrep and later stages (including code generator) do".

I think for efficiency reasons unboxed sums should really not make things CAFFY (similar to how unboxed tuples or unboxed literals do not make things CAFFY) so whatever we'll be using for the absent pointer values should not be considered as CAFFY in the code generator. Phab:D4652 fixes this so I think it's "something that is necessary regardless" and not a temporary workaround (e.g. would not be reverted after #9718).

As mentioned in the diff there's an alternative implementation that I think wouldn't need the business with stable pointers, while still fixing the problems with the type error and making things CAFFY.

comment:32 Changed 8 months ago by simonpj

I think for efficiency reasons unboxed sums should really not make things CAFFY

I rather agree. But the same is true for patError etc, not just absentError. I've opened a new ticket #15113.

So if that was fixed, again this patch (or part of it) becomes a patch we'd want to revert. But perhaps not all of it? I'm not even sure any more :-(.

comment:33 Changed 8 months ago by simonpj

Aha. Concerning other error-ids, like patErrorId, look at this, in MkCore:

mkRuntimeErrorId name
 = mkVanillaGlobalWithInfo name runtimeErrorTy bottoming_info
 where
    bottoming_info = vanillaIdInfo `setStrictnessInfo`    strict_sig
                                   `setArityInfo`         1
                        -- Make arity and strictness agree

        -- Do *not* mark them as NoCafRefs, because they can indeed have
        -- CAF refs.  For example, pAT_ERROR_ID calls GHC.Err.untangle,
        -- which has some CAFs
        -- In due course we may arrange that these error-y things are
        -- regarded by the GC as permanently live, in which case we
        -- can give them NoCaf info.  As it is, any function that calls
        -- any pc_bottoming_Id will itself have CafRefs, which bloats
        -- SRTs.

This came from a commit back in 2002!

commit 33ce2a14d1220bd3b9f00e8d461b8f9ef31ee411
Author: simonpj <unknown>
Date:   Fri Jun 28 14:06:54 2002 +0000

    [project @ 2002-06-28 14:06:52 by simonpj]
    -----------------------------------
   	Fix the CAF info field of error Ids
    	-----------------------------------

    A bizarre bug.   In MkId, we build the Id for various error-y
    Ids (like pAT_ERROR_ID) that we grab out of thin air in various
    places (like the desugarer).  They were marked as not referring
    to any CAFs, but this was a lie!  In fact, they refer to 'untangle'
    (see GHC.Err) and thence to a CAF.

    Result: GC crash under very obscure circumstances.  (Rob's optimistic
    evaluator tickled it.)

    Solution: give them more conservative IdInfo.

    Two other better solutions to think about:

    * Don't grab them out of thin air; instead get them from
      an interface file.

    * Treat them as always-live (requires mod to garbage collector)
      so they don't need to be mentioned in SRTs at all

So even if we fixed #15113, that would not help with patErrorId etc, becuase they are wired-in Ids, built with mkRuntimeErrorId, and hence (conservatively) marked MayHaveCafRefs.

What to do? I hate this special treatment of aBSENT_ERROR_ID etc. Let's instead:

  • Add Note [The CAF-ness of error Ids] explaining all this.
  • Make all the error ids (both aBSENT_ERROR_ID and mkRuntimeErrorId start with errorIdInfo
  • Define errorIdInfo = noCafInfo, pointing to Note [The CAF-ness of error Ids], which says that we can mark them no-CAF because (and only because) we add them the to the Stable Pointer Table; see Note [StableGcRoots] in RtsStartup.c.
  • Add all the wired-in error-ids to the list in RtsStartup.c. And make sure that the MkCore.errorIds list also points to the Note.

Now all the error Ids will be non-CAFy, we'll get smaller SRTs etc, which is good. And there will be no special treatment of the unboxed-sum business. (But you might want a Note in the unarise transform to explain that adding a reference to absent-error-id doesn't change CAFyness.)

comment:34 Changed 7 months ago by osa1

Differential Rev(s): Phab:D4652Phab:D4680

comment:35 Changed 7 months ago by osa1

Milestone: 8.6.18.4.3

It seems like we'll be making another 8.4 release (for #15068), we should probably include the fix for this as well.

comment:36 Changed 7 months ago by Ömer Sinan Ağacan <omeragacan@…>

In b2ff5dde/ghc:

Fix #15038

We introduce a new Id for unused pointer values in unboxed sums that is
not CAFFY. Because the Id is not CAFFY it doesn't make non-CAFFY
definitions CAFFY, fixing #15038.

To make sure anything referenced by the new id will be retained we get a
stable pointer to in on RTS startup.

Test Plan: Passes validate

Reviewers: simonmar, simonpj, hvr, bgamari, erikd

Reviewed By: simonmar

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #15038

Differential Revision: https://phabricator.haskell.org/D4680

comment:37 Changed 7 months ago by osa1

Status: patchmerge

comment:38 Changed 7 months ago by simonpj

Just to note that the patch in comment:36 should be reverted when we implement #9718. Or at least revisited to make the treatment of error-ids uniform with each other.

comment:39 Changed 7 months ago by simonpj

Also to note: Omer tried the approach in comment:33, but we found that binary sizes went up by 1%. Comment:33 means that we'd link the code for all error-ids into every binary even if they are not actually used; but it seems hard to believe that would increase code size by 1%.

Rather than pursue this mystery, we backed off and did the ad-hoc special case implemented in comment:36.

comment:40 Changed 7 months ago by andrewthad

Should I now be able to try building my original repository (packed) with GHC HEAD?

comment:41 Changed 7 months ago by osa1

That'd be great. I tried your reproducer (correupted-memory-example) but not the original package.

comment:42 Changed 7 months ago by andrewthad

I just confirmed that my original library no longer crashes when built with GHC head. If it is possible to release this fix with GHC 8.4.3, I would greatly appreciate it.

comment:43 Changed 7 months ago by Sergei Trofimovich <slyfox@…>

In 79bbb23/ghc:

rts: export new absentSumFieldError from base

commit b2ff5dde399cd012218578945ada1d9ff68daa35 "Fix #15038"
added new stable closure 'absentSumFieldError_closure' to
base package. This closure is used in rts package.

Unfortunately the symbol was not explicitly exported and build
failed on windows as:

```
"inplace/bin/ghc-stage1" -o ...hsc2hs.exe ...
rts/dist/build/libHSrts.a(RtsStartup.o): In function `hs_init_ghc':

rts/RtsStartup.c:272:0: error:
     undefined reference to `base_ControlziExceptionziBase_absentSumFieldError_closure'
    |
272 |     getStablePtr((StgPtr)absentSumFieldError_closure);
    | ^

```

This change adds 'absentSumFieldError_closure' to explicit export
into libHSbase.def.

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>

comment:44 Changed 7 months ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:45 Changed 7 months ago by Ömer Sinan Ağacan <omeragacan@…>

In 11eed2f/ghc:

testsuite: Don't rely on find command in T15038

Test Plan: Validate

Reviewers: int-index, osa1

Reviewed By: osa1

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #15038

Differential Revision: https://phabricator.haskell.org/D4723
Note: See TracTickets for help on using tickets.