Opened 2 years ago

Last modified 6 months ago

#11317 new bug

Test prog003 fails with segfault on Windows (GHCi)

Reported by: rdragon Owned by:
Priority: normal Milestone:
Component: GHCi Version: 7.11
Keywords: GC Cc: Phyx-, mnislaih, bmjames, alexandersgreen, meteficha
Operating System: Windows Architecture: Unknown/Multiple
Type of failure: GHCi crash Test Case: prog003
Blocked By: Blocking:
Related Tickets: #11234 #3408 Differential Rev(s):
Wiki Page:

Description

On my x86_64 Windows machine test prog003 fails with a segfault:

cd ./ghci/prog003 && ghciWayFlags=-static HC="/home/Rik/ghc/inplace/bin/ghc-stage2.exe" HC_OPTS="-dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -fno-warn-tabs -fno-warn-missed-specialisations -fno-ghci-history " "/home/Rik/ghc/inplace/bin/ghc-stage2.exe" -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -fno-warn-tabs -fno-warn-missed-specialisations -fno-ghci-history  --interactive -v0 -ignore-dot-ghci +RTS -I0.1 -RTS    <prog003.script > prog003.run.stdout 2> prog003.run.stderr >> prog003.run.stdout 2>> prog003.run.stderr
Wrong exit code (expected 0 , actual 1 )
Stdout:
Run 1
a :: Int -> Int
168
Run 2
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 3
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 4
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 5
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 6
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 7
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 8
Segmentation fault/access violation in generated code

Stderr:

*** unexpected failure for prog003(ghci)

(I do not think it is an access violation error since when I remove line 65 and everything below line 70, the same stdout is generated.)

This looks related to ticket #11234.

The GHC version used was df6cb57b32d94b7f6f7c9a86207adfeee9712ed6, although I already noticed the problem with a version a few days older.

If someone could point me to information about how to install GDB suitable for debugging GHC on my Windows machine I will try to post the stack trace.

Change History (15)

comment:1 Changed 2 years ago by thomie

Cc: Phyx- added

comment:2 Changed 2 years ago by Phyx-

That's peculiar.. When I ran the tests after #11234 it was working.

In anycase, making a devel2 build should compile the RTS with symbols.

Instructions for debugging are here

I will try to reproduce, thanks for the report!

comment:3 Changed 2 years ago by Phyx-

Architecture: x86_64 (amd64)Unknown/Multiple
Keywords: GC added
Test Case: prog003

This seems like it's related to #3408. The GC timeout -I is at fault.

The default timeout for Windows was changed to 5secs in https://github.com/ghc/ghc/blob/master/ghc/hschooks.c#L42 to avoid issues with the GC and haskeline on Windows.

Coincidentally I don't know why this was done there (which is rather hidden) rather than in RtsFlags.c. The helptext for RTS opts is also wrong (says the default is 0.3s for all platforms).

In any case, the test is setting a value so low that it's causing a segfault.

On a normal run the GC ends with

Memory inventory:
  gen 0 blocks :    16 blocks (   0.1 MB)
  gen 1 blocks :  5541 blocks (  21.6 MB)
  nursery      :  5379 blocks (  21.0 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     1 blocks (   0.0 MB)
  free         :  1411 blocks (   5.5 MB)
  total        : 12348 blocks (  48.2 MB)
       84f44: cap 0: all caps stopped for GC
       84f44: cap 0: finished GC
       84f44: exitHpc
       84f44: cap 0: shutting down

but when it segfaults

Memory inventory:
  gen 0 blocks :    51 blocks (   0.2 MB)
  gen 1 blocks :  5470 blocks (  21.4 MB)
  nursery      :  5202 blocks (  20.3 MB)
  retainer     :     0 blocks (   0.0 MB)
  arena blocks :     0 blocks (   0.0 MB)
  exec         :     1 blocks (   0.0 MB)
  free         :  3388 blocks (  13.2 MB)
  total        : 14112 blocks (  55.1 MB)
       84cd4: cap 0: all caps stopped for GC
       84cd4: cap 0: finished GC

So GC does finish, but something happens between then and exitHpc

comment:4 Changed 18 months ago by Tamar Christina <tamar@…>

In b40e1b4c/ghc:

Fix incorrect calculated relocations on Windows x86_64

Summary:
See #12031 for analysis, but essentially what happens is:

To sum up the issue, the reason this seems to go wrong is because
of how we initialize the `.bss` section for Windows in the runtime linker.

The first issue is where we calculate the zero space for the section:

```
zspace = stgCallocBytes(1, bss_sz, "ocGetNames_PEi386(anonymous bss)");
sectab_i->PointerToRawData = ((UChar*)zspace) - ((UChar*)(oc->image));
```

Where
```
UInt32 PointerToRawData;
```

This means we're stuffing a `64-bit` value into a `32-bit` one. Also `zspace`
can be larger than `oc->image`. In which case it'll overflow and
then get truncated in the cast.

The address of a value in the `.bss` section is then calculated as:

```
addr = ((UChar*)(oc->image))
     + (sectabent->PointerToRawData
     + symtab_i->Value);
```

If it does truncate then this calculation won't be correct (which is what is happening).

We then later use the value of `addr` as the `S` (Symbol) value for the relocations

```
S = (size_t) lookupSymbol_( (char*)symbol );
```

Now the majority of the relocations are `R_X86_64_PC32` etc.
e.g. They are guaranteed to fit in a `32-bit` value.

The `R_X86_64_64` introduced for these pseudo-relocations so they can use
the full `48-bit` addressing space isn't as lucky.
As for why it sometimes work has to do on whether the value is truncated or not.

`PointerToRawData` can't be changed because it's size is fixed by the PE specification.

Instead just like with the other platforms, we now use `section` on Windows as well.
This gives us a `start` parameter of type `void*` which solves the issue.

This refactors the code to use `section.start` and to fix the issues.

Test Plan: ./validate and new test added T12031

Reviewers: RyanGlScott, erikd, bgamari, austin, simonmar

Reviewed By: simonmar

Subscribers: thomie, #ghc_windows_task_force

Differential Revision: https://phabricator.haskell.org/D2316

GHC Trac Issues: #12031, #11317

comment:5 Changed 18 months ago by Phyx-

Resolution: fixed
Status: newclosed

This commit seems to fix prog003 as well.

comment:6 Changed 12 months ago by bgamari

Resolution: fixed
Status: closednew

This appears to be failing in the same manner on Windows again.

Last edited 12 months ago by bgamari (previous) (diff)

comment:7 Changed 12 months ago by Ben Gamari <ben@…>

In 24a4fe2/ghc:

testsuite: Mark prog003 as broken on Windows

Due to #11317.

comment:8 Changed 12 months ago by bgamari

This failure is actually quite flaky. It seems to fail most of the time, but occasionally doesn't.

Unfortunately gdb isn't of much help here,

GHCi, version 8.1.20161216: http://www.haskell.org/ghc/  :? for help
[New Thread 7512.0xb0c]
Warning: ignoring unrecognised input `prog003.script'
Prelude> :script prog003.script
[1 of 1] Compiling Main             ( shell.hs, interpreted )
Ok, modules loaded: Main.
Run 1
[1 of 4] Compiling D                ( D.hs, interpreted )
[2 of 4] Compiling C                ( C.hs, interpreted )
[3 of 4] Compiling B                ( B.hs, interpreted )
[4 of 4] Compiling A                ( A.hs, interpreted )
Ok, modules loaded: A, B, C, D.
a :: Int -> Int
168
Run 2
[1 of 4] Compiling D                ( D.hs, interpreted )
[2 of 4] Compiling C                ( C.hs, interpreted ) [D changed]
[3 of 4] Compiling B                ( B.hs, interpreted ) [D changed]
[4 of 4] Compiling A                ( A.hs, interpreted ) [B changed]
Ok, modules loaded: A, B, C, D.
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 3
Ok, modules loaded: A, B, C, D.
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 4
[2 of 4] Compiling C                ( C.hs, interpreted )
[3 of 4] Compiling B                ( B.hs, interpreted )
[4 of 4] Compiling A                ( A.hs, interpreted )
Ok, modules loaded: A, B, C, D (D.o).
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 5
[3 of 4] Compiling B                ( B.hs, interpreted )
[4 of 4] Compiling A                ( A.hs, interpreted )
Ok, modules loaded: A, B, C (C.o), D (D.o).
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 6
[4 of 4] Compiling A                ( A.hs, interpreted )
Ok, modules loaded: A, B (B.o), C (C.o), D (D.o).
(A.a,B.b,C.c,D.d)
  :: (Float -> Float, Float -> Float, Float -> Float, Float -> Float)
28.0
Run 7

Program received signal SIGSEGV, Segmentation fault.
0x000000000af36d2f in ?? ()
(gdb) bt
#0  0x000000000af36d2f in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

comment:9 Changed 12 months ago by Tamar Christina <tamar@…>

In a3704409/ghc:

Fix various issues with testsuite code on Windows

Summary:
Previously we would make direct calls to `diff` using `os.system`.
On Windows `os.system` is implemented using the standard
idiom `CreateProcess .. WaitForSingleObject ..`.

This again runs afoul with the `_exec` behaviour on Windows. So we ran
into some trouble where sometimes `diff` would return before it's done.

On tests which run multiple ways, such as `8086` what happens is that
we think the diff is done and continue. The next way tries to set things
up again by removing any previous directory. This would then fail with
and error saying the directory can't be removed. Which is true, because
the previous diff code/child is still running.

We shouldn't make any external calls to anything using `os.system`.
Instead just use `runCmd` which uses `timeout`. This also ensures that if
we hit the cygwin bug where diff or any other utility hangs, we kill it and
continue and not hang the entire test and leave hanging processes.

Further more we also:
Ignore error lines from `removeFile` from tools in the testsuite. This is a rather large
hammer to work around the fact that `hsc2hs` often tries to remove it's own file too early.
When this is patched the workaround can be removed. See Trac #9775

We mark `prog003` as skip. Since this test randomly fails and passes. For stability it's disabled
but it is a genuine bug which we should find. It's something with interface files being
overwritten. See Trac #11317

when `rmtree` hits a readonly file, the `onerror` handler is raised afterwards but not
during the tree walk. It doesn't allow you to recover and continue as we thought.
Instead you have to explicitly start again. This is why sometimes even though we
call `cleanup` before `os.mkdirs`, it would sometimes fail with an error that the
folder already exists. So we now do a second walk.

A new verbosity level (4) will strip the silent flags from `MAKE` invocations so you can actually
see what's going on.

Test Plan: ./validate on build bots.

Reviewers: bgamari, austin

Reviewed By: bgamari

Subscribers: mpickering, thomie, #ghc_windows_task_force

Differential Revision: https://phabricator.haskell.org/D2894

GHC Trac Issues: #12661, #11317, #9775

comment:10 Changed 6 months ago by mnislaih

Cc: mnislaih added

comment:11 Changed 6 months ago by bmjames

Cc: bmjames added

comment:12 Changed 6 months ago by alexandersgreen

Cc: alexandersgreen added

comment:13 Changed 6 months ago by meteficha

Cc: meteficha added

comment:14 Changed 6 months ago by mnislaih

We are seeing segmentation faults (in Windows only) quite regularly while trying to run things in ghci, or even just off lens TH splices while compiling code.

comment:15 Changed 6 months ago by Phyx-

Hi, do you have a small example where this occurs? prog003 is quote hard to debug due to the shell interactions. If you have a Haskell only example could you report it under a new issue? Thanks!

Note: See TracTickets for help on using tickets.