Opened 3 years ago

Closed 3 years ago

#5003 closed bug (fixed)

test 4038 (ghci) fails on OS X 64 bit.

Reported by: gwright Owned by: gwright
Priority: high Milestone: 7.2.1
Component: GHCi Version: 7.0.2
Keywords: Cc: pho@…, william.knop.nospam@…, michal.terepeta@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime crash Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Test 4038 (ghci) sometimes fails on OS X 64 bit. When the failure happens, the test crashes with a segfault or bus error:

gwright-macbook> inplace/bin/ghc-stage2 --interactive 4038.hs
GHCi, version 7.1.20110302: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( 4038.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
Bus error
gwright-macbook> 

Under gdb, I get

gwright-macbook> gdb
GNU gdb 6.3.50-20050815 (Apple version gdb-1510) (Wed Sep 22 02:45:02 UTC 2010)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin".
(gdb) source gdbinit 
(gdb) source gdb_ghc 
Reading symbols for shared libraries ..... done
(gdb) run
Starting program: /Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib/ghc-stage2 +RTS -V0 -i0 -RTS -B/Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib -pgmc /usr/bin/gcc-4.2 -pgma /usr/bin/gcc-4.2 -pgml /usr/bin/gcc-4.2 -pgmP "/usr/bin/gcc-4.2 -E -undef -traditional" --interactive 4038.hs
Reading symbols for shared libraries ++++. done
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( 4038.hs, interpreted )
Ok, modules loaded: Main.
*Main> main

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x0000000104300fd8
[Switching to process 249]
0x0000000101f3d8a9 in stg_makeStablePtrzh ()
(gdb) 

The gdb_ghc file used to start gdb contained:

file /Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib/ghc-stage2
set args +RTS -V0 -i0 -RTS -B/Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib -pgmc /usr/bin/gcc-4.2 -pgma /usr/bin/gcc-4.2 -pgml /usr/bin/gcc-4.2 -pgmP "/usr/bin/gcc-4.2 -E -undef -traditional" --interactive 4038.hs

I had copied the failing 4038.hs file to the top level directory.

The address of the failure, 0x104300fd8, is in the block of memory mapped by the linker for HSghc-prim-0.2.0.0.o.

Attachments (1)

4038.hs.dpatch (119.9 KB) - added by gwright 3 years ago.

Download all attachments as: .zip

Change History (30)

comment:1 Changed 3 years ago by gwright

  • Owner set to gwright

comment:2 Changed 3 years ago by PHO

  • Cc pho@… added

comment:3 Changed 3 years ago by gwright

A more careful check of mmap'ed addresses shows that the failed access, 0x104300fd8, is just above one of the mmap'ed blocks. I'm still guessing its a relocation bug.

comment:4 Changed 3 years ago by altaic

  • Cc william.knop.nospam@… added

Not sure if this is useful, but while poking around I noticed that when ghc is built with -Rghc-timing -O0 -DDEBUG -debug the bad access happens in stg_ap_p_info rather than stg_makeStablePtrzh.

comment:5 follow-up: Changed 3 years ago by michalt

  • Cc michal.terepeta@… added

I've got similar problem on x86-64 Linux but only with ghc compiled code, not
ghci. For me 4038 simply segfaults when compiled. The interesting thing is
that if I change the test to

f :: Int -> IO Int
f 0 = return 42
f n = do
  putStrLn $ "first: " ++ show n
  f' <- wrap_f_io f
  putStrLn $ "second: " ++ show n
  f_io f' (n-1)

Then the result is:

 ~/develop/ghc/inplace/bin/ghc-stage2 --make -fforce-recomp 4038 -rtsopts -debug && ./4038
[1 of 1] Compiling Main             ( 4038.hs, 4038.o )
Linking 4038 ...
first: 1000
second: 1000
first: 999
second: 999
first: 998
second: 998
[...]
first: 501
second: 501
first: 500
second: 500
Segmentation fault

When running gdb on it

[...]
first: 501
second: 501
first: 500
second: 500

Program received signal SIGSEGV, Segmentation fault.
0x00000000006e5ce3 in stg_ap_p_info ()

So it segfaults also in stg_ap_p_info.

This seems a bit too similar to be unrelated, so maybe the issue is not limited
to OS X? Or am I just hitting different bug?

comment:6 Changed 3 years ago by altaic

Compiling and running your modified test on OS X (ghc-stage2 built with -Rghc-timing -O0 -DDEBUG -debug) fails similarly:

[...]
first: 501
second: 501

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00007fff5f3fe8a0
0x000000010031ac72 in stg_ap_p_info ()

Interestingly, the failure occurs more quickly in ghci:

[...]
first: 908
second: 908

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x00000001037ff720
[Switching to process 27370 thread 0x2003]
0x0000000101b7331a in stg_ap_p_info ()

Note that all the failures in stg_ap_p_info happen at the same instruction:

0x0000000101b7331a <stg_ap_p_info+890>:	mov    %rax,0x40(%rsp)

Using ghc-stage2 built with BuildFlavour = perf, the modified test fails somewhat differently.

ghci:

[...]
first: 296
second: 296

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x0000000103700fd0
[Switching to process 5685 thread 0x2003]
0x0000000101aad47e in stg_takeMVarzh ()

But similarly, the instruction is the first access of rsp:

0x0000000101aad47e <stg_takeMVarzh+6>:	mov    %rax,0x40(%rsp)

And using ghc-stage2 built with BuildFlavour = devel2...

ghci:

first: 1000
second: 1000

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
[Switching to process 21612 thread 0x2103]
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000001037de4dd in ?? ()
#2  0x0000000104a1d9c8 in ?? ()
Previous frame inner to this frame (gdb could not unwind past this frame)
(gdb) frame 1   
#1  0x00000001037de4dd in ?? ()
(gdb) disas $rip $rip+1
Dump of assembler code from 0x1037de4dd to 0x1037de4de:
0x00000001037de4dd:	mov    0x50(%rsp),%rcx
End of assembler dump.
(gdb) p $rsp
$1 = (void *) 0x10397ce60

Finally, using ghc built with -Rghc-timing -O0 -DDEBUG, the original test fails as originally reported (note: callq accesses rsp).

ghci:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x0000000103900f88
[Switching to process 21633 thread 0x2103]
0x0000000101b5197b in stg_makeStablePtrzh ()
(gdb) disas $rip $rip+1
Dump of assembler code from 0x101b5197b to 0x101b5197c:
0x0000000101b5197b <stg_makeStablePtrzh+51>:	callq  0x101b34f80 <getStablePtr>
End of assembler dump.
(gdb) p $rsp
$1 = (void *) 0x103900f90

comment:7 Changed 3 years ago by altaic

In StgRun (in StgCRun.c), rsp gets set to the bogus address just before the error. I poked around in StgRun a bit and came across a correlation. Perhaps it's some sort of mmap bug in Linker.c.

In ./includes/rts/Constants.h, changing

#define RESERVED_C_STACK_BYTES (2048 * SIZEOF_LONG)

to

#define RESERVED_C_STACK_BYTES (4096 * SIZEOF_LONG)

causes the error to happen twice as soon, i.e.

[...]
first: 748
second: 748

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00007fff5f3f8a88
0x000000010031ab22 in stg_ap_p_info ()

rather than the original

[...]
first: 501
second: 501

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00007fff5f3fe8a0
0x000000010031ac72 in stg_ap_p_info ()

Likewise, changing it to

#define RESERVED_C_STACK_BYTES (1024 * SIZEOF_LONG)

causes the error to happen twice as late, i.e

[...]
first: 22
second: 22

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00007fff5f3ff6e8
0x000000010031ab22 in stg_ap_p_info ()

comment:8 in reply to: ↑ 5 ; follow-up: Changed 3 years ago by simonmar

  • Milestone set to 7.2.1
  • Priority changed from normal to high

Replying to michalt:

I've got similar problem on x86-64 Linux but only with ghc compiled code, not
ghci.

I tried and I can't reproduce the error here on x86-64/Linux (Ubuntu 10.04). I also ran the program with Valgrind and found no errors (reading uninitialized data is a common cause of things that fail on one platform but not another).

So I'm stumped, but we should treat this as a high priority. It could be related to #4303. Ian, could you try to repro on your Mac?

comment:9 in reply to: ↑ 8 Changed 3 years ago by michalt

Replying to simonmar:

Replying to michalt:

I've got similar problem on x86-64 Linux but only with ghc compiled code, not
ghci.

I tried and I can't reproduce the error here on x86-64/Linux (Ubuntu 10.04). I also ran the program with Valgrind and found no errors (reading uninitialized data is a common cause of things that fail on one platform but not another).

So I'm stumped, but we should treat this as a high priority. It could be related to #4303. Ian, could you try to repro on your Mac?

Just to be sure I've pulled a fresh copy of GHC and just run validate --- as
before test 4038 failed. FYI: I'm running Gentoo, with GCC 4.5.2,
bootstrapping using GHC 7.0.2.

comment:10 Changed 3 years ago by gwright

This is odd. On OS X 10.6.6 (64 bit), ghc-HEAD from today still only has 4038 failing under ghci.
Perhaps I'll try valgrind as well, to see if I can get a clue why we're running off the end of the stack.

comment:11 Changed 3 years ago by gwright

On OS X 64 bit, valgrind didn't show anything:

gwright-macbook> valgrind "inplace/bin/ghc-stage2 --interactive 4038.hs"
valgrind: inplace/bin/ghc-stage2 --interactive 4038.hs: No such file or directory
gwright-macbook> valgrind inplace/bin/ghc-stage2 --interactive 4038.hs
==24759== Memcheck, a memory error detector
==24759== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==24759== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==24759== Command: inplace/bin/ghc-stage2 --interactive 4038.hs
==24759== 
GHCi, version 7.1.20110320: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( 4038.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
Bus error
gwright-macbook> 

Poking around with gdb will be needed to find out when the stack (or stack pointer) is corrupted.

comment:12 Changed 3 years ago by gwright

  • Operating System changed from MacOS X to Unknown/Multiple

I added the print statements as described above by michalt, and on OS X 64 bit I consistently get a crash with ghci at the same point:

first:  695
second: 695
first:  694
second: 694
first:  693
second: 693
Bus error
gwright-macbook> 

The error always seems to happen just after second is 693.

Has anyone observed this error on a 32 bit platform? If not, the underlying bug is probably similar to #4970, in which a 32 bit value was copied to the lower half of a 64 bit location, without the upper half being zeroed or sign extended. Unfortunately, the original error may happen some time before the ultimate crash.

I guess the good news is that this doesn't seem to be an OS X linker bug.

comment:13 Changed 3 years ago by gwright

I've now seen the bug in the compiled 4038.hs. It's much easier to see what's going in this case.

What's happening is that we're simply blowing past the end of our process's C stack. I can see this in gdb. I've set a breakpoint at stg_makeStablePtrzh, since this is the function in which the memory error occurs in the compiled code. When the breakpoint is first encountered, the $rsp (the C stack pointer) is at 0x7fff5fbfb530. When the crash occurs, $rsp is pointing to 0x7fff5f3ffa70:

gwright-macbook> gdb
GNU gdb 6.3.50-20050815 (Apple version gdb-1515) (Sat Jan 15 08:33:48 UTC 2011)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin".
(gdb) source gdbinit 
(gdb) file ./4038
Reading symbols for shared libraries ... done
Reading symbols from /Users/gwright/tmp/ghc-HEAD/ghc/4038...done.
(gdb) break stg_makeStablePtrzh
Breakpoint 1 at 0x100344f78
(gdb) run
Starting program: /Users/gwright/tmp/ghc-HEAD/ghc/4038 
Reading symbols for shared libraries ++. done

Breakpoint 1, 0x0000000100344f78 in stg_makeStablePtrzh ()
(gdb) info reg
rax            0x100054cb0      4295314608
rbx            0x100b041a1      4306518433
rcx            0x1      1
rdx            0x1003c82f8      4298932984
rsi            0x100b00140      4306501952
rdi            0x228    552
rbp            0x100b05310      0x100b05310
rsp            0x7fff5fbfb530   0x7fff5fbfb530
r8             0x0      0
r9             0x19     25
r10            0x100b05628      4306523688
r11            0x100b05610      4306523664
r12            0x100b041a8      4306518440
r13            0x1003c9558      4298937688
r14            0x100b04169      4306518377
r15            0x100b050c0      4306522304
rip            0x100344f78      0x100344f78 <stg_makeStablePtrzh>
eflags         0x203    515
cs             0x27     39
ss             0x0      0
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) ignore 1 506
Will ignore next 506 crossings of breakpoint 1.
(gdb) c
Continuing.

Breakpoint 1, 0x0000000100344f78 in stg_makeStablePtrzh ()
(gdb) info reg
rax            0x10018e7c8      4296599496
rbx            0x10038d421      4298691617
rcx            0x10038d230      4298691120
rdx            0x100b249a0      4306651552
rsi            0x100b24931      4306651441
rdi            0x100346e10      4298403344
rbp            0x100b21340      0x100b21340
rsp            0x7fff5f3ffa70   0x7fff5f3ffa70
r8             0x1009fc0a0      4305436832
r9             0x1009fc0a4      4305436836
r10            0x1      1
r11            0x10038d421      4298691617
r12            0x100b24998      4306651544
r13            0x1003c9558      4298937688
r14            0x10038d421      4298691617
r15            0x100b210c0      4306636992
rip            0x100344f78      0x100344f78 <stg_makeStablePtrzh>
eflags         0x213    531
cs             0x27     39
ss             0x0      0
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) c
Continuing.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00007fff5f3ffa68
0x0000000100344fab in stg_makeStablePtrzh ()
(gdb) info reg
rax            0x0      0
rbx            0x10038d421      4298691617
rcx            0x1003d3ad0      4298980048
rdx            0x100b249a0      4306651552
rsi            0x100b24931      4306651441
rdi            0x10038d421      4298691617
rbp            0x100b21340      0x100b21340
rsp            0x7fff5f3ffa70   0x7fff5f3ffa70
r8             0x1009fc0a0      4305436832
r9             0x1009fc0a4      4305436836
r10            0x1      1
r11            0x10038d421      4298691617
r12            0x100b24998      4306651544
r13            0x1003c9558      4298937688
r14            0x10038d421      4298691617
r15            0x100b210c0      4306636992
rip            0x100344fab      0x100344fab <stg_makeStablePtrzh+51>
eflags         0x10287  66183
cs             0x27     39
ss             0x0      0
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) info break
Num Type           Disp Enb Address            What
1   breakpoint     keep y   0x0000000100344f78 <stg_makeStablePtrzh>
        breakpoint already hit 508 times
(gdb) 

The difference in $rsp between the first breakpoint and the crash is 8370880 (decimal). 8 megabytes is 8388608, just a bit bigger. Check the process limits:

gwright-macbook> ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 2560
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 266
virtual memory          (kbytes, -v) unlimited

The C stack is 8 megabytes, and we're simply running off the end.

The question now is what is pushing without popping?

comment:14 Changed 3 years ago by altaic

Oh that makes sense. On x86-64, RESERVED_C_STACK_BYTES = 2048 * sizeof(SIZEOF_LONG) = 2048 * 8 = 16384 = 16k. Dividing 8192/16 = 512, which is roughly the number of iterations the program runs before crashing. Except they're not loop iterations, they are nested C calls which each get a whopping 16k of stack. So wouldn't this be expected behavior?

comment:15 Changed 3 years ago by altaic

I'll hypothesize that the reason the test doesn't fail on x86 is that sizeof(SIZEOF_LONG) = 4, which means half as much of the stack is eaten by each call. So the 4038 test should also fail on a 32-bit machine when modified with main = f 2000 >>= print.

comment:16 Changed 3 years ago by gwright

More information. Great chunks of stack are being gobbled by StgRunIsImplementedInAssembler (in rts/StgCRun.c). I set a watchpoint on $rsp an ran it between two breakpoints of stg_makeStablePtrzh. Here's the interesting piece:

Watchpoint 2: $rsp

Old value = (void *) 0x7fff5f407b70
New value = (void *) 0x7fff5f407b68
0x0000000100330200 in StgRunIsImplementedInAssembler ()
1: x/i $rip  0x100330200 <StgRunIsImplementedInAssembler>:      sub    $0x4038,%rsp
Continuing.
Watchpoint 2: $rsp

Old value = (void *) 0x7fff5f407b68
New value = (void *) 0x7fff5f403b30
0x0000000100330207 in StgRunIsImplementedInAssembler ()
1: x/i $rip  0x100330207 <StgRunIsImplementedInAssembler+7>:    mov    %rsp,%rax
Continuing.

The stack pointer is being moved by a bit more than 16 kbytes by the sub instruction.

On x86_64, the code for StgRunIsImplementedInAssembler is

static void GNUC3_ATTRIBUTE(used)
StgRunIsImplementedInAssembler(void)
{
    __asm__ volatile (
	/*
	 * save callee-saves registers on behalf of the STG code.
	 */
	".globl " STG_RUN "\n"
	STG_RUN ":\n\t"
	"subq %0, %%rsp\n\t"
	"movq %%rsp, %%rax\n\t"
	"addq %0-48, %%rax\n\t"
        "movq %%rbx,0(%%rax)\n\t"
        "movq %%rbp,8(%%rax)\n\t"
        "movq %%r12,16(%%rax)\n\t"
        "movq %%r13,24(%%rax)\n\t"
        "movq %%r14,32(%%rax)\n\t"
        "movq %%r15,40(%%rax)\n\t"
	/*
	 * Set BaseReg
	 */
	"movq %%rsi,%%r13\n\t"
	/*
	 * grab the function argument from the stack, and jump to it.
	 */
        "movq %%rdi,%%rax\n\t"
        "jmp *%%rax\n\t"

	".globl " STG_RETURN "\n"
       	STG_RETURN ":\n\t"

	"movq %%rbx, %%rax\n\t"   /* Return value in R1  */

	/*
	 * restore callee-saves registers.  (Don't stomp on %%rax!)
	 */
	"movq %%rsp, %%rdx\n\t"
	"addq %0-48, %%rdx\n\t"
        "movq 0(%%rdx),%%rbx\n\t"	/* restore the registers saved above */
        "movq 8(%%rdx),%%rbp\n\t"
        "movq 16(%%rdx),%%r12\n\t"
        "movq 24(%%rdx),%%r13\n\t"
        "movq 32(%%rdx),%%r14\n\t"
        "movq 40(%%rdx),%%r15\n\t"
	"addq %0, %%rsp\n\t"
	"retq"

	: : "i"(RESERVED_C_STACK_BYTES+48+8 /*stack frame size*/));
    /* 
       HACK alert!

       The x86_64 ABI specifies that on a procedure call, %rsp is
       aligned on a 16-byte boundary + 8.  That is, the first
       argument on the stack after the return address will be
       16-byte aligned.  
       
       Which should be fine: RESERVED_C_STACK_BYTES+48 is a multiple
       of 16 bytes.  
       
       BUT... when we do a C-call from STG land, gcc likes to put the
       stack alignment adjustment in the prolog.  eg. if we're calling
       a function with arguments in regs, gcc will insert 'subq $8,%rsp'
       in the prolog, to keep %rsp aligned (the return address is 8
       bytes, remember).  The mangler throws away the prolog, so we
       lose the stack alignment.

       The hack is to add this extra 8 bytes to our %rsp adjustment
       here, so that throughout STG code, %rsp is 16-byte aligned,
       ready for a C-call.  

       A quick way to see if this is wrong is to compile this code:

          main = System.Exit.exitWith ExitSuccess

       And run it with +RTS -sstderr.  The stats code in the RTS, in
       particular statsPrintf(), relies on the stack alignment because
       it saves the %xmm regs on the stack, so it'll fall over if the
       stack isn't aligned, and calling exitWith from Haskell invokes
       shutdownHaskellAndExit using a C call.

       Future gcc releases will almost certainly break this hack...
    */
}

Well, we're warned that something might go wrong here.

When watching $rsp I saw the initial subq instruction but I never saw an addq instruction popping that big chunk of memory off the stack.

The next question is whether these large blocks should be pushed, in which case the underlying error is elsewhere, or if the value 0x4038 subtracted from $rsp is wrong.

comment:17 follow-up: Changed 3 years ago by altaic

The 0x4038 is right (although I'm not sure why it needs to be so large as described in StgRun?.c). It comes from RESERVED_C_STACK_BYTES + 48 + 8 = 16384 + 48 + 8 = 16440 = 0x4038.

comment:18 Changed 3 years ago by gwright

A brief review of gcc assembler syntax shows that 0x4038 bytes are really supposed to be pushed. The constraint line:

	: : "i"(RESERVED_C_STACK_BYTES+48+8 /*stack frame size*/));

puts the value (2048 * 8) + 48 + 8 = 0x4038 bytes into operand %0. So that's correct.

It doesn't seem like there is a bug here at all --- if we do recursive calls of C functions through the ffi, we will eventually run out of stack space. I suppose I knew but didn't really believe that 16 kB were allocated off the stack for each C call.
I'll ask if there is some subtlety here. I don't understand why the test passed earlier on OS X 64 bit, only failing for ghci. 1000 iterations should have almost 16 MB onto the stack, twice the allocated space. Why didn't that fail?

If the fix is to reduce the number of times f is called in the 4038.hs,
I will submit a patch to do that.

comment:19 in reply to: ↑ 17 ; follow-up: Changed 3 years ago by gwright

Replying to altaic:

The 0x4038 is right (although I'm not sure why it needs to be so large as described in StgRun?.c). It comes from RESERVED_C_STACK_BYTES + 48 + 8 = 16384 + 48 + 8 = 16440 = 0x4038.

Yes, we seem to be working on this at the same time.

The mystery is why this ever worked. Test 4038 only failed for ghci when I was doing tests to verify my linker patches. Was there simply more than 8 MB of valid memory in the stack area, or is something else going on?

comment:20 Changed 3 years ago by altaic

I'm certain I also saw 4038 pass the non-ghci tests, but now it fails when doing a manual run of the test. I'm thinking that there's some kind of optimization that somehow gets rid of the nested calls, and we're missing it now because we're using debug builds. Of course, that doesn't explain why the tests pass on linux x86-64.

comment:21 in reply to: ↑ 19 ; follow-up: Changed 3 years ago by altaic

Replying to gwright:

The mystery is why this ever worked.

I may have unraveled part of that. Unless I'm missing something, it seems the testsuite is reporting invalid passes. Running make TEST=4038, 6 tests pass and 1 fails. When I run the passing tests manually (just as they appear in the output of make TEST=4038) from the command line, every one of them segfaults.

comment:22 follow-up: Changed 3 years ago by simonpj

Great debugging work guys -- thank you for your care and persistence.

16kbytes per C call does seem a lot!

Does a stack overflow have to lead to a bus error? Couldn't it lead to "C stack overflow" or something?

Simon

comment:23 in reply to: ↑ 21 Changed 3 years ago by gwright

Replying to altaic:

Replying to gwright:

The mystery is why this ever worked.

I may have unraveled part of that. Unless I'm missing something, it seems the testsuite is reporting invalid passes. Running make TEST=4038, 6 tests pass and 1 fails. When I run the passing tests manually (just as they appear in the output of make TEST=4038) from the command line, every one of them segfaults.

I tried it again on OS X 10.6.6 this morning. Now every 4038 test fails:

OVERALL SUMMARY for test run started at Thu Mar 24 08:32:04 EDT 2011
    2710 total tests, which gave rise to
   10140 test cases, of which
       0 caused framework failures
   10133 were skipped

       0 expected passes
       0 expected failures
       0 unexpected passes
       7 unexpected failures

Unexpected failures:
   4038(normal,optc,hpc,optasm,ghci,threaded1,threaded2)

So it's still a mystery.

comment:24 in reply to: ↑ 22 Changed 3 years ago by gwright

Replying to simonpj:

Great debugging work guys -- thank you for your care and persistence.

16kbytes per C call does seem a lot!

Does a stack overflow have to lead to a bus error? Couldn't it lead to "C stack overflow" or something?

Simon

Plain old C just segfaults when you run off the end of the stack. An 8 MB block is allocated at the top of the virtual address space, and accesses below (top - 8 MB) are usually to unallocated memory, generating a segfault. (The .NET framework's fancier runtime that can check the stack pointer and throw a stack overflow exception.)

It's still strange that the test has sometimes worked, particularly when compiled rather than interpreted. I'm wondering if in addition to the stack space there's additional space allocated at high virtual addresses. Then when we run 4038 we access memory we shouldn't, but since we're not writing much to the stack, the test (and process) end before the resulting memory corruption causes a crash.

I'm scratching my head over this one, but I'm happy that this isn't
another OS X linker bug.

comment:25 Changed 3 years ago by simonmar

16KB is the area we allocate for spilling temporary values during execution of Haskell code. In the future we might be able to use the ordinary Haskell stack, but that's a long way off. It is 16KB because sometimes GHC encounters a particularly large basic block that needs a lot of temporary storage.

So it's still a mystery why sometimes the test doesn't fail. Perhaps the memory below the stack is allocated to another thread, and as Greg suggests we might get away with it because we're not writing to a part of the stack that the other thread is using.

comment:26 Changed 3 years ago by gwright

The attached patch reduces the number of nested calls in test 4038 to 300 and explains why. It is a bit interesting that when I used 400 calls I still had a segfault when running the ghci test. I guess ghci grabs more of the C stack for other purposes.

Changed 3 years ago by gwright

comment:27 Changed 3 years ago by gwright

  • Status changed from new to patch

comment:28 Changed 3 years ago by altaic

Replying to simonmar:

So it's still a mystery why sometimes the test doesn't fail. Perhaps the memory below the stack is allocated to another thread, and as Greg suggests we might get away with it because we're not writing to a part of the stack that the other thread is using.

That would explain why the tests pass consistently in my testsuite but fail when I run them from the shell. Should we open a new ticket for the testsuite script with the goal of being able to better detect bad memory accesses?

comment:29 Changed 3 years ago by igloo

  • Resolution set to fixed
  • Status changed from patch to closed

I've applied the testsuite patch, thanks!

I don't understand what it is the testsuite driver can do better, but if there is something then please do open a ticket.

Note: See TracTickets for help on using tickets.