Opened 3 years ago

Closed 18 months ago

#7684 closed bug (fixed)

cgrun071 segfaults

Reported by: tibbe Owned by:
Priority: normal Milestone: 7.8.1
Component: Compiler Version: 7.7
Keywords: Cc: leroux@…
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime crash Test Case: cgrun071
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

I ran into this working on something unrelated today:

=====> cgrun071(normal) 171 of 3449 [0, 0, 12]
cd ./codeGen/should_run && '/Users/tibbe/src/ghc/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -o cgrun071 cgrun071.hs    >cgrun071.comp.stderr 2>&1
cd ./codeGen/should_run && ./cgrun071    </dev/null >cgrun071.run.stdout 2>cgrun071.run.stderr
/bin/sh: line 1: 37450 Segmentation fault: 11  ./cgrun071 < /dev/null > cgrun071.run.stdout 2> cgrun071.run.stderr
Wrong exit code (expected 0 , actual 139 )
Stdout:

Stderr:

*** unexpected failure for cgrun071(normal)

Attachments (12)

0001-Fix-7684.-Mask-words-with-bit-size-for-the-fast-popu.patch (1.7 KB) - added by leroux 2 years ago.
test-cgrun071-llvm.txt (737 bytes) - added by leroux 2 years ago.
Test cgrun071 (WAY=llvm) summary
test-cgrun071-normal.txt (1.0 KB) - added by leroux 2 years ago.
Test cgrun071 (WAY=normal) summary
cgrun071.S (74.8 KB) - added by leroux 2 years ago.
cgrun071: -fasm -ddump-asm
cgrun071-rts-asm-diffs.txt (2.4 KB) - added by leroux 2 years ago.
cgrun071 object dump diffs for popcnt8 and popcnt16
0001-Make-argument-types-in-popcnt.c-match-declared-primo.patch (2.4 KB) - added by rwbarton 2 years ago.
untested!
popcnt.c (2.3 KB) - added by tibbe 18 months ago.
test.c (142 bytes) - added by tibbe 18 months ago.
popcnt42.s (6.9 KB) - added by tibbe 18 months ago.
test42.s (850 bytes) - added by tibbe 18 months ago.
popcnt49.s (6.5 KB) - added by tibbe 18 months ago.
test49.s (749 bytes) - added by tibbe 18 months ago.

Download all attachments as: .zip

Change History (72)

comment:1 Changed 3 years ago by simonmar

  • difficulty set to Unknown
  • Milestone set to 7.8.1

As discussed on IRC, this is very likely due to #7383.

comment:2 Changed 3 years ago by tibbe

  • Owner changed from tibbe to simonmar

Assigning to Simon M as he thinks this is a codegen problem.

comment:3 Changed 3 years ago by simonmar

  • Status changed from new to infoneeded

Please try the patch attached to #7383.

comment:4 Changed 3 years ago by igloo

  • Resolution set to fixed
  • Status changed from infoneeded to closed

Fixed by the #7383 patch.

comment:5 Changed 2 years ago by nfrisby

I still see the same error that Johan originally reported. I'm on OS X 10.7.5 with an Core i7.

ghc : ade1ae97ed52c493ec415c1601dace39b64071dd
testsuite : 743cab5865ae0b9820dadc33a692511e0e467b9b

I used validate --slow.

Both 085e8145f63c8f42d8bc19cd3cff52b8cd5b6455 and 229323898b0809047b19b79c181085430cce9850 occur in my git log output.

What info can I provide that would be more helpful?

comment:6 Changed 2 years ago by simonmar

Is this 64-bit? The commit log in the patch mentioned that it might not be quite right for 64 bits. I don't have time to look at it right now, but perhaps someone else could fix it using the 32 bit patch as an example of what needs doing.

comment:7 Changed 2 years ago by nfrisby

Yep, Core i7 is x86-64.

comment:8 Changed 2 years ago by goldfire

  • Owner simonmar deleted
  • Resolution fixed deleted
  • Status changed from closed to new

I also can report failure of this test on MacOS 10.7.5. Reopening the ticket...

comment:9 Changed 2 years ago by leroux

  • Owner set to leroux

I've run into this same problem on OS X 10.8.4 x86_64. I dug into cgrun071 and found that it's a problem with popcnt8. Whenever popcnt8 encounters very "large" numbers, it segaults.

map (\x -> (x, popcnt8 x)) $ iterate (+1000) 1

I ran this expression multiple times expecting it to segfault at the same point, but it did otherwise. Whenever it broke, I tested the integer and it would simply segfault.

comment:10 Changed 2 years ago by leroux

  • Status changed from new to patch
  • Test Case set to cgrun071

comment:11 Changed 2 years ago by leroux

My fix is to simply apply a mask of n-bits to the word being passed to the fast population count function.

After fixing the popcnt8 test, I found that popcnt16 also had the same problem, so I ended up applying the fix to 8, 16 and 32 bit word tests.

test_popCnt8 = test (popcntMask 8 popcnt8) (slowPopcnt . wordMask 8)
test_popCnt16 = test (popcntMask 16 popcnt16) (slowPopcnt . wordMask 16)
test_popCnt32 = test (popcntMask 32 popcnt32) (slowPopcnt . wordMask 32)

comment:12 Changed 2 years ago by leroux

  • Cc leroux@… added

Ignore the patch where it just masks the word that is being passed to the appropriate fast popCount functions. I figured that it'd be getting rid of the point of the test since it "defines" how the functions should behave.I'm now digging into how/what instructions are emitted by the popCount PrimOps.

Also, it'd be nice if someone to clear up why this patch was needed.

comment:13 Changed 2 years ago by leroux

  • Owner leroux deleted
  • Status changed from patch to new

comment:14 Changed 2 years ago by leroux

  • Owner set to leroux

comment:15 follow-up: Changed 2 years ago by simonmar

The native code generator is supposed to emit code to mask out the extra bits. If it isn't doing that, that's a bug. See commit 085e8145f63c8f42d8bc19cd3cff52b8cd5b6455

comment:16 Changed 2 years ago by leroux

I think this bit is causing problems on OS X because I ran into problems with only Word8 and Word16. https://github.com/ghc/ghc/blob/master/compiler/nativeGen/X86/CodeGen.hs#L1693-L1697

Changed 2 years ago by leroux

Test cgrun071 (WAY=llvm) summary

Changed 2 years ago by leroux

Test cgrun071 (WAY=normal) summary

comment:17 Changed 2 years ago by carter

I have a theory: the pop count code was only added after 7.6 release I believe, and may have never been tested on macs.

Xcode 4.6 and earlier provides a very very old gcc, gcc 4.2.

gcc4.2 doesn't support sse4 instructions: http://gcc.gnu.org/onlinedocs/gcc-4.2.4/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options

comment:18 Changed 2 years ago by carter

someone who's got a ghc clang build working or using a different flavor of gcc newer than 4.2 on their mac should try and run this test case. (my theory could be completely wrong though)

comment:19 Changed 2 years ago by leroux

The problem has been pinpointed to the fact that apple-gcc42 does not support SSE4 instructions.

After carter suggested that it may be a problem with apple-gcc42, I rebuilt ghc with gcc48 and cgrun071 passed both the normal and llvm ways.

It seems that carter's theory was right in that gcc 4.2 does not have support for SSE4 instructions. In fact, support for SSE4 was added in with the release of gcc43. http://gcc.gnu.org/gcc-4.3/changes.html

It may be that SSE42 support isn't being detected properly by nativeGen. https://github.com/ghc/ghc/blob/master/compiler/nativeGen/X86/CodeGen.hs#L1686-L1689

sse4_2 <- sse4_2Enabled
if sse4_2 then ... else ...

While reproducing this, #8148 may get in the way. carter's helpful fix is to simply symlink to gcc-4.8 as gcc and, add it to your PATH, then build ghc as normal.

$ mkdir ~/bin
$ ln -s `which gcc-4.8` ~/bin/gcc
$ export PATH=~/bin:$PATH
$ # reload your shell and `gcc -v` should show 4.8 or whatever.

Should we switch the fallback to slowPopcnt for this particular case? https://github.com/ghc/testsuite/blob/master/tests/codeGen/should_run/cgrun071.hs#L41-L46

slowPopcnt :: Word -> Word
slowPopcnt x = count' (bitSize x) x 0
  where
    count' 0 _ !acc = acc
    count' n x acc  = count' (n-1) (x `shiftR` 1)
                      (acc + if x .&. 1 == 1 then 1 else 0)
Last edited 2 years ago by leroux (previous) (diff)

comment:20 Changed 2 years ago by tibbe

I'm confused. The test isn't compiled with -msse42 and thus the popcnt instruction should never be omitted (check the output assembly to make sure) so it seems more likely that the failure has to do with narrowing.

Changed 2 years ago by leroux

cgrun071: -fasm -ddump-asm

comment:21 Changed 2 years ago by tibbe

As you can see from the assembly the POPCNT instruction is never used and can't thus be the cause of the segfault. Besides, if I remember correctly incorrect instructions give an "illegal instruction" crash on OS X, not a segfault. The ghc-prim function is called

    call hs_popcnt16

so the issue is most likely to do with that call (e.g. narrowing).

comment:22 Changed 2 years ago by carter

ok, so I think we've narrowed it down to "the code generator version is correct, our fallback isnt"?

so @leroux's patch on the native code generator isn't relevant... its that the fall back is wrong?

is the test using the fallback version (written in C?) or the slowpopcount one in the test itself?

comment:23 Changed 2 years ago by tibbe

is the test using the fallback version (written in C?) or the slowpopcount one in the test itself?

The test is comparing the hs_popcnt* C implementation in ghc-prim with the slowPopcnt in the test itself. I don't see why slowPopcnt would segfault so probably we're generating the wrong call sequence to hs_popcnt* in the native code generator (I assume llvm isn't being used).

comment:24 Changed 2 years ago by tibbe

Maybe the confusion stems from that both the POPCNT implementation and the C fallback implementation have some supporting code in the native code generator, either to emit the instruction or to emit the C call?

Changed 2 years ago by leroux

cgrun071 object dump diffs for popcnt8 and popcnt16

comment:25 Changed 2 years ago by leroux

attachment:cgrun071-rts-asm-diffs.txt

Both cgrun071 binaries are built, dumped, and diff'd as follows.

$ ghc-stage2 -fforce-recomp -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -ddump-to-file -ddump-asm -fno-ghci-history -o cgrun071 cgrun071.hs -O -fasm # TEST=cgrun071 WAY=normal EXTRA_HC_OPTS="-ddump-to-file -ddump-asm"
$ otool -tV cgrun071 > cgrun071.S # dump the object code

And then simply

$ diff -y cgrun071-gcc48-passed.S cgrun071-gcc42-failed.S

The left side of the diff is from ghc (built with gcc48, passed) and the right is built with gcc42 (failed, segfault).

Last edited 2 years ago by leroux (previous) (diff)

comment:26 Changed 2 years ago by leroux

The following is using ghc built with gcc42. Hopefully this will provide more insight into finding a fix.

$ '/Users/leroux/Dropbox/src/ghc/ghc-validate/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -debug -fno-ghci-history -ddump-to-file -ddump-asm -o cgrun071 cgrun071.hs
$ gdb cgrun071
(gdb) run
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
0x000000010027f74b in hs_popcnt8 ()

### The void address is referring to the value of $eax
(gdb) p $eax
$4 = void

### current instruction
(gdb) display/i $pc
1: x/i $pc  0x10027f74b <hs_popcnt8+11>:	movzbl (%rdi,%rax,1),%eax

(gdb) info registers
rax            0x1002ef380	4298044288
rbx            0x101150fc9	4313124809
rcx            0x101101300	4312797952
rdx            0x2	2
rsi            0x1002d7540	4297946432
rdi            0x98b7fa5e6c84f828	-7442204575751931864
rbp            0x7fff5fbfb030	0x7fff5fbfb030
rsp            0x7fff5fbfb030	0x7fff5fbfb030
r8             0x98b7fa5e6c84f828	-7442204575751931864
r9             0x1	1
r10            0x1	1
r11            0x101150a60	4313123424
r12            0x1011541d8	4313137624
r13            0x10035d918	4298496280
r14            0x10114c998	4313106840
r15            0x1011ed0c0	4313764032
rip            0x10027f74b	0x10027f74b <hs_popcnt8+11>
eflags         0x10202	66050
cs             0x2b	43
ss             0x0	0
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
_hs_popcnt8:
000000010027f740	pushq	%rbp
000000010027f741	movq	%rsp, %rbp
000000010027f744	leaq	_popcount_tab(%rip), %rax
000000010027f74b	movzbl	(%rdi,%rax), %eax         <------- 0x10027f74b
000000010027f74f	popq	%rbp
000000010027f750	ret
000000010027f751	nopl	(%rax)
000000010027f758	nopl	(%rax,%rax)

I'm not sure where to go from here, so I'll leave it to people more experienced than myself. If you want to continue giving pointers go right ahead, and I'll pick this back up.

Last edited 2 years ago by leroux (previous) (diff)

comment:27 Changed 2 years ago by leroux

  • Owner leroux deleted

comment:28 Changed 2 years ago by leroux

I've unassigned myself since... (from two comments ago)

I'm not sure where to go from here, so I'll leave it to people more experienced than myself. If you want to continue giving pointers go right ahead, and I'll pick this back up.

comment:29 in reply to: ↑ 15 Changed 2 years ago by rwbarton

I don't understand how the first 8 or so lines of gdb output are consistent with the rest, so I'm going to assume that gdb was temporarily confused. The info registers output looks much more plausible.

Replying to simonmar:

The native code generator is supposed to emit code to mask out the extra bits. If it isn't doing that, that's a bug. See commit 085e8145f63c8f42d8bc19cd3cff52b8cd5b6455

Going by leroux's info registers output, it sure looks like it isn't masking them out, since the argument register rdi is 0x98b7fa5e6c84f828.

From the assembly diff I see that gcc 4.8 outputs code to do the masking in the callee (movzbl %dil, %edi) where gcc 4.2 does not. Oddly the System V x86_64 ABI doesn't seem to specify whether arguments in registers that are smaller than the register size should be masked/sign-extended by the caller, but I gather that we are working under the assumption that they should be (which is consistent with the 386 ABI).

leroux, could you attach the entire cgrun071.S disassembly output? Or if it's very large, at least the disassembly of the function which calls hs_popcnt8?

comment:30 Changed 2 years ago by rwbarton

(According to this discussion with an editor of the System V x86_64 ABI, a callee must NOT assume that its arguments have been sign-/zero-extended to fill their registers. So the gcc 4.2 hs_popcnt8 is wrong, and the gcc 4.8 hs_popcnt8 is right. However, pointing the finger at a widely-installed compiler is not very helpful when we could presumably fix the issue by doing the zero-extension in ghc instead.)

comment:31 Changed 2 years ago by leroux

Here's the disassembly and the backtrace so that you can find the caller.

cgrun071 disassembly (Can't attach it here because it's ~20MB)

(gdb) bt
#0  0x000000010027f74b in hs_popcnt8 ()
#1  0x000000010000475a in c1gz_info ()
Previous frame inner to this frame (gdb could not unwind past this frame)

Hope that helps.

Also, I probably should have included this in my last comment.

(gdb) run
Starting program: /Users/leroux/Dropbox/src/ghc/ghc-validate/testsuite/tests/codeGen/should_run/cgrun071
Reading symbols for shared libraries +++.............................. done
OK

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
0x000000010027f74b in hs_popcnt8 ()

comment:32 Changed 2 years ago by carter

@leroux, could you paste that assembly as a Gist on github? Dropbox is 404ing for everyone, and makes it harder to have a persistent copy for everyone

comment:33 Changed 2 years ago by rwbarton

In the #7383 "promote arguments" patch, shouldn't the new line 2060 read

    let prom_args = map (maybePromoteCArg dflags W64 {- not W32 -}) args

But anyways, I can't see why the code generator would think the argument of the MO_PopCnt W8 operation is an 8-bit quantity—it comes from a Haskell Word#, and the type of the popCnt8# primop is Word# -> Word#. So we shouldn't expect the "promote arguments" patch to do anything here.

Can't we just fix this by changing the definition of hs_popcnt8 from

extern StgWord hs_popcnt8(StgWord8 x);
StgWord
hs_popcnt8(StgWord8 x)
{
  return popcount_tab[(unsigned char)x];
}

to

extern StgWord hs_popcnt8(StgWord x);
StgWord
hs_popcnt8(StgWord x)
{
  return popcount_tab[(unsigned char)x];
}

(and the same for hs_popcnt16, etc.)—then everybody will agree that the argument to hs_popcnt8 is a word rather than a single byte, and gcc 4.2 will be forced to output the same code as gcc 4.8.

Last edited 2 years ago by rwbarton (previous) (diff)

comment:34 Changed 2 years ago by tibbe

I don't want to fix anything without understanding what the problem is first. There's nothing wrong with

extern StgWord hs_popcnt8(StgWord8 x);

so why is the compiler doing the wrong thing?

Aside: if we can one day fix Word8 to actually use a Word8# behind the scene changing the signature of the C function would be wrong.

comment:35 Changed 2 years ago by rwbarton

Well, which compiler is doing the wrong thing?

According to the x86_64 ABI, gcc 4.2 is doing the wrong thing, because when it generates code for hs_popcnt8, it assumes that the high 7 bytes of the argument register %edi are 0. The caller is not supposed to have to guarantee that, and gcc 4.8 does the right thing by masking off the high 7 bytes inside hs_popcnt8.

ghc sees a primop PopCnt8Op of type Word# -> Word#, and given that, it is doing the right thing by taking the whole Word# value extracted from the argument to popcnt8 (in cgrun071) and putting it in %edi before calling hs_popcnt8. I don't think ghc looks at the C prototype of functions that implement primops, does it? It just prepares the arguments for the call in registers or on the stack according to their Cmm sizes, and does a call to the symbol of the right name.

Now, the combination of ghc passing a whole Word# and hs_popcnt8 treating the argument as an unsigned char works out according to the x86_64 ABI since the intent is to compute the popcnt of the lowest byte. It doesn't work out if hs_popcnt8 expects its argument to be zero-extended from 1 byte to 8 bytes like under gcc 4.2.

Changing the declared argument type of hs_popcnt8 to StgWord will make popcnt.c agree with the list of primops. Given that we can modify ghc but not gcc 4.2 this seems like the best solution for now.

If Word8# becomes a thing, we can change the signature of hs_popcnt8 back and also change the declared type of PopCnt8Op to Word8# -> Word#. As I understand things, the key is that those two declarations agree.

comment:36 Changed 2 years ago by leroux

Here's a copy of the disassembly; I've gisted it since it is too large to be uploaded here. https://gist.github.com/leroux/6469840

comment:37 Changed 2 years ago by rwbarton

I'm attaching the patch that I outlined above, but please note that it is completely untested aside from me building ghc-prim and manually inspecting the disassembly for sanity. I'm not on Mac OS, so I can't do the interesting test anyways.

comment:38 Changed 2 years ago by leroux

@rwbarton, your patch worked. (I thought I posted a comment about this yesterday, but I guess I didn't.)

comment:39 Changed 2 years ago by leroux

  • Status changed from new to patch

comment:40 follow-up: Changed 2 years ago by simonmar

The patch is very likely to be papering over the problem rather than fixing it. While the primop does take a Word# (because there's no Word8#), the MachOp MO_PopCnt really does take an 8-bit value when it is generated by the 8-bit version of the primop.

comment:41 in reply to: ↑ 40 Changed 2 years ago by rwbarton

Replying to simonmar:

The patch is very likely to be papering over the problem rather than fixing it. While the primop does take a Word# (because there's no Word8#), the MachOp MO_PopCnt really does take an 8-bit value when it is generated by the 8-bit version of the primop.

How so? Of course we want it to be true, but I don't see any place where GHC is aware of this.

Let's trace through the compilation of a PopCnt8Op primop versus a PopCnt16Op primop. The only place where these are treated differently is in emitPrimOp, which calls emitPopCntCall with an argument of W8 in the former and W16 in the latter case. emitPopCntCall only uses that argument as a field of the MO_PopCnt constructor. It doesn't do any kind of cast or assertion on the size of its argument. In the native code generator, the field of MO_PopCnt is only used to build the name of the C function to call (assuming no SSE). So, it's unsurprising to me that GHC does not, for example, narrow the argument to hs_popcnt8 to 8 bits, but narrow the argument to hs_popcnt16 to 16 bits.

Maybe you think some part of the above is a bug, and prefer to fix the problem there. I didn't want to guess where the bug was, so I proposed this patch to the C implementation to match the calling convention used by the code generator.

Last edited 2 years ago by rwbarton (previous) (diff)

comment:42 follow-ups: Changed 2 years ago by simonmar

Well, one bug is that in emitPopCntCall we don't cast the argument to the correct width. The -dcmm-lint should catch it, because the argument width isn't the same as the primop is expecting, but the Cmm linter doesn't check the arguments of prim calls (that's another bug).

comment:43 Changed 21 months ago by thoughtpolice

  • Status changed from patch to infoneeded

comment:44 in reply to: ↑ 42 Changed 18 months ago by tibbe

Replying to simonmar:

Well, one bug is that in emitPopCntCall we don't cast the argument to the correct width. The -dcmm-lint should catch it, because the argument width isn't the same as the primop is expecting, but the Cmm linter doesn't check the arguments of prim calls (that's another bug).

How would one cast the argument in the emitPopCntCall?

comment:45 Changed 18 months ago by simonmar

With a MO_UU_Conv MachOp.

comment:46 in reply to: ↑ 42 Changed 18 months ago by tibbe

Replying to simonmar:

Well, one bug is that in emitPopCntCall we don't cast the argument to the correct width.

I don't think it should.

The MO_PopCnt MachOp that's emitted by emitPopCntCall also takes the width as an argument. The reason we pass the width all the way down to the native and LLVM code generators is to avoid doing any narrowing, as we can emit a popcnt instruction that looks at just part of the word (e.g. popcnt %ax,%ax).

If we end up not using the the dedicated instruction, and we're using GCC 4.2 which doesn't do the narrowing correctly, then we'd have to do the narrowing ourselves before calling the C fallback.

I think we have two choices in working around the buggy behavior of GCC 4.2:

  1. Change the function prototype as rwbarton suggested.
  2. Do some narrowing in the native code generator before emitting the call to the C function.

comment:47 Changed 18 months ago by tibbe

I've attached the assembly output for popcnt.c (in ghc-prim) and a small C test program, test.c, for GCC 4.2 and GCC 4.9, respectively.

Changed 18 months ago by tibbe

Changed 18 months ago by tibbe

Changed 18 months ago by tibbe

Changed 18 months ago by tibbe

Changed 18 months ago by tibbe

Changed 18 months ago by tibbe

comment:48 Changed 18 months ago by tibbe

Now I'm confused. Both GCC 4.2 (apple-gcc42 from Homebrew) and GCC 4.9 create the correctly zero extended code for hs_popcnt8:

GCC 4.2

_hs_popcnt8:
LFB112:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        movzbl  %dil, %edi
        leaq    _popcount_tab(%rip), %rax
        movzbl  (%rdi,%rax), %eax
        leave
        ret

GCC 4.9:

_hs_popcnt8:
LFB110:
        leaq    _popcount_tab(%rip), %rax
        movzbl  %dil, %edi
        movzbl  (%rax,%rdi), %eax
        ret

Exact GCC versions:

$ gcc-4.2 --version
i686-apple-darwin11-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
...
$ gcc-4.9 --version
gcc-4.9 (GCC) 4.9.0 20140223 (experimental)
...

Perhaps this bug was fixed in some release of GCC 4.2?

I don't know how to install the GCC 4.2 that actually shipped with the old XCode without ruining my current XCode install.

Last edited 18 months ago by tibbe (previous) (diff)

comment:49 Changed 18 months ago by tibbe

I think I've figured it out. I've installed the CLI tools that came with XCode 4.6. Here's the gcc that's included:

$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

So this is a LLVM-based GCC (i.e. clang). This GCC indeed produces incorrect code for popcnt.c:

_hs_popcnt8:
Leh_func_begin1:
        pushq   %rbp
Ltmp0:
        movq    %rsp, %rbp
Ltmp1:
        leaq    _popcount_tab(%rip), %rax
        movzbl  (%rdi,%rax), %eax
        popq    %rbp
        ret

Note how the argument is not zero extended when it's used to compute and address in _popcount_tab.

comment:50 Changed 18 months ago by carter

gcc-llvm I think is the apple version of dragonegg, which uses the proper GCC frontend, but does the code gen with llvm

comment:51 Changed 18 months ago by tibbe

I chatted a bit with rwbarton about the semantics of these primops. My intention is that all the primops work on all input sizes (much like the underlying assembly instruction) and the difference is just the number of bit we count. The code implements this behavior correctly, if it weren't for LLVM misbehaving. The reason I implemented the primops this way is that I wanted to avoid the narrowing.

This means that even if we had a Word8# type, the primops would still operate on a Word#. Does that make sense?

If so the right fix is to work around the LLVM bug in popcnt.c.

comment:52 Changed 18 months ago by Johan Tibell <johan.tibell@…>

In ad9bf96815cb5a9bb4acc51c99eff20be3e50da3/ghc-prim:

Make argument types in popcnt.c match declared primop types

On 64-bit Mac OS, gcc 4.2 (which comes with Xcode 4.6) generates code
that assumes that an argument that is smaller than the register
it is passed in has been sign- or zero-extended. But ghc thinks
the types of the PopCnt*Op primops are Word# -> Word#, so it passes
the entire argument word to the hs_popcnt* function as though it was
declared to have an argument of type StgWord. Segfaults ensue.

The easiest fix is to sidestep all this zero-extension business
by declaring the hs_popcnt* functions to take a whole StgWord (when their
argument would fit in a register), thereby matching the list of primops.

Fixes #7684.

comment:53 Changed 18 months ago by tibbe

I've applied rwbarton's patch, which means that we implement the semantics I described above, just like I originally intended (and just like the -msse4.2 version behaves).

comment:54 Changed 18 months ago by tibbe

  • Resolution set to fixed
  • Status changed from infoneeded to closed

comment:55 Changed 18 months ago by rwbarton

Any particular reason not to merge to 7.8?

comment:56 Changed 18 months ago by hvr

  • Status changed from closed to merge

I'll mark this for merge (if not 7.8.1 then it might make it into 7.8.2) and let Austin decide

comment:57 follow-up: Changed 18 months ago by simonmar

So I think this fix is ok - it avoids the question of whether the ABI requires the caller to sign/zero-extend or not. My suggestion to cast the argument was based on the type of hs_popcnt8 taking an HsInt8, but I see that wasn't what you intended.

FWIW I think gcc probably should be zero-extending in the caller, not the callee, for an 8-bit argument (i.e. gcc 4.2 is correct, 4.9 is wrong). This is what we assume in GHC. I couldn't see anything about this in the ABI spec, but there's a gcc bug open: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46942

comment:58 in reply to: ↑ 57 Changed 18 months ago by tibbe

Replying to simonmar:

FWIW I think gcc probably should be zero-extending in the caller, not the callee, for an 8-bit argument (i.e. gcc 4.2 is correct, 4.9 is wrong). This is what we assume in GHC. I couldn't see anything about this in the ABI spec, but there's a gcc bug open: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46942

According to the discussion with an editor of the System V x86_64 ABI rwbarton linked to above, it's incorrect for the callee to assume that the register has been zero extended. Accessing bits outside the specified argument size is intentionally undefined behavior (which means that the callee has to zero extend).

comment:59 Changed 18 months ago by simonmar

Ok, that means we can remove the extension behaviour from GHC then. I've created a ticket: #8922.

comment:60 Changed 18 months ago by thoughtpolice

  • Status changed from merge to closed

Merged in 7.8, thanks!

Note: See TracTickets for help on using tickets.