Opened 23 months ago

Closed 18 months ago

Last modified 17 months ago

#7192 closed bug (fixed)

Bug in -fregs-graph with -fnew-codegen

Reported by: simonmar Owned by: benl
Priority: highest Milestone: 7.8.1
Component: Compiler Version: 7.7
Keywords: Cc: benl, pho@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: #4258 Related Tickets:

Description

This is triggered by running the test dph-diophantine-opt with -fnew-codegen. It segfaults for me on x86_64/Linux. I verified that adding -fno-regs-graph fixes it, as does -fllvm.

This failure is blocking the new code generator switchover. I don't want to break DPH, although perhaps we could disable -fregs-graph temporarily until this bug is fixed.

Here's the bit of code that is miscompiled. I think there are probably several instances of a similar sequence in the module, because I saw the failure happening in several different places.

Cmm code:

sT8r_info:
    {offset
      c1cnV:
          _sT8r::I64 = R1;
          _sT85::I64 = R2;
          _sT82::I64 = R3;
          _sT8m::I64 = R4;
          _sT87::I64 = R5;
          _sT8a::I64 = R6;
          if (Sp - 96 < SpLim) goto c1co3; else goto c1cnZ;
      c1cnZ:
          Hp = Hp + 16;
          if (Hp > HpLim) goto c1co2; else goto c1coR;
      c1co2:
          HpAlloc = 16;
          goto c1co3;
      c1co3:
          R1 = _sT8r::I64;
          I64[Sp - 8] = _sT8a::I64;
          I64[Sp - 16] = _sT87::I64;
          I64[Sp - 24] = _sT8m::I64;
          I64[Sp - 32] = _sT82::I64;
          I64[Sp - 40] = _sT85::I64;
          Sp = Sp - 40;
          call (stg_gc_fun)() args: 48, res: 0, upd: 8;

asm code:

   0x0000000000438f38 <+0>:     mov    %rbx,%r10    # R1 saved in %r10
   0x0000000000438f3b <+3>:     mov    %rdi,0x48(%rsp)
   0x0000000000438f40 <+8>:     mov    %r8,0x40(%rsp)
   0x0000000000438f45 <+13>:    mov    %r9,%rbx     # %rbx clobbered
   0x0000000000438f48 <+16>:    lea    -0x60(%rbp),%rax
   0x0000000000438f4c <+20>:    cmp    %r15,%rax
   0x0000000000438f4f <+23>:    jb     0x438ff5 <sT8r_info+189>
   ...
   0x0000000000438ff5 <+189>:   mov    %rbx,-0x8(%rbp)
   0x0000000000438ff9 <+193>:   mov    %r8,%rax
   0x0000000000438ffc <+196>:   mov    %rax,-0x10(%rbp)
   0x0000000000439000 <+200>:   mov    %rdi,%rax
   0x0000000000439003 <+203>:   mov    %rax,-0x18(%rbp)
   0x0000000000439007 <+207>:   mov    %rsi,-0x20(%rbp)
   0x000000000043900b <+211>:   mov    %r14,-0x28(%rbp)
   0x000000000043900f <+215>:   add    $0xffffffffffffffd8,%rbp
   0x0000000000439013 <+219>:   jmpq   *-0x8(%r13)

The problem is that the contents of %rbx (aka R1) has been lost at the jump. It is supposed to contain the same value it had on entry to the proc.

BTW I made a change recently which might be relevant: the registers R1-R8 can now be used by the register allocator. See f857f0741515b9ebf186beb38fe64448de355817

The linear register allocator is generating much better code here:

    leaq -96(%rbp),%rax
    cmpq %r15,%rax
    jb .Lc12k1
        ...
    movq %r9,-8(%rbp)
    movq %r8,-16(%rbp)
    movq %rdi,-24(%rbp)
    movq %rsi,-32(%rbp)
    movq %r14,-40(%rbp)
    addq $-40,%rbp
    jmp *-8(%r13)

Change History (15)

comment:1 Changed 23 months ago by simonmar

  • Blocking 4258 added
  • Cc benl added
  • Version changed from 7.4.2 to 7.7

comment:2 Changed 22 months ago by marlowsd@…

commit 4f656e89e802f3238c5d6cf1cd9bc420cd7f42ff

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Tue Aug 28 16:10:48 2012 +0100

    disable -fregs-graph (#7192)

 compiler/main/DynFlags.hs |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

comment:3 Changed 22 months ago by benl

  • Owner set to benl

comment:4 Changed 20 months ago by benl

On x86-32 compiling with the NCG with and without -fregs-graph causes a stack overflow in the compiled program, but with -fllvm it works fine. Maybe the register liveness determinator is broken, because this is used by both the linear and graph coloring allocators.

comment:5 Changed 20 months ago by simonmar

@benl: are you saying the dph-diophantine-opt test is broken in master at the moment?

comment:6 Changed 20 months ago by benl

@simonmar: yes. It looks like the NCG (or something) is broken independently from the graph allocator.

When compiling with -fllvm I get the right answer:

desire:diophantine benl$ pwd
/Users/benl/devel/ghc/ghc-head-devel/testsuite/tests/dph/diophantine

desire:diophantine benl$ /Users/benl/devel/ghc/ghc-head-devel/inplace/bin/ghc-stage2 --version
The Glorious Glasgow Haskell Compilation System, version 7.7.20121112

desire:diophantine benl$ /Users/benl/devel/ghc/ghc-head-devel/inplace/bin/ghc-stage2 \
  -fforce-recomp -dcore-lint -dcmm-lint \
  --make -o dph-diophantine-copy-fast Main \
  -O -fno-enable-rewrite-rules -package dph-lifted-copy -fllvm

desire:diophantine benl$ ./dph-diophantine-copy-fast 
(1260,[2,2,1,1,0])
(1260,[2,2,1,1,0])
(1260,fromList<PArray> [2,2,1,1,0])

But with with NCG -fno-regs-graph it gives a different answer:

desire:diophantine benl$ /Users/benl/devel/ghc/ghc-head-devel/inplace/bin/ghc-stage2 \
  -fforce-recomp -dcore-lint -dcmm-lint \
  --make -o dph-diophantine-copy-fast Main \
  -O -fno-enable-rewrite-rules -package dph-lifted-copy -fno-regs-graph

desire:diophantine benl$ ./dph-diophantine-copy-fast 
dph-diophantine-copy-fast: Prelude.minimum: empty list

Compiling in different ways by typing make in that same directory sometimes causes a stack overflow instead of Prelude.minimum: empty list

I looked through the output assembly code but didn't find code code sequence in your initial report. My approach was to compile with -ddump-cmmz-sp -ddump-to-file then grep

comment:7 Changed 20 months ago by benl

... (continued from previous)... the cmm code for the sequence you had. For this I compiled the Haskell code with -Odph to get array fusion and -fregs-graph to turn the graph allocator back on.

desire:diophantine benl$ /Users/benl/devel/ghc/ghc-head-devel/inplace/bin/ghc-stage2 \
  -fforce-recomp -dcore-lint -dcmm-lint --make -o dph-diophantine-copy \
  Main -Odph -fregs-graph -package dph-lifted-copy \
  -ddump-cmmz-sp -ddump-asm -ddump-to-file

desire:diophantine benl$ ./dph-diophantine-copy
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

desire:diophantine benl$ grep -B 8 -A 16 "if (Sp - 96 < SpLim)" \
  DiophantineVect.dump-cmmz-sp > dump-entry.cmm

From this I get about 12 blocks of cmm code that look like the one in your report. I checked the corresponding asm code and didn't see any register allocation problems. A few of the proc entry blocks assign to %rbx, but the original value this register had on entry to the proc is restored before issuing jmp *-8(%r13), which I assume invokes the GC.

However, I do notice that some of the calls to stg_gc_fun in the cmm code have R1 arguments, and some don't.

c1cr6:
      _s12rI::P64 = R6;
      _s12rF::I64 = R5;
      _s12rU::I64 = R4;
      _s12rA::I64 = R3;
      _s12rD::I64 = R2;
      _s12rZ::P64 = R1;
      if (Sp - 96 < SpLim) goto c1crZ; else goto c1cs2;
      ...

c1crZ:
      R1 = _s12rZ::P64;
      I64[Sp - 40] = _s12rD::I64;
      I64[Sp - 32] = _s12rA::I64;
      I64[Sp - 24] = _s12rU::I64;
      I64[Sp - 16] = _s12rF::I64;
      P64[Sp - 8] = _s12rI::P64;
      Sp = Sp - 40;
      call (stg_gc_fun)() args: 48, res: 0, upd: 8;      *** no R1 argument here

But then:

offset
  c1eWQ:
      _s17H9::P64 = R1;
      if (Sp - 96 < SpLim) goto c1eXm; else goto c1eXl;
      ...
  c1eXm:
      R1 = _s17H9::P64;
      call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;      *** got an R1 here

If R1 needs to valid at *every* call to stg_gc_fun, then you need to pass it as an argument or the register liveness determinator will mark it as dead -- and no good will come from that.

        c1crZ:
            	movq %vI_s12rZ,%rbx
                    # born:    %r1
                    # r_dying: %vI_s12rZ
                    # w_dying: %r1                 **** R1 dies here 
                     
            	movq %vI_s12rD,-40(%rbp)
                    # r_dying: %vI_s12rD
                     
            	movq %vI_s12rA,-32(%rbp)
                    # r_dying: %vI_s12rA
                     
            	movq %vI_s12rU,-24(%rbp)
                    # r_dying: %vI_s12rU
                     
            	movq %vI_s12rF,-16(%rbp)
                    # r_dying: %vI_s12rF
                     
            	movq %vI_s12rI,-8(%rbp)
                    # r_dying: %vI_s12rI
                     
            	addq $-40,%rbp
                     
            	jmp *-8(%r13)

If the stg_gc_fun() thing is correct then can you tell me how to find the assembly sequence in your initial report? I can hack on it this week.

comment:8 Changed 20 months ago by benl

Yeah, if the liveness determinator can't see R1 being read in the block that calls stg_gc_fun, then it the allocator isn't obliged to preserve it's value across the jump. I think your original cmm code is malformed.

comment:9 follow-up: Changed 20 months ago by simonmar

Aha! You're absolutely right, it's a bug, sorry about that. I'm validating a fix right now. I don't seem to be able to reproduce the original problem, but I've definitely fixed the missing R1 dependency.

I still think we need to look at the graph-colouring allocator though, because I think it is interacting badly with the code generated by the new code generator. The code I've seen doesn't look great. I'm leaving it turned off for the time being, and I'll make a separate ticket.

If you could verify that you don't see the wrong answers any more after my patch, that would be great. Patch coming shortly...

comment:10 Changed 20 months ago by marlowsd@…

commit 4270d7e7485b124dd153399dfe3f571253dc0d1d

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Tue Nov 13 11:43:09 2012 +0000

    Fix the Slow calling convention (#7192)
    
    The Slow calling convention passes the closure in R1, but we were
    ignoring this and hoping it would work, which it often did.  However,
    this bug seems to have been the cause of #7192, because the
    graph-colouring allocator is more sensitive to having correct liveness
    information on jumps.

 compiler/cmm/CmmCallConv.hs      |    6 +++---
 compiler/cmm/CmmParse.y          |    2 +-
 compiler/cmm/MkGraph.hs          |   32 ++++++++------------------------
 compiler/codeGen/StgCmmBind.hs   |   21 +++++++++++----------
 compiler/codeGen/StgCmmExpr.hs   |    4 ++--
 compiler/codeGen/StgCmmHeap.hs   |   12 ++++--------
 compiler/codeGen/StgCmmLayout.hs |    2 +-
 7 files changed, 30 insertions(+), 49 deletions(-)

comment:11 Changed 19 months ago by PHO

  • Cc pho@… added

comment:12 Changed 18 months ago by simonmar

  • Resolution set to fixed
  • Status changed from new to closed

We think this is fixed after the patch above.

comment:13 in reply to: ↑ 9 ; follow-up: Changed 18 months ago by shelarcy

Replying to simonmar:

I still think we need to look at the graph-colouring allocator though, because I think it is interacting badly with the code generated by the new code generator. The code I've seen doesn't look great. I'm leaving it turned off for the time being, and I'll make a separate ticket.

Where is a separated ticket?

comment:14 in reply to: ↑ 13 ; follow-up: Changed 17 months ago by simonmar

Replying to shelarcy:

Where is a separated ticket?

#7679

comment:15 in reply to: ↑ 14 Changed 17 months ago by shelarcy

Replying to simonmar:

#7679

I see. Thank you.

Note: See TracTickets for help on using tickets.