Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#11487 closed bug (fixed)

stg_ap_pp_fast doesn't pass the argument in the arity=1 case

Reported by: rwbarton Owned by: gmainland
Priority: highest Milestone: 8.0.1
Component: Runtime System Version: 8.0.1-rc1
Keywords: Cc: simonmar, gmainland
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1864
Wiki Page:

Description (last modified by rwbarton)

The body of stg_ap_pp_fast looks like

            arity = TO_W_(StgFunInfoExtra_arity(%GET_FUN_INFO(R1)));
            ASSERT(arity > 0);
            if (arity == 1) {
                Sp_adj(-2);
                W_[Sp+WDS(1)] = R3;
                W_[Sp+WDS(0)] = stg_ap_p_info;
                R1 = R1 + 1;
                jump_SAVE_CCCS(%GET_ENTRY(UNTAG(R1)));
            }
            if (arity == 2) {
                Sp_adj(0);
                R1 = R1 + 2;
                jump %GET_ENTRY(UNTAG(R1)) [R1,R2,R3];
            } else {
                Sp_adj(-3);
                W_[Sp+WDS(2)] = R3;
                W_[Sp+WDS(1)] = R2;
                if (arity < 4) {
                  R1 = R1 + arity;
                }
                BUILD_PAP(2,2,stg_ap_pp_info,FUN);
            }

where

// Jump to target, saving CCCS and restoring it on return
#if defined(PROFILING)
#define jump_SAVE_CCCS(target)                  \
    Sp(-1) = CCCS;                              \
    Sp(-2) = stg_restore_cccs_info;             \
    Sp_adj(-2);                                 \
    jump (target) [R1]
#else
#define jump_SAVE_CCCS(target) jump (target) [R1]
#endif

So in the arity=1 case the jump amounts to

                jump (%GET_ENTRY(UNTAG(R1))) [R1]

R1 is the function to apply, but we don't pass its argument, R2!

Now, possibly by design, the calling convention of stg_ap_pp_fast is such that the first argument to apply is in R2, which is the same register that the function R1 will expect to find its argument in. So if nothing happens to disturb R2 (and possibly this is always the case with the NCG), then everything is fine. However, it's definitely not fine for the LLVM backend, which quite reasonably passes undef for the R2, R3, and R4 arguments when doing the jump. On arm/Android, LLVM decided to clobber r8 (the physical register for R2) in the body of stg_ap_ppp_fast (which has the same problems as stg_ap_pp_fast).

This caused a crash for the following program:

main = print (read "3"::Int)

when built with -debug.

This appears to have been broken by commit f9265dd369b9e269349930012c25e670248f2a09 which changed the argument list for jump_SAVE_CCCS from [*] to [R1].

Change History (25)

comment:1 Changed 3 years ago by rwbarton

Description: modified (diff)

comment:2 Changed 3 years ago by rwbarton

Description: modified (diff)

comment:3 Changed 3 years ago by simonpj

Cc: gmainland added

Sounds bad to me. Geoff, can you comment as the author of the commit Reid identifies?

comment:4 Changed 3 years ago by bgamari

Priority: highhighest

Seems like this should really be highest priority given this has the potential to cause crashes and is working by mere chance at the moment.

comment:5 Changed 3 years ago by rwbarton

The commit in question does produce the correct argument list for the arity=2 case. So this is probably just a matter of going through all the other cases and making sure the argument lists are right. genapply is a bit large though...

comment:6 Changed 3 years ago by gmainland

My recollection is that we were saving too many registers and therefore taking a notable performance hit on LLVM. Now we are apparently saving too few, but only when saving the current cost centre stack :) If nobody else gets to it, I will put a patch on Phab within a week.

comment:7 Changed 3 years ago by simonmar

Good catch! It looks wrong to me. You probably need to pass the list of argument registers to jump_SAVE_CCS and attach them to the final jump.

comment:8 Changed 3 years ago by simonpj

Owner: set to gmainland

comment:9 Changed 3 years ago by gmainland

Perhaps someone can volunteer to at least test? This appears to only manifest on ARM, and I don't have access to an ARM system.

Would validate --slow on ARM catch this bug? From the bug report, it seems that perhaps the answer is "no." That should be fixed.

comment:10 Changed 3 years ago by bgamari

I have an ARM box which I can test with.

comment:11 Changed 3 years ago by gmainland

Great, can you also get a test into the testsuite that will detect this bug?

comment:12 Changed 3 years ago by bgamari

Although sadly it seems I can't reproduce the issue on said machine.

comment:13 Changed 3 years ago by bgamari

I've proposed a testcase in Phab:D1851.

comment:14 Changed 3 years ago by gmainland

I guess it will be difficult for you to test any proposed fix then :)

Reid, can you help?

comment:15 Changed 3 years ago by rwbarton

I think the reason that I only saw this bug with -debug is that the debug version of stg_ap_ppp_fast has extra code to print logging information when some debug flags are activated. This created enough register pressure that LLVM decided to rearrange some registers including r8, the ARM register that holds R2. Because stg_ap_ppp_fast invokes the function it calls with the R2 argument undef, once r8 was clobbered, the invoked function saw the wrong value for R2, and a crash was inevitable.

So here is a plan for testing for this bug, and others like it.

  • Add a flag -fllvm-fill-undef-with-garbage, and pass it when running validate.
  • In Llvm.Types.ppLit, if this flag is enabled, output 0xbbbbbbbb... (of whatever size is needed) rather than undef.

Then all LLVM builds should catch issues like this one.

I can implement this if it sounds reasonable.

comment:16 Changed 3 years ago by gmainland

Sounds reasonable to me.

comment:17 Changed 3 years ago by rwbarton

Architecture: armUnknown/Multiple
Differential Rev(s): Phab:D1858
Operating System: LinuxUnknown/Multiple

comment:18 Changed 3 years ago by Ben Gamari <ben@…>

In d50609e/ghc:

Test for undef bugs in the LLVM backend when validating

In an attempt to catch bugs involving using undef values, replace
undef literals by values likely to cause crashes or test failures.
We do this only when validating since it is a deoptimization.

This depends on D1857 to catch such bugs in the RTS (such as #11487).

Test Plan:
Did a build with
```
BuildFlavour = quick-llvm
SRC_HC_OPTS_STAGE1 = -fllvm-fill-undef-with-garbage
```
The build crashed when running ghc-stage2, as expected.

Reviewers: austin, bgamari

Reviewed By: bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1858

comment:19 Changed 3 years ago by gmainland

Differential Rev(s): Phab:D1858Phab:D1864

comment:20 Changed 3 years ago by Geoffrey Mainland <mainland@…>

In 6544f8de/ghc:

Properly track live registers when saving the CCCS.

Summary:
When saving the CCCS, we now correctly track the set of live registers and pass
them to the jump_SAVE_CCCS macro. This is now a variadic macro, but variadic
macros are supported by GCC since 3.0 and by all versions of clang, so this
should not be a problem.

Test Plan:
./validate with the following build options:

```
BuildFlavour = quick-llvm
SRC_HC_OPTS_STAGE1 = -fllvm-fill-undef-with-garbage
```

Reviewers: bgamari, simonmar, austin, rwbarton, simonpj

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1864

GHC Trac Issues: #11487

comment:21 Changed 3 years ago by Geoffrey Mainland <mainland@…>

In 669cbef0/ghc:

Fix Trac issue #11487.

comment:22 Changed 3 years ago by gmainland

Resolution: fixed
Status: newclosed

comment:23 Changed 3 years ago by simonpj

Thank you Geoff!

comment:24 Changed 3 years ago by bgamari

Status: closedmerge
Note: See TracTickets for help on using tickets.