Opened 3 years ago

Last modified 2 years ago

#9481 new bug

Linker does not correctly resolve symbols in previously loaded objects

Reported by: edsko Owned by:
Priority: normal Milestone:
Component: Runtime System (Linker) Version: 7.8.2
Keywords: Cc: simonmar
Operating System: MacOS X Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Summary: Given two object files (created from C files) A and B, where B refers to symbols defined in A, if we load B before A, calling resolveObjs after each object file, symbol resolution goes wrong. Full test case attached.

Detailed description: Consider

// a.c
#include <stdio.h>

void defined_in_A() {
  printf("In A\n");
}
// b.c
#include <stdio.h>

void defined_in_A();

void defined_in_B() {
  printf("In B\n");
  defined_in_A();
  printf("In B\n");
}

and

{-# LANGUAGE ForeignFunctionInterface #-}
module Main where

import System.IO

foreign import ccall "defined_in_B" defined_in_B :: IO ()

main :: IO ()
main = defined_in_B

and we use the GHC API to load the object files a.o and b.o (corresponding to a.c and b.c), calling resolveObjs after loading each object, and then load Main and call Main.main, everything works fine (the attached test case contains both a.c, b.c and Main.hs as well as the test program that calls into the GHC API).

However, if we load b.o before a.o, things go wrong. When we load b.o and call resolveObjs then resolveObjs (quite reasonably) complains that

lookupSymbol failed in relocateSection (relocate external)
b.o: unknown symbol `_defined_in_A'

since we haven't loaded a.o yet. But when we then load a.o and call resolveObjs again, resolveObjs reports okay, but something goes wrong because the program subsequently segfaults.

A successful run

I took a closer look at what happens exactly. First, let's consider the test run where we load A before B, and everything works fine, and let's look at the precise code that we actually execute when Main.main does the foreign call to defined_in_B:

# lldb Linkerbug
Current executable set to 'Linkerbug' (x86_64).

(lldb) breakpoint set -n ffi_call
Breakpoint 1: where = Linkerbug`ffi_call + 29 at ffi64.c:421, address = 0x0000000102be2eed

(lldb) run
Process 92361 launched: 'Linkerbug' (x86_64)
Loading object "a.o"
  symbol resolution ok
Loading object "b.o"
  symbol resolution ok
Loading Haskell module "Main.hs"
  ok
Running "Main.main"
Process 92361 stopped
* thread #1: tid = 0x32048b, 0x0000000102be2eed Linkerbug`ffi_call(cif=0x000000010cc68500, fn=0x0000000104c6c210, rvalue=0x00007fff5fbff920, avalue=0x00007fff5fbff8c0) + 29 at ffi64.c:421, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000102be2eed Linkerbug`ffi_call(cif=0x000000010cc68500, fn=0x0000000104c6c210, rvalue=0x00007fff5fbff920, avalue=0x00007fff5fbff8c0) + 29 at ffi64.c:421

(lldb) disassemble -c 14 -s fn
   0x104c6c210:  pushq  %rbp
   0x104c6c211:  movq   %rsp, %rbp
   0x104c6c214:  pushq  %rbx
   0x104c6c215:  pushq  %rax
   0x104c6c216:  leaq   0x1d(%rip), %rbx
   0x104c6c21d:  movq   %rbx, %rdi
   0x104c6c220:  callq  0x104c6c3c8
   0x104c6c225:  xorl   %eax, %eax
   0x104c6c227:  callq  0x104c6b210
   0x104c6c22c:  movq   %rbx, %rdi
   0x104c6c22f:  addq   $0x8, %rsp
   0x104c6c233:  popq   %rbx
   0x104c6c234:  popq   %rbp
   0x104c6c235:  jmpq   0x104c6c3c8

This disassembly is the compiled code for B which, in order, does a call to printf, then to defined_in_A, and then back to printf (the last call is jmpq rather than callq: the C compiler did a tail call optimization). Let's make sure that this is actually true; the first call is a call to

(lldb) disassemble -c 2 -s 0x104c6c3c8
   0x104c6c3c8:  jmpq   *-0xe(%rip)
   0x104c6c3ce:  addb   %al, (%rax)

which is an indirect jump to

(lldb) memory read -f A -c 1 "0x104c6c3ce - 0xe"
0x104c6c3c0: 0x00007fff84fd5b9b libsystem_c.dylib`puts

to puts, okay, good. Seems an unnecessary level of indirection, but that's okay. The next call is to

(lldb) disassemble -c 5 -s 0x104c6b210
   0x104c6b210:  pushq  %rbp
   0x104c6b211:  movq   %rsp, %rbp
   0x104c6b214:  leaq   0x6(%rip), %rdi
   0x104c6b21b:  popq   %rbp
   0x104c6b21c:  jmpq   0x104c6b370

which is the compiled code for defined_in_A, which should be just a call to printf. Let's confirm:

(lldb) disassemble -c 2 -s 0x104c6b370
   0x104c6b370:  jmpq   *-0xe(%rip)
   0x104c6b376:  addb   %al, (%rax)

(lldb) memory read -f A -c 1 "0x104c6b376 - 0xe"
0x104c6b368: 0x00007fff84fd5b9b libsystem_c.dylib`puts

Ok, all good.

A failed run

Now let's check what happens when we load the objects in the opposite order. The code that we execute is

(lldb) disassemble -c 14 -s fn
   0x104c6b210:  pushq  %rbp
   0x104c6b211:  movq   %rsp, %rbp
   0x104c6b214:  pushq  %rbx
   0x104c6b215:  pushq  %rax
   0x104c6b216:  leaq   0x1d(%rip), %rbx
   0x104c6b21d:  movq   %rbx, %rdi
   0x104c6b220:  callq  0x104c6b3c8
   0x104c6b225:  xorl   %eax, %eax
   0x104c6b227:  callq  0x104c6c210
   0x104c6b22c:  movq   %rbx, %rdi
   0x104c6b22f:  addq   $0x8, %rsp
   0x104c6b233:  popq   %rbx
   0x104c6b234:  popq   %rbp
   0x104c6b235:  jmpq   0x104c6b556

This immediately looks suspicious: the address of the jmpq (the second -- tail -- call to printf) is different from the first. The first callq is okay:

(lldb) disassemble -c 2 -s 0x104c6b3c8
   0x104c6b3c8:  jmpq   *-0xe(%rip)
   0x104c6b3ce:  addb   %al, (%rax)

(lldb) memory read -f A -c 1 "0x104c6b3ce - 0xe"
0x104c6b3c0: 0x00007fff84fd5b9b libsystem_c.dylib`puts

and the call to defined_in_A, as well as the code for defined_in_A, are both ok:

(lldb) disassemble -c 5 -s 0x104c6c210
   0x104c6c210:  pushq  %rbp
   0x104c6c211:  movq   %rsp, %rbp
   0x104c6c214:  leaq   0x6(%rip), %rdi
   0x104c6c21b:  popq   %rbp
   0x104c6c21c:  jmpq   0x104c6c370

(lldb) disassemble -c 2 -s 0x104c6c370
   0x104c6c370:  jmpq   *-0xe(%rip)
   0x104c6c376:  addb   %al, (%rax)

(lldb) memory read -f A -c 1 "0x104c6c376 - 0xe"
0x104c6c368: 0x00007fff84fd5b9b libsystem_c.dylib`puts

However, that second call to printf (the jmpq) in defined_in_B is indeed wrong: it's jumping into nowhere:

(lldb) memory read -c 8 0x104c6b556
0x104c6b556: 00 00 00 00 00 00 00 00                          ........

Note that somewhat surprisingly is it not the resolution of the symbol from a.o that is wrong; that does get resolved okay once we load a.o; rather, it is the jump to puts which is wrong (and then only the one of them).

Attachments (5)

Linkerbug.hs (1.9 KB) - added by edsko 3 years ago.
The main test
a.c (64 bytes) - added by edsko 3 years ago.
The C source for object file A
b.c (124 bytes) - added by edsko 3 years ago.
The C source for object file B
Main.hs (172 bytes) - added by edsko 3 years ago.
The Haskell module that calls into B
Makefile (183 bytes) - added by edsko 3 years ago.

Download all attachments as: .zip

Change History (10)

Changed 3 years ago by edsko

Attachment: Linkerbug.hs added

The main test

Changed 3 years ago by edsko

Attachment: a.c added

The C source for object file A

Changed 3 years ago by edsko

Attachment: b.c added

The C source for object file B

Changed 3 years ago by edsko

Attachment: Main.hs added

The Haskell module that calls into B

Changed 3 years ago by edsko

Attachment: Makefile added

comment:1 Changed 3 years ago by rwbarton

I should be able to reproduce by downloading all the attachments and running make followed by Linkerbug right? In that case it seems to work for me (GHC 7.8.3, Linux x86_64):

rwbarton@morphism:/tmp/linkerbug$ make
ghc -c -O a.c
ghc -c -O b.c
ghc -package ghc Linkerbug
[1 of 1] Compiling Main             ( Linkerbug.hs, Linkerbug.o )
Linking Linkerbug ...
rwbarton@morphism:/tmp/linkerbug$ ./Linkerbug 
Loading object "b.o"
Linkerbug: b.o: unknown symbol `defined_in_A'
  symbol resolution failed
Loading object "a.o"
  symbol resolution ok
Loading Haskell module "Main.hs"
  ok
Running "Main.main"
In B
In A
In B
  ok

I assume you are on OS X?

comment:2 Changed 3 years ago by edsko

Yup, I'm on OSX. For me this gives:

# make
ghc -c -O a.c
ghc -c -O b.c
ghc -package ghc Linkerbug
[1 of 1] Compiling Main             ( Linkerbug.hs, Linkerbug.o )
Linking Linkerbug ...

# ./Linkerbug 
Loading object "b.o"
Linkerbug: 
lookupSymbol failed in relocateSection (relocate external)
b.o: unknown symbol `_defined_in_A'
  symbol resolution failed
Loading object "a.o"
  symbol resolution ok
Loading Haskell module "Main.hs"
  ok
Running "Main.main"
In B
In A
Segmentation fault: 11

(with GHC 7.4.2, 7.6.3, 7.8.2, 7.8.3; OSX 64 bit).

comment:3 Changed 3 years ago by rwbarton

Operating System: Unknown/MultipleMacOS X

comment:4 Changed 3 years ago by simonmar

Works for me on Linux x86_64... somewhat surprisingly, since after resolveObjs has already failed once it seems likely that things might be left in a partially-linked state, and repeating resolveObjs might not work. I think after resolveObjs has failed you probably shouldn't use the linker any more and should report the error and exit, which is what GHCi does. What is the use case for retrying resolveObjs after it has failed?

comment:5 Changed 2 years ago by ezyang

Component: CompilerRuntime System (Linker)
Note: See TracTickets for help on using tickets.