Opened 18 months ago

Last modified 5 weeks ago

#7602 new bug

Threaded RTS performing badly on recent OS X (10.8?)

Reported by: simonmar Owned by: thoughtpolice
Priority: high Milestone: 7.8.4
Component: Runtime System Version:
Keywords: thread-local state, TLS clang Cc: johan.tibell@…, chak@…, anton.nik@…, george.colpitts@…, simonmar
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By: #7678
Blocking: Related Tickets:


This ticket is to remind us about the following problem: OS X is now using llvm-gcc, and as a result GHC's garbage collector with -threaded is much slower than it should be (approx 30% slower overall runtime). Some results here:

This is because the GC code relies on having fast access to thread-local state. It uses one of two methods: either a register variable (gcc only) or __thread variables (which aren't supported on OS X). To make things work on OS X, we use calls to pthread_getspecific instead (see #5634), which is quite slow, even though it compiles to inline assembly.

I don't recall which OS X / XCode versions are affected, maybe a Mac expert could fill in the details.

We have tried other fixes, such as passing around the thread-local state as extra arguments, but performance wasn't good. Ideally Apple will implement TLS in OS X at some point and we can start to use it.

A workaround is to install a real gcc (using homebrew?) and use that to compile GHC. Whoever builds the GHC distributions for OS X should probably do it that way, so everyone benefits.

Change History (46)

comment:1 Changed 18 months ago by tibbe

  • Cc johan.tibell@… added

comment:2 Changed 18 months ago by thoughtpolice

The situation is 'Okay' now because Clang/LLVM 3.2 and OS X (as of 10.7 possibly, but 10.8 for certain) supports TLS.

As far as I know, we don't use register variables (which may never appear in LLVM) on x86_64 and instead opt for TLS, so it should be much easier to compile the RTS and compiler using only Clang/LLVM now.

We already have code in place in DriverPipeline to run Clang for the assembler and whatnot, so perhaps it wouldn't be that much change to get a Fully-LLVM built GHC on OS X.

I might have time to try this since I've been sorting out llvm 3.2 bugs anyway.

comment:3 Changed 18 months ago by simonmar

Does the support for TLS compile down to calls to pthread_getspecific/pthread_getspecific? If so, that won't help much - we could remove some hacks from the code, but it will still perform badly. Someone with a Mac will need to do some measurements to be sure.

comment:4 Changed 18 months ago by thoughtpolice

I don't think so, or at least it doesn't in my trivial case:

#include <stdio.h>
#include <stdlib.h>

__thread int foo;

int main(int ac, char* av[]) {
  if (ac < 2) foo = 10;
  else foo = atoi(av[1]);

  printf("foo = %d\n", foo);

  return 0;

On Mac OS X 10.8, with Clang 3.2, I can compile this with no special options. Disassembling, we see:

$ lldb ./a.out
(lldb) disassemble -m -n main 
a.out`main at tls.c:6
   6   	int main(int ac, char* av[]) {
   7   	  if (ac < 2) foo = 10;
a.out[0x100000eb0]:  pushq  %rbp
a.out[0x100000eb1]:  movq   %rsp, %rbp
a.out[0x100000eb4]:  subq   $48, %rsp
a.out[0x100000eb8]:  movl   $0, -4(%rbp)
a.out[0x100000ebf]:  movl   %edi, -8(%rbp)
a.out[0x100000ec2]:  movq   %rsi, -16(%rbp)
a.out`main + 22 at tls.c:7
   6   	int main(int ac, char* av[]) {
   7   	  if (ac < 2) foo = 10;
   8   	  else foo = atoi(av[1]);
a.out[0x100000ec6]:  cmpl   $2, -8(%rbp)
a.out[0x100000ecd]:  jge    0x100000ee7               ; main + 55 at tls.c:8
a.out[0x100000ed3]:  leaq   326(%rip), %rdi           ; foo
a.out[0x100000eda]:  callq  *(%rdi)
a.out[0x100000edc]:  movl   $10, (%rax)
a.out[0x100000ee2]:  jmpq   0x100000f05               ; main + 85 at tls.c:8
a.out`main + 55 at tls.c:8
   7   	  if (ac < 2) foo = 10;
   8   	  else foo = atoi(av[1]);
a.out[0x100000ee7]:  movq   -16(%rbp), %rax
a.out[0x100000eeb]:  movq   8(%rax), %rdi
a.out[0x100000eef]:  callq  0x100000f36               ; symbol stub for: atoi
a.out[0x100000ef4]:  leaq   293(%rip), %rdi           ; foo
a.out[0x100000efb]:  movl   %eax, -20(%rbp)
a.out[0x100000efe]:  callq  *(%rdi)
a.out[0x100000f00]:  movl   -20(%rbp), %ecx
a.out[0x100000f03]:  movl   %ecx, (%rax)
a.out[0x100000f05]:  leaq   92(%rip), %rdi            ; "foo = %d\n"
a.out`main + 92 at tls.c:10
   10  	  printf("foo = %d\n", foo);
a.out[0x100000f0c]:  movq   %rdi, -32(%rbp)
a.out[0x100000f10]:  leaq   265(%rip), %rdi           ; foo
a.out[0x100000f17]:  callq  *(%rdi)
a.out[0x100000f19]:  movl   (%rax), %esi
a.out[0x100000f1b]:  movq   -32(%rbp), %rdi
a.out[0x100000f1f]:  movb   $0, %al
a.out[0x100000f21]:  callq  0x100000f3c               ; symbol stub for: printf
a.out[0x100000f26]:  movl   $0, %esi
a.out`main + 123 at tls.c:12
   12  	  return 0;
   13  	}
a.out[0x100000f2b]:  movl   %eax, -36(%rbp)
a.out[0x100000f2e]:  movl   %esi, %eax
a.out[0x100000f30]:  addq   $48, %rsp
a.out[0x100000f34]:  popq   %rbp
a.out[0x100000f35]:  ret   

(lldb) ^D

In the origial post, David says that we basically get code like:

    call getThreadLocalVar
    movq    (%rdi),%rdi #deref the key which is an index into the tls memory
    jmp <dynamic_linker_stub>
    movq    %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body

where the biggest penalty is the jump into dyld to do linking for the stub. This code does still exist in the latest implementation of Apple's libc:

(Look at the OPTIMIZE implementation.)

However, Clang on OS X seems to directly avoid this? I'm not sure why the offsets of leaq for foo seem to decrease for every access...

I attempted to look through the LLVM source code for specific notes about this, but the new TLS support is of course deeply ingrained in the new release, so it's hard to point out any one thing about this behavioral change.

I'll investigate this more over the next few days and look at disassembly outputs, we should be able to see if this buys is anything at all pretty quickly.

We don't use TLS for x86, only register variables, correct? If so, then this still leaves 32bit OS X users up a creek a bit, but Apple and the community are largely moving away from this anyway, it seems.

comment:5 Changed 18 months ago by thoughtpolice

Ah, I misspoke in earnest without looking more deeply. It does look like the callq *(%rdi) is the indirection we're still hitting in some sense, although I'm not sure why it's different from apple's implementation in libc: with -O2, this is minimized to a simple load and call, but it's still a significant overhead for the access, at a glance.

comment:6 Changed 18 months ago by simonmar

Yeah, it looks like they've done some trickery so that instead of calling pthread_getspecific they make an indirect call to the contents of foo to get its per-thread location. This won't be fast enough for us, I'm sure.

comment:7 Changed 18 months ago by chak

  • Cc chak@… added

Let me just add that llvm-gcc will disappear from Xcode (command line tools) soon — i.e., we will need to use clang.

comment:8 Changed 18 months ago by simonmar

I believe there is no problem with clang, we incorporated patches to make it work a while back. But please report any problems if you find them.

comment:9 Changed 17 months ago by thoughtpolice

Just as a note, I don't think GCC will help anymore; using GCC 4.7.2 on OSX 10.8, compiling my same 'foo' example earlier with gcc -O3, I get:

(lldb) disassemble -m -n main
a.out[0x100000f00]:  cmpl   $1, %edi
a.out[0x100000f03]:  pushq  %rbx
a.out[0x100000f04]:  jle    0x100000f3f               ; main + 63
a.out[0x100000f06]:  movq   8(%rsi), %rdi
a.out[0x100000f0a]:  callq  0x100000f54               ; symbol stub for: atoi
a.out[0x100000f0f]:  leaq   330(%rip), %rdi           ;
a.out[0x100000f16]:  movl   %eax, %ebx
a.out[0x100000f18]:  callq  0x100000f66               ; symbol stub for: __emutls_get_address
a.out[0x100000f1d]:  movl   %ebx, (%rax)
a.out[0x100000f1f]:  leaq   314(%rip), %rdi           ;
a.out[0x100000f26]:  callq  0x100000f66               ; symbol stub for: __emutls_get_address
a.out[0x100000f2b]:  leaq   114(%rip), %rdi           ; "foo = %d\n"
a.out[0x100000f32]:  movl   (%rax), %esi
a.out[0x100000f34]:  xorl   %eax, %eax
a.out[0x100000f36]:  callq  0x100000f60               ; symbol stub for: printf
a.out[0x100000f3b]:  xorl   %eax, %eax
a.out[0x100000f3d]:  popq   %rbx
a.out[0x100000f3e]:  ret    
a.out[0x100000f3f]:  leaq   282(%rip), %rdi           ;
a.out[0x100000f46]:  callq  0x100000f66               ; symbol stub for: __emutls_get_address
a.out[0x100000f4b]:  movl   $10, (%rax)
a.out[0x100000f51]:  jmp    0x100000f1f               ; main + 31

That is, we're still getting the callq for the TLS reference. I'm not sure what else to try.

So I think for now we're going to have to just bite the bullet on this one, and make sure the build is solid with Clang on modern OS X anyway. Maybe we can do something evil here later to recover the loss :/

David Peixotto's original patches made the RTS build with clang at first. I'll run a test against HEAD using Clang and see what I find.

comment:10 Changed 17 months ago by simonmar

gcc support register variables, which we use instead of __thread (see rts/sm/GCTDecl.h). So we can use gcc until/unless they drop support for register variables.

TLS requires support from the OS, which is why neither gcc nor Clang/LLVM can support it on OS X.

comment:11 Changed 17 months ago by thoughtpolice

Ah, you're very correct. I shouldn't comment on tickets when it's absurdly late at night and I'm really tired...

Apple has advertised OS X as having TLS since Lion I believe. I think this really is TLS support. It's just slower than Linux - on Ubuntu 12.10, the example above moves values directly into %fs:0xfffffffffffffff0. GCC is just weird here because it's calling into libgcc first from my digging (which would maybe mean it's even slower. I haven't tested.)

Modern GCC is at least easily obtainable on OS X, so perhaps this isn't as bad and it's not worth doing something evil. We'll probably need to fix other things, so I'll still look into it.

comment:12 Changed 17 months ago by thoughtpolice

Just as an aside, I may have found a way to recover (most) of the performance relative to David's original post, even with Clang. It's by using a trick JavaScriptCore (WebKit) uses. I'm building now and will see what happens...

comment:13 Changed 17 months ago by thoughtpolice

Alright, I think my patch is almost working, but in the mean time I've verified with a small snippet the behavior I think we want. Simon, can you please tell me if this approach would be OK?

Essentially, there is a small set of predefined TLS keys in the OS X C library for various Apple-internal things. There are about 100 of these special keys. With them, it's possible to use very special inline variants of pthread_getspecific and pthread_setspecific that directly write into an offset block of the %gs register. Performance-wise, this should be very close to Linux's implementation.

One of these things on modern OS X and its libc is WebKit. pthread has a specific range of keys (5 to be exact) dedicated to WebKit. These are used in JavaScriptCore's FastMalloc allocator for performance critical sections - likely for their GC! But only a single key is used by WebKit at all (__PTK_FRAMEWORK_JAVASCRIPTCORE_KEY0), and there are 0 references to it elsewhere that I can find on the internet.

You can see this here:

This defines the inline get/set routines for special TLS keys. If you scroll down a little you can see the JavaScriptCore keys (keys 90-94 to be exact.)

Now, look here:

And you can see there's a special stubbed out pthread_getspecific and pthread_setspecific routine for this exact purpose.

Therefore, I propose we steal one of the high TLS keys that dedicated to WebKit's JS engine for the GC. Unfortunately, pthread_machdep.h is not installed by default in modern variants of XCode, so we must inline the definitions ourselves for the necessary architectures.

The following example demonstrates the use of these special keys:

#include <stdio.h>
#include <stdlib.h>

#include <pthread.h>

/** Snipped from pthread_machdep.h */

__inline__ void *
_pthread_getspecific_direct(unsigned long slot) {
  void* ret;
#if defined(__i386__) || defined(__x86_64__)
  __asm__("mov %%gs:%1, %0" : "=r" (ret) : "m" (*(void **)(slot * sizeof(void *))));
#error "No definition of pthread_getspecific_direct!"
  return ret;

/* To be used with static constant keys only */
__inline__ static int
_pthread_setspecific_direct(unsigned long slot, void * val)
#if defined(__x86_64__)
  /* PIC is free and cannot be disabled, even with: gcc -mdynamic-no-pic ... */
  __asm__("movq %1,%%gs:%0" : "=m" (*(void **)(slot * sizeof(void *))) : "rn" (val));
#error "No definition of pthread_setspecific_direct!"
  return 0;

/** End snippets */ 

static const pthread_key_t fooKey =

#define GET_FOO() ((int)(_pthread_getspecific_direct(fooKey)))
#define SET_FOO(to) (_pthread_setspecific_direct(fooKey, to))

int main(int ac, char* av[]) {
  if (ac < 2) SET_FOO((void*)10);
  else SET_FOO((void*)atoi(av[1]));

  printf("foo = %d\n", GET_FOO());

  return 0;

This is pretty close to what the GC does now. And compiling:

$ clang -O3 tls2.c
$ lldb ./a.out
Current executable set to './a.out' (x86_64).
(lldb) disassemble -m -n main
a.out[0x100000ef0]:  pushq  %rbp
a.out[0x100000ef1]:  movq   %rsp, %rbp
a.out[0x100000ef4]:  cmpl   $1, %edi
a.out[0x100000ef7]:  jg     0x100000f08               ; main + 24
a.out[0x100000ef9]:  movq   $10, %gs:752
a.out[0x100000f06]:  jmp    0x100000f1d               ; main + 45
a.out[0x100000f08]:  movq   8(%rsi), %rdi
a.out[0x100000f0c]:  callq  0x100000f38               ; symbol stub for: atoi
a.out[0x100000f11]:  movslq %eax, %rax
a.out[0x100000f14]:  movq   %rax, %gs:752
a.out[0x100000f1d]:  movq   %gs:752, %rsi
a.out[0x100000f26]:  leaq   59(%rip), %rdi            ; "foo = %d\n"
a.out[0x100000f2d]:  xorb   %al, %al
a.out[0x100000f2f]:  callq  0x100000f3e               ; symbol stub for: printf
a.out[0x100000f34]:  xorl   %eax, %eax
a.out[0x100000f36]:  popq   %rbp
a.out[0x100000f37]:  ret    
(lldb) r
Process 67488 launched: './a.out' (x86_64)
foo = 10
Process 67488 exited with status = 0 (0x00000000) 
(lldb) ^D

This will probably only work on modern versions of XCode and OS X (10.8 etc.) In part, older libcs have different implementations of pthread_setspecific_direct - it just falls back to regular ol' pthread_setspecific - which means this could be very wrong on older machines. I'm not sure how much older, so if we had any 10.7 users who could try this that would be awesome. The build system will need modifications to check for that, and fall back to the much slower routines otherwise I suppose.

However I think this fix is fairly principled and safe, even if it's a little sneaky. There's a chance WebKit could use all of it's keys tomorrow, and then WebKit bindings would break in crazy ways, at the least. Not sure what to do about that, but I think the chance is small.

Simon, does this approach sound OK? I think it will recover the performance loss here and we can just go ahead and use Clang, which is the easiest for everybody I think. I will go ahead and set myself as owner and point this to 7.7, because I would like to see this fix in 7.8 if possible.

comment:14 Changed 17 months ago by thoughtpolice

  • Architecture changed from Unknown/Multiple to x86_64 (amd64)
  • Operating System changed from Unknown/Multiple to MacOS X
  • Owner set to thoughtpolice
  • Version changed from 7.6.1 to 7.7

comment:15 Changed 17 months ago by thoughtpolice

  • Blocked By 7678 added

comment:16 Changed 17 months ago by simonmar

Ok, so an inline pthread_getspecific is good, but it's still not ideal, because the compiler can't see the code inside the inline asm. Multiple references to the TLS variable should not cause repeated reads, but with the inline pthread_getspecific, they will.

However, this can be worked around in the code by loading the TLS variable into a local once at the start of the function. I wouldn't object to this, it's less invasive than passing around the TLS variable as a parameter everywhere. But we should peer at the asm before and after to make sure it's doing what we expect.

Stealing the WebKit-reserved slot would probably work for now, until some library developer has the same idea, and then we have a bizarre bug waiting to happen. I suppose I slightly prefer to use gcc for now, since we know it works, has good performance, and doesn't have this infelicity.

comment:17 Changed 17 months ago by thoughtpolice

OK. Clang is kind of being an issue at the moment so my patch is on hold (see #7678.) Right now my change is localized in GCTDecl.h behind an #ifdef so it's nothing more than a configurable performance optimization on Darwin.

I think first I will just look at the raw differences between:

  • clang, with the slow, non-inline pthread calls.
  • clang, using this change to steal an inline TLS variable
  • gcc 4.7.2, using register variables.

(I don't particularly care about llvm-gcc too much - it's already slow whether or not it's using clang or llvm-gcc, and llvm-gcc will be removed anyway.)

Afterwords, we can look at what storing the TLS variable in a local will save us performance wise, and decide on a default.

comment:18 Changed 13 months ago by chak

Any news here? Did you check out the Xcode 5 DP? It's now clang or nothing.

comment:19 follow-up: Changed 13 months ago by thoughtpolice

Sigh. I haven't seen XCode 5. The fact they EOL'd gcc finally isn't totally surprising.

The summary is that I have a patch for this and it should help the performance loss quite a bit, but the blocker here is #7678 - clang's preprocessor doesn't like the way we use RULES, among other things.

This is work-around-able, but probably not in a very nice way unfortunately.

comment:20 in reply to: ↑ 19 Changed 13 months ago by chak

Replying to thoughtpolice:

Sigh. I haven't seen XCode 5. The fact they EOL'd gcc finally isn't totally surprising.

The summary is that I have a patch for this and it should help the performance loss quite a bit, but the blocker here is #7678 - clang's preprocessor doesn't like the way we use RULES, among other things.

This is work-around-able, but probably not in a very nice way unfortunately.

Our use of CPP was always a dirty hack and it was only a matter of time, before it was going to bite us. Can't we just disable cpp in modules with rules?

comment:21 Changed 13 months ago by chak

Concerning TLS, clang actually supports a choice of different ways of handling TLS since clang 3.2. See under heading "Support for tls_model attribute" (the compiler option is "-ftls-model"). The various models are spec'ed in this document:

comment:22 Changed 13 months ago by thoughtpolice

Well, technically I think that under GCC's -traditional-cpp mode, clang SHOULD respect our code. The problem is I just don't think the "be insensitive to leading whitespace" rule is implemented. :(

I don't think turning one or the other off is an option. Base in particular has a few modules with both CPP and Rules, but if you disable one or the other, things will break. Either the preprocessor directives become a syntax error, or you disable the RULES and they are ignored - but the preprocessor would error on them anyway because it runs first.

There is another problem. We can change GHC and all of its dependent libraries to remove leading whitespace on lines that would be ambiguous to clang. Unfortunately, 3rd party libraries do this kind of formatting to, so this could break user programs for a very bizarre reason.

I think it should be possible to fix this in a transparent way, but it won't be pretty. Basically, we'll need to preemptively strip out lines that begin with whitespace, where the first non whitespace character is #. Something like "s/\s+\#(.*)/#$1/" or whatever in regex-ese. In the mean time I think it should be possible to get clang building by reformatting a few of libraries using RULES improperly, which should not be too much work I think.

Finally, the various TLS models makes no difference on the example using TLS I posted above. I posit that TLS models are silently ignored on OS X, where Apple is free to do what they wish.

comment:23 Changed 13 months ago by simonmar

  • Milestone changed from _|_ to 7.8.1
  • Priority changed from normal to high

I'd be really happy if someone took ownership of this problem and drove it to a solution. @thoughtpolice?

As I understand it, currently the situation is that you have to install gcc. We can probably make it so that you don't have to install gcc to use GHC, but still need gcc to build GHC, because we don't have access to fast TLS support on OS X (or alternatively you can build GHC with clang, but the GC will be slow.) It seems certain privileged projects do get fast TLS support (WebKit), but it's not officially available for general use.

comment:24 Changed 13 months ago by thoughtpolice

Right, understood. I'll put this on my plate and try to tackle it tonight. Just getting clang to build and recovering the performance I think won't be too hard. The preprocessing hack can happen afterwords, and I can file a separate ticket for it.

comment:25 Changed 13 months ago by carter

I'm going to see if I can reach out to someone I know at apple to see if the patched clang CPP stuff can be backported to the Xcode 5 clang.

It'd be really a shame if we have to bundle up our own GCC/CLang or require end users to build their own before they can use GHC on mac!

comment:26 Changed 13 months ago by lelf

  • Cc anton.nik@… added

comment:27 Changed 10 months ago by thoughtpolice

  • Priority changed from high to highest

(NB: All of the Clang annoyingness is fixed now.)

I'm going to go ahead and bump this to highest priority, since it should really be fixed. I do have a patch that works, but it needs to be cleaned up. Luckily I have a better OS X build machine now.

comment:28 Changed 9 months ago by george.colpitts

  • Cc george.colpitts@… simonmar added

comment:29 Changed 9 months ago by ezyang

I must admit, I'm surprised that passing the thread-local state as a variable is not good performance-wise. At least within a function like evacuate, where everything ends up getting inlined into one giant function body, register pressure on x86_64 is low enough that I have seen (experimentally) that GCC is able to arrange for an extra parameter to not get spilled. Perhaps the situation is not so good across function call boundaries, but I don't see why it should necessarily be a problem.

comment:30 Changed 9 months ago by simonmar

If somebody makes a patch to pass around gct explicitly, and it doesn't degrade performance on Linux, then I wouldn't object to putting it in. It would be good to remove our reliance on thread-local state, at the expense of a bit of extra code cruft.

comment:31 Changed 7 months ago by simonpj

Highest priority to test what current situation is; but then we may or may not need to actually fix something.

comment:32 Changed 6 months ago by carter

Whats the current status? do I or someone else with a mac need to run a benchmark of ghc head built with clang vs ghc head built with gcc?

Mind you, i only have a 2 core machine, so idk if thats enough to exercise the issues.

comment:33 follow-up: Changed 6 months ago by thoughtpolice

I investigated Mavericks and talked with Simon about this.

10.9 is problematic. It still emits indirect jumps via a call through %rdi when using __thread, so it can't perform a simple direct load/store to a 'fast' location. You can also use pthread_{get,set}specific, but these are indirect dynamic calls too (and probably performs even worse.) Furthermore I can't find the source code for pthread_getspecific & co, so it's impossible to check if my 10.8 fix still works OK.

We decided to:

  • Make the runtime use __thread on OS X for now. This cleans up the code, and in the future could be optimized by the compiler/runtime handling of TLS variables in future versions.
  • Document the fact OS X will suffer a performance regression with the -threaded garbage collector.

comment:34 in reply to: ↑ 33 Changed 6 months ago by Andrea

Replying to thoughtpolice:

Furthermore I can't find the source code for pthread_getspecific & co, so it's impossible to check if my 10.8 fix still works OK.

The most recent pthread_getspecific source released by Apple was part of Libc-825.40.1. You can find it here:

Following Libc-825.40.1 Apple released the source for Libc-997.1.1 but the pthread_getspecific code is missing (they forgot it:

I'm not an uber expert but TLS seems now to be working very similarly to how it works on other unixes and therefore it's unlikely that there has been any further change to pthread_getspecific. So maybe an inline function (like the _pthread_getspecific_direct in comment 13) can now be used with a key from pthread_key_create without stealing a predefined slot. Recommendations in comment 16 by simonmar would still apply, of course.

Last edited 6 months ago by Andrea (previous) (diff)

comment:35 Changed 5 months ago by Austin Seipp <austin@…>

In 28b031c506122e28e0230a562a4f6fd3d0256d0c/ghc:

Refactor GCTDecl.h, and mitigate #7602 a bit

This basically cleans a lot of GCTDecl up - I found it quite hard to
read and a bit confusing. The changes are mostly cosmetic: better
delineation between the alternative cases and light touchups, and tries
to make every branch as consistent as possible.

However, this patch does have one significant effect: it will ensure
that any LLVM-based compilers will use __thread if they support it.
Before, they would simply always use pthread_getspecific and
pthread_setspecific, which are almost surely even *more* inefficient.

The details are a bit too long and boring to go into here; see #7602.
After talking with Simon, we decided to play it safe - __thread can at
least be optimized by future clang releases even further on OS X if they
choose, and it's safer until we can investigate the pthread
implementation further on Mavericks.

For Linux, the story isn't so bleak if you use Clang (for whatever
reason) - Linux directly writes to `%fs` for __thread slots (while OS X
will perform a load followed by an indirect call.) So it should still be
fairly competitive, speed-wise.

Signed-off-by: Austin Seipp <>

comment:36 Changed 5 months ago by thoughtpolice

  • Version changed from 7.7 to 7.8.1-rc1

comment:37 Changed 5 months ago by thoughtpolice

  • Milestone changed from 7.8.1 to 7.8.2
  • Version 7.8.1-rc1 deleted

I'm punting this to 7.8.2.

comment:38 Changed 3 months ago by thoughtpolice

  • Milestone changed from 7.8.2 to 7.8.3

comment:39 Changed 3 months ago by George

  • Keywords thread-local state TLS clang added
  • Type of failure changed from None/Unknown to Runtime performance bug

comment:40 Changed 3 months ago by carter

there was some recent chatter on the llvm list about adding gcc style global named register support to clang!msg/llvm-dev/p7GAE_bbDoo/QgJme5LblBkJ

perhaps someone on the ghc team should chime in :)

comment:41 Changed 3 months ago by simonmar

It looks like they're lacking good use cases for register variables. I'm not on that list so it's difficult to jump into the middle of the conversation, perhaps someone else could point them to this ticket and/or our use of register variables in the GC?

comment:42 Changed 3 months ago by ezyang

I went ahead and posted mail; hasn't showed up in the archives yet.

comment:43 Changed 3 months ago by thoughtpolice

I did not see Edward's reply. I instead posted my own summary for the LLVM people:

comment:44 Changed 3 months ago by thoughtpolice

For those who didn't follow the list:

  • Mark Seaborn, a Chromium/NaCl engineer, stated that they have this exact same problem in Native Client on 64bit OS X. In particular, they came up with the same solution I did essentially: inline the fast path of pthread_{get,set}specific. Their code is more robust than my patch however, as it's actually paranoid enough to check the machine code and bail otherwise:
I think this is probably the best approach at the moment

  • Renato Golin, the proposer of the initial change, updated me to say that their updated proposal does include register variables for general purpose registers (GPRs), but it's only one part of the overall set of changes, which will take more time (their particular motivation now seems to be as a principled approach to defining __builtin_return_address and friends at the moment). However, they understand that proposal doesn't accommodate our design, and would like us to give more feedback as they evolve the work to support true register variables.

So, in the long haul, I think things will get better here. In the mean time, I think this is a signal that we should go ahead and jump on the fast-path-inline change I spoke of in comment 13.

comment:45 Changed 2 months ago by thoughtpolice

  • Priority changed from highest to high

I have a patch incoming Real Soon for this. Yay!

comment:46 Changed 5 weeks ago by thoughtpolice

  • Milestone changed from 7.8.3 to 7.8.4

Moving to 7.8.4.

Note: See TracTickets for help on using tickets.