Opened 7 months ago

Closed 6 months ago

#8256 closed task (fixed)

adding locality levels to prefetch# and friends

Reported by: carter Owned by:
Priority: normal Milestone: 7.8.1
Component: Compiler Version: 7.7
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets: #8252 ,#8107

Description

currently in HEAD / 7.7, the prefetch primop only does the equivalent of builtin_prefetch(ptr,0,3)
(0 here denoting a read prefetch, 3 denoting "high locality, keep this in cache for a while")

On modern hardware, we can do better! For many natural use case of prefetch in haskell, we have a streaming workload, so we want a prefetch to also hint that "once we've worked on a piece of memory, no need to keep it around for any period of time, don't pollute our memory with it"

the attached patch takes each prefetchBlah# operation (where blah=ByteArray?,MutableByteArray?, or Addr) and replaces it with prefetchBlah0# through prefetchBlah3#, and passes the integer information through into the code generators.

To make the engineering reasonable, I enriched MO_Prefetch_Data with an Int parameter (which must be a value between ranging 0-3). Theres probably a better way to model the locality paramter, but that maps directly to how its used in llvm code gen.

This patch does not include a test case yet. (interestingly, currently the test suite doesn't have any tests for the prefetch primops as yet!). So that needs to be added.

Also, theres no good reason for the prefetch ops to be LLVM only.

worst case we could just treat them as noop's and drop them. But at least for x86, should be easy to add the support (though if anyone wants to add support for the ppc and sparc stuff, or help me do that, that'd be awesome too!) . I"ll look into doing that in a few days.

theres probably some other things i'm overlooking.

anyways, i'll attach a preliminary patch for feedback now

Attachments (10)

patch-prefetch1.diff (7.3 KB) - added by carter 7 months ago.
initial prefetch patch
patch-prefetch-squash1.diff (15.1 KB) - added by carter 7 months ago.
updated diff / patch
0001-prefetch-patch-initial-work.patch (14.4 KB) - added by carter 7 months ago.
apparently i don't know how to generate patches, heres one!
patch-sept17-1am.patch (20.3 KB) - added by carter 7 months ago.
updated patch, sept 17, 1am
testsuite-prefetch-patch.patch (1.7 KB) - added by carter 7 months ago.
testsuite patch sept 17 4pm
validate sept 17th 5pm.txt (14.1 KB) - added by carter 7 months ago.
make test outcomes for my current patches
prefetchNative-MoreBugs.patch (19.2 KB) - added by carter 7 months ago.
patch as of sept 19
0001-prefetch-in-the-native-code-gen-plus-locality-levels.patch (19.2 KB) - added by carter 7 months ago.
updated patch, sept 19, 12:45am
0001-prefetch-in-the-native-code-gen-plus-locality-levels.2.patch (19.2 KB) - added by carter 7 months ago.
sept 19 , 2:16am patch
0001-adding-support-for-prefetch-with-locality-levels-0-3.patch (19.3 KB) - added by carter 7 months ago.
final patch for prefetch! sept 19 3:28pm

Download all attachments as: .zip

Change History (60)

Changed 7 months ago by carter

initial prefetch patch

comment:1 Changed 7 months ago by carter

oh and part of the patch eliminates some preexisting prefetch dead code that Reid Barton noticed

comment:2 follow-up: Changed 7 months ago by carter

huh, the patch accidentally includes an edit for a commit i dind't do, i'll figure out fixing that tomorrow

comment:3 Changed 7 months ago by carter

note that i view the number suffix approach in the API as a stop gap until tickets #8252 ,#8107 are addressed

comment:4 Changed 7 months ago by carter

relatedly: in support of this improvment, i'd like to add nontemporal reads and writes once geoffs updated simd lands

comment:5 in reply to: ↑ 2 Changed 7 months ago by rwbarton

Replying to carter:

huh, the patch accidentally includes an edit for a commit i dind't do, i'll figure out fixing that tomorrow

I think you are also missing a patch that introduces the new primops and the parameter to MO_Prefetch_Data. Could you squash it with the commit "some other fixups for adding prefetch i overlooked."?

Changed 7 months ago by carter

updated diff / patch

comment:6 Changed 7 months ago by carter

Oops! fixed.

theres a few more things (such as NOOP fallbacks on -fasm, plus perhaps actual code gen support on x86/x86_64 -fasm) but I"d like to get some initial feedback before I add that too!

comment:7 Changed 7 months ago by carter

might also want to add streaming and (un)aligned simd loads and stores, I'll see about getting geoff to fold that into his work

Changed 7 months ago by carter

apparently i don't know how to generate patches, heres one!

comment:8 Changed 7 months ago by carter

one thing i'm worried i'm overlooking, is how do I make sure cmm-lint knows about prefetch op types?

(or is that a non issue)

Last edited 7 months ago by carter (previous) (diff)

comment:9 Changed 7 months ago by carter

  • Status changed from new to patch

there more i want to do, but i'd like some feedback on this basic patch first (i should also cook up that test example)

comment:10 Changed 7 months ago by tibbe

Some comments on the patch:

  • It needs to work on all backends. It's fine if it's a no-op on some backends.
  • There needs to be a basic should_run test somewhere.
  • The primops need a better suffix that 0-3 I think. Do the levels have descriptive names whose abbreviations we could use?
  • The primops need docs.

comment:11 Changed 7 months ago by thoughtpolice

This patch looks to be mostly in pretty good shape, so I think it can go in with a little more work.

In regards to point 1 from @tibbe, IMO it's perfectly reasonable to just have the code generator emit an empty piece of code for the other backends. prefetch can't change the semantics, only the performance - so simply emitting a no-op in the other backends is a perfectly legitimate thing to do, and easy. I'm not aware of any hardware we support where prefetching isn't strictly an optimization.

As for names? I don't really mind that much. The usage of the 0-3 constants for locality argument comes from GCC, where __builtin_prefetch has a locality argument ranging from 0 to 3. There is no real mnemonic here: 3 means it should be left in cache if possible as it has high locality, 0 means it can be evicted after access (low locality,) and 1-2 are just "somewhere inbetween 0 and 3." I imagine LLVM inherited this convention from GCC, and I'm not sure what else we could sensibly call them.

That said, we do need a test, and the primops need docs. There are also a lot of typos and random comments laying around which would be nice to get cleaned up (mostly punctuation, stuff like that.)

Version 0, edited 7 months ago by thoughtpolice (next)

comment:12 Changed 7 months ago by rwbarton

Confusingly, GCC's locality levels 0, 1, 2, 3 correspond to Intel instructions prefetchnta, prefetcht2, prefetcht1, prefetcht0 respectively, but there's not much we can do about that (except document it clearly, I guess).

Changed 7 months ago by carter

updated patch, sept 17, 1am

comment:13 Changed 7 months ago by carter

I still need to add the test cases, and update the documentation

comment:14 Changed 7 months ago by carter

also i've a new patch that will fix the typoes, will add that + test + the amended documentation tomorrow

comment:15 Changed 7 months ago by carter

adding patch for testuite now

Changed 7 months ago by carter

testsuite patch sept 17 4pm

Changed 7 months ago by carter

make test outcomes for my current patches

Changed 7 months ago by carter

patch as of sept 19

comment:16 Changed 7 months ago by carter

the current patch is correct except that i'm not pretty printing the prefetch asm correctly.

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:25:0:
    suffix or operands invalid for `prefetchnta'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:48:0:
    suffix or operands invalid for `prefetchnta'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:70:0:
    suffix or operands invalid for `prefetchnta'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:93:0:
    no such instruction: `prefetch2 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:116:0:
    no such instruction: `prefetch2 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:138:0:
    no such instruction: `prefetch2 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:161:0:
    no such instruction: `prefetch1 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:184:0:
    no such instruction: `prefetch1 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:206:0:
    no such instruction: `prefetch1 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:229:0:
    no such instruction: `prefetch0 %rax'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc20629_0/ghc20629_2.s:252:0:
    no such instruction: `prefetch0 %rax'

comment:17 Changed 7 months ago by carter

I wrote some C code to see what the code gen in clang does for that asm

// void __builtin_prefetch (const void *addr, ...)


void prefetchnta(const void *addr){
    
     __builtin_prefetch(addr, 0,0);
}

void prefetch2(const void *addr){
    
     __builtin_prefetch(addr, 0,1);
}

void prefetch1(const void *addr){
    
     __builtin_prefetch(addr, 0,2);
}


void prefetch0(const void *addr){
    
     __builtin_prefetch(addr, 0,3);
}
}}


the asm generated is 

{{{

	.section	__TEXT,__text,regular,pure_instructions
	.globl	_prefetchnta
	.align	4, 0x90
_prefetchnta:                           ## @prefetchnta
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp2:
	.cfi_def_cfa_offset 16
Ltmp3:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp4:
	.cfi_def_cfa_register %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rdi
	prefetchnta	(%rdi)
	popq	%rbp
	ret
	.cfi_endproc

	.globl	_prefetch2
	.align	4, 0x90
_prefetch2:                             ## @prefetch2
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp7:
	.cfi_def_cfa_offset 16
Ltmp8:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp9:
	.cfi_def_cfa_register %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rdi
	prefetcht2	(%rdi)
	popq	%rbp
	ret
	.cfi_endproc

	.globl	_prefetch1
	.align	4, 0x90
_prefetch1:                             ## @prefetch1
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp12:
	.cfi_def_cfa_offset 16
Ltmp13:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp14:
	.cfi_def_cfa_register %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rdi
	prefetcht1	(%rdi)
	popq	%rbp
	ret
	.cfi_endproc

	.globl	_prefetch0
	.align	4, 0x90
_prefetch0:                             ## @prefetch0
	.cfi_startproc
## BB#0:
	pushq	%rbp
Ltmp17:
	.cfi_def_cfa_offset 16
Ltmp18:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Ltmp19:
	.cfi_def_cfa_register %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rdi
	prefetcht0	(%rdi)
	popq	%rbp
	ret
	.cfi_endproc


.subsections_via_symbols


}}}

comment:18 Changed 7 months ago by carter

so do i need to parenthesize the registers somehow? How do i express this in the pretty printing?

Changed 7 months ago by carter

updated patch, sept 19, 12:45am

comment:19 Changed 7 months ago by carter

it looks like I should use AddrMode? instead of general Operands on the code gen, though i'm a bit confused on that front.

comment:20 Changed 7 months ago by carter

using some suggestions form Shachaf and others, i fixed up how i was pretty printing, but i'm still getting errors from gcc

"inplace/bin/ghc-stage1" -optc-Ilibraries/haskeline/includes -optc-DUSE_GHC_ENCODINGS -optc-DTERMINFO -optc-I'/Users/carter/Desktop/repoScratcher/ghc/libraries/directory/include' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/libraries/unix/include' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/libraries/time/include' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/libraries/containers/include' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/libraries/bytestring/include' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/libraries/base/include' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/rts/dist/build' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/includes' -optc-I'/Users/carter/Desktop/repoScratcher/ghc/includes/dist-derivedconstants/header' -static  -H64m -O0 -fasm    -package-name haskeline-0.7.0.4 -hide-all-packages -i -ilibraries/haskeline/. -ilibraries/haskeline/dist-install/build -ilibraries/haskeline/dist-install/build/autogen -Ilibraries/haskeline/dist-install/build -Ilibraries/haskeline/dist-install/build/autogen -Ilibraries/haskeline/includes   -optP-DUSE_GHC_ENCODINGS -optP-DTERMINFO -optP-include -optPlibraries/haskeline/dist-install/build/autogen/cabal_macros.h -package base-4.7.0.0 -package bytestring-0.10.3.0 -package containers-0.5.3.1 -package directory-1.2.0.1 -package filepath-1.3.0.2 -package terminfo-0.3.2.5 -package transformers-0.3.0.0 -package unix-2.7.0.0 -Wall -XHaskell98 -XForeignFunctionInterface -XRank2Types -XFlexibleInstances -XTypeSynonymInstances -XFlexibleContexts -XExistentialQuantification -XScopedTypeVariables -XGeneralizedNewtypeDeriving -XMultiParamTypeClasses -XOverlappingInstances -XUndecidableInstances -XCPP -XDeriveDataTypeable -XPatternGuards -O -fasm  -no-user-package-db -rtsopts      -c libraries/haskeline/cbits/h_wcwidth.c -o libraries/haskeline/dist-install/build/cbits/h_wcwidth.o

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:93:0:
    no such instruction: `prefetch2 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:116:0:
    no such instruction: `prefetch2 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:138:0:
    no such instruction: `prefetch2 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:161:0:
    no such instruction: `prefetch1 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:184:0:
    no such instruction: `prefetch1 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:206:0:
    no such instruction: `prefetch1 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:229:0:
    no such instruction: `prefetch0 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:252:0:
    no such instruction: `prefetch0 (%rax)'

/var/folders/py/wgp_hj9d2rl3cx48yym_ynj00000gn/T/ghc14328_0/ghc14328_2.s:274:0:
    no such instruction: `prefetch0 (%rax)'

Changed 7 months ago by carter

sept 19 , 2:16am patch

comment:21 Changed 7 months ago by carter

huh, intel cant spell

prfetcht vs prefetch!

i'll test this change tomorrow

comment:22 Changed 7 months ago by carter

I had a funny problem with validate

== Start post-build package check
Timestamp 2013-09-19 17:26:28 UTC for /Users/carter/Desktop/repoScratcher/ghc/inplace/lib/package.conf.d/package.cache
Timestamp 2013-09-19 17:26:28 UTC for /Users/carter/Desktop/repoScratcher/ghc/inplace/lib/package.conf.d (same as cache)
using cache: /Users/carter/Desktop/repoScratcher/ghc/inplace/lib/package.conf.d/package.cache
== End post-build package check
rm -f bindist-list
make -r --no-print-directory -f ghc.mk bindist BINDIST=YES
"cp" -p ghc/stage1/build/tmp/ inplace/lib/bin/ghc-stage1
cp: ghc/stage1/build/tmp/ is a directory (not copied).
make[1]: *** [inplace/lib/bin/ghc-stage1] Error 1
make: *** [binary-dist-prep] Error 2

but the teste suite seems to have run fine when i subsequently did a make test

Unexpected results from:
TEST="T5313 static001 T1969 T4801"

OVERALL SUMMARY for test run started at Thu Sep 19 14:00:24 EDT 2013
    3778 total tests, which gave rise to
   14853 test cases, of which
   11416 were skipped

      28 had missing libraries
    3344 expected passes
      61 expected failures

       0 caused framework failures
       1 unexpected passes
       3 unexpected failures

Unexpected passes:
   driver  T5313 (normal)

Unexpected failures:
   driver         static001 [bad stderr] (normal)
   perf/compiler  T1969 [stat not good enough] (normal)
   perf/compiler  T4801 [stat too good] (normal)

I will upload the current ghc patch shortly. (same testsuite patch as before).

will just clean up the documentation a bit.

Note also that for now i've given all the prefetch primops a can_fail=True attribute, but it could be argued just as well that they should have no such attribute, or the has_side_effects=True attribute. I'm unsure whats the right story there.

comment:23 Changed 7 months ago by carter

its worth noting that the "right" benchmark for prefetch would be something like having two large arrays, one of which is indices into the other array, chosen randomly. the benchmark would be comparing a naive random lookup vs a prefetched one.

(the idea being that the test fails if prefetch isn't faster than naive).

i'll cook up this example some other time. but would be the only sane way to benchmark if prefetch is working or not! (because prefetch on any valid memory address will not change the naive perf correctness).

NB: the one way prefetching could interact with correctness in an unrelated to performance way, would be in relation to all those various side channel timing based security issues. proper mitigation would probably require ghc also supporting streaming reads and writes, in addition to prefetch.

Changed 7 months ago by carter

final patch for prefetch! sept 19 3:28pm

comment:24 Changed 7 months ago by carter

the only thing i think needs discussion may be the canfail / has effects attributes. (should it have them or not? its unclear to me).

comment:25 Changed 7 months ago by carter

reminder for myself

[16:18:07] <carter>	 tibbe: that said, i should explain prefetch*0 as being best for the streaming ephemeral case
[16:18:27] <carter>	 and prefetch*3 ops for "will be  reading/writing a bunch"
[16:18:36] <tibbe>	 right
[16:19:08] <carter>	 i'll comment that on the ticket first
[16:19:13] <carter>	 before i forget to do so this evening
[16:19:15] <carter>	 been a long day
[16:19:27] <tibbe>	 carter: maybe test all versions and make sure (prefetch bytearray) == bytearray
[16:19:34] <tibbe>	 :)
[16:19:38] <carter>	 ohh
[16:19:44] <carter>	 thats a good idea for a test
[16:19:52] <carter>	 for the pure ones

comment:26 Changed 7 months ago by carter

ok, to keep the attachment confusion from going crazy, the most current patches are here https://gist.github.com/cartazio/6643768

i'll also attach them to the ticket too if needed

comment:27 Changed 7 months ago by simonmar

Why do we have can_fail=True on the primops? Do they crash with an invalid address?

comment:28 Changed 7 months ago by simonmar

Ignore me, I just saw there was discussion of this on the mailing list.

comment:29 Changed 7 months ago by carter

well, there wasn't much discussion on the mailing list about whether can_fail is correct nor not!

but to recap the mailing list reasons I laid out:

  1. I think we're ok with inlining duplicating a prefetch operation
  1. I'm not sure if we want to be ok with floating out prefetches (doing a prefetch earlier than planned). This is the reason for doing the can_fail = True annotation.

comment:30 Changed 7 months ago by simonmar

Yes, I think can_fail is fine.

comment:31 Changed 7 months ago by carter

ok great! will this make it into 7.8?

comment:32 Changed 7 months ago by Austin Seipp <austin@…>

In fd74014079f14bd3ab50e328e52c44ef97d40e05/ghc:

Add support for prefetch with locality levels.

This patch adds support for several new primitive operations which
support using processor-specific instructions to help guide data and
cache locality decisions. We have levels ranging from [0..3]

For LLVM, we generate llvm.prefetch intrinsics at the proper locality
level (similar to GCC.)

For x86 we generate prefetch{NTA, t2, t1, t0} instructions. On SPARC and
PowerPC, the locality levels are ignored.

This closes #8256.

Authored-by: Carter Tazio Schonwald <carter.schonwald@gmail.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>

comment:33 Changed 7 months ago by Austin Seipp <austin@…>

In 14b14399f1b5189e9d18c9ddc9c6909c781af4af/testsuite:

Add tests for prefetch primops (#8256)

Authored-by: Carter Tazio Schonwald <carter.schonwald@gmail.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>

comment:34 Changed 7 months ago by thoughtpolice

  • Resolution set to fixed
  • Status changed from patch to closed

Merged, thanks!

comment:35 Changed 7 months ago by simonpj

Great stuff. Questions about the patch:

  • The primops are all marked can_fail = True. Could there be some words of documentation to explain why? I recall scratching my head over the can-fai-ness of primops before, so documenting the reasoning is a Good Thing.
  • Would it be possible to include in the Haddock comments in primops.txt.pp a URL where one can read more about the prefetch instructions and "locality levels"? I know that URLs go out of date, but a start would be useful. Perhaps with some keywords from that web page that might be a starting point for a subsequent search if the URL went stale.

Simon

comment:36 Changed 7 months ago by carter

Hey Simon, excellent questions

1) the motivation for the can_fail attribute is the following: anyone using prefetch (or at least using prefetch appropriately) are treating the CPU's memory bandwidth as a resource, and accordingly (and conservatively), can_fail prevents let floating/hoisting (if i've read the docs correctly), while still making it "cheap" to inline. It need not have the can_fail attribute, but since prefetches should/will only be used in pretty perf sensitive code, it seems like its a decent arbitrary choice that will help make "when" prefetch happens a bit easier to understand.

2) good points, i'll hack out a patch to docs later this week when i'm not buried with my consulting work

comment:37 Changed 7 months ago by carter

theres perhaps an argument for not having the can_fail = true attribute, but it seems like the "no float out" would help with reasoning about when prefetch happens

comment:38 Changed 7 months ago by simonpj

Thank you for (1). But when you fix (2) can you add those remarks to primops.txt.pp (as a comment, perhaps), rather than just telling me!

Simon

comment:39 Changed 6 months ago by carter

  • Owner carter deleted
  • Resolution fixed deleted
  • Status changed from closed to new

comment:40 Changed 6 months ago by carter

  • Owner set to carter

comment:41 Changed 6 months ago by carter

  • Status changed from new to patch

Heres the additional documentation patch, linked here https://github.com/cartazio/ghc/commit/d0f9dec9203c9203aa3dac5365e10b2b39ba0858

please let me know if that clarifies things sufficiently in the desired ways.

comment:42 Changed 6 months ago by simonpj

Thanks

Do we have any performance tests that show that the new primops actually deliver performance improvements?

I still don't understand this stuff about can_fail and hoisting/floating out. Here are the types for a couple of the new primops:

 prefetchByteArray2# :: ByteArray# -> Int# -> ByteArray# 
 prefetchMutableByteArray1# :: MutableByteArray# s -> Int# -> State# s -> State# s 

The latter returns a State#, which precisely sequences it between two other operations on that State#. Simialarly the former returns a ByteArray# which (I assume) should be used in preference to the earlier one. That is, you should use this pattern:

let a1 = prefetchByteArray2# a n in ...a1...
  -- Use a1 rather than a in the body of the 1et

If I have this right, this usage pattern should be thoroughly documented, since the type system does not enforce it.

Now in any loop involving prefetch, the State# or the ByteArray# is going to be carried along as a parameter to the recursive function, so there is no danger of it being floated out.

My gut feel is that we should drop this can_fail stuff until we have actual evidence that it is benefiting us. The whole can_fail thing is a bit of a hack, and I'd rather use it as little as possible. It's not a major deal though.

Simon

comment:43 Changed 6 months ago by carter

Simon, good points.

No i don't have a benchmark showing that adding the 0-3 levels is justified, though I could probably cook up one that shows a measurable (albeit small) difference when I have time. The prior LLVM only prefetch operation was equivalent to the prefetch*3# family of operations.

The 0-3 stuff also in part motivated as having a simple concrete thorn to motivate the "Static literal data" extension I think we need so as to properly support some other interesting primops in a somewhat type safe way, and prefetch is the simplest example, where exposing only 4 variants would still be useful. So assuming I work out that Static data extension right, We could have something like prefetchByteArray :: Static Int# -> ByteArray# -> Int# ->ByteArray# rather than the 0-3 variants for 7.10. So yes, I agree that the number suffix stuff is a bit of a hack, and its one that motivates some more systemic solutions going forward.

On the can_fail, explained that way, I think you're absolutely right (or at least more likely right than I).
I'll adjust that branch accordingly later today/tomorrow and link to the diff again.

comment:44 Changed 6 months ago by simonpj

We have plenty of thorns; see Geoff's recent stuff about SIMD-instruction primops!

comment:45 Changed 6 months ago by thoughtpolice

  • Resolution set to fixed
  • Status changed from patch to closed

The documentation patch is now merged, thanks!

comment:46 Changed 6 months ago by carter

  • Owner carter deleted
  • Resolution fixed deleted
  • Status changed from closed to new

urk, forgot to push the additional cleanup using the feedback simon had:

I'll push that commit onto my prefetch-docs branch again in a moment. https://github.com/cartazio/ghc/compare/prefetch-docs

comment:48 Changed 6 months ago by carter

  • Status changed from new to patch

comment:49 Changed 6 months ago by Austin Seipp <austin@…>

In 37ae422fc1131245705a686d1e3144b3f9e9aa81/ghc:

Update documentation concerning prefetch ops

Also remove can_fail=True since it's likely unnecessary upon discussion
(see #8256.)

Signed-off-by: Austin Seipp <austin@well-typed.com>

comment:50 Changed 6 months ago by thoughtpolice

  • Resolution set to fixed
  • Status changed from patch to closed

Merged, thanks!

Note: See TracTickets for help on using tickets.