Opened 6 years ago

Last modified 19 months ago

#2374 new bug

MutableByteArray# is slower than Addr#

Reported by: dolio Owned by:
Priority: lowest Milestone: 7.6.2
Component: Compiler (NCG) Version: 6.8.2
Keywords: performance array Cc:
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I've filed this against the native code generator, even thought it still occurs in my experience with the via-C path. It is significantly more pronounced with the NCG path, and so, I imagine, is likely to be fixed by working on that. However, feel free to reclassify if appropriate.

When attempting to rewrite the fannkuch benchmark of the computer language benchmarks game to use more realistic code, I discovered that I am currently unable to do so. At least one piece of the puzzle is that the current entry uses Ptr and Addr# for arrays, while the current mutable array libraries (ST arrays and uvector) use MutableByteArray?# underneath, and it happens that reads/writes on the latter are slower than the former.

After some discussion on the glasgow-haskell-users list, I believe I've worked the kinks out of my (even more micro) benchmark programs, and will attach them here. They accept two numbers, k and n, and reverse an array of size-n k times. The larger fannkuch benchmark works on small arrays (reversing, shifting and copying), so using small arrays with a very large number of iterations is more representative of it. However, my benchmark doesn't show terribly much variation when one varies the parameters by corresponding factors (although there is a drop in performance when the size of the array nears the size of the CPU cache, naturally).

I've test machine is an AMD Athlon 64 3000+ running in 64-bit mode with 2 GB of ram. My operating system is Linux, kernel 2.6.24.

I've mainly done runs where n * k = 2.5 billion, as that takes approximately as much time as the MutableByteArray?# version of the fannkuch benchmark, around 11 seconds for 250 million iterations with an array size of 10. The Addr# version of the benchmark runs with the same parameters in around 7 seconds (which is faster than the actual fannkuch benchmark, but I digress).

I'll attach my two micro benchmarks. ByteArr?.hs is the MutableByteArray?# reversal benchmark. Ptr.hs is the Addr#/Ptr version. I may also attach full fannkuch implementations as another data point (the original is already Addr#/Ptr, and I've written a MutableByteArray?# implementation; I just need to do some proofreading of the latter to make sure it doesn't have any discrepancies).

I've also not done much investigation as to the source of the problem, besides looking at the core of various higher level programs to track things down this far. I'll attempt to delve further into the issue, investigating the C-- and assembly if my abilities allow, and update here as I find things.

Attachments (3)

Ptr.hs (1.1 KB) - added by dolio 6 years ago.
Slightly faster version of the Ptr benchmark for low array sizes (same on larger arrays).
ByteArr.hs (1.3 KB) - added by dolio 6 years ago.
Improved byte array benchmark
2374.zip (2.4 KB) - added by batterseapower 6 years ago.
Ticket #2374 benchmarking script and benchmark SATed/non-SATed versions

Download all attachments as: .zip

Change History (15)

Changed 6 years ago by dolio

Slightly faster version of the Ptr benchmark for low array sizes (same on larger arrays).

comment:1 Changed 6 years ago by igloo

  • Difficulty set to Unknown
  • Milestone set to 6.10 branch

Changed 6 years ago by dolio

Improved byte array benchmark

comment:2 Changed 6 years ago by dolio

I've attached a new version of the MutableByteArray?# benchmark I discovered while playing around with the static argument transform. It appears that, for me at least, performing SAT on the array argument of this benchmark causes GHC to produce significantly slower code via the NCG (via-C isn't much affected).

Using this version of the benchmark appears to put Addr# and MBA# at about the same speed (Addr# may still be slightly faster, but I can probably forgive 0.5 seconds on 2.5 billion array accesses). So, you may feel free to close this bug if you deem it appropriate.

comment:3 Changed 6 years ago by batterseapower

I get results exactly opposite to those dolio reports when trying to reproduce this. My times are:

-fvia-c:

Before SAT:
real	0m17.201s

After SAT:
real	0m10.498s

-fasm:
Before SAT:
real	0m14.942s

After SAT:
real	0m14.695s

So the NCG is unaffected by SATing the arr argument of reverse, while GCC /is/ affected but it improves the code rather than making it worse.

I don't really understand why our results are so different. I've attached the benchmarking script and source files I used in the hopes that someone who understands the NCG better can work out why we aren't benefiting from SAT while GCC is.

Changed 6 years ago by batterseapower

Ticket #2374 benchmarking script and benchmark SATed/non-SATed versions

comment:4 Changed 6 years ago by batterseapower

I should add, actually, that I got my results on a Mac OS X 10.5 system running i386 code on Xeons (an early 2008 Mac Pro). The installed GCC is v4.0.1 and GHC was HEAD version 6.9.20080831.

comment:5 Changed 5 years ago by igloo

  • Milestone changed from 6.10 branch to 6.12 branch

comment:6 Changed 4 years ago by igloo

  • Milestone changed from 6.12 branch to 6.12.3

comment:7 Changed 4 years ago by igloo

  • Milestone changed from 6.12.3 to 6.14.1
  • Priority changed from normal to low

comment:8 Changed 3 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:9 Changed 3 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:10 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:11 Changed 2 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1
  • Priority changed from low to lowest

comment:12 Changed 19 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2
Note: See TracTickets for help on using tickets.