Opened 3 years ago

Closed 2 years ago

#5282 closed bug (fixed)

Bizarre results from -P profiler on OS X

Reported by: bos Owned by: simonmar
Priority: high Milestone: 7.4.1
Component: Runtime System Version: 7.0.3
Keywords: Cc: dmp@…, jwlato@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Incorrect result at runtime Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I'm trying to use the profiler to understand the performance of an application on OS X. I have a smallish benchmark that prints a lot of lines of text to stdout, and the numbers reported when I use +RTS -P are very different from what I see with my eyes.

The benchmark itself is quite simple:

https://github.com/mailrank/text-format/blob/master/benchmarks/Simple.hs

When I run it on the command line, it produces simple enough output:

$ time ./Simple one 500000 +RTS -P >/dev/null
500000 iterations in 2.6 secs (191.7 thousand/sec)

real	0m2.626s
user	0m2.586s
sys	0m0.030s

The time measured within the app and the wallclock time are both in agreement.

However, if I look at the Simple.prof file left behind afterwards, its numbers seem bizarre:

        Mon Jun 27 11:35 2011 Time and Allocation Profiling Report  (Final)

           Simple +RTS -P -RTS one 500000

        total time  =        0.28 secs   (14 ticks @ 20 ms)
        total alloc = 2,200,395,720 bytes  (excludes profiling overheads)

It reports a time of just 0.28 seconds, and the number of ticks it counts arithmetically agree with this.

I have other programs that suffer from the same discrepancy between wallclock measurement and what the profiler reports to me, such that I don't feel like I can trust the profiler at all here. I'm at a loss for what to do about this, though.

Change History (20)

comment:1 Changed 3 years ago by simonmar

  • Milestone set to 7.2.1
  • Owner set to simonmar

comment:2 Changed 3 years ago by bos

Just to ensure that I wasn't insane, I installed Linux under VMWare on the same machine. Building and running a program inside the Linux VM gives correct time measurements, while the Mac builds get them wrong.

Could this possibly be related to bug 4970 that got fixed some time recently?

comment:3 Changed 3 years ago by bos

Let's see if I can hyperlink bug #4970 properly ;-(

comment:4 Changed 3 years ago by simonmar

Unassigning myself since it's likely Mac-specific.

This commit might be responsible: 5e195173cd1921e838eb977f649ef178dff00415. However, be careful about just reverting it, because it did fix a bug (timing with getrusage was far too inaccurate).

comment:5 Changed 3 years ago by simonmar

  • Owner simonmar deleted

comment:6 Changed 3 years ago by simonmar

  • Priority changed from normal to high

comment:7 Changed 3 years ago by bos

I doubt that commit is to blame, as OS X lacks support for clock_gettime.

comment:8 Changed 3 years ago by dmp

  • Cc dmp@… added

I'm seeing the odd profiling results, but only with the single threaded runtime. If I compile with -threaded, then the profiling results look correct.

For the non-threaded runtime, it looks like all of the profiling ticks are going to the SYSTEM cost center. Since these ticks are not included in the profiling summary, it makes it look like there are very few actual ticks that get logged. To see where the ticks are going, I did some simple printf debugging by adding a statement to the handleProfTick function that prints out the cost center.

    if (do_prof_ticks) {
        printf("%s\n", CCCS->root->label);
	CCCS->time_ticks++;
    }

When running with the threaded runtime it prints mostly "CAF" and the profiling output looks good. When running with the non-threaded runtime it prints mostly "SYSTEM" and the profiling summary looks bad.

I looked around, but couldn't see anything obviously wrong that would cause all the ticks to go to the SYSTEM entry. I'm at a dead end, but could do some more if there are some suggestions on where to look.

I'm using a simplified version of bos's benchmark that just does a busy loop

{-# LANGUAGE BangPatterns, OverloadedStrings #-}

--module Main (main) where

import Control.Monad
import System.Environment
import Data.Time.Clock
import System.IO

counting :: Int -> (Int -> () -> IO ()) -> IO ()
counting count act = loop 0
    where loop !i | i < count = act i () >> loop (i+1)
                  | otherwise = return ()
{-# NOINLINE counting #-}
  
idle count = counting count $ \_ x -> return ()

main = do
  args <- getArgs
  let count = case args of
                (_:x:_) -> read x
                _       -> 100000
  let bm = case args of
             ("idle":_)   -> idle
             _            -> error "wut?"
  start <- getCurrentTime
  bm count
  elapsed <- (`diffUTCTime` start) `fmap` getCurrentTime
  putStrLn (show elapsed)

and running with

$ ./Simple idle 10000000 +RTS -P -RTS

comment:9 follow-up: Changed 3 years ago by simonmar

I can't imagine why the threaded and non-threaded runtimes would differ in this way. You could try gdb and set a watchpoint on the CCCS variable ("Current Cost Centre Stack"), and see if you can see what differs between the two runtimes.

comment:10 follow-up: Changed 3 years ago by strake

I can realize this on Linux 2.6.38, with GHC 7.0.3; however, the number of ticks is not just low, but zero.

$ ./program +RTS -P -RTS

program.prof, in part:

total time  =        0.00 secs   (0 ticks @ 20 ms)
total alloc =      10,600 bytes  (excludes profiling overheads)

The problem vanishes when compiled threaded, and the number of ticks is reasonable.

comment:11 Changed 3 years ago by strake

  • Operating System changed from MacOS X to Unknown/Multiple

comment:12 in reply to: ↑ 10 Changed 3 years ago by simonmar

Replying to strake:

I can realize this on Linux 2.6.38, with GHC 7.0.3; however, the number of ticks is not just low, but zero.

Can you attach a program that demonstrates this please?

comment:13 in reply to: ↑ 9 Changed 3 years ago by dmp

Replying to simonmar:

I can't imagine why the threaded and non-threaded runtimes would differ in this way. You could try gdb and set a watchpoint on the CCCS variable ("Current Cost Centre Stack"), and see if you can see what differs between the two runtimes.

I spent some time with gdb and could not find anything obviously wrong. In the single-threaded runtime, if I put a break point on the timer callback the backtrace shows it is in the scheduler getting ready to do a GC so the CCCS is set correctly.

In the threaded version, I'm seeing some odd things in the debugger but it seems fine if I run it outside the debugger. In the debugger it looks like the timer interrupt is delivered to each thread in the rts, including two worker threads waiting on pthread condition variables. Its probably just an odd interaction with the debugger, because running normally seems to work fine.

The simple benchmark I'm using does do a large number of GCs. It has 5208 collections in about 2.5 seconds, which is a collection about every half millisecond. Even though the collections are fast (98% productivity reported by -s), we could be just getting unlucky with the timer signal. If I set a larger heap with -A100M, then there is only 26 collections and the profile data looks good. I don't know why the threaded runtime wouldn't have the same problem though.

comment:14 Changed 3 years ago by scpmw

A few more tidbits of information, from debugging under Mac OS: The signal seems to be called from strangely predictable places. For single-threaded RTS, it keeps getting called from blockUserSignals and unblockUserSignals. When using --install-signal-handlers=no, suddenly all calls come from the getrusage call in getProcessCPUTime. I have verified that using a global status variable: All offending timer calls happen while in that function.

Another interesting puzzle piece: The way the itimer is initialized seems to be the reason that it is specific to -threaded: If I put "|| 1" at the end of Itimer.c:83, I get correct results for non-threaded builds.

So... Signal masking causing SIGVTALARM to fire? Hm.

comment:15 Changed 3 years ago by simonmar

So the single-threaded RTS uses a CPU timer (that is, the signals are generated every 20ms of CPU time), whereas the multi-threaded RTS uses a wall-clock timer. This is explained in a comment at the top of source:rts/posix/Itimer.c (the comment says "Major bogosity", but I now think that's overstating it a bit, it's been this way for ever).

So perhaps on Mac OS X the CPU time signals are occurring at predictable places, whereas wall-clock signals are more accurate. It would be unfortunate if we had to switch to using the wall-clock timer, because it's less accurate for profiling (periods when the process is descheduled introduce noise into the results).

comment:16 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:17 Changed 3 years ago by jwlato

  • Cc jwlato@… added

comment:18 Changed 2 years ago by simonmar

  • Owner set to simonmar

I'm on this.

comment:19 Changed 2 years ago by simonmar

Here's a profile from today's HEAD with a patch I'm about to commit, on a Mac:

	Fri Nov 25 15:23 2011 Time and Allocation Profiling Report  (Final)

	   5282 +RTS -P -RTS idle 10000000

	total time  =        1.00 secs   (1001 ticks @ 1000 us)
	total alloc = 800,082,392 bytes  (excludes profiling overheads)

COST CENTRE   MODULE  %time %alloc  ticks     bytes

counting.loop Main     98.7  100.0    988 800000000
idle.\        Main      1.3    0.0     13         0


                                                            individual     inherited
COST CENTRE       MODULE                  no.     entries  %time %alloc   %time %alloc  ticks     bytes

MAIN              MAIN                    108           0    0.0    0.0   100.0  100.0      0       656
 main             Main                    217           0    0.0    0.0   100.0  100.0      0     28848
  main.count      Main                    223           1    0.0    0.0     0.0    0.0      0     10880
  idle            Main                    220           1    0.0    0.0   100.0  100.0      0         0
   counting       Main                    221           1    0.0    0.0   100.0  100.0      0        72
    counting.loop Main                    222    10000001   98.7  100.0   100.0  100.0    988 800000000
     idle.\       Main                    224    10000000    1.3    0.0     1.3    0.0     13         0
  main.bm         Main                    218           1    0.0    0.0     0.0    0.0      0         0
 CAF              Main                    215           0    0.0    0.0     0.0    0.0      0        48
  main            Main                    216           1    0.0    0.0     0.0    0.0      0         0
   main.bm        Main                    219           0    0.0    0.0     0.0    0.0      0       304
 CAF              GHC.Conc.Signal         206           0    0.0    0.0     0.0    0.0      0       672
 CAF              GHC.IO.Encoding         190           0    0.0    0.0     0.0    0.0      0      3360
 CAF              GHC.IO.Encoding.Iconv   188           0    0.0    0.0     0.0    0.0      0       248
 CAF              GHC.IO.Handle.FD        179           0    0.0    0.0     0.0    0.0      0     34944
 CAF              Text.Read.Lex           133           0    0.0    0.0     0.0    0.0      0      2360

I've switched to using wall-clock time for profile samples on both the non-threaded and the threaded RTS (CPU time was far too unreliable), and I've bumped the default sample interval to 1ms. The profile on Linux looks the same, and on Windows I've switched to using a better timer so the results there are better too.

comment:20 Changed 2 years ago by simonmar

  • Resolution set to fixed
  • Status changed from new to closed

commit 6b1098511aaabd2c9503ee7be6da1944466f9cb4

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Fri Nov 25 13:11:39 2011 +0000

    Time handling overhaul
    
    Terminology cleanup: the type "Ticks" has been renamed "Time", which
    is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds).
    The terminology "tick" is now used consistently to mean the interval
    between timer signals.
    
    The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if
    we have it).  Before it used CPU time in the non-threaded RTS and
    realtime in the threaded RTS, but I've discovered that the CPU timer
    has terrible resolution (at least on Linux) and isn't much use for
    profiling.  So now we always use realtime.  This should also fix
    
    The default tick interval is now 10ms, except when profiling where we
    drop it to 1ms.  This gives more accurate profiles without affecting
    runtime too much (<1%).
    
    Lots of cleanups - the resolution of Time is now in one place
    only (Rts.h) rather than having calculations that depend on the
    resolution scattered all over the RTS.  I hope I found them all.
Note: See TracTickets for help on using tickets.