Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#8456 closed bug (fixed)

Control flow optimisations duplicate blocks

Reported by: jstolarek Owned by: jstolarek
Priority: highest Milestone: 7.8.1
Component: Compiler Version: 7.7
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking: #8275
Related Tickets: Differential Rev(s):
Wiki Page:

Description

During work on #8275 I observed that control flow optimisation pass in the Cmm pipeline duplicates block that does the stack check, which is completely redundant. There is at least one bug in CmmContFlowOpt module. Consider this:

L1: goto L2
L2: whatever
L3: goto L1

We are processing blocks from the end. When we reach L3 first guard in maybe_concat function (CmmContFlowOpt.hs, line 123) will turn that blocks into:

L1: goto L2
L2: whatever
L3: goto L2

However, the number of predecessors of L2 block is not updated because backEdges is computed once before we run maybe_concat and is not updated when we make changes to the block structure.

Another issue, which I have not yet encountered in practice but I believe may arise as well, comes from the fact that we may map one label to a different one, but again we don't that take into account when determining the number of predecessors.

Change History (19)

comment:1 Changed 4 years ago by jstolarek

Simon Marlow says:

I wonder if the right way to fix these issues is to not shortcut or concat a jump when the jump destination has not already been processed. This might be easier than keeping track of the predecessor counts accurately, and should give results that are just as good.

Let me see if I understand this correctly. Consider again the same code as in the ticket report:

L1: goto L2
L2: whatever
L3: goto L1

According to your proposal L3 will not process its destination - L1 - because L1 itself has not been processed (we process blocks from the back). L1 however will concatenate with L2, causing L2 to be unreachable and removed from the graph later on:

L1: whatever
L2: whatever -- UNREACHABLE, eliminated later
L3: goto L1

Now consider this:

L1: goto L2
L2: whatever
L3: goto L1
L4: goto L2

L3 will not shortcut its destination, because it has not been processed yet. L1 will not concatenate L2, because L2 has two predecessors. So this graph will not be modified in any way. However, using current algorithm (and providing that we correctly keep track of predecessors) we will optimise that graph to:

L1: goto L2 -- UNREACHABLE, eliminated later
L2: whatever
L3: goto L2
L4: goto L2

Do I got this right?

comment:2 Changed 4 years ago by jstolarek

Cc: simonmar added

comment:3 Changed 4 years ago by simonpj

About CmmContFlowOpt, it looks to me as if there may be other bugs. When deciding in shouldConcatWith whether the destination block is small, we look in blocks; but those are the original un-rewritten blocks, and we might have already rewritten the destination block.

I think it might be better to consider this algorithm as something that rewrites a full graph to a graph (both represented as usual as a finite map). The algorithm takes a post-order DFS list of block-ids (not blocks!) which is used as the list of blocks to consider, one by one. Look up that block id to find the block, consider whether to rewrite it; if so, rewrite it and replace it in the finite map, and move on to the next one.

You are right that rewriting changes predecessor counts, so you'd need to maintain that alongside the current graph. An easy approximiation would be this: keep a set of block-ids that are referenced exactly once, and delete things from it when necessary. (But never add anything.) That would be an approximation, but probably a reasonable one.

I don't agree with Simon M that "not shortcut or concat backward jumps" will give results that are at least as good. Imagine a backward goto to a goto!

Thanks for working on this Jan

Simon

comment:4 Changed 4 years ago by simonmar

Simon - the set of blocks is updated as we go along (there's a foldr at the top). So it does what you're suggesting.

For a backward goto to a goto, I was imagining that the forward goto would be concatenated with its destination. But it might not be, if the destination has multiple predecessors, so I withdraw my suggestion.

comment:5 Changed 4 years ago by simonpj

Ah yes, you're right. In this line 135

           , Just blk' <- mapLookup b' blocks

I thought blocks was the original blocks. But it's the working set of blocks. (Shadowing.) I'll rename it

Simon

comment:6 in reply to:  5 Changed 4 years ago by jstolarek

I'll rename it

Please, don't - I have a major patch coming for this module. I'll write more in a moment, just let me finish my lunch :)

comment:7 Changed 4 years ago by jstolarek

I have rewritten maybe_concat function and some subsequent functions - code is here on github. In the modified algorithm I maintain an up-to-date map of predecessors. I don't like the idea of having an approximation - up till now we had an approximation (i.e. changing number of predecessors usually doesn't matter) and now we're fixing it because it does the wrong thing. Having an approximation just feels dangerous.

My patch validates, except for the fact that it slightly increases allocation and trips two performance tests. I think that's to be expected, given that we are passing around one more data structure and updating it. I'll be updating the comments for the next hour or so, so if there's something wrong or unclear about the patch please tell me.

As a side note I see some recurring problems in the design of Cmm pipeline:

  • here we have a problem, because we create a list of predecessors, which we use to perform graph transformations but at the same time we invalidate that list
  • in #8205 we computed a list of proc-points and passed it to stack layout, which needed it for graph transformations but at the same time invalidated that list
  • CmmContFlowOpt creates unreachable blocks and relies on common block elimination to remove them
  • stack layout creates unreachable blocks and relies on sinking pass to remove them

The last two have not manifested themselves as bugs, but fall into the category of introducing inconsistencies in our data and assuming that someone will later clean up those inconsistencies. I will not be surprised if more such problems are lurking in the code. At the moment I don't know how to do better, but certainly the current design feels tightly coupled and that is not a good thing.

comment:8 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 057bef6ef759bc0819d4ced291da92aa3feb445a/ghc:

Improve control flow optimisation algorithm

Fixes #8456. Previous version of control flow optimisations
did not update the list of block predecessors, leading to
unnecessary duplication of blocks in some cases. See Trac
and comments in the code for more details.

comment:9 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 1e31b0083597c2c1cf3198c8e8c5736360640f91/testsuite:

Update performance test due to fix for #8456

comment:10 Changed 4 years ago by jstolarek

Resolution: fixed
Status: newclosed

comment:11 Changed 4 years ago by simonmar

Owner: jstolarek deleted
Priority: highhighest
Resolution: fixed
Status: closednew

I looked at this patch, I think we need to revisit it.

         | CmmBranch b' <- last
         , Just blk' <- mapLookup b' blocks
         , hasOnePredecessor b'
         , hasNotBeenMappedTo b' shortcut_map

In particular, the hasNotBeenMappedTo here is very suspicious:

          hasNotBeenMappedTo :: BlockId -> BlockEnv BlockId -> Bool
          hasNotBeenMappedTo b successor_map = mapMember b successor_map

It looks like the condition is reversed. Indeed, fixing this makes the performance problem go away, probably because this condition is preventing any block concatenation from happening at all, which means later phases are seeing a lot more code than they would if concatenation was happening.

What's more, even when it is reversed, I don't think this condition does what you want, because what you actually wanted to do was look in the range of the map, not the domain. And we don't want to do that, because searching the range of a map is O(n) operation, leaving the whole pass O(n2).

I suggest just deleting this condition. If the block is in the shortcut_map, then it will also have another predecessor (the call itself), and we won't attempt to concatenate with it.

comment:12 Changed 4 years ago by simonmar

Milestone: 7.8.1

comment:13 Changed 4 years ago by jstolarek

Owner: set to jstolarek

It looks like the condition is reversed

Yes, you're right.

what you actually wanted to do was look in the range of the map, not the domain.

Argh... I don't know why I assumed that it looks in the range :-/

searching the range of a map is O(n) operation

According to documentation of IntMap lookup has O(min(n,W)) complexity:

Many operations have a worst-case complexity of O(min(n,W)). This means that the operation can become linear in the number of elements with a maximum of W -- the number of bits in an Int (32 or 64).

I'm not sure if I understand correctly, but this means that for n < W lookup is linear, right?

I suggest just deleting this condition. If the block is in the shortcut_map, then it will also have another predecessor (the call itself), and we won't attempt to concatenate with it.

Not any more - see third guard and update_cont function. I wanted to avoid this kind of hidden invariants, because they often cause us trouble.

My idea here would be to replace

, hasNotBeenMappedTo b' shortcut_map

with

, Nothing <- mapLookup b' shortcut_map

Would that be acceptable? I think that for the most time we keep a small number of entries in shortcut_map, so lookup will be linear, but then again if these numbers are small it shouldn't be that much of a problem?

comment:14 Changed 4 years ago by simonmar

lookup looks up in the domain, not the range. You would need to do something like elem b' (mapElems shortcut_map). This is O(n).

But I still think you don't need this test. You're keeping backEdges up to date, so there's no reason to believe that you have something "hidden" in the shortcut_map that would represent more predecessors, right? The shortcut_map is for replacing labels inside expressions after the pass is complete. It used to update calls, but as you point out, the pass now does this as it goes along.

There are no extra invariants. The shortcut_map doesn't represent any extra edges in the graph.

comment:15 Changed 4 years ago by jstolarek

Ah, now I see. Yes, I think you're right - the test for mapping can be removed.

comment:16 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 29db17fe02786a0af543e0f6de1d4ee4cca7e299/ghc:

Remove unnecessary check in CmmContFlowOpt

Fixes #8456

comment:17 Changed 4 years ago by Jan Stolarek <jan.stolarek@…>

In 3ac7046a52d1d37e3cd3f09d556caabbb9b75716/testsuite:

Adjust performance of T783 to #8456 fix

comment:18 Changed 4 years ago by jstolarek

Resolution: fixed
Status: newclosed

comment:19 Changed 4 years ago by simonmar

Thanks!

Note: See TracTickets for help on using tickets.