Refactor of Coercion

changed milestone to %8.6.1

changed weight to 5

Trac metadata

Trac field	Value
Version	8.2.2 → -

This new rule corresponds to the coherence rule used in my thesis and in last ICFP's paper on dependent Haskell.

Trac metadata

Trac field	Value
Differential revisions	- → D4747

Can you elaborate?

What are the advantages of the new form? (It looks more symmetrical; but it has more arguments.) Can you say what the "several places" where is is useful are? Are there any disadvanatages?
Can we use the new form in OptCoercion to simplify coercions?
I see no constraints on r in the rule. So you could fix on N and use SubCo to downgrade to R?

Replying to [ticket:15192#comment:153901 simonpj]:

Can you elaborate?

What are the advantages of the new form? (It looks more symmetrical; but it has more arguments.) Can you say what the "several places" where is is useful are? Are there any disadvanatages?

Theoretically, this new form emphasizes that type equality relation is coherent, namely coercions within types is immaterial. This is studied in Section 5.8.3 in Richard's thesis and the An-EraseEq rule in A Specification for Dependent Types in Haskell paper
Practically, it simplifies some sort of coercion constructions. In current implementation, coercions as t |> g ~ t is built as <t> |> g. Namely we build the reflexivity coercion first and insert kind coercion to the left. Using the new form it can be rewritten to (t |> g) ~ t. Similarly, coercions as t ~ t |> g is sym (<t> |> g), and can be rewritten to t ~ (t |> g) directly. We have several usage like that in current implementation, e.g. TcCanonical. Moreover, one possible optimization is to replace the buildCoercion :: Type -> Type -> CoercionN function in module Coercion, which builds a coercion between types that are the same ignoring coercions, which is exactly the intention of the new form.
There are indeed some disadvantages now. In current refactor, there are three places where given a coercion g, we need to call coercionKind to get g :: t1 ~ t2 in order to use the new form because the new form needs this information to build (t1 |> ki_co) ~ t1 ; g . This is very inefficient. We are still trying to figure out further optimizations in those places.

Can we use the new form in OptCoercion to simplify coercions?

We have optimizations for EraseEqCo in OptCoercion. For example opt_co4 and opt_trans_rule for the new form are both simpler than for the old form. Besides that, I don't know if further optimization is possible?

I see no constraints on r in the rule. So you could fix on N and use SubCo to downgrade to R?

It is doable I think. But I don't know if we could gain any benefit from this restriction?

I agree with ticket:15192#comment:153905. Note that the new form is essentially a generalized form of Refl, which also has an unrestricted role parameter.

One area where the new coercion form is simpler is when we need t ~r (t |> g). Currently, this is Sym (Coherence (Refl r t) g). The new form would be EraseEqCo r t (CastTy t g) g. I do suppose that, as sharing is destroyed, the new form could be worse than the original one.

The truth is that Stephanie and I have been using this new form in our work for quite some time now, and this feels like just bringing GHC up to speed. But perhaps the old form is more convenient for implementation. I'd love to bring Stephanie in on the conversation.

This new form may be more efficient if there are multiple uses of Coherence needed --- one EraseEqCo can stand in for many casts (on either side).

However, If coherence has better sharing in some situations, then perhaps we don't want to lose that information.

There is no reason why we can't have both in the implementation. At least until we can see if EraseEqCo is useful in connection with other extensions. Richard and I both found it useful in generating coercions in our proofs. Perhaps that will show up in some of the coercions that GHC needs to construct.

Eventually, we may want to get rid of both of these coercions. In other words --- we can allow Core to work "up-to-erasure-equivalence", as in the system in the appendix of Richard's thesis. In that case, we wouldn't need to use Coherence or EraseEqCo very much (If at all).

I agree with Stephanie's first few paragraphs, but not her last one. GHC already works up to erasure-equivalence. That is, eqType returns True whenever two types are erasure-equivalent (and have the same kinds). But we still need the coherence coercions, because they sometimes change kinds.

I guess that will change once we move to homogeneous equality, though...

changed the description

Update the ticket from EraseEqCo to GRefl according to the discussion with Richard and Simon

changed the description

We don't know coercion co in MCo co is reflexive or not.

As a follow-on to ticket:15192#comment:153938: even with homogeneous equality, coercions will still be heterogeneous, and thus the need for a coherence coercion (of some form) will remain. The type of ~# becomes homogeneous, but that affects only what coercions we can quantify over, not what ones we can form.

Overall changes so far:

Removed Refl r ty and CoherenceCo
Introduced Refl ty and GRefl r ty MCoercion

Refl ty :: ty ~n ty, note that Refl ty is always nominal.
GRefl r ty MRefl :: ty ~r ty. If r == Nominal, use Refl.
GRefl r ty (MCo co) :: ty ~r ty |>co.

Replaced original Refl Nominal ty with Refl ty.
Given g1 :: s ~r t, to construct s |> g2 ~r t we used CoherenceCo g1 g2. It is now replaced with Sym (GRefl r s g2) ; g1. Similar for s ~ t |> g3.
It turns out that the explicit patten match in homogenise_result in TcFlatten triggers some optimization of GHC and improves the performance. However it is not useful in master branch.
Added a small regression for T9872d (the added number is the allocation of current master on T9872d).
Added note about flatten_exact_fam_app_fully performance in TcFlatten.

Performance summary

This patch intends to improve the overall performance about coercions.
It does perform better in all cases under perf/compiler, except T9872b (0.6%), T9872d(3.7%), and T14683(0.02%).
It failed T9872d, thus we added a small regression.
It seems to perform better to compile large packages, e.g. Cabal.

TODO: Further analysis of the performance: a step-by-step replay of the refactor.

OK. I propose that we merge the work Ningning has done so far, given that it is generally an improvement. Ningning, are you happy with the current patch on D4747? If so, I can merge it.

Replying to [ticket:15192#comment:156459 goldfire]:

OK. I propose that we merge the work Ningning has done so far, given that it is generally an improvement. Ningning, are you happy with the current patch on D4747? If so, I can merge it.

I am generally happy with it. Should I rebase it?

Yes, please. Then re-post to Phab and let me know you're ready to commit; I can to the merge to the master branch.

mentioned in commit 55a3f855

Is this fixed?

Replying to [ticket:15192#comment:156795 RyanGlScott]:

Is this fixed?

Not yet. We committed but currently we still have some performance issue.

I am still working on it.

assigned to @trac-ningning

Ah, my apologies then. (I interpreted ticket:15192#comment:156459 as "let's just accept the performance regressions and move on".)

I think that the regression is on perf/compiler/T9872d.

mentioned in commit e5f3de2c

I did some experiments recently. What I did:

add back CoherenceCo and related functions, such as mkCoherenceLeftCo (renaming it to mkCoherenceLeftCo').
replace uses of functions defined in terms of GRefl to corresponding functions defined in terms of CoherenceCo; for example, replace mkCoherenceLeftCo with mkCoherenceLeftCo'
test allocation after each change

A general observation is, mkCoherenceLeftCo' saves more allocation than mkCoherenceLeftCo when kind_co is not reflexive (similarly for mkCoherenceRightCo)

-- | Given @ty :: k1@, @co :: k1 ~ k2@, @co2:: ty ~ ty'@,
-- produces @co' :: (ty |> co) ~r ty'
mkCoherenceLeftCo r ty co co2
  | isGReflCo co = co2
  | otherwise    = (mkSymCo co $ GRefl r ty (MCo co)) `mktransCo` co2
                              -- stores an extra r and ty

mkCoherenceLeftCo' co (Refl _) = co
mkCoherenceLeftCo' (CoherenceCo co1 co2) co3 = CoherenceCo co1 (co2 `mkTransCo` co3)
mkCoherenceLeftCo' co1 co2 = CoherenceCo co1 co2

In the test case T9872d, TcFlatten.homogenise_result is called 340k+ times, which means mkCoherenceLeftCo is called 340k+ times. (not exactly as mkCoherenceLeftCo is inlined by hand in TcFlatten.homogenise_result now. Bur morally it is still true.)

If I leave everything unchanged, except for using mkCoherenceLeftCo' instead of mkCoherenceLeftCo in TcFlatten.homogenise_result, I save allocation in T9872d by ~2.5% and it passes the test before the regression.

(I also tried to replace the suspicious use of zipWith3 in TcFlatten with the original implementation by zipWith, but it save little allocation.)

I propose to update Note [flatten_exact_fam_app_fully performance] in TcFlatten to include the analysis.

Ningning, let me check my understanding

In HEAD we have GRefl but not CoherenceCo
In EXPERIMENT you added CoherenceCo and used it.
You found that EXPERIMENT allocated up to 2.5% less.

Right so far?

What are the costs of EXPERIMENT?

The extra constructor in Coercion, and extra cases in free-vars, substitution, pretty-printing etc.
Extra stuff in OptCoercion, I assume. This is already a very tricky module, so may be the biggest cost.

Is that all?

Other thoughts

I recall that CoerherenceCo was biased to the laft. One could add a right-biased version too -- would that help even more?
Are you implicitly advocating adding CoherenceCo not for expressiveness but for efficiency? And if so, are there any other combinations of coercions that we could play a similar game with?
I've noticed (in tc_trace printing) that we build a LOT of Refl coercions. Yet the isGReflCo case obviously doesn't fire, perhaps because the coercion hasn't been zonked. Yet in fact zonkTcType does zonk coercions, so I'm a bit puzzled about why we get quite so many of them. There might be some low-hanging fruit in finding out why.
In writing that comment I also realised that when zonking a type (with zonkTcType) we fully zonk all the coercions it mentions. I wonder if that is really necessary? That is, in zonkTcType could we simply not-zonk any coercions, just doing the latter when we do zonkTcTypeToType?

Replying to [ticket:15192#comment:161456 simonpj]:

Ningning, let me check my understanding

In HEAD we have GRefl but not CoherenceCo

In EXPERIMENT you added CoherenceCo and used it.

You found that EXPERIMENT allocated up to 2.5% less.

Right so far?

The change of homogenise_result saves ~2.5%. With more similar changes the allocation could become fewer.

For example, a similar change in TcFlatten.flatten_args_slow (use mkTcCoherenceLeftCo' instead of mkTcCoherenceLeftCo saves ~1% in T9872d. Namely, if we combine those two changes, it saves ~3.5%.

Changes in other places does not have a significant result in T9872d though.

What are the costs of EXPERIMENT?

The extra constructor in Coercion, and extra cases in free-vars, substitution, pretty-printing etc.

Extra stuff in OptCoercion, I assume. This is already a very tricky module, so may be the biggest cost.

Is that all?

Extra stuff in the compilation pipeline, including Coercion, OptCoercion, CoreLint, Ifacexxx etc. Basically I recovered everything I have removed of CoherenceCo in previous commit. Except that, I added nothing new.

Other thoughts

I recall that CoerherenceCo was biased to the laft. One could add a right-biased version too -- would that help even more?

I don't know the answer for sure. But I guess not. As mkCoherenceRightCo differs from CoherenceCo only by 2 SymCo, which won' save much I presume. In contrast, the GRefl-version of mkCoherenceLeftCo differs from CoherenceCo-version of mkCoherenceLeftCo' by a whole type, and the type can be large.

Are you implicitly advocating adding CoherenceCo not for expressiveness but for efficiency? And if so, are there any other combinations of coercions that we could play a similar game with?

No I am not advocating addting CoherenceCo back. I don't know if it is worth it to add all those stuff for one particular test case.

If we look through Coercion.hs in head, we can find that functions mkCoherenceLeftCo and mkCoherenceRightCo are utility functions that construct coercions and are used frequently but don't have their own constructor of coercion (namely, CoherenceCo). Other ones don't seem to need a new constructor. (castCoercionKind could be one, but I don't think it will benefit from a new constructor of coercion).

I've noticed (in tc_trace printing) that we build a LOT of Refl coercions. Yet the isGReflCo case obviously doesn't fire, perhaps because the coercion hasn't been zonked. Yet in fact zonkTcType does zonk coercions, so I'm a bit puzzled about why we get quite so many of them. There might be some low-hanging fruit in finding out why.

In writing that comment I also realised that when zonking a type (with zonkTcType) we fully zonk all the coercions it mentions. I wonder if that is really necessary? That is, in zonkTcType could we simply not-zonk any coercions, just doing the latter when we do zonkTcTypeToType?

I am not familiar with the zonking process (yet). If there are some other ways to improve the performance that would be great.

About not zonking in coercions in types: I'm worried about violating Note [The tcType invariant] in TcHsType. I do think we'll need to zonk these.

An unappealing alternative would be to have data Coercion = ... | ZonkedCo Coercion Type Type FV; zonking would zonk the types that the coercion relates instead of the coercion body. These types would be put into a ZonkedCo, along with the fvs and the unzonked underlying coercion. zonkTcTypeToType would then drop the ZonkedCo and zonk the underlying coercion instead. I don't think this is a good idea, though.

Back to the main points above: Having an extra constructor for performance has a precedent: that's why we have both TyConApp and FunTy. (Both can be encoded by AppTy.) The problem is that it will be hard to know when to stop playing that game: we will probably always be able to identify patterns in individual test cases that would be improved by an extra constructor.

Keeping this ticket open because while the main commit in #15192 is done, there is still work to be done on the perf aspects.

NB: re ticket:15192#comment:161564, I hope that Note [The tcType invariant] is going to go away.

added Pnormal label

added Coercions compiler perf labels

Refactor of Coercion

Child items ...

Activity