This patch aims at generalizing the reflexive coercion to
|GReflRoleTypeMCoercion
As its name suggests, and as its typing rule makes precise, GRefl is generalised version of Refl:
t1 : k1 ------------------------------------ GRefl r t1 MRefl: t1 ~r t1 t1 : k1 co :: k1 ~ k2 ------------------------------------ GRefl r t1 (MCo co) : t1 ~r t1 |> co
MCoercion wraps a coercion, which might be reflexive (MRefl) or any coercion (MCo co). To know more about MCoercion see #14975.
This new coercion form will replace both Refl and the current CoherenceCo:
|ReflRoleType-- Coherence applies a coercion to the left-hand type of another coercion-- See Note [Coherence]|CoherenceCoCoercionKindCoercion-- :: e -> N -> e
When the MCoercion is MRefl, GRefl is just the same as Refl.
It is easy to express CoherenceCo using GRefl. If g1 :: s ~r t, then
CoherenceCo g1 g2 :: (s |> g2) ~r tSym (GRefl r s g2) ; g1 :: (s |> g1) ~r t
What are the advantages of the new form? (It looks more symmetrical; but it has more arguments.) Can you say what the "several places" where is is useful are? Are there any disadvanatages?
Can we use the new form in OptCoercion to simplify coercions?
I see no constraints on r in the rule. So you could fix on N and use SubCo to downgrade to R?
What are the advantages of the new form? (It looks more symmetrical; but it has more arguments.) Can you say what the "several places" where is is useful are? Are there any disadvanatages?
Practically, it simplifies some sort of coercion constructions. In current implementation, coercions as t |> g ~ t is built as <t> |> g. Namely we build the reflexivity coercion first and insert kind coercion to the left. Using the new form it can be rewritten to (t |> g) ~ t. Similarly, coercions as t ~ t |> g is sym (<t> |> g), and can be rewritten to t ~ (t |> g) directly. We have several usage like that in current implementation, e.g. TcCanonical. Moreover, one possible optimization is to replace the buildCoercion :: Type -> Type -> CoercionN function in module Coercion, which builds a coercion between types that are the same ignoring coercions, which is exactly the intention of the new form.
There are indeed some disadvantages now. In current refactor, there are three places where given a coercion g, we need to call coercionKind to get g :: t1 ~ t2 in order to use the new form because the new form needs this information to build (t1 |> ki_co) ~ t1 ; g . This is very inefficient. We are still trying to figure out further optimizations in those places.
Can we use the new form in OptCoercion to simplify coercions?
We have optimizations for EraseEqCo in OptCoercion. For example opt_co4 and opt_trans_rule for the new form are both simpler than for the old form. Besides that, I don't know if further optimization is possible?
I see no constraints on r in the rule. So you could fix on N and use SubCo to downgrade to R?
It is doable I think. But I don't know if we could gain any benefit from this restriction?
I agree with ticket:15192#comment:153905. Note that the new form is essentially a generalized form of Refl, which also has an unrestricted role parameter.
One area where the new coercion form is simpler is when we need t ~r (t |> g). Currently, this is Sym (Coherence (Refl r t) g). The new form would be EraseEqCo r t (CastTy t g) g. I do suppose that, as sharing is destroyed, the new form could be worse than the original one.
The truth is that Stephanie and I have been using this new form in our work for quite some time now, and this feels like just bringing GHC up to speed. But perhaps the old form is more convenient for implementation. I'd love to bring Stephanie in on the conversation.
This new form may be more efficient if there are multiple uses of Coherence needed --- one EraseEqCo can stand in for many casts (on either side).
However, If coherence has better sharing in some situations, then perhaps we don't want to lose that information.
There is no reason why we can't have both in the implementation. At least until we can see if EraseEqCo is useful in connection with other extensions. Richard and I both found it useful in generating coercions in our proofs. Perhaps that will show up in some of the coercions that GHC needs to construct.
Eventually, we may want to get rid of both of these coercions. In other words --- we can allow Core to work "up-to-erasure-equivalence", as in the system in the appendix of Richard's thesis. In that case, we wouldn't need to use Coherence or EraseEqCo very much (If at all).
I agree with Stephanie's first few paragraphs, but not her last one. GHC already works up to erasure-equivalence. That is, eqType returns True whenever two types are erasure-equivalent (and have the same kinds). But we still need the coherence coercions, because they sometimes change kinds.
I guess that will change once we move to homogeneous equality, though...
As a follow-on to ticket:15192#comment:153938: even with homogeneous equality, coercions will still be heterogeneous, and thus the need for a coherence coercion (of some form) will remain. The type of ~# becomes homogeneous, but that affects only what coercions we can quantify over, not what ones we can form.
Refl ty :: ty ~n ty, note that Refl ty is always nominal.
GRefl r ty MRefl :: ty ~r ty. If r == Nominal, use Refl.
GRefl r ty (MCo co) :: ty ~r ty |>co.
Replaced original Refl Nominal ty with Refl ty.
Given g1 :: s ~r t, to construct s |> g2 ~r t we used CoherenceCo g1 g2. It is now replaced with Sym (GRefl r s g2) ; g1. Similar for s ~ t |> g3.
It turns out that the explicit patten match in homogenise_result in TcFlatten triggers some optimization of GHC and improves the performance. However it is not useful in master branch.
Added a small regression for T9872d (the added number is the allocation of current master on T9872d).
Added note about flatten_exact_fam_app_fully performance in TcFlatten.
Performance summary
This patch intends to improve the overall performance about coercions.
It does perform better in all cases under perf/compiler, except T9872b (0.6%), T9872d(3.7%), and T14683(0.02%).
It failed T9872d, thus we added a small regression.
It seems to perform better to compile large packages, e.g. Cabal.
TODO: Further analysis of the performance: a step-by-step replay of the refactor.
OK. I propose that we merge the work Ningning has done so far, given that it is generally an improvement. Ningning, are you happy with the current patch on D4747? If so, I can merge it.
OK. I propose that we merge the work Ningning has done so far, given that it is generally an improvement. Ningning, are you happy with the current patch on D4747? If so, I can merge it.
add back CoherenceCo and related functions, such as mkCoherenceLeftCo (renaming it to mkCoherenceLeftCo').
replace uses of functions defined in terms of GRefl to corresponding functions defined in terms of CoherenceCo; for example, replace mkCoherenceLeftCo with mkCoherenceLeftCo'
test allocation after each change
A general observation is, mkCoherenceLeftCo' saves more allocation than mkCoherenceLeftCo when kind_co is not reflexive (similarly for mkCoherenceRightCo)
-- | Given @ty :: k1@, @co :: k1 ~ k2@, @co2:: ty ~ ty'@,-- produces @co' :: (ty |> co) ~r ty'mkCoherenceLeftCortycoco2|isGReflCoco=co2|otherwise=(mkSymCoco$GReflrty(MCoco))`mktransCo`co2-- stores an extra r and tymkCoherenceLeftCo'co(Refl_)=comkCoherenceLeftCo'(CoherenceCoco1co2)co3=CoherenceCoco1(co2`mkTransCo`co3)mkCoherenceLeftCo'co1co2=CoherenceCoco1co2
In the test case T9872d, TcFlatten.homogenise_result is called 340k+ times, which means mkCoherenceLeftCo is called 340k+ times. (not exactly as mkCoherenceLeftCo is inlined by hand in TcFlatten.homogenise_result now. Bur morally it is still true.)
If I leave everything unchanged, except for using mkCoherenceLeftCo' instead of mkCoherenceLeftCo in TcFlatten.homogenise_result, I save allocation in T9872d by ~2.5% and it passes the test before the regression.
(I also tried to replace the suspicious use of zipWith3 in TcFlatten with the original implementation by zipWith, but it save little allocation.)
I propose to update Note [flatten_exact_fam_app_fully performance] in TcFlatten to include the analysis.
You found that EXPERIMENT allocated up to 2.5% less.
Right so far?
What are the costs of EXPERIMENT?
The extra constructor in Coercion, and extra cases in free-vars, substitution, pretty-printing etc.
Extra stuff in OptCoercion, I assume. This is already a very tricky module, so may be the biggest cost.
Is that all?
Other thoughts
I recall that CoerherenceCo was biased to the laft. One could add a right-biased version too -- would that help even more?
Are you implicitly advocating adding CoherenceCo not for expressiveness but for efficiency? And if so, are there any other combinations of coercions that we could play a similar game with?
I've noticed (in tc_trace printing) that we build a LOT of Refl coercions. Yet the isGReflCo case obviously doesn't fire, perhaps because the coercion hasn't been zonked. Yet in fact zonkTcTypedoes zonk coercions, so I'm a bit puzzled about why we get quite so many of them. There might be some low-hanging fruit in finding out why.
In writing that comment I also realised that when zonking a type (with zonkTcType) we fully zonk all the coercions it mentions. I wonder if that is really necessary? That is, in zonkTcType could we simply not-zonk any coercions, just doing the latter when we do zonkTcTypeToType?
You found that EXPERIMENT allocated up to 2.5% less.
Right so far?
The change of homogenise_result saves ~2.5%. With more similar changes the allocation could become fewer.
For example, a similar change in TcFlatten.flatten_args_slow (use mkTcCoherenceLeftCo' instead of mkTcCoherenceLeftCo saves ~1% in T9872d. Namely, if we combine those two changes, it saves ~3.5%.
Changes in other places does not have a significant result in T9872d though.
What are the costs of EXPERIMENT?
The extra constructor in Coercion, and extra cases in free-vars, substitution, pretty-printing etc.
Extra stuff in OptCoercion, I assume. This is already a very tricky module, so may be the biggest cost.
Is that all?
Extra stuff in the compilation pipeline, including Coercion, OptCoercion, CoreLint, Ifacexxx etc. Basically I recovered everything I have removed of CoherenceCo in previous commit. Except that, I added nothing new.
Other thoughts
I recall that CoerherenceCo was biased to the laft. One could add a right-biased version too -- would that help even more?
I don't know the answer for sure. But I guess not. As mkCoherenceRightCo differs from CoherenceCo only by 2 SymCo, which won' save much I presume. In contrast, the GRefl-version of mkCoherenceLeftCo differs from CoherenceCo-version of mkCoherenceLeftCo' by a whole type, and the type can be large.
Are you implicitly advocating adding CoherenceCo not for expressiveness but for efficiency? And if so, are there any other combinations of coercions that we could play a similar game with?
No I am not advocating addting CoherenceCo back.
I don't know if it is worth it to add all those stuff for one particular test case.
If we look through Coercion.hs in head, we can find that functions mkCoherenceLeftCo and mkCoherenceRightCo are utility functions that construct coercions and are used frequently but don't have their own constructor of coercion (namely, CoherenceCo). Other ones don't seem to need a new constructor. (castCoercionKind could be one, but I don't think it will benefit from a new constructor of coercion).
I've noticed (in tc_trace printing) that we build a LOT of Refl coercions. Yet the isGReflCo case obviously doesn't fire, perhaps because the coercion hasn't been zonked. Yet in fact zonkTcTypedoes zonk coercions, so I'm a bit puzzled about why we get quite so many of them. There might be some low-hanging fruit in finding out why.
In writing that comment I also realised that when zonking a type (with zonkTcType) we fully zonk all the coercions it mentions. I wonder if that is really necessary? That is, in zonkTcType could we simply not-zonk any coercions, just doing the latter when we do zonkTcTypeToType?
I am not familiar with the zonking process (yet). If there are some other ways to improve the performance that would be great.
About not zonking in coercions in types: I'm worried about violating Note [The tcType invariant] in TcHsType. I do think we'll need to zonk these.
An unappealing alternative would be to have data Coercion = ... | ZonkedCo Coercion Type Type FV; zonking would zonk the types that the coercion relates instead of the coercion body. These types would be put into a ZonkedCo, along with the fvs and the unzonked underlying coercion. zonkTcTypeToType would then drop the ZonkedCo and zonk the underlying coercion instead. I don't think this is a good idea, though.
Back to the main points above: Having an extra constructor for performance has a precedent: that's why we have both TyConApp and FunTy. (Both can be encoded by AppTy.) The problem is that it will be hard to know when to stop playing that game: we will probably always be able to identify patterns in individual test cases that would be improved by an extra constructor.