thr3ads.net - llvm dev - [llvm-dev] [RFC] Goals for VPlan-based cost modelling [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Anna Sophia Welker via llvm-dev

2020-Oct-30 10:08 UTC

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi all,

I am looking into the benefits of a VPlan-based cost model, and into how 
such a cost model should be implemented to make the most out of these 
benefits.

Over the last year I have been working with LLVM, mainly focused on the 
ARM backend, in the course of a one-year internship at Arm Ltd. My main 
project from December 2019 to August 2020 was to introduce 
gather/scatters for MVE auto-vectorization. One of the recurring 
challenges during this work was to get things right in the cost model.
For example, gathers can extend the data they load, while scatters can 
truncate their input, meaning that an extend following the load, or a 
truncate preceding the store, is for free if it meets certain type 
conditions. As the current cost model is not designed for context-based 
analysis, this was a pain to model.

I have done some research and found out that one of the proposed 
benefits of VPlan is that a new cost model making use of it would be 
able to better support context-dependent decisions like this.
However, there does not exist much specification about what such a cost 
model should look like.

Also, I have read through the respective code to understand how loop 
vectorization is currently done and how far the introduction of VPlan 
has progressed and have realised that the only recipe that actually 
handles more than one instruction from the input IR is the one for 
interleaved groups. When the VPlan is generated on the VPlan-native 
path, every IR instruction is considered and transformed into a recipe 
separately, ignoring its context (to give a code location, I am looking 
at VPlanTransforms::VPInstructionsToVPRecipes).
And maybe there are architectures that for some cases do not have the 
same vector instructions, so a pattern that works great for one could be 
useless for others. So I am wondering: Is there any plan to have 
target-dependent flavours of recipes, or how will those things be handled?
Right now it makes sense that nothing like this has been implemented 
yet, as there is no cost model that could guide the transformation. But 
if recipes are a general thing, should the cost model be the component 
actually providing the target-specific pattern for a recipe, together 
with its cost?

I am considering choosing a topic related to VPlan, possibly cost 
modelling, for my Master thesis, with the goal to present a solution and 
implement a prototype.

What I would like to ask the community is

(1) what goals there are or should be for VPlan-based cost modelling,
(2) whether there have been any discussions on target-dependent patterns 
yet, and
(3) which examples of inefficiencies and shortcomings of the current 
cost model they have come across.

I am looking forward to your feedback.

Many thanks,
Anna Welker

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201030/da187159/attachment.html>

Florian Hahn via llvm-dev

2020-Oct-30 11:38 UTC

head link

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi,
> On Oct 30, 2020, at 10:08, Anna Sophia Welker via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi all, 
> 
> I am looking into the benefits of a VPlan-based cost model, and into how
such a cost model should be implemented to make the most out of these benefits.
> 
> Over the last year I have been working with LLVM, mainly focused on the ARM
backend, in the course of a one-year internship at Arm Ltd. My main project from
December 2019 to August 2020 was to introduce gather/scatters for MVE
auto-vectorization. One of the recurring challenges during this work was to get
things right in the cost model.
> For example, gathers can extend the data they load, while scatters can
truncate their input, meaning that an extend following the load, or a truncate
preceding the store, is for free if it meets certain type conditions. As the
current cost model is not designed for context-based analysis, this was a pain
to model.
> 

One goal is a more modular approach to cost-modeling and transformations. The
current cost-model is quite monolithic and keeps some context information in
various tables, which is hard to reason about and also hard to change and
extend. One idea with VPlan is that cost decisions could be materialized in the
VPlans directly by transformations working in concert to find the best overall
strategy.

I think the example with gathers that extend the loaded value illustrates that
quite well. In the legacy cost model, you could maintain a set of values that
are free to extend and add the loaded values to it. When visiting a zext, return
cost of zero for those.

With VPlan, instead you could add a new VPInstruction opcode ZextLoad and have a
transformation that replaces all `zext (load)` instructions with the new
ZExtLoad one. The cost-model needs to be taught about this special instruction
and how much it costs. Then you could apply this transformation to an existing
plan and check if the overall cost improved.

Of course there’s also the issue of how to generalize the TTI APIs to allow for
computing the cost of instructions with more context.

> I have done some research and found out that one of the proposed benefits
of VPlan is that a new cost model making use of it would be able to better
support context-dependent decisions like this.
> However, there does not exist much specification about what such a cost
model should look like.
> 
> Also, I have read through the respective code to understand how loop
vectorization is currently done and how far the introduction of VPlan has
progressed and have realised that the only recipe that actually handles more
than one instruction from the input IR is the one for interleaved groups. When
the VPlan is generated on the VPlan-native path, every IR instruction is
considered and transformed into a recipe separately, ignoring its context (to
give a code location, I am looking at
VPlanTransforms::VPInstructionsToVPRecipes).
> And maybe there are architectures that for some cases do not have the same
vector instructions, so a pattern that works great for one could be useless for
others. So I am wondering: Is there any plan to have target-dependent flavours
of recipes, or how will those things be handled?
I think we probably want to keep things as generic as possible, try to model
generic concepts and use TTI to decide whether to use it or not (e.g. see how
masked loads/stores are handled).
> 
> Right now it makes sense that nothing like this has been implemented yet,
as there is no cost model that could guide the transformation. But if recipes
are a general thing, should the cost model be the component actually providing
the target-specific pattern for a recipe, together with its cost?
> 
I am not sure what patterns specifically you are thinking about, but I think the
cost model should just evaluate the cost of a given plan and not provide
anything beyond that. Of course, this still can mean that there might be certain
recipes/instructions that are not available/profitable on some targets and we
decide to never generate them, based on the cost-information.
> I am considering choosing a topic related to VPlan, possibly cost
modelling, for my Master thesis, with the goal to present a solution and
implement a prototype.
> 
I am hoping to make some progress on this in the next months (hopefully the work
on modeling the def-use chains between recipes in VPlan will be wrapped up soon)
and I expect there to be a few moving parts. Not sure what that means for a
master thesis in this area.


Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201030/2590d8c3/attachment.html>

Sjoerd Meijer via llvm-dev

2020-Oct-30 11:39 UTC

head link

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hello Anna,

There was a VPlan round-table at the US LLVM dev conference just a few weeks
ago, and I have linked in the people who were involved in this as cost-modeling
was one of the topics that was discussed. You may have seen this already, but
Dave started doing some work in this area recently:

https://reviews.llvm.org/D89322
https://reviews.llvm.org/D89323

Just some first high-level remarks on your proposal: I think work in this area
would be very valuable and solve a real problem that we currently have. If you
plan to choose this as a subject for a master's thesis, you probably need to
think about how big of a task this is and having some motivating examples would
be good too (but you mentioned some already). To me it sounds like a big task
that needs a lot of plumbing in VPlan, but that's why having this discussion
here is interesting and hopefully people with more VPlan experience can provide
insights.

Cheers,
Sjoerd.


________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Anna
Sophia Welker via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 30 October 2020 10:08
To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi all,

I am looking into the benefits of a VPlan-based cost model, and into how such a
cost model should be implemented to make the most out of these benefits.

Over the last year I have been working with LLVM, mainly focused on the ARM
backend, in the course of a one-year internship at Arm Ltd. My main project from
December 2019 to August 2020 was to introduce gather/scatters for MVE
auto-vectorization. One of the recurring challenges during this work was to get
things right in the cost model.
For example, gathers can extend the data they load, while scatters can truncate
their input, meaning that an extend following the load, or a truncate preceding
the store, is for free if it meets certain type conditions. As the current cost
model is not designed for context-based analysis, this was a pain to model.

I have done some research and found out that one of the proposed benefits of
VPlan is that a new cost model making use of it would be able to better support
context-dependent decisions like this.
However, there does not exist much specification about what such a cost model
should look like.

Also, I have read through the respective code to understand how loop
vectorization is currently done and how far the introduction of VPlan has
progressed and have realised that the only recipe that actually handles more
than one instruction from the input IR is the one for interleaved groups. When
the VPlan is generated on the VPlan-native path, every IR instruction is
considered and transformed into a recipe separately, ignoring its context (to
give a code location, I am looking at
VPlanTransforms::VPInstructionsToVPRecipes).
And maybe there are architectures that for some cases do not have the same
vector instructions, so a pattern that works great for one could be useless for
others. So I am wondering: Is there any plan to have target-dependent flavours
of recipes, or how will those things be handled?
Right now it makes sense that nothing like this has been implemented yet, as
there is no cost model that could guide the transformation. But if recipes are a
general thing, should the cost model be the component actually providing the
target-specific pattern for a recipe, together with its cost?

I am considering choosing a topic related to VPlan, possibly cost modelling, for
my Master thesis, with the goal to present a solution and implement a prototype.

What I would like to ask the community is

(1) what goals there are or should be for VPlan-based cost modelling,
(2) whether there have been any discussions on target-dependent patterns yet,
and
(3) which examples of inefficiencies and shortcomings of the current cost model
they have come across.

I am looking forward to your feedback.

Many thanks,
Anna Welker

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201030/a786ff54/attachment.html>

Renato Golin via llvm-dev

2020-Oct-30 13:37 UTC

head link

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi Anna,

This sounds like a great work for a master!

The main difference with VPlan (versus the original vectoriser) is that we
can have multiple paths in parallel and compare them to each other via some
cost function. The problem, as you have stated, is that the cost model is
really naive and has a lot of heuristics that were fudged to suit the old
vectoriser. Transitioning to the new model will likely produce a cost model
that has nothing to do with the previous one. At all.

I see the problem from two angles. Both need to be thought about in
parallel, as their designs will need to be in sync.

1. How to build the cost model

The cost model structure needs to be target independent but with heavy
target-specific knowledge. The kind of specialisation we created for the
previous vectoriser is too naive, and assumes the concepts are the same,
just not necessarily the costs. That's not true at all for different
hardware.

Another big flaw is that the cost is per instruction, not per pattern. So
casts have zero cost because we expect the targets to optimise them away,
but in some cases they don't, so the heuristics is an average of how
efficient the back-end is at getting rid of those. When you're choosing
between two VPlan paths, this will generate noise beyond usefulness.

So we need to find the cost of instructions that are actually generated and
not those that aren't. One idea is to ignore most instructions and focus on
key ones (memory, arithmetic, etc) and then follow the use-def chain to see
the patterns, and then, *per target*, know what the costs are for the
pattern, not each individual instruction. Different targets can have
different key instructions, so the walk through the basic-block can be
target-dependent (ex: `for (auto inst: TTI->AllKeyInsts(bb))`). Once you
try to take the cost of a key instruction, it will follow all of the others
that you have ignored, to see if they would make a difference in the
pattern, stopping short at other key instructions, for example, to avoid
counting twice the same cost.

There are probably many other ways to do this, I'm just thinking out loud
to give an idea of the problems we faced earlier. I'm sure you'll find
better ways to do this yourself once you start looking at it more.

Regardless of how we do it, having the ability to say "this shuffle is free
on this instruction" but "not with that other one" will make the
cost model
a lot more precise and need a lot less fudged heuristics.

2. How to use the cost model

Once we have costs of a VPlan, we need to traverse the problem space
efficiently. It would be awesome if we could use some smart traversing
(like RLO) but that also brings highly uncertain execution times and the
need for training and genralisation, so not generally feasible. But we can
do something simpler while still having the same idea. What we really don't
want is to be as greedy as the original vectoriser. We must have a way to
occasionally increase costs without giving up, but for that, we need a
policy that tells us that it's ok to go that way and not some other (less
optimal) way.

This policy has to be target-specific. Given it's more deterministic nature
(more than RLO at least), this will probably be the place where we fudge
most of our heuristics. Things like "keep adding shuffles that it's
likely
they'll disappear", even if they don't all the time, it's worth
pushing a
bit more, in case they do. So we'll also need to have hard limits, possibly
per target, and possibly benchmark-driven heuristics.

Once we have the policy and the cost, we can traverse one or many paths
(depending on the budget we give the compiler, which could be a command
line option), and then push as hard as we can through all paths within the
budget and when we stop, we take the lowest cost and apply the series. How
we traverse this will depend on the implementation of the cost model and
the policy, but some simple heuristics search would be fine as a first
approach.

Folks working on VPlan are creating a really nice structure to work with
the different concepts in vectorisation strategies (predication, scalable
vectors, use-def chain costs), so I'm sure you'll have at least a few
good
ways to proceed, and hopefully help define the interfaces between the
different parts of the vectoriser.

cheers,
--renato

On Fri, 30 Oct 2020 at 10:08, Anna Sophia Welker via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi all,
>
> I am looking into the benefits of a VPlan-based cost model, and into how
> such a cost model should be implemented to make the most out of these
> benefits.
>
> Over the last year I have been working with LLVM, mainly focused on the
> ARM backend, in the course of a one-year internship at Arm Ltd. My main
> project from December 2019 to August 2020 was to introduce gather/scatters
> for MVE auto-vectorization. One of the recurring challenges during this
> work was to get things right in the cost model.
> For example, gathers can extend the data they load, while scatters can
> truncate their input, meaning that an extend following the load, or a
> truncate preceding the store, is for free if it meets certain type
> conditions. As the current cost model is not designed for context-based
> analysis, this was a pain to model.
>
> I have done some research and found out that one of the proposed benefits
> of VPlan is that a new cost model making use of it would be able to better
> support context-dependent decisions like this.
> However, there does not exist much specification about what such a cost
> model should look like.
>
> Also, I have read through the respective code to understand how loop
> vectorization is currently done and how far the introduction of VPlan has
> progressed and have realised that the only recipe that actually handles
> more than one instruction from the input IR is the one for interleaved
> groups. When the VPlan is generated on the VPlan-native path, every IR
> instruction is considered and transformed into a recipe separately,
> ignoring its context (to give a code location, I am looking at
> VPlanTransforms::VPInstructionsToVPRecipes).
> And maybe there are architectures that for some cases do not have the same
> vector instructions, so a pattern that works great for one could be useless
> for others. So I am wondering: Is there any plan to have target-dependent
> flavours of recipes, or how will those things be handled?
> Right now it makes sense that nothing like this has been implemented yet,
> as there is no cost model that could guide the transformation. But if
> recipes are a general thing, should the cost model be the component
> actually providing the target-specific pattern for a recipe, together with
> its cost?
>
> I am considering choosing a topic related to VPlan, possibly cost
> modelling, for my Master thesis, with the goal to present a solution and
> implement a prototype.
>
> What I would like to ask the community is
>
> (1) what goals there are or should be for VPlan-based cost modelling,
> (2) whether there have been any discussions on target-dependent patterns
> yet, and
> (3) which examples of inefficiencies and shortcomings of the current cost
> model they have come across.
>
> I am looking forward to your feedback.
>
> Many thanks,
> Anna Welker
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201030/1f25ee15/attachment.html>

Anna Sophia Welker via llvm-dev

2020-Nov-02 09:48 UTC

head link

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi Florian,
> With VPlan, instead you could add a new VPInstruction opcode ZextLoad 
> and have a transformation that replaces all `zext (load)` instructions 
> with the new ZExtLoad one. The cost-model needs to be taught about 
> this special instruction and how much it costs. Then you could apply 
> this transformation to an existing plan and check if the overall cost 
> improved.
>
> Of course there’s also the issue of how to generalize the TTI APIs to 
> allow for computing the cost of instructions with more context.
>
> I think we probably want to keep things as generic as possible, try to 
> model generic concepts and use TTI to decide whether to use it or not 
> (e.g. see how masked loads/stores are handled).
Hm, so what you are saying is that for any instruction that could be 
merged into another one on some architecture, there should be a new 
VPInstruction (naturally based on the 'plain' VPInstruction for that 
instruction) that symbolises this and would be assigned the cost 0 by 
the TTI, or whatever a new cost model would use, of only the respective 
architecture.  I had to think about this a bit, but now it makes sense 
to me. Handling it a smart way, such that TTIs that do not know about 
this special case fall back to what they do for the normal case, would 
also save us from generating additional VPlans. And with the load being 
the ingredient of the extend, and thus accessible, everything is 
available to do some type checking and ensure the merge is doable for 
the architecture.
This really helped me to understand how things are supposed to be 
working, thanks a lot for explaining!
I do still have a few doubts as to whether this approach might still 
bear some risk of over-generating (see my reply to Renato's email which 
I'll link you to), but those might simply originate from me being too 
cautious because I do not have a good overview yet of how many 
(different) examples like the one I mentioned there are, so how many 
special cases would have to get their own VPInstruction flavour.
> I am not sure what patterns specifically you are thinking about, but I 
> think the cost model should just evaluate the cost of a given plan and 
> not provide anything beyond that. Of course, this still can mean that 
> there might be certain recipes/instructions that are not 
> available/profitable on some targets and we decide to never generate 
> them, based on the cost-information.
I was wondering about how one would merge the extend from the example 
above into the load without any changes to the VPlan structure/classes. 
So basically without the trick of adding special instructions that you 
mentioned above. But that does not seem a nice way to do it.
>> I am considering choosing a topic related to VPlan, possibly cost 
>> modelling, for my Master thesis, with the goal to present a solution 
>> and implement a prototype.
>>
>
> I am hoping to make some progress on this in the next months 
> (hopefully the work on modeling the def-use chains between recipes in 
> VPlan will be wrapped up soon) and I expect there to be a few moving 
> parts. Not sure what that means for a master thesis in this area.I guess the extend of impact that has on a thesis strongly depends on 
what exactly 'this' entails - could you elaborate a bit, please? Are you
planning to do work on a cost model for VPlan, or to change how 
VPInstructions and VPRecipes work, or something else?
If I would start working on a thesis named 'Cost modelling with VPlan' 
or similar, the first thing to do would be a lot of thinking and 
drafting to come up with a good and solid concept. I don't think that 
would depend too much on concrete implementation details of VPlan, as 
long as the core of the individual components stays the same...

Best,
Anna

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201102/8175b26d/attachment.html>

Anna Sophia Welker via llvm-dev

2020-Nov-02 09:53 UTC

head link

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hi Renato,
> The cost model structure needs to be target independent but with heavy 
> target-specific knowledge. The kind of specialisation we created for 
> the previous vectoriser is too naive, and assumes the concepts are the 
> same, just not necessarily the costs. That's not true at all for 
> different hardware.Yes, that's a big issue to be solved. I do like the idea from Florians 
reply, basically encoding an instruction's environment in the 
VPInstruction that is generated and make it such that target-specific 
cost components would overwrite the cost calculations for the general 
class if they can be treated in a special way by the architecture. 
Though currently the gather that is extended and the scatter whose input 
is truncated are my only examples for this kind of thing, which is why I 
am hoping that the community will have encountered more examples like 
these and is willing to share. Maybe there are cases that are more 
complex, that e.g. merge or depend on far more then two instructions, in 
which case the generation of the right VPInstructions might become 
tricky - especially if there are multiple possible 'merge patterns' (for
one or different architectures) that are overlapping, which would 
require to generate alternative VPlans, whose number for long loops with 
many special cases might grow exponentially to cover all possible 
combinations.
> There are probably many other ways to do this, I'm just thinking out 
> loud to give an idea of the problems we faced earlier. I'm sure
you'll
> find better ways to do this yourself once you start looking at it more.
>
> Regardless of how we do it, having the ability to say "this shuffle is
> free on this instruction" but "not with that other one" will
make the
> cost model a lot more precise and need a lot less fudged heuristics.Well thank you for thinking out loud, that is exactly what I was hoping 
to get out of this RFC!
Yes, there needs to be a good way to say things like that, and 
preferrably 'good' should not only mean efficient, but also easy to 
understand, maintain and extend - but that's just me dreaming :)
Are you thinking of a concrete example when mentioning those shuffles? 
Because concrete examples would really help me a lot when thinking 
further about this - if I decide to do this as a thesis, I might have to 
constrain the actual implementation to (a subset of instructions for) a 
single architecture, but for thinking about the issue itself any 
concrete examples for any architecture would be a great help!
> 2. How to use the cost model
>
> Once we have costs of a VPlan, we need to traverse the problem space 
> efficiently. It would be awesome if we could use some smart traversing 
> (like RLO) but that also brings highly uncertain execution times and 
> the need for training and genralisation, so not generally feasible. 
> But we can do something simpler while still having the same idea. What 
> we really don't want is to be as greedy as the original vectoriser. We 
> must have a way to occasionally increase costs without giving up, but 
> for that, we need a policy that tells us that it's ok to go that way 
> and not some other (less optimal) way.
>
> This policy has to be target-specific. Given it's more deterministic 
> nature (more than RLO at least), this will probably be the place where 
> we fudge most of our heuristics. Things like "keep adding 
> shuffles that it's likely they'll disappear", even if they
don't all
> the time, it's worth pushing a bit more, in case they do. So we'll 
> also need to have hard limits, possibly per target, and possibly 
> benchmark-driven heuristics.
>
> Once we have the policy and the cost, we can traverse one or many 
> paths (depending on the budget we give the compiler, which could be a 
> command line option), and then push as hard as we can through all 
> paths within the budget and when we stop, we take the lowest cost and 
> apply the series. How we traverse this will depend on the 
> implementation of the cost model and the policy, but some simple 
> heuristics search would be fine as a first approach.
>
> Folks working on VPlan are creating a really nice structure to work 
> with the different concepts in vectorisation strategies (predication, 
> scalable vectors, use-def chain costs), so I'm sure you'll have at 
> least a few good ways to proceed, and hopefully help define the 
> interfaces between the different parts of the vectoriser.Hm, I hadn't yet gotten as far as how to use it, but you're right of 
course - that will be another challenge. I know that there are many 
things being planned and on the way for VPlan, but I'll willingly admit 
that I am currently still having some trouble gaining an overview of 
what is there, what is not, and what might be on its way. It's simply 
hard and takes a lot of time to find documentation for all the new 
things, but I'm sure I'll get there.

Thanks for your detailed reply and all the ideas, they do help a lot.

Best,
Anna

David Green via llvm-dev

2020-Nov-04 15:20 UTC

head link

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Hello

The gramma idea sounds interesting. My brain tends to just work in terms of
code, but it might make a great master thesis :)  The pattern matches were
modelled after the existing ones in llvm that are used all over the place,
especially instcombine. It was meant to feel familiar with what already existed.

I imagine a lot of people must have thought about the same kinds of ideas and
there must be some research out there on it. Either whether it worked well or
not, it would be good to look into what was already tried. I imagine there will
be cases where it works well and times where it doesn't. The same concept of
specifying patterns in some way that isn't C++ code could be said for
instcombine or DAG combining too. The fact that no-one ever did it is perhaps a
bad sign? I know there are some tablegen defined combines for global isel now,
they might give some ideas.

Combining a few instructions into a vmulh is simple enough, and I was hoping
that there would not be a huge number of combines needed. There are bigger
transforms that perhaps wouldn't fit that very well. Imagine trying to take
code that does:
x, y = vld2
x *= -1
vst2 x, y

and turn that into
x = load
x *= [-1,1]
str x

That is like "de-interleaving" an interleaving group, much like slp
vectorization. Plus it would need to make that transform generic across all the
different operations that could be between the loads/store and on the size of
the interleaving groups. It, of course, could get quite complex.

Looking forward to seeing what you come up with.
Dave

--------------------------------------------
Sent: 03 November 2020 15:45
Subject: Re: [llvm-dev] [RFC] Goals for VPlan-based cost modelling 

Hi Dave,

thanks for your detailed reply, the reductions you describe really seem 
like a perfect example of what a new cost model and IR transformation 
should be able to cope with, and so does vmulh and the other 
instructions you mentioned.
In particular, they do underline the importance of finding a solution 
for patterns of arbitrary size, which is much harder than finding a 
feasible solution for patterns of size two like the ones I mentioned.

A thought that has been sitting in my head for a while is whether this 
part of transforming IR (or VP-) instructions into VPRecipes could be 
handled by some kind of grammar, like grammars used for parsing in NLP, 
or also in compilers. Patterns could be represented as grammar rules, 
and if more than one pattern matches the analysis can fork and continue 
on two or more VPlans in parallel. Recipes that can be used in many 
architectures would be in the main grammar. It could be a probabilistic 
grammar with values either 0 or 1, such that architectures that do not 
implement a recipe can set the probability of it ever being generated to 
0. For very target-specifc patterns, target-specific sub-grammars could 
be maintained.
On a totally crazy thought, a cost model could be integrated in this 
grammar step, if the patterns need legality (type) checking anyway like 
it is currently being done in some TTI cost functions. Then the recipes 
can be annotated with their cost at generation time, we maintain a total 
cost for each VPlan and can define a threshold of `cost(VPlan) - 
minVPlanCost' above which we abandon plans on the fly, thus abandoning 
options that are unlikely to be fruitful mid-analysis (I know that 
threshold would be difficult to find, but at least it would reduce the 
problem space a bit).

In an approach like that patterns could be arbitrarily large, but I do 
have some fears that it might lead to over-generating VPlans. It would 
also probably mean that a lot of recipes are needed to model every 
instruction that can possibly be generated, but if the concrete costs 
were calculated at parse time and the information from the matched 
instructions were reduced to what will be needed to generate the output 
IR, it might be possible to reduce that.

I think your example in D88152 does go into a similar direction, only it 
is transforming VPInstructions to another VPInstruction, not to a recipe 
(The pattern matching does remind me a bit of Tablegen, tbh).
What I wonder is, if it looks like to leverage VPlan we need to do some 
kind of pattern matching, be it in the cost model for correct cost 
calculation, to detect or to transform patterns of VPInstructions, 
wouldn't it conceptually make sense to explicitly define something like 
a grammar, instead of doing it per instruction/pattern and per 
application (costing/transformation)?

What are your thoughts on this?

Thanks,
Anna

On 03.11.20 11:55, David Green wrote:> Hello
>
> Perhaps I can talk a little about the things I have been at least thinking
about lately. Unfortunately a few other important things have come up and I have
not had the time do as much as I would have liked. What I have is mostly on the
pragmatic end of the spectrum.
>
> So we've been thinking about this for a while but it really starts with
vector reductions. Arm MVE has some instruction that make it beneficial to put
vector reduction instruction into the loop, instead of summing in the loop and
reducing after it. So instead of:
> loop:
>    a = load v16i8
>    s = s + a
> reduce(a)
>
> We can efficiently do:
> loop:
>    a = load v16i8
>    s += reduce a
>
> That is all working now, and seems to be humming along nicely in the
vectorizer. It even handles tail predicated loops. The real benefit from those
instructions though comes from the fact that they can natively handle larger
than legal reductions. So we have an instruction that can take 16i8's and
sum them into an i32. That would look something like this in IR:
> loop:
>    a = load v16i8
>    e = sext a to v16i32
>    s += reduce e
>
> It can also do a mla at the same time, doing reduce(mul(sext, sext)) as a
single instruction. We even have i64 variants of the instruction, even though
there is no i64 vector add in MVE. Unfortunately all of this is both too large
for the vectorizer to consider at the moment and looks like a set of very
expensive operations. Extends especially are very expensive at the moment in
MVE.
>
> We can already cost model load(sext(..)) well (along with other patterns
like add(sext, ..) for aarch64). We have the context parameter (plus a parameter
for the type of the load now). For 2 instruction patterns this mostly works OK,
if not very elegantly, providing that the context instruction can be trusted
(which with vplan making more changes becomes a big if). Any more than 2 starts
to get very awkward though, trying to check that you are the middle mul in a
reduce(mul(sext, sext)) pattern. Especially if the reduce looks like an add!
>
> I had a go at implemented some code to try and get the reductions more
correct. One idea was for a better way of providing context. Something like an
"InstructionView", in the same way as there is an ArrayView or
StringView. It gives a way to query possibly fake instructions for the info the
backend might need to make decisions and could be backed by real instructions or
vplan nodes. Another simpler idea was to keep a list of "free
instructions" that the backend could provide and iterator backwards through
the costs instead of forward, letting the backend greedily choose patterns. I
didn't much like the changes they required though and didn't get very
far with them as they are all very large changes that do not play well into the
costmodel as it currently exists.
>
> So I knew that there were some other things that I wanted to start making
better in the vectorizer/costmodel, maybe investigating those would give a
better view onto what the best way to handle this would be. One is instructions
like vmulh. They multiply and take the top half, and are common across a number
of architectures. Things like vrmulh, vhadd, vrhadd, vqdmulh, vabsd, etc can all
fit into the same bucket. They conceptually do something like:
> loop:
>    a = load v16i8
>    a = load v16i8
>    ae = sext a to v16i16
>    be = sext a to v16i16
>    m = mul ae, be
>    s = shr m, 8
>    t = trunc s to v16i8
>    store t
>
> So the same issue with sign extends being expensive exists, plus
considering the cost of those instructions will be very wrong (especially if you
consider a i32 extended to an i64 on an architecture with no 64 bit
operations.)  That I put up as https://reviews.llvm.org/D88152, mostly as a
proof of concept to show how it can be done. It recognizes the pattern in vplan
and turns it into a single mulh VPInstruction simply enough. The obvious problem
comes in where we are still costing the origin instructions in the loop though.
Hence the cost model really needs to be moved over to costing vplan nodes, not
the original IR instructions (which leads us to https://reviews.llvm.org/D89322
making a start at that).
>
> D88152 obviously still needs some details filled in. The rough idea was to
have a set of "VectorPatterns" that the backend could be queried the
cost of (like vmulh etc). In this case we are replacing 5 instructions with 1 at
a lower bitwidth, so the change is going to be fairly obvious to make if it is
legal, and can be done by replacing nodes. In other cases it may be better to
clone the vplan and cost them against one another. It still seems simpler in
these cases (and the concepts are general enough) that combining in the
vectorizer seems like a good way forward. It doesn't really help costmodel
anywhere else in the compiler though, if that is something that becomes
important.
>
> We would love to be able to start doing things like cost predicated vplans
against unpredicated vplans, along with things like de-interleaving the
interleaving groups (kind of like slp vectorization), costing gathers vs
interleaving loads, and even possibly things like getting the vectorizer to
consider lane interleaving if we can. All of this requires us to be costing
vplans though, at least as a first step. Doing some of these like (what I have
been calling) lane interleaving certainly require more global analysis than the
local checks we can do in places like instcombine or ISel, to make sure they are
really profitable. It's hard to guess at the costs of them.
>
> I also found it very easy to introduce regressions when trying to rewrite
an entire costmodel, as you might imagine. Simple enough things (and actually
costing predicate instructions) can have severe effects on some code. It was
something I was trying to go slowly on and hopefully do bit at a time. We will
see how that goes though.
>
> In short, I think I like the idea of vplan looking more like the target
expects it to end up as, as opposed to trying to reverse engineer or guess that
in the costmodel. None of this has been through review or anything like that. It
was just the kinds of things we were trying to work on and hoping on improving.
VPlan has a lot of potential I believe, and it would be great to see it start
solving some real problems.
>
> I'm not sure if that helped at all. Maybe it at least gave some
context, even if it was a bit rambley.
>
> Looking forward to seeing what your thoughts are,
> Dave

llvm dev - Oct 2020 - [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling

[llvm-dev] [RFC] Goals for VPlan-based cost modelling