thr3ads.net - llvm dev - [llvm-dev] Questions on LLVM vectorization diagnostics [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2016-Aug-25 04:46 UTC

[llvm-dev] Questions on LLVM vectorization diagnostics

Hi, Gerolf.

We've been a bit quiet for some time. After listening to feedback on the
project internally and externally,
we decided to take a more generally accepted community development model ----
building up through
a collection of small incremental changes ---- than trying to make a big step
forward. That change of course
took a bit of time, but we are getting close to the first NFC patch on which we
hope to incrementally build
up new functionalities.

Within a few weeks, we plan to send in the first of the series of RFCs, which is
soon to be followed by
the NFC patch for review as the first step. We are also making a submission for
a talk about this project,
plus a submission for a BoF about vector masking, at 2016 LLVM DEV Meeting. I
hope our submissions
will be accepted. Looking forward to have great discussions on the mailing list,
patch review process,
and in person.
>I’m very interested in specific examples underlying the key design
decisions.
Since the two paragraphs above aren't too useful in answering your
questions, let me talk about
one particular example: auto-vectorization of outer loops.

I do not know whether any readers here have noticed: ICC auto-vectorizer works
inner to outer.
If the inner loop is auto-vectorized, outer loop is no longer a vectorization
candidate. Currently,
it does not have an ability to compare the benefit of vectorizing outer loop and
the benefit of
vectorizing inner loop(s) ---- people in academia, here's a paper
opportunity. :)
Often times, outer loop vectorization requires a massaging of inner loop control
flow --- and ICC
Vectorizer does such massaging on its underlying IR level ---- just like many of
you who have
implemented OpenMP SIMD, OpenCL and other explicit vector programming model.
This is
okay when you know which loop to vectorize ahead of time. Not so nice if you are
trying to
decide between inner loop vectorization or outer loop vectorization. As such,
one of the key
design consideration was being able to "pseudo-massage inner loop control
flow" w/o modifying
the underlying IR, until the cost model decides where to vectorize.

This, by itself, is a rather ambitious project, many people advised us to go
many incremental small steps,
and we listened. That has led to the small NFC patch mentioned above.

I hope this revelation is interesting enough for many of you to stay tuned for
our further development.
I probably spoke too much about ICC vectorizer internal. One of the future RFCs
(it'll certainly take some
time to get to that point through many incremental steps) will be talking about
the inner versus outer
auto-vectorization. We hope to get there sooner than later.

---------------------------
Now, I have one question. Suppose we'd like to split the vectorization
decision as an Analysis pass and vectorization
transformation as a Transformation pass. Is it acceptable if an Analysis pass
creates new Instructions and new  BasicBlocks,
keep them unreachable from the underlying IR of the Function/Loop, and pass
those to the Transformation pass as
part of Analysis's internal data? We've been operating under the
assumption that such Analysis pass behavior is unacceptable.
Please let us know if this is a generally acceptable way for an Analysis pass to
work ---- this might make our development
move quicker. Why we'd want to do this? As mentioned above, we need to
"pseudo-massage inner loop control flow"
before deciding where/whether to vectorize. Hope someone can give us a clear
answer.

Thanks,
Hideki Saito
Intel Compilers and Languages

-----Original Message-----
From: ghoflehner at apple.com [mailto:ghoflehner at apple.com] 
Sent: Wednesday, August 24, 2016 5:38 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: llvm-dev at lists.llvm.org; Dangeti Tharun kumar <cs15mtech11002 at
iith.ac.in>; Santanu Das <cs15mtech11018 at iith.ac.in>
Subject: Re: [llvm-dev] Questions on LLVM vectorization diagnostics

Has there been a follow up? I’m very interested in specific examples underlying
the key design decisions. Specifically I expect that you have examples that have
a x% speed-up with ICC vs clang because of XYZ in your design. Similar if you
have examples for better diagnostics if probably makes sense to share them.

Thanks
Gerolf
> On Jun 24, 2016, at 12:45 AM, Saito, Hideki via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> Hi Dangeti, Ramakrishna, Adam, and Gerolf,
> 
>> Yes this is an area that needs further improvement.  We have some
immediate plans to make these more useful.  See the recent llvm-dev threads [1],
[2].
> 
> It takes a lot of dedicated effort to make vectorization report easier 
> to understand by ordinary programmers (i.e., those who are not 
> compiler writers). Having done that for ICC ourselves, we truly 
> believe it was a good investment of resource. There are areas where 
> both expert and non-expert of vectorizer development can equally 
> contribute. That includes getting the source code location right and 
> variable names (and memory
> references) printed at the source level representation. If anyone has 
> data on how good LLVM is on these areas, we'd appreciate a pointer to 
> such information. Otherwise, we'll study that when our development
effort hit that area, report back, and contribute for improvement.
> 
>>> In our analysis we never seen llvm trying to vectorize outer loops.
>>> Is this well known? Is outer loop vectorization implemented in LLVM
as in GCC? (http://dl.acm.org/citation.cfm?id=1454119
<http://dl.acm.org/citation.cfm?id=1454119>) If not, is someone working on
it?
>> 
>> I heard various people mention this but I am not sure whether actual
work is already taking place.
> 
> We are currently working on introducing a next generation vectorizer 
> design to LLVM, aiming to support OpenMP4.5 SIMD (i.e., including 
> outer loop vectorization). I hope to be able to send in an RFC on the 
> high level design document to LLVM-DEV next month. We are currently working
on an RFC for "vectorizer's output" (IR, not diagnostic), to be
discussed before the next gen design. As part of this next gen work, we'll
also be working on improving diagnostics. Stay tuned.
> 
>> actual work is already taking place.
> 
> Yes, our hands are dirty with actual coding work to ensure that the 
> high level design makes sense. :)
> 
> Thanks,
> Hideki Saito (hideki dot saito at intel dot com) Technical Lead of 
> Vectorizer Development Intel Compiler and Languages
> 
> ----------------------------------------------------------------------
> -------
> Message: 3
> Date: Thu, 23 Jun 2016 10:45:28 -0700
> From: Adam Nemet via llvm-dev <llvm-dev at lists.llvm.org>
> To: Dangeti Tharun kumar <cs15mtech11002 at iith.ac.in>
> Cc: llvm-dev at lists.llvm.org, Santanu Das <cs15mtech11018 at
iith.ac.in>
> Subject: Re: [llvm-dev] Questions on LLVM vectorization diagnostics
> Message-ID: <B6F42D93-F676-4CB1-8413-A37A07490A55 at apple.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Dangeti,
> 
>> On Jun 23, 2016, at 8:20 AM, Dangeti Tharun kumar via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>> 
>> Dear LLVM Community,
>> 
>> I am D Tharun Kumar, masters student at Indian Institute of Technology
Hyderabad, working in a team to improve current vectorizer in LLVM. As an
initial study, we are studying various benchmarks to analyze and compare
vectorizing capabilities of LLVM, GCC and ICC. We found that vectorization
remarks given by LLVM are vague and brief, comparatively GCC and ICC are giving
detailed diagnostics.
> 
> Yes this is an area that needs further improvement.  We have some immediate
plans to make these more useful.  See the recent llvm-dev threads [1], [2].
>> I am interested to know why the LLVM diagnostics are brief and not
intuitive (making them less helpful)?
> I think it’s just lack of work or weakness in the analyses to provide more
detailed information.  It would be good to file bugs for specific cases where we
fall behind.
>> In our analysis we never seen llvm trying to vectorize outer loops. Is
this well known? Is outer loop vectorization implemented in LLVM as in GCC?
(http://dl.acm.org/citation.cfm?id=1454119
<http://dl.acm.org/citation.cfm?id=1454119>) If not, is someone working on
it?
> 
> I heard various people mention this but I am not sure whether actual work
is already taking place.
> 
>> On the TSVC benchmark suite, out of a total of 151 loops, LLVM, GCC and
ICC vectorized 70, 82 and 112 loops respectively. Is the cause for lag of LLVM
the inability of LLVM’s vectorizer, or are there any (enabling) optimization
passes running before GCC’s vectorizer that are helping GCC perform better?
> 
> I don’t know about the GCC but I’ve seen ICC perform loop transformation
more aggressively that can increase the coverage for loop vectorization.  ICC
performs Loop Distribution/Fusion/Interchange, etc by default at their highest
optimization level.  We have some of these passes (distribution, interchange)
but not on by default yet.
> 
> Arguably, there is also some difference between focus areas for these
compilers.  I think that ICC has a more HPC focus than LLVM or GCC.  We have
Polly which is geared toward more the HPC use cases.
> 
>> Loop peeling to enhance vectorization is present in GCC and ICC, but,
the LLVM remarks don’t say anything about alignment. Does LLVM has this
functionality and the vectorizer doesn’t remark about it, or it doesn’t it have
the functionality at all?
> We don’t have it.
> 
>> Finally, we appreciate suggestions and directions for improving the
vectorization framework of LLVM.
> 
> This is a pretty active area.  Probably reading up on recent llvm-dev
discussion in this area would be helpful to you.
> 
>> I would also like to know if anyone worked or is working on improving
vectorization remarks.
> 
> Yes we are.  If you’re interested working on this area it would be good to
coordinate.
> 
> Adam
> 
>> 
>> Regards,
>> 
>> Dangeti Tharun kumar
>> M.TECH Computer Science
>> IIT Hyderabad
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> 
> [1] http://thread.gmane.org/gmane.comp.compilers.llvm.devel/98334
> [2] http://thread.gmane.org/gmane.comp.compilers.llvm.devel/99126
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Aug-27 14:15 UTC

head link

[llvm-dev] Questions on LLVM vectorization diagnostics

On 25 August 2016 at 05:46, Saito, Hideki via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Now, I have one question. Suppose we'd like to split the vectorization
decision as an Analysis pass and vectorization
> transformation as a Transformation pass. Is it acceptable if an Analysis
pass creates new Instructions and new  BasicBlocks,
> keep them unreachable from the underlying IR of the Function/Loop, and pass
those to the Transformation pass as
> part of Analysis's internal data? We've been operating under the
assumption that such Analysis pass behavior is unacceptable.
Hi Saito,

First let me say, impressive work you guys are planning for the
vectoriser. Outer loop vectorisation is not an easy task, so feel free
to share your ideas early and often, as that would probably mean a lot
less work for you guys, too.

Regarding generation of dead code, I don't remember any pass doing
this (though I haven't looked at many). Most passes do some kind of
clean up at the end, and DCE ends up getting rid of spurious things
here and there, so you can't *rely* on it being there. It's even worse
than metadata, which is normally left alone *unless* needs to be
destroyed, dead code is purposely destroyed.

But analysis passes shouldn't be touching code in the first place. Of
course, creating additional dead code is not strictly changing code,
but this could be cause for code bloat, leaks, or making it worse for
other analysis. My personal view is that this is a bad move.

> Please let us know if this is a generally acceptable way for an Analysis
pass to work ---- this might make our development
> move quicker. Why we'd want to do this? As mentioned above, we need to
"pseudo-massage inner loop control flow"
> before deciding where/whether to vectorize. Hope someone can give us a
clear answer.
We have discussed the split of analysis vs transformation with Polly
years ago, and it was considered "a good idea". But that relied
exclusively on metadata.

So, first, the vectorisers and Polly would pass on the IR as an
analysis pass first, leaving a trail of width/unroll factors, loop
dependency trackers, recommended skew factors, etc. Then, the
transformation passes (Loop/SLP/Polly) would use that information and
transform the loop the best they can, and clean up the metadata,
leaving only a single "width=1", which means, "don't try to
vectorise
any more". Clean ups as required, after the transformation pass.

The current loop vectoriser is split in three stages: validity, cost
and transformation. We only check the cost if we know of any valid
transformation, and we only transform if we know of any better cost
than width=1. Where the cost analysis would be, depends on how we
arrange Polly, Loop and SLP vectoriser and their analysis passes.
Conservatively, I'd leave the cost analysis with the transformation,
so we only do it once.

The outer loop proposal, then, suffers from the cost analysis not
being done at the same time as the validity analysis. It would also
complicate a lot to pass "more than one" types of possible
vectorisation techniques via the same metadata structure, which will
probably already be complex enough. This is the main reason why we
haven't split yet.

Given that scenario of split responsibility, I'm curious as to your
opinion on the matter of carrying (and sharing) metadata between
different vectorisation analysis passes and different transformation
types.

cheers,
--renato

Michael Zolotukhin via llvm-dev

2016-Aug-30 01:34 UTC

head link

[llvm-dev] Questions on LLVM vectorization diagnostics

Hi Hideki,

Thanks for the interesting writeup!> On Aug 27, 2016, at 7:15 AM, Renato Golin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On 25 August 2016 at 05:46, Saito, Hideki via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Now, I have one question. Suppose we'd like to split the
vectorization decision as an Analysis pass and vectorization
>> transformation as a Transformation pass. Is it acceptable if an
Analysis pass creates new Instructions and new  BasicBlocks,
>> keep them unreachable from the underlying IR of the Function/Loop, and
pass those to the Transformation pass as
>> part of Analysis's internal data? We've been operating under
the assumption that such Analysis pass behavior is unacceptable.
> 
> Hi Saito,
> 
> First let me say, impressive work you guys are planning for the
> vectoriser. Outer loop vectorisation is not an easy task, so feel free
> to share your ideas early and often, as that would probably mean a lot
> less work for you guys, too.
> 
> Regarding generation of dead code, I don't remember any pass doing
> this (though I haven't looked at many). Most passes do some kind of
> clean up at the end, and DCE ends up getting rid of spurious things
> here and there, so you can't *rely* on it being there. It's even
worse
> than metadata, which is normally left alone *unless* needs to be
> destroyed, dead code is purposely destroyed.
> 
> But analysis passes shouldn't be touching code in the first place. Of
> course, creating additional dead code is not strictly changing code,
> but this could be cause for code bloat, leaks, or making it worse for
> other analysis. My personal view is that this is a bad move.While I agree with Renato, it should be definitely worth mentioning LCSSA in
this context. I still don’t know how we should call it: an analysis or a
transformation. It sometimes can be viewed as an analysis meaning that a pass
can ‘preserve’ it (i.e. the IR is still in LCSSA form after the pass). At the
same time, LCSSA obviously can and does transform IR, but it does so by
generating a ‘dead’ code - phi-nodes that can later be folded easily.

So, to answer your question - I think it is ok to do some massaging of the IR
before your pass, and you could use LCSSA as an example of how it can be
implemented. However, creating unreachable blocks sound a bit hacky - it looks
like we’re just going to use IR as some shadow data-structure. If that’s the
case, why not to use a shadow data-structure :-) ? ScalarEvolution might be an
example of how this can be done - it creates a map from IR instructions to
SCEV-objects.

Thanks,
Michael
> 
> 
>> Please let us know if this is a generally acceptable way for an
Analysis pass to work ---- this might make our development
>> move quicker. Why we'd want to do this? As mentioned above, we need
to "pseudo-massage inner loop control flow"
>> before deciding where/whether to vectorize. Hope someone can give us a
clear answer.
> 
> We have discussed the split of analysis vs transformation with Polly
> years ago, and it was considered "a good idea". But that relied
> exclusively on metadata.
> 
> So, first, the vectorisers and Polly would pass on the IR as an
> analysis pass first, leaving a trail of width/unroll factors, loop
> dependency trackers, recommended skew factors, etc. Then, the
> transformation passes (Loop/SLP/Polly) would use that information and
> transform the loop the best they can, and clean up the metadata,
> leaving only a single "width=1", which means, "don't try
to vectorise
> any more". Clean ups as required, after the transformation pass.
> 
> The current loop vectoriser is split in three stages: validity, cost
> and transformation. We only check the cost if we know of any valid
> transformation, and we only transform if we know of any better cost
> than width=1. Where the cost analysis would be, depends on how we
> arrange Polly, Loop and SLP vectoriser and their analysis passes.
> Conservatively, I'd leave the cost analysis with the transformation,
> so we only do it once.
> 
> The outer loop proposal, then, suffers from the cost analysis not
> being done at the same time as the validity analysis. It would also
> complicate a lot to pass "more than one" types of possible
> vectorisation techniques via the same metadata structure, which will
> probably already be complex enough. This is the main reason why we
> haven't split yet.
> 
> Given that scenario of split responsibility, I'm curious as to your
> opinion on the matter of carrying (and sharing) metadata between
> different vectorisation analysis passes and different transformation
> types.
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Aug 2016 - Questions on LLVM vectorization diagnostics

[llvm-dev] Questions on LLVM vectorization diagnostics

[llvm-dev] Questions on LLVM vectorization diagnostics

[llvm-dev] Questions on LLVM vectorization diagnostics

Maybe Matching Threads