thr3ads.net - llvm dev - [llvm-dev] sum elements in the vector [May 2016]

If this information is useful, please help other people find it:
Share via:
Saito, Hideki via llvm-dev
2016-May-19 19:16 UTC
[llvm-dev] sum elements in the vector

Chandler> I'm starting to think we should directly implement horizontal
operations on vector types.

+1.

Inputs from our experience:

Anything that can be used in reduction operator should have such support, as
well as the vector of booleans resulting from the comparison of vector values.
This includes MIN/MAX which is currently represented as compare/select pair in
the IR.

In general, the best code sequence to perform horizontal operation depends on
the target micro-architecture (i.e., optimal code sequence may be different even
within the same ISA). As such, there is a merit in keeping horizontal operations
as they are until the compiler is ready to perform micro-architectural
optimization. For the targets that do not have such characteristics, generic
lowering may be sufficient.

It is also good to have FP-value-safe variant of horizontal operation, such that
it can be used
w/o fastmath flag. It would be certainly slower, but there are enough people who
consider
bitwise identity of FP computation more important than a bit of speed impact.

Thanks,
Hideki Saito
Intel Compiler and Languages

----------------------
Date: Wed, 18 May 2016 22:35:49 +0100
From: "Martin J. O'Riordan via llvm-dev" <llvm-dev at
lists.llvm.org>
To: "'Rail Shafigulin'" <rail at esenciatech.com>
Cc: 'LLVM Developers' <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] sum elements in the vector
Message-ID: <007601d1b14d$467b2a10$d3717e30$@movidius.com>
Content-Type: text/plain; charset="utf-8"

That’s how the pattern is expressed, but it selects a single instruction.  Not
sure how TableGen does this, but it does match that particular IR pattern to the
single instruction match without actually performing the element-by-element
extract.  It is crude though, and is very particular about specific IR
structures.  If it was elevated to a true ISD node supported by a C++ expressed
IR normalisation, then it would catch a wider number of variants such as the
classic loop for you show.  Other patterns simply reduce to element extract and
scalar ADD, but my pattern catches a common case.  Horizontal patterns as first
class citizens would catch a broader range of IR forms.

 

            MartinO

 

From: Rail Shafigulin [mailto:rail at esenciatech.com]
Sent: 18 May 2016 18:55
To: Martin J. O'Riordan <martin.oriordan at movidius.com>
Cc: LLVM Developers <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] sum elements in the vector

 

 

 

On Wed, May 18, 2016 at 5:56 AM, Martin J. O'Riordan <martin.oriordan at
movidius.com <mailto:martin.oriordan at movidius.com> > wrote:

Hi Rail,

 

We used a very simple pattern expansion (actually, not a pattern fragment).  For
example, for AND, ADD (horizontal sum), OR and XOR of 4 elements we use
something like the following TableGen structure:

 

class HORIZ_Op4<SDNode opc, RegisterClass regVT, ValueType rt, ValueType vt,
string asmstr> :

    SHAVE_Instr<(outs regVT:$dst), (ins VRF128:$src),

                    !strconcat(asmstr, " $dst $src"),

                    [(set regVT:$dst,

                      (opc (rt (vector_extract(vt VRF128:$src), 0 ) ),

                        (opc (rt (vector_extract(vt VRF128:$src), 1 ) ),

                          (opc (rt (vector_extract(vt VRF128:$src), 2 ) ),

                            (rt (vector_extract(vt VRF128:$src), 3 ) )

                          )

                        )

                      )

                    )]>;

 

This is okay for 4 element vectors, and it will get selected if the programmer
writes something like:

 

vec[0] & vec[1] & vec[2] & vec[3]

 

but not with a simple variant like:

 

vec[0] & vec[2] & vec[1] & vec[3]

 

If this was properly represented by an ISD node, the other permutations could be
more easily handled through normalisation.  We “could” write patterns for each
of the permutations, but it is verbose, and in practice most people only write
it one way anyway.

 

The 8-lane equivalent has TableGen left thinking for quite a long time, and the
16-lane equivalent seems to hang TableGen.

 

            MartinO

 

Martin,

 

Thanks for the reply. If I read a pattern correctly (and I'm not sure if I
do) then you are extracting data from the vector first and then perform an
operation. What I'm trying to place data into a vector and perform and
operation. Here is what I'm talking about:

 

Convert the following:

 

int a[] = {1, 2, 3, 4};

int sum = 0;

for (int i = 0; i < 4; i++)

  sum+= a[i];

 

into

 

vector.load vector.register.0, addressOfA

horizontal.add gpr.0, vector.register.0

 

Original thought was to match a pattern of adds and then use insert_elt
instruction, but this solution doesn't produce a good result, since it uses
more instructions than a chain of adds. Do you think there is a simple solution
to the problem that I'm suggesting or I have to make some major code
changes? As I said before, my experience with all of this is very limited, so
any help is greatly appreciated.

 

P.S. I wasn't able to find anything related to SHAVE_Instr in the LLVM
trunk. Are you guys not committing your work.

 

 

 

 

 

From: Rail Shafigulin [mailto:rail at esenciatech.com <mailto:rail at
esenciatech.com> ]
Sent: 16 May 2016 23:50
To: Martin J. O'Riordan
Cc: LLVM Developers


Subject: Re: [llvm-dev] sum elements in the vector

 

 

 

On Mon, May 16, 2016 at 3:11 AM, Martin J. O'Riordan via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >
wrote:

This would be really cool.  We have several instructions that perform horizontal
vector operations, and have to use built-ins to select them as there is no easy
way of expressing them in a TD file.  Some like SUM for a ‘v4i32’ are easy
enough to express with a pattern fragment,

Do you mind sharing how to do it with a pattern fragment? I'm not new to TD
files but all the work I've done was very simple.

 

 

SUM ‘v8i16’ takes TableGen a long time to compute, but SUM ‘v16i8’ resulted in
TableGen disappearing into itself for hours trying to reduce the patterns before
I gave up and cancelled it.

 

If there were ISD nodes for these, then it would be far simpler to express in
TableGen, and also, the pattern fragments only match a very specific form of IR
to the desired instruction.

 

The horizontal operations are particularly useful for finalising a vectorised
operation - for example I may want to compute the scalar MAX, MIN or SUM of a
large number of items.  If the number of items is divisible by the vector lanes
(e.g. 4, 8, or 16 in our case), then 4, 8 or 16 at a time can be computed using
normal vector operation, and then the final scalar value can be computed using a
single horizontal operation.

 

            MartinO

 

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org
<mailto:llvm-dev-bounces at lists.llvm.org> ] On Behalf Of Chandler
Carruth via llvm-dev
Sent: 16 May 2016 2:16
To: Shahid, Asghar-ahmad; Rail Shafigulin; llvm-dev; Hal Finkel


Subject: Re: [llvm-dev] sum elements in the vector

 

I'm starting to think we should directly implement horizontal operations on
vector types.

 

My suspicion is that coming up with a nice model for this would help us a lot
with things like:

- Idiom recognition of reduction patterns that use horizontal arithmetic

- Ability to use horizontal operations in SLPVectorizer

- Significantly easier cost modeling of vectorizing loops with reductions in
LoopVectorize

- Other things I've not thought of?


 Curious what others think?

 

-Chandler

 

On Wed, May 11, 2016 at 10:07 PM Shahid, Asghar-ahmad via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > wrote:
> why in order to add this particular instruction (sum elements in a vector)
I need to add an insrinsic?
Adding intrinsic is not the only way, it is one of the way and user WILL-NOT be
required to invoke

It specifically.

 

Currently LLVM does not have any instruction to directly represent “sum of
elements in a vector” and

generate your particular instruction.However, you can do it without intrinsic by
pattern matching the

LLVM-IRs representing “sum of elements in vector” to your particular instruction
in DAGCombiner.

 

Regards,

Shahid

 

 

From: Rail Shafigulin [mailto:rail at esenciatech.com <mailto:rail at
esenciatech.com> ]
Sent: Monday, May 09, 2016 11:59 PM
To: Shahid, Asghar-ahmad; llvm-dev
Cc: Das, Dibyendu


Subject: Re: [llvm-dev] sum elements in the vector

 

I'm a little confused. Here is why.

 

I was able to add a vector add instruction to my target without using any
intrinsics and without adding any new instructions to LLVM. So here is my
question: how come I managed to add a new vector instruction without adding an
intrinsic and why in order to add this particular instruction (sum elements in a
vector) I need to add an insrinsic?

 

Another question that I have is whether compiler will be able to target this new
instruction (sum elements in a vector) if it is implemented as an intrinsic or
the user will have to specifically invoke an instrinsic.

 

Pardon if questions seem dumb, I'm still learning things.

 

Any help is appreciated. 

 

On Fri, May 6, 2016 at 1:51 PM, Rail Shafigulin <rail at esenciatech.com
<mailto:rail at esenciatech.com> > wrote:

Thanks for the reply. These steps will add an instruction as an intrinsic. Is it
possible to add an actual new instruction so that a compiler could target it
during an optimization? How hard is it to do it? Is that a realistic objective.

 

Rail

 

On Mon, Apr 4, 2016 at 9:02 PM, Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at
amd.com <mailto:Asghar-ahmad.Shahid at amd.com> > wrote:

Hi Rail,

 

We had done this for generation of X86 PSAD (sum of absolute difference)
instruction through

Llvm intrinsic. Doing this requires following

1.       Define an intrinsic, xyz(),  for the required instruction and
corresponding SDNode

2.       Generate the “call xyz() “ IR based the matched pattern

3.       Map “call xyz()” IR to corresponding SDNode in SelectionDagBuilder.cpp

4.       Provide default expansion of the xyz() intrinsic

5.       Legalize type and/or operation

6.       Provide Lowering of intrinsic/SDNode to generate your target
instruction

 

You can visit http://llvm.org/docs/ExtendingLLVM.html for details.

 

Regards,

Shahid 

 

 

 

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org
<mailto:llvm-dev-bounces at lists.llvm.org> ] On Behalf Of Rail Shafigulin
via llvm-dev
Sent: Monday, April 04, 2016 11:00 PM
To: Das, Dibyendu
Cc: llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> 
Subject: Re: [llvm-dev] sum elements in the vector

 

Thanks for the pointers. I looked at hadd instructions. They seem to do very
similar to what I need. Unfortunately as I said before my LLVM experience is
limited. My understanding is that when I create a new type of SDNode I need to
specify a pattern for it, so that when LLVM is analyzing the code and is seeing
a given pattern it would create this particular node. I'm really struggling
to understand how it is done. So here are the problems that I'm having.

 

1. How do I identify that pattern that should be used?

2. How do I specify a given pattern? 

 

Do you (or someone else) mind helping me out? 

 

Any help is appreciated.

 

On Mon, Apr 4, 2016 at 9:59 AM, Das, Dibyendu <Dibyendu.Das at amd.com
<mailto:Dibyendu.Das at amd.com> > wrote:

This is roughly along the lines of x86 hadd* instructions though the semantics
of hadd* may not exactly match what you are looking for. This is probably more
in line with x86/ARM SAD-like instructions but I don’t think llvm generates SAD
without intrinsics.

 

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org
<mailto:llvm-dev-bounces at lists.llvm.org> ] On Behalf Of Rail Shafigulin
via llvm-dev
Sent: Monday, April 04, 2016 9:34 AM
To: llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org> >
Subject: [llvm-dev] sum elements in the vector

 

My target has an instruction that adds up all elements in the vector and stores
the result in a register. I'm trying to implement it in my compiler but
I'm not sure even where to start.

 

I did look at other targets, but they don't seem to have anything like it (
I could be wrong. My experience with LLVM is limited, so if I missed it, I'd
appreciate if someone could point it out ).

 

My understanding is that if SDNode for such an instruction doesn't exist I
have to define one. Unfortunately, I don't know how to do it. I don't
even know where to start looking. Would someone care to point me in the right
direction?

 

Any help is appreciated.


 

-- 

Rail Shafigulin

Software Engineer 
Esencia Technologies





 

-- 

Rail Shafigulin

Software Engineer 
Esencia Technologies





 

-- 

Rail Shafigulin

Software Engineer 
Esencia Technologies





 

-- 

Rail Shafigulin

Software Engineer 
Esencia Technologies

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> 
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> 
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev





 

-- 

Rail Shafigulin

Software Engineer 
Esencia Technologies





 

-- 

Rail Shafigulin

Software Engineer 
Esencia Technologies

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160518/b6200f17/attachment-0001.html>
llvm dev - May 2016 - sum elements in the vector

[llvm-dev] sum elements in the vector