thr3ads.net - llvm dev - [LLVMdev] Vector select/compare support in LLVM [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Rotem, Nadav

2011-Mar-08 19:46 UTC

[LLVMdev] Vector select/compare support in LLVM

Hello, 

I started working on adding vector support for the SELECT and CMP instructions
in the codegen (bugs: 3384, 1784, 2314). 

Currently, the codegen scalarizes vector CMPs into multiple scalar CMPs.  It is
easy to add similar scalarization support to the SELECT instruction.  However,
using multiple scalar operations is slower than using vector operations.
In LLVM, vector-compare operations generate a vector of i1s, and the
vector-select instruction uses these vectors. In between, these values (masks)
can be manipulated (xor-ed, and-ed, etc).
For x86, I would like the codegen to generate the ‘pcmpeq’ and ‘blend’ family of
instructions.  SSE masks are implemented using a 32bit word per item, where the
MSB bit is used as a predicate and the rest of the bits are ignored.  I believe
that  PPC Altivec and ARM Neon are also implemented this way.

I can think of two ways to represent masks in x86: sparse and packed. In the
sparse method, the masks are kept in <4 x 32bit> registers, which are
mapped to xmm registers. This is the ‘native’ way of using masks.
In the second representation, the packed method, the MSB bits are collected from
the xmm register into a packed general purpose register.  Luckily, SSE has the
MOVMSKPS instruction, which converts sparse masks to packed masks. I am not sure
which representation is better, but both are reasonable. The former may cause
register pressure in some cases, while the latter may add the packing-unpacking
overhead.

_Sparse_
After my discussion with Duncan, last week, I started working on the promotion
of  type <4 x i1> to <4 x i32>, and I ran into a problem.  It looks
like the codegen term ‘promote’ is overloaded.  For scalars, the ‘promote’
operation converts scalars to larger bit-width scalars.  For vectors, the
‘promote’ operation widens the vector to the next power of two.  This is
reasonable for types such as ‘<3 x float>’.  Maybe we need to add another
legalization operation which will mean widening the vectors?  In any case, I
estimated that implementing this per-element promotion would require major
changes and decided that this is not the way to go.

_Packed_
I followed Duncan’s original suggestion which was packing vectors of i1s into
general purpose registers.
I started by adding several new types to ValueTypes (td and h).  I added ‘4vi1,
8vi1, 16vi1 … 64vi1’.  For x86, I mapped the v8i1 .. v8i64 to general purpose
x86 registers. I started playing with a small program, which performed a vector
CMP on 4 elements.  The legalizer promoted the v4i1 to the next legal pow-of-two
type, which was v8i1. I changed WidenVecRes_SETCC and added a new method
WidenVecOp_Select to handle the legalization of these types. The widening of the
Select and SETCC ops was simple since I only widened the operands which needed
widening. I am not sure if this is correct, but I ran into more problems before
I could test it. 
Another  problem that I had was that i1 types are still promoted to i8 types. So
if I have a vector such as ‘4 x i1: <0, 0, 1, 1>’,  it will be mapped to
DAG node ‘BUILD_VECTOR’ which accepts 4 i8s and returns a single v4i1.  This
fails somewhere because the cast is illegal.  The desired result should be that
the above vector would be translated to the (packed) scalar value ‘3’. I hacked
TargetLowering::ReplaceNodeResults and added a minimal support for BUILD_VECTOR.

I’d be interested in hearing your suggestions in which direction/s to proceed.

Thank you, 
Nadav
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

David A. Greene

2011-Mar-09 19:58 UTC

head link

[LLVMdev] Vector select/compare support in LLVM

"Rotem, Nadav" <nadav.rotem at intel.com> writes:
> I can think of two ways to represent masks in x86: sparse and
> packed. In the sparse method, the masks are kept in <4 x 32bit>
> registers, which are mapped to xmm registers. This is the ‘native’ way
> of using masks.  
This argues for the sparse representation, I think.
> _Sparse_ After my discussion with Duncan, last week, I started working
> on the promotion of type <4 x i1> to <4 x i32>, and I ran into
a
> problem.  It looks like the codegen term ‘promote’ is overloaded.
Heavily.  :-/
>  For scalars, the ‘promote’ operation converts scalars to larger
> bit-width scalars.  For vectors, the ‘promote’ operation widens the
> vector to the next power of two.  This is reasonable for types such as
> ‘<3 x float>’.  Maybe we need to add another legalization operation
which
> will mean widening the vectors?
You mean widening the element type, correct?  Yes, that's definitely a
useful concept.
>  In any case, I estimated that implementing this per-element promotion
> would require major changes and decided that this is not the way to
> go.
What major changes?  I think this will end up giving much better code in
the end.  The pack/unpack operations could be very expensive.

There is another huge cost in using GPRs to hold masks.  There will be
fewer GPRs to hold addresses, which is a precious resource.  We should
avoid doing anything that uses more of that resource unnecessarily.

                             -Dave

Rotem, Nadav

2011-Mar-10 09:03 UTC

head link

[LLVMdev] Vector select/compare support in LLVM

Hi David, 

The MOVMSKPS instruction is cheap (2 cycles).  Not to be confused with VMASKMOV,
the AVX masked move, which is expensive.

One of the arguments for packing masks is that it reduces vector-registers
pressure.  Auto-vectorizing compilers maintain multiple masks for different
execution paths (for each loop nesting, etc).  Saving masks in xmm registers may
result in vector-register pressure which will cause spilling of these registers.
I agree with you that GP registers are also a precious resource.
I am not sure what is the best way to store masks.

In my private branch, I added the [v4i1 .. v64i1] types. I also implemented a
new type of target lowering: "PACK". This lowering packs vectors of
i1s into integer registers. For example, the <4 x i1> type would get
packed into the i8 type. I modified LegalizeTypes and LegalizeVectorTypes and
added legalization for SETCC, XOR, OR, AND, and BUILD_VECTOR.  I also changed
the x86 lowering of SELECT to prevent lowering of selects with vector condition
operand. Next, I am going to add new patterns for SETCC and SELECT which use
i8/i16/i32/i64 as a condition value.

I also plan to experiment with promoting <4 x i1> to <4 x i32>.  At
this point I can't really say what needs to be done.  Implementing this kind
of promotion also requires adding legalization support for strange vector types
such as <4 x i65>.

-Nadav

-----Original Message-----
From: David A. Greene [mailto:greened at obbligato.org] 
Sent: Wednesday, March 09, 2011 21:59
To: Rotem, Nadav
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Vector select/compare support in LLVM

"Rotem, Nadav" <nadav.rotem at intel.com> writes:
> I can think of two ways to represent masks in x86: sparse and
> packed. In the sparse method, the masks are kept in <4 x 32bit>
> registers, which are mapped to xmm registers. This is the ‘native’ way
> of using masks.  
This argues for the sparse representation, I think.
> _Sparse_ After my discussion with Duncan, last week, I started working
> on the promotion of type <4 x i1> to <4 x i32>, and I ran into
a
> problem.  It looks like the codegen term ‘promote’ is overloaded.
Heavily.  :-/
>  For scalars, the ‘promote’ operation converts scalars to larger
> bit-width scalars.  For vectors, the ‘promote’ operation widens the
> vector to the next power of two.  This is reasonable for types such as
> ‘<3 x float>’.  Maybe we need to add another legalization operation
which
> will mean widening the vectors?
You mean widening the element type, correct?  Yes, that's definitely a
useful concept.
>  In any case, I estimated that implementing this per-element promotion
> would require major changes and decided that this is not the way to
> go.
What major changes?  I think this will end up giving much better code in
the end.  The pack/unpack operations could be very expensive.

There is another huge cost in using GPRs to hold masks.  There will be
fewer GPRs to hold addresses, which is a precious resource.  We should
avoid doing anything that uses more of that resource unnecessarily.

                             -Dave
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Rotem, Nadav

2011-Mar-10 12:44 UTC

head link

[LLVMdev] Vector select/compare support in LLVM

After I implemented a new type of legalization (the packing of i1 vectors), I
found that x86 does not have a way to load packed masks into SSE registers.  So,
I guess that legalizing of <4 x i1> to <4 x i32> is the way to go.

Cheers, 
Nadav 

-----Original Message-----
From: Rotem, Nadav 
Sent: Thursday, March 10, 2011 11:04
To: 'David A. Greene'
Cc: llvmdev at cs.uiuc.edu
Subject: RE: [LLVMdev] Vector select/compare support in LLVM

Hi David, 

The MOVMSKPS instruction is cheap (2 cycles).  Not to be confused with VMASKMOV,
the AVX masked move, which is expensive.

One of the arguments for packing masks is that it reduces vector-registers
pressure.  Auto-vectorizing compilers maintain multiple masks for different
execution paths (for each loop nesting, etc).  Saving masks in xmm registers may
result in vector-register pressure which will cause spilling of these registers.
I agree with you that GP registers are also a precious resource.
I am not sure what is the best way to store masks.

In my private branch, I added the [v4i1 .. v64i1] types. I also implemented a
new type of target lowering: "PACK". This lowering packs vectors of
i1s into integer registers. For example, the <4 x i1> type would get
packed into the i8 type. I modified LegalizeTypes and LegalizeVectorTypes and
added legalization for SETCC, XOR, OR, AND, and BUILD_VECTOR.  I also changed
the x86 lowering of SELECT to prevent lowering of selects with vector condition
operand. Next, I am going to add new patterns for SETCC and SELECT which use
i8/i16/i32/i64 as a condition value.

I also plan to experiment with promoting <4 x i1> to <4 x i32>.  At
this point I can't really say what needs to be done.  Implementing this kind
of promotion also requires adding legalization support for strange vector types
such as <4 x i65>.

-Nadav

-----Original Message-----
From: David A. Greene [mailto:greened at obbligato.org] 
Sent: Wednesday, March 09, 2011 21:59
To: Rotem, Nadav
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Vector select/compare support in LLVM

"Rotem, Nadav" <nadav.rotem at intel.com> writes:
> I can think of two ways to represent masks in x86: sparse and
> packed. In the sparse method, the masks are kept in <4 x 32bit>
> registers, which are mapped to xmm registers. This is the ‘native’ way
> of using masks.  
This argues for the sparse representation, I think.
> _Sparse_ After my discussion with Duncan, last week, I started working
> on the promotion of type <4 x i1> to <4 x i32>, and I ran into
a
> problem.  It looks like the codegen term ‘promote’ is overloaded.
Heavily.  :-/
>  For scalars, the ‘promote’ operation converts scalars to larger
> bit-width scalars.  For vectors, the ‘promote’ operation widens the
> vector to the next power of two.  This is reasonable for types such as
> ‘<3 x float>’.  Maybe we need to add another legalization operation
which
> will mean widening the vectors?
You mean widening the element type, correct?  Yes, that's definitely a
useful concept.
>  In any case, I estimated that implementing this per-element promotion
> would require major changes and decided that this is not the way to
> go.
What major changes?  I think this will end up giving much better code in
the end.  The pack/unpack operations could be very expensive.

There is another huge cost in using GPRs to hold masks.  There will be
fewer GPRs to hold addresses, which is a precious resource.  We should
avoid doing anything that uses more of that resource unnecessarily.

                             -Dave
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Mar 2011 - [LLVMdev] Vector select/compare support in LLVM

[LLVMdev] Vector select/compare support in LLVM

[LLVMdev] Vector select/compare support in LLVM

[LLVMdev] Vector select/compare support in LLVM

[LLVMdev] Vector select/compare support in LLVM

Maybe Matching Threads