thr3ads.net - llvm dev - [LLVMdev] Modeling GPU vector registers, again (with my implementation) [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Alex

2009-Feb-16 10:32 UTC

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Evan Cheng-2 wrote:> 
> Well, how many possible permutations are there? Is it possible to  
> model each case as a separate physical register?
> 
> Evan
> 
I don't think so. There are 4x4x4x4 = 256 permutations. For example:

* xyzw: default
* zxyw
* yyyy: splat

Even if can model each of these 256 cases as a separate physical register,
how can I model the use of r0.xyzw in the following example:

// dp4 = dot product 4-element
dp4 r0.x, r1, r2
dp4 r0.y, r3, r4
dp4 r0.z, r5, r6
dp4 r0.w, r7, r8
sub r5, r0.xyzw, r6


-- 
View this message in context:
http://www.nabble.com/Modeling-GPU-vector-registers%2C-again-%28with-my-implementation%29-tp22001613p22034856.html
Sent from the LLVM - Dev mailing list archive at Nabble.com.

Villmow, Micah

2009-Feb-16 17:24 UTC

head link

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Alex, 
  From my experience in working with GPU vector registers; there is no
support for swizzles in the manner that you would normally code them,
and in my case I have 6^4 permutations on src registers and 24
combinations in the dst registers. The way that I ended up handling this
was to have different register classes for 1, 2, 3 and 4 component
vectors. This made the generic cases very simple but still made
swizzling fairly difficult. 
  In order to get swizzling to work you only need to handle three
SDNodes, insert_vector_elt, extract_vector_elt and build_vector while
expanding the rest. For those three nodes I then custom lowered them to
a target specific node with an extra integer constant per register that
would encode the swizzle mask in 32bits. The correct swizzles can then
be generated in the asm printer by decoding the integer constant. This
does require having extra moves, but your example below would end up
being something like the following:

dp4 r100, r1, r2
mov r0.x, r100 (float4 => float1 extract_vector_elt)
dp4 r101, r4, r5
mov r3.x, r101 (float4 => float1 extract_vector_elt)
iadd r6.xy__, r0.x000, r3.0x00(float1 + float1 => float2 build_vector)
dp4 r7.x, r8, r9
<as above>
dp4 r10.x, r11, r12
<as above>
iadd r13.xy__, r7.x000, f10.0x00(float1 + float1 => float2 build_vector)
iadd r14, r13.xy00, r6.00xy (float2 + float2 => float4 build_vector)
sub r15, r14, r9

It's not as compact and neat but it works and the move instructions will
get optimized away by the lower level gpu compiler.

Hope this helps,
Micah

-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
On Behalf Of [Alex]
Sent: Monday, February 16, 2009 2:33 AM
To: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Modeling GPU vector registers, again (with my
implementation)

Evan Cheng-2 wrote:> 
> Well, how many possible permutations are there? Is it possible to  
> model each case as a separate physical register?
> 
> Evan
> 
I don't think so. There are 4x4x4x4 = 256 permutations. For example:

* xyzw: default
* zxyw
* yyyy: splat

Even if can model each of these 256 cases as a separate physical
register,
how can I model the use of r0.xyzw in the following example:

// dp4 = dot product 4-element
dp4 r0.x, r1, r2
dp4 r0.y, r3, r4
dp4 r0.z, r5, r6
dp4 r0.w, r7, r8
sub r5, r0.xyzw, r6

-- 
View this message in context:
http://www.nabble.com/Modeling-GPU-vector-registers%2C-again-%28with-my-
implementation%29-tp22001613p22034856.html
Sent from the LLVM - Dev mailing list archive at Nabble.com.

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Scott Michel

2009-Feb-19 00:57 UTC

head link

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

On Feb 16, 2009, at 9:24 AM, Villmow, Micah wrote:>   In order to get swizzling to work you only need to handle three
> SDNodes, insert_vector_elt, extract_vector_elt and build_vector while
> expanding the rest. For those three nodes I then custom lowered  
> them to
> a target specific node with an extra integer constant per register  
> that
> would encode the swizzle mask in 32bits.
Villimow, Micah:

This problem argues for why SDNode should be target polymorphic. If  
they were target polymorphic, then a target-specific node would be  
largely unnecessary. (By a target-specific node, I mean extending the  
ISD enumeration for your target.) A target polymorphic SDNode would  
still capture all of the behaviors and attributes with  
insert_vector_elt, extract_vector_elt and build_vector, but also  
allow you to add additional behaviors and attributes. Which is mostly  
the point in object oriented programming. Assuming you don't need to  
do extra DAGCombine work, you would get that for free from the parent  
class.

Unfortunately, that would mean a lot of work at this juncture and a  
heavy overhaul of SelectionDAGNodes.h and associated SelectionDAG  
source. Node allocation, in particular, would become more complicated  
(but not unsolvable.)

Were anyone going to tackle this problem, the solution would have to  
be largely incremental, i.e., the source can't be overhauled all at  
once, but should permit incremental transition of SDNodes to a  
behavioral interface style.

-scooter

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Feb 2009 - [LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Possibly Parallel Threads