thr3ads.net - llvm dev - [LLVMdev] Modeling GPU vector registers, again (with my implementation) [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Alex

2009-Feb-13 17:47 UTC

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

It seems to me that LLVM sub-register is not for the following hardware
architecture.

All instructions of a hardware are vector instructions. All registers
contains
4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.

Most instructions write more than one elements in this way:

  mul r0.xyw, r1, r2
  add r0.z, r3, r4
  sub r5, r0, r1

Notice that the four elements of r0 are written by two different
instructions.

My question is how should I model these sub-registers. If I treat each
component
as a register, and do the register allocation individually, it seems very
difficult to merge the scalars operations back into one vetor operation.

  // each %reg is a sub-register
  // r1, r2, r3, r4 here are virtual register number

  mul %reg1024, r1, r2  // x
  mul %reg1025, r1, r2  // y
  mul %reg1026, r1, r2  // z

  add %reg1027, r3, r4  // w

  sub %reg1028, %reg1024, r1
  sub %reg1029, %reg1025, r1
  sub %reg1030, %reg1026, r1
  sub %reg1031, %reg1027, r1

So I decided to model each 4-element register as one Register in *.td file.

Here are the details.

Since all the 4 elements of a vector register occupy the same 'alloca',
during the conversion of shader assembly to LLVM IR, I check if a vector
register is written (to different elements) by different instructions. When
the second write happens, I generate a shufflevector to multiplex the
existing value and the new value, and store the result of shufflevector.

Input assembly language:
  mul r0.xy, r1, r2
  add r0.zw, r3, r4
  sub r5, r0, r1

is converted to LLVM IR:

  %r0 = alloca <4 x float>
  %mul_1 = mul <4 x float> %r1, %r2
  store <4 x float> %mul_1, <4 x float>* %r0
  ...
  %add_1 = add <4 x float> %r3, %r4
  ; a store does not immediately happen here
  %load_1 = load <4 x float>* %r0

  ; select the first two elements from the existing value,
  ; the last two elements from the newly generated value
  %merge_1 = shufflevector <4 x float> %load_1,
                           <4 x float> %add_1,
                           <4 x i32> < i32 0, i32 1, i32 6, i32 7 >

  ; store the multiplexed value
  store <4 x float> %merge_1, <4 x float>* %r0


After mem2reg:

  %mul_1 = mul <4 x float> %r1, %r2
  %add_1 = add <4 x float> %r3, %r4
  %merge_1 = shufflevector <4 x float> %mul_1,
                           <4 x float> %add_1,
                           <4 x i32> < i32 0, i32 1, i32 6, i32 7 >


After instruction selection:

  MUL   %reg1024, %reg1025, %reg1026
  ADD   %reg1027, %reg1028, %reg1029
  MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"

The 'shufflevector' is selected to a MERGE instruction by the default
LLVM
instruction selector. The hardware doesn't have this instruction. I have a
*pre*-register allocation FunctionPass to remember:

  The phyicial regsiter allocated to the destination register of MERGE
  (%reg1030) should replace the destination register allocated to the
  destination register of MUL (%reg1024) and ADD(%reg1027).

In this way I ensure MUL and ADD write to the same physical register. This
replacement is done in the other FunctionPass *after* register allocation.

MUL and ADD have an 'OptionalDefOperand' writemask. By default the
writemask
is
"xyzw" (all elmenets are written).

  // 0xF == all elements are written by default
  def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops (i32
0xF))>

  {...}

  def MUL : MyInst<(outs REG4X32:$dst),
                   (ins  REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),

In the said post-register-allocation FunctionPass, in addition to replace
the
destination registers as described before, the writemask ($wm) of each
instruction is also replaced with the writemask operands of MERGE. So:

  MUL   %R0, %R1, %R2, "xyzw"
  ADD   %R5, %R3, %R4, "xyzw"
  MERGE %R6, %R0, "xy", %R5, "zw"

==>

  MUL   %R6, %R1, %R2, "xy"  // "xy" comes from MERGE
operand 2
  ADD   %R6, %R3, %R4, "zw"
  // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED

Final machine code:

  MUL r6.xy, r1, r2
  ADD r6.zw, r3, r4
  SUB r8, r6, r1

I don't feel very comfortable with these two very ad-hoc FunctionPass.

Alex.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/ec84f395/attachment.html>

Evan Cheng

2009-Feb-13 23:05 UTC

head link

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

On Feb 13, 2009, at 9:47 AM, Alex wrote:
> It seems to me that LLVM sub-register is not for the following  
> hardware architecture.
>
> All instructions of a hardware are vector instructions. All  
> registers contains
> 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.
>
> Most instructions write more than one elements in this way:
>
>   mul r0.xyw, r1, r2
>   add r0.z, r3, r4
>   sub r5, r0, r1
>
> Notice that the four elements of r0 are written by two different  
> instructions.
>
> My question is how should I model these sub-registers. If I treat  
> each component
> as a register, and do the register allocation individually, it seems  
> very
> difficult to merge the scalars operations back into one vetor  
> operation.
Well, how many possible permutations are there? Is it possible to  
model each case as a separate physical register?

Evan
>   // each %reg is a sub-register
>   // r1, r2, r3, r4 here are virtual register number
>
>   mul %reg1024, r1, r2  // x
>   mul %reg1025, r1, r2  // y
>   mul %reg1026, r1, r2  // z
>
>   add %reg1027, r3, r4  // w
>
>   sub %reg1028, %reg1024, r1
>   sub %reg1029, %reg1025, r1
>   sub %reg1030, %reg1026, r1
>   sub %reg1031, %reg1027, r1
>
> So I decided to model each 4-element register as one Register in  
> *.td file.
>
> Here are the details.
>
> Since all the 4 elements of a vector register occupy the same  
> 'alloca',
> during the conversion of shader assembly to LLVM IR, I check if a  
> vector
> register is written (to different elements) by different  
> instructions. When
> the second write happens, I generate a shufflevector to multiplex the
> existing value and the new value, and store the result of  
> shufflevector.
>
> Input assembly language:
>   mul r0.xy, r1, r2
>   add r0.zw, r3, r4
>   sub r5, r0, r1
>
> is converted to LLVM IR:
>
>   %r0 = alloca <4 x float>
>   %mul_1 = mul <4 x float> %r1, %r2
>   store <4 x float> %mul_1, <4 x float>* %r0
>   ...
>   %add_1 = add <4 x float> %r3, %r4
>   ; a store does not immediately happen here
>   %load_1 = load <4 x float>* %r0
>
>   ; select the first two elements from the existing value,
>   ; the last two elements from the newly generated value
>   %merge_1 = shufflevector <4 x float> %load_1,
>                            <4 x float> %add_1,
>                            <4 x i32> < i32 0, i32 1, i32 6, i32 7
>
>
>   ; store the multiplexed value
>   store <4 x float> %merge_1, <4 x float>* %r0
>
>
> After mem2reg:
>
>   %mul_1 = mul <4 x float> %r1, %r2
>   %add_1 = add <4 x float> %r3, %r4
>   %merge_1 = shufflevector <4 x float> %mul_1,
>                            <4 x float> %add_1,
>                            <4 x i32> < i32 0, i32 1, i32 6, i32 7
>
>
>
> After instruction selection:
>
>   MUL   %reg1024, %reg1025, %reg1026
>   ADD   %reg1027, %reg1028, %reg1029
>   MERGE %reg1030, %reg1024, "xy", %reg1027, "zw"
>
> The 'shufflevector' is selected to a MERGE instruction by the  
> default LLVM
> instruction selector. The hardware doesn't have this instruction. I  
> have a
> *pre*-register allocation FunctionPass to remember:
>
>   The phyicial regsiter allocated to the destination register of MERGE
>   (%reg1030) should replace the destination register allocated to the
>   destination register of MUL (%reg1024) and ADD(%reg1027).
>
> In this way I ensure MUL and ADD write to the same physical  
> register. This
> replacement is done in the other FunctionPass *after* register  
> allocation.
>
> MUL and ADD have an 'OptionalDefOperand' writemask. By default the
> writemask is
> "xyzw" (all elmenets are written).
>
>   // 0xF == all elements are written by default
>   def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops  
> (i32 0xF))>
>   {...}
>
>   def MUL : MyInst<(outs REG4X32:$dst),
>                    (ins  REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),
>
> In the said post-register-allocation FunctionPass, in addition to  
> replace the
> destination registers as described before, the writemask ($wm) of each
> instruction is also replaced with the writemask operands of MERGE. So:
>
>   MUL   %R0, %R1, %R2, "xyzw"
>   ADD   %R5, %R3, %R4, "xyzw"
>   MERGE %R6, %R0, "xy", %R5, "zw"
>
> ==>
>
>   MUL   %R6, %R1, %R2, "xy"  // "xy" comes from MERGE
operand 2
>   ADD   %R6, %R3, %R4, "zw"
>   // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED
>
> Final machine code:
>
>   MUL r6.xy, r1, r2
>   ADD r6.zw, r3, r4
>   SUB r8, r6, r1
>
> I don't feel very comfortable with these two very ad-hoc FunctionPass.
>
> Alex.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/0dc19c01/attachment.html>

Alex

2009-Feb-16 10:32 UTC

head link

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Evan Cheng-2 wrote:> 
> Well, how many possible permutations are there? Is it possible to  
> model each case as a separate physical register?
> 
> Evan
> 
I don't think so. There are 4x4x4x4 = 256 permutations. For example:

* xyzw: default
* zxyw
* yyyy: splat

Even if can model each of these 256 cases as a separate physical register,
how can I model the use of r0.xyzw in the following example:

// dp4 = dot product 4-element
dp4 r0.x, r1, r2
dp4 r0.y, r3, r4
dp4 r0.z, r5, r6
dp4 r0.w, r7, r8
sub r5, r0.xyzw, r6


-- 
View this message in context:
http://www.nabble.com/Modeling-GPU-vector-registers%2C-again-%28with-my-implementation%29-tp22001613p22034856.html
Sent from the LLVM - Dev mailing list archive at Nabble.com.

David Greene

2009-Feb-19 01:19 UTC

head link

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

On Friday 13 February 2009 11:47, Alex wrote:> It seems to me that LLVM sub-register is not for the following hardware
> architecture.
>
> All instructions of a hardware are vector instructions. All registers
> contains
> 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.
>
> Most instructions write more than one elements in this way:
>
>   mul r0.xyw, r1, r2
>   add r0.z, r3, r4
>   sub r5, r0, r1
>
> Notice that the four elements of r0 are written by two different
> instructions.
>
> My question is how should I model these sub-registers. If I treat each
> component
> as a register, and do the register allocation individually, it seems very
> difficult to merge the scalars operations back into one vetor operation.
This is a very good use case for vector masks in LLVM.  Expressing this as 
two masked operations and a merge:

** Warning, pseudo-LLVM code ***
mul r0, r1, r2, [1101] ; [xy_w]
add r6, r3, r4, [0010] ; [__z_]
** The assumption here is that masked elements are undefined, so we need a 
merge **
select r0, r0, r6, [1101] ; Select 1's from r0, 0's from r6, merge
sub r5, r0, r1, [1111] ; Or have no mask == full mask

The registers are just vector registers then.  They don't have component 
pieces.  Regalloc will have no problem with them.

The MachineInstrs for your architecture would have to preserve the mask 
semantics.  In the AsmPrinter for your architecture, it would be a simple 
matter to dump out the mask as the field specifier on a register name.

The masks would allow you to get rid of the shufflevector stuff.  Since you 
don't have a hardware merge instruction you could keep your pre- and 
post-regalloc passes to rewrite things or a very simple post-regalloc peephole 
pass could examine the masks of the merge and rewrite the registers in the 
defs without a pre-regalloc pass needed to remember things.

Alas, we do not have masks in LLVM just yet.  But I'm getting to the point 
where I'm ready to restart that discussion.  :)

This also won't directly handle the more general case of swizzling:

r0.wyzx = ...

But a "regular" masked operation followed by a shufflevector should do
it.

                                      -Dave

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Feb 2009 - [LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

[LLVMdev] Modeling GPU vector registers, again (with my implementation)

Seemingly Similar Threads