Alex
2009-Feb-13 17:47 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
It seems to me that LLVM sub-register is not for the following hardware architecture. All instructions of a hardware are vector instructions. All registers contains 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w. Most instructions write more than one elements in this way: mul r0.xyw, r1, r2 add r0.z, r3, r4 sub r5, r0, r1 Notice that the four elements of r0 are written by two different instructions. My question is how should I model these sub-registers. If I treat each component as a register, and do the register allocation individually, it seems very difficult to merge the scalars operations back into one vetor operation. // each %reg is a sub-register // r1, r2, r3, r4 here are virtual register number mul %reg1024, r1, r2 // x mul %reg1025, r1, r2 // y mul %reg1026, r1, r2 // z add %reg1027, r3, r4 // w sub %reg1028, %reg1024, r1 sub %reg1029, %reg1025, r1 sub %reg1030, %reg1026, r1 sub %reg1031, %reg1027, r1 So I decided to model each 4-element register as one Register in *.td file. Here are the details. Since all the 4 elements of a vector register occupy the same 'alloca', during the conversion of shader assembly to LLVM IR, I check if a vector register is written (to different elements) by different instructions. When the second write happens, I generate a shufflevector to multiplex the existing value and the new value, and store the result of shufflevector. Input assembly language: mul r0.xy, r1, r2 add r0.zw, r3, r4 sub r5, r0, r1 is converted to LLVM IR: %r0 = alloca <4 x float> %mul_1 = mul <4 x float> %r1, %r2 store <4 x float> %mul_1, <4 x float>* %r0 ... %add_1 = add <4 x float> %r3, %r4 ; a store does not immediately happen here %load_1 = load <4 x float>* %r0 ; select the first two elements from the existing value, ; the last two elements from the newly generated value %merge_1 = shufflevector <4 x float> %load_1, <4 x float> %add_1, <4 x i32> < i32 0, i32 1, i32 6, i32 7 > ; store the multiplexed value store <4 x float> %merge_1, <4 x float>* %r0 After mem2reg: %mul_1 = mul <4 x float> %r1, %r2 %add_1 = add <4 x float> %r3, %r4 %merge_1 = shufflevector <4 x float> %mul_1, <4 x float> %add_1, <4 x i32> < i32 0, i32 1, i32 6, i32 7 > After instruction selection: MUL %reg1024, %reg1025, %reg1026 ADD %reg1027, %reg1028, %reg1029 MERGE %reg1030, %reg1024, "xy", %reg1027, "zw" The 'shufflevector' is selected to a MERGE instruction by the default LLVM instruction selector. The hardware doesn't have this instruction. I have a *pre*-register allocation FunctionPass to remember: The phyicial regsiter allocated to the destination register of MERGE (%reg1030) should replace the destination register allocated to the destination register of MUL (%reg1024) and ADD(%reg1027). In this way I ensure MUL and ADD write to the same physical register. This replacement is done in the other FunctionPass *after* register allocation. MUL and ADD have an 'OptionalDefOperand' writemask. By default the writemask is "xyzw" (all elmenets are written). // 0xF == all elements are written by default def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops (i32 0xF))> {...} def MUL : MyInst<(outs REG4X32:$dst), (ins REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm), In the said post-register-allocation FunctionPass, in addition to replace the destination registers as described before, the writemask ($wm) of each instruction is also replaced with the writemask operands of MERGE. So: MUL %R0, %R1, %R2, "xyzw" ADD %R5, %R3, %R4, "xyzw" MERGE %R6, %R0, "xy", %R5, "zw" ==> MUL %R6, %R1, %R2, "xy" // "xy" comes from MERGE operand 2 ADD %R6, %R3, %R4, "zw" // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED Final machine code: MUL r6.xy, r1, r2 ADD r6.zw, r3, r4 SUB r8, r6, r1 I don't feel very comfortable with these two very ad-hoc FunctionPass. Alex. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/ec84f395/attachment.html>
Evan Cheng
2009-Feb-13 23:05 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
On Feb 13, 2009, at 9:47 AM, Alex wrote:> It seems to me that LLVM sub-register is not for the following > hardware architecture. > > All instructions of a hardware are vector instructions. All > registers contains > 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w. > > Most instructions write more than one elements in this way: > > mul r0.xyw, r1, r2 > add r0.z, r3, r4 > sub r5, r0, r1 > > Notice that the four elements of r0 are written by two different > instructions. > > My question is how should I model these sub-registers. If I treat > each component > as a register, and do the register allocation individually, it seems > very > difficult to merge the scalars operations back into one vetor > operation.Well, how many possible permutations are there? Is it possible to model each case as a separate physical register? Evan> // each %reg is a sub-register > // r1, r2, r3, r4 here are virtual register number > > mul %reg1024, r1, r2 // x > mul %reg1025, r1, r2 // y > mul %reg1026, r1, r2 // z > > add %reg1027, r3, r4 // w > > sub %reg1028, %reg1024, r1 > sub %reg1029, %reg1025, r1 > sub %reg1030, %reg1026, r1 > sub %reg1031, %reg1027, r1 > > So I decided to model each 4-element register as one Register in > *.td file. > > Here are the details. > > Since all the 4 elements of a vector register occupy the same > 'alloca', > during the conversion of shader assembly to LLVM IR, I check if a > vector > register is written (to different elements) by different > instructions. When > the second write happens, I generate a shufflevector to multiplex the > existing value and the new value, and store the result of > shufflevector. > > Input assembly language: > mul r0.xy, r1, r2 > add r0.zw, r3, r4 > sub r5, r0, r1 > > is converted to LLVM IR: > > %r0 = alloca <4 x float> > %mul_1 = mul <4 x float> %r1, %r2 > store <4 x float> %mul_1, <4 x float>* %r0 > ... > %add_1 = add <4 x float> %r3, %r4 > ; a store does not immediately happen here > %load_1 = load <4 x float>* %r0 > > ; select the first two elements from the existing value, > ; the last two elements from the newly generated value > %merge_1 = shufflevector <4 x float> %load_1, > <4 x float> %add_1, > <4 x i32> < i32 0, i32 1, i32 6, i32 7 > > > ; store the multiplexed value > store <4 x float> %merge_1, <4 x float>* %r0 > > > After mem2reg: > > %mul_1 = mul <4 x float> %r1, %r2 > %add_1 = add <4 x float> %r3, %r4 > %merge_1 = shufflevector <4 x float> %mul_1, > <4 x float> %add_1, > <4 x i32> < i32 0, i32 1, i32 6, i32 7 > > > > After instruction selection: > > MUL %reg1024, %reg1025, %reg1026 > ADD %reg1027, %reg1028, %reg1029 > MERGE %reg1030, %reg1024, "xy", %reg1027, "zw" > > The 'shufflevector' is selected to a MERGE instruction by the > default LLVM > instruction selector. The hardware doesn't have this instruction. I > have a > *pre*-register allocation FunctionPass to remember: > > The phyicial regsiter allocated to the destination register of MERGE > (%reg1030) should replace the destination register allocated to the > destination register of MUL (%reg1024) and ADD(%reg1027). > > In this way I ensure MUL and ADD write to the same physical > register. This > replacement is done in the other FunctionPass *after* register > allocation. > > MUL and ADD have an 'OptionalDefOperand' writemask. By default the > writemask is > "xyzw" (all elmenets are written). > > // 0xF == all elements are written by default > def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops > (i32 0xF))> > {...} > > def MUL : MyInst<(outs REG4X32:$dst), > (ins REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm), > > In the said post-register-allocation FunctionPass, in addition to > replace the > destination registers as described before, the writemask ($wm) of each > instruction is also replaced with the writemask operands of MERGE. So: > > MUL %R0, %R1, %R2, "xyzw" > ADD %R5, %R3, %R4, "xyzw" > MERGE %R6, %R0, "xy", %R5, "zw" > > ==> > > MUL %R6, %R1, %R2, "xy" // "xy" comes from MERGE operand 2 > ADD %R6, %R3, %R4, "zw" > // MERGE %R6, %R0, "xy", %R5, "zw" <== REMOVED > > Final machine code: > > MUL r6.xy, r1, r2 > ADD r6.zw, r3, r4 > SUB r8, r6, r1 > > I don't feel very comfortable with these two very ad-hoc FunctionPass. > > Alex. > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20090213/0dc19c01/attachment.html>
Alex
2009-Feb-16 10:32 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
Evan Cheng-2 wrote:> > Well, how many possible permutations are there? Is it possible to > model each case as a separate physical register? > > Evan >I don't think so. There are 4x4x4x4 = 256 permutations. For example: * xyzw: default * zxyw * yyyy: splat Even if can model each of these 256 cases as a separate physical register, how can I model the use of r0.xyzw in the following example: // dp4 = dot product 4-element dp4 r0.x, r1, r2 dp4 r0.y, r3, r4 dp4 r0.z, r5, r6 dp4 r0.w, r7, r8 sub r5, r0.xyzw, r6 -- View this message in context: http://www.nabble.com/Modeling-GPU-vector-registers%2C-again-%28with-my-implementation%29-tp22001613p22034856.html Sent from the LLVM - Dev mailing list archive at Nabble.com.
David Greene
2009-Feb-19 01:19 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
On Friday 13 February 2009 11:47, Alex wrote:> It seems to me that LLVM sub-register is not for the following hardware > architecture. > > All instructions of a hardware are vector instructions. All registers > contains > 4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w. > > Most instructions write more than one elements in this way: > > mul r0.xyw, r1, r2 > add r0.z, r3, r4 > sub r5, r0, r1 > > Notice that the four elements of r0 are written by two different > instructions. > > My question is how should I model these sub-registers. If I treat each > component > as a register, and do the register allocation individually, it seems very > difficult to merge the scalars operations back into one vetor operation.This is a very good use case for vector masks in LLVM. Expressing this as two masked operations and a merge: ** Warning, pseudo-LLVM code *** mul r0, r1, r2, [1101] ; [xy_w] add r6, r3, r4, [0010] ; [__z_] ** The assumption here is that masked elements are undefined, so we need a merge ** select r0, r0, r6, [1101] ; Select 1's from r0, 0's from r6, merge sub r5, r0, r1, [1111] ; Or have no mask == full mask The registers are just vector registers then. They don't have component pieces. Regalloc will have no problem with them. The MachineInstrs for your architecture would have to preserve the mask semantics. In the AsmPrinter for your architecture, it would be a simple matter to dump out the mask as the field specifier on a register name. The masks would allow you to get rid of the shufflevector stuff. Since you don't have a hardware merge instruction you could keep your pre- and post-regalloc passes to rewrite things or a very simple post-regalloc peephole pass could examine the masks of the merge and rewrite the registers in the defs without a pre-regalloc pass needed to remember things. Alas, we do not have masks in LLVM just yet. But I'm getting to the point where I'm ready to restart that discussion. :) This also won't directly handle the more general case of swizzling: r0.wyzx = ... But a "regular" masked operation followed by a shufflevector should do it. -Dave
Possibly Parallel Threads
- [LLVMdev] Modeling GPU vector registers, again (with my implementation)
- [LLVMdev] Vector LLVM extension v.s. DirectX Shaders
- [LLVMdev] Vector LLVM extension v.s. DirectX Shaders
- [LLVMdev] Codegen/Register allocation question.
- [LLVMdev] Codegen/Register allocation question.