Alex
2009-Feb-16 10:32 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
Evan Cheng-2 wrote:> > Well, how many possible permutations are there? Is it possible to > model each case as a separate physical register? > > Evan >I don't think so. There are 4x4x4x4 = 256 permutations. For example: * xyzw: default * zxyw * yyyy: splat Even if can model each of these 256 cases as a separate physical register, how can I model the use of r0.xyzw in the following example: // dp4 = dot product 4-element dp4 r0.x, r1, r2 dp4 r0.y, r3, r4 dp4 r0.z, r5, r6 dp4 r0.w, r7, r8 sub r5, r0.xyzw, r6 -- View this message in context: http://www.nabble.com/Modeling-GPU-vector-registers%2C-again-%28with-my-implementation%29-tp22001613p22034856.html Sent from the LLVM - Dev mailing list archive at Nabble.com.
Villmow, Micah
2009-Feb-16 17:24 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
Alex, From my experience in working with GPU vector registers; there is no support for swizzles in the manner that you would normally code them, and in my case I have 6^4 permutations on src registers and 24 combinations in the dst registers. The way that I ended up handling this was to have different register classes for 1, 2, 3 and 4 component vectors. This made the generic cases very simple but still made swizzling fairly difficult. In order to get swizzling to work you only need to handle three SDNodes, insert_vector_elt, extract_vector_elt and build_vector while expanding the rest. For those three nodes I then custom lowered them to a target specific node with an extra integer constant per register that would encode the swizzle mask in 32bits. The correct swizzles can then be generated in the asm printer by decoding the integer constant. This does require having extra moves, but your example below would end up being something like the following: dp4 r100, r1, r2 mov r0.x, r100 (float4 => float1 extract_vector_elt) dp4 r101, r4, r5 mov r3.x, r101 (float4 => float1 extract_vector_elt) iadd r6.xy__, r0.x000, r3.0x00(float1 + float1 => float2 build_vector) dp4 r7.x, r8, r9 <as above> dp4 r10.x, r11, r12 <as above> iadd r13.xy__, r7.x000, f10.0x00(float1 + float1 => float2 build_vector) iadd r14, r13.xy00, r6.00xy (float2 + float2 => float4 build_vector) sub r15, r14, r9 It's not as compact and neat but it works and the move instructions will get optimized away by the lower level gpu compiler. Hope this helps, Micah -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of [Alex] Sent: Monday, February 16, 2009 2:33 AM To: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Modeling GPU vector registers, again (with my implementation) Evan Cheng-2 wrote:> > Well, how many possible permutations are there? Is it possible to > model each case as a separate physical register? > > Evan >I don't think so. There are 4x4x4x4 = 256 permutations. For example: * xyzw: default * zxyw * yyyy: splat Even if can model each of these 256 cases as a separate physical register, how can I model the use of r0.xyzw in the following example: // dp4 = dot product 4-element dp4 r0.x, r1, r2 dp4 r0.y, r3, r4 dp4 r0.z, r5, r6 dp4 r0.w, r7, r8 sub r5, r0.xyzw, r6 -- View this message in context: http://www.nabble.com/Modeling-GPU-vector-registers%2C-again-%28with-my- implementation%29-tp22001613p22034856.html Sent from the LLVM - Dev mailing list archive at Nabble.com. _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Scott Michel
2009-Feb-19 00:57 UTC
[LLVMdev] Modeling GPU vector registers, again (with my implementation)
On Feb 16, 2009, at 9:24 AM, Villmow, Micah wrote:> In order to get swizzling to work you only need to handle three > SDNodes, insert_vector_elt, extract_vector_elt and build_vector while > expanding the rest. For those three nodes I then custom lowered > them to > a target specific node with an extra integer constant per register > that > would encode the swizzle mask in 32bits.Villimow, Micah: This problem argues for why SDNode should be target polymorphic. If they were target polymorphic, then a target-specific node would be largely unnecessary. (By a target-specific node, I mean extending the ISD enumeration for your target.) A target polymorphic SDNode would still capture all of the behaviors and attributes with insert_vector_elt, extract_vector_elt and build_vector, but also allow you to add additional behaviors and attributes. Which is mostly the point in object oriented programming. Assuming you don't need to do extra DAGCombine work, you would get that for free from the parent class. Unfortunately, that would mean a lot of work at this juncture and a heavy overhaul of SelectionDAGNodes.h and associated SelectionDAG source. Node allocation, in particular, would become more complicated (but not unsolvable.) Were anyone going to tackle this problem, the solution would have to be largely incremental, i.e., the source can't be overhauled all at once, but should permit incremental transition of SDNodes to a behavioral interface style. -scooter