Ruiling Song via llvm-dev
2016-Aug-22 13:46 UTC
[llvm-dev] How to describe the RegisterInfo?
Hello Everyone, I am trying to make a new LLVM backend target for Intel GPU. I would start from targeting OpenCL language first. But I am not quite familiar with LLVM backend infrastructure. I have some problem on describing the RegisterInfo. Intel GPU launches lots of hardware threads to do GPGPU workload. Each hardware thread has 128 registers(r0-r127), with each one of size 32 byte. Each hardware thread may run in SIMD 8/16/32 way, which maps to 8/16/32 OpenCL working items. And the SIMD width is chosen at compile time (normally chosen according to register pressure, bigger simd width means bigger register pressure). Note each instruction has each own exec-width, which may not be equal to program SIMD width. Normally we would allocate contiguous registers for divergent value. For example, we have a program compiled as SIMD 8, we need to allocate 4 byte*8=32 byte value for a divergent float/i32 value. But if there is a 'short type' value, it only needs 2 byte*8=16 byte, that is half of a 32-byte-register. we may also allocate for 'uniform' value, a uniform value only needs type-sized register, without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte physical register. Thus a 32-byte-register can hold up to 8 different uniform float/i32 values. Some time we also need to access register in stride way. Like a bitcast from i64 to v2i32, we need to access the i64 register with horizontal stride of 2. Look below example, the i64 value is hold in r10 and r11. L/H stands for the low 32bit/high 32bit. And the simd width of the program is SIMD 8, so we have 8 pairs of L/H. r10: L H L H L H L H r11: L H L H L H L H below two instructions will extract the low 32bit and high 32bit part. mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D (The format of a register region is RegNum.regSubNum<vertStride, width, horzStride>:type) (Note the regSubNum is measured in units of the register type here.) then r12/r13 contains the result vector components. You can refer below link for more details on Intel GPU assembly and register usage: https://software.intel.com/en-us/articles/introduction-to-gen-assembly I notice the hardware encoding of a register is 16 bit. that is not enough to encode all the register region parameters(regNum, type, hstride, vstride, width,...) in RegisterInfo.td. And I am not sure which is the reasonable place to hold this stride/type/width information for a physical register. Maybe some other .cpp file is more suitable than RegisterInfo.td file? Because I need to change the register region parameters in the bitcast instruction( from qword with hstride 1 to dword with hstride 2) At which stage is suitable to do such bitcast instruction logic? after reg-alloc? The detailed hardware spec is located at: https://01.org/sites/default/files/documentation/intel-gfx- prm-osrc-bdw-vol07-3d_media_gpgpu_3.pdf at page 921, it describe the detailed instruction encode format. It needs (regFile, regNum, subRegNum, width, type, addrMode, hStride, vStride) to describe a register. I have attached my first version RegisterInfo.td. And I also have a question about the attached RegisterInfo.td file. Do I have to define different SubRegIndex like below to make TableGen works correctly? foreach Index = 0-15 in { def subd#Index :SubRegIndex<32, !shl(Index, 5)>; //used as SubRegIndex when declaring gpr_d_simd8 def subw#Index: SubRegIndex<16, !shl(Index, 4)>; //used as SubRegIndex when declaring gpr_w_simd8 ... } If anything I am not saying clear, just reply the mail. Thanks for any help! Thanks! Ruiling -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160822/7e9761ee/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: IntelGPURegisterInfo.td Type: application/octet-stream Size: 5907 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160822/7e9761ee/attachment.obj>
> On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hello Everyone, > > I am trying to make a new LLVM backend target for Intel GPU. > I would start from targeting OpenCL language first. > But I am not quite familiar with LLVM backend infrastructure. > I have some problem on describing the RegisterInfo. > > Intel GPU launches lots of hardware threads to do GPGPU workload. > Each hardware thread has 128 registers(r0-r127), with each one of size 32 byte. > Each hardware thread may run in SIMD 8/16/32 way, which maps to > 8/16/32 OpenCL working items. And the SIMD width is chosen at > compile time (normally chosen according to register pressure, bigger simd width means bigger register pressure). > Note each instruction has each own exec-width, which may not be equal to program SIMD width. > Normally we would allocate contiguous registers for divergent value. > For example, we have a program compiled as SIMD 8, we need to allocate 4 byte*8=32 byte > value for a divergent float/i32 value. But if there is a 'short type' value, > it only needs 2 byte*8=16 byte, that is half of a 32-byte-register. > we may also allocate for 'uniform' value, a uniform value only needs type-sized register, > without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte physical register. > Thus a 32-byte-register can hold up to 8 different uniform float/i32 values.As a GPU backend maintainer, I strongly discourage trying to model the total register bank of the GPU in LLVM. Just model one thread. This will make things much, much easier.> > Some time we also need to access register in stride way. Like a bitcast from i64 to v2i32, > we need to access the i64 register with horizontal stride of 2. > Look below example, the i64 value is hold in r10 and r11. L/H stands for the low 32bit/high 32bit. > And the simd width of the program is SIMD 8, so we have 8 pairs of L/H. > r10: L H L H L H L H > r11: L H L H L H L H > below two instructions will extract the low 32bit and high 32bit part. > mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D > mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D > (The format of a register region is RegNum.regSubNum<vertStride, width, horzStride>:type) > (Note the regSubNum is measured in units of the register type here.) > then r12/r13 contains the result vector components. > You can refer below link for more details on Intel GPU assembly and register usage: > https://software.intel.com/en-us/articles/introduction-to-gen-assembly <https://software.intel.com/en-us/articles/introduction-to-gen-assembly>—escha -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160822/7d7c9bed/attachment.html>
Ruiling Song via llvm-dev
2016-Aug-23 03:07 UTC
[llvm-dev] How to describe the RegisterInfo?
Hi Escha, Great to have your comment! Do you have any specific reason for not doing like this? I am not sure whether I understand your point correctly. For "just model one thread", do you mean "only considering ONE of the 8/16 working lanes that running in lock-step way"?? For my case, may be something like I only need to define r0~r127 as register for i32 register (each r# is just enough for simd8 i32). Then the register allocator never need to go to allocate the sub-registers, just operate them as a whole. right? Yes, it looks really easy for divergent registers. But I think then I would lose the ability to allocate uniform register. Am I right? Is there any way to allocate uniform register as well as allocate divergent register? Thanks! Ruiling 2016-08-23 0:32 GMT+08:00 <escha at apple.com>:> > On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Hello Everyone, > > I am trying to make a new LLVM backend target for Intel GPU. > I would start from targeting OpenCL language first. > But I am not quite familiar with LLVM backend infrastructure. > I have some problem on describing the RegisterInfo. > > Intel GPU launches lots of hardware threads to do GPGPU workload. > Each hardware thread has 128 registers(r0-r127), with each one of size 32 > byte. > Each hardware thread may run in SIMD 8/16/32 way, which maps to > 8/16/32 OpenCL working items. And the SIMD width is chosen at > compile time (normally chosen according to register pressure, bigger simd > width means bigger register pressure). > Note each instruction has each own exec-width, which may not be equal to > program SIMD width. > Normally we would allocate contiguous registers for divergent value. > For example, we have a program compiled as SIMD 8, we need to allocate 4 > byte*8=32 byte > value for a divergent float/i32 value. But if there is a 'short type' > value, > it only needs 2 byte*8=16 byte, that is half of a 32-byte-register. > we may also allocate for 'uniform' value, a uniform value only needs > type-sized register, > without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte > physical register. > Thus a 32-byte-register can hold up to 8 different uniform float/i32 > values. > > > As a GPU backend maintainer, I strongly discourage trying to model the > total register bank of the GPU in LLVM. Just model one thread. This will > make things much, much easier. > > > —escha >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160823/8f63bb41/attachment.html>
Tom Stellard via llvm-dev
2016-Aug-23 17:32 UTC
[llvm-dev] How to describe the RegisterInfo?
On Mon, Aug 22, 2016 at 09:46:10PM +0800, Ruiling Song via llvm-dev wrote:> Hello Everyone, > > I am trying to make a new LLVM backend target for Intel GPU. > I would start from targeting OpenCL language first. > But I am not quite familiar with LLVM backend infrastructure. > I have some problem on describing the RegisterInfo. > > Intel GPU launches lots of hardware threads to do GPGPU workload. > Each hardware thread has 128 registers(r0-r127), with each one of size 32 > byte. > Each hardware thread may run in SIMD 8/16/32 way, which maps to > 8/16/32 OpenCL working items. And the SIMD width is chosen at > compile time (normally chosen according to register pressure, bigger simd > width means bigger register pressure). > Note each instruction has each own exec-width, which may not be equal to > program SIMD width. > Normally we would allocate contiguous registers for divergent value. > For example, we have a program compiled as SIMD 8, we need to allocate 4 > byte*8=32 byte > value for a divergent float/i32 value. But if there is a 'short type' value, > it only needs 2 byte*8=16 byte, that is half of a 32-byte-register. > we may also allocate for 'uniform' value, a uniform value only needs > type-sized register, > without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte > physical register. > Thus a 32-byte-register can hold up to 8 different uniform float/i32 values. > > Some time we also need to access register in stride way. Like a bitcast > from i64 to v2i32, > we need to access the i64 register with horizontal stride of 2. > Look below example, the i64 value is hold in r10 and r11. L/H stands for > the low 32bit/high 32bit. > And the simd width of the program is SIMD 8, so we have 8 pairs of L/H. > r10: L H L H L H L H > r11: L H L H L H L H > below two instructions will extract the low 32bit and high 32bit part. > mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D > mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D > (The format of a register region is RegNum.regSubNum<vertStride, width, > horzStride>:type) > (Note the regSubNum is measured in units of the register type here.) > then r12/r13 contains the result vector components. > You can refer below link for more details on Intel GPU assembly and > register usage: > https://software.intel.com/en-us/articles/introduction-to-gen-assembly > > I notice the hardware encoding of a register is 16 bit. that is not enough > to encode all the > register region parameters(regNum, type, hstride, vstride, width,...) in > RegisterInfo.td. And I am not sure > which is the reasonable place to hold this stride/type/width information > for a physical register. > Maybe some other .cpp file is more suitable than RegisterInfo.td file? > Because I need to change the register > region parameters in the bitcast instruction( from qword with hstride 1 to > dword with hstride 2) > At which stage is suitable to do such bitcast instruction logic? after > reg-alloc? >Hi, I would recommend encoding some of the register region parameters as part of the instruction rather than using the register encoding, because something like 'width' seems more like a property of the instruction than of the register to me. -Tom> The detailed hardware spec is located at: > https://01.org/sites/default/files/documentation/intel-gfx- > prm-osrc-bdw-vol07-3d_media_gpgpu_3.pdf > at page 921, it describe the detailed instruction encode format. > It needs (regFile, regNum, subRegNum, width, type, addrMode, hStride, > vStride) to describe a register. > > I have attached my first version RegisterInfo.td. > And I also have a question about the attached RegisterInfo.td file. Do I > have to define different SubRegIndex > like below to make TableGen works correctly? > > foreach Index = 0-15 in { > def subd#Index :SubRegIndex<32, !shl(Index, 5)>; //used as SubRegIndex > when declaring gpr_d_simd8 > def subw#Index: SubRegIndex<16, !shl(Index, 4)>; //used as SubRegIndex > when declaring gpr_w_simd8 > ... > } > > If anything I am not saying clear, just reply the mail. Thanks for any help! > > Thanks! > Ruiling> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Ruiling Song via llvm-dev
2016-Aug-24 08:47 UTC
[llvm-dev] How to describe the RegisterInfo?
2016-08-24 1:32 GMT+08:00 Tom Stellard <tom at stellard.net>:> On Mon, Aug 22, 2016 at 09:46:10PM +0800, Ruiling Song via llvm-dev wrote: > > Hello Everyone, > > > > I am trying to make a new LLVM backend target for Intel GPU. > > I would start from targeting OpenCL language first. > > But I am not quite familiar with LLVM backend infrastructure. > > I have some problem on describing the RegisterInfo. > > > > Intel GPU launches lots of hardware threads to do GPGPU workload. > > Each hardware thread has 128 registers(r0-r127), with each one of size 32 > > byte. > > Each hardware thread may run in SIMD 8/16/32 way, which maps to > > 8/16/32 OpenCL working items. And the SIMD width is chosen at > > compile time (normally chosen according to register pressure, bigger simd > > width means bigger register pressure). > > Note each instruction has each own exec-width, which may not be equal to > > program SIMD width. > > Normally we would allocate contiguous registers for divergent value. > > For example, we have a program compiled as SIMD 8, we need to allocate 4 > > byte*8=32 byte > > value for a divergent float/i32 value. But if there is a 'short type' > value, > > it only needs 2 byte*8=16 byte, that is half of a 32-byte-register. > > we may also allocate for 'uniform' value, a uniform value only needs > > type-sized register, > > without multiply 'simd-width'. A uniform float/i32 value only needs 4 > byte > > physical register. > > Thus a 32-byte-register can hold up to 8 different uniform float/i32 > values. > > > > Some time we also need to access register in stride way. Like a bitcast > > from i64 to v2i32, > > we need to access the i64 register with horizontal stride of 2. > > Look below example, the i64 value is hold in r10 and r11. L/H stands for > > the low 32bit/high 32bit. > > And the simd width of the program is SIMD 8, so we have 8 pairs of L/H. > > r10: L H L H L H L H > > r11: L H L H L H L H > > below two instructions will extract the low 32bit and high 32bit part. > > mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D > > mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D > > (The format of a register region is RegNum.regSubNum<vertStride, width, > > horzStride>:type) > > (Note the regSubNum is measured in units of the register type here.) > > then r12/r13 contains the result vector components. > > You can refer below link for more details on Intel GPU assembly and > > register usage: > > https://software.intel.com/en-us/articles/introduction-to-gen-assembly > > > > I notice the hardware encoding of a register is 16 bit. that is not > enough > > to encode all the > > register region parameters(regNum, type, hstride, vstride, width,...) in > > RegisterInfo.td. And I am not sure > > which is the reasonable place to hold this stride/type/width information > > for a physical register. > > Maybe some other .cpp file is more suitable than RegisterInfo.td file? > > Because I need to change the register > > region parameters in the bitcast instruction( from qword with hstride 1 > to > > dword with hstride 2) > > At which stage is suitable to do such bitcast instruction logic? after > > reg-alloc? > > > > Hi, > > I would recommend encoding some of the register region parameters as part > of the instruction rather than using the register encoding, because > something like 'width' seems more like a property of the instruction > than of the register to me. > > -Tom > > Hi Tom,Thanks for your suggestion. I agree that some region parameters need to be part of the instruction descriptor. But it is a little hard for me to point out which parameters should go to instruction descriptor, which should be declared in RegisterInfo.td. My current idea was to describe uniform/non-uniform register in RegisterInfo.td. while other register region paramters (like stride etc.) are left to instruction descriptor. The simd-width of the compiled program is used to determine the width of the non-uniform register (normally 8 lanes or 16 lanes), So I think this should be included in RegisterInfo.td. So if it is non-uniform value, I would assgin non-uniform registerClass to it. I am not sure whether this can be easily done in LLVM. I don't know if there are any other possible way to do it instead of declaring uniform/non-uniform register in RegisterInfo.td file. Please share with me if you have idea on how to allocate non-uniform registers if it is not handled in RegisterInfo.td. - Ruiling -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160824/7d4b145a/attachment.html>
Possibly Parallel Threads
- How to describe the RegisterInfo?
- How to describe the RegisterInfo?
- [LLVMdev] Breaking changes in *RegisterInfo.td regarding SubRegIndex
- error:Ran out of lanemask bits to represent subregisterr
- Assign different RegClasses to a virtual register based on 'uniform' attribute?