thr3ads.net - llvm dev - [llvm-dev] How to describe the RegisterInfo? [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Ruiling Song via llvm-dev

2016-Aug-23 03:07 UTC

[llvm-dev] How to describe the RegisterInfo?

Hi Escha,

Great to have your comment! Do you have any specific reason for not doing
like this?
I am not sure whether I understand your point correctly. For "just model
one thread",
do you mean "only considering ONE of the 8/16 working lanes that running in
lock-step way"??

For my case, may be something like I only need to define r0~r127 as
register for i32 register (each r# is just enough for simd8 i32).
Then the register allocator never need to go to allocate the sub-registers,
just operate them as a whole. right?

Yes, it looks really easy for divergent registers. But I think then I would
lose the ability
to allocate uniform register. Am I right? Is there any way to allocate
uniform register
as well as allocate divergent register?

Thanks!
Ruiling

2016-08-23 0:32 GMT+08:00 <escha at apple.com>:
>
> On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hello Everyone,
>
> I am trying to make a new LLVM backend target for Intel GPU.
> I would start from targeting OpenCL language first.
> But I am not quite familiar with LLVM backend infrastructure.
> I have some problem on describing the RegisterInfo.
>
> Intel GPU launches lots of hardware threads to do GPGPU workload.
> Each hardware thread has 128 registers(r0-r127), with each one of size 32
> byte.
> Each hardware thread may run in SIMD 8/16/32 way, which maps to
> 8/16/32 OpenCL working items. And the SIMD width is chosen at
> compile time (normally chosen according to register pressure, bigger simd
> width means bigger register pressure).
> Note each instruction has each own exec-width, which may not be equal to
> program SIMD width.
> Normally we would allocate contiguous registers for divergent value.
> For example, we have a program compiled as SIMD 8, we need to allocate 4
> byte*8=32 byte
> value for a divergent float/i32 value. But if there is a 'short
type'
> value,
> it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.
> we may also allocate for 'uniform' value, a uniform value only
needs
> type-sized register,
> without multiply 'simd-width'. A uniform float/i32 value only needs
4 byte
> physical register.
> Thus a 32-byte-register can hold up to 8 different uniform float/i32
> values.
>
>
> As a GPU backend maintainer, I strongly discourage trying to model the
> total register bank of the GPU in LLVM. Just model one thread. This will
> make things much, much easier.
>
>
> —escha
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160823/8f63bb41/attachment.html>

via llvm-dev

2016-Aug-23 03:45 UTC

head link

[llvm-dev] How to describe the RegisterInfo?

If I understand right, on this arch, ‘uniform’ refers to values that only take
one lane of register file instead of SIMD-width lanes, and they *share* the same
region of the register file as non-uniform values. This is in contrast to e.g.
AMDGPU where SGPRs (scalar GPRs) and VGPRs are separate register files.

If this understanding is correct, you may be able to define uniform and
non-uniform registers separately, but make sure that one aliases the other, e.g.
so that (if your SIMD width is 16) VGPR 20 overlaps SGPR 320, 321….335. So you
can have 128 vector registers, 16*128 uniforms, or a mix of the two.

(Maybe some of the AMDGPU maintainers have thoughts?)

—escha
> On Aug 22, 2016, at 8:07 PM, Ruiling Song <ruiling.song83 at
gmail.com> wrote:
> 
> Hi Escha,
> 
> Great to have your comment! Do you have any specific reason for not doing
like this?
> I am not sure whether I understand your point correctly. For "just
model one thread",
> do you mean "only considering ONE of the 8/16 working lanes that
running in lock-step way"??
> 
> For my case, may be something like I only need to define r0~r127 as
register for i32 register (each r# is just enough for simd8 i32).
> Then the register allocator never need to go to allocate the sub-registers,
just operate them as a whole. right?
> 
> Yes, it looks really easy for divergent registers. But I think then I would
lose the ability
> to allocate uniform register. Am I right? Is there any way to allocate
uniform register
> as well as allocate divergent register?
> 
> Thanks!
> Ruiling
> 
> 2016-08-23 0:32 GMT+08:00 <escha at apple.com <mailto:escha at
apple.com>>:
> 
>> On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hello Everyone,
>> 
>> I am trying to make a new LLVM backend target for Intel GPU.
>> I would start from targeting OpenCL language first.
>> But I am not quite familiar with LLVM backend infrastructure.
>> I have some problem on describing the RegisterInfo.
>> 
>> Intel GPU launches lots of hardware threads to do GPGPU workload.
>> Each hardware thread has 128 registers(r0-r127), with each one of size
32 byte.
>> Each hardware thread may run in SIMD 8/16/32 way, which maps to
>> 8/16/32 OpenCL working items. And the SIMD width is chosen at
>> compile time (normally chosen according to register pressure, bigger
simd width means bigger register pressure).
>> Note each instruction has each own exec-width, which may not be equal
to program SIMD width.
>> Normally we would allocate contiguous registers for divergent value.
>> For example, we have a program compiled as SIMD 8, we need to allocate
4 byte*8=32 byte
>> value for a divergent float/i32 value. But if there is a 'short
type' value,
>> it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.
>> we may also allocate for 'uniform' value, a uniform value only
needs type-sized register,
>> without multiply 'simd-width'. A uniform float/i32 value only
needs 4 byte physical register.
>> Thus a 32-byte-register can hold up to 8 different uniform float/i32
values.
> 
> As a GPU backend maintainer, I strongly discourage trying to model the
total register bank of the GPU in LLVM. Just model one thread. This will make
things much, much easier.
> 
> —escha
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160822/a7eb7858/attachment.html>

Ruiling Song via llvm-dev

2016-Aug-23 07:08 UTC

head link

[llvm-dev] How to describe the RegisterInfo?

Yes, the arch is just as you said, something like AMD GPU, but Intel GPU
don't have separate register file for 'scalar/vector'.
In fact my idea of defining the register tuples was borrowed from
SIRegisterInfo.td in AMD GPU.
But seems that AMD GPU mainly support i32/i64 register type, while Intel
GPU also support byte/short register type.
So I have to start defining the registers from 'byte' type, and then
build
up other type registers through RegisterTuples.
I thought RegisterTuple is kind of expressing register alias in
RegisterInfo.td file. I am not sure whether I understand it correctly. My
first trial was like below(to make things simple, I remove some WORD/QWORD
register class):
let Namespace = "IntelGPU" in {

foreach Index = 0-15 in {
  def sub#Index : SubRegIndex<32, !shl(Index, 5)>;
}
}

class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
  bits<2> HStride;
  bits<1> regFile;

  let Namespace = "IntelGPU";
  let HWEncoding{12-0}  = regIdx;
  let HWEncoding{15}    = regFile;
}
// here I define the whole 4096 byte registers
foreach Index = 0-4095 in {
  def Rb#Index : IntelGPUReg <"Rb"#Index, Index> {
    let regFile = 0;
  }
}

// b-->byte w-->word d-->dword q-->qword
// the set of uniform byte register
def gpr_b : RegisterClass<"IntelGPU", [i8], 8,
                          (sequence "Rb%u", 0, 4095)> {
  let AllocationPriority = 1;
}

def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],
                              [(add (decimate gpr_b, 4)),
                               (add (decimate (shl gpr_b, 1), 4)),
                               (add (decimate (shl gpr_b, 2), 4)),
                               (add (decimate (shl gpr_b, 3), 4))]>;

// simd byte use stride 2 register as stride 1 does not support useful ALU
instruction
def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6,
sub7],
                                 [(add (decimate gpr_b, 16)),
                                  (add (decimate (shl gpr_b, 2), 16)),
                                  (add (decimate (shl gpr_b, 4), 16)),
                                  (add (decimate (shl gpr_b, 6), 16)),
                                  (add (decimate (shl gpr_b, 8), 16)),
                                  (add (decimate (shl gpr_b, 10), 16)),
                                  (add (decimate (shl gpr_b, 12), 16)),
                                  (add (decimate (shl gpr_b, 14), 16))]>;

def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6,
sub7],
                                [(add (decimate gpr_d, 8)),
                                 (add (decimate (shl gpr_d, 1), 8)),
                                 (add (decimate (shl gpr_d, 2), 8)),
                                 (add (decimate (shl gpr_d, 3), 8)),
                                 (add (decimate (shl gpr_d, 4), 8)),
                                 (add (decimate (shl gpr_d, 5), 8)),
                                 (add (decimate (shl gpr_d, 6), 8)),
                                 (add (decimate (shl gpr_d, 7), 8))]>;
def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d)>;
def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d_simd8)> {
}
This is easy for me to define the register alias information. But it won't
works!
the tablegen exit and tells me: "error:Ran out of lanemask bits to
represent subregister sub1_then_sub1"
Anybody know what's wrong here?

- Ruiling

2016-08-23 11:45 GMT+08:00 <escha at apple.com>:
> If I understand right, on this arch, ‘uniform’ refers to values that only
> take one lane of register file instead of SIMD-width lanes, and they
> *share* the same region of the register file as non-uniform values. This is
> in contrast to e.g. AMDGPU where SGPRs (scalar GPRs) and VGPRs are separate
> register files.
>
> If this understanding is correct, you may be able to define uniform and
> non-uniform registers separately, but make sure that one aliases the other,
> e.g. so that (if your SIMD width is 16) VGPR 20 overlaps SGPR 320,
> 321….335. So you can have 128 vector registers, 16*128 uniforms, or a mix
> of the two.
>
> (Maybe some of the AMDGPU maintainers have thoughts?)
>
> —escha
>
>
> On Aug 22, 2016, at 8:07 PM, Ruiling Song <ruiling.song83 at
gmail.com>
> wrote:
>
> Hi Escha,
>
> Great to have your comment! Do you have any specific reason for not doing
> like this?
> I am not sure whether I understand your point correctly. For "just
model
> one thread",
> do you mean "only considering ONE of the 8/16 working lanes that
running
> in lock-step way"??
>
> For my case, may be something like I only need to define r0~r127 as
> register for i32 register (each r# is just enough for simd8 i32).
> Then the register allocator never need to go to allocate the
> sub-registers, just operate them as a whole. right?
>
> Yes, it looks really easy for divergent registers. But I think then I
> would lose the ability
> to allocate uniform register. Am I right? Is there any way to allocate
> uniform register
> as well as allocate divergent register?
>
> Thanks!
> Ruiling
>
> 2016-08-23 0:32 GMT+08:00 <escha at apple.com>:
>
>>
>> On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hello Everyone,
>>
>> I am trying to make a new LLVM backend target for Intel GPU.
>> I would start from targeting OpenCL language first.
>> But I am not quite familiar with LLVM backend infrastructure.
>> I have some problem on describing the RegisterInfo.
>>
>> Intel GPU launches lots of hardware threads to do GPGPU workload.
>> Each hardware thread has 128 registers(r0-r127), with each one of size
32
>> byte.
>> Each hardware thread may run in SIMD 8/16/32 way, which maps to
>> 8/16/32 OpenCL working items. And the SIMD width is chosen at
>> compile time (normally chosen according to register pressure, bigger
simd
>> width means bigger register pressure).
>> Note each instruction has each own exec-width, which may not be equal
to
>> program SIMD width.
>> Normally we would allocate contiguous registers for divergent value.
>> For example, we have a program compiled as SIMD 8, we need to allocate
4
>> byte*8=32 byte
>> value for a divergent float/i32 value. But if there is a 'short
type'
>> value,
>> it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.
>> we may also allocate for 'uniform' value, a uniform value only
needs
>> type-sized register,
>> without multiply 'simd-width'. A uniform float/i32 value only
needs 4
>> byte physical register.
>> Thus a 32-byte-register can hold up to 8 different uniform float/i32
>> values.
>>
>>
>> As a GPU backend maintainer, I strongly discourage trying to model the
>> total register bank of the GPU in LLVM. Just model one thread. This
will
>> make things much, much easier.
>>
>>
>> —escha
>>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160823/a3fe717a/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Aug 2016 - How to describe the RegisterInfo?

[llvm-dev] How to describe the RegisterInfo?

[llvm-dev] How to describe the RegisterInfo?

[llvm-dev] How to describe the RegisterInfo?

Maybe Matching Threads