thr3ads.net - llvm dev - [LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Jingyue Wu

2014-Oct-24 18:18 UTC

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

Hi,

I noticed a significant performance regression (up to 40%) on some internal
CUDA benchmarks (a reduced example presented below). The root cause of this
regression seems that IndVarSimpilfy widens induction variables assuming
arithmetics on wider integer types are as cheap as those on narrower ones.
However, this assumption is wrong at least for the NVPTX64 target.

Although the NVPTX64 target supports 64-bit arithmetics, since the actual
NVIDIA GPU typically has only 32-bit integer registers, one 64-bit
arithmetic typically ends up with two machine instructions taking care of
the low 32 bits and the high 32 bits respectively. I haven't looked at
other GPU targets such as R600, but I suspect this problem is not
restricted to the NVPTX64 target.

Below is a reduced example:
__attribute__((global)) void foo(int n, int *output) {
  for (int i = 0; i < n; i += 3) {
    output[i] = i * i;
  }
}

Without widening, the loop body in the PTX (a low-level assembly-like
language generated by NVPTX64) is:
BB0_2:                                  // =>This Inner Loop Header:
Depth=1
        mul.lo.s32      %r5, %r6, %r6;

        st.u32  [%rd4], %r5;

        add.s32         %r6, %r6, 3;

        add.s64         %rd4, %rd4, 12;

        setp.lt.s32     %p2, %r6, %r3;
        @%p2 bra        BB0_2;
in which %r6 is the induction variable i.

With widening, the loop body becomes:
BB0_2:                                  // =>This Inner Loop Header:
Depth=1
        mul.lo.s64      %rd8, %rd10, %rd10;

        st.u32  [%rd9], %rd8;

        add.s64         %rd10, %rd10, 3;

        add.s64         %rd9, %rd9, 12;

        setp.lt.s64     %p2, %rd10, %rd1;

        @%p2 bra        BB0_2;

Although the number of PTX instructions in both versions are the same, the
version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64
instructions which are more expensive than their 32-bit counterparts.
Indeed, the SASS code (disassembly of the actual machine code running on
GPUs) of the version with widening looks significantly longer.

Without widening (7 instructions):
.L_1:

        /*0048*/                IMUL R2, R0, R0;

        /*0050*/                IADD R0, R0, 0x1;

        /*0058*/                ST.E [R4], R2;

        /*0060*/                ISETP.NE.AND P0, PT, R0, c[0x0][0x140], PT;
            /*0068*/                IADD R4.CC, R4, 0x4;

        /*0070*/                IADD.X R5, R5, RZ;

        /*0078*/            @P0 BRA `(.L_1);

With widening (12 instructions):
.L_1:

        /*0050*/                IMUL.U32.U32 R6.CC, R4, R4;

        /*0058*/                IADD R0, R0, -0x1;

        /*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;

        /*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;

        /*0070*/                IMAD.U32.U32 R7, R4, R5, R8;

        /*0078*/                IADD R4.CC, R4, 0x1;

        /*0088*/                ST.E [R2], R6;

        /*0090*/                IADD.X R5, R5, RZ;

        /*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;

        /*00a0*/                IADD R2.CC, R2, 0x4;

        /*00a8*/                IADD.X R3, R3, RZ;

        /*00b0*/            @P0 BRA `(.L_1);

I hope the issue is clear up to this point. So what's a good solution to
fix this issue? I am thinking of having IndVarSimplify consult
TargetTransformInfo about the cost of integer arithmetics of different
types. If operations on wider integer types are more expensive,
IndVarSimplify should disable the widening.

Another thing I am concerned about: are there other optimizations that make
similar assumptions about integer widening? Those might cause performance
regression too just as IndVarSimplify does.

Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/9f950ec7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indvar.cu
Type: application/octet-stream
Size: 119 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/9f950ec7/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indvar.32.ptx
Type: application/octet-stream
Size: 819 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/9f950ec7/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indvar.32.sass
Type: application/octet-stream
Size: 3211 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/9f950ec7/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indvar.64.ptx
Type: application/octet-stream
Size: 850 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/9f950ec7/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: indvar.64.sass
Type: application/octet-stream
Size: 3553 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/9f950ec7/attachment-0004.obj>

Justin Holewinski

2014-Oct-24 18:29 UTC

head link

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

On Fri, 24 Oct 2014, Jingyue Wu wrote:
> Hi, 
> I noticed a significant performance regression (up to 40%) on some internal
CUDA benchmarks (a reduced example presented below). The root cause of this
regression seems
> that IndVarSimpilfy widens induction variables assuming arithmetics on
wider integer types are as cheap as those on narrower ones. However, this
assumption is wrong at
> least for the NVPTX64 target. 
> 
> Although the NVPTX64 target supports 64-bit arithmetics, since the actual
NVIDIA GPU typically has only 32-bit integer registers, one 64-bit arithmetic
typically ends up
> with two machine instructions taking care of the low 32 bits and the high
32 bits respectively. I haven't looked at other GPU targets such as R600,
but I suspect this
> problem is not restricted to the NVPTX64 target. 
> 
> Below is a reduced example:
> __attribute__((global)) void foo(int n, int *output) {
>   for (int i = 0; i < n; i += 3) {
>     output[i] = i * i;
>   }
> }
> 
> Without widening, the loop body in the PTX (a low-level assembly-like
language generated by NVPTX64) is:
> BB0_2:                                  // =>This Inner Loop Header:
Depth=1        
>         mul.lo.s32      %r5, %r6, %r6;                                    
         
>         st.u32  [%rd4], %r5;                                              
         
>         add.s32         %r6, %r6, 3;                                      
         
>         add.s64         %rd4, %rd4, 12;                                    
         
>         setp.lt.s32     %p2, %r6, %r3;
>         @%p2 bra        BB0_2;
> in which %r6 is the induction variable i. 
> 
> With widening, the loop body becomes:
> BB0_2:                                  // =>This Inner Loop Header:
Depth=1        
>         mul.lo.s64      %rd8, %rd10, %rd10;                                
        
>         st.u32  [%rd9], %rd8;                                              
          
>         add.s64         %rd10, %rd10, 3;                                  
         
>         add.s64         %rd9, %rd9, 12;                                    
        
>         setp.lt.s64     %p2, %rd10, %rd1;                                  
        
>         @%p2 bra        BB0_2;
> 
> Although the number of PTX instructions in both versions are the same, the
version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64
instructions which are
> more expensive than their 32-bit counterparts. Indeed, the SASS code
(disassembly of the actual machine code running on GPUs) of the version with
widening looks
> significantly longer. 
> 
> Without widening (7 instructions): 
> .L_1:                                                                      
        
>         /*0048*/                IMUL R2, R0, R0;                          
           
>         /*0050*/                IADD R0, R0, 0x1;                          
        
>         /*0058*/                ST.E [R4], R2;                            
         
>         /*0060*/                ISETP.NE.AND P0, PT, R0, c[0x0][0x140], PT;
            /*0068*/                IADD R4.CC, R4, 0x4;                        
       
>         /*0070*/                IADD.X R5, R5, RZ;                        
         
>         /*0078*/            @P0 BRA `(.L_1);
> 
> With widening (12 instructions):
> .L_1:                                                                      
     
>         /*0050*/                IMUL.U32.U32 R6.CC, R4, R4;                
     
>         /*0058*/                IADD R0, R0, -0x1;                        
           
>         /*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;      
      
>         /*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;            
      
>         /*0070*/                IMAD.U32.U32 R7, R4, R5, R8;              
      
>         /*0078*/                IADD R4.CC, R4, 0x1;                      
      
>         /*0088*/                ST.E [R2], R6;                            
      
>         /*0090*/                IADD.X R5, R5, RZ;                        
      
>         /*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;          
      
>         /*00a0*/                IADD R2.CC, R2, 0x4;                      
      
>         /*00a8*/                IADD.X R3, R3, RZ;                        
         
>         /*00b0*/            @P0 BRA `(.L_1);
> 
> I hope the issue is clear up to this point. So what's a good solution
to fix this issue? I am thinking of having IndVarSimplify consult
TargetTransformInfo about the cost
> of integer arithmetics of different types. If operations on wider integer
types are more expensive, IndVarSimplify should disable the widening. 
TargetTransformInfo seems like a good place to put a hook for this. 
You're right that 64-bit integer math will be slower for NVPTX targets, as 
the hardware needs to emulate 64-bit integer ops with 32-bit ops.

How much is register usage affected by this in your benchmarks?
> 
> Another thing I am concerned about: are there other optimizations that make
similar assumptions about integer widening? Those might cause performance
regression too just
> as IndVarSimplify does. 
> 
> Jingyue
> 
>

Jingyue Wu

2014-Oct-24 20:02 UTC

head link

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

HI Justin,

37 w/o widening, and 40 w/ widening.

There is some other weirdness in the register allocation: I didn't specify
any upper bound on register usage, but ptxas on the version w/ widening
aggressively rematerializes some arithmetics instructions for fewer
registers. Nevertheless, it still uses more registers than the version w/o
widening.

Btw, Justin, do you have time to take a look at this (
http://reviews.llvm.org/D5612)? Eli and I think it's OK, but would like you
to confirm.

Jingyue

On Fri Oct 24 2014 at 11:29:33 AM Justin Holewinski <jholewinski at
nvidia.com>
wrote:
> On Fri, 24 Oct 2014, Jingyue Wu wrote:
>
> > Hi,
> > I noticed a significant performance regression (up to 40%) on some
> internal CUDA benchmarks (a reduced example presented below). The root
> cause of this regression seems
> > that IndVarSimpilfy widens induction variables assuming arithmetics on
> wider integer types are as cheap as those on narrower ones. However, this
> assumption is wrong at
> > least for the NVPTX64 target.
> >
> > Although the NVPTX64 target supports 64-bit arithmetics, since the
> actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit
> arithmetic typically ends up
> > with two machine instructions taking care of the low 32 bits and the
> high 32 bits respectively. I haven't looked at other GPU targets such
as
> R600, but I suspect this
> > problem is not restricted to the NVPTX64 target.
> >
> > Below is a reduced example:
> > __attribute__((global)) void foo(int n, int *output) {
> >   for (int i = 0; i < n; i += 3) {
> >     output[i] = i * i;
> >   }
> > }
> >
> > Without widening, the loop body in the PTX (a low-level assembly-like
> language generated by NVPTX64) is:
> > BB0_2:                                  // =>This Inner Loop
Header:
> Depth=1
> >         mul.lo.s32      %r5, %r6, %r6;
>
> >         st.u32  [%rd4], %r5;
>
> >         add.s32         %r6, %r6, 3;
>
> >         add.s64         %rd4, %rd4, 12;
>
> >         setp.lt.s32     %p2, %r6, %r3;
> >         @%p2 bra        BB0_2;
> > in which %r6 is the induction variable i.
> >
> > With widening, the loop body becomes:
> > BB0_2:                                  // =>This Inner Loop
Header:
> Depth=1
> >         mul.lo.s64      %rd8, %rd10, %rd10;
>
> >         st.u32  [%rd9], %rd8;
>
> >         add.s64         %rd10, %rd10, 3;
>
> >         add.s64         %rd9, %rd9, 12;
>
> >         setp.lt.s64     %p2, %rd10, %rd1;
>
> >         @%p2 bra        BB0_2;
> >
> > Although the number of PTX instructions in both versions are the same,
> the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64
> instructions which are
> > more expensive than their 32-bit counterparts. Indeed, the SASS code
> (disassembly of the actual machine code running on GPUs) of the version
> with widening looks
> > significantly longer.
> >
> > Without widening (7 instructions):
> > .L_1:
>
> >         /*0048*/                IMUL R2, R0, R0;
>
> >         /*0050*/                IADD R0, R0, 0x1;
>
> >         /*0058*/                ST.E [R4], R2;
>
> >         /*0060*/                ISETP.NE.AND P0, PT, R0,
c[0x0][0x140],
> PT;             /*0068*/                IADD R4.CC, R4, 0x4;
>
> >         /*0070*/                IADD.X R5, R5, RZ;
>
> >         /*0078*/            @P0 BRA `(.L_1);
> >
> > With widening (12 instructions):
> > .L_1:
>
> >         /*0050*/                IMUL.U32.U32 R6.CC, R4, R4;
>
> >         /*0058*/                IADD R0, R0, -0x1;
>
> >         /*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;
>
> >         /*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;
>
> >         /*0070*/                IMAD.U32.U32 R7, R4, R5, R8;
>
> >         /*0078*/                IADD R4.CC, R4, 0x1;
>
> >         /*0088*/                ST.E [R2], R6;
>
> >         /*0090*/                IADD.X R5, R5, RZ;
>
> >         /*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;
>
> >         /*00a0*/                IADD R2.CC, R2, 0x4;
>
> >         /*00a8*/                IADD.X R3, R3, RZ;
>
> >         /*00b0*/            @P0 BRA `(.L_1);
> >
> > I hope the issue is clear up to this point. So what's a good
solution to
> fix this issue? I am thinking of having IndVarSimplify consult
> TargetTransformInfo about the cost
> > of integer arithmetics of different types. If operations on wider
> integer types are more expensive, IndVarSimplify should disable the
> widening.
>
> TargetTransformInfo seems like a good place to put a hook for this.
> You're right that 64-bit integer math will be slower for NVPTX targets,
as
> the hardware needs to emulate 64-bit integer ops with 32-bit ops.
>
> How much is register usage affected by this in your benchmarks?
>
> >
> > Another thing I am concerned about: are there other optimizations that
> make similar assumptions about integer widening? Those might cause
> performance regression too just
> > as IndVarSimplify does.
> >
> > Jingyue
> >
> >-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/d77acd42/attachment.html>

Andrew Trick

2014-Oct-24 22:15 UTC

head link

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

Please see: http://llvm.org/PR21148 <http://llvm.org/PR21148>

I updated the bug with my suggestion. I hope it works.

-Andy
> On Oct 24, 2014, at 11:29 AM, Justin Holewinski <jholewinski at
nvidia.com> wrote:
> 
> On Fri, 24 Oct 2014, Jingyue Wu wrote:
> 
>> Hi, 
>> I noticed a significant performance regression (up to 40%) on some
internal CUDA benchmarks (a reduced example presented below). The root cause of
this regression seems
>> that IndVarSimpilfy widens induction variables assuming arithmetics on
wider integer types are as cheap as those on narrower ones. However, this
assumption is wrong at
>> least for the NVPTX64 target. 
>> Although the NVPTX64 target supports 64-bit arithmetics, since the
actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit
arithmetic typically ends up
>> with two machine instructions taking care of the low 32 bits and the
high 32 bits respectively. I haven't looked at other GPU targets such as
R600, but I suspect this
>> problem is not restricted to the NVPTX64 target. 
>> Below is a reduced example:
>> __attribute__((global)) void foo(int n, int *output) {
>>   for (int i = 0; i < n; i += 3) {
>>     output[i] = i * i;
>>   }
>> }
>> Without widening, the loop body in the PTX (a low-level assembly-like
language generated by NVPTX64) is:
>> BB0_2:                                  // =>This Inner Loop Header:
Depth=1
>>         mul.lo.s32      %r5, %r6, %r6;
>>         st.u32  [%rd4], %r5;
>>         add.s32         %r6, %r6, 3;
>>         add.s64         %rd4, %rd4, 12;
>>         setp.lt.s32     %p2, %r6, %r3;
>>         @%p2 bra        BB0_2;
>> in which %r6 is the induction variable i. 
>> With widening, the loop body becomes:
>> BB0_2:                                  // =>This Inner Loop Header:
Depth=1
>>         mul.lo.s64      %rd8, %rd10, %rd10;
>>         st.u32  [%rd9], %rd8;
>>         add.s64         %rd10, %rd10, 3;
>>         add.s64         %rd9, %rd9, 12;
>>         setp.lt.s64     %p2, %rd10, %rd1;
>>         @%p2 bra        BB0_2;
>> Although the number of PTX instructions in both versions are the same,
the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64
instructions which are
>> more expensive than their 32-bit counterparts. Indeed, the SASS code
(disassembly of the actual machine code running on GPUs) of the version with
widening looks
>> significantly longer. 
>> Without widening (7 instructions): 
>> .L_1:
>>         /*0048*/                IMUL R2, R0, R0;
>>         /*0050*/                IADD R0, R0, 0x1;
>>         /*0058*/                ST.E [R4], R2;
>>         /*0060*/                ISETP.NE.AND P0, PT, R0, c[0x0][0x140],
PT;             /*0068*/                IADD R4.CC, R4, 0x4;
>>         /*0070*/                IADD.X R5, R5, RZ;
>>         /*0078*/            @P0 BRA `(.L_1);
>> With widening (12 instructions):
>> .L_1:
>>         /*0050*/                IMUL.U32.U32 R6.CC, R4, R4;
>>         /*0058*/                IADD R0, R0, -0x1;
>>         /*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;
>>         /*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;
>>         /*0070*/                IMAD.U32.U32 R7, R4, R5, R8;
>>         /*0078*/                IADD R4.CC, R4, 0x1;
>>         /*0088*/                ST.E [R2], R6;
>>         /*0090*/                IADD.X R5, R5, RZ;
>>         /*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;
>>         /*00a0*/                IADD R2.CC, R2, 0x4;
>>         /*00a8*/                IADD.X R3, R3, RZ;
>>         /*00b0*/            @P0 BRA `(.L_1);
>> I hope the issue is clear up to this point. So what's a good
solution to fix this issue? I am thinking of having IndVarSimplify consult
TargetTransformInfo about the cost
>> of integer arithmetics of different types. If operations on wider
integer types are more expensive, IndVarSimplify should disable the widening.
> 
> TargetTransformInfo seems like a good place to put a hook for this.
You're right that 64-bit integer math will be slower for NVPTX targets, as
the hardware needs to emulate 64-bit integer ops with 32-bit ops.
> 
> How much is register usage affected by this in your benchmarks?
> 
>> Another thing I am concerned about: are there other optimizations that
make similar assumptions about integer widening? Those might cause performance
regression too just
>> as IndVarSimplify does. 
>> Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/6cbfcca1/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Oct 2014 - [LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

Apparently Analagous Threads