thr3ads.net - llvm dev - [LLVMdev] MI scheduler produce badly code with inline function [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Zakk

2013-Oct-16 04:28 UTC

[LLVMdev] MI scheduler produce badly code with inline function

Hi Andy, thanks for your help!!
The scheduled code by method A is same as B when using the new machine
model.
it's make sense, but there is the another problem, the scheduled code is
badly.

load/store instruction always reuse the same register

Source:

#define N  2000000
static double b[N], c[N];
void Scale () {
    double scalar = 3.0;
    for (int j=0;j<N;j++)
        b[j] = scalar*c[j];
}

$clang -O3 foo.c -static -S -o foo.s  -mllvm -unroll-count=4
-mcpu=cortex-a9 -fno-vectorize -fno-slp-vectorize --target=arm
-mfloat-abi=hard -mllvm -enable-misched -mllvm -scheditins=false

per-operand cost model :
Scale:
  push  {lr}
  movw  r12, :lower16:c
  movw  lr, :lower16:b
  movw  r3, #9216
  movt  r12, :upper16:c
  mov r1, #0
  vmov.f64  d16, #3.000000e+00
  movt  lr, :upper16:b
  movt  r3, #244
.LBB0_1:
  add r0, r12, r1
  add r2, lr, r1
  *vldr  d17, [r0]*
  add r1, r1, #32
  vmul.f64  d17, d17, d16
  cmp r1, r3
  vstr  d17, [r2]
*  vldr  d17, [r0, #8]*
  vmul.f64  d17, d17, d16
* * vstr  d17, [r2, #8]
*  vldr  d17, [r0, #16]*
  vmul.f64  d17, d17, d16
  vstr  d17, [r2, #16]
*  vldr  d17, [r0, #24]*
  vmul.f64  d17, d17, d16
  vstr  d17, [r2, #24]
  bne .LBB0_1
  pop {lr}
  bx  lr
.Ltmp0:

Using Itinerary will generate better scheduled code:
clang -O3 foo.c -static -S -o foo.s -mllvm -unroll-count=4 -mcpu=cortex-a9
-fno-vectorize -fno-slp-vectorize --target=arm -mfloat-abi=hard -mllvm
-enable-misched

Scale: movw r12, :lower16:c movw r2, :lower16:b movw r3, #9216 movt r12,
:upper16:c mov r1, #0 vmov.f64 d16, #3.000000e+00 movt r2, :upper16:b movt
r3, #244 .LBB0_1: add r0, r12, r1 * vldr d17, [r0]* * vldr **d18**, [r0, #8]
* vmul.f64 d17, d17, d16 * vldr **d19**, [r0, #16]* * vldr **d20**, [r0,
#24]* add r0, r2, r1 vmul.f64 d18, d18, d16 add r1, r1, #32 cmp r1, r3
vmul.f64 d19, d19, d16 vmul.f64 d20, d20, d16 vstmia r0, {d17, d18, d19,
d20} bne .LBB0_1 bx lr

this is just because A9's per-operand machine model is not implemented
well?
By the way, why do you want to use the new machine model for mi-sched?

Thanks,

Kind regards
Kuan-Hsu



2013/10/15 Andrew Trick <atrick at apple.com>
>
> On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
>
> Hi all,
> I meet this problem when compiling the TREAM benchmark (
> http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>
> The small function will be scheduled as good code, but if opt inline this
> function, the inline part will be scheduled as bad code.
>
>
> A bug for this is welcome. Pretty soon, I’ll be verifying A9 performance
> and changing the default scheduler. When I do this, I’ll be using the new
> machine model:
>
> (-mllvm) -sched-itins=false
>
> However, some scheduler changes are required for that mode to fully
> enforce pipeline hazards.
>
> so I rewrite a simple code as attached link (foo.c), and compiled with two
> different methods:
>
> *method A:*
> *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched  -mllvm
> -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9
> -fno-vectorize -fno-slp-vectorize*
> *
> *
> *and*
> *
> *
> *method B:*
> *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard
> -mcpu=cortex-a9
> *
> *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc*
> * *
> *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-misched*
>
>
> You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should
> generate the same bitcode, but skip the “opt” step.
>
> If that doesn’t work it can be a nightmare trying to decompose the
> compilations steps with fidelity. You can try:
> - clang -### …
> - clang -mllvm -print-options …
> - Passing a full triple to all tools with -mtriple
> - Debug the TargetOptions fields
> - -print-after-all to see which phase is different
>
> Even if you get all the options right, the process of serializing and
> rereading the IR can affect the optimizations.
>
> Sorry. I’ve been trying to think of a way to improve this situation.
>
> -Andy
>
>  (ps. I had checked with debug-pass=structure, so I think they are
> equivalently)
>
> but the result is different:
> You can find the LBB1_4 of foo.s, it always reuses the same reg for
> computation, but LBB1_4 of foo.opt.s doesn't.
>
> My question is how to just use clang (method A) to achieve B result?
> Or i am missing something here?
>
> I really appreciate any help and suggestions.
> Thanks
>
> Kuan-Hsu
>
> ------- file link -------
> foo.c: http://goo.gl/nVa2K0
> foo.s: http://goo.gl/ML9eNj
> foo.opt.s: http://goo.gl/31PCnf
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>

-- 
Best regards,
Kuan-Hsu
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131016/7b2341e7/attachment.html>

Andrew Trick

2013-Oct-16 05:38 UTC

head link

[LLVMdev] MI scheduler produce badly code with inline function

On Oct 15, 2013, at 9:28 PM, Zakk <zakk0610 at gmail.com> wrote:
> Hi Andy, thanks for your help!!
> The scheduled code by method A is same as B when using the new machine
model.
> it's make sense, but there is the another problem, the scheduled code
is badly.
> 
> load/store instruction always reuse the same register
I filed PR17593 with this information. However, I see opposite results from what
you’re expecting. The code that uses fewer registers runs 4% faster on my
cortex-a9. The integer unit is out-of-order.
> this is just because A9's per-operand machine model is not implemented
well?
> By the way, why do you want to use the new machine model for mi-sched?
I want to move all the targets we support to the new machine model so it will be
easier to maintain the scheduler. Additionally, the new model is much more
efficient and simpler (if you don’t use special features). It is also correct
for both preRA and postRA. Note that in the case of A9, the .td file for the new
machine model is horribly complicated because it handles load multiple
instructions. The A9 itinerary doesn’t even attempt to do that. (This was done
mainly to demonstrate the feature set of the new model, not because it’s
terribly important). The new model for A9 is also complicated by a mapping from
the old itinerary classes to the new machine model.

-Andy
> Thanks,
> 
> Kind regards
> Kuan-Hsu
> 
> 
> 
> 2013/10/15 Andrew Trick <atrick at apple.com>
> 
> On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
> 
>> Hi all, 
>> I meet this problem when compiling the TREAM benchmark
(http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>> 
>> The small function will be scheduled as good code, but if opt inline
this function, the inline part will be scheduled as bad code.
> 
> A bug for this is welcome. Pretty soon, I’ll be verifying A9 performance
and changing the default scheduler. When I do this, I’ll be using the new
machine model:
> 
> (-mllvm) -sched-itins=false
> 
> However, some scheduler changes are required for that mode to fully enforce
pipeline hazards.
> 
>> so I rewrite a simple code as attached link (foo.c), and compiled with
two different methods:
>> 
>> method A:
>> $clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched  -mllvm
-unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9 -fno-vectorize
-fno-slp-vectorize
>> 
>> and
>> 
>> method B:
>> $clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard
-mcpu=cortex-a9
>> $opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc
>> $llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-misched
> 
> You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should
generate the same bitcode, but skip the “opt” step.
> 
> If that doesn’t work it can be a nightmare trying to decompose the
compilations steps with fidelity. You can try:
> - clang -### … 
> - clang -mllvm -print-options …
> - Passing a full triple to all tools with -mtriple
> - Debug the TargetOptions fields
> - -print-after-all to see which phase is different
> 
> Even if you get all the options right, the process of serializing and
rereading the IR can affect the optimizations.
> 
> Sorry. I’ve been trying to think of a way to improve this situation.
> 
> -Andy
> 
>> (ps. I had checked with debug-pass=structure, so I think they are
equivalently)
>> 
>> but the result is different: 
>> You can find the LBB1_4 of foo.s, it always reuses the same reg for
computation, but LBB1_4 of foo.opt.s doesn't.
>> 
>> My question is how to just use clang (method A) to achieve B result? 
>> Or i am missing something here?
>> 
>> I really appreciate any help and suggestions.
>> Thanks
>> 
>> Kuan-Hsu
>> 
>> ------- file link -------
>> foo.c: http://goo.gl/nVa2K0
>> foo.s: http://goo.gl/ML9eNj
>> foo.opt.s: http://goo.gl/31PCnf
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> 
> 
> -- 
> Best regards,
> Kuan-Hsu
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131015/5a05f7fa/attachment.html>

Zakk

2013-Oct-16 09:15 UTC

head link

[LLVMdev] MI scheduler produce badly code with inline function

2013/10/16 Andrew Trick <atrick at apple.com>
>
> On Oct 15, 2013, at 9:28 PM, Zakk <zakk0610 at gmail.com> wrote:
>
> Hi Andy, thanks for your help!!
> The scheduled code by method A is same as B when using the new machine
> model.
> it's make sense, but there is the another problem, the scheduled code
is
> badly.
>
> load/store instruction always reuse the same register
>
>
> I filed PR17593 with this information. However, I see opposite results
> from what you’re expecting. The code that uses fewer registers runs 4%
> faster on my cortex-a9. The integer unit is out-of-order.
>
> I think you should use clang to generate .asm, not use clang + llc.
I also reply to http://llvm.org/bugs/show_bug.cgi?id=17593
> this is just because A9's per-operand machine model is not implemented
> well?
> By the way, why do you want to use the new machine model for mi-sched?
>
>
> I want to move all the targets we support to the new machine model so it
> will be easier to maintain the scheduler. Additionally, the new model is
> much more efficient and simpler (if you don’t use special features). It is
> also correct for both preRA and postRA. Note that in the case of A9, the
> .td file for the new machine model is horribly complicated because it
> handles load multiple instructions. The A9 itinerary doesn’t even attempt
> to do that. (This was done mainly to demonstrate the feature set of the new
> model, not because it’s terribly important). The new model for A9 is also
> complicated by a mapping from the old itinerary classes to the new machine
> model.
>
I got it, thanks. writing itinerary class is really tedious...



Kind regards
Kuan-Hsu

> -Andy
>
> Thanks,
>
> Kind regards
> Kuan-Hsu
>
>
>
> 2013/10/15 Andrew Trick <atrick at apple.com>
>
>>
>> On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
>>
>> Hi all,
>> I meet this problem when compiling the TREAM benchmark (
>> http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>>
>> The small function will be scheduled as good code, but if opt inline
this
>> function, the inline part will be scheduled as bad code.
>>
>>
>> A bug for this is welcome. Pretty soon, I’ll be verifying A9
performance
>> and changing the default scheduler. When I do this, I’ll be using the
new
>> machine model:
>>
>> (-mllvm) -sched-itins=false
>>
>> However, some scheduler changes are required for that mode to fully
>> enforce pipeline hazards.
>>
>> so I rewrite a simple code as attached link (foo.c), and compiled with
>> two different methods:
>>
>> *method A:*
>> *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched  -mllvm
>> -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9
>> -fno-vectorize -fno-slp-vectorize*
>> *
>> *
>> *and*
>> *
>> *
>> *method B:*
>> *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard
>> -mcpu=cortex-a9
>> *
>> *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc*
>> * *
>> *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9
-enable-misched*
>>
>>
>> You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should
>> generate the same bitcode, but skip the “opt” step.
>>
>> If that doesn’t work it can be a nightmare trying to decompose the
>> compilations steps with fidelity. You can try:
>> - clang -### …
>> - clang -mllvm -print-options …
>> - Passing a full triple to all tools with -mtriple
>> - Debug the TargetOptions fields
>> - -print-after-all to see which phase is different
>>
>> Even if you get all the options right, the process of serializing and
>> rereading the IR can affect the optimizations.
>>
>> Sorry. I’ve been trying to think of a way to improve this situation.
>>
>> -Andy
>>
>>  (ps. I had checked with debug-pass=structure, so I think they are
>> equivalently)
>>
>> but the result is different:
>> You can find the LBB1_4 of foo.s, it always reuses the same reg for
>> computation, but LBB1_4 of foo.opt.s doesn't.
>>
>> My question is how to just use clang (method A) to achieve B result?
>> Or i am missing something here?
>>
>> I really appreciate any help and suggestions.
>> Thanks
>>
>> Kuan-Hsu
>>
>> ------- file link -------
>> foo.c: http://goo.gl/nVa2K0
>> foo.s: http://goo.gl/ML9eNj
>> foo.opt.s: http://goo.gl/31PCnf
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>
>
> --
> Best regards,
> Kuan-Hsu
>
>
>
>

-- 
Best regards,
Kuan-Hsu
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131016/7c143629/attachment.html>

Zakk

2013-Oct-21 09:25 UTC

head link

[LLVMdev] MI scheduler produce badly code with inline function

Hi Andy, I'm working on defining new machine model for my target,
But I don't understand how to define the in-order machine (reservation
tables) in new model.

For example, if target has IF ID EX WB stages

should I do:

let BufferSize=0 in {
def IF: ProcResource<1>; def ID: ProcResource<1>;
def EX: ProcResource<1>; def WB: ProcResource<1>;
}
def : WriteRes<WriteALU, [IF, ID, EX1, WB]> ;
or

define each stage as SchedWrite type and use WriteSequence to define this
sequence?

Thanks,
Kuan-Hsu

2013/10/16 Andrew Trick <atrick at apple.com>
>
> On Oct 15, 2013, at 9:28 PM, Zakk <zakk0610 at gmail.com> wrote:
>
> Hi Andy, thanks for your help!!
> The scheduled code by method A is same as B when using the new machine
> model.
> it's make sense, but there is the another problem, the scheduled code
is
> badly.
>
> load/store instruction always reuse the same register
>
>
> I filed PR17593 with this information. However, I see opposite results
> from what you’re expecting. The code that uses fewer registers runs 4%
> faster on my cortex-a9. The integer unit is out-of-order.
>
> this is just because A9's per-operand machine model is not implemented
> well?
> By the way, why do you want to use the new machine model for mi-sched?
>
>
> I want to move all the targets we support to the new machine model so it
> will be easier to maintain the scheduler. Additionally, the new model is
> much more efficient and simpler (if you don’t use special features). It is
> also correct for both preRA and postRA. Note that in the case of A9, the
> .td file for the new machine model is horribly complicated because it
> handles load multiple instructions. The A9 itinerary doesn’t even attempt
> to do that. (This was done mainly to demonstrate the feature set of the new
> model, not because it’s terribly important). The new model for A9 is also
> complicated by a mapping from the old itinerary classes to the new machine
> model.
>
> -Andy
>
> Thanks,
>
> Kind regards
> Kuan-Hsu
>
>
>
> 2013/10/15 Andrew Trick <atrick at apple.com>
>
>>
>> On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
>>
>> Hi all,
>> I meet this problem when compiling the TREAM benchmark (
>> http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>>
>> The small function will be scheduled as good code, but if opt inline
this
>> function, the inline part will be scheduled as bad code.
>>
>>
>> A bug for this is welcome. Pretty soon, I’ll be verifying A9
performance
>> and changing the default scheduler. When I do this, I’ll be using the
new
>> machine model:
>>
>> (-mllvm) -sched-itins=false
>>
>> However, some scheduler changes are required for that mode to fully
>> enforce pipeline hazards.
>>
>> so I rewrite a simple code as attached link (foo.c), and compiled with
>> two different methods:
>>
>> *method A:*
>> *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched  -mllvm
>> -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9
>> -fno-vectorize -fno-slp-vectorize*
>> *
>> *
>> *and*
>> *
>> *
>> *method B:*
>> *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard
>> -mcpu=cortex-a9
>> *
>> *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc*
>> * *
>> *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9
-enable-misched*
>>
>>
>> You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should
>> generate the same bitcode, but skip the “opt” step.
>>
>> If that doesn’t work it can be a nightmare trying to decompose the
>> compilations steps with fidelity. You can try:
>> - clang -### …
>> - clang -mllvm -print-options …
>> - Passing a full triple to all tools with -mtriple
>> - Debug the TargetOptions fields
>> - -print-after-all to see which phase is different
>>
>> Even if you get all the options right, the process of serializing and
>> rereading the IR can affect the optimizations.
>>
>> Sorry. I’ve been trying to think of a way to improve this situation.
>>
>> -Andy
>>
>>  (ps. I had checked with debug-pass=structure, so I think they are
>> equivalently)
>>
>> but the result is different:
>> You can find the LBB1_4 of foo.s, it always reuses the same reg for
>> computation, but LBB1_4 of foo.opt.s doesn't.
>>
>> My question is how to just use clang (method A) to achieve B result?
>> Or i am missing something here?
>>
>> I really appreciate any help and suggestions.
>> Thanks
>>
>> Kuan-Hsu
>>
>> ------- file link -------
>> foo.c: http://goo.gl/nVa2K0
>> foo.s: http://goo.gl/ML9eNj
>> foo.opt.s: http://goo.gl/31PCnf
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>
>
> --
> Best regards,
> Kuan-Hsu
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/749d7a52/attachment.html>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Oct 2013 - [LLVMdev] MI scheduler produce badly code with inline function

[LLVMdev] MI scheduler produce badly code with inline function

[LLVMdev] MI scheduler produce badly code with inline function

[LLVMdev] MI scheduler produce badly code with inline function

[LLVMdev] MI scheduler produce badly code with inline function

Apparently Analagous Threads