Hi all, I meet this problem when compiling the TREAM benchmark ( http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code. so I rewrite a simple code as attached link (foo.c), and compiled with two different methods: *method A:* *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched -mllvm -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9 -fno-vectorize -fno-slp-vectorize* * * *and* * * *method B:* *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard -mcpu=cortex-a9 * *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc* * * *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-misched* (ps. I had checked with debug-pass=structure, so I think they are equivalently) but the result is different: You can find the LBB1_4 of foo.s, it always reuses the same reg for computation, but LBB1_4 of foo.opt.s doesn't. My question is how to just use clang (method A) to achieve B result? Or i am missing something here? I really appreciate any help and suggestions. Thanks Kuan-Hsu ------- file link ------- foo.c: http://goo.gl/nVa2K0 foo.s: http://goo.gl/ML9eNj foo.opt.s: http://goo.gl/31PCnf -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/681dfc76/attachment.html>
Andrew Trick
2013-Oct-15 04:38 UTC
[LLVMdev] MI scheduler produce badly code with inline function
On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:> Hi all, > I meet this problem when compiling the TREAM benchmark (http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched > > The small function will be scheduled as good code, but if opt inline this function, the inline part will be scheduled as bad code.A bug for this is welcome. Pretty soon, I’ll be verifying A9 performance and changing the default scheduler. When I do this, I’ll be using the new machine model: (-mllvm) -sched-itins=false However, some scheduler changes are required for that mode to fully enforce pipeline hazards.> so I rewrite a simple code as attached link (foo.c), and compiled with two different methods: > > method A: > $clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched -mllvm -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9 -fno-vectorize -fno-slp-vectorize > > and > > method B: > $clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard -mcpu=cortex-a9 > $opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc > $llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-mischedYou can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should generate the same bitcode, but skip the “opt” step. If that doesn’t work it can be a nightmare trying to decompose the compilations steps with fidelity. You can try: - clang -### … - clang -mllvm -print-options … - Passing a full triple to all tools with -mtriple - Debug the TargetOptions fields - -print-after-all to see which phase is different Even if you get all the options right, the process of serializing and rereading the IR can affect the optimizations. Sorry. I’ve been trying to think of a way to improve this situation. -Andy> (ps. I had checked with debug-pass=structure, so I think they are equivalently) > > but the result is different: > You can find the LBB1_4 of foo.s, it always reuses the same reg for computation, but LBB1_4 of foo.opt.s doesn't. > > My question is how to just use clang (method A) to achieve B result? > Or i am missing something here? > > I really appreciate any help and suggestions. > Thanks > > Kuan-Hsu > > ------- file link ------- > foo.c: http://goo.gl/nVa2K0 > foo.s: http://goo.gl/ML9eNj > foo.opt.s: http://goo.gl/31PCnf > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/b0115881/attachment.html>
Hi Andy, thanks for your help!! The scheduled code by method A is same as B when using the new machine model. it's make sense, but there is the another problem, the scheduled code is badly. load/store instruction always reuse the same register Source: #define N 2000000 static double b[N], c[N]; void Scale () { double scalar = 3.0; for (int j=0;j<N;j++) b[j] = scalar*c[j]; } $clang -O3 foo.c -static -S -o foo.s -mllvm -unroll-count=4 -mcpu=cortex-a9 -fno-vectorize -fno-slp-vectorize --target=arm -mfloat-abi=hard -mllvm -enable-misched -mllvm -scheditins=false per-operand cost model : Scale: push {lr} movw r12, :lower16:c movw lr, :lower16:b movw r3, #9216 movt r12, :upper16:c mov r1, #0 vmov.f64 d16, #3.000000e+00 movt lr, :upper16:b movt r3, #244 .LBB0_1: add r0, r12, r1 add r2, lr, r1 *vldr d17, [r0]* add r1, r1, #32 vmul.f64 d17, d17, d16 cmp r1, r3 vstr d17, [r2] * vldr d17, [r0, #8]* vmul.f64 d17, d17, d16 * * vstr d17, [r2, #8] * vldr d17, [r0, #16]* vmul.f64 d17, d17, d16 vstr d17, [r2, #16] * vldr d17, [r0, #24]* vmul.f64 d17, d17, d16 vstr d17, [r2, #24] bne .LBB0_1 pop {lr} bx lr .Ltmp0: Using Itinerary will generate better scheduled code: clang -O3 foo.c -static -S -o foo.s -mllvm -unroll-count=4 -mcpu=cortex-a9 -fno-vectorize -fno-slp-vectorize --target=arm -mfloat-abi=hard -mllvm -enable-misched Scale: movw r12, :lower16:c movw r2, :lower16:b movw r3, #9216 movt r12, :upper16:c mov r1, #0 vmov.f64 d16, #3.000000e+00 movt r2, :upper16:b movt r3, #244 .LBB0_1: add r0, r12, r1 * vldr d17, [r0]* * vldr **d18**, [r0, #8] * vmul.f64 d17, d17, d16 * vldr **d19**, [r0, #16]* * vldr **d20**, [r0, #24]* add r0, r2, r1 vmul.f64 d18, d18, d16 add r1, r1, #32 cmp r1, r3 vmul.f64 d19, d19, d16 vmul.f64 d20, d20, d16 vstmia r0, {d17, d18, d19, d20} bne .LBB0_1 bx lr this is just because A9's per-operand machine model is not implemented well? By the way, why do you want to use the new machine model for mi-sched? Thanks, Kind regards Kuan-Hsu 2013/10/15 Andrew Trick <atrick at apple.com>> > On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote: > > Hi all, > I meet this problem when compiling the TREAM benchmark ( > http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched > > The small function will be scheduled as good code, but if opt inline this > function, the inline part will be scheduled as bad code. > > > A bug for this is welcome. Pretty soon, I’ll be verifying A9 performance > and changing the default scheduler. When I do this, I’ll be using the new > machine model: > > (-mllvm) -sched-itins=false > > However, some scheduler changes are required for that mode to fully > enforce pipeline hazards. > > so I rewrite a simple code as attached link (foo.c), and compiled with two > different methods: > > *method A:* > *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched -mllvm > -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9 > -fno-vectorize -fno-slp-vectorize* > * > * > *and* > * > * > *method B:* > *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard > -mcpu=cortex-a9 > * > *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc* > * * > *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-misched* > > > You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should > generate the same bitcode, but skip the “opt” step. > > If that doesn’t work it can be a nightmare trying to decompose the > compilations steps with fidelity. You can try: > - clang -### … > - clang -mllvm -print-options … > - Passing a full triple to all tools with -mtriple > - Debug the TargetOptions fields > - -print-after-all to see which phase is different > > Even if you get all the options right, the process of serializing and > rereading the IR can affect the optimizations. > > Sorry. I’ve been trying to think of a way to improve this situation. > > -Andy > > (ps. I had checked with debug-pass=structure, so I think they are > equivalently) > > but the result is different: > You can find the LBB1_4 of foo.s, it always reuses the same reg for > computation, but LBB1_4 of foo.opt.s doesn't. > > My question is how to just use clang (method A) to achieve B result? > Or i am missing something here? > > I really appreciate any help and suggestions. > Thanks > > Kuan-Hsu > > ------- file link ------- > foo.c: http://goo.gl/nVa2K0 > foo.s: http://goo.gl/ML9eNj > foo.opt.s: http://goo.gl/31PCnf > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >-- Best regards, Kuan-Hsu -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131016/7b2341e7/attachment.html>
Apparently Analagous Threads
- [LLVMdev] MI scheduler produce badly code with inline function
- [LLVMdev] MI scheduler produce badly code with inline function
- [LLVMdev] MI scheduler produce badly code with inline function
- [PATCH 0/5] ARM NEON optimization for samplerate converter
- [LLVMdev] question about alignment of structures on the stack (arm 32)