thr3ads.net - llvm dev - [llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration. [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson via llvm-dev

2018-Sep-20 13:52 UTC

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

Hi,

I have a benchmark (mcf) that is currently slower when compiled with 
clang compared to gcc 8 (~10%). It seems that a hot loop has a few 
differences, where one interesting one is that while clang stores an 
incremented value in each iteration, gcc waits and just stores the final 
value just once after the loop. The value is a global variable.

I wonder if this is something clang does not do per default but can be 
activated, similarly to the fp-contract situation?

If not, is this a deficiency in clang? What pass should handle this? 
IndVarSimplify?

I have made a reduced test case which shows the same difference between 
the compilers: clang adds 1 and stores it back to 'a' in each iteration,
while gcc instead figures out the value a has after the loop (0) and 
stores it then once to 'a'.

/Jonas


int a = 1;
void b() {
   do
     if (a)
       a++;
   while (a != 0);
}

bin/clang -O3 -march=z13 -mllvm -unroll-count=1

         .text
         .file   "testfun.i"
         .globl  b                       # -- Begin function b
         .p2align        4
         .type   b, at function
b:                                      # @b
# %bb.0:                                # %entry
         lrl     %r0, a
.LBB0_1:                                # %do.body
                                         # =>This Inner Loop Header: Depth=1
         cije    %r0, 0, .LBB0_3
# %bb.2:                                # %if.then
                                         #   in Loop: Header=BB0_1 Depth=1
         ahi     %r0, 1
         strl    %r0, a
.LBB0_3:                                # %do.cond
                                         #   in Loop: Header=BB0_1 Depth=1
         cijlh   %r0, 0, .LBB0_1
# %bb.4:                                # %do.end
         br      %r14
.Lfunc_end0:
         .size   b, .Lfunc_end0-b
                                         # -- End function
         .type   a, at object               # @a
         .data
         .globl  a
         .p2align        2
a:
         .long   1                       # 0x1
         .size   a, 4


gcc -O3 -march=z13:

         .file   "testfun.i"
         .machinemode zarch
         .machine "z13"
.text
         .align  8
.globl b
         .type   b, @function
b:
.LFB0:
         .cfi_startproc
         larl    %r1,a
         lt      %r1,0(%r1)
         je      .L1
         larl    %r1,a
         mvhi    0(%r1),0
.L1:
         br      %r14
         .cfi_endproc
.LFE0:
         .size   b, .-b
.globl a
.data
         .align  4
         .type   a, @object
         .size   a, 4
a:
         .long   1
         .ident  "GCC: (GNU) 8.0.1 20180324 (Red Hat 8.0.1-0.20)"
         .section        .note.GNU-stack,"", at progbits

Friedman, Eli via llvm-dev

2018-Sep-20 19:34 UTC

head link

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

On 9/20/2018 6:52 AM, Jonas Paulsson via llvm-dev wrote:> Hi,
>
> I have a benchmark (mcf) that is currently slower when compiled with 
> clang compared to gcc 8 (~10%). It seems that a hot loop has a few 
> differences, where one interesting one is that while clang stores an 
> incremented value in each iteration, gcc waits and just stores the 
> final value just once after the loop. The value is a global variable.
>
> I wonder if this is something clang does not do per default but can be 
> activated, similarly to the fp-contract situation?
>
> If not, is this a deficiency in clang? What pass should handle this? 
> IndVarSimplify?
See http://lists.llvm.org/pipermail/llvm-dev/2018-September/126064.html .
>
> I have made a reduced test case which shows the same difference 
> between the compilers: clang adds 1 and stores it back to 'a' in
each
> iteration, while gcc instead figures out the value a has after the 
> loop (0) and stores it then once to 'a'.
Your testcase is a bit weird because the condition of the while loop is 
the same as the condition of the if statement.  Is that really what the 
original loop looks like?

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

Philip Reames via llvm-dev

2018-Sep-20 23:38 UTC

head link

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

On 09/20/2018 06:52 AM, Jonas Paulsson via llvm-dev
wrote:> Hi,
>
> I have a benchmark (mcf) that is currently slower when compiled with 
> clang compared to gcc 8 (~10%). It seems that a hot loop has a few 
> differences, where one interesting one is that while clang stores an 
> incremented value in each iteration, gcc waits and just stores the 
> final value just once after the loop. The value is a global variable.
>
> I wonder if this is something clang does not do per default but can be 
> activated, similarly to the fp-contract situation?
>
> If not, is this a deficiency in clang? What pass should handle this? 
> IndVarSimplify?
>
> I have made a reduced test case which shows the same difference 
> between the compilers: clang adds 1 and stores it back to 'a' in
each
> iteration, while gcc instead figures out the value a has after the 
> loop (0) and stores it then once to 'a'.
>
> /Jonas
>
>
> int a = 1;
> void b() {
>   do
>     if (a)
>       a++;
>   while (a != 0);
> }I think your example may be a bit over reduced.  Unless I'm misreading 
this, a starts at 1, is incremented one each iteration, and then is 
tested against zero.  The only way this loop can exit is if a has 
wrapped around and C++ states that signed integers are assumed to not 
overflow.  We can/should be replacing the whole loop with an unreachable.

Do we still fail to optimize if either a) you use an unsigned which has 
defined overflow or b) you use a non-zero exit test?  That is, change 
the example to something like:
int a = 1;
void b() {
   do
     if (a)
       a++;
   while (a != 500);
}

If so, then yes, this is probably a case where the aggressive LoopPRE 
mentioned in the other thread that Eli linked to would be useful.  Once 
we'd done the PRE, then everything else should collapse.
>
> bin/clang -O3 -march=z13 -mllvm -unroll-count=1
>
>         .text
>         .file   "testfun.i"
>         .globl  b                       # -- Begin function b
>         .p2align        4
>         .type   b, at function
> b:                                      # @b
> # %bb.0:                                # %entry
>         lrl     %r0, a
> .LBB0_1:                                # %do.body
>                                         # =>This Inner Loop Header: 
> Depth=1
>         cije    %r0, 0, .LBB0_3
> # %bb.2:                                # %if.then
>                                         #   in Loop: Header=BB0_1 Depth=1
>         ahi     %r0, 1
>         strl    %r0, a
> .LBB0_3:                                # %do.cond
>                                         #   in Loop: Header=BB0_1 Depth=1
>         cijlh   %r0, 0, .LBB0_1
> # %bb.4:                                # %do.end
>         br      %r14
> .Lfunc_end0:
>         .size   b, .Lfunc_end0-b
>                                         # -- End function
>         .type   a, at object               # @a
>         .data
>         .globl  a
>         .p2align        2
> a:
>         .long   1                       # 0x1
>         .size   a, 4
>
>
> gcc -O3 -march=z13:
>
>         .file   "testfun.i"
>         .machinemode zarch
>         .machine "z13"
> .text
>         .align  8
> .globl b
>         .type   b, @function
> b:
> .LFB0:
>         .cfi_startproc
>         larl    %r1,a
>         lt      %r1,0(%r1)
>         je      .L1
>         larl    %r1,a
>         mvhi    0(%r1),0
> .L1:
>         br      %r14
>         .cfi_endproc
> .LFE0:
>         .size   b, .-b
> .globl a
> .data
>         .align  4
>         .type   a, @object
>         .size   a, 4
> a:
>         .long   1
>         .ident  "GCC: (GNU) 8.0.1 20180324 (Red Hat 8.0.1-0.20)"
>         .section        .note.GNU-stack,"", at progbits
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Jonas Paulsson via llvm-dev

2018-Sep-21 07:15 UTC

head link

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

Hi Philip and Eli,

> I think your example may be a bit over reduced.  Unless I'm misreading 
> this, a starts at 1, is incremented one each iteration, and then is 
> tested against zero.  The only way this loop can exit is if a has 
> wrapped around and C++ states that signed integers are assumed to not 
> overflow.  We can/should be replacing the whole loop with an unreachable.
>
> Do we still fail to optimize if either a) you use an unsigned which 
> has defined overflow or b) you use a non-zero exit test? That is, 
> change the example to something like:
> int a = 1;
> void b() {
>   do
>     if (a)
>       a++;
>   while (a != 500);
> }
Yes, both if I change 'a' to unsigned, or replace the exit test with 
500, clang stores in each iteration while gcc does not.

 > (Eli) Your testcase is a bit weird because the condition of the while 
loop is the same as the condition of the if statement.  Is that really 
what the original loop looks like?

No, not really, the reduced one just shows the difference between gcc 
and clang. There were some variations to this, but I chose this since it 
gave a very small output. Sorry if it was confusing.
>
> If so, then yes, this is probably a case where the aggressive LoopPRE 
> mentioned in the other thread that Eli linked to would be useful.  
> Once we'd done the PRE, then everything else should collapse.Thanks for the link, it's good to know this issue is recognized. If I 
understand it correctly, the reason clang is storing in each iteration 
is due to concurrency. As a newbie I wonder how this works in practice 
since even if the value is stored in each iteration two threads could 
still do this simultaneously if not some sort of atomic operation is 
doing it, right? What happens here is that the value of 'a' is loaded 
once before the loop, then incremented and stored in each iteration. How 
does that help with multiple threads compared to storing it after the loop?

Is there an option to change this behavior in gcc or clang? It seems 
that gcc is assuming a single thread, while clang is not. It would be 
nice to have the same setting here when comparing them. Or am I missing 
something?

Thanks

Jonas
>
>>
>> bin/clang -O3 -march=z13 -mllvm -unroll-count=1
>>
>>         .text
>>         .file   "testfun.i"
>>         .globl  b                       # -- Begin function b
>>         .p2align        4
>>         .type   b, at function
>> b:                                      # @b
>> # %bb.0:                                # %entry
>>         lrl     %r0, a
>> .LBB0_1:                                # %do.body
>>                                         # =>This Inner Loop Header: 
>> Depth=1
>>         cije    %r0, 0, .LBB0_3
>> # %bb.2:                                # %if.then
>>                                         #   in Loop: Header=BB0_1 
>> Depth=1
>>         ahi     %r0, 1
>>         strl    %r0, a
>> .LBB0_3:                                # %do.cond
>>                                         #   in Loop: Header=BB0_1 
>> Depth=1
>>         cijlh   %r0, 0, .LBB0_1
>> # %bb.4:                                # %do.end
>>         br      %r14
>> .Lfunc_end0:
>>         .size   b, .Lfunc_end0-b
>>                                         # -- End function
>>         .type   a, at object               # @a
>>         .data
>>         .globl  a
>>         .p2align        2
>> a:
>>         .long   1                       # 0x1
>>         .size   a, 4
>>
>>
>> gcc -O3 -march=z13:
>>
>>         .file   "testfun.i"
>>         .machinemode zarch
>>         .machine "z13"
>> .text
>>         .align  8
>> .globl b
>>         .type   b, @function
>> b:
>> .LFB0:
>>         .cfi_startproc
>>         larl    %r1,a
>>         lt      %r1,0(%r1)
>>         je      .L1
>>         larl    %r1,a
>>         mvhi    0(%r1),0
>> .L1:
>>         br      %r14
>>         .cfi_endproc
>> .LFE0:
>>         .size   b, .-b
>> .globl a
>> .data
>>         .align  4
>>         .type   a, @object
>>         .size   a, 4
>> a:
>>         .long   1
>>         .ident  "GCC: (GNU) 8.0.1 20180324 (Red Hat
8.0.1-0.20)"
>>         .section        .note.GNU-stack,"", at progbits
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Sep 2018 - Comparing Clang and GCC: only clang stores updated value in each iteration.

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

[llvm-dev] Comparing Clang and GCC: only clang stores updated value in each iteration.

Apparently Analagous Threads