thr3ads.net - llvm dev - [llvm-dev] combineRepeatedFPDivisors design questions [Mar 2017]

If this information is useful, please help other people find it:
Share via:

vit9696 via llvm-dev

2017-Mar-06 12:11 UTC

[llvm-dev] combineRepeatedFPDivisors design questions

Hello,

Recently I have stumbled across an almost double performance decrease in one of
my codebases when compiling with clang instead of gcc. I was testing with the
currently latest 3.9.1 as well as 4.0.0 rc2 targeting x86-64 darwin with -Ofast,
and after some investigation was able to narrow down the code to the following
snippet:

__attribute__ ((noinline)) bool calculate(double c) {
	uint32_t b = 0;
	uint32_t a = 0;
	while (b < 0xFFFFFFFF) {
		if ((uint64_t)(b / c) % 2 == 0)
			a++;
		b++;
	}
	return a;
}

The culprit here is the division, which is not hoisted out of the loop and
replaced with a multiplication by a reciprocal. From what I understand the
current implementation of combineRepeatedFPDivisors responsible for floating
point division conversion does that given that the following preconditions are
met:
— unsafe math optimisations are enabled;
— the amount of the divisions by the same divisor in a single basic block is
above the threshold (2 for x86-64).

According to the IR the divisions are in two separate blocks, and I guess that
it must be the reason for the optimisation not taking place. This seems to be
proven by using an even loop count like 0xFFFFFFF0 to avoid a middle check. In
this case the optimisation works just fine. Here are the IR samples:
https://ghostbin.com/paste/r4td2 <https://ghostbin.com/paste/r4td2>.

At this point the thoughts arising on my side are:
— Should this be considered a bug and reported to llvm-bugs or a desired
behaviour?
— If this is considered a bug, how is one supposed to fix it? With the exact
sample provided the compiler could generally predict the amount of iterations
being over the threshold and optimise the output, however, the real world
examples (including mine) will unlikely be that simple. Loop body would be much
larger, the conditions would be more complicated, the divisions would likely
occur in several different basic blocks for various reasons, and I think in
general it is not possible to predict the iteration count.
— From what I can tell at least current gcc (6.3.0) and icc (17.0.2) are capable
of doing that, with both this snippet and even when the loop count depends on an
another argument, which means they are not capable to determine whether there
would only be 0 iterations or many more, however, the optimisation would still
be applied (https://ghostbin.com/paste/ttnrz
<https://ghostbin.com/paste/ttnrz>). Perhaps llvm could also be more
permissive when it sees a loop?

Best regards,
Vit

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170306/a51e8fe2/attachment.html>

Hal Finkel via llvm-dev

2017-Mar-06 19:34 UTC

head link

[llvm-dev] combineRepeatedFPDivisors design questions

Hi Vit,

Yes, please file a bug report. DAGCombiner::visitFDIV should call 
BuildReciprocalEstimate, and then MachineLICM should hoist the 
reciprocal-generation out of the loop. I imagine that the problem here 
is really in X86TargetLowering::getRecipEstimate, which says this:

   // It is likely not profitable to do this for f64 because a 
double-precision
   // reciprocal estimate with refinement on x86 prior to FMA requires
   // 15 instructions: convert to single, rcpss, convert back to double, 
refine
   // (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA
   // along with FMA, this could be a throughput win.

It seems like we need to be a bit smarter about this.

  -Hal

On 03/06/2017 06:11 AM, vit9696 via llvm-dev wrote:> Hello,
>
> Recently I have stumbled across an almost double performance decrease 
> in one of my codebases when compiling with clang instead of gcc. I was 
> testing with the currently latest 3.9.1 as well as 4.0.0 rc2 targeting 
> x86-64 darwin with -Ofast, and after some investigation was able to 
> narrow down the code to the following snippet:
>
> __attribute__ ((noinline)) bool calculate(double c) {
> uint32_t b = 0;
> uint32_t a = 0;
> while (b < 0xFFFFFFFF) {
> if ((uint64_t)(b / c) % 2 == 0)
> a++;
> b++;
> }
> return a;
> }
>
> The culprit here is the division, which is not hoisted out of the loop 
> and replaced with a multiplication by a reciprocal. From what I 
> understand the current implementation of combineRepeatedFPDivisors 
> responsible for floating point division conversion does that given 
> that the following preconditions are met:
> — unsafe math optimisations are enabled;
> — the amount of the divisions by the same divisor in a single basic 
> block is above the threshold (2 for x86-64).
>
> According to the IR the divisions are in two separate blocks, and I 
> guess that it must be the reason for the optimisation not taking 
> place. This seems to be proven by using an even loop count 
> like 0xFFFFFFF0 to avoid a middle check. In this case the optimisation 
> works just fine. Here are the IR samples: 
> https://ghostbin.com/paste/r4td2.
>
> At this point the thoughts arising on my side are:
> — Should this be considered a bug and reported to llvm-bugs or a 
> desired behaviour?
> — If this is considered a bug, how is one supposed to fix it? With the 
> exact sample provided the compiler could generally predict the amount 
> of iterations being over the threshold and optimise the output, 
> however, the real world examples (including mine) will unlikely be 
> that simple. Loop body would be much larger, the conditions would be 
> more complicated, the divisions would likely occur in several 
> different basic blocks for various reasons, and I think in general it 
> is not possible to predict the iteration count.
> — From what I can tell at least current gcc (6.3.0) and icc (17.0.2) 
> are capable of doing that, with both this snippet and even when the 
> loop count depends on an another argument, which means they are not 
> capable to determine whether there would only be 0 iterations or many 
> more, however, the optimisation would still be applied 
> (https://ghostbin.com/paste/ttnrz). Perhaps llvm could also be more 
> permissive when it sees a loop?
>
> Best regards,
> Vit
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170306/6d652794/attachment-0001.html>

vit9696 via llvm-dev

2017-Mar-06 21:27 UTC

head link

[llvm-dev] combineRepeatedFPDivisors design questions

Thank you for your participation. I have just created a new entry in bugzilla:
https://bugs.llvm.org/show_bug.cgi?id=32157

Vit
> 6 марта 2017 г., в 22:34, Hal Finkel <hfinkel at anl.gov> написал(а):
> 
> Hi Vit,
> 
> Yes, please file a bug report. DAGCombiner::visitFDIV should call
BuildReciprocalEstimate, and then MachineLICM should hoist the
reciprocal-generation out of the loop. I imagine that the problem here is really
in X86TargetLowering::getRecipEstimate, which says this:
> 
>   // It is likely not profitable to do this for f64 because a
double-precision
>   // reciprocal estimate with refinement on x86 prior to FMA requires
>   // 15 instructions: convert to single, rcpss, convert back to double,
refine
>   // (3 steps = 12 insts). If an 'rcpsd' variant was added to the
ISA
>   // along with FMA, this could be a throughput win.
> It seems like we need to be a bit smarter about this.
> 
>  -Hal
> On 03/06/2017 06:11 AM, vit9696 via llvm-dev wrote:
>> Hello,
>> 
>> Recently I have stumbled across an almost double performance decrease
in one of my codebases when compiling with clang instead of gcc. I was testing
with the currently latest 3.9.1 as well as 4.0.0 rc2 targeting x86-64 darwin
with -Ofast, and after some investigation was able to narrow down the code to
the following snippet:
>> 
>> __attribute__ ((noinline)) bool calculate(double c) {
>> 	uint32_t b = 0;
>> 	uint32_t a = 0;
>> 	while (b < 0xFFFFFFFF) {
>> 		if ((uint64_t)(b / c) % 2 == 0)
>> 			a++;
>> 		b++;
>> 	}
>> 	return a;
>> }
>> 
>> The culprit here is the division, which is not hoisted out of the loop
and replaced with a multiplication by a reciprocal. From what I understand the
current implementation of combineRepeatedFPDivisors responsible for floating
point division conversion does that given that the following preconditions are
met:
>> — unsafe math optimisations are enabled;
>> — the amount of the divisions by the same divisor in a single basic
block is above the threshold (2 for x86-64).
>> 
>> According to the IR the divisions are in two separate blocks, and I
guess that it must be the reason for the optimisation not taking place. This
seems to be proven by using an even loop count like 0xFFFFFFF0 to avoid a middle
check. In this case the optimisation works just fine. Here are the IR samples:
https://ghostbin.com/paste/r4td2 <https://ghostbin.com/paste/r4td2>.
>> 
>> At this point the thoughts arising on my side are:
>> — Should this be considered a bug and reported to llvm-bugs or a
desired behaviour?
>> — If this is considered a bug, how is one supposed to fix it? With the
exact sample provided the compiler could generally predict the amount of
iterations being over the threshold and optimise the output, however, the real
world examples (including mine) will unlikely be that simple. Loop body would be
much larger, the conditions would be more complicated, the divisions would
likely occur in several different basic blocks for various reasons, and I think
in general it is not possible to predict the iteration count.
>> — From what I can tell at least current gcc (6.3.0) and icc (17.0.2)
are capable of doing that, with both this snippet and even when the loop count
depends on an another argument, which means they are not capable to determine
whether there would only be 0 iterations or many more, however, the optimisation
would still be applied (https://ghostbin.com/paste/ttnrz
<https://ghostbin.com/paste/ttnrz>). Perhaps llvm could also be more
permissive when it sees a loop?
>> 
>> Best regards,
>> Vit
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170307/166efff7/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Mar 2017 - combineRepeatedFPDivisors design questions

[llvm-dev] combineRepeatedFPDivisors design questions

[llvm-dev] combineRepeatedFPDivisors design questions

[llvm-dev] combineRepeatedFPDivisors design questions

Seemingly Similar Threads