thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long [Apr 2017]

If this information is useful, please help other people find it:
Share via:

Kaylor, Andrew via llvm-dev

2017-Apr-21 00:30 UTC

[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

I think it’s generally true that whenever branches can reliably be predicted
branching is faster than a cmov that involves speculative execution, and I would
guess that your assessment regarding looping on input values is probably
correct.

I believe the code that actually creates most of the transformation you’re
interested in here is in SelectionDAGLegalize::ExpandNode() in LegalizeDAG.cpp. 
The X86 backend sets a table entry indicating that FP_TO_UINT should be expanded
for these value types, but the actual expansion is in target-independent code. 
This is what it looks like in the version I last fetched:

  case ISD::FP_TO_UINT: {
    SDValue True, False;
    EVT VT =  Node->getOperand(0).getValueType();
    EVT NVT = Node->getValueType(0);
    APFloat apf(DAG.EVTToAPFloatSemantics(VT),
                APInt::getNullValue(VT.getSizeInBits()));
    APInt x = APInt::getSignBit(NVT.getSizeInBits());
    (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
    Tmp1 = DAG.getConstantFP(apf, dl, VT);
    Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
                        Node->getOperand(0),
                        Tmp1, ISD::SETLT);
    True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));
    // TODO: Should any fast-math-flags be set for the FSUB?
    False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
                        DAG.getNode(ISD::FSUB, dl, VT,
                                    Node->getOperand(0), Tmp1));
    False = DAG.getNode(ISD::XOR, dl, NVT, False,
                        DAG.getConstant(x, dl, NVT));
    Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
    Results.push_back(Tmp1);
    break;
  }

The tricky bit here is that this code is asking for a Select and then something
else will decide whether that select should be implemented as a branch or a
cmov.

Circling back to what I said earlier, while I stand by the idea that we
shouldn’t sacrifice performance just to maintain FP status bits, you’ve got me
warming up to the idea that maybe an FP status safe implementation of this might
be faster in most cases.  It’s at least worth taking some measurements to see if
it is faster.

-Andy

From: Michael Clark [mailto:michaeljclark at mac.com]
Sent: Thursday, April 20, 2017 4:02 PM
To: Kaylor, Andrew <andrew.kaylor at intel.com>
Cc: Flamedoge <code.kchoi at gmail.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: Re: [llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion
from float to unsigned long long

On 21 Apr 2017, at 8:50 AM, Kaylor, Andrew <andrew.kaylor at
intel.com<mailto:andrew.kaylor at intel.com>> wrote:
> This seems like it was done for perf reason (mispredict).
Conditional-to-cmov transformation should keep
> from introducing additional observable side-effects, and it's clear
that whatever did this did not account
> for floating point exception.
That’s a very reasonable statement, but I’m not sure it corresponds to the way
we have typically approached this sort of problem.

Just out of interest, I ran the two code sequences (branch vs predicate logic)
through Intel Architecture Code Analyzer:

- https://gist.github.com/michaeljclark/84f9e411ab959073996e10bff6ab8ec0

I am not sure how accurate the cycle estimate is for the GCC code sequence as it
says “possibly". It would appear that if the branch is predicted in a loop
of floating point values that are < LLONG_MAX then GCC would win. Some values
> LLONG_MAX and some <= LLONG_MAX would fool the branch predictor.

In particular, there has been a de facto standard practice (and it has even been
openly stated and agreed upon once or twice) of assuming default rounding mode
and ignoring FP exceptions in the default (and currently only) optimization path
for FP-related instructions.  That is, clang/LLVM didn’t just not support
FENV_ACCESS by indifference but rather we have made conscious decisions to allow
transformations that violate the needs of FENV_ACCESS when doing so can improve
the performance of generated code.  Basically, we more or less pretend that
floating point status bits don’t exist (at least before you get to the
target-specific backend).

It is a shame as Clang/LLVM does surprisingly well at honouring floating point
accrued exceptions on x86_64 as is (see below).

As far as I can tell, besides the eager constant folding optimisation, the only
issues I am facing is with the float to u64 and double to u64 intrinsics.

I have expressed the RISC-V floating point ISA in a C notation, which I use to
build a RISC-V ISA simulator, and the floating point accrued exceptions are
tested for the complete set of RISC-V FPU operations. I’ve been using the RISC-V
ISA test suite essentially to test gcc and clang floating point accrued
exception state and the test suite has relatively complete accrued exception
state tests after any floating point operations that update the accrued
exception state. Clang does very well, with the exception of float to u64 and
double to u64. They are currently the only two operations that fail the RISC-V
QEMU tests, which is essentially testing all of the compiler floating point
intrinsics.

https://github.com/michaeljclark/riscv-meta/blob/ea306062bfd2f60a229daf6b04826cdeb2dfbe9d/meta/opcode-pseudocode-c#L132-L204

These are the tests that I am using to exercise clang and gcc FENV_ACCESS along
with the C code in the link above. riscv-qemu-tests is a user-mode port of the
RISC-V ISA conformance test suite.

- https://github.com/arsv/riscv-qemu-tests/tree/master/rv64f
- https://github.com/arsv/riscv-qemu-tests/tree/master/rv64d

The tests are not exhaustive. There are some special cases with rounding modes
that are not tested and the code is isolated from certain optimisations and is
essentially testing just the intrinsics. That said, it makes sense for the
intrinsics to correctly set the floating point accrued exception state to allow
any more complete handling of optimisation around floating point operations.

In any case I need these conversions to work on compilers today so I came up
with this work-around:

- https://godbolt.org/g/Ez8RZA

With the workaround above I am able to pass all of the RISC-V tests, however I
know of some corner cases that are not tested. Interestingly float and double to
unsigned need to round or clamp negative values to zero. I guess this is
undefined behaviour in C.

You’ll find that the X86 backend doesn’t even model MXCSR at the moment.  I
tried to add it recently and it kind of blew up before I had even modeled it for
anything other than LDMXCSR and STMCXSR.  We may want to address that at some
point, but right now it just isn’t there.

When we discussed how FENV_ACCESS support should be implemented, Chandler
proposed that when restricted modes (whether FENV_ACCESS or any other front
end-specific analogous behavior) were not being used the optimizer should be
able to behave as described above and that nothing done to support restricted FP
behavior should complicate or restrict the default optimizer behavior.  This was
met with general agreement at the time.

I mention all this as prologue to saying that while we should do something to
get FPToUI lowered without incorrectly setting FP exception status bits, it
isn’t necessarily what we want as the default behavior.

I would also like to add that I personally am very pleased that you discovered
this issue and have gotten as far as you have in the analysis of the problem. 
I’m in the process of adding constrained versions of various FP intrinsics (I
have a patch ready to be sent out today) and what I’ve done up to now has been
to simply translate the constrained operations into their traditional
representations somewhere in the ISel process.  I was aware that something would
need to be done in the codegen space to continue protecting these operations,
but I was kind of hoping that the actual instruction selection would be
reasonably safe.  FWIW, FPToUI is not one of the parts my pending patch
addresses.

-Andy

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Flamedoge via llvm-dev
Sent: Wednesday, April 19, 2017 10:14 AM
To: Michael Clark <michaeljclark at mac.com<mailto:michaeljclark at
mac.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion
from float to unsigned long long
> Are we better off using branches instead of cmove to implement FP tounsigned i64?

This seems like it was done for perf reason (mispredict). Conditional-to-cmov
transformation should keep from introducing additional observable side-effects,
and it's clear that whatever did this did not account for floating point
exception.

On Wed, Apr 19, 2017 at 10:01 AM, Michael Clark via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Changing the list from cfe-dev to llvm-dev

On 20 Apr 2017, at 4:52 AM, Michael Clark <michaeljclark at
mac.com<mailto:michaeljclark at mac.com>> wrote:

I’m getting close. I think it may be an issue with an individual intrinsic. I’m
looking for the X86 lowering of Instruction::FPToUI.

I found a comment around the rationale for using a conditional move versus a
branch. I believe the predicate logic using a conditional move is causing
INEXACT to be set from the other side of the predicate as the lowered x86_64
code executes both conversions whereas GCC uses a branch. That seems to be the
difference.

I can’t find FPToUI in llvm/lib/Target/X86 so I’m trying to figure out what the
cast gets renamed to in the target layer so I can find where the sequence is
emitted.

$ more llvm/lib/Target/X86//README-X86-64.txt
…
Are we better off using branches instead of cmove to implement FP to
unsigned i64?

_conv:
        ucomiss LC0(%rip), %xmm0
        cvttss2siq      %xmm0, %rdx
        jb      L3
        subss   LC0(%rip), %xmm0
        movabsq $-9223372036854775808, %rax
        cvttss2siq      %xmm0, %rdx
        xorq    %rax, %rdx
L3:
        movq    %rdx, %rax
        ret

instead of

_conv:
        movss LCPI1_0(%rip), %xmm1
        cvttss2siq %xmm0, %rcx
        movaps %xmm0, %xmm2
        subss %xmm1, %xmm2
        cvttss2siq %xmm2, %rax
        movabsq $-9223372036854775808, %rdx
        xorq %rdx, %rax
        ucomiss %xmm1, %xmm0
        cmovb %rcx, %rax
        ret

On 19 Apr 2017, at 2:10 PM, Michael Clark <michaeljclark at
mac.com<mailto:michaeljclark at mac.com>> wrote:

On 19 Apr 2017, at 1:14 PM, Tim Northover <t.p.northover at
gmail.com<mailto:t.p.northover at gmail.com>> wrote:

On 18 April 2017 at 15:54, Michael Clark via cfe-dev
<cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:

The only way towards completing a milestone is via fixing a number of small
issues along
the way…

I believe there's more to it than that. None of LLVM's optimizations
are aware of this extra side-channel of information (with possible
exceptions like avoiding speculating fdiv because of unavoidable
exceptions).

From what I remember, the real proposal is to replace all
floating-point IR with intrinsics when FENV_ACCESS is on, which the
optimizers by default won't have a clue about and will treat
conservatively (essentially like they're modifying external memory).

So be careful with drawing conclusions from small snippets; you're
probably not seeing the full range of LLVM's behaviour.

Yes. I’m sure.

It reproduces with just the cast on its own: https://godbolt.org/g/myUoL2

It appears to be in the LLVM lowering of the fptoui intrinsic so it must MC
layer optimisations.

; Function Attrs: noinline nounwind uwtable
define i64 @_Z7fcvt_luf(float %f) #0 {
  %1 = alloca float, align 4
  store float %f, float* %1, align 4
  %2 = load float, float* %1, align 4
  %3 = fptoui float %2 to i64
  ret i64 %3
}

GCC performs a comparison with ucomiss and branches whereas Clang computes both
forms and predicates the result using a conditional move. One of the conversions
obviously is setting the INEXACT MXCSR flag.

Clang lowering (inexact set when result is exact):

fcvt_lu(float):
        movss   xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero,zero,zero
        movaps  xmm2, xmm0
        subss   xmm2, xmm1
        cvttss2si       rax, xmm2
        movabs  rcx, -9223372036854775808
        xor     rcx, rax
        cvttss2si       rax, xmm0
        ucomiss xmm0, xmm1
        cmovae  rax, rcx
        ret

GCC lowering (sets flags correctly):

fcvt_lu(float):
        ucomiss xmm0, DWORD PTR .LC0[rip]
        jnb     .L4
        cvttss2si       rax, xmm0
        ret
.L4:
        subss   xmm0, DWORD PTR .LC0[rip]
        movabs  rdx, -9223372036854775808
        cvttss2si       rax, xmm0
        xor     rax, rdx
        ret

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170421/97a716e6/attachment-0001.html>

Michael Clark via llvm-dev

2017-Apr-21 02:23 UTC

head link

[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

> On 21 Apr 2017, at 12:30 PM, Kaylor, Andrew <andrew.kaylor at
intel.com> wrote:
> 
> I think it’s generally true that whenever branches can reliably be
predicted branching is faster than a cmov that involves speculative execution,
and I would guess that your assessment regarding looping on input values is
probably correct.
>  
> I believe the code that actually creates most of the transformation you’re
interested in here is in SelectionDAGLegalize::ExpandNode() in LegalizeDAG.cpp. 
The X86 backend sets a table entry indicating that FP_TO_UINT should be expanded
for these value types, but the actual expansion is in target-independent code. 
This is what it looks like in the version I last fetched:
>  
>   case ISD::FP_TO_UINT: {
>     SDValue True, False;
>     EVT VT =  Node->getOperand(0).getValueType();
>     EVT NVT = Node->getValueType(0);
>     APFloat apf(DAG.EVTToAPFloatSemantics(VT),
>                 APInt::getNullValue(VT.getSizeInBits()));
>     APInt x = APInt::getSignBit(NVT.getSizeInBits());
>     (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
>     Tmp1 = DAG.getConstantFP(apf, dl, VT);
>     Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
>                         Node->getOperand(0),
>                         Tmp1, ISD::SETLT);
>     True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));
>     // TODO: Should any fast-math-flags be set for the FSUB?
>     False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
>                         DAG.getNode(ISD::FSUB, dl, VT,
>                                     Node->getOperand(0), Tmp1));
>     False = DAG.getNode(ISD::XOR, dl, NVT, False,
>                         DAG.getConstant(x, dl, NVT));
>     Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
>     Results.push_back(Tmp1);
>     break;
>   }
>  
> The tricky bit here is that this code is asking for a Select and then
something else will decide whether that select should be implemented as a branch
or a cmov.
Good. I had found ISD::FP_TO_UINT but had not found the target-independent code
as I was digging in llvm/lib/Target/X86. I had in fact just started looking at
the target-independent code after realising it was likely not target specific.
This issue could potentially effect any hard float target with IEEE-754 accrued
exceptions and conditional moves as the unconditional FSUB will set INEXACT.

I can see comments in lib/Target/X86//X86ISelLowering.cpp LowerSELECT regarding
selection of branch or cmov and wonder if the DAG can be matched there or
whether the fix is in target-independent code.

It seems like a SELECT node with any sufficiently large number of child nodes
should use a branch instead of a conditional move. I wonder about the cost model
for predicate logic  and cmov. Modern branch predictors are actually pretty good
so if LLVM X86 is using predication when the cost of a branch is less it could
result in a loss of performance. I’m now curious about more general possibility
of controlling whether SELECT is lowered to branches or predication using cmov.
Can this be controlled? Anecdotally, the RISC-V CPU architects recommend
branches over predicate logic as in their case (Rocket) branch mis-predict is
only 3 cycles.

BTW - semi off-topic. The RISC-V interpreter I am working on seems to be a
pathological test case for the LLVM/Clang optimiser (-O3) compared with GCC
(-O3) with LLVM/Clang producing code that runs nearly twice as slow as GCC. I
don’t know exactly what I’ve done for this to happen; too many switch statements
I suspect. Branchy code versus predication perhaps? Branchiness might also
explain GCC’s lead on SciMark Monte Carlo assuming Monte Carlo is branchy. Now I
am guessing, although after some googling I see that clang generates x86_64 asm
that prefers predication versus branches in gcc. Note this CPU simulator test
requires the RISC-V GCC toolchain to be installed.

Here is a step by step for anyone interested in a pathological optimiser test
case for Clang:

- https://github.com/riscv/riscv-gnu-toolchain/
<https://github.com/riscv/riscv-gnu-toolchain/>
- https://github.com/michaeljclark/riscv-meta/
<https://github.com/michaeljclark/riscv-meta/>

$ git clone https://github.com/riscv/riscv-gnu-toolchain.git
$ git clone https://github.com/michaeljclark/riscv-meta.git
$ cd riscv-gnu-toolchain
$ export RISCV=/opt/riscv-gnu-toolchain
$ ./configure --prefix=$RISCV
$ make
$ cd ..
$ cd riscv-meta
$ git submodule update --init --recursive
$ export RISCV=/opt/riscv-gnu-toolchain
$ make -j4 CXX=g++ V=1
$ make test-build
$ time ./build/linux_x86_64/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m28.280s
user	0m28.280s
sys	0m0.000s

$ make clean
$ make -j4 CXX=clang++-3.9 V=1
$ make test-build
$ time ./build/linux_x86_64/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m52.533s
user	0m52.532s
sys	0m0.000s

$ g++ --version
g++ (Debian 6.3.0-6) 6.3.0 20170205
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ clang++-3.9 --version
clang version 3.9.0-6 (tags/RELEASE_390/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

There is also a RISC-V -> x86_64 JIT engine (x86_64 JIT currently for the
RISC-V integer ISAt, hard float coming soon…):

$ time ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-sha512
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m0.838s
user	0m0.840s
sys	0m0.000s

Clang and GCC produce typical native code that performs the same.

$ clang -O3 src/test/test-sha512.c -o test-sha512
$ time ./test-sha512 
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m0.285s
user	0m0.280s
sys	0m0.004s

$ gcc -O3 src/test/test-sha512.c -o test-sha512 
$ time ./test-sha512 
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m0.285s
user	0m0.284s
sys	0m0.000s

Michael.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170421/f47e7108/attachment.html>

Michael Clark via llvm-dev

2017-Apr-21 07:30 UTC

head link

[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

> On 21 Apr 2017, at 2:23 PM, Michael Clark <michaeljclark at mac.com
<mailto:michaeljclark at mac.com>> wrote:
> 
> 
>> On 21 Apr 2017, at 12:30 PM, Kaylor, Andrew <andrew.kaylor at
intel.com <mailto:andrew.kaylor at intel.com>> wrote:
>> 
>> I think it’s generally true that whenever branches can reliably be
predicted branching is faster than a cmov that involves speculative execution,
and I would guess that your assessment regarding looping on input values is
probably correct.
Yes it’s based on an assumption that val <= LLONG_MAX i.e. branch predict
success, which may not always be the case, but to break branch predict it would
require an unpredictable sequence of values <= LLONG_MAX and > LLONG_MAX.
I was curious and microbenchmarked it:

- https://godbolt.org/g/ytgk7l <https://godbolt.org/g/ytgk7l>

Best of 10 runs on a MacBookPro Ivy Bridge Intel Core i7-3740QM

$ time fcvt-branch

real	0m0.208s
user	0m0.201s
sys	0m0.002s

$ time fcvt-cmov

real	0m0.241s
user	0m0.235s
sys	0m0.002s
>>  I believe the code that actually creates most of the transformation
you’re interested in here is in SelectionDAGLegalize::ExpandNode() in
LegalizeDAG.cpp.  The X86 backend sets a table entry indicating that FP_TO_UINT
should be expanded for these value types, but the actual expansion is in
target-independent code.  This is what it looks like in the version I last
fetched:
>>  
>>   case ISD::FP_TO_UINT: {
>>     SDValue True, False;
>>     EVT VT =  Node->getOperand(0).getValueType();
>>     EVT NVT = Node->getValueType(0);
>>     APFloat apf(DAG.EVTToAPFloatSemantics(VT),
>>                 APInt::getNullValue(VT.getSizeInBits()));
>>     APInt x = APInt::getSignBit(NVT.getSizeInBits());
>>     (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
>>     Tmp1 = DAG.getConstantFP(apf, dl, VT);
>>     Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
>>                         Node->getOperand(0),
>>                         Tmp1, ISD::SETLT);
>>     True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
Node->getOperand(0));
>>     // TODO: Should any fast-math-flags be set for the FSUB?
>>     False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
>>                         DAG.getNode(ISD::FSUB, dl, VT,
>>                                     Node->getOperand(0), Tmp1));
>>     False = DAG.getNode(ISD::XOR, dl, NVT, False,
>>                         DAG.getConstant(x, dl, NVT));
>>     Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
>>     Results.push_back(Tmp1);
>>     break;
>>   }
>>  
>> The tricky bit here is that this code is asking for a Select and then
something else will decide whether that select should be implemented as a branch
or a cmov.
> 
> Good. I had found ISD::FP_TO_UINT but had not found the target-independent
code as I was digging in llvm/lib/Target/X86. I had in fact just started looking
at the target-independent code after realising it was likely not target
specific. This issue could potentially effect any hard float target with
IEEE-754 accrued exceptions and conditional moves as the unconditional FSUB will
set INEXACT.
> 
> I can see comments in lib/Target/X86//X86ISelLowering.cpp LowerSELECT
regarding selection of branch or cmov and wonder if the DAG can be matched there
or whether the fix is in target-independent code.
> 
> It seems like a SELECT node with any sufficiently large number of child
nodes should use a branch instead of a conditional move. I wonder about the cost
model for predicate logic  and cmov. Modern branch predictors are actually
pretty good so if LLVM X86 is using predication when the cost of a branch is
less it could result in a loss of performance. I’m now curious about more
general possibility of controlling whether SELECT is lowered to branches or
predication using cmov. Can this be controlled? Anecdotally, the RISC-V CPU
architects recommend branches over predicate logic as in their case (Rocket)
branch mis-predict is only 3 cycles.
> 
> BTW - semi off-topic. The RISC-V interpreter I am working on seems to be a
pathological test case for the LLVM/Clang optimiser (-O3) compared with GCC
(-O3) with LLVM/Clang producing code that runs nearly twice as slow as GCC. I
don’t know exactly what I’ve done for this to happen; too many switch statements
I suspect. Branchy code versus predication perhaps? Branchiness might also
explain GCC’s lead on SciMark Monte Carlo assuming Monte Carlo is branchy. Now I
am guessing, although after some googling I see that clang generates x86_64 asm
that prefers predication versus branches in gcc. Note this CPU simulator test
requires the RISC-V GCC toolchain to be installed.
> 
> Here is a step by step for anyone interested in a pathological optimiser
test case for Clang:
> 
> - https://github.com/riscv/riscv-gnu-toolchain/
<https://github.com/riscv/riscv-gnu-toolchain/>
> - https://github.com/michaeljclark/riscv-meta/
<https://github.com/michaeljclark/riscv-meta/>
> 
> $ git clone https://github.com/riscv/riscv-gnu-toolchain.git
<https://github.com/riscv/riscv-gnu-toolchain.git>
> $ git clone https://github.com/michaeljclark/riscv-meta.git
<https://github.com/michaeljclark/riscv-meta.git>
> $ cd riscv-gnu-toolchain
> $ export RISCV=/opt/riscv-gnu-toolchain
> $ ./configure --prefix=$RISCV
> $ make
> $ cd ..
> $ cd riscv-meta
> $ git submodule update --init --recursive
> $ export RISCV=/opt/riscv-gnu-toolchain
> $ make -j4 CXX=g++ V=1
> $ make test-build
> $ time ./build/linux_x86_64/bin/rv-sim
build/riscv64-unknown-elf/bin/test-sha512
>
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m28.280s
> user	0m28.280s
> sys	0m0.000s
> 
> $ make clean
> $ make -j4 CXX=clang++-3.9 V=1
> $ make test-build
> $ time ./build/linux_x86_64/bin/rv-sim
build/riscv64-unknown-elf/bin/test-sha512
>
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m52.533s
> user	0m52.532s
> sys	0m0.000s
> 
> $ g++ --version
> g++ (Debian 6.3.0-6) 6.3.0 20170205
> Copyright (C) 2016 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> $ clang++-3.9 --version
> clang version 3.9.0-6 (tags/RELEASE_390/final)
> Target: x86_64-pc-linux-gnu
> Thread model: posix
> InstalledDir: /usr/bin
> 
> 
> There is also a RISC-V -> x86_64 JIT engine (x86_64 JIT currently for
the RISC-V integer ISAt, hard float coming soon…):
> 
> $ time ./build/linux_x86_64/bin/rv-jit
build/riscv64-unknown-elf/bin/test-sha512
>
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m0.838s
> user	0m0.840s
> sys	0m0.000s
> 
> 
> Clang and GCC produce typical native code that performs the same.
> 
> $ clang -O3 src/test/test-sha512.c -o test-sha512
> $ time ./test-sha512 
>
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m0.285s
> user	0m0.280s
> sys	0m0.004s
> 
> $ gcc -O3 src/test/test-sha512.c -o test-sha512 
> $ time ./test-sha512 
>
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m0.285s
> user	0m0.284s
> sys	0m0.000s
> 
> Michael.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170421/9e4f7127/attachment-0001.html>

llvm dev - Apr 2017 - [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long