thr3ads.net - llvm dev - [LLVMdev] X86TargetLowering::LowerToBT [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Chris Sears

2015-Jan-19 15:05 UTC

[LLVMdev] X86TargetLowering::LowerToBT

Which BTQ? There are three flavors.

BTQ reg/reg
BTQ reg/mem
BTQ reg/imm

I can imagine that the reg/reg and especially the reg/mem versions would be
slow. However the shrq/and versions *with the same operands* would be slow
as well. There's even a compiler comment about the reg/mem version saying
"this is for disassembly only".

But I doubt BTQ reg/imm would be microcoded.


-- 
Ite Ursi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/658be9ec/attachment.html>

Chris Sears

2015-Jan-19 18:29 UTC

head link

[LLVMdev] X86TargetLowering::LowerToBT

Looking at the Intel Optimization Reference Manual, page 14-14, for Atom

    BT m16, imm8, BT mem, imm8   latency 2,1 throughput 1
    BT m16, r16, BT mem, reg          latency 10, 9, throughput 8
    BT reg, imm8, BT reg, reg           latency 1, throughput 1

On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.

The posted functions were simplified for tracking down the code generation
problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ
for bit testing is even worse because you have to MOVE the tested reg to a
temporary before the SHRQ/ANDQ. And all of these instructions require a REX
prefix (well, not the AND). The result is some code bloat (3 instructions
vs 1) and a little register pressure.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/ef0da9ac/attachment.html>

Mehdi Amini

2015-Jan-19 18:53 UTC

head link

[LLVMdev] X86TargetLowering::LowerToBT

> On Jan 19, 2015, at 10:29 AM, Chris Sears <chris.sears at gmail.com>
wrote:
> 
> Looking at the Intel Optimization Reference Manual, page 14-14, for Atom
> 
>     BT m16, imm8, BT mem, imm8   latency 2,1 throughput 1
>     BT m16, r16, BT mem, reg          latency 10, 9, throughput 8
>     BT reg, imm8, BT reg, reg           latency 1, throughput 1
> 
> On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.
> 
> The posted functions were simplified for tracking down the code generation
problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ for
bit testing is even worse because you have to MOVE the tested reg to a temporary
before the SHRQ/ANDQ. And all of these instructions require a REX prefix (well,
not the AND). The result is some code bloat (3 instructions vs 1) and a little
register pressure.
I’m not an X86 expert, but I’d still like to understand why you are comparing 1
instructions to 3, the result does not seem exactly the same since (if I
understand correctly) BT only sets the carry flags while the other combination
provide the result in a register.

The full sequence is:

	btq	%rsi, %rdi
	sbbq	%rax, %rax
	andq	$1, %rax
	popq	%rbp

vs:

	shrq	$25, %rdi
	andq	$1, %rdi
	movq	%rdi, %rax
	popq	%rbp

(I’m not saying that btq is not preferable, just that I don’t see a difference
in the number of instructions needed to get the result).

I agree with Ahmed that you probably should look into PerformAndCombine()(
X86ISelLowering.cpp)

Best,

Mehdi

llvm dev - Jan 2015 - [LLVMdev] X86TargetLowering::LowerToBT

[LLVMdev] X86TargetLowering::LowerToBT

[LLVMdev] X86TargetLowering::LowerToBT

[LLVMdev] X86TargetLowering::LowerToBT