> On Jan 19, 2015, at 10:29 AM, Chris Sears <chris.sears at gmail.com>
wrote:
>
> Looking at the Intel Optimization Reference Manual, page 14-14, for Atom
>
> BT m16, imm8, BT mem, imm8 latency 2,1 throughput 1
> BT m16, r16, BT mem, reg latency 10, 9, throughput 8
> BT reg, imm8, BT reg, reg latency 1, throughput 1
>
> On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.
>
> The posted functions were simplified for tracking down the code generation
problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ for
bit testing is even worse because you have to MOVE the tested reg to a temporary
before the SHRQ/ANDQ. And all of these instructions require a REX prefix (well,
not the AND). The result is some code bloat (3 instructions vs 1) and a little
register pressure.
I’m not an X86 expert, but I’d still like to understand why you are comparing 1
instructions to 3, the result does not seem exactly the same since (if I
understand correctly) BT only sets the carry flags while the other combination
provide the result in a register.
The full sequence is:
btq %rsi, %rdi
sbbq %rax, %rax
andq $1, %rax
popq %rbp
vs:
shrq $25, %rdi
andq $1, %rdi
movq %rdi, %rax
popq %rbp
(I’m not saying that btq is not preferable, just that I don’t see a difference
in the number of instructions needed to get the result).
I agree with Ahmed that you probably should look into PerformAndCombine()(
X86ISelLowering.cpp)
Best,
Mehdi