I suspect that this is because the mask in your example is the result of a variable shift, which (a) has it’s own performance and flags hazards pre-SHLX and (b) requires additional µops to do with TEST. I expect that ICC is putting a dummy TEST or XOR ahead of the BT to break the false flags dependency, as well. If the mask were constant, I expect ICC would generate TEST instead (but I don’t have it handy to check). – Steve> On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at rotateright.com> wrote: > > If 'bt' is a perf sin, icc doesn't seem to know it: > > $ icc -v > icc version 15.0.1 (gcc version 4.9.0 compatibility) > > $ cat bt.c > unsigned long long IsBitSetB_64(unsigned long long val, int index) { return (val & (1ULL<<index)) != 0ULL; } > unsigned int IsBitSetB_32(unsigned int val, int index) { return (val & (1U<<index)) != 0U; } > > $ icc -O3 -S bt.c -o - | grep bt > .file "bt.c" > btq %rsi, %rdi > btl %esi, %edi > > Does anyone at Intel have guidance for us? > > > On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at gmail.com <mailto:echristo at gmail.com>> wrote: > > > On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> wrote: > The status quo is: > > a) 40b REX+BT instruction for the 64b case > b) 48b TEST for the 32b case > c) unless it's small TEST > > You are currently paying a 16b penalty for TEST vs BT in the 32b case. > That may be worth testing the -Os flag. > > You'll want -Oz here, Os isn't supposed to affect the runtime as much as this is going to. > > -eric > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/8e2572c2/attachment.html>
Full icc code sequence (for the 32-bit case):
xorl %eax, %eax
movl $1, %edx
btl %esi, %edi
cmovc %edx, %eax
ret
Chris's code example is actually returning the result, so no 'test'
or 'bt'
in the constant mask case:
unsigned int IsBitSetA_32(unsigned int val) { return (val & (1U<<25))
!0U; }
andl $33554432, %edi
shrl $25, %edi
movl %edi, %eax
ret
On Fri, Jan 23, 2015 at 9:45 AM, Stephen Canon <scanon at apple.com>
wrote:
> I suspect that this is because the mask in your example is the result of a
> variable shift, which (a) has it’s own performance and flags hazards
> pre-SHLX and (b) requires additional µops to do with TEST. I expect that
> ICC is putting a dummy TEST or XOR ahead of the BT to break the false flags
> dependency, as well.
>
> If the mask were constant, I expect ICC would generate TEST instead (but I
> don’t have it handy to check).
>
> – Steve
>
> On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at
rotateright.com> wrote:
>
> If 'bt' is a perf sin, icc doesn't seem to know it:
>
> $ icc -v
> icc version 15.0.1 (gcc version 4.9.0 compatibility)
>
> $ cat bt.c
> unsigned long long IsBitSetB_64(unsigned long long val, int index) {
> return (val & (1ULL<<index)) != 0ULL; }
> unsigned int IsBitSetB_32(unsigned int val, int index) { return (val &
> (1U<<index)) != 0U; }
>
> $ icc -O3 -S bt.c -o - | grep bt
> .file "bt.c"
> btq %rsi, %rdi
> btl %esi, %edi
>
> Does anyone at Intel have guidance for us?
>
>
> On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at
gmail.com>
> wrote:
>
>>
>>
>> On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at
gmail.com>
>> wrote:
>>
>>> The status quo is:
>>>
>>> a) 40b REX+BT instruction for the 64b case
>>> b) 48b TEST for the 32b case
>>> c) unless it's small TEST
>>>
>>>
>>> You are currently paying a 16b penalty for TEST vs BT in the 32b
case.
>>> That may be worth testing the -Os flag.
>>>
>>
>> You'll want -Oz here, Os isn't supposed to affect the runtime
as much as
>> this is going to.
>>
>> -eric
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/9f33c95a/attachment.html>
Right, so the xor breaks the false dependency on the previous flags state.
Compare to what we get from clang with a variable mask:
bt %esi, %edi
sbb %eax, %eax // HAZARD — false dependency of flags state prior to BT.
and $1, %eax
If we instead generated:
xor %eax, %eax
bt %esi, %edi
adc %eax, %eax
We’d mostly avoid the partial-flags hazard, though we’d still get one extra µop
generated. Targeting Haswell, I’d probably rather see:
shrx %esi, %edi, %eax
and $1, %eax
but a reasonable case can be made for the bt sequence under –Oz.
As I understand it though, this whole discussion is actually about the constant
mask case, for which clang already generates reasonable code.
– Steve
> On Jan 23, 2015, at 11:57 AM, Sanjay Patel <spatel at
rotateright.com> wrote:
>
> Full icc code sequence (for the 32-bit case):
> xorl %eax, %eax
> movl $1, %edx
> btl %esi, %edi
> cmovc %edx, %eax
> ret
>
> Chris's code example is actually returning the result, so no
'test' or 'bt' in the constant mask case:
>
> unsigned int IsBitSetA_32(unsigned int val) { return (val &
(1U<<25)) != 0U; }
>
> andl $33554432, %edi
> shrl $25, %edi
> movl %edi, %eax
> ret
>
>
>
>
> On Fri, Jan 23, 2015 at 9:45 AM, Stephen Canon <scanon at apple.com>
wrote:
> I suspect that this is because the mask in your example is the result of a
variable shift, which (a) has it’s own performance and flags hazards pre-SHLX
and (b) requires additional µops to do with TEST. I expect that ICC is putting
a dummy TEST or XOR ahead of the BT to break the false flags dependency, as
well.
>
> If the mask were constant, I expect ICC would generate TEST instead (but I
don’t have it handy to check).
>
> – Steve
>
>> On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at
rotateright.com> wrote:
>>
>> If 'bt' is a perf sin, icc doesn't seem to know it:
>>
>> $ icc -v
>> icc version 15.0.1 (gcc version 4.9.0 compatibility)
>>
>> $ cat bt.c
>> unsigned long long IsBitSetB_64(unsigned long long val, int index) {
return (val & (1ULL<<index)) != 0ULL; }
>> unsigned int IsBitSetB_32(unsigned int val, int index) { return (val
& (1U<<index)) != 0U; }
>>
>> $ icc -O3 -S bt.c -o - | grep bt
>> .file "bt.c"
>> btq %rsi, %rdi
>> btl %esi, %edi
>>
>> Does anyone at Intel have guidance for us?
>>
>>
>> On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at
gmail.com> wrote:
>>
>>
>> On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at
gmail.com> wrote:
>> The status quo is:
>>
>> a) 40b REX+BT instruction for the 64b case
>> b) 48b TEST for the 32b case
>> c) unless it's small TEST
>>
>> You are currently paying a 16b penalty for TEST vs BT in the 32b case.
>> That may be worth testing the -Os flag.
>>
>> You'll want -Oz here, Os isn't supposed to affect the runtime
as much as this is going to.
>>
>> -eric
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>