thr3ads.net - llvm dev - [llvm-dev] Help me modify llvm optimizer to eliminate valgrind false positive? [Feb 2021]

If this information is useful, please help other people find it:
Share via:

Eyal Soha via llvm-dev

2021-Feb-28 04:07 UTC

[llvm-dev] Help me modify llvm optimizer to eliminate valgrind false positive?

It worked!  I have spent the last hour reading up on tablegen and I still
don't understand any of that patch.  But it definitely solved the problem
in my test case.  I also ran the clang tests with this change and they all
passed so I'm not sure from where you are seeing the change in the behavior
around the movdqa.  Maybe I ran them incorrectly?

I think that I understand the bit about the delay with pre-Ivybridge.  The
delay is because those CPUs aren't able to run at full speed when the mov
is followed by the usage.  It's a strange mov, though.  Why did it even do
that?  Couldn't it just interleave the xmm register with itself?

As for collectTiedOperands, does undef mean that the register has yet to be
allocated to a specific register in the machine?  What is the situation
that you think that Valgrind will still have trouble with?

Thanks for your help!

Eyal

On Sat, Feb 27, 2021 at 2:31 PM Craig Topper <craig.topper at gmail.com>
wrote:
> Hi Eyal,
>
> LLVM knows that those bits are garbage and did this consciously. But it
> does create false execution dependency on xmm3. Potentially delaying the
> execution depending on what instruction is producing xmm3. This is
> definitely not the only case that this can happen.
>
> This is probably the easiest fix which avoids ever creating the false
> dependency by recognizing that there is only one real input and assigning
> it to both sources.
>
> *--- a/llvm/lib/Target/X86/X86InstrSSE.td*
>
> *+++ b/llvm/lib/Target/X86/X86InstrSSE.td*
>
> @@ -3917,6 +3917,26 @@ let Constraints = "$src1 = $dst" in {
>
>  }
>
>  } // ExeDomain = SSEPackedInt
>
>
> +let Predicates = [UseSSE2] in {
> +def : Pat <(v16i8 (X86Unpckl undef, VR128:$src)),
> +           (PUNPCKLBWrr VR128:$src, VR128:$src)>;
> +def : Pat <(v8i16 (X86Unpckl undef, VR128:$src)),
> +           (PUNPCKLWDrr VR128:$src, VR128:$src)>;
> +def : Pat <(v4i32 (X86Unpckl undef, VR128:$src)),
> +           (PUNPCKLDQrr VR128:$src, VR128:$src)>;
> +def : Pat <(v2i64 (X86Unpckl undef, VR128:$src)),
> +           (PUNPCKLQDQrr VR128:$src, VR128:$src)>;
> +
> +def : Pat <(v16i8 (X86Unpckh undef, VR128:$src)),
> +           (PUNPCKHBWrr VR128:$src, VR128:$src)>;
> +def : Pat <(v8i16 (X86Unpckh undef, VR128:$src)),
> +           (PUNPCKHWDrr VR128:$src, VR128:$src)>;
> +def : Pat <(v4i32 (X86Unpckh undef, VR128:$src)),
> +           (PUNPCKHDQrr VR128:$src, VR128:$src)>;
> +def : Pat <(v2i64 (X86Unpckh undef, VR128:$src)),
> +           (PUNPCKHQDQrr VR128:$src, VR128:$src)>;
> +}
> +
>
> Unfortunately, this can introduce extra mov instructions to copy the input
> register if it is used again by a later instruction. For example
>
>  define <16 x i8> @test_remconstant_16i8(<16 x i8> %a) nounwind
{
>
>  ; SSE2-LABEL: test_remconstant_16i8:
>
>  ; SSE2:       # %bb.0:
>
> +; SSE2-NEXT:    movdqa %xmm0, %xmm1
>
>  ; SSE2-NEXT:    punpckhbw {{.*#+}} xmm1 >
xmm1[8],xmm0[8],xmm1[9],xmm0[9],xmm1[10],xmm0[10],xmm1[11],xmm0[11],xmm1[12],xmm0[12],xmm1[13],xmm0[13],xmm1[14],xmm0[14],xmm1[15],xmm0[15]
>
> Note, there's a weird quirk that only the tied source uses the result
of
> the movdqa. The other source is still using xmm0.
>
> On CPUs without move elimination, pre-Ivybridge for Intel) this
> potentially adds an extra delay to the execution of the punpck to wait for
> the movdqa to complete. But depending on the rest of the code in the
> function that might be better than the false dependency.
>
> My other thought would be to do something different for undef sources
> in TwoAddressInstructionPass::collectTiedOperands. Rather than setting it
> to the destination virtual register, if there is another source with the
> same register class with a kill flag set, assign the undef vreg to that and
> keep the tied pair. This would do the same thing as the change in
> X86InstrSSE.td above, but applied much later and would be limited to cases
> where the source register isn't used later. But valgrind could still
have
> an issue for code that doesn't get fixed due to an additional use of
the
> register.
>
> ~Craig
>
>
> On Sat, Feb 27, 2021 at 9:53 AM Eyal Soha via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I spent a few days debugging a valgrind memcheck false positive that
was
>> introduced by a new optimization in clang version 10.
>>
>> Valgrind is mistakenly reporting an error due to the way that clang 10
is
>> emitting asm.  But it's a trivial matter to fix the code.  I just
needed to
>> change this:
>>
>>   401697:       66 0f 6e d2             movd   %edx,%xmm2
>>   40169b:       66 0f 60 d2             punpcklbw %xmm2,%xmm2
>>   40169f:       66 0f 61 da             punpcklwd %xmm2,%xmm3
>>
>> to this:
>>
>>   4016a3:       66 0f 6e da             movd   %edx,%xmm3
>>   4016a7:       66 0f 60 db             punpcklbw %xmm3,%xmm3
>>   4016ab:       66 0f 61 db             punpcklwd %xmm3,%xmm3
>>
>> So instead of using $xmm2 and then $xmm3 just use $xmm3.  This was a
tiny
>> change and I was able to modify the executable to do it.
>>
>> Is it possible to modify clang/llvm to do this where possible? 
Valgrind
>> is a popular tool for finding bugs in code and if it's not too much
trouble
>> for llvm and there's no performance penalty, maybe LLVM could
accommodate?
>>
>> I'm willing to do the work but I'm not sure where to start.help
but I'm
>> not sure where to start.
>>
>> There are more details about the issue in the Valgrind bug report:
>> https://bugs.kde.org/show_bug.cgi?id=432801#c12
>>
>> Thanks for your help!
>>
>> Eyal
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210227/609de9f0/attachment.html>

Craig Topper via llvm-dev

2021-Feb-28 04:49 UTC

head link

[llvm-dev] Help me modify llvm optimizer to eliminate valgrind false positive?

On Sat, Feb 27, 2021 at 8:07 PM Eyal Soha <shawarmakarma at gmail.com>
wrote:
> It worked!  I have spent the last hour reading up on tablegen and I still
> don't understand any of that patch.  But it definitely solved the
problem
> in my test case.  I also ran the clang tests with this change and they all
> passed so I'm not sure from where you are seeing the change in the
behavior
> around the movdqa.  Maybe I ran them incorrectly?
>
The patch is detecting cases where llvm wants to create a punpck
instruction where the first operand is undefined. Meaning llvm knows that
the value for the bits don't matter. If we find such a case, we just apply
the other src register to both inputs instead. The first line of each of
the Pats is the source pattern, the second line is the result pattern. In
the source pattern you can see the "undef", in the result pattern you
can
see that VR128:$src is listed twice.

Did you run make check-all or make check-llvm? It definitely should affect
a bunch of tests in llvm/test/CodeGen/X86/

>
> I think that I understand the bit about the delay with pre-Ivybridge.  The
> delay is because those CPUs aren't able to run at full speed when the
mov
> is followed by the usage.  It's a strange mov, though.  Why did it even
do
> that?  Couldn't it just interleave the xmm register with itself?
>
At least for the test case I copied from, the movdqa was needed because the
register being copied from was used after the punpck. So it would be a bug
for the punpck to overwrite it.

>
> As for collectTiedOperands, does undef mean that the register has yet to
> be allocated to a specific register in the machine?  What is the situation
> that you think that Valgrind will still have trouble with?
>
That pass runs before register allocation. Undef here means that when we
get to register allocation the allocator has permission to use any register
it finds convenient. Unlike valgrind, LLVM knows that the bits that will
become X don't matter so it gave the register allocator permission to use
any register. My proposal here wouldn't handle cases where xmm2 from the
punpcklbw is used by another instruction after the punpcklwd that reads
xmm3.

>
> Thanks for your help!
>
> Eyal
>
> On Sat, Feb 27, 2021 at 2:31 PM Craig Topper <craig.topper at
gmail.com>
> wrote:
>
>> Hi Eyal,
>>
>> LLVM knows that those bits are garbage and did this consciously. But it
>> does create false execution dependency on xmm3. Potentially delaying
the
>> execution depending on what instruction is producing xmm3. This is
>> definitely not the only case that this can happen.
>>
>> This is probably the easiest fix which avoids ever creating the false
>> dependency by recognizing that there is only one real input and
assigning
>> it to both sources.
>>
>> *--- a/llvm/lib/Target/X86/X86InstrSSE.td*
>>
>> *+++ b/llvm/lib/Target/X86/X86InstrSSE.td*
>>
>> @@ -3917,6 +3917,26 @@ let Constraints = "$src1 = $dst" in {
>>
>>  }
>>
>>  } // ExeDomain = SSEPackedInt
>>
>>
>> +let Predicates = [UseSSE2] in {
>> +def : Pat <(v16i8 (X86Unpckl undef, VR128:$src)),
>> +           (PUNPCKLBWrr VR128:$src, VR128:$src)>;
>> +def : Pat <(v8i16 (X86Unpckl undef, VR128:$src)),
>> +           (PUNPCKLWDrr VR128:$src, VR128:$src)>;
>> +def : Pat <(v4i32 (X86Unpckl undef, VR128:$src)),
>> +           (PUNPCKLDQrr VR128:$src, VR128:$src)>;
>> +def : Pat <(v2i64 (X86Unpckl undef, VR128:$src)),
>> +           (PUNPCKLQDQrr VR128:$src, VR128:$src)>;
>> +
>> +def : Pat <(v16i8 (X86Unpckh undef, VR128:$src)),
>> +           (PUNPCKHBWrr VR128:$src, VR128:$src)>;
>> +def : Pat <(v8i16 (X86Unpckh undef, VR128:$src)),
>> +           (PUNPCKHWDrr VR128:$src, VR128:$src)>;
>> +def : Pat <(v4i32 (X86Unpckh undef, VR128:$src)),
>> +           (PUNPCKHDQrr VR128:$src, VR128:$src)>;
>> +def : Pat <(v2i64 (X86Unpckh undef, VR128:$src)),
>> +           (PUNPCKHQDQrr VR128:$src, VR128:$src)>;
>> +}
>> +
>>
>> Unfortunately, this can introduce extra mov instructions to copy the
>> input register if it is used again by a later instruction. For example
>>
>>  define <16 x i8> @test_remconstant_16i8(<16 x i8> %a)
nounwind {
>>
>>  ; SSE2-LABEL: test_remconstant_16i8:
>>
>>  ; SSE2:       # %bb.0:
>>
>> +; SSE2-NEXT:    movdqa %xmm0, %xmm1
>>
>>  ; SSE2-NEXT:    punpckhbw {{.*#+}} xmm1 >>
xmm1[8],xmm0[8],xmm1[9],xmm0[9],xmm1[10],xmm0[10],xmm1[11],xmm0[11],xmm1[12],xmm0[12],xmm1[13],xmm0[13],xmm1[14],xmm0[14],xmm1[15],xmm0[15]
>>
>> Note, there's a weird quirk that only the tied source uses the
result of
>> the movdqa. The other source is still using xmm0.
>>
>> On CPUs without move elimination, pre-Ivybridge for Intel) this
>> potentially adds an extra delay to the execution of the punpck to wait
for
>> the movdqa to complete. But depending on the rest of the code in the
>> function that might be better than the false dependency.
>>
>> My other thought would be to do something different for undef sources
>> in TwoAddressInstructionPass::collectTiedOperands. Rather than setting
it
>> to the destination virtual register, if there is another source with
the
>> same register class with a kill flag set, assign the undef vreg to that
and
>> keep the tied pair. This would do the same thing as the change in
>> X86InstrSSE.td above, but applied much later and would be limited to
cases
>> where the source register isn't used later. But valgrind could
still have
>> an issue for code that doesn't get fixed due to an additional use
of the
>> register.
>>
>> ~Craig
>>
>>
>> On Sat, Feb 27, 2021 at 9:53 AM Eyal Soha via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> I spent a few days debugging a valgrind memcheck false positive
that was
>>> introduced by a new optimization in clang version 10.
>>>
>>> Valgrind is mistakenly reporting an error due to the way that clang
10
>>> is emitting asm.  But it's a trivial matter to fix the code.  I
just needed
>>> to change this:
>>>
>>>   401697:       66 0f 6e d2             movd   %edx,%xmm2
>>>   40169b:       66 0f 60 d2             punpcklbw %xmm2,%xmm2
>>>   40169f:       66 0f 61 da             punpcklwd %xmm2,%xmm3
>>>
>>> to this:
>>>
>>>   4016a3:       66 0f 6e da             movd   %edx,%xmm3
>>>   4016a7:       66 0f 60 db             punpcklbw %xmm3,%xmm3
>>>   4016ab:       66 0f 61 db             punpcklwd %xmm3,%xmm3
>>>
>>> So instead of using $xmm2 and then $xmm3 just use $xmm3.  This was
a
>>> tiny change and I was able to modify the executable to do it.
>>>
>>> Is it possible to modify clang/llvm to do this where possible? 
Valgrind
>>> is a popular tool for finding bugs in code and if it's not too
much trouble
>>> for llvm and there's no performance penalty, maybe LLVM could
accommodate?
>>>
>>> I'm willing to do the work but I'm not sure where to
start.help but I'm
>>> not sure where to start.
>>>
>>> There are more details about the issue in the Valgrind bug report:
>>> https://bugs.kde.org/show_bug.cgi?id=432801#c12
>>>
>>> Thanks for your help!
>>>
>>> Eyal
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210227/323aca4c/attachment.html>

llvm dev - Feb 2021 - Help me modify llvm optimizer to eliminate valgrind false positive?

[llvm-dev] Help me modify llvm optimizer to eliminate valgrind false positive?

[llvm-dev] Help me modify llvm optimizer to eliminate valgrind false positive?