thr3ads.net - llvm dev - [llvm-dev] unnecessary reload of 8-byte struct on i386 [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Seth Brenith via llvm-dev

2019-Oct-25 21:28 UTC

[llvm-dev] unnecessary reload of 8-byte struct on i386

Hello folks,

I've recently been looking at the generated code for a few functions in
Chromium while investigating crashes, and I came across a curious pattern. A
smallish repro case is available at https://godbolt.org/z/Dsu1WI . In that case,
the function Assembler::emit_arith receives a struct (Operand) by value and
passes it by value to another function. That struct is 8 bytes long, so the -O3
generated code uses movsd to copy it up the stack. However, we end up with some
loads that aren't needed, as in the following chunk:

movsd xmm0, qword ptr [ecx] # xmm0 = mem[0],zero
mov dword ptr [esp + 24], edx
movsd qword ptr [esp + 40], xmm0
movsd xmm0, qword ptr [esp + 40] # xmm0 = mem[0],zero
movsd qword ptr [esp + 8], xmm0

As far as I can tell, the fourth line has no effect. On its own, that seems like
a small missed opportunity for optimization. However, this sequence of
instructions also appears to trigger a hardware bug on a small fraction of
devices which sometimes end up storing zero at esp+8. A more in-depth discussion
of that issue can be found here:
https://bugs.chromium.org/p/v8/issues/detail?id=9774 .

I'm hoping that getting rid of the second load in the sequence above would
appease these misbehaving machines (though of course I don't know that it
would), as well as making the code a little smaller for everybody else. Does
that sound like a reasonable idea? Would LLVM be interested in a patch related
to eliminating reloads like this? Does anybody have advice about where I should
start looking, or any reasons it would be very hard to achieve the result
I'm hoping for?

Thanks,
Seth
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191025/2808aadf/attachment.html>

Eli Friedman via llvm-dev

2019-Oct-28 23:33 UTC

head link

[llvm-dev] unnecessary reload of 8-byte struct on i386

It's not unheard of for the compiler to work around CPU bugs... but
generally, we try to do it in a more disciplined way: with a code generation
pass that actually detects the bad sequence in question.   I'm not really
happy with trying to "get lucky" here to avoid a bug.

This particular missed optimization is a known issue with the LLVM IR
representation of "byval"; there's an implied copy that can't
be easily optimized away at the IR level due to calling convention rules.  For
ARM targets, clang works around this issue by changing the IR it generates; see
ARMABIInfo::classifyArgumentType in clang/lib/CodeGen/TargetInfo.cpp .

-Eli

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Seth
Brenith via llvm-dev
Sent: Friday, October 25, 2019 2:29 PM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] unnecessary reload of 8-byte struct on i386

Hello folks,

I've recently been looking at the generated code for a few functions in
Chromium while investigating crashes, and I came across a curious pattern. A
smallish repro case is available at https://godbolt.org/z/Dsu1WI . In that case,
the function Assembler::emit_arith receives a struct (Operand) by value and
passes it by value to another function. That struct is 8 bytes long, so the -O3
generated code uses movsd to copy it up the stack. However, we end up with some
loads that aren't needed, as in the following chunk:

movsd xmm0, qword ptr [ecx] # xmm0 = mem[0],zero
mov dword ptr [esp + 24], edx
movsd qword ptr [esp + 40], xmm0
movsd xmm0, qword ptr [esp + 40] # xmm0 = mem[0],zero
movsd qword ptr [esp + 8], xmm0

As far as I can tell, the fourth line has no effect. On its own, that seems like
a small missed opportunity for optimization. However, this sequence of
instructions also appears to trigger a hardware bug on a small fraction of
devices which sometimes end up storing zero at esp+8. A more in-depth discussion
of that issue can be found here:
https://bugs.chromium.org/p/v8/issues/detail?id=9774 .

I'm hoping that getting rid of the second load in the sequence above would
appease these misbehaving machines (though of course I don't know that it
would), as well as making the code a little smaller for everybody else. Does
that sound like a reasonable idea? Would LLVM be interested in a patch related
to eliminating reloads like this? Does anybody have advice about where I should
start looking, or any reasons it would be very hard to achieve the result
I'm hoping for?

Thanks,
Seth
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191028/3abfa7ae/attachment.html>

David Zarzycki via llvm-dev

2019-Oct-29 06:23 UTC

head link

[llvm-dev] unnecessary reload of 8-byte struct on i386

This just looks like a temporary stack variable wasn’t properly eliminated
because the compiler modeled `Operand` internally as an “i64” which i386 doesn’t
natively support. A further reduction, with notes about changes that sidestep
the bug:

https://godbolt.org/z/zcCguv


> On Oct 26, 2019, at 12:28 AM, Seth Brenith via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello folks,
>  
> I’ve recently been looking at the generated code for a few functions in
Chromium while investigating crashes, and I came across a curious pattern. A
smallish repro case is available at https://godbolt.org/z/Dsu1WI
<https://godbolt.org/z/Dsu1WI> . In that case, the function
Assembler::emit_arith receives a struct (Operand) by value and passes it by
value to another function. That struct is 8 bytes long, so the -O3 generated
code uses movsd to copy it up the stack. However, we end up with some loads that
aren’t needed, as in the following chunk:
>  
> movsd xmm0, qword ptr [ecx] # xmm0 = mem[0],zero
> mov dword ptr [esp + 24], edx
> movsd qword ptr [esp + 40], xmm0
> movsd xmm0, qword ptr [esp + 40] # xmm0 = mem[0],zero
> movsd qword ptr [esp + 8], xmm0
>  
> As far as I can tell, the fourth line has no effect. On its own, that seems
like a small missed opportunity for optimization. However, this sequence of
instructions also appears to trigger a hardware bug on a small fraction of
devices which sometimes end up storing zero at esp+8. A more in-depth discussion
of that issue can be found here:
https://bugs.chromium.org/p/v8/issues/detail?id=9774
<https://bugs.chromium.org/p/v8/issues/detail?id=9774> .
>  
> I’m hoping that getting rid of the second load in the sequence above would
appease these misbehaving machines (though of course I don’t know that it
would), as well as making the code a little smaller for everybody else. Does
that sound like a reasonable idea? Would LLVM be interested in a patch related
to eliminating reloads like this? Does anybody have advice about where I should
start looking, or any reasons it would be very hard to achieve the result I’m
hoping for?
>  
> Thanks,
> Seth
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191029/222534ae/attachment-0001.html>

Seth Brenith via llvm-dev

2019-Oct-30 15:52 UTC

head link

[llvm-dev] unnecessary reload of 8-byte struct on i386

Hi Eli,

Thanks for the thoughtful response! To check my understanding, is the following
statement true? Given infinite time/resources and a clearer understanding of the
exact trigger for this CPU bug, we would like two separate changes: one change
in Clang to emit different IR for x86 like we do for ARM, and a second change to
search for this instruction sequence and transform it into something harmless.

However, I don't really know what "something harmless" would be,
and this issue affects only a tiny minority of machines, and it feels a little
silly to stuff an increasing number of nops in between those instructions until
we stop getting these crash reports, so I think the reasonable next step is to
leave it alone. Please let me know if you think otherwise, and thanks again!

-Seth

From: Eli Friedman <efriedma at quicinc.com>
Sent: Monday, October 28, 2019 4:34 PM
To: Seth Brenith <Seth.Brenith at microsoft.com>; LLVM Dev <llvm-dev at
lists.llvm.org>
Subject: [EXTERNAL] RE: [llvm-dev] unnecessary reload of 8-byte struct on i386

It's not unheard of for the compiler to work around CPU bugs... but
generally, we try to do it in a more disciplined way: with a code generation
pass that actually detects the bad sequence in question.   I'm not really
happy with trying to "get lucky" here to avoid a bug.

This particular missed optimization is a known issue with the LLVM IR
representation of "byval"; there's an implied copy that can't
be easily optimized away at the IR level due to calling convention rules.  For
ARM targets, clang works around this issue by changing the IR it generates; see
ARMABIInfo::classifyArgumentType in clang/lib/CodeGen/TargetInfo.cpp .

-Eli

From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> On Behalf Of Seth Brenith via llvm-dev
Sent: Friday, October 25, 2019 2:29 PM
To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: [EXT] [llvm-dev] unnecessary reload of 8-byte struct on i386

Hello folks,

I've recently been looking at the generated code for a few functions in
Chromium while investigating crashes, and I came across a curious pattern. A
smallish repro case is available at
https://godbolt.org/z/Dsu1WI<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgodbolt.org%2Fz%2FDsu1WI&data=02%7C01%7CSeth.Brenith%40microsoft.com%7C3fbf0efccbcd4bb3518c08d75bff49e5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079025113565811&sdata=N4cEn%2BsUBdqAr6uhQHDaAnxqlyBk1hfcRGj%2FkO%2F%2Fns4%3D&reserved=0>
. In that case, the function Assembler::emit_arith receives a struct (Operand)
by value and passes it by value to another function. That struct is 8 bytes
long, so the -O3 generated code uses movsd to copy it up the stack. However, we
end up with some loads that aren't needed, as in the following chunk:

movsd xmm0, qword ptr [ecx] # xmm0 = mem[0],zero
mov dword ptr [esp + 24], edx
movsd qword ptr [esp + 40], xmm0
movsd xmm0, qword ptr [esp + 40] # xmm0 = mem[0],zero
movsd qword ptr [esp + 8], xmm0

As far as I can tell, the fourth line has no effect. On its own, that seems like
a small missed opportunity for optimization. However, this sequence of
instructions also appears to trigger a hardware bug on a small fraction of
devices which sometimes end up storing zero at esp+8. A more in-depth discussion
of that issue can be found here:
https://bugs.chromium.org/p/v8/issues/detail?id=9774<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.chromium.org%2Fp%2Fv8%2Fissues%2Fdetail%3Fid%3D9774&data=02%7C01%7CSeth.Brenith%40microsoft.com%7C3fbf0efccbcd4bb3518c08d75bff49e5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079025113575803&sdata=x1t77z5dpjMuQW3xawNP7jbxMcS12ZlhUAvZRAbBwbQ%3D&reserved=0>
.

I'm hoping that getting rid of the second load in the sequence above would
appease these misbehaving machines (though of course I don't know that it
would), as well as making the code a little smaller for everybody else. Does
that sound like a reasonable idea? Would LLVM be interested in a patch related
to eliminating reloads like this? Does anybody have advice about where I should
start looking, or any reasons it would be very hard to achieve the result
I'm hoping for?

Thanks,
Seth
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191030/70d9fcfe/attachment.html>

Seth Brenith via llvm-dev

2019-Oct-30 15:54 UTC

head link

[llvm-dev] [EXTERNAL] Re: unnecessary reload of 8-byte struct on i386

Very interesting, thanks for the analysis and further reduction! Over-aligning
the struct seems like a simple workaround that would have little impact
elsewhere.

From: David Zarzycki <dave at znu.io>
Sent: Monday, October 28, 2019 11:23 PM
To: Seth Brenith <Seth.Brenith at microsoft.com>
Cc: llvm-dev at lists.llvm.org
Subject: [EXTERNAL] Re: [llvm-dev] unnecessary reload of 8-byte struct on i386

This just looks like a temporary stack variable wasn’t properly eliminated
because the compiler modeled `Operand` internally as an “i64” which i386 doesn’t
natively support. A further reduction, with notes about changes that sidestep
the bug:

https://godbolt.org/z/zcCguv<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgodbolt.org%2Fz%2FzcCguv&data=02%7C01%7CSeth.Brenith%40microsoft.com%7C9a7bff8d4a864a3bd7a908d75c3880fb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079270882127747&sdata=Mlqw0Jmr9wsh4JYWLm0kX%2B4pj%2FjFR8Ef8YJSCYZOpDg%3D&reserved=0>

On Oct 26, 2019, at 12:28 AM, Seth Brenith via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Hello folks,

I’ve recently been looking at the generated code for a few functions in Chromium
while investigating crashes, and I came across a curious pattern. A smallish
repro case is available at
https://godbolt.org/z/Dsu1WI<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgodbolt.org%2Fz%2FDsu1WI&data=02%7C01%7CSeth.Brenith%40microsoft.com%7C9a7bff8d4a864a3bd7a908d75c3880fb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079270882137742&sdata=OeSk0j2ISpza%2F%2FVUWDsmUcGUP8MX3WRwkAtyIyO3YxQ%3D&reserved=0>
. In that case, the function Assembler::emit_arith receives a struct (Operand)
by value and passes it by value to another function. That struct is 8 bytes
long, so the -O3 generated code uses movsd to copy it up the stack. However, we
end up with some loads that aren’t needed, as in the following chunk:

movsd xmm0, qword ptr [ecx] # xmm0 = mem[0],zero
mov dword ptr [esp + 24], edx
movsd qword ptr [esp + 40], xmm0
movsd xmm0, qword ptr [esp + 40] # xmm0 = mem[0],zero
movsd qword ptr [esp + 8], xmm0

As far as I can tell, the fourth line has no effect. On its own, that seems like
a small missed opportunity for optimization. However, this sequence of
instructions also appears to trigger a hardware bug on a small fraction of
devices which sometimes end up storing zero at esp+8. A more in-depth discussion
of that issue can be found here:
https://bugs.chromium.org/p/v8/issues/detail?id=9774<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.chromium.org%2Fp%2Fv8%2Fissues%2Fdetail%3Fid%3D9774&data=02%7C01%7CSeth.Brenith%40microsoft.com%7C9a7bff8d4a864a3bd7a908d75c3880fb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079270882137742&sdata=ujiic5Tb0%2F%2BsHM7sH6JXCJlpsQSKD6txl5g3HtEYOuo%3D&reserved=0>
.

I’m hoping that getting rid of the second load in the sequence above would
appease these misbehaving machines (though of course I don’t know that it
would), as well as making the code a little smaller for everybody else. Does
that sound like a reasonable idea? Would LLVM be interested in a patch related
to eliminating reloads like this? Does anybody have advice about where I should
start looking, or any reasons it would be very hard to achieve the result I’m
hoping for?

Thanks,
Seth
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=02%7C01%7CSeth.Brenith%40microsoft.com%7C9a7bff8d4a864a3bd7a908d75c3880fb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637079270882137742&sdata=CAD9ST%2FRFrzyfx3hOLCNvaGX%2BXNwavEpy788wFfbZ6Q%3D&reserved=0>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191030/c06ae3c7/attachment-0001.html>

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Oct 2019 - unnecessary reload of 8-byte struct on i386

[llvm-dev] unnecessary reload of 8-byte struct on i386

[llvm-dev] unnecessary reload of 8-byte struct on i386

[llvm-dev] unnecessary reload of 8-byte struct on i386

[llvm-dev] unnecessary reload of 8-byte struct on i386

[llvm-dev] [EXTERNAL] Re: unnecessary reload of 8-byte struct on i386

Reasonably Related Threads