thr3ads.net - llvm dev - [llvm-dev] LLVM struct, alloca, SROA and the entry basic block [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Benoit Belley via llvm-dev

2015-Sep-08 17:11 UTC

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

From: Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>
Date: mardi 8 septembre 2015 12:50
To: Benoit Belley <benoit.belley at autodesk.com<mailto:benoit.belley at
autodesk.com>>, "llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic block

On 09/08/2015 07:21 AM, Benoit Belley via llvm-dev wrote:
Hi everyone,

We have noticed that the SROA pass will only eliminate 'alloca'
instructions if those are located in the entry basic block of a function.

As a general recommendation, should the LLVM IR emitted by our compiler always
place 'alloca' instructions in the entry basic block ? (I couldn't
find any recommendations concerning this matter.)
Yes.

Thanks Phil. Should this be mentioned somewhere in the documentation ? As a
footnote in the LLVM Language Reference manual maybe ?

As a note, I have also find out that alloca instructions should be placed before
any call instructions as these can get inlined and then, the original alloca can
no longer by placed in the entry basic block!

In addition, we have noticed that the MemCpy pass will attempt to copy LLVM
struct using moves that are as large as possible. For example, a struct of 3
floats is copied using a 64-bit and a 32-bit move. It is therefore important
that such a struct be aligned on 8-byte boundary, not just 4 bytes! Else, one
runs the risk of triggering store-forwarding failure pipelining stalls (which we
did encountered really badly with one of our internal performance benchmark).
This sounds like a bug to me.  We shouldn't be using the large load/stores
without knowing they're aligned or that unaligned access is fast on a
particular target.  Where this is best fixed (memcpy, store lowering?) I
don't know.

I'll send out a test case. Maybe, that will help.

Is there any guidelines for specifying the alignment of LLVM structs allocated
by alloca instructions ? Is rounding down to the structure size to the next
power of 2 a good strategy ? Will the MemCpy pass issue moves of up to 64-bytes
on AVX-512 capable processors ?

Cheers,
Benoit

Benoit Belley
Sr Principal Developer
M&E-Product Development Group

MAIN +1 514 393 1616
DIRECT +1 438 448 6304
FAX +1 514 393 0110

Twitter<http://twitter.com/autodesk>
Facebook<https://www.facebook.com/Autodesk>

Autodesk, Inc.
10 Duke Street
Montreal, Quebec, Canada H3C 2L7
www.autodesk.com<http://www.autodesk.com/>

[Description: Email_Signature_Logobar]

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/0655709e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00001.png
Type: image/png
Size: 4316 bytes
Desc: ATT00001.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/0655709e/attachment.png>

Mehdi Amini via llvm-dev

2015-Sep-08 17:27 UTC

head link

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

Hi,
> On Sep 8, 2015, at 10:11 AM, Benoit Belley via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> From: Philip Reames <listmail at philipreames.com <mailto:listmail at
philipreames.com>>
> Date: mardi 8 septembre 2015 12:50
> To: Benoit Belley <benoit.belley at autodesk.com
<mailto:benoit.belley at autodesk.com>>, "llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>" <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic block
> 
>> On 09/08/2015 07:21 AM, Benoit Belley via llvm-dev wrote:
>>> Hi everyone,
>>> 
>>> We have noticed that the SROA pass will only eliminate ‘alloca’
instructions if those are located in the entry basic block of a function.
>>> 
>>> As a general recommendation, should the LLVM IR emitted by our
compiler always place ‘alloca’ instructions in the entry basic block ? (I
couldn’t find any recommendations concerning this matter.)
>> Yes.  
> 
> 
> 
> Thanks Phil. Should this be mentioned somewhere in the documentation ? As a
footnote in the LLVM Language Reference manual maybe ?

This sounds like a candidate for:
http://llvm.org/docs/Frontend/PerformanceTips.html ?

— 
Mehdi


> As a note, I have also find out that alloca instructions should be placed
before any call instructions as these can get inlined and then, the original
alloca can no longer by placed in the entry basic block!
> 
>> 
>>> 
>>> In addition, we have noticed that the MemCpy pass will attempt to
copy LLVM struct using moves that are as large as possible. For example, a
struct of 3 floats is copied using a 64-bit and a 32-bit move. It is therefore
important that such a struct be aligned on 8-byte boundary, not just 4 bytes!
Else, one runs the risk of triggering store-forwarding failure pipelining stalls
(which we did encountered really badly with one of our internal performance
benchmark).
>> This sounds like a bug to me.  We shouldn't be using the large
load/stores without knowing they're aligned or that unaligned access is fast
on a particular target.  Where this is best fixed (memcpy, store lowering?) I
don't know.
> 
> 
> I’ll send out a test case. Maybe, that will help.
> 
>> 
>>> 
>>> Is there any guidelines for specifying the alignment of LLVM
structs allocated by alloca instructions ? Is rounding down to the structure
size to the next power of 2 a good strategy ? Will the MemCpy pass issue moves
of up to 64-bytes on AVX-512 capable processors ?
>>> 
>>> Cheers,
>>> Benoit 
>>> 
>>> Benoit Belley
>>> Sr Principal Developer
>>> M&E-Product Development Group
>>>  
>>> MAIN +1 514 393 1616
>>> DIRECT +1 438 448 6304
>>> FAX +1 514 393 0110
>>>  
>>> Twitter <http://twitter.com/autodesk>
>>> Facebook <https://www.facebook.com/Autodesk>
>>>  
>>> Autodesk, Inc.
>>> 10 Duke Street
>>> Montreal, Quebec, Canada H3C 2L7
>>> www.autodesk.com <http://www.autodesk.com/>
>>>  
>>> <ATT00001.png>
>>>  
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> <ATT00001.png>_______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/13984ff2/attachment.html>

Benoit Belley via llvm-dev

2015-Sep-08 18:11 UTC

head link

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

Hi Philip,

Attached you will find the LLVM IR that causes LLVM 3.7.0 to emit assembly
generating a whole bunch of blocked store-forwarding pipeline stalls.

Compile using:

$ llvm/3.7.0/Release/bin/opt -S -O3 store-forward-failure.ll -o - |
llvm/3.7.0/Release/bin/llc -filetype=asm -O3 -o -


You will find assembly sequences such as:

        movss   dword ptr [rcx - 12], xmm4 # 32-bit store
        movss   dword ptr [rcx - 16], xmm3 # 32-bit store
    mov rdx, qword ptr [rcx - 16]      # 64-bit load

Notice how the stores and loads are back-to-back and of different bit-width.  On
my processor (Intel Sandy Bridge), this sequence seems to fail store-forwarding
and to cause a huge CPU pipeline stall. Or at least, this is what the following
CPU performance counter leads me to believe:

LD_BLOCKS.STORE_FORWARD: Loads blocked by overlapping with store buffer that
cannot be forwarded.

My test case is generating 1,500,000,000  of these "blocked
store-forwarding » when using LLVM 3.7 versus 74,000 for LLVM 3.6! The number of
instructions executed per CPU cycles goes down to 0.7 IPC instead of 2.2 IPC.

Further analysis suggests that it might be due to the GVN pass (which runs just
before the MemCpy pass) which actually combines 2 32-bit loads into a single
64-bit load.  See the attached files.

I have also noted that the alloca are actually getting properly annotated with
an alignment of 8 bytes by the « Combine redundant instructions » pass. So, I
guess that annotating alloca when emitting LLVM IR within our JIT compiler is
unnecessary. Is that a fair assessment ?

Is store-forwarding always blocking on these kind of memory accesses even if
they are properly aligned ?

(Side note: Moving the alloca into the entry BB, causes all of these redundant
alloca, store and load instructions to be optimized out and the entire
store-forwarding issue goes away for this particular test case. But, isn’t this
an issue that could be triggered in other valid cases ?)

Cheers,
Benoit

Benoit Belley
Sr Principal Developer
M&E-Product Development Group

MAIN +1 514 393 1616
DIRECT +1 438 448 6304
FAX +1 514 393 0110

Twitter<http://twitter.com/autodesk>
Facebook<https://www.facebook.com/Autodesk>

Autodesk, Inc.
10 Duke Street
Montreal, Quebec, Canada H3C 2L7
www.autodesk.com<http://www.autodesk.com/>

[Description: Email_Signature_Logobar]


From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> on behalf of Benoit Belley via llvm-dev <llvm-dev
at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Reply-To: Benoit Belley <benoit.belley at
autodesk.com<mailto:benoit.belley at autodesk.com>>
Date: mardi 8 septembre 2015 13:11
To: Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>, "llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic block

From: Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>
Date: mardi 8 septembre 2015 12:50
To: Benoit Belley <benoit.belley at autodesk.com<mailto:benoit.belley at
autodesk.com>>, "llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic block

On 09/08/2015 07:21 AM, Benoit Belley via llvm-dev wrote:
Hi everyone,

We have noticed that the SROA pass will only eliminate ‘alloca’ instructions if
those are located in the entry basic block of a function.

As a general recommendation, should the LLVM IR emitted by our compiler always
place ‘alloca’ instructions in the entry basic block ? (I couldn’t find any
recommendations concerning this matter.)
Yes.


Thanks Phil. Should this be mentioned somewhere in the documentation ? As a
footnote in the LLVM Language Reference manual maybe ?

As a note, I have also find out that alloca instructions should be placed before
any call instructions as these can get inlined and then, the original alloca can
no longer by placed in the entry basic block!



In addition, we have noticed that the MemCpy pass will attempt to copy LLVM
struct using moves that are as large as possible. For example, a struct of 3
floats is copied using a 64-bit and a 32-bit move. It is therefore important
that such a struct be aligned on 8-byte boundary, not just 4 bytes! Else, one
runs the risk of triggering store-forwarding failure pipelining stalls (which we
did encountered really badly with one of our internal performance benchmark).
This sounds like a bug to me.  We shouldn't be using the large load/stores
without knowing they're aligned or that unaligned access is fast on a
particular target.  Where this is best fixed (memcpy, store lowering?) I
don't know.

I’ll send out a test case. Maybe, that will help.



Is there any guidelines for specifying the alignment of LLVM structs allocated
by alloca instructions ? Is rounding down to the structure size to the next
power of 2 a good strategy ? Will the MemCpy pass issue moves of up to 64-bytes
on AVX-512 capable processors ?

Cheers,
Benoit

Benoit Belley
Sr Principal Developer
M&E-Product Development Group

MAIN +1 514 393 1616
DIRECT +1 438 448 6304
FAX +1 514 393 0110

Twitter<http://twitter.com/autodesk>
Facebook<https://www.facebook.com/Autodesk>

Autodesk, Inc.
10 Duke Street
Montreal, Quebec, Canada H3C 2L7
www.autodesk.com<http://www.autodesk.com/>

[Description: Email_Signature_Logobar]




_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/06e1cbc6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 350F40DB-4457-4455-A632-0DF05738AF15[21].png
Type: image/png
Size: 4316 bytes
Desc: 350F40DB-4457-4455-A632-0DF05738AF15[21].png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/06e1cbc6/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00001.png
Type: image/png
Size: 4316 bytes
Desc: ATT00001.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/06e1cbc6/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: store-forward-failure.ll
Type: application/octet-stream
Size: 20583 bytes
Desc: store-forward-failure.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/06e1cbc6/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: before-gvn.ll
Type: application/octet-stream
Size: 29231 bytes
Desc: before-gvn.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/06e1cbc6/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: after-gvn.ll
Type: application/octet-stream
Size: 29902 bytes
Desc: after-gvn.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/06e1cbc6/attachment-0005.obj>

Benoit Belley via llvm-dev

2015-Sep-08 18:39 UTC

head link

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

How about:

--- a/docs/Frontend/PerformanceTips.rst
+++ b/docs/Frontend/PerformanceTips.rst
@@ -19,20 +19,32 @@ Avoid loads and stores of large aggregate type
 ===============================================
 LLVM currently does not optimize well loads and stores of large :ref:`aggregate
 types <t_aggregate>` (i.e. structs and arrays).  As an alternative,
consider
 loading individual fields from memory.

 Aggregates that are smaller than the largest (performant) load or store
 instruction supported by the targeted hardware are well supported.  These can
 be an effective way to represent collections of small packed fields.

+Issue alloca in the entry basic block
+======================================+
+Issue alloca instructions in the entry basic block of a function. Also, issue
+them before any call instructions. Call instructions might get inlined into
+multiple basic blocks. The end result is that a following alloca instruction
+would no longer be in the entry basic block afterward.
+
+The SROA (Scalar Replacement Of Aggregates) pass only attempts to elminate
+alloca instructions that are in the entry basic block. Following optimizations
+passes relies on such alloca instructions to have been eliminated.
+
 Prefer zext over sext when legal
 =================================
 On some architectures (X86_64 is one), sign extension can involve an extra
 instruction whereas zero extension can be folded into a load.  LLVM will try to
 replace a sext with a zext when it can be proven safe, but if you have
 information in your source language about the range of a integer value, it can
 be profitable to use a zext rather than a sext.

 Alternatively, you can :ref:`specify the range of the value using metadata

Benoit

Benoit Belley
Sr Principal Developer
M&E-Product Development Group

MAIN +1 514 393 1616
DIRECT +1 438 448 6304
FAX +1 514 393 0110

Twitter<http://twitter.com/autodesk>
Facebook<https://www.facebook.com/Autodesk>

Autodesk, Inc.
10 Duke Street
Montreal, Quebec, Canada H3C 2L7
www.autodesk.com<http://www.autodesk.com/>

[Description: Email_Signature_Logobar]


From: <mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>> on
behalf of Mehdi Amini <mehdi.amini at apple.com<mailto:mehdi.amini at
apple.com>>
Date: mardi 8 septembre 2015 13:27
To: Benoit Belley <benoit.belley at autodesk.com<mailto:benoit.belley at
autodesk.com>>
Cc: Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>, "llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic block

Hi,

On Sep 8, 2015, at 10:11 AM, Benoit Belley via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

From: Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>
Date: mardi 8 septembre 2015 12:50
To: Benoit Belley <benoit.belley at autodesk.com<mailto:benoit.belley at
autodesk.com>>, "llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic block

On 09/08/2015 07:21 AM, Benoit Belley via llvm-dev wrote:
Hi everyone,

We have noticed that the SROA pass will only eliminate ‘alloca’ instructions if
those are located in the entry basic block of a function.

As a general recommendation, should the LLVM IR emitted by our compiler always
place ‘alloca’ instructions in the entry basic block ? (I couldn’t find any
recommendations concerning this matter.)
Yes.


Thanks Phil. Should this be mentioned somewhere in the documentation ? As a
footnote in the LLVM Language Reference manual maybe ?


This sounds like a candidate for:
http://llvm.org/docs/Frontend/PerformanceTips.html ?

—
Mehdi



As a note, I have also find out that alloca instructions should be placed before
any call instructions as these can get inlined and then, the original alloca can
no longer by placed in the entry basic block!



In addition, we have noticed that the MemCpy pass will attempt to copy LLVM
struct using moves that are as large as possible. For example, a struct of 3
floats is copied using a 64-bit and a 32-bit move. It is therefore important
that such a struct be aligned on 8-byte boundary, not just 4 bytes! Else, one
runs the risk of triggering store-forwarding failure pipelining stalls (which we
did encountered really badly with one of our internal performance benchmark).
This sounds like a bug to me.  We shouldn't be using the large load/stores
without knowing they're aligned or that unaligned access is fast on a
particular target.  Where this is best fixed (memcpy, store lowering?) I
don't know.

I’ll send out a test case. Maybe, that will help.



Is there any guidelines for specifying the alignment of LLVM structs allocated
by alloca instructions ? Is rounding down to the structure size to the next
power of 2 a good strategy ? Will the MemCpy pass issue moves of up to 64-bytes
on AVX-512 capable processors ?

Cheers,
Benoit

Benoit Belley
Sr Principal Developer
M&E-Product Development Group

MAIN +1 514 393 1616
DIRECT +1 438 448 6304
FAX +1 514 393 0110

Twitter<http://twitter.com/autodesk>
Facebook<https://www.facebook.com/Autodesk>

Autodesk, Inc.
10 Duke Street
Montreal, Quebec, Canada H3C 2L7
www.autodesk.com<http://www.autodesk.com/>

<ATT00001.png>




_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

<ATT00001.png>_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/7f234820/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 350F40DB-4457-4455-A632-0DF05738AF15[22].png
Type: image/png
Size: 4316 bytes
Desc: 350F40DB-4457-4455-A632-0DF05738AF15[22].png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/7f234820/attachment.png>

Sanjay Patel via llvm-dev

2015-Sep-09 15:38 UTC

head link

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

Hi Benoit -

I've been looking at memcpy/memset lowering and alignment issues recently.
See:
https://llvm.org/bugs/show_bug.cgi?id=24678
and the links from there.

If you can file a bug report with your test case and any perf data that
you've collected, that would be very helpful.

Thanks!


On Tue, Sep 8, 2015 at 12:11 PM, Benoit Belley via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi Philip,
>
> Attached you will find the LLVM IR that causes LLVM 3.7.0 to emit assembly
> generating a whole bunch of blocked store-forwarding pipeline stalls.
>
> Compile using:
>
> $ llvm/3.7.0/Release/bin/opt -S -O3 store-forward-failure.ll -o - |
> llvm/3.7.0/Release/bin/llc -filetype=asm -O3 -o -
>
>
> You will find assembly sequences such as:
>
>         movss   dword ptr [rcx - 12], xmm4 # 32-bit store
>         movss   dword ptr [rcx - 16], xmm3 # 32-bit store
>     mov rdx, qword ptr [rcx - 16]      # 64-bit load
>
> Notice how the stores and loads are back-to-back and of different
> bit-width.  On my processor (Intel Sandy Bridge), this sequence seems to
> fail store-forwarding and to cause a huge CPU pipeline stall. Or at least,
> this is what the following CPU performance counter leads me to believe:
>
> LD_BLOCKS.STORE_FORWARD: Loads blocked by overlapping with store buffer
> that cannot be forwarded.
>
> My test case is generating 1,500,000,000  of these "blocked
> store-forwarding » when using LLVM 3.7 versus 74,000 for LLVM 3.6! The
> number of instructions executed per CPU cycles goes down to 0.7 IPC instead
> of 2.2 IPC.
>
> Further analysis suggests that it might be due to the GVN pass (which runs
> just before the MemCpy pass) which actually combines 2 32-bit loads into a
> single 64-bit load.  See the attached files.
>
> I have also noted that the alloca are actually getting properly annotated
> with an alignment of 8 bytes by the « Combine redundant instructions »
> pass. So, I guess that annotating alloca when emitting LLVM IR within our
> JIT compiler is unnecessary. Is that a fair assessment ?
>
> Is store-forwarding always blocking on these kind of memory accesses even
> if they are properly aligned ?
>
> (Side note: Moving the alloca into the entry BB, causes all of these
> redundant alloca, store and load instructions to be optimized out and the
> entire store-forwarding issue goes away for this particular test case. But,
> isn’t this an issue that could be triggered in other valid cases ?)
>
> Cheers,
> Benoit
>
> *Benoit Belley*
>
> Sr Principal Developer
>
> M&E-Product Development Group
>
>
>
> *MAIN* +1 514 393 1616
>
> *DIRECT* +1 438 448 6304
>
> *FAX* +1 514 393 0110
>
>
>
> Twitter <http://twitter.com/autodesk>
>
> Facebook <https://www.facebook.com/Autodesk>
>
>
>
> *Autodesk, Inc.*
>
> 10 Duke Street
>
> Montreal, Quebec, Canada H3C 2L7
>
> www.autodesk.com
>
>
>
> [image: Description: Email_Signature_Logobar]
>
>
>
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Benoit
> Belley via llvm-dev <llvm-dev at lists.llvm.org>
> Reply-To: Benoit Belley <benoit.belley at autodesk.com>
> Date: mardi 8 septembre 2015 13:11
> To: Philip Reames <listmail at philipreames.com>, "llvm-dev at
lists.llvm.org" <
> llvm-dev at lists.llvm.org>
>
> Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic
> block
>
> From: Philip Reames <listmail at philipreames.com>
> Date: mardi 8 septembre 2015 12:50
> To: Benoit Belley <benoit.belley at autodesk.com>, "llvm-dev at
lists.llvm.org"
> <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] LLVM struct, alloca, SROA and the entry basic
> block
>
> On 09/08/2015 07:21 AM, Benoit Belley via llvm-dev wrote:
>
> Hi everyone,
>
> We have noticed that the SROA pass will only eliminate ‘alloca’
> instructions if those are located in the entry basic block of a function.
>
> *As a general recommendation, should the LLVM IR emitted by our compiler
> always place ‘alloca’ instructions in the entry basic block ? (I couldn’t
> find any recommendations concerning this matter.)*
>
> Yes.
>
>
>
> Thanks Phil. Should this be mentioned somewhere in the documentation ? As
> a footnote in the LLVM Language Reference manual maybe ?
>
> As a note, I have also find out that alloca instructions should be placed
> before any call instructions as these can get inlined and then, the
> original alloca can no longer by placed in the entry basic block!
>
>
>
> In addition, we have noticed that the MemCpy pass will attempt to copy
> LLVM struct using moves that are as large as possible. For example, a
> struct of 3 floats is copied using a 64-bit and a 32-bit move. It is
> therefore important that such a struct be aligned on 8-byte boundary, not
> just 4 bytes! Else, one runs the risk of triggering store-forwarding
> failure pipelining stalls (which we did encountered really badly with one
> of our internal performance benchmark).
>
> This sounds like a bug to me.  We shouldn't be using the large
load/stores
> without knowing they're aligned or that unaligned access is fast on a
> particular target.  Where this is best fixed (memcpy, store lowering?) I
> don't know.
>
>
> I’ll send out a test case. Maybe, that will help.
>
>
>
> *Is there any guidelines for specifying the alignment of LLVM structs
> allocated by alloca instructions ? Is rounding down to the structure size
> to the next power of 2 a good strategy ? Will the MemCpy pass issue moves
> of up to 64-bytes on AVX-512 capable processors ?*
>
> Cheers,
> Benoit
>
> *Benoit Belley*
>
> Sr Principal Developer
>
> M&E-Product Development Group
>
>
>
> *MAIN* +1 514 393 1616
>
> *DIRECT* +1 438 448 6304
>
> *FAX* +1 514 393 0110
>
>
>
> Twitter <http://twitter.com/autodesk>
>
> Facebook <https://www.facebook.com/Autodesk>
>
>
>
> *Autodesk, Inc.*
>
> 10 Duke Street
>
> Montreal, Quebec, Canada H3C 2L7
>
> www.autodesk.com
>
>
>
> [image: Description: Email_Signature_Logobar]
>
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150909/24dbad81/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00001.png
Type: image/png
Size: 4316 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150909/24dbad81/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 350F40DB-4457-4455-A632-0DF05738AF15[21].png
Type: image/png
Size: 4316 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150909/24dbad81/attachment-0001.png>

James Y Knight via llvm-dev

2015-Sep-09 17:13 UTC

head link

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

On Sep 8, 2015, at 2:11 PM, Benoit Belley via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> You will find assembly sequences such as:
> 
>         movss   dword ptr [rcx - 12], xmm4 # 32-bit store
>         movss   dword ptr [rcx - 16], xmm3 # 32-bit store
>         mov rdx, qword ptr [rcx - 16]      # 64-bit load
> 
> Notice how the stores and loads are back-to-back and of different
bit-width.  On my processor (Intel Sandy Bridge), this sequence seems to fail
store-forwarding and to cause a huge CPU pipeline stall. Or at least, this is
what the following CPU performance counter leads me to believe:
> 
> LD_BLOCKS.STORE_FORWARD: Loads blocked by overlapping with store buffer
that cannot be forwarded.
Yep, that happens for all intel archs: if you have a wider load than a store,
store forwarding is blocked. It doesn't support forwarding two stores into
one load, even if everything's properly aligned.
> My test case is generating 1,500,000,000  of these "blocked
store-forwarding » when using LLVM 3.7 versus 74,000 for LLVM 3.6! The number of
instructions executed per CPU cycles goes down to 0.7 IPC instead of 2.2 IPC.
> 
> Further analysis suggests that it might be due to the GVN pass (which runs
just before the MemCpy pass) which actually combines 2 32-bit loads into a
single 64-bit load.  See the attached files.
> 
> I have also noted that the alloca are actually getting properly annotated
with an alignment of 8 bytes by the « Combine redundant instructions » pass. So,
I guess that annotating alloca when emitting LLVM IR within our JIT compiler is
unnecessary. Is that a fair assessment ?
> 
> Is store-forwarding always blocking on these kind of memory accesses even
if they are properly aligned ?
> 
> (Side note: Moving the alloca into the entry BB, causes all of these
redundant alloca, store and load instructions to be optimized out and the entire
store-forwarding issue goes away for this particular test case. But, isn’t this
an issue that could be triggered in other valid cases ?)
I'd hope it's less likely to be a problem in other situations, as if the
optimizer is working properly (and not "broken" by the use of alloca
outside the entry block), it's less likely to have a load of an address
immediately following a store of the same address -- the compiler should've
just used the registers in the first place. Store forwarding is most important
when the store and load are neighboring -- the farther away you get the less
likely it is that the store is still working its way through the pipeline.

llvm dev - Sep 2015 - LLVM struct, alloca, SROA and the entry basic block

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

[llvm-dev] LLVM struct, alloca, SROA and the entry basic block