thr3ads.net - llvm dev - [llvm-dev] Aggregate load/stores [Aug 2015]

If this information is useful, please help other people find it:
Share via:

mats petersson via llvm-dev

2015-Aug-17 21:02 UTC

[llvm-dev] Aggregate load/stores

I've definitely "run into this problem", and I would very much
love to
remove my kludges [that are incomplete, because I keep finding places where
I need to modify the code-gen to "fix" the same problem - this is
probably
par for the course from a complete amateur compiler writer and someone that
has only spent the last 14 months working (as a hobby) with LLVM].

So whilst I can't contribute much on the "what is the right
solution" and
"how do we solve this", I would very much like to see something that
allows
the user of LLVM to use load/store withing things like "is my thing that
I'm storing big, if so don't generate a load, use a memcpy
instead". Not
only does this make the usage of LLVM harder, it also causes slow
compilation [perhaps this is a separte problem, but I have a simple program
that copies a large struct a few times, and if I turn off my "use memcpy
for large things", the compile time gets quite a lot longer - approx 1000x,
and 48 seconds is a long time to compile 37 lines of relatively straight
forward code - even the Pascal compiler on PDP-11/70 that I used at my
school in 1980's was capable of doing more than 1 line per second, and it
didn't run anywhere near 2.5GHz and had 20-30 users anytime I could use
it...]

../lacsap -no-memcpy -tt longcompile.pas
Time for Parse 0.657 ms
Time for Analyse 0.018 ms
Time for Compile 1.248 ms
Time for CreateObject 48803.263 ms
Time for CreateBinary 48847.631 ms
Time for Compile 48854.064 ms

compared with:
../lacsap -tt longcompile.pas
Time for Parse 0.455 ms
Time for Analyse 0.013 ms
Time for Compile 1.138 ms
Time for CreateObject 44.627 ms
Time for CreateBinary 82.758 ms
Time for Compile 95.797 ms

wc longcompile.pas
 37  84 410 longcompile.pas

Source here:
https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas


--
Mats

On 17 August 2015 at 21:18, deadal nix via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> OK, what about that plan :
>
> Slice the aggregate into a serie of valid loads/stores for non atomic ones.
> Use big scalar for atomic/volatile ones.
> Try to generate memcpy or memmove when possible ?
>
>
> 2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com>:
>
>>
>>
>> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at
apple.com>:
>>
>>> Hi,
>>>
>>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>>
>>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com>:
>>>
>>>>
>>>>
>>>> Because a solution which doesn't generalize is not a very
powerful
>>>> solution.  What happens when somebody says that they want to
use atomics +
>>>> large aggregate loads and stores? Give them yet another,
different answer?
>>>> That would mean our earlier, less general answer, approach was
either a
>>>> bandaid (bad) or the new answer requires a parallel code path
in their
>>>> frontend (worse).
>>>>
>>>
>>>
>>> +1 with David’s approach: making thing incrementally better is fine
*as
>>> long as* the long term direction is identified. Small incremental
changes
>>> that makes things slightly better in the short term but drives us
away of
>>> the long term direction is not good.
>>>
>>> Don’t get me wrong, I’m not saying that the current patch is not
good,
>>> just that it does not seem clear to me that the long term direction
has
>>> been identified, which explain why some can be nervous about adding
stuff
>>> prematurely.
>>> And I’m not for the status quo, while I can’t judge it definitively
>>> myself, I even bugged David last month to look at this revision and
try to
>>> identify what is really the long term direction and how to make
your (and
>>> other) frontends’ life easier.
>>>
>>>
>>>
>> As long as there is something to be done. Concern has been raised for
>> very large aggregate (64K, 1Mb) but there is no way a good codegen can
come
>> out of these anyway. I don't know of any machine that have 1Mb of
register
>> available to tank the load. Even I we had a good way to handle it in
>> InstCombine, the backend would have no capability to generate something
>> nice for it anyway. Most aggregates are small and there is no good
excuse
>> to not do anything to handle them because someone could generate
gigantic
>> ones that won't map nicely to the hardware anyway.
>>
>> By that logic, SROA should not exists as one could generate gigantic
>> aggregate as well (in fact, SROA fail pretty badly on large
aggregates).
>>
>> The second concern raised is for atomic/volatile, which needs to be
>> handled by the optimizer differently anyway, so is mostly irrelevant
here.
>>
>>
>>>
>>>>
>>>
>>>
>>> clang has many developer behind it, some of them paid to work on
it.
>>> That s simply not the case for many others.
>>>
>>> But to answer your questions :
>>>  - Per field load/store generate more loads/stores than necessary
in
>>> many cases. These can't be aggregated back because of padding.
>>>  - memcpy only work memory to memory. It is certainly usable in
some
>>> cases, but certainly do not cover all uses.
>>>
>>> I'm willing to do the memcpy optimization in InstCombine (in
fact,
>>> things would not degenerate into so much bikescheding, that would
already
>>> be done).
>>>
>>>
>>> Calling out “bikescheding” what other devs think is what keeps the
>>> quality of the project high is unlikely to help your patch go
through, it’s
>>> probably quite the opposite actually.
>>>
>>>
>>>
>> I understand the desire to keep quality high. That's is not where
the
>> problem is. The problem lies into discussing actual proposal against
>> hypothetical perfect ones that do not exists.
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/be15d155/attachment.html>

Mehdi Amini via llvm-dev

2015-Aug-17 21:05 UTC

head link

[llvm-dev] Aggregate load/stores

Hi Mats,

The performance issue seems like a potential different issue.
Can you send the input IR in both cases and the list of passes you are running?

Thanks,

— 
Mehdi
 > On Aug 17, 2015, at 2:02 PM, mats petersson <mats at
planetcatfish.com> wrote:
> 
> I've definitely "run into this problem", and I would very
much love to remove my kludges [that are incomplete, because I keep finding
places where I need to modify the code-gen to "fix" the same problem -
this is probably par for the course from a complete amateur compiler writer and
someone that has only spent the last 14 months working (as a hobby) with LLVM].
> 
> So whilst I can't contribute much on the "what is the right
solution" and "how do we solve this", I would very much like to
see something that allows the user of LLVM to use load/store withing things like
"is my thing that I'm storing big, if so don't generate a load, use
a memcpy instead". Not only does this make the usage of LLVM harder, it
also causes slow compilation [perhaps this is a separte problem, but I have a
simple program that copies a large struct a few times, and if I turn off my
"use memcpy for large things", the compile time gets quite a lot
longer - approx 1000x, and 48 seconds is a long time to compile 37 lines of
relatively straight forward code - even the Pascal compiler on PDP-11/70 that I
used at my school in 1980's was capable of doing more than 1 line per
second, and it didn't run anywhere near 2.5GHz and had 20-30 users anytime I
could use it...]
> 
> ../lacsap -no-memcpy -tt longcompile.pas 
> Time for Parse 0.657 ms
> Time for Analyse 0.018 ms
> Time for Compile 1.248 ms
> Time for CreateObject 48803.263 ms
> Time for CreateBinary 48847.631 ms
> Time for Compile 48854.064 ms
> 
> compared with:
> ../lacsap -tt longcompile.pas 
> Time for Parse 0.455 ms
> Time for Analyse 0.013 ms
> Time for Compile 1.138 ms
> Time for CreateObject 44.627 ms
> Time for CreateBinary 82.758 ms
> Time for Compile 95.797 ms
> 
> wc longcompile.pas 
>  37  84 410 longcompile.pas
> 
> Source here:
> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Leporacanthicus_lacsap_blob_master_test_longcompile.pas&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=6BM4NTaxZDH8Gd6oekl1GjVZGnKT-5VY6_8gGb61Nkk&e=>
> 
> 
> --
> Mats
> 
> On 17 August 2015 at 21:18, deadal nix via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> OK, what about that plan :
> 
> Slice the aggregate into a serie of valid loads/stores for non atomic ones.
> Use big scalar for atomic/volatile ones.
> Try to generate memcpy or memmove when possible ?
> 
> 
> 2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com
<mailto:deadalnix at gmail.com>>:
> 
> 
> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>>:
> Hi,
> 
>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> 
>> 
>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com <mailto:david.majnemer at gmail.com>>:
>> 
>> 
>> Because a solution which doesn't generalize is not a very powerful
solution.  What happens when somebody says that they want to use atomics + large
aggregate loads and stores? Give them yet another, different answer? That would
mean our earlier, less general answer, approach was either a bandaid (bad) or
the new answer requires a parallel code path in their frontend (worse).
> 
> 
> +1 with David’s approach: making thing incrementally better is fine *as
long as* the long term direction is identified. Small incremental changes that
makes things slightly better in the short term but drives us away of the long
term direction is not good.
> 
> Don’t get me wrong, I’m not saying that the current patch is not good, just
that it does not seem clear to me that the long term direction has been
identified, which explain why some can be nervous about adding stuff
prematurely.
> And I’m not for the status quo, while I can’t judge it definitively myself,
I even bugged David last month to look at this revision and try to identify what
is really the long term direction and how to make your (and other) frontends’
life easier.
> 
> 
> 
> As long as there is something to be done. Concern has been raised for very
large aggregate (64K, 1Mb) but there is no way a good codegen can come out of
these anyway. I don't know of any machine that have 1Mb of register
available to tank the load. Even I we had a good way to handle it in
InstCombine, the backend would have no capability to generate something nice for
it anyway. Most aggregates are small and there is no good excuse to not do
anything to handle them because someone could generate gigantic ones that
won't map nicely to the hardware anyway.
> 
> By that logic, SROA should not exists as one could generate gigantic
aggregate as well (in fact, SROA fail pretty badly on large aggregates).
> 
> The second concern raised is for atomic/volatile, which needs to be handled
by the optimizer differently anyway, so is mostly irrelevant here.
>  
>>  
>> 
>> 
>> clang has many developer behind it, some of them paid to work on it.
That s simply not the case for many others.
>> 
>> But to answer your questions :
>>  - Per field load/store generate more loads/stores than necessary in
many cases. These can't be aggregated back because of padding.
>>  - memcpy only work memory to memory. It is certainly usable in some
cases, but certainly do not cover all uses.
>> 
>> I'm willing to do the memcpy optimization in InstCombine (in fact,
things would not degenerate into so much bikescheding, that would already be
done).
> 
> Calling out “bikescheding” what other devs think is what keeps the quality
of the project high is unlikely to help your patch go through, it’s probably
quite the opposite actually.
> 
> 
> 
> I understand the desire to keep quality high. That's is not where the
problem is. The problem lies into discussing actual proposal against
hypothetical perfect ones that do not exists.
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>       
http://llvm.cs.uiuc.edu
<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.cs.uiuc.edu&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=XQPhtYoenE_8aGjkPFg5qwxjM_C1CvJzloFkwo03VbM&e=>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=88-nGhQnI-go7arn8nxF4F1rk-cz3L_uwsFS5FD8kzc&e=>
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/612c4853/attachment.html>

mats petersson via llvm-dev

2015-Aug-17 21:35 UTC

head link

[llvm-dev] Aggregate load/stores

Even if I turn to -O0 [in other words, no optimisation passes at all], it
takes the same amount of time.

The time is spent in

  12.94%  lacsap   lacsap               [.]
llvm::SDNode::use_iterator::operator=   7.68%  lacsap   lacsap               [.]
llvm::SDNode::use_iterator::operator*
   7.53%  lacsap   lacsap               [.]
llvm::SelectionDAG::ReplaceAllUsesOfValueWith
   7.28%  lacsap   lacsap               [.]
llvm::SDNode::use_iterator::operator++
   5.59%  lacsap   lacsap               [.]
llvm::SDNode::use_iterator::operator!   4.65%  lacsap   lacsap               [.]
llvm::SDNode::hasNUsesOfValue
   3.82%  lacsap   lacsap               [.] llvm::SDUse::getResNo
   2.33%  lacsap   lacsap               [.] llvm::SDValue::getResNo
   2.19%  lacsap   lacsap               [.] llvm::SDUse::getNext
   1.32%  lacsap   lacsap               [.]
llvm::SDNode::use_iterator::getUse
   1.28%  lacsap   lacsap               [.] llvm::SDUse::getUser

Here's the LLVM IR generated:
https://gist.github.com/Leporacanthicus/9b662f88e0c4a471e51a

And as can be seen here -O0 produces "no passes":
https://github.com/Leporacanthicus/lacsap/blob/master/lacsap.cpp#L76

../lacsap -no-memcpy -tt longcompile.pas  -O0
Time for Parse 0.502 ms
Time for Analyse 0.015 ms
Time for Compile 1.038 ms
Time for CreateObject 48134.541 ms
Time for CreateBinary 48179.720 ms
Time for Compile 48187.351 ms

And before someone says "but you are running a debug build", if I run
the
"production", it does speed things up quite nicely, about 3x, but
still
takes 17 seconds vs 45ms with that build of the compiler.


../lacsap -no-memcpy -tt longcompile.pas  -O0
Time for Parse 0.937 ms
Time for Analyse 0.005 ms
Time for Compile 0.559 ms
Time for CreateObject 17241.177 ms
Time for CreateBinary 17286.701 ms
Time for Compile 17289.187 ms

../lacsap -tt longcompile.pas
Time for Parse 0.274 ms
Time for Analyse 0.004 ms
Time for Compile 0.258 ms
Time for CreateObject 7.504 ms
Time for CreateBinary 45.405 ms
Time for Compile 46.670 ms

I believe I know what happens: The compiler is trying to figure out the
best order of instructions, and looks at N^2 instructions that are pretty
much independently executable with no code or data dependencies. So it
iterates over a vast number of possible permutations, only to find that
they are all pretty much equally good/bad... But like I said earlier,
although I'm a professional software engineer, compilers are just a
hobby-project for me, and I only started a little over a year back, so I
make no pretense to know the answer. Using memcpy instead solves this
problem, as it

--
Mats

On 17 August 2015 at 22:05, Mehdi Amini <mehdi.amini at apple.com> wrote:
> Hi Mats,
>
> The performance issue seems like a potential different issue.
> Can you send the input IR in both cases and the list of passes you are
> running?
>
> Thanks,
>
> —
> Mehdi
>
>
> On Aug 17, 2015, at 2:02 PM, mats petersson <mats at
planetcatfish.com>
> wrote:
>
> I've definitely "run into this problem", and I would very
much love to
> remove my kludges [that are incomplete, because I keep finding places where
> I need to modify the code-gen to "fix" the same problem - this is
probably
> par for the course from a complete amateur compiler writer and someone that
> has only spent the last 14 months working (as a hobby) with LLVM].
>
> So whilst I can't contribute much on the "what is the right
solution" and
> "how do we solve this", I would very much like to see something
that allows
> the user of LLVM to use load/store withing things like "is my thing
that
> I'm storing big, if so don't generate a load, use a memcpy
instead". Not
> only does this make the usage of LLVM harder, it also causes slow
> compilation [perhaps this is a separte problem, but I have a simple program
> that copies a large struct a few times, and if I turn off my "use
memcpy
> for large things", the compile time gets quite a lot longer - approx
1000x,
> and 48 seconds is a long time to compile 37 lines of relatively straight
> forward code - even the Pascal compiler on PDP-11/70 that I used at my
> school in 1980's was capable of doing more than 1 line per second, and
it
> didn't run anywhere near 2.5GHz and had 20-30 users anytime I could use
> it...]
>
> ../lacsap -no-memcpy -tt longcompile.pas
> Time for Parse 0.657 ms
> Time for Analyse 0.018 ms
> Time for Compile 1.248 ms
> Time for CreateObject 48803.263 ms
> Time for CreateBinary 48847.631 ms
> Time for Compile 48854.064 ms
>
> compared with:
> ../lacsap -tt longcompile.pas
> Time for Parse 0.455 ms
> Time for Analyse 0.013 ms
> Time for Compile 1.138 ms
> Time for CreateObject 44.627 ms
> Time for CreateBinary 82.758 ms
> Time for Compile 95.797 ms
>
> wc longcompile.pas
>  37  84 410 longcompile.pas
>
> Source here:
> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Leporacanthicus_lacsap_blob_master_test_longcompile.pas&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=6BM4NTaxZDH8Gd6oekl1GjVZGnKT-5VY6_8gGb61Nkk&e=>
>
>
> --
> Mats
>
> On 17 August 2015 at 21:18, deadal nix via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> OK, what about that plan :
>>
>> Slice the aggregate into a serie of valid loads/stores for non atomic
>> ones.
>> Use big scalar for atomic/volatile ones.
>> Try to generate memcpy or memmove when possible ?
>>
>>
>> 2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com>:
>>
>>>
>>>
>>> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at
apple.com>:
>>>
>>>> Hi,
>>>>
>>>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>
>>>>
>>>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com>:
>>>>
>>>>>
>>>>>
>>>>> Because a solution which doesn't generalize is not a
very powerful
>>>>> solution.  What happens when somebody says that they want
to use atomics +
>>>>> large aggregate loads and stores? Give them yet another,
different answer?
>>>>> That would mean our earlier, less general answer, approach
was either a
>>>>> bandaid (bad) or the new answer requires a parallel code
path in their
>>>>> frontend (worse).
>>>>>
>>>>
>>>>
>>>> +1 with David’s approach: making thing incrementally better is
fine *as
>>>> long as* the long term direction is identified. Small
incremental changes
>>>> that makes things slightly better in the short term but drives
us away of
>>>> the long term direction is not good.
>>>>
>>>> Don’t get me wrong, I’m not saying that the current patch is
not good,
>>>> just that it does not seem clear to me that the long term
direction has
>>>> been identified, which explain why some can be nervous about
adding stuff
>>>> prematurely.
>>>> And I’m not for the status quo, while I can’t judge it
definitively
>>>> myself, I even bugged David last month to look at this revision
and try to
>>>> identify what is really the long term direction and how to make
your (and
>>>> other) frontends’ life easier.
>>>>
>>>>
>>>>
>>> As long as there is something to be done. Concern has been raised
for
>>> very large aggregate (64K, 1Mb) but there is no way a good codegen
can come
>>> out of these anyway. I don't know of any machine that have 1Mb
of register
>>> available to tank the load. Even I we had a good way to handle it
in
>>> InstCombine, the backend would have no capability to generate
something
>>> nice for it anyway. Most aggregates are small and there is no good
excuse
>>> to not do anything to handle them because someone could generate
gigantic
>>> ones that won't map nicely to the hardware anyway.
>>>
>>> By that logic, SROA should not exists as one could generate
gigantic
>>> aggregate as well (in fact, SROA fail pretty badly on large
aggregates).
>>>
>>> The second concern raised is for atomic/volatile, which needs to be
>>> handled by the optimizer differently anyway, so is mostly
irrelevant here.
>>>
>>>
>>>>
>>>>>
>>>>
>>>>
>>>> clang has many developer behind it, some of them paid to work
on it.
>>>> That s simply not the case for many others.
>>>>
>>>> But to answer your questions :
>>>>  - Per field load/store generate more loads/stores than
necessary in
>>>> many cases. These can't be aggregated back because of
padding.
>>>>  - memcpy only work memory to memory. It is certainly usable in
some
>>>> cases, but certainly do not cover all uses.
>>>>
>>>> I'm willing to do the memcpy optimization in InstCombine
(in fact,
>>>> things would not degenerate into so much bikescheding, that
would already
>>>> be done).
>>>>
>>>>
>>>> Calling out “bikescheding” what other devs think is what keeps
the
>>>> quality of the project high is unlikely to help your patch go
through, it’s
>>>> probably quite the opposite actually.
>>>>
>>>>
>>>>
>>> I understand the desire to keep quality high. That's is not
where the
>>> problem is. The problem lies into discussing actual proposal
against
>>> hypothetical perfect ones that do not exists.
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.cs.uiuc.edu&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=XQPhtYoenE_8aGjkPFg5qwxjM_C1CvJzloFkwo03VbM&e=>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=88-nGhQnI-go7arn8nxF4F1rk-cz3L_uwsFS5FD8kzc&e=>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/f80432f3/attachment-0001.html>

Nicholas Chapman via llvm-dev

2015-Aug-19 01:31 UTC

head link

[llvm-dev] Aggregate load/stores

Hi,
I'm a front end author who has also run into similar problems when using 
LLVM with arrays and structs.
I would also like to see this situation improved, although I can't 
comment on the particular proposed patch.
Here is a bug report I filed about arrays a while ago: 
https://llvm.org/bugs/show_bug.cgi?id=17090

Cheers,
   Nick C.

On 17/08/2015 22:02, mats petersson via llvm-dev wrote:> I've definitely "run into this problem", and I would very
much love to
> remove my kludges [that are incomplete, because I keep finding places 
> where I need to modify the code-gen to "fix" the same problem -
this
> is probably par for the course from a complete amateur compiler writer 
> and someone that has only spent the last 14 months working (as a 
> hobby) with LLVM].
>
> So whilst I can't contribute much on the "what is the right
solution"
> and "how do we solve this", I would very much like to see
something
> that allows the user of LLVM to use load/store withing things like "is
> my thing that I'm storing big, if so don't generate a load, use a 
> memcpy instead". Not only does this make the usage of LLVM harder, it 
> also causes slow compilation [perhaps this is a separte problem, but I 
> have a simple program that copies a large struct a few times, and if I 
> turn off my "use memcpy for large things", the compile time gets
quite
> a lot longer - approx 1000x, and 48 seconds is a long time to compile 
> 37 lines of relatively straight forward code - even the Pascal 
> compiler on PDP-11/70 that I used at my school in 1980's was capable 
> of doing more than 1 line per second, and it didn't run anywhere near 
> 2.5GHz and had 20-30 users anytime I could use it...]
>
> ../lacsap -no-memcpy -tt longcompile.pas
> Time for Parse 0.657 ms
> Time for Analyse 0.018 ms
> Time for Compile 1.248 ms
> Time for CreateObject 48803.263 ms
> Time for CreateBinary 48847.631 ms
> Time for Compile 48854.064 ms
>
> compared with:
> ../lacsap -tt longcompile.pas
> Time for Parse 0.455 ms
> Time for Analyse 0.013 ms
> Time for Compile 1.138 ms
> Time for CreateObject 44.627 ms
> Time for CreateBinary 82.758 ms
> Time for Compile 95.797 ms
>
> wc longcompile.pas
>  37  84 410 longcompile.pas
>
> Source here:
> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>
>
> --
> Mats
>
> On 17 August 2015 at 21:18, deadal nix via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     OK, what about that plan :
>
>     Slice the aggregate into a serie of valid loads/stores for non
>     atomic ones.
>     Use big scalar for atomic/volatile ones.
>     Try to generate memcpy or memmove when possible ?
>
>
>     2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com
>     <mailto:deadalnix at gmail.com>>:
>
>
>
>         2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com
>         <mailto:mehdi.amini at apple.com>>:
>
>             Hi,
>
>>             On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev
>>             <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>
>>
>>             2015-08-16 23:21 GMT-07:00 David Majnemer
>>             <david.majnemer at gmail.com <mailto:david.majnemer
at gmail.com>>:
>>
>>
>>
>>                 Because a solution which doesn't generalize is not
a
>>                 very powerful solution. What happens when somebody
>>                 says that they want to use atomics + large aggregate
>>                 loads and stores? Give them yet another, different
>>                 answer? That would mean our earlier, less general
>>                 answer, approach was either a bandaid (bad) or the
>>                 new answer requires a parallel code path in their
>>                 frontend (worse).
>>
>
>
>             +1 with David’s approach: making thing incrementally
>             better is fine *as long as* the long term direction is
>             identified. Small incremental changes that makes things
>             slightly better in the short term but drives us away of
>             the long term direction is not good.
>
>             Don’t get me wrong, I’m not saying that the current patch
>             is not good, just that it does not seem clear to me that
>             the long term direction has been identified, which explain
>             why some can be nervous about adding stuff prematurely.
>             And I’m not for the status quo, while I can’t judge it
>             definitively myself, I even bugged David last month to
>             look at this revision and try to identify what is really
>             the long term direction and how to make your (and other)
>             frontends’ life easier.
>
>
>
>         As long as there is something to be done. Concern has been
>         raised for very large aggregate (64K, 1Mb) but there is no way
>         a good codegen can come out of these anyway. I don't know of
>         any machine that have 1Mb of register available to tank the
>         load. Even I we had a good way to handle it in InstCombine,
>         the backend would have no capability to generate something
>         nice for it anyway. Most aggregates are small and there is no
>         good excuse to not do anything to handle them because someone
>         could generate gigantic ones that won't map nicely to the
>         hardware anyway.
>
>         By that logic, SROA should not exists as one could generate
>         gigantic aggregate as well (in fact, SROA fail pretty badly on
>         large aggregates).
>
>         The second concern raised is for atomic/volatile, which needs
>         to be handled by the optimizer differently anyway, so is
>         mostly irrelevant here.
>
>>
>>
>>             clang has many developer behind it, some of them paid to
>>             work on it. That s simply not the case for many others.
>>
>>             But to answer your questions :
>>              - Per field load/store generate more loads/stores than
>>             necessary in many cases. These can't be aggregated back
>>             because of padding.
>>              - memcpy only work memory to memory. It is certainly
>>             usable in some cases, but certainly do not cover all uses.
>>
>>             I'm willing to do the memcpy optimization in
InstCombine
>>             (in fact, things would not degenerate into so much
>>             bikescheding, that would already be done).
>
>             Calling out “bikescheding” what other devs think is what
>             keeps the quality of the project high is unlikely to help
>             your patch go through, it’s probably quite the opposite
>             actually.
>
>
>
>         I understand the desire to keep quality high. That's is not
>         where the problem is. The problem lies into discussing actual
>         proposal against hypothetical perfect ones that do not exists.
>
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://llvm.cs.uiuc.edu
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/236f160d/attachment-0001.html>

Nicholas Chapman via llvm-dev

2015-Aug-19 01:36 UTC

head link

[llvm-dev] Aggregate load/stores

Oh,
and another potential reason for handling aggregate loads and stores 
directly is that it expresses the semantics of the program more clearly, 
which I think should allow LLVM to optimise more aggresively.
Here's a bug report showing a missed optimisation, which I think is due 
to the use of memcpy, which in turn is required to work around slow 
structure loads and stores:
https://llvm.org/bugs/show_bug.cgi?id=23226

Cheers,
   Nick

On 17/08/2015 22:02, mats petersson via llvm-dev wrote:> I've definitely "run into this problem", and I would very
much love to
> remove my kludges [that are incomplete, because I keep finding places 
> where I need to modify the code-gen to "fix" the same problem -
this
> is probably par for the course from a complete amateur compiler writer 
> and someone that has only spent the last 14 months working (as a 
> hobby) with LLVM].
>
> So whilst I can't contribute much on the "what is the right
solution"
> and "how do we solve this", I would very much like to see
something
> that allows the user of LLVM to use load/store withing things like "is
> my thing that I'm storing big, if so don't generate a load, use a 
> memcpy instead". Not only does this make the usage of LLVM harder, it 
> also causes slow compilation [perhaps this is a separte problem, but I 
> have a simple program that copies a large struct a few times, and if I 
> turn off my "use memcpy for large things", the compile time gets
quite
> a lot longer - approx 1000x, and 48 seconds is a long time to compile 
> 37 lines of relatively straight forward code - even the Pascal 
> compiler on PDP-11/70 that I used at my school in 1980's was capable 
> of doing more than 1 line per second, and it didn't run anywhere near 
> 2.5GHz and had 20-30 users anytime I could use it...]
>
> ../lacsap -no-memcpy -tt longcompile.pas
> Time for Parse 0.657 ms
> Time for Analyse 0.018 ms
> Time for Compile 1.248 ms
> Time for CreateObject 48803.263 ms
> Time for CreateBinary 48847.631 ms
> Time for Compile 48854.064 ms
>
> compared with:
> ../lacsap -tt longcompile.pas
> Time for Parse 0.455 ms
> Time for Analyse 0.013 ms
> Time for Compile 1.138 ms
> Time for CreateObject 44.627 ms
> Time for CreateBinary 82.758 ms
> Time for Compile 95.797 ms
>
> wc longcompile.pas
>  37  84 410 longcompile.pas
>
> Source here:
> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>
>
> --
> Mats
>
> On 17 August 2015 at 21:18, deadal nix via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     OK, what about that plan :
>
>     Slice the aggregate into a serie of valid loads/stores for non
>     atomic ones.
>     Use big scalar for atomic/volatile ones.
>     Try to generate memcpy or memmove when possible ?
>
>
>     2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com
>     <mailto:deadalnix at gmail.com>>:
>
>
>
>         2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com
>         <mailto:mehdi.amini at apple.com>>:
>
>             Hi,
>
>>             On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev
>>             <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>
>>
>>             2015-08-16 23:21 GMT-07:00 David Majnemer
>>             <david.majnemer at gmail.com <mailto:david.majnemer
at gmail.com>>:
>>
>>
>>
>>                 Because a solution which doesn't generalize is not
a
>>                 very powerful solution. What happens when somebody
>>                 says that they want to use atomics + large aggregate
>>                 loads and stores? Give them yet another, different
>>                 answer? That would mean our earlier, less general
>>                 answer, approach was either a bandaid (bad) or the
>>                 new answer requires a parallel code path in their
>>                 frontend (worse).
>>
>
>
>             +1 with David’s approach: making thing incrementally
>             better is fine *as long as* the long term direction is
>             identified. Small incremental changes that makes things
>             slightly better in the short term but drives us away of
>             the long term direction is not good.
>
>             Don’t get me wrong, I’m not saying that the current patch
>             is not good, just that it does not seem clear to me that
>             the long term direction has been identified, which explain
>             why some can be nervous about adding stuff prematurely.
>             And I’m not for the status quo, while I can’t judge it
>             definitively myself, I even bugged David last month to
>             look at this revision and try to identify what is really
>             the long term direction and how to make your (and other)
>             frontends’ life easier.
>
>
>
>         As long as there is something to be done. Concern has been
>         raised for very large aggregate (64K, 1Mb) but there is no way
>         a good codegen can come out of these anyway. I don't know of
>         any machine that have 1Mb of register available to tank the
>         load. Even I we had a good way to handle it in InstCombine,
>         the backend would have no capability to generate something
>         nice for it anyway. Most aggregates are small and there is no
>         good excuse to not do anything to handle them because someone
>         could generate gigantic ones that won't map nicely to the
>         hardware anyway.
>
>         By that logic, SROA should not exists as one could generate
>         gigantic aggregate as well (in fact, SROA fail pretty badly on
>         large aggregates).
>
>         The second concern raised is for atomic/volatile, which needs
>         to be handled by the optimizer differently anyway, so is
>         mostly irrelevant here.
>
>>
>>
>>             clang has many developer behind it, some of them paid to
>>             work on it. That s simply not the case for many others.
>>
>>             But to answer your questions :
>>              - Per field load/store generate more loads/stores than
>>             necessary in many cases. These can't be aggregated back
>>             because of padding.
>>              - memcpy only work memory to memory. It is certainly
>>             usable in some cases, but certainly do not cover all uses.
>>
>>             I'm willing to do the memcpy optimization in
InstCombine
>>             (in fact, things would not degenerate into so much
>>             bikescheding, that would already be done).
>
>             Calling out “bikescheding” what other devs think is what
>             keeps the quality of the project high is unlikely to help
>             your patch go through, it’s probably quite the opposite
>             actually.
>
>
>
>         I understand the desire to keep quality high. That's is not
>         where the problem is. The problem lies into discussing actual
>         proposal against hypothetical perfect ones that do not exists.
>
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://llvm.cs.uiuc.edu
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/17048d7b/attachment.html>

deadal nix via llvm-dev

2015-Aug-19 05:31 UTC

head link

[llvm-dev] Aggregate load/stores

It is pretty clear people need this. Let's get this moving.

I'll try to sum up the point that have been made and I'll try to address
them carefully.

1/ There is no good solution for large aggregates.
That is true. However, I don't think this is a reason to not address
smaller aggregates, as they appear to be needed. Realistically, the
proportion of aggregates that are very large is small, and there is no
expectation that such a thing would map nicely to the hardware anyway (the
hardware won't have enough registers to load it all anyway). I do think
this is reasonable to expect a reasonable handling of relatively small
aggregates like fat pointers while accepting that larges ones will be
inefficient.

This limitation is not unique to the current discussion, as SROA suffer
from the same limitation.
It is possible to disable to transformation for aggregates that are too
large if this is too big of a concern. It should maybe also be done for
SROA.

2/ Slicing the aggregate break the semantic of atomic/volatile.
That is true. It means slicing the aggregate should not be done for
atomic/volatile. It doesn't mean this should not be done for regular ones
as it is reasonable to handle atomic/volatile differently. After all, they
have different semantic.

3/ Not slicing can create scalar that aren't supported by the target. This
is undesirable.
Indeed. But as always, the important question is compared to what ?

The hardware has no notion of aggregate, so an aggregate or a large scalar
ends up both requiring legalization. Doing the transformation is still
beneficial :
 - Some aggregates will generate valid scalars. For such aggregate, this is
100% win.
 - For aggregate that won't, the situation is still better as various
optimization passes will be able to handle the load in a sensible manner.
 - The transformation never make the situation worse than it is to begin
with.

On previous discussion, Hal Finkel seemed to think that the scalar solution
is preferable to the slicing one.

Is that a fair assessment of the situation ? Considering all of this, I
think the right path forward is :
 - Go for the scalar solution in the general case.
 - If that is a problem, the slicing approach can be used for non
atomic/volatile.
 - If necessary, disable the transformation for very large aggregates (and
consider doing so for SROA as well).

Do we have a plan ?


2015-08-18 18:36 GMT-07:00 Nicholas Chapman via llvm-dev <
llvm-dev at lists.llvm.org>:
> Oh,
> and another potential reason for handling aggregate loads and stores
> directly is that it expresses the semantics of the program more clearly,
> which I think should allow LLVM to optimise more aggresively.
> Here's a bug report showing a missed optimisation, which I think is due
to
> the use of memcpy, which in turn is required to work around slow structure
> loads and stores:
> https://llvm.org/bugs/show_bug.cgi?id=23226
>
> Cheers,
>   Nick
>
> On 17/08/2015 22:02, mats petersson via llvm-dev wrote:
>
> I've definitely "run into this problem", and I would very
much love to
> remove my kludges [that are incomplete, because I keep finding places where
> I need to modify the code-gen to "fix" the same problem - this is
probably
> par for the course from a complete amateur compiler writer and someone that
> has only spent the last 14 months working (as a hobby) with LLVM].
>
> So whilst I can't contribute much on the "what is the right
solution" and
> "how do we solve this", I would very much like to see something
that allows
> the user of LLVM to use load/store withing things like "is my thing
that
> I'm storing big, if so don't generate a load, use a memcpy
instead". Not
> only does this make the usage of LLVM harder, it also causes slow
> compilation [perhaps this is a separte problem, but I have a simple program
> that copies a large struct a few times, and if I turn off my "use
memcpy
> for large things", the compile time gets quite a lot longer - approx
1000x,
> and 48 seconds is a long time to compile 37 lines of relatively straight
> forward code - even the Pascal compiler on PDP-11/70 that I used at my
> school in 1980's was capable of doing more than 1 line per second, and
it
> didn't run anywhere near 2.5GHz and had 20-30 users anytime I could use
> it...]
>
> ../lacsap -no-memcpy -tt longcompile.pas
> Time for Parse 0.657 ms
> Time for Analyse 0.018 ms
> Time for Compile 1.248 ms
> Time for CreateObject 48803.263 ms
> Time for CreateBinary 48847.631 ms
> Time for Compile 48854.064 ms
>
> compared with:
> ../lacsap -tt longcompile.pas
> Time for Parse 0.455 ms
> Time for Analyse 0.013 ms
> Time for Compile 1.138 ms
> Time for CreateObject 44.627 ms
> Time for CreateBinary 82.758 ms
> Time for Compile 95.797 ms
>
> wc longcompile.pas
>  37  84 410 longcompile.pas
>
> Source here:
> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>
>
> --
> Mats
>
> On 17 August 2015 at 21:18, deadal nix via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> OK, what about that plan :
>>
>> Slice the aggregate into a serie of valid loads/stores for non atomic
>> ones.
>> Use big scalar for atomic/volatile ones.
>> Try to generate memcpy or memmove when possible ?
>>
>>
>> 2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com>:
>>
>>>
>>>
>>> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at
apple.com>:
>>>
>>>> Hi,
>>>>
>>>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>
>>>>
>>>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com>:
>>>>
>>>>>
>>>>>
>>>>> Because a solution which doesn't generalize is not a
very powerful
>>>>> solution.  What happens when somebody says that they want
to use atomics +
>>>>> large aggregate loads and stores? Give them yet another,
different answer?
>>>>> That would mean our earlier, less general answer, approach
was either a
>>>>> bandaid (bad) or the new answer requires a parallel code
path in their
>>>>> frontend (worse).
>>>>>
>>>>
>>>>
>>>> +1 with David’s approach: making thing incrementally better is
fine *as
>>>> long as* the long term direction is identified. Small
incremental changes
>>>> that makes things slightly better in the short term but drives
us away of
>>>> the long term direction is not good.
>>>>
>>>> Don’t get me wrong, I’m not saying that the current patch is
not good,
>>>> just that it does not seem clear to me that the long term
direction has
>>>> been identified, which explain why some can be nervous about
adding stuff
>>>> prematurely.
>>>> And I’m not for the status quo, while I can’t judge it
definitively
>>>> myself, I even bugged David last month to look at this revision
and try to
>>>> identify what is really the long term direction and how to make
your (and
>>>> other) frontends’ life easier.
>>>>
>>>>
>>>>
>>> As long as there is something to be done. Concern has been raised
for
>>> very large aggregate (64K, 1Mb) but there is no way a good codegen
can come
>>> out of these anyway. I don't know of any machine that have 1Mb
of register
>>> available to tank the load. Even I we had a good way to handle it
in
>>> InstCombine, the backend would have no capability to generate
something
>>> nice for it anyway. Most aggregates are small and there is no good
excuse
>>> to not do anything to handle them because someone could generate
gigantic
>>> ones that won't map nicely to the hardware anyway.
>>>
>>> By that logic, SROA should not exists as one could generate
gigantic
>>> aggregate as well (in fact, SROA fail pretty badly on large
aggregates).
>>>
>>> The second concern raised is for atomic/volatile, which needs to be
>>> handled by the optimizer differently anyway, so is mostly
irrelevant here.
>>>
>>>
>>>>
>>>>>
>>>>
>>>>
>>>> clang has many developer behind it, some of them paid to work
on it.
>>>> That s simply not the case for many others.
>>>>
>>>> But to answer your questions :
>>>>  - Per field load/store generate more loads/stores than
necessary in
>>>> many cases. These can't be aggregated back because of
padding.
>>>>  - memcpy only work memory to memory. It is certainly usable in
some
>>>> cases, but certainly do not cover all uses.
>>>>
>>>> I'm willing to do the memcpy optimization in InstCombine
(in fact,
>>>> things would not degenerate into so much bikescheding, that
would already
>>>> be done).
>>>>
>>>>
>>>> Calling out “bikescheding” what other devs think is what keeps
the
>>>> quality of the project high is unlikely to help your patch go
through, it’s
>>>> probably quite the opposite actually.
>>>>
>>>>
>>>>
>>> I understand the desire to keep quality high. That's is not
where the
>>> problem is. The problem lies into discussing actual proposal
against
>>> hypothetical perfect ones that do not exists.
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at lists.llvm.org        
http://llvm.cs.uiuc.eduhttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150818/3d54f3e0/attachment.html>

llvm dev - Aug 2015 - Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores