thr3ads.net - llvm dev - [llvm-dev] Aggregate load/stores [Aug 2015]

If this information is useful, please help other people find it:
Share via:

deadal nix via llvm-dev

2015-Aug-17 19:16 UTC

[llvm-dev] Aggregate load/stores

2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com>:
> Hi,
>
> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com>:
>
>>
>>
>> Because a solution which doesn't generalize is not a very powerful
>> solution.  What happens when somebody says that they want to use
atomics +
>> large aggregate loads and stores? Give them yet another, different
answer?
>> That would mean our earlier, less general answer, approach was either a
>> bandaid (bad) or the new answer requires a parallel code path in their
>> frontend (worse).
>>
>
>
> +1 with David’s approach: making thing incrementally better is fine *as
> long as* the long term direction is identified. Small incremental changes
> that makes things slightly better in the short term but drives us away of
> the long term direction is not good.
>
> Don’t get me wrong, I’m not saying that the current patch is not good,
> just that it does not seem clear to me that the long term direction has
> been identified, which explain why some can be nervous about adding stuff
> prematurely.
> And I’m not for the status quo, while I can’t judge it definitively
> myself, I even bugged David last month to look at this revision and try to
> identify what is really the long term direction and how to make your (and
> other) frontends’ life easier.
>
>
>As long as there is something to be done. Concern has been raised for very
large aggregate (64K, 1Mb) but there is no way a good codegen can come out
of these anyway. I don't know of any machine that have 1Mb of register
available to tank the load. Even I we had a good way to handle it in
InstCombine, the backend would have no capability to generate something
nice for it anyway. Most aggregates are small and there is no good excuse
to not do anything to handle them because someone could generate gigantic
ones that won't map nicely to the hardware anyway.

By that logic, SROA should not exists as one could generate gigantic
aggregate as well (in fact, SROA fail pretty badly on large aggregates).

The second concern raised is for atomic/volatile, which needs to be handled
by the optimizer differently anyway, so is mostly irrelevant here.

>
>>
>
>
> clang has many developer behind it, some of them paid to work on it. That
> s simply not the case for many others.
>
> But to answer your questions :
>  - Per field load/store generate more loads/stores than necessary in many
> cases. These can't be aggregated back because of padding.
>  - memcpy only work memory to memory. It is certainly usable in some
> cases, but certainly do not cover all uses.
>
> I'm willing to do the memcpy optimization in InstCombine (in fact,
things
> would not degenerate into so much bikescheding, that would already be
done).
>
>
> Calling out “bikescheding” what other devs think is what keeps the quality
> of the project high is unlikely to help your patch go through, it’s
> probably quite the opposite actually.
>
>
>I understand the desire to keep quality high. That's is not where the
problem is. The problem lies into discussing actual proposal against
hypothetical perfect ones that do not exists.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/59a6be4a/attachment.html>

deadal nix via llvm-dev

2015-Aug-17 20:18 UTC

head link

[llvm-dev] Aggregate load/stores

OK, what about that plan :

Slice the aggregate into a serie of valid loads/stores for non atomic ones.
Use big scalar for atomic/volatile ones.
Try to generate memcpy or memmove when possible ?


2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com>:
>
>
> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com>:
>
>> Hi,
>>
>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>
>>
>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com>:
>>
>>>
>>>
>>> Because a solution which doesn't generalize is not a very
powerful
>>> solution.  What happens when somebody says that they want to use
atomics +
>>> large aggregate loads and stores? Give them yet another, different
answer?
>>> That would mean our earlier, less general answer, approach was
either a
>>> bandaid (bad) or the new answer requires a parallel code path in
their
>>> frontend (worse).
>>>
>>
>>
>> +1 with David’s approach: making thing incrementally better is fine *as
>> long as* the long term direction is identified. Small incremental
changes
>> that makes things slightly better in the short term but drives us away
of
>> the long term direction is not good.
>>
>> Don’t get me wrong, I’m not saying that the current patch is not good,
>> just that it does not seem clear to me that the long term direction has
>> been identified, which explain why some can be nervous about adding
stuff
>> prematurely.
>> And I’m not for the status quo, while I can’t judge it definitively
>> myself, I even bugged David last month to look at this revision and try
to
>> identify what is really the long term direction and how to make your
(and
>> other) frontends’ life easier.
>>
>>
>>
> As long as there is something to be done. Concern has been raised for very
> large aggregate (64K, 1Mb) but there is no way a good codegen can come out
> of these anyway. I don't know of any machine that have 1Mb of register
> available to tank the load. Even I we had a good way to handle it in
> InstCombine, the backend would have no capability to generate something
> nice for it anyway. Most aggregates are small and there is no good excuse
> to not do anything to handle them because someone could generate gigantic
> ones that won't map nicely to the hardware anyway.
>
> By that logic, SROA should not exists as one could generate gigantic
> aggregate as well (in fact, SROA fail pretty badly on large aggregates).
>
> The second concern raised is for atomic/volatile, which needs to be
> handled by the optimizer differently anyway, so is mostly irrelevant here.
>
>
>>
>>>
>>
>>
>> clang has many developer behind it, some of them paid to work on it.
That
>> s simply not the case for many others.
>>
>> But to answer your questions :
>>  - Per field load/store generate more loads/stores than necessary in
many
>> cases. These can't be aggregated back because of padding.
>>  - memcpy only work memory to memory. It is certainly usable in some
>> cases, but certainly do not cover all uses.
>>
>> I'm willing to do the memcpy optimization in InstCombine (in fact,
things
>> would not degenerate into so much bikescheding, that would already be
done).
>>
>>
>> Calling out “bikescheding” what other devs think is what keeps the
>> quality of the project high is unlikely to help your patch go through,
it’s
>> probably quite the opposite actually.
>>
>>
>>
> I understand the desire to keep quality high. That's is not where the
> problem is. The problem lies into discussing actual proposal against
> hypothetical perfect ones that do not exists.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/efc3ebd5/attachment.html>

mats petersson via llvm-dev

2015-Aug-17 21:02 UTC

head link

[llvm-dev] Aggregate load/stores

I've definitely "run into this problem", and I would very much
love to
remove my kludges [that are incomplete, because I keep finding places where
I need to modify the code-gen to "fix" the same problem - this is
probably
par for the course from a complete amateur compiler writer and someone that
has only spent the last 14 months working (as a hobby) with LLVM].

So whilst I can't contribute much on the "what is the right
solution" and
"how do we solve this", I would very much like to see something that
allows
the user of LLVM to use load/store withing things like "is my thing that
I'm storing big, if so don't generate a load, use a memcpy
instead". Not
only does this make the usage of LLVM harder, it also causes slow
compilation [perhaps this is a separte problem, but I have a simple program
that copies a large struct a few times, and if I turn off my "use memcpy
for large things", the compile time gets quite a lot longer - approx 1000x,
and 48 seconds is a long time to compile 37 lines of relatively straight
forward code - even the Pascal compiler on PDP-11/70 that I used at my
school in 1980's was capable of doing more than 1 line per second, and it
didn't run anywhere near 2.5GHz and had 20-30 users anytime I could use
it...]

../lacsap -no-memcpy -tt longcompile.pas
Time for Parse 0.657 ms
Time for Analyse 0.018 ms
Time for Compile 1.248 ms
Time for CreateObject 48803.263 ms
Time for CreateBinary 48847.631 ms
Time for Compile 48854.064 ms

compared with:
../lacsap -tt longcompile.pas
Time for Parse 0.455 ms
Time for Analyse 0.013 ms
Time for Compile 1.138 ms
Time for CreateObject 44.627 ms
Time for CreateBinary 82.758 ms
Time for Compile 95.797 ms

wc longcompile.pas
 37  84 410 longcompile.pas

Source here:
https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas


--
Mats

On 17 August 2015 at 21:18, deadal nix via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> OK, what about that plan :
>
> Slice the aggregate into a serie of valid loads/stores for non atomic ones.
> Use big scalar for atomic/volatile ones.
> Try to generate memcpy or memmove when possible ?
>
>
> 2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com>:
>
>>
>>
>> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at
apple.com>:
>>
>>> Hi,
>>>
>>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>>
>>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at
gmail.com>:
>>>
>>>>
>>>>
>>>> Because a solution which doesn't generalize is not a very
powerful
>>>> solution.  What happens when somebody says that they want to
use atomics +
>>>> large aggregate loads and stores? Give them yet another,
different answer?
>>>> That would mean our earlier, less general answer, approach was
either a
>>>> bandaid (bad) or the new answer requires a parallel code path
in their
>>>> frontend (worse).
>>>>
>>>
>>>
>>> +1 with David’s approach: making thing incrementally better is fine
*as
>>> long as* the long term direction is identified. Small incremental
changes
>>> that makes things slightly better in the short term but drives us
away of
>>> the long term direction is not good.
>>>
>>> Don’t get me wrong, I’m not saying that the current patch is not
good,
>>> just that it does not seem clear to me that the long term direction
has
>>> been identified, which explain why some can be nervous about adding
stuff
>>> prematurely.
>>> And I’m not for the status quo, while I can’t judge it definitively
>>> myself, I even bugged David last month to look at this revision and
try to
>>> identify what is really the long term direction and how to make
your (and
>>> other) frontends’ life easier.
>>>
>>>
>>>
>> As long as there is something to be done. Concern has been raised for
>> very large aggregate (64K, 1Mb) but there is no way a good codegen can
come
>> out of these anyway. I don't know of any machine that have 1Mb of
register
>> available to tank the load. Even I we had a good way to handle it in
>> InstCombine, the backend would have no capability to generate something
>> nice for it anyway. Most aggregates are small and there is no good
excuse
>> to not do anything to handle them because someone could generate
gigantic
>> ones that won't map nicely to the hardware anyway.
>>
>> By that logic, SROA should not exists as one could generate gigantic
>> aggregate as well (in fact, SROA fail pretty badly on large
aggregates).
>>
>> The second concern raised is for atomic/volatile, which needs to be
>> handled by the optimizer differently anyway, so is mostly irrelevant
here.
>>
>>
>>>
>>>>
>>>
>>>
>>> clang has many developer behind it, some of them paid to work on
it.
>>> That s simply not the case for many others.
>>>
>>> But to answer your questions :
>>>  - Per field load/store generate more loads/stores than necessary
in
>>> many cases. These can't be aggregated back because of padding.
>>>  - memcpy only work memory to memory. It is certainly usable in
some
>>> cases, but certainly do not cover all uses.
>>>
>>> I'm willing to do the memcpy optimization in InstCombine (in
fact,
>>> things would not degenerate into so much bikescheding, that would
already
>>> be done).
>>>
>>>
>>> Calling out “bikescheding” what other devs think is what keeps the
>>> quality of the project high is unlikely to help your patch go
through, it’s
>>> probably quite the opposite actually.
>>>
>>>
>>>
>> I understand the desire to keep quality high. That's is not where
the
>> problem is. The problem lies into discussing actual proposal against
>> hypothetical perfect ones that do not exists.
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/be15d155/attachment.html>

Jonas Maebe via llvm-dev

2015-Aug-18 13:23 UTC

head link

[llvm-dev] Aggregate load/stores

deadal nix via llvm-dev wrote on Mon, 17 Aug 2015:
> OK, what about that plan :
>
> Slice the aggregate into a serie of valid loads/stores for non atomic ones.
> Use big scalar for atomic/volatile ones.
> Try to generate memcpy or memmove when possible ?
Are memcpy/memmove guaranteed to be handled inline, i.e., without a  
call to libc? Or are there plans to do this in the context of the  
(afaik) long term goal of enabling llvm to (optionally) generate  
freestanding code? If not, generating memcpy/memmove seems like a bad  
idea, as it would make that goal harder to achieve.

FWIW, personally I think that all accepted sizes should be handled  
reasonably efficiently (or, in other words, that other sizes should  
result in a compile time error). I hadn't seen the "Performance tips  
for frontend authors" before, and indeed got very ugly code when  
trying to load/store a [256 x i16] (http://pastebin.com/krXhuEzF ). I  
had expected LLVM to generate at least a simple copy loop. I can of  
course also generate it in our frontend (after which llvm can then try  
to unroll and/or vectorise it :), but it feels redundant.


Jonas

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Aug 2015 - Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

[llvm-dev] Aggregate load/stores

Seemingly Similar Threads