thr3ads.net - llvm dev - [llvm-dev] [RFC] Aggreate load/store, proposed plan [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2015-Aug-21 21:46 UTC

[llvm-dev] [RFC] Aggreate load/store, proposed plan

----- Original Message -----> From: "deadal nix" <deadalnix at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Mehdi Amini" <mehdi.amini at apple.com>,
"llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, August 21, 2015 1:24:04 AM
> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> 
> I've done some investigation. The proposed approach seems to work
> well on struct with no padding, but the optimizer gets confused when
> the struct has some. That's a bummer but that is already improving
> things, so I'd rather have that than nothing.
> 
> Alternatively, I think I have, in theory only so far, that may be
> better. The idea is to create an alloca, memcpy the aggregate in it,
> and transform extract values into loads, and insertvalues into an
> optional memecy into a new alloca (if the old value is still alive
> after the insertvalue) + a store. There is some extra trickery one
> need to do for call arguments, ret and phi nodes, but overall, I
> think this can work. This approach would rely on SROA to generate
> something sensible. As it is going to do massive substitutions, I'm
> not sure InstCombine would be the right place to do it. Maybe its
> own pass ?
> 
I think this might make more sense; it sounds closer to the transformation that
Clang does for C-level aggregates, and should play nicer with SROA, etc. I'm
not entirely sure where it should go, but feel free to prototype it as its own
pass, and if it works well, we'll worry about pipeline placement.
> 
> That wouldn't work for atomic/volatile, but I think that would work
> nicely for regular load/stores.
> 
We only support atomics on integer pointee types anyway, so atomics are not
relevant.

 -Hal
> 
> 
> How does that sound ?
> 
> 
> 
> 
> 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> 
> 
> ----- Original Message -----
> > From: "deadal nix" < deadalnix at gmail.com >
> > To: "Hal Finkel" < hfinkel at anl.gov >
> > Cc: "Mehdi Amini" < mehdi.amini at apple.com >,
"llvm-dev" <
> > llvm-dev at lists.llvm.org >
> > Sent: Thursday, August 20, 2015 4:09:17 PM
> > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > 
> 
> 
> > Problem :
> > 
> > Many languages define aggregates and some way to manipulate them.
> > LLVM define aggregates types (arrays and structs) to handle them.
> > However, when aggregate are loaded or stored, LLVM will simply
> > ignore these up to the legalization in the backend. This lead to
> > many misoptimizations. Most frontend are using a set of trick to
> > work around this limtation, but that an undesirable situation as it
> > increase the work required to write a front end. Ideally that work
> > should be done once by LLVM instead of every time by each frontend.
> > 
> > In previous discussion on the subject, many LLVM user have
> > expressed
> > interest in being able to use aggregate memory access. In addition,
> > it is likely that it would have reduced the workload of some
> > existing frontends.
> > 
> > The proposed solution is to transform aggregate loads and stores to
> > something that the LLVM toolcahin already understand and is able to
> > work with.
> > 
> > The proposed solution will use InstCombine to do the
> > transformation as it is done early and will allow subsequent passes
> > to work with something familiar (basically, canonicalization).
> > 
> > 
> > Proposed solution :
> > 
> > 
> > Aggregate load and store are turned into aggregate load and store
> > of
> > a scalar of the same size and alignement. Binary manipulation, like
> > mask and shift, are used to build the aggregate from the scalar
> > after loading and the aggregate to a scalar when storing.
> > 
> > 
> > For instance, the following IR (extracted from a D frontend) :
> > 
> > %B__vtbl = type { i8*, i32 (%B*)* }
> > @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > 
> > %0 = tail call i8* @allocmemory(i64 32)
> > 
> > %1 = bitcast i8* %0 to %B*
> > store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > 
> > 
> > Would be canonized into :
> > %0 = tail call i8* @allocmemory(i64 32)
> > %1 = bitcast i8* %0 to i128*
> > store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to i64)
> > to i128), i128 774763251095801167872), i128* %1, align 8
> > 
> 
> As I've said before, we really should introduce some useful
> canonicalization for these, and so I support the investigations in
> this area.
> 
> Have you verified that InstCombine that undo this transformation
> reasonably? Specifically, if you have a bunch of insertvalues, and
> an aggregate store in one function, and a load in another function
> with a bunch of extractvalues, which gets inlined into the function
> with the store, does everything get combined appropriately to
> forward the values directly regardless of how many of the members
> are extracted and in what order?
> 
> -Hal
> 
> 
> 
> > 
> > 
> > Which the rest of the LLVM pipeline can work with.
> > 
> > 
> > 
> > Limitations :
> > 
> > 
> > 1/ This solution will not handle properly large (tens of kilobytes)
> > aggregates. It is an accepted limitation, both for this proposal
> > and
> > other part of the pipeline that handle aggregates. Optionally,
> > checks can be added both for this canonicalization and SROA to
> > disable them on very large aggregates as to avoid wasting work that
> > won't yield good codegen at the end anyway. This limitation should
> > not be a blocker as most aggregate are fairly small. For instance,
> > some language make heavy use of fat pointers, and would greatly
> > benefit from this canonicalization.
> > 
> > 
> > 2/ This solution will generate loads and stores of value that may
> > not
> > be natively supported by the hardware. The hardware do not natively
> > support aggregate to begin with, so both original IR and canonized
> > IR will require optimization. This is not ideal, but the
> > canonicalization is a plus for 2 reasons:
> > - A subset of these memory access won't need canonicalization
> > anymore.
> > 
> > - Other passes in LLVM will be able to work with these load and
> > perform adequate transformations.
> > 
> > 
> > 
> > Possible alternatives :
> > 
> > 
> > In order to mitigate 1/ it is possible to gate the canonicalization
> > to aggregate under a certain size. This essentially avoiding to do
> > work that will lead to bad codegen no matter what.
> > 
> > In order to mitigate 2/, it is possible to slice aggregates loads
> > and
> > stores according to the target's data layout. This CANNOT be
> > implemented for atomic/volatile as it would change semantic, but
> > can
> > be done for regulars ones, which are the most commons.
> > 
> > 
> > 
> > 
> > Do that looks better as an RFC ?
> > 
> > 
> > 
> > 
> > 
> > 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> > 
> > 
> > ----- Original Message -----
> > > From: "Mehdi Amini via llvm-dev" < llvm-dev at
lists.llvm.org >
> > > To: "deadal nix" < deadalnix at gmail.com >
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >
> > > Sent: Wednesday, August 19, 2015 7:24:28 PM
> > > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > > 
> > > Hi,
> > > 
> > > To be sure, because the RFC below is not detailed and assume
> > > everyone
> > > knows about all the emails from 10 months ago,
> > 
> > I agree. The RFC needs to summarize the problems and the potential
> > solutions.
> > 
> > > is there more to do
> > > than what is proposed in http://reviews.llvm.org/D9766 ?
> > > 
> > > So basically the proposal is that *InstCombine*
> > 
> > I think that fixing this early in the optimizer makes sense
> > (InstCombine, etc.). This seems little different from any other
> > canonicalization problem. These direct aggregate IR values are
> > valid
> > IR, but not our preferred canonical form, so we should transform
> > the
> > IR, when possible, into our preferred canonical form.
> > 
> > -Hal
> > 
> > 
> > 
> > > turns aggregate
> > > load/store into a load/store using an integer of equivalent size
> > > and
> > > insert the correct bitcast before/after, right?
> > > 
> > > Example is:
> > > 
> > > %0 = tail call i8* @allocmemory(i64 32)
> > > %1 = bitcast i8* %0 to %B*
> > > store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > > 
> > > into:
> > > 
> > > store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> > > i64)
> > > to i128), i128 774763251095801167872), i128* %1, align 8
> > > 
> > > Where the aggregate is:
> > > 
> > > %B__vtbl = type { i8*, i32 (%B*)* }
> > > @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > > 
> > > 
> > > Thanks,
> > > 
> > > —
> > > Mehdi
> > > 
> > > 
> > > > On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
> > > > < llvm-dev at lists.llvm.org > wrote:
> > > > 
> > > > It is pretty clear people need this. Let's get this
moving.
> > > > 
> > > > I'll try to sum up the point that have been made and
I'll try
> > > > to
> > > > address them carefully.
> > > > 
> > > > 1/ There is no good solution for large aggregates.
> > > > That is true. However, I don't think this is a reason to
not
> > > > address smaller aggregates, as they appear to be needed.
> > > > Realistically, the proportion of aggregates that are very
large
> > > > is
> > > > small, and there is no expectation that such a thing would
map
> > > > nicely to the hardware anyway (the hardware won't have
enough
> > > > registers to load it all anyway). I do think this is
reasonable
> > > > to
> > > > expect a reasonable handling of relatively small aggregates
> > > > like
> > > > fat pointers while accepting that larges ones will be
> > > > inefficient.
> > > > 
> > > > This limitation is not unique to the current discussion, as
> > > > SROA
> > > > suffer from the same limitation.
> > > > It is possible to disable to transformation for aggregates
that
> > > > are
> > > > too large if this is too big of a concern. It should maybe
also
> > > > be
> > > > done for SROA.
> > > > 
> > > > 2/ Slicing the aggregate break the semantic of
atomic/volatile.
> > > > That is true. It means slicing the aggregate should not be
done
> > > > for
> > > > atomic/volatile. It doesn't mean this should not be done
for
> > > > regular ones as it is reasonable to handle atomic/volatile
> > > > differently. After all, they have different semantic.
> > > > 
> > > > 3/ Not slicing can create scalar that aren't supported
by the
> > > > target. This is undesirable.
> > > > Indeed. But as always, the important question is compared to
> > > > what
> > > > ?
> > > > 
> > > > The hardware has no notion of aggregate, so an aggregate or
a
> > > > large
> > > > scalar ends up both requiring legalization. Doing the
> > > > transformation is still beneficial :
> > > > - Some aggregates will generate valid scalars. For such
> > > > aggregate,
> > > > this is 100% win.
> > > > - For aggregate that won't, the situation is still
better as
> > > > various optimization passes will be able to handle the load
in
> > > > a
> > > > sensible manner.
> > > > - The transformation never make the situation worse than it
is
> > > > to
> > > > begin with.
> > > > 
> > > > On previous discussion, Hal Finkel seemed to think that the
> > > > scalar
> > > > solution is preferable to the slicing one.
> > > > 
> > > > Is that a fair assessment of the situation ? Considering all
of
> > > > this, I think the right path forward is :
> > > > - Go for the scalar solution in the general case.
> > > > - If that is a problem, the slicing approach can be used for
> > > > non
> > > > atomic/volatile.
> > > > - If necessary, disable the transformation for very large
> > > > aggregates (and consider doing so for SROA as well).
> > > > 
> > > > Do we have a plan ?
> > > > 
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > llvm-dev at lists.llvm.org
> > > >
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e>
> >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

deadal nix via llvm-dev

2015-Aug-21 22:05 UTC

head link

[llvm-dev] [RFC] Aggreate load/store, proposed plan

Awesome, will work a prototype.

2015-08-21 14:46 GMT-07:00 Hal Finkel <hfinkel at anl.gov>:
> ----- Original Message -----
> > From: "deadal nix" <deadalnix at gmail.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "Mehdi Amini" <mehdi.amini at apple.com>,
"llvm-dev" <
> llvm-dev at lists.llvm.org>
> > Sent: Friday, August 21, 2015 1:24:04 AM
> > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >
> > I've done some investigation. The proposed approach seems to work
> > well on struct with no padding, but the optimizer gets confused when
> > the struct has some. That's a bummer but that is already improving
> > things, so I'd rather have that than nothing.
> >
> > Alternatively, I think I have, in theory only so far, that may be
> > better. The idea is to create an alloca, memcpy the aggregate in it,
> > and transform extract values into loads, and insertvalues into an
> > optional memecy into a new alloca (if the old value is still alive
> > after the insertvalue) + a store. There is some extra trickery one
> > need to do for call arguments, ret and phi nodes, but overall, I
> > think this can work. This approach would rely on SROA to generate
> > something sensible. As it is going to do massive substitutions,
I'm
> > not sure InstCombine would be the right place to do it. Maybe its
> > own pass ?
> >
>
> I think this might make more sense; it sounds closer to the transformation
> that Clang does for C-level aggregates, and should play nicer with SROA,
> etc. I'm not entirely sure where it should go, but feel free to
prototype
> it as its own pass, and if it works well, we'll worry about pipeline
> placement.
>
> >
> > That wouldn't work for atomic/volatile, but I think that would
work
> > nicely for regular load/stores.
> >
>
> We only support atomics on integer pointee types anyway, so atomics are
> not relevant.
>
>  -Hal
>
> >
> >
> > How does that sound ?
> >
> >
> >
> >
> > 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> >
> >
> > ----- Original Message -----
> > > From: "deadal nix" < deadalnix at gmail.com >
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > Cc: "Mehdi Amini" < mehdi.amini at apple.com >,
"llvm-dev" <
> > > llvm-dev at lists.llvm.org >
> > > Sent: Thursday, August 20, 2015 4:09:17 PM
> > > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > >
> >
> >
> > > Problem :
> > >
> > > Many languages define aggregates and some way to manipulate them.
> > > LLVM define aggregates types (arrays and structs) to handle them.
> > > However, when aggregate are loaded or stored, LLVM will simply
> > > ignore these up to the legalization in the backend. This lead to
> > > many misoptimizations. Most frontend are using a set of trick to
> > > work around this limtation, but that an undesirable situation as
it
> > > increase the work required to write a front end. Ideally that
work
> > > should be done once by LLVM instead of every time by each
frontend.
> > >
> > > In previous discussion on the subject, many LLVM user have
> > > expressed
> > > interest in being able to use aggregate memory access. In
addition,
> > > it is likely that it would have reduced the workload of some
> > > existing frontends.
> > >
> > > The proposed solution is to transform aggregate loads and stores
to
> > > something that the LLVM toolcahin already understand and is able
to
> > > work with.
> > >
> > > The proposed solution will use InstCombine to do the
> > > transformation as it is done early and will allow subsequent
passes
> > > to work with something familiar (basically, canonicalization).
> > >
> > >
> > > Proposed solution :
> > >
> > >
> > > Aggregate load and store are turned into aggregate load and store
> > > of
> > > a scalar of the same size and alignement. Binary manipulation,
like
> > > mask and shift, are used to build the aggregate from the scalar
> > > after loading and the aggregate to a scalar when storing.
> > >
> > >
> > > For instance, the following IR (extracted from a D frontend) :
> > >
> > > %B__vtbl = type { i8*, i32 (%B*)* }
> > > @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > >
> > > %0 = tail call i8* @allocmemory(i64 32)
> > >
> > > %1 = bitcast i8* %0 to %B*
> > > store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > >
> > >
> > > Would be canonized into :
> > > %0 = tail call i8* @allocmemory(i64 32)
> > > %1 = bitcast i8* %0 to i128*
> > > store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
i64)
> > > to i128), i128 774763251095801167872), i128* %1, align 8
> > >
> >
> > As I've said before, we really should introduce some useful
> > canonicalization for these, and so I support the investigations in
> > this area.
> >
> > Have you verified that InstCombine that undo this transformation
> > reasonably? Specifically, if you have a bunch of insertvalues, and
> > an aggregate store in one function, and a load in another function
> > with a bunch of extractvalues, which gets inlined into the function
> > with the store, does everything get combined appropriately to
> > forward the values directly regardless of how many of the members
> > are extracted and in what order?
> >
> > -Hal
> >
> >
> >
> > >
> > >
> > > Which the rest of the LLVM pipeline can work with.
> > >
> > >
> > >
> > > Limitations :
> > >
> > >
> > > 1/ This solution will not handle properly large (tens of
kilobytes)
> > > aggregates. It is an accepted limitation, both for this proposal
> > > and
> > > other part of the pipeline that handle aggregates. Optionally,
> > > checks can be added both for this canonicalization and SROA to
> > > disable them on very large aggregates as to avoid wasting work
that
> > > won't yield good codegen at the end anyway. This limitation
should
> > > not be a blocker as most aggregate are fairly small. For
instance,
> > > some language make heavy use of fat pointers, and would greatly
> > > benefit from this canonicalization.
> > >
> > >
> > > 2/ This solution will generate loads and stores of value that may
> > > not
> > > be natively supported by the hardware. The hardware do not
natively
> > > support aggregate to begin with, so both original IR and
canonized
> > > IR will require optimization. This is not ideal, but the
> > > canonicalization is a plus for 2 reasons:
> > > - A subset of these memory access won't need canonicalization
> > > anymore.
> > >
> > > - Other passes in LLVM will be able to work with these load and
> > > perform adequate transformations.
> > >
> > >
> > >
> > > Possible alternatives :
> > >
> > >
> > > In order to mitigate 1/ it is possible to gate the
canonicalization
> > > to aggregate under a certain size. This essentially avoiding to
do
> > > work that will lead to bad codegen no matter what.
> > >
> > > In order to mitigate 2/, it is possible to slice aggregates loads
> > > and
> > > stores according to the target's data layout. This CANNOT be
> > > implemented for atomic/volatile as it would change semantic, but
> > > can
> > > be done for regulars ones, which are the most commons.
> > >
> > >
> > >
> > >
> > > Do that looks better as an RFC ?
> > >
> > >
> > >
> > >
> > >
> > > 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov
> :
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Mehdi Amini via llvm-dev" < llvm-dev at
lists.llvm.org >
> > > > To: "deadal nix" < deadalnix at gmail.com >
> > > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org
>
> > > > Sent: Wednesday, August 19, 2015 7:24:28 PM
> > > > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed
plan
> > > >
> > > > Hi,
> > > >
> > > > To be sure, because the RFC below is not detailed and assume
> > > > everyone
> > > > knows about all the emails from 10 months ago,
> > >
> > > I agree. The RFC needs to summarize the problems and the
potential
> > > solutions.
> > >
> > > > is there more to do
> > > > than what is proposed in http://reviews.llvm.org/D9766 ?
> > > >
> > > > So basically the proposal is that *InstCombine*
> > >
> > > I think that fixing this early in the optimizer makes sense
> > > (InstCombine, etc.). This seems little different from any other
> > > canonicalization problem. These direct aggregate IR values are
> > > valid
> > > IR, but not our preferred canonical form, so we should transform
> > > the
> > > IR, when possible, into our preferred canonical form.
> > >
> > > -Hal
> > >
> > >
> > >
> > > > turns aggregate
> > > > load/store into a load/store using an integer of equivalent
size
> > > > and
> > > > insert the correct bitcast before/after, right?
> > > >
> > > > Example is:
> > > >
> > > > %0 = tail call i8* @allocmemory(i64 32)
> > > > %1 = bitcast i8* %0 to %B*
> > > > store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > > >
> > > > into:
> > > >
> > > > store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ
to
> > > > i64)
> > > > to i128), i128 774763251095801167872), i128* %1, align 8
> > > >
> > > > Where the aggregate is:
> > > >
> > > > %B__vtbl = type { i8*, i32 (%B*)* }
> > > > @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo
}
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > —
> > > > Mehdi
> > > >
> > > >
> > > > > On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
> > > > > < llvm-dev at lists.llvm.org > wrote:
> > > > >
> > > > > It is pretty clear people need this. Let's get this
moving.
> > > > >
> > > > > I'll try to sum up the point that have been made
and I'll try
> > > > > to
> > > > > address them carefully.
> > > > >
> > > > > 1/ There is no good solution for large aggregates.
> > > > > That is true. However, I don't think this is a
reason to not
> > > > > address smaller aggregates, as they appear to be
needed.
> > > > > Realistically, the proportion of aggregates that are
very large
> > > > > is
> > > > > small, and there is no expectation that such a thing
would map
> > > > > nicely to the hardware anyway (the hardware won't
have enough
> > > > > registers to load it all anyway). I do think this is
reasonable
> > > > > to
> > > > > expect a reasonable handling of relatively small
aggregates
> > > > > like
> > > > > fat pointers while accepting that larges ones will be
> > > > > inefficient.
> > > > >
> > > > > This limitation is not unique to the current
discussion, as
> > > > > SROA
> > > > > suffer from the same limitation.
> > > > > It is possible to disable to transformation for
aggregates that
> > > > > are
> > > > > too large if this is too big of a concern. It should
maybe also
> > > > > be
> > > > > done for SROA.
> > > > >
> > > > > 2/ Slicing the aggregate break the semantic of
atomic/volatile.
> > > > > That is true. It means slicing the aggregate should not
be done
> > > > > for
> > > > > atomic/volatile. It doesn't mean this should not be
done for
> > > > > regular ones as it is reasonable to handle
atomic/volatile
> > > > > differently. After all, they have different semantic.
> > > > >
> > > > > 3/ Not slicing can create scalar that aren't
supported by the
> > > > > target. This is undesirable.
> > > > > Indeed. But as always, the important question is
compared to
> > > > > what
> > > > > ?
> > > > >
> > > > > The hardware has no notion of aggregate, so an
aggregate or a
> > > > > large
> > > > > scalar ends up both requiring legalization. Doing the
> > > > > transformation is still beneficial :
> > > > > - Some aggregates will generate valid scalars. For such
> > > > > aggregate,
> > > > > this is 100% win.
> > > > > - For aggregate that won't, the situation is still
better as
> > > > > various optimization passes will be able to handle the
load in
> > > > > a
> > > > > sensible manner.
> > > > > - The transformation never make the situation worse
than it is
> > > > > to
> > > > > begin with.
> > > > >
> > > > > On previous discussion, Hal Finkel seemed to think that
the
> > > > > scalar
> > > > > solution is preferable to the slicing one.
> > > > >
> > > > > Is that a fair assessment of the situation ?
Considering all of
> > > > > this, I think the right path forward is :
> > > > > - Go for the scalar solution in the general case.
> > > > > - If that is a problem, the slicing approach can be
used for
> > > > > non
> > > > > atomic/volatile.
> > > > > - If necessary, disable the transformation for very
large
> > > > > aggregates (and consider doing so for SROA as well).
> > > > >
> > > > > Do we have a plan ?
> > > > >
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > llvm-dev at lists.llvm.org
> > > > >
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e>
> > >
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > llvm-dev at lists.llvm.org
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/a3c45f7c/attachment-0001.html>

Philip Reames via llvm-dev

2015-Aug-21 22:20 UTC

head link

[llvm-dev] [RFC] Aggreate load/store, proposed plan

On 08/21/2015 02:46 PM, Hal Finkel via llvm-dev wrote:> ----- Original Message -----
>> From: "deadal nix" <deadalnix at gmail.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "Mehdi Amini" <mehdi.amini at apple.com>,
"llvm-dev" <llvm-dev at lists.llvm.org>
>> Sent: Friday, August 21, 2015 1:24:04 AM
>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
>>
>> I've done some investigation. The proposed approach seems to work
>> well on struct with no padding, but the optimizer gets confused when
>> the struct has some. That's a bummer but that is already improving
>> things, so I'd rather have that than nothing.
>>
>> Alternatively, I think I have, in theory only so far, that may be
>> better. The idea is to create an alloca, memcpy the aggregate in it,
>> and transform extract values into loads, and insertvalues into an
>> optional memecy into a new alloca (if the old value is still alive
>> after the insertvalue) + a store. There is some extra trickery one
>> need to do for call arguments, ret and phi nodes, but overall, I
>> think this can work. This approach would rely on SROA to generate
>> something sensible. As it is going to do massive substitutions, I'm
>> not sure InstCombine would be the right place to do it. Maybe its
>> own pass ?
>>
> I think this might make more sense; it sounds closer to the transformation
that Clang does for C-level aggregates, and should play nicer with SROA, etc.
I'm not entirely sure where it should go, but feel free to prototype it as
its own pass, and if it works well, we'll worry about pipeline placement.
>
>> That wouldn't work for atomic/volatile, but I think that would work
>> nicely for regular load/stores.
>>
> We only support atomics on integer pointee types anyway, so atomics are not
relevant.OT: I plan to change this at some point.  In particular, I need atomic 
float and double access.  Doing the bitcasts in the frontend works, but 
given there's an obvious lowering, I don't see why we shouldn't do
this
within LLVM.>
>   -Hal
>
>>
>> How does that sound ?
>>
>>
>>
>>
>> 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
>>
>>
>> ----- Original Message -----
>>> From: "deadal nix" < deadalnix at gmail.com >
>>> To: "Hal Finkel" < hfinkel at anl.gov >
>>> Cc: "Mehdi Amini" < mehdi.amini at apple.com >,
"llvm-dev" <
>>> llvm-dev at lists.llvm.org >
>>> Sent: Thursday, August 20, 2015 4:09:17 PM
>>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
>>>
>>
>>> Problem :
>>>
>>> Many languages define aggregates and some way to manipulate them.
>>> LLVM define aggregates types (arrays and structs) to handle them.
>>> However, when aggregate are loaded or stored, LLVM will simply
>>> ignore these up to the legalization in the backend. This lead to
>>> many misoptimizations. Most frontend are using a set of trick to
>>> work around this limtation, but that an undesirable situation as it
>>> increase the work required to write a front end. Ideally that work
>>> should be done once by LLVM instead of every time by each frontend.
>>>
>>> In previous discussion on the subject, many LLVM user have
>>> expressed
>>> interest in being able to use aggregate memory access. In addition,
>>> it is likely that it would have reduced the workload of some
>>> existing frontends.
>>>
>>> The proposed solution is to transform aggregate loads and stores to
>>> something that the LLVM toolcahin already understand and is able to
>>> work with.
>>>
>>> The proposed solution will use InstCombine to do the
>>> transformation as it is done early and will allow subsequent passes
>>> to work with something familiar (basically, canonicalization).
>>>
>>>
>>> Proposed solution :
>>>
>>>
>>> Aggregate load and store are turned into aggregate load and store
>>> of
>>> a scalar of the same size and alignement. Binary manipulation, like
>>> mask and shift, are used to build the aggregate from the scalar
>>> after loading and the aggregate to a scalar when storing.
>>>
>>>
>>> For instance, the following IR (extracted from a D frontend) :
>>>
>>> %B__vtbl = type { i8*, i32 (%B*)* }
>>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
>>>
>>> %0 = tail call i8* @allocmemory(i64 32)
>>>
>>> %1 = bitcast i8* %0 to %B*
>>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
>>>
>>>
>>> Would be canonized into :
>>> %0 = tail call i8* @allocmemory(i64 32)
>>> %1 = bitcast i8* %0 to i128*
>>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to i64)
>>> to i128), i128 774763251095801167872), i128* %1, align 8
>>>
>> As I've said before, we really should introduce some useful
>> canonicalization for these, and so I support the investigations in
>> this area.
>>
>> Have you verified that InstCombine that undo this transformation
>> reasonably? Specifically, if you have a bunch of insertvalues, and
>> an aggregate store in one function, and a load in another function
>> with a bunch of extractvalues, which gets inlined into the function
>> with the store, does everything get combined appropriately to
>> forward the values directly regardless of how many of the members
>> are extracted and in what order?
>>
>> -Hal
>>
>>
>>
>>>
>>> Which the rest of the LLVM pipeline can work with.
>>>
>>>
>>>
>>> Limitations :
>>>
>>>
>>> 1/ This solution will not handle properly large (tens of kilobytes)
>>> aggregates. It is an accepted limitation, both for this proposal
>>> and
>>> other part of the pipeline that handle aggregates. Optionally,
>>> checks can be added both for this canonicalization and SROA to
>>> disable them on very large aggregates as to avoid wasting work that
>>> won't yield good codegen at the end anyway. This limitation
should
>>> not be a blocker as most aggregate are fairly small. For instance,
>>> some language make heavy use of fat pointers, and would greatly
>>> benefit from this canonicalization.
>>>
>>>
>>> 2/ This solution will generate loads and stores of value that may
>>> not
>>> be natively supported by the hardware. The hardware do not natively
>>> support aggregate to begin with, so both original IR and canonized
>>> IR will require optimization. This is not ideal, but the
>>> canonicalization is a plus for 2 reasons:
>>> - A subset of these memory access won't need canonicalization
>>> anymore.
>>>
>>> - Other passes in LLVM will be able to work with these load and
>>> perform adequate transformations.
>>>
>>>
>>>
>>> Possible alternatives :
>>>
>>>
>>> In order to mitigate 1/ it is possible to gate the canonicalization
>>> to aggregate under a certain size. This essentially avoiding to do
>>> work that will lead to bad codegen no matter what.
>>>
>>> In order to mitigate 2/, it is possible to slice aggregates loads
>>> and
>>> stores according to the target's data layout. This CANNOT be
>>> implemented for atomic/volatile as it would change semantic, but
>>> can
>>> be done for regulars ones, which are the most commons.
>>>
>>>
>>>
>>>
>>> Do that looks better as an RFC ?
>>>
>>>
>>>
>>>
>>>
>>> 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov >
:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Mehdi Amini via llvm-dev" < llvm-dev at
lists.llvm.org >
>>>> To: "deadal nix" < deadalnix at gmail.com >
>>>> Cc: "llvm-dev" < llvm-dev at lists.llvm.org >
>>>> Sent: Wednesday, August 19, 2015 7:24:28 PM
>>>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed
plan
>>>>
>>>> Hi,
>>>>
>>>> To be sure, because the RFC below is not detailed and assume
>>>> everyone
>>>> knows about all the emails from 10 months ago,
>>> I agree. The RFC needs to summarize the problems and the potential
>>> solutions.
>>>
>>>> is there more to do
>>>> than what is proposed in http://reviews.llvm.org/D9766 ?
>>>>
>>>> So basically the proposal is that *InstCombine*
>>> I think that fixing this early in the optimizer makes sense
>>> (InstCombine, etc.). This seems little different from any other
>>> canonicalization problem. These direct aggregate IR values are
>>> valid
>>> IR, but not our preferred canonical form, so we should transform
>>> the
>>> IR, when possible, into our preferred canonical form.
>>>
>>> -Hal
>>>
>>>
>>>
>>>> turns aggregate
>>>> load/store into a load/store using an integer of equivalent
size
>>>> and
>>>> insert the correct bitcast before/after, right?
>>>>
>>>> Example is:
>>>>
>>>> %0 = tail call i8* @allocmemory(i64 32)
>>>> %1 = bitcast i8* %0 to %B*
>>>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
>>>>
>>>> into:
>>>>
>>>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
>>>> i64)
>>>> to i128), i128 774763251095801167872), i128* %1, align 8
>>>>
>>>> Where the aggregate is:
>>>>
>>>> %B__vtbl = type { i8*, i32 (%B*)* }
>>>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> —
>>>> Mehdi
>>>>
>>>>
>>>>> On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
>>>>> < llvm-dev at lists.llvm.org > wrote:
>>>>>
>>>>> It is pretty clear people need this. Let's get this
moving.
>>>>>
>>>>> I'll try to sum up the point that have been made and
I'll try
>>>>> to
>>>>> address them carefully.
>>>>>
>>>>> 1/ There is no good solution for large aggregates.
>>>>> That is true. However, I don't think this is a reason
to not
>>>>> address smaller aggregates, as they appear to be needed.
>>>>> Realistically, the proportion of aggregates that are very
large
>>>>> is
>>>>> small, and there is no expectation that such a thing would
map
>>>>> nicely to the hardware anyway (the hardware won't have
enough
>>>>> registers to load it all anyway). I do think this is
reasonable
>>>>> to
>>>>> expect a reasonable handling of relatively small aggregates
>>>>> like
>>>>> fat pointers while accepting that larges ones will be
>>>>> inefficient.
>>>>>
>>>>> This limitation is not unique to the current discussion, as
>>>>> SROA
>>>>> suffer from the same limitation.
>>>>> It is possible to disable to transformation for aggregates
that
>>>>> are
>>>>> too large if this is too big of a concern. It should maybe
also
>>>>> be
>>>>> done for SROA.
>>>>>
>>>>> 2/ Slicing the aggregate break the semantic of
atomic/volatile.
>>>>> That is true. It means slicing the aggregate should not be
done
>>>>> for
>>>>> atomic/volatile. It doesn't mean this should not be
done for
>>>>> regular ones as it is reasonable to handle atomic/volatile
>>>>> differently. After all, they have different semantic.
>>>>>
>>>>> 3/ Not slicing can create scalar that aren't supported
by the
>>>>> target. This is undesirable.
>>>>> Indeed. But as always, the important question is compared
to
>>>>> what
>>>>> ?
>>>>>
>>>>> The hardware has no notion of aggregate, so an aggregate or
a
>>>>> large
>>>>> scalar ends up both requiring legalization. Doing the
>>>>> transformation is still beneficial :
>>>>> - Some aggregates will generate valid scalars. For such
>>>>> aggregate,
>>>>> this is 100% win.
>>>>> - For aggregate that won't, the situation is still
better as
>>>>> various optimization passes will be able to handle the load
in
>>>>> a
>>>>> sensible manner.
>>>>> - The transformation never make the situation worse than it
is
>>>>> to
>>>>> begin with.
>>>>>
>>>>> On previous discussion, Hal Finkel seemed to think that the
>>>>> scalar
>>>>> solution is preferable to the slicing one.
>>>>>
>>>>> Is that a fair assessment of the situation ? Considering
all of
>>>>> this, I think the right path forward is :
>>>>> - Go for the scalar solution in the general case.
>>>>> - If that is a problem, the slicing approach can be used
for
>>>>> non
>>>>> atomic/volatile.
>>>>> - If necessary, disable the transformation for very large
>>>>> aggregates (and consider doing so for SROA as well).
>>>>>
>>>>> Do we have a plan ?
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e>>>>
_______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>

Hal Finkel via llvm-dev

2015-Aug-21 22:27 UTC

head link

[llvm-dev] [RFC] Aggreate load/store, proposed plan

----- Original Message -----> From: "Philip Reames" <listmail at philipreames.com>
> To: "Hal Finkel" <hfinkel at anl.gov>, "deadal
nix" <deadalnix at gmail.com>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, August 21, 2015 5:20:34 PM
> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> 
> 
> 
> On 08/21/2015 02:46 PM, Hal Finkel via llvm-dev wrote:
> > ----- Original Message -----
> >> From: "deadal nix" <deadalnix at gmail.com>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "Mehdi Amini" <mehdi.amini at apple.com>,
"llvm-dev"
> >> <llvm-dev at lists.llvm.org>
> >> Sent: Friday, August 21, 2015 1:24:04 AM
> >> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >>
> >> I've done some investigation. The proposed approach seems to
work
> >> well on struct with no padding, but the optimizer gets confused
> >> when
> >> the struct has some. That's a bummer but that is already
improving
> >> things, so I'd rather have that than nothing.
> >>
> >> Alternatively, I think I have, in theory only so far, that may be
> >> better. The idea is to create an alloca, memcpy the aggregate in
> >> it,
> >> and transform extract values into loads, and insertvalues into an
> >> optional memecy into a new alloca (if the old value is still alive
> >> after the insertvalue) + a store. There is some extra trickery one
> >> need to do for call arguments, ret and phi nodes, but overall, I
> >> think this can work. This approach would rely on SROA to generate
> >> something sensible. As it is going to do massive substitutions,
> >> I'm
> >> not sure InstCombine would be the right place to do it. Maybe its
> >> own pass ?
> >>
> > I think this might make more sense; it sounds closer to the
> > transformation that Clang does for C-level aggregates, and should
> > play nicer with SROA, etc. I'm not entirely sure where it should
> > go, but feel free to prototype it as its own pass, and if it works
> > well, we'll worry about pipeline placement.
> >
> >> That wouldn't work for atomic/volatile, but I think that would
> >> work
> >> nicely for regular load/stores.
> >>
> > We only support atomics on integer pointee types anyway, so atomics
> > are not relevant.
> OT: I plan to change this at some point.  In particular, I need
> atomic
> float and double access.  Doing the bitcasts in the frontend works,
> but
> given there's an obvious lowering, I don't see why we shouldn't
do
> this
> within LLVM.
I agree; I'd like them for float/double as well.

 -Hal
> >
> >   -Hal
> >
> >>
> >> How does that sound ?
> >>
> >>
> >>
> >>
> >> 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov >
:
> >>
> >>
> >> ----- Original Message -----
> >>> From: "deadal nix" < deadalnix at gmail.com >
> >>> To: "Hal Finkel" < hfinkel at anl.gov >
> >>> Cc: "Mehdi Amini" < mehdi.amini at apple.com
>, "llvm-dev" <
> >>> llvm-dev at lists.llvm.org >
> >>> Sent: Thursday, August 20, 2015 4:09:17 PM
> >>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed
plan
> >>>
> >>
> >>> Problem :
> >>>
> >>> Many languages define aggregates and some way to manipulate
them.
> >>> LLVM define aggregates types (arrays and structs) to handle
them.
> >>> However, when aggregate are loaded or stored, LLVM will simply
> >>> ignore these up to the legalization in the backend. This lead
to
> >>> many misoptimizations. Most frontend are using a set of trick
to
> >>> work around this limtation, but that an undesirable situation
as
> >>> it
> >>> increase the work required to write a front end. Ideally that
> >>> work
> >>> should be done once by LLVM instead of every time by each
> >>> frontend.
> >>>
> >>> In previous discussion on the subject, many LLVM user have
> >>> expressed
> >>> interest in being able to use aggregate memory access. In
> >>> addition,
> >>> it is likely that it would have reduced the workload of some
> >>> existing frontends.
> >>>
> >>> The proposed solution is to transform aggregate loads and
stores
> >>> to
> >>> something that the LLVM toolcahin already understand and is
able
> >>> to
> >>> work with.
> >>>
> >>> The proposed solution will use InstCombine to do the
> >>> transformation as it is done early and will allow subsequent
> >>> passes
> >>> to work with something familiar (basically, canonicalization).
> >>>
> >>>
> >>> Proposed solution :
> >>>
> >>>
> >>> Aggregate load and store are turned into aggregate load and
store
> >>> of
> >>> a scalar of the same size and alignement. Binary manipulation,
> >>> like
> >>> mask and shift, are used to build the aggregate from the
scalar
> >>> after loading and the aggregate to a scalar when storing.
> >>>
> >>>
> >>> For instance, the following IR (extracted from a D frontend) :
> >>>
> >>> %B__vtbl = type { i8*, i32 (%B*)* }
> >>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> >>>
> >>> %0 = tail call i8* @allocmemory(i64 32)
> >>>
> >>> %1 = bitcast i8* %0 to %B*
> >>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> >>>
> >>>
> >>> Would be canonized into :
> >>> %0 = tail call i8* @allocmemory(i64 32)
> >>> %1 = bitcast i8* %0 to i128*
> >>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> >>> i64)
> >>> to i128), i128 774763251095801167872), i128* %1, align 8
> >>>
> >> As I've said before, we really should introduce some useful
> >> canonicalization for these, and so I support the investigations in
> >> this area.
> >>
> >> Have you verified that InstCombine that undo this transformation
> >> reasonably? Specifically, if you have a bunch of insertvalues, and
> >> an aggregate store in one function, and a load in another function
> >> with a bunch of extractvalues, which gets inlined into the
> >> function
> >> with the store, does everything get combined appropriately to
> >> forward the values directly regardless of how many of the members
> >> are extracted and in what order?
> >>
> >> -Hal
> >>
> >>
> >>
> >>>
> >>> Which the rest of the LLVM pipeline can work with.
> >>>
> >>>
> >>>
> >>> Limitations :
> >>>
> >>>
> >>> 1/ This solution will not handle properly large (tens of
> >>> kilobytes)
> >>> aggregates. It is an accepted limitation, both for this
proposal
> >>> and
> >>> other part of the pipeline that handle aggregates. Optionally,
> >>> checks can be added both for this canonicalization and SROA to
> >>> disable them on very large aggregates as to avoid wasting work
> >>> that
> >>> won't yield good codegen at the end anyway. This
limitation
> >>> should
> >>> not be a blocker as most aggregate are fairly small. For
> >>> instance,
> >>> some language make heavy use of fat pointers, and would
greatly
> >>> benefit from this canonicalization.
> >>>
> >>>
> >>> 2/ This solution will generate loads and stores of value that
may
> >>> not
> >>> be natively supported by the hardware. The hardware do not
> >>> natively
> >>> support aggregate to begin with, so both original IR and
> >>> canonized
> >>> IR will require optimization. This is not ideal, but the
> >>> canonicalization is a plus for 2 reasons:
> >>> - A subset of these memory access won't need
canonicalization
> >>> anymore.
> >>>
> >>> - Other passes in LLVM will be able to work with these load
and
> >>> perform adequate transformations.
> >>>
> >>>
> >>>
> >>> Possible alternatives :
> >>>
> >>>
> >>> In order to mitigate 1/ it is possible to gate the
> >>> canonicalization
> >>> to aggregate under a certain size. This essentially avoiding
to
> >>> do
> >>> work that will lead to bad codegen no matter what.
> >>>
> >>> In order to mitigate 2/, it is possible to slice aggregates
loads
> >>> and
> >>> stores according to the target's data layout. This CANNOT
be
> >>> implemented for atomic/volatile as it would change semantic,
but
> >>> can
> >>> be done for regulars ones, which are the most commons.
> >>>
> >>>
> >>>
> >>>
> >>> Do that looks better as an RFC ?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov
> :
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Mehdi Amini via llvm-dev" < llvm-dev
at lists.llvm.org >
> >>>> To: "deadal nix" < deadalnix at gmail.com
>
> >>>> Cc: "llvm-dev" < llvm-dev at lists.llvm.org
>
> >>>> Sent: Wednesday, August 19, 2015 7:24:28 PM
> >>>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store,
proposed plan
> >>>>
> >>>> Hi,
> >>>>
> >>>> To be sure, because the RFC below is not detailed and
assume
> >>>> everyone
> >>>> knows about all the emails from 10 months ago,
> >>> I agree. The RFC needs to summarize the problems and the
> >>> potential
> >>> solutions.
> >>>
> >>>> is there more to do
> >>>> than what is proposed in http://reviews.llvm.org/D9766 ?
> >>>>
> >>>> So basically the proposal is that *InstCombine*
> >>> I think that fixing this early in the optimizer makes sense
> >>> (InstCombine, etc.). This seems little different from any
other
> >>> canonicalization problem. These direct aggregate IR values are
> >>> valid
> >>> IR, but not our preferred canonical form, so we should
transform
> >>> the
> >>> IR, when possible, into our preferred canonical form.
> >>>
> >>> -Hal
> >>>
> >>>
> >>>
> >>>> turns aggregate
> >>>> load/store into a load/store using an integer of
equivalent size
> >>>> and
> >>>> insert the correct bitcast before/after, right?
> >>>>
> >>>> Example is:
> >>>>
> >>>> %0 = tail call i8* @allocmemory(i64 32)
> >>>> %1 = bitcast i8* %0 to %B*
> >>>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> >>>>
> >>>> into:
> >>>>
> >>>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl*
@B__vtblZ to
> >>>> i64)
> >>>> to i128), i128 774763251095801167872), i128* %1, align 8
> >>>>
> >>>> Where the aggregate is:
> >>>>
> >>>> %B__vtbl = type { i8*, i32 (%B*)* }
> >>>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)*
@B.foo }
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> —
> >>>> Mehdi
> >>>>
> >>>>
> >>>>> On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
> >>>>> < llvm-dev at lists.llvm.org > wrote:
> >>>>>
> >>>>> It is pretty clear people need this. Let's get
this moving.
> >>>>>
> >>>>> I'll try to sum up the point that have been made
and I'll try
> >>>>> to
> >>>>> address them carefully.
> >>>>>
> >>>>> 1/ There is no good solution for large aggregates.
> >>>>> That is true. However, I don't think this is a
reason to not
> >>>>> address smaller aggregates, as they appear to be
needed.
> >>>>> Realistically, the proportion of aggregates that are
very large
> >>>>> is
> >>>>> small, and there is no expectation that such a thing
would map
> >>>>> nicely to the hardware anyway (the hardware won't
have enough
> >>>>> registers to load it all anyway). I do think this is
reasonable
> >>>>> to
> >>>>> expect a reasonable handling of relatively small
aggregates
> >>>>> like
> >>>>> fat pointers while accepting that larges ones will be
> >>>>> inefficient.
> >>>>>
> >>>>> This limitation is not unique to the current
discussion, as
> >>>>> SROA
> >>>>> suffer from the same limitation.
> >>>>> It is possible to disable to transformation for
aggregates that
> >>>>> are
> >>>>> too large if this is too big of a concern. It should
maybe also
> >>>>> be
> >>>>> done for SROA.
> >>>>>
> >>>>> 2/ Slicing the aggregate break the semantic of
atomic/volatile.
> >>>>> That is true. It means slicing the aggregate should
not be done
> >>>>> for
> >>>>> atomic/volatile. It doesn't mean this should not
be done for
> >>>>> regular ones as it is reasonable to handle
atomic/volatile
> >>>>> differently. After all, they have different semantic.
> >>>>>
> >>>>> 3/ Not slicing can create scalar that aren't
supported by the
> >>>>> target. This is undesirable.
> >>>>> Indeed. But as always, the important question is
compared to
> >>>>> what
> >>>>> ?
> >>>>>
> >>>>> The hardware has no notion of aggregate, so an
aggregate or a
> >>>>> large
> >>>>> scalar ends up both requiring legalization. Doing the
> >>>>> transformation is still beneficial :
> >>>>> - Some aggregates will generate valid scalars. For
such
> >>>>> aggregate,
> >>>>> this is 100% win.
> >>>>> - For aggregate that won't, the situation is still
better as
> >>>>> various optimization passes will be able to handle the
load in
> >>>>> a
> >>>>> sensible manner.
> >>>>> - The transformation never make the situation worse
than it is
> >>>>> to
> >>>>> begin with.
> >>>>>
> >>>>> On previous discussion, Hal Finkel seemed to think
that the
> >>>>> scalar
> >>>>> solution is preferable to the slicing one.
> >>>>>
> >>>>> Is that a fair assessment of the situation ?
Considering all of
> >>>>> this, I think the right path forward is :
> >>>>> - Go for the scalar solution in the general case.
> >>>>> - If that is a problem, the slicing approach can be
used for
> >>>>> non
> >>>>> atomic/volatile.
> >>>>> - If necessary, disable the transformation for very
large
> >>>>> aggregates (and consider doing so for SROA as well).
> >>>>>
> >>>>> Do we have a plan ?
> >>>>>
> >>>>> _______________________________________________
> >>>>> LLVM Developers mailing list
> >>>>> llvm-dev at lists.llvm.org
> >>>>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e>
>>>> _______________________________________________
> >>>> LLVM Developers mailing list
> >>>> llvm-dev at lists.llvm.org
> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>>
> >>> --
> >>> Hal Finkel
> >>> Assistant Computational Scientist
> >>> Leadership Computing Facility
> >>> Argonne National Laboratory
> >>>
> >>>
> >> --
> >> Hal Finkel
> >> Assistant Computational Scientist
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
> >>
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Aug 2015 - [RFC] Aggreate load/store, proposed plan

[llvm-dev] [RFC] Aggreate load/store, proposed plan

[llvm-dev] [RFC] Aggreate load/store, proposed plan

[llvm-dev] [RFC] Aggreate load/store, proposed plan

[llvm-dev] [RFC] Aggreate load/store, proposed plan