thr3ads.net - llvm dev - [llvm-dev] GlobalISel design update and goals [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Amara Emerson via llvm-dev

2018-Jul-30 14:01 UTC

[llvm-dev] GlobalISel design update and goals

Hi all,

Over the past few months we’ve been doing work on the foundations for the next
stages of GlobalISel development. In terms of changes from this time last year,
the IR translator, the legalizer, and instruction selector have seen moderate to
major changes. The most significant of these was the change to the legalizer
API, allowing targets to use predicates to express legality, which gives more
precise control over what forms of instructions are legal, and how to legalize
them. This was necessary to implement support for the new extending loads and
truncating stores, but also results in more concise and elegant expressions of
legality for each target. For example, you can now apple a single definition to
apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).

The IR translator has been modified to split aggregates rather than handling
them as one single large scalar. This change fixed some bugs and was necessary
in order handle big endian code correctly in future.

The tablegen instruction selector also saw significant improvements in
performance, helping to keep overall compile time regression vs fastisel to be
<5% geomean on CTMark. There are still a few outliers like sqlite3 which has
a significant regression compared to FastISel, but most of the other benchmarks
show little difference or even improvement.

The tablegen importer has had improvements made to it, so that we can import
more SelectionDAG selection rules. For example, currently on AArch64 we have
about 40% of the rules being successfully imported.

New additions from last year include the beginnings of a new combiner, although
there’s still significant work to be done here in terms of the final design. The
combiner will become a critical part of the pipeline in order to begin improving
runtime performance.

High levels goals

Going forward, we plan to improve GlobalISel in a number of key areas to achieve
the following targets:
 * Keeping compile time under control, ideally within 5% of FastISel, and when
optimizations are enabled to maintain a compile time advantage of SelectionDAG.
 * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
 * Improving overall stability and test coverage. Maintaining a high level of
code quality and minimizing regressions in correctness and performance will be a
significant challenge.
 * Ensure that the overall design meets the needs of general targets, not being
overly tuned to a specific implementation.

Design work planned

These are some design changes coming in the near to medium term future:

 * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to
handle different use cases. At the moment the opcode is too powerful, resulting
in overly complex handling in places like the legalizer. G_MERGE will be split
so that it only handles merging of scalars into one larger scalar. For other
cases like merging scalars into a vector we will create a new G_BUILD_VECTOR
opcode, with a new counterpart opcode for doing the opposite. For the current
vector + vector case a new G_CONCAT_VECTOR will be introduced. With these
changes it should simplify implementations for all targets.

 * Constant representation at the MI level needs some investigation. We
currently represent constants as generic instructions, with each instance of a
constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.

Since we don’t do any real caching of MI constants, multiple G_CONSTANT
definitions can exist for the same constant. This can also result in a lot of
redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.

 * For optimizations to be supported, the combiner will become a crucial part of
the GISel pipeline. We have already done some preliminary work in a generic
combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.

 * CSE throughout the pipeline. From a theoretical perspective, having a self
contained CSE pass that operates as a single phase in the pipeline is attractive
for the simplicity and elegance. However, we know empirically that this is
expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.

 * Known bits computation. Some optimizations require the knowledge of which
bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.

 * Load/store ordering needs some thought, as we currently don’t have a way to
easily check at the MI level what the ordering requirements are on a set of
memory operations. SelectionDAG uses the chains to ensure that they’re scheduled
to respect the orderings. How to achieve the same thing remains an open question
for GlobalISel.

 * More extensive tests that exercise multiple stages of the pipeline. One
advantage of using MIR with GISel is that individual passes can be easily tested
by feeding the exact input expected for a particular pass, and checking the
immediate output of the pass. However this approach can leave holes in the test
coverage. To help mitigate this, we will be exploring writing/generating whole
pipeline tests, tracking some IR through each pass and checking how the MIR is
mutated. We currently also have a proposed change to allow usage of FileCheck as
a library, not just as a stand-alone tool. This would allow us to use FileCheck
style checks and Improve testing of currently unused code paths.


Roadmap for enabling optimizations

I’ve filed a few PRs that people can follow or comment on to track the progress
towards enabling the -Os optimization level. The rough outline is:

PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks
(Darwin)
PR 38366 - GlobalISel: Lightweight CSE
PR 32561 - GlobalISel: placement of constants in the entry-block and fast
regalloc result in lots of reloaded constant
PR 38367 - GlobalISel: Implement support for obtaining known bits information
PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store
orderings

These, along with general design and implementation work on the combiner, will
then lead onto a long road of performance analysis, inevitable bug fixing, and
implementing more optimizations.

If anyone is interested in discussing in more detail, feel free to reach out on
the list, or to any of the GlobalISel developers. We’d especially like to hear
about any issues or concerns about porting targets to GlobalISel.

Thanks,
Amara
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180730/7e49e54d/attachment.html>

Quentin Colombet via llvm-dev

2018-Jul-30 23:04 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi Amara,

Thanks for sharing the plan going forward.

Inlined a couple of comments.

2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev <llvm-dev at
lists.llvm.org>:> Hi all,
>
> Over the past few months we’ve been doing work on the foundations for the
> next stages of GlobalISel development. In terms of changes from this time
> last year, the IR translator, the legalizer, and instruction selector have
> seen moderate to major changes. The most significant of these was the
change
> to the legalizer API, allowing targets to use predicates to express
> legality, which gives more precise control over what forms of instructions
> are legal, and how to legalize them. This was necessary to implement
support
> for the new extending loads and truncating stores, but also results in more
> concise and elegant expressions of legality for each target. For example,
> you can now apple a single definition to apply to multiples opcodes (G_ADD,
> G_SUB, G_MUL etc).
>
> The IR translator has been modified to split aggregates rather than
handling
> them as one single large scalar. This change fixed some bugs and was
> necessary in order handle big endian code correctly in future.
>
> The tablegen instruction selector also saw significant improvements in
> performance, helping to keep overall compile time regression vs fastisel to
> be <5% geomean on CTMark. There are still a few outliers like sqlite3
which
> has a significant regression compared to FastISel, but most of the other
> benchmarks show little difference or even improvement.
>
> The tablegen importer has had improvements made to it, so that we can
import
> more SelectionDAG selection rules. For example, currently on AArch64 we
have
> about 40% of the rules being successfully imported.
>
> New additions from last year include the beginnings of a new combiner,
> although there’s still significant work to be done here in terms of the
> final design. The combiner will become a critical part of the pipeline in
> order to begin improving runtime performance.
>
> High levels goals
>
> Going forward, we plan to improve GlobalISel in a number of key areas to
> achieve the following targets:
>  * Keeping compile time under control, ideally within 5% of FastISel, and
> when optimizations are enabled to maintain a compile time advantage of
> SelectionDAG.
>  * Begin improving runtime performance by adding the most important
> optimizations required to be competitive at -Os. We will be targeting and
> measuring AArch64 for this goal but will endeavor to implement as many
> optimizations as possible in generic code to benefit other targets.
>  * Improving overall stability and test coverage. Maintaining a high level
> of code quality and minimizing regressions in correctness and performance
> will be a significant challenge.
>  * Ensure that the overall design meets the needs of general targets, not
> being overly tuned to a specific implementation.
>
> Design work planned
>
> These are some design changes coming in the near to medium term future:
>
>  * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to
> handle different use cases. At the moment the opcode is too powerful,
> resulting in overly complex handling in places like the legalizer. G_MERGE
> will be split so that it only handles merging of scalars into one larger
> scalar. For other cases like merging scalars into a vector we will create a
> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the
> opposite. For the current vector + vector case a new G_CONCAT_VECTOR will
be
> introduced. With these changes it should simplify implementations for all
> targets.
>
>  * Constant representation at the MI level needs some investigation. We
> currently represent constants as generic instructions, with each instance
of
> a constant being largely independent of each other, being stored in the
> entry block except for a few places in IR translation where we emit at the
> point of use. As a result we run a localizer pass in an effort to reduce
the
> live ranges of the constants (and the consequent spilling), using some
> heuristics to decide where to sink the constant definitions to.
>
> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
> definitions can exist for the same constant. This can also result in a lot
> of redundant constants being created, especially for things like address
> computation. Reducing the number of constants can help reduce compile time
> and memory usage. Given this situation, one possible approach is to encode
> constants into the operands of the users, rather than have dedicated
machine
> instructions. At instruction selection time the constant can then be
> materialized into a register or encoded as an immediate. Further
> investigation is needed to find the right way forward here.
The initial design was to not have constant in machine operands. The
main rational is materializing a constant may be expensive, so we
better not sprinkle it around.
Now, in practice, this indeed means that we need to keep a table of
all the constants created so that we don't start to duplicate them,
which we fail to do. That should be easy to fix that just a map
virtual register + type to constant that could be kept at the function
level (or even module level). Better yet, this could be handled by the
IRBuilder. E.g., when instantiating a IR builder, it could scan the
function to see which constants are already there and build this
mapping and then, create only the constants that are missing.

Moreover, another the advantage of this model is that optimizations
don't have to deal with two variants of the same instruction (ADDr and
ADDi), same for patterns. Alternatively, if we don't change the
opcode, but only the MachineOperands, then every optimization has to
deal with two different kind of opcodes. Which is bad IMHO.

Also, this design was meant to absorb what the constant hoisting pass
does on the IR, so that we can kill that pass while having a better
cost model.

Finally, regarding the localizer pass, this was meant as a workaround
for the fast regalloc problem and should be killed asap. In
particular, the main motivation was to avoid code size bloat, but
AFAIR during our last measurements with Ahmed, we only saved a few %
on a handful of benchmarks, so maybe we can just kill it.
>
>  * For optimizations to be supported, the combiner will become a crucial
> part of the GISel pipeline. We have already done some preliminary work in a
> generic combiner, which will be used to eventually support combines of
> extloads/truncstores. We’ve had discussions on and off list about what we
> need from the new combiner. The summary is that we want the combiner to be
> flexible for each target to select from a library of combines, being as
> efficient as possible. The expression of the combines are currently written
> in C++, but one piece of investigation work we might do is to prototype
> using the same tablegen driven instruction selector code to match
> declarative combine patterns written in tablegen. Regardless, we will need
> to support the custom C++ use case.
>
>  * CSE throughout the pipeline. From a theoretical perspective, having a
> self contained CSE pass that operates as a single phase in the pipeline is
> attractive for the simplicity and elegance. However, we know empirically
> that this is expensive in compile time. Not only does the CSE pass itself
> take a non-negligible time to run, but having it as a late pass can result
> in the non-CSE’d code from the IRTranslator onwards surviving for a long
> time, taking up time in analysis at each stage of compilation. We believe
> running a light weight CSE early is a win. SelectionDAG currently does CSE
> by default when building the DAG, and this is something we could explore as
> part of a custom IRBuilder.
I have been pushing for having the IRBuilder being smarter. Having
this doing CSE was something I wanted we try.
>
>  * Known bits computation. Some optimizations require the knowledge of
which
> bits in a value are known to be 1 or 0, and do this by using the
> computeKnownBits() capability for SelectionDAG nodes. We will need some way
> of getting the same information. In an ideal scenario the replacement
> infrastructure for this will be more efficient, as this part of the
codebase
> seems to be disproportionately responsible for pathological compile time
> regressions.
>
>  * Load/store ordering needs some thought, as we currently don’t have a way
> to easily check at the MI level what the ordering requirements are on a set
> of memory operations. SelectionDAG uses the chains to ensure that they’re
> scheduled to respect the orderings. How to achieve the same thing remains
an
> open question for GlobalISel.
I don't get this problem.
GISel has a sequential IR, the order is already modeled here.
Dominance should give us the relative ordering, then if we want more,
we should do exactly what we do at the IR level (alias analysis, etc.)

Cheers,
-Quentin
>
>  * More extensive tests that exercise multiple stages of the pipeline. One
> advantage of using MIR with GISel is that individual passes can be easily
> tested by feeding the exact input expected for a particular pass, and
> checking the immediate output of the pass. However this approach can leave
> holes in the test coverage. To help mitigate this, we will be exploring
> writing/generating whole pipeline tests, tracking some IR through each pass
> and checking how the MIR is mutated. We currently also have a proposed
> change to allow usage of FileCheck as a library, not just as a stand-alone
> tool. This would allow us to use FileCheck style checks and Improve testing
> of currently unused code paths.
>
>
> Roadmap for enabling optimizations
>
> I’ve filed a few PRs that people can follow or comment on to track the
> progress towards enabling the -Os optimization level. The rough outline is:
>
> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks
> (Darwin)
> PR 38366 - GlobalISel: Lightweight CSE
> PR 32561 - GlobalISel: placement of constants in the entry-block and fast
> regalloc result in lots of reloaded constant
> PR 38367 - GlobalISel: Implement support for obtaining known bits
> information
> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store
> orderings
>
> These, along with general design and implementation work on the combiner,
> will then lead onto a long road of performance analysis, inevitable bug
> fixing, and implementing more optimizations.
>
> If anyone is interested in discussing in more detail, feel free to reach
out
> on the list, or to any of the GlobalISel developers. We’d especially like
to
> hear about any issues or concerns about porting targets to GlobalISel.
>
> Thanks,
> Amara
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Tom Stellard via llvm-dev

2018-Jul-31 14:53 UTC

head link

[llvm-dev] GlobalISel design update and goals

On 07/30/2018 07:01 AM, Amara Emerson via llvm-dev
wrote:> Hi all,
> 
Hi Amara,

Thanks for the update.  One question and then a few comments
below.

Are there any plans to allow custom legalizing intrinsics?
It would be convenient to be able to lower these to generic
target instructions prior to register bank selection.

> Over the past few months we’ve been doing work on the foundations for the
next stages of GlobalISel development. In terms of changes from this time last
year, the IR translator, the legalizer, and instruction selector have seen
moderate to major changes. The most significant of these was the change to the
legalizer API, allowing targets to use predicates to express legality, which
gives more precise control over what forms of instructions are legal, and how to
legalize them. This was necessary to implement support for the new extending
loads and truncating stores, but also results in more concise and elegant
expressions of legality for each target. For example, you can now apple a single
definition to apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).
> 
> The IR translator has been modified to split aggregates rather than
handling them as one single large scalar. This change fixed some bugs and was
necessary in order handle big endian code correctly in future.
> 
> The tablegen instruction selector also saw significant improvements in
performance, helping to keep overall compile time regression vs fastisel to be
<5% geomean on CTMark. There are still a few outliers like sqlite3 which has
a significant regression compared to FastISel, but most of the other benchmarks
show little difference or even improvement.
> 
> The tablegen importer has had improvements made to it, so that we can
import more SelectionDAG selection rules. For example, currently on AArch64 we
have about 40% of the rules being successfully imported.
> 
> New additions from last year include the beginnings of a new combiner,
although there’s still significant work to be done here in terms of the final
design. The combiner will become a critical part of the pipeline in order to
begin improving runtime performance.
> 
> *High levels goals*
> 
> Going forward, we plan to improve GlobalISel in a number of key areas to
achieve the following targets:
>  * Keeping compile time under control, ideally within 5% of FastISel, and
when optimizations are enabled to maintain a compile time advantage of
SelectionDAG.
>  * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
>  * Improving overall stability and test coverage. Maintaining a high level
of code quality and minimizing regressions in correctness and performance will
be a significant challenge.
>  * Ensure that the overall design meets the needs of general targets, not
being overly tuned to a specific implementation.
> 
> *Design work planned*
> 
> These are some design changes coming in the near to medium term future:
> 
>  * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to
handle different use cases. At the moment the opcode is too powerful, resulting
in overly complex handling in places like the legalizer. G_MERGE will be split
so that it only handles merging of scalars into one larger scalar. For other
cases like merging scalars into a vector we will create a new G_BUILD_VECTOR
opcode, with a new counterpart opcode for doing the opposite. For the current
vector + vector case a new G_CONCAT_VECTOR will be introduced. With these
changes it should simplify implementations for all targets.
> 
>  * Constant representation at the MI level needs some investigation. We
currently represent constants as generic instructions, with each instance of a
constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.
> 
I think having the G_CONSTANT opcode is useful is some ways for
the AMDGPU target, because almost every operand for this target
can be encoded as an immediate so being able to assume that
all operands are registers simplifies the code a lot.

However, one problem with G_CONSTANT on AMDGPU is that some
instructions have operands that must be encoded as an immediate.
This creates some ambiguity, because something like
G_CONSTANT i8 0 would normally be illegal, but if its only use in an
operand that must be an i8 immediate, then we must keep it legal.

I'm also not sure what register bank to assign in this case,
because these values will never be stored in a register.

So I think having a little more flexibility on being able
to encode operands as immediates would be good, but
I'm not sure I would really want to get rid of G_CONSTANT.
> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
definitions can exist for the same constant. This can also result in a lot of
redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.
> 
>  * For optimizations to be supported, the combiner will become a crucial
part of the GISel pipeline. We have already done some preliminary work in a
generic combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.
> 
>  * CSE throughout the pipeline. From a theoretical perspective, having a
self contained CSE pass that operates as a single phase in the pipeline is
attractive for the simplicity and elegance. However, we know empirically that
this is expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.
> 
CSE also makes writing tests harder if it is constantly being
run like in the SelectionDAG.

-Tom
>  * Known bits computation. Some optimizations require the knowledge of
which bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.
> 
>  * Load/store ordering needs some thought, as we currently don’t have a way
to easily check at the MI level what the ordering requirements are on a set of
memory operations. SelectionDAG uses the chains to ensure that they’re scheduled
to respect the orderings. How to achieve the same thing remains an open question
for GlobalISel.
> 
>  * More extensive tests that exercise multiple stages of the pipeline. One
advantage of using MIR with GISel is that individual passes can be easily tested
by feeding the exact input expected for a particular pass, and checking the
immediate output of the pass. However this approach can leave holes in the test
coverage. To help mitigate this, we will be exploring writing/generating whole
pipeline tests, tracking some IR through each pass and checking how the MIR is
mutated. We currently also have a proposed change to allow usage of FileCheck as
a library, not just as a stand-alone tool. This would allow us to use FileCheck
style checks and Improve testing of currently unused code paths.
> 
> 
> *Roadmap for enabling optimizations*
> **
> I’ve filed a few PRs that people can follow or comment on to track the
progress towards enabling the -Os optimization level. The rough outline is:
> 
> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks
(Darwin)
> PR 38366 - GlobalISel: Lightweight CSE
> PR 32561 - GlobalISel: placement of constants in the entry-block and fast
regalloc result in lots of reloaded constant
> PR 38367 - GlobalISel: Implement support for obtaining known bits
information
> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store
orderings
> 
> These, along with general design and implementation work on the combiner,
will then lead onto a long road of performance analysis, inevitable bug fixing,
and implementing more optimizations.
> 
> If anyone is interested in discussing in more detail, feel free to reach
out on the list, or to any of the GlobalISel developers. We’d especially like to
hear about any issues or concerns about porting targets to GlobalISel.
> 
> Thanks,
> Amara
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Amara Emerson via llvm-dev

2018-Jul-31 15:18 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi Quentin,
> On Jul 31, 2018, at 12:04 AM, Quentin Colombet <quentin.colombet at
gmail.com> wrote:
> 
> Hi Amara,
> 
> Thanks for sharing the plan going forward.
> 
> Inlined a couple of comments.
> 
> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
>> Hi all,
>> 
>> Over the past few months we’ve been doing work on the foundations for
the
>> next stages of GlobalISel development. In terms of changes from this
time
>> last year, the IR translator, the legalizer, and instruction selector
have
>> seen moderate to major changes. The most significant of these was the
change
>> to the legalizer API, allowing targets to use predicates to express
>> legality, which gives more precise control over what forms of
instructions
>> are legal, and how to legalize them. This was necessary to implement
support
>> for the new extending loads and truncating stores, but also results in
more
>> concise and elegant expressions of legality for each target. For
example,
>> you can now apple a single definition to apply to multiples opcodes
(G_ADD,
>> G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than
handling
>> them as one single large scalar. This change fixed some bugs and was
>> necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
>> performance, helping to keep overall compile time regression vs
fastisel to
>> be <5% geomean on CTMark. There are still a few outliers like
sqlite3 which
>> has a significant regression compared to FastISel, but most of the
other
>> benchmarks show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can
import
>> more SelectionDAG selection rules. For example, currently on AArch64 we
have
>> about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
>> although there’s still significant work to be done here in terms of the
>> final design. The combiner will become a critical part of the pipeline
in
>> order to begin improving runtime performance.
>> 
>> High levels goals
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas
to
>> achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel,
and
>> when optimizations are enabled to maintain a compile time advantage of
>> SelectionDAG.
>> * Begin improving runtime performance by adding the most important
>> optimizations required to be competitive at -Os. We will be targeting
and
>> measuring AArch64 for this goal but will endeavor to implement as many
>> optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high
level
>> of code quality and minimizing regressions in correctness and
performance
>> will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets,
not
>> being overly tuned to a specific implementation.
>> 
>> Design work planned
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes
to
>> handle different use cases. At the moment the opcode is too powerful,
>> resulting in overly complex handling in places like the legalizer.
G_MERGE
>> will be split so that it only handles merging of scalars into one
larger
>> scalar. For other cases like merging scalars into a vector we will
create a
>> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the
>> opposite. For the current vector + vector case a new G_CONCAT_VECTOR
will be
>> introduced. With these changes it should simplify implementations for
all
>> targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
>> currently represent constants as generic instructions, with each
instance of
>> a constant being largely independent of each other, being stored in the
>> entry block except for a few places in IR translation where we emit at
the
>> point of use. As a result we run a localizer pass in an effort to
reduce the
>> live ranges of the constants (and the consequent spilling), using some
>> heuristics to decide where to sink the constant definitions to.
>> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
>> definitions can exist for the same constant. This can also result in a
lot
>> of redundant constants being created, especially for things like
address
>> computation. Reducing the number of constants can help reduce compile
time
>> and memory usage. Given this situation, one possible approach is to
encode
>> constants into the operands of the users, rather than have dedicated
machine
>> instructions. At instruction selection time the constant can then be
>> materialized into a register or encoded as an immediate. Further
>> investigation is needed to find the right way forward here.
> 
> The initial design was to not have constant in machine operands. The
> main rational is materializing a constant may be expensive, so we
> better not sprinkle it around.
> Now, in practice, this indeed means that we need to keep a table of
> all the constants created so that we don't start to duplicate them,
> which we fail to do. That should be easy to fix that just a map
> virtual register + type to constant that could be kept at the function
> level (or even module level). Better yet, this could be handled by the
> IRBuilder. E.g., when instantiating a IR builder, it could scan the
> function to see which constants are already there and build this
> mapping and then, create only the constants that are missing.
> 
> Moreover, another the advantage of this model is that optimizations
> don't have to deal with two variants of the same instruction (ADDr and
> ADDi), same for patterns. Alternatively, if we don't change the
> opcode, but only the MachineOperands, then every optimization has to
> deal with two different kind of opcodes. Which is bad IMHO.
> 
> Also, this design was meant to absorb what the constant hoisting pass
> does on the IR, so that we can kill that pass while having a better
> cost model.
> 
> Finally, regarding the localizer pass, this was meant as a workaround
> for the fast regalloc problem and should be killed asap. In
> particular, the main motivation was to avoid code size bloat, but
> AFAIR during our last measurements with Ahmed, we only saved a few %
> on a handful of benchmarks, so maybe we can just kill it.All valid points, however we're now seeing more and more constants created
especially as part of address computation, e.g. GEP offsets. Without a solution
to the regalloc problem, code size and compile time is taking a hit. IMO a few %
is significant enough to warrant considerable effort. For example, a quick
constant deduplication experiment I did saved around 1.5% in overall compile
time alone.

If the fast regalloc issue can be fixed without a significant compile time
impact then I agree it sounds like the best approach combined with early
deduplication. It’s definitely something we’ll look
into.> 
>> 
>> * For optimizations to be supported, the combiner will become a crucial
>> part of the GISel pipeline. We have already done some preliminary work
in a
>> generic combiner, which will be used to eventually support combines of
>> extloads/truncstores. We’ve had discussions on and off list about what
we
>> need from the new combiner. The summary is that we want the combiner to
be
>> flexible for each target to select from a library of combines, being as
>> efficient as possible. The expression of the combines are currently
written
>> in C++, but one piece of investigation work we might do is to prototype
>> using the same tablegen driven instruction selector code to match
>> declarative combine patterns written in tablegen. Regardless, we will
need
>> to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
>> self contained CSE pass that operates as a single phase in the pipeline
is
>> attractive for the simplicity and elegance. However, we know
empirically
>> that this is expensive in compile time. Not only does the CSE pass
itself
>> take a non-negligible time to run, but having it as a late pass can
result
>> in the non-CSE’d code from the IRTranslator onwards surviving for a
long
>> time, taking up time in analysis at each stage of compilation. We
believe
>> running a light weight CSE early is a win. SelectionDAG currently does
CSE
>> by default when building the DAG, and this is something we could
explore as
>> part of a custom IRBuilder.
> 
> I have been pushing for having the IRBuilder being smarter. Having
> this doing CSE was something I wanted we try.Yes IRBuilder is one of the candidates we’ll try for
this.> 
>> 
>> * Known bits computation. Some optimizations require the knowledge of
which
>> bits in a value are known to be 1 or 0, and do this by using the
>> computeKnownBits() capability for SelectionDAG nodes. We will need some
way
>> of getting the same information. In an ideal scenario the replacement
>> infrastructure for this will be more efficient, as this part of the
codebase
>> seems to be disproportionately responsible for pathological compile
time
>> regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a
way
>> to easily check at the MI level what the ordering requirements are on a
set
>> of memory operations. SelectionDAG uses the chains to ensure that
they’re
>> scheduled to respect the orderings. How to achieve the same thing
remains an
>> open question for GlobalISel.
> 
> I don't get this problem.
> GISel has a sequential IR, the order is already modeled here.
> Dominance should give us the relative ordering, then if we want more,
> we should do exactly what we do at the IR level (alias analysis, etc.)Sorry, I didn’t elaborate this properly. The information is there, you’re right
about that. The problem is that finding if a load/store A precedes another in a
block potentially requires a scan of every instruction since they’re stored as
ilists. What chains give as a side effect of the implementation is a way to walk
through the dependent memory operations without doing a backwards scan of every
instruction. A simple block level cache might work here. As you say if we want
more information then something akin to MemorySSA might be useful one day for
whole CFG analysis.> 
> Cheers,
> -Quentin
> 
>> 
>> * More extensive tests that exercise multiple stages of the pipeline.
One
>> advantage of using MIR with GISel is that individual passes can be
easily
>> tested by feeding the exact input expected for a particular pass, and
>> checking the immediate output of the pass. However this approach can
leave
>> holes in the test coverage. To help mitigate this, we will be exploring
>> writing/generating whole pipeline tests, tracking some IR through each
pass
>> and checking how the MIR is mutated. We currently also have a proposed
>> change to allow usage of FileCheck as a library, not just as a
stand-alone
>> tool. This would allow us to use FileCheck style checks and Improve
testing
>> of currently unused code paths.
>> 
>> 
>> Roadmap for enabling optimizations
>> 
>> I’ve filed a few PRs that people can follow or comment on to track the
>> progress towards enabling the -Os optimization level. The rough outline
is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks
>> (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and
fast
>> regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
>> information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store
>> orderings
>> 
>> These, along with general design and implementation work on the
combiner,
>> will then lead onto a long road of performance analysis, inevitable bug
>> fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to
reach out
>> on the list, or to any of the GlobalISel developers. We’d especially
like to
>> hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/6294ab28/attachment-0001.html>

Aditya Nandakumar via llvm-dev

2018-Jul-31 15:32 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi Tom

This review has been sitting for a little over an year because of lack of a test
which custom legalizes intrinsics.
https://reviews.llvm.org/D31359 <https://reviews.llvm.org/D31359>

Aditya
> On Jul 31, 2018, at 7:53 AM, Tom Stellard via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On 07/30/2018 07:01 AM, Amara Emerson via llvm-dev wrote:
>> Hi all,
>> 
> 
> Hi Amara,
> 
> Thanks for the update.  One question and then a few comments
> below.
> 
> Are there any plans to allow custom legalizing intrinsics?
> It would be convenient to be able to lower these to generic
> target instructions prior to register bank selection.
> 
> 
>> Over the past few months we’ve been doing work on the foundations for
the next stages of GlobalISel development. In terms of changes from this time
last year, the IR translator, the legalizer, and instruction selector have seen
moderate to major changes. The most significant of these was the change to the
legalizer API, allowing targets to use predicates to express legality, which
gives more precise control over what forms of instructions are legal, and how to
legalize them. This was necessary to implement support for the new extending
loads and truncating stores, but also results in more concise and elegant
expressions of legality for each target. For example, you can now apple a single
definition to apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than
handling them as one single large scalar. This change fixed some bugs and was
necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
performance, helping to keep overall compile time regression vs fastisel to be
<5% geomean on CTMark. There are still a few outliers like sqlite3 which has
a significant regression compared to FastISel, but most of the other benchmarks
show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can
import more SelectionDAG selection rules. For example, currently on AArch64 we
have about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
although there’s still significant work to be done here in terms of the final
design. The combiner will become a critical part of the pipeline in order to
begin improving runtime performance.
>> 
>> *High levels goals*
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas
to achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel,
and when optimizations are enabled to maintain a compile time advantage of
SelectionDAG.
>> * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high
level of code quality and minimizing regressions in correctness and performance
will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets,
not being overly tuned to a specific implementation.
>> 
>> *Design work planned*
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes
to handle different use cases. At the moment the opcode is too powerful,
resulting in overly complex handling in places like the legalizer. G_MERGE will
be split so that it only handles merging of scalars into one larger scalar. For
other cases like merging scalars into a vector we will create a new
G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the opposite. For
the current vector + vector case a new G_CONCAT_VECTOR will be introduced. With
these changes it should simplify implementations for all targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
currently represent constants as generic instructions, with each instance of a
constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.
>> 
> 
> I think having the G_CONSTANT opcode is useful is some ways for
> the AMDGPU target, because almost every operand for this target
> can be encoded as an immediate so being able to assume that
> all operands are registers simplifies the code a lot.
> 
> However, one problem with G_CONSTANT on AMDGPU is that some
> instructions have operands that must be encoded as an immediate.
> This creates some ambiguity, because something like
> G_CONSTANT i8 0 would normally be illegal, but if its only use in an
> operand that must be an i8 immediate, then we must keep it legal.
> 
> I'm also not sure what register bank to assign in this case,
> because these values will never be stored in a register.
> 
> So I think having a little more flexibility on being able
> to encode operands as immediates would be good, but
> I'm not sure I would really want to get rid of G_CONSTANT.
> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
definitions can exist for the same constant. This can also result in a lot of
redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.
>> 
>> * For optimizations to be supported, the combiner will become a crucial
part of the GISel pipeline. We have already done some preliminary work in a
generic combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
self contained CSE pass that operates as a single phase in the pipeline is
attractive for the simplicity and elegance. However, we know empirically that
this is expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.
>> 
> 
> CSE also makes writing tests harder if it is constantly being
> run like in the SelectionDAG.
> 
> -Tom
> 
>> * Known bits computation. Some optimizations require the knowledge of
which bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a
way to easily check at the MI level what the ordering requirements are on a set
of memory operations. SelectionDAG uses the chains to ensure that they’re
scheduled to respect the orderings. How to achieve the same thing remains an
open question for GlobalISel.
>> 
>> * More extensive tests that exercise multiple stages of the pipeline.
One advantage of using MIR with GISel is that individual passes can be easily
tested by feeding the exact input expected for a particular pass, and checking
the immediate output of the pass. However this approach can leave holes in the
test coverage. To help mitigate this, we will be exploring writing/generating
whole pipeline tests, tracking some IR through each pass and checking how the
MIR is mutated. We currently also have a proposed change to allow usage of
FileCheck as a library, not just as a stand-alone tool. This would allow us to
use FileCheck style checks and Improve testing of currently unused code paths.
>> 
>> 
>> *Roadmap for enabling optimizations*
>> **
>> I’ve filed a few PRs that people can follow or comment on to track the
progress towards enabling the -Os optimization level. The rough outline is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and
fast regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store orderings
>> 
>> These, along with general design and implementation work on the
combiner, will then lead onto a long road of performance analysis, inevitable
bug fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to
reach out on the list, or to any of the GlobalISel developers. We’d especially
like to hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/43483335/attachment-0001.html>

Daniel Sanders via llvm-dev

2018-Jul-31 15:40 UTC

head link

[llvm-dev] GlobalISel design update and goals

> On 30 Jul 2018, at 16:04, Quentin Colombet via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi Amara,
> 
> Thanks for sharing the plan going forward.
> 
> Inlined a couple of comments.
> 
> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
>> Hi all,
>> 
>> Over the past few months we’ve been doing work on the foundations for
the
>> next stages of GlobalISel development. In terms of changes from this
time
>> last year, the IR translator, the legalizer, and instruction selector
have
>> seen moderate to major changes. The most significant of these was the
change
>> to the legalizer API, allowing targets to use predicates to express
>> legality, which gives more precise control over what forms of
instructions
>> are legal, and how to legalize them. This was necessary to implement
support
>> for the new extending loads and truncating stores, but also results in
more
>> concise and elegant expressions of legality for each target. For
example,
>> you can now apple a single definition to apply to multiples opcodes
(G_ADD,
>> G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than
handling
>> them as one single large scalar. This change fixed some bugs and was
>> necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
>> performance, helping to keep overall compile time regression vs
fastisel to
>> be <5% geomean on CTMark. There are still a few outliers like
sqlite3 which
>> has a significant regression compared to FastISel, but most of the
other
>> benchmarks show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can
import
>> more SelectionDAG selection rules. For example, currently on AArch64 we
have
>> about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
>> although there’s still significant work to be done here in terms of the
>> final design. The combiner will become a critical part of the pipeline
in
>> order to begin improving runtime performance.
>> 
>> High levels goals
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas
to
>> achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel,
and
>> when optimizations are enabled to maintain a compile time advantage of
>> SelectionDAG.
>> * Begin improving runtime performance by adding the most important
>> optimizations required to be competitive at -Os. We will be targeting
and
>> measuring AArch64 for this goal but will endeavor to implement as many
>> optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high
level
>> of code quality and minimizing regressions in correctness and
performance
>> will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets,
not
>> being overly tuned to a specific implementation.
>> 
>> Design work planned
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes
to
>> handle different use cases. At the moment the opcode is too powerful,
>> resulting in overly complex handling in places like the legalizer.
G_MERGE
>> will be split so that it only handles merging of scalars into one
larger
>> scalar. For other cases like merging scalars into a vector we will
create a
>> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the
>> opposite. For the current vector + vector case a new G_CONCAT_VECTOR
will be
>> introduced. With these changes it should simplify implementations for
all
>> targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
>> currently represent constants as generic instructions, with each
instance of
>> a constant being largely independent of each other, being stored in the
>> entry block except for a few places in IR translation where we emit at
the
>> point of use. As a result we run a localizer pass in an effort to
reduce the
>> live ranges of the constants (and the consequent spilling), using some
>> heuristics to decide where to sink the constant definitions to.
>> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
>> definitions can exist for the same constant. This can also result in a
lot
>> of redundant constants being created, especially for things like
address
>> computation. Reducing the number of constants can help reduce compile
time
>> and memory usage. Given this situation, one possible approach is to
encode
>> constants into the operands of the users, rather than have dedicated
machine
>> instructions. At instruction selection time the constant can then be
>> materialized into a register or encoded as an immediate. Further
>> investigation is needed to find the right way forward here.
> 
> The initial design was to not have constant in machine operands. The
> main rational is materializing a constant may be expensive, so we
> better not sprinkle it around.
> Now, in practice, this indeed means that we need to keep a table of
> all the constants created so that we don't start to duplicate them,
> which we fail to do. That should be easy to fix that just a map
> virtual register + type to constant that could be kept at the function
> level (or even module level). Better yet, this could be handled by the
> IRBuilder. E.g., when instantiating a IR builder, it could scan the
> function to see which constants are already there and build this
> mapping and then, create only the constants that are missing.
> 
> Moreover, another the advantage of this model is that optimizations
> don't have to deal with two variants of the same instruction (ADDr and
> ADDi), same for patterns. Alternatively, if we don't change the
> opcode, but only the MachineOperands, then every optimization has to
> deal with two different kind of opcodes. Which is bad IMHO.
> 
> Also, this design was meant to absorb what the constant hoisting pass
> does on the IR, so that we can kill that pass while having a better
> cost model.
> 
> Finally, regarding the localizer pass, this was meant as a workaround
> for the fast regalloc problem and should be killed asap. In
> particular, the main motivation was to avoid code size bloat, but
> AFAIR during our last measurements with Ahmed, we only saved a few %
> on a handful of benchmarks, so maybe we can just kill it.
> 
>> 
>> * For optimizations to be supported, the combiner will become a crucial
>> part of the GISel pipeline. We have already done some preliminary work
in a
>> generic combiner, which will be used to eventually support combines of
>> extloads/truncstores. We’ve had discussions on and off list about what
we
>> need from the new combiner. The summary is that we want the combiner to
be
>> flexible for each target to select from a library of combines, being as
>> efficient as possible. The expression of the combines are currently
written
>> in C++, but one piece of investigation work we might do is to prototype
>> using the same tablegen driven instruction selector code to match
>> declarative combine patterns written in tablegen. Regardless, we will
need
>> to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
>> self contained CSE pass that operates as a single phase in the pipeline
is
>> attractive for the simplicity and elegance. However, we know
empirically
>> that this is expensive in compile time. Not only does the CSE pass
itself
>> take a non-negligible time to run, but having it as a late pass can
result
>> in the non-CSE’d code from the IRTranslator onwards surviving for a
long
>> time, taking up time in analysis at each stage of compilation. We
believe
>> running a light weight CSE early is a win. SelectionDAG currently does
CSE
>> by default when building the DAG, and this is something we could
explore as
>> part of a custom IRBuilder.
> 
> I have been pushing for having the IRBuilder being smarter. Having
> this doing CSE was something I wanted we try.
> 
>> 
>> * Known bits computation. Some optimizations require the knowledge of
which
>> bits in a value are known to be 1 or 0, and do this by using the
>> computeKnownBits() capability for SelectionDAG nodes. We will need some
way
>> of getting the same information. In an ideal scenario the replacement
>> infrastructure for this will be more efficient, as this part of the
codebase
>> seems to be disproportionately responsible for pathological compile
time
>> regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a
way
>> to easily check at the MI level what the ordering requirements are on a
set
>> of memory operations. SelectionDAG uses the chains to ensure that
they’re
>> scheduled to respect the orderings. How to achieve the same thing
remains an
>> open question for GlobalISel.
> 
> I don't get this problem.
> GISel has a sequential IR, the order is already modeled here.
%t1 = load %addr1 :: (load 4 from @a)
%t2 = load %addr2 :: (load 4 from @a + 4)
store %t1 %addr3 :: (store 4 to @b)
store %t2 %addr4 :: (store 4 to @b + 4)

MIR specifies the order:
	load @a -> load @a+4 -> store @b -> store @b+4
but the dependencies are actually:
	load @a -> store @b
	load @a+4 -> store @b+4

We only have a representation of the former at the moment and have to
re-calculate the latter each time we want to use that information to
sink/hoist/fold/etc. instructions. We need to build a means of preserving that
dependency graph within a pass and between passes.
> Dominance should give us the relative ordering, then if we want more,
> we should do exactly what we do at the IR level (alias analysis, etc.)
> 
> Cheers,
> -Quentin
>> * More extensive tests that exercise multiple stages of the pipeline.
One
>> advantage of using MIR with GISel is that individual passes can be
easily
>> tested by feeding the exact input expected for a particular pass, and
>> checking the immediate output of the pass. However this approach can
leave
>> holes in the test coverage. To help mitigate this, we will be exploring
>> writing/generating whole pipeline tests, tracking some IR through each
pass
>> and checking how the MIR is mutated. We currently also have a proposed
>> change to allow usage of FileCheck as a library, not just as a
stand-alone
>> tool. This would allow us to use FileCheck style checks and Improve
testing
>> of currently unused code paths.
>> 
>> 
>> Roadmap for enabling optimizations
>> 
>> I’ve filed a few PRs that people can follow or comment on to track the
>> progress towards enabling the -Os optimization level. The rough outline
is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks
>> (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and
fast
>> regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
>> information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store
>> orderings
>> 
>> These, along with general design and implementation work on the
combiner,
>> will then lead onto a long road of performance analysis, inevitable bug
>> fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to
reach out
>> on the list, or to any of the GlobalISel developers. We’d especially
like to
>> hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/e482f555/attachment.html>

Daniel Sanders via llvm-dev

2018-Jul-31 16:03 UTC

head link

[llvm-dev] GlobalISel design update and goals

> On 31 Jul 2018, at 07:53, Tom Stellard via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On 07/30/2018 07:01 AM, Amara Emerson via llvm-dev wrote:
>> Hi all,
>> 
> 
> Hi Amara,
> 
> Thanks for the update.  One question and then a few comments
> below.
> 
> Are there any plans to allow custom legalizing intrinsics?
> It would be convenient to be able to lower these to generic
> target instructions prior to register bank selection.
In addition to the legalization of intrinsics that Aditya mentioned, we've
also been talking about the possibility of having target specific IRTranslation
for intrinsics. That would allow us to go straight to a generic target
instruction rather than having to create a temporary G_INTRINSIC_* first. I
don't think anyone has had chance to look into it yet though.
>> Over the past few months we’ve been doing work on the foundations for
the next stages of GlobalISel development. In terms of changes from this time
last year, the IR translator, the legalizer, and instruction selector have seen
moderate to major changes. The most significant of these was the change to the
legalizer API, allowing targets to use predicates to express legality, which
gives more precise control over what forms of instructions are legal, and how to
legalize them. This was necessary to implement support for the new extending
loads and truncating stores, but also results in more concise and elegant
expressions of legality for each target. For example, you can now apple a single
definition to apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than
handling them as one single large scalar. This change fixed some bugs and was
necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
performance, helping to keep overall compile time regression vs fastisel to be
<5% geomean on CTMark. There are still a few outliers like sqlite3 which has
a significant regression compared to FastISel, but most of the other benchmarks
show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can
import more SelectionDAG selection rules. For example, currently on AArch64 we
have about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
although there’s still significant work to be done here in terms of the final
design. The combiner will become a critical part of the pipeline in order to
begin improving runtime performance.
>> 
>> *High levels goals*
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas
to achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel,
and when optimizations are enabled to maintain a compile time advantage of
SelectionDAG.
>> * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high
level of code quality and minimizing regressions in correctness and performance
will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets,
not being overly tuned to a specific implementation.
>> 
>> *Design work planned*
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes
to handle different use cases. At the moment the opcode is too powerful,
resulting in overly complex handling in places like the legalizer. G_MERGE will
be split so that it only handles merging of scalars into one larger scalar. For
other cases like merging scalars into a vector we will create a new
G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the opposite. For
the current vector + vector case a new G_CONCAT_VECTOR will be introduced. With
these changes it should simplify implementations for all targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
currently represent constants as generic instructions, with each instance of a
constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.
>> 
> 
> I think having the G_CONSTANT opcode is useful is some ways for
> the AMDGPU target, because almost every operand for this target
> can be encoded as an immediate so being able to assume that
> all operands are registers simplifies the code a lot.
> 
> However, one problem with G_CONSTANT on AMDGPU is that some
> instructions have operands that must be encoded as an immediate.
> This creates some ambiguity, because something like
> G_CONSTANT i8 0 would normally be illegal, but if its only use in an
> operand that must be an i8 immediate, then we must keep it legal.
> 
> I'm also not sure what register bank to assign in this case,
> because these values will never be stored in a register.
> 
> So I think having a little more flexibility on being able
> to encode operands as immediates would be good, but
> I'm not sure I would really want to get rid of G_CONSTANT.
> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
definitions can exist for the same constant. This can also result in a lot of
redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.
>> 
>> * For optimizations to be supported, the combiner will become a crucial
part of the GISel pipeline. We have already done some preliminary work in a
generic combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
self contained CSE pass that operates as a single phase in the pipeline is
attractive for the simplicity and elegance. However, we know empirically that
this is expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.
>> 
> 
> CSE also makes writing tests harder if it is constantly being
> run like in the SelectionDAG.
> 
> -Tom
> 
>> * Known bits computation. Some optimizations require the knowledge of
which bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a
way to easily check at the MI level what the ordering requirements are on a set
of memory operations. SelectionDAG uses the chains to ensure that they’re
scheduled to respect the orderings. How to achieve the same thing remains an
open question for GlobalISel.
>> 
>> * More extensive tests that exercise multiple stages of the pipeline.
One advantage of using MIR with GISel is that individual passes can be easily
tested by feeding the exact input expected for a particular pass, and checking
the immediate output of the pass. However this approach can leave holes in the
test coverage. To help mitigate this, we will be exploring writing/generating
whole pipeline tests, tracking some IR through each pass and checking how the
MIR is mutated. We currently also have a proposed change to allow usage of
FileCheck as a library, not just as a stand-alone tool. This would allow us to
use FileCheck style checks and Improve testing of currently unused code paths.
>> 
>> 
>> *Roadmap for enabling optimizations*
>> **
>> I’ve filed a few PRs that people can follow or comment on to track the
progress towards enabling the -Os optimization level. The rough outline is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and
fast regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store orderings
>> 
>> These, along with general design and implementation work on the
combiner, will then lead onto a long road of performance analysis, inevitable
bug fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to
reach out on the list, or to any of the GlobalISel developers. We’d especially
like to hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/22d0d25a/attachment.html>

Amara Emerson via llvm-dev

2018-Jul-31 17:48 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi Tom,

Thanks for the feedback.
> On Jul 31, 2018, at 3:53 PM, Tom Stellard <tstellar at redhat.com>
wrote:
> 
> On 07/30/2018 07:01 AM, Amara Emerson via llvm-dev wrote:
>> Hi all,
>> 
> 
> Hi Amara,
> 
> Thanks for the update.  One question and then a few comments
> below.
> 
> Are there any plans to allow custom legalizing intrinsics?
> It would be convenient to be able to lower these to generic
> target instructions prior to register bank selection.
> 
> 
>> Over the past few months we’ve been doing work on the foundations for
the next stages of GlobalISel development. In terms of changes from this time
last year, the IR translator, the legalizer, and instruction selector have seen
moderate to major changes. The most significant of these was the change to the
legalizer API, allowing targets to use predicates to express legality, which
gives more precise control over what forms of instructions are legal, and how to
legalize them. This was necessary to implement support for the new extending
loads and truncating stores, but also results in more concise and elegant
expressions of legality for each target. For example, you can now apple a single
definition to apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than
handling them as one single large scalar. This change fixed some bugs and was
necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
performance, helping to keep overall compile time regression vs fastisel to be
<5% geomean on CTMark. There are still a few outliers like sqlite3 which has
a significant regression compared to FastISel, but most of the other benchmarks
show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can
import more SelectionDAG selection rules. For example, currently on AArch64 we
have about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
although there’s still significant work to be done here in terms of the final
design. The combiner will become a critical part of the pipeline in order to
begin improving runtime performance.
>> 
>> *High levels goals*
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas
to achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel,
and when optimizations are enabled to maintain a compile time advantage of
SelectionDAG.
>> * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high
level of code quality and minimizing regressions in correctness and performance
will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets,
not being overly tuned to a specific implementation.
>> 
>> *Design work planned*
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes
to handle different use cases. At the moment the opcode is too powerful,
resulting in overly complex handling in places like the legalizer. G_MERGE will
be split so that it only handles merging of scalars into one larger scalar. For
other cases like merging scalars into a vector we will create a new
G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the opposite. For
the current vector + vector case a new G_CONCAT_VECTOR will be introduced. With
these changes it should simplify implementations for all targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
currently represent constants as generic instructions, with each instance of a
constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.
>> 
> 
> I think having the G_CONSTANT opcode is useful is some ways for
> the AMDGPU target, because almost every operand for this target
> can be encoded as an immediate so being able to assume that
> all operands are registers simplifies the code a lot.
> 
> However, one problem with G_CONSTANT on AMDGPU is that some
> instructions have operands that must be encoded as an immediate.
> This creates some ambiguity, because something like
> G_CONSTANT i8 0 would normally be illegal, but if its only use in an
> operand that must be an i8 immediate, then we must keep it legal.
> This is interesting as it’s in conflict with our principle for the legalizer of
only looking at instruction-local information, I think it warrants further
discussion in its own thread (like most of these topics eventually). Off the top
of my head I wonder if, in the absence of encoding immediate as MOs, we should
make an exception for G_CONSTANTs and allow them to treated as being “paired”
with their user and disallow multiple uses of them until they’re selected away
with their single user at ISel. Or perhaps introduce a TargetConstant opcode
that doesn’t require legalisation.> I'm also not sure what register bank to assign in this case,
> because these values will never be stored in a register.
> 
> So I think having a little more flexibility on being able
> to encode operands as immediates would be good, but
> I’m not sure I would really want to get rid of G_CONSTANT.
> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
definitions can exist for the same constant. This can also result in a lot of
redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.
>> 
>> * For optimizations to be supported, the combiner will become a crucial
part of the GISel pipeline. We have already done some preliminary work in a
generic combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
self contained CSE pass that operates as a single phase in the pipeline is
attractive for the simplicity and elegance. However, we know empirically that
this is expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.
>> 
> 
> CSE also makes writing tests harder if it is constantly being
> run like in the SelectionDAG.We should be able to disable this for tests if needed, although it does mean
we’re not testing exactly the same thing any more.> 
> -Tom
> 
>> * Known bits computation. Some optimizations require the knowledge of
which bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a
way to easily check at the MI level what the ordering requirements are on a set
of memory operations. SelectionDAG uses the chains to ensure that they’re
scheduled to respect the orderings. How to achieve the same thing remains an
open question for GlobalISel.
>> 
>> * More extensive tests that exercise multiple stages of the pipeline.
One advantage of using MIR with GISel is that individual passes can be easily
tested by feeding the exact input expected for a particular pass, and checking
the immediate output of the pass. However this approach can leave holes in the
test coverage. To help mitigate this, we will be exploring writing/generating
whole pipeline tests, tracking some IR through each pass and checking how the
MIR is mutated. We currently also have a proposed change to allow usage of
FileCheck as a library, not just as a stand-alone tool. This would allow us to
use FileCheck style checks and Improve testing of currently unused code paths.
>> 
>> 
>> *Roadmap for enabling optimizations*
>> **
>> I’ve filed a few PRs that people can follow or comment on to track the
progress towards enabling the -Os optimization level. The rough outline is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and
fast regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store orderings
>> 
>> These, along with general design and implementation work on the
combiner, will then lead onto a long road of performance analysis, inevitable
bug fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to
reach out on the list, or to any of the GlobalISel developers. We’d especially
like to hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/a60007d0/attachment.html>

Leslie Zhai via llvm-dev

2018-Aug-03 02:46 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi Amara,

Thanks for your great job!

MIPS, RISCV and other targets have refactory requirement 
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html

Please give us some suggestion for supporting custom CCState, CCAssignFn 
in D41700. And also RegisterBank in D41653. because it needs to consider 
about how to support variable-sized register classes concept implemented 
in D24631.

I am building Linux Kernel and OpenJDK8 with LLVM toolchain for mips64el:

http://lists.llvm.org/pipermail/llvm-dev/2018-July/124620.html

http://lists.llvm.org/pipermail/llvm-dev/2018-July/124717.html

And migrate to GlobalISel and Machine Scheduler for LoongISA 
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123608.html

My sincere thanks will goto LLVM, Linux Kernel and OpenJDK developers 
who teach me a lot!


在 2018年07月30日 22:01, Amara Emerson via llvm-dev 写道:> Hi all,
>
> Over the past few months we’ve been doing work on the foundations for 
> the next stages of GlobalISel development. In terms of changes from 
> this time last year, the IR translator, the legalizer, and instruction 
> selector have seen moderate to major changes. The most significant of 
> these was the change to the legalizer API, allowing targets to use 
> predicates to express legality, which gives more precise control over 
> what forms of instructions are legal, and how to legalize them. This 
> was necessary to implement support for the new extending loads and 
> truncating stores, but also results in more concise and elegant 
> expressions of legality for each target. For example, you can now 
> apple a single definition to apply to multiples opcodes (G_ADD, G_SUB, 
> G_MUL etc).
>
> The IR translator has been modified to split aggregates rather than 
> handling them as one single large scalar. This change fixed some bugs 
> and was necessary in order handle big endian code correctly in future.
>
> The tablegen instruction selector also saw significant improvements in 
> performance, helping to keep overall compile time regression vs 
> fastisel to be <5% geomean on CTMark. There are still a few outliers 
> like sqlite3 which has a significant regression compared to FastISel, 
> but most of the other benchmarks show little difference or even 
> improvement.
>
> The tablegen importer has had improvements made to it, so that we can 
> import more SelectionDAG selection rules. For example, currently on 
> AArch64 we have about 40% of the rules being successfully imported.
>
> New additions from last year include the beginnings of a new combiner, 
> although there’s still significant work to be done here in terms of 
> the final design. The combiner will become a critical part of the 
> pipeline in order to begin improving runtime performance.
>
> *High levels goals*
>
> Going forward, we plan to improve GlobalISel in a number of key areas 
> to achieve the following targets:
>  * Keeping compile time under control, ideally within 5% of FastISel, 
> and when optimizations are enabled to maintain a compile time 
> advantage of SelectionDAG.
>  * Begin improving runtime performance by adding the most important 
> optimizations required to be competitive at -Os. We will be targeting 
> and measuring AArch64 for this goal but will endeavor to implement as 
> many optimizations as possible in generic code to benefit other targets.
>  * Improving overall stability and test coverage. Maintaining a high 
> level of code quality and minimizing regressions in correctness and 
> performance will be a significant challenge.
>  * Ensure that the overall design meets the needs of general targets, 
> not being overly tuned to a specific implementation.
>
> *Design work planned*
>
> These are some design changes coming in the near to medium term future:
>
>  * The G_MERGE and G_UNMERGE opcodes will be split into separate 
> opcodes to handle different use cases. At the moment the opcode is too 
> powerful, resulting in overly complex handling in places like the 
> legalizer. G_MERGE will be split so that it only handles merging of 
> scalars into one larger scalar. For other cases like merging scalars 
> into a vector we will create a new G_BUILD_VECTOR opcode, with a new 
> counterpart opcode for doing the opposite. For the current vector + 
> vector case a new G_CONCAT_VECTOR will be introduced. With these 
> changes it should simplify implementations for all targets.
>
>  * Constant representation at the MI level needs some investigation. 
> We currently represent constants as generic instructions, with each 
> instance of a constant being largely independent of each other, being 
> stored in the entry block except for a few places in IR translation 
> where we emit at the point of use. As a result we run a localizer pass 
> in an effort to reduce the live ranges of the constants (and the 
> consequent spilling), using some heuristics to decide where to sink 
> the constant definitions to.
>
> Since we don’t do any real caching of MI constants, multiple 
> G_CONSTANT definitions can exist for the same constant. This can also 
> result in a lot of redundant constants being created, especially for 
> things like address computation. Reducing the number of constants can 
> help reduce compile time and memory usage. Given this situation, one 
> possible approach is to encode constants into the operands of the 
> users, rather than have dedicated machine instructions. At instruction 
> selection time the constant can then be materialized into a register 
> or encoded as an immediate. Further investigation is needed to find 
> the right way forward here.
>
>  * For optimizations to be supported, the combiner will become a 
> crucial part of the GISel pipeline. We have already done some 
> preliminary work in a generic combiner, which will be used to 
> eventually support combines of extloads/truncstores. We’ve had 
> discussions on and off list about what we need from the new combiner. 
> The summary is that we want the combiner to be flexible for each 
> target to select from a library of combines, being as efficient as 
> possible. The expression of the combines are currently written in C++, 
> but one piece of investigation work we might do is to prototype using 
> the same tablegen driven instruction selector code to match 
> declarative combine patterns written in tablegen. Regardless, we will 
> need to support the custom C++ use case.
>
>  * CSE throughout the pipeline. From a theoretical perspective, having 
> a self contained CSE pass that operates as a single phase in the 
> pipeline is attractive for the simplicity and elegance. However, we 
> know empirically that this is expensive in compile time. Not only does 
> the CSE pass itself take a non-negligible time to run, but having it 
> as a late pass can result in the non-CSE’d code from the IRTranslator 
> onwards surviving for a long time, taking up time in analysis at each 
> stage of compilation. We believe running a light weight CSE early is a 
> win. SelectionDAG currently does CSE by default when building the DAG, 
> and this is something we could explore as part of a custom IRBuilder.
>
>  * Known bits computation. Some optimizations require the knowledge of 
> which bits in a value are known to be 1 or 0, and do this by using the 
> computeKnownBits() capability for SelectionDAG nodes. We will need 
> some way of getting the same information. In an ideal scenario the 
> replacement infrastructure for this will be more efficient, as this 
> part of the codebase seems to be disproportionately responsible for 
> pathological compile time regressions.
>
>  * Load/store ordering needs some thought, as we currently don’t have 
> a way to easily check at the MI level what the ordering requirements 
> are on a set of memory operations. SelectionDAG uses the chains to 
> ensure that they’re scheduled to respect the orderings. How to achieve 
> the same thing remains an open question for GlobalISel.
>
>  * More extensive tests that exercise multiple stages of the pipeline. 
> One advantage of using MIR with GISel is that individual passes can be 
> easily tested by feeding the exact input expected for a particular 
> pass, and checking the immediate output of the pass. However this 
> approach can leave holes in the test coverage. To help mitigate this, 
> we will be exploring writing/generating whole pipeline tests, tracking 
> some IR through each pass and checking how the MIR is mutated. We 
> currently also have a proposed change to allow usage of FileCheck as a 
> library, not just as a stand-alone tool. This would allow us to use 
> FileCheck style checks and Improve testing of currently unused code paths.
>
>
> *Roadmap for enabling optimizations*
>
> I’ve filed a few PRs that people can follow or comment on to track the 
> progress towards enabling the -Os optimization level. The rough 
> outline is:
>
> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or 
> benchmarks (Darwin)
> PR 38366 - GlobalISel: Lightweight CSE
> PR 32561 - GlobalISel: placement of constants in the entry-block and 
> fast regalloc result in lots of reloaded constant
> PR 38367 - GlobalISel: Implement support for obtaining known bits 
> information
> PR 38368 - GlobalISel: Investigate an efficient way to ensure 
> load/store orderings
>
> These, along with general design and implementation work on the 
> combiner, will then lead onto a long road of performance analysis, 
> inevitable bug fixing, and implementing more optimizations.
>
> If anyone is interested in discussing in more detail, feel free to 
> reach out on the list, or to any of the GlobalISel developers. We’d 
> especially like to hear about any issues or concerns about porting 
> targets to GlobalISel.
>
> Thanks,
> Amara
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Regards,
Leslie Zhai

Leslie Zhai via llvm-dev

2018-Aug-03 03:56 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi LLVM and HotSpot developers,

I just experienced 4+ months HotSpot C1 MIPS porting. And my sincere 
thanks will goto HotSpot developers who taught me a lot!

As an apprentice in the compiler world, I have some questions:

* There is no instruction selection "concept" equivalent to LLVM's
SelectionDAG and GlobalISel in HotSpot C1? Because I manually write 
assembly[1] lowing HIR to LIR in HotSpot C1. So which one is better? 
LLVM or HotSpot selection by human?

* Why not use Greedy, just like LLVM's RegAllocGreedy, to take place of 
Linear Scan[2] for HotSpot C1's register allocation?

Please teach me, thanks a lot!

1. 
http://hg.loongnix.org/jdk8-mips64-public/hotspot/file/tip/src/cpu/mips/vm/c1_LIRAssembler_mips.cpp#l1542

2. 
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2018-March/028545.html


在 2018年08月03日 10:46, Leslie Zhai 写道:> Hi Amara,
>
> Thanks for your great job!
>
> MIPS, RISCV and other targets have refactory requirement 
> http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html
>
> Please give us some suggestion for supporting custom CCState, 
> CCAssignFn in D41700. And also RegisterBank in D41653. because it 
> needs to consider about how to support variable-sized register classes 
> concept implemented in D24631.
>
> I am building Linux Kernel and OpenJDK8 with LLVM toolchain for mips64el:
>
> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124620.html
>
> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124717.html
>
> And migrate to GlobalISel and Machine Scheduler for LoongISA 
> http://lists.llvm.org/pipermail/llvm-dev/2018-May/123608.html
>
> My sincere thanks will goto LLVM, Linux Kernel and OpenJDK developers 
> who teach me a lot!
>
>
> 在 2018年07月30日 22:01, Amara Emerson via llvm-dev 写道:
>> Hi all,
>>
>> Over the past few months we’ve been doing work on the foundations for 
>> the next stages of GlobalISel development. In terms of changes from 
>> this time last year, the IR translator, the legalizer, and 
>> instruction selector have seen moderate to major changes. The most 
>> significant of these was the change to the legalizer API, allowing 
>> targets to use predicates to express legality, which gives more 
>> precise control over what forms of instructions are legal, and how to 
>> legalize them. This was necessary to implement support for the new 
>> extending loads and truncating stores, but also results in more 
>> concise and elegant expressions of legality for each target. For 
>> example, you can now apple a single definition to apply to multiples 
>> opcodes (G_ADD, G_SUB, G_MUL etc).
>>
>> The IR translator has been modified to split aggregates rather than 
>> handling them as one single large scalar. This change fixed some bugs 
>> and was necessary in order handle big endian code correctly in future.
>>
>> The tablegen instruction selector also saw significant improvements 
>> in performance, helping to keep overall compile time regression vs 
>> fastisel to be <5% geomean on CTMark. There are still a few outliers
>> like sqlite3 which has a significant regression compared to FastISel, 
>> but most of the other benchmarks show little difference or even 
>> improvement.
>>
>> The tablegen importer has had improvements made to it, so that we can 
>> import more SelectionDAG selection rules. For example, currently on 
>> AArch64 we have about 40% of the rules being successfully imported.
>>
>> New additions from last year include the beginnings of a new 
>> combiner, although there’s still significant work to be done here in 
>> terms of the final design. The combiner will become a critical part 
>> of the pipeline in order to begin improving runtime performance.
>>
>> *High levels goals*
>>
>> Going forward, we plan to improve GlobalISel in a number of key areas 
>> to achieve the following targets:
>>  * Keeping compile time under control, ideally within 5% of FastISel, 
>> and when optimizations are enabled to maintain a compile time 
>> advantage of SelectionDAG.
>>  * Begin improving runtime performance by adding the most important 
>> optimizations required to be competitive at -Os. We will be targeting 
>> and measuring AArch64 for this goal but will endeavor to implement as 
>> many optimizations as possible in generic code to benefit other
targets.
>>  * Improving overall stability and test coverage. Maintaining a high 
>> level of code quality and minimizing regressions in correctness and 
>> performance will be a significant challenge.
>>  * Ensure that the overall design meets the needs of general targets, 
>> not being overly tuned to a specific implementation.
>>
>> *Design work planned*
>>
>> These are some design changes coming in the near to medium term future:
>>
>>  * The G_MERGE and G_UNMERGE opcodes will be split into separate 
>> opcodes to handle different use cases. At the moment the opcode is 
>> too powerful, resulting in overly complex handling in places like the 
>> legalizer. G_MERGE will be split so that it only handles merging of 
>> scalars into one larger scalar. For other cases like merging scalars 
>> into a vector we will create a new G_BUILD_VECTOR opcode, with a new 
>> counterpart opcode for doing the opposite. For the current vector + 
>> vector case a new G_CONCAT_VECTOR will be introduced. With these 
>> changes it should simplify implementations for all targets.
>>
>>  * Constant representation at the MI level needs some investigation. 
>> We currently represent constants as generic instructions, with each 
>> instance of a constant being largely independent of each other, being 
>> stored in the entry block except for a few places in IR translation 
>> where we emit at the point of use. As a result we run a localizer 
>> pass in an effort to reduce the live ranges of the constants (and the 
>> consequent spilling), using some heuristics to decide where to sink 
>> the constant definitions to.
>>
>> Since we don’t do any real caching of MI constants, multiple 
>> G_CONSTANT definitions can exist for the same constant. This can also 
>> result in a lot of redundant constants being created, especially for 
>> things like address computation. Reducing the number of constants can 
>> help reduce compile time and memory usage. Given this situation, one 
>> possible approach is to encode constants into the operands of the 
>> users, rather than have dedicated machine instructions. At 
>> instruction selection time the constant can then be materialized into 
>> a register or encoded as an immediate. Further investigation is 
>> needed to find the right way forward here.
>>
>>  * For optimizations to be supported, the combiner will become a 
>> crucial part of the GISel pipeline. We have already done some 
>> preliminary work in a generic combiner, which will be used to 
>> eventually support combines of extloads/truncstores. We’ve had 
>> discussions on and off list about what we need from the new combiner. 
>> The summary is that we want the combiner to be flexible for each 
>> target to select from a library of combines, being as efficient as 
>> possible. The expression of the combines are currently written in 
>> C++, but one piece of investigation work we might do is to prototype 
>> using the same tablegen driven instruction selector code to match 
>> declarative combine patterns written in tablegen. Regardless, we will 
>> need to support the custom C++ use case.
>>
>>  * CSE throughout the pipeline. From a theoretical perspective, 
>> having a self contained CSE pass that operates as a single phase in 
>> the pipeline is attractive for the simplicity and elegance. However, 
>> we know empirically that this is expensive in compile time. Not only 
>> does the CSE pass itself take a non-negligible time to run, but 
>> having it as a late pass can result in the non-CSE’d code from the 
>> IRTranslator onwards surviving for a long time, taking up time in 
>> analysis at each stage of compilation. We believe running a light 
>> weight CSE early is a win. SelectionDAG currently does CSE by default 
>> when building the DAG, and this is something we could explore as part 
>> of a custom IRBuilder.
>>
>>  * Known bits computation. Some optimizations require the knowledge 
>> of which bits in a value are known to be 1 or 0, and do this by using 
>> the computeKnownBits() capability for SelectionDAG nodes. We will 
>> need some way of getting the same information. In an ideal scenario 
>> the replacement infrastructure for this will be more efficient, as 
>> this part of the codebase seems to be disproportionately responsible 
>> for pathological compile time regressions.
>>
>>  * Load/store ordering needs some thought, as we currently don’t have 
>> a way to easily check at the MI level what the ordering requirements 
>> are on a set of memory operations. SelectionDAG uses the chains to 
>> ensure that they’re scheduled to respect the orderings. How to 
>> achieve the same thing remains an open question for GlobalISel.
>>
>>  * More extensive tests that exercise multiple stages of the 
>> pipeline. One advantage of using MIR with GISel is that individual 
>> passes can be easily tested by feeding the exact input expected for a 
>> particular pass, and checking the immediate output of the pass. 
>> However this approach can leave holes in the test coverage. To help 
>> mitigate this, we will be exploring writing/generating whole pipeline 
>> tests, tracking some IR through each pass and checking how the MIR is 
>> mutated. We currently also have a proposed change to allow usage of 
>> FileCheck as a library, not just as a stand-alone tool. This would 
>> allow us to use FileCheck style checks and Improve testing of 
>> currently unused code paths.
>>
>>
>> *Roadmap for enabling optimizations*
>>
>> I’ve filed a few PRs that people can follow or comment on to track 
>> the progress towards enabling the -Os optimization level. The rough 
>> outline is:
>>
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or 
>> benchmarks (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and 
>> fast regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits 
>> information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure 
>> load/store orderings
>>
>> These, along with general design and implementation work on the 
>> combiner, will then lead onto a long road of performance analysis, 
>> inevitable bug fixing, and implementing more optimizations.
>>
>> If anyone is interested in discussing in more detail, feel free to 
>> reach out on the list, or to any of the GlobalISel developers. We’d 
>> especially like to hear about any issues or concerns about porting 
>> targets to GlobalISel.
>>
>> Thanks,
>> Amara
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Regards,
Leslie Zhai

llvm dev - Aug 2018 - GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals