thr3ads.net - llvm dev - [llvm-dev] GlobalISel design update and goals [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Amara Emerson via llvm-dev

2018-Aug-04 12:49 UTC

[llvm-dev] GlobalISel design update and goals

Hi Leslie,

I haven’t had a deep look at those patches yet, but if you need a custom CCState
in assignArgs then you can modify the function to take a custom CCState object.
Or perhaps better to take a (possibly null) callback function that will
construct a CCState.

On the variable width vectors side, we haven’t considered that use case much.
Graham posted a patch to add scalable LLTs in https://reviews.llvm.org/D47769 so
I would first check that the codegen requirements of RISC-V and SVE are aligned
there. The variable vector type discussion is still ongoing, so I expect things
to become clearer in the near future.

Cheers,
Amara
> On Aug 3, 2018, at 3:46 AM, Leslie Zhai <lesliezhai at llvm.org.cn>
wrote:
> 
> Hi Amara,
> 
> Thanks for your great job!
> 
> MIPS, RISCV and other targets have refactory requirement
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html
> 
> Please give us some suggestion for supporting custom CCState, CCAssignFn in
D41700. And also RegisterBank in D41653. because it needs to consider about how
to support variable-sized register classes concept implemented in D24631.
> 
> I am building Linux Kernel and OpenJDK8 with LLVM toolchain for mips64el:
> 
> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124620.html
> 
> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124717.html
> 
> And migrate to GlobalISel and Machine Scheduler for LoongISA
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123608.html
> 
> My sincere thanks will goto LLVM, Linux Kernel and OpenJDK developers who
teach me a lot!
> 
> 
> 在 2018年07月30日 22:01, Amara Emerson via llvm-dev 写道:
>> Hi all,
>> 
>> Over the past few months we’ve been doing work on the foundations for
the next stages of GlobalISel development. In terms of changes from this time
last year, the IR translator, the legalizer, and instruction selector have seen
moderate to major changes. The most significant of these was the change to the
legalizer API, allowing targets to use predicates to express legality, which
gives more precise control over what forms of instructions are legal, and how to
legalize them. This was necessary to implement support for the new extending
loads and truncating stores, but also results in more concise and elegant
expressions of legality for each target. For example, you can now apple a single
definition to apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than
handling them as one single large scalar. This change fixed some bugs and was
necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
performance, helping to keep overall compile time regression vs fastisel to be
<5% geomean on CTMark. There are still a few outliers like sqlite3 which has
a significant regression compared to FastISel, but most of the other benchmarks
show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can
import more SelectionDAG selection rules. For example, currently on AArch64 we
have about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
although there’s still significant work to be done here in terms of the final
design. The combiner will become a critical part of the pipeline in order to
begin improving runtime performance.
>> 
>> *High levels goals*
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas
to achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel,
and when optimizations are enabled to maintain a compile time advantage of
SelectionDAG.
>> * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high
level of code quality and minimizing regressions in correctness and performance
will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets,
not being overly tuned to a specific implementation.
>> 
>> *Design work planned*
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes
to handle different use cases. At the moment the opcode is too powerful,
resulting in overly complex handling in places like the legalizer. G_MERGE will
be split so that it only handles merging of scalars into one larger scalar. For
other cases like merging scalars into a vector we will create a new
G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the opposite. For
the current vector + vector case a new G_CONCAT_VECTOR will be introduced. With
these changes it should simplify implementations for all targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
currently represent constants as generic instructions, with each instance of a
constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.
>> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
definitions can exist for the same constant. This can also result in a lot of
redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.
>> 
>> * For optimizations to be supported, the combiner will become a crucial
part of the GISel pipeline. We have already done some preliminary work in a
generic combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
self contained CSE pass that operates as a single phase in the pipeline is
attractive for the simplicity and elegance. However, we know empirically that
this is expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.
>> 
>> * Known bits computation. Some optimizations require the knowledge of
which bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a
way to easily check at the MI level what the ordering requirements are on a set
of memory operations. SelectionDAG uses the chains to ensure that they’re
scheduled to respect the orderings. How to achieve the same thing remains an
open question for GlobalISel.
>> 
>> * More extensive tests that exercise multiple stages of the pipeline.
One advantage of using MIR with GISel is that individual passes can be easily
tested by feeding the exact input expected for a particular pass, and checking
the immediate output of the pass. However this approach can leave holes in the
test coverage. To help mitigate this, we will be exploring writing/generating
whole pipeline tests, tracking some IR through each pass and checking how the
MIR is mutated. We currently also have a proposed change to allow usage of
FileCheck as a library, not just as a stand-alone tool. This would allow us to
use FileCheck style checks and Improve testing of currently unused code paths.
>> 
>> 
>> *Roadmap for enabling optimizations*
>> 
>> I’ve filed a few PRs that people can follow or comment on to track the
progress towards enabling the -Os optimization level. The rough outline is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and
fast regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store orderings
>> 
>> These, along with general design and implementation work on the
combiner, will then lead onto a long road of performance analysis, inevitable
bug fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to
reach out on the list, or to any of the GlobalISel developers. We’d especially
like to hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> -- 
> Regards,
> Leslie Zhai
> 
> 
>

Leslie Zhai via llvm-dev

2018-Aug-06 02:35 UTC

head link

[llvm-dev] GlobalISel design update and goals

Hi Amara,

Thanks for your response!

Alex lead me porting GlobalISel to RISCV target and he reviewed my 
patches, my sincere thanks will goto LLVM backend developers who taught 
me! But we also need other developers to review the shared *dependent* 
component, for example, Refactory CallLowering handleAssignments for 
custom <Target>CCState[1]

If not approved , it is not easy to port GlobalISel for MIPS and RISCV, 
perhaps GlobalISel is not mature for other targets. So I am just porting 
LoongISA[2] to SelectionDAG firstly, BTW learning Machine Scheduler.

1. https://reviews.llvm.org/D41774

2. 
https://www.researchgate.net/publication/276487823_LoongISA_for_compatibility_with_mainstream_instruction_set_architecture


在 2018年08月04日 20:49, Amara Emerson 写道:> Hi Leslie,
>
> I haven’t had a deep look at those patches yet, but if you need a custom
CCState in assignArgs then you can modify the function to take a custom CCState
object. Or perhaps better to take a (possibly null) callback function that will
construct a CCState.
>
> On the variable width vectors side, we haven’t considered that use case
much. Graham posted a patch to add scalable LLTs in
https://reviews.llvm.org/D47769 so I would first check that the codegen
requirements of RISC-V and SVE are aligned there. The variable vector type
discussion is still ongoing, so I expect things to become clearer in the near
future.
>
> Cheers,
> Amara
>
>> On Aug 3, 2018, at 3:46 AM, Leslie Zhai <lesliezhai at
llvm.org.cn> wrote:
>>
>> Hi Amara,
>>
>> Thanks for your great job!
>>
>> MIPS, RISCV and other targets have refactory requirement
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html
>>
>> Please give us some suggestion for supporting custom CCState,
CCAssignFn in D41700. And also RegisterBank in D41653. because it needs to
consider about how to support variable-sized register classes concept
implemented in D24631.
>>
>> I am building Linux Kernel and OpenJDK8 with LLVM toolchain for
mips64el:
>>
>> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124620.html
>>
>> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124717.html
>>
>> And migrate to GlobalISel and Machine Scheduler for LoongISA
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123608.html
>>
>> My sincere thanks will goto LLVM, Linux Kernel and OpenJDK developers
who teach me a lot!
>>
>>
>> 在 2018年07月30日 22:01, Amara Emerson via llvm-dev 写道:
>>> Hi all,
>>>
>>> Over the past few months we’ve been doing work on the foundations
for the next stages of GlobalISel development. In terms of changes from this
time last year, the IR translator, the legalizer, and instruction selector have
seen moderate to major changes. The most significant of these was the change to
the legalizer API, allowing targets to use predicates to express legality, which
gives more precise control over what forms of instructions are legal, and how to
legalize them. This was necessary to implement support for the new extending
loads and truncating stores, but also results in more concise and elegant
expressions of legality for each target. For example, you can now apple a single
definition to apply to multiples opcodes (G_ADD, G_SUB, G_MUL etc).
>>>
>>> The IR translator has been modified to split aggregates rather than
handling them as one single large scalar. This change fixed some bugs and was
necessary in order handle big endian code correctly in future.
>>>
>>> The tablegen instruction selector also saw significant improvements
in performance, helping to keep overall compile time regression vs fastisel to
be <5% geomean on CTMark. There are still a few outliers like sqlite3 which
has a significant regression compared to FastISel, but most of the other
benchmarks show little difference or even improvement.
>>>
>>> The tablegen importer has had improvements made to it, so that we
can import more SelectionDAG selection rules. For example, currently on AArch64
we have about 40% of the rules being successfully imported.
>>>
>>> New additions from last year include the beginnings of a new
combiner, although there’s still significant work to be done here in terms of
the final design. The combiner will become a critical part of the pipeline in
order to begin improving runtime performance.
>>>
>>> *High levels goals*
>>>
>>> Going forward, we plan to improve GlobalISel in a number of key
areas to achieve the following targets:
>>> * Keeping compile time under control, ideally within 5% of
FastISel, and when optimizations are enabled to maintain a compile time
advantage of SelectionDAG.
>>> * Begin improving runtime performance by adding the most important
optimizations required to be competitive at -Os. We will be targeting and
measuring AArch64 for this goal but will endeavor to implement as many
optimizations as possible in generic code to benefit other targets.
>>> * Improving overall stability and test coverage. Maintaining a high
level of code quality and minimizing regressions in correctness and performance
will be a significant challenge.
>>> * Ensure that the overall design meets the needs of general
targets, not being overly tuned to a specific implementation.
>>>
>>> *Design work planned*
>>>
>>> These are some design changes coming in the near to medium term
future:
>>>
>>> * The G_MERGE and G_UNMERGE opcodes will be split into separate
opcodes to handle different use cases. At the moment the opcode is too powerful,
resulting in overly complex handling in places like the legalizer. G_MERGE will
be split so that it only handles merging of scalars into one larger scalar. For
other cases like merging scalars into a vector we will create a new
G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the opposite. For
the current vector + vector case a new G_CONCAT_VECTOR will be introduced. With
these changes it should simplify implementations for all targets.
>>>
>>> * Constant representation at the MI level needs some investigation.
We currently represent constants as generic instructions, with each instance of
a constant being largely independent of each other, being stored in the entry
block except for a few places in IR translation where we emit at the point of
use. As a result we run a localizer pass in an effort to reduce the live ranges
of the constants (and the consequent spilling), using some heuristics to decide
where to sink the constant definitions to.
>>>
>>> Since we don’t do any real caching of MI constants, multiple
G_CONSTANT definitions can exist for the same constant. This can also result in
a lot of redundant constants being created, especially for things like address
computation. Reducing the number of constants can help reduce compile time and
memory usage. Given this situation, one possible approach is to encode constants
into the operands of the users, rather than have dedicated machine instructions.
At instruction selection time the constant can then be materialized into a
register or encoded as an immediate. Further investigation is needed to find the
right way forward here.
>>>
>>> * For optimizations to be supported, the combiner will become a
crucial part of the GISel pipeline. We have already done some preliminary work
in a generic combiner, which will be used to eventually support combines of
extloads/truncstores. We’ve had discussions on and off list about what we need
from the new combiner. The summary is that we want the combiner to be flexible
for each target to select from a library of combines, being as efficient as
possible. The expression of the combines are currently written in C++, but one
piece of investigation work we might do is to prototype using the same tablegen
driven instruction selector code to match declarative combine patterns written
in tablegen. Regardless, we will need to support the custom C++ use case.
>>>
>>> * CSE throughout the pipeline. From a theoretical perspective,
having a self contained CSE pass that operates as a single phase in the pipeline
is attractive for the simplicity and elegance. However, we know empirically that
this is expensive in compile time. Not only does the CSE pass itself take a
non-negligible time to run, but having it as a late pass can result in the
non-CSE’d code from the IRTranslator onwards surviving for a long time, taking
up time in analysis at each stage of compilation. We believe running a light
weight CSE early is a win. SelectionDAG currently does CSE by default when
building the DAG, and this is something we could explore as part of a custom
IRBuilder.
>>>
>>> * Known bits computation. Some optimizations require the knowledge
of which bits in a value are known to be 1 or 0, and do this by using the
computeKnownBits() capability for SelectionDAG nodes. We will need some way of
getting the same information. In an ideal scenario the replacement
infrastructure for this will be more efficient, as this part of the codebase
seems to be disproportionately responsible for pathological compile time
regressions.
>>>
>>> * Load/store ordering needs some thought, as we currently don’t
have a way to easily check at the MI level what the ordering requirements are on
a set of memory operations. SelectionDAG uses the chains to ensure that they’re
scheduled to respect the orderings. How to achieve the same thing remains an
open question for GlobalISel.
>>>
>>> * More extensive tests that exercise multiple stages of the
pipeline. One advantage of using MIR with GISel is that individual passes can be
easily tested by feeding the exact input expected for a particular pass, and
checking the immediate output of the pass. However this approach can leave holes
in the test coverage. To help mitigate this, we will be exploring
writing/generating whole pipeline tests, tracking some IR through each pass and
checking how the MIR is mutated. We currently also have a proposed change to
allow usage of FileCheck as a library, not just as a stand-alone tool. This
would allow us to use FileCheck style checks and Improve testing of currently
unused code paths.
>>>
>>>
>>> *Roadmap for enabling optimizations*
>>>
>>> I’ve filed a few PRs that people can follow or comment on to track
the progress towards enabling the -Os optimization level. The rough outline is:
>>>
>>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or
benchmarks (Darwin)
>>> PR 38366 - GlobalISel: Lightweight CSE
>>> PR 32561 - GlobalISel: placement of constants in the entry-block
and fast regalloc result in lots of reloaded constant
>>> PR 38367 - GlobalISel: Implement support for obtaining known bits
information
>>> PR 38368 - GlobalISel: Investigate an efficient way to ensure
load/store orderings
>>>
>>> These, along with general design and implementation work on the
combiner, will then lead onto a long road of performance analysis, inevitable
bug fixing, and implementing more optimizations.
>>>
>>> If anyone is interested in discussing in more detail, feel free to
reach out on the list, or to any of the GlobalISel developers. We’d especially
like to hear about any issues or concerns about porting targets to GlobalISel.
>>>
>>> Thanks,
>>> Amara
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> -- 
>> Regards,
>> Leslie Zhai
>>
>>
>>
-- 
Regards,
Leslie Zhai

llvm dev - Aug 2018 - GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals

[llvm-dev] GlobalISel design update and goals