Daniel Sanders via llvm-dev
2018-Aug-03 01:40 UTC
[llvm-dev] GlobalISel design update and goals
> On 2 Aug 2018, at 14:50, Quentin Colombet <quentin.colombet at gmail.com> wrote: > > Hi Daniel, > > 2018-07-31 8:40 GMT-07:00 Daniel Sanders <daniel_l_sanders at apple.com>: >> >> >> On 30 Jul 2018, at 16:04, Quentin Colombet via llvm-dev >> <llvm-dev at lists.llvm.org> wrote: >> >> Hi Amara, >> >> Thanks for sharing the plan going forward. >> >> Inlined a couple of comments. >> >> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev >> <llvm-dev at lists.llvm.org>: >> >> Hi all, >> >> Over the past few months we’ve been doing work on the foundations for the >> next stages of GlobalISel development. In terms of changes from this time >> last year, the IR translator, the legalizer, and instruction selector have >> seen moderate to major changes. The most significant of these was the change >> to the legalizer API, allowing targets to use predicates to express >> legality, which gives more precise control over what forms of instructions >> are legal, and how to legalize them. This was necessary to implement support >> for the new extending loads and truncating stores, but also results in more >> concise and elegant expressions of legality for each target. For example, >> you can now apple a single definition to apply to multiples opcodes (G_ADD, >> G_SUB, G_MUL etc). >> >> The IR translator has been modified to split aggregates rather than handling >> them as one single large scalar. This change fixed some bugs and was >> necessary in order handle big endian code correctly in future. >> >> The tablegen instruction selector also saw significant improvements in >> performance, helping to keep overall compile time regression vs fastisel to >> be <5% geomean on CTMark. There are still a few outliers like sqlite3 which >> has a significant regression compared to FastISel, but most of the other >> benchmarks show little difference or even improvement. >> >> The tablegen importer has had improvements made to it, so that we can import >> more SelectionDAG selection rules. For example, currently on AArch64 we have >> about 40% of the rules being successfully imported. >> >> New additions from last year include the beginnings of a new combiner, >> although there’s still significant work to be done here in terms of the >> final design. The combiner will become a critical part of the pipeline in >> order to begin improving runtime performance. >> >> High levels goals >> >> Going forward, we plan to improve GlobalISel in a number of key areas to >> achieve the following targets: >> * Keeping compile time under control, ideally within 5% of FastISel, and >> when optimizations are enabled to maintain a compile time advantage of >> SelectionDAG. >> * Begin improving runtime performance by adding the most important >> optimizations required to be competitive at -Os. We will be targeting and >> measuring AArch64 for this goal but will endeavor to implement as many >> optimizations as possible in generic code to benefit other targets. >> * Improving overall stability and test coverage. Maintaining a high level >> of code quality and minimizing regressions in correctness and performance >> will be a significant challenge. >> * Ensure that the overall design meets the needs of general targets, not >> being overly tuned to a specific implementation. >> >> Design work planned >> >> These are some design changes coming in the near to medium term future: >> >> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to >> handle different use cases. At the moment the opcode is too powerful, >> resulting in overly complex handling in places like the legalizer. G_MERGE >> will be split so that it only handles merging of scalars into one larger >> scalar. For other cases like merging scalars into a vector we will create a >> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the >> opposite. For the current vector + vector case a new G_CONCAT_VECTOR will be >> introduced. With these changes it should simplify implementations for all >> targets. >> >> * Constant representation at the MI level needs some investigation. We >> currently represent constants as generic instructions, with each instance of >> a constant being largely independent of each other, being stored in the >> entry block except for a few places in IR translation where we emit at the >> point of use. As a result we run a localizer pass in an effort to reduce the >> live ranges of the constants (and the consequent spilling), using some >> heuristics to decide where to sink the constant definitions to. >> >> Since we don’t do any real caching of MI constants, multiple G_CONSTANT >> definitions can exist for the same constant. This can also result in a lot >> of redundant constants being created, especially for things like address >> computation. Reducing the number of constants can help reduce compile time >> and memory usage. Given this situation, one possible approach is to encode >> constants into the operands of the users, rather than have dedicated machine >> instructions. At instruction selection time the constant can then be >> materialized into a register or encoded as an immediate. Further >> investigation is needed to find the right way forward here. >> >> >> The initial design was to not have constant in machine operands. The >> main rational is materializing a constant may be expensive, so we >> better not sprinkle it around. >> Now, in practice, this indeed means that we need to keep a table of >> all the constants created so that we don't start to duplicate them, >> which we fail to do. That should be easy to fix that just a map >> virtual register + type to constant that could be kept at the function >> level (or even module level). Better yet, this could be handled by the >> IRBuilder. E.g., when instantiating a IR builder, it could scan the >> function to see which constants are already there and build this >> mapping and then, create only the constants that are missing. >> >> Moreover, another the advantage of this model is that optimizations >> don't have to deal with two variants of the same instruction (ADDr and >> ADDi), same for patterns. Alternatively, if we don't change the >> opcode, but only the MachineOperands, then every optimization has to >> deal with two different kind of opcodes. Which is bad IMHO. >> >> Also, this design was meant to absorb what the constant hoisting pass >> does on the IR, so that we can kill that pass while having a better >> cost model. >> >> Finally, regarding the localizer pass, this was meant as a workaround >> for the fast regalloc problem and should be killed asap. In >> particular, the main motivation was to avoid code size bloat, but >> AFAIR during our last measurements with Ahmed, we only saved a few % >> on a handful of benchmarks, so maybe we can just kill it. >> >> >> * For optimizations to be supported, the combiner will become a crucial >> part of the GISel pipeline. We have already done some preliminary work in a >> generic combiner, which will be used to eventually support combines of >> extloads/truncstores. We’ve had discussions on and off list about what we >> need from the new combiner. The summary is that we want the combiner to be >> flexible for each target to select from a library of combines, being as >> efficient as possible. The expression of the combines are currently written >> in C++, but one piece of investigation work we might do is to prototype >> using the same tablegen driven instruction selector code to match >> declarative combine patterns written in tablegen. Regardless, we will need >> to support the custom C++ use case. >> >> * CSE throughout the pipeline. From a theoretical perspective, having a >> self contained CSE pass that operates as a single phase in the pipeline is >> attractive for the simplicity and elegance. However, we know empirically >> that this is expensive in compile time. Not only does the CSE pass itself >> take a non-negligible time to run, but having it as a late pass can result >> in the non-CSE’d code from the IRTranslator onwards surviving for a long >> time, taking up time in analysis at each stage of compilation. We believe >> running a light weight CSE early is a win. SelectionDAG currently does CSE >> by default when building the DAG, and this is something we could explore as >> part of a custom IRBuilder. >> >> >> I have been pushing for having the IRBuilder being smarter. Having >> this doing CSE was something I wanted we try. >> >> >> * Known bits computation. Some optimizations require the knowledge of which >> bits in a value are known to be 1 or 0, and do this by using the >> computeKnownBits() capability for SelectionDAG nodes. We will need some way >> of getting the same information. In an ideal scenario the replacement >> infrastructure for this will be more efficient, as this part of the codebase >> seems to be disproportionately responsible for pathological compile time >> regressions. >> >> * Load/store ordering needs some thought, as we currently don’t have a way >> to easily check at the MI level what the ordering requirements are on a set >> of memory operations. SelectionDAG uses the chains to ensure that they’re >> scheduled to respect the orderings. How to achieve the same thing remains an >> open question for GlobalISel. >> >> >> I don't get this problem. >> GISel has a sequential IR, the order is already modeled here. >> >> >> %t1 = load %addr1 :: (load 4 from @a) >> %t2 = load %addr2 :: (load 4 from @a + 4) >> store %t1 %addr3 :: (store 4 to @b) >> store %t2 %addr4 :: (store 4 to @b + 4) >> >> MIR specifies the order: >> load @a -> load @a+4 -> store @b -> store @b+4 >> but the dependencies are actually: >> load @a -> store @b >> load @a+4 -> store @b+4 >> >> We only have a representation of the former at the moment and have to >> re-calculate the latter each time we want to use that information to >> sink/hoist/fold/etc. instructions. We need to build a means of preserving >> that dependency graph within a pass and between passes. > > Thanks for the example. I think it matches what I was saying > previously, i.e., if we want more than the relative ordering, we'll > need some alias analysis. > That said, the alias analysis data from the IR is already available > (admittedly in a form that I don't like) by using the memory operands.I agree we need alias analysis, but having alias analysis by itself isn't enough. We also need a means to find the right 'does x alias y' questions to ask for a given change to the MIR and a means to avoid asking the same aliasing question multiple times. The answers to 'does x alias y' was encoded in artificial edges of the DAG before so it was easy to find the relevant instructions, but although the necessary information is available to GlobalISel we don't have an efficient means of using it yet.> Check the stack coloring pass to see how we use it.As far as I can see StackColoring is only maintaining AA information, it's not using it in the decision making code. Which code are you referring to?> Cheers, > -Quentin >> Dominance should give us the relative ordering, then if we want more, >> we should do exactly what we do at the IR level (alias analysis, etc.) >> >> Cheers, >> -Quentin >> >> * More extensive tests that exercise multiple stages of the pipeline. One >> advantage of using MIR with GISel is that individual passes can be easily >> tested by feeding the exact input expected for a particular pass, and >> checking the immediate output of the pass. However this approach can leave >> holes in the test coverage. To help mitigate this, we will be exploring >> writing/generating whole pipeline tests, tracking some IR through each pass >> and checking how the MIR is mutated. We currently also have a proposed >> change to allow usage of FileCheck as a library, not just as a stand-alone >> tool. This would allow us to use FileCheck style checks and Improve testing >> of currently unused code paths. >> >> >> Roadmap for enabling optimizations >> >> >> I’ve filed a few PRs that people can follow or comment on to track the >> progress towards enabling the -Os optimization level. The rough outline is: >> >> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks >> (Darwin) >> PR 38366 - GlobalISel: Lightweight CSE >> PR 32561 - GlobalISel: placement of constants in the entry-block and fast >> regalloc result in lots of reloaded constant >> PR 38367 - GlobalISel: Implement support for obtaining known bits >> information >> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store >> orderings >> >> These, along with general design and implementation work on the combiner, >> will then lead onto a long road of performance analysis, inevitable bug >> fixing, and implementing more optimizations. >> >> If anyone is interested in discussing in more detail, feel free to reach out >> on the list, or to any of the GlobalISel developers. We’d especially like to >> hear about any issues or concerns about porting targets to GlobalISel. >> >> Thanks, >> Amara >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >>
Quentin Colombet via llvm-dev
2018-Aug-03 01:50 UTC
[llvm-dev] GlobalISel design update and goals
2018-08-02 18:40 GMT-07:00 Daniel Sanders <daniel_l_sanders at apple.com>:> > >> On 2 Aug 2018, at 14:50, Quentin Colombet <quentin.colombet at gmail.com> wrote: >> >> Hi Daniel, >> >> 2018-07-31 8:40 GMT-07:00 Daniel Sanders <daniel_l_sanders at apple.com>: >>> >>> >>> On 30 Jul 2018, at 16:04, Quentin Colombet via llvm-dev >>> <llvm-dev at lists.llvm.org> wrote: >>> >>> Hi Amara, >>> >>> Thanks for sharing the plan going forward. >>> >>> Inlined a couple of comments. >>> >>> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev >>> <llvm-dev at lists.llvm.org>: >>> >>> Hi all, >>> >>> Over the past few months we’ve been doing work on the foundations for the >>> next stages of GlobalISel development. In terms of changes from this time >>> last year, the IR translator, the legalizer, and instruction selector have >>> seen moderate to major changes. The most significant of these was the change >>> to the legalizer API, allowing targets to use predicates to express >>> legality, which gives more precise control over what forms of instructions >>> are legal, and how to legalize them. This was necessary to implement support >>> for the new extending loads and truncating stores, but also results in more >>> concise and elegant expressions of legality for each target. For example, >>> you can now apple a single definition to apply to multiples opcodes (G_ADD, >>> G_SUB, G_MUL etc). >>> >>> The IR translator has been modified to split aggregates rather than handling >>> them as one single large scalar. This change fixed some bugs and was >>> necessary in order handle big endian code correctly in future. >>> >>> The tablegen instruction selector also saw significant improvements in >>> performance, helping to keep overall compile time regression vs fastisel to >>> be <5% geomean on CTMark. There are still a few outliers like sqlite3 which >>> has a significant regression compared to FastISel, but most of the other >>> benchmarks show little difference or even improvement. >>> >>> The tablegen importer has had improvements made to it, so that we can import >>> more SelectionDAG selection rules. For example, currently on AArch64 we have >>> about 40% of the rules being successfully imported. >>> >>> New additions from last year include the beginnings of a new combiner, >>> although there’s still significant work to be done here in terms of the >>> final design. The combiner will become a critical part of the pipeline in >>> order to begin improving runtime performance. >>> >>> High levels goals >>> >>> Going forward, we plan to improve GlobalISel in a number of key areas to >>> achieve the following targets: >>> * Keeping compile time under control, ideally within 5% of FastISel, and >>> when optimizations are enabled to maintain a compile time advantage of >>> SelectionDAG. >>> * Begin improving runtime performance by adding the most important >>> optimizations required to be competitive at -Os. We will be targeting and >>> measuring AArch64 for this goal but will endeavor to implement as many >>> optimizations as possible in generic code to benefit other targets. >>> * Improving overall stability and test coverage. Maintaining a high level >>> of code quality and minimizing regressions in correctness and performance >>> will be a significant challenge. >>> * Ensure that the overall design meets the needs of general targets, not >>> being overly tuned to a specific implementation. >>> >>> Design work planned >>> >>> These are some design changes coming in the near to medium term future: >>> >>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to >>> handle different use cases. At the moment the opcode is too powerful, >>> resulting in overly complex handling in places like the legalizer. G_MERGE >>> will be split so that it only handles merging of scalars into one larger >>> scalar. For other cases like merging scalars into a vector we will create a >>> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the >>> opposite. For the current vector + vector case a new G_CONCAT_VECTOR will be >>> introduced. With these changes it should simplify implementations for all >>> targets. >>> >>> * Constant representation at the MI level needs some investigation. We >>> currently represent constants as generic instructions, with each instance of >>> a constant being largely independent of each other, being stored in the >>> entry block except for a few places in IR translation where we emit at the >>> point of use. As a result we run a localizer pass in an effort to reduce the >>> live ranges of the constants (and the consequent spilling), using some >>> heuristics to decide where to sink the constant definitions to. >>> >>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT >>> definitions can exist for the same constant. This can also result in a lot >>> of redundant constants being created, especially for things like address >>> computation. Reducing the number of constants can help reduce compile time >>> and memory usage. Given this situation, one possible approach is to encode >>> constants into the operands of the users, rather than have dedicated machine >>> instructions. At instruction selection time the constant can then be >>> materialized into a register or encoded as an immediate. Further >>> investigation is needed to find the right way forward here. >>> >>> >>> The initial design was to not have constant in machine operands. The >>> main rational is materializing a constant may be expensive, so we >>> better not sprinkle it around. >>> Now, in practice, this indeed means that we need to keep a table of >>> all the constants created so that we don't start to duplicate them, >>> which we fail to do. That should be easy to fix that just a map >>> virtual register + type to constant that could be kept at the function >>> level (or even module level). Better yet, this could be handled by the >>> IRBuilder. E.g., when instantiating a IR builder, it could scan the >>> function to see which constants are already there and build this >>> mapping and then, create only the constants that are missing. >>> >>> Moreover, another the advantage of this model is that optimizations >>> don't have to deal with two variants of the same instruction (ADDr and >>> ADDi), same for patterns. Alternatively, if we don't change the >>> opcode, but only the MachineOperands, then every optimization has to >>> deal with two different kind of opcodes. Which is bad IMHO. >>> >>> Also, this design was meant to absorb what the constant hoisting pass >>> does on the IR, so that we can kill that pass while having a better >>> cost model. >>> >>> Finally, regarding the localizer pass, this was meant as a workaround >>> for the fast regalloc problem and should be killed asap. In >>> particular, the main motivation was to avoid code size bloat, but >>> AFAIR during our last measurements with Ahmed, we only saved a few % >>> on a handful of benchmarks, so maybe we can just kill it. >>> >>> >>> * For optimizations to be supported, the combiner will become a crucial >>> part of the GISel pipeline. We have already done some preliminary work in a >>> generic combiner, which will be used to eventually support combines of >>> extloads/truncstores. We’ve had discussions on and off list about what we >>> need from the new combiner. The summary is that we want the combiner to be >>> flexible for each target to select from a library of combines, being as >>> efficient as possible. The expression of the combines are currently written >>> in C++, but one piece of investigation work we might do is to prototype >>> using the same tablegen driven instruction selector code to match >>> declarative combine patterns written in tablegen. Regardless, we will need >>> to support the custom C++ use case. >>> >>> * CSE throughout the pipeline. From a theoretical perspective, having a >>> self contained CSE pass that operates as a single phase in the pipeline is >>> attractive for the simplicity and elegance. However, we know empirically >>> that this is expensive in compile time. Not only does the CSE pass itself >>> take a non-negligible time to run, but having it as a late pass can result >>> in the non-CSE’d code from the IRTranslator onwards surviving for a long >>> time, taking up time in analysis at each stage of compilation. We believe >>> running a light weight CSE early is a win. SelectionDAG currently does CSE >>> by default when building the DAG, and this is something we could explore as >>> part of a custom IRBuilder. >>> >>> >>> I have been pushing for having the IRBuilder being smarter. Having >>> this doing CSE was something I wanted we try. >>> >>> >>> * Known bits computation. Some optimizations require the knowledge of which >>> bits in a value are known to be 1 or 0, and do this by using the >>> computeKnownBits() capability for SelectionDAG nodes. We will need some way >>> of getting the same information. In an ideal scenario the replacement >>> infrastructure for this will be more efficient, as this part of the codebase >>> seems to be disproportionately responsible for pathological compile time >>> regressions. >>> >>> * Load/store ordering needs some thought, as we currently don’t have a way >>> to easily check at the MI level what the ordering requirements are on a set >>> of memory operations. SelectionDAG uses the chains to ensure that they’re >>> scheduled to respect the orderings. How to achieve the same thing remains an >>> open question for GlobalISel. >>> >>> >>> I don't get this problem. >>> GISel has a sequential IR, the order is already modeled here. >>> >>> >>> %t1 = load %addr1 :: (load 4 from @a) >>> %t2 = load %addr2 :: (load 4 from @a + 4) >>> store %t1 %addr3 :: (store 4 to @b) >>> store %t2 %addr4 :: (store 4 to @b + 4) >>> >>> MIR specifies the order: >>> load @a -> load @a+4 -> store @b -> store @b+4 >>> but the dependencies are actually: >>> load @a -> store @b >>> load @a+4 -> store @b+4 >>> >>> We only have a representation of the former at the moment and have to >>> re-calculate the latter each time we want to use that information to >>> sink/hoist/fold/etc. instructions. We need to build a means of preserving >>> that dependency graph within a pass and between passes. >>>> Thanks for the example. I think it matches what I was saying >> previously, i.e., if we want more than the relative ordering, we'll >> need some alias analysis. >> That said, the alias analysis data from the IR is already available >> (admittedly in a form that I don't like) by using the memory operands. > > I agree we need alias analysis, but having alias analysis by itself isn't enough. We also need a means to find the right 'does x alias y' questions to ask for a given change to the MIR and a means to avoid asking the same aliasing question multiple times. The answers to 'does x alias y' was encoded in artificial edges of the DAG before so it was easy to find the relevant instructions, but although the necessary information is available to GlobalISel we don't have an efficient means of using it yet.As far as I remember, we have the exact same problem with the LLVM IR and we don't have a specific framework for that (maybe would be the closest solution I could think of memorySSA). The way I would approach this is just do a naive approach, that, like you said, would be pretty inefficient (like traversing the few instructions between the two we want to merge) and see how bad this is in practice. In other words, given (again IIRC) we don't have such a complex mechanism at the IR level, I wouldn't start with over-engineering a solution that in the end is not on the critical path of performance.> >> Check the stack coloring pass to see how we use it. > > As far as I can see StackColoring is only maintaining AA information, it's not using it in the decision making code. Which code are you referring to?My bad, I thought it was using it. Though, if it maintains it, someone else must use it, right? :).> >> Cheers, >> -Quentin >>> Dominance should give us the relative ordering, then if we want more, >>> we should do exactly what we do at the IR level (alias analysis, etc.) >>> >>> Cheers, >>> -Quentin >>> >>> * More extensive tests that exercise multiple stages of the pipeline. One >>> advantage of using MIR with GISel is that individual passes can be easily >>> tested by feeding the exact input expected for a particular pass, and >>> checking the immediate output of the pass. However this approach can leave >>> holes in the test coverage. To help mitigate this, we will be exploring >>> writing/generating whole pipeline tests, tracking some IR through each pass >>> and checking how the MIR is mutated. We currently also have a proposed >>> change to allow usage of FileCheck as a library, not just as a stand-alone >>> tool. This would allow us to use FileCheck style checks and Improve testing >>> of currently unused code paths. >>> >>> >>> Roadmap for enabling optimizations >>> >>> >>> I’ve filed a few PRs that people can follow or comment on to track the >>> progress towards enabling the -Os optimization level. The rough outline is: >>> >>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks >>> (Darwin) >>> PR 38366 - GlobalISel: Lightweight CSE >>> PR 32561 - GlobalISel: placement of constants in the entry-block and fast >>> regalloc result in lots of reloaded constant >>> PR 38367 - GlobalISel: Implement support for obtaining known bits >>> information >>> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store >>> orderings >>> >>> These, along with general design and implementation work on the combiner, >>> will then lead onto a long road of performance analysis, inevitable bug >>> fixing, and implementing more optimizations. >>> >>> If anyone is interested in discussing in more detail, feel free to reach out >>> on the list, or to any of the GlobalISel developers. We’d especially like to >>> hear about any issues or concerns about porting targets to GlobalISel. >>> >>> Thanks, >>> Amara >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >
Daniel Sanders via llvm-dev
2018-Aug-06 15:05 UTC
[llvm-dev] GlobalISel design update and goals
> On 2 Aug 2018, at 18:50, Quentin Colombet <quentin.colombet at gmail.com> wrote: > > 2018-08-02 18:40 GMT-07:00 Daniel Sanders <daniel_l_sanders at apple.com <mailto:daniel_l_sanders at apple.com>>: >> >> >>> On 2 Aug 2018, at 14:50, Quentin Colombet <quentin.colombet at gmail.com> wrote: >>> >>> Hi Daniel, >>> >>> 2018-07-31 8:40 GMT-07:00 Daniel Sanders <daniel_l_sanders at apple.com>: >>>> >>>> >>>> On 30 Jul 2018, at 16:04, Quentin Colombet via llvm-dev >>>> <llvm-dev at lists.llvm.org> wrote: >>>> >>>> Hi Amara, >>>> >>>> Thanks for sharing the plan going forward. >>>> >>>> Inlined a couple of comments. >>>> >>>> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev >>>> <llvm-dev at lists.llvm.org>: >>>> >>>> Hi all, >>>> >>>> Over the past few months we’ve been doing work on the foundations for the >>>> next stages of GlobalISel development. In terms of changes from this time >>>> last year, the IR translator, the legalizer, and instruction selector have >>>> seen moderate to major changes. The most significant of these was the change >>>> to the legalizer API, allowing targets to use predicates to express >>>> legality, which gives more precise control over what forms of instructions >>>> are legal, and how to legalize them. This was necessary to implement support >>>> for the new extending loads and truncating stores, but also results in more >>>> concise and elegant expressions of legality for each target. For example, >>>> you can now apple a single definition to apply to multiples opcodes (G_ADD, >>>> G_SUB, G_MUL etc). >>>> >>>> The IR translator has been modified to split aggregates rather than handling >>>> them as one single large scalar. This change fixed some bugs and was >>>> necessary in order handle big endian code correctly in future. >>>> >>>> The tablegen instruction selector also saw significant improvements in >>>> performance, helping to keep overall compile time regression vs fastisel to >>>> be <5% geomean on CTMark. There are still a few outliers like sqlite3 which >>>> has a significant regression compared to FastISel, but most of the other >>>> benchmarks show little difference or even improvement. >>>> >>>> The tablegen importer has had improvements made to it, so that we can import >>>> more SelectionDAG selection rules. For example, currently on AArch64 we have >>>> about 40% of the rules being successfully imported. >>>> >>>> New additions from last year include the beginnings of a new combiner, >>>> although there’s still significant work to be done here in terms of the >>>> final design. The combiner will become a critical part of the pipeline in >>>> order to begin improving runtime performance. >>>> >>>> High levels goals >>>> >>>> Going forward, we plan to improve GlobalISel in a number of key areas to >>>> achieve the following targets: >>>> * Keeping compile time under control, ideally within 5% of FastISel, and >>>> when optimizations are enabled to maintain a compile time advantage of >>>> SelectionDAG. >>>> * Begin improving runtime performance by adding the most important >>>> optimizations required to be competitive at -Os. We will be targeting and >>>> measuring AArch64 for this goal but will endeavor to implement as many >>>> optimizations as possible in generic code to benefit other targets. >>>> * Improving overall stability and test coverage. Maintaining a high level >>>> of code quality and minimizing regressions in correctness and performance >>>> will be a significant challenge. >>>> * Ensure that the overall design meets the needs of general targets, not >>>> being overly tuned to a specific implementation. >>>> >>>> Design work planned >>>> >>>> These are some design changes coming in the near to medium term future: >>>> >>>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to >>>> handle different use cases. At the moment the opcode is too powerful, >>>> resulting in overly complex handling in places like the legalizer. G_MERGE >>>> will be split so that it only handles merging of scalars into one larger >>>> scalar. For other cases like merging scalars into a vector we will create a >>>> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the >>>> opposite. For the current vector + vector case a new G_CONCAT_VECTOR will be >>>> introduced. With these changes it should simplify implementations for all >>>> targets. >>>> >>>> * Constant representation at the MI level needs some investigation. We >>>> currently represent constants as generic instructions, with each instance of >>>> a constant being largely independent of each other, being stored in the >>>> entry block except for a few places in IR translation where we emit at the >>>> point of use. As a result we run a localizer pass in an effort to reduce the >>>> live ranges of the constants (and the consequent spilling), using some >>>> heuristics to decide where to sink the constant definitions to. >>>> >>>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT >>>> definitions can exist for the same constant. This can also result in a lot >>>> of redundant constants being created, especially for things like address >>>> computation. Reducing the number of constants can help reduce compile time >>>> and memory usage. Given this situation, one possible approach is to encode >>>> constants into the operands of the users, rather than have dedicated machine >>>> instructions. At instruction selection time the constant can then be >>>> materialized into a register or encoded as an immediate. Further >>>> investigation is needed to find the right way forward here. >>>> >>>> >>>> The initial design was to not have constant in machine operands. The >>>> main rational is materializing a constant may be expensive, so we >>>> better not sprinkle it around. >>>> Now, in practice, this indeed means that we need to keep a table of >>>> all the constants created so that we don't start to duplicate them, >>>> which we fail to do. That should be easy to fix that just a map >>>> virtual register + type to constant that could be kept at the function >>>> level (or even module level). Better yet, this could be handled by the >>>> IRBuilder. E.g., when instantiating a IR builder, it could scan the >>>> function to see which constants are already there and build this >>>> mapping and then, create only the constants that are missing. >>>> >>>> Moreover, another the advantage of this model is that optimizations >>>> don't have to deal with two variants of the same instruction (ADDr and >>>> ADDi), same for patterns. Alternatively, if we don't change the >>>> opcode, but only the MachineOperands, then every optimization has to >>>> deal with two different kind of opcodes. Which is bad IMHO. >>>> >>>> Also, this design was meant to absorb what the constant hoisting pass >>>> does on the IR, so that we can kill that pass while having a better >>>> cost model. >>>> >>>> Finally, regarding the localizer pass, this was meant as a workaround >>>> for the fast regalloc problem and should be killed asap. In >>>> particular, the main motivation was to avoid code size bloat, but >>>> AFAIR during our last measurements with Ahmed, we only saved a few % >>>> on a handful of benchmarks, so maybe we can just kill it. >>>> >>>> >>>> * For optimizations to be supported, the combiner will become a crucial >>>> part of the GISel pipeline. We have already done some preliminary work in a >>>> generic combiner, which will be used to eventually support combines of >>>> extloads/truncstores. We’ve had discussions on and off list about what we >>>> need from the new combiner. The summary is that we want the combiner to be >>>> flexible for each target to select from a library of combines, being as >>>> efficient as possible. The expression of the combines are currently written >>>> in C++, but one piece of investigation work we might do is to prototype >>>> using the same tablegen driven instruction selector code to match >>>> declarative combine patterns written in tablegen. Regardless, we will need >>>> to support the custom C++ use case. >>>> >>>> * CSE throughout the pipeline. From a theoretical perspective, having a >>>> self contained CSE pass that operates as a single phase in the pipeline is >>>> attractive for the simplicity and elegance. However, we know empirically >>>> that this is expensive in compile time. Not only does the CSE pass itself >>>> take a non-negligible time to run, but having it as a late pass can result >>>> in the non-CSE’d code from the IRTranslator onwards surviving for a long >>>> time, taking up time in analysis at each stage of compilation. We believe >>>> running a light weight CSE early is a win. SelectionDAG currently does CSE >>>> by default when building the DAG, and this is something we could explore as >>>> part of a custom IRBuilder. >>>> >>>> >>>> I have been pushing for having the IRBuilder being smarter. Having >>>> this doing CSE was something I wanted we try. >>>> >>>> >>>> * Known bits computation. Some optimizations require the knowledge of which >>>> bits in a value are known to be 1 or 0, and do this by using the >>>> computeKnownBits() capability for SelectionDAG nodes. We will need some way >>>> of getting the same information. In an ideal scenario the replacement >>>> infrastructure for this will be more efficient, as this part of the codebase >>>> seems to be disproportionately responsible for pathological compile time >>>> regressions. >>>> >>>> * Load/store ordering needs some thought, as we currently don’t have a way >>>> to easily check at the MI level what the ordering requirements are on a set >>>> of memory operations. SelectionDAG uses the chains to ensure that they’re >>>> scheduled to respect the orderings. How to achieve the same thing remains an >>>> open question for GlobalISel. >>>> >>>> >>>> I don't get this problem. >>>> GISel has a sequential IR, the order is already modeled here. >>>> >>>> >>>> %t1 = load %addr1 :: (load 4 from @a) >>>> %t2 = load %addr2 :: (load 4 from @a + 4) >>>> store %t1 %addr3 :: (store 4 to @b) >>>> store %t2 %addr4 :: (store 4 to @b + 4) >>>> >>>> MIR specifies the order: >>>> load @a -> load @a+4 -> store @b -> store @b+4 >>>> but the dependencies are actually: >>>> load @a -> store @b >>>> load @a+4 -> store @b+4 >>>> >>>> We only have a representation of the former at the moment and have to >>>> re-calculate the latter each time we want to use that information to >>>> sink/hoist/fold/etc. instructions. We need to build a means of preserving >>>> that dependency graph within a pass and between passes. >>>>> Thanks for the example. I think it matches what I was saying >>> previously, i.e., if we want more than the relative ordering, we'll >>> need some alias analysis. >>> That said, the alias analysis data from the IR is already available >>> (admittedly in a form that I don't like) by using the memory operands. >> >> I agree we need alias analysis, but having alias analysis by itself isn't enough. We also need a means to find the right 'does x alias y' questions to ask for a given change to the MIR and a means to avoid asking the same aliasing question multiple times. The answers to 'does x alias y' was encoded in artificial edges of the DAG before so it was easy to find the relevant instructions, but although the necessary information is available to GlobalISel we don't have an efficient means of using it yet. > > As far as I remember, we have the exact same problem with the LLVM IR > and we don't have a specific framework for that (maybe would be the > closest solution I could think of memorySSA).AliasSetTracker seems to provide some infrastructure. LICM is using it to aggregate AA within a loop to determine whether memory operations are hoistable outside the loop. I'm not sure if we have anything more than that though.> The way I would approach this is just do a naive approach, that, like > you said, would be pretty inefficient (like traversing the few > instructions between the two we want to merge) and see how bad this is > in practice. > In other words, given (again IIRC) we don't have such a complex > mechanism at the IR level, I wouldn't start with over-engineering a > solution that in the end is not on the critical path of performance.That makes sense but I'm expecting it to be pretty bad in practice. Combines/ISel/etc. are not constrained by basic blocks so each match attempt has the potential to scan every preceeding instruction.>> >>> Check the stack coloring pass to see how we use it. >> >> As far as I can see StackColoring is only maintaining AA information, it's not using it in the decision making code. Which code are you referring to? > > > My bad, I thought it was using it. Though, if it maintains it, someone > else must use it, right? :).Possibly, we could be maintaining it for no reason or we could always be taking the AA == nullptr path>>> Cheers, >>> -Quentin >>>> Dominance should give us the relative ordering, then if we want more, >>>> we should do exactly what we do at the IR level (alias analysis, etc.) >>>> >>>> Cheers, >>>> -Quentin >>>> >>>> * More extensive tests that exercise multiple stages of the pipeline. One >>>> advantage of using MIR with GISel is that individual passes can be easily >>>> tested by feeding the exact input expected for a particular pass, and >>>> checking the immediate output of the pass. However this approach can leave >>>> holes in the test coverage. To help mitigate this, we will be exploring >>>> writing/generating whole pipeline tests, tracking some IR through each pass >>>> and checking how the MIR is mutated. We currently also have a proposed >>>> change to allow usage of FileCheck as a library, not just as a stand-alone >>>> tool. This would allow us to use FileCheck style checks and Improve testing >>>> of currently unused code paths. >>>> >>>> >>>> Roadmap for enabling optimizations >>>> >>>> >>>> I’ve filed a few PRs that people can follow or comment on to track the >>>> progress towards enabling the -Os optimization level. The rough outline is: >>>> >>>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks >>>> (Darwin) >>>> PR 38366 - GlobalISel: Lightweight CSE >>>> PR 32561 - GlobalISel: placement of constants in the entry-block and fast >>>> regalloc result in lots of reloaded constant >>>> PR 38367 - GlobalISel: Implement support for obtaining known bits >>>> information >>>> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store >>>> orderings >>>> >>>> These, along with general design and implementation work on the combiner, >>>> will then lead onto a long road of performance analysis, inevitable bug >>>> fixing, and implementing more optimizations. >>>> >>>> If anyone is interested in discussing in more detail, feel free to reach out >>>> on the list, or to any of the GlobalISel developers. We’d especially like to >>>> hear about any issues or concerns about porting targets to GlobalISel. >>>> >>>> Thanks, >>>> Amara >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180806/30e8552f/attachment.html>