On Sat, Feb 29, 2020 at 5:14 PM Nicholas Krause via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On 2/29/20 7:23 PM, River Riddle wrote: > > > > On Sat, Feb 29, 2020 at 4:00 PM Nicholas Krause <xerofoify at gmail.com> > wrote: > >> >> >> On 2/29/20 6:17 PM, River Riddle via llvm-dev wrote: >> >> >> >> On Sat, Feb 29, 2020 at 2:25 PM David Blaikie via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> >>> >>> On Sat, Feb 29, 2020 at 2:19 PM Chris Lattner <clattner at nondot.org> >>> wrote: >>> >>>> On Feb 29, 2020, at 2:08 PM, David Blaikie <dblaikie at gmail.com> wrote: >>>> >>>> I've >>>>> curious as >>>>> to how MLIR deals with IPO as that's the problem I was running into. >>>>> >>>> >>>> FWIW I believe LLVM's new pass manager (NPM) was designed with >>>> parallelism and the ability to support this situation (that MLIR doesn't? >>>> Or doesn't to the degree/way in which the NPM does). I'll leave it to folks >>>> (Chandler probably has the most context here) to provide some more detail >>>> there if they can/have time. >>>> >>>> >>>> Historically speaking, all of the LLVM pass managers have been designed >>>> to support multithreaded compilation (check out the ancient history of the >>>> WritingAnLLVMPass <http://llvm.org/docs/WritingAnLLVMPass.html> doc if >>>> curious). >>>> >>> >>> I think the specific thing that might'v been a bit different in the NPM >>> was to do with analysis invalidation in a way that's more parallelism >>> friendly than the previous one - but I may be >>> misrepresenting/misundrstanding some of it. >>> >>> >>>> The problem is that LLVM has global use-def chains on constants, >>>> functions and globals, etc, so it is impractical to do this. Every >>>> “inst->setOperand” would have to be able to take locks or use something >>>> like software transactional memory techniques in their implementation. >>>> This would be very complicated and very slow. >>>> >>> >>> Oh, yeah - I recall that particular limitation being discussed/not >>> addressed as yet. >>> >>> >>>> MLIR defines this away from the beginning. This is a result of the >>>> core IR design, not the pass manager design itself. >>>> >>> >>> What does MLIR do differently here/how does it define that issue away? >>> (doesn't have use-lists built-in?) >>> >> >> The major thing is that constants and global-like objects don't produce >> SSA values and thus don't have use-lists. >> https://mlir.llvm.org/docs/Rationale/#multithreading-the-compiler discusses >> this a bit. >> >> For constants, the data is stored as an Attribute(context uniqued >> metadata, have no use-list, not SSA). This attribute can either placed in >> the attribute list(if the operand is always constant, like for the value of >> a switch case), otherwise it must be explicitly materialized via some >> operation. For example, the `std.constant >> <https://mlir.llvm.org/docs/Dialects/Standard/#constant-operation>` >> operation will materialize an SSA value from some attribute data. >> >> For references to functions and other global-like objects, we have a >> non-SSA mechanism built around `symbols`. This is essentially using a >> special attribute to reference the function by-name, instead of by ssa >> value. You can find more information on MLIR symbols here >> <https://mlir.llvm.org/docs/SymbolsAndSymbolTables/>. >> >> Along with the above, there is a trait that can be attached to operations >> called `IsolatedFromAbove >> <https://mlir.llvm.org/docs/Traits/#isolatedfromabove>`. This >> essentially means that no SSA values defined above a region can be >> referenced from within that region. The pass manager only allows schedule >> passes on operations that have this property, meaning that all pipelines >> are implicitly multi-threaded. >> >> The pass manager in MLIR was heavily inspired by the work on the new pass >> manager in LLVM, but with specific constraints/requirements that are unique >> to the design of MLIR. That being said, there are some usability features >> added that would also make great additions to LLVM: instance specific pass >> options and statistics, pipeline crash reproducer generation, etc. >> >> Not sure if any of the above helps clarify, but happy to chat more if you >> are interested. >> >> -- River >> >> >>> - Dave >>> >> River, >> The big thing from my reading of the Pass Manager in MLIR is that it >> allows us to iterate through >> a pass per function or module as a group allowing it to run in async. >> I've proposed this >> on the GCC side: >> https://gcc.gnu.org/ml/gcc/2020-02/msg00247.html >> >> Its to walk through the IPA passes which are similar to analyze passes on >> the LLVM side. >> > > Hi Nicholas, > > I can't say anything about the GCC side, but this isn't a particularly > novel aspect of the MLIR pass manager. In many ways, the pass manager is > the easiest/simplest part of the multi-threading problem. The bigger > problem is making sure that the rest of the compiler infrastructure is > structured in a way that is thread-safe, or can be made thread-safe. This > is why most of the discussion is based around how to model things like > constants, global values, etc. When I made MLIR multi-threaded a year ago, > a large majority of my time was spent outside of the pass manager. For a > real example, I spent much more time just on multi-threaded pass timing > <https://mlir.llvm.org/docs/WritingAPass/#multi-threaded-pass-timing> > than making the pass manager itself multi-threaded. > > -- River > > Actually in my experience, the biggest problem is if we can detect IPO and > run async guarantees on that. MLIR runs operations but only for a module or > set of functions > without this. One of my dreams would be to run passes in parallel > including IPO detection and stop if it cannot continue pass a IPO pass or > set of passes due to changes. > > Maybe MLIR does do that but its the one bottleneck that is really hard to > fix, >What MLIR does (that would require quite some work in LLVM) is making sure that you can process and transform functions in isolation, allowing to run *local* optimizations in parallel. This does not solve the IPO problem you're after. As I understand it, this is a difficult thing to design, and it requires consideration about how you think the passes and the pass-pipeline entirely. Running function-passes and "local" optimizations in parallel in LLVM isn't possible because the structures in the LLVMContext aren't thread-safe, and because the IR itself isn't thread-safe. Something like just DCE or CSE a function call requires to modify the callee (through its use-list). -- Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200229/a7633d2b/attachment.html>
On 2/29/20 9:38 PM, Mehdi AMINI wrote:> > > On Sat, Feb 29, 2020 at 5:14 PM Nicholas Krause via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > > On 2/29/20 7:23 PM, River Riddle wrote: >> >> >> On Sat, Feb 29, 2020 at 4:00 PM Nicholas Krause >> <xerofoify at gmail.com <mailto:xerofoify at gmail.com>> wrote: >> >> >> >> On 2/29/20 6:17 PM, River Riddle via llvm-dev wrote: >>> >>> >>> On Sat, Feb 29, 2020 at 2:25 PM David Blaikie via llvm-dev >>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> >>> wrote: >>> >>> >>> >>> On Sat, Feb 29, 2020 at 2:19 PM Chris Lattner >>> <clattner at nondot.org <mailto:clattner at nondot.org>> wrote: >>> >>> On Feb 29, 2020, at 2:08 PM, David Blaikie >>> <dblaikie at gmail.com <mailto:dblaikie at gmail.com>> wrote: >>>> >>>> I've >>>> curious as >>>> to how MLIR deals with IPO as that's the >>>> problem I was running into. >>>> >>>> >>>> FWIW I believe LLVM's new pass manager (NPM) was >>>> designed with parallelism and the ability to >>>> support this situation (that MLIR doesn't? Or >>>> doesn't to the degree/way in which the NPM does). >>>> I'll leave it to folks (Chandler probably has the >>>> most context here) to provide some more detail >>>> there if they can/have time. >>> >>> Historically speaking, all of the LLVM pass managers >>> have been designed to support multithreaded >>> compilation (check out the ancient history of the >>> WritingAnLLVMPass >>> <http://llvm.org/docs/WritingAnLLVMPass.html> doc if >>> curious). >>> >>> >>> I think the specific thing that might'v been a bit >>> different in the NPM was to do with analysis >>> invalidation in a way that's more parallelism friendly >>> than the previous one - but I may be >>> misrepresenting/misundrstanding some of it. >>> >>> The problem is that LLVM has global use-def chains >>> on constants, functions and globals, etc, so it is >>> impractical to do this. Every “inst->setOperand” >>> would have to be able to take locks or use something >>> like software transactional memory techniques in >>> their implementation. This would be very >>> complicated and very slow. >>> >>> >>> Oh, yeah - I recall that particular limitation being >>> discussed/not addressed as yet. >>> >>> MLIR defines this away from the beginning. This is >>> a result of the core IR design, not the pass manager >>> design itself. >>> >>> >>> What does MLIR do differently here/how does it define >>> that issue away? (doesn't have use-lists built-in?) >>> >>> >>> The major thing is that constants and global-like objects >>> don't produce SSA values and thus don't have use-lists. >>> https://mlir.llvm.org/docs/Rationale/#multithreading-the-compiler discusses >>> this a bit. >>> >>> For constants, the data is stored as an Attribute(context >>> uniqued metadata, have no use-list, not SSA). This attribute >>> can either placed in the attribute list(if the operand is >>> always constant, like for the value of a switch case), >>> otherwise it must be explicitly materialized via some >>> operation. For example, the `std.constant >>> <https://mlir.llvm.org/docs/Dialects/Standard/#constant-operation>` >>> operation will materialize an SSA value from some attribute >>> data. >>> >>> For references to functions and other global-like objects, >>> we have a non-SSA mechanism built around `symbols`. This is >>> essentially using a special attribute to reference the >>> function by-name, instead of by ssa value. You can find more >>> information on MLIR symbols here >>> <https://mlir.llvm.org/docs/SymbolsAndSymbolTables/>. >>> >>> Along with the above, there is a trait that can be attached >>> to operations called `IsolatedFromAbove >>> <https://mlir.llvm.org/docs/Traits/#isolatedfromabove>`. >>> This essentially means that no SSA values defined above a >>> region can be referenced from within that region. The pass >>> manager only allows schedule passes on operations that have >>> this property, meaning that all pipelines are implicitly >>> multi-threaded. >>> >>> The pass manager in MLIR was heavily inspired by the work on >>> the new pass manager in LLVM, but with specific >>> constraints/requirements that are unique to the design of >>> MLIR. That being said, there are some usability features >>> added that would also make great additions to LLVM: instance >>> specific pass options and statistics, pipeline crash >>> reproducer generation, etc. >>> >>> Not sure if any of the above helps clarify, but happy to >>> chat more if you are interested. >>> >>> -- River >>> >>> - Dave >>> >> River, >> The big thing from my reading of the Pass Manager in MLIR is >> that it allows us to iterate through >> a pass per function or module as a group allowing it to run >> in async. I've proposed this >> on the GCC side: >> https://gcc.gnu.org/ml/gcc/2020-02/msg00247.html >> >> Its to walk through the IPA passes which are similar to >> analyze passes on the LLVM side. >> >> >> Hi Nicholas, >> >> I can't say anything about the GCC side, but this isn't a >> particularly novel aspect of the MLIR pass manager. In many ways, >> the pass manager is the easiest/simplest part of the >> multi-threading problem. The bigger problem is making sure that >> the rest of the compiler infrastructure is structured in a way >> that is thread-safe, or can be made thread-safe. This is why most >> of the discussion is based around how to model things like >> constants, global values, etc. When I made MLIR multi-threaded a >> year ago, a large majority of my time was spent outside of the >> pass manager. For a real example, I spent much more time just on >> multi-threaded pass timing >> <https://mlir.llvm.org/docs/WritingAPass/#multi-threaded-pass-timing> >> than making the pass manager itself multi-threaded. >> >> -- River > Actually in my experience, the biggest problem is if we can detect > IPO and run async guarantees on that. MLIR runs operations but > only for a module or set of functions > without this. One of my dreams would be to run passes in parallel > including IPO detection and stop if it cannot continue pass a IPO > pass or set of passes due to changes. > > Maybe MLIR does do that but its the one bottleneck that is really > hard to fix, > > > What MLIR does (that would require quite some work in LLVM) is making > sure that you can process and transform functions in isolation, > allowing to run *local* optimizations in parallel. This does not solve > the IPO problem you're after. As I understand it, this is a difficult > thing to design, and it requires consideration about how you think the > passes and the pass-pipeline entirely.I've been thinking about it on the GCC side for the last few months. It's non trivial through something similar to this may be possible but I will need to research both pass managers better: https://elixir.bootlin.com/linux/latest/source/include/linux/workqueue.h#L102 I'm looking on some RFCS for the pass manager and SSA iterators on the GCC side. Through one easy thing to do is a lot of core classes in both GCC/LLVM there are loops that run a lot. LRA on the GCC for register allocation does. We may want to figure out how and if its a good idea to run these loops in small worker functions that are only used by the one function that requires it. The only real question is some loops are only very performance intensive on corner cases which we may require heuristics in order to decide when to launch i.e. number of elements iterating through e.t.c. Maybe that's a bad idea but after looking through the GCC and LLVM side a little this seems to be a later tedious but trivial fix mostly due to figuring out what loops are expensive enough to do this. I believe Johannes was going to reach out next week about the Module Classes and we will go for there, Nick> > Running function-passes and "local" optimizations in parallel in LLVM > isn't possible because the structures in the LLVMContext aren't > thread-safe, and because the IR itself isn't thread-safe. Something > like just DCE or CSE a function call requires to modify the callee > (through its use-list). > > -- > Mehdi-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200301/d4999bcf/attachment.html>
This is a recent desktop. Xubuntu 19.10 Compiling for 10.0.0 clang and llvm. See below. For this test, running 14 processors in a gui VM. The cores are hyperthreaded, processors are twice the cores, but all the cores before the run are showing negligible activity. compile_commands.json has 3022 entries. The ninja compile run lasted 7 minutes and 43 seconds with 99% all processor usage throughout. We then have 7*60+43 = 463 seconds. Compile seconds per compile line in compile_commands.json 463/3022 = 0.1532 seconds. Average compile time per processor would be about 14*0.1532 seconds. cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;llvm" -DLLVM_USE_LINKER=lld -DCMAKE_BUILD_TYPE="Release" -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_LIBPFM=OFF -DRUN_HAVE_GNU_POSIX_REGEX=0 -DRUN_HAVE_THREAD_SAFETY_ATTRIBUTES=0 -Wno-dev ../llvm &> /home/nnelson/Documents/cmake.log ninja &> /home/nnelson/Documents/ninja.log Here are some useful pages from Threading Building Blocks. Task-Based Programming https://software.intel.com/en-us/node/506100 Appendix A Costs of Time Slicing https://software.intel.com/en-us/node/506127 The point When the number of compiles exceeds the number of cores such that all the cores are utilized, nothing is gained by trying to multi-thread the individual compiles. In fact, loading up the cores with more threads or tasks than there are cores will reduce compiling efficiency because of time slicing. And sequencing through more tasks than less when the cores are not overloaded will reduce compiling efficiency because more tasks have to be loaded and unloaded to the cores. Neil Nelson -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200301/d8a69443/attachment.html>