Bekket McClane via llvm-dev
2016-Nov-30 01:22 UTC
[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection
> Mehdi Amini <mehdi.amini at apple.com> 於 2016年11月30日 上午5:14 寫道: > >> >> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> Hi, >> Though there exists lots of researches on parallelizing or scheduling optimization passes, If you open up the time matrices of codegen(llc -time-passes), you'll find that the most time consuming task is actually instruction selection(40~50% of time) instead of optimization passes(10~0%). That's why we're trying to parallelize the (target-independent) instruction selection process in aid of JIT compilation speed. > > > How much of this 40-50% is spent in the matcher table? I though most of the overhead was inherent to SelectionDAG?Do you mean DAG operations like adding SDNode to CSENodes? Could you talk a little bit more?> Also why having such a fine grain approach instead of trying to perform instruction selection in parallel across basic blocks or functions?JIT compilation tends to compile small amount of DAG nodes each time, and also, most of the JIT strategies tend to merge several instructions(e.g. all of the hot instructions) into single basic block or function, so I guess it wouldn’t get significant speedup on overall performance if we handle only one(or few) compilation unit each time.(correct me if I'm wrong) B.R. McClane> > I suspect you won’t gain much for too much added complexity with this approach. > > — > Mehdi > > > > > >> >> The instruction selector of LLVM is an interpreter that interpret the MatcherTable which consists of bytecodes generated by TableGen. I'm surprised to find that the structure of MatcherTable and the interpreter seems to be suitable for parallelization. So we propose a prototype that parallelizes the interpreting of OPC_Scope children that are possibly time-consuming. Here is some quick overview: >> >> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the front of scope(OPC_Scope) children which fulfill following conditions: >> 1. Amount of opcodes within the child exceed certain threshold(5 in current prototype). >> 2. The child is reside in a sequence of continuous scope children which length also exceed certain threshold(7 in current prototype). >> For each valid sequence of scope children, an extra scope child, where OPC_Merge is the only opcode, would be appended to it(the sequence). >> >> In the interpreter, when an OPC_Fork is encountered inside a scope child, the main thread would dispatch the scope child as a task to a central thread pool, then jump to the next child. At the end of a valid "parallel sequence(of scope children)" an OPC_Merge must exist and the main thread would stop there and wait other threads to finish. >> >> About the synchronization, read-write lock is mainly used: In each checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is used. >> >> Finally, although the generated code is correct, total consuming time barely break even with the original one. Possible reasons may be: >> 1. The original interpreter is pretty fast actually. The thread pool dispatching time for each selection task may be too long in comparison with the original approach. >> 2. X86 is the only architecture which contains OPC_CheckComplexPat that would modify DAG. This constraint force us to add write lock on it which would block other threads at the same time. Unfortunately, OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and perhaps in other architectures, too. >> 3. Too many threads. We're now working on another approach that use larger region, consist of multiple scope children, for each parallel task for the sake of reducing thread amount. >> 4. Popular instructions, like add or sub, contain lots of scope children so one or several parallel regions exist. However, most of the common instruction variants(e.g. add %reg1, %reg2) is on "top" among scope children which would be encountered pretty early. So sometimes threads are fired, but the correct instruction is actually immediately selected after that. Thus lots of time is wasted on joining threads. >> >> Here is our working repository and diff with 3.9 release: https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff <https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff> >> I don't think the current state is ready for code reviewing since there is no significant speedup. But it's very welcome for folks to discuss about this idea and also, whether current instruction selection approach had reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices too much on quality of generated code. One of our goals is to boost the compilation speed while keeping the code quality as much as possible) >> >> Feel free to comment directly on the repo diff above. >> >> About the "region approach" mentioned in the third item of possible reasons above. It's actually the "dev-region-parallel" branch, but it still has some bugs on correctness of generated code. I would put more detail about it if the feedback is sound. >> >> NOTE: There seems to be some serious bugs in concurrent and synchronization library of old gcc/standard libraries. So it's strongly recommended to use the latest version of clang to build our work. >> >> B.R >> -- >> Bekket McClane >> Department of Computer Science, >> National Tsing Hua University >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/741a8433/attachment.html>
Mehdi Amini via llvm-dev
2016-Nov-30 01:50 UTC
[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection
> On Nov 29, 2016, at 5:22 PM, Bekket McClane <bekket.mcclane at gmail.com> wrote: > > >> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午5:14 寫道: >> >>> >>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> Hi, >>> Though there exists lots of researches on parallelizing or scheduling optimization passes, If you open up the time matrices of codegen(llc -time-passes), you'll find that the most time consuming task is actually instruction selection(40~50% of time) instead of optimization passes(10~0%). That's why we're trying to parallelize the (target-independent) instruction selection process in aid of JIT compilation speed. >> >> >> How much of this 40-50% is spent in the matcher table? I though most of the overhead was inherent to SelectionDAG? > > Do you mean DAG operations like adding SDNode to CSENodes? Could you talk a little bit more?Yes: building the DAG and mutating the DAG (performing the combine/legalize stage, with continuous CSE). What does your profile show? Do you have some example IR out of your frontend that I could process with opt/llc and profile?> >> Also why having such a fine grain approach instead of trying to perform instruction selection in parallel across basic blocks or functions? > > JIT compilation tends to compile small amount of DAG nodes each time, and also, most of the JIT strategies tend to merge several instructions(e.g. all of the hot instructions) into single basic block or function, so I guess it wouldn’t get significant speedup on overall performance if we handle only one(or few) compilation unit each time.(correct me if I'm wrong)If you have one function and a very small IR as you seem to indicate here, I wouldn’t expect parallelizing the matcher table to be worthwhile (I’d like to see a profile…). I rather think that if the matcher table is high on the profile, a “PGO" approach as suggested by Nicolai is likely to help the most. Also, you should really measure the difference on the generated code between fast-isel and SelectionDAG, as in practice it may not be that high for the kind of IR you have. Sometimes, we’ve seen fast-isel even gets better generated code by matching pattern across basic blocks. — Mehdi> > B.R. > McClane >> >> I suspect you won’t gain much for too much added complexity with this approach. >> >> — >> Mehdi >> >> >> >> >> >>> >>> The instruction selector of LLVM is an interpreter that interpret the MatcherTable which consists of bytecodes generated by TableGen. I'm surprised to find that the structure of MatcherTable and the interpreter seems to be suitable for parallelization. So we propose a prototype that parallelizes the interpreting of OPC_Scope children that are possibly time-consuming. Here is some quick overview: >>> >>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the front of scope(OPC_Scope) children which fulfill following conditions: >>> 1. Amount of opcodes within the child exceed certain threshold(5 in current prototype). >>> 2. The child is reside in a sequence of continuous scope children which length also exceed certain threshold(7 in current prototype). >>> For each valid sequence of scope children, an extra scope child, where OPC_Merge is the only opcode, would be appended to it(the sequence). >>> >>> In the interpreter, when an OPC_Fork is encountered inside a scope child, the main thread would dispatch the scope child as a task to a central thread pool, then jump to the next child. At the end of a valid "parallel sequence(of scope children)" an OPC_Merge must exist and the main thread would stop there and wait other threads to finish. >>> >>> About the synchronization, read-write lock is mainly used: In each checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is used. >>> >>> Finally, although the generated code is correct, total consuming time barely break even with the original one. Possible reasons may be: >>> 1. The original interpreter is pretty fast actually. The thread pool dispatching time for each selection task may be too long in comparison with the original approach. >>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that would modify DAG. This constraint force us to add write lock on it which would block other threads at the same time. Unfortunately, OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and perhaps in other architectures, too. >>> 3. Too many threads. We're now working on another approach that use larger region, consist of multiple scope children, for each parallel task for the sake of reducing thread amount. >>> 4. Popular instructions, like add or sub, contain lots of scope children so one or several parallel regions exist. However, most of the common instruction variants(e.g. add %reg1, %reg2) is on "top" among scope children which would be encountered pretty early. So sometimes threads are fired, but the correct instruction is actually immediately selected after that. Thus lots of time is wasted on joining threads. >>> >>> Here is our working repository and diff with 3.9 release: https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff <https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff> >>> I don't think the current state is ready for code reviewing since there is no significant speedup. But it's very welcome for folks to discuss about this idea and also, whether current instruction selection approach had reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices too much on quality of generated code. One of our goals is to boost the compilation speed while keeping the code quality as much as possible) >>> >>> Feel free to comment directly on the repo diff above. >>> >>> About the "region approach" mentioned in the third item of possible reasons above. It's actually the "dev-region-parallel" branch, but it still has some bugs on correctness of generated code. I would put more detail about it if the feedback is sound. >>> >>> NOTE: There seems to be some serious bugs in concurrent and synchronization library of old gcc/standard libraries. So it's strongly recommended to use the latest version of clang to build our work. >>> >>> B.R >>> -- >>> Bekket McClane >>> Department of Computer Science, >>> National Tsing Hua University >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/7a04e7eb/attachment-0001.html>
Bekket McClane via llvm-dev
2016-Nov-30 03:03 UTC
[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection
> Bekket McClane <bekket.mcclane at gmail.com> 於 2016年11月30日 上午10:55 寫道: > >> >> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午9:50 寫道: >> >>> >>> On Nov 29, 2016, at 5:22 PM, Bekket McClane <bekket.mcclane at gmail.com <mailto:bekket.mcclane at gmail.com>> wrote: >>> >>> >>>> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午5:14 寫道: >>>> >>>>> >>>>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>> >>>>> Hi, >>>>> Though there exists lots of researches on parallelizing or scheduling optimization passes, If you open up the time matrices of codegen(llc -time-passes), you'll find that the most time consuming task is actually instruction selection(40~50% of time) instead of optimization passes(10~0%). That's why we're trying to parallelize the (target-independent) instruction selection process in aid of JIT compilation speed. >>>> >>>> >>>> How much of this 40-50% is spent in the matcher table? I though most of the overhead was inherent to SelectionDAG? >>> >>> Do you mean DAG operations like adding SDNode to CSENodes? Could you talk a little bit more? >> >> Yes: building the DAG and mutating the DAG (performing the combine/legalize stage, with continuous CSE). >> What does your profile show? Do you have some example IR out of your frontend that I could process with opt/llc and profile? > > The “large” one, to simulate the scenario that someone just want to speed up the building speed of a large project instead of JIT usages:The mailing system seems to block the message due to the large IR file Here is the google drive link: https://drive.google.com/open?id=0BwIwAHzsWbekX04wRkFuUGs4dUU <https://drive.google.com/open?id=0BwIwAHzsWbekX04wRkFuUGs4dUU>> > The normal one, from one of the test cases in LLVM, to simulate JIT compilation usage:> >> >>> >>>> Also why having such a fine grain approach instead of trying to perform instruction selection in parallel across basic blocks or functions? >>> >>> JIT compilation tends to compile small amount of DAG nodes each time, and also, most of the JIT strategies tend to merge several instructions(e.g. all of the hot instructions) into single basic block or function, so I guess it wouldn’t get significant speedup on overall performance if we handle only one(or few) compilation unit each time.(correct me if I'm wrong) >> >> If you have one function and a very small IR as you seem to indicate here, I wouldn’t expect parallelizing the matcher table to be worthwhile (I’d like to see a profile…). > > The reason why we suspect MatcherTable to be the bottleneck is that there might be chances that though a program has few DAG nodes, each of those nodes' selection “path” in MatcherTable is long(i.e. Failed in many OPC_Scope). So the overall selection time won’t necessarily be linear to the amount of DAG nodes. But indeed, we haven’t do fine profiling to confirm our assumption. > > Also, the builtin time profiler(llc -time-passes) has indicated that the "real instruction selector” part, where MatcherTable acts as the main role, is also the most time consuming part among other instruction selection procedures like DAG combiner and legalizer. > >> I rather think that if the matcher table is high on the profile, a “PGO" approach as suggested by Nicolai is likely to help the most. >> >> Also, you should really measure the difference on the generated code between fast-isel and SelectionDAG, as in practice it may not be that high for the kind of IR you have. Sometimes, we’ve seen fast-isel even gets better generated code by matching pattern across basic blocks. > > We haven’t performed a deep profiling on fast-isel. Thanks for the information. > >> >> — >> Mehdi >> >> >>> >>> B.R. >>> McClane >>>> >>>> I suspect you won’t gain much for too much added complexity with this approach. >>>> >>>> — >>>> Mehdi >>>> >>>> >>>> >>>> >>>> >>>>> >>>>> The instruction selector of LLVM is an interpreter that interpret the MatcherTable which consists of bytecodes generated by TableGen. I'm surprised to find that the structure of MatcherTable and the interpreter seems to be suitable for parallelization. So we propose a prototype that parallelizes the interpreting of OPC_Scope children that are possibly time-consuming. Here is some quick overview: >>>>> >>>>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the front of scope(OPC_Scope) children which fulfill following conditions: >>>>> 1. Amount of opcodes within the child exceed certain threshold(5 in current prototype). >>>>> 2. The child is reside in a sequence of continuous scope children which length also exceed certain threshold(7 in current prototype). >>>>> For each valid sequence of scope children, an extra scope child, where OPC_Merge is the only opcode, would be appended to it(the sequence). >>>>> >>>>> In the interpreter, when an OPC_Fork is encountered inside a scope child, the main thread would dispatch the scope child as a task to a central thread pool, then jump to the next child. At the end of a valid "parallel sequence(of scope children)" an OPC_Merge must exist and the main thread would stop there and wait other threads to finish. >>>>> >>>>> About the synchronization, read-write lock is mainly used: In each checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is used. >>>>> >>>>> Finally, although the generated code is correct, total consuming time barely break even with the original one. Possible reasons may be: >>>>> 1. The original interpreter is pretty fast actually. The thread pool dispatching time for each selection task may be too long in comparison with the original approach. >>>>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that would modify DAG. This constraint force us to add write lock on it which would block other threads at the same time. Unfortunately, OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and perhaps in other architectures, too. >>>>> 3. Too many threads. We're now working on another approach that use larger region, consist of multiple scope children, for each parallel task for the sake of reducing thread amount. >>>>> 4. Popular instructions, like add or sub, contain lots of scope children so one or several parallel regions exist. However, most of the common instruction variants(e.g. add %reg1, %reg2) is on "top" among scope children which would be encountered pretty early. So sometimes threads are fired, but the correct instruction is actually immediately selected after that. Thus lots of time is wasted on joining threads. >>>>> >>>>> Here is our working repository and diff with 3.9 release: https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff <https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff> >>>>> I don't think the current state is ready for code reviewing since there is no significant speedup. But it's very welcome for folks to discuss about this idea and also, whether current instruction selection approach had reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices too much on quality of generated code. One of our goals is to boost the compilation speed while keeping the code quality as much as possible) >>>>> >>>>> Feel free to comment directly on the repo diff above. >>>>> >>>>> About the "region approach" mentioned in the third item of possible reasons above. It's actually the "dev-region-parallel" branch, but it still has some bugs on correctness of generated code. I would put more detail about it if the feedback is sound. >>>>> >>>>> NOTE: There seems to be some serious bugs in concurrent and synchronization library of old gcc/standard libraries. So it's strongly recommended to use the latest version of clang to build our work. >>>>> >>>>> B.R >>>>> -- >>>>> Bekket McClane >>>>> Department of Computer Science, >>>>> National Tsing Hua University >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/68aa0fa6/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: zlib-longest-match.ll Type: application/octet-stream Size: 13242 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/68aa0fa6/attachment.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/68aa0fa6/attachment-0001.html>
Mehdi Amini via llvm-dev
2016-Nov-30 03:23 UTC
[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection
> On Nov 29, 2016, at 6:55 PM, Bekket McClane <bekket.mcclane at gmail.com> wrote: > >> >> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午9:50 寫道: >> >>> >>> On Nov 29, 2016, at 5:22 PM, Bekket McClane <bekket.mcclane at gmail.com <mailto:bekket.mcclane at gmail.com>> wrote: >>> >>> >>>> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午5:14 寫道: >>>> >>>>> >>>>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>> >>>>> Hi, >>>>> Though there exists lots of researches on parallelizing or scheduling optimization passes, If you open up the time matrices of codegen(llc -time-passes), you'll find that the most time consuming task is actually instruction selection(40~50% of time) instead of optimization passes(10~0%). That's why we're trying to parallelize the (target-independent) instruction selection process in aid of JIT compilation speed. >>>> >>>> >>>> How much of this 40-50% is spent in the matcher table? I though most of the overhead was inherent to SelectionDAG? >>> >>> Do you mean DAG operations like adding SDNode to CSENodes? Could you talk a little bit more? >> >> Yes: building the DAG and mutating the DAG (performing the combine/legalize stage, with continuous CSE). >> What does your profile show? Do you have some example IR out of your frontend that I could process with opt/llc and profile? > > The “large” one, to simulate the scenario that someone just want to speed up the building speed of a large project instead of JIT usages: > <Task.ll> > > The normal one, from one of the test cases in LLVM, to simulate JIT compilation usage: > <zlib-longest-match.ll>If I run: " clang -march=x86-64 zlib-longest-match.ll -mllvm -time-passes -c -O2” on this I get this top 5: Total Execution Time: 0.1896 seconds (0.2101 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0024 ( 18.4%) 0.0450 ( 25.4%) 0.0473 ( 25.0%) 0.0536 ( 25.5%) X86 DAG->DAG Instruction Selection 0.0005 ( 4.1%) 0.0192 ( 10.8%) 0.0197 ( 10.4%) 0.0226 ( 10.8%) Combine redundant instructions 0.0010 ( 7.7%) 0.0128 ( 7.2%) 0.0138 ( 7.3%) 0.0144 ( 6.9%) Induction Variable Simplification 0.0011 ( 8.7%) 0.0101 ( 5.7%) 0.0112 ( 5.9%) 0.0116 ( 5.5%) Loop Strength Reduction 0.0002 ( 1.3%) 0.0083 ( 4.7%) 0.0085 ( 4.5%) 0.0101 ( 4.8%) Early CSE And the DAG detail is: 0.0004 ( 22.6%) 0.0137 ( 39.4%) 0.0141 ( 38.6%) 0.0168 ( 40.3%) DAG Combining 1 0.0003 ( 18.8%) 0.0106 ( 30.6%) 0.0109 ( 30.0%) 0.0118 ( 28.1%) Instruction Selection 0.0002 ( 14.6%) 0.0022 ( 6.4%) 0.0025 ( 6.8%) 0.0034 ( 8.1%) Instruction Scheduling 0.0001 ( 8.2%) 0.0027 ( 7.9%) 0.0029 ( 7.9%) 0.0033 ( 7.9%) DAG Legalization 0.0001 ( 8.0%) 0.0026 ( 7.6%) 0.0028 ( 7.6%) 0.0032 ( 7.7%) Type Legalization The -time-passes option is nice, but I like to look at the profiler better in general, so I ran a profiler (Instruments) on the backend only (without running the optimizer!) using: llc -march=x86-64 zlib-longest-match.ll -O2 -time-passes -disable-verify -time-compilations 100 (It will repeat the codegen 100 times which helps getting enough runtime to have a meaningful profile) I get (focusing on the “Instruction Selection” part: Running Time Self (ms) Symbol Name 20.5ms 3.2% 0.2 llvm::SelectionDAGISel::DoInstructionSelection() 18.9ms 3.0% 0.5 (anonymous namespace)::X86DAGToDAGISel::Select(llvm::SDNode*) 17.7ms 2.8% 6.9 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) 3.9ms 0.6% 0.4 (anonymous namespace)::X86DAGToDAGISel::CheckComplexPattern(llvm::SDNode*, llvm::SDNode*, llvm::SDValue, unsigned int, llvm::SmallVectorImpl<std::__1::pair<llvm::SDValue, llvm::SDNode*> >&) Note that the percentage if with respect to the total time, so it seems to me that you’re trying to parallelize a function that takes ~3% of the backend, which doesn’t play well with Amdahl. Best, — Mehdi> >> >>> >>>> Also why having such a fine grain approach instead of trying to perform instruction selection in parallel across basic blocks or functions? >>> >>> JIT compilation tends to compile small amount of DAG nodes each time, and also, most of the JIT strategies tend to merge several instructions(e.g. all of the hot instructions) into single basic block or function, so I guess it wouldn’t get significant speedup on overall performance if we handle only one(or few) compilation unit each time.(correct me if I'm wrong) >> >> If you have one function and a very small IR as you seem to indicate here, I wouldn’t expect parallelizing the matcher table to be worthwhile (I’d like to see a profile…). > > The reason why we suspect MatcherTable to be the bottleneck is that there might be chances that though a program has few DAG nodes, each of those nodes' selection “path” in MatcherTable is long(i.e. Failed in many OPC_Scope). So the overall selection time won’t necessarily be linear to the amount of DAG nodes. But indeed, we haven’t do fine profiling to confirm our assumption. > > Also, the builtin time profiler(llc -time-passes) has indicated that the "real instruction selector” part, where MatcherTable acts as the main role, is also the most time consuming part among other instruction selection procedures like DAG combiner and legalizer. > >> I rather think that if the matcher table is high on the profile, a “PGO" approach as suggested by Nicolai is likely to help the most. >> >> Also, you should really measure the difference on the generated code between fast-isel and SelectionDAG, as in practice it may not be that high for the kind of IR you have. Sometimes, we’ve seen fast-isel even gets better generated code by matching pattern across basic blocks. > > We haven’t performed a deep profiling on fast-isel. Thanks for the information. > >> >> — >> Mehdi >> >> >>> >>> B.R. >>> McClane >>>> >>>> I suspect you won’t gain much for too much added complexity with this approach. >>>> >>>> — >>>> Mehdi >>>> >>>> >>>> >>>> >>>> >>>>> >>>>> The instruction selector of LLVM is an interpreter that interpret the MatcherTable which consists of bytecodes generated by TableGen. I'm surprised to find that the structure of MatcherTable and the interpreter seems to be suitable for parallelization. So we propose a prototype that parallelizes the interpreting of OPC_Scope children that are possibly time-consuming. Here is some quick overview: >>>>> >>>>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the front of scope(OPC_Scope) children which fulfill following conditions: >>>>> 1. Amount of opcodes within the child exceed certain threshold(5 in current prototype). >>>>> 2. The child is reside in a sequence of continuous scope children which length also exceed certain threshold(7 in current prototype). >>>>> For each valid sequence of scope children, an extra scope child, where OPC_Merge is the only opcode, would be appended to it(the sequence). >>>>> >>>>> In the interpreter, when an OPC_Fork is encountered inside a scope child, the main thread would dispatch the scope child as a task to a central thread pool, then jump to the next child. At the end of a valid "parallel sequence(of scope children)" an OPC_Merge must exist and the main thread would stop there and wait other threads to finish. >>>>> >>>>> About the synchronization, read-write lock is mainly used: In each checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is used. >>>>> >>>>> Finally, although the generated code is correct, total consuming time barely break even with the original one. Possible reasons may be: >>>>> 1. The original interpreter is pretty fast actually. The thread pool dispatching time for each selection task may be too long in comparison with the original approach. >>>>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that would modify DAG. This constraint force us to add write lock on it which would block other threads at the same time. Unfortunately, OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and perhaps in other architectures, too. >>>>> 3. Too many threads. We're now working on another approach that use larger region, consist of multiple scope children, for each parallel task for the sake of reducing thread amount. >>>>> 4. Popular instructions, like add or sub, contain lots of scope children so one or several parallel regions exist. However, most of the common instruction variants(e.g. add %reg1, %reg2) is on "top" among scope children which would be encountered pretty early. So sometimes threads are fired, but the correct instruction is actually immediately selected after that. Thus lots of time is wasted on joining threads. >>>>> >>>>> Here is our working repository and diff with 3.9 release: https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff <https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff> >>>>> I don't think the current state is ready for code reviewing since there is no significant speedup. But it's very welcome for folks to discuss about this idea and also, whether current instruction selection approach had reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices too much on quality of generated code. One of our goals is to boost the compilation speed while keeping the code quality as much as possible) >>>>> >>>>> Feel free to comment directly on the repo diff above. >>>>> >>>>> About the "region approach" mentioned in the third item of possible reasons above. It's actually the "dev-region-parallel" branch, but it still has some bugs on correctness of generated code. I would put more detail about it if the feedback is sound. >>>>> >>>>> NOTE: There seems to be some serious bugs in concurrent and synchronization library of old gcc/standard libraries. So it's strongly recommended to use the latest version of clang to build our work. >>>>> >>>>> B.R >>>>> -- >>>>> Bekket McClane >>>>> Department of Computer Science, >>>>> National Tsing Hua University >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/3c791f64/attachment.html>
Bekket McClane via llvm-dev
2016-Nov-30 03:30 UTC
[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection
> Mehdi Amini <mehdi.amini at apple.com> 於 2016年11月30日 上午11:23 寫道: > > >> On Nov 29, 2016, at 6:55 PM, Bekket McClane <bekket.mcclane at gmail.com <mailto:bekket.mcclane at gmail.com>> wrote: >> >>> >>> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午9:50 寫道: >>> >>>> >>>> On Nov 29, 2016, at 5:22 PM, Bekket McClane <bekket.mcclane at gmail.com <mailto:bekket.mcclane at gmail.com>> wrote: >>>> >>>> >>>>> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 2016年11月30日 上午5:14 寫道: >>>>> >>>>>> >>>>>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>>> >>>>>> Hi, >>>>>> Though there exists lots of researches on parallelizing or scheduling optimization passes, If you open up the time matrices of codegen(llc -time-passes), you'll find that the most time consuming task is actually instruction selection(40~50% of time) instead of optimization passes(10~0%). That's why we're trying to parallelize the (target-independent) instruction selection process in aid of JIT compilation speed. >>>>> >>>>> >>>>> How much of this 40-50% is spent in the matcher table? I though most of the overhead was inherent to SelectionDAG? >>>> >>>> Do you mean DAG operations like adding SDNode to CSENodes? Could you talk a little bit more? >>> >>> Yes: building the DAG and mutating the DAG (performing the combine/legalize stage, with continuous CSE). >>> What does your profile show? Do you have some example IR out of your frontend that I could process with opt/llc and profile? >> >> The “large” one, to simulate the scenario that someone just want to speed up the building speed of a large project instead of JIT usages: >> <Task.ll> >> >> The normal one, from one of the test cases in LLVM, to simulate JIT compilation usage: >> <zlib-longest-match.ll> > > If I run: " clang -march=x86-64 zlib-longest-match.ll -mllvm -time-passes -c -O2” on this I get this top 5: > > Total Execution Time: 0.1896 seconds (0.2101 wall clock) > > ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- > 0.0024 ( 18.4%) 0.0450 ( 25.4%) 0.0473 ( 25.0%) 0.0536 ( 25.5%) X86 DAG->DAG Instruction Selection > 0.0005 ( 4.1%) 0.0192 ( 10.8%) 0.0197 ( 10.4%) 0.0226 ( 10.8%) Combine redundant instructions > 0.0010 ( 7.7%) 0.0128 ( 7.2%) 0.0138 ( 7.3%) 0.0144 ( 6.9%) Induction Variable Simplification > 0.0011 ( 8.7%) 0.0101 ( 5.7%) 0.0112 ( 5.9%) 0.0116 ( 5.5%) Loop Strength Reduction > 0.0002 ( 1.3%) 0.0083 ( 4.7%) 0.0085 ( 4.5%) 0.0101 ( 4.8%) Early CSE > > And the DAG detail is: > > 0.0004 ( 22.6%) 0.0137 ( 39.4%) 0.0141 ( 38.6%) 0.0168 ( 40.3%) DAG Combining 1 > 0.0003 ( 18.8%) 0.0106 ( 30.6%) 0.0109 ( 30.0%) 0.0118 ( 28.1%) Instruction Selection > 0.0002 ( 14.6%) 0.0022 ( 6.4%) 0.0025 ( 6.8%) 0.0034 ( 8.1%) Instruction Scheduling > 0.0001 ( 8.2%) 0.0027 ( 7.9%) 0.0029 ( 7.9%) 0.0033 ( 7.9%) DAG Legalization > 0.0001 ( 8.0%) 0.0026 ( 7.6%) 0.0028 ( 7.6%) 0.0032 ( 7.7%) Type Legalization > > > The -time-passes option is nice, but I like to look at the profiler better in general, so I ran a profiler (Instruments) on the backend only (without running the optimizer!) using: llc -march=x86-64 zlib-longest-match.ll -O2 -time-passes -disable-verify -time-compilations 100 > (It will repeat the codegen 100 times which helps getting enough runtime to have a meaningful profile) > > I get (focusing on the “Instruction Selection” part: > > Running Time Self (ms) Symbol Name > 20.5ms 3.2% 0.2 llvm::SelectionDAGISel::DoInstructionSelection() > 18.9ms 3.0% 0.5 (anonymous namespace)::X86DAGToDAGISel::Select(llvm::SDNode*) > 17.7ms 2.8% 6.9 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) > 3.9ms 0.6% 0.4 (anonymous namespace)::X86DAGToDAGISel::CheckComplexPattern(llvm::SDNode*, llvm::SDNode*, llvm::SDValue, unsigned int, llvm::SmallVectorImpl<std::__1::pair<llvm::SDValue, llvm::SDNode*> >&) > > > Note that the percentage if with respect to the total time, so it seems to me that you’re trying to parallelize a function that takes ~3% of the backend, which doesn’t play well with Amdahl.Cheers! Thanks for the profiling I even don’t know the -time-compilations options. We’ll verify our assumption again. B.R. McClane> > > Best, > > — > Mehdi > > > > >> >>> >>>> >>>>> Also why having such a fine grain approach instead of trying to perform instruction selection in parallel across basic blocks or functions? >>>> >>>> JIT compilation tends to compile small amount of DAG nodes each time, and also, most of the JIT strategies tend to merge several instructions(e.g. all of the hot instructions) into single basic block or function, so I guess it wouldn’t get significant speedup on overall performance if we handle only one(or few) compilation unit each time.(correct me if I'm wrong) >>> >>> If you have one function and a very small IR as you seem to indicate here, I wouldn’t expect parallelizing the matcher table to be worthwhile (I’d like to see a profile…). >> >> The reason why we suspect MatcherTable to be the bottleneck is that there might be chances that though a program has few DAG nodes, each of those nodes' selection “path” in MatcherTable is long(i.e. Failed in many OPC_Scope). So the overall selection time won’t necessarily be linear to the amount of DAG nodes. But indeed, we haven’t do fine profiling to confirm our assumption. >> >> Also, the builtin time profiler(llc -time-passes) has indicated that the "real instruction selector” part, where MatcherTable acts as the main role, is also the most time consuming part among other instruction selection procedures like DAG combiner and legalizer. >> >>> I rather think that if the matcher table is high on the profile, a “PGO" approach as suggested by Nicolai is likely to help the most. >>> >>> Also, you should really measure the difference on the generated code between fast-isel and SelectionDAG, as in practice it may not be that high for the kind of IR you have. Sometimes, we’ve seen fast-isel even gets better generated code by matching pattern across basic blocks. >> >> We haven’t performed a deep profiling on fast-isel. Thanks for the information. >> >>> >>> — >>> Mehdi >>> >>> >>>> >>>> B.R. >>>> McClane >>>>> >>>>> I suspect you won’t gain much for too much added complexity with this approach. >>>>> >>>>> — >>>>> Mehdi >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> The instruction selector of LLVM is an interpreter that interpret the MatcherTable which consists of bytecodes generated by TableGen. I'm surprised to find that the structure of MatcherTable and the interpreter seems to be suitable for parallelization. So we propose a prototype that parallelizes the interpreting of OPC_Scope children that are possibly time-consuming. Here is some quick overview: >>>>>> >>>>>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the front of scope(OPC_Scope) children which fulfill following conditions: >>>>>> 1. Amount of opcodes within the child exceed certain threshold(5 in current prototype). >>>>>> 2. The child is reside in a sequence of continuous scope children which length also exceed certain threshold(7 in current prototype). >>>>>> For each valid sequence of scope children, an extra scope child, where OPC_Merge is the only opcode, would be appended to it(the sequence). >>>>>> >>>>>> In the interpreter, when an OPC_Fork is encountered inside a scope child, the main thread would dispatch the scope child as a task to a central thread pool, then jump to the next child. At the end of a valid "parallel sequence(of scope children)" an OPC_Merge must exist and the main thread would stop there and wait other threads to finish. >>>>>> >>>>>> About the synchronization, read-write lock is mainly used: In each checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is used. >>>>>> >>>>>> Finally, although the generated code is correct, total consuming time barely break even with the original one. Possible reasons may be: >>>>>> 1. The original interpreter is pretty fast actually. The thread pool dispatching time for each selection task may be too long in comparison with the original approach. >>>>>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that would modify DAG. This constraint force us to add write lock on it which would block other threads at the same time. Unfortunately, OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and perhaps in other architectures, too. >>>>>> 3. Too many threads. We're now working on another approach that use larger region, consist of multiple scope children, for each parallel task for the sake of reducing thread amount. >>>>>> 4. Popular instructions, like add or sub, contain lots of scope children so one or several parallel regions exist. However, most of the common instruction variants(e.g. add %reg1, %reg2) is on "top" among scope children which would be encountered pretty early. So sometimes threads are fired, but the correct instruction is actually immediately selected after that. Thus lots of time is wasted on joining threads. >>>>>> >>>>>> Here is our working repository and diff with 3.9 release: https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff <https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff> >>>>>> I don't think the current state is ready for code reviewing since there is no significant speedup. But it's very welcome for folks to discuss about this idea and also, whether current instruction selection approach had reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices too much on quality of generated code. One of our goals is to boost the compilation speed while keeping the code quality as much as possible) >>>>>> >>>>>> Feel free to comment directly on the repo diff above. >>>>>> >>>>>> About the "region approach" mentioned in the third item of possible reasons above. It's actually the "dev-region-parallel" branch, but it still has some bugs on correctness of generated code. I would put more detail about it if the feedback is sound. >>>>>> >>>>>> NOTE: There seems to be some serious bugs in concurrent and synchronization library of old gcc/standard libraries. So it's strongly recommended to use the latest version of clang to build our work. >>>>>> >>>>>> B.R >>>>>> -- >>>>>> Bekket McClane >>>>>> Department of Computer Science, >>>>>> National Tsing Hua University >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/636e99f2/attachment-0001.html>