thr3ads.net - llvm dev - [llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Bekket McClane via llvm-dev

2016-Nov-29 12:02 UTC

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

Hi,
Though there exists lots of researches on parallelizing or scheduling
optimization passes, If you open up the time matrices of codegen(llc
-time-passes), you'll find that the most time consuming task is actually
instruction selection(40~50% of time) instead of optimization
passes(10~0%). That's why we're trying to parallelize the
(target-independent) instruction selection process in aid of JIT
compilation speed.

The instruction selector of LLVM is an interpreter that interpret the
MatcherTable which consists of bytecodes generated by TableGen. I'm
surprised to find that the structure of MatcherTable and the interpreter
seems to be suitable for parallelization. So we propose a prototype that
parallelizes the interpreting of OPC_Scope children that are possibly
time-consuming. Here is some quick overview:

We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to
the front of scope(OPC_Scope) children which fulfill following conditions:
1. Amount of opcodes within the child exceed certain threshold(5 in
current prototype).
2. The child is reside in a sequence of continuous scope children which
length also exceed certain threshold(7 in current prototype).
For each valid sequence of scope children, an extra scope child, where
OPC_Merge is the only opcode, would be appended to it(the sequence).

In the interpreter, when an OPC_Fork is encountered inside a scope child,
the main thread would dispatch the scope child as a task to a central
thread pool, then jump to the next child. At the end of a valid "parallel
sequence(of scope children)" an OPC_Merge must exist and the main thread
would stop there and wait other threads to finish.

About the synchronization, read-write lock is mainly used: In each
checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock
is used.

Finally, although the generated code is correct, total consuming time
barely break even with the original one. Possible reasons may be:
1. The original interpreter is pretty fast actually. The thread pool
dispatching time for each selection task may be too long in comparison with
the original approach.
2. X86 is the only architecture which contains OPC_CheckComplexPat that
would modify DAG. This constraint force us to add write lock on it which
would block other threads at the same time. Unfortunately,
OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and
perhaps in other architectures, too.
3. Too many threads. We're now working on another approach that use larger
region, consist of multiple scope children, for each parallel task for the
sake of reducing thread amount.
4. Popular instructions, like add or sub, contain lots of scope children so
one or several parallel regions exist. However, most of the common
instruction variants(e.g. add %reg1, %reg2) is on "top" among scope
children which would be encountered pretty early. So sometimes threads are
fired, but the correct instruction is actually immediately selected after
that. Thus lots of time is wasted on joining threads.

Here is our working repository and diff with 3.9 release: https://bitbucket.
org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
I don't think the current state is ready for code reviewing since there is
no significant speedup. But it's very welcome for folks to discuss about
this idea and also, whether current instruction selection approach had
reached its upper bound of speed.(I ignore fast-isel by mean since it
sacrifices too much on quality of generated code. One of our goals is to
boost the compilation speed while keeping the code quality as much as
possible)

Feel free to comment directly on the repo diff above.

About the "region approach" mentioned in the third item of possible
reasons
above. It's actually the "dev-region-parallel" branch, but it
still has
some bugs on correctness of generated code. I would put more detail about
it if the feedback is sound.

NOTE: There seems to be some serious bugs in concurrent and synchronization
library of old gcc/standard libraries. So it's strongly recommended to use
the latest version of clang to build our work.

B.R
--
Bekket McClane
Department of Computer Science,
National Tsing Hua University
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/eac7296a/attachment-0001.html>

Nicolai Hähnle via llvm-dev

2016-Nov-29 20:56 UTC

head link

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

It seems like you're trying to do extremely fine-grained 
parallelization, which usually doesn't end well. It seems to me that it 
would make much more sense to go for parallelization at the module level 
(i.e. multiple functions in parallel) or possibly basic block level or 
possibly at the selection DAG level itself, i.e. at the level of 
selection DAG nodes.

As far as the instruction selection itself is concerned, I've been 
wondering in the past whether somebody has looked into profile-guided 
optimization of the matcher (i.e. profile-guided optimization of the 
matcher table by TableGen).

Cheers,
Nicolai

On 29.11.2016 13:02, Bekket McClane via llvm-dev wrote:> Hi,
> Though there exists lots of researches on parallelizing or scheduling
> optimization passes, If you open up the time matrices of codegen(llc
> -time-passes), you'll find that the most time consuming task is
actually
> instruction selection(40~50% of time) instead of optimization
> passes(10~0%). That's why we're trying to parallelize the
> (target-independent) instruction selection process in aid of JIT
> compilation speed.
>
> The instruction selector of LLVM is an interpreter that interpret the
> MatcherTable which consists of bytecodes generated by TableGen. I'm
> surprised to find that the structure of MatcherTable and the interpreter
> seems to be suitable for parallelization. So we propose a prototype that
> parallelizes the interpreting of OPC_Scope children that are possibly
> time-consuming. Here is some quick overview:
>
> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
> process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added
> to the front of scope(OPC_Scope) children which fulfill following
> conditions:
> 1.  Amount of opcodes within the child exceed certain threshold(5 in
> current prototype).
> 2. The child is reside in a sequence of continuous scope children which
> length also exceed certain threshold(7 in current prototype).
> For each valid sequence of scope children, an extra scope child, where
> OPC_Merge is the only opcode, would be appended to it(the sequence).
>
> In the interpreter, when an OPC_Fork is encountered inside a scope
> child, the main thread would dispatch the scope child as a task to a
> central thread pool, then jump to the next child. At the end of a valid
> "parallel sequence(of scope children)" an OPC_Merge must exist
and the
> main thread would stop there and wait other threads to finish.
>
> About the synchronization, read-write lock is mainly used: In each
> checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
> OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write
> lock is used.
>
> Finally, although the generated code is correct, total consuming time
> barely break even with the original one. Possible reasons may be:
> 1. The original interpreter is pretty fast actually. The thread pool
> dispatching time for each selection task may be too long in comparison
> with the original approach.
> 2. X86 is the only architecture which contains OPC_CheckComplexPat that
> would modify DAG. This constraint force us to add write lock on it which
> would block other threads at the same time. Unfortunately,
> OPC_CheckComplexPat is probably the most time-consuming opcodes in X86
> and perhaps in other architectures, too.
> 3. Too many threads. We're now working on another approach that use
> larger region, consist of multiple scope children, for each parallel
> task for the sake of reducing thread amount.
> 4. Popular instructions, like add or sub, contain lots of scope children
> so one or several parallel regions exist. However, most of the common
> instruction variants(e.g. add %reg1, %reg2) is on "top" among
scope
> children which would be encountered pretty early. So sometimes threads
> are fired, but the correct instruction is actually immediately selected
> after that. Thus lots of time is wasted on joining threads.
>
> Here is our working repository and diff with 3.9
> release:
https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff
>
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
> I don't think the current state is ready for code reviewing since there
> is no significant speedup. But it's very welcome for folks to discuss
> about this idea and also, whether current instruction selection approach
> had reached its upper bound of speed.(I ignore fast-isel by mean since
> it sacrifices too much on quality of generated code. One of our goals is
> to boost the compilation speed while keeping the code quality as much as
> possible)
>
> Feel free to comment directly on the repo diff above.
>
> About the "region approach" mentioned in the third item of
possible
> reasons above. It's actually the "dev-region-parallel"
branch, but it
> still has some bugs on correctness of generated code. I would put more
> detail about it if the feedback is sound.
>
> NOTE: There seems to be some serious bugs in concurrent and
> synchronization library of old gcc/standard libraries. So it's strongly
> recommended to use the latest version of clang to build our work.
>
> B.R
> --
> Bekket McClane
> Department of Computer Science,
> National Tsing Hua University
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Mehdi Amini via llvm-dev

2016-Nov-29 21:14 UTC

head link

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi,
> Though there exists lots of researches on parallelizing or scheduling
optimization passes, If you open up the time matrices of codegen(llc
-time-passes), you'll find that the most time consuming task is actually
instruction selection(40~50% of time) instead of optimization passes(10~0%).
That's why we're trying to parallelize the (target-independent)
instruction selection process in aid of JIT compilation speed.

How much of this 40-50% is spent in the matcher table? I though most of the
overhead was inherent to SelectionDAG?
Also why having such a fine grain approach instead of trying to perform
instruction selection in parallel across basic blocks or functions?

I suspect you won’t gain much for too much added complexity with this approach.

— 
Mehdi




> 
> The instruction selector of LLVM is an interpreter that interpret the
MatcherTable which consists of bytecodes generated by TableGen. I'm
surprised to find that the structure of MatcherTable and the interpreter seems
to be suitable for parallelization. So we propose a prototype that parallelizes
the interpreting of OPC_Scope children that are possibly time-consuming. Here is
some quick overview:
> 
> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the
front of scope(OPC_Scope) children which fulfill following conditions:
> 1.  Amount of opcodes within the child exceed certain threshold(5 in
current prototype).
> 2. The child is reside in a sequence of continuous scope children which
length also exceed certain threshold(7 in current prototype).
> For each valid sequence of scope children, an extra scope child, where
OPC_Merge is the only opcode, would be appended to it(the sequence).
> 
> In the interpreter, when an OPC_Fork is encountered inside a scope child,
the main thread would dispatch the scope child as a task to a central thread
pool, then jump to the next child. At the end of a valid "parallel
sequence(of scope children)" an OPC_Merge must exist and the main thread
would stop there and wait other threads to finish.
> 
> About the synchronization, read-write lock is mainly used: In each
checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is
used.
> 
> Finally, although the generated code is correct, total consuming time
barely break even with the original one. Possible reasons may be:
> 1. The original interpreter is pretty fast actually. The thread pool
dispatching time for each selection task may be too long in comparison with the
original approach.
> 2. X86 is the only architecture which contains OPC_CheckComplexPat that
would modify DAG. This constraint force us to add write lock on it which would
block other threads at the same time. Unfortunately, OPC_CheckComplexPat is
probably the most time-consuming opcodes in X86 and perhaps in other
architectures, too.
> 3. Too many threads. We're now working on another approach that use
larger region, consist of multiple scope children, for each parallel task for
the sake of reducing thread amount.
> 4. Popular instructions, like add or sub, contain lots of scope children so
one or several parallel regions exist. However, most of the common instruction
variants(e.g. add %reg1, %reg2) is on "top" among scope children which
would be encountered pretty early. So sometimes threads are fired, but the
correct instruction is actually immediately selected after that. Thus lots of
time is wasted on joining threads.
> 
> Here is our working repository and diff with 3.9 release:
https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff 
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
> I don't think the current state is ready for code reviewing since there
is no significant speedup. But it's very welcome for folks to discuss about
this idea and also, whether current instruction selection approach had reached
its upper bound of speed.(I ignore fast-isel by mean since it sacrifices too
much on quality of generated code. One of our goals is to boost the compilation
speed while keeping the code quality as much as possible)
> 
> Feel free to comment directly on the repo diff above.
> 
> About the "region approach" mentioned in the third item of
possible reasons above. It's actually the "dev-region-parallel"
branch, but it still has some bugs on correctness of generated code. I would put
more detail about it if the feedback is sound.
> 
> NOTE: There seems to be some serious bugs in concurrent and synchronization
library of old gcc/standard libraries. So it's strongly recommended to use
the latest version of clang to build our work.
> 
> B.R
> -- 
> Bekket McClane
> Department of Computer Science,
> National Tsing Hua University
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/70c46028/attachment.html>

Mehdi Amini via llvm-dev

2016-Nov-29 21:16 UTC

head link

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

> On Nov 29, 2016, at 1:14 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
> 
>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hi,
>> Though there exists lots of researches on parallelizing or scheduling
optimization passes, If you open up the time matrices of codegen(llc
-time-passes), you'll find that the most time consuming task is actually
instruction selection(40~50% of time) instead of optimization passes(10~0%).
That's why we're trying to parallelize the (target-independent)
instruction selection process in aid of JIT compilation speed.
> 
> 
> How much of this 40-50% is spent in the matcher table? I though most of the
overhead was inherent to SelectionDAG?
> Also why having such a fine grain approach instead of trying to perform
instruction selection in parallel across basic blocks or functions?
> 
> I suspect you won’t gain much for too much added complexity with this
approach.
I forgot to add: did you try to enable fast-isel instead? In the context of a
JIT this is a quite common approach.

— 
Mehdi
> 
> 
> 
> 
> 
>> 
>> The instruction selector of LLVM is an interpreter that interpret the
MatcherTable which consists of bytecodes generated by TableGen. I'm
surprised to find that the structure of MatcherTable and the interpreter seems
to be suitable for parallelization. So we propose a prototype that parallelizes
the interpreting of OPC_Scope children that are possibly time-consuming. Here is
some quick overview:
>> 
>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the
front of scope(OPC_Scope) children which fulfill following conditions:
>> 1.  Amount of opcodes within the child exceed certain threshold(5 in
current prototype).
>> 2. The child is reside in a sequence of continuous scope children which
length also exceed certain threshold(7 in current prototype).
>> For each valid sequence of scope children, an extra scope child, where
OPC_Merge is the only opcode, would be appended to it(the sequence).
>> 
>> In the interpreter, when an OPC_Fork is encountered inside a scope
child, the main thread would dispatch the scope child as a task to a central
thread pool, then jump to the next child. At the end of a valid "parallel
sequence(of scope children)" an OPC_Merge must exist and the main thread
would stop there and wait other threads to finish.
>> 
>> About the synchronization, read-write lock is mainly used: In each
checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is
used.
>> 
>> Finally, although the generated code is correct, total consuming time
barely break even with the original one. Possible reasons may be:
>> 1. The original interpreter is pretty fast actually. The thread pool
dispatching time for each selection task may be too long in comparison with the
original approach.
>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that
would modify DAG. This constraint force us to add write lock on it which would
block other threads at the same time. Unfortunately, OPC_CheckComplexPat is
probably the most time-consuming opcodes in X86 and perhaps in other
architectures, too.
>> 3. Too many threads. We're now working on another approach that use
larger region, consist of multiple scope children, for each parallel task for
the sake of reducing thread amount.
>> 4. Popular instructions, like add or sub, contain lots of scope
children so one or several parallel regions exist. However, most of the common
instruction variants(e.g. add %reg1, %reg2) is on "top" among scope
children which would be encountered pretty early. So sometimes threads are
fired, but the correct instruction is actually immediately selected after that.
Thus lots of time is wasted on joining threads.
>> 
>> Here is our working repository and diff with 3.9 release:
https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff 
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
>> I don't think the current state is ready for code reviewing since
there is no significant speedup. But it's very welcome for folks to discuss
about this idea and also, whether current instruction selection approach had
reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices
too much on quality of generated code. One of our goals is to boost the
compilation speed while keeping the code quality as much as possible)
>> 
>> Feel free to comment directly on the repo diff above.
>> 
>> About the "region approach" mentioned in the third item of
possible reasons above. It's actually the "dev-region-parallel"
branch, but it still has some bugs on correctness of generated code. I would put
more detail about it if the feedback is sound.
>> 
>> NOTE: There seems to be some serious bugs in concurrent and
synchronization library of old gcc/standard libraries. So it's strongly
recommended to use the latest version of clang to build our work.
>> 
>> B.R
>> -- 
>> Bekket McClane
>> Department of Computer Science,
>> National Tsing Hua University
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/2ffbb840/attachment.html>

Bekket McClane via llvm-dev

2016-Nov-30 00:55 UTC

head link

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

> Nicolai Hähnle <nhaehnle at gmail.com> 於 2016年11月30日 上午4:56 寫道：
> 
> It seems like you're trying to do extremely fine-grained
parallelization, which usually doesn't end well. It seems to me that it
would make much more sense to go for parallelization at the module level (i.e.
multiple functions in parallel) or possibly basic block level or possibly at the
selection DAG level itself, i.e. at the level of selection DAG nodes.
I think JIT compilation tend to compile small amount of DAG nodes(so, small
Function and BasicBlock) each time in no matter method-based or traced-based
strategies, so I guess it wouldn’t get significant benefit from it. (Correct me
if I’m not right)
> 
> As far as the instruction selection itself is concerned, I've been
wondering in the past whether somebody has looked into profile-guided
optimization of the matcher (i.e. profile-guided optimization of the matcher
table by TableGen).
That’s an interesting thing idea. I ‘m also wondering whether we can “cache” the
selection result or selection “path” of matchers like inline cache in dynamic
compilation.
Maybe you can talk more about your thoughts?

B.R
McClane> 
> Cheers,
> Nicolai
> 
> On 29.11.2016 13:02, Bekket McClane via llvm-dev wrote:
>> Hi,
>> Though there exists lots of researches on parallelizing or scheduling
>> optimization passes, If you open up the time matrices of codegen(llc
>> -time-passes), you'll find that the most time consuming task is
actually
>> instruction selection(40~50% of time) instead of optimization
>> passes(10~0%). That's why we're trying to parallelize the
>> (target-independent) instruction selection process in aid of JIT
>> compilation speed.
>> 
>> The instruction selector of LLVM is an interpreter that interpret the
>> MatcherTable which consists of bytecodes generated by TableGen. I'm
>> surprised to find that the structure of MatcherTable and the
interpreter
>> seems to be suitable for parallelization. So we propose a prototype
that
>> parallelizes the interpreting of OPC_Scope children that are possibly
>> time-consuming. Here is some quick overview:
>> 
>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
>> process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added
>> to the front of scope(OPC_Scope) children which fulfill following
>> conditions:
>> 1.  Amount of opcodes within the child exceed certain threshold(5 in
>> current prototype).
>> 2. The child is reside in a sequence of continuous scope children which
>> length also exceed certain threshold(7 in current prototype).
>> For each valid sequence of scope children, an extra scope child, where
>> OPC_Merge is the only opcode, would be appended to it(the sequence).
>> 
>> In the interpreter, when an OPC_Fork is encountered inside a scope
>> child, the main thread would dispatch the scope child as a task to a
>> central thread pool, then jump to the next child. At the end of a valid
>> "parallel sequence(of scope children)" an OPC_Merge must
exist and the
>> main thread would stop there and wait other threads to finish.
>> 
>> About the synchronization, read-write lock is mainly used: In each
>> checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
>> OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write
>> lock is used.
>> 
>> Finally, although the generated code is correct, total consuming time
>> barely break even with the original one. Possible reasons may be:
>> 1. The original interpreter is pretty fast actually. The thread pool
>> dispatching time for each selection task may be too long in comparison
>> with the original approach.
>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that
>> would modify DAG. This constraint force us to add write lock on it
which
>> would block other threads at the same time. Unfortunately,
>> OPC_CheckComplexPat is probably the most time-consuming opcodes in X86
>> and perhaps in other architectures, too.
>> 3. Too many threads. We're now working on another approach that use
>> larger region, consist of multiple scope children, for each parallel
>> task for the sake of reducing thread amount.
>> 4. Popular instructions, like add or sub, contain lots of scope
children
>> so one or several parallel regions exist. However, most of the common
>> instruction variants(e.g. add %reg1, %reg2) is on "top" among
scope
>> children which would be encountered pretty early. So sometimes threads
>> are fired, but the correct instruction is actually immediately selected
>> after that. Thus lots of time is wasted on joining threads.
>> 
>> Here is our working repository and diff with 3.9
>> release:
https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff
>>
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>>
>> I don't think the current state is ready for code reviewing since
there
>> is no significant speedup. But it's very welcome for folks to
discuss
>> about this idea and also, whether current instruction selection
approach
>> had reached its upper bound of speed.(I ignore fast-isel by mean since
>> it sacrifices too much on quality of generated code. One of our goals
is
>> to boost the compilation speed while keeping the code quality as much
as
>> possible)
>> 
>> Feel free to comment directly on the repo diff above.
>> 
>> About the "region approach" mentioned in the third item of
possible
>> reasons above. It's actually the "dev-region-parallel"
branch, but it
>> still has some bugs on correctness of generated code. I would put more
>> detail about it if the feedback is sound.
>> 
>> NOTE: There seems to be some serious bugs in concurrent and
>> synchronization library of old gcc/standard libraries. So it's
strongly
>> recommended to use the latest version of clang to build our work.
>> 
>> B.R
>> --
>> Bekket McClane
>> Department of Computer Science,
>> National Tsing Hua University
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/033ab59c/attachment.html>

Bekket McClane via llvm-dev

2016-Nov-30 01:22 UTC

head link

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

> Mehdi Amini <mehdi.amini at apple.com> 於 2016年11月30日 上午5:14 寫道：
> 
>> 
>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hi,
>> Though there exists lots of researches on parallelizing or scheduling
optimization passes, If you open up the time matrices of codegen(llc
-time-passes), you'll find that the most time consuming task is actually
instruction selection(40~50% of time) instead of optimization passes(10~0%).
That's why we're trying to parallelize the (target-independent)
instruction selection process in aid of JIT compilation speed.
> 
> 
> How much of this 40-50% is spent in the matcher table? I though most of the
overhead was inherent to SelectionDAG?
Do you mean DAG operations like adding SDNode to CSENodes? Could you talk a
little bit more?
> Also why having such a fine grain approach instead of trying to perform
instruction selection in parallel across basic blocks or functions?
JIT compilation tends to compile small amount of DAG nodes each time, and also,
most of the JIT strategies tend to merge  several instructions(e.g. all of the
hot instructions) into single basic block or function, so I guess it wouldn’t
get significant speedup on overall performance if we handle only one(or few)
compilation unit each time.(correct me if I'm wrong)

B.R.
McClane> 
> I suspect you won’t gain much for too much added complexity with this
approach.
> 
> — 
> Mehdi
> 
> 
> 
> 
> 
>> 
>> The instruction selector of LLVM is an interpreter that interpret the
MatcherTable which consists of bytecodes generated by TableGen. I'm
surprised to find that the structure of MatcherTable and the interpreter seems
to be suitable for parallelization. So we propose a prototype that parallelizes
the interpreting of OPC_Scope children that are possibly time-consuming. Here is
some quick overview:
>> 
>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to the
front of scope(OPC_Scope) children which fulfill following conditions:
>> 1.  Amount of opcodes within the child exceed certain threshold(5 in
current prototype).
>> 2. The child is reside in a sequence of continuous scope children which
length also exceed certain threshold(7 in current prototype).
>> For each valid sequence of scope children, an extra scope child, where
OPC_Merge is the only opcode, would be appended to it(the sequence).
>> 
>> In the interpreter, when an OPC_Fork is encountered inside a scope
child, the main thread would dispatch the scope child as a task to a central
thread pool, then jump to the next child. At the end of a valid "parallel
sequence(of scope children)" an OPC_Merge must exist and the main thread
would stop there and wait other threads to finish.
>> 
>> About the synchronization, read-write lock is mainly used: In each
checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock is
used.
>> 
>> Finally, although the generated code is correct, total consuming time
barely break even with the original one. Possible reasons may be:
>> 1. The original interpreter is pretty fast actually. The thread pool
dispatching time for each selection task may be too long in comparison with the
original approach.
>> 2. X86 is the only architecture which contains OPC_CheckComplexPat that
would modify DAG. This constraint force us to add write lock on it which would
block other threads at the same time. Unfortunately, OPC_CheckComplexPat is
probably the most time-consuming opcodes in X86 and perhaps in other
architectures, too.
>> 3. Too many threads. We're now working on another approach that use
larger region, consist of multiple scope children, for each parallel task for
the sake of reducing thread amount.
>> 4. Popular instructions, like add or sub, contain lots of scope
children so one or several parallel regions exist. However, most of the common
instruction variants(e.g. add %reg1, %reg2) is on "top" among scope
children which would be encountered pretty early. So sometimes threads are
fired, but the correct instruction is actually immediately selected after that.
Thus lots of time is wasted on joining threads.
>> 
>> Here is our working repository and diff with 3.9 release:
https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff 
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
>> I don't think the current state is ready for code reviewing since
there is no significant speedup. But it's very welcome for folks to discuss
about this idea and also, whether current instruction selection approach had
reached its upper bound of speed.(I ignore fast-isel by mean since it sacrifices
too much on quality of generated code. One of our goals is to boost the
compilation speed while keeping the code quality as much as possible)
>> 
>> Feel free to comment directly on the repo diff above.
>> 
>> About the "region approach" mentioned in the third item of
possible reasons above. It's actually the "dev-region-parallel"
branch, but it still has some bugs on correctness of generated code. I would put
more detail about it if the feedback is sound.
>> 
>> NOTE: There seems to be some serious bugs in concurrent and
synchronization library of old gcc/standard libraries. So it's strongly
recommended to use the latest version of clang to build our work.
>> 
>> B.R
>> -- 
>> Bekket McClane
>> Department of Computer Science,
>> National Tsing Hua University
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/741a8433/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Nov 2016 - [RFC] Parallelizing (Target-Independent) Instruction Selection

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

Apparently Analagous Threads