thr3ads.net - llvm dev - [llvm-dev] Some questions about software pipeline in LLVM 4.0.0 [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Brendon Cahoon via llvm-dev

2017-Jun-26 22:22 UTC

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

Hi Ehsan,

 

In some cases modulo scheduling will insert copy instruction that end up as
real copies in the final code.  It unavoidable in some cases.  For example,
let's say a instruction defining a value is scheduled in the first
iteration, but one of its uses is scheduled two iterations later. In this
case, the kernel needs to create a copy because there will be two values
live in the kernel, from two other iterations.

 

        = R1     // use of the value from iteration n-2

        = R0     // use of the value from iteration n-1

  R0 = insn  // def at iteration n

  R1 = R0  

 

If all the uses of an instruction occur at most one iteration away, and the
uses appear before the definition, then the copies should be coalesced away.

 

In the examples that you show below, it all depends in which iteration each
instruction is scheduled and/or the order in which the instructions are
scheduled.

  %vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

  %vreg61<def> = INSN1 %vreg1, %vreg73;

  %vreg62<def> = INSN2 %vreg73, %vreg5;

  %vreg64<def> = INSN1 %vreg2, %vreg73;

 

For some reason, the instruction defining vreg64 was scheduled after the
instruction defining vreg62, which causes the copy to be generated.  Then,
the question is why did that happen? That can be hard to answer without
seeing the debug output from the pipeliner. In what order were the
instructions scheduled? I would assume that its either vreg73, vreg61,
vreg64, and then vreg62 , or it's the opposite order.  If that's the
case,
then there was a cycle gap in between the scheduling of vreg61 and vreg64,
so vreg62 was inserted in between them. Perhaps, there are multi-cycle
latencies that left the hole?  Also, can multiple instructions be executed
in parallel in the same cycle?

 

Let me know if any of that isn't clear.  I apologize for the delay in
replying to your original email.

 

Thanks,

Brendon

 

From: Ehsan Amiri [mailto:ehsan.amiri at huawei.com] 
Sent: Monday, June 19, 2017 1:55 AM
To: Brendon Cahoon <bcahoon at codeaurora.org>
Cc: llvm-dev at lists.llvm.org
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

 

Hi Brendon

 

Certainly, there are some real copies that end up being generated, but I
think it's better to exclude the copies from the schedule since most will be
eliminated. 

 

I was wondering what was the cause of the real copies that was being
generated in your experience? Something that I noticed when experimenting
with LLVM on our out-of-tree backend, was that there are copy instructions
generated **because of** modulo scheduling. 

 

For example before modulo scheduling I have 

 

%vreg6<def> = PHI %vreg23, <BB#1>, %vreg17

%vreg25<def> = INSN1 %vreg1, %vreg6;

% vreg26<def> = INSN1 %vreg2, %vreg6     <-- same opcode as previous
insn

% vreg17<def> = INSN2 %vreg6, %vreg5;

 

So for the phi node here, if we do phi elimination and register coalescing,
we won't have any copy insn left. But after modulo scheduling the
instructions above, now appear like this:

 

%vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;

%vreg61<def> = INSN1 %vreg1, %vreg73;

%vreg62<def> = INSN2 %vreg73, %vreg5;

%vreg64<def> = INSN1 %vreg2, %vreg73;

 

Now if you look right after the third insn after modulo scheduling, both
vreg73 and vreg62 are live here. So when we remove the corresponding phi
instruction, we end up with a copy instruction that cannot be removed by
register coalescing. 

 

IIUC, this is a byproduct of modulo scheduling. I have not really started
tuning modulo scheduling for our target, so I don't know if this is a result
of modulo scheduling not being tuned or not? Have you seen this type of
Copy? Any insights are greatly appreciated.

 

Thanks

Ehsan

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170626/8561c956/attachment.html>

Ehsan Amiri via llvm-dev

2017-Jun-27 12:56 UTC

head link

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

Hi Brendon

Thanks for the answer. I completely agree with your comments. The main reason
that I brought up this issue is the following: Inserting a COPY instruction that
cannot be eliminated, means that the loop has an instruction that was not taken
into account during modulo scheduling analysis. If we see these kind of copies
frequently enough, do you think it is worthwhile to work on the algorithm, so
these instructions are predicted and taken into account during the scheduling?
Or maybe we already do this and I am not aware of it?

Some other remarks/questions:

IIUC, these kind of copies will be generated even if we implement SMS after
register coalescing. Is this correct?

For us, so far we have enabled machine pipeliner for our backend and we see
these kind of copy generated frequently for our workloads. Some times multiple
copies inserted in a relatively small loop. IIUC, you don't see it
frequently though. Is that correct?

Thanks
Ehsan



________________________________
From: Brendon Cahoon [bcahoon at codeaurora.org]
Sent: Monday, June 26, 2017 6:22 PM
To: Ehsan Amiri
Cc: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0
Hi Ehsan,

In some cases modulo scheduling will insert copy instruction that end up as real
copies in the final code.  It unavoidable in some cases.  For example, let's
say a instruction defining a value is scheduled in the first iteration, but one
of its uses is scheduled two iterations later. In this case, the kernel needs to
create a copy because there will be two values live in the kernel, from two
other iterations.

        = R1     // use of the value from iteration n-2
        = R0     // use of the value from iteration n-1
  R0 = insn  // def at iteration n
  R1 = R0

If all the uses of an instruction occur at most one iteration away, and the uses
appear before the definition, then the copies should be coalesced away.

In the examples that you show below, it all depends in which iteration each
instruction is scheduled and/or the order in which the instructions are
scheduled.
  %vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;
  %vreg61<def> = INSN1 %vreg1, %vreg73;
  %vreg62<def> = INSN2 %vreg73, %vreg5;
  %vreg64<def> = INSN1 %vreg2, %vreg73;

For some reason, the instruction defining vreg64 was scheduled after the
instruction defining vreg62, which causes the copy to be generated.  Then, the
question is why did that happen? That can be hard to answer without seeing the
debug output from the pipeliner. In what order were the instructions scheduled?
I would assume that its either vreg73, vreg61, vreg64, and then vreg62 , or
it's the opposite order.  If that's the case, then there was a cycle gap
in between the scheduling of vreg61 and vreg64, so vreg62 was inserted in
between them. Perhaps, there are multi-cycle latencies that left the hole? 
Also, can multiple instructions be executed in parallel in the same cycle?

Let me know if any of that isn't clear.  I apologize for the delay in
replying to your original email.

Thanks,
Brendon

From: Ehsan Amiri [mailto:ehsan.amiri at huawei.com]
Sent: Monday, June 19, 2017 1:55 AM
To: Brendon Cahoon <bcahoon at codeaurora.org<mailto:bcahoon at
codeaurora.org>>
Cc: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

Hi Brendon

Certainly, there are some real copies that end up being generated, but I think
it's better to exclude the copies from the schedule since most will be
eliminated.

I was wondering what was the cause of the real copies that was being generated
in your experience? Something that I noticed when experimenting with LLVM on our
out-of-tree backend, was that there are copy instructions generated **because
of** modulo scheduling.

For example before modulo scheduling I have

%vreg6<def> = PHI %vreg23, <BB#1>, %vreg17
%vreg25<def> = INSN1 %vreg1, %vreg6;
% vreg26<def> = INSN1 %vreg2, %vreg6     <-- same opcode as previous
insn
% vreg17<def> = INSN2 %vreg6, %vreg5;

So for the phi node here, if we do phi elimination and register coalescing, we
won't have any copy insn left. But after modulo scheduling the instructions
above, now appear like this:

%vreg73<def> = PHI %vreg59, <BB#5>, %vreg62, <BB#6>;
%vreg61<def> = INSN1 %vreg1, %vreg73;
%vreg62<def> = INSN2 %vreg73, %vreg5;
%vreg64<def> = INSN1 %vreg2, %vreg73;

Now if you look right after the third insn after modulo scheduling, both vreg73
and vreg62 are live here. So when we remove the corresponding phi instruction,
we end up with a copy instruction that cannot be removed by register coalescing.

IIUC, this is a byproduct of modulo scheduling. I have not really started tuning
modulo scheduling for our target, so I don't know if this is a result of
modulo scheduling not being tuned or not? Have you seen this type of Copy? Any
insights are greatly appreciated.

Thanks
Ehsan


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170627/79e6e251/attachment.html>

Brendon Cahoon via llvm-dev

2017-Jun-27 14:26 UTC

head link

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

Hi Ehsan,

 
> If we see these kind of copies frequently enough, do you think it isworthwhile to work on the algorithm, so these instructions are predicted and
taken into account during the scheduling?

 

if you do see them frequently, then it would be worthwhile to find a way to
eliminate them. I'm not sure how easy it would be to predict them though
when the MII is computed.  My guess is that you want some solution where the
generated schedule doesn't require the extra copies. Without some more
information, it's hard to see why the algorithm is generating a schedule
that requires extra copies so frequently.

 
> Or maybe we already do this and I am not aware of it?
 

In some cases, the pipeliner does attempt to eliminate unnecessary copies.
For Hexagon, or any machine that allows multiple instructions per cycle, the
pipeliner attempts to order the generated instructions to minimize copies.
For example, Hexagon allows up to 4 instructions to execute "in
parallel".
So, in the same cycle, there can be a use of a register and a definition of
that same register (the uses are read first, then the definition occurs). In
effect, the use is for the value generated in the previous iteration. Since
the pipeliner generates a linear list of instructions (i.e., serial
semantics), the final order needs to make sure that the use is generated
before the definition. Otherwise, an extra copy is generated.

 
> IIUC, these kind of copies will be generated even if we implement SMSafter register coalescing. Is this correct?

 

That is correct.

 
> you don't see it frequently though. Is that correct?
 

Correct. For Hexagon, we don't see the extra copies very frequently.  In
your earlier example though, it would be interesting to see why the
algorithm puts the definition prior to the last use.  If possible, its
better to schedule the definition after the last use. Of course, in some
cases, that may/may not generate an efficient schedule.

 

Thanks,

Brendon

 

From: Ehsan Amiri [mailto:ehsan.amiri at huawei.com] 
Sent: Tuesday, June 27, 2017 7:57 AM
To: Brendon Cahoon <bcahoon at codeaurora.org>
Cc: llvm-dev at lists.llvm.org
Subject: RE: [llvm-dev] Some questions about software pipeline in LLVM 4.0.0

 

Hi Brendon 

 

Thanks for the answer. I completely agree with your comments. The main
reason that I brought up this issue is the following: Inserting a COPY
instruction that cannot be eliminated, means that the loop has an
instruction that was not taken into account during modulo scheduling
analysis. If we see these kind of copies frequently enough, do you think it
is worthwhile to work on the algorithm, so these instructions are predicted
and taken into account during the scheduling? Or maybe we already do this
and I am not aware of it?

 

Some other remarks/questions:

 

IIUC, these kind of copies will be generated even if we implement SMS after
register coalescing. Is this correct?

 

For us, so far we have enabled machine pipeliner for our backend and we see
these kind of copy generated frequently for our workloads. Some times
multiple copies inserted in a relatively small loop. IIUC, you don't see it
frequently though. Is that correct?

 

Thanks

Ehsan

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170627/8ee9c3c3/attachment-0001.html>

llvm dev - Jun 2017 - Some questions about software pipeline in LLVM 4.0.0

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0

[llvm-dev] Some questions about software pipeline in LLVM 4.0.0