thr3ads.net - llvm dev - [llvm-dev] Pattern transformation between scalar and vector on IR. [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Jojo Ma via llvm-dev

2016-Sep-08 09:32 UTC

[llvm-dev] Pattern transformation between scalar and vector on IR.

Hi All,

I'm tring to use RSQRT instructions on follow case for ARM
(now what using is sqrt):

    1.0 / sqrt(x)

The RSQRT instructions(VRSQRTE/VRSQRTS) are vector type,
but above operation is scalar type. So a transformation must be
done(transform sqrt pattern to rsqrt).

I have completed a patch for this, but I made the transformation in the
backend which will leads to additional latencies.And actually it's not
reasonable doing transformation in backend.
I think it would be better done that on IR. I am a novice to llvm.I don't
know  anything about this subject. If anyone could
give me some advice would be appreciated.

Thanks!

-- 
Best Regards,
Jojo
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160908/8f56b21a/attachment.html>

James Molloy via llvm-dev

2016-Sep-08 10:59 UTC

head link

[llvm-dev] Pattern transformation between scalar and vector on IR.

Hi,

The cost model for this transform is really difficult to get right. The
latencies and throughputs for VRSQRTE/SQRT vary between microarchitectures but
in general it is fair to say that;

* There are two possible fast sequences for calculating 1/sqrt(x):
  a) (1 / x)  * sqrt(x)  (DIV, SQRT, MUL, where the DIV and SQRT are
*independent* and can issue in parallel)
  b) VRSQRTE + s*(VRSQRTS + MUL + MUL) where s is the number of newton-raphson
steps required - 2 for 32-bit floats and 4 for doubles.

* SQRT and DIV are iterative instructions and commonly the hardware for this,
because it must iterate, is not pipelined.
  * As a consequence of this, these instructions can also commonly *exit early*
if the calculation converges early.

* SQRT and DIV will almost always have a shorter latency than the equivalent
VRSQRTE sequence due to the sheer number of instructions in that sequence and
the early exit capability of SQRT/DIV.

So the calculation on which to choose depends on several factors:
  1) Is the calculation throughput or latency limited? This loop is throughput
limited - the result of the sqrt is not on the cyclic critical path, so we
expect to be able to vectorize it or at least look ahead and have the core
execute multiple iterations in parallel. We’d probably then want to use VRSQRTE.
     for (int i = 0; i < n; ++i)
       p[i] = 1.0 / q[i];

      This loop is latency limited. Here, we don’t care about throughput as only
one iteration can ever be executed in the core at once due to the critical path.
We’d want to optimise for latency over anything else, so we’d use SQRT + DIV.
      for (int i = 0; i < n; ++i)
        p = 1.0 / p + q[i];

  2) Can a SQRT and DIV execute in parallel on the microarchitecture? both these
instructions use similar hardware, so it’s possible that they both need the same
functional unit that isn’t pipelined. If so, the sqrt() sequence’s latency gets
drastically increased and the profitability calculation changes.

The major one is latency versus throughput. This is very difficult to do at the
IR level, but at the MachineInstr level we have MachineTraceMetrics which is
able to analyze loops and obtain their functional unit usage (“resource height”)
and critical path length (“depth”). Using these two metrics we can determine if
it’d make sense to swap the SQRT for a VRSQRTE.

So in summary it is a hard problem with a difficult cost model, that can only
reasonably be done at the MachineInstr level.

Cheers,

James

On 8 Sep 2016, at 10:32, Jojo Ma <jojo.ma at linaro.org<mailto:jojo.ma at
linaro.org>> wrote:

Hi All,

I'm tring to use RSQRT instructions on follow case for ARM
(now what using is sqrt):

    1.0 / sqrt(x)

The RSQRT instructions(VRSQRTE/VRSQRTS) are vector type,
but above operation is scalar type. So a transformation must be
done(transform sqrt pattern to rsqrt).

I have completed a patch for this, but I made the transformation in the
backend which will leads to additional latencies.And actually it's not
reasonable doing transformation in backend.
I think it would be better done that on IR. I am a novice to llvm.I don't
know  anything about this subject. If anyone could
give me some advice would be appreciated.

Thanks!

--
Best Regards,
Jojo

IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160908/2a3f5f10/attachment.html>

Jojo Ma via llvm-dev

2016-Sep-26 12:50 UTC

head link

[llvm-dev] Pattern transformation between scalar and vector on IR.

Hi James & all,

I don't know if you have saw the topic disscusion about Evandro's patch
on
AArch64
([llvm] r268539 - [AArch64] Use the reciprocal estimation machinery
<http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160919/391140.html>).The
patchset is relate
to this topic. And Actually I did a draft patch on ARM which exactly the
same with this before
starting this thread. Any more discussion about this topic will help me
doing things on the right way.

More introduce about what i was doing:
I rasied this thread because I am looking on Bug 27107
<https://llvm.org/bugs/show_bug.cgi?id=27107>- LLVM misses reciprocal
estimate instructions in ISel on ARMv7/8.
My planning about this task is :

   - Emitting rsqrt[es].

   Necessary preparation. I supposed we may not necessay to give a final
   design at this step. Just be able to emit.
   My patch on ARM is exactly the same with Evandro's. But I was troubled
   with scalar-to-vector
   when I tried to support the operation on f32 in rsqrt as my first
   step(rsqrt[es] of ARM is just for vector).
   I would appreciate it if you could help me take a look at it and show me
   the problems.
   https://reviews.llvm.org/D24911

   Or would it  make sense if I do validating on AArch64 and porting to ARM
   after that?

   - Benchmarking on both strategies.(N-body and others)
   Validating whether the replacing profitable or not.

   - Giving a final decision.
   Hi James, Thanks again for your detailed instroduction. I think what you
   instroduct would be our final goal,
   and changes based on MachineCombiner should be done for this.I have no
   specific progress on this yet.


Hi Evandro,

I believe we will be working for the same goal, I think we could work
together to make a patch that
not only won't break the LTO but also based on the context. I'm a new
recruit, any comments from
 this thread or bugzilla are welcome. Thanks!

Best Regards,
Jojo


On 8 September 2016 at 18:59, James Molloy <James.Molloy at arm.com>
wrote:
> Hi,
>
> The cost model for this transform is really difficult to get right. The
> latencies and throughputs for VRSQRTE/SQRT vary between microarchitectures
> but in general it is fair to say that;
>
> * There are two possible fast sequences for calculating 1/sqrt(x):
>   a) (1 / x)  * sqrt(x)  (DIV, SQRT, MUL, where the DIV and SQRT are
> *independent* and can issue in parallel)
>   b) VRSQRTE + s*(VRSQRTS + MUL + MUL) where s is the number of
> newton-raphson steps required - 2 for 32-bit floats and 4 for doubles.
>
> * SQRT and DIV are iterative instructions and commonly the hardware for
> this, because it must iterate, is not pipelined.
>   * As a consequence of this, these instructions can also commonly *exit
> early* if the calculation converges early.
>
> * SQRT and DIV will almost always have a shorter latency than the
> equivalent VRSQRTE sequence due to the sheer number of instructions in that
> sequence and the early exit capability of SQRT/DIV.
>
> So the calculation on which to choose depends on several factors:
>   1) Is the calculation throughput or latency limited? This loop is
> throughput limited - the result of the sqrt is not on the cyclic critical
> path, so we expect to be able to vectorize it or at least look ahead and
> have the core execute multiple iterations in parallel. We’d probably then
> want to use VRSQRTE.
>      for (int i = 0; i < n; ++i)
>        p[i] = 1.0 / q[i];
>
>       This loop is latency limited. Here, we don’t care about throughput
> as only one iteration can ever be executed in the core at once due to the
> critical path. We’d want to optimise for latency over anything else, so
> we’d use SQRT + DIV.
>       for (int i = 0; i < n; ++i)
>         p = 1.0 / p + q[i];
>
>   2) Can a SQRT and DIV execute in parallel on the microarchitecture? both
> these instructions use similar hardware, so it’s possible that they both
> need the same functional unit that isn’t pipelined. If so, the sqrt()
> sequence’s latency gets drastically increased and the profitability
> calculation changes.
>
> The major one is latency versus throughput. This is very difficult to do
> at the IR level, but at the MachineInstr level we have MachineTraceMetrics
> which is able to analyze loops and obtain their functional unit usage
> (“resource height”) and critical path length (“depth”). Using these two
> metrics we can determine if it’d make sense to swap the SQRT for a VRSQRTE.
>
> So in summary it is a hard problem with a difficult cost model, that can
> only reasonably be done at the MachineInstr level.
>
> Cheers,
>
> James
>
> On 8 Sep 2016, at 10:32, Jojo Ma <jojo.ma at linaro.org> wrote:
>
> Hi All,
>
> I'm tring to use RSQRT instructions on follow case for ARM
> (now what using is sqrt):
>
>     1.0 / sqrt(x)
>
> The RSQRT instructions(VRSQRTE/VRSQRTS) are vector type,
> but above operation is scalar type. So a transformation must be
> done(transform sqrt pattern to rsqrt).
>
> I have completed a patch for this, but I made the transformation in the
> backend which will leads to additional latencies.And actually it's not
> reasonable doing transformation in backend.
> I think it would be better done that on IR. I am a novice to llvm.I
don't
> know  anything about this subject. If anyone could
> give me some advice would be appreciated.
>
> Thanks!
>
> --
> Best Regards,
> Jojo
>
>
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>


-- 
Best Regards,
Jojo
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160926/e2a28300/attachment.html>

llvm dev - Sep 2016 - Pattern transformation between scalar and vector on IR.

[llvm-dev] Pattern transformation between scalar and vector on IR.

[llvm-dev] Pattern transformation between scalar and vector on IR.

[llvm-dev] Pattern transformation between scalar and vector on IR.