thr3ads.net - llvm dev - [llvm-dev] [Proposal][RFC] Strided Memory Access [Jun 2016]

If this information is useful, please help other people find it:
Share via:
Saito, Hideki via llvm-dev
2016-Jun-15 20:38 UTC
[llvm-dev] [Proposal][RFC] Strided Memory Access

Ashutosh,

First, I'm all for enabling general stride load/store support for all
targets --- should be just a matter of proper
cost modeling. For that matter, we should enable general gather/scatter support
for all targets.

About the specific approach taken by this RFC:
1) It's best to first state that this is for small constant stride where a
single wide load/store can cover two or more
     valid stride load/store elements. If a wide load/store contain just one
valid element, there aren't any need
     to do a wide load/store.
2) Optimal code generation for strided memory accesses are target dependent.
     Breaking one gather/scatter w/ constant index vector into several IR
instructions like below might hurt such optimization,
     especially when the vectorization factor is greater than the target legal
vector width ---- this can happen
     in mixed data type code (int mixed with double, etc.)
3) Make sure that the last element of unmasked load is a valid stride load
element ---- otherwise,
     out-of-bounds SEGV can happen. Use of masked load intrinsic most likely
degrades any benefits over
     gather intrinsic w/ constant index.
4) Unfortunately, 3) most likely imply that the wide load used here needs to be
based on target HW physical
     vector width. There are reasons why such type legalization at vectorizer
should be avoided.
5) In general, I think leaking this much complexity into IR hurts (rather than
enable) optimization.
     It's best to keep vectorizer's output as simple as vector code can
be such that other optimizers can do their best
     in optimizing vector code.

If I remember correctly, Elena Demikhovsky (my colleague) attempted to introduce
stride load/store intrinsic
sometime ago and got a push back. I think it's best to address the following
questions.
a) What can this enable over gather/scatter intrinsic w/ constant index vector
b) What can this enable over hypothetical "stride load/store"
intrinsic
c) Any chance of introducing stride load/store as a form of wide load/store?
    I understand that there is a big hurdle here, but the reward should also be
pretty good here.

One might claim that the cost modeling can be easier, but that is also the
downside since
the generated cost may not match the cost of optimal code.

All these things considered, my suggestion is to start from enabling optimal
code generation (or optimal lowering
in CG prepare) from gather/scatter intrinsic and reflect that in
vectorizer's cost model.

As a vectorizer centric person, I'd like to eventually see
gather/scatter/stride-load/store to be supported
by IR instructions (i.e., not intrinsics), but I'll leave that discussion to
another time so that I won't pollute
Ashutosh's RFC.

Thanks,
Hideki Saito
Technical Lead of Vectorizer Development
Intel Compiler and Languages

-----------------------------------
Message: 7
Date: Wed, 15 Jun 2016 06:16:58 +0000
From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org>
To: llvm-dev <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
	Vectorization
Message-ID:
	<BLUPR12MB0644468A76C335E9B7431853FB550 at
BLUPR12MB0644.namprd12.prod.outlook.com>
	
Content-Type: text/plain; charset="utf-8"

Hi,

This is a proposal about implementing strided memory access vectorization.

Currently LLVM does not support strided access vectorization completely, some
support is available via Interleave vectorization.

There are two main overheads with strided vectorizations:
*       An overhead of consolidating data into an operable vector.
*       An overhead of distributing the data elements after the operations.

Because of these overheads LLVM finds strided memory vectorization is not
profitable and generating serial code sequence over vectorized code.
GATHER & SCATTER instruction support is available on few(not all) targets
for consolidation & distribution operations.

In this approach we are trying to handle cases like this:
for (int i = 0; i < len; i++)
     a[i*3] = b[i*2] + c[i*3];

We model strided memory load & store using shuffle & load/mask-store
operations.
*       Load is modeled as loads followed by shuffle.
*       Store is modeled as shuffle followed by mask store.
*       To minimize load and store operation introduced 'SkipFactor'.

'SkipFactor':
*       Multiple load operation required for consolidating data into an operable
vector.
*       Between various loads we skip by few offsets to effective consolidate.
*       SkipFactor is the number of additional offsets required to move from the
previous vector load memory-end address for the next vector load.
*       SkipFactor allows all memory loads to be considered as identical (From
valid element perspective).
*       SkipFactor = (Stride - (VF % Stride)) % Stride)

For example:


How LOAD is modeled:
Load stride 3 (i.e. load for b [ 3 * i ])
  %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
  %6 = bitcast i32* %5 to <4 x i32>*
  %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
  %7 = getelementptr i32, i32* %5, i64 6
  %8 = bitcast i32* %7 to <4 x i32>*
  %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
  %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32>
%stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>

How STORE is modeled:
Store with stride 3 (i.e. store to c [ 3 * i ])
 %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
  %11 = bitcast i32* %10 to <4 x i32>*
  %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32>
undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 1>
  call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x
i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1
true>)
  %12 = getelementptr i32, i32* %10, i64 6
  %13 = bitcast i32* %12 to <4 x i32>*
  %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x
i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
  call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, <4 x
i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1
true>)

NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by
6(4+2).

To enable this feature below LLVM changes required.
1)      Identify strided memory access (Its already there i.e. interleave
vectorizer does this)
2)      Costing changes:
a.      Identify number of Load[s]/Store[s] & Shuffle[s] required to model
Load/Store operation by considering SkipFactor.
b.      Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
3)      Transform:
a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to model a
Store operation.
b.      Generate Load[s] followed by Shuffle[s] instructions to model a Load
operation.

Implemented a working prototype to demonstrate this feature, its available at
below review-thread.
http://reviews.llvm.org/D21363

Use below option to enable this feature:
"-mllvm -enable-interleaved-mem-accesses -mllvm
-enable-strided-vectorization"

Gains observed with prototype:
TSVC kernel S111                1.15x
TSVC kernel S1111               1.42x

If community is interested in the above work, I have below plan:

I'm not sure keeping this feature separate or the part of current interleave
vectorizer ?

1)      Will enable interleave vectorizer on X86 (If this feature is going as an
enhancement on interleave vectorizer)
2)      If it's going with interleave vectorizer, will modify its infra to
support this.
3)      Will enable strided vectorization by submitting changes on costing &
transform.

Please share your valuable thoughts.

Thanks,
Ashutosh
llvm dev - Jun 2016 - [Proposal][RFC] Strided Memory Access

[llvm-dev] [Proposal][RFC] Strided Memory Access