thr3ads.net - llvm dev - [llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Demikhovsky, Elena via llvm-dev

2016-Sep-25 18:28 UTC

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

|
  |Hi Elena,
  |
  |Technically speaking, this seems straightforward.
  |
  |I wonder, however, how target-independent this is in a practical
  |sense; will there be an efficient lowering when targeting any other
  |ISA? I don't want to get into the territory where, because the
  |vectorizer is supposed to be architecture independent, we need to
  |add target-independent intrinsics for all potentially-side-effect-
  |carrying idioms (or just complicated idioms) we want the vectorizer to
  |support on any target. Is there a way we can design the vectorizer so
  |that the targets can plug in their own idiom recognition for these
  |kinds of things, and then, via that interface, let the vectorizer produce
  |the relevant target-dependent intrinsics?

Entering target specific plug-in in vectorizer may be a good idea. We need
target specific pattern recognition and target specific implementation of
“vectorizeMemoryInstruction”. (It may be more functionality in the future)
TTI->checkAdditionalVectorizationOppotunities() - detects target specific
patterns; X86 will find compress/expand and may be others
TTI->vectorizeMemoryInstruction()  - handle only exotic target-specific cases

Pros:
It will allow us to implement all X86 specific solutions.
The expandload and compresssrore intrinsics may be x86 specific, polymorphic:
llvm.x86.masked.expandload()
llvm.x86.masked.compressstore()

Cons:

TTI will need to deal with Loop Info, SCEVs and other loop analysis info that it
does not have today. (I do not like this way)
Or we'll need to introduce TLV - Target Loop Vectorizer - a new class that
handles all target specific cases. This solution seems more reasonable, but too
heavy just for compress/expand.
Do you see any other target plug-in solution? 

-Elena

  |
  |Thanks again,
  |Hal
  |
  |----- Original Message -----
  |> From: "Elena Demikhovsky" <elena.demikhovsky at
intel.com>
  |> To: "llvm-dev" <llvm-dev at lists.llvm.org>
  |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>, "Michael
Kuperstein"
  |<mkuper at google.com>, "Adam Nemet (anemet at apple.com)"
  |> <anemet at apple.com>, "Hal Finkel (hfinkel at anl.gov)"
  |<hfinkel at anl.gov>, "Sanjay Patel (spatel at
rotateright.com)"
  |> <spatel at rotateright.com>, "Nadav Rotem"
  |<nadav.rotem at me.com>
  |> Sent: Monday, September 19, 2016 1:37:02 AM
  |> Subject: RFC: New intrinsics masked.expandload and
  |> masked.compressstore
  |>
  |>
  |> Hi all,
  |>
  |> AVX-512 ISA introduces new vector instructions VCOMPRESS and
  |VEXPAND
  |> in order to allow vectorization of the following loops with two
  |> specific types of cross-iteration dependencies:
  |>
  |> Compress:
  |> for (int i=0; i<N; ++i)
  |> If (t[i])
  |> *A++ = expr;
  |>
  |> Expand:
  |> for (i=0; i<N; ++i)
  |> If (t[i])
  |> X[i] = *A++;
  |> else
  |> X[i] = PassThruV[i];
  |>
  |> On this poster (
  |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf )
  |you’ll
  |> find depicted “compress” and “expand” patterns.
  |>
  |> The RFC proposes to support this functionality by introducing two
  |> intrinsics to LLVM IR:
  |> llvm.masked.expandload.*
  |> llvm.masked.compressstore.*
  |>
  |> The syntax of these two intrinsics is similar to the syntax of
  |> llvm.masked.load.* and masked.store.*, respectively, but the
  |semantics
  |> are different, matching the above patterns.
  |>
  |> %res = call <16 x float> @llvm.masked.expandload.v16f32.p0f32
  |(float*
  |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
  |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
<value>,
  |> float* <ptr>, <16 x i1> <mask>)
  |>
  |> The arguments - %mask, %value and %passthru all have the same
  |vector
  |> length.
  |> The underlying type of %ptr corresponds to the scalar type of the
  |> vector value.
  |> (In brief; the full syntax description will be provided in subsequent
  |> full documentation.)
  |>
  |> The intrinsics are planned to be target independent, similar to
  |> masked.load/store/gather/scatter. They will be lowered effectively
  |on
  |> AVX-512 and scalarized on other targets, also akin to masked.*
  |> intrinsics.
  |> Loop vectorizer will query TTI about existence of effective support
  |> for these intrinsics, and if provided will be able to handle loops
  |> with such cross-iteration dependences.
  |>
  |> The first step will include the full documentation and
  |implementation
  |> of CodeGen part.
  |>
  |> An additional information about expand load (
  |>
  |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text  |exp
  |> andload&techs=AVX_512
  |> ) and compress store (
  |>
  |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text  |com
  |> pressstore&techs=AVX_512
  |> ) you also can find in the Intel Intrinsic Guide.
  |>
  |>
  |>     * Elena
  |>
  |> ---------------------------------------------------------------------
  |> Intel Israel (74) Limited
  |>
  |> This e-mail and any attachments may contain confidential material
  |for
  |> the sole use of the intended recipient(s). Any review or distribution
  |> by others is strictly prohibited. If you are not the intended
  |> recipient, please contact the sender and delete all copies.
  |
  |--
  |Hal Finkel
  |Lead, Compiler Technology and Programming Languages Leadership
  |Computing Facility Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Michael Kuperstein via llvm-dev

2016-Sep-26 07:31 UTC

head link

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

In theory, we could offload several things to such a target plug-in, I'm
just not entirely sure we want to.

Two examples I can think of:

1) This could be a better interface for masked load/stores and gathers.

2) Horizontal reductions. I tried writing
yet-another-horizontals-as-first-class-citizens proposal a couple of months
ago, and the main problem from the previous discussions about this was that
there's no good common representation. E.g. should a horizontal add return
a vector or a scalar, should it return the base type of the vector (assumes
saturation) or a wider integer type, etc. With a plugin, we could have the
vectorizer emit the right target intrinsic, instead of the crazy backend
pattern-matching we have now.

On Sun, Sep 25, 2016 at 9:28 PM, Demikhovsky, Elena <
elena.demikhovsky at intel.com> wrote:
>
>   |
>   |Hi Elena,
>   |
>   |Technically speaking, this seems straightforward.
>   |
>   |I wonder, however, how target-independent this is in a practical
>   |sense; will there be an efficient lowering when targeting any other
>   |ISA? I don't want to get into the territory where, because the
>   |vectorizer is supposed to be architecture independent, we need to
>   |add target-independent intrinsics for all potentially-side-effect-
>   |carrying idioms (or just complicated idioms) we want the vectorizer to
>   |support on any target. Is there a way we can design the vectorizer so
>   |that the targets can plug in their own idiom recognition for these
>   |kinds of things, and then, via that interface, let the vectorizer
> produce
>   |the relevant target-dependent intrinsics?
>
> Entering target specific plug-in in vectorizer may be a good idea. We need
> target specific pattern recognition and target specific implementation of
> “vectorizeMemoryInstruction”. (It may be more functionality in the future)
> TTI->checkAdditionalVectorizationOppotunities() - detects target
specific
> patterns; X86 will find compress/expand and may be others
> TTI->vectorizeMemoryInstruction()  - handle only exotic target-specific
> cases
>
> Pros:
> It will allow us to implement all X86 specific solutions.
> The expandload and compresssrore intrinsics may be x86 specific,
> polymorphic:
> llvm.x86.masked.expandload()
> llvm.x86.masked.compressstore()
>
> Cons:
>
> TTI will need to deal with Loop Info, SCEVs and other loop analysis info
> that it does not have today. (I do not like this way)
> Or we'll need to introduce TLV - Target Loop Vectorizer - a new class
that
> handles all target specific cases. This solution seems more reasonable, but
> too heavy just for compress/expand.
> Do you see any other target plug-in solution?
>
> -Elena
>
>   |
>   |Thanks again,
>   |Hal
>   |
>   |----- Original Message -----
>   |> From: "Elena Demikhovsky" <elena.demikhovsky at
intel.com>
>   |> To: "llvm-dev" <llvm-dev at lists.llvm.org>
>   |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>,
"Michael Kuperstein"
>   |<mkuper at google.com>, "Adam Nemet (anemet at
apple.com)"
>   |> <anemet at apple.com>, "Hal Finkel (hfinkel at
anl.gov)"
>   |<hfinkel at anl.gov>, "Sanjay Patel (spatel at
rotateright.com)"
>   |> <spatel at rotateright.com>, "Nadav Rotem"
>   |<nadav.rotem at me.com>
>   |> Sent: Monday, September 19, 2016 1:37:02 AM
>   |> Subject: RFC: New intrinsics masked.expandload and
>   |> masked.compressstore
>   |>
>   |>
>   |> Hi all,
>   |>
>   |> AVX-512 ISA introduces new vector instructions VCOMPRESS and
>   |VEXPAND
>   |> in order to allow vectorization of the following loops with two
>   |> specific types of cross-iteration dependencies:
>   |>
>   |> Compress:
>   |> for (int i=0; i<N; ++i)
>   |> If (t[i])
>   |> *A++ = expr;
>   |>
>   |> Expand:
>   |> for (i=0; i<N; ++i)
>   |> If (t[i])
>   |> X[i] = *A++;
>   |> else
>   |> X[i] = PassThruV[i];
>   |>
>   |> On this poster (
>   |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf )
>   |you’ll
>   |> find depicted “compress” and “expand” patterns.
>   |>
>   |> The RFC proposes to support this functionality by introducing two
>   |> intrinsics to LLVM IR:
>   |> llvm.masked.expandload.*
>   |> llvm.masked.compressstore.*
>   |>
>   |> The syntax of these two intrinsics is similar to the syntax of
>   |> llvm.masked.load.* and masked.store.*, respectively, but the
>   |semantics
>   |> are different, matching the above patterns.
>   |>
>   |> %res = call <16 x float> @llvm.masked.expandload.v16f32.p0f32
>   |(float*
>   |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
>   |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
<value>,
>   |> float* <ptr>, <16 x i1> <mask>)
>   |>
>   |> The arguments - %mask, %value and %passthru all have the same
>   |vector
>   |> length.
>   |> The underlying type of %ptr corresponds to the scalar type of the
>   |> vector value.
>   |> (In brief; the full syntax description will be provided in
subsequent
>   |> full documentation.)
>   |>
>   |> The intrinsics are planned to be target independent, similar to
>   |> masked.load/store/gather/scatter. They will be lowered effectively
>   |on
>   |> AVX-512 and scalarized on other targets, also akin to masked.*
>   |> intrinsics.
>   |> Loop vectorizer will query TTI about existence of effective support
>   |> for these intrinsics, and if provided will be able to handle loops
>   |> with such cross-iteration dependences.
>   |>
>   |> The first step will include the full documentation and
>   |implementation
>   |> of CodeGen part.
>   |>
>   |> An additional information about expand load (
>   |>
>   |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text>  
|exp
>   |> andload&techs=AVX_512
>   |> ) and compress store (
>   |>
>   |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text>  
|com
>   |> pressstore&techs=AVX_512
>   |> ) you also can find in the Intel Intrinsic Guide.
>   |>
>   |>
>   |>     * Elena
>   |>
>   |>
---------------------------------------------------------------------
>   |> Intel Israel (74) Limited
>   |>
>   |> This e-mail and any attachments may contain confidential material
>   |for
>   |> the sole use of the intended recipient(s). Any review or
distribution
>   |> by others is strictly prohibited. If you are not the intended
>   |> recipient, please contact the sender and delete all copies.
>   |
>   |--
>   |Hal Finkel
>   |Lead, Compiler Technology and Programming Languages Leadership
>   |Computing Facility Argonne National Laboratory
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160926/69afe137/attachment.html>

Demikhovsky, Elena via llvm-dev

2016-Sep-26 08:38 UTC

head link

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

We also may want to implement strided memory access on X86, masking allows to do
this safely.
One day we’ll need to mask FP operations as a part of FP exception mode.
Arithmetic operations with saturation.


-           Elena

From: Michael Kuperstein [mailto:mkuper at google.com]
Sent: Monday, September 26, 2016 10:32
To: Demikhovsky, Elena <elena.demikhovsky at intel.com>
Cc: Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at
intel.com>; Adam Nemet (anemet at apple.com) <anemet at apple.com>;
Sanjay Patel (spatel at rotateright.com) <spatel at rotateright.com>;
Nadav Rotem <nadav.rotem at me.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: Re: RFC: New intrinsics masked.expandload and masked.compressstore

In theory, we could offload several things to such a target plug-in, I'm
just not entirely sure we want to.

Two examples I can think of:

1) This could be a better interface for masked load/stores and gathers.

2) Horizontal reductions. I tried writing
yet-another-horizontals-as-first-class-citizens proposal a couple of months ago,
and the main problem from the previous discussions about this was that
there's no good common representation. E.g. should a horizontal add return a
vector or a scalar, should it return the base type of the vector (assumes
saturation) or a wider integer type, etc. With a plugin, we could have the
vectorizer emit the right target intrinsic, instead of the crazy backend
pattern-matching we have now.

On Sun, Sep 25, 2016 at 9:28 PM, Demikhovsky, Elena <elena.demikhovsky at
intel.com<mailto:elena.demikhovsky at intel.com>> wrote:

  |
  |Hi Elena,
  |
  |Technically speaking, this seems straightforward.
  |
  |I wonder, however, how target-independent this is in a practical
  |sense; will there be an efficient lowering when targeting any other
  |ISA? I don't want to get into the territory where, because the
  |vectorizer is supposed to be architecture independent, we need to
  |add target-independent intrinsics for all potentially-side-effect-
  |carrying idioms (or just complicated idioms) we want the vectorizer to
  |support on any target. Is there a way we can design the vectorizer so
  |that the targets can plug in their own idiom recognition for these
  |kinds of things, and then, via that interface, let the vectorizer produce
  |the relevant target-dependent intrinsics?

Entering target specific plug-in in vectorizer may be a good idea. We need
target specific pattern recognition and target specific implementation of
“vectorizeMemoryInstruction”. (It may be more functionality in the future)
TTI->checkAdditionalVectorizationOppotunities() - detects target specific
patterns; X86 will find compress/expand and may be others
TTI->vectorizeMemoryInstruction()  - handle only exotic target-specific cases

Pros:
It will allow us to implement all X86 specific solutions.
The expandload and compresssrore intrinsics may be x86 specific, polymorphic:
llvm.x86.masked.expandload()
llvm.x86.masked.compressstore()

Cons:

TTI will need to deal with Loop Info, SCEVs and other loop analysis info that it
does not have today. (I do not like this way)
Or we'll need to introduce TLV - Target Loop Vectorizer - a new class that
handles all target specific cases. This solution seems more reasonable, but too
heavy just for compress/expand.
Do you see any other target plug-in solution?

-Elena

  |
  |Thanks again,
  |Hal
  |
  |----- Original Message -----
  |> From: "Elena Demikhovsky" <elena.demikhovsky at
intel.com<mailto:elena.demikhovsky at intel.com>>
  |> To: "llvm-dev" <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
  |> Cc: "Ayal Zaks" <ayal.zaks at intel.com<mailto:ayal.zaks
at intel.com>>, "Michael Kuperstein"
  |<mkuper at google.com<mailto:mkuper at google.com>>, "Adam
Nemet (anemet at apple.com<mailto:anemet at apple.com>)"
  |> <anemet at apple.com<mailto:anemet at apple.com>>, "Hal
Finkel (hfinkel at anl.gov<mailto:hfinkel at anl.gov>)"
  |<hfinkel at anl.gov<mailto:hfinkel at anl.gov>>, "Sanjay
Patel (spatel at rotateright.com<mailto:spatel at rotateright.com>)"
  |> <spatel at rotateright.com<mailto:spatel at
rotateright.com>>, "Nadav Rotem"
  |<nadav.rotem at me.com<mailto:nadav.rotem at me.com>>
  |> Sent: Monday, September 19, 2016 1:37:02 AM
  |> Subject: RFC: New intrinsics masked.expandload and
  |> masked.compressstore
  |>
  |>
  |> Hi all,
  |>
  |> AVX-512 ISA introduces new vector instructions VCOMPRESS and
  |VEXPAND
  |> in order to allow vectorization of the following loops with two
  |> specific types of cross-iteration dependencies:
  |>
  |> Compress:
  |> for (int i=0; i<N; ++i)
  |> If (t[i])
  |> *A++ = expr;
  |>
  |> Expand:
  |> for (i=0; i<N; ++i)
  |> If (t[i])
  |> X[i] = *A++;
  |> else
  |> X[i] = PassThruV[i];
  |>
  |> On this poster (
  |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf )
  |you’ll
  |> find depicted “compress” and “expand” patterns.
  |>
  |> The RFC proposes to support this functionality by introducing two
  |> intrinsics to LLVM IR:
  |> llvm.masked.expandload.*
  |> llvm.masked.compressstore.*
  |>
  |> The syntax of these two intrinsics is similar to the syntax of
  |> llvm.masked.load.* and masked.store.*, respectively, but the
  |semantics
  |> are different, matching the above patterns.
  |>
  |> %res = call <16 x float> @llvm.masked.expandload.v16f32.p0f32
  |(float*
  |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
  |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
<value>,
  |> float* <ptr>, <16 x i1> <mask>)
  |>
  |> The arguments - %mask, %value and %passthru all have the same
  |vector
  |> length.
  |> The underlying type of %ptr corresponds to the scalar type of the
  |> vector value.
  |> (In brief; the full syntax description will be provided in subsequent
  |> full documentation.)
  |>
  |> The intrinsics are planned to be target independent, similar to
  |> masked.load/store/gather/scatter. They will be lowered effectively
  |on
  |> AVX-512 and scalarized on other targets, also akin to masked.*
  |> intrinsics.
  |> Loop vectorizer will query TTI about existence of effective support
  |> for these intrinsics, and if provided will be able to handle loops
  |> with such cross-iteration dependences.
  |>
  |> The first step will include the full documentation and
  |implementation
  |> of CodeGen part.
  |>
  |> An additional information about expand load (
  |>
  |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text  |exp
  |> andload&techs=AVX_512
  |> ) and compress store (
  |>
  |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text  |com
  |> pressstore&techs=AVX_512
  |> ) you also can find in the Intel Intrinsic Guide.
  |>
  |>
  |>     * Elena
  |>
  |> ---------------------------------------------------------------------
  |> Intel Israel (74) Limited
  |>
  |> This e-mail and any attachments may contain confidential material
  |for
  |> the sole use of the intended recipient(s). Any review or distribution
  |> by others is strictly prohibited. If you are not the intended
  |> recipient, please contact the sender and delete all copies.
  |
  |--
  |Hal Finkel
  |Lead, Compiler Technology and Programming Languages Leadership
  |Computing Facility Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160926/9e9b901e/attachment-0001.html>

Hal Finkel via llvm-dev

2016-Sep-26 19:47 UTC

head link

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

----- Original Message -----> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Ayal Zaks" <ayal.zaks at intel.com>, "Michael
Kuperstein" <mkuper at google.com>, "Adam Nemet (anemet at
apple.com)"
> <anemet at apple.com>, "Sanjay Patel (spatel at
rotateright.com)" <spatel at rotateright.com>, "Nadav
Rotem"
> <nadav.rotem at me.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>
> Sent: Sunday, September 25, 2016 1:28:58 PM
> Subject: RE: RFC: New intrinsics masked.expandload and masked.compressstore
> 
> 
>   |
>   |Hi Elena,
>   |
>   |Technically speaking, this seems straightforward.
>   |
>   |I wonder, however, how target-independent this is in a practical
>   |sense; will there be an efficient lowering when targeting any
>   |other
>   |ISA? I don't want to get into the territory where, because the
>   |vectorizer is supposed to be architecture independent, we need to
>   |add target-independent intrinsics for all potentially-side-effect-
>   |carrying idioms (or just complicated idioms) we want the
>   |vectorizer to
>   |support on any target. Is there a way we can design the vectorizer
>   |so
>   |that the targets can plug in their own idiom recognition for these
>   |kinds of things, and then, via that interface, let the vectorizer
>   |produce
>   |the relevant target-dependent intrinsics?
> 
> Entering target specific plug-in in vectorizer may be a good idea. We
> need target specific pattern recognition and target specific
> implementation of “vectorizeMemoryInstruction”. (It may be more
> functionality in the future)
> TTI->checkAdditionalVectorizationOppotunities() - detects target
> specific patterns; 
How would this work in this case? The result would need to affect the legality
and cost of the memory instruction. From your poster, it looks like we're
talking about loops with constructs like this:

for (i =0; i < N; i++) {
 if (topVal > b[i]) {
   *dst = a[i];
   dst++;
 }
}

is this loop vectorizable at all without these constructs? It looks like the
target would need to analyze the PHI representing the store's address,
assign the store some reasonable cost, and also provide some alternative SCEVs
(perhaps lower and upper bounds) for use with the dependence checks?
> X86 will find compress/expand and may be others
What others might fit in here?
> TTI->vectorizeMemoryInstruction()  - handle only exotic
> target-specific cases
> 
> Pros:
> It will allow us to implement all X86 specific solutions.
> The expandload and compresssrore intrinsics may be x86 specific,
> polymorphic:
> llvm.x86.masked.expandload()
> llvm.x86.masked.compressstore()
> 
> Cons:
> 
> TTI will need to deal with Loop Info, SCEVs and other loop analysis
> info that it does not have today. (I do not like this way)
Giving TTI the loop and other analyses, in itself, does not bother me.
getUnrollingPreferences takes a Loop*. I'm more concerned about how cleanly
we could integrate everything.
> Or we'll need to introduce TLV - Target Loop Vectorizer - a new class
> that handles all target specific cases. This solution seems more
> reasonable, but too heavy just for compress/expand.
I don't see how this would work without duplicating a lot of the logic in
the vectorizer (unless it is really just doing loop-idiom recognition, in which
case none of this is really relevant). You'd want the cost-model using by
the vectorizer, in general, to be integrated with whatever the target was
providing.

Thanks again,
Hal
> Do you see any other target plug-in solution?
> 
> -Elena
> 
>   |
>   |Thanks again,
>   |Hal
>   |
>   |----- Original Message -----
>   |> From: "Elena Demikhovsky" <elena.demikhovsky at
intel.com>
>   |> To: "llvm-dev" <llvm-dev at lists.llvm.org>
>   |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>,
"Michael Kuperstein"
>   |<mkuper at google.com>, "Adam Nemet (anemet at
apple.com)"
>   |> <anemet at apple.com>, "Hal Finkel (hfinkel at
anl.gov)"
>   |<hfinkel at anl.gov>, "Sanjay Patel (spatel at
rotateright.com)"
>   |> <spatel at rotateright.com>, "Nadav Rotem"
>   |<nadav.rotem at me.com>
>   |> Sent: Monday, September 19, 2016 1:37:02 AM
>   |> Subject: RFC: New intrinsics masked.expandload and
>   |> masked.compressstore
>   |>
>   |>
>   |> Hi all,
>   |>
>   |> AVX-512 ISA introduces new vector instructions VCOMPRESS and
>   |VEXPAND
>   |> in order to allow vectorization of the following loops with two
>   |> specific types of cross-iteration dependencies:
>   |>
>   |> Compress:
>   |> for (int i=0; i<N; ++i)
>   |> If (t[i])
>   |> *A++ = expr;
>   |>
>   |> Expand:
>   |> for (i=0; i<N; ++i)
>   |> If (t[i])
>   |> X[i] = *A++;
>   |> else
>   |> X[i] = PassThruV[i];
>   |>
>   |> On this poster (
>   |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf )
>   |you’ll
>   |> find depicted “compress” and “expand” patterns.
>   |>
>   |> The RFC proposes to support this functionality by introducing
>   |> two
>   |> intrinsics to LLVM IR:
>   |> llvm.masked.expandload.*
>   |> llvm.masked.compressstore.*
>   |>
>   |> The syntax of these two intrinsics is similar to the syntax of
>   |> llvm.masked.load.* and masked.store.*, respectively, but the
>   |semantics
>   |> are different, matching the above patterns.
>   |>
>   |> %res = call <16 x float> @llvm.masked.expandload.v16f32.p0f32
>   |(float*
>   |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
>   |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
<value>,
>   |> float* <ptr>, <16 x i1> <mask>)
>   |>
>   |> The arguments - %mask, %value and %passthru all have the same
>   |vector
>   |> length.
>   |> The underlying type of %ptr corresponds to the scalar type of
>   |> the
>   |> vector value.
>   |> (In brief; the full syntax description will be provided in
>   |> subsequent
>   |> full documentation.)
>   |>
>   |> The intrinsics are planned to be target independent, similar to
>   |> masked.load/store/gather/scatter. They will be lowered
>   |> effectively
>   |on
>   |> AVX-512 and scalarized on other targets, also akin to masked.*
>   |> intrinsics.
>   |> Loop vectorizer will query TTI about existence of effective
>   |> support
>   |> for these intrinsics, and if provided will be able to handle
>   |> loops
>   |> with such cross-iteration dependences.
>   |>
>   |> The first step will include the full documentation and
>   |implementation
>   |> of CodeGen part.
>   |>
>   |> An additional information about expand load (
>   |>
>   |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text>  
|exp
>   |> andload&techs=AVX_512
>   |> ) and compress store (
>   |>
>   |https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text>  
|com
>   |> pressstore&techs=AVX_512
>   |> ) you also can find in the Intel Intrinsic Guide.
>   |>
>   |>
>   |>     * Elena
>   |>
>   |>
---------------------------------------------------------------------
>   |> Intel Israel (74) Limited
>   |>
>   |> This e-mail and any attachments may contain confidential
>   |> material
>   |for
>   |> the sole use of the intended recipient(s). Any review or
>   |> distribution
>   |> by others is strictly prohibited. If you are not the intended
>   |> recipient, please contact the sender and delete all copies.
>   |
>   |--
>   |Hal Finkel
>   |Lead, Compiler Technology and Programming Languages Leadership
>   |Computing Facility Argonne National Laboratory
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel via llvm-dev

2016-Sep-26 19:55 UTC

head link

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

----- Original Message -----
> From: "Michael Kuperstein" <mkuper at google.com>
> To: "Elena Demikhovsky" <elena.demikhovsky at intel.com>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Ayal
Zaks"
> <ayal.zaks at intel.com>, "Adam Nemet (anemet at
apple.com)"
> <anemet at apple.com>, "Sanjay Patel (spatel at
rotateright.com)"
> <spatel at rotateright.com>, "Nadav Rotem" <nadav.rotem
at me.com>,
> "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Monday, September 26, 2016 2:31:41 AM
> Subject: Re: RFC: New intrinsics masked.expandload and
> masked.compressstore
> In theory, we could offload several things to such a target plug-in,
> I'm just not entirely sure we want to.
> Two examples I can think of:
> 1) This could be a better interface for masked load/stores and
> gathers.
> 2) Horizontal reductions. I tried writing
> yet-another-horizontals-as-first-class-citizens proposal a couple of
> months ago, and the main problem from the previous discussions about
> this was that there's no good common representation. E.g. should a
> horizontal add return a vector or a scalar, should it return the
> base type of the vector (assumes saturation) or a wider integer
> type, etc. With a plugin, we could have the vectorizer emit the
> right target intrinsic, instead of the crazy backend
> pattern-matching we have now.I don't think we want to offload either of these things to the targets to
produce target-specific intrinsics - both are fairly generic. There's value
in using IR and then pattern-matching the result later because it also means
that we pick up cases where the same pattern comes from people using C-level
vector intrinsics, other portable frontends, etc. We don't want every
frontend wishing to emit a horizontal reduction to need to use target-specific
intrinsics for different targets. Our vectorizer should not be special in this
regard.

However, this does bring up another issue with our current cost model: it
current estimates costs one instruction at a time, and so can't take
advantage of lower costs associated with target instructions that have
complicated behaviors (FMAs, saturating arithmetic, byte-swapping loads, etc.).
This is a separate problem, in a sense, but perhaps there's a common
solution.

-Hal 
> On Sun, Sep 25, 2016 at 9:28 PM, Demikhovsky, Elena <
> elena.demikhovsky at intel.com > wrote:
> > |
> 
> > |Hi Elena,
> 
> > |
> 
> > |Technically speaking, this seems straightforward.
> 
> > |
> 
> > |I wonder, however, how target-independent this is in a practical
> 
> > |sense; will there be an efficient lowering when targeting any
> > |other
> 
> > |ISA? I don't want to get into the territory where, because the
> 
> > |vectorizer is supposed to be architecture independent, we need to
> 
> > |add target-independent intrinsics for all potentially-side-effect-
> 
> > |carrying idioms (or just complicated idioms) we want the
> > |vectorizer
> > |to
> 
> > |support on any target. Is there a way we can design the vectorizer
> > |so
> 
> > |that the targets can plug in their own idiom recognition for these
> 
> > |kinds of things, and then, via that interface, let the vectorizer
> > |produce
> 
> > |the relevant target-dependent intrinsics?
> 
> > Entering target specific plug-in in vectorizer may be a good idea.
> > We
> > need target specific pattern recognition and target specific
> > implementation of “vectorizeMemoryInstruction”. (It may be more
> > functionality in the future)
> 
> > TTI->checkAdditionalVectorizationOppotunities() - detects target
> > specific patterns; X86 will find compress/expand and may be others
> 
> > TTI->vectorizeMemoryInstruction() - handle only exotic
> > target-specific cases
> 
> > Pros:
> 
> > It will allow us to implement all X86 specific solutions.
> 
> > The expandload and compresssrore intrinsics may be x86 specific,
> > polymorphic:
> 
> > llvm.x86.masked.expandload()
> 
> > llvm.x86.masked.compressstore()
> 
> > Cons:
> 
> > TTI will need to deal with Loop Info, SCEVs and other loop analysis
> > info that it does not have today. (I do not like this way)
> 
> > Or we'll need to introduce TLV - Target Loop Vectorizer - a new
> > class
> > that handles all target specific cases. This solution seems more
> > reasonable, but too heavy just for compress/expand.
> 
> > Do you see any other target plug-in solution?
> 
> > -Elena
> 
> > |
> 
> > |Thanks again,
> 
> > |Hal
> 
> > |
> 
> > | ----- Original Message -----
> 
> > |> From: "Elena Demikhovsky" < elena.demikhovsky at
intel.com >
> 
> > |> To: "llvm-dev" < llvm-dev at lists.llvm.org >
> 
> > |> Cc: "Ayal Zaks" < ayal.zaks at intel.com >,
"Michael Kuperstein"
> 
> > |< mkuper at google.com >, "Adam Nemet ( anemet at
apple.com )"
> 
> > |> < anemet at apple.com >, "Hal Finkel ( hfinkel at
anl.gov )"
> 
> > |< hfinkel at anl.gov >, "Sanjay Patel ( spatel at
rotateright.com )"
> 
> > |> < spatel at rotateright.com >, "Nadav Rotem"
> 
> > |< nadav.rotem at me.com >
> 
> > |> Sent: Monday, September 19, 2016 1:37:02 AM
> 
> > |> Subject: RFC: New intrinsics masked.expandload and
> 
> > |> masked.compressstore
> 
> > |>
> 
> > |>
> 
> > |> Hi all,
> 
> > |>
> 
> > |> AVX-512 ISA introduces new vector instructions VCOMPRESS and
> 
> > |VEXPAND
> 
> > |> in order to allow vectorization of the following loops with two
> 
> > |> specific types of cross-iteration dependencies:
> 
> > |>
> 
> > |> Compress:
> 
> > |> for (int i=0; i<N; ++i)
> 
> > |> If (t[i])
> 
> > |> *A++ = expr;
> 
> > |>
> 
> > |> Expand:
> 
> > |> for (i=0; i<N; ++i)
> 
> > |> If (t[i])
> 
> > |> X[i] = *A++;
> 
> > |> else
> 
> > |> X[i] = PassThruV[i];
> 
> > |>
> 
> > |> On this poster (
> 
> > |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf )
> 
> > |you’ll
> 
> > |> find depicted “compress” and “expand” patterns.
> 
> > |>
> 
> > |> The RFC proposes to support this functionality by introducing
> > |> two
> 
> > |> intrinsics to LLVM IR:
> 
> > |> llvm.masked.expandload.*
> 
> > |> llvm.masked.compressstore.*
> 
> > |>
> 
> > |> The syntax of these two intrinsics is similar to the syntax of
> 
> > |> llvm.masked.load.* and masked.store.*, respectively, but the
> 
> > |semantics
> 
> > |> are different, matching the above patterns.
> 
> > |>
> 
> > |> %res = call <16 x float>
@llvm.masked.expandload.v16f32.p0f32
> 
> > |(float*
> 
> > |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
> 
> > |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
<value>,
> 
> > |> float* <ptr>, <16 x i1> <mask>)
> 
> > |>
> 
> > |> The arguments - %mask, %value and %passthru all have the same
> 
> > |vector
> 
> > |> length.
> 
> > |> The underlying type of %ptr corresponds to the scalar type of
> > |> the
> 
> > |> vector value.
> 
> > |> (In brief; the full syntax description will be provided in
> > |> subsequent
> 
> > |> full documentation.)
> 
> > |>
> 
> > |> The intrinsics are planned to be target independent, similar to
> 
> > |> masked.load/store/gather/scatter. They will be lowered
> > |> effectively
> 
> > |on
> 
> > |> AVX-512 and scalarized on other targets, also akin to masked.*
> 
> > |> intrinsics.
> 
> > |> Loop vectorizer will query TTI about existence of effective
> > |> support
> 
> > |> for these intrinsics, and if provided will be able to handle
> > |> loops
> 
> > |> with such cross-iteration dependences.
> 
> > |>
> 
> > |> The first step will include the full documentation and
> 
> > |implementation
> 
> > |> of CodeGen part.
> 
> > |>
> 
> > |> An additional information about expand load (
> 
> > |>
> 
> > |
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text>
> > |exp
> 
> > |> andload&techs=AVX_512
> 
> > |> ) and compress store (
> 
> > |>
> 
> > |
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text>
> > |com
> 
> > |> pressstore&techs=AVX_512
> 
> > |> ) you also can find in the Intel Intrinsic Guide.
> 
> > |>
> 
> > |>
> 
> > |> * Elena
> 
> > |>
> 
> > |>
---------------------------------------------------------------------
> 
> > |> Intel Israel (74) Limited
> 
> > |>
> 
> > |> This e-mail and any attachments may contain confidential
> > |> material
> 
> > |for
> 
> > |> the sole use of the intended recipient(s). Any review or
> > |> distribution
> 
> > |> by others is strictly prohibited. If you are not the intended
> 
> > |> recipient, please contact the sender and delete all copies.
> 
> > |
> 
> > |--
> 
> > |Hal Finkel
> 
> > |Lead, Compiler Technology and Programming Languages Leadership
> 
> > |Computing Facility Argonne National Laboratory
> 
> > ---------------------------------------------------------------------
> 
> > Intel Israel (74) Limited
> 
> > This e-mail and any attachments may contain confidential material
> > for
> 
> > the sole use of the intended recipient(s). Any review or
> > distribution
> 
> > by others is strictly prohibited. If you are not the intended
> 
> > recipient, please contact the sender and delete all copies.
> 
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160926/87a740e5/attachment.html>

Demikhovsky, Elena via llvm-dev

2016-Sep-26 20:55 UTC

head link

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

|
  |How would this work in this case? The result would need to affect the
  |legality and cost of the memory instruction. From your poster, it looks
  |like we're talking about loops with constructs like this:
  |
  |for (i =0; i < N; i++) {
  | if (topVal > b[i]) {
  |   *dst = a[i];
  |   dst++;
  | }
  |}
  |
  |is this loop vectorizable at all without these constructs?

Good question. Today it isn't. Theoretically yes if we'll know that only
a small part of the loop has cross-iteration dependency or another issue. A loop
may be vectorized and contain scalar pieces inside.
But it requires full reconstruction of the cost model.

  | It looks like
  |the target would need to analyze the PHI representing the store's
  |address, assign the store some reasonable cost, and also provide
  |some alternative SCEVs (perhaps lower and upper bounds) for use
  |with the dependence checks?

First of all, this loop should pass legality check. Legality will need an
additional effort in order to detect compress/expand pattern in a loop with
cross-iteration dependency.
Once the pattern is detected, we mark the "store" as "compressing
store" and TTI will give a cost for compressing store.
  |
  |> X86 will find compress/expand and may be others
  |
  |What others might fit in here?
The compress/expand are special patterns that will require a separate analysis.
I thought about other X86 specific patterns that may be detected. Strided memory
access with masks or arithmetic with saturation. But again, I'm not sure
that constructing plug-in will not be an overkill in this case.
  |
  |> TTI->vectorizeMemoryInstruction()  - handle only exotic
  |> target-specific cases
  |>
  |> Pros:
  |> It will allow us to implement all X86 specific solutions.
  |> The expandload and compresssrore intrinsics may be x86 specific,
  |> polymorphic:
  |> llvm.x86.masked.expandload()
  |> llvm.x86.masked.compressstore()
  |>
  |> Cons:
  |>
  |> TTI will need to deal with Loop Info, SCEVs and other loop analysis
  |> info that it does not have today. (I do not like this way)
  |
  |Giving TTI the loop and other analyses, in itself, does not bother me.
  |getUnrollingPreferences takes a Loop*. I'm more concerned about
  |how cleanly we could integrate everything.
  |
  |> Or we'll need to introduce TLV - Target Loop Vectorizer - a new
class
  |> that handles all target specific cases. This solution seems more
  |> reasonable, but too heavy just for compress/expand.
  |
  |I don't see how this would work without duplicating a lot of the logic
  |in the vectorizer (unless it is really just doing loop-idiom recognition,
  |in which case none of this is really relevant). You'd want the cost-
  |model using by the vectorizer, in general, to be integrated with
  |whatever the target was providing.
  |
  |Thanks again,
  |Hal
  |
  |> Do you see any other target plug-in solution?
  |>
  |> -Elena
  |>
  |>   |
  |>   |Thanks again,
  |>   |Hal
  |>   |
  |>   |----- Original Message -----
  |>   |> From: "Elena Demikhovsky" <elena.demikhovsky at
intel.com>
  |>   |> To: "llvm-dev" <llvm-dev at lists.llvm.org>
  |>   |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>,
"Michael Kuperstein"
  |>   |<mkuper at google.com>, "Adam Nemet (anemet at
apple.com)"
  |>   |> <anemet at apple.com>, "Hal Finkel (hfinkel at
anl.gov)"
  |>   |<hfinkel at anl.gov>, "Sanjay Patel (spatel at
rotateright.com)"
  |>   |> <spatel at rotateright.com>, "Nadav Rotem"
  |>   |<nadav.rotem at me.com>
  |>   |> Sent: Monday, September 19, 2016 1:37:02 AM
  |>   |> Subject: RFC: New intrinsics masked.expandload and
  |>   |> masked.compressstore
  |>   |>
  |>   |>
  |>   |> Hi all,
  |>   |>
  |>   |> AVX-512 ISA introduces new vector instructions VCOMPRESS
  |and
  |>   |VEXPAND
  |>   |> in order to allow vectorization of the following loops with two
  |>   |> specific types of cross-iteration dependencies:
  |>   |>
  |>   |> Compress:
  |>   |> for (int i=0; i<N; ++i)
  |>   |> If (t[i])
  |>   |> *A++ = expr;
  |>   |>
  |>   |> Expand:
  |>   |> for (i=0; i<N; ++i)
  |>   |> If (t[i])
  |>   |> X[i] = *A++;
  |>   |> else
  |>   |> X[i] = PassThruV[i];
  |>   |>
  |>   |> On this poster (
  |>   |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-
  |Poster.pdf )
  |>   |you’ll
  |>   |> find depicted “compress” and “expand” patterns.
  |>   |>
  |>   |> The RFC proposes to support this functionality by introducing
  |>   |> two
  |>   |> intrinsics to LLVM IR:
  |>   |> llvm.masked.expandload.*
  |>   |> llvm.masked.compressstore.*
  |>   |>
  |>   |> The syntax of these two intrinsics is similar to the syntax of
  |>   |> llvm.masked.load.* and masked.store.*, respectively, but the
  |>   |semantics
  |>   |> are different, matching the above patterns.
  |>   |>
  |>   |> %res = call <16 x float>
@llvm.masked.expandload.v16f32.p0f32
  |>   |(float*
  |>   |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
  |>   |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
  |<value>,
  |>   |> float* <ptr>, <16 x i1> <mask>)
  |>   |>
  |>   |> The arguments - %mask, %value and %passthru all have the
  |same
  |>   |vector
  |>   |> length.
  |>   |> The underlying type of %ptr corresponds to the scalar type of
  |>   |> the
  |>   |> vector value.
  |>   |> (In brief; the full syntax description will be provided in
  |>   |> subsequent
  |>   |> full documentation.)
  |>   |>
  |>   |> The intrinsics are planned to be target independent, similar to
  |>   |> masked.load/store/gather/scatter. They will be lowered
  |>   |> effectively
  |>   |on
  |>   |> AVX-512 and scalarized on other targets, also akin to masked.*
  |>   |> intrinsics.
  |>   |> Loop vectorizer will query TTI about existence of effective
  |>   |> support
  |>   |> for these intrinsics, and if provided will be able to handle
  |>   |> loops
  |>   |> with such cross-iteration dependences.
  |>   |>
  |>   |> The first step will include the full documentation and
  |>   |implementation
  |>   |> of CodeGen part.
  |>   |>
  |>   |> An additional information about expand load (
  |>   |>
  |>
  ||https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text
  |  |>   |exp
  |>   |> andload&techs=AVX_512
  |>   |> ) and compress store (
  |>   |>
  |>
  ||https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text
  |  |>   |com
  |>   |> pressstore&techs=AVX_512
  |>   |> ) you also can find in the Intel Intrinsic Guide.
  |>   |>
  |>   |>
  |>   |>     * Elena
  |>   |>
  |>   |>
---------------------------------------------------------------------
  |>   |> Intel Israel (74) Limited
  |>   |>
  |>   |> This e-mail and any attachments may contain confidential
  |>   |> material
  |>   |for
  |>   |> the sole use of the intended recipient(s). Any review or
  |>   |> distribution
  |>   |> by others is strictly prohibited. If you are not the intended
  |>   |> recipient, please contact the sender and delete all copies.
  |>   |
  |>   |--
  |>   |Hal Finkel
  |>   |Lead, Compiler Technology and Programming Languages
  |Leadership
  |>   |Computing Facility Argonne National Laboratory
  |> ---------------------------------------------------------------------
  |> Intel Israel (74) Limited
  |>
  |> This e-mail and any attachments may contain confidential material
  |for
  |> the sole use of the intended recipient(s). Any review or distribution
  |> by others is strictly prohibited. If you are not the intended
  |> recipient, please contact the sender and delete all copies.
  |>
  |
  |--
  |Hal Finkel
  |Lead, Compiler Technology and Programming Languages Leadership
  |Computing Facility Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Sep 2016 - RFC: New intrinsics masked.expandload and masked.compressstore

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

Possibly Parallel Threads