Displaying 20 results from an estimated 3000 matches similar to: "enabling interleaved access loop vectorization"
2016 May 26
0
enabling interleaved access loop vectorization
On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Is there a compile-time and/or potential runtime cost that makes
> enableInterleavedAccessVectorization() default to 'false'?
>
> I notice that this is set to true for ARM, AArch64, and PPC.
>
> In particular, I'm wondering if there's a reason it's not enabled for
2016 May 26
2
enabling interleaved access loop vectorization
Interleaved access is not enabled on X86 yet.
We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer
2016 Aug 05
3
enabling interleaved access loop vectorization
Hi Michael,
Sometime back I did some experiments with interleave vectorizer and did not found any degrade,
probably my tests/benchmarks are not extensive enough to cover much.
Elina is the right person to comment on it as she already experienced cases where it hinders performance.
For interleave vectorizer on X86 we do not have any specific costing, it goes to BasicTTI where the costing is not
2016 Aug 05
2
enabling interleaved access loop vectorization
Regarding InterleavedAccessPass - sure, but proper strided/interleaved
access optimization ought to have a positive impact even without target
support.
Case in point - Hal enabled it on PPC last September. An important
difference vs. x86 seems to be that arbitrary shuffles are cheap on PPC,
but, as I said below, I hope we can enable it on x86 with a conservative
cost function, and still get
2016 Sep 19
3
[arm, aarch64] Alignment checking in interleaved access pass
Hi,
As a follow up to Patch D23646 <https://reviews.llvm.org/D23646>, I'm
trying to figure out if there should be an alignment check and what the
correct approach is.
Some background:
For stores, the pass turns:
%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
<0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr
2016 Oct 10
2
[arm, aarch64] Alignment checking in interleaved access pass
Hi Renato,
Thank you for the answers!
First, let me clarify a couple of things and give some context.
The patch it looking at VSTn, rather than VLDn (stores seem to be somewhat
harder to get the "right" patterns, the pass is doing a good job for loads
already)
The examples you gave come mostly from loop vectorization, which, as I
understand it, was the reason for adding the
2016 Aug 05
2
enabling interleaved access loop vectorization
On 5 August 2016 at 21:00, Demikhovsky, Elena
<elena.demikhovsky at intel.com> wrote:
> As far as I remember, may be I’m wrong, vectorizer does not generate
> shuffles for interleave access. It generates a bunch of extracts and inserts
> that ought to be coupled into shuffles after wise.
That's my understanding as well.
Whatever strategy we take, it will be a mix of telling
2016 Sep 01
2
enabling interleaved access loop vectorization
So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.
> The details are in PR29025.
Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-)
> But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get:
Indeed such
2016 Aug 07
2
enabling interleaved access loop vectorization
We checked the gathered data again. All regressions that we see are in 32-bit mode. The 64-bit mode looks good overall.
- Elena
From: Michael Kuperstein [mailto:mkuper at google.com]
Sent: Saturday, August 06, 2016 02:56
To: Renato Golin <renato.golin at linaro.org>
Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson <mssimpso at codeaurora.org>;
2016 Aug 05
3
enabling interleaved access loop vectorization
On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com> wrote:
> I agree that we can get *more* improvement with better cost modeling, but
> I'd expect to be able to get *some* improvement the way things are right
> now.
Elena said she saw "some" improvements. :)
> That's why I'm curious about where we saw regressions - I'm wondering
>
2016 Aug 09
2
enabling interleaved access loop vectorization
Thanks Ayal!
I'll take a look at DENBench.
As another data point - I tried enabling this on our internal benchmarks.
I'm seeing one regression, and it seems to be a regression of the "good"
kind - without interleaving we don't vectorize the innermost loop, and with
interleaving we do. The vectorized loop is actually significantly faster
when benchmarked in isolation, but in
2016 Aug 16
2
enabling interleaved access loop vectorization
Hi Ayal, Elena,
I'd really like to enable this by default.
As I wrote above, I didn't see any regressions in internal benchmarks, and
there doesn't seem to be anything in SPEC2006 either. I do see a
performance improvement in an internal benchmark (that is, a real
workload).
Would you be able to provide an example that gets pessimized? I have no
doubt you've seen regressions
2016 Aug 17
2
enabling interleaved access loop vectorization
Thanks Ayal!
On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
> Hi Michael,
>
>
>
> Don’t quite have a full reproducer for you yet. You’re welcome to try and
> see what’s happening in 32 bit mode when enabling interleaving for the
> following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”:
>
>
>
> void rgb2yik
2016 Oct 10
2
[arm, aarch64] Alignment checking in interleaved access pass
On Mon, Oct 10, 2016 at 1:14 PM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 10 October 2016 at 19:39, Alina Sbirlea <alina.sbirlea at gmail.com>
> wrote:
> > Now, for ARM archs Halide is currently generating explicit VSTn
> intrinsics,
> > with some of the patterns I described, and I found no reason why Halide
> > shouldn't generate a single
2015 Apr 16
2
[LLVMdev] Code review for gather and scatter intrinsics
Hi Renato,
I fully agree with you, but indexed load and store is the next step.
I'm asking to review gather and scatter code.
Thanks.
- Elena
-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org]
Sent: Thursday, April 16, 2015 17:17
To: Demikhovsky, Elena
Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; James Molloy
Subject: Re: [LLVMdev] Code review for gather
2015 Apr 16
2
[LLVMdev] Code review for gather and scatter intrinsics
Hi,
I have a code ready, waiting for review here: http://reviews.llvm.org/D7665.
I presented this work on LLVM Euro and people are interested in this feature.
Can anybody review this code, please?
Thanks.
- Elena
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the
2017 Feb 01
2
RFC: Generic IR reductions
> My proposal was to have a reduction intrinsic that can infer the type by the predecessors.
> For example:
> @llvm.reduce(ext <N x double> ( add <N x float> %a, %b))
And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer))
Don't we have a problem with constant
2016 Feb 26
2
how to force llvm generate gather intrinsic
If I'm understanding correctly, you're saying that vgather* is slow on all
of Excavator, Haswell, Broadwell, and Skylake (client). Therefore, we will
not generate it for any of those machines.
Even if that's true, we should not define "gatherIsSlow()" as "hasAVX2() &&
!hasAVX512()". It could break for some hypothetical future processor that
manages to
2016 Feb 25
2
how to force llvm generate gather intrinsic
It seems that http://reviews.llvm.org/D15690 only implemented
gather/scatter for AVX-512, but not for AVX/AVX2. Is there any plan to
enable gather for AVX/2? Thanks.
Best,
Zhi
On Thu, Feb 25, 2016 at 8:28 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> I don't think gather has been enabled for AVX2 as of r261875.
> Masked load/store were enabled for AVX with:
>
2016 Feb 25
2
how to force llvm generate gather intrinsic
Yes, masked load/store/gather/scatter are completed.
- Elena
From: zhi chen [mailto:zchenhn at gmail.com]
Sent: Thursday, February 25, 2016 01:20
To: Demikhovsky, Elena <elena.demikhovsky at intel.com>
Cc: Sanjay Patel <spatel at rotateright.com>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] how to