thr3ads.net - similar to: "enabling interleaved access loop vectorization"

Displaying 20 results from an estimated 3000 matches similar to: "enabling interleaved access loop vectorization"

enabling interleaved access loop vectorization

2016 May 26

enabling interleaved access loop vectorization

On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Is there a compile-time and/or potential runtime cost that makes > enableInterleavedAccessVectorization() default to 'false'? > > I notice that this is set to true for ARM, AArch64, and PPC. > > In particular, I'm wondering if there's a reason it's not enabled for

enabling interleaved access loop vectorization

2016 May 26

enabling interleaved access loop vectorization

Interleaved access is not enabled on X86 yet. We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer

enabling interleaved access loop vectorization

2016 Aug 05

enabling interleaved access loop vectorization

Hi Michael, Sometime back I did some experiments with interleave vectorizer and did not found any degrade, probably my tests/benchmarks are not extensive enough to cover much. Elina is the right person to comment on it as she already experienced cases where it hinders performance. For interleave vectorizer on X86 we do not have any specific costing, it goes to BasicTTI where the costing is not

enabling interleaved access loop vectorization

2016 Aug 05

enabling interleaved access loop vectorization

Regarding InterleavedAccessPass - sure, but proper strided/interleaved access optimization ought to have a positive impact even without target support. Case in point - Hal enabled it on PPC last September. An important difference vs. x86 seems to be that arbitrary shuffles are cheap on PPC, but, as I said below, I hope we can enable it on x86 with a conservative cost function, and still get

[arm, aarch64] Alignment checking in interleaved access pass

2016 Sep 19

[arm, aarch64] Alignment checking in interleaved access pass

Hi, As a follow up to Patch D23646 <https://reviews.llvm.org/D23646>, I'm trying to figure out if there should be an alignment check and what the correct approach is. Some background: For stores, the pass turns: %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11> store <12 x i32> %i.vec, <12 x i32>* %ptr

[arm, aarch64] Alignment checking in interleaved access pass

2016 Oct 10

[arm, aarch64] Alignment checking in interleaved access pass

Hi Renato, Thank you for the answers! First, let me clarify a couple of things and give some context. The patch it looking at VSTn, rather than VLDn (stores seem to be somewhat harder to get the "right" patterns, the pass is doing a good job for loads already) The examples you gave come mostly from loop vectorization, which, as I understand it, was the reason for adding the

enabling interleaved access loop vectorization

2016 Aug 05

enabling interleaved access loop vectorization

On 5 August 2016 at 21:00, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote: > As far as I remember, may be I’m wrong, vectorizer does not generate > shuffles for interleave access. It generates a bunch of extracts and inserts > that ought to be coupled into shuffles after wise. That's my understanding as well. Whatever strategy we take, it will be a mix of telling

enabling interleaved access loop vectorization

2016 Sep 01

enabling interleaved access loop vectorization

So turns out it is a full reproducer after all (choosing to vectorize on AVX), good. > The details are in PR29025. Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-) > But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get: Indeed such

enabling interleaved access loop vectorization

2016 Aug 07

enabling interleaved access loop vectorization

We checked the gathered data again. All regressions that we see are in 32-bit mode. The 64-bit mode looks good overall. - Elena From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Saturday, August 06, 2016 02:56 To: Renato Golin <renato.golin at linaro.org> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson <mssimpso at codeaurora.org>;

enabling interleaved access loop vectorization

2016 Aug 05

enabling interleaved access loop vectorization

On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com> wrote: > I agree that we can get *more* improvement with better cost modeling, but > I'd expect to be able to get *some* improvement the way things are right > now. Elena said she saw "some" improvements. :) > That's why I'm curious about where we saw regressions - I'm wondering >

enabling interleaved access loop vectorization

2016 Aug 09

enabling interleaved access loop vectorization

Thanks Ayal! I'll take a look at DENBench. As another data point - I tried enabling this on our internal benchmarks. I'm seeing one regression, and it seems to be a regression of the "good" kind - without interleaving we don't vectorize the innermost loop, and with interleaving we do. The vectorized loop is actually significantly faster when benchmarked in isolation, but in

enabling interleaved access loop vectorization

2016 Aug 16

enabling interleaved access loop vectorization

Hi Ayal, Elena, I'd really like to enable this by default. As I wrote above, I didn't see any regressions in internal benchmarks, and there doesn't seem to be anything in SPEC2006 either. I do see a performance improvement in an internal benchmark (that is, a real workload). Would you be able to provide an example that gets pessimized? I have no doubt you've seen regressions

enabling interleaved access loop vectorization

2016 Aug 17

enabling interleaved access loop vectorization

Thanks Ayal! On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote: > Hi Michael, > > > > Don’t quite have a full reproducer for you yet. You’re welcome to try and > see what’s happening in 32 bit mode when enabling interleaving for the > following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”: > > > > void rgb2yik

[arm, aarch64] Alignment checking in interleaved access pass

2016 Oct 10

[arm, aarch64] Alignment checking in interleaved access pass

On Mon, Oct 10, 2016 at 1:14 PM, Renato Golin <renato.golin at linaro.org> wrote: > On 10 October 2016 at 19:39, Alina Sbirlea <alina.sbirlea at gmail.com> > wrote: > > Now, for ARM archs Halide is currently generating explicit VSTn > intrinsics, > > with some of the patterns I described, and I found no reason why Halide > > shouldn't generate a single

[LLVMdev] Code review for gather and scatter intrinsics

2015 Apr 16

[LLVMdev] Code review for gather and scatter intrinsics

Hi Renato, I fully agree with you, but indexed load and store is the next step. I'm asking to review gather and scatter code. Thanks. - Elena -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Thursday, April 16, 2015 17:17 To: Demikhovsky, Elena Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; James Molloy Subject: Re: [LLVMdev] Code review for gather

[LLVMdev] Code review for gather and scatter intrinsics

2015 Apr 16

[LLVMdev] Code review for gather and scatter intrinsics

Hi, I have a code ready, waiting for review here: http://reviews.llvm.org/D7665. I presented this work on LLVM Euro and people are interested in this feature. Can anybody review this code, please? Thanks. - Elena --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the

RFC: Generic IR reductions

2017 Feb 01

RFC: Generic IR reductions

> My proposal was to have a reduction intrinsic that can infer the type by the predecessors. > For example: > @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer)) Don't we have a problem with constant

how to force llvm generate gather intrinsic

2016 Feb 26

how to force llvm generate gather intrinsic

If I'm understanding correctly, you're saying that vgather* is slow on all of Excavator, Haswell, Broadwell, and Skylake (client). Therefore, we will not generate it for any of those machines. Even if that's true, we should not define "gatherIsSlow()" as "hasAVX2() && !hasAVX512()". It could break for some hypothetical future processor that manages to

how to force llvm generate gather intrinsic

2016 Feb 25

how to force llvm generate gather intrinsic

It seems that http://reviews.llvm.org/D15690 only implemented gather/scatter for AVX-512, but not for AVX/AVX2. Is there any plan to enable gather for AVX/2? Thanks. Best, Zhi On Thu, Feb 25, 2016 at 8:28 AM, Sanjay Patel <spatel at rotateright.com> wrote: > I don't think gather has been enabled for AVX2 as of r261875. > Masked load/store were enabled for AVX with: >

how to force llvm generate gather intrinsic

2016 Feb 25

how to force llvm generate gather intrinsic

Yes, masked load/store/gather/scatter are completed. - Elena From: zhi chen [mailto:zchenhn at gmail.com] Sent: Thursday, February 25, 2016 01:20 To: Demikhovsky, Elena <elena.demikhovsky at intel.com> Cc: Sanjay Patel <spatel at rotateright.com>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] how to

similar to: enabling interleaved access loop vectorization