similar to: [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Displaying 20 results from an estimated 10000 matches similar to: "[RFC] Allow loop vectorizer to choose vector widths that generate illegal types"

2016 Jun 16
2
[RFC] Allow loop vectorizer to choose vector widths that generate illegal types
Hi Michael,  Thank you for working on this. The loop vectorizer tries a bunch of different vectorization factors and stops at the widest word size mostly because of compile time concerns. On every vectorization factors that we check we have to scan all of the instructions in the loop and make multiple calls into TTI. If you decide to increase the VF enumeration space then you will linearly
2016 Jun 16
2
[RFC] Allow loop vectorizer to choose vector widths that generate illegal types
Some thoughts: o To determine the VF for a loop with mixed data sizes, choosing the smallest ensures each vector register used is full, choosing the largest will minimize the number of vector registers used. Which one’s better, or some size in between, depends on the target’s costs for the vector operations, availability of registers and possibly control/memory divergence and trip count. “This is
2016 Aug 05
3
enabling interleaved access loop vectorization
Hi Michael, Sometime back I did some experiments with interleave vectorizer and did not found any degrade, probably my tests/benchmarks are not extensive enough to cover much. Elina is the right person to comment on it as she already experienced cases where it hinders performance. For interleave vectorizer on X86 we do not have any specific costing, it goes to BasicTTI where the costing is not
2016 Aug 05
2
enabling interleaved access loop vectorization
Regarding InterleavedAccessPass - sure, but proper strided/interleaved access optimization ought to have a positive impact even without target support. Case in point - Hal enabled it on PPC last September. An important difference vs. x86 seems to be that arbitrary shuffles are cheap on PPC, but, as I said below, I hope we can enable it on x86 with a conservative cost function, and still get
2016 May 26
2
enabling interleaved access loop vectorization
Interleaved access is not enabled on X86 yet. We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer
2016 Feb 19
2
target triple in 3.8
I have some trouble making the SIMD vector length visible to the passes. My application is basically on the level of 'opt'. What I did in version 3.6 was functionPassManager->add(new llvm::TargetLibraryInfo(llvm::Triple(Mod->getTargetTriple()))); functionPassManager->add(new llvm::DataLayoutPass()); and then the -basicaa and -loop-vectorizer were able to vectorize the input
2016 May 26
0
enabling interleaved access loop vectorization
On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Is there a compile-time and/or potential runtime cost that makes > enableInterleavedAccessVectorization() default to 'false'? > > I notice that this is set to true for ARM, AArch64, and PPC. > > In particular, I'm wondering if there's a reason it's not enabled for
2018 Jul 23
2
[LoopVectorizer] Improving the performance of dot product reduction loop
On 07/23/2018 06:23 PM, Hal Finkel via llvm-dev wrote: > > On 07/23/2018 05:22 PM, Craig Topper wrote: >> Hello all, >> >> This code https://godbolt.org/g/tTyxpf is a dot product reduction >> loop multipying sign extended 16-bit values to produce a 32-bit >> accumulated result. The x86 backend is currently not able to optimize >> it as well as gcc and icc.
2014 Dec 11
2
[LLVMdev] Vectorization factor limitation in Loop Vectorizer
Hi Nadav/Devs I am exploring Loop Vectorizer to vectorize i8 scalar operations into 8xi8 vector operation. I was expecting the Loop Vectorizer to analyze the profitability for vectorization factor(VF) of 8, However it is not doing so due to the widest type calculation done for the blocks inside the loop. May be I am missing something, however, I am curious to know why Loop Vectorizer limits the
2016 Aug 09
2
enabling interleaved access loop vectorization
Thanks Ayal! I'll take a look at DENBench. As another data point - I tried enabling this on our internal benchmarks. I'm seeing one regression, and it seems to be a regression of the "good" kind - without interleaving we don't vectorize the innermost loop, and with interleaving we do. The vectorized loop is actually significantly faster when benchmarked in isolation, but in
2014 Dec 13
2
[LLVMdev] Vectorization factor limitation in Loop Vectorizer
So IMO, if we modify the VF calculation for targets/subtargets using TTI where higher VF is supported The vectorizer’s scope will become wider. Did/do you foresee any issue with this? Thanks, Shahid From: Nadav Rotem [mailto:nrotem at apple.com] Sent: Saturday, December 13, 2014 2:47 AM To: Shahid, Asghar-ahmad Cc: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Vectorization factor limitation in
2016 Aug 16
2
enabling interleaved access loop vectorization
Hi Ayal, Elena, I'd really like to enable this by default. As I wrote above, I didn't see any regressions in internal benchmarks, and there doesn't seem to be anything in SPEC2006 either. I do see a performance improvement in an internal benchmark (that is, a real workload). Would you be able to provide an example that gets pessimized? I have no doubt you've seen regressions
2016 Aug 07
2
enabling interleaved access loop vectorization
We checked the gathered data again. All regressions that we see are in 32-bit mode. The 64-bit mode looks good overall. - Elena From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Saturday, August 06, 2016 02:56 To: Renato Golin <renato.golin at linaro.org> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson <mssimpso at codeaurora.org>;
2018 Jul 24
2
[LoopVectorizer] Improving the performance of dot product reduction loop
On 07/24/2018 02:58 AM, Nema, Ashutosh wrote: > >   > >   > > *From:*Hal Finkel <hfinkel at anl.gov> > *Sent:* Tuesday, July 24, 2018 5:05 AM > *To:* Craig Topper <craig.topper at gmail.com>; hideki.saito at intel.com; > estotzer at ti.com; Nemanja Ivanovic <nemanja.i.ibm at gmail.com>; Adam > Nemet <anemet at apple.com>; graham.hunter at
2016 Aug 17
2
enabling interleaved access loop vectorization
Thanks Ayal! On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote: > Hi Michael, > > > > Don’t quite have a full reproducer for you yet. You’re welcome to try and > see what’s happening in 32 bit mode when enabling interleaving for the > following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”: > > > > void rgb2yik
2016 Sep 01
2
enabling interleaved access loop vectorization
So turns out it is a full reproducer after all (choosing to vectorize on AVX), good. > The details are in PR29025. Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-) > But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get: Indeed such
2013 Oct 26
3
[LLVMdev] Why is the loop vectorizer not working on my function?
----- Original Message ----- > >>> LV: The Widest type: 32 bits. > >>> LV: The Widest register is: 32 bits. > > Yep, we don’t pick up the right TTI. > > Try -march=x86-64 (or leave it out) you already have this info in the > triple. > > Then it should work (does for me with your example below). That may depend on what CPU is picks by default; Frank,
2016 Jun 30
0
[Proposal][RFC] Strided Memory Access Vectorization
One common concern raised for cases where Loop Vectorizer generate bigger types than target supported: Based on VF currently we check the cost and generate the expected set of instruction[s] for bigger type. It has two challenges for bigger types cost is not always correct and code generation may not generate efficient instruction[s]. Probably can depend on the support provided by below RFC by
2013 Oct 27
3
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Frank, On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote: > I would need this to work when calling the vectorizer through > the function pass manager. Unfortunately I am having the same > problem there: I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass
2016 Jun 30
1
[Proposal][RFC] Strided Memory Access Vectorization
As a strong advocate of logical vector representation, I'm counting on community liking Michael's RFC and that'll proceed sooner than later. I plan to pitch in (e.g., perf experiments). >Probably can depend on the support provided by below RFC by Michael: > "Allow loop vectorizer to choose vector widths that generate illegal types" >In that case Loop Vectorizer will