search for: psadbws

Displaying 20 results from an estimated 21 matches for "psadbws".

Did you mean: psadbw
2015 Nov 19
5
[RFC] Introducing a vector reduction add instruction.
After some attempt to implement reduce-add in LLVM, I found out a easier way to detect reduce-add without introducing new IR operations. The basic idea is annotating phi node instead of add (so that it is easier to handle other reduction operations). In PHINode class, we can add a flag indicating if the phi node is a reduction one (the flag can be set in loop vectorizer for vectorized phi nodes).
2015 Nov 13
2
[RFC] Introducing a vector reduction add instruction.
Hi When a reduction instruction is vectorized in a loop, it will be turned into an instruction with vector operands of the same operation type. This new instruction has a special property that can give us more flexibility during instruction selection later: this operation is valid as long as the reduction of all elements of the result vector is identical to the reduction of all elements of its
2015 Nov 25
2
[RFC] Introducing a vector reduction add instruction.
On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <hfinkel at anl.gov> wrote: > Hi Cong, > > After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations? The flag is only useful for vectorized reduction for now. I'll give
2014 Nov 04
3
[LLVMdev] supporting SAD in loop vectorizer
...Dibyendu <Dibyendu.Das at amd.com> wrote: > > Is there any plan to support special idioms in the loop vectorizer > > like sum of absolute difference (SAD) ? We see some useful cases > > where llvm is losing performance at -O3 due to SADs not being > > vectorized (hence PSADBWs not being generated). > > It's been a while, but this could either be that the legalisation > phase is not recognising the reduction or that the cost is not taking > into account the lowered abs(). > > What does -debug-only=loop-vectorize say about it? FWIW, I agree, this s...
2015 Nov 25
2
[RFC] Introducing a vector reduction add instruction.
----- Original Message ----- > From: "Xinliang David Li" <davidxl at google.com> > To: "Cong Hou" <congh at google.com> > Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Wednesday, November 25, 2015 5:17:58 PM > Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
2014 Nov 11
3
[LLVMdev] supporting SAD in loop vectorizer
...already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated. > including the fact that we would need a > 4-way unroll to use all of 128b PSADBWs. Or am I > missing something ? No, each unrolling will get its own, so you'll get a PSADBW from each time the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>, and that is the 128 bits you need. If you'd like to contribute support for this, look at isHorizonta...
2014 Nov 11
4
[LLVMdev] supporting SAD in loop vectorizer
...nd that matches other > horizontal operations (see isHorizontalBinOp and its callers in > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be > significantly more complicated. > > > including the fact that we would need a > > 4-way unroll to use all of 128b PSADBWs. Or am I > > missing something ? > > No, each unrolling will get its own, so you'll get a PSADBW from each > time the loop is unrolled. Each unrolling is vectorized in terms of > <4 x i32>, and that is the 128 bits you need. > > If you'd like to contribute su...
2014 Nov 04
2
[LLVMdev] supporting SAD in loop vectorizer
Nadav and other vectorizer folks- Is there any plan to support special idioms in the loop vectorizer like sum of absolute difference (SAD) ? We see some useful cases where llvm is losing performance at -O3 due to SADs not being vectorized (hence PSADBWs not being generated). Also, since the abs() call is already lowered to a sequence of 'icmp; neg; select' by simplifylibcalls (in -O3), we may then need to get hold of this pattern in the loop vectorizer (part of reduction analysis) and do the needful. Thoughts ? -Thx Dibyendu
2018 Apr 07
0
SCEV and LoopStrengthReduction Formulae
> > I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less > instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition > to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass? This seems reasonable to me so long as rbx has no other uses that
2010 Jan 27
2
[LLVMdev] some llvm/clang missed optimizations
> Umm, can you find one that isn't a popcount implementation? Ok. MMX psadbw instruction: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml Position of first set bit: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml Log2 floor: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/83/837A80E9.shtml Pixel format
2009 Oct 13
3
Proposal for replacing asm code with intrinsics
Hi, I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2). There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems: 1) Need to sync gcc & msvc versions 2) Only 32bit environment is supported 3) No support for newer than MMX
2018 Apr 03
4
SCEV and LoopStrengthReduction Formulae
I am attempting to implement a minor loop strength reduction optimization for targets that support compare and jump fusion, specifically TTI::canMacroFuseCmp(). My approach might be wrong; however, I am soliciting the idea for feedback, so that I can implement this correctly. My plan is to add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes the following case, but perhaps
2010 Jan 27
2
[LLVMdev] some llvm/clang missed optimizations
>> Repetitive code with lots of bitwise operations is compiled by LLVM into >> much larger code than the other compilers: >> >> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml >> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml >> >> Note that this is straight-line code, so LLVM's output will
2010 Jan 27
0
[LLVMdev] some llvm/clang missed optimizations
On Tue, Jan 26, 2010 at 5:55 PM, John Regehr <regehr at cs.utah.edu> wrote: >>> Repetitive code with lots of bitwise operations is compiled by LLVM into >>> much larger code than the other compilers: >>> >>> >>> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml >>> >>>
2010 Jan 27
0
[LLVMdev] some llvm/clang missed optimizations
On Tue, Jan 26, 2010 at 7:42 PM, John Regehr <regehr at cs.utah.edu> wrote: >> Umm, can you find one that isn't a popcount implementation? > > Ok. > > MMX psadbw instruction: > > http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml > > Position of first set bit: > >
2005 Apr 19
0
mmx optimization
Hi, I've been giving a look at the archives of the mailing list and I've seen that you have rewritten a lot of functions using mmx to make them faster. I'm currently trying to optimize some code, but I'm have some problems, because I work with 16 bit per component and not 8 like theora. I know that it is off topic, but I'm posting to ask you a little help. I've got
2016 May 28
4
sum elements in the vector
Hi Rail, Below 2 revisions might be of your interest which Detect SAD patterns and emit psadbw instructions on X86.: http://reviews.llvm.org/D14840 http://reviews.llvm.org/D14897 Intrinsics related to absdiff revisons : http://reviews.llvm.org/D10867 http://reviews.llvm.org/D11678 Hope this helps. Regards, Suyog On Sat, May 28, 2016 at 4:20 AM, Rail Shafigulin via llvm-dev < llvm-dev at
2016 May 30
0
sum elements in the vector
Suyog, Thanks for the reply. Do you know if it is possible to add a new intrinsic without actually modifying core code (ISDOpcodes.h is an example of core code)? I'd like to add this intrinsic with as little code change as possible. On Fri, May 27, 2016 at 8:59 PM, suyog sarda <sardask01 at gmail.com> wrote: > Hi Rail, > > Below 2 revisions might be of your interest which
2004 Aug 24
5
MMX/mmxext optimisations
quite some speed improvement indeed. attached the updated patch to apply to svn/trunk. j -------------- next part -------------- A non-text attachment was scrubbed... Name: theora-mmx.patch.gz Type: application/x-gzip Size: 8648 bytes Desc: not available Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20040824/5a5f2731/theora-mmx.patch-0001.bin
2016 May 27
0
sum elements in the vector
Hi Shahid. Do you mind providing a concrete example of X86 code where an intrinsic was added (preferrable with filenames and line numbers)? I'm having difficulty tracking down the steps you provided. Any help is appreciated. On Mon, Apr 4, 2016 at 9:02 PM, Shahid, Asghar-ahmad < Asghar-ahmad.Shahid at amd.com> wrote: > Hi Rail, > > > > We had done this for generation