thr3ads.net - search: "psadbws"

Displaying 20 results from an estimated 21 matches for "psadbws".

Did you mean: psadbw

[RFC] Introducing a vector reduction add instruction.

2015 Nov 19

[RFC] Introducing a vector reduction add instruction.

After some attempt to implement reduce-add in LLVM, I found out a easier way to detect reduce-add without introducing new IR operations. The basic idea is annotating phi node instead of add (so that it is easier to handle other reduction operations). In PHINode class, we can add a flag indicating if the phi node is a reduction one (the flag can be set in loop vectorizer for vectorized phi nodes).

[RFC] Introducing a vector reduction add instruction.

2015 Nov 13

[RFC] Introducing a vector reduction add instruction.

Hi When a reduction instruction is vectorized in a loop, it will be turned into an instruction with vector operands of the same operation type. This new instruction has a special property that can give us more flexibility during instruction selection later: this operation is valid as long as the reduction of all elements of the result vector is identical to the reduction of all elements of its

[RFC] Introducing a vector reduction add instruction.

2015 Nov 25

[RFC] Introducing a vector reduction add instruction.

On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <hfinkel at anl.gov> wrote: > Hi Cong, > > After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations? The flag is only useful for vectorized reduction for now. I'll give

[LLVMdev] supporting SAD in loop vectorizer

2014 Nov 04

[LLVMdev] supporting SAD in loop vectorizer

...Dibyendu <Dibyendu.Das at amd.com> wrote: > > Is there any plan to support special idioms in the loop vectorizer > > like sum of absolute difference (SAD) ? We see some useful cases > > where llvm is losing performance at -O3 due to SADs not being > > vectorized (hence PSADBWs not being generated). > > It's been a while, but this could either be that the legalisation > phase is not recognising the reduction or that the cost is not taking > into account the lowered abs(). > > What does -debug-only=loop-vectorize say about it? FWIW, I agree, this s...

[RFC] Introducing a vector reduction add instruction.

2015 Nov 25

[RFC] Introducing a vector reduction add instruction.

----- Original Message ----- > From: "Xinliang David Li" <davidxl at google.com> > To: "Cong Hou" <congh at google.com> > Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Wednesday, November 25, 2015 5:17:58 PM > Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add

[LLVMdev] supporting SAD in loop vectorizer

2014 Nov 11

[LLVMdev] supporting SAD in loop vectorizer

...already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated. > including the fact that we would need a > 4-way unroll to use all of 128b PSADBWs. Or am I > missing something ? No, each unrolling will get its own, so you'll get a PSADBW from each time the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>, and that is the 128 bits you need. If you'd like to contribute support for this, look at isHorizonta...

[LLVMdev] supporting SAD in loop vectorizer

2014 Nov 11

[LLVMdev] supporting SAD in loop vectorizer

...nd that matches other > horizontal operations (see isHorizontalBinOp and its callers in > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be > significantly more complicated. > > > including the fact that we would need a > > 4-way unroll to use all of 128b PSADBWs. Or am I > > missing something ? > > No, each unrolling will get its own, so you'll get a PSADBW from each > time the loop is unrolled. Each unrolling is vectorized in terms of > <4 x i32>, and that is the 128 bits you need. > > If you'd like to contribute su...

[LLVMdev] supporting SAD in loop vectorizer

2014 Nov 04

[LLVMdev] supporting SAD in loop vectorizer

Nadav and other vectorizer folks- Is there any plan to support special idioms in the loop vectorizer like sum of absolute difference (SAD) ? We see some useful cases where llvm is losing performance at -O3 due to SADs not being vectorized (hence PSADBWs not being generated). Also, since the abs() call is already lowered to a sequence of 'icmp; neg; select' by simplifylibcalls (in -O3), we may then need to get hold of this pattern in the loop vectorizer (part of reduction analysis) and do the needful. Thoughts ? -Thx Dibyendu

SCEV and LoopStrengthReduction Formulae

2018 Apr 07

SCEV and LoopStrengthReduction Formulae

> > I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less > instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition > to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass? This seems reasonable to me so long as rbx has no other uses that

[LLVMdev] some llvm/clang missed optimizations

2010 Jan 27

[LLVMdev] some llvm/clang missed optimizations

> Umm, can you find one that isn't a popcount implementation? Ok. MMX psadbw instruction: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml Position of first set bit: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml Log2 floor: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/83/837A80E9.shtml Pixel format

Proposal for replacing asm code with intrinsics

2009 Oct 13

Proposal for replacing asm code with intrinsics

Hi, I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2). There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems: 1) Need to sync gcc & msvc versions 2) Only 32bit environment is supported 3) No support for newer than MMX

SCEV and LoopStrengthReduction Formulae

2018 Apr 03

SCEV and LoopStrengthReduction Formulae

I am attempting to implement a minor loop strength reduction optimization for targets that support compare and jump fusion, specifically TTI::canMacroFuseCmp(). My approach might be wrong; however, I am soliciting the idea for feedback, so that I can implement this correctly. My plan is to add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes the following case, but perhaps

[LLVMdev] some llvm/clang missed optimizations

2010 Jan 27

[LLVMdev] some llvm/clang missed optimizations

>> Repetitive code with lots of bitwise operations is compiled by LLVM into >> much larger code than the other compilers: >> >> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml >> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml >> >> Note that this is straight-line code, so LLVM's output will

[LLVMdev] some llvm/clang missed optimizations

2010 Jan 27

[LLVMdev] some llvm/clang missed optimizations

On Tue, Jan 26, 2010 at 5:55 PM, John Regehr <regehr at cs.utah.edu> wrote: >>> Repetitive code with lots of bitwise operations is compiled by LLVM into >>> much larger code than the other compilers: >>> >>> >>> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml >>> >>>

[LLVMdev] some llvm/clang missed optimizations

2010 Jan 27

[LLVMdev] some llvm/clang missed optimizations

On Tue, Jan 26, 2010 at 7:42 PM, John Regehr <regehr at cs.utah.edu> wrote: >> Umm, can you find one that isn't a popcount implementation? > > Ok. > > MMX psadbw instruction: > > http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml > > Position of first set bit: > >

mmx optimization

2005 Apr 19

mmx optimization

Hi, I've been giving a look at the archives of the mailing list and I've seen that you have rewritten a lot of functions using mmx to make them faster. I'm currently trying to optimize some code, but I'm have some problems, because I work with 16 bit per component and not 8 like theora. I know that it is off topic, but I'm posting to ask you a little help. I've got

sum elements in the vector

2016 May 28

sum elements in the vector

Hi Rail, Below 2 revisions might be of your interest which Detect SAD patterns and emit psadbw instructions on X86.: http://reviews.llvm.org/D14840 http://reviews.llvm.org/D14897 Intrinsics related to absdiff revisons : http://reviews.llvm.org/D10867 http://reviews.llvm.org/D11678 Hope this helps. Regards, Suyog On Sat, May 28, 2016 at 4:20 AM, Rail Shafigulin via llvm-dev < llvm-dev at

sum elements in the vector

2016 May 30

sum elements in the vector

Suyog, Thanks for the reply. Do you know if it is possible to add a new intrinsic without actually modifying core code (ISDOpcodes.h is an example of core code)? I'd like to add this intrinsic with as little code change as possible. On Fri, May 27, 2016 at 8:59 PM, suyog sarda <sardask01 at gmail.com> wrote: > Hi Rail, > > Below 2 revisions might be of your interest which

MMX/mmxext optimisations

2004 Aug 24

MMX/mmxext optimisations

quite some speed improvement indeed. attached the updated patch to apply to svn/trunk. j -------------- next part -------------- A non-text attachment was scrubbed... Name: theora-mmx.patch.gz Type: application/x-gzip Size: 8648 bytes Desc: not available Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20040824/5a5f2731/theora-mmx.patch-0001.bin

sum elements in the vector

2016 May 27

sum elements in the vector

Hi Shahid. Do you mind providing a concrete example of X86 code where an intrinsic was added (preferrable with filenames and line numbers)? I'm having difficulty tracking down the steps you provided. Any help is appreciated. On Mon, Apr 4, 2016 at 9:02 PM, Shahid, Asghar-ahmad < Asghar-ahmad.Shahid at amd.com> wrote: > Hi Rail, > > > > We had done this for generation

search for: psadbws