Displaying 20 results from an estimated 21 matches for "psadbws".
Did you mean:
psadbw
2015 Nov 19
5
[RFC] Introducing a vector reduction add instruction.
After some attempt to implement reduce-add in LLVM, I found out a
easier way to detect reduce-add without introducing new IR operations.
The basic idea is annotating phi node instead of add (so that it is
easier to handle other reduction operations). In PHINode class, we can
add a flag indicating if the phi node is a reduction one (the flag can
be set in loop vectorizer for vectorized phi nodes).
2015 Nov 13
2
[RFC] Introducing a vector reduction add instruction.
Hi
When a reduction instruction is vectorized in a loop, it will be
turned into an instruction with vector operands of the same operation
type. This new instruction has a special property that can give us
more flexibility during instruction selection later: this operation is
valid as long as the reduction of all elements of the result vector is
identical to the reduction of all elements of its
2015 Nov 25
2
[RFC] Introducing a vector reduction add instruction.
On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi Cong,
>
> After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations?
The flag is only useful for vectorized reduction for now. I'll give
2014 Nov 04
3
[LLVMdev] supporting SAD in loop vectorizer
...Dibyendu <Dibyendu.Das at amd.com> wrote:
> > Is there any plan to support special idioms in the loop vectorizer
> > like sum of absolute difference (SAD) ? We see some useful cases
> > where llvm is losing performance at -O3 due to SADs not being
> > vectorized (hence PSADBWs not being generated).
>
> It's been a while, but this could either be that the legalisation
> phase is not recognising the reduction or that the cost is not taking
> into account the lowered abs().
>
> What does -debug-only=loop-vectorize say about it?
FWIW, I agree, this s...
2015 Nov 25
2
[RFC] Introducing a vector reduction add instruction.
----- Original Message -----
> From: "Xinliang David Li" <davidxl at google.com>
> To: "Cong Hou" <congh at google.com>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Wednesday, November 25, 2015 5:17:58 PM
> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
2014 Nov 11
3
[LLVMdev] supporting SAD in loop vectorizer
...already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated.
> including the fact that we would need a
> 4-way unroll to use all of 128b PSADBWs. Or am I
> missing something ?
No, each unrolling will get its own, so you'll get a PSADBW from each time the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>, and that is the 128 bits you need.
If you'd like to contribute support for this, look at isHorizonta...
2014 Nov 11
4
[LLVMdev] supporting SAD in loop vectorizer
...nd that matches other
> horizontal operations (see isHorizontalBinOp and its callers in
> lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be
> significantly more complicated.
>
> > including the fact that we would need a
> > 4-way unroll to use all of 128b PSADBWs. Or am I
> > missing something ?
>
> No, each unrolling will get its own, so you'll get a PSADBW from each
> time the loop is unrolled. Each unrolling is vectorized in terms of
> <4 x i32>, and that is the 128 bits you need.
>
> If you'd like to contribute su...
2014 Nov 04
2
[LLVMdev] supporting SAD in loop vectorizer
Nadav and other vectorizer folks-
Is there any plan to support special idioms in the loop vectorizer like sum of absolute difference (SAD) ? We see some useful cases where llvm is losing performance at -O3 due to SADs not being vectorized (hence PSADBWs not being generated).
Also, since the abs() call is already lowered to a sequence of 'icmp; neg; select' by simplifylibcalls (in -O3), we may then need to get hold of this pattern in the loop vectorizer (part of reduction analysis) and do the needful.
Thoughts ?
-Thx
Dibyendu
2018 Apr 07
0
SCEV and LoopStrengthReduction Formulae
>
> I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less
> instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
> to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?
This seems reasonable to me so long as rbx has no other uses that
2010 Jan 27
2
[LLVMdev] some llvm/clang missed optimizations
> Umm, can you find one that isn't a popcount implementation?
Ok.
MMX psadbw instruction:
http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml
Position of first set bit:
http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml
Log2 floor:
http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/83/837A80E9.shtml
Pixel format
2009 Oct 13
3
Proposal for replacing asm code with intrinsics
Hi,
I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems:
1) Need to sync gcc & msvc versions
2) Only 32bit environment is supported
3) No support for newer than MMX
2018 Apr 03
4
SCEV and LoopStrengthReduction Formulae
I am attempting to implement a minor loop strength reduction optimization for
targets that support compare and jump fusion, specifically
TTI::canMacroFuseCmp(). My approach might be wrong; however, I am soliciting
the idea for feedback, so that I can implement this correctly. My plan is to
add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes the
following case, but perhaps
2010 Jan 27
2
[LLVMdev] some llvm/clang missed optimizations
>> Repetitive code with lots of bitwise operations is compiled by LLVM into
>> much larger code than the other compilers:
>>
>> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml
>> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml
>>
>> Note that this is straight-line code, so LLVM's output will
2010 Jan 27
0
[LLVMdev] some llvm/clang missed optimizations
On Tue, Jan 26, 2010 at 5:55 PM, John Regehr <regehr at cs.utah.edu> wrote:
>>> Repetitive code with lots of bitwise operations is compiled by LLVM into
>>> much larger code than the other compilers:
>>>
>>>
>>> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml
>>>
>>>
2010 Jan 27
0
[LLVMdev] some llvm/clang missed optimizations
On Tue, Jan 26, 2010 at 7:42 PM, John Regehr <regehr at cs.utah.edu> wrote:
>> Umm, can you find one that isn't a popcount implementation?
>
> Ok.
>
> MMX psadbw instruction:
>
> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml
>
> Position of first set bit:
>
>
2005 Apr 19
0
mmx optimization
Hi,
I've been giving a look at the archives of the mailing list and I've
seen that you have rewritten a lot of functions using mmx to make them
faster.
I'm currently trying to optimize some code, but I'm have some problems,
because I work with 16 bit per component and not 8 like theora. I know
that it is off topic, but I'm posting to ask you a little help.
I've got
2016 May 28
4
sum elements in the vector
Hi Rail,
Below 2 revisions might be of your interest which Detect SAD patterns and
emit psadbw instructions on X86.:
http://reviews.llvm.org/D14840
http://reviews.llvm.org/D14897
Intrinsics related to absdiff revisons :
http://reviews.llvm.org/D10867
http://reviews.llvm.org/D11678
Hope this helps.
Regards,
Suyog
On Sat, May 28, 2016 at 4:20 AM, Rail Shafigulin via llvm-dev <
llvm-dev at
2016 May 30
0
sum elements in the vector
Suyog,
Thanks for the reply. Do you know if it is possible to add a new intrinsic
without actually modifying core code (ISDOpcodes.h is an example of core
code)? I'd like to add this intrinsic with as little code change as
possible.
On Fri, May 27, 2016 at 8:59 PM, suyog sarda <sardask01 at gmail.com> wrote:
> Hi Rail,
>
> Below 2 revisions might be of your interest which
2004 Aug 24
5
MMX/mmxext optimisations
quite some speed improvement indeed.
attached the updated patch to apply to svn/trunk.
j
-------------- next part --------------
A non-text attachment was scrubbed...
Name: theora-mmx.patch.gz
Type: application/x-gzip
Size: 8648 bytes
Desc: not available
Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20040824/5a5f2731/theora-mmx.patch-0001.bin
2016 May 27
0
sum elements in the vector
Hi Shahid.
Do you mind providing a concrete example of X86 code where an intrinsic was
added (preferrable with filenames and line numbers)? I'm having difficulty
tracking down the steps you provided.
Any help is appreciated.
On Mon, Apr 4, 2016 at 9:02 PM, Shahid, Asghar-ahmad <
Asghar-ahmad.Shahid at amd.com> wrote:
> Hi Rail,
>
>
>
> We had done this for generation