search for: psadbw

Displaying 20 results from an estimated 21 matches for "psadbw".

2015 Nov 19
5
[RFC] Introducing a vector reduction add instruction.
...ction selection, we detect those reduction phi nodes and then annotate reduction operations. This requires an additional flag in SDNodeFlags. We can then check this flag when combining instructions to detect reduction operations. In this approach, I have managed to let LLVM compile a SAD loop into psadbw instructions. Source code: const int N = 1024; unsigned char a[N], b[N]; int sad() { int s = 0; for (int i = 0; i < N; ++i) { int res = a[i] - b[i]; s += (res > 0) ? res : -res; } return s; } Emitted instructions on X86: # BB#0: # %entr...
2015 Nov 13
2
[RFC] Introducing a vector reduction add instruction.
...truction selection later: this operation is valid as long as the reduction of all elements of the result vector is identical to the reduction of all elements of its operands. One example that can benefit this property is SAD (sum of absolute differences) pattern detection in SSE2, which provides a psadbw instruction whose description is shown below: ''' psadbw: Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 b...
2015 Nov 25
2
[RFC] Introducing a vector reduction add instruction.
...sult is the reduction phi node, and this is usually true as long as a reduction loop can be vectorized). For example, if we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3, 0], the reduction result won't change. This enable us to detect SAD or dot-product patterns and use SSE's psadbw and pmaddwd instructions. Please see my respond to your another email for more details. Thanks! Cong > > Thanks again, > Hal > > ----- Original Message ----- >> From: "Cong Hou via llvm-dev" <llvm-dev at lists.llvm.org> >> To: "llvm-dev" <l...
2014 Nov 04
3
[LLVMdev] supporting SAD in loop vectorizer
...Dibyendu <Dibyendu.Das at amd.com> wrote: > > Is there any plan to support special idioms in the loop vectorizer > > like sum of absolute difference (SAD) ? We see some useful cases > > where llvm is losing performance at -O3 due to SADs not being > > vectorized (hence PSADBWs not being generated). > > It's been a while, but this could either be that the legalisation > phase is not recognising the reduction or that the cost is not taking > into account the lowered abs(). > > What does -debug-only=loop-vectorize say about it? FWIW, I agree, this...
2015 Nov 25
2
[RFC] Introducing a vector reduction add instruction.
...nly the product is useful, only the maximum or minimum value is useful, etc.). Now I completely understand why the flag is useful at the SDAG level. Because SDAG is basic-block local, we can't examine the loop structure when doing instruction selection for the relevant operations composing the psadbw (and friends). We also need to realize, when lowering the horizontal reduction at the end of the loop, to lower it in some more-trivial way (right?). Regarding the metadata at the IR level: the motivation here is that, without it, the SDAG builder would need to examine the uses of the PHI, determi...
2014 Nov 11
3
[LLVMdev] supporting SAD in loop vectorizer
...-------------------------------------------------- > > The loop vectorizer does vectorize the loop and then unrolls it > twice. The main body of the loop at the end looks like below where > we see the icmp, neg select pattern appearing twice. > Are we saying we pattern match this to PSADBW in target ? Yes. > That seems > to have some challenges It does, but we already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated....
2014 Nov 11
4
[LLVMdev] supporting SAD in loop vectorizer
...------------------------- > > > > The loop vectorizer does vectorize the loop and then unrolls it > > twice. The main body of the loop at the end looks like below where > > we see the icmp, neg select pattern appearing twice. > > Are we saying we pattern match this to PSADBW in target ? > > Yes. > > > That seems > > to have some challenges > > It does, but we already have code in the backend that matches other > horizontal operations (see isHorizontalBinOp and its callers in > lib/Target/X86/X86ISelLowering.cpp), and I suspect this w...
2014 Nov 04
2
[LLVMdev] supporting SAD in loop vectorizer
Nadav and other vectorizer folks- Is there any plan to support special idioms in the loop vectorizer like sum of absolute difference (SAD) ? We see some useful cases where llvm is losing performance at -O3 due to SADs not being vectorized (hence PSADBWs not being generated). Also, since the abs() call is already lowered to a sequence of 'icmp; neg; select' by simplifylibcalls (in -O3), we may then need to get hold of this pattern in the loop vectorizer (part of reduction analysis) and do the needful. Thoughts ? -Thx Dibyendu
2018 Apr 07
0
SCEV and LoopStrengthReduction Formulae
...e, pseudocode): int add_delta_256(uint8 *in1, uint8 *in2) { int accum = 0; for (int i = 0; i < 16; ++i) { uint8x16 a = load16(in1 + i *16); // NOTE: takes an extra addressing op because x86 uint8x16 b = load16(in2 + i *16); // NOTE: takes an extra addressing op because x86 accum += psadbw(a, b); } return accum; } end of loop: inc i cmp i, 16 jl loop LSR’d code: int add_delta_256(uint8 *in1, uint8 *in2) { int accum = 0; for (int i = 0; i < 16; ++i, in1 += 16, in2 += 16) { uint8x16 a = load16(in1); uint8x16 b = load16(in2); accum += psadbw(a, b); } return ac...
2010 Jan 27
2
[LLVMdev] some llvm/clang missed optimizations
> Umm, can you find one that isn't a popcount implementation? Ok. MMX psadbw instruction: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml Position of first set bit: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml Log2 floor: http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/83/837A80E9.shtml Pixel f...
2009 Oct 13
3
Proposal for replacing asm code with intrinsics
...32bit environment is supported 3) No support for newer than MMX instruction sets My proposal is to replace all functions in assembly with compiler intrinsic which compiles into 1-2 assembly instructions and are much easier to maintain. For example: _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers. And code like: psadbw mm4,mm5 paddw mm0,mm4 Can be re-written into _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); Compiler will replace variables with actual registers, ensur...
2018 Apr 03
4
SCEV and LoopStrengthReduction Formulae
I am attempting to implement a minor loop strength reduction optimization for targets that support compare and jump fusion, specifically TTI::canMacroFuseCmp(). My approach might be wrong; however, I am soliciting the idea for feedback, so that I can implement this correctly. My plan is to add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes the following case, but perhaps
2010 Jan 27
2
[LLVMdev] some llvm/clang missed optimizations
>> Repetitive code with lots of bitwise operations is compiled by LLVM into >> much larger code than the other compilers: >> >> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml >> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml >> >> Note that this is straight-line code, so LLVM's output will
2010 Jan 27
0
[LLVMdev] some llvm/clang missed optimizations
On Tue, Jan 26, 2010 at 5:55 PM, John Regehr <regehr at cs.utah.edu> wrote: >>> Repetitive code with lots of bitwise operations is compiled by LLVM into >>> much larger code than the other compilers: >>> >>> >>> http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/ED/ED37DAF5.shtml >>> >>>
2010 Jan 27
0
[LLVMdev] some llvm/clang missed optimizations
On Tue, Jan 26, 2010 at 7:42 PM, John Regehr <regehr at cs.utah.edu> wrote: >> Umm, can you find one that isn't a popcount implementation? > > Ok. > > MMX psadbw instruction: > > http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/CE/CE3DA132.shtml > > Position of first set bit: > > http://embed.cs.utah.edu/embarrassing/jan_10/harvest/source/1F/1F4003C7.shtml > > Log2 floor: > > http://embed.cs.utah.edu/embarrassing/jan...
2005 Apr 19
0
mmx optimization
...i++) for (j = 0; j < 4; j++) sad += abs (mb->orig_mb[corner_x + i][corner_y + j] - mb->pred_mb[corner_x + i][corner_y + j]); return sad; } where mb->orig_mb and mb->pred_mb are arrays of short int and not unsigned char. I cannot therefore use psadbw, because it works on 8 bit data. I've currently rewritten the function in this way: si32 sad_4x4 (macroblock_t * mb, ui8 x, ui8 y) { zeros = _mm_setzero_si64 (); ones = _mm_set1_pi16 (1); orig = *((__m64*) &mb->orig_mb[corner_x][corner_y]); pred = *((__m64*) &mb->pre...
2016 May 28
4
sum elements in the vector
Hi Rail, Below 2 revisions might be of your interest which Detect SAD patterns and emit psadbw instructions on X86.: http://reviews.llvm.org/D14840 http://reviews.llvm.org/D14897 Intrinsics related to absdiff revisons : http://reviews.llvm.org/D10867 http://reviews.llvm.org/D11678 Hope this helps. Regards, Suyog On Sat, May 28, 2016 at 4:20 AM, Rail Shafigulin via llvm-dev < llvm-de...
2016 May 30
0
sum elements in the vector
...h is an example of core code)? I'd like to add this intrinsic with as little code change as possible. On Fri, May 27, 2016 at 8:59 PM, suyog sarda <sardask01 at gmail.com> wrote: > Hi Rail, > > Below 2 revisions might be of your interest which Detect SAD patterns and > emit psadbw instructions on X86.: > > http://reviews.llvm.org/D14840 > http://reviews.llvm.org/D14897 > > Intrinsics related to absdiff revisons : > > http://reviews.llvm.org/D10867 > http://reviews.llvm.org/D11678 > > Hope this helps. > > Regards, > Suyog > > On Sa...
2004 Aug 24
5
MMX/mmxext optimisations
quite some speed improvement indeed. attached the updated patch to apply to svn/trunk. j -------------- next part -------------- A non-text attachment was scrubbed... Name: theora-mmx.patch.gz Type: application/x-gzip Size: 8648 bytes Desc: not available Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20040824/5a5f2731/theora-mmx.patch-0001.bin
2016 May 27
0
sum elements in the vector
Hi Shahid. Do you mind providing a concrete example of X86 code where an intrinsic was added (preferrable with filenames and line numbers)? I'm having difficulty tracking down the steps you provided. Any help is appreciated. On Mon, Apr 4, 2016 at 9:02 PM, Shahid, Asghar-ahmad < Asghar-ahmad.Shahid at amd.com> wrote: > Hi Rail, > > > > We had done this for generation