search for: pmovmskb

Displaying 14 results from an estimated 14 matches for "pmovmskb".

Did you mean: vpmovmskb
2016 Jun 29
0
Question about VectorLegalizer::ExpandStore() with v4i1
...is consumed, it is best to look at what happens to v8i1. We can then let the same optimizer work to get the optimal ASM code out in the end, whether vectorization factor is 4 or 8. In the end, I may be agreeing to Rob, but not because of the reasons Rob mentioned. One of the headaches is movmskps/pmovmskb do not have a quick reverse instruction (MIC-AVX512 and below). I do not know LLVM's X86 CodeGen enough to say whether it internally has mask-to/from-vector nodes. If it has, I'd hope X86 CodeGen can cancel out such things in a peephole manner very efficiently so that blindly going for i1-p...
2012 Sep 05
0
[LLVMdev] branch on vector compare?
...eting altivec, probably no such issue as I think it doesn't have such blatantly missing shuffle instructions). But yes ptest looks like the obvious winner. For cpus not having sse41 (and there's tons of them still in use not to mention still sold) it would be nice if llvm could come up with pmovmskb/movmskps/movmskpd + test (these instructions look like they were intended for exactly that use case after all). But the <4 x i8> sign-extend solution shouldn't hurt performance too much neither, if you've got ssse3. Roland
2017 Sep 25
3
What should a truncating store do?
...t definition of bitcast.  And in fact, the backend will lower a bitcast to a store+load to a stack temporary in cases where there isn't some other lowering specified. The end result is probably going to be pretty inefficient unless your target has a special instruction to handle it (x86 has pmovmskb for i1 vector bitcasts, but otherwise you probably end up with some terrible lowering involving a lot of shifts). > This also reminded me of the following test case that is in trunk: >  test/CodeGen/X86/pr20011.ll > > %destTy = type { i2, i2 } > > define void @crash(i64 %x0, i...
2012 Sep 04
2
[LLVMdev] branch on vector compare?
Roland Scheidegger <sroland <at> vmware.com> writes: > This looks quite similar to something I filed a bug on (12312). Michael > Liao submitted fixes for this, so I think > if you change it to > %16 = fcmp ogt <4 x float> %15, %cr > %17 = sext <4 x i1> %16 to <4 x i32> > %18 = bitcast <4 x i32> %17 to i128 > %19 = icmp ne i128 %18, 0
2017 Sep 25
0
What should a truncating store do?
(Not sure if this exactly maps to “truncating store”, but I think it at least touches some of the subjects discussed in this thread) Our out-of-tree-target need several patches to get things working correctly for us. We have introduced i24 and i40 types in ValueTypes/MachineValueTypes (in addition to the normal pow-of-2 types). And we have vectors of those (v2i40, v4i40). And the byte size in our
2017 Sep 25
0
What should a truncating store do?
...ight definition of bitcast. And in fact, the backend will lower a bitcast to a store+load to a stack temporary in cases where there isn't some other lowering specified. The end result is probably going to be pretty inefficient unless your target has a special instruction to handle it (x86 has pmovmskb for i1 vector bitcasts, but otherwise you probably end up with some terrible lowering involving a lot of shifts). This also reminded me of the following test case that is in trunk: test/CodeGen/X86/pr20011.ll %destTy = type { i2, i2 } define void @crash(i64 %x0, i64 %y0, %destTy* nocapture %de...
2016 Jun 28
2
Question about VectorLegalizer::ExpandStore() with v4i1
On Tue, Jun 28, 2016 at 2:45 AM, jingu kang via llvm-dev <llvm-dev at lists.llvm.org> wrote: > Hi All, > > Can someone comment below question whether it is wrong or not please? > > 2016-06-25 7:52 GMT+01:00 jingu kang <jaykang10 at gmail.com>: >> Hi All, >> >> I have a problem with VectorLegalizer::ExpandStore() with v4i1. >> >> Let's
2018 Apr 26
2
windows ABI problem with i128?
...mov %rdx,-0x28(%rbp) 63: 48 89 45 d0 mov %rax,-0x30(%rbp) 67: c5 fa 6f 45 d0 vmovdqu -0x30(%rbp),%xmm0 6c: c5 fa 6f 4d e0 vmovdqu -0x20(%rbp),%xmm1 71: c5 f9 74 c1 vpcmpeqb %xmm1,%xmm0,%xmm0 75: c5 79 d7 c0 vpmovmskb %xmm0,%r8d 79: 41 81 e8 ff ff 00 00 sub $0xffff,%r8d 80: 44 89 45 cc mov %r8d,-0x34(%rbp) 84: 74 06 je 8c <_start+0x7c> 86: eb 00 jmp 88 <_start+0x78> 88: eb 00 jmp 8a <...
2018 Apr 26
0
windows ABI problem with i128?
...> 63: 48 89 45 d0 mov %rax,-0x30(%rbp) > 67: c5 fa 6f 45 d0 vmovdqu -0x30(%rbp),%xmm0 > 6c: c5 fa 6f 4d e0 vmovdqu -0x20(%rbp),%xmm1 > 71: c5 f9 74 c1 vpcmpeqb %xmm1,%xmm0,%xmm0 > 75: c5 79 d7 c0 vpmovmskb %xmm0,%r8d > 79: 41 81 e8 ff ff 00 00 sub $0xffff,%r8d > 80: 44 89 45 cc mov %r8d,-0x34(%rbp) > 84: 74 06 je 8c <_start+0x7c> > 86: eb 00 jmp 88 <_start+0x78> > 88: eb 00...
2018 Apr 26
1
windows ABI problem with i128?
...45 d0 mov %rax,-0x30(%rbp) > > 67: c5 fa 6f 45 d0 vmovdqu -0x30(%rbp),%xmm0 > > 6c: c5 fa 6f 4d e0 vmovdqu -0x20(%rbp),%xmm1 > > 71: c5 f9 74 c1 vpcmpeqb %xmm1,%xmm0,%xmm0 > > 75: c5 79 d7 c0 vpmovmskb %xmm0,%r8d > > 79: 41 81 e8 ff ff 00 00 sub $0xffff,%r8d > > 80: 44 89 45 cc mov %r8d,-0x34(%rbp) > > 84: 74 06 je 8c <_start+0x7c> > > 86: eb 00 jmp 88 <_start+0x78> > &gt...
2017 Sep 15
2
What should a truncating store do?
They are starting to look complicated. The patch linked is interesting, perhaps v1 vectors are special cased. It shouldn't be too onerous to work out what one or two in tree back ends do by experimentation. Thanks again, it's great to have context beyond the source. On Fri, Sep 15, 2017 at 9:41 PM, Friedman, Eli <efriedma at codeaurora.org> wrote: > On 9/15/2017 12:10 PM, Jon
2013 Jun 29
2
[PATCH] nv50: H.264/MPEG2 decoding support via VP2, available on NV84-NV96, NVA0
...one (mine). SSE4.2 appeared on Core i7 (Nehalem), AFAIK. Other optimizations may be possible with just SSE2, but I couldn't think of anything particularly clever. Those PCMPESTRM instructions are _really_ useful to skip over the useless 0's. I guess something could be done with PCMPEQW and PMOVMSKB (wish there were a PMOVMSKW...). I'll play with it. But the plain-C loop isn't so bad either. > >> out. (It gets tricky because a lot of the data is 0's, so it's unclear whether >> it's faster to use SSE to do operations on everything or one-at-a-time on the >...
2013 Jun 30
0
[PATCH] nv50: H.264/MPEG2 decoding support via VP2, available on NV84-NV96, NVA0
...appeared on Core i7 (Nehalem), AFAIK. > Other optimizations may be possible with just SSE2, but I couldn't > think of anything particularly clever. Those PCMPESTRM instructions > are _really_ useful to skip over the useless 0's. I guess something > could be done with PCMPEQW and PMOVMSKB (wish there were a > PMOVMSKW...). I'll play with it. But the plain-C loop isn't so bad > either. > >> >>> out. (It gets tricky because a lot of the data is 0's, so it's unclear whether >>> it's faster to use SSE to do operations on everything o...
2013 Jun 27
4
[PATCH] nv50: H.264/MPEG2 decoding support via VP2, available on NV84-NV96, NVA0
Adds H.264 and MPEG2 codec support via VP2, using firmware from the blob. Acceleration is supported at the bitstream level for H.264 and IDCT level for MPEG2. Known issues: - H.264 interlaced doesn't render properly - H.264 shows very occasional artifacts on a small fraction of videos - MPEG2 + VDPAU shows frequent but small artifacts, which aren't there when using XvMC on the same