Roland Scheidegger <sroland <at> vmware.com> writes:> This looks quite similar to something I filed a bug on (12312). Michael > Liao submitted fixes for this, so I think > if you change it to > %16 = fcmp ogt <4 x float> %15, %cr > %17 = sext <4 x i1> %16 to <4 x i32> > %18 = bitcast <4 x i32> %17 to i128 > %19 = icmp ne i128 %18, 0 > br i1 %19, label %true1, label %false2 > > should do the trick (one cmpps + one ptest + one br instruction). > This, however, requires sse41 which I don't know if you have - you say > the extractelements go through memory which I've never seen then again > our code didn't try to extract the i1 directly (even without fixes for > ptest the above sequence will result in only 2 extraction steps instead > of 4 if you're on x64 and the cpu supports sse41 but I guess without > sse41 and hence no pextrd/q it probably also will go through memory). > Though on altivec this sequence might not produce anything good, the > free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't > be a problem nowadays but on other backends it might be different) and > for the ptest sequence very recent svn is required. > I don't think the current code can generate movmskps + test (probably > the next best thing without sse41) instead of ptest though if you only > got sse.Thanks Roland, sign extending gets me part of the way at least. I'm on version 3.1 and as you say in bug report, there are a few extraneous instructions. For the record, casting to a <4 x i8> seems to do a better job for x86 (shuffle, movd, test, jump). Using <4 x i32> seems to issue a pextrd for each element. For x64, it seems to be the same for either. I suppose it's all academic seeing as the ptest patch looks good. Looking at it again, I'm not sure how I saw memory spills. Certainly I can't reproduce them without using -O0. It's possible I was did that accidentally when investigating the issue. Thanks, Stephen.
Am 05.09.2012 00:24, schrieb Stephen:> Roland Scheidegger <sroland <at> vmware.com> writes: >> This looks quite similar to something I filed a bug on (12312). Michael >> Liao submitted fixes for this, so I think >> if you change it to >> %16 = fcmp ogt <4 x float> %15, %cr >> %17 = sext <4 x i1> %16 to <4 x i32> >> %18 = bitcast <4 x i32> %17 to i128 >> %19 = icmp ne i128 %18, 0 >> br i1 %19, label %true1, label %false2 >> >> should do the trick (one cmpps + one ptest + one br instruction). >> This, however, requires sse41 which I don't know if you have - you say >> the extractelements go through memory which I've never seen then again >> our code didn't try to extract the i1 directly (even without fixes for >> ptest the above sequence will result in only 2 extraction steps instead >> of 4 if you're on x64 and the cpu supports sse41 but I guess without >> sse41 and hence no pextrd/q it probably also will go through memory). >> Though on altivec this sequence might not produce anything good, the >> free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't >> be a problem nowadays but on other backends it might be different) and >> for the ptest sequence very recent svn is required. >> I don't think the current code can generate movmskps + test (probably >> the next best thing without sse41) instead of ptest though if you only >> got sse. > > > Thanks Roland, sign extending gets me part of the way at least. > I'm on version 3.1 and as you say in bug report, there are a > few extraneous instructions. For the record, casting to a <4 x i8> > seems to do a better job for x86 (shuffle, movd, test, jump). Using > <4 x i32> seems to issue a pextrd for each element. For x64, it seems > to be the same for either. I suppose it's all academic seeing as the > ptest patch looks good.Yes <4 x i8> cast looks like a good idea. Just be careful though if you also need to target cpus without ssse3, IIRC without pshufb this will create some horrible code (could have been with older llvm version though). Though if you don't have ssse3 you also won't have pextrd, which means more shuffling to extract the values if you sign-extend them to <4 x i32> too (if you're targeting altivec, probably no such issue as I think it doesn't have such blatantly missing shuffle instructions). But yes ptest looks like the obvious winner. For cpus not having sse41 (and there's tons of them still in use not to mention still sold) it would be nice if llvm could come up with pmovmskb/movmskps/movmskpd + test (these instructions look like they were intended for exactly that use case after all). But the <4 x i8> sign-extend solution shouldn't hurt performance too much neither, if you've got ssse3. Roland
On Wed, Sep 5, 2012 at 9:07 AM, Roland Scheidegger <sroland at vmware.com> wrote:> Am 05.09.2012 00:24, schrieb Stephen: >> Roland Scheidegger <sroland <at> vmware.com> writes: >>> This looks quite similar to something I filed a bug on (12312). Michael >>> Liao submitted fixes for this, so I think >>> if you change it to >>> %16 = fcmp ogt <4 x float> %15, %cr >>> %17 = sext <4 x i1> %16 to <4 x i32> >>> %18 = bitcast <4 x i32> %17 to i128 >>> %19 = icmp ne i128 %18, 0 >>> br i1 %19, label %true1, label %false2 >>> >>> should do the trick (one cmpps + one ptest + one br instruction). >>> This, however, requires sse41 which I don't know if you have - you say >>> the extractelements go through memory which I've never seen then again >>> our code didn't try to extract the i1 directly (even without fixes for >>> ptest the above sequence will result in only 2 extraction steps instead >>> of 4 if you're on x64 and the cpu supports sse41 but I guess without >>> sse41 and hence no pextrd/q it probably also will go through memory). >>> Though on altivec this sequence might not produce anything good, the >>> free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't >>> be a problem nowadays but on other backends it might be different) and >>> for the ptest sequence very recent svn is required. >>> I don't think the current code can generate movmskps + test (probably >>> the next best thing without sse41) instead of ptest though if you only >>> got sse. >> >> >> Thanks Roland, sign extending gets me part of the way at least. >> I'm on version 3.1 and as you say in bug report, there are a >> few extraneous instructions. For the record, casting to a <4 x i8> >> seems to do a better job for x86 (shuffle, movd, test, jump). Using >> <4 x i32> seems to issue a pextrd for each element. For x64, it seems >> to be the same for either. I suppose it's all academic seeing as the >> ptest patch looks good. > > Yes <4 x i8> cast looks like a good idea. Just be careful though if you > also need to target cpus without ssse3, IIRC without pshufb this will > create some horrible code (could have been with older llvm version > though). Though if you don't have ssse3 you also won't have pextrd, > which means more shuffling to extract the values if you sign-extend them > to <4 x i32> too (if you're targeting altivec, probably no such issue as > I think it doesn't have such blatantly missing shuffle instructions). > But yes ptest looks like the obvious winner. For cpus not having sse41 > (and there's tons of them still in use not to mention still sold) it > would be nice if llvm could come up with pmovmskb/movmskps/movmskpd + > test (these instructions look like they were intended for exactly that > use case after all). But the <4 x i8> sign-extend solution shouldn't > hurt performance too much neither, if you've got ssse3.If all you need is to test all flags are the same among elements, we could add a pseudo PTEST support on CPU without SSE4.1, i.e. we could replace cmpltps %xmm0, %xmm1 ptest %xmm1, %xmm1 jz LABEL to cmpltps %xmm0, %xmm1 movmskps %xmm0, %r8d test %r8d, %r8d jz LABEL It looks to me much more efficient and only relies on SSE. But, we have to ensure the 2 operands to PTEST are the same and it's generated from packed CMP. I am figuring out how to simplify the checking of these pre-conditions. Just off-topic issue, most vector IR so far operates on element-wise or vertically. The generalized issue from here and PR12312 is that we don't have simply way to express horizontal operations easily, like primitives float %s = reduce fadd <N x float> %x i32 %m = reduce max <N x i32> %x i1 %c = any <N x i1> %x or i1 %c = reduce or <N x i1> i1 %c = all <N x i1> %x or i1 %c = reduce and <N x i1> one more interesting example would be scan, horizontal operation but still generate vector <N x i32> %s = scan add <N x i32> %x, 0 ; exclusive scan <N x i32> %s = scan add <N x i32> %x, 1; inclusive scan With these primitives, some workloads may be simplified in IR and backend (like X86) could support some directly. - michael> > Roland > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Apparently Analagous Threads
- [LLVMdev] branch on vector compare?
- [LLVMdev] branch on vector compare?
- [LLVMdev] branch on vector compare?
- RFC: [X86] Can we begin removing AutoUpgrade support for x86 instrinsics added in early 3.X versions
- RFC: [X86] Can we begin removing AutoUpgrade support for x86 instrinsics added in early 3.X versions