search for: vpxor

Displaying 18 results from an estimated 18 matches for "vpxor".

2017 Oct 11
1
[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support
...\ @@ -482,7 +482,7 @@ ENDPROC(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab) #define inpack16_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ y6, y7, rio, key) \ vmovq key, x0; \ - vpshufb .Lpack_bswap, x0, x0; \ + vpshufb .Lpack_bswap(%rip), x0, x0; \ \ vpxor 0 * 16(rio), x0, y7; \ vpxor 1 * 16(rio), x0, y6; \ @@ -533,7 +533,7 @@ ENDPROC(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab) vmovdqu x0, stack_tmp0; \ \ vmovq key, x0; \ - vpshufb .Lpack_bswap, x0, x0; \ + vpshufb .Lpack_bswap(%rip), x0, x0; \ \ vpxor x0, y7, y7; \ vp...
2018 Jul 24
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...that the vectorizer produces if you add '#pragma clang loop >> vectorize(enable) vectorize_width(16)' above the loop? I tried it in your >> godbolt example and the generated code looks very similar to the >> icc-generated code. >> > > It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily > carried across the loop. It's then redundantly added twice in the reduction > after the loop despite it being 0. This happens because we basically > tricked the backend into generating a 256-bit vpmaddwd concated with a > 256-bit zero vec...
2018 Jul 23
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...ffers from the > code that the vectorizer produces if you add '#pragma clang loop > vectorize(enable) vectorize_width(16)' above the loop? I tried it in your > godbolt example and the generated code looks very similar to the > icc-generated code. > It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily carried across the loop. It's then redundantly added twice in the reduction after the loop despite it being 0. This happens because we basically tricked the backend into generating a 256-bit vpmaddwd concated with a 256-bit zero vector going into a 512...
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes: - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes: - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below the top 2G of the virtual address space. It allows to optionally extend the KASLR randomization range from 1G to 3G. Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler changes, PIE support and KASLR in general. Thanks to
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below the top 2G of the virtual address space. It allows to optionally extend the KASLR randomization range from 1G to 3G. Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler changes, PIE support and KASLR in general. Thanks to
2018 May 23
33
[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization
Changes: - patch v3: - Update on message to describe longer term PIE goal. - Minor change on ftrace if condition. - Changed code using xchgq. - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes: - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce dynamic relocation space on mapped memory. It also simplifies the relocation process. - Move the start the module section next to the kernel. Remove the need for -mcmodel=large on modules. Extends
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes: - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce dynamic relocation space on mapped memory. It also simplifies the relocation process. - Move the start the module section next to the kernel. Remove the need for -mcmodel=large on modules. Extends
2015 Jun 26
2
[LLVMdev] Can LLVM vectorize <2 x i32> type
...4[0] vpcmpgtq %xmm3, %xmm4, %xmm3 vptest %xmm3, %xmm3 je .LBB10_66 # BB#5: # %for.body.preheader vpaddq %xmm15, %xmm2, %xmm3 vpand %xmm15, %xmm3, %xmm3 vpaddq .LCPI10_1(%rip), %xmm3, %xmm8 vpand .LCPI10_5(%rip), %xmm8, %xmm5 vpxor %xmm4, %xmm4, %xmm4 vpcmpeqq %xmm4, %xmm5, %xmm6 vptest %xmm6, %xmm6 jne .LBB10_9 It turned out that the vector one is way more complicated than the scalar one. I was expecting that it would be not so tedious. On Fri, Jun 26, 2015 at 3:49 AM, suyog sarda <sardask01 at gmail....
2015 Jul 24
0
[LLVMdev] SIMD for sdiv <2 x i64>
...$63, %rax sarq $2, %rdx addq %rax, %rdx vmovq %rdx, %xmm1 vmovq %xmm0, %rax imulq %rcx movq %rdx, %rax shrq $63, %rax sarq $2, %rdx addq %rax, %rdx vmovq %rdx, %xmm0 vpunpcklqdq %xmm1, %xmm0, %xmm1 # xmm1 = xmm0[0],xmm1[0] vpxor %xmm4, %xmm1, %xmm0 vpcmpgtq %xmm6, %xmm0, %xmm0 vptest %xmm0, %xmm0 je .LBB582_49 Thanks, Zhi On Fri, Jul 24, 2015 at 10:16 AM, Philip Reames <listmail at philipreames.com> wrote: > > > On 07/24/2015 03:42 AM, Benjamin Kramer wrote: > > On 24.07.2015, at...
2015 Jul 24
1
[LLVMdev] SIMD for sdiv <2 x i64>
..., %rdx > vmovq %rdx, %xmm1 > vmovq %xmm0, %rax > imulq %rcx > movq %rdx, %rax > shrq $63, %rax > sarq $2, %rdx > addq %rax, %rdx > vmovq %rdx, %xmm0 > vpunpcklqdq %xmm1, %xmm0, %xmm1 # xmm1 = xmm0[0],xmm1[0] > vpxor %xmm4, %xmm1, %xmm0 > vpcmpgtq %xmm6, %xmm0, %xmm0 > vptest %xmm0, %xmm0 > je .LBB582_49 > > Thanks, > Zhi > > On Fri, Jul 24, 2015 at 10:16 AM, Philip Reames > <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: &gt...
2015 Jul 24
2
[LLVMdev] SIMD for sdiv <2 x i64>
On 07/24/2015 03:42 AM, Benjamin Kramer wrote: >> On 24.07.2015, at 08:06, zhi chen <zchenhn at gmail.com> wrote: >> >> It seems that that it's hard to vectorize int64 in LLVM. For example, LLVM 3.4 generates very complicated code for the following IR. I am running on a Haswell processor. Is it because there is no alternative AVX/2 instructions for int64? The same thing
2018 Jul 23
3
[LoopVectorizer] Improving the performance of dot product reduction loop
Hello all, This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The
2015 Jun 24
2
[LLVMdev] Can LLVM vectorize <2 x i32> type
Hi, Is LLVM be able to generate code for the following code? %mul = mul <2 x i32> %1, %2, where %1 and %2 are <2 x i32> type. I am running it on a Haswell processor with LLVM-3.4.2. It seems that it will generates really complicated code with vpaddq, vpmuludq, vpsllq, vpsrlq. Thanks, Zhi -------------- next part -------------- An HTML attachment was scrubbed... URL:
2014 Sep 10
13
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> wrote: > Awesome, thanks for all the information! > > See below: > > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> > wrote: >> >> You have already mentioned how the new shuffle lowering is missing >> some features; for example, you explicitly
2013 Oct 15
0
[LLVMdev] [llvm-commits] r192750 - Enable MI Sched for x86.
...18:33:07 2013 >> @@ -5,8 +5,8 @@ >> ; It's hard to test for the ISEL condition because CodeGen optimizes >> ; away the bugpointed code. Just ensure the basics are still there. >> ;CHECK-LABEL: func: >> -;CHECK: vxorps >> -;CHECK: vinsertf128 >> +;CHECK: vpxor >> +;CHECK: vinserti128 >> ;CHECK: vpshufd >> ;CHECK: vpshufd >> ;CHECK: vmulps >> >> Modified: llvm/trunk/test/CodeGen/X86/3addr-16bit.ll >> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/3addr-16bit.ll?rev=192750&r1=192749&r...