search for: avx512bw

Displaying 17 results from an estimated 17 matches for "avx512bw".

2016 Oct 20
2
[AVX512BW] Nasty KAND issue
Hey guys, I've hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits. In the IR, we represent a 4b vector mask as <4 x i1>. This assumes that the storage container for this type is also 4b, but it's not. The smallest mask register on SKX is 8b. This also implies that the smallest load/store moves 8b. We run into problems when we try to optimize ANDs (full test case
2020 May 18
2
Use Galois field New Instructions (GFNI) to combine affine instructions
On 5/18/20 8:24 PM, Craig Topper wrote: > I can tell you that your avx512 issue is that v64i8 gfni instructions also > require avx512bw to be enabled to make v64i8 a supported type. The C > intrinsics handling in the front end know this rule. But since you > generated your own intrinsics you bypassed that. Indeed that's the issue... I was stick with what Intel announces here (https://software.intel.com/sites/landingpage/...
2016 Oct 20
2
[AVX512BW] Nasty KAND issue
On Thu, Oct 20, 2016 at 12:05 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > >> On Oct 20, 2016, at 8:54 AM, Cameron McInally via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Hey guys, >> >> I've hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits. >> >> In the IR, we represent a 4b vector mask as <4 x i1>.
2016 Oct 20
2
[AVX512BW] Nasty KAND issue
On 10/20/2016 9:28 AM, Cameron McInally via llvm-dev wrote: > I should have attached the generated asm to save some trouble. > Apologies for that and attaching now... > > > > On Thu, Oct 20, 2016 at 12:26 PM, Cameron McInally > <cameron.mcinally at nyu.edu> wrote: >> On Thu, Oct 20, 2016 at 12:05 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
2017 Sep 30
2
invalid code generated on Windows x86_64 using skylake-specific features
...dows laptop that I am testing on, I get these values: target_specific_cpu_args: skylake target_specific_features: +sse2,+cx16,-tbm,-avx512ifma,-avx512dq,-fma4,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,-avx512er,-avx512vpopcntdq,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,-xop,+rdseed,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vl,-avx512cd,+avx,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,+sgx,+cmov,-avx512vbmi,+movbe,+xsaveopt,-sha,+adx,-avx512pf,+sse3 It successfully creates a binary, but the binary when run crashes with: Unhandled except...
2016 Jun 29
0
Question about VectorLegalizer::ExpandStore() with v4i1
...4bit-per-elem has to happen. We need to minimize conversion between 0/1 logic and 0/-1 logic, and also conversion between different element sizes. Doing so for AVX2 and below is challenging enough. Introduction of AVX512F in Xeon Phi added another challenge to the vectorizer developers. Addition of AVX512BW and VL should make it easier. Without AVX512BW and VL (i.e., all of today's x86 targets), optimal representation of the result of compare is determined by how it is consumed, and it is not a good idea to have such optimization in multiple different places. If the legalizer has to blindly legal...
2017 Oct 01
1
invalid code generated on Windows x86_64 using skylake-specific features
...get these values: > > target_specific_cpu_args: skylake > > target_specific_features: +sse2,+cx16,-tbm,-avx512ifma,- > avx512dq,-fma4,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes, > +xsaves,-avx512er,-avx512vpopcntdq,-clwb,-avx512f,-clzero,-pku,+mmx,- > lwp,-xop,+rdseed,-sse4a,-avx512bw,+clflushopt,+xsave,- > avx512vl,-avx512cd,+avx,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4. > 1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ > ssse3,+sgx,+cmov,-avx512vbmi,+movbe,+xsaveopt,-sha,+adx,-avx512pf,+sse3 > > > It successfully creates a binary, but the binary when run...
2016 Jun 29
2
avx512 JIT backend generates wrong code on <4 x float>
...prints the assembler. I stumbled on this since the result of an actual calculation was wrong. So, it's not only the text version of the assembler also the machine assembler is wrong. When I execute the exploit program on an Intel KNL the following output is produced: CPU name = knl -sse4a,-avx512bw,cx16,-tbm,xsave,-fma4,-avx512vl,prfchw,bmi2,adx,-xsavec,fsgsbase,avx,avx512cd,avx512pf,-rtm,popcnt,fma,bmi,aes,rdrnd,-xsaves,sse4.1,sse4.2,avx2,avx512er,sse,lzcnt,pclmul,avx512f,f16c,ssse3,mmx,-pku,cmov,-xop,rdseed,movbe,-hle,xsaveopt,-sha,sse2,sse3,-avx512dq, Assembly: .text .file &qu...
2016 Jun 29
0
avx512 JIT backend generates wrong code on <4 x float>
...the result of an > actual calculation was wrong. So, it's not only the text version of > the > assembler also the machine assembler is wrong. > > When I execute the exploit program on an Intel KNL the following > output > is produced: > > CPU name = knl > -sse4a,-avx512bw,cx16,-tbm,xsave,-fma4,-avx512vl,prfchw,bmi2,adx,-xsavec,fsgsbase,avx,avx512cd,avx512pf,-rtm,popcnt,fma,bmi,aes,rdrnd,-xsaves,sse4.1,sse4.2,avx2,avx512er,sse,lzcnt,pclmul,avx512f,f16c,ssse3,mmx,-pku,cmov,-xop,rdseed,movbe,-hle,xsaveopt,-sha,sse2,sse3,-avx512dq, > Assembly: > .text >...
2016 Jun 30
1
avx512 JIT backend generates wrong code on <4 x float>
...lation was wrong. So, it's not only the text version of >> the >> assembler also the machine assembler is wrong. >> >> When I execute the exploit program on an Intel KNL the following >> output >> is produced: >> >> CPU name = knl >> -sse4a,-avx512bw,cx16,-tbm,xsave,-fma4,-avx512vl,prfchw,bmi2,adx,-xsavec,fsgsbase,avx,avx512cd,avx512pf,-rtm,popcnt,fma,bmi,aes,rdrnd,-xsaves,sse4.1,sse4.2,avx2,avx512er,sse,lzcnt,pclmul,avx512f,f16c,ssse3,mmx,-pku,cmov,-xop,rdseed,movbe,-hle,xsaveopt,-sha,sse2,sse3,-avx512dq, >> Assembly: >> .tex...
2020 May 18
2
Use Galois field New Instructions (GFNI) to combine affine instructions
Hello everyone, On the last couple of days, I have been experimenting with teaching LLVM how to combine a set of affine instructions into an instruction that uses the GFNI [1] AVX512 extension, especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have some questions about my current implementation (see below). FTR, I have named this transformation AffineCombineExpr (ACE).
2017 Feb 01
2
RFC: Generic IR reductions
...I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction. Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)" is "nice and concise". Such a representation is also useful inside of vectorized loop if the programmer wants bitwise ident...
2017 Oct 03
2
invalid code generated on Windows x86_64 using skylake-specific features
...fic_cpu_args: skylake >>> >>> target_specific_features: +sse2,+cx16,-tbm,-avx512ifma,- >>> avx512dq,-fma4,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes, >>> +xsaves,-avx512er,-avx512vpopcntdq,-clwb,-avx512f,-clzero,-p >>> ku,+mmx,-lwp,-xop,+rdseed,-sse4a,-avx512bw,+clflushopt,+xsav >>> e,-avx512vl,-avx512cd,+avx,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+ >>> sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,-prefetchwt1,+ >>> f16c,+ssse3,+sgx,+cmov,-avx512vbmi,+movbe,+xsaveopt,- >>> sha,+adx,-avx512pf,+sse3 >>> >>> >>&...
2020 Jul 10
12
New x86-64 micro-architecture levels
...lowing features, it is assumed that the run-time selection takes full support coverage (from silicon to the kernel) into account. * Level C AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. This is close to what glibc currently calls "haswell". * Level D AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in level C. This is the AVX-512 level implemented by Xeon Scalable Processors, not the Xeon Phi variant. glibc (or an alternative loader implementation) would search for libraries starting at level D, going back to level A, and finally the baseline...
2018 Mar 23
2
Issue with libguestfs-test-tool on a guest hosted on VMWare ESXi
...st tm2 ssse3 cid fma cx16 xtpr pdcm pcid dca sse4.1 sse4.2 x2apic movbe popcnt tsc-deadline aes xsave osxsave avx f16c rdrand hypervisor fsgsbase tsc-adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512ifma pcommit clflushopt clwb avx512pf avx512er avx512cd avx512bw avx512vl avx512vbmi umip pku ospke rdpid avx512-4vnniw avx512-4fmaps syscall nx mmxext fxsr-opt pdpe1gb rdtscp lm 3dnowext 3dnow lahf-lm cmp-legacy svm extapic cr8legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid-msr tbm topoext perfctr-core perfctr-...
2020 Jul 13
3
New x86-64 micro-architecture levels
...he > > kernel) into account. > > > > * Level C > > > > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. > > > > This is close to what glibc currently calls "haswell". > > > > * Level D > > > > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in > > level C. > > > > This is the AVX-512 level implemented by Xeon Scalable Processors, not > > the Xeon Phi variant. > > > > > > glibc (or an alternative loader implementation) would search for > > li...
2017 Feb 01
2
RFC: Generic IR reductions
...to an instruction (mainly because of the op), but it can be done, just need time and community support. > Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see > a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem > of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)" > is "nice and concise". I agree it would be ugly, but I want to make sure we're clear that this is mostly irrel...