Hi Tim,> > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > I get a VFP vmla here rather than a NEON one (clang -target > armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are > you seeing something different? >As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.> > > However, if gcc emits vmla (NEON) instruction with cortex-a8 then > shouldn't > > LLVM also emit vmla (NEON) instruction? > > It appears we've decided in the past that vmla just isn't worth it on > Cortex-A8. There's this comment in the source: > > // Some processors have FP multiply-accumulate instructions that don't > // play nicely with other VFP / NEON instructions, and it's generally > better > // to just not use them. > > Sufficient benchmarking evidence could overturn that decision, but I > assume the people who added it in the first place didn't do so on a > whim. > > > The performance gain with vmla instruction is huge. > > Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines > in odd ways, and that was a very primitive core so it's almost > certainly not going to be just as good as a vmul (in fact if I'm > reading correctly, it takes pretty much exactly the same time as > separate vmul and vadd instructions, 10 cycles vs 2 * 5). >It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, then intermediate results will be generated which needs to be stored in memory for future access. This will lead to lot of load/store ops being inserted which degrade performance. Correct me if i am wrong on this, but my observation till date have shown this.> > Cheers. > > Tim. >-- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/fe30dc82/attachment.html>
> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.My version of the ARM architecture reference manual (v7 A & R) lists versions requiring NEON and versions requiring VFP. (Section A8.8.337). Split in just the way you'd expect (SIMD variants need NEON).> It may seem that total number of cycles are more or less same for single vmla > and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, > then intermediate results will be generated which needs to be stored in memory > for future access.Well, it increases register pressure slightly I suppose, but there's no need to store anything to memory unless that gets critical.> Correct me if i am wrong on this, but my observation till date have shown this.Perhaps. Actual data is needed, I think, if you seriously want to change this behaviour in LLVM. The test-suite might be a good place to start, though it'll give an incomplete picture without the externals (SPEC & other things). Of course, if we're just speculating we can carry on. Cheers. Tim.
On Thu, Dec 19, 2013 at 2:43 PM, Tim Northover <t.p.northover at gmail.com>wrote:> > As per Renato comment above, vmla instruction is NEON instruction while > vmfa is VFP instruction. Correct me if i am wrong on this. > > My version of the ARM architecture reference manual (v7 A & R) lists > versions requiring NEON and versions requiring VFP. (Section > A8.8.337). Split in just the way you'd expect (SIMD variants need > NEON). >I will check on this part.> > > It may seem that total number of cycles are more or less same for single > vmla > > and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, > > then intermediate results will be generated which needs to be stored in > memory > > for future access. > > Well, it increases register pressure slightly I suppose, but there's > no need to store anything to memory unless that gets critical. > > > Correct me if i am wrong on this, but my observation till date have > shown this. > > Perhaps. Actual data is needed, I think, if you seriously want to > change this behaviour in LLVM. The test-suite might be a good place to > start, though it'll give an incomplete picture without the externals > (SPEC & other things). > > Of course, if we're just speculating we can carry on. >I wasn't speculating. Let's take an example of a 3*3 simple matrix multiplication (no loops, all multiplication and additions are hard coded - basically all the operations are expanded e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0] and so on for all 9 elements of the result ). If i compile above code with "clang -O3 -mcpu=cortex-a8 -mfpu=vfpv3-d16" (only 16 floating point registers present with my arm, so specifying vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load ops in total. If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9 store and 20 load ops. So, its clear that extra load/store ops gets added with clang as it is not emitting vmla instruction. Won't this lead to performance degradation? I would also like to know about accuracy with vmla and pair of vmul and vadd ops. -- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/126b3c33/attachment.html>
On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote:> It may seem that total number of cycles are more or less same for single > vmla and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, then intermediate results will be generated which needs to be stored > in memory for future access. This will lead to lot of load/store ops being > inserted which degrade performance. Correct me if i am wrong on this, but > my observation till date have shown this. >VMLA.F can be either NEON or VFP on A series and the encoding will determine which will be used. In assembly files, the difference is mainly the type vs. the registers used. The problem we were trying to avoid a long time ago was well researched by Evan Cheng and it has shown that there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD. Also, please note that, as accurate as cycle counts go, according to the A9 manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to provide the results, so a sequence of VMUL+VADD might be faster, in some contexts or cores, than half the sequence of VMLAs. As Tim and David said and I agree, without hard data, anything we say might be used against us. ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/68a15bea/attachment.html>
On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at linaro.org>wrote:> On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > >> It may seem that total number of cycles are more or less same for single >> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of >> vmla, then intermediate results will be generated which needs to be stored >> in memory for future access. This will lead to lot of load/store ops being >> inserted which degrade performance. Correct me if i am wrong on this, but >> my observation till date have shown this. >> > > VMLA.F can be either NEON or VFP on A series and the encoding will > determine which will be used. In assembly files, the difference is mainly > the type vs. the registers used. > > The problem we were trying to avoid a long time ago was well researched by > Evan Cheng and it has shown that there is a pipeline stall between two > sequential VMLAs (possibly due to the need of re-use of some registers) and > this made code much slower than a sequence of VMLA+VMUL+VADD. > > Also, please note that, as accurate as cycle counts go, according to the > A9 manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to > provide the results, so a sequence of VMUL+VADD might be faster, in some > contexts or cores, than half the sequence of VMLAs. > > As Tim and David said and I agree, without hard data, anything we say > might be used against us. ;) > >Sorry folks, i didn't specify the actual test case and results in detail previously. The details are as follows : Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand . clang version : trunk version (latest as of today 19 Dec 2013) GCC version : 4.5 (i checked with 4.8 as well) flags passed to both gcc and clang : -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16 -mcpu=cortex-a8 Optimization level used : O3 No vmla instruction emitted by clang but GCC happily emits it. This was tested on real hardware. Time taken for a 4x4 matrix multiplication: clang : ~14 secs gcc : ~9 secs Time taken for a 3x3 matrix multiplication: clang : ~6.5 secs gcc : ~5 secs when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla instructions (gcc emits by default) Time for 4x4 matrix multiplication : clang : ~8.5 secs GCC : ~9secs Time for matrix multiplication : clang : ~3.8 secs GCC : ~5 secs Please let me know if i am missing something. (-ffast-math option doesn't help in this case.) On examining assembly code for various scenarios above, i concluded what i have stated above regarding more load/store ops. Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards? -- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/1ff94025/attachment.html>