Hi, I've been thinking about how to implement a framework for attempting instruction scheduling of small blocks of code by using (GA/simulated annealing/etc) controlled timing-test-evaluations of various orderings. (I'm particularly interested small-ish numerical inner loop code in low-power CPUs like Atom and various ARMs where there CPU doesn't have the ability to "massage" the compiler's scheduling.) I had been thinking that the minimal changes approach would be by attaching the desired ordering to the bit-code as metadata nodes, using the standard JIT interface and just adding a tiny bit of code in the scheduler to recognise the metadata and add those as extra dependencies via addPred(). However, looking at things in detail it looks like instruction selection does things that would make propagating bitcode-instruction ordering through to MachineInstr ordering actually pretty invasive. It looks like it may actually be less invasive overall to put the "try different orderings" controller code at the point where the SelectionDAG is just about to be converted to a list of MachineInstr, essntially repeatedly calling reg-alloc, executing and using the run-time info to suggest a new possible ordering. Are there any obvious reasons why this is actually a bad approach to the problem? On the assumption that its a reasonable idea, it involves repeatedly calling reg-alloc and machine code execution from an unexpected point in the llvm system, so is there any documentation about any "destructive updating" or "single-use" assumptions in the compilation-and-execution codepaths? (I haven't found anything obvious on the llvm website.) Many thanks for any suggestions, David Steven Tweed
Kalle.Raiskila at nokia.com
2010-Jun-11 14:17 UTC
[LLVMdev] thinking about timing-test-driven scheduler
On Wed, 2010-06-09 at 17:30 +0200, orthochronous wrote:> Hi, > > I've been thinking about how to implement a framework for attempting > instruction scheduling of small blocks of code by using (GA/simulated > annealing/etc) controlled timing-test-evaluations of various > orderings.This sounds interesting.> (I'm particularly interested small-ish numerical inner loop > code in low-power CPUs like Atom and various ARMs where there CPU > doesn't have the ability to "massage" the compiler's scheduling.)Have you tried to do some benchmarking here? E.g. high-end x86 vs. Atom processors? Does the current scheduler do a bad job here?> Are there any > obvious reasons why this is actually a bad approach to the problem? On > the assumption that its a reasonable idea, it involves repeatedly > calling reg-alloc and machine code execution from an unexpected point > in the llvm system,This sort of framework would be nice to have also for (especially for) processors that one must cross compile for (or don't have JIT). Just a thought... kalle
On Fri, Jun 11, 2010 at 3:17 PM, <Kalle.Raiskila at nokia.com> wrote:> On Wed, 2010-06-09 at 17:30 +0200, orthochronous wrote:>> I've been thinking about how to implement a framework for attempting >> instruction scheduling of small blocks of code by using (GA/simulated >> annealing/etc) controlled timing-test-evaluations of various >> orderings. > > This sounds interesting. > >> (I'm particularly interested small-ish numerical inner loop >> code in low-power CPUs like Atom and various ARMs where there CPU >> doesn't have the ability to "massage" the compiler's scheduling.) > > Have you tried to do some benchmarking here? E.g. high-end x86 vs. Atom > processors? Does the current scheduler do a bad job here?Some caveats: I was looking at LLVM because it's jit capability ought to make it useful for doing GA based stuff rather than the intrinsic quality of the LLVM scheduler. I'm more of an image processing/scientific computing guy rather than a compiler guy, but I'd expect that for common unpredictable branchy code (especially on OOO cpus) the difference between a good scheduling and a close to optimal schedule is probably in the noise. For the kind of non-branchy numeric code I'd expect that IF THE STOCHASTIC SEARCH CAN BE MADE TO ACTUALLY WORK you might be able to get 5-10 percent consistently. I don't have access to a high-end x86 at the moment; here's some benchmark results on an Atom N270 that I don't completely understand, and may well be wrong. I need to dig in further and see precisely what's happening. The program evaluates both uint64_t intrinsicsFn(__m128i* __restrict in,__m128i* __restrict out){ uint64_t startTSC,endTSC; int i=0; rdtscll(startTSC); do{ __m128i xmm0=in[0/16]; __m128i xmm1=in[16/16]; __m128i xmm2=in[32/16]; __m128i xmm3=in[48/16]; xmm0=_mm_add_epi16(in[64/16],xmm0); xmm1=_mm_add_epi16(in[80/16],xmm1); xmm2=_mm_add_epi16(in[96/16],xmm2); xmm3=_mm_add_epi16(in[112/16],xmm3); xmm0=_mm_add_epi16(xmm1,xmm0); xmm2=_mm_add_epi16(xmm3,xmm2); xmm0=_mm_add_epi16(xmm2,xmm0); out[0/16]=xmm0; in+=8;out+=1; }while(++i<LIM); rdtscll(endTSC); return endTSC-startTSC; } and the 9072 different valid reorderings of the hopefully equivalent inline assembly style function uint64_t f9072(__m128i * __restrict in,__m128i * __restrict out){ uint64_t startTSC,endTSC; int i=0; rdtscll(startTSC); do{ __asm__ __volatile__( "\tmovdqa 0(%1),%%xmm0\n" "\tmovdqa 16(%1),%%xmm1\n" "\tmovdqa 32(%1),%%xmm2\n" "\tmovdqa 48(%1),%%xmm3\n" "\tpaddw 64(%1),%%xmm0\n" "\tpaddw 80(%1),%%xmm1\n" "\tpaddw 96(%1),%%xmm2\n" "\tpaddw 112(%1),%%xmm3\n" "\tpaddw %%xmm1,%%xmm0\n" "\tpaddw %%xmm3,%%xmm2\n" "\tpaddw %%xmm2,%%xmm0\n" "\tmovdqa %%xmm0,(%0)\n" :"=r" (out):"r" (in):"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7"); in+=8;out+=1; }while(++i<LIM); rdtscll(endTSC); return endTSC-startTSC; } Note that my inline asm isn't that good, so the array ptr increments are forced to be at end for the asm versions, which is probably not good for speed. I'd like to try a bigger inner loop but the number of combinations obviously grows too rapidly to try this way rather than via a stochastic sampling approach. Running those through g++-4.4 (which DOESN'T have an atom pipeline model) and clang++ trunk gives the attached graph (y is proportional to runtime whilst x is scheduling instance, ordered by increasing y value). The minimum g++ inline assembly: 73200 The time for g++ -O3 optimisation of intrinsics fn: 93712 The minimum clang++ inline assembly: 79275 The time for clang++ -O3 optimisation of intrinsics fn: 72581 Speculations: the clang intrinsics is finding some optimisation that it is being blocked on the asm version, so it's not just the SSE scheduling that's being measured. g++ is doing systematically some optimisation on the asm versions that clang isn't, hence the rough "curve offset". For some reason g++ is picking a noticeably worse ordering for the intrinsics; I need to find an ubuntu package for g++-4.5 and see if the atom pipeline model improves things. (I can make the hacked together benchmark code available by email if anyone wants it.) This probably raises as many questions as it answers... but it suggests the idea may be worthwhile for really intense inner loop code. Regards, David Steven Tweed -------------- next part -------------- A non-text attachment was scrubbed... Name: timeForSchedules.ps Type: application/postscript Size: 255628 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100611/aa23896d/attachment.ps>
Reasonably Related Threads
- [LLVMdev] thinking about timing-test-driven scheduler
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] SIMD instructions and memory alignment on X86
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag