Adding LLVMdev, since this is intimately related to the optimization passes.> I think this is roughly because some function level optimisations are > worse than O(N) in the number of instructions.Please profile this and mail llvmdev regarding passes with significantly superlinear behavior (e.g. O(n^2)). My understanding is that these kinds of algorithmic problems are generally considered bugs in the optimizer and will be fixed.> Roughly, a pass would calculate the cost of running on some > unit (translation unit, function), and a "goodness" estimating > how much improvement might be expected. The optimiser > then tries to simultaneously maximise goodness and minimise > cost by selecting some sequence of passes.I'm doubt there is a way to compute expected benefit that is significantly cheaper than actually performing the optimization itself. You also have to consider that passes interact with each other in complicated and "poorly understood" ways so it's hard to extrapolate one pass's estimated benefit to overall code improvement in the end. I say "poorly understood" because as far as I know there is no formal or scientific underpinning to the choice and ordering of passes which are run as the "standard compile opts" (e.g. -O1, -O2, -O3, etc.) besides very high-level, extremely rough heuristics.> Another way to do this would be to calculate a time limit on > a pass and abort it if it exceeded the limit.No way. This would make the optimizations nondeterministic. -- Sean Silva On Thu, Nov 22, 2012 at 11:50 PM, john skaller <skaller at users.sourceforge.net> wrote:> Hi, I've been having some of the usual problems with C compiler optimisations: > they work fine on small functions but drop dead on big ones. > > I think this is roughly because some function level optimisations are > worse than O(N) in the number of instructions. > > Changing the command line -O level is not a solution. > The optimiser should adapt to ensure reasonable compile times. > > Has anyone thought about costing optimisation passes and using > a linear optimisation formula to select when to apply them? > > TeX does something like this to calculate the best places to put > line breaks when typesetting. > > Roughly, a pass would calculate the cost of running on some > unit (translation unit, function), and a "goodness" estimating > how much improvement might be expected. The optimiser > then tries to simultaneously maximise goodness and minimise > cost by selecting some sequence of passes. > > Another way to do this would be to calculate a time limit on > a pass and abort it if it exceeded the limit. > > Some optimisations could be coded to do this kind of thing > internally, i.e. partition a function and optimise the pieces, > rather than the whole function, to avoid time overruns. > > FYI: I have a product which generates big functions. > It does this for two reasons: performance is one, > and lack of suitable semantics in C, roughly this > is coroutine calls (exchange of control), and probably > continuation passing. LLVM also lacks these fundamentals > and is too high level to implement them, so the implementation > is done natively but the result is big functions. > > -- > john skaller > skaller at users.sourceforge.net > http://felix-lang.org > > > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Hi, Here's a personal perspective on the issue: Firstly I don't think TeX does pre-estimation of work, it's just the standard stepwise "refinement stuff". However, the idea itself isn't necessarily a bad one. In the academic community there's a set of people (John Cavazos, Lous-Noel Pouchet, others I've forgotten) doing work on estimating which sets of optimizations are most likely to produce a good speed up using estimation functions. (Note they're concerned primarily with outputted code speed, followed by the fact there are just too many combinations of optimizations to evaluate directly.) There's a couple of important differences: they're using estimation functions found by machine learning on "scientific-ish" code, which probably has more uniformity than completely unrestricted programs. It's also based on the idea that there's just not the manpower available to understand the complicated interactions between components on the plethora of modern machines. From reading papers (and I've not yet implemented enough of this stuff to test how well it really works) it looks like a reasonable argument. However, it's also potentially introducing highly non-predictable/principle-of-least-surprise behaviour into the compiler, particularly since llvm can be used as a JIT. So I suspect it wouldn't be something most people would want for mainstream clang/llvm, but if might be interesting if someone were to implement some passes like this (probably out-of-tree) and see if it proves beneficial. Of course Sean's point that superlinear behaviour ought to be reported as it's probably a bug is very true, Regards, Dave -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Sean Silva Sent: 23 November 2012 06:47 To: john skaller Cc: Clang; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] [cfe-dev] costing optimisations Adding LLVMdev, since this is intimately related to the optimization passes.> I think this is roughly because some function level optimisations are > worse than O(N) in the number of instructions.Please profile this and mail llvmdev regarding passes with significantly superlinear behavior (e.g. O(n^2)). My understanding is that these kinds of algorithmic problems are generally considered bugs in the optimizer and will be fixed.> Roughly, a pass would calculate the cost of running on some > unit (translation unit, function), and a "goodness" estimating > how much improvement might be expected. The optimiser > then tries to simultaneously maximise goodness and minimise > cost by selecting some sequence of passes.I'm doubt there is a way to compute expected benefit that is significantly cheaper than actually performing the optimization itself. You also have to consider that passes interact with each other in complicated and "poorly understood" ways so it's hard to extrapolate one pass's estimated benefit to overall code improvement in the end. I say "poorly understood" because as far as I know there is no formal or scientific underpinning to the choice and ordering of passes which are run as the "standard compile opts" (e.g. -O1, -O2, -O3, etc.) besides very high-level, extremely rough heuristics.> Another way to do this would be to calculate a time limit on > a pass and abort it if it exceeded the limit.No way. This would make the optimizations nondeterministic. -- Sean Silva On Thu, Nov 22, 2012 at 11:50 PM, john skaller <skaller at users.sourceforge.net> wrote:> Hi, I've been having some of the usual problems with C compileroptimisations:> they work fine on small functions but drop dead on big ones. > > I think this is roughly because some function level optimisations are > worse than O(N) in the number of instructions. > > Changing the command line -O level is not a solution. > The optimiser should adapt to ensure reasonable compile times. > > Has anyone thought about costing optimisation passes and using > a linear optimisation formula to select when to apply them? > > TeX does something like this to calculate the best places to put > line breaks when typesetting. > > Roughly, a pass would calculate the cost of running on some > unit (translation unit, function), and a "goodness" estimating > how much improvement might be expected. The optimiser > then tries to simultaneously maximise goodness and minimise > cost by selecting some sequence of passes. > > Another way to do this would be to calculate a time limit on > a pass and abort it if it exceeded the limit. > > Some optimisations could be coded to do this kind of thing > internally, i.e. partition a function and optimise the pieces, > rather than the whole function, to avoid time overruns. > > FYI: I have a product which generates big functions. > It does this for two reasons: performance is one, > and lack of suitable semantics in C, roughly this > is coroutine calls (exchange of control), and probably > continuation passing. LLVM also lacks these fundamentals > and is too high level to implement them, so the implementation > is done natively but the result is big functions. > > -- > john skaller > skaller at users.sourceforge.net > http://felix-lang.org > > > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev_______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 23/11/2012, at 5:46 PM, Sean Silva wrote:> Adding LLVMdev, since this is intimately related to the optimization passes. > >> I think this is roughly because some function level optimisations are >> worse than O(N) in the number of instructions. > > Please profile this and mail llvmdev regarding passes with > significantly superlinear behavior (e.g. O(n^2)). My understanding is > that these kinds of algorithmic problems are generally considered bugs > in the optimizer and will be fixed.I'll try, but at the moment I'm having trouble building Clang/LLVM. Bug report in Bugzilla. 10 hours build time followed by failure on the Release version isn't much fun (and also uses a lot of power -- everything is solar/wind powered). The Debug+Asserts version did build, that's what is taking so long. I thought perhaps the assertions or lack of optimisation might slow things down so I tried to build a release version. I think I built Debug+Asserts using clang 3.1. However once installed, the Release version is being built by 3.3svn/Debug+Asserts.> You also have to consider that passes interact with each other > in complicated and "poorly understood" ways so it's hard to > extrapolate one pass's estimated benefit to overall code improvement > in the endI agree. It's also hard to tell how good linear passes are compared to recursive optimisations. However one has to have some way of choosing which optimisation passes to run, when to run them, and how many times, so some judgement has to be made anyhow. Making a numerical judgment and using formula to choose by weighting can't be worse (once suitable numbers are found).>> Another way to do this would be to calculate a time limit on >> a pass and abort it if it exceeded the limit. > > No way. This would make the optimizations nondeterministic.Not if you set the time to infinity. At present I have a test suite which imposes a timeout on each test. This is in case a bug make the test run forever. But it also kills the test if the compile runs too long. So I have non-deterministic behaviour already. Sometimes a test runs and sometime not (because part of the compilation is cached, it may succeed on a second attempt). -- john skaller skaller at users.sourceforge.net http://felix-lang.org
On 23/11/2012, at 8:28 PM, David Tweed wrote:> Firstly I don't think TeX does pre-estimation of work, it's just the > standard stepwise "refinement stuff".I didn't mean that example to be taken so literally. When TeX formats a paragraph, it has to decide where to put line breaks. Breaking a line on a space has less badness that hyphenating a word. If I recall hyphenations can have different badnesses. The thing is that stretching whitespace also has a badness dependent on how much you stretch it. So TeX actually considers several possible break points, and calculates a badness for each one. but there's more. The choice of a break point in a line affects the next line. And the one after that, etc. TeX tries to solve for optimal layout of the whole paragraph. That's why it produces such awesome results.> However, it's also potentially introducing highly > non-predictable/principle-of-least-surprise behaviour into the compiler, > particularly since llvm can be used as a JIT. So I suspect it wouldn't be > something most people would want for mainstream clang/llvm, but if might be > interesting if someone were to implement some passes like this (probably > out-of-tree) and see if it proves beneficial.My problem is that I get quantum non-determinism :) Either the compiler finishes and does a "reasonable job" or it takes so long the whole thing gets killed. Of course, I could re-run the compiler with a lower optimisation switch in that case. I think I might try that.> > Of course Sean's point that superlinear behaviour ought to be reported as > it's probably a bug is very true,Isn't data flow analysis O(N^3)? You cannot do proper alias analysis without it. -- john skaller skaller at users.sourceforge.net http://felix-lang.org
On 23/11/2012, at 5:46 PM, Sean Silva wrote:> Adding LLVMdev, since this is intimately related to the optimization passes. > >> I think this is roughly because some function level optimisations are >> worse than O(N) in the number of instructions. > > Please profile this and mail llvmdev regarding passes with > significantly superlinear behavior (e.g. O(n^2)). My understanding is > that these kinds of algorithmic problems are generally considered bugs > in the optimizer and will be fixed.I cannot do that yet, however here's some more info: Felix compile time: 5 seconds. C++ compile and run times: With -O3: /usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC .... Elapsed: 106.058, Result code 0 /usr/local/bin/clang++ -fno-common -fno-strict-aliasing -std=c++11 -dynamiclib -O3 .... Elapsed: 0.049, Result code 0 env DYLD_LIBRARY_PATH=build/release/lib/rtl:$DYLD_LIBRARY_PATH build/release/bin/flx_run '/Users/johnskaller/felix/test/regress/rt/tuple-02.dylib' Elapsed: 0.007, Result code 0 With -O2: Compile: Elapsed: 106.918, Result code 0 Link: Elapsed: 0.048, Result code 0 Run: Elapsed: 0.010, Result code 0 With -O1: Compile: Elapsed: 13.664, Result code 0 Link: Elapsed: 0.042, Result code 0 Run: Elapsed: 0.007, Result code 0 So on this particular test case, -O3 and -O2 take the same time, but -O1 is 8 times faster. Link and run times are roughly the same in all cases. I made a modified program, this is Felix code but should be comprehensible. The Felix compiler generates C++. This particular test checks printing and comparison of tuples and arrays. What is relevant here is the "inline* keyword. This tells Felix to inline the procedure, so the result is a single big C++ function. ///////////////// var z = 1, (2.2, "xx"); println$ str z; var z2 = 1, 2.2, "xx", 3, 4.3; println$ z2; var z3 = 1,23.3, z, z2, 99, "Jah"; println$ z3; a := 1,2,3,4,5,6,7,8,9,10; inline proc A () { println$ 1,2; println$ 1,2,3; println$ 1,2,3,4; println$ 1,2,3,4,5; println$ (a, a) :>> (int ^ 20); } A; inline proc B () { println$ (1,"x") == (1,"x"); println$ (1,"x",4.2) == (1,"x",4.2); println$ (1,"x",4.2,"x") == (1,"x",4.2,"y"); x1 := (1,"x",4.2,"x"); x2 := (1,"x",4.2,"y"); println$ x1 == x2; } B; inline proc C() { println$ "-" * 20; y1 := (1,"x",4.2,"x",55,"hello",4.222); y2 := (1,"x",4.2,"y",55,"hello",4.222); println$ y1 == y2; // false println$ y1 != y2; // true println$ y1 < y2; // true println$ y1 <= y2; // true println$ y1 > y2; // false println$ y1 >= y2; // false } C; println$ a == a; b:= 1,2; println$ b == b; inline proc D() { println$ (1,2) == (1,2); println$ (1,2,3) == (1,2,3); println$ (1,2,3,4) == (1,2,3,4); println$ (1,2,3,4,5) == (1,2,3,4,5); println$ (1,2,3,4,5,6) == (1,2,3,4,5,6); println$ ("1",2,3,4,5,6) == ("1",2,3,4,5,6); } D; ///////////////// The elapsed time for the compile on this variant with -O2 is 120 seconds. Now, watch what happens if I change "inline" to "noinline" which makes each procedure A,B,C,D a separate C++ function: Elapsed time: 47 seconds. With -O1: inline version: 14 seconds. noinline version: 11 seconds So the O1 compilation takes a bit longer with a single big function. The O2 compilation is almost a decimal order of magnitude slower with the inline version, but twice as fast when the function is split up into 4 pieces (plus some other stuff). The actual C++ here: 1729 5498 93095 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.cpp 753 2246 17518 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.hpp plus library files. Felix optimises the use of headers so only required library headers are actually used. 100 seconds for 2000 lines is only 20 lines a second. At that speed a 2 million line program would take 100K seconds to compile (about one day). Actually, Clang + LLVM took over 10 hours (using Clang as the compiler). OK .. so I split it up again like this: ///////////////////////////// noinline proc A1 () { println$ 1,2; } noinline proc A2() { println$ 1,2,3; } noinline proc A3() { println$ 1,2,3,4; } noinline proc A4() { println$ 1,2,3,4,5; } noinline proc A5() { println$ (a, a) :>> (int ^ 20); } A1; A2; A3; A4; A5; ...... //////////////////////// Now with -O1 the compilation time is the same: 11 seconds. But with -O2 we're down to 30 seconds ( -O3 is the same). The thing to note here is that at a coarse grain, the program contains no branches. It's straight line code. Of course there may be local branches at a lower level. Roughly: the optimiser isn't smart enough to recognise sequential computations. It should be able to partition the control flow graph of a function into a sequence of "super blocks" if you like, which can be optimised independently and the results merged. This would reduce the cost of a non-linear algorithm significantly. I have no idea how to do this at the moment. However in the bug finding code I worked on, there was a natural way to "clump" the control flow graph. The graph was built directly on the C AST, so in general each assignment x = expr would be a clump. Inside the expr there is control flow, but it is usually entirely linear. So when doing path analysis, you could basically look at the C level control flow, and disregard the flow in the expression. Most of the time, data flow followed because sub-expressions only assigned to temporaries (so these variables are created and destroyed within the statement and can usually be ignored). Obviously LLVM cannot do that (the higher level stuff has been compiled away) but it could use heuristics to search for the patterns that would be left behind. I do that in my compiler with some success. -- john skaller skaller at users.sourceforge.net http://felix-lang.org
On 23.11.2012, at 15:12, john skaller <skaller at users.sourceforge.net> wrote:> > On 23/11/2012, at 5:46 PM, Sean Silva wrote: > >> Adding LLVMdev, since this is intimately related to the optimization passes. >> >>> I think this is roughly because some function level optimisations are >>> worse than O(N) in the number of instructions. >> >> Please profile this and mail llvmdev regarding passes with >> significantly superlinear behavior (e.g. O(n^2)). My understanding is >> that these kinds of algorithmic problems are generally considered bugs >> in the optimizer and will be fixed. > > > I cannot do that yet, however here's some more info: > > Felix compile time: 5 seconds. > C++ compile and run times: > > With -O3: > > /usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC .... > Elapsed: 106.058, Result code 0 > > /usr/local/bin/clang++ -fno-common -fno-strict-aliasing -std=c++11 -dynamiclib -O3 .... > Elapsed: 0.049, Result code 0 > > env DYLD_LIBRARY_PATH=build/release/lib/rtl:$DYLD_LIBRARY_PATH build/release/bin/flx_run '/Users/johnskaller/felix/test/regress/rt/tuple-02.dylib' > Elapsed: 0.007, Result code 0 > > > With -O2: > Compile: Elapsed: 106.918, Result code 0 > Link: Elapsed: 0.048, Result code 0 > Run: Elapsed: 0.010, Result code 0 > > With -O1: > Compile: Elapsed: 13.664, Result code 0 > Link: Elapsed: 0.042, Result code 0 > Run: Elapsed: 0.007, Result code 0 > > So on this particular test case, -O3 and -O2 take the same time, but -O1 is 8 times faster. > Link and run times are roughly the same in all cases. > > I made a modified program, this is Felix code but should be comprehensible. > The Felix compiler generates C++. This particular test checks printing and > comparison of tuples and arrays. What is relevant here is the "inline* keyword. > This tells Felix to inline the procedure, so the result is a single big C++ function. > > ///////////////// > var z = 1, (2.2, "xx"); > println$ str z; > var z2 = 1, 2.2, "xx", 3, 4.3; > println$ z2; > > var z3 = 1,23.3, z, z2, 99, "Jah"; > println$ z3; > a := 1,2,3,4,5,6,7,8,9,10; > > inline proc A () { > println$ 1,2; > println$ 1,2,3; > println$ 1,2,3,4; > println$ 1,2,3,4,5; > println$ (a, a) :>> (int ^ 20); > } > A; > > > inline proc B () { > println$ (1,"x") == (1,"x"); > println$ (1,"x",4.2) == (1,"x",4.2); > println$ (1,"x",4.2,"x") == (1,"x",4.2,"y"); > x1 := (1,"x",4.2,"x"); > x2 := (1,"x",4.2,"y"); > println$ x1 == x2; > } > B; > > inline proc C() { > println$ "-" * 20; > y1 := (1,"x",4.2,"x",55,"hello",4.222); > y2 := (1,"x",4.2,"y",55,"hello",4.222); > println$ y1 == y2; // false > println$ y1 != y2; // true > println$ y1 < y2; // true > println$ y1 <= y2; // true > println$ y1 > y2; // false > println$ y1 >= y2; // false > } > C; > > println$ a == a; > > b:= 1,2; > println$ b == b; > > inline proc D() { > println$ (1,2) == (1,2); > println$ (1,2,3) == (1,2,3); > println$ (1,2,3,4) == (1,2,3,4); > println$ (1,2,3,4,5) == (1,2,3,4,5); > println$ (1,2,3,4,5,6) == (1,2,3,4,5,6); > > println$ ("1",2,3,4,5,6) == ("1",2,3,4,5,6); > } > D; > ///////////////// > > The elapsed time for the compile on this variant with -O2 is 120 seconds. > > Now, watch what happens if I change "inline" to "noinline" which makes > each procedure A,B,C,D a separate C++ function: > > Elapsed time: 47 seconds. > > With -O1: > > inline version: 14 seconds. > noinline version: 11 seconds > > So the O1 compilation takes a bit longer with a single big function. > The O2 compilation is almost a decimal order of magnitude slower > with the inline version, but twice as fast when the function is split up > into 4 pieces (plus some other stuff). > > The actual C++ here: > > 1729 5498 93095 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.cpp > 753 2246 17518 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.hppHow about providing a preprocessed version of those files as a test case? I don't think there are many people on this list that speak Felix fluently and/or are willing to install your compiler. You may have found a pathological case in the optimizer that can be fixed by putting in a clever heuristic, but we'll never know without a test case. You can also pass -ftime-report to clang to get a breakdown on where the compile time goes. - Ben> > plus library files. Felix optimises the use of headers so only required library > headers are actually used. > > 100 seconds for 2000 lines is only 20 lines a second. > At that speed a 2 million line program would take 100K seconds to compile > (about one day). > > Actually, Clang + LLVM took over 10 hours (using Clang as the compiler). > > OK .. so I split it up again like this: > > ///////////////////////////// > noinline proc A1 () { > println$ 1,2; > } > noinline proc A2() { > println$ 1,2,3; > } > > noinline proc A3() { > println$ 1,2,3,4; > } > noinline proc A4() { > println$ 1,2,3,4,5; > } > noinline proc A5() { > println$ (a, a) :>> (int ^ 20); > } > A1; A2; A3; A4; A5; > ...... > //////////////////////// > > Now with -O1 the compilation time is the same: 11 seconds. > But with -O2 we're down to 30 seconds ( -O3 is the same). > > The thing to note here is that at a coarse grain, the program > contains no branches. It's straight line code. Of course there > may be local branches at a lower level. > > Roughly: the optimiser isn't smart enough to recognise sequential > computations. It should be able to partition the control flow graph > of a function into a sequence of "super blocks" if you like, which > can be optimised independently and the results merged. > This would reduce the cost of a non-linear algorithm significantly. > > I have no idea how to do this at the moment. However in the bug finding > code I worked on, there was a natural way to "clump" the control flow graph. > The graph was built directly on the C AST, so in general each assignment > > x = expr > > would be a clump. Inside the expr there is control flow, but it is usually > entirely linear. So when doing path analysis, you could basically look > at the C level control flow, and disregard the flow in the expression. > Most of the time, data flow followed because sub-expressions only > assigned to temporaries (so these variables are created and destroyed > within the statement and can usually be ignored). > > Obviously LLVM cannot do that (the higher level stuff has been compiled > away) but it could use heuristics to search for the patterns that would > be left behind. I do that in my compiler with some success. > > > -- > john skaller > skaller at users.sourceforge.net > http://felix-lang.org > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev