Fernando Magno Quintao Pereira via llvm-dev
2020-Apr-06 18:53 UTC
[llvm-dev] Adding a new External Suite to test-suite
Hi Johannes,> I'd also like to know what the intention here is. What is tested and how?we have a few uses for these benchmarks in the technical report: http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf, but since then, we came up with other applications. All these programs produce object files without external dependencies. We have been using them to train a predictive compiler that reduces code size (the technical report has more about that). In addition, you can use them to compare compilation time, for instance, as Michael had asked. We have also used these benchmarks in two studies: 1) http://cuda.dcc.ufmg.br/angha/chordAnalysis 2) http://cuda.dcc.ufmg.br/angha/staticProperties A few other applications that I know about (outside our research group), include: * Comparing the size of code produced by three HLS tools: Intel HLS, Vivado and LegUp. * Testing the Ultimate Buchi Automizer, to see which kind of C constructs it handles * Comparing compilation time of gcc vs clang A few other studies that I would like to carry out: * Checking the runtime of different C parsers that we have. * Trying to infer, empirically, the complexity of compiler analyses and optimizations.> Looking at a few of these it seems there is not much you can do as it is little code with a lot of unknown function calls and global symbols.Most of the programs are small (avg 63 bytecodes, std 97); however, among these 1M C functions, we have a few large ones, with more than 40K bytecodes. Regards, Fernando
Johannes Doerfert via llvm-dev
2020-Apr-06 23:30 UTC
[llvm-dev] Adding a new External Suite to test-suite
Hi Fernando, On 4/6/20 1:53 PM, Fernando Magno Quintao Pereira via llvm-dev wrote:> Hi Johannes, > >> I'd also like to know what the intention here is. What is tested and how? > we have a few uses for these benchmarks in the technical report: > http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf, but > since then, we came up with other applications. All these programs > produce object files without external dependencies. We have been using > them to train a predictive compiler that reduces code size (the > technical report has more about that). In addition, you can use them > to compare compilation time, for instance, as Michael had asked. We > have also used these benchmarks in two studies: > > 1) http://cuda.dcc.ufmg.br/angha/chordAnalysis > 2) http://cuda.dcc.ufmg.br/angha/staticProperties > > A few other applications that I know about (outside our research > group), include: > > * Comparing the size of code produced by three HLS tools: Intel HLS, > Vivado and LegUp. > * Testing the Ultimate Buchi Automizer, to see which kind of C > constructs it handles > * Comparing compilation time of gcc vs clang > > A few other studies that I would like to carry out: > > * Checking the runtime of different C parsers that we have. > * Trying to infer, empirically, the complexity of compiler analyses > and optimizations.All the use cases sound reasonable but why do we need these kind of "weird files" to do this? I mean, why would you train or measure something on single definition translation units and not on the original ones, potentially one function at a time? To me this looks like a really good way to skew the input data set, e.g., you don't ever see a call that can be inlined or for which inter-procedural reasoning is performed. As a consequence each function is way smaller than it would be in a real run, with all the consequences on the results obtained from such benchmarks. Again, why can't we take the original programs instead?>> Looking at a few of these it seems there is not much you can do as it is little code with a lot of unknown function calls and global symbols. > Most of the programs are small (avg 63 bytecodes, std 97); however, > among these 1M C functions, we have a few large ones, with more than > 40K bytecodes.How many duplicates are there among the small functions? I mean, close to 1M functions of such a small size (and with similar pro- and epilogue). Cheers, Johannes> Regards, > > Fernando > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Fernando Magno Quintao Pereira via llvm-dev
2020-Apr-07 00:24 UTC
[llvm-dev] Adding a new External Suite to test-suite
Hi Johannes,> All the use cases sound reasonable but why do we need these kind of "weird files" to do this? > > I mean, why would you train or measure something on single definition translation units and not on the original ones, potentially one function at a time?I think that's the fundamental question :) The short answer is that it is hard to compile the files from open-source repositories automatically. The weird files that you mentioned appear due to the type inference that we run on them. Let me give you some data and tell you the whole story. One of the benchmark collections distributed in our website consists of 529,498 C functions and their respective LLVM bytecodes. Out of these files, we extracted 698,449 functions, sizes varying from one line to 45,263 lines of code (Radare2's assembler). Thus, we produced an initial code base of 698,449 C files, each file containing a single function. We run Psyche-C (http://cuda.dcc.ufmg.br/psyche-c/) with a timeout of 30 seconds on this code base. Psyche-C has been able to reconstruct dependencies of 529,498 functions; thus, ensuring their compilation. Compilation consists in the generation of an object file out of the function. Out of the 698,449 functions, 31,935 were directly compilable as-is, that is, without type inference. To perform automatic compilation, we invoke clang onto a whole C file. In case of success, we count as compilable every function with a body within that file. Hence, without type inference, we could ensure compilation of 4.6% of the programs. With type inference, we could ensure compilation of 75,8% of all the programs. Failures to reconstruct types were mostly due to macros that were not syntactically valid in C without preprocessing. Only 3,666 functions could not be reconstructed within the allotted 30-second time slot. So, we compile automatically less about 5% of the functions that we download, even considering all the dependencies in the C files where these functions exist. Nevertheless, given that we can download millions of functions, 5% is already enough to give us a non-negligible number of benchmarks. However, these compilable functions tend to be very small. The median number of LLVM bytecodes is seven (in contrast with >60 once we use type inference). Said functions are unlikely to contain features such as arrays of structs, type casts, recursive types, double pointer dereferences, etc.> To me this looks like a really good way to skew the input data set, e.g., you don't ever see a call that can be inlined or for which inter-procedural reasoning is performed. As a consequence each function is way smaller than it would be in a real run, with all the consequences on the results obtained from such benchmarks. Again, why can't we take the original programs instead?Well, in the end, just using the compilable functions leads to poor predictions. For instance, using these compilable functions, YaCoS (it's the framework that we have been using) reduces the size of MiBench's Bitcount by 10%, whereas using AnghaBench, it achieves 16.9%. In Susan, the naturally compilable functions lead to an increase of code size (5.4%), whereas AnghaBench reduces size by 1.7%. Although we can find benchmarks in MiBench where the naturally compilable functions lead to better code reduction, these gains tend to be very close to those obtained by AnghaBench, and seldom occur. About inlining, you are right: there will be no inlining. To get around this problem, we also have a database of 15K whole files, which contains files with multiple functions. The programs are available here: http://cuda.dcc.ufmg.br/angha/files/suites/angha_wholefiles_all_15k.tar.gz Regards, Fernando