Matthias Braun via llvm-dev
2016-Oct-08 03:42 UTC
[llvm-dev] New test-suite result viewer/analyzer
I just put a little script into the llvm test-suite under util/compare.py. It is usefull for situations in which you want to analyze the results of a few test-suite runs on the commandline; If you have hundreds of results or want graphs you should rather use LNT. The tool currently parses the json files produced by running "lit -o file.json" (so you should use the cmake/lit test-suite mode). === Basic usage ==> compare.py base0.jsonWarning: 'test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test' has No metrics! Tests: 508 Metric: exec_time Program base0 INT2006/456.hmmer/456.hmmer 1222.90 INT2006/464.h264ref/464.h264ref 928.70 INT2006/458.sjeng/458.sjeng 873.93 INT2006/401.bzip2/401.bzip2 829.99 INT2006/445.gobmk/445.gobmk 782.92 INT2006/471.omnetpp/471.omnetpp 723.68 INT2006/473.astar/473.astar 701.71 INT2006/400.perlbench/400.perlbench 677.13 INT2006/483.xalancbmk/483.xalancbmk 502.35 INT2006/462.libquantum/462.libquantum 409.06 INT2000/164.gzip/164.gzip 150.25 FP2000/188.ammp/188.ammp 149.88 INT2000/197.parser/197.parser 135.19 INT2000/300.twolf/300.twolf 119.94 INT2000/256.bzip2/256.bzip2 105.71 base0 count 506.000000 mean 20.563098 std 111.423325 min 0.003400 25% 0.011200 50% 0.339450 75% 4.067200 max 1222.896800 - All numbers are arranged below each other on the dot (mail clients with variable width fonts mess up the effect). - Results are sorted by magnitude, and limited to the 15 biggest ones by default, common prefixes and suffixes in benchmark names are removed ('test-suite :: External/SPEC/' and '.test' in this case). The names are shortened with some '...' in the middle if they are still too long. All of this can be disabled with the --full flag. - The pandas library prints some neat statistics below the results - Shows the 'exec_time' metric by default, use --metric XXX to select a different one === Compare multiple runs ==> compare.py --filter-short base0.json base1.json base2.jsonTests: 508 Short Running: 281 (filtered out) Remaining: 227 Metric: exec_time Program base0 base1 base2 diff SingleSour...e/Benchmarks/Misc/himenobmtxpa 3.27 3.26 4.52 38.5% MultiSource/Benchmarks/nbench/nbench 14.39 18.10 15.03 25.8% SingleSour...Benchmarks/Shootout-C++/lists1 0.87 1.02 1.07 22.5% MultiSourc...hmarks/MallocBench/cfrac/cfrac 2.95 2.44 2.41 22.3% MultiSourc...chmarks/BitBench/five11/five11 8.69 10.21 8.67 17.9% MultiSource/Benchmarks/Ptrdist/bc/bc 1.25 1.25 1.07 16.8% SingleSour...out-C++/Shootout-C++-ackermann 1.22 1.17 1.35 16.2% MultiSourc...chmarks/Prolangs-C++/life/life 4.23 3.76 3.75 12.8% External/SPEC/CINT95/134.perl/134.perl 16.76 17.79 17.73 6.1% MultiSourc...e/Applications/ClamAV/clamscan 0.80 0.82 0.77 5.9% SingleSour...hootout-C++/Shootout-C++-sieve 3.04 3.21 3.21 5.8% MultiSource/Applications/lemon/lemon 2.84 2.72 2.79 4.2% SingleSour...Shootout-C++/Shootout-C++-hash 1.27 1.31 1.32 3.5% SingleSour...h/stencils/fdtd-apml/fdtd-apml 16.15 15.61 15.66 3.5% MultiSourc...e/Applications/sqlite3/sqlite3 5.62 5.81 5.62 3.3% base0 base1 base2 diff count 226.000000 226.000000 225.000000 227.000000 mean 45.939256 45.985196 46.096998 0.013667 std 163.389800 163.494907 163.503512 0.042327 min 0.608000 0.600500 0.665200 0.000000 25% 2.432750 2.428300 2.432600 0.001370 50% 4.708250 4.697600 4.799300 0.002822 75% 9.674850 10.083075 9.698000 0.007492 max 1222.896800 1223.112600 1221.131300 0.385443 - Displays results of different result files next to each, computes difference between smallest and biggest number, sort by difference === A/B Comparisons and multiple runs ==> compare.py --filter-short base0.json base1.json base2.json vs try0.json try1.json try2.jsonTests: 508 Short Running: 283 (filtered out) Remaining: 225 Metric: exec_time Program lhs rhs diff SingleSour.../Benchmarks/Linpack/linpack-pc 5.16 4.30 -16.5% SingleSour...Benchmarks/Misc/matmul_f64_4x4 1.25 1.09 -12.8% SingleSour...enchmarks/BenchmarkGame/n-body 1.86 1.63 -12.4% MultiSourc...erolling-dbl/LoopRerolling-dbl 7.01 7.86 12.2% MultiSource/Benchmarks/sim/sim 4.37 4.88 11.7% SingleSour...UnitTests/Vectorizer/gcc-loops 3.89 3.54 -9.0% SingleSource/Benchmarks/Misc/salsa20 9.30 8.54 -8.3% MultiSourc...marks/Trimaran/enc-pc1/enc-pc1 1.00 0.92 -8.2% SingleSource/UnitTests/Vector/build2 2.90 3.13 8.1% External/SPEC/CINT2000/181.mcf/181.mcf 100.20 92.82 -7.4% MultiSourc...VC/Symbolics-dbl/Symbolics-dbl 5.02 4.65 -7.4% SingleSour...enchmarks/CoyoteBench/fftbench 2.73 2.53 -7.1% External/SPEC/CFP2000/177.mesa/177.mesa 49.12 46.05 -6.2% MultiSourc...VC/Symbolics-flt/Symbolics-flt 2.94 2.76 -6.2% SingleSour...hmarks/BenchmarkGame/recursive 1.32 1.39 5.4% lhs rhs diff count 225.000000 225.000000 225.000000 mean 46.018045 46.272968 -0.003343 std 163.377050 164.958080 0.028145 min 0.665200 0.658800 -0.164888 25% 2.424500 2.428200 -0.006311 50% 4.799300 4.650700 -0.000495 75% 9.684500 9.646200 0.004407 max 1221.131300 1219.680000 0.122050 - A/B comparison mode: Merges the metric values before the "vs" argument and the ones after into the "lhs" and "rhs" sets. Takes the smallest value on each side (--merge-max, --merge-average is available as well) and compares the resulting two sets. === Filtering Flags ==--filter-hash: Exclude Programs whose 'hash' metric has the same value everywhere. --filter-short: Exclude Programs when the metric is < 0.6 (typically used with 'exec_time') --filter-blacklist blacklist.txt: Exclude programs listed in the blacklist.txt file (one name per line) You need python 2.7 and the pandas library installed to get this running. Hope it helps your benchmarking/testing workflow. - Matthias