thr3ads.net - llvm dev - [llvm-dev] New test-suite result viewer/analyzer [Oct 2016]

If this information is useful, please help other people find it:
Share via:
Matthias Braun via llvm-dev
2016-Oct-08 03:42 UTC
[llvm-dev] New test-suite result viewer/analyzer

I just put a little script into the llvm test-suite under util/compare.py. 

It is usefull for situations in which you want to analyze the results of a few
test-suite runs on the commandline; If you have hundreds of results or want
graphs you should rather use LNT.
The tool currently parses the json files produced by running "lit -o
file.json" (so you should use the cmake/lit test-suite mode).


=== Basic usage ==> compare.py base0.jsonWarning: 'test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test' has
No metrics!
Tests: 508
Metric: exec_time

Program                                         base0

 INT2006/456.hmmer/456.hmmer                   1222.90
 INT2006/464.h264ref/464.h264ref               928.70
 INT2006/458.sjeng/458.sjeng                   873.93
 INT2006/401.bzip2/401.bzip2                   829.99
 INT2006/445.gobmk/445.gobmk                   782.92
 INT2006/471.omnetpp/471.omnetpp               723.68
 INT2006/473.astar/473.astar                   701.71
 INT2006/400.perlbench/400.perlbench           677.13
 INT2006/483.xalancbmk/483.xalancbmk           502.35
 INT2006/462.libquantum/462.libquantum         409.06
 INT2000/164.gzip/164.gzip                     150.25
 FP2000/188.ammp/188.ammp                      149.88
 INT2000/197.parser/197.parser                 135.19
 INT2000/300.twolf/300.twolf                   119.94
 INT2000/256.bzip2/256.bzip2                   105.71
             base0
count  506.000000
mean   20.563098
std    111.423325
min    0.003400
25%    0.011200
50%    0.339450
75%    4.067200
max    1222.896800


- All numbers are arranged below each other on the dot (mail clients with
variable width fonts mess up the effect).
- Results are sorted by magnitude, and limited to the 15 biggest ones by
default, common prefixes and suffixes in benchmark names are removed
('test-suite :: External/SPEC/' and '.test' in this case). The
names are shortened with some '...' in the middle if they are still too
long. All of this can be disabled with the --full flag.
- The pandas library prints some neat statistics below the results
- Shows the 'exec_time' metric by default, use --metric XXX to select a
different one


=== Compare multiple runs ==> compare.py --filter-short base0.json base1.json base2.jsonTests: 508
Short Running: 281 (filtered out)
Remaining: 227
Metric: exec_time

Program                                         base0  base1  base2 diff

 SingleSour...e/Benchmarks/Misc/himenobmtxpa     3.27   3.26   4.52 38.5%
 MultiSource/Benchmarks/nbench/nbench           14.39  18.10  15.03 25.8%
 SingleSour...Benchmarks/Shootout-C++/lists1     0.87   1.02   1.07 22.5%
 MultiSourc...hmarks/MallocBench/cfrac/cfrac     2.95   2.44   2.41 22.3%
 MultiSourc...chmarks/BitBench/five11/five11     8.69  10.21   8.67 17.9%
 MultiSource/Benchmarks/Ptrdist/bc/bc            1.25   1.25   1.07 16.8%
 SingleSour...out-C++/Shootout-C++-ackermann     1.22   1.17   1.35 16.2%
 MultiSourc...chmarks/Prolangs-C++/life/life     4.23   3.76   3.75 12.8%
 External/SPEC/CINT95/134.perl/134.perl         16.76  17.79  17.73  6.1%
 MultiSourc...e/Applications/ClamAV/clamscan     0.80   0.82   0.77  5.9%
 SingleSour...hootout-C++/Shootout-C++-sieve     3.04   3.21   3.21  5.8%
 MultiSource/Applications/lemon/lemon            2.84   2.72   2.79  4.2%
 SingleSour...Shootout-C++/Shootout-C++-hash     1.27   1.31   1.32  3.5%
 SingleSour...h/stencils/fdtd-apml/fdtd-apml    16.15  15.61  15.66  3.5%
 MultiSourc...e/Applications/sqlite3/sqlite3     5.62   5.81   5.62  3.3%
             base0        base1        base2        diff
count  226.000000   226.000000   225.000000   227.000000
mean   45.939256    45.985196    46.096998    0.013667
std    163.389800   163.494907   163.503512   0.042327
min    0.608000     0.600500     0.665200     0.000000
25%    2.432750     2.428300     2.432600     0.001370
50%    4.708250     4.697600     4.799300     0.002822
75%    9.674850     10.083075    9.698000     0.007492
max    1222.896800  1223.112600  1221.131300  0.385443


- Displays results of different result files next to each, computes difference
between smallest and biggest number, sort by difference


=== A/B Comparisons and multiple runs ==> compare.py --filter-short base0.json base1.json base2.json vs try0.json
try1.json try2.jsonTests: 508
Short Running: 283 (filtered out)
Remaining: 225
Metric: exec_time

Program                                         lhs    rhs   diff

 SingleSour.../Benchmarks/Linpack/linpack-pc     5.16   4.30 -16.5%
 SingleSour...Benchmarks/Misc/matmul_f64_4x4     1.25   1.09 -12.8%
 SingleSour...enchmarks/BenchmarkGame/n-body     1.86   1.63 -12.4%
 MultiSourc...erolling-dbl/LoopRerolling-dbl     7.01   7.86 12.2%
 MultiSource/Benchmarks/sim/sim                  4.37   4.88 11.7%
 SingleSour...UnitTests/Vectorizer/gcc-loops     3.89   3.54 -9.0%
 SingleSource/Benchmarks/Misc/salsa20            9.30   8.54 -8.3%
 MultiSourc...marks/Trimaran/enc-pc1/enc-pc1     1.00   0.92 -8.2%
 SingleSource/UnitTests/Vector/build2            2.90   3.13  8.1%
 External/SPEC/CINT2000/181.mcf/181.mcf        100.20  92.82 -7.4%
 MultiSourc...VC/Symbolics-dbl/Symbolics-dbl     5.02   4.65 -7.4%
 SingleSour...enchmarks/CoyoteBench/fftbench     2.73   2.53 -7.1%
 External/SPEC/CFP2000/177.mesa/177.mesa        49.12  46.05 -6.2%
 MultiSourc...VC/Symbolics-flt/Symbolics-flt     2.94   2.76 -6.2%
 SingleSour...hmarks/BenchmarkGame/recursive     1.32   1.39  5.4%
               lhs          rhs        diff
count  225.000000   225.000000   225.000000
mean   46.018045    46.272968   -0.003343
std    163.377050   164.958080   0.028145
min    0.665200     0.658800    -0.164888
25%    2.424500     2.428200    -0.006311
50%    4.799300     4.650700    -0.000495
75%    9.684500     9.646200     0.004407
max    1221.131300  1219.680000  0.122050


- A/B comparison mode: Merges the metric values before the "vs"
argument and the ones after into the "lhs" and "rhs" sets.
Takes the smallest value on each side (--merge-max, --merge-average is available
as well) and compares the resulting two sets.


=== Filtering Flags ==--filter-hash: Exclude Programs whose 'hash'
metric has the same value everywhere.
--filter-short: Exclude Programs when the metric is < 0.6 (typically used
with 'exec_time')
--filter-blacklist blacklist.txt: Exclude programs listed in the blacklist.txt
file (one name per line)


You need python 2.7 and the pandas library installed to get this running. Hope
it helps your benchmarking/testing workflow.

- Matthias
llvm dev - Oct 2016 - New test-suite result viewer/analyzer

[llvm-dev] New test-suite result viewer/analyzer