Stefanos Baziotis via llvm-dev
2021-Jul-19 19:46 UTC
[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks
Hi, Usually one does not compare executions of the entire test-suite, but> look for which programs have regressed. In this scenario only relative > changes between programs matter, so μs are only compared to μs and > seconds only compared to seconds.That's true, but there are different insights one can get from, say, a 30% increase in a program that initially took 100μs and one which initially took 10s. What do you mean? Don't you get the exec_time per program? Yes, but JSON file does not include the time _unit_. Actually, I think the correct phrasing is "unit of time", not "time unit", my bad. In any case, I mean that you get e.g., "exec_time": 4, but you don't know if this 4 is 4 seconds or 4 μs or whatever other unit of time. For example, the only reason that it seems that MultiSource/ use seconds is just because I ran a bunch of them manually (and because some outputs saved by llvm-lit, which measure in seconds, match the numbers on JSON). If we know the unit of time per test case (or per X grouping of tests for that matter), we could then, e.g., normalize the times, as you suggest, or anyway, know the unit of time and act accordingly. Running the programs a second time did work for me in the past. Ok, it seems it works for me if I wait, but it seems it behaves differently the second time. Anyway, not important. It depends. You can run in parallel, but then you should increase the> number of samples (executions) appropriately to counter the increased > noise. Depending on how many cores your system has, it might not be > worth it, but instead try to make the system as deterministic as > possible (single thread, thread affinity, avoid background processes, > use perf instead of timeit, avoid context switches etc. ). To avoid > systematic bias because always the same cache-sensitive programs run > in parallel, use the --shuffle option.I see, thanks. I didn't know about the --shuffle option, interesting. Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems that perf runs both during the build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not important but do you happen to know why does this happen? Also, depending on what you are trying to achieve (and what your platform> target is), you could enable perfcounter > <https://github.com/google/benchmark/blob/main/docs/perf_counters.md> > collection;Thanks, that can be useful in a bunch of cases. I should not that perf stats are not included in the JSON file. Is the "canonical" way to access them to follow the CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats ? For example, let's say that I want the perf stats for test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp To find them, I should go to the same path but in the build directory, i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/ and then follow the pattern above, so, the .perfstats file will be in: test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats Sorry for the long path strings, but I couldn't make it clear otherwise. Thanks to both, Stefanos Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin <mtrofin at google.com> έγραψε:> > > On Sun, Jul 18, 2021 at 8:58 PM Michael Kruse via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via >> llvm-dev <llvm-dev at lists.llvm.org>: >> > Now, to the questions. First, there doesn't seem to be a common time >> unit for >> > "exec_time" among the different tests. For instance, SingleSource/ seem >> to use >> > seconds while MicroBenchmarks seem to use μs. So, we can't reliably >> judge >> > changes. Although I get the fact that micro-benchmarks are different in >> nature >> > than Single/MultiSource benchmarks, so maybe one should focus only on >> > the one or the other depending on what they're interested in. >> >> Usually one does not compare executions of the entire test-suite, but >> look for which programs have regressed. In this scenario only relative >> changes between programs matter, so μs are only compared to μs and >> seconds only compared to seconds. >> >> >> > In any case, it would at least be great if the JSON data contained the >> time unit per test, >> > but that is not happening either. >> >> What do you mean? Don't you get the exec_time per program? >> >> >> > Do you think that the lack of time unit info is a problem ? If yes, do >> you like the >> > solution of adding the time unit in the JSON or do you want to propose >> an alternative? >> >> You could also normalize the time unit that is emitted to JSON to s or ms. >> >> > >> > The second question has to do with re-running the benchmarks: I do >> > cmake + make + llvm-lit -v -j 1 -o out.json . >> > but if I try to do the latter another time, it just does/shows nothing. >> Is there any reason >> > that the benchmarks can't be run a second time? Could I somehow run it >> a second time ? >> >> Running the programs a second time did work for me in the past. >> Remember to change the output to another file or the previous .json >> will be overwritten. >> >> >> > Lastly, slightly off-topic but while we're on the subject of >> benchmarking, >> > do you think it's reliable to run with -j <number of cores> ? I'm a >> little bit afraid of >> > the shared caches (because misses should be counted in the CPU time, >> which >> > is what is measured in "exec_time" AFAIU) >> > and any potential multi-threading that the tests may use. >> >> It depends. You can run in parallel, but then you should increase the >> number of samples (executions) appropriately to counter the increased >> noise. Depending on how many cores your system has, it might not be >> worth it, but instead try to make the system as deterministic as >> possible (single thread, thread affinity, avoid background processes, >> use perf instead of timeit, avoid context switches etc. ). To avoid >> systematic bias because always the same cache-sensitive programs run >> in parallel, use the --shuffle option. >> >> Also, depending on what you are trying to achieve (and what your platform > target is), you could enable perfcounter > <https://github.com/google/benchmark/blob/main/docs/perf_counters.md>collection; > if instruction counts are sufficient (for example), the value will probably > not vary much with multi-threading. > > ...but it's probably best to avoid system noise altogether. On Intel, > afaik that includes disabling turbo boost and hyperthreading, along with > Michael's recommendations. > > Michael >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210719/fefc511d/attachment.html>
Stefanos Baziotis via llvm-dev
2021-Jul-19 19:53 UTC
[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks
> > Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems > that perf runs both during the > build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not > important but do you happen to know > why does this happen?It seems the one gathers measurements for the compilation command and the other for the run. My bad, I hadn't noticed. - Stefanos Στις Δευ, 19 Ιουλ 2021 στις 10:46 μ.μ., ο/η Stefanos Baziotis < stefanos.baziotis at gmail.com> έγραψε:> Hi, > > Usually one does not compare executions of the entire test-suite, but >> look for which programs have regressed. In this scenario only relative >> changes between programs matter, so μs are only compared to μs and >> seconds only compared to seconds. > > > That's true, but there are different insights one can get from, say, a 30% > increase in a program that initially took 100μs and one which initially > took 10s. > > What do you mean? Don't you get the exec_time per program? > > > Yes, but JSON file does not include the time _unit_. Actually, I think the > correct phrasing > is "unit of time", not "time unit", my bad. In any case, I mean that you > get > e.g., "exec_time": 4, but you don't know if this 4 is 4 seconds or > 4 μs or whatever other unit of time. > > For example, the only reason that it seems that MultiSource/ use > seconds is just because I ran a bunch of them manually (and because > some outputs saved by llvm-lit, which measure in seconds, match > the numbers on JSON). > > If we know the unit of time per test case (or per X grouping of tests > for that matter), we could then, e.g., normalize the times, as you > suggest, or anyway, know the unit of time and act accordingly. > > Running the programs a second time did work for me in the past. > > > Ok, it seems it works for me if I wait, but it seems it behaves differently > the second time. Anyway, not important. > > It depends. You can run in parallel, but then you should increase the >> number of samples (executions) appropriately to counter the increased >> noise. Depending on how many cores your system has, it might not be >> worth it, but instead try to make the system as deterministic as >> possible (single thread, thread affinity, avoid background processes, >> use perf instead of timeit, avoid context switches etc. ). To avoid >> systematic bias because always the same cache-sensitive programs run >> in parallel, use the --shuffle option. > > > I see, thanks. I didn't know about the --shuffle option, interesting. > > Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems > that perf runs both during the > build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not > important but do you happen to know > why does this happen? > > Also, depending on what you are trying to achieve (and what your platform >> target is), you could enable perfcounter >> <https://github.com/google/benchmark/blob/main/docs/perf_counters.md> >> collection; > > > Thanks, that can be useful in a bunch of cases. I should not that perf > stats are not included in the > JSON file. Is the "canonical" way to access them to follow the > CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats ? > > For example, let's say that I want the perf stats for > test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp > To find them, I should go to the same path but in the build directory, > i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/ > and then follow the pattern above, so, the .perfstats file will be in: > test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats > > Sorry for the long path strings, but I couldn't make it clear otherwise. > > Thanks to both, > Stefanos > > Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin < > mtrofin at google.com> έγραψε: > >> >> >> On Sun, Jul 18, 2021 at 8:58 PM Michael Kruse via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via >>> llvm-dev <llvm-dev at lists.llvm.org>: >>> > Now, to the questions. First, there doesn't seem to be a common time >>> unit for >>> > "exec_time" among the different tests. For instance, SingleSource/ >>> seem to use >>> > seconds while MicroBenchmarks seem to use μs. So, we can't reliably >>> judge >>> > changes. Although I get the fact that micro-benchmarks are different >>> in nature >>> > than Single/MultiSource benchmarks, so maybe one should focus only on >>> > the one or the other depending on what they're interested in. >>> >>> Usually one does not compare executions of the entire test-suite, but >>> look for which programs have regressed. In this scenario only relative >>> changes between programs matter, so μs are only compared to μs and >>> seconds only compared to seconds. >>> >>> >>> > In any case, it would at least be great if the JSON data contained the >>> time unit per test, >>> > but that is not happening either. >>> >>> What do you mean? Don't you get the exec_time per program? >>> >>> >>> > Do you think that the lack of time unit info is a problem ? If yes, do >>> you like the >>> > solution of adding the time unit in the JSON or do you want to propose >>> an alternative? >>> >>> You could also normalize the time unit that is emitted to JSON to s or >>> ms. >>> >>> > >>> > The second question has to do with re-running the benchmarks: I do >>> > cmake + make + llvm-lit -v -j 1 -o out.json . >>> > but if I try to do the latter another time, it just does/shows >>> nothing. Is there any reason >>> > that the benchmarks can't be run a second time? Could I somehow run it >>> a second time ? >>> >>> Running the programs a second time did work for me in the past. >>> Remember to change the output to another file or the previous .json >>> will be overwritten. >>> >>> >>> > Lastly, slightly off-topic but while we're on the subject of >>> benchmarking, >>> > do you think it's reliable to run with -j <number of cores> ? I'm a >>> little bit afraid of >>> > the shared caches (because misses should be counted in the CPU time, >>> which >>> > is what is measured in "exec_time" AFAIU) >>> > and any potential multi-threading that the tests may use. >>> >>> It depends. You can run in parallel, but then you should increase the >>> number of samples (executions) appropriately to counter the increased >>> noise. Depending on how many cores your system has, it might not be >>> worth it, but instead try to make the system as deterministic as >>> possible (single thread, thread affinity, avoid background processes, >>> use perf instead of timeit, avoid context switches etc. ). To avoid >>> systematic bias because always the same cache-sensitive programs run >>> in parallel, use the --shuffle option. >>> >>> Also, depending on what you are trying to achieve (and what your >> platform target is), you could enable perfcounter >> <https://github.com/google/benchmark/blob/main/docs/perf_counters.md>collection; >> if instruction counts are sufficient (for example), the value will probably >> not vary much with multi-threading. >> >> ...but it's probably best to avoid system noise altogether. On Intel, >> afaik that includes disabling turbo boost and hyperthreading, along with >> Michael's recommendations. >> >> Michael >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210719/6ad960c9/attachment.html>
Michael Kruse via llvm-dev
2021-Jul-19 22:25 UTC
[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks
Am Mo., 19. Juli 2021 um 14:47 Uhr schrieb Stefanos Baziotis <stefanos.baziotis at gmail.com>:> For example, the only reason that it seems that MultiSource/ use > seconds is just because I ran a bunch of them manually (and because > some outputs saved by llvm-lit, which measure in seconds, match > the numbers on JSON). > > If we know the unit of time per test case (or per X grouping of tests > for that matter), we could then, e.g., normalize the times, as you > suggest, or anyway, know the unit of time and act accordingly.You know the unit of time from the top-level folder. MicroBenchmarks is microseconds (because Google Benchmark reports microseconds), everything is seconds. That might be confusing when you don't know about it, but if you do you there is no ambiguity. Michael
Mircea Trofin via llvm-dev
2021-Jul-19 23:13 UTC
[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks
On Mon., Jul. 19, 2021, 12:47 Stefanos Baziotis, < stefanos.baziotis at gmail.com> wrote:> Hi, > > Usually one does not compare executions of the entire test-suite, but >> look for which programs have regressed. In this scenario only relative >> changes between programs matter, so μs are only compared to μs and >> seconds only compared to seconds. > > > That's true, but there are different insights one can get from, say, a 30% > increase in a program that initially took 100μs and one which initially > took 10s. > > What do you mean? Don't you get the exec_time per program? > > > Yes, but JSON file does not include the time _unit_. Actually, I think the > correct phrasing > is "unit of time", not "time unit", my bad. In any case, I mean that you > get > e.g., "exec_time": 4, but you don't know if this 4 is 4 seconds or > 4 μs or whatever other unit of time. > > For example, the only reason that it seems that MultiSource/ use > seconds is just because I ran a bunch of them manually (and because > some outputs saved by llvm-lit, which measure in seconds, match > the numbers on JSON). > > If we know the unit of time per test case (or per X grouping of tests > for that matter), we could then, e.g., normalize the times, as you > suggest, or anyway, know the unit of time and act accordingly. > > Running the programs a second time did work for me in the past. > > > Ok, it seems it works for me if I wait, but it seems it behaves differently > the second time. Anyway, not important. > > It depends. You can run in parallel, but then you should increase the >> number of samples (executions) appropriately to counter the increased >> noise. Depending on how many cores your system has, it might not be >> worth it, but instead try to make the system as deterministic as >> possible (single thread, thread affinity, avoid background processes, >> use perf instead of timeit, avoid context switches etc. ). To avoid >> systematic bias because always the same cache-sensitive programs run >> in parallel, use the --shuffle option. > > > I see, thanks. I didn't know about the --shuffle option, interesting. > > Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems > that perf runs both during the > build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not > important but do you happen to know > why does this happen? > > Also, depending on what you are trying to achieve (and what your platform >> target is), you could enable perfcounter >> <https://github.com/google/benchmark/blob/main/docs/perf_counters.md> >> collection; > > > Thanks, that can be useful in a bunch of cases. I should not that perf > stats are not included in the > JSON file. Is the "canonical" way to access them to follow the > CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats ? >You need to specify which counters you want collected, up to 3 - see the link above (also, you need to opt in to linking libpfm)> > For example, let's say that I want the perf stats for > test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp > To find them, I should go to the same path but in the build directory, > i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/ > and then follow the pattern above, so, the .perfstats file will be in: > test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats > > Sorry for the long path strings, but I couldn't make it clear otherwise. > > Thanks to both, > Stefanos > > Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin < > mtrofin at google.com> έγραψε: > >> >> >> On Sun, Jul 18, 2021 at 8:58 PM Michael Kruse via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via >>> llvm-dev <llvm-dev at lists.llvm.org>: >>> > Now, to the questions. First, there doesn't seem to be a common time >>> unit for >>> > "exec_time" among the different tests. For instance, SingleSource/ >>> seem to use >>> > seconds while MicroBenchmarks seem to use μs. So, we can't reliably >>> judge >>> > changes. Although I get the fact that micro-benchmarks are different >>> in nature >>> > than Single/MultiSource benchmarks, so maybe one should focus only on >>> > the one or the other depending on what they're interested in. >>> >>> Usually one does not compare executions of the entire test-suite, but >>> look for which programs have regressed. In this scenario only relative >>> changes between programs matter, so μs are only compared to μs and >>> seconds only compared to seconds. >>> >>> >>> > In any case, it would at least be great if the JSON data contained the >>> time unit per test, >>> > but that is not happening either. >>> >>> What do you mean? Don't you get the exec_time per program? >>> >>> >>> > Do you think that the lack of time unit info is a problem ? If yes, do >>> you like the >>> > solution of adding the time unit in the JSON or do you want to propose >>> an alternative? >>> >>> You could also normalize the time unit that is emitted to JSON to s or >>> ms. >>> >>> > >>> > The second question has to do with re-running the benchmarks: I do >>> > cmake + make + llvm-lit -v -j 1 -o out.json . >>> > but if I try to do the latter another time, it just does/shows >>> nothing. Is there any reason >>> > that the benchmarks can't be run a second time? Could I somehow run it >>> a second time ? >>> >>> Running the programs a second time did work for me in the past. >>> Remember to change the output to another file or the previous .json >>> will be overwritten. >>> >>> >>> > Lastly, slightly off-topic but while we're on the subject of >>> benchmarking, >>> > do you think it's reliable to run with -j <number of cores> ? I'm a >>> little bit afraid of >>> > the shared caches (because misses should be counted in the CPU time, >>> which >>> > is what is measured in "exec_time" AFAIU) >>> > and any potential multi-threading that the tests may use. >>> >>> It depends. You can run in parallel, but then you should increase the >>> number of samples (executions) appropriately to counter the increased >>> noise. Depending on how many cores your system has, it might not be >>> worth it, but instead try to make the system as deterministic as >>> possible (single thread, thread affinity, avoid background processes, >>> use perf instead of timeit, avoid context switches etc. ). To avoid >>> systematic bias because always the same cache-sensitive programs run >>> in parallel, use the --shuffle option. >>> >>> Also, depending on what you are trying to achieve (and what your >> platform target is), you could enable perfcounter >> <https://github.com/google/benchmark/blob/main/docs/perf_counters.md>collection; >> if instruction counts are sufficient (for example), the value will probably >> not vary much with multi-threading. >> >> ...but it's probably best to avoid system noise altogether. On Intel, >> afaik that includes disabling turbo boost and hyperthreading, along with >> Michael's recommendations. >> >> Michael >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210719/e98a3a11/attachment.html>