thr3ads.net - llvm dev - [llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Stefanos Baziotis via llvm-dev

2021-Jul-19 19:46 UTC

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

Hi,

Usually one does not compare executions of the entire test-suite,
but> look for which programs have regressed. In this scenario only relative
> changes between programs matter, so μs are only compared to μs and
> seconds only compared to seconds.

That's true, but there are different insights one can get from, say, a 30%
increase in a program that initially took 100μs and one which initially
took 10s.

What do you mean? Don't you get the exec_time per program?


Yes, but JSON file does not include the time _unit_. Actually, I think the
correct phrasing
is "unit of time", not "time unit", my bad. In any case, I
mean that you get
e.g., "exec_time": 4, but you don't know if this 4 is 4 seconds or
4 μs or whatever other unit of time.

For example, the only reason that it seems that MultiSource/ use
seconds is just because I ran a bunch of them manually (and because
some outputs saved by llvm-lit, which measure in seconds, match
the numbers on JSON).

If we know the unit of time per test case (or per X grouping of tests
for that matter), we could then, e.g., normalize the times, as you
suggest, or anyway, know the unit of time and act accordingly.

Running the programs a second time did work for me in the past.


Ok, it seems it works for me if I wait, but it seems it behaves differently
the second time. Anyway, not important.

It depends. You can run in parallel, but then you should increase
the> number of samples (executions) appropriately to counter the increased
> noise. Depending on how many cores your system has, it might not be
> worth it, but instead try to make the system as deterministic as
> possible (single thread, thread affinity, avoid background processes,
> use perf instead of timeit, avoid context switches etc. ). To avoid
> systematic bias because always the same cache-sensitive programs run
> in parallel, use the --shuffle option.

I see, thanks. I didn't know about the --shuffle option, interesting.

Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems
that perf runs both during the
build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not
important but do you happen to know
why does this happen?

Also, depending on what you are trying to achieve (and what your
platform> target is), you could enable perfcounter
> <https://github.com/google/benchmark/blob/main/docs/perf_counters.md>
> collection;

Thanks, that can be useful in a bunch of cases. I should not that perf
stats are not included in the
JSON file. Is the "canonical" way to access them to follow the
CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats ?

For example, let's say that I want the perf stats for
test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp
To find them, I should go to the same path but in the build directory,
i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/
and then follow the pattern above, so, the .perfstats file will be in:
test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats

Sorry for the long path strings, but I couldn't make it clear otherwise.

Thanks to both,
Stefanos

Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin <mtrofin at
google.com>
έγραψε:
>
>
> On Sun, Jul 18, 2021 at 8:58 PM Michael Kruse via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via
>> llvm-dev <llvm-dev at lists.llvm.org>:
>> > Now, to the questions. First, there doesn't seem to be a
common time
>> unit for
>> > "exec_time" among the different tests. For instance,
SingleSource/ seem
>> to use
>> > seconds while MicroBenchmarks seem to use μs. So, we can't
reliably
>> judge
>> > changes. Although I get the fact that micro-benchmarks are
different in
>> nature
>> > than Single/MultiSource benchmarks, so maybe one should focus only
on
>> > the one or the other depending on what they're interested in.
>>
>> Usually one does not compare executions of the entire test-suite, but
>> look for which programs have regressed. In this scenario only relative
>> changes between programs matter, so μs are only compared to μs and
>> seconds only compared to seconds.
>>
>>
>> > In any case, it would at least be great if the JSON data contained
the
>> time unit per test,
>> > but that is not happening either.
>>
>> What do you mean? Don't you get the exec_time per program?
>>
>>
>> > Do you think that the lack of time unit info is a problem ? If
yes, do
>> you like the
>> > solution of adding the time unit in the JSON or do you want to
propose
>> an alternative?
>>
>> You could also normalize the time unit that is emitted to JSON to s or
ms.
>>
>> >
>> > The second question has to do with re-running the benchmarks: I do
>> > cmake + make + llvm-lit -v -j 1 -o out.json .
>> > but if I try to do the latter another time, it just does/shows
nothing.
>> Is there any reason
>> > that the benchmarks can't be run a second time? Could I
somehow run it
>> a second time ?
>>
>> Running the programs a second time did work for me in the past.
>> Remember to change the output to another file or the previous .json
>> will be overwritten.
>>
>>
>> > Lastly, slightly off-topic but while we're on the subject of
>> benchmarking,
>> > do you think it's reliable to run with -j <number of
cores> ? I'm a
>> little bit afraid of
>> > the shared caches (because misses should be counted in the CPU
time,
>> which
>> > is what is measured in "exec_time" AFAIU)
>> > and any potential multi-threading that the tests may use.
>>
>> It depends. You can run in parallel, but then you should increase the
>> number of samples (executions) appropriately to counter the increased
>> noise. Depending on how many cores your system has, it might not be
>> worth it, but instead try to make the system as deterministic as
>> possible (single thread, thread affinity, avoid background processes,
>> use perf instead of timeit, avoid context switches etc. ). To avoid
>> systematic bias because always the same cache-sensitive programs run
>> in parallel, use the --shuffle option.
>>
>> Also, depending on what you are trying to achieve (and what your
platform
> target is), you could enable perfcounter
>
<https://github.com/google/benchmark/blob/main/docs/perf_counters.md>collection;
> if instruction counts are sufficient (for example), the value will probably
> not vary much with  multi-threading.
>
> ...but it's probably best to avoid system noise altogether. On Intel,
> afaik that includes disabling turbo boost and hyperthreading, along with
> Michael's recommendations.
>
> Michael
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210719/fefc511d/attachment.html>

Stefanos Baziotis via llvm-dev

2021-Jul-19 19:53 UTC

head link

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

>
> Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems
> that perf runs both during the
> build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not
> important but do you happen to know
> why does this happen?

It seems the one gathers measurements for the compilation command and the
other for the run. My bad, I hadn't noticed.

- Stefanos

Στις Δευ, 19 Ιουλ 2021 στις 10:46 μ.μ., ο/η Stefanos Baziotis <
stefanos.baziotis at gmail.com> έγραψε:
> Hi,
>
> Usually one does not compare executions of the entire test-suite, but
>> look for which programs have regressed. In this scenario only relative
>> changes between programs matter, so μs are only compared to μs and
>> seconds only compared to seconds.
>
>
> That's true, but there are different insights one can get from, say, a
30%
> increase in a program that initially took 100μs and one which initially
> took 10s.
>
> What do you mean? Don't you get the exec_time per program?
>
>
> Yes, but JSON file does not include the time _unit_. Actually, I think the
> correct phrasing
> is "unit of time", not "time unit", my bad. In any
case, I mean that you
> get
> e.g., "exec_time": 4, but you don't know if this 4 is 4
seconds or
> 4 μs or whatever other unit of time.
>
> For example, the only reason that it seems that MultiSource/ use
> seconds is just because I ran a bunch of them manually (and because
> some outputs saved by llvm-lit, which measure in seconds, match
> the numbers on JSON).
>
> If we know the unit of time per test case (or per X grouping of tests
> for that matter), we could then, e.g., normalize the times, as you
> suggest, or anyway, know the unit of time and act accordingly.
>
> Running the programs a second time did work for me in the past.
>
>
> Ok, it seems it works for me if I wait, but it seems it behaves differently
> the second time. Anyway, not important.
>
> It depends. You can run in parallel, but then you should increase the
>> number of samples (executions) appropriately to counter the increased
>> noise. Depending on how many cores your system has, it might not be
>> worth it, but instead try to make the system as deterministic as
>> possible (single thread, thread affinity, avoid background processes,
>> use perf instead of timeit, avoid context switches etc. ). To avoid
>> systematic bias because always the same cache-sensitive programs run
>> in parallel, use the --shuffle option.
>
>
> I see, thanks. I didn't know about the --shuffle option, interesting.
>
> Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems
> that perf runs both during the
> build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not
> important but do you happen to know
> why does this happen?
>
> Also, depending on what you are trying to achieve (and what your platform
>> target is), you could enable perfcounter
>>
<https://github.com/google/benchmark/blob/main/docs/perf_counters.md>
>> collection;
>
>
> Thanks, that can be useful in a bunch of cases. I should not that perf
> stats are not included in the
> JSON file. Is the "canonical" way to access them to follow the
> CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats
?
>
> For example, let's say that I want the perf stats for
> test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp
> To find them, I should go to the same path but in the build directory,
> i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/
> and then follow the pattern above, so, the .perfstats file will be in:
>
test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats
>
> Sorry for the long path strings, but I couldn't make it clear
otherwise.
>
> Thanks to both,
> Stefanos
>
> Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin <
> mtrofin at google.com> έγραψε:
>
>>
>>
>> On Sun, Jul 18, 2021 at 8:58 PM Michael Kruse via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via
>>> llvm-dev <llvm-dev at lists.llvm.org>:
>>> > Now, to the questions. First, there doesn't seem to be a
common time
>>> unit for
>>> > "exec_time" among the different tests. For instance,
SingleSource/
>>> seem to use
>>> > seconds while MicroBenchmarks seem to use μs. So, we can't
reliably
>>> judge
>>> > changes. Although I get the fact that micro-benchmarks are
different
>>> in nature
>>> > than Single/MultiSource benchmarks, so maybe one should focus
only on
>>> > the one or the other depending on what they're interested
in.
>>>
>>> Usually one does not compare executions of the entire test-suite,
but
>>> look for which programs have regressed. In this scenario only
relative
>>> changes between programs matter, so μs are only compared to μs and
>>> seconds only compared to seconds.
>>>
>>>
>>> > In any case, it would at least be great if the JSON data
contained the
>>> time unit per test,
>>> > but that is not happening either.
>>>
>>> What do you mean? Don't you get the exec_time per program?
>>>
>>>
>>> > Do you think that the lack of time unit info is a problem ? If
yes, do
>>> you like the
>>> > solution of adding the time unit in the JSON or do you want to
propose
>>> an alternative?
>>>
>>> You could also normalize the time unit that is emitted to JSON to s
or
>>> ms.
>>>
>>> >
>>> > The second question has to do with re-running the benchmarks:
I do
>>> > cmake + make + llvm-lit -v -j 1 -o out.json .
>>> > but if I try to do the latter another time, it just does/shows
>>> nothing. Is there any reason
>>> > that the benchmarks can't be run a second time? Could I
somehow run it
>>> a second time ?
>>>
>>> Running the programs a second time did work for me in the past.
>>> Remember to change the output to another file or the previous .json
>>> will be overwritten.
>>>
>>>
>>> > Lastly, slightly off-topic but while we're on the subject
of
>>> benchmarking,
>>> > do you think it's reliable to run with -j <number of
cores> ? I'm a
>>> little bit afraid of
>>> > the shared caches (because misses should be counted in the CPU
time,
>>> which
>>> > is what is measured in "exec_time" AFAIU)
>>> > and any potential multi-threading that the tests may use.
>>>
>>> It depends. You can run in parallel, but then you should increase
the
>>> number of samples (executions) appropriately to counter the
increased
>>> noise. Depending on how many cores your system has, it might not be
>>> worth it, but instead try to make the system as deterministic as
>>> possible (single thread, thread affinity, avoid background
processes,
>>> use perf instead of timeit, avoid context switches etc. ). To avoid
>>> systematic bias because always the same cache-sensitive programs
run
>>> in parallel, use the --shuffle option.
>>>
>>> Also, depending on what you are trying to achieve (and what your
>> platform target is), you could enable perfcounter
>>
<https://github.com/google/benchmark/blob/main/docs/perf_counters.md>collection;
>> if instruction counts are sufficient (for example), the value will
probably
>> not vary much with  multi-threading.
>>
>> ...but it's probably best to avoid system noise altogether. On
Intel,
>> afaik that includes disabling turbo boost and hyperthreading, along
with
>> Michael's recommendations.
>>
>> Michael
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210719/6ad960c9/attachment.html>

Michael Kruse via llvm-dev

2021-Jul-19 22:25 UTC

head link

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

Am Mo., 19. Juli 2021 um 14:47 Uhr schrieb Stefanos Baziotis
<stefanos.baziotis at gmail.com>:> For example, the only reason that it seems that MultiSource/ use
> seconds is just because I ran a bunch of them manually (and because
> some outputs saved by llvm-lit, which measure in seconds, match
> the numbers on JSON).
>
> If we know the unit of time per test case (or per X grouping of tests
> for that matter), we could then, e.g., normalize the times, as you
> suggest, or anyway, know the unit of time and act accordingly.
You know the unit of time from the top-level folder. MicroBenchmarks
is microseconds (because Google Benchmark reports microseconds),
everything is seconds.

That might be confusing when you don't know about it, but if you do
you there is no ambiguity.

Michael

Mircea Trofin via llvm-dev

2021-Jul-19 23:13 UTC

head link

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

On Mon., Jul. 19, 2021, 12:47 Stefanos Baziotis, <
stefanos.baziotis at gmail.com> wrote:
> Hi,
>
> Usually one does not compare executions of the entire test-suite, but
>> look for which programs have regressed. In this scenario only relative
>> changes between programs matter, so μs are only compared to μs and
>> seconds only compared to seconds.
>
>
> That's true, but there are different insights one can get from, say, a
30%
> increase in a program that initially took 100μs and one which initially
> took 10s.
>
> What do you mean? Don't you get the exec_time per program?
>
>
> Yes, but JSON file does not include the time _unit_. Actually, I think the
> correct phrasing
> is "unit of time", not "time unit", my bad. In any
case, I mean that you
> get
> e.g., "exec_time": 4, but you don't know if this 4 is 4
seconds or
> 4 μs or whatever other unit of time.
>
> For example, the only reason that it seems that MultiSource/ use
> seconds is just because I ran a bunch of them manually (and because
> some outputs saved by llvm-lit, which measure in seconds, match
> the numbers on JSON).
>
> If we know the unit of time per test case (or per X grouping of tests
> for that matter), we could then, e.g., normalize the times, as you
> suggest, or anyway, know the unit of time and act accordingly.
>
> Running the programs a second time did work for me in the past.
>
>
> Ok, it seems it works for me if I wait, but it seems it behaves differently
> the second time. Anyway, not important.
>
> It depends. You can run in parallel, but then you should increase the
>> number of samples (executions) appropriately to counter the increased
>> noise. Depending on how many cores your system has, it might not be
>> worth it, but instead try to make the system as deterministic as
>> possible (single thread, thread affinity, avoid background processes,
>> use perf instead of timeit, avoid context switches etc. ). To avoid
>> systematic bias because always the same cache-sensitive programs run
>> in parallel, use the --shuffle option.
>
>
> I see, thanks. I didn't know about the --shuffle option, interesting.
>
> Btw, when using perf (i.e., using TEST_SUITE_USE_PERF in cmake), it seems
> that perf runs both during the
> build (i.e., make) and the run (i.e., llvm-lit) of the tests. It's not
> important but do you happen to know
> why does this happen?
>
> Also, depending on what you are trying to achieve (and what your platform
>> target is), you could enable perfcounter
>>
<https://github.com/google/benchmark/blob/main/docs/perf_counters.md>
>> collection;
>
>
> Thanks, that can be useful in a bunch of cases. I should not that perf
> stats are not included in the
> JSON file. Is the "canonical" way to access them to follow the
> CMakeFiles/<benchmark name>.dir/<benchmark name>.time.perfstats
?
>You need to specify which counters you want collected, up to 3 - see the
link above (also, you need to opt in to linking libpfm)
>
> For example, let's say that I want the perf stats for
> test-suite/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp
> To find them, I should go to the same path but in the build directory,
> i.e.,: test-suite-build/SingleSource/Benchmarks/Adobe-C++/
> and then follow the pattern above, so, the .perfstats file will be in:
>
test-suite-build/SingleSource/Benchmarks/Adobe-C++/CMakeFiles/loop_unroll.dir/loop_unroll.cpp.time.perfstats
>
> Sorry for the long path strings, but I couldn't make it clear
otherwise.
>
> Thanks to both,
> Stefanos
>
> Στις Δευ, 19 Ιουλ 2021 στις 5:36 μ.μ., ο/η Mircea Trofin <
> mtrofin at google.com> έγραψε:
>
>>
>>
>> On Sun, Jul 18, 2021 at 8:58 PM Michael Kruse via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via
>>> llvm-dev <llvm-dev at lists.llvm.org>:
>>> > Now, to the questions. First, there doesn't seem to be a
common time
>>> unit for
>>> > "exec_time" among the different tests. For instance,
SingleSource/
>>> seem to use
>>> > seconds while MicroBenchmarks seem to use μs. So, we can't
reliably
>>> judge
>>> > changes. Although I get the fact that micro-benchmarks are
different
>>> in nature
>>> > than Single/MultiSource benchmarks, so maybe one should focus
only on
>>> > the one or the other depending on what they're interested
in.
>>>
>>> Usually one does not compare executions of the entire test-suite,
but
>>> look for which programs have regressed. In this scenario only
relative
>>> changes between programs matter, so μs are only compared to μs and
>>> seconds only compared to seconds.
>>>
>>>
>>> > In any case, it would at least be great if the JSON data
contained the
>>> time unit per test,
>>> > but that is not happening either.
>>>
>>> What do you mean? Don't you get the exec_time per program?
>>>
>>>
>>> > Do you think that the lack of time unit info is a problem ? If
yes, do
>>> you like the
>>> > solution of adding the time unit in the JSON or do you want to
propose
>>> an alternative?
>>>
>>> You could also normalize the time unit that is emitted to JSON to s
or
>>> ms.
>>>
>>> >
>>> > The second question has to do with re-running the benchmarks:
I do
>>> > cmake + make + llvm-lit -v -j 1 -o out.json .
>>> > but if I try to do the latter another time, it just does/shows
>>> nothing. Is there any reason
>>> > that the benchmarks can't be run a second time? Could I
somehow run it
>>> a second time ?
>>>
>>> Running the programs a second time did work for me in the past.
>>> Remember to change the output to another file or the previous .json
>>> will be overwritten.
>>>
>>>
>>> > Lastly, slightly off-topic but while we're on the subject
of
>>> benchmarking,
>>> > do you think it's reliable to run with -j <number of
cores> ? I'm a
>>> little bit afraid of
>>> > the shared caches (because misses should be counted in the CPU
time,
>>> which
>>> > is what is measured in "exec_time" AFAIU)
>>> > and any potential multi-threading that the tests may use.
>>>
>>> It depends. You can run in parallel, but then you should increase
the
>>> number of samples (executions) appropriately to counter the
increased
>>> noise. Depending on how many cores your system has, it might not be
>>> worth it, but instead try to make the system as deterministic as
>>> possible (single thread, thread affinity, avoid background
processes,
>>> use perf instead of timeit, avoid context switches etc. ). To avoid
>>> systematic bias because always the same cache-sensitive programs
run
>>> in parallel, use the --shuffle option.
>>>
>>> Also, depending on what you are trying to achieve (and what your
>> platform target is), you could enable perfcounter
>>
<https://github.com/google/benchmark/blob/main/docs/perf_counters.md>collection;
>> if instruction counts are sufficient (for example), the value will
probably
>> not vary much with  multi-threading.
>>
>> ...but it's probably best to avoid system noise altogether. On
Intel,
>> afaik that includes disabling turbo boost and hyperthreading, along
with
>> Michael's recommendations.
>>
>> Michael
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210719/e98a3a11/attachment.html>

llvm dev - Jul 2021 - Questions About LLVM Test Suite: Time Units, Re-running benchmarks

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks