thr3ads.net - llvm dev - [llvm-dev] RFC: EfficiencySanitizer working set tool [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Derek Bruening via llvm-dev

2016-Apr-20 05:44 UTC

[llvm-dev] RFC: EfficiencySanitizer working set tool

Please reference the prior RFC on EfficiencySanitizer.  This is one of the
performance analysis tools we would like to build under the
EfficiencySanitizer umbrella.

===================Motivation
===================
Knowing the working set size at periodic points during a given
application's execution helps to understand its cache behavior, how its
behavior changes over time, how its performance might vary on different
hardware configurations, and how best to direct performance improvement
efforts.  For example, knowing whether the working set is close to fitting
in current L3 caches or is many times larger can help determine which
efforts are most likely to bear fruit.

The typical approach to acquire working set information is to collect a
memory reference profile and feed it to a cache simulator, looking for the
point at which the miss rate drops to zero.  However, this incurs
prohibitively high runtime overhead.  Our goal is to build a fast shadow
memory-based working set measurement tool.

===================Approach
===================
We focus on the data working set size, partly because it is typically more
important and partly because to measure the instruction working set with
compiler instrumentation is much more difficult to implement, requiring
insertion at a later and more challenging point in the compilation pipeline.

Because we want to know how the program’s working set affects the cache
behavior, we define the working set as the set of cache lines that are
referenced by the program during a certain period of time.

Using the definition above, we are able to use a single shadow bit to
represent whether a cache line is referenced or not.  If the cache line
size is 64 bytes, we will use one bit to represent a 64 byte cache line via
a 512 byte to 1 byte shadow mapping.

For every memory reference, we insert code to set the shadow bits
corresponding to the cache line(s) touched by that reference.  The
resulting shadow memory bitmap describes the program’s working set.

There are two optimizations that can be added:

1) Add a shadow bit check before writing each bit to reduce the number of
memory stores.  In our experience this is always a win with shadow memory
tools.

2) Analyze the memory references within a basic block to coalesce multiple
memory references into a single shadow update.

We will divide the program execution into regular intervals and record a
snapshot of the working set at each interval boundary.  An adaptive
strategy can keep the number of snapshots bounded across varying total
execution time by combining adjacent snapshots via logical or.  When a
snapshot is recorded, the shadow memory is cleared so that the next
snapshot starts with a blank slate.  If we use a 64 byte to 1 byte shadow
mapping, we can use the upper bits to store up to 8 consecutive snapshots
in the shadow memory itself by shifting rather than clearing the shadow
memory on a snapshot.

Recording the snapshot will use optimizations to avoid storing for the
entire address space: only mapped regions will be saved.

The report from the tool to the user at the end of execution would
essentially be a histogram with time on the x axis and the cache lines
touched on the y axis.  The units of time could be selected by the user:
cpu time, wall clock time, memory access count, etc.  The unit determines
the method of snapshot sampling, whether performance counters for the
memory access count (or instrumentation to increment a global counter) or
an itimer.  We can record callstacks with each snapshot as well to help
give an indication of what the program is doing at that point in time, to
help the user understand the program phases.

We expect the time overhead of the tool to be well under the 5x
EfficiencySanitizer ceiling; presumably it should be under 3x.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/8884704a/attachment.html>

Sean Silva via llvm-dev

2016-Apr-20 06:48 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Please reference the prior RFC on EfficiencySanitizer.  This is one of the
> performance analysis tools we would like to build under the
> EfficiencySanitizer umbrella.
>
> ===================> Motivation
> ===================>
> Knowing the working set size at periodic points during a given
> application's execution helps to understand its cache behavior, how its
> behavior changes over time, how its performance might vary on different
> hardware configurations, and how best to direct performance improvement
> efforts.  For example, knowing whether the working set is close to fitting
> in current L3 caches or is many times larger can help determine which
> efforts are most likely to bear fruit.
>
> The typical approach to acquire working set information is to collect a
> memory reference profile and feed it to a cache simulator, looking for the
> point at which the miss rate drops to zero.  However, this incurs
> prohibitively high runtime overhead.  Our goal is to build a fast shadow
> memory-based working set measurement tool.
>
> ===================> Approach
> ===================>
> We focus on the data working set size, partly because it is typically more
> important and partly because to measure the instruction working set with
> compiler instrumentation is much more difficult to implement, requiring
> insertion at a later and more challenging point in the compilation
pipeline.
>
> Because we want to know how the program’s working set affects the cache
> behavior, we define the working set as the set of cache lines that are
> referenced by the program during a certain period of time.
>
> Using the definition above, we are able to use a single shadow bit to
> represent whether a cache line is referenced or not.  If the cache line
> size is 64 bytes, we will use one bit to represent a 64 byte cache line via
> a 512 byte to 1 byte shadow mapping.
>
> For every memory reference, we insert code to set the shadow bits
> corresponding to the cache line(s) touched by that reference.  The
> resulting shadow memory bitmap describes the program’s working set.
>
> There are two optimizations that can be added:
>
> 1) Add a shadow bit check before writing each bit to reduce the number of
> memory stores.  In our experience this is always a win with shadow memory
> tools.
>
> 2) Analyze the memory references within a basic block to coalesce multiple
> memory references into a single shadow update.
>
Will this fire often? Very few memory references are known to the compiler
to be >=cacheline aligned, so I'm not seeing how the compiler will prove
that two memory references definitely land on the same cacheline.

-- Sean Silva

> We will divide the program execution into regular intervals and record a
> snapshot of the working set at each interval boundary.  An adaptive
> strategy can keep the number of snapshots bounded across varying total
> execution time by combining adjacent snapshots via logical or.  When a
> snapshot is recorded, the shadow memory is cleared so that the next
> snapshot starts with a blank slate.  If we use a 64 byte to 1 byte shadow
> mapping, we can use the upper bits to store up to 8 consecutive snapshots
> in the shadow memory itself by shifting rather than clearing the shadow
> memory on a snapshot.
>
> Recording the snapshot will use optimizations to avoid storing for the
> entire address space: only mapped regions will be saved.
>
> The report from the tool to the user at the end of execution would
> essentially be a histogram with time on the x axis and the cache lines
> touched on the y axis.  The units of time could be selected by the user:
> cpu time, wall clock time, memory access count, etc.  The unit determines
> the method of snapshot sampling, whether performance counters for the
> memory access count (or instrumentation to increment a global counter) or
> an itimer.  We can record callstacks with each snapshot as well to help
> give an indication of what the program is doing at that point in time, to
> help the user understand the program phases.
>
> We expect the time overhead of the tool to be well under the 5x
> EfficiencySanitizer ceiling; presumably it should be under 3x.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160419/78fc935a/attachment.html>

Qin Zhao via llvm-dev

2016-Apr-20 15:14 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

On Wed, Apr 20, 2016 at 2:48 AM, Sean Silva <chisophugis at gmail.com>
wrote:
>
>
> On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Please reference the prior RFC on EfficiencySanitizer.  This is one of
>> the performance analysis tools we would like to build under the
>> EfficiencySanitizer umbrella.
>>
>> ===================>> Motivation
>> ===================>>
>> Knowing the working set size at periodic points during a given
>> application's execution helps to understand its cache behavior, how
its
>> behavior changes over time, how its performance might vary on different
>> hardware configurations, and how best to direct performance improvement
>> efforts.  For example, knowing whether the working set is close to
fitting
>> in current L3 caches or is many times larger can help determine which
>> efforts are most likely to bear fruit.
>>
>> The typical approach to acquire working set information is to collect a
>> memory reference profile and feed it to a cache simulator, looking for
the
>> point at which the miss rate drops to zero.  However, this incurs
>> prohibitively high runtime overhead.  Our goal is to build a fast
shadow
>> memory-based working set measurement tool.
>>
>> ===================>> Approach
>> ===================>>
>> We focus on the data working set size, partly because it is typically
>> more important and partly because to measure the instruction working
set
>> with compiler instrumentation is much more difficult to implement,
>> requiring insertion at a later and more challenging point in the
>> compilation pipeline.
>>
>> Because we want to know how the program’s working set affects the cache
>> behavior, we define the working set as the set of cache lines that are
>> referenced by the program during a certain period of time.
>>
>> Using the definition above, we are able to use a single shadow bit to
>> represent whether a cache line is referenced or not.  If the cache line
>> size is 64 bytes, we will use one bit to represent a 64 byte cache line
via
>> a 512 byte to 1 byte shadow mapping.
>>
>> For every memory reference, we insert code to set the shadow bits
>> corresponding to the cache line(s) touched by that reference.  The
>> resulting shadow memory bitmap describes the program’s working set.
>>
>> There are two optimizations that can be added:
>>
>> 1) Add a shadow bit check before writing each bit to reduce the number
of
>> memory stores.  In our experience this is always a win with shadow
memory
>> tools.
>>
>> 2) Analyze the memory references within a basic block to coalesce
>> multiple memory references into a single shadow update.
>>
>
> Will this fire often? Very few memory references are known to the compiler
> to be >=cacheline aligned, so I'm not seeing how the compiler will
prove
> that two memory references definitely land on the same cacheline.
>
> -- Sean Silva
>
I am not sure what you mean "fire often"?
This is an optimization during the instrumentation, so it will be performed
during the instrumentation pass.
If you are asking about how many opportunities there, we do not know
because we have not yet implemented it. And we probably won't implement it
until we implement the basic instrumentation and evaluate the performance
and result.

But I think there are opportunities:
example 1: if two references access address Addr, and Addr+64(cacheline
size), then we only need one shadow address calculation for the first one,
the second one's shadow address can be obtained with +1.
example 2: if three references are: Addr, Addr+16, Addr+32, we can skip
instrumentation for Addr+16 without prove which cacheline it belongs to.

-- Qin

>
>
>> We will divide the program execution into regular intervals and record
a
>> snapshot of the working set at each interval boundary.  An adaptive
>> strategy can keep the number of snapshots bounded across varying total
>> execution time by combining adjacent snapshots via logical or.  When a
>> snapshot is recorded, the shadow memory is cleared so that the next
>> snapshot starts with a blank slate.  If we use a 64 byte to 1 byte
shadow
>> mapping, we can use the upper bits to store up to 8 consecutive
snapshots
>> in the shadow memory itself by shifting rather than clearing the shadow
>> memory on a snapshot.
>>
>> Recording the snapshot will use optimizations to avoid storing for the
>> entire address space: only mapped regions will be saved.
>>
>> The report from the tool to the user at the end of execution would
>> essentially be a histogram with time on the x axis and the cache lines
>> touched on the y axis.  The units of time could be selected by the
user:
>> cpu time, wall clock time, memory access count, etc.  The unit
determines
>> the method of snapshot sampling, whether performance counters for the
>> memory access count (or instrumentation to increment a global counter)
or
>> an itimer.  We can record callstacks with each snapshot as well to help
>> give an indication of what the program is doing at that point in time,
to
>> help the user understand the program phases.
>>
>> We expect the time overhead of the tool to be well under the 5x
>> EfficiencySanitizer ceiling; presumably it should be under 3x.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "efficiency-sanitizer" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to efficiency-sanitizer+unsubscribe at google.com.
> To post to this group, send email to efficiency-sanitizer at google.com.
> To view this discussion on the web visit
>
https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com
>
<https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/f3b3f11e/attachment.html>

Alexander Potapenko via llvm-dev

2016-Apr-20 16:09 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

On Wed, Apr 20, 2016 at 7:44 AM, Derek Bruening via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Please reference the prior RFC on EfficiencySanitizer.  This is one of the
> performance analysis tools we would like to build under the
> EfficiencySanitizer umbrella.
>
> ===================> Motivation
> ===================>
> Knowing the working set size at periodic points during a given
application's
> execution helps to understand its cache behavior, how its behavior changes
> over time, how its performance might vary on different hardware
> configurations, and how best to direct performance improvement efforts. 
For
> example, knowing whether the working set is close to fitting in current L3
> caches or is many times larger can help determine which efforts are most
> likely to bear fruit.
>
> The typical approach to acquire working set information is to collect a
> memory reference profile and feed it to a cache simulator, looking for the
> point at which the miss rate drops to zero.  However, this incurs
> prohibitively high runtime overhead.  Our goal is to build a fast shadow
> memory-based working set measurement tool.
>
> ===================> Approach
> ===================>
> We focus on the data working set size, partly because it is typically more
> important and partly because to measure the instruction working set with
> compiler instrumentation is much more difficult to implement, requiring
> insertion at a later and more challenging point in the compilation
pipeline.
>
> Because we want to know how the program’s working set affects the cache
> behavior, we define the working set as the set of cache lines that are
> referenced by the program during a certain period of time.
>
> Using the definition above, we are able to use a single shadow bit to
> represent whether a cache line is referenced or not.  If the cache line
size
> is 64 bytes, we will use one bit to represent a 64 byte cache line via a
512
> byte to 1 byte shadow mapping.I've raised the concern about problems with this approach in
multithreaded environment, but this particular tool is a good example,
so we can discuss those problems here.
Your approach suggests storing the state of 512 cache lines in a
single shadow cache line. So, if I'm understanding the algorithm
correctly, parallel accesses to 512 adjacent cache lines from
different CPUs will cause unnecessary contention on that shadow cache
line, which will presumably degrade the application's performance.
Also note that because we don't have atomic bitwise operations,
updates of shadow bytes will require a CAS loop.
I'm not sure about the performance impact of the above issues, but
there might be a tradeoff between the shadow memory scale and the
slowdown.
> For every memory reference, we insert code to set the shadow bits
> corresponding to the cache line(s) touched by that reference.  The
resulting
> shadow memory bitmap describes the program’s working set.
>
> There are two optimizations that can be added:
>
> 1) Add a shadow bit check before writing each bit to reduce the number of
> memory stores.  In our experience this is always a win with shadow memory
> tools.
>
> 2) Analyze the memory references within a basic block to coalesce multiple
> memory references into a single shadow update.
>
> We will divide the program execution into regular intervals and record a
> snapshot of the working set at each interval boundary.  An adaptive
strategy
> can keep the number of snapshots bounded across varying total execution
time
> by combining adjacent snapshots via logical or.  When a snapshot is
> recorded, the shadow memory is cleared so that the next snapshot starts
with
> a blank slate.  If we use a 64 byte to 1 byte shadow mapping, we can use
the
> upper bits to store up to 8 consecutive snapshots in the shadow memory
> itself by shifting rather than clearing the shadow memory on a snapshot.
>
> Recording the snapshot will use optimizations to avoid storing for the
> entire address space: only mapped regions will be saved.
>
> The report from the tool to the user at the end of execution would
> essentially be a histogram with time on the x axis and the cache lines
> touched on the y axis.  The units of time could be selected by the user:
cpu
> time, wall clock time, memory access count, etc.  The unit determines the
> method of snapshot sampling, whether performance counters for the memory
> access count (or instrumentation to increment a global counter) or an
> itimer.  We can record callstacks with each snapshot as well to help give
an
> indication of what the program is doing at that point in time, to help the
> user understand the program phases.
>
> We expect the time overhead of the tool to be well under the 5x
> EfficiencySanitizer ceiling; presumably it should be under 3x.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Filipe Cabecinhas via llvm-dev

2016-Apr-20 17:08 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

My comment is related to Alexander's.
I see you haven't yet implemented this, but I would guess that having
1 byte per cache-line (instead of 1 bit) would be much better (read:
faster). It could still impact performance a lot (cache contention,
etc), but I think that's expected. We're gathering information about
it, and as long as we know what we're doing to the program, we can
probably still use that information in a useful way.

For our case, programs are limited in the amount of virtual memory
they use, and they won't use more than 16GiB Virtual Memory space.
Which implies that we wouldn't need more than 16*(1024^3)/64 == 256MB
for a 1byte per cache-line mapping, which should be easy enough to get
from program budget. This would make these information-gathering runs
much faster than bit-twiddling to save 200MB.

Of course your use case might be very different :-)

Thank you,

Filipe


On Wed, Apr 20, 2016 at 5:09 PM, Alexander Potapenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> On Wed, Apr 20, 2016 at 7:44 AM, Derek Bruening via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Please reference the prior RFC on EfficiencySanitizer.  This is one of
the
>> performance analysis tools we would like to build under the
>> EfficiencySanitizer umbrella.
>>
>> ===================>> Motivation
>> ===================>>
>> Knowing the working set size at periodic points during a given
application's
>> execution helps to understand its cache behavior, how its behavior
changes
>> over time, how its performance might vary on different hardware
>> configurations, and how best to direct performance improvement efforts.
For
>> example, knowing whether the working set is close to fitting in current
L3
>> caches or is many times larger can help determine which efforts are
most
>> likely to bear fruit.
>>
>> The typical approach to acquire working set information is to collect a
>> memory reference profile and feed it to a cache simulator, looking for
the
>> point at which the miss rate drops to zero.  However, this incurs
>> prohibitively high runtime overhead.  Our goal is to build a fast
shadow
>> memory-based working set measurement tool.
>>
>> ===================>> Approach
>> ===================>>
>> We focus on the data working set size, partly because it is typically
more
>> important and partly because to measure the instruction working set
with
>> compiler instrumentation is much more difficult to implement, requiring
>> insertion at a later and more challenging point in the compilation
pipeline.
>>
>> Because we want to know how the program’s working set affects the cache
>> behavior, we define the working set as the set of cache lines that are
>> referenced by the program during a certain period of time.
>>
>> Using the definition above, we are able to use a single shadow bit to
>> represent whether a cache line is referenced or not.  If the cache line
size
>> is 64 bytes, we will use one bit to represent a 64 byte cache line via
a 512
>> byte to 1 byte shadow mapping.
> I've raised the concern about problems with this approach in
> multithreaded environment, but this particular tool is a good example,
> so we can discuss those problems here.
> Your approach suggests storing the state of 512 cache lines in a
> single shadow cache line. So, if I'm understanding the algorithm
> correctly, parallel accesses to 512 adjacent cache lines from
> different CPUs will cause unnecessary contention on that shadow cache
> line, which will presumably degrade the application's performance.
> Also note that because we don't have atomic bitwise operations,
> updates of shadow bytes will require a CAS loop.
> I'm not sure about the performance impact of the above issues, but
> there might be a tradeoff between the shadow memory scale and the
> slowdown.
>
>> For every memory reference, we insert code to set the shadow bits
>> corresponding to the cache line(s) touched by that reference.  The
resulting
>> shadow memory bitmap describes the program’s working set.
>>
>> There are two optimizations that can be added:
>>
>> 1) Add a shadow bit check before writing each bit to reduce the number
of
>> memory stores.  In our experience this is always a win with shadow
memory
>> tools.
>>
>> 2) Analyze the memory references within a basic block to coalesce
multiple
>> memory references into a single shadow update.
>>
>> We will divide the program execution into regular intervals and record
a
>> snapshot of the working set at each interval boundary.  An adaptive
strategy
>> can keep the number of snapshots bounded across varying total execution
time
>> by combining adjacent snapshots via logical or.  When a snapshot is
>> recorded, the shadow memory is cleared so that the next snapshot starts
with
>> a blank slate.  If we use a 64 byte to 1 byte shadow mapping, we can
use the
>> upper bits to store up to 8 consecutive snapshots in the shadow memory
>> itself by shifting rather than clearing the shadow memory on a
snapshot.
>>
>> Recording the snapshot will use optimizations to avoid storing for the
>> entire address space: only mapped regions will be saved.
>>
>> The report from the tool to the user at the end of execution would
>> essentially be a histogram with time on the x axis and the cache lines
>> touched on the y axis.  The units of time could be selected by the
user: cpu
>> time, wall clock time, memory access count, etc.  The unit determines
the
>> method of snapshot sampling, whether performance counters for the
memory
>> access count (or instrumentation to increment a global counter) or an
>> itimer.  We can record callstacks with each snapshot as well to help
give an
>> indication of what the program is doing at that point in time, to help
the
>> user understand the program phases.
>>
>> We expect the time overhead of the tool to be well under the 5x
>> EfficiencySanitizer ceiling; presumably it should be under 3x.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
>
> --
> Alexander Potapenko
> Software Engineer
>
> Google Germany GmbH
> Erika-Mann-Straße, 33
> 80636 München
>
> Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Filipe Cabecinhas via llvm-dev

2016-Apr-20 17:15 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

On Wed, Apr 20, 2016 at 6:44 AM, Derek Bruening via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> We will divide the program execution into regular intervals and record a
> snapshot of the working set at each interval boundary.  An adaptive
strategy
> can keep the number of snapshots bounded across varying total execution
time
> by combining adjacent snapshots via logical or.  When a snapshot is
> recorded, the shadow memory is cleared so that the next snapshot starts
with
> a blank slate.  If we use a 64 byte to 1 byte shadow mapping, we can use
the
> upper bits to store up to 8 consecutive snapshots in the shadow memory
> itself by shifting rather than clearing the shadow memory on a snapshot.I forgot to add my previous questions:

About the working set tool:
How are you thinking about doing the snapshots? How do you plan to
sync the several threads?
Spawning an external process/"thread" (kind of like LSan), or
internally?

I know I'm getting to low(ish)-level details, but if you've already
thought about this (possibly about several ways to do this), I'd like
to know more about what avenues you're planning on exploring.

I'm ok with a "we haven't thought that far ahead about this tool
yet" :-)

Thank you,

  Filipe
> Recording the snapshot will use optimizations to avoid storing for the
> entire address space: only mapped regions will be saved.
>
> The report from the tool to the user at the end of execution would
> essentially be a histogram with time on the x axis and the cache lines
> touched on the y axis.  The units of time could be selected by the user:
cpu
> time, wall clock time, memory access count, etc.  The unit determines the
> method of snapshot sampling, whether performance counters for the memory
> access count (or instrumentation to increment a global counter) or an
> itimer.  We can record callstacks with each snapshot as well to help give
an
> indication of what the program is doing at that point in time, to help the
> user understand the program phases.
>
> We expect the time overhead of the tool to be well under the 5x
> EfficiencySanitizer ceiling; presumably it should be under 3x.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Derek Bruening via llvm-dev

2016-Apr-20 17:43 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

On Wed, Apr 20, 2016 at 1:15 PM, Filipe Cabecinhas <filcab at gmail.com>
wrote:
> On Wed, Apr 20, 2016 at 6:44 AM, Derek Bruening via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > We will divide the program execution into regular intervals and record
a
> > snapshot of the working set at each interval boundary.  An adaptive
> strategy
> > can keep the number of snapshots bounded across varying total
execution
> time
> > by combining adjacent snapshots via logical or.  When a snapshot is
> > recorded, the shadow memory is cleared so that the next snapshot
starts
> with
> > a blank slate.  If we use a 64 byte to 1 byte shadow mapping, we can
use
> the
> > upper bits to store up to 8 consecutive snapshots in the shadow memory
> > itself by shifting rather than clearing the shadow memory on a
snapshot.
> I forgot to add my previous questions:
>
> About the working set tool:
> How are you thinking about doing the snapshots? How do you plan to
> sync the several threads?
> Spawning an external process/"thread" (kind of like LSan), or
internally?
>
> I know I'm getting to low(ish)-level details, but if you've already
> thought about this (possibly about several ways to do this), I'd like
> to know more about what avenues you're planning on exploring.
>
> I'm ok with a "we haven't thought that far ahead about this
tool yet" :-)
>
Our plan is to live with races between shadow memory updates in other
threads and the thread making a copy of the shadow memory.  The nature of
the tool means that pinpoint accuracy is not required.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/02fe9be3/attachment.html>

Qin Zhao via llvm-dev

2016-Apr-20 18:24 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

>
>
> I've raised the concern about problems with this approach in
> multithreaded environment, but this particular tool is a good example,
> so we can discuss those problems here.
> Your approach suggests storing the state of 512 cache lines in a
> single shadow cache line. So, if I'm understanding the algorithm
> correctly, parallel accesses to 512 adjacent cache lines from
> different CPUs will cause unnecessary contention on that shadow cache
> line, which will presumably degrade the application's performance.
> Also note that because we don't have atomic bitwise operations,
> updates of shadow bytes will require a CAS loop.
> I'm not sure about the performance impact of the above issues, but
> there might be a tradeoff between the shadow memory scale and the
> slowdown.
>
1. As Derek said, we will do 64B-2-1B mapping for easier instrumentation.
2. The cache contention in fact is not as bad as you think if we apply the
optimization mentioned by Derek:
"1) Add a shadow bit check before writing each bit to reduce the number of
memory stores.  In our experience this is always a win with shadow memory
tools."
By doing that, most shadow access would be read instead of write, so much
less cache contention.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/6c971dd2/attachment.html>

Qin Zhao via llvm-dev

2016-Apr-20 21:23 UTC

head link

[llvm-dev] RFC: EfficiencySanitizer working set tool

On Wed, Apr 20, 2016 at 1:15 PM, Filipe Cabecinhas <filcab at gmail.com>
wrote:
> On Wed, Apr 20, 2016 at 6:44 AM, Derek Bruening via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > We will divide the program execution into regular intervals and record
a
> > snapshot of the working set at each interval boundary.  An adaptive
> strategy
> > can keep the number of snapshots bounded across varying total
execution
> time
> > by combining adjacent snapshots via logical or.  When a snapshot is
> > recorded, the shadow memory is cleared so that the next snapshot
starts
> with
> > a blank slate.  If we use a 64 byte to 1 byte shadow mapping, we can
use
> the
> > upper bits to store up to 8 consecutive snapshots in the shadow memory
> > itself by shifting rather than clearing the shadow memory on a
snapshot.
> I forgot to add my previous questions:
>
> About the working set tool:
> How are you thinking about doing the snapshots? How do you plan to
> sync the several threads?
> Spawning an external process/"thread" (kind of like LSan), or
internally?
>
I do not think synchronized operation is important here.
The tool should be able to tolerate certain level of racy updates.
We could either use the spawning thread or a simple time interrupt without
sideline thread.
Sampling without sideline thread has the advantage of providing some sample
call stack.
Sideline thread avoids app thread pausing for the scanning.
We may implement both.

>
> I know I'm getting to low(ish)-level details, but if you've already
> thought about this (possibly about several ways to do this), I'd like
> to know more about what avenues you're planning on exploring.
>
> I'm ok with a "we haven't thought that far ahead about this
tool yet" :-)
>
> Thank you,
>
>   Filipe
>
> > Recording the snapshot will use optimizations to avoid storing for the
> > entire address space: only mapped regions will be saved.
> >
> > The report from the tool to the user at the end of execution would
> > essentially be a histogram with time on the x axis and the cache lines
> > touched on the y axis.  The units of time could be selected by the
user:
> cpu
> > time, wall clock time, memory access count, etc.  The unit determines
the
> > method of snapshot sampling, whether performance counters for the
memory
> > access count (or instrumentation to increment a global counter) or an
> > itimer.  We can record callstacks with each snapshot as well to help
> give an
> > indication of what the program is doing at that point in time, to help
> the
> > user understand the program phases.
> >
> > We expect the time overhead of the tool to be well under the 5x
> > EfficiencySanitizer ceiling; presumably it should be under 3x.
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "efficiency-sanitizer" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to efficiency-sanitizer+unsubscribe at google.com.
> To post to this group, send email to efficiency-sanitizer at google.com.
> To view this discussion on the web visit
>
https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAEDE853L4qXUw7YwwYWYLkyRRc6caT%3DgiW8%2BFS8%3D6fg6sT5ufg%40mail.gmail.com
> .
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/eac592ac/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Apr 2016 - RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

[llvm-dev] RFC: EfficiencySanitizer working set tool

Possibly Parallel Threads