thr3ads.net - zfs discuss - [zfs-discuss] Summary: Dedup and L2ARC memory requirements [May 2011]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2011-May-05 02:56 UTC

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

This is a summary of a much longer discussion "Dedup and L2ARC memory
requirements (again)"
Sorry even this summary is long.  But the results vary enormously based on
individual usage, so any "rule of thumb" metric that has been bouncing
around on the internet is simply not sufficient.  You need to go into this
level of detail to get an estimate that''s worth the napkin or bathroom
tissue it''s scribbled on.

This is how to (reasonably) accurately estimate the hypothetical ram
requirements to hold the complete data deduplication tables (DDT) and L2ARC
references in ram.  Please note both the DDT and L2ARC references can be
evicted from memory according to system policy, whenever the system decides
some other data is more valuable to keep.  So following this guide does not
guarantee that the whole DDT will remain in ARC or L2ARC.  But it''s a
good
start.

I am using a solaris 11 express x86 test system for my example numbers
below.  

----------- To calculate size of DDT -----------

Each entry in the DDT is a fixed size, which varies by platform.  You can
find it with the command:
	echo ::sizeof ddt_entry_t | mdb -k
This will return a hex value, that you probably want to convert to decimal.
On my test system, it is 0x178 which is 376 bytes

There is one DDT entry per non-dedup''d (unique) block in the zpool.  Be
aware that you cannot reliably estimate #blocks by counting #files.  You can
find the number of total blocks including dedup''d blocks in your pool
with
this command:
	zdb -bb poolname | grep ''bp count''
Note:  This command will run a long time and is IO intensive.  On my systems
where a scrub runs for 8-9 hours, this zdb command ran for about 90 minutes.
On my test system, the result is 44145049 (44.1M) total blocks.

To estimate the number of non-dedup''d (unique) blocks (assuming average
size
of dedup''d blocks = average size of blocks in the whole pool), use:
	zpool list
Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
blocks by the dedup ratio to find the number of non-dedup''d (unique)
blocks.

In my test system:
	44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
non-dedup''d (unique) blocks

Then multiply by the size of a DDT entry.
	19707611 * 376 = 7410061796 bytes = 7G total DDT size

----------- To calculate size of ARC/L2ARC references -----------

Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
another fixed size, which varies by platform.  You can find it with the
command:
	echo ::sizeof arc_buf_hdr_t | mdb -k
On my test system, it is 0xb0 which is 176 bytes

We need to know the average block size in the pool, to estimate the number
of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
pool:
	zpool list
Divide by the number of non-dedup''d (unique) blocks in the pool, to
find the
average block size.  In my test system:
	790G / 19707611 = 42K average block size

Remember:  If your L2ARC were only caching average size blocks, then the
payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
in a negligible ARC memory consumption.  But since your DDT can be pushed
out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
the complete DDT present in ARC and L2ARC, thus consuming tons of ram.

Remember disk mfgrs use base-10.  So my 32G SSD is only 30G base-2.
(32,000,000,000 / 1024/1024/1024)

So I have 30G L2ARC, and the first 7G may be consumed by DDT.  This leaves
23G remaining to be used for average-sized blocks.
The ARC consumed to reference the DDT in L2ARC is 176/376 * DDT size. In my
test system this is 176/376 * 7G = 3.3G

Take the remaining size of your L2ARC, divide by average block size, to get
the number of average size blocks the L2ARC can hold.  In my test system:
	23G / 42K = 574220 average-size blocks in L2ARC
Multiply by the ARC size of a L2ARC reference.  On my test system:
	574220 * 176 = 101062753 bytes = 96MB ARC consumed to reference the
average-size blocks in L2ARC

So the total ARC consumption to hold L2ARC references in my test system is
3.3G + 96M ~= 3.4G

----------- To calculate total ram needed -----------

And finally - The max size the ARC is allowed to grow, is a constant that
varies by platform.  On my system, it is 80% of system ram.  You can find
this value using the command:
	kstat -p zfs::arcstats:c_max
Divide by your total system memory to find the ratio.
Assuming the ratio is 4/5, it means you need to buy 5/4 the amount of
calculated ram to satisfy all your requirements.

So the end result is:
On my test system I guess the OS and processes consume 1G.  (I''m making
that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I''m just making that up.)
We calculated that I need 7G for DDT and 3.4G for L2ARC.  That is 10.4G.
Multiply by 5/4 and it means I need 13G
My system needs to be built with at least 8G + 13G = 21G.
Of this, 20% (4.2G) is more than enough to run the OS and processes, while
80% (16.8G) is available for ARC.  Of the 16.8G ARC, the DDT and L2ARC
references will consume 10.4G, which leaves 6.4G for "normal" ARC
caching.
These numbers are all fuzzy.  Anything from 16G to 24G might be reasonable.

That''s it. I''m done.

P.S.  I''ll just throw this out there:  It is my personal opinion that
you
probably won''t have the whole DDT in ARC and L2ARC at the same time.
Because the L2ARC is populated from the soon-to-expire list of the ARC, it
seems unlikely that all the DDT entries will get into ARC, and then onto the
soon-to-expire list and then pulled back into ARC and stay there.  The above
calculation is a sort of worst case.  I think the following is likely to be
a more realistic actual case:

Personally, I would model the ARC memory consumption of the L2ARC entries
using the average block size of the data pool, and just neglect the DDT
entries in the L2ARC.  Well ... inflate some.  Say 10% of the DDT is in the
L2ARC and the ARC at the same time.  I''m making up this number from
thin
air.

My revised end result is:
On my test system I guess the OS and processes consume 1G.  (I''m making
that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I''m just making that up.)
We calculated that I need 7G for DDT and (96M + 10% of 3.3G = 430M) for
L2ARC.  Multiply by 5/4 and it means I need 7.5G * 1.25 = 9.4G 
My system needs to be built with at least 8G + 9.4G = 17.4G.  
Of this, 20% (3.5G) is more than enough to run the OS and processes, while
80% (13.9G) is available for ARC.  Of the 13.9G ARC, the DDT and L2ARC
references will consume 7.5G, which leaves 6.4G for "normal" ARC
caching.
I personally think that''s likely to be more accurate in the observable
world.
My revised end result is still basically the same:  These numbers are all
fuzzy.  Anything from 16G to 24G might be reasonable.

Erik Trimble

2011-May-05 03:45 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Good summary, Ned.  A couple of minor corrections.

On 5/4/2011 7:56 PM, Edward Ned Harvey wrote:> This is a summary of a much longer discussion "Dedup and L2ARC memory
> requirements (again)"
> Sorry even this summary is long.  But the results vary enormously based on
> individual usage, so any "rule of thumb" metric that has been
bouncing
> around on the internet is simply not sufficient.  You need to go into this
> level of detail to get an estimate that''s worth the napkin or
bathroom
> tissue it''s scribbled on.
>
> This is how to (reasonably) accurately estimate the hypothetical ram
> requirements to hold the complete data deduplication tables (DDT) and L2ARC
> references in ram.  Please note both the DDT and L2ARC references can be
> evicted from memory according to system policy, whenever the system decides
> some other data is more valuable to keep.  So following this guide does not
> guarantee that the whole DDT will remain in ARC or L2ARC.  But
it''s a good
> start.
>
> I am using a solaris 11 express x86 test system for my example numbers
> below.
>
> ----------- To calculate size of DDT -----------
>
> Each entry in the DDT is a fixed size, which varies by platform.  You can
> find it with the command:
> 	echo ::sizeof ddt_entry_t | mdb -k
> This will return a hex value, that you probably want to convert to decimal.
> On my test system, it is 0x178 which is 376 bytes
>
> There is one DDT entry per non-dedup''d (unique) block in the
zpool.  Be
> aware that you cannot reliably estimate #blocks by counting #files.  You
can
> find the number of total blocks including dedup''d blocks in your
pool with
> this command:
> 	zdb -bb poolname | grep ''bp count''
> Note:  This command will run a long time and is IO intensive.  On my
systems
> where a scrub runs for 8-9 hours, this zdb command ran for about 90
minutes.
> On my test system, the result is 44145049 (44.1M) total blocks.
>
> To estimate the number of non-dedup''d (unique) blocks (assuming
average size
> of dedup''d blocks = average size of blocks in the whole pool),
use:
> 	zpool list
> Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
> blocks by the dedup ratio to find the number of non-dedup''d
(unique) blocks.
>
> In my test system:
> 	44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
> non-dedup''d (unique) blocks
>
> Then multiply by the size of a DDT entry.
> 	19707611 * 376 = 7410061796 bytes = 7G total DDT size
>
> ----------- To calculate size of ARC/L2ARC references -----------
>
> Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
> another fixed size, which varies by platform.  You can find it with the
> command:
> 	echo ::sizeof arc_buf_hdr_t | mdb -k
> On my test system, it is 0xb0 which is 176 bytes
>
> We need to know the average block size in the pool, to estimate the number
> of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
> pool:
> 	zpool list
> Divide by the number of non-dedup''d (unique) blocks in the pool,
to find the
> average block size.  In my test system:
> 	790G / 19707611 = 42K average block size
>
> Remember:  If your L2ARC were only caching average size blocks, then the
> payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
> 42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
> in a negligible ARC memory consumption.  But since your DDT can be pushed
> out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
> consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
> 176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
> the complete DDT present in ARC and L2ARC, thus consuming tons of ram.
>
> Remember disk mfgrs use base-10.  So my 32G SSD is only 30G base-2.
> (32,000,000,000 / 1024/1024/1024)
>
> So I have 30G L2ARC, and the first 7G may be consumed by DDT.  This leaves
> 23G remaining to be used for average-sized blocks.
> The ARC consumed to reference the DDT in L2ARC is 176/376 * DDT size. In my
> test system this is 176/376 * 7G = 3.3G
>
> Take the remaining size of your L2ARC, divide by average block size, to get
> the number of average size blocks the L2ARC can hold.  In my test system:
> 	23G / 42K = 574220 average-size blocks in L2ARC
> Multiply by the ARC size of a L2ARC reference.  On my test system:
> 	574220 * 176 = 101062753 bytes = 96MB ARC consumed to reference the
> average-size blocks in L2ARC
>
> So the total ARC consumption to hold L2ARC references in my test system is
> 3.3G + 96M ~= 3.4G
>
> ----------- To calculate total ram needed -----------
>
> And finally - The max size the ARC is allowed to grow, is a constant that
> varies by platform.  On my system, it is 80% of system ram.  You can find
> this value using the command:
> 	kstat -p zfs::arcstats:c_max
> Divide by your total system memory to find the ratio.
> Assuming the ratio is 4/5, it means you need to buy 5/4 the amount of
> calculated ram to satisfy all your requirements.
>Using the standard c_max value of 80%, remember that this is 80% of the 
TOTAL system RAM, including that RAM normally dedicated to other 
purposes.  So long as the total amount of RAM you expect to dedicate to 
ARC usage (for all ZFS uses, not just dedup) is less than 4 times that 
of all other RAM consumption, you don''t need to
"overprovision".

> So the end result is:
> On my test system I guess the OS and processes consume 1G.  (I''m
making that
> up without any reason.)
> On my test system I guess I need 8G in the system to get reasonable
> performance without dedup or L2ARC.  (Again, I''m just making that
up.)
> We calculated that I need 7G for DDT and 3.4G for L2ARC.  That is 10.4G.
> Multiply by 5/4 and it means I need 13G
> My system needs to be built with at least 8G + 13G = 21G.
> Of this, 20% (4.2G) is more than enough to run the OS and processes, while
> 80% (16.8G) is available for ARC.  Of the 16.8G ARC, the DDT and L2ARC
> references will consume 10.4G, which leaves 6.4G for "normal" ARC
caching.
> These numbers are all fuzzy.  Anything from 16G to 24G might be reasonable.
>
> That''s it. I''m done.
>
> P.S.  I''ll just throw this out there:  It is my personal opinion
that you
> probably won''t have the whole DDT in ARC and L2ARC at the same
time.
> Because the L2ARC is populated from the soon-to-expire list of the ARC, it
> seems unlikely that all the DDT entries will get into ARC, and then onto
the
> soon-to-expire list and then pulled back into ARC and stay there.  The
above
> calculation is a sort of worst case.  I think the following is likely to be
> a more realistic actual case:There is a *very* low probability that a DDT entry will exist in both 
the ARC and L2ARC at the same time. That is, such a condition will occur 
ONLY in the very short period of time when the DDT entry is being 
migrated from the ARC to the L2ARC. Each DDT entry is tracked 
separately, so each can be migrated from ARC to L2ARC as needed.  Any 
entry that is migrated back from L2ARC into ARC is considered "stale" 
data in the L2ARC, and thus, is no longer tracked in the ARC''s
reference
table for L2ARC.

As such, you can safely assume the DDT-related memory requirements for 
the ARC are (maximally) just slightly bigger than size of the DDT 
itself. Even then, that is a worst-case scenario; a typical use case 
would have the actual ARC consumption somewhere closer to the case where 
the entire DDT is in the L2ARC.  Using your numbers, that would mean the 
worst-case ARC usage would be a bit over 7G, and the more likely case 
would be somewhere in the 3.3-3.5G range.

> Personally, I would model the ARC memory consumption of the L2ARC entries
> using the average block size of the data pool, and just neglect the DDT
> entries in the L2ARC.  Well ... inflate some.  Say 10% of the DDT is in the
> L2ARC and the ARC at the same time.  I''m making up this number
from thin
> air.
>
> My revised end result is:
> On my test system I guess the OS and processes consume 1G.  (I''m
making that
> up without any reason.)
> On my test system I guess I need 8G in the system to get reasonable
> performance without dedup or L2ARC.  (Again, I''m just making that
up.)
> We calculated that I need 7G for DDT and (96M + 10% of 3.3G = 430M) for
> L2ARC.  Multiply by 5/4 and it means I need 7.5G * 1.25 = 9.4G
> My system needs to be built with at least 8G + 9.4G = 17.4G.
> Of this, 20% (3.5G) is more than enough to run the OS and processes, while
> 80% (13.9G) is available for ARC.  Of the 13.9G ARC, the DDT and L2ARC
> references will consume 7.5G, which leaves 6.4G for "normal" ARC
caching.
> I personally think that''s likely to be more accurate in the
observable
> world.
> My revised end result is still basically the same:  These numbers are all
> fuzzy.  Anything from 16G to 24G might be reasonable.
>

For total system RAM, you need the GREATER of these two values:

(1) the sum of your OS & application requirements, plus your standard 
ZFS-related ARC requirements, plus the DDT size

(2) 1.25 times the sum of size of your standard ARC needs and the DDT size


Redoing your calculations based on my adjustments:

(a)  worst case scenario is that you need 7GB for dedup-related ARC 
requirements
(b)  you presume to need 8GB for standard ARC caching not related to dedup
(c)  your system needs 1GB for basic operation

According to those numbers:

Case #1:  1 + 8 + 7 = 16GB

Case #2:  1.25 * (8 + 7) =~ 19GB

Thus, you should have 19GB of RAM in your system, with 16GB being a 
likely reasonable amount under most conditions (e.g. typical dedup ARC 
size is going to be ~3.5G, not the 7G maximum used above).



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Edward Ned Harvey

2011-May-05 13:45 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: Erik Trimble [mailto:erik.trimble at oracle.com]
> 
> Using the standard c_max value of 80%, remember that this is 80% of the
> TOTAL system RAM, including that RAM normally dedicated to other
> purposes.  So long as the total amount of RAM you expect to dedicate to
> ARC usage (for all ZFS uses, not just dedup) is less than 4 times that
> of all other RAM consumption, you don''t need to
"overprovision".
Correct, usually you don''t need to overprovision for the sake of
ensuring
enough ram available for OS and processes.  But you do need to overprovision
25% if you want to increase the size of your usable ARC without reducing the
amount of ARC you currently have in the system being used to cache other
files etc.

> Any
> entry that is migrated back from L2ARC into ARC is considered
"stale"
> data in the L2ARC, and thus, is no longer tracked in the ARC''s
reference
> table for L2ARC.
Good news.  I didn''t know that.  I thought the L2ARC was still valid,
even
if something was pulled back into ARC.

So there are two useful models:
(a) The upper bound:  The whole DDT is in ARC, and the whole L2ARC is filled
with average-size blocks.
or
(b) The lower bound:  The whole DDT is in L2ARC, and all the rest of the
L2ARC is filled with average-size blocks.  ARC requirements are based only
on L2ARC references.

The actual usage will be something between (a) and (b)...  And the actual is
probably closer to (b)

In my test system:
(a)  (upper bound)
On my test system I guess the OS and processes consume 1G.  (I''m making
that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I''m just making that up.)
I need 7G for DDT and 
I have 748982 average-size blocks in L2ARC, which means 131820832 bytes 125M or
0.1G for L2ARC
I really just need to plan for 7.1G ARC usage
Multiply by 5/4 and it means I need 8.875G system ram
My system needs to be built with at least 8G + 8.875G = 16.875G.

(b)  (lower bound)
On my test system I guess the OS and processes consume 1G.  (I''m making
that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I''m just making that up.)
I need 0G for DDT  (because it''s in L2ARC) and 
I need 3.4G ARC to hold all the L2ARC references, including the DDT in L2ARC
So I really just need to plan for 3.4G ARC for my L2ARC references.
Multiply by 5/4 and it means I need 4.25G system ram
My system needs to be built with at least 8G + 4.25G = 12.25G.

Thank you for your input, Erik.  Previously I would have only been
comfortable with 24G in this system, because I was calculating a need for
significantly higher than 16G.  But now, what we''re calling the upper
bound
is just *slightly* higher than 16G, while the lower bound and most likely
actual figure is significantly lower than 16G.  So in this system, I would
be comfortable running with 16G.  But I would be even more comfortable
running with 24G.   ;-)

Karl Wagner

2011-May-05 17:20 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

so there''s an ARC entry referencing each individual DDT entry in the
L2ARC?! I had made the assumption that DDT entries would be grouped into at
least minimum block sized groups (8k?), which would have lead to a much more
reasonable ARC requirement.

seems like a bad design to me, which leads to dedup only being usable by those
prepared to spend a LOT of dosh... which may as well go into more storage (I
know there are other benefits too, but that''s my opinion)
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:
> From: Erik Trimble [mailto:erik.trimble at oracle.com] > > Using the
standard c_max value of 80%, remember that this is 80% of the > TOTAL system
RAM, including that RAM normally dedicated to other > purposes. So long as
the total amount of RAM you expect to dedicate to > ARC usage (for all ZFS
uses, not just dedup) is less than 4 times that > of all other RAM
consumption, you don''t need to "overprovision". Correct,
usually you don''t need to overprovision for the sake of ensuring enough
ram available for OS and processes. But you do need to overprovision 25% if you
want to increase the size of your usable ARC without reducing the amount of ARC
you currently have in the system being used to cache other files etc. > Any
> entry that is migrated back from L2ARC into ARC is considered
"stale" > data in the L2ARC, and thus, is no longer tracked in the
ARC''s reference > table for L2ARC. Good news. I didn''t know
that. I thought the L2ARC was still valid, even if something was pulled back
into ARC. So there are two useful models: (a) The upper bound: The whole DDT is
in ARC, and the whole L2ARC is filled with average-size blocks. or (b) The lower
bound: The whole DDT is in L2ARC, and all the rest of the L2ARC is filled with
average-size blocks. ARC requirements are based only on L2ARC references. The
actual usage will be something between (a) and (b)... And the actual is probably
closer to (b) In my test system: (a) (upper bound) On my test system I guess the
OS and processes consume 1G. (I''m making that up without any reason.)
On my test system I guess I need 8G in the system to get reasonable performance
without dedup or L2ARC. (Again, I''m just making that up.) I need 7G for
DDT and I have 748982 average-size blocks in L2ARC, which means 131820832 bytes
= 125M or 0.1G for L2ARC I really just need to plan for 7.1G ARC usage Multiply
by 5/4 and it means I need 8.875G system ram My system needs to be built with at
least 8G + 8.875G = 16.875G. (b) (lower bound)
On my
test system I guess the OS and processes consume 1G. (I''m making that
up without any reason.) On my test system I guess I need 8G in the system to get
reasonable performance without dedup or L2ARC. (Again, I''m just making
that up.) I need 0G for DDT (because it''s in L2ARC) and I need 3.4G ARC
to hold all the L2ARC references, including the DDT in L2ARC So I really just
need to plan for 3.4G ARC for my L2ARC references. Multiply by 5/4 and it means
I need 4.25G system ram My system needs to be built with at least 8G + 4.25G =
12.25G. Thank you for your input, Erik. Previously I would have only been
comfortable with 24G in this system, because I was calculating a need for
significantly higher than 16G. But now, what we''re calling the upper
bound is just *slightly* higher than 16G, while the lower bound and most likely
actual figure is significantly lower than 16G. So in this system, I would be
comfortable running with 16G. But I would be even more comfortable running with
24G.
;-)_____________________________________________
zfs-discuss mailing list zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110505/78c05ad0/attachment-0001.html>

Edward Ned Harvey

2011-May-06 03:33 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: Karl Wagner [mailto:karl at mouse-hole.com]
> 
> so there''s an ARC entry referencing each individual DDT entry in
the L2ARC?!
> I had made the assumption that DDT entries would be grouped into at least
> minimum block sized groups (8k?), which would have lead to a much more
> reasonable ARC requirement.
> 
> seems like a bad design to me, which leads to dedup only being usable by
> those prepared to spend a LOT of dosh... which may as well go into more
> storage (I know there are other benefits too, but that''s my
opinion)
The whole point of the DDT is that it needs to be structured, and really fast
searchable.  So no, you''re not going to consolidate it into an
unstructured memory block as you said.  You pay the memory consumption price for
the sake of performance.  Yes it consumes a lot of ram, but don''t call
it a "bad design."  It''s just a different design than what
you expected, because what you expected would hurt performance while consuming
less ram.

And we''re not talking crazy dollars here.  So your emphasis on a LOT of
dosh seems exaggerated.  I just spec''d out a system where upgrading
from 12 to 24G of ram to enable dedup effectively doubled the storage capacity
of the system, and that upgrade cost the same as one of the disks.  (This is a
12-disk system.)   So it was actually a 6x cost reducer, at least.  It all
depends on how much mileage you get out of the dedup.  Your mileage may vary.

Richard Elling

2011-May-06 03:44 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On May 4, 2011, at 7:56 PM, Edward Ned Harvey wrote:
> This is a summary of a much longer discussion "Dedup and L2ARC memory
> requirements (again)"
> Sorry even this summary is long.  But the results vary enormously based on
> individual usage, so any "rule of thumb" metric that has been
bouncing
> around on the internet is simply not sufficient.  You need to go into this
> level of detail to get an estimate that''s worth the napkin or
bathroom
> tissue it''s scribbled on.
> 
> This is how to (reasonably) accurately estimate the hypothetical ram
> requirements to hold the complete data deduplication tables (DDT) and L2ARC
> references in ram.  Please note both the DDT and L2ARC references can be
> evicted from memory according to system policy, whenever the system decides
> some other data is more valuable to keep.  So following this guide does not
> guarantee that the whole DDT will remain in ARC or L2ARC.  But
it''s a good
> start.
As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
decreases. With one notable exception, destroying a dataset or snapshot requires
the DDT entries for the destroyed blocks to be updated. This is why people can
go for months or years and not see a problem, until they try to destroy a
dataset.
> 
> I am using a solaris 11 express x86 test system for my example numbers
> below.  
> 
> ----------- To calculate size of DDT -----------
> 
> Each entry in the DDT is a fixed size, which varies by platform.  You can
> find it with the command:
> 	echo ::sizeof ddt_entry_t | mdb -k
> This will return a hex value, that you probably want to convert to decimal.
> On my test system, it is 0x178 which is 376 bytes
> 
> There is one DDT entry per non-dedup''d (unique) block in the
zpool.
The workloads which are nicely dedupable tend to not have unique blocks.
So this is another way of saying, "if your workload isn''t
dedupable, don''t bother
with deduplication." For years now we have been trying to convey this
message.
One way to help convey the message is...
>  Be
> aware that you cannot reliably estimate #blocks by counting #files.  You
can
> find the number of total blocks including dedup''d blocks in your
pool with
> this command:
> 	zdb -bb poolname | grep ''bp count''
Ugh. A better method is to simulate dedup on existing data:
	zdb -S poolname
or measure dedup efficacy
	zdb -DD poolname
which offer similar tabular analysis
> Note:  This command will run a long time and is IO intensive.  On my
systems
> where a scrub runs for 8-9 hours, this zdb command ran for about 90
minutes.
> On my test system, the result is 44145049 (44.1M) total blocks.
> 
> To estimate the number of non-dedup''d (unique) blocks (assuming
average size
> of dedup''d blocks = average size of blocks in the whole pool),
use:
> 	zpool list
> Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
> blocks by the dedup ratio to find the number of non-dedup''d
(unique) blocks.
Or just count the unique and non-unique blocks with:
	zdb -D poolname
> 
> In my test system:
> 	44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
> non-dedup''d (unique) blocks
> 
> Then multiply by the size of a DDT entry.
> 	19707611 * 376 = 7410061796 bytes = 7G total DDT size
A minor gripe about zdb -D output is that it doesn''t do the math.
> 
> ----------- To calculate size of ARC/L2ARC references -----------
> 
> Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
> another fixed size, which varies by platform.  You can find it with the
> command:
> 	echo ::sizeof arc_buf_hdr_t | mdb -k
> On my test system, it is 0xb0 which is 176 bytes
Better yet, without need for mdb privilege, measure the current L2ARC header
size in use. Normal user accounts can:
	kstat -p zfs::arcstats:hdr_size
	kstat -p zfs::arcstats:l2_hdr_size

arcstat will allow you to easily track this over time.
> 
> We need to know the average block size in the pool, to estimate the number
> of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
> pool:
> 	zpool list
> Divide by the number of non-dedup''d (unique) blocks in the pool,
to find the
> average block size.  In my test system:
> 	790G / 19707611 = 42K average block size
> 
> Remember:  If your L2ARC were only caching average size blocks, then the
> payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
> 42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
> in a negligible ARC memory consumption.  But since your DDT can be pushed
> out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
> consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
> 176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
> the complete DDT present in ARC and L2ARC, thus consuming tons of ram.
This is a good thing for those cases when you need to quickly reference large
numbers of DDT entries.
> 
> Remember disk mfgrs use base-10.  So my 32G SSD is only 30G base-2.
> (32,000,000,000 / 1024/1024/1024)
> 
> So I have 30G L2ARC, and the first 7G may be consumed by DDT.  This leaves
> 23G remaining to be used for average-sized blocks.
> The ARC consumed to reference the DDT in L2ARC is 176/376 * DDT size. In my
> test system this is 176/376 * 7G = 3.3G
> 
> Take the remaining size of your L2ARC, divide by average block size, to get
> the number of average size blocks the L2ARC can hold.  In my test system:
> 	23G / 42K = 574220 average-size blocks in L2ARC
> Multiply by the ARC size of a L2ARC reference.  On my test system:
> 	574220 * 176 = 101062753 bytes = 96MB ARC consumed to reference the
> average-size blocks in L2ARC
> 
> So the total ARC consumption to hold L2ARC references in my test system is
> 3.3G + 96M ~= 3.4G
> 
> ----------- To calculate total ram needed -----------
> 
> And finally - The max size the ARC is allowed to grow, is a constant that
> varies by platform.  On my system, it is 80% of system ram.  
It is surely not 80% of RAM unless you have 5GB RAM or by luck. The algorithm
for
c_max is well documented as starting with the larger of:
	7/8 of physmem
or
	physmem - 1GB

This value adjusts as memory demands from other processes are satisfied.
> You can find
> this value using the command:
> 	kstat -p zfs::arcstats:c_max
> Divide by your total system memory to find the ratio.
> Assuming the ratio is 4/5, it means you need to buy 5/4 the amount of
> calculated ram to satisfy all your requirements.
> 
> So the end result is:
> On my test system I guess the OS and processes consume 1G.  (I''m
making that
> up without any reason.)
This is a little bit trickier to undestand, which is why we have:
	echo ::memstat | mdb -k
> On my test system I guess I need 8G in the system to get reasonable
> performance without dedup or L2ARC.  (Again, I''m just making that
up.)
> We calculated that I need 7G for DDT and 3.4G for L2ARC.  That is 10.4G.
> Multiply by 5/4 and it means I need 13G
> My system needs to be built with at least 8G + 13G = 21G.
> Of this, 20% (4.2G) is more than enough to run the OS and processes, while
> 80% (16.8G) is available for ARC.  Of the 16.8G ARC, the DDT and L2ARC
> references will consume 10.4G, which leaves 6.4G for "normal" ARC
caching.
> These numbers are all fuzzy.  Anything from 16G to 24G might be reasonable.
> 
> That''s it. I''m done.
> 
> P.S.  I''ll just throw this out there:  It is my personal opinion
that you
> probably won''t have the whole DDT in ARC and L2ARC at the same
time.
yep, that is a safe bet.
> Because the L2ARC is populated from the soon-to-expire list of the ARC, it
> seems unlikely that all the DDT entries will get into ARC, and then onto
the
> soon-to-expire list and then pulled back into ARC and stay there.  The
above
> calculation is a sort of worst case.  I think the following is likely to be
> a more realistic actual case:
> 
> Personally, I would model the ARC memory consumption of the L2ARC entries
> using the average block size of the data pool, and just neglect the DDT
> entries in the L2ARC.  Well ... inflate some.  Say 10% of the DDT is in the
> L2ARC and the ARC at the same time.  I''m making up this number
from thin
> air.
Much better to just measure it. However, measurements are likely to not be
appropriate for capacity planning purposes :-(
> My revised end result is:
> On my test system I guess the OS and processes consume 1G.  (I''m
making that
> up without any reason.)
> On my test system I guess I need 8G in the system to get reasonable
> performance without dedup or L2ARC.  (Again, I''m just making that
up.)
> We calculated that I need 7G for DDT and (96M + 10% of 3.3G = 430M) for
> L2ARC.  Multiply by 5/4 and it means I need 7.5G * 1.25 = 9.4G 
> My system needs to be built with at least 8G + 9.4G = 17.4G.  
> Of this, 20% (3.5G) is more than enough to run the OS and processes, while
> 80% (13.9G) is available for ARC.  Of the 13.9G ARC, the DDT and L2ARC
> references will consume 7.5G, which leaves 6.4G for "normal" ARC
caching.
> I personally think that''s likely to be more accurate in the
observable
> world.
> My revised end result is still basically the same:  These numbers are all
> fuzzy.  Anything from 16G to 24G might be reasonable.
I think these RAM numbers are reasonable first guesses. Many of the systems
I''ve
seen deployed this year are 48 to 96 GB RAM. L2ARC devices are 250 to 600 GB.
 -- richard

Frank Van Damme

2011-May-06 08:31 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Op 06-05-11 05:44, Richard Elling schreef:> As the size of the data grows, the need to have the whole DDT in RAM or
L2ARC
> decreases. With one notable exception, destroying a dataset or snapshot
requires
> the DDT entries for the destroyed blocks to be updated. This is why people
can
> go for months or years and not see a problem, until they try to destroy a
dataset.
So what you are saying is "you with your ram-starved system, don''t
even
try to start using snapshots on that system". Right?

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Casper.Dik at oracle.com

2011-May-06 08:37 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

>Op 06-05-11 05:44, Richard Elling schreef:
>> As the size of the data grows, the need to have the whole DDT in RAM or
L2ARC
>> decreases. With one notable exception, destroying a dataset or snapshot
requires
>> the DDT entries for the destroyed blocks to be updated. This is why
people can
>> go for months or years and not see a problem, until they try to destroy
a dataset.
>
>So what you are saying is "you with your ram-starved system,
don''t even
>try to start using snapshots on that system". Right?

I think it''s more like "don''t use dedup when you
don''t have RAM".

(It is not possible to not use snapshots in Solaris; they are used for
everything)

Casper

Erik Trimble

2011-May-06 10:24 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 5/6/2011 1:37 AM, Casper.Dik at oracle.com wrote:>> Op 06-05-11 05:44, Richard Elling schreef:
>>> As the size of the data grows, the need to have the whole DDT in
RAM or L2ARC
>>> decreases. With one notable exception, destroying a dataset or
snapshot requires
>>> the DDT entries for the destroyed blocks to be updated. This is why
people can
>>> go for months or years and not see a problem, until they try to
destroy a dataset.
>> So what you are saying is "you with your ram-starved system,
don''t even
>> try to start using snapshots on that system". Right?
>
> I think it''s more like "don''t use dedup when you
don''t have RAM".
>
> (It is not possible to not use snapshots in Solaris; they are used for
> everything)
>
> Casper
>Casper and Richard are correct - RAM starvation seriously impacts 
snapshot or dataset deletion when a pool has dedup enabled.  The reason 
behind this is that ZFS needs to scan the entire DDT to check to see if 
it can actually delete each block in the to-be-deleted snapshot/dataset, 
or if it just needs to update the dedup reference count. If it can''t 
store the entire DDT in either the ARC or L2ARC, it will be forced to do 
considerable I/O to disk, as it brings in the appropriate DDT entry.   
Worst case for insufficient ARC/L2ARC space can increase deletion times 
by many orders of magnitude. E.g. days, weeks, or even months to do a 
deletion.


If dedup isn''t enabled, snapshot and data deletion is very light on RAM
requirements, and generally won''t need to do much (if any) disk I/O.  
Such deletion should take milliseconds to a minute or so.



-- 

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Tomas Ögren

2011-May-06 11:17 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 06 May, 2011 - Erik Trimble sent me these 1,8K bytes:
> If dedup isn''t enabled, snapshot and data deletion is very light
on RAM
> requirements, and generally won''t need to do much (if any) disk
I/O.
> Such deletion should take milliseconds to a minute or so.
.. or hours. We''ve had problems on an old raidz2 that a recursive
snapshot creation over ~800 filesystems could take quite some time, up
until the sata-scsi disk box ate the pool. Now we''re using raid10 on a
scsi box, and it takes 3-15 minute or so, during which sync writes (NFS)
are almost unusable. Using 2 fast usb sticks as l2arc, waiting for a
Vertex2EX and a Vertex3 to arrive for ZIL&L2ARC testing. IO to the
filesystems are quite low (50 writes, 500k data per sec average), but
snapshot times goes waay up during backups.

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Edward Ned Harvey

2011-May-06 14:37 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> > ----------- To calculate size of DDT -----------
> 	zdb -S poolnameLook at total blocks allocated.  It is rounded, and uses a suffix like "K,
M, G" but it''s in decimal (powers of 10) notation, so you have to
remember
that...  So I prefer the zdb -D method below, but this works too.  Total
blocks allocated * mem requirement per DDT entry, and you have the mem
requirement to hold whole DDT in ram.

> 	zdb -DD poolnameThis just gives you the -S output, and the -D output all in one go.  So I
recommend using -DD, and base your calculations on #duplicate and #unique,
as mentioned below.  Consider the histogram to be informational.
> 	zdb -D poolnameIt gives you a number of duplicate, and a number of unique blocks.  Add them
to get the total number of blocks.  Multiply by the mem requirement per DDT
entry, and you have the mem requirement to hold the whole DDT in ram.

Edward Ned Harvey

2011-May-06 14:40 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> > 	zdb -DD poolname
> This just gives you the -S output, and the -D output all in one go.  So I
Sorry, zdb -DD only works for pools that are already dedup''d.
If you want to get a measurement for a pool that is not already
dedup''d, you
have to use -S

Yaverot

2011-May-06 18:13 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

One of the quoted participants is Richard Elling, the other is Edward Ned
Harvey, but my quoting was screwed up enough that I don''t know which is
which.  Apologies.
>> > 	zdb -DD poolname
>> This just gives you the -S output, and the -D output all in one go.  So
I
>Sorry, zdb -DD only works for pools that are already dedup''d.
>If you want to get a measurement for a pool that is not already
dedup''d, you have to use -S
And since zdb -S runs for 2 hours and dumps core (without results), the correct
answer remains:
zdb -bb poolname | grep ''bp count''
as was given in the summary.

The theoretical output of "zdb -S" my be superior if you have a
version that works, but I haven''t seen anyone mention onlist which
version(s) it is, or if/how it can be obtained; short of recompiling it
yourself.

Richard Elling

2011-May-07 00:46 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On May 6, 2011, at 3:24 AM, Erik Trimble <erik.trimble at oracle.com>
wrote:
> On 5/6/2011 1:37 AM, Casper.Dik at oracle.com wrote:
>>> Op 06-05-11 05:44, Richard Elling schreef:
>>>> As the size of the data grows, the need to have the whole DDT
in RAM or L2ARC
>>>> decreases. With one notable exception, destroying a dataset or
snapshot requires
>>>> the DDT entries for the destroyed blocks to be updated. This is
why people can
>>>> go for months or years and not see a problem, until they try to
destroy a dataset.
>>> So what you are saying is "you with your ram-starved system,
don''t even
>>> try to start using snapshots on that system". Right?
>> 
>> I think it''s more like "don''t use dedup when you
don''t have RAM".
>> 
>> (It is not possible to not use snapshots in Solaris; they are used for
>> everything)
:-)
>> 
>> Casper
>> 
> Casper and Richard are correct - RAM starvation seriously impacts snapshot
or dataset deletion when a pool has dedup enabled.  The reason behind this is
that ZFS needs to scan the entire DDT to check to see if it can actually delete
each block in the to-be-deleted snapshot/dataset, or if it just needs to update
the dedup reference count.
AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason.
The issue is that each reference update means that one, small bit of data is
changed. If the reference is not already in ARC, then a small, probably random
read is needed. If you have a typical consumer disk, especially a
"green" disk, and have not tuned zfs_vdev_max_pending, then that itty
bitty read can easily take more than 100 milliseconds(!) Consider that you can
have thousands or millions of reference updates to do during a zfs destroy, and
the math gets ugly. This is why fast SSDs make good dedup candidates.
> If it can''t store the entire DDT in either the ARC or L2ARC, it
will be forced to do considerable I/O to disk, as it brings in the appropriate
DDT entry.   Worst case for insufficient ARC/L2ARC space can increase deletion
times by many orders of magnitude. E.g. days, weeks, or even months to do a
deletion.
I''ve never seen months, but I have seen days, especially for low-perf
disks.
> 
> If dedup isn''t enabled, snapshot and data deletion is very light
on RAM requirements, and generally won''t need to do much (if any) disk
I/O.  Such deletion should take milliseconds to a minute or so.
Yes, perhaps a bit longer for recursive destruction, but everyone here knows
recursion is evil, right? :-)
 -- richard

Erik Trimble

2011-May-07 01:17 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 5/6/2011 5:46 PM, Richard Elling wrote:> On May 6, 2011, at 3:24 AM, Erik Trimble<erik.trimble at oracle.com> 
wrote:
>
>> Casper and Richard are correct - RAM starvation seriously impacts
snapshot or dataset deletion when a pool has dedup enabled.  The reason behind
this is that ZFS needs to scan the entire DDT to check to see if it can actually
delete each block in the to-be-deleted snapshot/dataset, or if it just needs to
update the dedup reference count.
> AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a
reason. The issue is that each reference update means that one, small bit of
data is changed. If the reference is not already in ARC, then a small, probably
random read is needed. If you have a typical consumer disk, especially a
"green" disk, and have not tuned zfs_vdev_max_pending, then that itty
bitty read can easily take more than 100 milliseconds(!) Consider that you can
have thousands or millions of reference updates to do during a zfs destroy, and
the math gets ugly. This is why fast SSDs make good dedup candidates.Just out of curiosity - I''m assuming that a delete works like this:

     (1) find list of blocks associated with file to be deleted
     (2) using the DDT, find out if any other files are using those blocks
     (3) delete/update any metadata associated with the file (dirents, 
ACLs, etc.)
     (4) for each block in the file
         (4a) if the DDT indicates there ARE other files using this 
block, update the DDT entry to change the refcount
         (4b) if the DDT indicates there AREN''T any other files, move 
the physical block to the free list, and delete the DDT entry


In a bulk delete scenario (not just snapshot deletion), I''d presume #1 
above almost always causes a Random I/O request to disk, as all the 
relevant metadata for every (to be deleted) file is unlikely to be 
stored in ARC.  If you can''t fit the DDT in ARC/L2ARC, #2 above would 
require you to pull in the remainder of the DDT info from disk, right?  
#3 and #4 can be batched up, so they don''t hurt that much.

Is that a (roughly) correct deletion methodology? Or can someone give a 
more accurate view of what''s actually going on?


>> If it can''t store the entire DDT in either the ARC or L2ARC,
it will be forced to do considerable I/O to disk, as it brings in the
appropriate DDT entry.   Worst case for insufficient ARC/L2ARC space can
increase deletion times by many orders of magnitude. E.g. days, weeks, or even
months to do a deletion.
> I''ve never seen months, but I have seen days, especially for
low-perf disks.I''ve seen an estimate of 5 weeks for removing a snapshot on a 1TB dedup
pool made up of 1 disk.

Not an optimal set up.

:-)
>> If dedup isn''t enabled, snapshot and data deletion is very
light on RAM requirements, and generally won''t need to do much (if any)
disk I/O.  Such deletion should take milliseconds to a minute or so.
> Yes, perhaps a bit longer for recursive destruction, but everyone here
knows recursion is evil, right? :-)
>   -- richardYou, my friend, have obviously never worshipped at the Temple of the 
Lamba Calculus, nor been exposed to the Holy Writ that is "Structure and 
Interpretation of Computer Programs" 
(http://mitpress.mit.edu/sicp/full-text/book/book.html).

I sentence you to a semester of 6.001 problem sets, written by Prof 
Sussman sometime in the 1980s.

(yes, I went to MIT.)

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Edward Ned Harvey

2011-May-07 13:37 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

New problem:

I''m following all the advice I summarized into the OP of this thread,
and
testing on a test system.  (A laptop).  And it''s just not working.  I
am
jumping into the dedup performance abyss far, far eariler than predicted...


My test system is a laptop with 1.5G ram, c_min =150M, c_max =1.2G
I have just a single sata 7.2krpm hard drive, no SSD.
Before I start, I have 1G free ram (according to top.)  
According to everything we''ve been talking about, I expect roughly 1G
divided by 376 bytes = 2855696 (2.8M) blocks in my pool before I start
running out of ram to hold the DDT and performance degrades.

I create a pool.  Enable dedup.  Set recordsize=512
I write a program that will very quickly generate unique non-dedupable data:
	#include <stdio.h>
	#include <stdlib.h>
	int main(int argc, char *argv[]) {
	        int i;
	        int numblocks=atoi(argv[1]);
	        // Note: Expect one command-line argument integer.
	        FILE *outfile;
	        outfile=fopen("junk.file","w");
	        for (i=0; i<numblocks ; i++) 
	                fprintf(outfile,"%512d",i);
	        fflush(outfile);
	        fclose(outfile);
	}

Disable dedup.  Run with a small numblocks.   For example:  time
~/datagenerator 100
Enable dedup and repeat.
They both complete instantly.

Repeat with a higher numblocks...  1000, 10000, 100000...  
Repeat until you find the point where performance with dedup is
significantly different from performance without dedup. 

See below.  Right around 400,000 blocks, dedup is suddenly an order of
magnitude slower than without dedup.

Times to create the file:
numblocks	dedup=off	dedup=verify	DDTsize	Filesize
100000		2.5sec		1.2sec		36 MB		49 MB
110000		1.4sec		1.3sec		39 MB		54 MB
120000		1.4sec		1.5sec		43 MB		59 MB
130000		1.5sec		1.8sec		47 MB		63 MB
140000		1.6sec		1.6sec		50 MB		68 MB
150000		4.8sec		7.0sec		54 MB		73 MB
160000		4.8sec		7.6sec		57 MB		78 MB
170000		2.1sec		2.1sec		61 MB		83 MB
180000		5.2sec		5.6sec		65 MB		88 MB
190000		6.0sec		10.1sec		68 MB		93 MB
200000		4.7sec		2.6sec		72 MB		98 MB
210000		6.8sec		6.7sec		75 MB		103 MB
220000		6.2sec		18.0sec		79 MB		107 MB
230000		6.5sec		16.7sec		82 MB		112 MB
240000		8.8sec		10.4sec		86 MB		117 MB
250000		8.2sec		17.0sec		90 MB		122 MB
260000		8.4sec		17.5sec		93 MB		127 MB
270000		6.8sec		19.2sec		97 MB		132 MB
280000		13.1sec		16.5sec		100 MB		137 MB
290000		9.4sec		73.1sec		104 MB		142 MB
300000		8.5sec		7.7sec		108 MB		146 MB
310000		8.5sec		7.7sec		111 MB		151 MB
320000		8.6sec		11.9sec		115 MB		156 MB
330000		9.3sec		33.5sec		118 MB		161 MB
340000		8.3sec		54.3sec		122 MB		166 MB
350000		8.3sec		50.0sec		126 MB		171 MB
360000		9.3sec		109.0sec	129 MB		176 MB
370000		9.5sec		12.5sec		133 MB		181 MB
380000		10.1sec		28.6sec		136 MB		186 MB
390000		10.2sec		14.6sec		140 MB		190 MB
400000		10.7sec		136.7sec	143 MB		195 MB
410000		11.4sec		116.6sec	147 MB		200 MB
420000		11.5sec		220.9sec	151 MB		205 MB
430000		11.7sec		151.3sec	154 MB		210 MB
440000		12.7sec		144.7sec	158 MB		215 MB
450000		12.0sec		202.1sec	161 MB		220 MB
460000		13.9sec		134.7sec	165 MB		225 MB
470000		12.2sec		127.6sec	169 MB		229 MB
480000		13.1sec		122.7sec	172 MB		234 MB
490000		13.1sec		106.3sec	176 MB		239 MB
500000		15.8sec		174.6sec	179 MB		244 MB
550000		14.2sec		216.6sec	197 MB		269 MB
600000		15.6sec		294.2sec	215 MB		293 MB
650000		16.7sec		332.8sec	233 MB		317 MB
700000		19.0sec		269.6sec	251 MB		342 MB
750000		20.1sec		472.0sec	269 MB		366 MB
800000		21.0sec		465.6sec	287 MB		391 MB

Edward Ned Harvey

2011-May-07 13:47 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> See below.  Right around 400,000 blocks, dedup is suddenly an order of
> magnitude slower than without dedup.
> 
> 400000		10.7sec		136.7sec	143 MB		195
MB> 800000		21.0sec		465.6sec	287 MB		391MB

The interesting thing is - In all these cases, the complete DDT and the
complete data file itself should fit entirely in ARC comfortably.  So it
makes no sense for performance to be so terrible at this level.

So I need to start figuring out exactly what''s going on.  Unfortunately
I
don''t know how to do that very well.  I''m looking for advice
from anyone -
how to poke around and see how much memory is being consumed for what
purposes.  I know how to lookup c_min and c and c_max...  But that
didn''t do
me much good.  The actual value for c barely changes at all over time...
Even when I rm the file, c does not change immediately.

All the other metrics from kstat ... have less than obvious names ... so I
don''t know what to look for...

Erik Trimble

2011-May-08 00:18 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 5/7/2011 6:47 AM, Edward Ned Harvey wrote:>> See below.  Right around 400,000 blocks, dedup is suddenly an order of
>> magnitude slower than without dedup.
>>
>> 400000		10.7sec		136.7sec	143 MB		195
> MB
>> 800000		21.0sec		465.6sec	287 MB		391
> MB
>
> The interesting thing is - In all these cases, the complete DDT and the
> complete data file itself should fit entirely in ARC comfortably.  So it
> makes no sense for performance to be so terrible at this level.
>
> So I need to start figuring out exactly what''s going on. 
Unfortunately I
> don''t know how to do that very well.  I''m looking for
advice from anyone -
> how to poke around and see how much memory is being consumed for what
> purposes.  I know how to lookup c_min and c and c_max...  But that
didn''t do
> me much good.  The actual value for c barely changes at all over time...
> Even when I rm the file, c does not change immediately.
>
> All the other metrics from kstat ... have less than obvious names ... so I
> don''t know what to look for...
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Some minor issues that might affect the above:

(1) I''m assuming you run your script repeatedly in the same pool, 
without deleting the pool. If that is the case, that means that a run of 
X+1 should dedup completely with the run of X.  E.g. a run with 120000 
blocks will dedup the first 110000 blocks with the prior run of 110000.

(2) can you NOT enable "verify" ?  Verify *requires* a disk read
before
writing for any potential dedup-able block. If case #1 above applies, 
then by turning on dedup, you *rapidly* increase the amount of disk I/O 
you require on each subsequent run.  E.g. the run of 100000 requires no 
disk I/O due to verify, but the run of 110000 requires 100000 I/O 
requests, while the run of 120000 requires 110000 requests, etc.  This 
will skew your results as the ARC buffering of file info changes over time.

(3) fflush is NOT the same as fsync.  If you''re running the script in a
loop, it''s entirely possible that ZFS hasn''t completely
committed things
to disk yet, which means that you get I/O requests to flush out the ARC 
write buffer in the middle of your runs.   Honestly, I''d do the 
following for benchmarking:

         i=0
         while [i -lt 80 ];
         do
             j = $[100000 + ( 1  * 10000)]
             ./run_your_script j
             sync
             sleep 10
             i = $[$i+1]
     done

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Garrett D''Amore

2011-May-08 01:21 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Just another data point.  The ddt is considered metadata, and by default the arc
will not allow more than 1/4 of it to be used for metadata.   Are you still sure
it fits?

Erik Trimble <erik.trimble at oracle.com> wrote:
>On 5/7/2011 6:47 AM, Edward Ned Harvey wrote:
>>> See below.  Right around 400,000 blocks, dedup is suddenly an order
of
>>> magnitude slower than without dedup.
>>>
>>> 400000		10.7sec		136.7sec	143 MB		195
>> MB
>>> 800000		21.0sec		465.6sec	287 MB		391
>> MB
>>
>> The interesting thing is - In all these cases, the complete DDT and the
>> complete data file itself should fit entirely in ARC comfortably.  So
it
>> makes no sense for performance to be so terrible at this level.
>>
>> So I need to start figuring out exactly what''s going on. 
Unfortunately I
>> don''t know how to do that very well.  I''m looking for
advice from anyone -
>> how to poke around and see how much memory is being consumed for what
>> purposes.  I know how to lookup c_min and c and c_max...  But that
didn''t do
>> me much good.  The actual value for c barely changes at all over
time...
>> Even when I rm the file, c does not change immediately.
>>
>> All the other metrics from kstat ... have less than obvious names ...
so I
>> don''t know what to look for...
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>Some minor issues that might affect the above:
>
>(1) I''m assuming you run your script repeatedly in the same pool, 
>without deleting the pool. If that is the case, that means that a run of 
>X+1 should dedup completely with the run of X.  E.g. a run with 120000 
>blocks will dedup the first 110000 blocks with the prior run of 110000.
>
>(2) can you NOT enable "verify" ?  Verify *requires* a disk read
before
>writing for any potential dedup-able block. If case #1 above applies, 
>then by turning on dedup, you *rapidly* increase the amount of disk I/O 
>you require on each subsequent run.  E.g. the run of 100000 requires no 
>disk I/O due to verify, but the run of 110000 requires 100000 I/O 
>requests, while the run of 120000 requires 110000 requests, etc.  This 
>will skew your results as the ARC buffering of file info changes over time.
>
>(3) fflush is NOT the same as fsync.  If you''re running the script
in a
>loop, it''s entirely possible that ZFS hasn''t completely
committed things
>to disk yet, which means that you get I/O requests to flush out the ARC 
>write buffer in the middle of your runs.   Honestly, I''d do the 
>following for benchmarking:
>
>         i=0
>         while [i -lt 80 ];
>         do
>             j = $[100000 + ( 1  * 10000)]
>             ./run_your_script j
>             sync
>             sleep 10
>             i = $[$i+1]
>     done
>
>
>
>-- 
>Erik Trimble
>Java System Support
>Mailstop:  usca22-123
>Phone:  x17195
>Santa Clara, CA
>
>_______________________________________________
>zfs-discuss mailing list
>zfs-discuss at opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Edward Ned Harvey

2011-May-08 14:31 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: Erik Trimble [mailto:erik.trimble at oracle.com]
> 
> (1) I''m assuming you run your script repeatedly in the same pool,
> without deleting the pool. If that is the case, that means that a run of
> X+1 should dedup completely with the run of X.  E.g. a run with 120000
> blocks will dedup the first 110000 blocks with the prior run of 110000.
I rm the file in between each run.  So if I''m not mistaken, no dedup
happens
on consecutive runs based on previous runs.

> (2) can you NOT enable "verify" ?  Verify *requires* a disk read
before
> writing for any potential dedup-able block. 
Every block is unique.  There is never anything to verify because there is
never a checksum match.

Why would I test dedup on non-dedupable data?  You can see it''s a test.
In
any pool where you want to enable dedup, you''re going to have a number
of
dedupable blocks, and a number of non-dedupable blocks.  The memory
requirement is based on number of allocated blocks in the pool.  So I want
to establish an upper and lower bound for dedup performance.  I am running
some tests on entirely duplicate data to see how fast it goes, and also
running the described test on entirely non-duplicate data...  With enough
ram and without enough ram...  As verification that we know how to predict
the lower bound.

So far, I''m failing to predict the lower bound, which is why
I''ve come here
to talk about it.

I''ve done a bunch of tests with dedup=verify or dedup=sha256.  Results
the
same.  But I didn''t do that for this particular test.  I''ll
run with just
sha256 if you would still like me to after what I just said.

> (3) fflush is NOT the same as fsync.  If you''re running the script
in a
> loop, it''s entirely possible that ZFS hasn''t completely
committed things
> to disk yet, 
Oh.  Well I''ll change that - but - I actually sat here and watched the
HDD
light, so even though I did that wrong, I can say the hard drive finished
and became idle in between each run.  (I stuck sleep statements in between
each run specifically so I could watch the HDD light.)

>          i=0
>          while [i -lt 80 ];
>          do
>              j = $[100000 + ( 1  * 10000)]
>              ./run_your_script j
>              sync
>              sleep 10
>              i = $[$i+1]
>      done
Oh, yeah.  That''s what I did, minus the sync command.  I''ll
make sure to
include that next time.  And I used "time ~/datagenerator"

Incidentally, does fsync() and sync return instantly or wait?  Cuz "time
sync" might product 0 sec every time even if there were something waiting
to
be flushed to disk.

Edward Ned Harvey

2011-May-08 14:37 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: Garrett D''Amore [mailto:garrett at nexenta.com]
> 
> Just another data point.  The ddt is considered metadata, and by default
the
> arc will not allow more than 1/4 of it to be used for metadata.   Are you
still
> sure it fits?
That''s interesting.  Is it tunable?  That could certainly start to
explain why my arc size arcstats:c never grew to any size I thought seemed
reasonable...  And in fact it grew larger when I had dedup disabled.  Smaller
when dedup was enabled.  "Weird," I thought.

Seems like a really important factor to mention in this summary.

Toby Thain

2011-May-08 14:48 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 06/05/11 9:17 PM, Erik Trimble wrote:> On 5/6/2011 5:46 PM, Richard Elling wrote:
>> ...
>> Yes, perhaps a bit longer for recursive destruction, but everyone here
>> knows recursion is evil, right? :-)
>>   -- richard
> You, my friend, have obviously never worshipped at the Temple of the
> Lamba Calculus, nor been exposed to the Holy Writ that is "Structure
and
> Interpretation of Computer Programs"
As someone who is studying Scheme and SICP, I had no trouble seeing that
Richard was not being serious :)
> (http://mitpress.mit.edu/sicp/full-text/book/book.html).
> 
> I sentence you to a semester of 6.001 problem sets, written by Prof
> Sussman sometime in the 1980s.
> 

--Toby


> (yes, I went to MIT.)
>

Toby Thain

2011-May-08 14:51 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 08/05/11 10:31 AM, Edward Ned Harvey wrote:>...
> Incidentally, does fsync() and sync return instantly or wait?  Cuz
"time
> sync" might product 0 sec every time even if there were something
waiting to
> be flushed to disk.
The semantics need to be synchronous. Anything else would be a horrible bug.

--Toby
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Edward Ned Harvey

2011-May-08 14:56 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>
> That could certainly start to explain why my
> arc size arcstats:c never grew to any size I thought seemed reasonable...

Also now that I''m looking closer at arcstats, it seems arcstats:size
might
be the appropriate measure, not arcstats:c

Garrett D''Amore

2011-May-08 15:05 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

It is tunable, I don''t remember the exact tunable name...
Arc_metadata_limit or some such.

  -- Garrett D''Amore

On May 8, 2011, at 7:37 AM, "Edward Ned Harvey"
<opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
>> From: Garrett D''Amore [mailto:garrett at nexenta.com]
>> 
>> Just another data point.  The ddt is considered metadata, and by
default the
>> arc will not allow more than 1/4 of it to be used for metadata.   Are
you still
>> sure it fits?
> 
> That''s interesting.  Is it tunable?  That could certainly start to
explain why my arc size arcstats:c never grew to any size I thought seemed
reasonable...  And in fact it grew larger when I had dedup disabled.  Smaller
when dedup was enabled.  "Weird," I thought.
> 
> Seems like a really important factor to mention in this summary.

Edward Ned Harvey

2011-May-08 15:15 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: Garrett D''Amore [mailto:garrett at nexenta.com]
> 
> It is tunable, I don''t remember the exact tunable name...
Arc_metadata_limit> or some such.
There it is:
echo "::arc" | sudo mdb -k | grep meta_limit
arc_meta_limit            =       286 MB

Looking at my chart earlier in this discussion, it seems like this might not
be the cause of the problem.  In my absolute largest test that I ran, my
supposed (calculated) DDT size was 287MB, so this performance abyss was
definitely happening at sizes smaller than the arc_meta_limit.

But I''ll go tune and test with this knowledge, just to be sure.

Edward Ned Harvey

2011-May-08 15:20 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> But I''ll go tune and test with this knowledge, just to be sure.
BTW, here''s how to tune it:

echo "arc_meta_limit/Z 0x30000000" | sudo mdb -kw

echo "::arc" | sudo mdb -k | grep meta_limit
arc_meta_limit            =       768 MB

Andrew Gabriel

2011-May-08 15:22 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Toby Thain wrote:> On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
>   
>> ...
>> Incidentally, does fsync() and sync return instantly or wait?  Cuz
"time
>> sync" might product 0 sec every time even if there were something
waiting to
>> be flushed to disk.
>>     
>
> The semantics need to be synchronous. Anything else would be a horrible
bug.
>   
sync(2) is not required to be synchronous.
I believe that for ZFS it is synchronous, but for most other 
filesystems, it isn''t (although a second sync will block until the 
actions resulting from a previous sync have completed).

fsync(3C) is synchronous.

-- 
Andrew Gabriel

Neil Perrin

2011-May-08 16:17 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On 05/08/11 09:22, Andrew Gabriel wrote:> Toby Thain wrote:
>> On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
>>  
>>> ...
>>> Incidentally, does fsync() and sync return instantly or wait?  Cuz 
>>> "time
>>> sync" might product 0 sec every time even if there were
something
>>> waiting to
>>> be flushed to disk.
>>>     
>>
>> The semantics need to be synchronous. Anything else would be a 
>> horrible bug.
>>   
>
> sync(2) is not required to be synchronous.
> I believe that for ZFS it is synchronous...
Indeed it is synchronous for zfs. Both sync(1)/sync(2) ensure that all
cached data and metadata at the time of the sync(2) call are written out
and stable before return from the call.

Neil.

Richard Elling

2011-May-08 16:40 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

On May 8, 2011, at 7:56 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>> 
>> That could certainly start to explain why my
>> arc size arcstats:c never grew to any size I thought seemed
reasonable...
> 
> 
> Also now that I''m looking closer at arcstats, it seems
arcstats:size might
> be the appropriate measure, not arcstats:c
size is the current size. c is the target size. c_min and c_max are the target
min and
max, respectively.

The source is well commented and describes the ARC algorithms and kstats in
detail.
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c

What is not known is how well the algorithms work for the various use cases we 
encounter. That will be an ongoing project for a long time.
 -- richard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110508/c9117ffd/attachment.html>

Frank Van Damme

2011-May-09 09:19 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Op 08-05-11 17:20, Edward Ned Harvey schreef:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> But I''ll go tune and test with this knowledge, just to be
sure.
> 
> BTW, here''s how to tune it:
> 
> echo "arc_meta_limit/Z 0x30000000" | sudo mdb -kw
> 
> echo "::arc" | sudo mdb -k | grep meta_limit
> arc_meta_limit            =       768 MB
Must. Try. This. Out.
Otoh, I struggle to see the difference between arc_meta_limit and
arc_meta_max.

arc_meta_used             =      1807 MB
arc_meta_limit            =      1275 MB
arc_meta_max              =      2289 MB

Mine is 1275 MB ((6-1)/4) GB, let''s try doubling it...

Is this persistent across reboots btw? Or only if you write it in
/etc/system or somefile?


-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Edward Ned Harvey

2011-May-09 12:22 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Frank Van Damme
> 
> Otoh, I struggle to see the difference between arc_meta_limit and
> arc_meta_max.
Thanks for pointing this out.  When I changed meta_limit and re-ran the
test, there was no discernable difference.  So now I''ll change meta_max
and
see if it helps...

Edward Ned Harvey

2011-May-09 12:36 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> So now I''ll change meta_max and
> see if it helps...
Oh, know what?  Nevermind.
I just looked at the source, and it seems arc_meta_max is just a gauge for
you to use, so you can know what''s the highest arc_meta_used has ever
reached.  So the most useful thing for you to do would be to set this to 0
to reset the counter.  And then you can start watching it over time.

Frank Van Damme

2011-May-09 13:25 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Op 09-05-11 14:36, Edward Ned Harvey schreef:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> So now I''ll change meta_max and
>> see if it helps...
> 
> Oh, know what?  Nevermind.
> I just looked at the source, and it seems arc_meta_max is just a gauge for
> you to use, so you can know what''s the highest arc_meta_used has
ever
> reached.  So the most useful thing for you to do would be to set this to 0
> to reset the counter.  And then you can start watching it over time. 
Ok good to know - but that confuses me even more since in my previous
post my arc_meta_used was bigger than my arc_meta_limit (by about 50%)
and now wince I doubled _limit, _used only shrunk by a couple megs.

I''d really like to find some way to tell this machine "CACHE MORE
METADATA, DAMNIT!" :-)

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Edward Ned Harvey

2011-May-09 13:42 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Frank Van Damme
> 
> in my previous
> post my arc_meta_used was bigger than my arc_meta_limit (by about 50%)
I have the same thing.  But as I sit here and run more and more extensive
tests on it ... it seems like arc_meta_limit is sort of a soft limit.  Or it
only checks periodically or something like that.  Because although I
sometimes see size > limit, and I definitely see max > limit ...  When I
do
bigger and bigger more intensive stuff, the size never grows much more than
limit.  It always gets knocked back down within a few seconds...

Frank Van Damme

2011-May-09 13:56 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Op 09-05-11 15:42, Edward Ned Harvey schreef:>> > in my previous
>> > post my arc_meta_used was bigger than my arc_meta_limit (by about
50%)
> I have the same thing.  But as I sit here and run more and more extensive
> tests on it ... it seems like arc_meta_limit is sort of a soft limit.  Or
it
> only checks periodically or something like that.  Because although I
> sometimes see size > limit, and I definitely see max > limit ... 
When I do
> bigger and bigger more intensive stuff, the size never grows much more than
> limit.  It always gets knocked back down within a few seconds...

I found a script called arc_summary.pl and look what it says.


ARC Size:
         Current Size:             1734 MB (arcsize)
         Target Size (Adaptive):   1387 MB (c)
         Min Size (Hard Limit):    637 MB (zfs_arc_min)
         Max Size (Hard Limit):    5102 MB (zfs_arc_max)



c                         =      1512 MB
c_min                     =       637 MB
c_max                     =      5102 MB
size                      =      1736 MB
...
arc_meta_used             =      1735 MB
arc_meta_limit            =      2550 MB
arc_meta_max              =      1832 MB

There are a dew seconds between running the script and ::arc | mdb -k,
but it seems that it just doesn''t use more arc than 1734 or so MB, and
that nearly all of it is used for metadata. (I set primarycache=metadata
to my data fs, so I deem it logical). So the goal seems shifted to
trying to enlarge the arc size (what''s it doing with the other
memory???
I have close to no processes running.)


-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Edward Ned Harvey

2011-May-10 04:56 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> BTW, here''s how to tune it:
> 
> echo "arc_meta_limit/Z 0x30000000" | sudo mdb -kw
> 
> echo "::arc" | sudo mdb -k | grep meta_limit
> arc_meta_limit            =       768 MB
Well ... I don''t know what to think yet.  I''ve been reading
these numbers
for like an hour, finding interesting things here and there, but nothing to
really solidly point my finger at.

The one thing I know for sure...  The free mem drops at an unnatural rate.
Initially the free mem disappears at a rate approx 2x faster than the sum of
file size and metadata combined.  Meaning the system could be caching the
entire file and all the metadata, and that would only explain half of the
memory disappearance.

I set the arc_meta_limit to 768 as mentioned above.  I ran all these tests,
and here are the results:
(sorry it''s extremely verbose)
http://dl.dropbox.com/u/543241/dedup%20tests/runtest-output.xlsx

BTW, here are all the scripts etc that I used to produce those results:
http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

Frank Van Damme

2011-May-10 08:03 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Op 09-05-11 15:42, Edward Ned Harvey schreef:>> > in my previous
>> > post my arc_meta_used was bigger than my arc_meta_limit (by about
50%)
> I have the same thing.  But as I sit here and run more and more extensive
> tests on it ... it seems like arc_meta_limit is sort of a soft limit.  Or
it
> only checks periodically or something like that.  Because although I
> sometimes see size > limit, and I definitely see max > limit ... 
When I do
> bigger and bigger more intensive stuff, the size never grows much more than
> limit.  It always gets knocked back down within a few seconds...
What I really can''t wrap my head around is how arc_meta_used can
possibly be bigger than the arcsize. Yet on my system right now there''s
about 15 MB difference between the two.

Also, arcsize is only about 1800M right now and since there is a
parameter set that is called arc_no_grow I presume it''ll remain that
way. Do you think it''s possible to
- force it to "off" again
- set a higher minimum limit?

The latter can probably be done in /etc/system, but on a live system?

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Frank Van Damme

2011-May-11 08:29 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

Op 10-05-11 06:56, Edward Ned Harvey schreef:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> BTW, here''s how to tune it:
>>
>> echo "arc_meta_limit/Z 0x30000000" | sudo mdb -kw
>>
>> echo "::arc" | sudo mdb -k | grep meta_limit
>> arc_meta_limit            =       768 MB
> 
> Well ... I don''t know what to think yet.  I''ve been
reading these numbers
> for like an hour, finding interesting things here and there, but nothing to
> really solidly point my finger at.
> 
> The one thing I know for sure...  The free mem drops at an unnatural rate.
> Initially the free mem disappears at a rate approx 2x faster than the sum
of
> file size and metadata combined.  Meaning the system could be caching the
> entire file and all the metadata, and that would only explain half of the
> memory disappearance.
I''m seeing similar things. Yesterday I first rebooted with set
zfs:zfs_arc_meta_limit=0x100000000 (that''s 4 GiB) set in /etc/system
and
monitored while the box was doing its regular job (taking backups).
zfs_arc_min is also set to 4 GiB. What I noticed is that shortly after
the reboot, the arc started filling up rapidly, mostly with metadata. It
shot up to:

arc_meta_max              =      3130 MB

afterwards, the number for arc_meta_used steadily dropped. Some 12 hours
ago, I started deleting files, it has deleted about 6000000 files since
then. Now at the moment the arc size stays right at the minimum of 2
GiB, of which metadata fluctuates around 1650 MB.

This is the output of the getmemstats.sh script you posted.

Memory: 6135M phys mem, 539M free mem, 6144M total swap, 6144M free swap
zfs:0:arcstats:c        2147483648		= 2 GiB target size
zfs:0:arcstats:c_max    5350862848		= 5 GiB
zfs:0:arcstats:c_min    2147483648		= 2 GiB
zfs:0:arcstats:data_size        829660160	= 791 MiB
zfs:0:arcstats:hdr_size 93396336		= 89 MiB
zfs:0:arcstats:other_size       411215168	= 392 MiB
zfs:0:arcstats:size     1741492896		= 1661 Mi
arc_meta_used             =      1626 MB
arc_meta_limit            =      4096 MB
arc_meta_max              =      3130 MB

I get way more cache misses then I''d like:

    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz    c
10:01:13    3K   380     10   166    7   214   15   259    7     1G   2G
10:02:13    2K   340     16    37    2   302   46   323   16     1G   2G
10:03:13    2K   368     18    47    3   321   46   347   17     1G   2G
10:04:13    1K   348     25    44    4   303   63   335   24     1G   2G
10:05:13    2K   420     15    87    4   332   36   383   14     1G   2G
10:06:13    3K   489     16   132    6   357   35   427   14     1G   2G
10:07:13    2K   405     15    49    2   355   39   401   15     1G   2G
10:08:13    2K   366     13    40    2   326   37   366   13     1G   2G
10:09:13    1K   364     20    18    1   345   58   363   20     1G   2G
10:10:13    4K   370      8    59    2   311   21   369    8     1G   2G
10:11:13    4K   351      8    57    2   294   21   350    8     1G   2G
10:12:13    3K   378     10    59    2   319   26   372   10     1G   2G
10:13:13    3K   393     11    53    2   339   28   393   11     1G   2G
10:14:13    2K   403     13    40    2   363   35   402   13     1G   2G
10:15:13    3K   365     11    48    2   317   30   365   11     1G   2G
10:16:13    2K   374     15    40    2   334   40   374   15     1G   2G
10:17:13    3K   385     12    43    2   341   28   383   12     1G   2G
10:18:13    4K   343      8    64    2   279   19   343    8     1G   2G
10:19:13    3K   391     10    59    2   332   23   391   10     1G   2G


So, one explanation I can think of is that the "rest" of the memory
are
l2arc pointers, supposing they are not actually counted in the arc
memory usage totals (AFAIK l2arc pointers are considered to be part of
arc). Then again my l2arc is still growing (slowly) and I''m only
caching
metadata at the moment, so you''d think it''d shrink if
there''s no more
room for l2arc pointers. Besides I''m getting very little reads from
ssd:

                 capacity     operations    bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
backups       5.49T  1.57T    415    121  3.13M  1.58M
  raidz1      5.49T  1.57T    415    121  3.13M  1.58M
    c0t0d0s1      -      -    170     16  2.47M   551K
    c0t1d0s1      -      -    171     16  2.46M   550K
    c0t2d0s1      -      -    170     16  2.53M   552K
    c0t3d0s1      -      -    170     16  2.44M   550K
cache             -      -      -      -      -      -
  c1t5d0      63.4G  48.4G     20      0  2.45M      0
------------  -----  -----  -----  -----  -----  -----

(typical statistic over 1 minute)


I might try the "windows solution" and reboot the machine to free up
memory and let it fill the cache all over again and see if I get more
cache hits... hmmm...
> I set the arc_meta_limit to 768 as mentioned above.  I ran all these tests,
> and here are the results:
> (sorry it''s extremely verbose)
> http://dl.dropbox.com/u/543241/dedup%20tests/runtest-output.xlsx
> 
> BTW, here are all the scripts etc that I used to produce those results:
> http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
> http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
> http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Edward Ned Harvey

2011-May-19 01:56 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> New problem:
> 
> I''m following all the advice I summarized into the OP of this
thread, and
> testing on a test system.  (A laptop).  And it''s just not working.
I am
> jumping into the dedup performance abyss far, far eariler thanpredicted...

Now I''m repeating all these tests on a system that more closely
resembles a
server.  This is a workstation with 6 core processor, 16G ram, and a single
1TB hard disk.

In the default configuration, arc_meta_limit is 3837MB.  And as I increase
the number of unique blocks in the data pool, it is perfectly clear that
performance jumps off a cliff when arc_meta_used starts to reach that level,
which is approx 880,000 to 1,030,000 unique blocks.  FWIW, this means,
without evil tuning, a 16G server is only sufficient to run dedup on approx
33GB to 125GB unique data without severe performance degradation.  I''m
calling "severe degradation" anything that''s an order of
magnitude or worse.
(That''s 40K average block size * 880,000 unique blocks, and 128K
average
block size * 1,030,000 unique blocks.)

So clearly this needs to be addressed, if dedup is going to be super-awesome
moving forward.

But I didn''t quit there.

So then I tweak the arc_meta_limit.  Set to 7680MB.  And repeat the test.
This time, the edge of the cliff is not so clearly defined, something like
1,480,000 to 1,620,000 blocks.  But the problem is - arc_meta_used never
even comes close to 7680MB.  At all times, I still have at LEAST 2G unused
free mem.

I have 16G physical mem, but at all times, I always have at least 2G free.
my arcstats:c_max is 15G.  But my arc size never exceeds 8.7G
my arc_meta_limit is 7680 MB, but my arc_meta_used never exceeds 3647 MB.

So what''s the holdup?

All of the above is, of course, just a summary.  If you want complete
overwhelming details, here they are:
http://dl.dropbox.com/u/543241/dedup%20tests/readme.txt

http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/parse.py
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass-parsed.xlsx

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass-parsed.xlsx

Edward Ned Harvey

2011-May-20 13:39 UTC

head link

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> New problem:
> 
> I''m following all the advice I summarized into the OP of this
thread, and
> testing on a test system.  (A laptop).  And it''s just not working.
I am
> jumping into the dedup performance abyss far, far eariler thanpredicted...

(resending this message, because it doesn''t seem to have been delivered
the
first time.  If this is a repeat, please ignore.)

Now I''m repeating all these tests on a system that more closely
resembles a
server.  This is a workstation with 6 core processor, 16G ram, and a single
1TB hard disk.

In the default configuration, arc_meta_limit is 3837MB.  And as I increase
the number of unique blocks in the data pool, it is perfectly clear that
performance jumps off a cliff when arc_meta_used starts to reach that level,
which is approx 880,000 to 1,030,000 unique blocks.  FWIW, this means,
without evil tuning, a 16G server is only sufficient to run dedup on approx
33GB to 125GB unique data without severe performance degradation.  I''m
calling "severe degradation" anything that''s an order of
magnitude or worse.
(That''s 40K average block size * 880,000 unique blocks, and 128K
average
block size * 1,030,000 unique blocks.)

So clearly this needs to be addressed, if dedup is going to be super-awesome
moving forward.

But I didn''t quit there.

So then I tweak the arc_meta_limit.  Set to 7680MB.  And repeat the test.
This time, the edge of the cliff is not so clearly defined, something like
1,480,000 to 1,620,000 blocks.  But the problem is - arc_meta_used never
even comes close to 7680MB.  At all times, I still have at LEAST 2G unused
free mem.

I have 16G physical mem, but at all times, I always have at least 2G free.
my arcstats:c_max is 15G.  But my arc size never exceeds 8.7G
my arc_meta_limit is 7680 MB, but my arc_meta_used never exceeds 3647 MB.

So what''s the holdup?

All of the above is, of course, just a summary.  If you want complete
overwhelming details, here they are:
http://dl.dropbox.com/u/543241/dedup%20tests/readme.txt

http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/parse.py
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass-parsed.xlsx

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass-parsed.xlsx

zfs discuss - May 2011 - Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements

[zfs-discuss] Summary: Dedup and L2ARC memory requirements