thr3ads.net - zfs discuss - [zfs-discuss] Summary: Dedup memory and performance (again, again) [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2011-Jul-09 16:04 UTC

[zfs-discuss] Summary: Dedup memory and performance (again, again)

There were a lot of useful details put into the thread "Summary: Dedup and
L2ARC memory requirements"

Please refer to that thread as necessary...  After much discussion leading
up to that thread, I thought I had enough understanding to make dedup
useful, but then in practice, it didn''t work out.  Now I''ve
done a lot more
work on it, reduced it all to practice, and I finally feel I can draw up
conclusions that are actually useful:

 

I am testing on a Sun Oracle server, X4270, 1 Xeon 4-core 2.4Ghz, 24G ram,
12 disks ea 2T sas 7.2krpm.  Solaris 11 express snv_151a

 

OS is installed on a single disk.  The remaining 11 disks are all striped
into a single 20 TB pool (no redundancy).  Obviously not the way you would
configure for production, the point is to get maximum usable size and max
performance for testing purposes.  So I can actually find the limits in this
lifetime.

 

With and without dedup, the read and write performance characteristics on
duplicate and unique data are completely different.  That''s a lot of
variables.  So here''s how I''m going to break it down: 
Performance gain of
dedup versus performance loss of dedup.

 

---  Performance gain:

Unfortunately there was only one area that I found any performance gain.
When you read back duplicate data that was previously written with dedup,
then you get a lot more cache hits, and as a result, the reads go faster.
Unfortunately these gains are diminished...  I don''t know by what... 
But
you only have about 2x to 4x performance gain reading previously
dedup''d
data, as compared to reading the same data which was never dedup''d. 
Even
when repeatedly reading the same file which is 100% duplicate data (created
by dd from /dev/zero) so all the data is 100% in cache...   I still see only
2x to 4x performance gain with dedup.

 

--- Performance loss:

 

The first conclusion to draw is:  For extremely small pools (say, a few GB
or so) writing with or without dedup performs exactly the same.  As you grow
the unique blocks in the pool, the write performance with dedup starts to
deviate from the write performance without dedup.  It quickly reaches 4x,
6x, 10x slower with dedup ... but this write performance degradation appears
to be logarithmic.  Where I reached 8x write performance degradation was
around 2.4M blocks (290G used) but I never exceeded 11x write performance
degradation even when I got up to 14T used (123M blocks)

 

The second conclusion is:  Out of the box, dedup is basically useless.
Thanks to arc_meta_limit being pathetically small by default, (3840M in my
24G system) I ran into a write performance brick wall around 30M unique
blocks in the system ~= 3.6T unique data in the pool.  When I say "write
performance brick wall," I mean up until that point, dedup writing was
about
8x slower than writing without dedup, and after that point, the write
performance difference grew exponentially.  (Maybe it''s not
mathematically
exponential, but the numbers look exponential to my naked eye.)  I left it
running for about 19 hours, and I never got beyond 5.8T written in the
system.

 

Fortunately, it''s really easy to tweak arc_meta_limit.  So in the
second
test, that''s what I did.  I set the arc_meta_limit so high it would
never be
reached.  In this configuration, the previously described logarithmic write
performance degradation continued much higher.  In other words, dedup write
performance was pretty bad, but there was no "brick wall" as
previously
described.  Basically I kept writing at this rate till the pool was 13.5T
full (113M blocks), and the whole time, dedup write performance was approx
10x slower than writing without dedup.  At this point, my arc_meta_used
reached 15,500M and it would not grow any more, so I reached a
"softer"
brick wall.  I could only conclude that the data being cached in arc was
pushing the metadata out of arc.  But that''s only a guess.

 

So the 3rd test was to leave the arc_meta_limit at maximum value, and set
the primarycache to metadata only.  Naturally if you use this configuration
in production, you''re likely to have poor read performance because
you''re
guaranteed you''ll never have a data cache hit...  But it could still be
a
useful configuration, for example, if you are using dedup on a write-only
backup server.  In this configuration, write performance was much better
than the other configurations.  I reached 3x slower dedup write performance
almost immediately.  And 4x occurred around 47M blocks (5.6T used).  And 5x
occurred around 88M blocks (10.6T used).  It maintained 6x until around 142M
blocks (17T used) and 15,732 M arc_meta_used.  At this point I hit 90% full,
and the whole system basically fell apart, so I disregard all the results
that came later.  Based on the numbers I''m seeing, I have every reason
to
believe the system could have continued writing with dedup merely 6x slower
than writing without dedup, if I had not run out of disk space.  At least
theoretically this should continue until I cannot fit the metadata in ram
anymore, and then I''ll hit a brick wall... But the only way I can
measure
that is to go remove ram from my system and repeat the test.  I don''t
think
I''ll bother.

 

It has been mentioned before, that you suffer a big performance penalty when
you delete things.  This is very very true.  Snapshot destroy, or even rm a
file, and the time to completion is on the same order with the time it took
to initially create all that data.  This is a really big weak point.  It
might take several hours to perform the snapshot destroy of the oldest daily
snapshot.  And naturally that would be likely to occur on a daily basis
every midnight.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110709/172ad683/attachment.html>

Edward Ned Harvey

2011-Jul-09 17:55 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

Given the abysmal performance, I have to assume there is a significant
number of "overhead" reads or writes in order to maintain the DDT for
each
"actual" block write operation.  Something I didn''t mention
in the other
email is that I also tracked iostat throughout the whole operation. 
It''s
all writes (or at least 99.9% writes.)  So I am forced to conclude it''s
a
bunch of small DDT maintenance writes taking place and incurring access time
penalties in addition to each intended single block access time penalty.

The nature of the DDT is that it''s a bunch of small blocks, that tend
to be
scattered randomly, and require maintenance in order to do anything else.
This sounds like precisely the usage pattern that benefits from low latency
devices such as SSD''s.

I understand the argument, DDT must be stored in the primary storage pool so
you can increase the size of the storage pool without running out of space
to hold the DDT...  But it''s a fatal design flaw as long as you care
about
performance...  If you don''t care about performance, you might as well
use
the netapp and do offline dedup.  The point of online dedup is to gain
performance.  So in ZFS you have to care about the performance.  

There are only two possible ways to fix the problem.  
Either ...
The DDT must be changed so it can be stored entirely in a designated
sequential area of disk, and maintained entirely in RAM, so all DDT
reads/writes can be infrequent and serial in nature...  This would solve the
case of async writes and large sync writes, but would still perform poorly
for small sync writes.  And it would be memory intensive.  But it should
perform very nicely given those limitations.  ;-)
Or ...
The DDT stays as it is now, highly scattered small blocks, and there needs
to be an option to store it entirely on low latency devices such as
dedicated SSD''s.  Eliminate the need for the DDT to reside on the slow
primary storage pool disks.  I understand you must consider what happens
when the dedicated SSD gets full.  The obvious choices would be either (a)
dedup turns off whenever the metadatadevice is full or (b) it defaults to
writing blocks in the main storage pool.  Maybe that could even be a
configurable behavior.  Either way, there''s a very realistic use case
here.
For some people in some situations, it may be acceptable to say "I have 32G
mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a
maximum 218M unique blocks in pool, and if I estimate 100K average block
size that means up to 20T primary pool storage.  If I reach that limit,
I''ll
add more metadatadevice."

Both of those options would also go a long way toward eliminating the
"surprise" delete performance black hole.

Edward Ned Harvey

2011-Jul-09 18:21 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> When you read back duplicate data that was previously written with
> dedup, then you get a lot more cache hits, and as a result, the reads go
> faster.? Unfortunately these gains are diminished...? I don''t know
by
> what...? But you only have about 2x to 4x performance gain reading
> previously dedup''d data, as compared to reading the same data
which was
> never dedup''d.? Even when repeatedly reading the same file which
is 100%
> duplicate data (created by dd from /dev/zero) so all the data is 100% in
> cache... ??I still see only 2x to 4x performance gain with dedup.
For what it''s worth:

I also repeated this without dedup.  Created a large file (17G, just big
enough that it will fit entirely in my ARC).  Rebooted.  Timed reading it.
Now it''s entirely in cache.  Time reading it again.

When it''s not cached, of course the read time was equal to the original
write time.  When it''s cached, it goes 4x faster.  Perhaps this is only
because I''m testing on a machine that has super fast storage...  11
striped
SAS disks yielding 8Gbit/sec as compared to all-RAM which yielded
31.2Gbit/sec.  It seems in this case, RAM is only 4x faster than the storage
itself...  But I would have expected a couple orders of magnitude...  So
perhaps my expectations are off, or the ARC itself simply incurs overhead.
Either way, dedup is not to blame for obtaining merely 2x or 4x performance
gain over the non-dedup equivalent.

Roy Sigurd Karlsbakk

2011-Jul-09 18:33 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> When it''s not cached, of course the read time was equal to the
> original
> write time. When it''s cached, it goes 4x faster. Perhaps this is
only
> because I''m testing on a machine that has super fast storage... 11
> striped
> SAS disks yielding 8Gbit/sec as compared to all-RAM which yielded
> 31.2Gbit/sec. It seems in this case, RAM is only 4x faster than the
> storage
> itself... But I would have expected a couple orders of magnitude... So
> perhaps my expectations are off, or the ARC itself simply incurs
> overhead.
> Either way, dedup is not to blame for obtaining merely 2x or 4x
> performance
> gain over the non-dedup equivalent.
Could you test with some SSD SLOGs and see how well or bad the system performs?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Edward Ned Harvey

2011-Jul-09 19:30 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net]
> Sent: Saturday, July 09, 2011 2:33 PM
> 
> Could you test with some SSD SLOGs and see how well or bad the system
> performs?
These are all async writes, so slog won''t be used.  Async writes that
have a single fflush() and fsync() at the end to ensure system buffering is not
skewing the results.

Roy Sigurd Karlsbakk

2011-Jul-09 19:43 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> > From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net]
> > Sent: Saturday, July 09, 2011 2:33 PM
> >
> > Could you test with some SSD SLOGs and see how well or bad the
> > system
> > performs?
> 
> These are all async writes, so slog won''t be used. Async writes
that
> have a single fflush() and fsync() at the end to ensure system
> buffering is not skewing the results.
Sorry, my bad, I meant L2ARC to help buffer the DDT

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Edward Ned Harvey

2011-Jul-10 13:50 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> --- Performance loss:
I ran one more test, that is rather enlightening.  I repeated test #2 (tweak
arc_meta_limit, use the default primarycache=all) but this time I wrote 100%
duplicate data instead of unique.  Dedup=sha256 (no verify).  Ideally, you
would expect this to write very very fast... Because it''s all duplicate
data, and it''s all async, the system should just buffer a bunch of tiny
metadata changes, aggregate them, and occasionally write a single serial
block when it flushes the TXG.  It should be much faster to write dedup.

The results are:  With dedup, it writes several times slower.  Just the same
as test #2, minus the amount of time it takes to write the actual data.  For
example, here''s one datapoint, which is representative of the whole
test:
    time to write unique data without dedup:  7.090 sec
    time to write unique data with dedup:     47.379 sec

    time to write duplic data without dedup:  7.016 sec
    time to write duplic data with dedup:     39.852 sec

This clearly breaks it down:
    7 sec to write the actual data
    40 sec overhead caused by dedup
    <1 sec is about how fast it should have been writing duplicated data

Edward Ned Harvey

2011-Jul-10 14:01 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net]
> Sent: Saturday, July 09, 2011 3:44 PM
> 
> > > Could you test with some SSD SLOGs and see how well or bad the
> > > system
> > > performs?
> >
> > These are all async writes, so slog won''t be used. Async
writes that
> > have a single fflush() and fsync() at the end to ensure system
> > buffering is not skewing the results.
> 
> Sorry, my bad, I meant L2ARC to help buffer the DDT
Oh - 

It just so happens I don''t have one available, but that
doesn''t mean I can''t talk about it.  ;-)

For quite a lot of these tests, all the data resides in the ARC, period.  The
only area where the L2ARC would have an effect is after that region... When
I''m pushing the limits of ARC then there may be some benefit from the
use of L2ARC.  So ...

It is distinctly possible the L2ARC might help soften the "brick
wall."  When reaching arc_meta_limit, some of the metadata might have been
pushed out to L2ARC in order to leave a (slightly) smaller footprint in the
ARC...  I doubt it, but maybe there could be some gain here.

It is distinctly possible the L2ARC might help test #2 approach the performance
of test #3 (test #2 had primarycache=all and suffered approx 10x write
performance degradation, while test #3 had primarycache=metadata and suffered
approx 6x write performance degradation.)

But there''s positively no way the L2ARC would come into play on test
#3.  In this situation, all the metadata, the complete DDT resides in RAM.  So
with or without the cache device, the best case we''re currently looking
at is approx 6x write performance degradation.

Edward Ned Harvey

2011-Jul-10 14:22 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net]
> Sent: Saturday, July 09, 2011 3:44 PM
> 
> Sorry, my bad, I meant L2ARC to help buffer the DDT
Also, bear in mind, the L2ARC is only for reads.  So it can''t help
accelerate writing updates to the DDT.  Those updates need to hit the pool,
period.

Yes, on test 1 and test 2, there were significant regions where reads were
taking place.  (Basically the whole test, approx 25% to 30% reads.)

On test 3, there were absolutely no reads up till 75M entries (9.07 T used)
arc_meta_used= 12960 MB.  Up to this point, it was a 4x write performance
degradation.  Then it suddenly started performing about 5% reads and 95% writes
and suddenly jumped to 6x write performance degradation.

Jim Klimov

2011-Jul-12 11:40 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-09 20:04, Edward Ned Harvey ?????:>
> ---  Performance gain:
>
> Unfortunately there was only one area that I found any performance 
> gain.  When you read back duplicate data that was previously written 
> with dedup, then you get a lot more cache hits, and as a result, the 
> reads go faster.  Unfortunately these gains are diminished...  I
don''t
> know by what...  But you only have about 2x to 4x performance gain 
> reading previously dedup''d data, as compared to reading the same
data
> which was never dedup''d.  Even when repeatedly reading the same
file
> which is 100% duplicate data (created by dd from /dev/zero) so all the 
> data is 100% in cache...   I still see only 2x to 4x performance gain 
> with dedup.
>
First of all, thanks for all the experimental research and results,
even if the outlook is grim. I''d love to see comments about those
systems which use dedup and actually gain benefits, and how
much they gain (i.e. VM farms, etc.), and what may differ in
terms of setup (i.e. at least 256Gb RAM or whatever).

Hopefully the discrepancy between blissful hopes (I had) - that
dedup would save disk space and boost the systems somewhat
kinda like online compression can do - and cruel reality would
result in some improvement project. Perhaps it would be an
offline dedup implementation (perhaps with an online-dedup
option turnable off), as recently discussed on list.

Deleting stuff is still apain though. For the past week my box
is trying to delete an rsynced backup of a linux machine, some
300k files summed up to 50Gb. Deleting large files was rather
quick, but those consuming just a few blocks are really slow.
Even if I batch background RM''s so a hundred processes hang
and then they all at once complete in a minute or two.
And quite often the iSCSI initiator or target go crazy so one of
the boxes (or both) have to be rebooted, about trice a day.
I described my setup before, won''t clobber it into here ;)

Regarding the low read performance gain, you suggested
in a later post that this could be due to the RAM and disk
bandwidth difference in your machine. I for one think that
(without sufficient ARC block-caching) dedup reading would
suffer greatly also from fragmentation - any one large file
with some or all deduped data is basically guaranteed to
have its blocks scattered across all of your storage.
At least if this file was committed to the deduped pool
late in its life, when most or all of the blocks were already
there.

By the way, did you estimate how much is dedup''s overhead
in terms of metadata blocks? For example it was often said
on the list that you shouldn''t bother with dedup unless you
data can be deduped 2x or better, and if you''re lucky to
already have it on ZFS - you can estimate the reduction
with zdb. Now, I wonder where the number comes from -
is it empirical, or would dedup metadata take approx 1x
the data space, thus under 2x reduction you gain little
or nothing? ;)

Thanks for the research,
//Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110712/324a6033/attachment.html>

Edward Ned Harvey

2011-Jul-12 12:46 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
>  
> By the way, did you estimate how much is dedup''s overhead
> in terms of metadata blocks? For example it was often said
> on the list that you shouldn''t bother with dedup unless you
> data can be deduped 2x or better, and if you''re lucky to
> already have it on ZFS - you can estimate the reduction
> with zdb. Now, I wonder where the number comes from -
> is it empirical, or would dedup metadata take approx 1x
> the data space, thus under 2x reduction you gain little
> or nothing? ;)
You and I seem to have different interprettations of the empirical
"2x"
soft-requirement to make dedup worthwhile.  I always interpretted it like
this: If read/write of DUPLICATE blocks with dedup enabled yields 4x
performance gain, and read/write of UNIQUE blocks with dedup enabled yields
4x performance loss, then you need a 50/50 mix of unique and duplicate
blocks in the system in order to break even.  This is the same as having a
2x dedup ratio.  Unfortunately based on this experience, I would now say
something like a dedup ratio of 10x is more likely the break even point.

Ideally, read/write of unique blocks should be just as fast, with or without
dedup.  Ideally, read/write of duplicate blocks would be an order of
magnitude (or more) faster with dedup.  It''s not there right now... 
But I
still have high hopes.

You know what?  A year ago I would have said dedup still wasn''t stable
enough for production.  Now I would say it''s plenty stable enough... 
But it
needs performance enhancement before it''s truly useful for most cases.

Jim Klimov

2011-Jul-12 13:05 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

This dedup discussion (and my own bad expreience) have also
left me with another grim thought: some time ago sparse-root
zone support was ripped out of OpenSolaris.

Among the published rationales were transition to IPS and the
assumption that most people used them to save on disk space
(notion about saving RAM on shared objects was somehow
dismissed).

Regarding the disk savings, it was said that dedup would solve
the problem, at least for those systems which use dedup on
zoneroot dataset (and preferably that would be in the rpool, too).

On one hand, storing zoneroots in the rpool was never practical
for us because we tend to keep the rpool small and un-clobbered,
and on the other hand, now adding dedup to rpool would seem
like shooting oneself in the foot with a salt-loaded shotgun.
Maybe it won''t kill, but would hurt a lot and for a long time.

On the third hand ;) with a small rpool hosting zoneroots as well,
the DDT would reasonably be small too, and may actually boost
performance while saving space. But lots of attention should now
be paid to seperating /opt, parts of /var and stuff into delegated
datasets from a larger datapool. And software like Sun JES which
installs into a full-root zone''s /usr might overwhelm a small rpool
as well.

Anyhow, Edward, is there a test for this scenario - i.e. a 10Gb
pool with lots of non-unique data in small blocks?

Thanks,
//Jim

Jim Klimov

2011-Jul-12 13:19 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> You and I seem to have different interprettations of the 
> empirical "2x" soft-requirement to make dedup worthwhile. 
Well, until recently I had little interpretation for it at all, so your
approach may be better.
 
I hope that authors of the requirement statement would step
forward and explain what it is about "under the hood" and why 2x ;)
 > You know what?  A year ago I would have said dedup still 
> wasn''t stable enough for production.  Now I would say
it''s plenty stable
> enough...  But it needs performance enhancement before it''s
> truly useful for most cases.
Well, not that this would contradict you, but on my oi_148a (which
may be based on code close to a year old), it seems rather unstable, 
with systems either freezing or slowing down after some writes and 
having to be rebooted in order to work (fresh after boot writes are
usually relatively good, i.e. 5Mb/s vs. 100k/s). On the iSCSI server
side, the LUN and STMF service often lock up with "device busy" 
even though the volume "pool/dcpool" is not itself deduped. For me 
this is only solved by a reboot... And reboots of the VM client which
fights its way through deleting files from the deduped datasets inside
"dcpool" (imported over iSCSI) are beyond counting.
 
Actually in a couple of weeks I might be passing by that machine
and may have a chance to update it to oi_151-dev, would that
buy me any improvements, or potentially worsen my situation? ;)
 
//Jim
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110712/1ec7a793/attachment.html>

Bob Friesenhahn

2011-Jul-12 13:58 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

On Tue, 12 Jul 2011, Edward Ned Harvey wrote:>
> You know what?  A year ago I would have said dedup still wasn''t
stable
> enough for production.  Now I would say it''s plenty stable
enough...  But it
> needs performance enhancement before it''s truly useful for most
cases.
What has changed for you to change your mind?  Did the zfs code change 
in the past year, or is this based on experience with the same old 
stagnant code?

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2011-Jul-13 01:08 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> Sent: Tuesday, July 12, 2011 9:58 AM
> 
> > You know what?  A year ago I would have said dedup still
wasn''t stable
> > enough for production.  Now I would say it''s plenty stable
enough...
But it> > needs performance enhancement before it''s truly useful for
most cases.
> 
> What has changed for you to change your mind?  Did the zfs code change
> in the past year, or is this based on experience with the same old
> stagnant code?
No idea.  I assume they''ve been patching, and I don''t hear
many people
complaining of dedup instability on this list anymore.  But the other option
is that nothing''s changed, and only my perception has changed.  I
acknowledge that''s possible.

Frank Van Damme

2011-Jul-14 07:54 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

Op 12-07-11 13:40, Jim Klimov schreef:> Even if I batch background RM''s so a hundred processes hang
> and then they all at once complete in a minute or two.
Hmmm. I only run one rm process at a time. You think running more
processes at the same time would be faster?

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Jim Klimov

2011-Jul-14 10:28 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-14 11:54, Frank Van Damme ?????:> Op 12-07-11 13:40, Jim Klimov schreef:
>> Even if I batch background RM''s so a hundred processes hang
>> and then they all at once complete in a minute or two.
> Hmmm. I only run one rm process at a time. You think running more
> processes at the same time would be faster?Yes, quite often it seems so.
Whenever my slow "dcpool" decides to accept a write,
it processes a hundred pending deletions instead of one ;)

Even so, it took quite a few pool or iscsi hangs and then
reboots of both server and client, and about a week overall,
to remove a 50Gb dir with 400k small files from a deduped
pool served over iscsi from a volume in a physical pool.

Just completed this night ;)

Frank Van Damme

2011-Jul-14 11:48 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

Op 14-07-11 12:28, Jim Klimov schreef:>>
> Yes, quite often it seems so.
> Whenever my slow "dcpool" decides to accept a write,
> it processes a hundred pending deletions instead of one ;)
> 
> Even so, it took quite a few pool or iscsi hangs and then
> reboots of both server and client, and about a week overall,
> to remove a 50Gb dir with 400k small files from a deduped
> pool served over iscsi from a volume in a physical pool.
> 
> Just completed this night ;)
It seems counter-intuitive - you''d say: concurrent disk access makes
things only slower - , but it turns out to be true. I''m deleting a
dozen
times faster than before. How completely ridiculous.

Thank you :-)

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Jim Klimov

2011-Jul-14 15:24 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-14 15:48, Frank Van Damme ?????:> It seems counter-intuitive - you''d say: concurrent disk access
makes
> things only slower - , but it turns out to be true. I''m deleting a
> dozen times faster than before. How completely ridiculous. Thank you :-)
Well, look at it this way: it is not only about singular disk accesses
(i.e. unlike other FSes, you do not in-place modify a directory entry),
with ZFS COW it is about rewriting a tree of block pointers, with any
new writes going into free (unreferenced ATM) disk blocks anyway.

So by hoarding writes you have a chance to reduce mechanical
IOPS required for your tasks. Until you run out of RAM ;)

Just in case it helps, to quickly fire up removals of the specific 
directory
after yet another reboot of the box, and not overwhelm it with hundreds
of thousands queued "rm"processes either, I made this script as
/bin/RM:

==#!/bin/sh

SLEEP=10
[ x"$1" != x ] && SLEEP=$1

A=0
# To rm small files: find ... -size -10
find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do
   du -hs "$LINE"
   rm -f "$LINE" &
   A=$(($A+1))
   [ "$A" -ge 100 ] && ( date; while [ `ps -ef | grep -wc rm`
-gt 50 ]; do
      echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep $SLEEP; ps 
-ef | grep -wc rm;
   done
   date ) && A="`ps -ef | grep -wc rm`"
done ; date
==
Essentially, after firing up 100 "rm attempts" it waits for the
"rm"
process count to go below 50, then goes on. Sizing may vary
between systems, phase of the moon and computer''s attitude.
Sometimes I had 700 processes stacked and processed quickly.
Sometimes it hung on 50...

HTH,
//Jim

Daniel Carosone

2011-Jul-15 02:21 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

um, this is what xargs -P is for ...

--
Dan.

On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov
wrote:> 2011-07-14 15:48, Frank Van Damme ?????:
>> It seems counter-intuitive - you''d say: concurrent disk access
makes
>> things only slower - , but it turns out to be true. I''m
deleting a
>> dozen times faster than before. How completely ridiculous. Thank you 
>> :-)
>
> Well, look at it this way: it is not only about singular disk accesses
> (i.e. unlike other FSes, you do not in-place modify a directory entry),
> with ZFS COW it is about rewriting a tree of block pointers, with any
> new writes going into free (unreferenced ATM) disk blocks anyway.
>
> So by hoarding writes you have a chance to reduce mechanical
> IOPS required for your tasks. Until you run out of RAM ;)
>
> Just in case it helps, to quickly fire up removals of the specific  
> directory
> after yet another reboot of the box, and not overwhelm it with hundreds
> of thousands queued "rm"processes either, I made this script as
/bin/RM:
>
> ==> #!/bin/sh
>
> SLEEP=10
> [ x"$1" != x ] && SLEEP=$1
>
> A=0
> # To rm small files: find ... -size -10
> find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do
>   du -hs "$LINE"
>   rm -f "$LINE" &
>   A=$(($A+1))
>   [ "$A" -ge 100 ] && ( date; while [ `ps -ef | grep -wc
rm` -gt 50 ]; do
>      echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep $SLEEP;
ps -ef
> | grep -wc rm;
>   done
>   date ) && A="`ps -ef | grep -wc rm`"
> done ; date
> ==>
> Essentially, after firing up 100 "rm attempts" it waits for the
"rm"
> process count to go below 50, then goes on. Sizing may vary
> between systems, phase of the moon and computer''s attitude.
> Sometimes I had 700 processes stacked and processed quickly.
> Sometimes it hung on 50...
>
> HTH,
> //Jim
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/24ca3508/attachment.bin>

Edward Ned Harvey

2011-Jul-15 02:27 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> I understand the argument, DDT must be stored in the primary storage pool
> so
> you can increase the size of the storage pool without running out of space
> to hold the DDT...  But it''s a fatal design flaw as long as you
care about
> performance...  If you don''t care about performance, you might as
well use
> the netapp and do offline dedup.  The point of online dedup is to gain
> performance.  So in ZFS you have to care about the performance.
> 
> There are only two possible ways to fix the problem.
> Either ...
> The DDT must be changed so it can be stored entirely in a designated
> sequential area of disk, and maintained entirely in RAM, so all DDT
> reads/writes can be infrequent and serial in nature...  This would solve
the> case of async writes and large sync writes, but would still perform poorly
> for small sync writes.  And it would be memory intensive.  But it should
> perform very nicely given those limitations.  ;-)
> Or ...
> The DDT stays as it is now, highly scattered small blocks, and there needs
> to be an option to store it entirely on low latency devices such as
> dedicated SSD''s.  Eliminate the need for the DDT to reside on the
slow
> primary storage pool disks.  I understand you must consider what happens
> when the dedicated SSD gets full.  The obvious choices would be either (a)
> dedup turns off whenever the metadatadevice is full or (b) it defaults to
> writing blocks in the main storage pool.  Maybe that could even be a
> configurable behavior.  Either way, there''s a very realistic use
case
here.> For some people in some situations, it may be acceptable to say "I
have
32G> mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a
> maximum 218M unique blocks in pool, and if I estimate 100K average block
> size that means up to 20T primary pool storage.  If I reach that limit,
I''ll> add more metadatadevice."
> 
> Both of those options would also go a long way toward eliminating the
> "surprise" delete performance black hole.
Is anyone from Oracle reading this?  I understand if you can''t say what
you''re working on and stuff like that.  But I am merely hopeful this
work
isn''t going into a black hole...  

Anyway.  Thanks for listening (I hope.)   ttyl

Jim Klimov

2011-Jul-15 03:56 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-15 6:21, Daniel Carosone ?????:> um, this is what xargs -P is for ...
Thanks for the hint. True, I don''t often use xargs.

However from the man pages, I don''t see a "-P" option
on OpenSolaris boxes of different releases, and there
is only a "-p" (prompt) mode. I am not eager to enter
"yes" 400000 times ;)

The way I had this script in practice, I could enter "RM"
once and it worked till the box hung. Even then, a watchdog
script could often have it rebooted without my interaction
so it could continue in the next lifetime ;)
>
> --
> Dan.
>
> On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov wrote:
>> 2011-07-14 15:48, Frank Van Damme ?????:
>>> It seems counter-intuitive - you''d say: concurrent disk
access makes
>>> things only slower - , but it turns out to be true. I''m
deleting a
>>> dozen times faster than before. How completely ridiculous. Thank
you
>>> :-)
>> Well, look at it this way: it is not only about singular disk accesses
>> (i.e. unlike other FSes, you do not in-place modify a directory entry),
>> with ZFS COW it is about rewriting a tree of block pointers, with any
>> new writes going into free (unreferenced ATM) disk blocks anyway.
>>
>> So by hoarding writes you have a chance to reduce mechanical
>> IOPS required for your tasks. Until you run out of RAM ;)
>>
>> Just in case it helps, to quickly fire up removals of the specific
>> directory
>> after yet another reboot of the box, and not overwhelm it with hundreds
>> of thousands queued "rm"processes either, I made this script
as /bin/RM:
>>
>> ==>> #!/bin/sh
>>
>> SLEEP=10
>> [ x"$1" != x ]&&  SLEEP=$1
>>
>> A=0
>> # To rm small files: find ... -size -10
>> find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do
>>    du -hs "$LINE"
>>    rm -f "$LINE"&
>>    A=$(($A+1))
>>    [ "$A" -ge 100 ]&&  ( date; while [ `ps -ef | grep
-wc rm` -gt 50 ]; do
>>       echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep
$SLEEP; ps -ef
>> | grep -wc rm;
>>    done
>>    date )&&  A="`ps -ef | grep -wc rm`"
>> done ; date
>> ==>>
>> Essentially, after firing up 100 "rm attempts" it waits for
the "rm"
>> process count to go below 50, then goes on. Sizing may vary
>> between systems, phase of the moon and computer''s attitude.
>> Sometimes I had 700 processes stacked and processed quickly.
>> Sometimes it hung on 50...
>>
>> HTH,
>> //Jim
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/e81efe1f/attachment.html>

Daniel Carosone

2011-Jul-15 03:59 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

On Fri, Jul 15, 2011 at 07:56:25AM +0400, Jim Klimov
wrote:> 2011-07-15 6:21, Daniel Carosone ?????:
>> um, this is what xargs -P is for ...
>
> Thanks for the hint. True, I don''t often use xargs.
>
> However from the man pages, I don''t see a "-P" option
> on OpenSolaris boxes of different releases, and there
> is only a "-p" (prompt) mode. I am not eager to enter
> "yes" 400000 times ;)
you want the /usr/gnu/{bin,share/man} version, at least in this case.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/0785ff52/attachment.bin>

Frank Van Damme

2011-Jul-15 06:35 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

Op 15-07-11 04:27, Edward Ned Harvey schreef:> Is anyone from Oracle reading this?  I understand if you can''t say
what
> you''re working on and stuff like that.  But I am merely hopeful
this work
> isn''t going into a black hole...  
> 
> Anyway.  Thanks for listening (I hope.)   ttyl
If they aren''t, maybe someone from an open source Solaris version is :)

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

phil.harman@gmail.com

2011-Jul-15 07:10 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

If you clone zones from a golden image using ZFS cloning, you get fast,
efficient dedup for free. Sparse root always was a horrible hack!

----- Reply message -----
From: "Jim Klimov" <jimklimov at cos.ru>
To: 
Cc: <zfs-discuss at opensolaris.org>
Subject: [zfs-discuss] Summary: Dedup memory and performance (again, again)
Date: Tue, Jul 12, 2011 14:05

This dedup discussion (and my own bad expreience) have also
left me with another grim thought: some time ago sparse-root
zone support was ripped out of OpenSolaris.

Among the published rationales were transition to IPS and the
assumption that most people used them to save on disk space
(notion about saving RAM on shared objects was somehow
dismissed).

Regarding the disk savings, it was said that dedup would solve
the problem, at least for those systems which use dedup on
zoneroot dataset (and preferably that would be in the rpool, too).

On one hand, storing zoneroots in the rpool was never practical
for us because we tend to keep the rpool small and un-clobbered,
and on the other hand, now adding dedup to rpool would seem
like shooting oneself in the foot with a salt-loaded shotgun.
Maybe it won''t kill, but would hurt a lot and for a long time.

On the third hand ;) with a small rpool hosting zoneroots as well,
the DDT would reasonably be small too, and may actually boost
performance while saving space. But lots of attention should now
be paid to seperating /opt, parts of /var and stuff into delegated
datasets from a larger datapool. And software like Sun JES which
installs into a full-root zone''s /usr might overwhelm a small rpool
as well.

Anyhow, Edward, is there a test for this scenario - i.e. a 10Gb
pool with lots of non-unique data in small blocks?

Thanks,
//Jim

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/b98846be/attachment-0001.html>

Jim Klimov

2011-Jul-15 10:19 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-15 11:10, phil.harman at gmail.com ?????:> If you clone zones from a golden image using ZFS cloning, you get
> fast, efficient dedup for free. Sparse root always was a horrible hack!Sounds like a holy war is flaming up ;)

From what I heard, sparse root zones with shared common
system libraries allowed to save not only on disk space but
also on RAM. Can''t vouch, never tested extensively myself.

Cloning of golden zones is of course used in our systems.
But this approach breaks badly upon any major systems
update (i.e. LiveUpgrade to a new release) - many of the
binaries change, and you either suddenly have the zones
(wanting to) consume many gigabytes of disk space which
are not there on a small rpool or a busy data pool, or you
have to make a new golden image, clone a new set of
zones and reinstall/migrate all applications and settings.

True, this is a no-brainer for zones running a single task
like an /opt/tomcat directory which can be tossed around
to any OS, but becomes tedious for software with many
packages and complicated settings, especially if (in some
extremity) it was homebrewn and/or home-compiled and
unpackaged ;)

I am not the first (or probably last) to write about inconvenience
of zone upgrades which loses the cloning benefit, and much
of the same is true for upgrading cloned/deduped VM golden
images as well, where the golden image is just some common
baseline OS but the clones all run different software. And it is
this different software which makes them useful and unique,
and too distinct to maintain a dozen of golden images efficiently
(i.e. there might be just 2 or 3 clones of each gold).

But in general, the problem is there - you either accept that
your OS images in effect won''t be deduped, much or at all,
after some lifespan involving OS upgrades, or you don''t
update them often (which may be inacceptable for security
and/or paranoia types of deployments), or you use some
trickery to update frequently and not lose much disk space,
such as automation of software and configs migration
from one clone (of old gold) to another clone (of new gold).

Dedup was a promising variant in this context, unless it
kills performance and/or stability... which was the subject
of this thread, with Edward''s research into performance
of current dedup implementation (and perhaps some
baseline to test whether real improvements appear in
the future).

And in terms of performance there''s some surprise in
Edward''s findings regarding i.e. reads from the deduped
data. For infrequent maintenance (i.e. monthly upgrades)
zoneroots (OS image part) would be read-mostly and write
performance of dedup may not matter much. If the updates
must pour in often for whatever reason, then write and
delete performance of dedup may begin to matter.

Sorry about straying the discussion into zones - they,
their performance and coping with changes introduced
during lifetime (see OS upgrades), are one good example
for discussion of dedup, and its one application which
may be commonly useful on any server or workstation,
not only on on hardware built for dedicated storage.

Sparse-root vs. full-root zones, or disk images of VMs;
are they stuffed in one rpool or spread between rpool and
data pools - that detail is not actually the point of the thread.

Actual useability of dedup for savings and gains on these
tasks (preferably working also on low-mid-range boxes,
where adding a good enterprise SSD would double the
server cost - not only on those big good systems with
tens of GB of RAM), and hopefully simplifying the system
configuration and maintenance - that is indeed the point
in question.

//Jim

Mike Gerdts

2011-Jul-15 14:55 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

On Fri, Jul 15, 2011 at 5:19 AM, Jim Klimov <jimklimov at cos.ru>
wrote:> 2011-07-15 11:10, phil.harman at gmail.com ?????:
>>
>> If you clone zones from a golden image using ZFS cloning, you get fast,
>> efficient dedup for free. Sparse root always was a horrible hack!
>
> Sounds like a holy war is flaming up ;)
>
> From what I heard, sparse root zones with shared common
> system libraries allowed to save not only on disk space but
> also on RAM. Can''t vouch, never tested extensively myself.
There may be some benefit to that, I''d argue that most of the time
there''s not that much.  Using what is surely an imperfect way of
measuring, I took a look a zone on a Solaris 10 box that I happen to
be logged into. I found it is using about 52 MB of memory in mappings
of executables and libraries.  By disabling webconsole (a java program
that has a RSS size of 100+ MB) the shared mappings drop to 40 MB.

# cd /proc
# pmap -xa * | grep r.x | grep -v '' anon '' | grep -v
'' stack '' | grep
-v '' heap '' | sort -u | nawk ''{ t+= $3 } END { print
t / 1024, "MB" }''
pmap: cannot examine 22427: system process
40.3281 MB

If you are running the same large application (large executable +
libraries resident in memory) in many zones, you may have additional
benefit.

Solaris 10 was released in 2005, meaning that sparse root zones were
conceived sometime in the years leading up to that.  In that time, the
entry level servers have gone from 1 - 2 GB of memory (e.g. a V210 or
V240) to 12 - 16+ GB of memory (X2270 M2, T3-1).  Further, large
systems tend to have NUMA characteristics that challenge the logic of
trying to maintain only one copy of hot read-only executable pages.
It just doesn''t make sense to constrain the design of zones around
something that is going to save 0.3% of the memory of an entry level
server.  Even in 2005, I''m not so sure it was a strong argument.

Disk space is another issue.  Jim does a fine job of describing the
issues around that.
> Cloning of golden zones is of course used in our systems.
> But this approach breaks badly upon any major systems
> update (i.e. LiveUpgrade to a new release) - many of the
> binaries change, and you either suddenly have the zones
> (wanting to) consume many gigabytes of disk space which
> are not there on a small rpool or a busy data pool, or you
> have to make a new golden image, clone a new set of
> zones and reinstall/migrate all applications and settings.
>
> True, this is a no-brainer for zones running a single task
> like an /opt/tomcat directory which can be tossed around
> to any OS, but becomes tedious for software with many
> packages and complicated settings, especially if (in some
> extremity) it was homebrewn and/or home-compiled and
> unpackaged ;)
>
> I am not the first (or probably last) to write about inconvenience
> of zone upgrades which loses the cloning benefit, and much
> of the same is true for upgrading cloned/deduped VM golden
> images as well, where the golden image is just some common
> baseline OS but the clones all run different software. And it is
> this different software which makes them useful and unique,
> and too distinct to maintain a dozen of golden images efficiently
> (i.e. there might be just 2 or 3 clones of each gold).
>
> But in general, the problem is there - you either accept that
> your OS images in effect won''t be deduped, much or at all,
> after some lifespan involving OS upgrades, or you don''t
> update them often (which may be inacceptable for security
> and/or paranoia types of deployments), or you use some
> trickery to update frequently and not lose much disk space,
> such as automation of software and configs migration
> from one clone (of old gold) to another clone (of new gold).
>
> Dedup was a promising variant in this context, unless it
> kills performance and/or stability... which was the subject
> of this thread, with Edward''s research into performance
> of current dedup implementation (and perhaps some
> baseline to test whether real improvements appear in
> the future).
>
> And in terms of performance there''s some surprise in
> Edward''s findings regarding i.e. reads from the deduped
> data. For infrequent maintenance (i.e. monthly upgrades)
> zoneroots (OS image part) would be read-mostly and write
> performance of dedup may not matter much. If the updates
> must pour in often for whatever reason, then write and
> delete performance of dedup may begin to matter.
>
> Sorry about straying the discussion into zones - they,
> their performance and coping with changes introduced
> during lifetime (see OS upgrades), are one good example
> for discussion of dedup, and its one application which
> may be commonly useful on any server or workstation,
> not only on on hardware built for dedicated storage.
>
> Sparse-root vs. full-root zones, or disk images of VMs;
> are they stuffed in one rpool or spread between rpool and
> data pools - that detail is not actually the point of the thread.
>
> Actual useability of dedup for savings and gains on these
> tasks (preferably working also on low-mid-range boxes,
> where adding a good enterprise SSD would double the
> server cost - not only on those big good systems with
> tens of GB of RAM), and hopefully simplifying the system
> configuration and maintenance - that is indeed the point
> in question.
>
> //Jim
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ian Collins

2011-Jul-23 08:01 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

On 07/10/11 04:04 AM, Edward Ned Harvey wrote:>
> There were a lot of useful details put into the thread "Summary: Dedup
> and L2ARC memory requirements"
>
> Please refer to that thread as necessary...  After much discussion 
> leading up to that thread, I thought I had enough understanding to 
> make dedup useful, but then in practice, it didn''t work out.  Now
I''ve
> done a lot more work on it, reduced it all to practice, and I finally 
> feel I can draw up conclusions that are actually useful:
>
> I am testing on a Sun Oracle server, X4270, 1 Xeon 4-core 2.4Ghz, 24G 
> ram, 12 disks ea 2T sas 7.2krpm.  Solaris 11 express snv_151a
>Can you provide more details of your tests?  I''m currently testing a 
couple of slightly better configured X4270s (2 CPU, 96GB RAM and a FLASH 
accelerator card) using real data from an existing server.  So far, I 
haven''t seen the levels of performance fall off you report.

I currently have about 5TB of uncompressed data in the pool (stripe of 5 
mirrors) and throughput is similar to the existing, Solaris 10, 
servers.  The pool dedup ratio is 1.7, so there''s a good mix of unique 
and duplicate blocks.

-- 
Ian.

Edward Ned Harvey

2011-Jul-23 13:17 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: Ian Collins [mailto:ian at ianshome.com]
> Sent: Saturday, July 23, 2011 4:02 AM
> 
> Can you provide more details of your tests?  
Here''s everything:
http://dl.dropbox.com/u/543241/dedup%20tests/dedup%20tests.zip

In particular:
Under the "work server" directory.

The basic concept goes like this:
Find some amount of data that takes approx 10 sec to write.  I don''t
know
the size, I just kept increasing a block counter till got times I felt were
reasonable, so let''s suppose it''s 10G.

Time Write that much without dedup (all unique).
Remove the file.
Time Write that much with dedup (sha256, no verify) (all unique).
Remove the file.
Write 10x that much with dedup (all unique).
Don''t remove the file.
Repeat.

So I''m getting comparisons of write speeds for 10G files, sampling at
100G
intervals.  For a 6x performance degradation, it would be 7 sec to write
without dedup, and 40-45sec to write with dedup.

I am doing fflush() and fsync() at the end of every file write, to ensure
results are not skewed by write buffering.

Roberto Waltman

2011-Jul-24 16:21 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

Edward Ned Harvey wrote:
  > So I''m getting comparisons of write speeds for 10G files,
sampling
at 100G> intervals.  For a 6x performance degradation, it would be 7 sec to write
> without dedup, and 40-45sec to write with dedup.
For a totally unscientific data point:

The HW:  Server - Supermicro "server" motherboard.
Intel 920 CPU.
6 GB memory.
1 x 16 GB SSD as a boot device.
8 x 2TB "green" (5400 RPM?) hard drives.
The disks configured with 3 equal size partitions, all p1''s in one 
raidz2 pool, all p2''s in another, all p3''s in another.
(Done to improve performance by limiting head movement when most of the 
disk activity is in one pool)

The SW: the last release of Open Solaris. (Current at the time, I have 
since moved to Solaris 11)

The test: backup an almost full 750Gb external hard disk formatted as a 
single NTFS volume. The disk was connected via eSATA to a fast computer 
(also a supermicro + I920) running Ubuntu.
The Ubuntu machine had access to the file server via NFS.
The NFS-exported file system was created new for this backup, with dedup 
enabled, encryption and compression disabled, atime=off. This was the 
first (and last) time I tried enabling dedup.

 From previous similar transfers, (without dedup), I expected the backup 
to be finished in a few hours overnight, with the bottlenecks being the 
NTFS-3G driver in Ubuntu and the 100Mbit ethernet connection.

It took more than EIGHT DAYS, without any other activity going on both 
machines.

(My) conclusion: Running low on storage? get more/bigger disks.

-- 
Roberto Waltman

Nico Williams

2011-Jul-24 18:42 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

On Jul 9, 2011 1:56 PM, "Edward Ned Harvey" <
opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:>
> Given the abysmal performance, I have to assume there is a significant
> number of "overhead" reads or writes in order to maintain the DDT
for each
> "actual" block write operation.  Something I didn''t
mention in the other
> email is that I also tracked iostat throughout the whole operation. 
It''s
> all writes (or at least 99.9% writes.)  So I am forced to conclude
it''s a
> bunch of small DDT maintenance writes taking place and incurring access
time> penalties in addition to each intended single block access time penalty.
>
> The nature of the DDT is that it''s a bunch of small blocks, that
tend to
be> scattered randomly, and require maintenance in order to do anything else.
> This sounds like precisely the usage pattern that benefits from low
latency> devices such as SSD''s.
The DDT should be written to in COW fashion, and asynchronously, so there
should be no access time penalty.  Or so ISTM it should be.

Dedup is necessarily slower for writing because of the deduplication table
lookups.  Those are synchronous lookups, but for async writes you''d
think
that total write throughput would only be affected by a) the additional read
load (which is zero in your case) and b) any inability to put together large
transactions due to the high latency of each logical write, but (b)
shouldn''t happen, particularly if the DDT fits in RAM or L2ARC, as it
does
in your case.

So, at first glance my guess is ZFS is leaving dedup write performance on
the table most likely due to implementation reasons, not design reasons.

Nico
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110724/5d06def3/attachment.html>

Ian Collins

2011-Jul-25 00:38 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

On 07/25/11 04:21 AM, Roberto Waltman wrote:> Edward Ned Harvey wrote:
>    >  So I''m getting comparisons of write speeds for 10G files,
sampling
> at 100G
>> intervals.  For a 6x performance degradation, it would be 7 sec to
write
>> without dedup, and 40-45sec to write with dedup.
> For a totally unscientific data point:
>
> The HW:  Server - Supermicro "server" motherboard.
> Intel 920 CPU.
> 6 GB memory.
> 1 x 16 GB SSD as a boot device.
> 8 x 2TB "green" (5400 RPM?) hard drives.
> The disks configured with 3 equal size partitions, all p1''s in one
> raidz2 pool, all p2''s in another, all p3''s in another.
> (Done to improve performance by limiting head movement when most of the
> disk activity is in one pool)
>
> The SW: the last release of Open Solaris. (Current at the time, I have
> since moved to Solaris 11)
>
> The test: backup an almost full 750Gb external hard disk formatted as a
> single NTFS volume. The disk was connected via eSATA to a fast computer
> (also a supermicro + I920) running Ubuntu.
> The Ubuntu machine had access to the file server via NFS.
> The NFS-exported file system was created new for this backup, with dedup
> enabled, encryption and compression disabled, atime=off. This was the
> first (and last) time I tried enabling dedup.
>
>   From previous similar transfers, (without dedup), I expected the backup
> to be finished in a few hours overnight, with the bottlenecks being the
> NTFS-3G driver in Ubuntu and the 100Mbit ethernet connection.
>
> It took more than EIGHT DAYS, without any other activity going on both
> machines.
>
> (My) conclusion: Running low on storage? get more/bigger disks.
>Add to that: if running dedup, get plenty of RAM and cache.

I''m still seeing similar performance on my test system with and without
dedup enabled.  Snapshot deletion appears slightly slower, but I have 
yet to run timed tests.

-- 
Ian.

Edward Ned Harvey

2011-Jul-25 13:31 UTC

head link

[zfs-discuss] Summary: Dedup memory and performance (again, again)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Ian Collins
> 
> Add to that: if running dedup, get plenty of RAM and cache.
Add plenty RAM.  And tweak your arc_meta_limit.  You can at least get dedup
performance that''s on the same order of magnitude as performance
without
dedup.

Cache devices don''t really help dedup very much - Because each DDT
stored in
ARC/L2ARC takes 376 bytes, and each reference to an L2ARC entry requires 176
bytes of ARC.  So in order to prevent an individual DDT entry from being
evicted to disk, you must either keep the 376 bytes in ARC, or evict it to
L2ARC and keep 176 bytes.  This is a very small payload.  A good payload
would be to evict a 128k block from ARC into L2ARC, keeping the 176 bytes
only in ARC.

zfs discuss - Jul 2011 - Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)

[zfs-discuss] Summary: Dedup memory and performance (again, again)