Brad Diggs
2011-Dec-07 18:46 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Hello, I have a hypothetical question regarding ZFS reduplication. Does the L1ARC cache benefit from reduplication in the sense that the L1ARC will only need to cache one copy of the reduplicated data versus many copies? Here is an example: Imagine that I have a server with 2TB of RAM and a PB of disk storage. On this server I create a single 1TB data file that is full of unique data. Then I make 9 copies of that file giving each file a unique name and location within the same ZFS zpool. If I start up 10 application instances where each application reads all of its own unique copy of the data, will the L1ARC contain only the deduplicated data or will it cache separate copies the data from each file? In simpler terms, will the L1ARC require 10TB of RAM or just 1TB of RAM to cache all 10 1TB files worth of data? My hope is that since the data only physically occupies 1TB of storage via deduplication that the L1ARC will also only require 1TB of RAM for the data. Note that I know the deduplication table will use the L1ARC as well. However, the focus of my question is on how the L1ARC would benefit from a data caching standpoint. Thanks in advance! Brad Brad Diggs | Principal Sales Consultant Tech Blog: http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111207/4517bfad/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 9062 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111207/4517bfad/attachment.tiff>
Mertol Ozyoney
2011-Dec-07 20:48 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware. However we have significant advantage that zfs keeps checksums regardless of the dedup being on and off. So, in the future we can perhaps make functions more dedup friendly regardless of dedup being enabled or not. For example, thecnicaly it''s possible to optimize our replication that it does not send daya chunks if a data chunk with the same chechsum exists in target, without enabling dedup on target and source. Best regards Mertol Sent from a mobile device Mertol Ozyoney On 07 Ara 2011, at 20:46, Brad Diggs <Brad.Diggs at oracle.com> wrote:> Hello, > > I have a hypothetical question regarding ZFS reduplication. Does the L1ARC cache benefit from reduplication > in the sense that the L1ARC will only need to cache one copy of the reduplicated data versus many copies? > Here is an example: > > Imagine that I have a server with 2TB of RAM and a PB of disk storage. On this server I create a single 1TB > data file that is full of unique data. Then I make 9 copies of that file giving each file a unique name and > location within the same ZFS zpool. If I start up 10 application instances where each application reads all of > its own unique copy of the data, will the L1ARC contain only the deduplicated data or will it cache separate > copies the data from each file? In simpler terms, will the L1ARC require 10TB of RAM or just 1TB of RAM to > cache all 10 1TB files worth of data? > > My hope is that since the data only physically occupies 1TB of storage via deduplication that the L1ARC > will also only require 1TB of RAM for the data. > > Note that I know the deduplication table will use the L1ARC as well. However, the focus of my question > is on how the L1ARC would benefit from a data caching standpoint. > > Thanks in advance! > > Brad > <PastedGraphic-2.tiff> > Brad Diggs | Principal Sales Consultant > Tech Blog: http://TheZoneManager.com > LinkedIn: http://www.linkedin.com/in/braddiggs > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111207/a0de62d6/attachment-0001.html>
Jim Klimov
2011-Dec-08 06:44 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
It was my understanding that both dedup and caching work on block level. So if you have identical on-disk blocks (same original data past same compression and encryption), they turn into one(*) on-disk block with several references from DDT. And that one block is only cached once, saving ARC space. * (Technically, for very-often referenced blocks there is a number of copies, controlled by ditto attribute). HTH, //Jim Klimov
Darren J Moffat
2011-Dec-08 11:39 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/07/11 20:48, Mertol Ozyoney wrote:> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. > > The only vendor i know that can do this is Netapp > > In fact , most of our functions, like replication is not dedup aware.> For example, thecnicaly it''s possible to optimize our replication that > it does not send daya chunks if a data chunk with the same chechsum > exists in target, without enabling dedup on target and source.We already do that with ''zfs send -D'': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature. -- Darren J Moffat
Edward Ned Harvey
2011-Dec-08 13:00 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Mertol Ozyoney > Sent: Wednesday, December 07, 2011 3:49 PM > To: Brad Diggs > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup > > Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.I haven''t read the code, but I can reference experimental results that seem to defy statement... If you time write a large data stream of completely duplicated data to disk without dedup... And time read it back... It takes the same amount of time. But, If you enable dedup and repeat the same test, it goes much faster. Depending on a lot of variables, it might be 2x-12x faster. To me, "significantly faster than disk speed," can only mean it''s benefitting from cache.
Ian Collins
2011-Dec-08 21:41 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/ 9/11 12:39 AM, Darren J Moffat wrote:> On 12/07/11 20:48, Mertol Ozyoney wrote: >> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >> >> The only vendor i know that can do this is Netapp >> >> In fact , most of our functions, like replication is not dedup aware. >> For example, thecnicaly it''s possible to optimize our replication that >> it does not send daya chunks if a data chunk with the same chechsum >> exists in target, without enabling dedup on target and source. > We already do that with ''zfs send -D'': > > -D > > Perform dedup processing on the stream. Deduplicated > streams cannot be received on systems that do not > support the stream deduplication feature. > > > >Is there any more published information on how this feature works? -- Ian.
Mark Musante
2011-Dec-08 22:22 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
You can see the original ARC case here: http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt On 8 Dec 2011, at 16:41, Ian Collins wrote:> On 12/ 9/11 12:39 AM, Darren J Moffat wrote: >> On 12/07/11 20:48, Mertol Ozyoney wrote: >>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >>> >>> The only vendor i know that can do this is Netapp >>> >>> In fact , most of our functions, like replication is not dedup aware. >>> For example, thecnicaly it''s possible to optimize our replication that >>> it does not send daya chunks if a data chunk with the same chechsum >>> exists in target, without enabling dedup on target and source. >> We already do that with ''zfs send -D'': >> >> -D >> >> Perform dedup processing on the stream. Deduplicated >> streams cannot be received on systems that do not >> support the stream deduplication feature. >> >> >> >> > Is there any more published information on how this feature works? > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Pawel Jakub Dawidek
2011-Dec-10 14:05 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. > > The only vendor i know that can do this is NetappAnd you really work at Oracle?:) The answer is definiately yes. ARC caches on-disk blocks and dedup just reference those blocks. When you read dedup code is not involved at all. Let me show it to you with simple test: Create a file (dedup is on): # dd if=/dev/random of=/foo/a bs=1m count=1024 Copy this file so that it is deduped: # dd if=/foo/a of=/foo/b bs=1m Export the pool so all cache is removed and reimport it: # zpool export foo # zpool import foo Now let''s read one file: # dd if=/foo/a of=/dev/null bs=1m 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) We read file ''a'' and all its blocks are in cache now. The ''b'' file shares all the same blocks, so if ARC caches blocks only once, reading ''b'' should be much faster: # dd if=/foo/b of=/dev/null bs=1m 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) Now look at it, ''b'' was read 12.5 times faster than ''a'' with no disk activity. Magic?:) -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111210/7202b966/attachment.bin>
Nathan Kroenert
2011-Dec-11 11:10 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:> On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: >> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >> >> The only vendor i know that can do this is Netapp > And you really work at Oracle?:) > > The answer is definiately yes. ARC caches on-disk blocks and dedup just > reference those blocks. When you read dedup code is not involved at all. > Let me show it to you with simple test: > > Create a file (dedup is on): > > # dd if=/dev/random of=/foo/a bs=1m count=1024 > > Copy this file so that it is deduped: > > # dd if=/foo/a of=/foo/b bs=1m > > Export the pool so all cache is removed and reimport it: > > # zpool export foo > # zpool import foo > > Now let''s read one file: > > # dd if=/foo/a of=/dev/null bs=1m > 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) > > We read file ''a'' and all its blocks are in cache now. The ''b'' file > shares all the same blocks, so if ARC caches blocks only once, reading > ''b'' should be much faster: > > # dd if=/foo/b of=/dev/null bs=1m > 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) > > Now look at it, ''b'' was read 12.5 times faster than ''a'' with no disk > activity. Magic?:) >Hey all, That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I''d have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it''s dedup data? Cheers! Nathan.
Jim Klimov
2011-Dec-11 12:04 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
2011-12-11 15:10, Nathan Kroenert wrote:> Hey all, > > That reminds me of something I have been wondering about... Why only 12x > faster? If we are effectively reading from memory - as compared to a > disk reading at approximately 100MB/s (which is about an average PC HDD > reading sequentially), I''d have thought it should be a lot faster than 12x. > > Can we really only pull stuff from cache at only a little over one > gigabyte per second if it''s dedup data? >I believe there''s a couple of things in play. One is that you''d rarely get 100Mb/s from a single HDD disk due to fragmentation, especially inherent to ZFS. But you do mention "sequential reading", so that''s covered. Besides, from Pavel''s DD examples we see that he first read at 98Mbyte/sec average, and then at 1233Mbyte/sec. Another aspect is the RAM bandwidth, and we don''t know the specs of Pavel''s test rig. For example, a 100MHz DDR2 would peak out at 3200Mbyte/sec. That would include walking the (cached) DDT tree for each block involved, determining which (cached) data blocks correspond to it, and fetching them from RAM or disk. I would not be surprised to see that there is some disk IO adding delays for the second case (read of a deduped file "clone"), because you still have to determine references to this second file''s blocks, and another path of on-disk blocks might lead to it from a separate inode in a separate dataset (or I might be wrong). Reading this second path of pointers to the same cached data blocks might decrease speed a little. It would be interesting to see Pavel''s test updated with second reads of both files (now that data and metadata are all cached in RAM). It''s possible that NOW reads would be closer to RAM speeds with no disk IO. And I would be very surprised if speeds would be noticeably different ;) //Jim
Edward Ned Harvey
2011-Dec-11 16:23 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nathan Kroenert > > That reminds me of something I have been wondering about... Why only 12x > faster? If we are effectively reading from memory - as compared to a > disk reading at approximately 100MB/s (which is about an average PC HDD > reading sequentially), I''d have thought it should be a lot faster than12x.> > Can we really only pull stuff from cache at only a little over one > gigabyte per second if it''s dedup data?Actually, cpu''s and memory aren''t as fast as you might think. In a system with 12 disks, I''ve had to write my own "dd" replacement, because "dd if=/dev/zero bs=1024k" wasn''t fast enough to keep the disks busy. Later, I wanted to do something similar, using unique data, and it was simply impossible to generate random data fast enough. I had to tweak my "dd" replacement to write serial numbers, which still wasn''t fast enough, so I had to tweak my "dd" replacement to write a big block of static data, followed by a serial number, followed by another big block (always smaller than the disk block, so it would be treated as unique when hitting the pool...) 1 typical disk sustains 1Gbit/sec. In theory, 12 should be able to sustain 12 Gbit/sec. According to Nathan''s email, the memory bandwidth might be 25 Gbit, of which, you probably need to both read & write, thus making it effectively 12.5 Gbit... I''m sure the actual bandwidth available varies by system and memory type.
Gary Driggs
2011-Dec-11 16:46 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
What kind of drives are we talking about? Even SATA drives are available according to application type (desktop, enterprise server, home PVR, surveillance PVR, etc). Then there are drives with SAS & fiber channel interfaces. Then you''ve got Winchester platters vs SSD vs hybrids. But even before considering that and all the other system factors, throughput for direct attached storage can vary greatly not only from interface type and storage tech but even small on drive controller firmware differences could potentially introduce variances. That''s why server manufacturers like HP, DELL, et al prefer that you replace failed drives with one of theirs instead of something off the shelf because they usually have firmware that''s been fine tuned in house or in conjunction with the manufacturer. On Dec 11, 2011, at 8:25 AM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Nathan Kroenert >> >> That reminds me of something I have been wondering about... Why only 12x >> faster? If we are effectively reading from memory - as compared to a >> disk reading at approximately 100MB/s (which is about an average PC HDD >> reading sequentially), I''d have thought it should be a lot faster than > 12x. >> >> Can we really only pull stuff from cache at only a little over one >> gigabyte per second if it''s dedup data? > > Actually, cpu''s and memory aren''t as fast as you might think. In a system > with 12 disks, I''ve had to write my own "dd" replacement, because "dd > if=/dev/zero bs=1024k" wasn''t fast enough to keep the disks busy. Later, I > wanted to do something similar, using unique data, and it was simply > impossible to generate random data fast enough. I had to tweak my "dd" > replacement to write serial numbers, which still wasn''t fast enough, so I > had to tweak my "dd" replacement to write a big block of static data, > followed by a serial number, followed by another big block (always smaller > than the disk block, so it would be treated as unique when hitting the > pool...) > > 1 typical disk sustains 1Gbit/sec. In theory, 12 should be able to sustain > 12 Gbit/sec. According to Nathan''s email, the memory bandwidth might be 25 > Gbit, of which, you probably need to both read & write, thus making it > effectively 12.5 Gbit... I''m sure the actual bandwidth available varies by > system and memory type. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mertol Ozyoney
2011-Dec-11 22:59 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Not exactly. What is dedup''ed is the stream only, which is infect not very efficient. Real dedup aware replication is taking the necessary steps to avoid sending a block that exists on the other storage system. <http://www.oracle.com/> Mertol ?zy?ney | Storage Sales Mobile: +90 533 931 0752 Email: mertol.ozyoney at oracle.com On 12/8/11 1:39 PM, "Darren J Moffat" <Darren.Moffat at Oracle.COM> wrote:>On 12/07/11 20:48, Mertol Ozyoney wrote: >> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >> >> The only vendor i know that can do this is Netapp >> >> In fact , most of our functions, like replication is not dedup aware. > >> For example, thecnicaly it''s possible to optimize our replication that >> it does not send daya chunks if a data chunk with the same chechsum >> exists in target, without enabling dedup on target and source. > >We already do that with ''zfs send -D'': > > -D > > Perform dedup processing on the stream. Deduplicated > streams cannot be received on systems that do not > support the stream deduplication feature. > > > > >-- >Darren J Moffat
Mertol Ozyoney
2011-Dec-11 23:05 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
I am almost sure that in cache things are still hydrated. There is an outstanding RFE for this, while I am not sure, I think this feature will be implemented sooner or later. And in theory there will be little benefits as most dedup''ed shares are used for archive purposes... PS: NetApp''s do have significantly bigger problems in caching department , like virtually having no L1 cache. However it''s also my duty to knw where they have an advantage ? Br Mertol <http://www.oracle.com/> Mertol ?zy?ney | Storage Sales Mobile: +90 533 931 0752 Email: mertol.ozyoney at oracle.com On 12/10/11 4:05 PM, "Pawel Jakub Dawidek" <pjd at FreeBSD.org> wrote:>On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: >> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >> >> The only vendor i know that can do this is Netapp > >And you really work at Oracle?:) > >The answer is definiately yes. ARC caches on-disk blocks and dedup just >reference those blocks. When you read dedup code is not involved at all. >Let me show it to you with simple test: > >Create a file (dedup is on): > > # dd if=/dev/random of=/foo/a bs=1m count=1024 > >Copy this file so that it is deduped: > > # dd if=/foo/a of=/foo/b bs=1m > >Export the pool so all cache is removed and reimport it: > > # zpool export foo > # zpool import foo > >Now let''s read one file: > > # dd if=/foo/a of=/dev/null bs=1m > 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) > >We read file ''a'' and all its blocks are in cache now. The ''b'' file >shares all the same blocks, so if ARC caches blocks only once, reading >''b'' should be much faster: > > # dd if=/foo/b of=/dev/null bs=1m > 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) > >Now look at it, ''b'' was read 12.5 times faster than ''a'' with no disk >activity. Magic?:) > >-- >Pawel Jakub Dawidek http://www.wheelsystems.com >FreeBSD committer http://www.FreeBSD.org >Am I Evil? Yes, I Am! http://yomoli.com >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Pawel Jakub Dawidek
2011-Dec-12 15:03 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Sun, Dec 11, 2011 at 04:04:37PM +0400, Jim Klimov wrote:> I would not be surprised to see that there is some disk IO > adding delays for the second case (read of a deduped file > "clone"), because you still have to determine references > to this second file''s blocks, and another path of on-disk > blocks might lead to it from a separate inode in a separate > dataset (or I might be wrong). Reading this second path of > pointers to the same cached data blocks might decrease speed > a little.As I said, ZFS reading path involves no dedup code. No at all. The proof would be being able to boot from ZFS with dedup turned on eventhough ZFS boot code has 0 dedup code in it. Another proof would be ZFS source code. -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111212/d26a158c/attachment.bin>
Brad Diggs
2011-Dec-12 16:05 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Thanks everyone for your input on this thread. It sounds like there is sufficient weight behind the affirmative that I will include this methodology into my performance analysis test plan. If the performance goes well, I will share some of the results when we conclude in January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. See my previous blog post at the web site link below for a way around this protective caching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! Brad Brad Diggs | Principal Sales Consultant Tech Blog: http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote:> > You can see the original ARC case here: > > http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt > > On 8 Dec 2011, at 16:41, Ian Collins wrote: > >> On 12/ 9/11 12:39 AM, Darren J Moffat wrote: >>> On 12/07/11 20:48, Mertol Ozyoney wrote: >>>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >>>> >>>> The only vendor i know that can do this is Netapp >>>> >>>> In fact , most of our functions, like replication is not dedup aware. >>>> For example, thecnicaly it''s possible to optimize our replication that >>>> it does not send daya chunks if a data chunk with the same chechsum >>>> exists in target, without enabling dedup on target and source. >>> We already do that with ''zfs send -D'': >>> >>> -D >>> >>> Perform dedup processing on the stream. Deduplicated >>> streams cannot be received on systems that do not >>> support the stream deduplication feature. >>> >>> >>> >>> >> Is there any more published information on how this feature works? >> >> -- >> Ian. >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111212/c1fe9078/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 9062 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111212/c1fe9078/attachment-0001.tiff>
Jim Klimov
2011-Dec-12 16:30 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
2011-12-12 19:03, Pawel Jakub Dawidek ?????:> On Sun, Dec 11, 2011 at 04:04:37PM +0400, Jim Klimov wrote: >> I would not be surprised to see that there is some disk IO >> adding delays for the second case (read of a deduped file >> "clone"), because you still have to determine references >> to this second file''s blocks, and another path of on-disk >> blocks might lead to it from a separate inode in a separate >> dataset (or I might be wrong). Reading this second path of >> pointers to the same cached data blocks might decrease speed >> a little. > > As I said, ZFS reading path involves no dedup code. No at all.I am not sure if we contradicted each other ;) What I meant was that the ZFS reading path involves reading logical data blocks at some point, consulting the cache(s) if the block is already cached (and up-to-date). These blocks are addressed by some unique ID, and now with dedup there are several pointers to same block. So, basically, reading a file involves reading ZFS metadata, determining data block IDs, fetching them from disk or cache. Indeed, this does not need to be dedup-aware; but if the other chain of metadata blocks points to same data or metadata blocks which were already cached (for whatever reason, not limited to dedup) - this is where the read-speed boost appears. Likewise, if some blocks are not cached, such as metadata needed to determine the second file''s block IDs, this incurs disk IOs and may decrease overall speed. That''s why I proposed redoing the test with re-reading both files - now all relevant data and metadata would be cached and we might see a bit faster read speed. Just for kicks ;) //Jim
Richard Elling
2011-Dec-12 20:23 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:> Not exactly. What is dedup''ed is the stream only, which is infect not very > efficient. Real dedup aware replication is taking the necessary steps to > avoid sending a block that exists on the other storage system.These exist outside of ZFS (eg rsync) and scale poorly. Given that dedup is done at the pool level and ZFS send/receive is done at the dataset level, how would you propose implementing a dedup-aware ZFS send command? -- richard -- ZFS and performance consulting http://www.RichardElling.com
Erik Trimble
2011-Dec-13 04:04 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/12/2011 12:23 PM, Richard Elling wrote:> On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote: > >> Not exactly. What is dedup''ed is the stream only, which is infect not very >> efficient. Real dedup aware replication is taking the necessary steps to >> avoid sending a block that exists on the other storage system. > These exist outside of ZFS (eg rsync) and scale poorly. > > Given that dedup is done at the pool level and ZFS send/receive is done at > the dataset level, how would you propose implementing a dedup-aware > ZFS send command? > -- richard >I''m with Richard. There is no practical "optimally efficient" way to dedup a stream from one system to another. The only way to do so would be to have total information about the pool composition on BOTH the receiver and sender side. That would involve sending the checksums for the complete pool blocks between the receiver and sender, which is a non-trivial overhead, and, indeed, would usually be far worse than simply doing what ''zfs send -D'' does now (dedup the sending stream itself). The only possible way that such a scheme would work would be if the receiver and sender were the same machine (note: not VMs or Zones on the same machine, but the same OS instance, since you would need the DDT to be shared). And, that''s not a use case that ''zfs send'' is generally optimized for - that is, while it''s entirely possible, it''s not the primary use case for ''zfs send'' Given the overhead of network communications, there''s no way that sending block checksums between hosts can ever be more efficient than just sending the self-deduped whole stream (except in pedantic cases). Let''s look at possible implementations (all assume that the local sending machine does its own dedup - that is, the stream-to-be-sent is already deduped within itself): (1) when constructing the stream, every time a block is read from a fileset (or volume), its checksum is sent to the receiving machine. The receiving machine then looks up that checksum in its DDT, and sends back a "needed" or "not-needed" reply to the sender. While this lookup is being done, the sender must hold the original block in RAM, and cannot write it out to the to-be-sent-stream. (2) The sending machine reads all the to-be-sent blocks, creates a stream, AND creates a checksum table (a mini-DDT, if you will). The sender communicates to the receiver this mini-DDT. The receiver diffs this against its own master pool DDT, and then sends back an edited mini-DDT containing only the checksums that match blocks which aren''t on the receiver. The original sending machine must then go back and re-construct the stream (either as a whole, or parse the stream as it is being sent) to leave out the unneeded blocks. (3) some combo of #1 and #2 where several checksums are stuffed into a packet, and sent over the wire to be checked at the destination, with the receiver sending back only those to be included in the stream. In the first scenario, you produce a huge amount of small network packet traffic, which trashes network throughput, with no real expectation that the reduction in the send stream will be worth it. In the second case, you induce a huge amount of latency into the construction of the sending stream - that is, the "sender" has to wait around and then spend a non-trivial amount of processing power on essentially double processing the send stream, when, in the current implementation, it just sends out stuff as soon as it gets it. The third scenario is only an optimization of #1 and #2, and doesn''t avoid the pitfalls of either. That is, even if ZFS did pool-level sends, you''re still trapped by the need to share the DDT, which induces an overhead that can''t be reasonably made up vs simply sending an internally-deduped souce stream in the first place. I''m sure I can construct an instance where such DDT sharing would be better than the current ''zfs send'' implementation; I''m just as sure that such an instance would be the small minority of usage, and that such a required implementation would radically alter the "typical" use case''s performance to the negative. In any case, as ''zfs send'' works on filesets and volumes, and ZFS maintains DDT information on a pool-level, there''s no way to share an existing whole DDT between two systems (and, given the potential size of a pool-level DDT, that''s a bad idea anyway). I see no ability to optimize the ''zfs send/receive'' concept beyond what is currently done. -Erik
Pawel Jakub Dawidek
2011-Dec-13 08:00 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Mon, Dec 12, 2011 at 08:30:56PM +0400, Jim Klimov wrote:> 2011-12-12 19:03, Pawel Jakub Dawidek ?????: > > As I said, ZFS reading path involves no dedup code. No at all. > > I am not sure if we contradicted each other ;) > > What I meant was that the ZFS reading path involves reading > logical data blocks at some point, consulting the cache(s) > if the block is already cached (and up-to-date). These blocks > are addressed by some unique ID, and now with dedup there are > several pointers to same block. > > So, basically, reading a file involves reading ZFS metadata, > determining data block IDs, fetching them from disk or cache. > > Indeed, this does not need to be dedup-aware; but if the other > chain of metadata blocks points to same data or metadata blocks > which were already cached (for whatever reason, not limited to > dedup) - this is where the read-speed boost appears. > Likewise, if some blocks are not cached, such as metadata > needed to determine the second file''s block IDs, this incurs > disk IOs and may decrease overall speed.Ok, you are right, although in this test, I believe metadata of the other file was already prefetched. I''m using this box for something else now, so can''t retest, but the procedure is so easy that everyone is welcome to try it:) -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111213/438a5903/attachment.bin>
Nico Williams
2011-Dec-13 12:30 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Dec 11, 2011 5:12 AM, "Nathan Kroenert" <nathan at tuneunix.com> wrote:> > On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote: >> >> On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: >>> >>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >>> >>> The only vendor i know that can do this is Netapp >> >> And you really work at Oracle?:) >> >> The answer is definiately yes. ARC caches on-disk blocks and dedup just >> reference those blocks. When you read dedup code is not involved at all. >> Let me show it to you with simple test: >> >> Create a file (dedup is on): >> >> # dd if=/dev/random of=/foo/a bs=1m count=1024 >> >> Copy this file so that it is deduped: >> >> # dd if=/foo/a of=/foo/b bs=1m >> >> Export the pool so all cache is removed and reimport it: >> >> # zpool export foo >> # zpool import foo >> >> Now let''s read one file: >> >> # dd if=/foo/a of=/dev/null bs=1m >> 1073741824 bytes transferred in 10.855750 secs (98909962bytes/sec)>> >> We read file ''a'' and all its blocks are in cache now. The ''b'' file >> shares all the same blocks, so if ARC caches blocks only once, reading >> ''b'' should be much faster: >> >> # dd if=/foo/b of=/dev/null bs=1m >> 1073741824 bytes transferred in 0.870501 secs (1233475634bytes/sec)>> >> Now look at it, ''b'' was read 12.5 times faster than ''a'' with no disk >> activity. Magic?:) >> > > Hey all, > > That reminds me of something I have been wondering about... Why only 12xfaster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I''d have thought it should be a lot faster than 12x.> > Can we really only pull stuff from cache at only a little over onegigabyte per second if it''s dedup data? The second file may gave the same data, but not the same metadata -the inode number at least must be different- so the znode for it must get read in, and that will slow reading the copy down a bit. Nico -- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111213/e5eaf691/attachment.html>
Robert Milkowski
2011-Dec-16 16:13 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Pawel Jakub Dawidek > Sent: 10 December 2011 14:05 > To: Mertol Ozyoney > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup > > On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: > > Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. > > > > The only vendor i know that can do this is Netapp > > And you really work at Oracle?:) > > The answer is definiately yes. ARC caches on-disk blocks and dedup just > reference those blocks. When you read dedup code is not involved at all. > Let me show it to you with simple test: > > Create a file (dedup is on): > > # dd if=/dev/random of=/foo/a bs=1m count=1024 > > Copy this file so that it is deduped: > > # dd if=/foo/a of=/foo/b bs=1m > > Export the pool so all cache is removed and reimport it: > > # zpool export foo > # zpool import foo > > Now let''s read one file: > > # dd if=/foo/a of=/dev/null bs=1m > 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) > > We read file ''a'' and all its blocks are in cache now. The ''b'' file sharesall the> same blocks, so if ARC caches blocks only once, reading ''b'' should be much > faster: > > # dd if=/foo/b of=/dev/null bs=1m > 1073741824 bytes transferred in 0.870501 secs (1233475634 > bytes/sec) > > Now look at it, ''b'' was read 12.5 times faster than ''a'' with no diskactivity.> Magic?:)Yep, however in pre Solaris 11 GA (and in Illumos) you would end up with 2x copies of blocks in ARC cache, while in S11 GA ARC will keep only 1 copy of all blocks. This can make a big difference if there are even more than just 2x files being dedupped and you need arc memory to cache other data as well. -- Robert Milkowski
Brad Diggs
2011-Dec-28 21:14 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
As promised, here are the findings from my testing. I created 6 directory server instances where the first instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with each starting and applying load to each successive directory server instance. The L1ARC cache grew a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio remained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potential through the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than the block level. It very could be that even byte level deduplication doesn''t work as well either. Until that option is available, we won''t know for sure. Regards, Brad Brad Diggs | Principal Sales Consultant Tech Blog: http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:> Thanks everyone for your input on this thread. It sounds like there is sufficient weight > behind the affirmative that I will include this methodology into my performance analysis > test plan. If the performance goes well, I will share some of the results when we conclude > in January/February timeframe. > > Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC > detect and prevent streaming reads such as from dd from populating the cache. See > my previous blog post at the web site link below for a way around this protective > caching control of ZFS. > > http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html > > Thanks again! > > Brad > <PastedGraphic-2.tiff> > Brad Diggs | Principal Sales Consultant > Tech Blog: http://TheZoneManager.com > LinkedIn: http://www.linkedin.com/in/braddiggs > > On Dec 8, 2011, at 4:22 PM, Mark Musante wrote: > >> >> You can see the original ARC case here: >> >> http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt >> >> On 8 Dec 2011, at 16:41, Ian Collins wrote: >> >>> On 12/ 9/11 12:39 AM, Darren J Moffat wrote: >>>> On 12/07/11 20:48, Mertol Ozyoney wrote: >>>>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. >>>>> >>>>> The only vendor i know that can do this is Netapp >>>>> >>>>> In fact , most of our functions, like replication is not dedup aware. >>>>> For example, thecnicaly it''s possible to optimize our replication that >>>>> it does not send daya chunks if a data chunk with the same chechsum >>>>> exists in target, without enabling dedup on target and source. >>>> We already do that with ''zfs send -D'': >>>> >>>> -D >>>> >>>> Perform dedup processing on the stream. Deduplicated >>>> streams cannot be received on systems that do not >>>> support the stream deduplication feature. >>>> >>>> >>>> >>>> >>> Is there any more published information on how this feature works? >>> >>> -- >>> Ian. >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111228/47345d39/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2011-12-28 at 2.58.55 PM.png Type: image/png Size: 57275 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111228/47345d39/attachment-0002.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2011-12-28 at 2.59.12 PM.png Type: image/png Size: 59141 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111228/47345d39/attachment-0003.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 9062 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111228/47345d39/attachment-0001.tiff>
Nico Williams
2011-Dec-28 22:01 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Wed, Dec 28, 2011 at 3:14 PM, Brad Diggs <brad.diggs at oracle.com> wrote:> > The two key takeaways from this exercise were as follows. ?There is tremendous caching potential > through the use of ZFS deduplication. ?However, the current block level deduplication does not > benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than > the block level. ?It very could be that even byte level deduplication doesn''t work as well either. > Until that option is available, we won''t know for sure.How would byte-level dedup even work? My best idea would be to apply the rsync algorithm and then start searching for little chunks of data with matching rsync CRCs, rolling the rsync CRC over the data until a match is found for some chunk (which then has to be read and compared), and so on. The result would be incredibly slow on write and would have huge storage overhead. On the read side you could have many more I/Os too, so read would get much slower as well. I suspect any other byte-level dedup solutions would be similarly lousy. There''d be no real savings to be had, making the idea not worthwhile. Dedup is for very specific use cases. If your use case doesn''t benefit from block-level dedup, then don''t bother with dedup. (The same applies to compression, but compression is much more likely to be useful in general, which is why it should generally be on.) Nico --
Jim Klimov
2011-Dec-29 10:45 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Thanks for running and publishing the tests :) A comment on your testing technique follows, though. 2011-12-29 1:14, Brad Diggs wrote:> As promised, here are the findings from my testing. I created 6 > directory server instances ... > > However, once I started modifying the data of the replicated directory > server topology, the caching efficiency > quickly diminished. The following table shows that the delta for each > instance increased by roughly 2GB > after only 300k of changes. > > I suspect the divergence in data as seen by ZFS deduplication most > likely occurs because reduplication > occurs at the block level rather than at the byte level. When a write is > sent to one directory server instance, > the exact same write is propagated to the other 5 instances and > therefore should be considered a duplicate. > However this was not the case. There could be other reasons for the > divergence as well.Hello, Brad, If you tested with Sun DSEE (and I have no reason to believe other descendants of iPlanet Directory server would work differently under the hood), then there are two factors hindering your block-dedup gains: 1) The data is stored in the backend BerkeleyDB binary file. In Sun DSEE7 and/or in ZFS this could also be compressed data. Since for ZFS you dedup unique blocks, including same data at same offsets, it is quite unlikely you''d get the same data often enough. For example, each database might position same userdata blocks at different offsets due to garbage collection or whatever other optimisation the DB might think of, making on-disk blocks different and undedupable. You might look if it is possible to tune the database to write in sector-sized -> min.block-sized (512b/4096b) records and consistently use the same DSEE compression (or lack thereof) - in this case you might get more same blocks and win with dedup. But you''ll likely lose with compression, especially of the empty sparse structure which a database initially is. 2) During replication each database actually becomes unique. There are hidden records with "ns" prefix which mark when the record was created and replicated, who initiated it, etc. Timestamps in the data already warrant uniqueness ;) This might be an RFE for the DSEE team though - to keep such volatile metadata separately from userdata. Then your DS instances would more likely dedup well after replication, and unique metadata would be stored separately and stay unique. You might even keep it in a different dataset with no dedup, then... :) --- So, at the moment, this expectation does not hold true: "When a write is sent to one directory server instance, the exact same write is propagated to the other five instances and therefore should be considered a duplicate." These writes are not exact. HTH, //Jim Klimov
Robert Milkowski
2011-Dec-29 14:08 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Try reducing recordsize to 8K or even less *before* you put any data. This can potentially improve your dedup ratio and keep it higher after you start modifying data. From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Brad Diggs Sent: 28 December 2011 21:15 To: zfs-discuss discussion list Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the first instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with each starting and applying load to each successive directory server instance. The L1ARC cache grew a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio remained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potential through the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than the block level. It very could be that even byte level deduplication doesn''t work as well either. Until that option is available, we won''t know for sure. Regards, Brad Brad Diggs | Principal Sales Consultant Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: Thanks everyone for your input on this thread. It sounds like there is sufficient weight behind the affirmative that I will include this methodology into my performance analysis test plan. If the performance goes well, I will share some of the results when we conclude in January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. See my previous blog post at the web site link below for a way around this protective caching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! Brad <PastedGraphic-2.tiff> Brad Diggs | Principal Sales Consultant Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote: You can see the original ARC case here: http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt On 8 Dec 2011, at 16:41, Ian Collins wrote: On 12/ 9/11 12:39 AM, Darren J Moffat wrote: On 12/07/11 20:48, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware. For example, thecnicaly it''s possible to optimize our replication that it does not send daya chunks if a data chunk with the same chechsum exists in target, without enabling dedup on target and source. We already do that with ''zfs send -D'': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature. Is there any more published information on how this feature works? -- Ian. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/8b41bc76/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 57275 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/8b41bc76/attachment-0003.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 59141 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/8b41bc76/attachment-0004.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 4770 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/8b41bc76/attachment-0005.png>
Robert Milkowski
2011-Dec-29 14:11 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
And these results are from S11 FCS I assume. On older builds or Illumos based distros I would expect L1 arc to grow much bigger. From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Brad Diggs Sent: 28 December 2011 21:15 To: zfs-discuss discussion list Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the first instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with each starting and applying load to each successive directory server instance. The L1ARC cache grew a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio remained the same because no data on the directory server instances was changing. However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potential through the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than the block level. It very could be that even byte level deduplication doesn''t work as well either. Until that option is available, we won''t know for sure. Regards, Brad Brad Diggs | Principal Sales Consultant Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: Thanks everyone for your input on this thread. It sounds like there is sufficient weight behind the affirmative that I will include this methodology into my performance analysis test plan. If the performance goes well, I will share some of the results when we conclude in January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. See my previous blog post at the web site link below for a way around this protective caching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! Brad <PastedGraphic-2.tiff> Brad Diggs | Principal Sales Consultant Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote: You can see the original ARC case here: http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt On 8 Dec 2011, at 16:41, Ian Collins wrote: On 12/ 9/11 12:39 AM, Darren J Moffat wrote: On 12/07/11 20:48, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware. For example, thecnicaly it''s possible to optimize our replication that it does not send daya chunks if a data chunk with the same chechsum exists in target, without enabling dedup on target and source. We already do that with ''zfs send -D'': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature. Is there any more published information on how this feature works? -- Ian. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/bc6f57c0/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 57275 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/bc6f57c0/attachment-0003.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 59141 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/bc6f57c0/attachment-0004.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 4770 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/bc6f57c0/attachment-0005.png>
Brad Diggs
2011-Dec-29 15:53 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Jim, You are spot on. I was hoping that the writes would be close enough to identical that there would be a high ratio of duplicate data since I use the same record size, page size, compression algorithm, ? etc. However, that was not the case. The main thing that I wanted to prove though was that if the data was the same the L1 ARC only caches the data that was actually written to storage. That is a really cool thing! I am sure there will be future study on this topic as it applies to other scenarios. With regards to directory engineering investing any energy into optimizing ODSEE DS to more effectively leverage this caching potential, that won''t happen. OUD far out performs ODSEE. That said OUD may get some focus in this area. However, time will tell on that one. For now, I hope everyone benefits from the little that I did validate. Have a great day! Brad Brad Diggs | Principal Sales Consultant Tech Blog: http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 4:45 AM, Jim Klimov wrote:> Thanks for running and publishing the tests :) > > A comment on your testing technique follows, though. > > 2011-12-29 1:14, Brad Diggs wrote: >> As promised, here are the findings from my testing. I created 6 >> directory server instances ... >> >> However, once I started modifying the data of the replicated directory >> server topology, the caching efficiency >> quickly diminished. The following table shows that the delta for each >> instance increased by roughly 2GB >> after only 300k of changes. >> >> I suspect the divergence in data as seen by ZFS deduplication most >> likely occurs because reduplication >> occurs at the block level rather than at the byte level. When a write is >> sent to one directory server instance, >> the exact same write is propagated to the other 5 instances and >> therefore should be considered a duplicate. >> However this was not the case. There could be other reasons for the >> divergence as well. > > Hello, Brad, > > If you tested with Sun DSEE (and I have no reason to > believe other descendants of iPlanet Directory server > would work differently under the hood), then there are > two factors hindering your block-dedup gains: > > 1) The data is stored in the backend BerkeleyDB binary > file. In Sun DSEE7 and/or in ZFS this could also be > compressed data. Since for ZFS you dedup unique blocks, > including same data at same offsets, it is quite unlikely > you''d get the same data often enough. For example, each > database might position same userdata blocks at different > offsets due to garbage collection or whatever other > optimisation the DB might think of, making on-disk > blocks different and undedupable. > > You might look if it is possible to tune the database > to write in sector-sized -> min.block-sized (512b/4096b) > records and consistently use the same DSEE compression > (or lack thereof) - in this case you might get more same > blocks and win with dedup. But you''ll likely lose with > compression, especially of the empty sparse structure > which a database initially is. > > 2) During replication each database actually becomes > unique. There are hidden records with "ns" prefix which > mark when the record was created and replicated, who > initiated it, etc. Timestamps in the data already > warrant uniqueness ;) > > This might be an RFE for the DSEE team though - to keep > such volatile metadata separately from userdata. Then > your DS instances would more likely dedup well after > replication, and unique metadata would be stored > separately and stay unique. You might even keep it in > a different dataset with no dedup, then... :) > > --- > > > So, at the moment, this expectation does not hold true: > "When a write is sent to one directory server instance, > the exact same write is propagated to the other five > instances and therefore should be considered a duplicate." > These writes are not exact. > > HTH, > //Jim Klimov > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/057c8c74/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 9062 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/057c8c74/attachment-0001.tiff>
Brad Diggs
2011-Dec-29 15:53 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
S11 FCS Brad Brad Diggs | Principal Sales Consultant | 972.814.3698 eMail: Brad.Diggs at Oracle.com Tech Blog: http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:11 AM, Robert Milkowski wrote:> > And these results are from S11 FCS I assume. > On older builds or Illumos based distros I would expect L1 arc to grow much bigger. > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Brad Diggs > Sent: 28 December 2011 21:15 > To: zfs-discuss discussion list > Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup > > As promised, here are the findings from my testing. I created 6 directory server instances where the first > instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of > the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table > shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with > each starting and applying load to each successive directory server instance. The L1ARC cache grew > a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio > remained the same because no data on the directory server instances was changing. > > <image001.png> > However, once I started modifying the data of the replicated directory server topology, the caching efficiency > quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB > after only 300k of changes. > > <image002.png> > I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication > occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, > the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. > However this was not the case. There could be other reasons for the divergence as well. > > The two key takeaways from this exercise were as follows. There is tremendous caching potential > through the use of ZFS deduplication. However, the current block level deduplication does not > benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than > the block level. It very could be that even byte level deduplication doesn''t work as well either. > Until that option is available, we won''t know for sure. > > Regards, > > Brad > <image003.png> > Brad Diggs | Principal Sales Consultant > Tech Blog: http://TheZoneManager.com > LinkedIn: http://www.linkedin.com/in/braddiggs > > On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: > > > Thanks everyone for your input on this thread. It sounds like there is sufficient weight > behind the affirmative that I will include this methodology into my performance analysis > test plan. If the performance goes well, I will share some of the results when we conclude > in January/February timeframe. > > Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC > detect and prevent streaming reads such as from dd from populating the cache. See > my previous blog post at the web site link below for a way around this protective > caching control of ZFS. > > http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html > > Thanks again! > > Brad > <PastedGraphic-2.tiff> > Brad Diggs | Principal Sales Consultant > Tech Blog: http://TheZoneManager.com > LinkedIn: http://www.linkedin.com/in/braddiggs > > On Dec 8, 2011, at 4:22 PM, Mark Musante wrote: > > > > You can see the original ARC case here: > > http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt > > On 8 Dec 2011, at 16:41, Ian Collins wrote: > > > On 12/ 9/11 12:39 AM, Darren J Moffat wrote: > On 12/07/11 20:48, Mertol Ozyoney wrote: > Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. > > The only vendor i know that can do this is Netapp > > In fact , most of our functions, like replication is not dedup aware. > For example, thecnicaly it''s possible to optimize our replication that > it does not send daya chunks if a data chunk with the same chechsum > exists in target, without enabling dedup on target and source. > We already do that with ''zfs send -D'': > > -D > > Perform dedup processing on the stream. Deduplicated > streams cannot be received on systems that do not > support the stream deduplication feature. > > > > > Is there any more published information on how this feature works? > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/c920a68e/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 9062 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/c920a68e/attachment-0001.tiff>
Brad Diggs
2011-Dec-29 15:54 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Reducing the record size would negatively impact performance. For rational why, see the section titled "Match Average I/O Block Sizes" in my blog post on filesystem caching: http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs | Principal Sales Consultant | 972.814.3698 eMail: Brad.Diggs at Oracle.com Tech Blog: http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote:> > Try reducing recordsize to 8K or even less *before* you put any data. > This can potentially improve your dedup ratio and keep it higher after you start modifying data. > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Brad Diggs > Sent: 28 December 2011 21:15 > To: zfs-discuss discussion list > Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup > > As promised, here are the findings from my testing. I created 6 directory server instances where the first > instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of > the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table > shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with > each starting and applying load to each successive directory server instance. The L1ARC cache grew > a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio > remained the same because no data on the directory server instances was changing. > > <image001.png> > However, once I started modifying the data of the replicated directory server topology, the caching efficiency > quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB > after only 300k of changes. > > <image002.png> > I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication > occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, > the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. > However this was not the case. There could be other reasons for the divergence as well. > > The two key takeaways from this exercise were as follows. There is tremendous caching potential > through the use of ZFS deduplication. However, the current block level deduplication does not > benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than > the block level. It very could be that even byte level deduplication doesn''t work as well either. > Until that option is available, we won''t know for sure. > > Regards, > > Brad > <image003.png> > Brad Diggs | Principal Sales Consultant > Tech Blog: http://TheZoneManager.com > LinkedIn: http://www.linkedin.com/in/braddiggs > > On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: > > > Thanks everyone for your input on this thread. It sounds like there is sufficient weight > behind the affirmative that I will include this methodology into my performance analysis > test plan. If the performance goes well, I will share some of the results when we conclude > in January/February timeframe. > > Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC > detect and prevent streaming reads such as from dd from populating the cache. See > my previous blog post at the web site link below for a way around this protective > caching control of ZFS. > > http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html > > Thanks again! > > Brad > <PastedGraphic-2.tiff> > Brad Diggs | Principal Sales Consultant > Tech Blog: http://TheZoneManager.com > LinkedIn: http://www.linkedin.com/in/braddiggs > > On Dec 8, 2011, at 4:22 PM, Mark Musante wrote: > > > > You can see the original ARC case here: > > http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt > > On 8 Dec 2011, at 16:41, Ian Collins wrote: > > > On 12/ 9/11 12:39 AM, Darren J Moffat wrote: > On 12/07/11 20:48, Mertol Ozyoney wrote: > Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. > > The only vendor i know that can do this is Netapp > > In fact , most of our functions, like replication is not dedup aware. > For example, thecnicaly it''s possible to optimize our replication that > it does not send daya chunks if a data chunk with the same chechsum > exists in target, without enabling dedup on target and source. > We already do that with ''zfs send -D'': > > -D > > Perform dedup processing on the stream. Deduplicated > streams cannot be received on systems that do not > support the stream deduplication feature. > > > > > Is there any more published information on how this feature works? > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/cfb6aeef/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 9062 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/cfb6aeef/attachment-0001.tiff>
Robert Milkowski
2011-Dec-29 17:19 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
Citing yourself: "The average block size for a given data block should be used as the metric to map all other datablock sizes to. For example, the ZFS recordsize is 128kb by default. If the average block (or page) size of a directory server is 2k, then the mismatch in size will result in degraded throughput for both read and write operations. One of the benefits of ZFS is that you can change the recordsize of all write operations from the time you set the new value going forward. " And the above is not even entirely correct as if a file is bigger than a current value of recordsize property reducing a recordsize won''t change block size for the file (it will continue to use the previous size, for example 128K). This is why you need to set recordsize to a desired value for large files *before* you create them (or you will have to copy them later on).>From the performance point of view it really depends on a workload but asyou described in your blog the default recordsize of 128K with an average write/read of 2K for many workloads will negatively impact performance, and lowering recordsize can potentially improve it. Nevertheless I was referring to dedup efficiency which with lower recordsize values should improve dedup ratios (although it will require more memory for ddt). From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Brad Diggs Sent: 29 December 2011 15:55 To: Robert Milkowski Cc: ''zfs-discuss discussion list'' Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup Reducing the record size would negatively impact performance. For rational why, see the section titled "Match Average I/O Block Sizes" in my blog post on filesystem caching: http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs | Principal Sales Consultant | 972.814.3698 eMail: Brad.Diggs at Oracle.com Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote: Try reducing recordsize to 8K or even less *before* you put any data. This can potentially improve your dedup ratio and keep it higher after you start modifying data. From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Brad Diggs Sent: 28 December 2011 21:15 To: zfs-discuss discussion list Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the first instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with each starting and applying load to each successive directory server instance. The L1ARC cache grew a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio remained the same because no data on the directory server instances was changing. <image001.png> However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. <image002.png> I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potential through the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than the block level. It very could be that even byte level deduplication doesn''t work as well either. Until that option is available, we won''t know for sure. Regards, Brad <image003.png> Brad Diggs | Principal Sales Consultant Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: Thanks everyone for your input on this thread. It sounds like there is sufficient weight behind the affirmative that I will include this methodology into my performance analysis test plan. If the performance goes well, I will share some of the results when we conclude in January/February timeframe. Regarding the great dd use case provided earlier in this thread, the L1 and L2 ARC detect and prevent streaming reads such as from dd from populating the cache. See my previous blog post at the web site link below for a way around this protective caching control of ZFS. http://www.thezonemanager.com/2010/02/directory-data-priming-strategies.html Thanks again! Brad <PastedGraphic-2.tiff> Brad Diggs | Principal Sales Consultant Tech Blog: <http://TheZoneManager.com/> http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 8, 2011, at 4:22 PM, Mark Musante wrote: You can see the original ARC case here: http://arc.opensolaris.org/caselog/PSARC/2009/557/20091013_lori.alt On 8 Dec 2011, at 16:41, Ian Collins wrote: On 12/ 9/11 12:39 AM, Darren J Moffat wrote: On 12/07/11 20:48, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp In fact , most of our functions, like replication is not dedup aware. For example, thecnicaly it''s possible to optimize our replication that it does not send daya chunks if a data chunk with the same chechsum exists in target, without enabling dedup on target and source. We already do that with ''zfs send -D'': -D Perform dedup processing on the stream. Deduplicated streams cannot be received on systems that do not support the stream deduplication feature. Is there any more published information on how this feature works? -- Ian. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/a76f35a3/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 4770 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111229/a76f35a3/attachment-0001.png>
Nico Williams
2011-Dec-29 17:40 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs <brad.diggs at oracle.com> wrote:> Jim, > > You are spot on. ?I was hoping that the writes would be close enough to identical that > there would be a high ratio of duplicate data since I use the same record size, page size, > compression algorithm, ? etc. ?However, that was not the case. ?The main thing that I > wanted to prove though was that if the data was the same the L1 ARC only caches the > data that was actually written to storage. ?That is a really cool thing! ?I am sure there will > be future study on this topic as it applies to other scenarios. > > With regards to directory engineering investing any energy into optimizing ODSEE DS > to more effectively leverage this caching potential, that won''t happen. ?OUD far out > performs ODSEE. ?That said OUD may get some focus in this area. ?However, time will > tell on that one.Databases are not as likely to benefit from dedup as virtual machines, indeed, DBs are likely to not benefit at all from dedup. The VM use case benefits from dedup for the obvious reason that many VMs will have the same exact software installed most of the time, using the same filesystems, and the same patch/update installation order, so if you keep data out of their root filesystems then you can expect enormous deduplicatiousness. But databases, not so much. The unit of deduplicable data in a VM use case is the guest''s preferred block size, while in a DB the unit of deduplicable data might be a variable-sized table row, or even smaller: a single row/column value -- and you have no way to ensure alignment of individual deduplicable units nor ordering of sets of deduplicable units into larger ones. When it comes to databases your best bets will be: a) database-level compression or dedup features (e.g., Oracle''s column-level compression feature) or b) ZFS compression. (Dedup makes VM management easier, because the alternative is to patch one master guest VM [per-guest type] then re-clone and re-configure all instances of that guest type, in the process possibly losing any customizations in those guests. But even before dedup, the ability to snapshot and clone datasets was an impressive dedup-like tool for the VM use-case, just not as convenient as dedup.) Nico --
Matthew Ahrens
2011-Dec-30 00:44 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble <trims at netdemons.com> wrote:> On 12/12/2011 12:23 PM, Richard Elling wrote: >> >> On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote: >> >>> Not exactly. What is dedup''ed is the stream only, which is infect not >>> very >>> efficient. Real dedup aware replication is taking the necessary steps to >>> avoid sending a block that exists on the other storage system.As with all dedup-related performance, the efficiency of various methods of implementing "zfs send -D" will vary widely, depending on the dedup-ability of the data, and what is being sent. However, sending no blocks that already exist on the target system does seem like a good goal, since it addresses some use cases that intra-stream dedup does not.> (1) when constructing the stream, every time a block is read from a fileset > (or volume), its checksum is sent to the receiving machine. The receiving > machine then looks up that checksum in its DDT, and sends back a "needed" or > "not-needed" reply to the sender. While this lookup is being done, the > sender must hold the original block in RAM, and cannot write it out to the > to-be-sent-stream....> you produce a huge amount of small network packet > traffic, which trashes network throughputThis seems like a valid approach to me. When constructing the stream, the sender need not read the actual data, just the checksum in the indirect block. So there is nothing that the sender "must hold in RAM". There is no need to create small (or synchronous) network packets, because sender need not wait for the receiver to determine if it needs the block or not. There can be multiple asynchronous communication streams: one where the sender sends all the checksums to the receiver; another where the receiver requests blocks that it does not have from the sender; and another where the sender sends requested blocks back to the receiver. Implementing this may not be trivial, and in some cases it will not improve on the current implementation. But in others it would be a considerable improvement. --matt
Nico Williams
2011-Dec-30 01:11 UTC
[zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens <mahrens at delphix.com> wrote:> On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble <trims at netdemons.com> wrote: >> (1) when constructing the stream, every time a block is read from a fileset >> (or volume), its checksum is sent to the receiving machine. The receiving >> machine then looks up that checksum in its DDT, and sends back a "needed" or >> "not-needed" reply to the sender. While this lookup is being done, the >> sender must hold the original block in RAM, and cannot write it out to the >> to-be-sent-stream. > ... >> you produce a huge amount of small network packet >> traffic, which trashes network throughput > > This seems like a valid approach to me. ?When constructing the stream, > the sender need not read the actual data, just the checksum in the > indirect block. ?So there is nothing that the sender "must hold in > RAM". ?There is no need to create small (or synchronous) network > packets, because sender need not wait for the receiver to determine if > it needs the block or not. ?There can be multiple asynchronous > communication streams: ?one where the sender sends all the checksums > to the receiver; another where the receiver requests blocks that it > does not have from the sender; and another where the sender sends > requested blocks back to the receiver. ?Implementing this may not be > trivial, and in some cases it will not improve on the current > implementation. ?But in others it would be a considerable improvement.Right, you''d want to let the socket/transport buffer/flow control writes of "I have this new block checksum" messages from the zfs sender and "I need the block with this checksum" messages from the zfs receiver. I like this. A separate channel for bulk data definitely comes recommended for flow control reasons, but if you do that then securing the transport gets complicated: you couldn''t just zfs send .. | ssh ... zfs receive. You could use SSH channel multiplexing, but that will net you lousy performance (well, no lousier than one already gets with SSH anyways)[*]. (And SunSSH lacks this feature anyways) It''d then begin to pay to have have a bonafide zfs send network protocol, and now we''re talking about significantly more work. Another option would be to have send/receive options to create the two separate channels, so one would do something like: % zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs receive --dedup-control-channel ... & % zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive --dedup-bulk-channel % wait The second zfs receive would rendezvous with the first and go from there. [*] The problem with SSHv2 is that it has flow controlled channels layered over a flow controlled congestion channel (TCP), and there''s not enough information flowing from TCP to SSHv2 to make this work well, but also, the SSHv2 channels cannot have their window shrink except by the sender consuming it, which makes it impossible to mix high-bandwidth bulk and small control data over a congested link. This means that in practice SSHv2 channels have to have relatively small windows, which then forces the protocol to work very synchronously (i.e., with effectively synchronous ACKs of bulk data). I now believe the idea of mixing bulk and non-bulk data over a single TCP connection in SSHv2 is a failure. SSHv2 over SCTP, or over multiple TCP connections, would be much better. Nico --