Whenever somebody asks the question, "How much memory do I need to dedup X terabytes filesystem," the standard answer is "as much as you can afford to buy." This is true and correct, but I don''t believe it''s the best we can do. Because "as much as you can buy" is a true assessment for memory in *any* situation. To improve knowledge in this area, I think the question just needs to be asked differently. "How much *extra* memory is required for X terabytes, with dedup enabled versus disabled?" I hope somebody knows more about this than me. I expect the answer will be something like ... The default ZFS block size is 128K. If you have a filesystem with 128G used, that means you are consuming 1,048,576 blocks, each of which must be checksummed. ZFS uses adler32 and sha256, which means 4bytes and 32bytes ... 36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by enabling dedup. I suspect my numbers are off, because 36Mbytes seems impossibly small. But I hope some sort of similar (and more correct) logic will apply. ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/91e21512/attachment.html>
On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey <solaris2 at nedharvey.com>wrote:> The default ZFS block size is 128K. If you have a filesystem with 128G > used, that means you are consuming 1,048,576 blocks, each of which must be > checksummed. ZFS uses adler32 and sha256, which means 4bytes and 32bytes > ... 36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by > enabling dedup. > > > > I suspect my numbers are off, because 36Mbytes seems impossibly small. But > I hope some sort of similar (and more correct) logic will apply. ;-) >I think that DDT entries are a little bigger than what you''re using. The size seems to range between 150 and 250 bytes depending on how it''s calculated, call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique data would require 600M - 1000M for the DDT. The numbers are fuzzy of course, and assum only 128k blocks. Lots of small files will increase the memory cost of dedupe, and using it on a zvol that has the default block size (8k) would require 16 times the memory. -B -- Brandon High : bhigh at freaks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/ee385451/attachment.html>
On 7/9/2010 5:18 PM, Brandon High wrote:> On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey > <solaris2 at nedharvey.com <mailto:solaris2 at nedharvey.com>> wrote: > > The default ZFS block size is 128K. If you have a filesystem with > 128G used, that means you are consuming 1,048,576 blocks, each of > which must be checksummed. ZFS uses adler32 and sha256, which > means 4bytes and 32bytes ... 36 bytes * 1M blocks = an extra 36 > Mbytes and some fluff consumed by enabling dedup. > > I suspect my numbers are off, because 36Mbytes seems impossibly > small. But I hope some sort of similar (and more correct) logic > will apply. ;-) > > > I think that DDT entries are a little bigger than what you''re using. > The size seems to range between 150 and 250 bytes depending on how > it''s calculated, call it 200b each. Your 128G dataset would require > closer to 200M (+/- 25%) for the DDT if your data was completely > unique. 1TB of unique data would require 600M - 1000M for the DDT. > > The numbers are fuzzy of course, and assum only 128k blocks. Lots of > small files will increase the memory cost of dedupe, and using it on a > zvol that has the default block size (8k) would require 16 times the > memory. > > -B >Go back and read several threads last month about ZFS/L2ARC memory usage for dedup. In particular, I''ve been quite specific about how to calculate estimated DDT size. Richard has also been quite good at giving size estimates (as well as explaining how to see current block size usage in a filesystem). The structure in question is this one: ddt_entry http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108 I''d have to fire up an IDE to track down all the sizes of the ddt_entry structure''s members, but I feel comfortable using Richard''s 270 bytes-per-entry estimate. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/b3da6852/attachment.html>
On 07/09/10 19:40, Erik Trimble wrote:> On 7/9/2010 5:18 PM, Brandon High wrote: >> On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey >> <solaris2 at nedharvey.com <mailto:solaris2 at nedharvey.com>> wrote: >> >> The default ZFS block size is 128K. If you have a filesystem >> with 128G used, that means you are consuming 1,048,576 blocks, >> each of which must be checksummed. ZFS uses adler32 and sha256, >> which means 4bytes and 32bytes ... 36 bytes * 1M blocks = an >> extra 36 Mbytes and some fluff consumed by enabling dedup. >> >> >> >> I suspect my numbers are off, because 36Mbytes seems impossibly >> small. But I hope some sort of similar (and more correct) logic >> will apply. ;-) >> >> >> I think that DDT entries are a little bigger than what you''re using. >> The size seems to range between 150 and 250 bytes depending on how >> it''s calculated, call it 200b each. Your 128G dataset would require >> closer to 200M (+/- 25%) for the DDT if your data was completely >> unique. 1TB of unique data would require 600M - 1000M for the DDT. >> >> The numbers are fuzzy of course, and assum only 128k blocks. Lots of >> small files will increase the memory cost of dedupe, and using it on >> a zvol that has the default block size (8k) would require 16 times >> the memory. >> >> -B >> > > > Go back and read several threads last month about ZFS/L2ARC memory > usage for dedup. In particular, I''ve been quite specific about how to > calculate estimated DDT size. Richard has also been quite good at > giving size estimates (as well as explaining how to see current block > size usage in a filesystem). > > > The structure in question is this one: > > ddt_entry > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108 > > I''d have to fire up an IDE to track down all the sizes of the > ddt_entry structure''s members, but I feel comfortable using Richard''s > 270 bytes-per-entry estimate. >It must have grown a bit because on 64 bit x86 a ddt_entry is currently 0x178 = 376 bytes : # mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs sata sd ip hook neti sockfs arp usba fctl random cpc fcip nfs lofs ufs logindmux ptm sppp ipc ] > ::sizeof struct ddt_entry sizeof (struct ddt_entry) = 0x178 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/1038ddc0/attachment.html>
On Fri, Jul 9, 2010 at 5:18 PM, Brandon High <bhigh at freaks.com> wrote:> I think that DDT entries are a little bigger than what you''re using. The > size seems to range between 150 and 250 bytes depending on how it''s > calculated, call it 200b each. Your 128G dataset would require closer to > 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique > data would require 600M - 1000M for the DDT. >Using 376b per entry, it''s 376M for 128G of unique data, or just under 3GB for 1TB of unique data. A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the DDT. Ouch. -B -- Brandon High : bhigh at freaks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/76b01ba6/attachment.html>
On Jul 9, 2010, at 11:10 PM, Brandon High wrote:> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High <bhigh at freaks.com> wrote: > I think that DDT entries are a little bigger than what you''re using. The size seems to range between 150 and 250 bytes depending on how it''s calculated, call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique data would require 600M - 1000M for the DDT. > > Using 376b per entry, it''s 376M for 128G of unique data, or just under 3GB for 1TB of unique data.4% seems to be a pretty good SWAG.> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the DDT. Ouch.... or more than 300GB for 512-byte records. The performance issue is that DDT access tends to be random. This implies that if you don''t have a lot of RAM and your pool has poor random read I/O performance, then you will not be impressed with dedup performance. In other words, trying to dedup lots of data on a small DRAM machine using big, slow pool HDDs will not set any benchmark records. By contrast, using SSDs for the pool can demonstrate good random read performance. As the price per bit of HDDs continues to drop, the value of deduping pools using HDDs also drops. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On 7/10/2010 5:24 AM, Richard Elling wrote:> On Jul 9, 2010, at 11:10 PM, Brandon High wrote: > > >> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High<bhigh at freaks.com> wrote: >> I think that DDT entries are a little bigger than what you''re using. The size seems to range between 150 and 250 bytes depending on how it''s calculated, call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique data would require 600M - 1000M for the DDT. >> >> Using 376b per entry, it''s 376M for 128G of unique data, or just under 3GB for 1TB of unique data. >> > 4% seems to be a pretty good SWAG. > > >> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the DDT. Ouch. >> > ... or more than 300GB for 512-byte records. > > The performance issue is that DDT access tends to be random. This implies that > if you don''t have a lot of RAM and your pool has poor random read I/O performance, > then you will not be impressed with dedup performance. In other words, trying to > dedup lots of data on a small DRAM machine using big, slow pool HDDs will not set > any benchmark records. By contrast, using SSDs for the pool can demonstrate good > random read performance. As the price per bit of HDDs continues to drop, the value > of deduping pools using HDDs also drops. > -- richard > >Which brings up an interesting idea: if I have a pool with good random I/O (perhaps made from SSDs, or even one of those nifty Oracle F5100 things), I would probably not want to have a DDT created, or at least have one that was very significantly abbreviated. What capability does ZFS have for recognizing that we won''t need a full DDT created for high-I/O-speed pools? Particularly with the fact that such pools would almost certainly be heavy candidates for dedup (the $/GB being significantly higher than other mediums, and thus space being at a premium) ? I''m not up on exactly how the DDT gets built and referenced to understand how this might happen. But, I can certainly see it as being useful to tell ZFS (perhaps through a pool property?) that building an in-ARC DDT isn''t really needed. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Jul 10, 2010, at 5:33 AM, Erik Trimble wrote:> On 7/10/2010 5:24 AM, Richard Elling wrote: >> On Jul 9, 2010, at 11:10 PM, Brandon High wrote: >> >> >>> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High<bhigh at freaks.com> wrote: >>> I think that DDT entries are a little bigger than what you''re using. The size seems to range between 150 and 250 bytes depending on how it''s calculated, call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique data would require 600M - 1000M for the DDT. >>> >>> Using 376b per entry, it''s 376M for 128G of unique data, or just under 3GB for 1TB of unique data. >>> >> 4% seems to be a pretty good SWAG. >> >> >>> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the DDT. Ouch. >>> >> ... or more than 300GB for 512-byte records. >> >> The performance issue is that DDT access tends to be random. This implies that >> if you don''t have a lot of RAM and your pool has poor random read I/O performance, >> then you will not be impressed with dedup performance. In other words, trying to >> dedup lots of data on a small DRAM machine using big, slow pool HDDs will not set >> any benchmark records. By contrast, using SSDs for the pool can demonstrate good >> random read performance. As the price per bit of HDDs continues to drop, the value >> of deduping pools using HDDs also drops. >> -- richard >> >> > > Which brings up an interesting idea: if I have a pool with good random I/O (perhaps made from SSDs, or even one of those nifty Oracle F5100 things), I would probably not want to have a DDT created, or at least have one that was very significantly abbreviated. What capability does ZFS have for recognizing that we won''t need a full DDT created for high-I/O-speed pools? Particularly with the fact that such pools would almost certainly be heavy candidates for dedup (the $/GB being significantly higher than other mediums, and thus space being at a premium) ?Methinks it is impossible to build a complete DDT, we''ll run out of atoms... maybe if we can use strings? :-) Think of it as a very, very sparse array. Otherwise it is managed just like other metadata. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble <erik.trimble at oracle.com>wrote:> Which brings up an interesting idea: if I have a pool with good random > I/O (perhaps made from SSDs, or even one of those nifty Oracle F5100 > things), I would probably not want to have a DDT created, or at least have > one that was very significantly abbreviated. What capability does ZFS have > for recognizing that we won''t need a full DDT created for high-I/O-speed > pools? Particularly with the fact that such pools would almost certainly be > heavy candidates for dedup (the $/GB being significantly higher than other > mediums, and thus space being at a premium) ? >I''m not exactly sure what problem you''re trying to solve. Dedup is to save space, not accelerate i/o. While the DDT is pool-wide, only data that''s added to datasets with dedup enabled will create entries in the DDT. If there''s data that you don''t want to dedup, then don''t add it to a pool with dedup enabled. I''m not up on exactly how the DDT gets built and referenced to understand> how this might happen. But, I can certainly see it as being useful to tell > ZFS (perhaps through a pool property?) that building an in-ARC DDT isn''t > really needed. >The DDT is in the pool, not in the ARC. Because it''s frequently accessed, some / most of it will reside in the ARC. -B -- Brandon High : bhigh at freaks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100710/efc5d5e3/attachment.html>
> From: Richard Elling [mailto:richard at nexenta.com] > > 4% seems to be a pretty good SWAG.Is the above "4%" wrong, or am I wrong? Suppose 200bytes to 400bytes, per 128Kbyte block ... 200/131072 = 0.0015 = 0.15% 400/131072 = 0.003 = 0.3% which would mean for 100G unique data = 153M to 312M ram. Around 3G ram for 1Tb unique data, assuming default 128K block Next question: Correct me if I''m wrong, if you have a lot of duplicated data, then dedup increases the probability of arc/ram cache hit. So dedup allows you to stretch your disk, and also stretch your ram cache. Which also benefits performance.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Brandon High > > Dedup is to > save space, not accelerate i/o.I''m going to have to disagree with you there. Dedup is a type of compression. Compression can be used for storage savings, and/or acceleration. Fast and lightweight compression algorithms (lzop, v.42bis, v.44) are usually used in-line for acceleration, while a compute-expensive algorithms (bzip2, lzma, gzip) are usually used for space savings and rarely for acceleration (except when transmitting data across a slow channel). Most general-purpose lossless compression algorithms (and certainly most of the ones I just mentioned) achieve compression by reducing duplicated data. There are special purpose lossless (flac etc) and lossy (jpg, mp3 etc) which do other techniques. But general purpose compression might possibly even be exclusively algorithms for reduction of repeated data. Unless I''m somehow mistaken, the performance benefit of dedup comes from the fact that it increases cache hits. Instead of having to read a thousand duplicate blocks from different sectors of disks, you read it once, and the other 999 have all been stored "same as" the original block, so it''s 999 cache hits and unnecessary to read disk again.
> > 4% seems to be a pretty good SWAG. > > Is the above "4%" wrong, or am I wrong? > > Suppose 200bytes to 400bytes, per 128Kbyte block ... > 200/131072 = 0.0015 = 0.15% > 400/131072 = 0.003 = 0.3% > which would mean for 100G unique data = 153M to 312M ram. > > Around 3G ram for 1Tb unique data, assuming default 128K blockRecodsize means maximum block size. Smaller files will be stored in smaller blocks. With lots of files of different sizes, the block size will generally be smaller than the recordsize set for ZFS.> Next question: > > Correct me if I''m wrong, if you have a lot of duplicated data, then > dedup > increases the probability of arc/ram cache hit. So dedup allows you to > stretch your disk, and also stretch your ram cache. Which also > benefits performance.Theoretically, yes, but there will be an overhead in cpu/memory that can reduce this benefit to a penalty. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net] > > increases the probability of arc/ram cache hit. So dedup allows you > to > > stretch your disk, and also stretch your ram cache. Which also > > benefits performance. > > Theoretically, yes, but there will be an overhead in cpu/memory that > can reduce this benefit to a penalty.That''s why a really fast compression algorithm is used in-line, in hopes that the time cost of compression is smaller than the performance gain of compression. Take for example, v.42bis and v.44 which was used to accelerate 56K modems. (Probably still are, if you actually have a modem somewhere. ;-) Nowadays we have faster communication channels; in fact when talking about dedup we''re talking about local disk speed, which is really fast. But we also have fast processors, and the algorithm in question can be really fast. I recently benchmarked lzop, gzip, bzip2, and lzma for some important data on our fileserver that I would call "typical." No matter what I did, lzop was so ridiculously light weight that I could never get lzop up to 100% cpu. Even reading data 100% from cache and filtering through lzop to /dev/null, the kernel overhead of reading ram cache was higher than the cpu overhead to compress. For the data in question, lzop compressed to 70%, gzip compressed to 42%, bzip 32%, and lzma something like 16%. bzip2 was the slowest (by a factor of 4). lzma -1 and gzip --fast were closely matched in speed but not compression. So the compression of lzop was really weak for the data in question, but it contributed no significant cpu overhead. The point is: It''s absolutely possible to compress quickly, if you have a fast algorithm, and gain performance. I''m boldly assuming dedup performs this fast. It would be nice to actually measure and prove it.
Even the most expensive decompression algorithms generally run significantly faster than I/O to disk -- at least when real disks are involved. So, as long as you don''t run out of CPU and have to wait for CPU to be available for decompression, the decompression will win. The same concept is true for dedup, although I don''t necessarily think of dedup as a form of compression (others might reasonably do so though.) - Garrett On Sat, 2010-07-10 at 19:09 -0400, Edward Ned Harvey wrote:> > From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net] > > > increases the probability of arc/ram cache hit. So dedup allows you > > to > > > stretch your disk, and also stretch your ram cache. Which also > > > benefits performance. > > > > Theoretically, yes, but there will be an overhead in cpu/memory that > > can reduce this benefit to a penalty. > > That''s why a really fast compression algorithm is used in-line, in hopes that the time cost of compression is smaller than the performance gain of compression. Take for example, v.42bis and v.44 which was used to accelerate 56K modems. (Probably still are, if you actually have a modem somewhere. ;-) > > Nowadays we have faster communication channels; in fact when talking about dedup we''re talking about local disk speed, which is really fast. But we also have fast processors, and the algorithm in question can be really fast. > > I recently benchmarked lzop, gzip, bzip2, and lzma for some important data on our fileserver that I would call "typical." No matter what I did, lzop was so ridiculously light weight that I could never get lzop up to 100% cpu. Even reading data 100% from cache and filtering through lzop to /dev/null, the kernel overhead of reading ram cache was higher than the cpu overhead to compress. > > For the data in question, lzop compressed to 70%, gzip compressed to 42%, bzip 32%, and lzma something like 16%. bzip2 was the slowest (by a factor of 4). lzma -1 and gzip --fast were closely matched in speed but not compression. So the compression of lzop was really weak for the data in question, but it contributed no significant cpu overhead. The point is: It''s absolutely possible to compress quickly, if you have a fast algorithm, and gain performance. I''m boldly assuming dedup performs this fast. It would be nice to actually measure and prove it. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On 7/10/2010 10:14 AM, Brandon High wrote:> On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble <erik.trimble at oracle.com > <mailto:erik.trimble at oracle.com>> wrote: > > Which brings up an interesting idea: if I have a pool with good > random I/O (perhaps made from SSDs, or even one of those nifty > Oracle F5100 things), I would probably not want to have a DDT > created, or at least have one that was very significantly > abbreviated. What capability does ZFS have for recognizing that > we won''t need a full DDT created for high-I/O-speed pools? > Particularly with the fact that such pools would almost certainly > be heavy candidates for dedup (the $/GB being significantly higher > than other mediums, and thus space being at a premium) ? > > > I''m not exactly sure what problem you''re trying to solve. Dedup is to > save space, not accelerate i/o. While the DDT is pool-wide, only data > that''s added to datasets with dedup enabled will create entries in the > DDT. If there''s data that you don''t want to dedup, then don''t add it > to a pool with dedup enabled. >What I''m talking about here is that caching the DDT in the ARC takes a non-trivial amount of space (as we''ve discovered). For a pool consisting of backing store with access times very close to that of main memory, there''s no real benefit from caching it in the ARC/L2ARC, so it would be useful if the DDT was simply kept somewhere on the actual backing store, and there was some way to tell ZFS to look there exclusively, and not try to build/store a DDT in ARC.> I''m not up on exactly how the DDT gets built and referenced to > understand how this might happen. But, I can certainly see it as > being useful to tell ZFS (perhaps through a pool property?) that > building an in-ARC DDT isn''t really needed. > > > The DDT is in the pool, not in the ARC. Because it''s frequently > accessed, some / most of it will reside in the ARC. > > -B > > -- > Brandon High : bhigh at freaks.com <mailto:bhigh at freaks.com>Are you sure? I was under the impression that the DDT had to be built from info in the pool, but that what we call the DDT only exists in the ARC. That''s my understanding from reading the ddt.h and ddt.c files - that the ''ddt_enty'' and ''ddt'' structures exist in RAM/ARC/L2ARC, but not on disk. Those two are built using the ''ddt_key'' and ''ddt_bookmark'' structures on disk. Am I missing something? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100710/7c650fb8/attachment.html>
> Even the most expensive decompression algorithms > generally run > significantly faster than I/O to disk -- at least > when real disks are > involved. So, as long as you don''t run out of CPU > and have to wait for > CPU to be available for decompression, the > decompression will win. The > same concept is true for dedup, although I don''t > necessarily think of > dedup as a form of compression (others might > reasonably do so though.)Effectively, dedup is a form of compression of the filesystem rather than any single file, but one oriented to not interfering with access to any of what may be sharing blocks. I would imagine that if it''s read-mostly, it''s a win, but otherwise it costs more than it saves. Even more conventional compression tends to be more resource intensive than decompression... What I''m wondering is when dedup is a better value than compression. Most obviously, when there are a lot of identical blocks across different files; but I''m not sure how often that happens, aside from maybe blocks of zeros (which may well be sparse anyway). -- This message posted from opensolaris.org
On 7/18/2010 4:18 PM, Richard L. Hamilton wrote:>> Even the most expensive decompression algorithms >> generally run >> significantly faster than I/O to disk -- at least >> when real disks are >> involved. So, as long as you don''t run out of CPU >> and have to wait for >> CPU to be available for decompression, the >> decompression will win. The >> same concept is true for dedup, although I don''t >> necessarily think of >> dedup as a form of compression (others might >> reasonably do so though.) >> > Effectively, dedup is a form of compression of the > filesystem rather than any single file, but one > oriented to not interfering with access to any of what > may be sharing blocks. > > I would imagine that if it''s read-mostly, it''s a win, but > otherwise it costs more than it saves. Even more conventional > compression tends to be more resource intensive than decompression... > > What I''m wondering is when dedup is a better value than compression. > Most obviously, when there are a lot of identical blocks across different > files; but I''m not sure how often that happens, aside from maybe > blocks of zeros (which may well be sparse anyway). >From my own experience, a dedup "win" is much more data-usage-dependent than compression. Compression seems to be of general use across the vast majority of data I''ve encountered - with the sole big exception of media file servers (where the data is already compressed pictures, audio, or video). It seems to be of general utility, since I''ve always got spare CPU cycles, and it''s really not very "expensive" in terms of CPU in most cases. Of course, the *value* of compression varies according to the data (i.e. how much it will compress), but that doesn''t matter for *utility* for the most part. Dedup, on the other hand, currently has a very steep price in terms of needed ARC/L2ARC/RAM, so it''s much harder to justify in those cases where it only provides modest benefits. Additionally, we''re still in the development side of dedup (IMHO), so I can''t really make a full evaluation of Dedup concept, as many of its issues today are implementation-related, not concept-related. All that said, Dedup has a showcase use case where it is of *massive* benefit: hosting Virtual Machines. For a machine hosting only 100 VM data stores, I can see 99% space savings. And, I see a significant performance boost, since I can cache that one VM image in RAM easily. There''s other places where Dedup seems modestly useful these days (one is in the afore-mentioned media-file server, which you''d be surprised how much duplication there is), but it''s *much* harder to pre-determine dedup''s utility for a given dataset, unless you have highly detailed knowledge of that dataset''s composition. I''ll admit to not being a big fan of the Dedup concept originally (go back a couple of years here on this list), but, given that the world is marching straight to Virtualization as fast as we can go, I''m a convert now. From my perspective, here''s a couple of things that I think would help improve dedup''s utility for me: (a) fix the outstanding issues in the current implementation (duh!). (b) add the ability to store the entire DDT in the backing store, and not have to construct it in ARC from disk-resident info (this would be of great help where backing store = SSD or RAM-based things) (c) be able to "test-dedup" a given filesystem. I''d like ZFS to be able to look at a filesystem and tell me how much dedup I''d get out of it, WITHOUT having to actually create a dedup-enabled filesystem and copy the data to it. While it would be nice to be able to simply turn on dedup for a filesystem, and have ZFS dedup the existing data there (in-place, without copying), I realize the implementation is hard given how things currently work, and frankly, that''s of much lower priority for me than being able to test-dedup a dataset. (d) increase the slab (record) size significantly, to at least 1MB or more. I daresay the primary way VM images are stored these days are as single, large files (though iSCSI volumes are coming up fast), and as such, I''ve got 20G files which would really, really, benefit from having a much larger slab size. (e) and, of course, seeing if there''s some way we can cut down on dedup''s piggy DDT size. :-) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Sun, 2010-07-18 at 16:18 -0700, Richard L. Hamilton wrote:> > I would imagine that if it''s read-mostly, it''s a win, but > otherwise it costs more than it saves. Even more conventional > compression tends to be more resource intensive than decompression... > > What I''m wondering is when dedup is a better value than compression. > Most obviously, when there are a lot of identical blocks across different > files; but I''m not sure how often that happens, aside from maybe > blocks of zeros (which may well be sparse anyway).Shared/identical blocks come into play in several specific scenarios: 1) Multiple VMs, cloud. If you have multiple guest OS'' installed, they''re going to benefit heavily from dedup. Even Zones can benefit here. 2) Situations with lots of copies of large amounts of data where only some of the data is different between each copy. The classic example is a Solaris build server, hosting dozens or even hundreds, of copies of the Solaris tree, each being worked on by different developers. Typically the developer is working on something less than 1% of the total source code, so the other 99% can be shared via dedup. For general purpose usage, e.g. hosting your music or movie collection, I doubt that dedup offers any real advantage. If I were talking about deploying dedup, I''d only use it in situations like the two I mentioned, and not for just a general purpose storage server. For general purpose applications I think compression is better. (Though I think dedup will have higher savings -- significantly so -- in the particular situation where you know you lots and lots of duplicate/redundant data.) Note also that dedup actually does some things where your duplicated data may gain an effective increase in redundancy/security, because it does make sure that the data that is deduped has higher redundancy than non-deduped data. (This sounds counterintuitive, but as long as you have at least 3 copies of the duplicated data, its a net win.) Btw, compression on top of dedup may actually kill your benefit of dedup. My hypothesis (unproven, admittedly) is that because many compression algos actually cause small permutations of data to significantly change the bit values (even just by changing their offset in the binary) in the overall compressed object, it can seriously defeat dedup''s efficacy. - Garrett
Brandon High wrote:> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High <bhigh at freaks.com > <mailto:bhigh at freaks.com>> wrote: > > I think that DDT entries are a little bigger than what you''re > using. The size seems to range between 150 and 250 bytes depending > on how it''s calculated, call it 200b each. Your 128G dataset would > require closer to 200M (+/- 25%) for the DDT if your data was > completely unique. 1TB of unique data would require 600M - 1000M > for the DDT. > > > Using 376b per entry, it''s 376M for 128G of unique data, or just under > 3GB for 1TB of unique data. > > A 1TB zvol with 8k blocks would require almost 24GB of memory to hold > the DDT. Ouch. > > -BTo reduce RAM requirements, consider an offline or idle time dedupe. I suggested a variation of this in regards to compress a while ago, probably on this list. In either case, you have the system write the data whichever way is fastest. If there is enough unused CPU power, run maximum compression, otherwise use fast compression. If new data type specific compression algorithms are added, attempt compression with those as well (e.g. lossless JPEG recompression that can save 20-25% space). Store the block in whichever compression format works best. If there is enough RAM to maintain a live dedupe table, dedupe right away. If CPU and RAM pressures are too high, defer dedupe and compression to a periodic scrub (or some other new periodically run command). In the deferred case, the dedupe table entries could be generated as blocks are filled/change and then kept on disk. Periodically that table would be quicksorted by the hash, and then any duplicates would be found next to each other. The blocks for the duplicates would be looked up, verified as truly identical, and then re-written (probably also using BP rewrite). Quicksort is parallelable and sorting a multi-gigabyte table is a plausible operation, even on disk. Quicksort 100mb pieces of it in RAM and iterate until the whole table ends up sorted. The end result of all this idle time compression and deduping is that the initially allocated storage space becomes the upper bound storage requirement, and that the data will end up packing tighter over time. The phrasing on bulk packaged items comes to mind: "Contents may have settled during shipping". Now a theoretical question about dedupe...what about the interaction with defragmentation (this also probably needs BP rewrite)? The first file will be completely defragmented, but the second file that is a slight variation of the first will have at least two fragments (the deduped portion, and the unique portion). Probably the performance impact will be minor as long as each fragment is a decent minimum size (multiple MB). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100719/2f21870b/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Richard L. Hamilton > > I would imagine that if it''s read-mostly, it''s a win, but > otherwise it costs more than it saves. Even more conventional > compression tends to be more resource intensive than decompression...I would imagine it''s *easier* to have a win when it''s read-mostly, but the expense of computing checksums is going to be done either way, with or without dedup. The only extra cost dedup adds is to maintain a hash tree of some kind, to see if some block has already been stored on disk. So ... of course I''m speaking hypothetically and haven''t been proven ... I think dedup will accelerate the system in nearly all use cases. The main exception is whenever you have highly non-duplicated data. I think the cost of dedup CPU power is tiny little small, but in the case of highly non-duplicated data, even that little expense is a waste.> What I''m wondering is when dedup is a better value than compression.Whenever files have internal repetition, compression will be better. Whenever the repetition crosses file barriers, dedup will be better.> Most obviously, when there are a lot of identical blocks across > different > files; but I''m not sure how often that happens, aside from maybe > blocks of zeros (which may well be sparse anyway).I think the main value here is when there are more than one copy of some files in the filesystem. For example: In subversion, there are two copies of every file in your working directory. Every file has a corresponding "base" copy located in the .svn directory. If you have a lot of developers ... software or whatever ... who have all checked out the same project, and they''re all working on it in their home directories ... All of those copies get essentially cut down to 1. Combine the developers with subversion ... You would have 2x copies of every file, in every person''s home dir = ... a lot of copies of the same files ... All cut down to 1. You build some package from source code. somefile.c becomes somefile.o, and then the linker takes somefile.o and a bunch of other .o files and mashes them all together to make "finalproduct" executable file. Well, that executable is just copies of all these .o files mashed together. So again ... cut it all down to 1. And multiply by the number of developers who are all doing the same thing in their home dirs. Others have mentioned VM''s, when VM''s are duplicated ... I don''t personally duplicate many VM''s, so it doesn''t matter to me ... but I see this value for others ...
On 20/07/2010 04:41, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Richard L. Hamilton >> >> I would imagine that if it''s read-mostly, it''s a win, but >> otherwise it costs more than it saves. Even more conventional >> compression tends to be more resource intensive than decompression... >> > I would imagine it''s *easier* to have a win when it''s read-mostly, but the > expense of computing checksums is going to be done either way, with or > without dedup. The only extra cost dedup adds is to maintain a hash tree of > some kind, to see if some block has already been stored on disk. So ... of > course I''m speaking hypothetically and haven''t been proven ... I think dedup > will accelerate the system in nearly all use cases. > > The main exception is whenever you have highly non-duplicated data. I think > the cost of dedup CPU power is tiny little small, but in the case of highly > non-duplicated data, even that little expense is a waste. > >Please note that by default ZFS uses fletcher4 checksums but dedup currently allows only for sha256 which are more CPU intensive. Also from a performance point of view there will be a sudden drop in write performance the moment DDT can''t fit entirely in a memory. L2ARC could mitigate the impact though. Then there will be less memory available for data caching due to extra memory requirements for DDT. (however please note that IIRC DDT is treated as meta data and by default there is a limit of meta-data cache size to be no bigger than 20% of ARC - there is a bug open for it, I haven''t checked if it''s been fixed yet or not).>> What I''m wondering is when dedup is a better value than compression. >> > Whenever files have internal repetition, compression will be better. > Whenever the repetition crosses file barriers, dedup will be better. > >Not necessarily. Compression in ZFS works only within a single fs block scope. So for example if you have a large file with most of its block identical dedup should "compress" the file much better than a compression. Also please note that you can use both: compression and dedup at the same time. -- Robert Milkowski http://milek.blogspot.com