There are a number of threads (this one[1] for example) that describe memory requirements for deduplication. They''re pretty high. I''m trying to get a better understanding... on our NetApps we use 4K block sizes with their post-process deduplication and get pretty good dedupe ratios for VM content. Using ZFS we are using 128K record sizes by default, which nets us less impressive savings... however, to drop to a 4K record size would theoretically require that we have nearly 40GB of memory for only 1TB of storage (based on 150 bytes per block for the DDT). This obviously becomes prohibitively higher for 10+ TB file systems. I will note that our NetApps are using only 2TB FlexVols, but would like to better understand ZFS''s (apparently) higher memory requirements... or maybe I''m missing something entirely. Thanks, Ray [1] http://markmail.org/message/wile6kawka6qnjdw
On 5/4/2011 9:57 AM, Ray Van Dolson wrote:> There are a number of threads (this one[1] for example) that describe > memory requirements for deduplication. They''re pretty high. > > I''m trying to get a better understanding... on our NetApps we use 4K > block sizes with their post-process deduplication and get pretty good > dedupe ratios for VM content. > > Using ZFS we are using 128K record sizes by default, which nets us less > impressive savings... however, to drop to a 4K record size would > theoretically require that we have nearly 40GB of memory for only 1TB > of storage (based on 150 bytes per block for the DDT). > > This obviously becomes prohibitively higher for 10+ TB file systems. > > I will note that our NetApps are using only 2TB FlexVols, but would > like to better understand ZFS''s (apparently) higher memory > requirements... or maybe I''m missing something entirely. > > Thanks, > Ray > > [1] http://markmail.org/message/wile6kawka6qnjdw > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussI''m not familiar with NetApp''s implementation, so I can''t speak to why it might appear to use less resources. However, there are a couple of possible issues here: (1) Pre-write vs Post-write Deduplication. ZFS does pre-write dedup, where it looks for duplicates before it writes anything to disk. In order to do pre-write dedup, you really have to store the ENTIRE deduplication block lookup table in some sort of fast (random) access media, realistically Flash or RAM. The win is that you get significantly lower disk utilization (i.e. better I/O performance), as (potentially) much less data is actually written to disk. Post-write Dedup is done via batch processing - that is, such a design has the system periodically scan the saved data, looking for duplicates. While this method also greatly benefits from being able to store the dedup table in fast random storage, it''s not anywhere as critical. The downside here is that you see much higher disk utilization - the system must first write all new data to disk (without looking for dedup), and then must also perform significant I/O later on to do the dedup. (2) Block size: a 4k block size will yield better dedup than a 128k block size, presuming reasonable data turnover. This is inherent, as any single bit change in a block will make it non-duplicated. With 32x the block size, there is a much greater chance that a small change in data will require a large loss of dedup ratio. That is, 4k blocks should almost always yield much better dedup ratios than larger ones. Also, remember that the ZFS block size is a SUGGESTION for zfs filesystems (i.e. it will use UP TO that block size, but not always that size), but is FIXED for zvols. (3) Method of storing (and data stored in) the dedup table. ZFS''s current design is (IMHO) rather piggy on DDT and L2ARC lookup requirements. Right now, ZFS requires a record in the ARC (RAM) for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it boils down to 500+ bytes of combined L2ARC & RAM usage per block entry in the DDT. Also, the actual DDT entry itself is perhaps larger than absolutely necessary. I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code completely avoid the need for an equivalent to the ARC-based lookup. In addition, I suspect they have a smaller DDT entry itself. Which boils down to probably needing 50% of the total resource consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement. Honestly, ZFS''s cache (L2ARC) requirements aren''t really a problem. The big issue is the ARC requirements, which, until they can be seriously reduced (or, best case, simply eliminated), really is a significant barrier to adoption of ZFS dedup. Right now, ZFS treats DDT entries like any other data or metadata in how it ages from ARC to L2ARC to gone. IMHO, the better way to do this is simply require the DDT to be entirely stored on the L2ARC (if present), and not ever keep any DDT info in the ARC at all (that is, the ARC should contain a pointer to the DDT in the L2ARC, and that''s it, regardless of the amount or frequency of access of the DDT). Frankly, at this point, I''d almost change the design to REQUIRE a L2ARC device in order to turn on Dedup. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:> On 5/4/2011 9:57 AM, Ray Van Dolson wrote: > > There are a number of threads (this one[1] for example) that describe > > memory requirements for deduplication. They''re pretty high. > > > > I''m trying to get a better understanding... on our NetApps we use 4K > > block sizes with their post-process deduplication and get pretty good > > dedupe ratios for VM content. > > > > Using ZFS we are using 128K record sizes by default, which nets us less > > impressive savings... however, to drop to a 4K record size would > > theoretically require that we have nearly 40GB of memory for only 1TB > > of storage (based on 150 bytes per block for the DDT). > > > > This obviously becomes prohibitively higher for 10+ TB file systems. > > > > I will note that our NetApps are using only 2TB FlexVols, but would > > like to better understand ZFS''s (apparently) higher memory > > requirements... or maybe I''m missing something entirely. > > > > Thanks, > > Ray > > I''m not familiar with NetApp''s implementation, so I can''t speak to > why it might appear to use less resources. > > However, there are a couple of possible issues here: > > (1) Pre-write vs Post-write Deduplication. > ZFS does pre-write dedup, where it looks for duplicates before > it writes anything to disk. In order to do pre-write dedup, you really > have to store the ENTIRE deduplication block lookup table in some sort > of fast (random) access media, realistically Flash or RAM. The win is > that you get significantly lower disk utilization (i.e. better I/O > performance), as (potentially) much less data is actually written to disk. > Post-write Dedup is done via batch processing - that is, such a > design has the system periodically scan the saved data, looking for > duplicates. While this method also greatly benefits from being able to > store the dedup table in fast random storage, it''s not anywhere as > critical. The downside here is that you see much higher disk utilization > - the system must first write all new data to disk (without looking for > dedup), and then must also perform significant I/O later on to do the dedup.Makes sense.> (2) Block size: a 4k block size will yield better dedup than a 128k > block size, presuming reasonable data turnover. This is inherent, as > any single bit change in a block will make it non-duplicated. With 32x > the block size, there is a much greater chance that a small change in > data will require a large loss of dedup ratio. That is, 4k blocks > should almost always yield much better dedup ratios than larger ones. > Also, remember that the ZFS block size is a SUGGESTION for zfs > filesystems (i.e. it will use UP TO that block size, but not always that > size), but is FIXED for zvols. > > (3) Method of storing (and data stored in) the dedup table. > ZFS''s current design is (IMHO) rather piggy on DDT and L2ARC > lookup requirements. Right now, ZFS requires a record in the ARC (RAM) > for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it > boils down to 500+ bytes of combined L2ARC & RAM usage per block entry > in the DDT. Also, the actual DDT entry itself is perhaps larger than > absolutely necessary.So the addition of L2ARC doesn''t necessarily reduce the need for memory (at least not much if you''re talking about 500 bytes combined)? I was hoping we could slap in 80GB''s of SSD L2ARC and get away with "only" 16GB of RAM for example.> I suspect that NetApp does the following to limit their > resource usage: they presume the presence of some sort of cache that > can be dedicated to the DDT (and, since they also control the hardware, > they can make sure there is always one present). Thus, they can make > their code completely avoid the need for an equivalent to the ARC-based > lookup. In addition, I suspect they have a smaller DDT entry itself. > Which boils down to probably needing 50% of the total resource > consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement. > > Honestly, ZFS''s cache (L2ARC) requirements aren''t really a problem. The > big issue is the ARC requirements, which, until they can be seriously > reduced (or, best case, simply eliminated), really is a significant > barrier to adoption of ZFS dedup. > > Right now, ZFS treats DDT entries like any other data or metadata in how > it ages from ARC to L2ARC to gone. IMHO, the better way to do this is > simply require the DDT to be entirely stored on the L2ARC (if present), > and not ever keep any DDT info in the ARC at all (that is, the ARC > should contain a pointer to the DDT in the L2ARC, and that''s it, > regardless of the amount or frequency of access of the DDT). Frankly, > at this point, I''d almost change the design to REQUIRE a L2ARC device in > order to turn on Dedup.Thanks for you response, Eric. Very helpful. Ray
On Wed, May 4, 2011 at 12:29 PM, Erik Trimble <erik.trimble at oracle.com> wrote:> ? ? ? ?I suspect that NetApp does the following to limit their resource > usage: ? they presume the presence of some sort of cache that can be > dedicated to the DDT (and, since they also control the hardware, they can > make sure there is always one present). ?Thus, they can make their codeAFAIK, NetApp has more restrictive requirements about how much data can be dedup''d on each type of hardware. See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller pieces of hardware can only dedup 1TB volumes, and even the big-daddy filers will only dedup up to 16TB per volume, even if the volume size is 32TB (the largest volume available for dedup). NetApp solves the problem by putting rigid constraints around the problem, whereas ZFS lets you enable dedup for any size dataset. Both approaches have limitations, and it sucks when you hit them. -B -- Brandon High : bhigh at freaks.com
On 5/4/2011 2:54 PM, Ray Van Dolson wrote:> On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: >> (2) Block size: a 4k block size will yield better dedup than a 128k >> block size, presuming reasonable data turnover. This is inherent, as >> any single bit change in a block will make it non-duplicated. With 32x >> the block size, there is a much greater chance that a small change in >> data will require a large loss of dedup ratio. That is, 4k blocks >> should almost always yield much better dedup ratios than larger ones. >> Also, remember that the ZFS block size is a SUGGESTION for zfs >> filesystems (i.e. it will use UP TO that block size, but not always that >> size), but is FIXED for zvols. >> >> (3) Method of storing (and data stored in) the dedup table. >> ZFS''s current design is (IMHO) rather piggy on DDT and L2ARC >> lookup requirements. Right now, ZFS requires a record in the ARC (RAM) >> for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it >> boils down to 500+ bytes of combined L2ARC& RAM usage per block entry >> in the DDT. Also, the actual DDT entry itself is perhaps larger than >> absolutely necessary. > So the addition of L2ARC doesn''t necessarily reduce the need for > memory (at least not much if you''re talking about 500 bytes combined)? > I was hoping we could slap in 80GB''s of SSD L2ARC and get away with > "only" 16GB of RAM for example.It reduces *somewhat* the need for RAM. Basically, if you have no L2ARC cache device, the DDT must be stored in RAM. That''s about 376 bytes per dedup block. If you have an L2ARC cache device, then the ARC must contain a reference to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT entry reference. So, adding a L2ARC reduces the ARC consumption by about 55%. Of course, the other benefit from a L2ARC is the data/metadata caching, which is likely worth it just by itself. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble <erik.trimble at oracle.com> wrote: > > ? ? ? ?I suspect that NetApp does the following to limit their resource > > usage: ? they presume the presence of some sort of cache that can be > > dedicated to the DDT (and, since they also control the hardware, they can > > make sure there is always one present). ?Thus, they can make their code > > AFAIK, NetApp has more restrictive requirements about how much data > can be dedup''d on each type of hardware. > > See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller > pieces of hardware can only dedup 1TB volumes, and even the big-daddy > filers will only dedup up to 16TB per volume, even if the volume size > is 32TB (the largest volume available for dedup). > > NetApp solves the problem by putting rigid constraints around the > problem, whereas ZFS lets you enable dedup for any size dataset. Both > approaches have limitations, and it sucks when you hit them. > > -BThat is very true, although worth mentioning you can have quite a few of the dedupe/SIS enabled FlexVols on even the lower-end filers (our FAS2050 has a bunch of 2TB SIS enabled FlexVols). The FAS2050 of course has a fairly small memory footprint... I do like the additional flexibility you have with ZFS, just trying to get a handle on the memory requirements. Are any of you out there using dedupe ZFS file systems to store VMware VMDK (or any VM tech. really)? Curious what recordsize you use and what your hardware specs / experiences have been. Ray
On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:> On 5/4/2011 2:54 PM, Ray Van Dolson wrote: > > On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: > >> (2) Block size: a 4k block size will yield better dedup than a 128k > >> block size, presuming reasonable data turnover. This is inherent, as > >> any single bit change in a block will make it non-duplicated. With 32x > >> the block size, there is a much greater chance that a small change in > >> data will require a large loss of dedup ratio. That is, 4k blocks > >> should almost always yield much better dedup ratios than larger ones. > >> Also, remember that the ZFS block size is a SUGGESTION for zfs > >> filesystems (i.e. it will use UP TO that block size, but not always that > >> size), but is FIXED for zvols. > >> > >> (3) Method of storing (and data stored in) the dedup table. > >> ZFS''s current design is (IMHO) rather piggy on DDT and L2ARC > >> lookup requirements. Right now, ZFS requires a record in the ARC (RAM) > >> for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it > >> boils down to 500+ bytes of combined L2ARC& RAM usage per block entry > >> in the DDT. Also, the actual DDT entry itself is perhaps larger than > >> absolutely necessary. > > So the addition of L2ARC doesn''t necessarily reduce the need for > > memory (at least not much if you''re talking about 500 bytes combined)? > > I was hoping we could slap in 80GB''s of SSD L2ARC and get away with > > "only" 16GB of RAM for example. > > It reduces *somewhat* the need for RAM. Basically, if you have no L2ARC > cache device, the DDT must be stored in RAM. That''s about 376 bytes per > dedup block. > > If you have an L2ARC cache device, then the ARC must contain a reference > to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT > entry reference. > > So, adding a L2ARC reduces the ARC consumption by about 55%. > > Of course, the other benefit from a L2ARC is the data/metadata caching, > which is likely worth it just by itself.Great info. Thanks Erik. For dedupe workloads on larger file systems (8TB+), I wonder if makes sense to use SLC / enterprise class SSD (or better) devices for L2ARC instead of lower-end MLC stuff? Seems like we''d be seeing more writes to the device than in a non-dedupe scenario. Thanks, Ray
On 5/4/2011 4:14 PM, Ray Van Dolson wrote:> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: >> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble<erik.trimble at oracle.com> wrote: >>> I suspect that NetApp does the following to limit their resource >>> usage: they presume the presence of some sort of cache that can be >>> dedicated to the DDT (and, since they also control the hardware, they can >>> make sure there is always one present). Thus, they can make their code >> AFAIK, NetApp has more restrictive requirements about how much data >> can be dedup''d on each type of hardware. >> >> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller >> pieces of hardware can only dedup 1TB volumes, and even the big-daddy >> filers will only dedup up to 16TB per volume, even if the volume size >> is 32TB (the largest volume available for dedup). >> >> NetApp solves the problem by putting rigid constraints around the >> problem, whereas ZFS lets you enable dedup for any size dataset. Both >> approaches have limitations, and it sucks when you hit them. >> >> -B > That is very true, although worth mentioning you can have quite a few > of the dedupe/SIS enabled FlexVols on even the lower-end filers (our > FAS2050 has a bunch of 2TB SIS enabled FlexVols). >Stupid question - can you hit all the various SIS volumes at once, and not get horrid performance penalties? If so, I''m almost certain NetApp is doing post-write dedup. That way, the strictly controlled max FlexVol size helps with keeping the resource limits down, as it will be able to round-robin the post-write dedup to each FlexVol in turn. ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the time, and can''t really share them well if it expects to keep performance from tanking... (no pun intended)> The FAS2050 of course has a fairly small memory footprint... > > I do like the additional flexibility you have with ZFS, just trying to > get a handle on the memory requirements. > > Are any of you out there using dedupe ZFS file systems to store VMware > VMDK (or any VM tech. really)? Curious what recordsize you use and > what your hardware specs / experiences have been. > > RayRight now, I use it for my Solaris 8 containers and VirtualBox images. the VB images are mostly Windows (XP and Win2003). I tend to put the OS image in one VMdisk, and my scratch disks in another. That is, I generally don''t want my apps writing much to my OS images. My scratch/data disks aren''t dedup. Overall, I''m running about 30 deduped images served out over NFS. My recordsize is set to 128k, but, given that they''re OS images, my actual disk block usage has a significant 4k presence. One way I reduced this initally was to have the VMdisk image stored on local disk, then copied the *entire* image to the ZFS server, so the server saw a single large file, which meant it tended to write full 128k blocks. Do note, that my 30 images only takes about 20GB of actual space, after dedup. I figure about 5GB of dedup space per OS type (and, I have 4 different setups). My data VMdisks, however, chew though about 4TB of disk space, which is nondeduped. I''m still trying to determine if I''m better off serving those data disks as NFS mounts to my clients, or as VMdisk images available over iSCSI or NFS. Right now, I''m doing VMdisks over NFS. The setup I''m using is an older X4200 (non-M2), with 3rd-party SSDs as L2ARC, hooked to an old 3500FC array. It has 8GB of RAM in total, and runs just fine with that. I definitely am going to upgrade to something much larger in the near future, since I expect to up my number of VM images by at least a factor of 5. That all said, if you''re relatively careful about separating OS installs from active data, you can get really impressive dedup ratios using a relatively small amount of actual space. In my case, I expect to eventually be serving about 10 different configs out to a total of maybe 100 clients, and probably never exceed 100GB max on the deduped end. Which means that I''ll be able to get away with 16GB of RAM for the whole server, comfortably. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Wed, May 4, 2011 at 6:36 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 5/4/2011 4:14 PM, Ray Van Dolson wrote: > >> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: >> >>> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble<erik.trimble at oracle.com> >>> wrote: >>> >>>> I suspect that NetApp does the following to limit their resource >>>> usage: they presume the presence of some sort of cache that can be >>>> dedicated to the DDT (and, since they also control the hardware, they >>>> can >>>> make sure there is always one present). Thus, they can make their code >>>> >>> AFAIK, NetApp has more restrictive requirements about how much data >>> can be dedup''d on each type of hardware. >>> >>> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller >>> pieces of hardware can only dedup 1TB volumes, and even the big-daddy >>> filers will only dedup up to 16TB per volume, even if the volume size >>> is 32TB (the largest volume available for dedup). >>> >>> NetApp solves the problem by putting rigid constraints around the >>> problem, whereas ZFS lets you enable dedup for any size dataset. Both >>> approaches have limitations, and it sucks when you hit them. >>> >>> -B >>> >> That is very true, although worth mentioning you can have quite a few >> of the dedupe/SIS enabled FlexVols on even the lower-end filers (our >> FAS2050 has a bunch of 2TB SIS enabled FlexVols). >> >> Stupid question - can you hit all the various SIS volumes at once, and > not get horrid performance penalties? > > If so, I''m almost certain NetApp is doing post-write dedup. That way, the > strictly controlled max FlexVol size helps with keeping the resource limits > down, as it will be able to round-robin the post-write dedup to each FlexVol > in turn. > > ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the time, > and can''t really share them well if it expects to keep performance from > tanking... (no pun intended) > >On a 2050? Probably not. It''s got a single-core mobile celeron CPU and 2GB/ram. You couldn''t even run ZFS on that box, much less ZFS+dedup. Can you do it on a model that isn''t 4 years old without tanking performance? Absolutely. Outside of those two 2000 series, the reason there are dedup limits isn''t performance. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110504/f596d052/attachment.html>
On 5/4/2011 4:17 PM, Ray Van Dolson wrote:> On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote: >> On 5/4/2011 2:54 PM, Ray Van Dolson wrote: >>> On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: >>>> (2) Block size: a 4k block size will yield better dedup than a 128k >>>> block size, presuming reasonable data turnover. This is inherent, as >>>> any single bit change in a block will make it non-duplicated. With 32x >>>> the block size, there is a much greater chance that a small change in >>>> data will require a large loss of dedup ratio. That is, 4k blocks >>>> should almost always yield much better dedup ratios than larger ones. >>>> Also, remember that the ZFS block size is a SUGGESTION for zfs >>>> filesystems (i.e. it will use UP TO that block size, but not always that >>>> size), but is FIXED for zvols. >>>> >>>> (3) Method of storing (and data stored in) the dedup table. >>>> ZFS''s current design is (IMHO) rather piggy on DDT and L2ARC >>>> lookup requirements. Right now, ZFS requires a record in the ARC (RAM) >>>> for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it >>>> boils down to 500+ bytes of combined L2ARC& RAM usage per block entry >>>> in the DDT. Also, the actual DDT entry itself is perhaps larger than >>>> absolutely necessary. >>> So the addition of L2ARC doesn''t necessarily reduce the need for >>> memory (at least not much if you''re talking about 500 bytes combined)? >>> I was hoping we could slap in 80GB''s of SSD L2ARC and get away with >>> "only" 16GB of RAM for example. >> It reduces *somewhat* the need for RAM. Basically, if you have no L2ARC >> cache device, the DDT must be stored in RAM. That''s about 376 bytes per >> dedup block. >> >> If you have an L2ARC cache device, then the ARC must contain a reference >> to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT >> entry reference. >> >> So, adding a L2ARC reduces the ARC consumption by about 55%. >> >> Of course, the other benefit from a L2ARC is the data/metadata caching, >> which is likely worth it just by itself. > Great info. Thanks Erik. > > For dedupe workloads on larger file systems (8TB+), I wonder if makes > sense to use SLC / enterprise class SSD (or better) devices for L2ARC > instead of lower-end MLC stuff? Seems like we''d be seeing more writes > to the device than in a non-dedupe scenario. > > Thanks, > RayI''m using Enterprise-class MLC drives (without a supercap), and they work fine with dedup. I''d have to test, but I don''t think that the increase in write is that much, so I don''t expect a SLC to really make much of a difference. (fill rate of the L2ARC is limited, so I can''t imaging we''d bump up against the MLC''s limits) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On 5/4/2011 4:44 PM, Tim Cook wrote:> > > On Wed, May 4, 2011 at 6:36 PM, Erik Trimble <erik.trimble at oracle.com > <mailto:erik.trimble at oracle.com>> wrote: > > On 5/4/2011 4:14 PM, Ray Van Dolson wrote: > > On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: > > On Wed, May 4, 2011 at 12:29 PM, Erik > Trimble<erik.trimble at oracle.com > <mailto:erik.trimble at oracle.com>> wrote: > > I suspect that NetApp does the following to > limit their resource > usage: they presume the presence of some sort of > cache that can be > dedicated to the DDT (and, since they also control the > hardware, they can > make sure there is always one present). Thus, they > can make their code > > AFAIK, NetApp has more restrictive requirements about how > much data > can be dedup''d on each type of hardware. > > See page 29 of > http://media.netapp.com/documents/tr-3505.pdf - Smaller > pieces of hardware can only dedup 1TB volumes, and even > the big-daddy > filers will only dedup up to 16TB per volume, even if the > volume size > is 32TB (the largest volume available for dedup). > > NetApp solves the problem by putting rigid constraints > around the > problem, whereas ZFS lets you enable dedup for any size > dataset. Both > approaches have limitations, and it sucks when you hit them. > > -B > > That is very true, although worth mentioning you can have > quite a few > of the dedupe/SIS enabled FlexVols on even the lower-end > filers (our > FAS2050 has a bunch of 2TB SIS enabled FlexVols). > > Stupid question - can you hit all the various SIS volumes at once, > and not get horrid performance penalties? > > If so, I''m almost certain NetApp is doing post-write dedup. That > way, the strictly controlled max FlexVol size helps with keeping > the resource limits down, as it will be able to round-robin the > post-write dedup to each FlexVol in turn. > > ZFS''s problem is that it needs ALL the resouces for EACH pool ALL > the time, and can''t really share them well if it expects to keep > performance from tanking... (no pun intended) > > > On a 2050? Probably not. It''s got a single-core mobile celeron CPU > and 2GB/ram. You couldn''t even run ZFS on that box, much less > ZFS+dedup. Can you do it on a model that isn''t 4 years old without > tanking performance? Absolutely. > > Outside of those two 2000 series, the reason there are dedup limits > isn''t performance. > > --Tim >Indirectly, yes, it''s performance, since NetApp has plainly chosen post-write dedup as a method to restrict the required hardware capabilities. The dedup limits on Volsize are almost certainly driven by the local RAM requirements for post-write dedup. It also looks like NetApp isn''t providing for a dedicated DDT cache, which means that when the NetApp is doing dedup, it''s consuming the normal filesystem cache (i.e. chewing through RAM). Frankly, I''d be very surprised if you didn''t see a noticeable performance hit during the period that the NetApp appliance is performing the dedup scans. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110504/d5323b92/attachment-0001.html>
On Wed, May 04, 2011 at 04:51:36PM -0700, Erik Trimble wrote:> On 5/4/2011 4:44 PM, Tim Cook wrote: > > > > On Wed, May 4, 2011 at 6:36 PM, Erik Trimble <erik.trimble at oracle.com> > wrote: > > On 5/4/2011 4:14 PM, Ray Van Dolson wrote: > > On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: > > On Wed, May 4, 2011 at 12:29 PM, Erik Trimble< > erik.trimble at oracle.com> wrote: > > I suspect that NetApp does the following to limit > their resource > usage: they presume the presence of some sort of cache > that can be > dedicated to the DDT (and, since they also control the > hardware, they can > make sure there is always one present). Thus, they can > make their code > > AFAIK, NetApp has more restrictive requirements about how much > data > can be dedup''d on each type of hardware. > > See page 29 of http://media.netapp.com/documents/tr-3505.pdf - > Smaller > pieces of hardware can only dedup 1TB volumes, and even the > big-daddy > filers will only dedup up to 16TB per volume, even if the > volume size > is 32TB (the largest volume available for dedup). > > NetApp solves the problem by putting rigid constraints around > the > problem, whereas ZFS lets you enable dedup for any size > dataset. Both > approaches have limitations, and it sucks when you hit them. > > -B > > That is very true, although worth mentioning you can have quite a > few > of the dedupe/SIS enabled FlexVols on even the lower-end filers > (our > FAS2050 has a bunch of 2TB SIS enabled FlexVols). > > > Stupid question - can you hit all the various SIS volumes at once, and > not get horrid performance penalties? > > If so, I''m almost certain NetApp is doing post-write dedup. That way, > the strictly controlled max FlexVol size helps with keeping the > resource limits down, as it will be able to round-robin the post-write > dedup to each FlexVol in turn. > > ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the > time, and can''t really share them well if it expects to keep > performance from tanking... (no pun intended) > > > > On a 2050? Probably not. It''s got a single-core mobile celeron CPU and > 2GB/ram. You couldn''t even run ZFS on that box, much less ZFS+dedup. Can > you do it on a model that isn''t 4 years old without tanking performance? > Absolutely. > > Outside of those two 2000 series, the reason there are dedup limits isn''t > performance. > > --Tim > > > Indirectly, yes, it''s performance, since NetApp has plainly chosen > post-write dedup as a method to restrict the required hardware > capabilities. The dedup limits on Volsize are almost certainly > driven by the local RAM requirements for post-write dedup. > > It also looks like NetApp isn''t providing for a dedicated DDT cache, > which means that when the NetApp is doing dedup, it''s consuming the > normal filesystem cache (i.e. chewing through RAM). Frankly, I''d be > very surprised if you didn''t see a noticeable performance hit during > the period that the NetApp appliance is performing the dedup scans.Yep, when the dedupe process runs, there is a drop in performance (hence we usually schedule it to run off-peak hours). Obviously this is a luxury that wouldn''t be an option in every environment... During normal operations outside of the dedupe period we haven''t noticed a performance hit. I don''t think we hit the filer too hard however -- it''s acting as a VMware datastore and only a few of the VM''s have higher I/O footprints. It is a 2050C however so we spread the load across the two filer heads (although we occasionally run everything on one head when performing maintenance on the other). Ray
On Wed, May 4, 2011 at 4:36 PM, Erik Trimble <erik.trimble at oracle.com> wrote:> If so, I''m almost certain NetApp is doing post-write dedup. ?That way, the > strictly controlled max FlexVol size helps with keeping the resource limits > down, as it will be able to round-robin the post-write dedup to each FlexVol > in turn.They are, its in their docs. A volume is dedup''d when 20% of non-deduped data is added to it, or something similar. 8 volumes can be processed at once though, I believe, and it could be that weaker systems are not able to do as many in parallel.> block usage has a significant 4k presence. ?One way I reduced this initally > was to have the VMdisk image stored on local disk, then copied the *entire* > image to the ZFS server, so the server saw a single large file, which meant > it tended to write full 128k blocks. ?Do note, that my 30 images only takesWouldn''t you have been better off cloning datasets that contain an unconfigured install and customizing from there? -B -- Brandon High : bhigh at freaks.com
On Wed, May 4, 2011 at 6:51 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 5/4/2011 4:44 PM, Tim Cook wrote: > > > > On Wed, May 4, 2011 at 6:36 PM, Erik Trimble <erik.trimble at oracle.com>wrote: > >> On 5/4/2011 4:14 PM, Ray Van Dolson wrote: >> >>> On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: >>> >>>> On Wed, May 4, 2011 at 12:29 PM, Erik Trimble<erik.trimble at oracle.com> >>>> wrote: >>>> >>>>> I suspect that NetApp does the following to limit their resource >>>>> usage: they presume the presence of some sort of cache that can be >>>>> dedicated to the DDT (and, since they also control the hardware, they >>>>> can >>>>> make sure there is always one present). Thus, they can make their code >>>>> >>>> AFAIK, NetApp has more restrictive requirements about how much data >>>> can be dedup''d on each type of hardware. >>>> >>>> See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller >>>> pieces of hardware can only dedup 1TB volumes, and even the big-daddy >>>> filers will only dedup up to 16TB per volume, even if the volume size >>>> is 32TB (the largest volume available for dedup). >>>> >>>> NetApp solves the problem by putting rigid constraints around the >>>> problem, whereas ZFS lets you enable dedup for any size dataset. Both >>>> approaches have limitations, and it sucks when you hit them. >>>> >>>> -B >>>> >>> That is very true, although worth mentioning you can have quite a few >>> of the dedupe/SIS enabled FlexVols on even the lower-end filers (our >>> FAS2050 has a bunch of 2TB SIS enabled FlexVols). >>> >>> Stupid question - can you hit all the various SIS volumes at once, and >> not get horrid performance penalties? >> >> If so, I''m almost certain NetApp is doing post-write dedup. That way, the >> strictly controlled max FlexVol size helps with keeping the resource limits >> down, as it will be able to round-robin the post-write dedup to each FlexVol >> in turn. >> >> ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the >> time, and can''t really share them well if it expects to keep performance >> from tanking... (no pun intended) >> >> > On a 2050? Probably not. It''s got a single-core mobile celeron CPU and > 2GB/ram. You couldn''t even run ZFS on that box, much less ZFS+dedup. Can > you do it on a model that isn''t 4 years old without tanking performance? > Absolutely. > > Outside of those two 2000 series, the reason there are dedup limits isn''t > performance. > > --Tim > > Indirectly, yes, it''s performance, since NetApp has plainly chosen > post-write dedup as a method to restrict the required hardware > capabilities. The dedup limits on Volsize are almost certainly driven by > the local RAM requirements for post-write dedup. > > It also looks like NetApp isn''t providing for a dedicated DDT cache, which > means that when the NetApp is doing dedup, it''s consuming the normal > filesystem cache (i.e. chewing through RAM). Frankly, I''d be very surprised > if you didn''t see a noticeable performance hit during the period that the > NetApp appliance is performing the dedup scans. > >Again, it depends on the model/load/etc. The smallest models will see performance hits for sure. If the vol size limits are strictly a matter of ram, why exactly would they jump from 4TB to 16TB on a 3140 by simply upgrading ONTAP? If the limits haven''t gone up on, at the very least, every one of the x2xx systems 12 months from now, feel free to dig up the thread and give an I-told-you-so. I''m quite confident that won''t be the case. The 16TB limit SCREAMS to me that it''s a holdover from the same 32bit limit that causes 32-bit volumes to have a 16TB limit. I''m quite confident they''re just taking the cautious approach on moving to 64bit dedup code. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110504/db2d1f95/attachment.html>
On 5/4/2011 5:11 PM, Brandon High wrote:> On Wed, May 4, 2011 at 4:36 PM, Erik Trimble<erik.trimble at oracle.com> wrote: >> If so, I''m almost certain NetApp is doing post-write dedup. That way, the >> strictly controlled max FlexVol size helps with keeping the resource limits >> down, as it will be able to round-robin the post-write dedup to each FlexVol >> in turn. > They are, its in their docs. A volume is dedup''d when 20% of > non-deduped data is added to it, or something similar. 8 volumes can > be processed at once though, I believe, and it could be that weaker > systems are not able to do as many in parallel. >Sounds rational.>> block usage has a significant 4k presence. One way I reduced this initally >> was to have the VMdisk image stored on local disk, then copied the *entire* >> image to the ZFS server, so the server saw a single large file, which meant >> it tended to write full 128k blocks. Do note, that my 30 images only takes > Wouldn''t you have been better off cloning datasets that contain an > unconfigured install and customizing from there? > > -BGiven that my "OS" installs include a fair amount of 3rd-party add-ons (compilers, SDKs, et al), I generally find the best method for me is to fully configure a client (with the VMdisk on local storage), then copy that VMdisk to the ZFS server as a "golden image". I can then clone that image for my other clients of that type, and only have to change the network information. Initially, each new VM image consumes about 1MB of space. :-) Overall, I''ve found that as I have to patch each image, it''s worth-while to take a new "golden-image" snapshot every so often, and then reconfigure each client machine again from that new Golden image. I''m sure I could do some optimization here, but the method works well enough. What you want to avoid is having the OS image written to, and waiting for any other configuration and customization to happen AFTER it was placed on the ZFS server is sub-optimal. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Erik Trimble > > ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the > time, and can''t really share them well if it expects to keep performance > from tanking... (no pun intended)That''s true, but on the flipside, if you don''t have adequate resources dedicated all the time, it means performance is unsustainable. Anything which is going to do post-write dedup will necessarily have degraded performance on a periodic basis. This is in *addition* to all your scrubs and backups and so on.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Ray Van Dolson > > Are any of you out there using dedupe ZFS file systems to store VMware > VMDK (or any VM tech. really)? Curious what recordsize you use and > what your hardware specs / experiences have been.Generally speaking, dedup doesn''t work on VM images. (Same is true for ZFS or netapp or anything else.) Because the VM images are all going to have their own filesystems internally with whatever blocksize is relevant to the guest OS. If the virtual blocks in the VM don''t align with the ZFS (or whatever FS) host blocks... Then even when you write duplicated data inside the guest, the host won''t see it as a duplicated block. There are some situations where dedup may help on VM images... For example if you''re not using sparse files and you have a zero-filed disk... But in that case, you should probably just use a sparse file instead... Or ... If you have a "golden" image that you''re copying all over the place ... but in that case, you should probably just use clones instead... Or if you''re intimately familiar with both the guest & host filesystems, and you choose blocksizes carefully to make them align. But that seems complicated and likely to fail.
On Wed, May 4, 2011 at 10:15 PM, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Erik Trimble > > > > ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the > > time, and can''t really share them well if it expects to keep performance > > from tanking... (no pun intended) > > That''s true, but on the flipside, if you don''t have adequate resources > dedicated all the time, it means performance is unsustainable. Anything > which is going to do post-write dedup will necessarily have degraded > performance on a periodic basis. This is in *addition* to all your scrubs > and backups and so on. > > >AGAIN, you''re assuming that all system resources are used all the time and can''t possibly go anywhere else. This is absolutely false. If someone is running a system at 99% capacity 24/7, perhaps that might be a factual statement. I''d argue if someone is running the system 99% all of the time, the system is grossly undersized for the workload. How can you EVER expect a highly available system to run 99% on both nodes (all nodes in a vmax/vsp scenario) and ever be able to fail over? Either a home-brew Opensolaris Cluster, Oracle 7000 cluster, or NetApp? I''m gathering that this list in general has a lack of understanding of how NetApp does things. If you don''t know for a fact how it works, stop jumping to conclusions on how you think it works. I know for a fact that short of the guys currently/previously writing the code at NetApp, there''s a handful of people in the entire world who know (factually) how they''re allocating resources from soup to nuts. As far as this discussion is concerned, there''s only two points that matter: They''ve got dedup on primary storage, it works in the field. The rest is just static that doesn''t matter. Let''s focus on how to make ZFS better instead of trying to guess how others are making it work, especially when they''ve got a completely different implementation. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110504/a68fba7d/attachment.html>
On Wed, May 4, 2011 at 10:23 PM, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Ray Van Dolson > > > > Are any of you out there using dedupe ZFS file systems to store VMware > > VMDK (or any VM tech. really)? Curious what recordsize you use and > > what your hardware specs / experiences have been. > > Generally speaking, dedup doesn''t work on VM images. (Same is true for ZFS > or netapp or anything else.) Because the VM images are all going to have > their own filesystems internally with whatever blocksize is relevant to the > guest OS. If the virtual blocks in the VM don''t align with the ZFS (or > whatever FS) host blocks... Then even when you write duplicated data > inside > the guest, the host won''t see it as a duplicated block. > > There are some situations where dedup may help on VM images... For example > if you''re not using sparse files and you have a zero-filed disk... But in > that case, you should probably just use a sparse file instead... Or ... > If > you have a "golden" image that you''re copying all over the place ... but in > that case, you should probably just use clones instead... > > Or if you''re intimately familiar with both the guest & host filesystems, > and > you choose blocksizes carefully to make them align. But that seems > complicated and likely to fail. > > >That''s patently false. VM images are the absolute best use-case for dedup outside of backup workloads. I''m not sure who told you/where you got the idea that VM images are not ripe for dedup, but it''s wrong. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110504/60f1f8f2/attachment-0001.html>
> From: Tim Cook [mailto:tim at cook.ms] > > > ZFS''s problem is that it needs ALL the resouces for EACH pool ALL the > > time, and can''t really share them well if it expects to keep performance > > from tanking... (no pun intended) > That''s true, but on the flipside, if you don''t have adequate resources > dedicated all the time, it means performance is unsustainable. ?Anything > which is going to do post-write dedup will necessarily have degraded > performance on a periodic basis. ?This is in *addition* to all your scrubs > and backups and so on. > > > AGAIN, you''re assuming that all system resources are used all the time and > can''t possibly go anywhere else.? This is absolutely false.? If someone is > running a system at 99% capacity 24/7, perhaps that might be a factual > statement.? I''d argue if someone is running the system 99% all of thetime,> the system is grossly undersized for the workload.?Well, here is my situation: I do IT for a company whose workload is very spiky. For weeks at a time, the system will be 99% idle. Then when the engineers have a deadline to meet, they will expand and consume all available resources, no matter how much you give them. So they will keep all systems 99% busy for a month at a time. After the deadline passes, they drop back down to 99% idle. The work is IO intensive so it''s not appropriate for something like the cloud.> I''m gathering that this list in general has a lack of understanding of how > NetApp does things.? If you don''t know for a fact how it works, stopjumping> to conclusions on how you think it works.? I know for a fact that short ofthe I''m a little confused by this rant. Cuz I didn''t say anything about netapp.
> From: Tim Cook [mailto:tim at cook.ms] > > That''s patently false.? VM images are the absolute best use-case for dedup > outside of backup workloads.? I''m not sure who told you/where you got the > idea that VM images are not ripe for dedup, but it''s wrong.Well, I got that idea from this list. I said a little bit about why I believed it was true ... about dedup being ineffective for VM''s ... Would you care to describe a use case where dedup would be effective for a VM? Or perhaps cite something specific, instead of just wiping the whole thing and saying "patently false?" I don''t feel like this comment was productive...
We have customers using dedup with lots of vm images... in one extreme case they are getting dedup ratios of over 200:1! You don''t need dedup or sparse files for zero filling. Simple zle compression will eliminate those for you far more efficiently and without needing massive amounts of ram. Our customers have the ability to access our systems engineers to design the solution for their needs. If you are serious about doing this stuff right, work with someone like Nexenta that can engineer a complete solution instead of trying to figure out which of us on this forum are quacks and which are cracks. :) Tim Cook <tim at cook.ms> wrote:>On Wed, May 4, 2011 at 10:23 PM, Edward Ned Harvey < >opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: > >> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> > bounces at opensolaris.org] On Behalf Of Ray Van Dolson >> > >> > Are any of you out there using dedupe ZFS file systems to store VMware >> > VMDK (or any VM tech. really)? Curious what recordsize you use and >> > what your hardware specs / experiences have been. >> >> Generally speaking, dedup doesn''t work on VM images. (Same is true for ZFS >> or netapp or anything else.) Because the VM images are all going to have >> their own filesystems internally with whatever blocksize is relevant to the >> guest OS. If the virtual blocks in the VM don''t align with the ZFS (or >> whatever FS) host blocks... Then even when you write duplicated data >> inside >> the guest, the host won''t see it as a duplicated block. >> >> There are some situations where dedup may help on VM images... For example >> if you''re not using sparse files and you have a zero-filed disk... But in >> that case, you should probably just use a sparse file instead... Or ... >> If >> you have a "golden" image that you''re copying all over the place ... but in >> that case, you should probably just use clones instead... >> >> Or if you''re intimately familiar with both the guest & host filesystems, >> and >> you choose blocksizes carefully to make them align. But that seems >> complicated and likely to fail. >> >> >> >That''s patently false. VM images are the absolute best use-case for dedup >outside of backup workloads. I''m not sure who told you/where you got the >idea that VM images are not ripe for dedup, but it''s wrong. > >--Tim > >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> From: Garrett D''Amore [mailto:garrett at nexenta.com] > > We have customers using dedup with lots of vm images... in one extreme > case they are getting dedup ratios of over 200:1!I assume you''re talking about a situation where there is an initial VM image, and then to clone the machine, the customers copy the VM, correct? If that is correct, have you considered ZFS cloning instead? When I said dedup wasn''t good for VM''s, what I''m talking about is: If there is data inside the VM which is cloned... For example if somebody logs into the guest OS and then does a "cp" operation... Then dedup of the host is unlikely to be able to recognize that data as cloned data inside the virtual disk.> Our customers have the ability to access our systems engineers to design the > solution for their needs. If you are serious about doing this stuff right, work > with someone like Nexenta that can engineer a complete solution instead of > trying to figure out which of us on this forum are quacks and which are > cracks. :)Is this a zfs discussion list, or a nexenta sales & promotion list?
Hi, On 05/ 5/11 03:02 PM, Edward Ned Harvey wrote:>> From: Garrett D''Amore [mailto:garrett at nexenta.com] >> >> We have customers using dedup with lots of vm images... in one extreme >> case they are getting dedup ratios of over 200:1! > > I assume you''re talking about a situation where there is an initial VM image, and then to clone the machine, the customers copy the VM, correct? > If that is correct, have you considered ZFS cloning instead? > > When I said dedup wasn''t good for VM''s, what I''m talking about is: If there is data inside the VM which is cloned... For example if somebody logs into the guest OS and then does a "cp" operation... Then dedup of the host is unlikely to be able to recognize that data as cloned data inside the virtual disk.ZFS cloning and ZFS dedup are solving two problems that are related, but different: - Through Cloning, a lot of space can be saved in situations where it is known beforehand that data is going to be used multiple times from multiple different "views". Virtualization is a perfect example of this. - Through Dedup, space can be saved in situations where the duplicate nature of data is not known, or not known beforehand. Again, in virtualization scenarios, this could be common modifications to VM images that are performed multiple times, but not anticipated, such as extra software, OS patches, or simply man users saving the same files to their local desktops. To go back to the "cp" example: If someone logs into a VM that is backed by ZFS with dedup enabled, then copies a file, the extra space that the file will take will be minimal. The act of copying the file will break down into a series of blocks that will be recognized as duplicate blocks. This is completely independent of the clone nature of the underlying VM''s backing store. But I agree that the biggest savings are to be expected from cloning first, as they typically translate into n GB (for the base image) x # of users, which is a _lot_. Dedup is still the icing on the cake for all those data blocks that were unforeseen. And that can be a lot, too, as everone who has seen cluttered desktops full of downloaded files can probably confirm. Cheers, Constantin -- Constantin Gonzalez Schmitz, Sales Consultant, Oracle Hardware Presales Germany Phone: +49 89 460 08 25 91 | Mobile: +49 172 834 90 30 Blog: http://constantin.glez.de/ | Twitter: zalez ORACLE Deutschland B.V. & Co. KG, Sonnenallee 1, 85551 Kirchheim-Heimstetten ORACLE Deutschland B.V. & Co. KG Hauptverwaltung: Riesstra?e 25, D-80992 M?nchen Registergericht: Amtsgericht M?nchen, HRA 95603 Komplement?rin: ORACLE Deutschland Verwaltung B.V. Hertogswetering 163/167, 3543 AS Utrecht Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697 Gesch?ftsf?hrer: J?rgen Kunz, Marcel van de Molen, Alexander van der Ven Oracle is committed to developing practices and products that help protect the environment
On Thu, 2011-05-05 at 09:02 -0400, Edward Ned Harvey wrote:> > From: Garrett D''Amore [mailto:garrett at nexenta.com] > > > > We have customers using dedup with lots of vm images... in one extreme > > case they are getting dedup ratios of over 200:1! > > I assume you''re talking about a situation where there is an initial VM image, and then to clone the machine, the customers copy the VM, correct? > If that is correct, have you considered ZFS cloning instead?No. Obviously if you can clone, its better. But sometimes you can''t do this even with v12n, and we have this situation at customer sites today. (I have always said, zfs clone is far easier, far more proven, and far more efficient, *if* you can control the "ancestral" relationship to take advantage of the clone.) For example, one are where cloning can''t help is with patches and updates. In some instances these can get quite large, and across 1000''s of VMs the space required can be considerable.> > When I said dedup wasn''t good for VM''s, what I''m talking about is: If there is data inside the VM which is cloned... For example if somebody logs into the guest OS and then does a "cp" operation... Then dedup of the host is unlikely to be able to recognize that data as cloned data inside the virtual disk.I disagree. I believe that within the VMDKs data is aligned nicely, since these are disk images. At any rate, we are seeing real (and large) dedup ratios in the field when used with v12n. In fact, this is the killer app for dedup.> > > Our customers have the ability to access our systems engineers to design the > > solution for their needs. If you are serious about doing this stuff right, work > > with someone like Nexenta that can engineer a complete solution instead of > > trying to figure out which of us on this forum are quacks and which are > > cracks. :) > > Is this a zfs discussion list, or a nexenta sales & promotion list?My point here was that there is a lot of half baked advice being given... the idea that you should only use dedup if you have a bunch of zeros on your disk images is absolutely and totally nuts for example. It doesn''t match real world experience, and it doesn''t match the theory either. And sometimes real-world experience trumps the theory. I''ve been shown on numerous occasions that ideas that I thought were half-baked turned out to be very effective in the field, and vice versa. (I''m a developer, not a systems engineer. Fortunately I have a very close working relationship with a couple of awesome systems engineers.) Folks come here looking for advice. I think the advice that if you''re contemplating these kinds of solutions, you should get someone with some real world experience solving these kinds of problems every day, is very sound advice. Trying to pull out the truths from the myths I see stated here nearly every day is going to be difficult for the average reader here, I think. - Garrett
> I assume you''re talking about a situation where there is an initial VM image, and then to clone the machine, the customers copy the VM, correct? > If that is correct, have you considered ZFS cloning instead? > > When I said dedup wasn''t good for VM''s, what I''m talking about is: If there is data inside the VM which is cloned... For example if somebody logs into the guest OS and then does a "cp" operation... Then dedup of the host is unlikely to be able to recognize that data as cloned data inside the virtual disk.I have the same opinion. When having talks with customers about the usage of dedup and cloning, the answer is simple: When you know that duplicates will occur but don''t know when, then use dedup, when you know that duplicates occur and you know that they are there from the beginning, then use cloning. Thus VM images cries for cloning. I''m not a fan for dedup for VMs. I heard the argument once "but what is with vm patching". Aside from the problem of detecting the clones, i wouldn''t patch a vm, but patch the master image and regenerate the clones, especially when it''s about general patching session (just saving a gig because there is a patch on 2 or 3 of 100 server) isn''t worth the effort of spending a lot of memory for dedup). Out of a simple reason: Patching the VMs each on it''s own is likely to increase VM sprawl. So all i save is some iron, but i''m not simplifying administration. However this needs good administrative processes. You can use dedup for VMs, but i''m not sure someone should ...> Is this a zfs discussion list, or a nexenta sales & promotion list?Well ... i have an opinion how he sees that ... however it''s just my own ;) -- ORACLE Joerg Moellenkamp | Sales Consultant Phone: +49 40 251523-460 | Mobile: +49 172 8318433 Oracle Hardware Presales - Nord ORACLE Deutschland B.V.& Co. KG | Nagelsweg 55 | 20097 Hamburg ORACLE Deutschland B.V.& Co. KG Hauptverwaltung: Riesstr. 25, D-80992 M?nchen Registergericht: Amtsgericht M?nchen, HRA 95603 Komplement?rin: ORACLE Deutschland Verwaltung B.V. Rijnzathe 6, 3454PV De Meern, Niederlande Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697 Gesch?ftsf?hrer: J?rgen Kunz, Marcel van de Molen, Alexander van der Ven Oracle is committed to developing practices and products that help protect the environment
On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> Generally speaking, dedup doesn''t work on VM images. ?(Same is true for ZFS > or netapp or anything else.) ?Because the VM images are all going to have > their own filesystems internally with whatever blocksize is relevant to the > guest OS. ?If the virtual blocks in the VM don''t align with the ZFS (or > whatever FS) host blocks... ?Then even when you write duplicated data inside > the guest, the host won''t see it as a duplicated block.A zvol with 4k blocks should give you decent results with Windows guests. Recent versions use 4k alignment by default and 4k blocks, so there should be lots of duplicates for a base OS image.> There are some situations where dedup may help on VM images... ?For example > if you''re not using sparse files and you have a zero-filed disk... ?But incompression=zle works even better for these cases, since it doesn''t require DDT resources.> Or if you''re intimately familiar with both the guest & host filesystems, and > you choose blocksizes carefully to make them align. ?But that seems > complicated and likely to fail.Using a 4k block size is a safe bet, since most OSs use a block size that is a multiple of 4k. It''s the same reason that the new "Advanced Format" drives use 4k sectors. Windows uses 4k alignment and 4k (or larger) clusters. ext3/ext4 uses 1k, 2k, or 4k blocks. Drives over 512MB should use 4k by default. The block alignment is determined by the partitioning, so some care needs to be taken there. zfs uses ''ashift'' size blocks. I''m not sure what ashift works out to be when using a zvol though, so it could be as small as 512b but may be set to the same as the blocksize property. ufs is 4k or 8k on x86 and 8k on sun4u. As with ext4, block alignment is determined by partitioning and slices. -B -- Brandon High : bhigh at freaks.com
On May 5, 2011, at 2:58 PM, Brandon High wrote:> On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey > >> Or if you''re intimately familiar with both the guest & host filesystems, and >> you choose blocksizes carefully to make them align. But that seems >> complicated and likely to fail. > > Using a 4k block size is a safe bet, since most OSs use a block size > that is a multiple of 4k. It''s the same reason that the new "Advanced > Format" drives use 4k sectors.Yes, 4KB block sizes are replacing the 512B blocks of yesteryear. However, the real reason the HDD manufacturers headed this way is because they can get more usable bits per platter. The tradeoff is that your workload may consume more real space on the platter than before. TANSTAAFL. The trick for best performance and best opportunity for dedup (alignment notwithstanding) is to have a block size that is smaller than your workload. Or, don''t bring a 128KB block to a 4KB block battle. For this reason, the default 8KB block size for a zvol is a reasonable choice, but perhaps 4KB is better for many workloads. -- richard
On May 5, 2011, at 6:02 AM, Edward Ned Harvey wrote:> Is this a zfs discussion list, or a nexenta sales & promotion list?Obviously, this is a Nextenta sales & promotion list. And Oracle. And OSX. And BSD. And Linux. And anyone who needs help or can offer help with ZFS technology :-) This list has never been more diverse. The only sad part is the unnecessary assassination of the OpenSolaris brand. But life moves on, and so does good technology. -- richard-who-is-proud-to-work-at-Nexenta
> From: Brandon High [mailto:bhigh at freaks.com] > > On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey > <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: > > Generally speaking, dedup doesn''t work on VM images. ?(Same is true for > ZFS > > or netapp or anything else.) ?Because the VM images are all going tohave> > their own filesystems internally with whatever blocksize is relevant tothe> > guest OS. ?If the virtual blocks in the VM don''t align with the ZFS (or > > whatever FS) host blocks... ?Then even when you write duplicated data > inside > > the guest, the host won''t see it as a duplicated block. > > A zvol with 4k blocks should give you decent results with Windows > guests. Recent versions use 4k alignment by default and 4k blocks, so > there should be lots of duplicates for a base OS image.I agree with everything Brandon said. The one thing I would add is: The "correct" recordsize for each guest machine would depend on the filesystem that the guest machine is using. Without knowing a specific filesystem on a specific guest OS, the 4k recordsize sounds like a reasonable general-purpose setting. But if you know more details of the guest, you could hopefully use a larger recordsize and therefore consume less ram on the host. If you have to use the 4k recordsize, it is likely to consume 32x more memory than the default 128k recordsize of ZFS. At this rate, it becomes increasingly difficult to get a justification to enable the dedup. But it''s certainly possible.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > If you have to use the 4k recordsize, it is likely to consume 32x more > memory than the default 128k recordsize of ZFS. At this rate, it becomes > increasingly difficult to get a justification to enable the dedup. Butit''s> certainly possible.Sorry, I didn''t realize ... RE just said (and I take his word for it) that the default recordsize for a zvol is 8k. While of course the default recordsize for a ZFS filesystem is 128k. Emphasis is that memory requirement is a constant multiplied by number of blocks, so smaller blocks ==> higher number of blocks ==> more memory consumption. This could be a major difference in implementation ... If you are going to use ZFS over NFS as your VM storage backend that would default to 128k recordsize, while if you''re going to use ZFS over ISCSI as your VM storage backend that would default to 8k recordsize. In either case, you really want to be aware of, and tune your recordsize appropriately for the guest(s) that you are running.
On Thu, May 5, 2011 at 8:50 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> If you have to use the 4k recordsize, it is likely to consume 32x more > memory than the default 128k recordsize of ZFS. ?At this rate, it becomes > increasingly difficult to get a justification to enable the dedup. ?But it''s > certainly possible.You''re forgetting that zvols use an 8k volblocksize by default. If you''re currently exporting exporting volumes with iSCSI it''s only a 2x increase. The tradeoff is that you should have more duplicate blocks, and reap the rewards there. I''m fairly certain that it won''t offset the large increase in the size of the DDT however. Dedup with zvols is probably never a good idea as a result. Only if you''re hosting your VM images in .vmdk files will you get 128k blocks. Of course, your chance of getting many identical blocks gets much, much smaller. You''ll have to worry about the guests'' block alignment in the context of the image file, since two identical files may not create identical blocks as seen from ZFS. This means you may get only fractional savings and have an enormous DDT. -B -- Brandon High : bhigh at freaks.com
On Wed, May 04, 2011 at 08:49:03PM -0700, Edward Ned Harvey wrote:> > From: Tim Cook [mailto:tim at cook.ms] > > > > That''s patently false.? VM images are the absolute best use-case for dedup > > outside of backup workloads.? I''m not sure who told you/where you got the > > idea that VM images are not ripe for dedup, but it''s wrong. > > Well, I got that idea from this list. I said a little bit about why I > believed it was true ... about dedup being ineffective for VM''s ... Would > you care to describe a use case where dedup would be effective for a VM? Or > perhaps cite something specific, instead of just wiping the whole thing and > saying "patently false?" I don''t feel like this comment was productive... >We use dedupe on our VMware datastores and typically see 50% savings, often times more. We do of course keep "like" VM''s on the same volume (at this point nothing more than groups of Windows VM''s, Linux VM''s and so on). Note that this isn''t on ZFS (yet), but we hope to begin experimenting with it soon (using NexentaStor). Apologies for devolving the conversation too much in the NetApp direction -- simply was a point of reference for me to get a better understanding of things on the ZFS side. :) Ray
On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson <rvandolson at esri.com> wrote:> We use dedupe on our VMware datastores and typically see 50% savings, > often times more. ?We do of course keep "like" VM''s on the same volumeI think NetApp uses 4k blocks by default, so the block size and alignment should match up for most filesystems and yield better savings. Your server''s resource requirements for ZFS and dedup will be much higher due to the large DDT, as you initially suspected. If bp_rewrite is ever completed and released, this might change. It should allow for offline dedup, which may make dedup usable in more situations.> Apologies for devolving the conversation too much in the NetApp > direction -- simply was a point of reference for me to get a better > understanding of things on the ZFS side. :)It''s good to compare the two, since they have a pretty large overlap in functionality but sometimes very different implementations. -B -- Brandon High : bhigh at freaks.com
On 05/ 6/11 07:21 PM, Brandon High wrote:> On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson<rvandolson at esri.com> > wrote: > >> We use dedupe on our VMware datastores and typically see 50% savings, >> often times more. We do of course keep "like" VM''s on the same volume > I think NetApp uses 4k blocks by default, so the block size and > alignment should match up for most filesystems and yield better > savings.Assuming that VMware datastores are on NFS ? Otherwise VMware filesystem VMFS uses its own block sizes from 1M to 8M, so the important point is to align guest OS partition to 1M, and Windows guests starting from Vista/2008 do that by default now. Regards,
On Mon, May 9, 2011 at 2:11 AM, Evaldas Auryla <evaldas.auryla at edqm.eu>wrote:> On 05/ 6/11 07:21 PM, Brandon High wrote: > >> On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson<rvandolson at esri.com> >> wrote: >> >> We use dedupe on our VMware datastores and typically see 50% savings, >>> often times more. We do of course keep "like" VM''s on the same volume >>> >> I think NetApp uses 4k blocks by default, so the block size and >> alignment should match up for most filesystems and yield better >> savings. >> > Assuming that VMware datastores are on NFS ? Otherwise VMware filesystem > VMFS uses its own block sizes from 1M to 8M, so the important point is to > align guest OS partition to 1M, and Windows guests starting from Vista/2008 > do that by default now. > > Regards, > >The VMFS filesystem itself is aligned by NetApp at LUN creation time. You still align to a 4K block on a filer because there is no way to automatically align an encapsulated guest, especially when you could have different guest OS types on a LUN. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110509/c9ff6af1/attachment.html>