Hi All ; Is there any hope for deduplication on ZFS ? Mertol <http://www.sun.com/> http://www.sun.com/emrkt/sigs/6g_top.gif Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080707/2a494ab3/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080707/2a494ab3/attachment.gif>
Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there''s hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there''s bug fixes & performance changes that customers are demanding. Neil. Mertol Ozyoney wrote:> Hi All ; > > > > Is there any hope for deduplication on ZFS ? > > > > Mertol > > > > > > http://www.sun.com/emrkt/sigs/6g_top.gif <http://www.sun.com/> > > > > *Mertol Ozyoney * > Storage Practice - Sales Manager > > *Sun Microsystems, TR* > Istanbul TR > Phone +902123352200 > Mobile +905339310752 > Fax +902123352222 > Email mertol.ozyoney at sun.com > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
A really smart nexus for dedup is right when archiving takes place. For systems like EMC Centera, dedup is basically a byproduct of checksumming. Two files with similar metadata that have the same hash? They''re identical. Charles On 7/7/08 4:25 PM, "Neil Perrin" <Neil.Perrin at Sun.COM> wrote:> Mertol, > > Yes, dedup is certainly on our list and has been actively > discussed recently, so there''s hope and some forward progress. > It would be interesting to see where it fits into our customers > priorities for ZFS. We have a long laundry list of projects. > In addition there''s bug fixes & performance changes that customers > are demanding. > > Neil.
Even better would be using the ZFS block checksums (assuming we are only summing the data, not it''s position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) Nathan. Charles Soto wrote:> A really smart nexus for dedup is right when archiving takes place. For > systems like EMC Centera, dedup is basically a byproduct of checksumming. > Two files with similar metadata that have the same hash? They''re identical. > > Charles > > > On 7/7/08 4:25 PM, "Neil Perrin" <Neil.Perrin at Sun.COM> wrote: > >> Mertol, >> >> Yes, dedup is certainly on our list and has been actively >> discussed recently, so there''s hope and some forward progress. >> It would be interesting to see where it fits into our customers >> priorities for ZFS. We have a long laundry list of projects. >> In addition there''s bug fixes & performance changes that customers >> are demanding. >> >> Neil. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Neil Perrin wrote:> Mertol, > > Yes, dedup is certainly on our list and has been actively > discussed recently, so there''s hope and some forward progress. > It would be interesting to see where it fits into our customers > priorities for ZFS. We have a long laundry list of projects. > In addition there''s bug fixes & performance changes that customers > are demanding. > > Neil. > >I want to cast my vote for getting dedup on ZFS. One place we currently use ZFS is as nearline storage for backup data. I have a 16TB server that provides a file store for an EMC Networker server. I''m seeing a compressratio of 1.73, which is mighty impressive, since we also use native EMC compression during the backups. But with dedup, we should see way more. Here at UCB SSL, we have demoed and investigated various dedup products, hardware and software, but they are all steep on the ROI curve. I would be very excited to see block level ZFS deduplication roll out. Especially since we already have the infrastructure in place using Solaris/ZFS. Cheers, Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
On Tue, 8 Jul 2008, Nathan Kroenert wrote:> Even better would be using the ZFS block checksums (assuming we are only > summing the data, not it''s position or time :)... > > Then we could have two files that have 90% the same blocks, and still > get some dedup value... ;)It seems that the hard problem is not if ZFS has the structure to support it (the implementation seems pretty obvious), but rather that ZFS is supposed to be able to scale to extremely large sizes. If you have a petabyte of storage in the pool, then the data structure to keep track of block similarity could grow exceedingly large. The block checksums are designed to be as random as possible so their value does not suggest anything regarding the similarity of the data unless the values are identical. The checksums have enough bits and randomness that binary trees would not scale. Except for the special case of backups or cloned server footprints, it does not seem that data deduplication is going to save the 90% (or more) space that Quantum claims at http://www.quantum.com/Solutions/datadeduplication/Index.aspx. ZFS clones already provide a form of data deduplication. The actual benefit of data deduplication to an enterprise seems negligible unless the backup system directly supports it. In the enterprise the cost of storage has more to do with backing up the data than the amount of storage media consumed. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 7 Jul 2008, Jonathan Loran wrote:> use ZFS is as nearline storage for backup data. I have a 16TB server > that provides a file store for an EMC Networker server. I''m seeing a > compressratio of 1.73, which is mighty impressive, since we also use > native EMC compression during the backups. But with dedup, we should > see way more. Here at UCB SSL, we have demoed and investigated variousI was going to say something smart about how zfs could contribute to improved serialized compression. However, I retract that and think that when one starts with order, it is best to preserve order and not attempt to re-create order once things have devolved into utter chaos. This deduplication technology seems similar to the Microsoft adds I see on TV which advertise how their new technology saves the customer 20% of the 500% additional cost incurred by Microsoft''s previous technology (which was itself a band-aid to a previous technology). Sun/Solaris should be about being smarter rather than working harder. If data devolution is a problem, it is most likely that the solution is to investigate the root causes and provide solutions which do not lead to devolution. For example, if Berkely has 30,000 students which all require a home directory with similar stuff, perhaps they can be initialized using ZFS clones so that there is little waste of space until a student modifies an existing file. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Jul 07, 2008 at 07:56:26PM -0500, Bob Friesenhahn wrote:> > This deduplication technology seems similar to the Microsoft adds I > see on TV which advertise how their new technology saves the customerQuantum''s claim of 20:1 just doesn''t jive in my head, either, for some reason. -brian
On Mon, Jul 7, 2008 at 7:40 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> The actual benefit of data deduplication to an enterprise seems > negligible unless the backup system directly supports it. ?In the > enterprise the cost of storage has more to do with backing up the data > than the amount of storage media consumed.Real data... I did a survey of about 120 (mostly sparse) zone roots installed over an 18 month period and used for normal enterprise activity. Each zone root is installed into its own SVM soft partition with a strong effort to isolate application data elsewhere. Each zone''s /var (including /var/tmp) was included in the survey. My mechanism involved calculating the md5 checksum of every 4 KB block from the SVM raw device. This size was chosen because it is the fixed block size of the player in the market that does deduplication of live data today. My results were that I found that I had 75% duplicate data - with no special effort to minimize duplicate data. If other techniques were applied to minimize duplicate data (e.g. periodic write of zeros over free space, extend file system to do the same for freed blocks, mount with noatime, etc.) or full root zones (or LDoms) were the subject of the test I would expect a higher level of duplication. Supposition... As I have considered deduplication for application data I see several things happen in various areas. - Multiple business application areas use very similar software. When looking at various applications that directly (conscious choice) or indirectly (embedded in some other product) use various web servers, application servers, databases, etc. each application administrator uses the same installation media to perform an installation into a private (but commonly NFS mounted) area. Many/most of these applications do a full installation of java which is a statistically significant size of the installation. - Maintenance activity creates duplicate data. When patching, upgrading, or otherwise performing maintenance, it is common to make a full copy or a fresh installation of the software. This allows most of the maintenance activity to be performed when the workload is live as well as rapid fallback by making small configuration changes. The vast majority of the data in these multiple versions are identical (e.g. small percentage of jars updated, maybe a bit of the included documentation, etc.) - Application distribution tools create duplicate data Some application-level clustering technologies cause a significant amount of data to be sent from the administrative server to the various cluster members. By application server design, this is duplicate data. If that data all resides on the same highly redundant storage frame, it could be reduced back down to one (or fewer copies). - Multiple development and release trees are duplicate When various developers check out code from a source code repository or a single developer has multiple copies to work on different releases, the checked out code is nearly 100% duplicate and objects that are created during builds may be highly duplicate. - Relying on storage-based snapshots and clones is impractical There tend to be organizational walls between those that manage storage and those that consume it. As storage is distributed across a network (NFS, iSCSI, FC) things like delegated datasets and RBAC are of limited practical use. Due to these factors and likely others, storage snapshots and clones are only used for the few cases where there is a huge financial incentive with minimal administrative effort. Deduplication could be deployed on the back end to do what clones can''t do due to non-technical reasons. - Clones diverge permanently but shouldn''t If I have a 3 GB OS image (inside an 8 GB block device) that I am patching, there is a reasonable chance that I unzip 500 MB of patches to the system, apply a the patches, then remove them. If deduplication is done at the block device level (e.g. iSCSI LUNs shared from a storage server) the space "uncloned" by extracting the patches remains per-server used space. Additionally the other space used by the installed patches remains used. Deduplication can reclaim the majority of the space. -- Mike Gerdts http://mgerdts.blogspot.com/
I second this, provided we also check that the data is in fact identical as well. Checksum collisions are likely given the sizes of disks and the sizes of checksums; and some users actually deliberately generate data with colliding checksums (researchers and nefarious users). Dedup must be absolutely safe and users should decide if they want the cost of checking blocks versus the space saving. Maurice On 08/07/2008, at 10:00 AM, Nathan Kroenert wrote:> Even better would be using the ZFS block checksums (assuming we are > only > summing the data, not it''s position or time :)... > > Then we could have two files that have 90% the same blocks, and still > get some dedup value... ;) > > Nathan. >
Good points. I see the archival process as a good candidate for adding dedup because it is essentially doing what a stage/release archiving system already does - "faking" the existence of data via metadata. Those blocks aren''t actually there, but they''re still "accessible" because they''re *somewhere* the system knows about (i.e. the "other twin"). Currently in SAMFS, if I store two identical files on the archiving filesystem and my policy generates 4 copies, I will have created 8 copies of the file (albeit with different metadata). Dedup would help immensely here. And as archiving (data management) is inherently a "costly" operation, it''s used where potentially slower access to data is acceptable. Another system that comes to mind that utilizes dedup is Xythos WebFS. As Bob points out, keeping track of dupes is a chore. IIRC, WebFS uses a relational database to track this (among much of its other metadata). Charles On 7/7/08 7:40 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 8 Jul 2008, Nathan Kroenert wrote: > >> Even better would be using the ZFS block checksums (assuming we are only >> summing the data, not it''s position or time :)... >> >> Then we could have two files that have 90% the same blocks, and still >> get some dedup value... ;) > > It seems that the hard problem is not if ZFS has the structure to > support it (the implementation seems pretty obvious), but rather that > ZFS is supposed to be able to scale to extremely large sizes. If you > have a petabyte of storage in the pool, then the data structure to > keep track of block similarity could grow exceedingly large. The > block checksums are designed to be as random as possible so their > value does not suggest anything regarding the similarity of the data > unless the values are identical. The checksums have enough bits and > randomness that binary trees would not scale. > > Except for the special case of backups or cloned server footprints, > it does not seem that data deduplication is going to save the 90% (or > more) space that Quantum claims at > http://www.quantum.com/Solutions/datadeduplication/Index.aspx. > > ZFS clones already provide a form of data deduplication. > > The actual benefit of data deduplication to an enterprise seems > negligible unless the backup system directly supports it. In the > enterprise the cost of storage has more to do with backing up the data > than the amount of storage media consumed. > > Bob
On Mon, 7 Jul 2008, Mike Gerdts wrote:> > As I have considered deduplication for application data I see several > things happen in various areas.You have provided an excellent description of gross inefficiencies in the way systems and software are deployed today, resulting in massive duplication. Massive duplication is used to ease service deployment and management. Most of this massive duplication is not technically necessary.> There tend to be organizational walls between those that manage > storage and those that consume it. As storage is distributed across > a network (NFS, iSCSI, FC) things like delegated datasets and RBAC > are of limited practical use. Due to these factors and likelyIt seems that deduplication on the server does not provide much benefit to the client since the client always sees a duplicate. It does not know that it doesn''t need to cache or copy a block twice because it is a duplicate. Only the server benefits from the deduplication except that maybe server-side caching improves and provides the client with a bit more performance. While deduplication can obviously save server storage space, it does not seem to help much for backups, and it does not really help the user manage all of that data. It does help the user in terms of less raw storage space but there is surely a substantial run-time cost associated with the deduplication mechanism. None of the existing applications (based on POSIX standards) has any understanding of deduplication so they won''t benefit from it. If you use tar, cpio, or ''cp -r'', to copy the contents of a directory tree, they will transmit just as much data as before and if the destintation does real-time deduplication, then the copy will be slower. If the copy is to another server, then the copy time will be huge, just like before. Unless the backup system fully understands and has access to the filesystem deduplication mechanism, it will be grossly inefficient just like before. Recovery from a backup stored in a sequential (e.g. tape) format which does understand deduplication would be quite interesting indeed. Raw storage space is cheap. Managing the data is what is expensive. Perhaps deduplication is a response to an issue which should be solved elsewhere? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Jul 7, 2008 at 9:24 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 7 Jul 2008, Mike Gerdts wrote: >> There tend to be organizational walls between those that manage >> storage and those that consume it. As storage is distributed across >> a network (NFS, iSCSI, FC) things like delegated datasets and RBAC >> are of limited practical use. Due to these factors and likely > > It seems that deduplication on the server does not provide much benefit to > the client since the client always sees a duplicate. It does not know that > it doesn''t need to cache or copy a block twice because it is a duplicate. > Only the server benefits from the deduplication except that maybe > server-side caching improves and provides the client with a bit more > performance.I want the deduplication to happen where it can be most efficient. Just like with snapshots and clones, the client will have no idea that multiple metadata sets point to the same data. If deduplication makes it so that each GB of perceived storage is cheaper, clients benefit because the storage provider is (or should be) charging less.> While deduplication can obviously save server storage space, it does not > seem to help much for backups, and it does not really help the user manage > all of that data. It does help the user in terms of less raw storage space > but there is surely a substantial run-time cost associated with the > deduplication mechanism. None of the existing applications (based on POSIX > standards) has any understanding of deduplication so they won''t benefit from > it. If you use tar, cpio, or ''cp -r'', to copy the contents of a directory > tree, they will transmit just as much data as before and if the destintation > does real-time deduplication, then the copy will be slower. If the copy is > to another server, then the copy time will be huge, just like before.I agree. Follow-on work needs to happen in the backup and especially restore areas. The first phase of work in this area is complete when a full restore of all data (including snapshots and clones) takes the same amount of space as was occupied during backup. I suspect that if you take a look at the processor utililzation on most storage devices you will find that there are lots of times that the processors are relatively idle. Deduplication can happen real time in when the processors are not very busy, but dirty block analysis should be queued during times of high processor utilization. If you find that the processor can''t keep up with the deduplication workload it suggests that your processors aren''t fast/plentiful enough or you have deduplication enabled on inappropriate data sets. The same goes for I/O induced by the dedupe process. In another message it was suggested that the size of the checksum employed by zfs is so large that maintaining a database of the checksums would be too costly. It may be that a multi-level checksum scheme is needed. That is, perhaps the database of checksums uses a 32-bit or 64-bit hash of the 256 bit checksum. If a hash collision occurs then normal I/O routines are used for comparing the checksums. If they are also the same, then compare the data. It may be that the intermediate comparison is more overhead than is needed because one set of data is already in cache and in the worst case an I/O is needed for the checksum or the data. Why do two I/O''s if only one is needed?> Unless the backup system fully understands and has access to the filesystem > deduplication mechanism, it will be grossly inefficient just like before. > Recovery from a backup stored in a sequential (e.g. tape) format which does > understand deduplication would be quite interesting indeed.Right now it is a mess. Take a look at the situation for restoring snapshots/clones and you will see that unless you use deduplication during restore you need to go out and buy a lot of storage to do a restore or highly duplicate data.> Raw storage space is cheap. Managing the data is what is expensive.The systems that make the raw storage scale to petabytes of fault tolerant storage are very expensive and sometimes quite complex. Needing fewer or smaller spindles should mean less energy consumption, less space, lower MTTR, higher MTTDL, and less complexity in all the hardware used to string it all together.> > Perhaps deduplication is a response to an issue which should be solved > elsewhere?Perhaps. However, I take a look at my backup and restore options for zfs today and don''t think the POSIX API is the right way to go - at least as I''ve seen it used so far. Unless something happens that makes restores of clones retain the initial space efficiency or deduplication hides the problem, clones are useless in most environments. If this problem is solved via fixing backups and restores, deduplication seems even more like the next step to take for storage efficiency. If it is solved by adding deduplication then we get the other benefits of deduplication at the same time. And after typing this message, deduplication is henceforth known as "d11n". :) -- Mike Gerdts http://mgerdts.blogspot.com/
Oh, I agree. Much of the duplication described is clearly the result of "bad design" in many of our systems. After all, most of an OS can be served off the network (diskless systems etc.). But much of the dupe I''m talking about is less about not using the most efficient system administration tricks. Rather, it''s about the fact that software (e.g. Samba) is used by people, and people don''t always do things efficiently. Case in point: students in one of our courses were hitting their quota by growing around 8GB per day. Rather than simply agree that "these kids need more space," we had a look at the files. Turns out just about every student copied a 600MB file into their own directories, as it was created by another student to be used as a "template" for many of their projects. Nobody understood that they could use the file right where it sat. Nope. 7GB of dupe data. And these students are even familiar with our practice of putting "class media" on a read-only share (these files serve as similar "templates" for their own projects - you can create a full video project with just a few MB in your "project file" this way). So, while much of the situation is caused by "bad data management," there aren''t always systems we can employ that prevent it. Done right, dedup can certainly be "worth it" for my operations. Yes, teaching the user the "right thing" is useful, but that user isn''t there to know how to "manage data" for my benefit. They''re there to learn how to be filmmakers, journalists, speech pathologists, etc. Charles On 7/7/08 9:24 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 7 Jul 2008, Mike Gerdts wrote: >> >> As I have considered deduplication for application data I see several >> things happen in various areas. > > You have provided an excellent description of gross inefficiencies in > the way systems and software are deployed today, resulting in massive > duplication. Massive duplication is used to ease service deployment > and management. Most of this massive duplication is not technically > necessary.
On Mon, Jul 7, 2008 at 11:07 PM, Charles Soto <csoto at mail.utexas.edu> wrote:> So, while much of the situation is caused by "bad data management," there > aren''t always systems we can employ that prevent it. Done right, dedup can > certainly be "worth it" for my operations. Yes, teaching the user the > "right thing" is useful, but that user isn''t there to know how to "manage > data" for my benefit. They''re there to learn how to be filmmakers, > journalists, speech pathologists, etc.Well said. -- Mike Gerdts http://mgerdts.blogspot.com/
Does anyone know a tool that can look over a dataset and give duplication statistics? I''m not looking for something incredibly efficient but I''d like to know how much it would actually benefit our dataset: HiRISE has a large set of spacecraft data (images) that could potentially have large amounts of redundancy, or not. Also, other up and coming missions have a large data volume that have a lot of duplicate image info and a small budget; with "d11p" in OpenSolaris there is a good business case to invest in Sun/OpenSolaris rather than buy the cheaper storage (+ linux?) that can simply hold everything as is. If someone feels like coding a tool up that basically makes a file of checksums and counts how many times a particular checksum get''s hit over a dataset, I would be willing to run it and provide feedback. :) -Tim Charles Soto wrote:> Oh, I agree. Much of the duplication described is clearly the result of > "bad design" in many of our systems. After all, most of an OS can be served > off the network (diskless systems etc.). But much of the dupe I''m talking > about is less about not using the most efficient system administration > tricks. Rather, it''s about the fact that software (e.g. Samba) is used by > people, and people don''t always do things efficiently. > > Case in point: students in one of our courses were hitting their quota by > growing around 8GB per day. Rather than simply agree that "these kids need > more space," we had a look at the files. Turns out just about every student > copied a 600MB file into their own directories, as it was created by another > student to be used as a "template" for many of their projects. Nobody > understood that they could use the file right where it sat. Nope. 7GB of > dupe data. And these students are even familiar with our practice of > putting "class media" on a read-only share (these files serve as similar > "templates" for their own projects - you can create a full video project > with just a few MB in your "project file" this way). > > So, while much of the situation is caused by "bad data management," there > aren''t always systems we can employ that prevent it. Done right, dedup can > certainly be "worth it" for my operations. Yes, teaching the user the > "right thing" is useful, but that user isn''t there to know how to "manage > data" for my benefit. They''re there to learn how to be filmmakers, > journalists, speech pathologists, etc. > > Charles > > > On 7/7/08 9:24 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote: > > >> On Mon, 7 Jul 2008, Mike Gerdts wrote: >> >>> As I have considered deduplication for application data I see several >>> things happen in various areas. >>> >> You have provided an excellent description of gross inefficiencies in >> the way systems and software are deployed today, resulting in massive >> duplication. Massive duplication is used to ease service deployment >> and management. Most of this massive duplication is not technically >> necessary. >> > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> Raw storage space is cheap. Managing the data is what is expensive.Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can''t think of a better spend of their time than a scheduled dedup.> Perhaps deduplication is a response to an issue which should be solved > elsewhere?I don''t think you can make this generalisation. For most people, yes, but not everyone. cheers, --justin
> Does anyone know a tool that can look over a dataset and give > duplication statistics? I''m not looking for something incredibly > efficient but I''d like to know how much it would actually benefit ourCheck out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool
Justin Stringfellow wrote:>> Raw storage space is cheap. Managing the data is what is expensive. >> > > Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have > stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can''t think of a better spend of their time > than a scheduled dedup. > > > >> Perhaps deduplication is a response to an issue which should be solved >> elsewhere? >> > > I don''t think you can make this generalisation. For most people, yes, but not everyone. > > > cheers, > --justin > _______________________________________________ >Frankly, while I tend to agree with Bob that backend dedup is something that ever-cheaper disks and client-side misuse make unnecessary, I would _very_ much like us to have some mechanism by which we could have some sort of a ''pay-per-feature'' system, so people who disagree with me can still get what they want. <grin> By that, I mean, that something along the lines of a ''bounty'' system where folks pony up cash for features. I''d love to have many more outside (from Sun) contributors to the OpenSolaris base, ZFS in particular. Right now, virtually all the development work is being driven by internal-to-Sun priorities, which, given that Sun pays the developers, is OK. However, I would really like to have some direct method where outsiders can show to Mgmt that there is direct cash for certain improvements. For Justin, it sounds like being able to pony up several thousand (minimum) for desired feature would be no problem. And, for the rest of us, I can think that a couple of hundred of us putting up $100 each to get RAIDZ expansion might move it to the front of the TODO list. <wink> Plus, we might be able to attract some more interest from the hobbiest folks that way. :-) Buying a service contract and then bugging your service rep doesn''t say the same thing a "I''m willing to pony up $10k right now for feature X". Big customers have weight to throw around, but we need some mechanism where a mid/small guy can make a real statement, and back it up. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Just going to make a quick comment here. It''s a good point about wanting backup software to support this, we''re a much smaller company but it''s already more difficult to manage the storage needed for backups than our live storage. However, we''re actively planning that over the next 12 months, ZFS will actually *be* our backup system, so for us just ZFS and send/receive supporting de-duplication would be great :) In fact, I can see that being useful for a number of places. ZFS send/receive is already a good way to stream incremental changes and keep filesystems in sync. Having de-duplication built into that can only be a good thing. PS. Yes, we''ll still have off-site tape backups just in case, but the vast majority of our backup & restore functionality (including two off-site backups) will be just ZFS. This message posted from opensolaris.org
> Even better would be using the ZFS block checksums (assuming we are only > summing the data, not it''s position or time :)... > > Then we could have two files that have 90% the same blocks, and still > get some dedup value... ;)Yes, but you will need to add some sort of highly collision resistant checksum (sha+md5 maybe) and code to a; bit level compare blocks on collision (100% bit verification) and b; handle linked or cascaded collision tables (2+ blocks with the same hash but differing bits). I actually coded some of this and was playing with it. My testbed relied on another internal data store to track hash maps, collisions (dedup lists) and collision cascades (kind of like what perl does with hash key collisions). It turned out to be a real pain when taking into account snaps and clones. I decided to wait until the resilver/grow/remove code was in place as this seems to be part of the puzzle. -Wade
zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 03:08:26 AM:> > > > Does anyone know a tool that can look over a dataset and give > > duplication statistics? I''m not looking for something incredibly > > efficient but I''d like to know how much it would actually benefit our > > Check out the following blog..: > > http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_poolJust want to add, while this is ok to give you a ballpark dedup number -- fletcher2 is notoriously collision prone on real data sets. It is meant to be fast at the expense of collisions. This issue can show much more dedup possible than really exists on large datasets.
Wade.Stuart at fallon.com wrote:> > zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 03:08:26 AM: > >> >>> Does anyone know a tool that can look over a dataset and give >>> duplication statistics? I''m not looking for something incredibly >>> efficient but I''d like to know how much it would actually benefit our >> Check out the following blog..: >> >> http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool > > Just want to add, while this is ok to give you a ballpark dedup number -- > fletcher2 is notoriously collision prone on real data sets. It is meant to > be fast at the expense of collisions. This issue can show much more dedup > possible than really exists on large datasets.Doing this using sha256 as the checksum algorithm would be much more interesting. I''m going to try that now and see how it compares with fletcher2 for a small contrived test. -- Darren J Moffat
Justin Stringfellow wrote:>> Raw storage space is cheap. Managing the data is what is expensive. >> > > Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have > stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can''t think of a better spend of their time > than a scheduled dedup. >[donning my managerial accounting hat] It is not a good idea to design systems based upon someone''s managerial accounting whims. These are subject to change in illogical ways at unpredictable intervals. This is why managerial accounting can be so much fun for people who want to hide costs. For example, some bright manager decided that they should charge $100/month/port for ethernet drops. So now, instead of having a centralized, managed network with well defined port mappings, every cube has an el-cheapo ethernet switch. Saving money? Not really, but this can be hidden by the accounting. In the interim, I think you will find that if the goal is to reduce the number of bits stored on some "expensive storage," there is more than one way to accomplish that goal. -- richard
On Jul 8, 2008, at 11:00 AM, Richard Elling wrote:> much fun for people who want to hide costs. For example, some bright > manager decided that they should charge $100/month/port for ethernet > drops. So now, instead of having a centralized, managed network with > well defined port mappings, every cube has an el-cheapo ethernet > switch. > Saving money? Not really, but this can be hidden by the accounting.Indeed, it actively hurts performance (mixing sunray, mobile, and fixed units on the same subnets rather than segregation by type). -- Keith H. Bierman khbkhb at gmail.com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008
On Tue, 8 Jul 2008, Richard Elling wrote:> [donning my managerial accounting hat] > It is not a good idea to design systems based upon someone''s managerial > accounting whims. These are subject to change in illogical ways at > unpredictable intervals. This is why managerial accounting can be soManagerial accounting whims can be put to good use. If there is desire to reduce the amout of disk space consumed, then the accounting whims should make sure that those who consume the disk space get to pay for it. Apparently this is not currently the case or else there would not be so much blatant waste. On the flip-side, the approach which results in so much blatant waste may be extremely profitable so the waste does not really matter. Imagine if university students were allowed to use as much space as they wanted but had to pay a per megabyte charge every two weeks or their account is terminated? This would surely result in huge reduction in disk space consumption. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Something else came to mind which is a negative regarding deduplication. When zfs writes new sequential files, it should try to allocate blocks in a way which minimizes "fragmentation" (disk seeks). Disk seeks are the bane of existing storage systems since they come out of the available IOPS budget, which is only a couple hundred ops/second per drive. The deduplication algorithm will surely result in increasing effective fragmentation (decreasing sequential performance) since duplicated blocks will result in a seek to the master copy of the block followed by a seek to the next block. Disk seeks will remain an issue until rotating media goes away, which (in spite of popular opinion) is likely quite a while from now. Someone has to play devil''s advocate here. :-) Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hmmn, you might want to look at Andrew Tridgell''s'' thesis (yes, Andrew of Samba fame), as he had to solve this very question to be able to select an algorithm to use inside rsync. --dave Darren J Moffat wrote:> Wade.Stuart at fallon.com wrote: > >>zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 03:08:26 AM: >> >> >>>>Does anyone know a tool that can look over a dataset and give >>>>duplication statistics? I''m not looking for something incredibly >>>>efficient but I''d like to know how much it would actually benefit our >>> >>>Check out the following blog..: >>> >>>http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool >> >>Just want to add, while this is ok to give you a ballpark dedup number -- >>fletcher2 is notoriously collision prone on real data sets. It is meant to >>be fast at the expense of collisions. This issue can show much more dedup >>possible than really exists on large datasets. > > > Doing this using sha256 as the checksum algorithm would be much more > interesting. I''m going to try that now and see how it compares with > fletcher2 for a small contrived test. >-- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191#
zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 01:26:15 PM:> Something else came to mind which is a negative regarding > deduplication. When zfs writes new sequential files, it should try to > allocate blocks in a way which minimizes "fragmentation" (disk seeks). > Disk seeks are the bane of existing storage systems since they come > out of the available IOPS budget, which is only a couple hundred > ops/second per drive. The deduplication algorithm will surely result > in increasing effective fragmentation (decreasing sequential > performance) since duplicated blocks will result in a seek to the > master copy of the block followed by a seek to the next block. Disk > seeks will remain an issue until rotating media goes away, which (in > spite of popular opinion) is likely quite a while from now.Yes, I think it should be close to common sense to realize that you are trading speed for space (but should be well documented if dedup/squash ever makes it into the codebase). You find these types of tradoffs in just about every area of disk administration from the type of raid you select, inode numbers, block size, to the number of spindles and size of disk you use. The key here is that it would be a choice just as compression is per fs -- let the administrator choose her path. In some situations it would make sense, in others not. -Wade> > Someone has to play devil''s advocate here. :-)Debate is welcome, it is the only way to flesh out the issues.> > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us,http://www.simplesystems.org/users/bfriesen/> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn wrote:> Something else came to mind which is a negative regarding > deduplication. When zfs writes new sequential files, it > should try to > allocate blocks in a way which minimizes "fragmentation" > (disk seeks).It should, but because of its copy-on-write nature, fragmentation is a significant part of the ZFS data lifecycle. There was a discussion of this on this list at the beginning of the year... http://mail.opensolaris.org/pipermail/zfs-discuss/2007-November/044077.h tml> Disk seeks are the bane of existing storage systems since they come > out of the available IOPS budget, which is only a couple hundred > ops/second per drive. The deduplication algorithm will surely result > in increasing effective fragmentation (decreasing sequential > performance) since duplicated blocks will result in a seek to the > master copy of the block followed by a seek to the next block. Disk > seeks will remain an issue until rotating media goes away, which (in > spite of popular opinion) is likely quite a while from now.On ZFS, sequential files are rarely sequential anyway. The SPA tries to keep blocks nearby, but when dealing with snapshotted sequential files being rewritten, there is no way to keep everything in order. But if you read through the thread referenced above, you''ll see that there''s no clear data about just how that impacts performance (I still owe Mr. Elling a filebench run on one of my spare servers) --Joe
On Tue, 8 Jul 2008, Moore, Joe wrote:> > On ZFS, sequential files are rarely sequential anyway. The SPA tries to > keep blocks nearby, but when dealing with snapshotted sequential files > being rewritten, there is no way to keep everything in order.I think that rewriting files (updating existing blocks) is pretty rare. Only limited types of applications do such things. That is a good thing since zfs is not so good at rewriting files. The most common situation is that a new file is written, even if selecting "save" for an existing file in an application. Even if the user thinks that the file is being re-written, usually the application writes to a new temporary file and moves it into place once it is known to be written correctly. The majority of files will be written sequentially and most files will be small enough that zfs will see all the data before it outputs to disk. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Jul 8, 2008 at 12:25 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 8 Jul 2008, Richard Elling wrote: >> [donning my managerial accounting hat] >> It is not a good idea to design systems based upon someone''s managerial >> accounting whims. These are subject to change in illogical ways at >> unpredictable intervals. This is why managerial accounting can be so > > Managerial accounting whims can be put to good use. If there is > desire to reduce the amout of disk space consumed, then the accounting > whims should make sure that those who consume the disk space get to > pay for it. Apparently this is not currently the case or else there > would not be so much blatant waste. On the flip-side, the approach > which results in so much blatant waste may be extremely profitable so > the waste does not really matter.The existence of the waste paves the way for new products to come in and offer competitive advantage over in-place solutions. When companies aren''t buying anything due to budget constraints, the only way to make sales is to show businesses that by buying something they will save money - and quickly.> Imagine if university students were allowed to use as much space as > they wanted but had to pay a per megabyte charge every two weeks or > their account is terminated? This would surely result in huge > reduction in disk space consumption.If you can offer the perception of more storage because of efficiencies of the storage devices make it the same cost as less storage, then perhaps allocating more per student is feasible. Or maybe tuition could drop by a few bucks. -- Mike Gerdts http://mgerdts.blogspot.com/
On Tue, Jul 8, 2008 at 1:26 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> Something else came to mind which is a negative regarding > deduplication. When zfs writes new sequential files, it should try to > allocate blocks in a way which minimizes "fragmentation" (disk seeks). > Disk seeks are the bane of existing storage systems since they come > out of the available IOPS budget, which is only a couple hundred > ops/second per drive. The deduplication algorithm will surely result > in increasing effective fragmentation (decreasing sequential > performance) since duplicated blocks will result in a seek to the > master copy of the block followed by a seek to the next block. Disk > seeks will remain an issue until rotating media goes away, which (in > spite of popular opinion) is likely quite a while from now. > > Someone has to play devil''s advocate here. :-)With L2ARC on SSD, seeks are free and IOPs are quite cheap (compared to spinning rust). Cold reads may be a problem, but there is a reasonable chance that L2ARC sizing can be helpful here. Also, the blocks that are likely to be duplicate are going to be the same file but just with a different offset. That is, this file is going to be the same in every one of my LDom disk images. # du -h /usr/jdk/instances/jdk1.5.0/jre/lib/rt.jar 38M /usr/jdk/instances/jdk1.5.0/jre/lib/rt.jar There is a pretty good chance that the first copy will be sequential and as a result all of the deduped copies would be sequential as well. What''s more - it is quite likely to be in the ARC or L2ARC. -- Mike Gerdts http://mgerdts.blogspot.com/
Tim Spriggs wrote:> Does anyone know a tool that can look over a dataset and give > duplication statistics? I''m not looking for something incredibly > efficient but I''d like to know how much it would actually benefit our > dataset: HiRISE has a large set of spacecraft data (images) that could > potentially have large amounts of redundancy, or not. Also, other up and > coming missions have a large data volume that have a lot of duplicate > image info and a small budget; with "d11p" in OpenSolaris there is a > good business case to invest in Sun/OpenSolaris rather than buy the > cheaper storage (+ linux?) that can simply hold everything as is. > > If someone feels like coding a tool up that basically makes a file of > checksums and counts how many times a particular checksum get''s hit over > a dataset, I would be willing to run it and provide feedback. :) > > -Tim > >Me too. Our data profile is just like Tim''s: Terra bytes of satellite data. I''m going to guess that the d11p ratio won''t be fantastic for us. I sure would like to measure it though. Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Justin Stringfellow wrote:> >> Does anyone know a tool that can look over a dataset and give >> duplication statistics? I''m not looking for something incredibly >> efficient but I''d like to know how much it would actually benefit our >> > > Check out the following blog..: > > http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool > >Unfortunately we are on Solaris 10 :( Can I get a zdb for zfs V4 that will dump those checksums? Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Moore, Joe wrote:> > On ZFS, sequential files are rarely sequential anyway. The SPA tries to > keep blocks nearby, but when dealing with snapshotted sequential files > being rewritten, there is no way to keep everything in order. >In some cases, a d11p system could actually speed up data reads and writes. If you are repeatedly accessing duplicate data, then you will more likely hit your ARC, and not have to go to disk. With your data d11p, the ARC can hold a significantly higher percentage of your data set, just like the disks. For a d11p ARC, I would expire based upon block reference count. If a block has few references, it should expire first, and vise versa, blocks with many references should be the last out. With all the savings on disks, think how much RAM you could buy ;) Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Mike Gerdts wrote: [I agree with the comments in this thread, but... I think we''re still being old fashioned...]>> Imagine if university students were allowed to use as much space as >> they wanted but had to pay a per megabyte charge every two weeks or >> their account is terminated? This would surely result in huge >> reduction in disk space consumption. >> > > If you can offer the perception of more storage because of > efficiencies of the storage devices make it the same cost as less > storage, then perhaps allocating more per student is feasible. Or > maybe tuition could drop by a few bucks. >hmm... well, having spent the past two years at the University, I can provide the observation that: 0. Tuition never drops. 1. Everybody (yes everybody) had a laptop. I would say the average hard disk size per laptop was > 100 GBytes. 2. Everybody (yes everybody) had USB flash drives. In part because the school uses them for recruitment tools (give-aways), but they are inexpensive, too. 3. Everybody (yes everybody) had a MP3 player of some magnitude. Many were disk-based, but there were many iPod Nanos, too. 4. > 50% had smart phones -- crackberries, iPhones, etc. 5. The school actually provides some storage space, but I don''t know anyone who took advantage of the service. E-mail and document sharing was outsourced to google -- no perceptible shortage of space there. Even Microsoft charges only $3/user/month for exchange and sharepoint services. I think many businesses would be hard-pressed to match that sort of efficiency. Unlike my undergraduate days, where we had to make trade-offs between beer and floppy disks, there does not seem to be a shortage of storage space amongst the university students today -- in spite of the rise of beer prices recently (hops shortage, they claim ;-O Is the era of centralized home directories for students over? I think that the normal enterprise backup scenarios are more likely to gain from de-dup, in part because they tend to make full backups of systems and end up with zillions of copies of (static) OS files. Actual work files tend to be smaller, for many businesses. De-dup on my desktop seems to be a non-issue. Has anyone done a full value chain or data path analysis for de-dup? Will de-dup grow beyond the backup function? Will the performance penalty of SHA-256 and bit comparison kill all interactive performance? Should I set aside a few acres at the ranch to grow hops? So many good questions, so little time... -- richard
> Hi All >Is there any hope for deduplication on ZFS ? >Mertol Ozyoney >Storage Practice - Sales Manager >Sun Microsystems > Email mertol.ozyoney at sun.comThere is always hope. Seriously thought, looking at http://en.wikipedia.org/wiki/Comparison_of_revision_control_software there are a lot of choices of how we could implement this. SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of those with ZFS. It _could_ be as simple (with SVN as an example) of using directory listings to produce files which were then ''diffed''. You could then view the diffs as though they were changes made to lines of source code. Just add a "tree" subroutine to allow you to grab all the diffs that referenced changes to file ''xyz'' and you would have easy access to all the changes of a particular file (or directory). With the speed optimized ability added to use ZFS snapshots with the "tree subroutine" to rollback a single file (or directory) you could undo / redo your way through the filesystem. Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html) you could "sit out" on the play and watch from the sidelines -- returning to the OS when you thought you were ''safe'' (and if not, jumping backout). Thus, Mertol, it is possible (and could work very well). Rob This message posted from opensolaris.org
zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM:> > Hi All > >Is there any hope for deduplication on ZFS ? > >Mertol Ozyoney > >Storage Practice - Sales Manager > >Sun Microsystems > > Email mertol.ozyoney at sun.com > > There is always hope. > > Seriously thought, looking at http://en.wikipedia. > org/wiki/Comparison_of_revision_control_software there are a lot of > choices of how we could implement this. > > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge > one of those with ZFS. > > It _could_ be as simple (with SVN as an example) of using directory > listings to produce files which were then ''diffed''. You could then > view the diffs as though they were changes made to lines of source code. > > Just add a "tree" subroutine to allow you to grab all the diffs that > referenced changes to file ''xyz'' and you would have easy access to > all the changes of a particular file (or directory). > > With the speed optimized ability added to use ZFS snapshots with the > "tree subroutine" to rollback a single file (or directory) you could > undo / redo your way through the filesystem. >dedup is not revision control, you seem to completely misunderstand the problem.> Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html > ) you could "sit out" on the play and watch from the sidelines -- > returning to the OS when you thought you were ''safe'' (and if not, > jumping backout). >Now it seems you have veered even further off course. What are you implying the LKCD has to do with zfs, solaris, dedup, let alone revision control software? -Wade
To do dedup properly, it seems like there would have to be some overly complicated methodology for a sort of delayed dedup of the data. For speed, you''d want your writes to go straight into the cache and get flushed out as quickly as possibly, keep everything as ACID as possible. Then, a dedup scrubber would take what was written, do the voodoo magic of checksumming the new data, scanning the tree to see if there are any matches, locking the duplicates, run the usage counters up or down for that block of data, swapping out inodes, and marking the duplicate data as free space. It''s a lofty goal, but one that is doable. I guess this is only necessary if deduplication is done at the file level. If done at the block level, it could possibly be done on the fly, what with the already implemented checksumming at the block level, but then your reads will suffer because pieces of files can potentially be spread all over hell and half of Georgia on the zdevs. Deduplication is going to require the judicious application of hallucinogens and man hours. I expect that someone is up to the task. On Tue, Jul 22, 2008 at 10:39 AM, <Wade.Stuart at fallon.com> wrote:> zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM: > > > > Hi All > > >Is there any hope for deduplication on ZFS ? > > >Mertol Ozyoney > > >Storage Practice - Sales Manager > > >Sun Microsystems > > > Email mertol.ozyoney at sun.com > > > > There is always hope. > > > > Seriously thought, looking at http://en.wikipedia. > > org/wiki/Comparison_of_revision_control_software there are a lot of > > choices of how we could implement this. > > > > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge > > one of those with ZFS. > > > > It _could_ be as simple (with SVN as an example) of using directory > > listings to produce files which were then ''diffed''. You could then > > view the diffs as though they were changes made to lines of source code. > > > > Just add a "tree" subroutine to allow you to grab all the diffs that > > referenced changes to file ''xyz'' and you would have easy access to > > all the changes of a particular file (or directory). > > > > With the speed optimized ability added to use ZFS snapshots with the > > "tree subroutine" to rollback a single file (or directory) you could > > undo / redo your way through the filesystem. > > > > > dedup is not revision control, you seem to completely misunderstand the > problem. > > > > > Using a LKCD ( > http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html > > ) you could "sit out" on the play and watch from the sidelines -- > > returning to the OS when you thought you were ''safe'' (and if not, > > jumping backout). > > > > Now it seems you have veered even further off course. What are you > implying the LKCD has to do with zfs, solaris, dedup, let alone revision > control software? > > -Wade > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080722/cd9db2ef/attachment.html>
zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 09:58:53 AM:> To do dedup properly, it seems like there would have to be some > overly complicated methodology for a sort of delayed dedup of the > data. For speed, you''d want your writes to go straight into the > cache and get flushed out as quickly as possibly, keep everything as > ACID as possible. Then, a dedup scrubber would take what was > written, do the voodoo magic of checksumming the new data, scanning > the tree to see if there are any matches, locking the duplicates, > run the usage counters up or down for that block of data, swapping > out inodes, and marking the duplicate data as free space.I agree, but what you are describing is file based dedup, ZFS already has the groundwork for dedup in the system (block level checksuming and pointers).> It''s a > lofty goal, but one that is doable. I guess this is only necessary > if deduplication is done at the file level. If done at the block > level, it could possibly be done on the fly, what with the already > implemented checksumming at the block level,exactly -- that is why it is attractive for ZFS, so much of the groundwork is done and needed for the fs/pool already.> but then your reads > will suffer because pieces of files can potentially be spread all > over hell and half of Georgia on the zdevs.I don''t know that you can make this statement without some study of an actual implementation on real world data -- and then because it is block based, you should see varying degrees of this dedup-flack-frag depending on data/usage. For instance, I would imagine that in many scenarios much od the dedup data blocks would belong to the same or very similar files. In this case the blocks were written as best they could on the first write, the deduped blocks would point to a pretty sequential line o blocks. Now on some files there may be duplicate header or similar portions of data -- these may cause you to jump around the disk; but I do not know how much this would be hit or impact real world usage.> Deduplication is going > to require the judicious application of hallucinogens and man hours. > I expect that someone is up to the task.I would prefer the coder(s) not be seeing "pink elephants" while writing this, but yes it can and will be done. It (I believe) will be easier after the grow/shrink/evac code paths are in place though. Also, the grow/shrink/evac path allows (if it is done right) for other cool things like a base to build a roaming defrag that takes into account snaps, clones, live and the like. I know that some feel that the grow/shrink/evac code is more important for home users, but I think that it is super important for most of these additional features. -Wade> On Tue, Jul 22, 2008 at 10:39 AM, <Wade.Stuart at fallon.com> wrote: > zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM: > > > > Hi All > > >Is there any hope for deduplication on ZFS ? > > >Mertol Ozyoney > > >Storage Practice - Sales Manager > > >Sun Microsystems > > > Email mertol.ozyoney at sun.com > > > > There is always hope. > > > > Seriously thought, looking at http://en.wikipedia. > > org/wiki/Comparison_of_revision_control_software there are a lot of > > choices of how we could implement this. > > > > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge > > one of those with ZFS. > > > > It _could_ be as simple (with SVN as an example) of using directory > > listings to produce files which were then ''diffed''. You could then > > view the diffs as though they were changes made to lines of sourcecode.> > > > Just add a "tree" subroutine to allow you to grab all the diffs that > > referenced changes to file ''xyz'' and you would have easy access to > > all the changes of a particular file (or directory). > > > > With the speed optimized ability added to use ZFS snapshots with the > > "tree subroutine" to rollback a single file (or directory) you could > > undo / redo your way through the filesystem. > > >> dedup is not revision control, you seem to completely misunderstand the > problem. > > > > > Using a LKCD(http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html> > ) you could "sit out" on the play and watch from the sidelines -- > > returning to the OS when you thought you were ''safe'' (and if not, > > jumping backout). > >> Now it seems you have veered even further off course. What are you > implying the LKCD has to do with zfs, solaris, dedup, let alone revision > control software? > > -Wade > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > -- > chris -at- microcozm -dot- net > === Si Hoc Legere Scis Nimium Eruditionis Habes > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Jul 22, 2008 at 11:19 AM, <Wade.Stuart at fallon.com> wrote:> zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 09:58:53 AM: > > > To do dedup properly, it seems like there would have to be some > > overly complicated methodology for a sort of delayed dedup of the > > data. For speed, you''d want your writes to go straight into the > > cache and get flushed out as quickly as possibly, keep everything as > > ACID as possible. Then, a dedup scrubber would take what was > > written, do the voodoo magic of checksumming the new data, scanning > > the tree to see if there are any matches, locking the duplicates, > > run the usage counters up or down for that block of data, swapping > > out inodes, and marking the duplicate data as free space. > I agree, but what you are describing is file based dedup, ZFS already has > the groundwork for dedup in the system (block level checksuming and > pointers). > > > It''s a > > lofty goal, but one that is doable. I guess this is only necessary > > if deduplication is done at the file level. If done at the block > > level, it could possibly be done on the fly, what with the already > > implemented checksumming at the block level, > > exactly -- that is why it is attractive for ZFS, so much of the groundwork > is done and needed for the fs/pool already. > > > but then your reads > > will suffer because pieces of files can potentially be spread all > > over hell and half of Georgia on the zdevs. > > I don''t know that you can make this statement without some study of an > actual implementation on real world data -- and then because it is block > based, you should see varying degrees of this dedup-flack-frag depending > on data/usage.It''s just a NonScientificWAG. I agree that most of the duplicated blocks will in most cases be part of identical files anyway, and thus lined up exactly as you''d want them. I was just free thinking and typing.> > > For instance, I would imagine that in many scenarios much od the dedup > data blocks would belong to the same or very similar files. In this case > the blocks were written as best they could on the first write, the deduped > blocks would point to a pretty sequential line o blocks. Now on some files > there may be duplicate header or similar portions of data -- these may > cause you to jump around the disk; but I do not know how much this would be > hit or impact real world usage. > > > > Deduplication is going > > to require the judicious application of hallucinogens and man hours. > > I expect that someone is up to the task. > > I would prefer the coder(s) not be seeing "pink elephants" while writing > this, but yes it can and will be done. It (I believe) will be easier > after the grow/shrink/evac code paths are in place though. Also, the > grow/shrink/evac path allows (if it is done right) for other cool things > like a base to build a roaming defrag that takes into account snaps, > clones, live and the like. I know that some feel that the grow/shrink/evac > code is more important for home users, but I think that it is super > important for most of these additional features.The elephants are just there to keep the coders company. There are tons of benefits for dedup, both for home and non-home users. I''m happy that it''s going to be done. I expect the first complaints will come from those people who don''t understand it, and their df and du numbers look different than their zpool status ones. Perhaps df/du will just have to be faked out for those folks, or we just apply the same hallucinogens to them instead.> > > -Wade > > > On Tue, Jul 22, 2008 at 10:39 AM, <Wade.Stuart at fallon.com> wrote: > > zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM: > > > > > > Hi All > > > >Is there any hope for deduplication on ZFS ? > > > >Mertol Ozyoney > > > >Storage Practice - Sales Manager > > > >Sun Microsystems > > > > Email mertol.ozyoney at sun.com > > > > > > There is always hope. > > > > > > Seriously thought, looking at http://en.wikipedia. > > > org/wiki/Comparison_of_revision_control_software there are a lot of > > > choices of how we could implement this. > > > > > > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge > > > one of those with ZFS. > > > > > > It _could_ be as simple (with SVN as an example) of using directory > > > listings to produce files which were then ''diffed''. You could then > > > view the diffs as though they were changes made to lines of source > code. > > > > > > Just add a "tree" subroutine to allow you to grab all the diffs that > > > referenced changes to file ''xyz'' and you would have easy access to > > > all the changes of a particular file (or directory). > > > > > > With the speed optimized ability added to use ZFS snapshots with the > > > "tree subroutine" to rollback a single file (or directory) you could > > > undo / redo your way through the filesystem. > > > > > > > > dedup is not revision control, you seem to completely misunderstand the > > problem. > > > > > > > > > Using a LKCD > (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html > > > ) you could "sit out" on the play and watch from the sidelines -- > > > returning to the OS when you thought you were ''safe'' (and if not, > > > jumping backout). > > > > > > Now it seems you have veered even further off course. What are you > > implying the LKCD has to do with zfs, solaris, dedup, let alone revision > > control software? > > > > -Wade > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > > > -- > > chris -at- microcozm -dot- net > > === Si Hoc Legere Scis Nimium Eruditionis Habes > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080722/177625a1/attachment.html>
Chris Cosby wrote:> > > On Tue, Jul 22, 2008 at 11:19 AM, <Wade.Stuart at fallon.com > <mailto:Wade.Stuart at fallon.com>> wrote: > > zfs-discuss-bounces at opensolaris.org > <mailto:zfs-discuss-bounces at opensolaris.org> wrote on 07/22/2008 > 09:58:53 AM: > > > To do dedup properly, it seems like there would have to be some > > overly complicated methodology for a sort of delayed dedup of the > > data. For speed, you''d want your writes to go straight into the > > cache and get flushed out as quickly as possibly, keep everything as > > ACID as possible. Then, a dedup scrubber would take what was > > written, do the voodoo magic of checksumming the new data, scanning > > the tree to see if there are any matches, locking the duplicates, > > run the usage counters up or down for that block of data, swapping > > out inodes, and marking the duplicate data as free space. > I agree, but what you are describing is file based dedup, ZFS > already has > the groundwork for dedup in the system (block level checksuming and > pointers). > > > It''s a > > lofty goal, but one that is doable. I guess this is only necessary > > if deduplication is done at the file level. If done at the block > > level, it could possibly be done on the fly, what with the already > > implemented checksumming at the block level, > > exactly -- that is why it is attractive for ZFS, so much of the > groundwork > is done and needed for the fs/pool already. > > > but then your reads > > will suffer because pieces of files can potentially be spread all > > over hell and half of Georgia on the zdevs. > > I don''t know that you can make this statement without some study of an > actual implementation on real world data -- and then because it is > block > based, you should see varying degrees of this dedup-flack-frag > depending > on data/usage. > > It''s just a NonScientificWAG. I agree that most of the duplicated > blocks will in most cases be part of identical files anyway, and thus > lined up exactly as you''d want them. I was just free thinking and typing. >No, you are right to be concerned over block-level dedup seriously impacting seeks. The problem is that, given many common storage scenarios, you will have not just similar files, but multiple common sections of many files. Things such as the various standard productivity app documents will not just have the same header sections, but internally, there will be significant duplications of considerable length with other documents from the same application. Your 5MB Word file is thus likely to share several (actually, many) multi-kB segments with other Word files. You will thus end up seeking all over the disk to read _most_ Word files. Which really sucks. I can list at least a couple more common scenarios where dedup has to potential to save at least some reasonable amount of space, yet will absolutely kill performance.> For instance, I would imagine that in many scenarios much od the > dedup > data blocks would belong to the same or very similar files. In > this case > the blocks were written as best they could on the first write, > the deduped > blocks would point to a pretty sequential line o blocks. Now on > some files > there may be duplicate header or similar portions of data -- these may > cause you to jump around the disk; but I do not know how much this > would be > hit or impact real world usage. > > > > Deduplication is going > > to require the judicious application of hallucinogens and man hours. > > I expect that someone is up to the task. > > I would prefer the coder(s) not be seeing "pink elephants" while > writing > this, but yes it can and will be done. It (I believe) will be easier > after the grow/shrink/evac code paths are in place though. Also, the > grow/shrink/evac path allows (if it is done right) for other cool > things > like a base to build a roaming defrag that takes into account snaps, > clones, live and the like. I know that some feel that the > grow/shrink/evac > code is more important for home users, but I think that it is super > important for most of these additional features. > > The elephants are just there to keep the coders company. There are > tons of benefits for dedup, both for home and non-home users. I''m > happy that it''s going to be done. I expect the first complaints will > come from those people who don''t understand it, and their df and du > numbers look different than their zpool status ones. Perhaps df/du > will just have to be faked out for those folks, or we just apply the > same hallucinogens to them instead. >I''m still not convinced that dedup is really worth it for anything but very limited, constrained usage. Disk is just so cheap, that you _really_ have to have an enormous amount of dup before the performance penalties of dedup are countered. This in many ways reminds me the last year''s discussion over file versioning in the filesystem. It sounds like a cool idea, but it''s not a generally-good idea. I tend to think that this kind of problem is better served by applications handling it, if they are concerned about it. Pretty much, here''s what I''ve heard: Dedup Advantages: (1) save space relative to the amount of duplication. this is highly dependent on workload, and ranges from 0% to 99%, but the distribution of possibilities isn''t a bell curve (i.e. the average space saved isn''t 50%). Dedup Disadvantages: (1) increase codebase complexity, in both cases of dedup during write, and ex-post-facto batched dedup (2) noticable write performance penalty (assuming block-level dedup on write), with potential write cache issues. (3) very significant post-write dedup time, at least on the order of ''zfs scrub''. Also, during such a post-write scenario, it more or less takes the zpool out of usage. (4) If dedup is done at block level, not at file level, it kills read performance, effectively turning all dedup''d files from sequential read to a random read. That is, block-level dedup drastically accelerates filesystem fragmentation. (5) Something no one has talked about, but is of concern. By removing duplication, you increase the likelihood that loss of the "master" segment will corrupt many more files. Yes, ZFS has self-healing and such. But, particularly in the case where there is no ZFS pool redundancy (or pool-level redundancy has been compromised), loss of one block can thus be many more times severe. We need to think long and hard about what the real widespread benefits are of dedup before committing to a filesystem-level solution, rather than an application-level one. In particular, we need some real-world data on the actual level of duplication under a wide variety of circumstances. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
FWIW, Sun''s VTL products use ZFS and offer de-duplication services. http://www.sun.com/aboutsun/pr/2008-04/sunflash.20080407.2.xml -- richard
zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 11:48:30 AM:> Chris Cosby wrote: > > > > > > On Tue, Jul 22, 2008 at 11:19 AM, <Wade.Stuart at fallon.com > > <mailto:Wade.Stuart at fallon.com>> wrote: > > > > zfs-discuss-bounces at opensolaris.org > > <mailto:zfs-discuss-bounces at opensolaris.org> wrote on 07/22/2008 > > 09:58:53 AM: > > > > > To do dedup properly, it seems like there would have to be some > > > overly complicated methodology for a sort of delayed dedup of the > > > data. For speed, you''d want your writes to go straight into the > > > cache and get flushed out as quickly as possibly, keep everythingas> > > ACID as possible. Then, a dedup scrubber would take what was > > > written, do the voodoo magic of checksumming the new data,scanning> > > the tree to see if there are any matches, locking the duplicates, > > > run the usage counters up or down for that block of data,swapping> > > out inodes, and marking the duplicate data as free space. > > I agree, but what you are describing is file based dedup, ZFS > > already has > > the groundwork for dedup in the system (block level checksuming and > > pointers). > > > > > It''s a > > > lofty goal, but one that is doable. I guess this is onlynecessary> > > if deduplication is done at the file level. If done at the block > > > level, it could possibly be done on the fly, what with thealready> > > implemented checksumming at the block level, > > > > exactly -- that is why it is attractive for ZFS, so much of the > > groundwork > > is done and needed for the fs/pool already. > > > > > but then your reads > > > will suffer because pieces of files can potentially be spread all > > > over hell and half of Georgia on the zdevs. > > > > I don''t know that you can make this statement without some study ofan> > actual implementation on real world data -- and then because it is > > block > > based, you should see varying degrees of this dedup-flack-frag > > depending > > on data/usage. > > > > It''s just a NonScientificWAG. I agree that most of the duplicated > > blocks will in most cases be part of identical files anyway, and thus > > lined up exactly as you''d want them. I was just free thinking andtyping.> > > No, you are right to be concerned over block-level dedup seriously > impacting seeks. The problem is that, given many common storage > scenarios, you will have not just similar files, but multiple common > sections of many files. Things such as the various standard > productivity app documents will not just have the same header sections, > but internally, there will be significant duplications of considerable > length with other documents from the same application. Your 5MB Word > file is thus likely to share several (actually, many) multi-kB segments > with other Word files. You will thus end up seeking all over the disk > to read _most_ Word files. Which really sucks. I can list at least a > couple more common scenarios where dedup has to potential to save at > least some reasonable amount of space, yet will absolutely killperformance. While you may have a point on some data sets, actual testing of this type of data (28.000+ of actual end user doc files) using xdelta with 4k and 8k block sizes shows that the similar blocks in these files are in the 2% range (~ 6% for 4k). That means a full read of each file on average would require < 6% seeks to other disk areas. That is not bad, but this is the worst case picture as those duplicate blocks would need to live in the same offsets and have the same block boundaries to "match" under the proposed algo. To me this means word docs are not a good candidate for dedup at the block level -- but the actual cost to dedup anyways seems small. Of course you could come up with data that is pathologically bad for these benchmarks, but I do not believe it would be nearly as bad as you are making it out to be on real world data.> > > > For instance, I would imagine that in many scenarios much od the > > dedup > > data blocks would belong to the same or very similar files. In > > this case > > the blocks were written as best they could on the first write, > > the deduped > > blocks would point to a pretty sequential line o blocks. Now on > > some files > > there may be duplicate header or similar portions of data -- thesemay> > cause you to jump around the disk; but I do not know how much this > > would be > > hit or impact real world usage. > > > > > > > Deduplication is going > > > to require the judicious application of hallucinogens and manhours.> > > I expect that someone is up to the task. > > > > I would prefer the coder(s) not be seeing "pink elephants" while > > writing > > this, but yes it can and will be done. It (I believe) will beeasier> > after the grow/shrink/evac code paths are in place though. Also,the> > grow/shrink/evac path allows (if it is done right) for other cool > > things > > like a base to build a roaming defrag that takes into accountsnaps,> > clones, live and the like. I know that some feel that the > > grow/shrink/evac > > code is more important for home users, but I think that it issuper> > important for most of these additional features. > > > > The elephants are just there to keep the coders company. There are > > tons of benefits for dedup, both for home and non-home users. I''m > > happy that it''s going to be done. I expect the first complaints will > > come from those people who don''t understand it, and their df and du > > numbers look different than their zpool status ones. Perhaps df/du > > will just have to be faked out for those folks, or we just apply the > > same hallucinogens to them instead. > > > I''m still not convinced that dedup is really worth it for anything but > very limited, constrained usage. Disk is just so cheap, that you > _really_ have to have an enormous amount of dup before the performance > penalties of dedup are countered.If you can dedup 30% of your data, your disk just became 30% cheaper. Depending on workflow, the cost of disk is the barrier -- not cpu cycles or write/read speed.> > This in many ways reminds me the last year''s discussion over file > versioning in the filesystem. It sounds like a cool idea, but it''s not > a generally-good idea. I tend to think that this kind of problem is > better served by applications handling it, if they are concerned aboutit.>snapping a full filesystem for versions is expensive -- you are dealing with one file changing. doing dedup on zfs is inexpensive vs a follow the writes queue.> Pretty much, here''s what I''ve heard: > > Dedup Advantages: > > (1) save space relative to the amount of duplication. this is highly > dependent on workload, and ranges from 0% to 99%, but the distribution > of possibilities isn''t a bell curve (i.e. the average space saved isn''t > 50%). > > > Dedup Disadvantages: > > (1) increase codebase complexity, in both cases of dedup during write, > and ex-post-facto batched dedupyes, but the code path is optional.> > (2) noticable write performance penalty (assuming block-level dedup on > write), with potential write cache issues.there is cost, but smart use of hash lookups and caching should absorb most of these. most of the cost comes with using a better hashing algo instead of fletch2/4> > (3) very significant post-write dedup time, at least on the order of > ''zfs scrub''. Also, during such a post-write scenario, it more or less > takes the zpool out of usage.post write, while not as bad as a separate dedup app, reduces the value of tying it to zfs. it should be done inline.> > (4) If dedup is done at block level, not at file level, it kills read > performance, effectively turning all dedup''d files from sequential read > to a random read. That is, block-level dedup drastically accelerates > filesystem fragmentation.again, this is completely dependant on the implementation and data sets. looking at our real world data on a 14tb user file store shows that most dedup that would happen (using 4, 8, 16 and 128k blocks) happens on totally binary similar files, a small percentage of dedup happens on other data if a static block seek is used (no sliding delta window).> > (5) Something no one has talked about, but is of concern. By removing > duplication, you increase the likelihood that loss of the "master" > segment will corrupt many more files. Yes, ZFS has self-healing and > such. But, particularly in the case where there is no ZFS pool > redundancy (or pool-level redundancy has been compromised), loss of one > block can thus be many more times severe.I assume that no one has talked about that because it seems obvious. Your blocks become N times more "valuable" where N is the number of blocks that are pointed to that block for dedup. A lost block on zfs can therefore affect N files + X snapshots + Y clones, or the entire filesystem if it was holding one of a few zfs structures.> > > We need to think long and hard about what the real widespread benefits > are of dedup before committing to a filesystem-level solution, rather > than an application-level one. In particular, we need some real-world > data on the actual level of duplication under a wide variety of > circumstances.There was already a post that shows how to exploit the zfs block checksums to gather similar block stats. An issue I have with that is zfs default hashing is pretty collision prone and the data seems suspect. I can probably post the perl scripts I used to gather data on my systems. The hash lookup tables that they generate are pretty damn huge, but the reporting part could display relative info in a compact way for posting. Assumptions I made were fixed block seeks (slurping in the largest block of data each read and acting on it as all block sizes in the test phase to be efficient), md5 match = bin match (pretty safe but a real system would bit level compare on a hash match). -Wade> > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Jul 22, 2008 at 11:48 AM, Erik Trimble <Erik.Trimble at sun.com> wrote:> No, you are right to be concerned over block-level dedup seriously > impacting seeks. The problem is that, given many common storage > scenarios, you will have not just similar files, but multiple common > sections of many files. Things such as the various standard > productivity app documents will not just have the same header sections, > but internally, there will be significant duplications of considerable > length with other documents from the same application. Your 5MB Word > file is thus likely to share several (actually, many) multi-kB segments > with other Word files. You will thus end up seeking all over the disk > to read _most_ Word files. Which really sucks. I can list at least a > couple more common scenarios where dedup has to potential to save at > least some reasonable amount of space, yet will absolutely kill performance.This would actually argue in favor of dedup... If the blocks are common they are more likely to be in the ARC with dedup, thus avoiding a read altogether. There would likely be greater overhead in assembling smaller packets Here''s some real life... I have 442 Word documents created by me and others over several years. Many were created from the same corporate templates. I generated the MD5 hash of every 8 KB of each file and came up with a total of 8409 hash - implying 65 MB of word documents. Taking those hashes through "sort | uniq -c | sort -n" led to the following: 3 p9I7HgbxFme7TlPZmsD6/Q 3 sKE3RBwZt8A6uz+tAihMDA 3 uA4PK1+SQqD+h1Nv6vJ6fQ 3 wQoU2g7f+dxaBMzY5rVE5Q 3 yM0csnXKtRxjpSxg1Zma0g 3 yyokNamrTcD7lQiitcVgqA 4 jdsZZfIHtshYZiexfX3bQw 17 pohs0DWPFwF8HJ8p/HnFKw 19 s0eKyh/vT1LothTvsqtZOw 64 CCn3F0CqsauYsz6uId7hIg Note that "CCn3F0CqsauYsz6uId7hIg" is the MD5 hash of 8 KB of zeros. If compression is used as well, this block would not even be stored. If 512 byte blocks are used, the story is a bit different: 81 DEf6rofNmnr1g5f7oaV75w 109 3gP+ZaZ2XKqMkTQ6zGLP/A 121 ypk+0ryBeMVRnnjYQD2ZEA 124 HcuMdyNKV7FDYcPqvb2o3Q 371 s0eKyh/vT1LothTvsqtZOw 372 ozgGMCCoc+0/RFbFDO8MsQ 8535 v2GerAzfP2jUluqTRBN+iw As you might guess, that most common hash is a block of zeros. Most likely, however, these files will end up using 128K blocks for the first part of the file, smaller for the portions that don''t fit. When I look at just 128K... 1 znJqBX8RtPrAOV2I6b5Wew 2 6tuJccWHGVwv3v4nee6B9w 2 Qr//PMqqhMtuKfgKhUIWVA 2 idX0awfYjjFmwHwi60MAxg 2 s0eKyh/vT1LothTvsqtZOw 3 +Q/cXnknPr/uUCARsaSIGw 3 /kyIGuWnPH/dC5ETtMqqLw 3 4G/QmksvChYvfhAX+rfgzg 3 SCMoKuvPepBdQEBVrTccvA 3 vbaNWd5IQvsGdQ9R8dIqhw There is actually very little duplication in word files. Many of the dupes above are from various revisions of the same files.> Dedup Advantages: > > (1) save space relative to the amount of duplication. this is highly > dependent on workload, and ranges from 0% to 99%, but the distribution > of possibilities isn''t a bell curve (i.e. the average space saved isn''t > 50%).I have evidence that shows 75% duplicate data on (mostly sparse) zone roots created and maintained over a 18 month period. I show other evidence above that it is not nearly as good for one person''s copy of word documents. I suspect that it would be different if the file system that I did this on was on a file server where all of my colleagues also stored their documents (and revisions of mine that they have reviewed).> (2) noticable write performance penalty (assuming block-level dedup on > write), with potential write cache issues.Depends on the approach taken.> (3) very significant post-write dedup time, at least on the order of > ''zfs scrub''. Also, during such a post-write scenario, it more or less > takes the zpool out of usage.The ZFS competition that has this in shipping product today does not quiesce the file system during dedup passes.> (4) If dedup is done at block level, not at file level, it kills read > performance, effectively turning all dedup''d files from sequential read > to a random read. That is, block-level dedup drastically accelerates > filesystem fragmentation.Absent data that shows this, I don''t accept this claim. Arguably the blocks that are duplicate are more likely to be in cache. I think that my analysis above shows that this is not a concern for my data set.> (5) Something no one has talked about, but is of concern. By removing > duplication, you increase the likelihood that loss of the "master" > segment will corrupt many more files. Yes, ZFS has self-healing and > such. But, particularly in the case where there is no ZFS pool > redundancy (or pool-level redundancy has been compromised), loss of one > block can thus be many more times severe.I believe this is true and likely a good topic for discussion.> We need to think long and hard about what the real widespread benefits > are of dedup before committing to a filesystem-level solution, rather > than an application-level one. In particular, we need some real-world > data on the actual level of duplication under a wide variety of > circumstances.The key thing here is that distributed applications will not play nicely. In my best use case, Solaris zones and LDoms are the "application". I don''t expect or want Solaris to form some sort of P2P storage system across my data center to save a few terabytes. D12n at the storage device can do this much more reliably with less complexity. -- Mike Gerdts http://mgerdts.blogspot.com/
On 7/22/08 11:48 AM, "Erik Trimble" <Erik.Trimble at Sun.COM> wrote:> I''m still not convinced that dedup is really worth it for anything but > very limited, constrained usage. Disk is just so cheap, that you > _really_ have to have an enormous amount of dup before the performance > penalties of dedup are countered.Again, I will argue that the spinning rust itself isn''t expensive, but data management is. If I am looking to protect multiple PB (through remote data replication and backup), I need more than just the rust to store that. I need to copy this data, which takes time and effort. If the system can say "these 500K blocks are the same as these 500K, don''t bother copying them to the DR site AGAIN," then I have a less daunting data management task. De-duplication makes a lot of sense at some layer(s) within the data management scheme. Charles
On Tue, 22 Jul 2008, Erik Trimble wrote:> > Dedup Disadvantages:Obviously you do not work in the Sun marketing department which is intrested in this feature (due to some other companies marketing it). Note that the topic starter post came from someone in Sun''s marketing department. I think that dedupication is a potential diversion which draws attention away from the core ZFS things which are still not ideally implemented or do not yet exist at all. Compared with other filesystems, ZFS is still a toddler since it has only been deployed for a few years. ZFS is intended to be an enterprise filesystem so let''s give it more time to mature before hiting it with the "feature" stick. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "et" == Erik Trimble <Erik.Trimble at Sun.COM> writes:et> Dedup Advantages: et> (1) save space (2) coalesce data which is frequently used by many nodes in a large cluster into a small nugget of common data which can fit into RAM or L2 fast disk (3) back up non-ZFS filesystems that don''t have snapshots and clones (4) make offsite replication easier on the WAN but, yeah, aside from imagining ahead to possible disastrous problems with the final implementation, the imagined use cases should probably be carefully compared to existing large installations. Firstly, dedup may be more tempting as a bulletted marketing feature or a bloggable/banterable boasting point than it is valuable to real people. Secondly, the comparison may drive the implementation. For example, should dedup happen at write time and be something that doesn''t happen to data written before it''s turned on, like recordsize or compression, to make it simpler in the user interface, and avoid problems with scrubs making pools uselessly slow? Or should it be scrub-like so that already-written filesystems can be thrown into the dedup bag and slowly squeezed, or so that dedup can run slowly during the business day over data written quickly at night (fast outside-business-hours backup)? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080722/d3497f8b/attachment.bin>
On Tue, 22 Jul 2008, Miles Nordin wrote:> scrubs making pools uselessly slow? Or should it be scrub-like so > that already-written filesystems can be thrown into the dedup bag and > slowly squeezed, or so that dedup can run slowly during the business > day over data written quickly at night (fast outside-business-hours > backup)?I think that the scrub-like model makes the most sense since ZFS write performance should not be penalized. It is useful to implement score-boarding so that a block is not considered for de-duplication until it has been duplicated a certain number of times. In order to decrease resource consumption, it is useful to perform de-duplication over a span of multiple days or multiple weeks doing just part of the job each time around. Deduping a petabyte of data seems quite challenging yet ZFS needs to be scalable to these levels. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Tue, 22 Jul 2008, Erik Trimble wrote: > >> Dedup Disadvantages: >> > > Obviously you do not work in the Sun marketing department which is > intrested in this feature (due to some other companies marketing it). > Note that the topic starter post came from someone in Sun''s marketing > department. > > I think that dedupication is a potential diversion which draws > attention away from the core ZFS things which are still not ideally > implemented or do not yet exist at all. Compared with other > filesystems, ZFS is still a toddler since it has only been deployed > for a few years. ZFS is intended to be an enterprise filesystem so > let''s give it more time to mature before hiting it with the "feature" > stick. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >More than anything, Bob''s reply is my major feeling on this. Dedup may indeed turn out to be quite useful, but honestly, there''s no broad data which says that it is a Big Win (tm) _right_now_, compared to finishing other features. I''d really want a Engineering Study about the real-world use (i.e. what percentage of the userbase _could_ use such a feature, and what percentage _would_ use it, and exactly how useful would each segment find it...) before bumping it up in the priority queue of work to be done on ZFS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
> On Tue, 22 Jul 2008, Miles Nordin wrote: > > scrubs making pools uselessly slow? Or should it be scrub-like so > > that already-written filesystems can be thrown into the dedup bag and > > slowly squeezed, or so that dedup can run slowly during the business > > day over data written quickly at night (fast outside-business-hours > > backup)? > > I think that the scrub-like model makes the most sense since ZFS write > performance should not be penalized. It is useful to implement > score-boarding so that a block is not considered for de-duplication > until it has been duplicated a certain number of times. In order to > decrease resource consumption, it is useful to perform de-duplication > over a span of multiple days or multiple weeks doing just part of the > job each time around. Deduping a petabyte of data seems quite > challenging yet ZFS needs to be scalable to these levels. > Bob FriesenhahnIn case anyone (other than Bob) missed it, this is why I suggested "File-Level" Dedup: "... using directory listings to produce files which were then ''diffed''. You could then view the diffs as though they were changes made ..." We could have: "Block-Level" (if we wanted to restore an exact copy of the drive - duplicate the ''dd'' command) or "Byte-Level" (if we wanted to use compression - duplicate the ''zfs set compression=on rpool'' _or_ ''bzip'' commands) ... etc... assuming we wanted to duplicate commands which already implement those features, and provide more than we (the filesystem) needs at a very high cost (performance). So I agree with your comment about the need to be mindful of "resource consumption", the ability to do this over a period of days is also useful. Indeed the Plan9 filesystem simply snapshots to WORM and has no delete - nor are they able to fill their drives faster than they can afford to buy new ones: Venti Filesystem http://www.cs.bell-labs.com/who/seanq/p9trace.html Rob This message posted from opensolaris.org
> with other Word files. You will thus end up seeking all over the disk > to read _most_ Word files. Which really sucks.<snip>> very limited, constrained usage. Disk is just so cheap, that you > _really_ have to have an enormous amount of dup before the performance > penalties of dedup are countered.Neither of these hold true for SSDs though, do they? Seeks are essentially free, and the devices are not cheap. cheers, --justin
On Tue, Jul 22, 2008 at 10:44 PM, Erik Trimble <Erik.Trimble at sun.com> wrote:> More than anything, Bob''s reply is my major feeling on this. Dedup may > indeed turn out to be quite useful, but honestly, there''s no broad data > which says that it is a Big Win (tm) _right_now_, compared to finishing > other features. I''d really want a Engineering Study about the > real-world use (i.e. what percentage of the userbase _could_ use such a > feature, and what percentage _would_ use it, and exactly how useful > would each segment find it...) before bumping it up in the priority > queue of work to be done on ZFS.I get this. However, for most of my uses of clones dedup is considered finishing the job. Without it, I run the risk of having way more writable data than I can restore. Another solution to this is to consider the output of "zfs send" to be a stable format and get integration with enterprise backup software that can perform restores in a way that maintains space efficiency. -- Mike Gerdts http://mgerdts.blogspot.com/
Just my 2c: Is it possible to do an "offline" dedup, kind of like snapshotting? What I mean in practice, is: we make many Solaris full-root zones. They share a lot of data as complete files. This is kind of easy to save space - make one zone as a template, snapshot/clone its dataset, make new zones. However, as projects evolve (software installed, etc.) these zones are filled with many similar files, many of which are duplicates. It seems reasonable to make some dedup process which would create a least-common-denominator snapshot for all the datasets involved (zone roots), of which all other datasets'' current data are to be dubbed "clones with modified data". For the system (and user) it should be perceived just the same as these datasets are currently "clones with modified data" of the original template zone-root dataset. Only the "template" becomes different... Hope this idea makes sense, and perhaps makes its way into code sometime :) This message posted from opensolaris.org
zfs-discuss-bounces at opensolaris.org wrote on 08/22/2008 04:26:35 PM:> Just my 2c: Is it possible to do an "offline" dedup, kind of like > snapshotting? > > What I mean in practice, is: we make many Solaris full-root zones. > They share a lot of data as complete files. This is kind of easy to > save space - make one zone as a template, snapshot/clone its > dataset, make new zones. > > However, as projects evolve (software installed, etc.) these zones > are filled with many similar files, many of which are duplicates. > > It seems reasonable to make some dedup process which would create a > least-common-denominator snapshot for all the datasets involved > (zone roots), of which all other datasets'' current data are to be > dubbed "clones with modified data". > > For the system (and user) it should be perceived just the same as > these datasets are currently "clones with modified data" of the > original template zone-root dataset. Only the "template" becomesdifferent...> > Hope this idea makes sense, and perhaps makes its way into code sometime:)>Jim, There have been a few long threads about this in the past on this list. My take is that it is worthwhile, but should (or really needs to) wait until the resilver/resize/evac code is done and the zfs libs are stabilized and public (meaning people can actually write non throw away code against them). Some people feel that dedup is over extending the premise of the filesystem (and would unnecessarily complicate the code). Some feel that the benefits would be less than we suspect. I would expect first dedup code you see to be written by non sun people -- and if it is enticing enough to be backported to trunk (maybe). There are a bunch of need-to-haves sitting in queue that Sun needs to focus on such as real user/group quotas, disk shrink/evac, utility/toolkit for failed pool recovery (beyond skull and bones forensic tools), etc that should be way ahead of the line vs dedup. -Wade> > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Ok, thank you Nils, Wade for the concise replies. After much reading I agree that the ZFS-development queued features do deserve a higher ranking on the priority list (pool-shrinking/disk-removal and user/group quotas would be my favourites), so probably the deduplication tool I''d need would, indeed, probably be some community-contributed script which does many hash-checks in zone-root file systems and does what Nils described to calculate the most-common "template" filesystem and derive zone roots as minimal changes to it. Does anybody with a wider awareness know of such readily-available scripts on some blog? :) Does some script-usable ZFS API (if any) provide for fetching block/file hashes (checksums) stored in the filesystem itself? In fact, am I wrong to expect file-checksums to be readily available? This message posted from opensolaris.org
Jim Klimov wrote:> Ok, thank you Nils, Wade for the concise replies. > > After much reading I agree that the ZFS-development queued features do deserve a higher ranking on the priority list (pool-shrinking/disk-removal and user/group quotas would be my favourites), so probably the deduplication tool I''d need would, indeed, probably be some community-contributed script which does many hash-checks in zone-root file systems and does what Nils described to calculate the most-common "template" filesystem and derive zone roots as minimal changes to it. > > Does anybody with a wider awareness know of such readily-available scripts on some blog? :) > > Does some script-usable ZFS API (if any) provide for fetching block/file hashes (checksums) stored in the filesystem itself? In fact, am I wrong to expect file-checksums to be readily available? >Yes. Files are not checksummed, blocks are checksummed. -- richard
> > > > Does some script-usable ZFS API (if any) provide for fetching > block/file hashes (checksums) stored in the filesystem itself? In > fact, am I wrong to expect file-checksums to be readily available? > > > > Yes. Files are not checksummed, blocks are checksummed. > -- richardFurther, even if they were file level checksums, the default checksums in zfs are too collision prone to be used for that purpose. If I were to write such a script I would md5+sha and then bit level compare collisions to be safe. -Wade
Wade.Stuart at fallon.com wrote:>>> Does some script-usable ZFS API (if any) provide for fetching >> block/file hashes (checksums) stored in the filesystem itself? In >> fact, am I wrong to expect file-checksums to be readily available? >> Yes. Files are not checksummed, blocks are checksummed. >> -- richard > > Further, even if they were file level checksums, the default checksums in > zfs are too collision prone to be used for that purpose. If I were to > write such a script I would md5+sha and then bit level compare collisions > to be safe.zfs set checksum=sha256 Remembering that doesn''t change existing data. -- Darren J Moffat
On Tue, 26 Aug 2008, Darren J Moffat wrote:> > zfs set checksum=sha256Expect performance to really suck after setting this. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Tue, 26 Aug 2008, Darren J Moffat wrote: >> >> zfs set checksum=sha256 > > Expect performance to really suck after setting this.Do you have evidence of that ? What kind of workload and how did you test it ? I''ve recently been benchmarking using filebench filemicro and filemacro workloads for ZFS Crypto and as part of setting my base line I compared the default checksum (flecher2) with sha256 and I didn''t see a big enough difference to classify it as "sucks". Here is my evidence for the filebench filemacro workload: http://cr.opensolaris.org/~darrenm/zfs-checksum-compare.html This was done on a X4500 running the zfs-crypto development binaries. In the interest of "full disclosure" I have changed the sha256.c in the ZFS source to use the default kernel one via the crypto framework rather than a private copy. I wouldn''t expect that to have too big an impact (I will be verifying it I just didn''t have the data to hand quickly). -- Darren J Moffat
On Aug 26, 2008, at 9:58 AM, Darren J Moffat wrote:> > than a private copy. I wouldn''t expect that to have too big an > impact (I >On a SPARC CMT (Niagara 1+) based system wouldn''t that be likely to have a large impact? -- Keith H. Bierman khbkhb at gmail.com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008
On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat <Darren.Moffat at sun.com> wrote:> In the interest of "full disclosure" I have changed the sha256.c in the > ZFS source to use the default kernel one via the crypto framework rather > than a private copy. I wouldn''t expect that to have too big an impact (I > will be verifying it I just didn''t have the data to hand quickly).Would this also make it so that it would use hardware assisted sha256 on capable (e.g N2) platforms? Is that the same as this change from long ago? http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html -- Mike Gerdts http://mgerdts.blogspot.com/
On Tue, 26 Aug 2008, Darren J Moffat wrote:> Bob Friesenhahn wrote: >> On Tue, 26 Aug 2008, Darren J Moffat wrote: >>> >>> zfs set checksum=sha256 >> >> Expect performance to really suck after setting this. > > Do you have evidence of that ? What kind of workload and how did you test itI did some random I/O throughput testing using iozone. While I saw similar I/O performance degredation to what you did (similar to your "large_db_oltp_8k_cached"), I did observe high CPU usage. The default fletcher algorithm uses hardly any CPU. In a dedicated file server, CPU usage is not a problem unless it slows subsequent requests. In a desktop system, or compute workstation, filesystem CPU usage competes with application CPU usage. With Solaris 10, enabling sha256 resulted in jerky mouse and desktop application behavior. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Keith Bierman wrote:> > On Aug 26, 2008, at 9:58 AM, Darren J Moffat wrote: > >> >> than a private copy. I wouldn''t expect that to have too big an impact (I >> > > On a SPARC CMT (Niagara 1+) based system wouldn''t that be likely to have > a large impact?UltraSPARC T1 has no hardware SHA256 so I wouldn''t expect any real change from running the private software sha256 copy in ZFS versus the software sha256 in the crypto framework. The software sha256 in crypto framework has very little (if any) optimization for sun4v. An UltraSPARC T2 has on chip SHA256 and it should have a good impact on performance to use the crypto framework. I don''t have the data to hand a the moment. -- Darren J Moffat
Mike Gerdts wrote:> On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat <Darren.Moffat at sun.com> wrote: >> In the interest of "full disclosure" I have changed the sha256.c in the >> ZFS source to use the default kernel one via the crypto framework rather >> than a private copy. I wouldn''t expect that to have too big an impact (I >> will be verifying it I just didn''t have the data to hand quickly). > > Would this also make it so that it would use hardware assisted sha256 > on capable (e.g N2) platforms?Yes.> Is that the same as this change from > long ago?> http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.htmlSlightly different implementation - in particular it doesn''t use PKCS#11 in userland only libmd. It also falls back to direct sha256 if the crypto framework call crypto_mech2id() call fails - this is needed to support ZFS boot. -- Darren J Moffat
On Tue, Aug 26, 2008 at 10:11 AM, Darren J Moffat <Darren.Moffat at sun.com>wrote:> Keith Bierman wrote: > >> >> >> >>> >> On a SPARC CMT (Niagara 1+) based system wouldn''t that be likely to have a >> large impact? >> > > UltraSPARC T1 has no hardware SHA256 so I wouldn''t expect any real change > from running the private software sha256 copy in ZFS versus the software > sha256 in the crypto framework. TheSorry for the typo (or thinko; I did know that but it''s possible that it slipped my mind in the moment). Admittedly most community members probably don''t have an N2 to play with, but it might well be available in the DC. -- Keith Bierman khbkhb at gmail.com kbiermank AIM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080826/18f8512c/attachment.html>