In reading this blog post: http://blogs.sun.com/bobn/entry/taking_zfs_deduplication_for_a a question came to mind..... To understand the context of the question, consider the opening paragraph from the above post; Here is my test case: I have 2 directories of photos, totaling about 90MB> each. And here''s the trick - they are almost complete duplicates of each > other. I downloaded all of the photos from the same camera on 2 different > days. How many of you do that ? Yeah, me too.OK, I consider myself in that category most certainly. Through just plain ''ol sloppiness I must have multiple copies of some images. Sad self indictment...but anyway.... What happens if, once dedup is on, I (or someone else with delete rights) open a photo management app containing that collection, and start deleting dupes - AND - happen to delete the original that all other references are pointing to. I know, I know, it doesn''t matter - snapshots save the day - but in this instance that''s not the point because I''m trying to properly understand the underlying dedup concept. Logically, if you delete what everything is pointing at, all the pointers are now null values, they are - in effect - pointing at nothing...an empty hole. I have the feeling the answer to this is; "no they don''t, there is no spoon ("original") you''re still OK". I suspect that, only because the people who thought this up couldn''t possibly have missed such an "obvious" point. The problem I have is in trying to mentally frame this in such a way that I can subsequently explain it, if asked to do so (which I see coming for sure). Help in understanding this would be hugely helpful - anyone? Regards & TIA, -Me -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/945fe146/attachment.html>
Michael Schuster
2009-Dec-08 12:10 UTC
[zfs-discuss] Deduplication - deleting the original
Colin Raven wrote:> What happens if, once dedup is on, I (or someone else with delete > rights) open a photo management app containing that collection, and > start deleting dupes - AND - happen to delete the original that all > other references are pointing to. I know, I know, it doesn''t matter - > snapshots save the day - but in this instance that''s not the point > because I''m trying to properly understand the underlying dedup concept. > > Logically, if you delete what everything is pointing at, all the > pointers are now null values, they are - in effect - pointing at > nothing...an empty hole. > > I have the feeling the answer to this is; "no they don''t, there is no > spoon ("original") you''re still OK". I suspect that, only because the > people who thought this up couldn''t possibly have missed such an > "obvious" point. The problem I have is in trying to mentally frame this > in such a way that I can subsequently explain it, if asked to do so > (which I see coming for sure). > > Help in understanding this would be hugely helpful - anyone?I mentally compare deduplication to links to files (hard, not soft) - as I understand it, there is no "original" and "copy"; rather, every directory entry points to "the data" (the inode, in ufs-speak), and if one directory entry of several is deleted, only the reference count changes. It''s probably a little more complicated with dedup, but I think the parallel is valid. HTH Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see ''Recursion''
On Tuesday 08 December 2009 14:00, Colin Raven wrote:> Help in understanding this would be hugely helpful - anyone? >i am no pro in zfs, but to my understanding there is no original. All the files have pointers to blocks on disk. Even if there is no ther file that shares the same block on the disk, there is a pointer to that. (of course if there are two or more files sharing the same block there would be more pointers) So on any delete action all there is needed to be done is to delete the correct pointer. If there are no more pointers to that block, the block is therefore free. But this is just how i understand it. Feel free to correct me. -- Real programmers don''t document. If it was hard to write, it should be hard to understand. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091208/0ad03a6f/attachment.bin>
Thomas Uebermeier
2009-Dec-08 12:22 UTC
[zfs-discuss] Deduplication - deleting the original
Colin, I think you mix up the filesystem layer (where the individual files as maintained) and the block layer, where actual data is stored. The analogue of deduplication on the filesystem layer would be to create hard links of the files, where deleting one file does not remove the other link. Block layer deduplication is a black box, see it simply as a compression, which works in the background. Thomas * Colin Raven [Tue Dec 08, 2009 at 01:00:54PM +0100]:> In reading this blog post: > [1]http://blogs.sun.com/bobn/entry/taking_zfs_deduplication_for_a > a question came to mind..... > To understand the context of the question, consider the opening paragraph > from the above post;> Here is my test case: I have 2 directories of photos, totaling about > 90MB each. And here''s the trick - they are almost complete duplicates of > each other. I downloaded all of the photos from the same camera on 2 > different days. How many of you do that ? Yeah, me too.?> OK, I consider myself in that category most certainly. Through just plain > ''ol sloppiness I must have multiple copies of some images. Sad self > indictment...but anyway.... > What happens if, once dedup is on, I (or someone else with delete rights) > open a photo management app containing that collection, and start deleting > dupes - AND - happen to delete the original that all other references are > pointing to. I know, I know, it doesn''t matter - snapshots save the day - > but in this instance that''s not the point because I''m trying to properly > understand the underlying dedup concept. > Logically, if you delete what everything is pointing at, all the pointers > are now null values, they are - in effect - pointing at nothing...an empty > hole. > I have the feeling the answer to this is; "no they don''t, there is no > spoon ("original") you''re still OK". I suspect that, only because the > people who thought this up couldn''t possibly have missed such an "obvious" > point. The problem I have is in trying to mentally frame this in such a > way that I can subsequently explain it, if asked to do so (which I see > coming for sure). > Help in understanding this would be hugely helpful - anyone? > Regards & TIA, > -Me
> i am no pro in zfs, but to my understanding there is no original.That is correct. From a semantic perspective, there is no change in behavior between dedup=off and dedup=on. Even the accounting remains the same: each reference to a block is charged to the dataset making the reference. The only place you see the effect of dedup is at the pool level, which can now have more logical than physical data. You may also see a difference in performance, which can be either positive or negative depending on a whole bunch of factors. At the implementation level, all that''s really happening with dedup is that when you write a block whose contents are identical to an existing block, instead of allocating new disk space we just increment a reference count on the existing block. When you free the block (from the dataset''s perspective), the storage pool decrements the reference count, but the block remains allocated at the pool level. When the reference count goes to zero, the storage pool frees the block for real (returns it to the storage pool''s free space map). But, to reiterate, none of this is visible semantically. The only way you can even tell dedup is happening is to observe that the total space used by all datasets exceeds the space allocated from the pool -- i.e. that the pool''s dedup ratio is greater than 1.0. Jeff
On Tue, Dec 8, 2009 at 22:54, Jeff Bonwick <Jeff.Bonwick at sun.com> wrote:> > i am no pro in zfs, but to my understanding there is no original. > > That is correct. From a semantic perspective, there is no change > in behavior between dedup=off and dedup=on. Even the accounting > remains the same: each reference to a block is charged to the dataset > making the reference. The only place you see the effect of dedup > is at the pool level, which can now have more logical than physical > data. You may also see a difference in performance, which can be > either positive or negative depending on a whole bunch of factors. > > At the implementation level, all that''s really happening with dedup > is that when you write a block whose contents are identical to an > existing block, instead of allocating new disk space we just increment > a reference count on the existing block. When you free the block > (from the dataset''s perspective), the storage pool decrements the > reference count, but the block remains allocated at the pool level. > When the reference count goes to zero, the storage pool frees the > block for real (returns it to the storage pool''s free space map). > > But, to reiterate, none of this is visible semantically. The only > way you can even tell dedup is happening is to observe that the > total space used by all datasets exceeds the space allocated from > the pool -- i.e. that the pool''s dedup ratio is greater than 1.0.Jeff, Thomas, Ed & Michael; Thank you all for assisting in the education of a n00bie in this most important ZFS feature. I *think* I have a better overall understanding now. This list is a resource treasure trove! I hope I''m able to acquire sufficient knowledge over time to eventually be able to contribute help to other newcomers. Regards & Thanks for all the help, -Me -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091209/bf1f15de/attachment.html>