Does the dedupe functionality happen at the file level or a lower block level? I am writing a large number of files that have the fol structure : ------ file begins 1024 lines of random ASCII chars 64 chars long some tilde chars .. about 1000 of then some text ( english ) for 2K more text ( english ) for 700 bytes or so ------------------ Each file has the same tilde chars and then english text at the end of 64K of random character data. Before writing the data I see : # zpool get size,capacity,version,dedupratio,free,allocated zp_dd NAME PROPERTY VALUE SOURCE zp_dd size 67.5G - zp_dd capacity 6% - zp_dd version 21 default zp_dd dedupratio 1.16x - zp_dd free 63.3G - zp_dd allocated 4.19G - After I see this : # zpool get size,capacity,version,dedupratio,free,allocated zp_dd NAME PROPERTY VALUE SOURCE zp_dd size 67.5G - zp_dd capacity 6% - zp_dd version 21 default zp_dd dedupratio 1.11x - zp_dd free 63.1G - zp_dd allocated 4.36G - Note the drop in dedup ratio from 1.16x to 1.11x which seems to indicate that dedupe does not detect the english text is identical in every file. -- Dennis
Dennis Clarke wrote:> Does the dedupe functionality happen at the file level or a lower block > level? >block level, but remember that block size may vary from file to file.> I am writing a large number of files that have the fol structure : > > ------ file begins > 1024 lines of random ASCII chars 64 chars long > some tilde chars .. about 1000 of then > some text ( english ) for 2K > more text ( english ) for 700 bytes or so > ------------------ > > Each file has the same tilde chars and then english text at the end of 64K > of random character data. > > Before writing the data I see : > > # zpool get size,capacity,version,dedupratio,free,allocated zp_dd > NAME PROPERTY VALUE SOURCE > zp_dd size 67.5G - > zp_dd capacity 6% - > zp_dd version 21 default > zp_dd dedupratio 1.16x - > zp_dd free 63.3G - > zp_dd allocated 4.19G - > > After I see this : > > # zpool get size,capacity,version,dedupratio,free,allocated zp_dd > NAME PROPERTY VALUE SOURCE > zp_dd size 67.5G - > zp_dd capacity 6% - > zp_dd version 21 default > zp_dd dedupratio 1.11x - > zp_dd free 63.1G - > zp_dd allocated 4.36G - > > > Note the drop in dedup ratio from 1.16x to 1.11x which seems to indicate > that dedupe does not detect the english text is identical in every file. > > >Theory: Your files may end up being in one large 128K block or maybe a couple of 64K blocks where there isn''t much redundancy to de-dup. -tim
On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote:> Does the dedupe functionality happen at the file level or a lower block > level?it occurs at the block allocation level.> I am writing a large number of files that have the fol structure : > > ------ file begins > 1024 lines of random ASCII chars 64 chars long > some tilde chars .. about 1000 of then > some text ( english ) for 2K > more text ( english ) for 700 bytes or so > ------------------ZFS''s default block size is 128K and is controlled by the "recordsize" filesystem property. Unless you changed "recordsize", each of the files above would be a single block distinct from the others. you may or may not get better dedup ratios with a smaller recordsize depending on how the common parts of the file line up with block boundaries. the cost of additional indirect blocks might overwhelm the savings from deduping a small common piece of the file. - Bill
> On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote: >> Does the dedupe functionality happen at the file level or a lower block >> level? > > it occurs at the block allocation level. > >> I am writing a large number of files that have the fol structure : >> >> ------ file begins >> 1024 lines of random ASCII chars 64 chars long >> some tilde chars .. about 1000 of then >> some text ( english ) for 2K >> more text ( english ) for 700 bytes or so >> ------------------ > > ZFS''s default block size is 128K and is controlled by the "recordsize" > filesystem property. Unless you changed "recordsize", each of the files > above would be a single block distinct from the others. > > you may or may not get better dedup ratios with a smaller recordsize > depending on how the common parts of the file line up with block > boundaries. > > the cost of additional indirect blocks might overwhelm the savings from > deduping a small common piece of the file. > > - BillWell, I as curious about these sort of things and figured that a simple test would show me the behavior. Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 directories named [a-z][a-z] where each file is 64K of random non-compressible data and then some english text. I guess I was wrong about the 64K random text chunk also .. because I wrote out that data as chars from the set { [A-Z][a-z][0-9] } and thus .. compressible ASCII data as opposed to random binary data. So ... after doing that a few times I now see something fascinating : $ ls -lo /tester/foo/*/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:38 /tester/foo/1/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/2/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/3/aa/aa.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/4/aa/aa.dat $ ls -lo /tester/foo/*/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:39 /tester/foo/1/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/2/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/3/zz/az.dat -rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/4/zz/az.dat $ find /tester/foo -type f | wc -l 70304 Those files, all 70,000+ of them, are unique and smaller than the filesystem blocksize. However : $ zfs get used,available,referenced,compressratio,recordsize,compression,dedup zp_dd/tester NAME PROPERTY VALUE SOURCE zp_dd/tester used 4.51G - zp_dd/tester available 3.49G - zp_dd/tester referenced 4.51G - zp_dd/tester compressratio 1.00x - zp_dd/tester recordsize 128K default zp_dd/tester compression off local zp_dd/tester dedup on local Compression factors don''t interest me at the moment .. but see this : $ zpool get all zp_dd NAME PROPERTY VALUE SOURCE zp_dd size 67.5G - zp_dd capacity 6% - zp_dd altroot - default zp_dd health ONLINE - zp_dd guid 14649016030066358451 default zp_dd version 21 default zp_dd bootfs - default zp_dd delegation on default zp_dd autoreplace off default zp_dd cachefile - default zp_dd failmode wait default zp_dd listsnapshots off default zp_dd autoexpand off default zp_dd dedupratio 1.95x - zp_dd free 63.3G - zp_dd allocated 4.22G - The dedupe ratio has climbed to 1.95x with all those unique files that are less than %recordsize% bytes. -- Dennis Clarke dclarke at opensolaris.ca <- Email related to the open source Solaris dclarke at blastwave.org <- Email related to open source for Solaris
Dennis Clarke wrote:>> On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote: >>> Does the dedupe functionality happen at the file level or a lower block >>> level? >> it occurs at the block allocation level. >> >>> I am writing a large number of files that have the fol structure : >>> >>> ------ file begins >>> 1024 lines of random ASCII chars 64 chars long >>> some tilde chars .. about 1000 of then >>> some text ( english ) for 2K >>> more text ( english ) for 700 bytes or so >>> ------------------ >> ZFS''s default block size is 128K and is controlled by the "recordsize" >> filesystem property. Unless you changed "recordsize", each of the files >> above would be a single block distinct from the others. >> >> you may or may not get better dedup ratios with a smaller recordsize >> depending on how the common parts of the file line up with block >> boundaries. >> >> the cost of additional indirect blocks might overwhelm the savings from >> deduping a small common piece of the file. >> >> - Bill > > Well, I as curious about these sort of things and figured that a simple > test would show me the behavior. > > Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 > directories named [a-z][a-z] where each file is 64K of random > non-compressible data and then some english text. > > I guess I was wrong about the 64K random text chunk also .. because I > wrote out that data as chars from the set { [A-Z][a-z][0-9] } and thus .. > compressible ASCII data as opposed to random binary data. > > So ... after doing that a few times I now see something fascinating : > > $ ls -lo /tester/foo/*/aa/aa.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:38 /tester/foo/1/aa/aa.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/2/aa/aa.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/3/aa/aa.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:43 /tester/foo/4/aa/aa.dat > $ ls -lo /tester/foo/*/zz/az.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:39 /tester/foo/1/zz/az.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/2/zz/az.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:45 /tester/foo/3/zz/az.dat > -rw-r--r-- 1 dclarke 68330 Nov 7 22:47 /tester/foo/4/zz/az.dat > > $ find /tester/foo -type f | wc -l > 70304 > > Those files, all 70,000+ of them, are unique and smaller than the > filesystem blocksize. > > However : > > $ zfs get > used,available,referenced,compressratio,recordsize,compression,dedup > zp_dd/tester > NAME PROPERTY VALUE SOURCE > zp_dd/tester used 4.51G - > zp_dd/tester available 3.49G - > zp_dd/tester referenced 4.51G - > zp_dd/tester compressratio 1.00x - > zp_dd/tester recordsize 128K default > zp_dd/tester compression off local > zp_dd/tester dedup on local > > Compression factors don''t interest me at the moment .. but see this : > > $ zpool get all zp_dd > NAME PROPERTY VALUE SOURCE > zp_dd size 67.5G - > zp_dd capacity 6% - > zp_dd altroot - default > zp_dd health ONLINE - > zp_dd guid 14649016030066358451 default > zp_dd version 21 default > zp_dd bootfs - default > zp_dd delegation on default > zp_dd autoreplace off default > zp_dd cachefile - default > zp_dd failmode wait default > zp_dd listsnapshots off default > zp_dd autoexpand off default > zp_dd dedupratio 1.95x - > zp_dd free 63.3G - > zp_dd allocated 4.22G - > > The dedupe ratio has climbed to 1.95x with all those unique files that are > less than %recordsize% bytes. >You can get more dedup information by running ''zdb -DD zp_dd''. This should show you how we break things down. Add more ''D'' options and get even more detail. - George
On Sat, 7 Nov 2009, Dennis Clarke wrote:> > Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 > directories named [a-z][a-z] where each file is 64K of random > non-compressible data and then some english text.What method did you use to produce this "random" data?> The dedupe ratio has climbed to 1.95x with all those unique files that are > less than %recordsize% bytes.Perhaps there are other types of blocks besides user data blocks (e.g. metadata blocks) which become subject to deduplication? Presumably ''dedupratio'' is based on a count of blocks rather than percentage of total data. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> On Sat, 7 Nov 2009, Dennis Clarke wrote: >> >> Now the first test I did was to write 26^2 files [a-z][a-z].dat in 26^2 >> directories named [a-z][a-z] where each file is 64K of random >> non-compressible data and then some english text. > > What method did you use to produce this "random" data?I''m using the tt800 method from Makoto Matsumoto described here : see http://random.mat.sbg.ac.at/generators/ and then here : /* * Generate the random text before we need it and also * outside of the area that measures the IO time. * We could have just read bytes from /dev/urandom but * you would be *amazed* how slow that is. */ random_buffer_start_hrt = gethrtime(); if ( random_buffer_start_hrt == -1 ) { perror("Could not get random_buffer high res start time"); exit(EXIT_FAILURE); } for ( char_count = 0; char_count < 65535; ++char_count ) { k_index = (int) ( genrand() * (double) 62 ); buffer_64k_rand_text[char_count]=alph[k_index]; } /* would be nice to break this into 0x40h char lines */ for ( p = 0x03fu; p < 65535; p = p + 0x040u ) buffer_64k_rand_text[p]=''\n''; buffer_64k_rand_text[65535]=''\n''; buffer_64k_rand_text[65536]=''\0''; random_buffer_end_hrt = gethrtime(); That works well. You know what ... I''m a schmuck. I didn''t grab a time based seed first. All those files with random text .. have identical twins on the filesystem somewhere. :-P damn I''ll go fix that.>> The dedupe ratio has climbed to 1.95x with all those unique files that >> are less than %recordsize% bytes. > > Perhaps there are other types of blocks besides user data blocks (e.g. > metadata blocks) which become subject to deduplication? Presumably > ''dedupratio'' is based on a count of blocks rather than percentage of > total data.I have no idea .. yet. I figure I''ll try a few more experiments to see what it does and maybe, dare I say it, look at the source :-) -- Dennis Clarke dclarke at opensolaris.ca <- Email related to the open source Solaris dclarke at blastwave.org <- Email related to open source for Solaris
> > You can get more dedup information by running ''zdb -DD zp_dd''. This > should show you how we break things down. Add more ''D'' options and get > even more detail. > > - GeorgeOKay .. thank you. Looks like I have piles of numbers here : # zdb -DDD zp_dd DDT-sha256-zap-duplicate: 37317 entries, size 342 on disk, 210 in core bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 2 18.4K 763M 355M 355M 37.9K 1.52G 727M 727M 4 18.0K 1.16G 1.15G 1.15G 72.4K 4.67G 4.61G 4.61G 8 70 1.47M 849K 849K 657 12.0M 6.78M 6.78M 16 27 39.5K 31.5K 31.5K 535 747K 598K 598K 32 6 4K 4K 4K 276 180K 180K 180K 64 4 9.00K 6.50K 6.50K 340 680K 481K 481K 128 1 2K 1.50K 1.50K 170 340K 255K 255K 256 1 1K 1K 1K 313 313K 313K 313K 512 1 512 512 512 522 261K 261K 261K Total 36.4K 1.91G 1.50G 1.50G 113K 6.21G 5.33G 5.33G DDT-sha256-zap-unique: 154826 entries, size 335 on disk, 196 in core bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 151K 5.61G 2.52G 2.52G 151K 5.61G 2.52G 2.52G Total 151K 5.61G 2.52G 2.52G 151K 5.61G 2.52G 2.52G DDT histogram (aggregated over all DDTs): bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 151K 5.61G 2.52G 2.52G 151K 5.61G 2.52G 2.52G 2 18.4K 763M 355M 355M 37.9K 1.52G 727M 727M 4 18.0K 1.16G 1.15G 1.15G 72.4K 4.67G 4.61G 4.61G 8 70 1.47M 849K 849K 657 12.0M 6.78M 6.78M 16 27 39.5K 31.5K 31.5K 535 747K 598K 598K 32 6 4K 4K 4K 276 180K 180K 180K 64 4 9.00K 6.50K 6.50K 340 680K 481K 481K 128 1 2K 1.50K 1.50K 170 340K 255K 255K 256 1 1K 1K 1K 313 313K 313K 313K 512 1 512 512 512 522 261K 261K 261K Total 188K 7.52G 4.01G 4.01G 264K 11.8G 7.85G 7.85G dedup = 1.96, compress = 1.51, copies = 1.00, dedup * compress / copies 2.95 # I have no idea what any of that means, yet :-) -- Dennis Clarke dclarke at opensolaris.ca <- Email related to the open source Solaris dclarke at blastwave.org <- Email related to open source for Solaris
On Sun, 8 Nov 2009, Dennis Clarke wrote:> > That works well. > > You know what ... I''m a schmuck. I didn''t grab a time based seed first. > All those files with random text .. have identical twins on the filesystem > somewhere. :-P damnThat is one reason why I asked. Failure to get a good seed is the most common problem. Using the time() system call is no longer good enough if multiple processes are somehow involved. It is useful to include additional information such as PID and microseconds. Reading a few characters from /dev/random to create the seed is even better. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Got some out-of-curiosity questions for the gurus if they have time to answer: Isn''t dedupe in some ways the antithesis of setting copies > 1? We go to a lot of trouble to create redundancy (n-way mirroring, raidz-n, copies=n, etc) to make things as robust as possible and then we reduce redundancy with dedupe and compression :-). What would be the difference in MTTDL between a scenario where dedupe ratio is exactly two and you''ve set copies=2 vs. no dedupe and copies=1? Intuitively MTTDL would be better because of the copies=2, but you''d lose twice the data when DL eventually happens. Similarly, if hypothetically dedupe ratio = 1.5 and you have a two-way mirror, vs. no dedupe and a 3 disk raidz1, which would be more reliable? Again intuition says the mirror because there''s one less device to fail, but device failure isn''t the only consideration. In both cases it sounds like you might gain a bit in performance, especially if the dedupe ratio is high because you don''t have to write the actual duplicated blocks on a write and on a read you are more likely to have the data blocks in cache. Does this make sense? Maybe there are too many variables, but it would be so interesting to hear of possible decision making algorithms. A similar discussion applies to compression, although that seems to defeat redundancy more directly. This analysis requires good statistical maths skills! Thanks -- Frank
On Nov 12, 2009, at 1:36 PM, Frank Middleton wrote:> Got some out-of-curiosity questions for the gurus if they > have time to answer: > > Isn''t dedupe in some ways the antithesis of setting copies > 1? > We go to a lot of trouble to create redundancy (n-way mirroring, > raidz-n, copies=n, etc) to make things as robust as possible and > then we reduce redundancy with dedupe and compression :-). > > What would be the difference in MTTDL between a scenario where > dedupe ratio is exactly two and you''ve set copies=2 vs. no dedupe > and copies=1? Intuitively MTTDL would be better because of the > copies=2, but you''d lose twice the data when DL eventually happens.The MTTDL models I''ve used consider any loss a complete loss. But there are some interesting wrinkles to explore here... :-)> Similarly, if hypothetically dedupe ratio = 1.5 and you have a > two-way mirror, vs. no dedupe and a 3 disk raidz1, which would > be more reliable? Again intuition says the mirror because there''s > one less device to fail, but device failure isn''t the only > consideration. > > In both cases it sounds like you might gain a bit in performance, > especially if the dedupe ratio is high because you don''t have to > write the actual duplicated blocks on a write and on a read you > are more likely to have the data blocks in cache. Does this make > sense? > > Maybe there are too many variables, but it would be so interesting > to hear of possible decision making algorithms. A similar discussion > applies to compression, although that seems to defeat redundancy > more directly. This analysis requires good statistical maths skills!There are several dimensions here. But I''m not yet convinced there is a configuration decision point to consume a more detailed analysis. In other words, if you could decide between two or more possible configurations, what would you wish to consider to improve the outcome? Thoughts? -- richard>
> Isn''t dedupe in some ways the antithesis of setting > copies > 1? We go to a lot of trouble to create redundancy (n-way > mirroring, raidz-n, copies=n, etc) to make things as robust as > possible and then we reduce redundancy with dedupe and compressionBut are we reducing redundancy? I don''t know the details of how dedupe is implemented, but I''d have thought that if copies=2, you get 2 copies of each dedupe block. So your data is just as safe since you haven''t actually changed the redundancy, it''s just that like you say: you''re risking more data being lost in the event of a problem. However, the flip side of that is that dedupe in many circumstances will free up a lot of space, possibly enough to justify copies=3, or even 4. So if you were to use dedupe and compression, you could probably add more redundancy without loosing capacity. And with the speed benefits associated with dedupe to boot. More reliable and faster, at the same price. Sounds good to me :D -- This message posted from opensolaris.org
On 13.11.09 16:09, Ross wrote:>> Isn''t dedupe in some ways the antithesis of setting copies > 1? We go to a >> lot of trouble to create redundancy (n-way mirroring, raidz-n, copies=n, >> etc) to make things as robust as possible and then we reduce redundancy >> with dedupe and compression > > But are we reducing redundancy? I don''t know the details of how dedupe is > implemented, but I''d have thought that if copies=2, you get 2 copies of each > dedupe block. So your data is just as safe since you haven''t actually > changed the redundancy, it''s just that like you say: you''re risking more > data being lost in the event of a problem. > > However, the flip side of that is that dedupe in many circumstances will free > up a lot of space, possibly enough to justify copies=3, or even 4.It is not possible to set copies to 4. There''s space for only 3 addresses in the block pointer. There''s also dedupditto property which specifies a threshold, and if reference count for deduped block goes above the threshold, another ditto copy of it is stored automatically. victor
On Fri, 13 Nov 2009, Ross wrote:> > But are we reducing redundancy? I don''t know the details of how > dedupe is implemented, but I''d have thought that if copies=2, you > get 2 copies of each dedupe block. So your data is just as safe > since you haven''t actually changed the redundancy, it''s just that > like you say: you''re risking more data being lost in the event of a > problem.Another point is that the degree of risk is related to the degree of total exposure. The more disk space consumed, the greater the chance that there will be data loss. Assuming that the algorithm and implementation are quite solid, it seems that dedupe should increase data reliability. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Nov 13, 2009 at 7:09 AM, Ross <myxiplx at googlemail.com> wrote:> > Isn''t dedupe in some ways the antithesis of setting > > copies > 1? We go to a lot of trouble to create redundancy (n-way > > mirroring, raidz-n, copies=n, etc) to make things as robust as > > possible and then we reduce redundancy with dedupe and compression > > But are we reducing redundancy? I don''t know the details of how dedupe is > implemented, but I''d have thought that if copies=2, you get 2 copies of each > dedupe block. So your data is just as safe since you haven''t actually > changed the redundancy, it''s just that like you say: you''re risking more > data being lost in the event of a problem. > > However, the flip side of that is that dedupe in many circumstances will > free up a lot of space, possibly enough to justify copies=3, or even 4. So > if you were to use dedupe and compression, you could probably add more > redundancy without loosing capacity. > > And with the speed benefits associated with dedupe to boot. > > More reliable and faster, at the same price. Sounds good to me :D > > >I believe in a previous thread, Adam had said that it automatically keeps more copies of a block based on how many references there are to that block. IE: If there''s 20 references it would keep 2 copies, whereas if there''s 20,000 it would keep 5. I''ll have to see if I can dig up the old thread. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091113/158fab6b/attachment.html>