Hi all, I''m going to be trying out some tests using b130 for dedup on a server with about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks). What I''m trying to get a handle on is how to estimate the memory overhead required for dedup on that amount of storage. From what I gather, the dedup hash keys are held in ARC and L2ARC and as such are in competition for the available memory. So the question is how much memory or L2ARC would be necessary to ensure that I''m never going back to disk to read out the hash keys. Better yet would be some kind of algorithm for calculating the overhead. eg - averaged block size of 4K = a hash key for every 4k stored and a hash occupies 256 bits. An associated question is then how does the ARC handle competition between hash keys and regular ARC functions? Based on these estimations, I think that I should be able to calculate the following: 1,7 TB 1740,8 GB 1782579,2 MB 1825361100,8 KB 4 average block size 456340275,2 blocks 256 hash key size-bits 1,16823E+11 hash key overhead - bits 14602888806,4 hash key size-bytes 14260633,6 hash key size-KB 13926,4 hash key size-MB 13,6 hash key overhead-GB Of course the big question on this will be the average block size - or better yet - to be able to analyze an existing datastore to see just how many blocks it uses and what is the current distribution of different block sizes. I''m currently playing around with zdb with mixed success on extracting this kind of data. That''s also a worst case scenario since it''s counting really small blocks and using 100% of available storage - highly unlikely. # zdb -ddbb siovale/iphone Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 57.0K 64K 77.34 DMU dnode 1 1 16K 1K 1.50K 1K 100.00 ZFS master node 2 1 16K 512 1.50K 512 100.00 ZFS delete queue 3 2 16K 16K 18.0K 32K 100.00 ZFS directory 4 3 16K 128K 408M 408M 100.00 ZFS plain file 5 1 16K 16K 3.00K 16K 100.00 FUID table 6 1 16K 4K 4.50K 4K 100.00 ZFS plain file 7 1 16K 6.50K 6.50K 6.50K 100.00 ZFS plain file 8 3 16K 128K 952M 952M 100.00 ZFS plain file 9 3 16K 128K 912M 912M 100.00 ZFS plain file 10 3 16K 128K 695M 695M 100.00 ZFS plain file 11 3 16K 128K 914M 914M 100.00 ZFS plain file Now, if I''m understanding this output properly, object 4 is composed of 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks. Can someone confirm (or correct) that assumption? Also, I note that each object (as far as my limited testing has shown) has a single block size with no internal variation. Interestingly, all of my zvols seem to use fixed size blocks - that is, there is no variation in the block sizes - they''re all the size defined on creation with no dynamic block sizes being used. I previously thought that the -b option set the maximum size, rather than fixing all blocks. Learned something today :-) # zdb -ddbb siovale/testvol Dataset siovale/testvol [ZVOL], ID 45, cr_txg 4717890, 23.9K, 2 objects Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 21.0K 16K 6.25 DMU dnode 1 1 16K 64K 0 64K 0.00 zvol object 2 1 16K 512 1.50K 512 100.00 zvol prop # zdb -ddbb siovale/tm-media Dataset siovale/tm-media [ZVOL], ID 706, cr_txg 4426997, 240G, 2 objects ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 21.0K 16K 6.25 DMU dnode 1 5 16K 8K 240G 250G 97.33 zvol object 2 1 16K 512 1.50K 512 100.00 zvol prop
On Jan 21, 2010, at 8:04 AM, erik.ableson wrote:> Hi all, > > I''m going to be trying out some tests using b130 for dedup on a server with about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks). What I''m trying to get a handle on is how to estimate the memory overhead required for dedup on that amount of storage. From what I gather, the dedup hash keys are held in ARC and L2ARC and as such are in competition for the available memory.... and written to disk, of course. For ARC sizing, more is always better.> So the question is how much memory or L2ARC would be necessary to ensure that I''m never going back to disk to read out the hash keys. Better yet would be some kind of algorithm for calculating the overhead. eg - averaged block size of 4K = a hash key for every 4k stored and a hash occupies 256 bits. An associated question is then how does the ARC handle competition between hash keys and regular ARC functions?AFAIK, there is no special treatment given to the DDT. The DDT is stored like other metadata and (currently) not easily accounted for. Also the DDT keys are 320 bits. The key itself includes the logical and physical block size and compression. The DDT entry is even larger. I think it is better to think of the ARC as caching the uncompressed DDT blocks which were written to disk. The number of these will be data dependent. "zdb -S poolname" will give you an idea of the number of blocks and how well dedup will work on your data, but that means you already have the data in a pool. -- richard> Based on these estimations, I think that I should be able to calculate the following: > 1,7 TB > 1740,8 GB > 1782579,2 MB > 1825361100,8 KB > 4 average block size > 456340275,2 blocks > 256 hash key size-bits > 1,16823E+11 hash key overhead - bits > 14602888806,4 hash key size-bytes > 14260633,6 hash key size-KB > 13926,4 hash key size-MB > 13,6 hash key overhead-GB > > Of course the big question on this will be the average block size - or better yet - to be able to analyze an existing datastore to see just how many blocks it uses and what is the current distribution of different block sizes. I''m currently playing around with zdb with mixed success on extracting this kind of data. That''s also a worst case scenario since it''s counting really small blocks and using 100% of available storage - highly unlikely. > > # zdb -ddbb siovale/iphone > Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects > > ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 > > Object lvl iblk dblk dsize lsize %full type > 0 7 16K 16K 57.0K 64K 77.34 DMU dnode > 1 1 16K 1K 1.50K 1K 100.00 ZFS master node > 2 1 16K 512 1.50K 512 100.00 ZFS delete queue > 3 2 16K 16K 18.0K 32K 100.00 ZFS directory > 4 3 16K 128K 408M 408M 100.00 ZFS plain file > 5 1 16K 16K 3.00K 16K 100.00 FUID table > 6 1 16K 4K 4.50K 4K 100.00 ZFS plain file > 7 1 16K 6.50K 6.50K 6.50K 100.00 ZFS plain file > 8 3 16K 128K 952M 952M 100.00 ZFS plain file > 9 3 16K 128K 912M 912M 100.00 ZFS plain file > 10 3 16K 128K 695M 695M 100.00 ZFS plain file > 11 3 16K 128K 914M 914M 100.00 ZFS plain file > > Now, if I''m understanding this output properly, object 4 is composed of 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks. Can someone confirm (or correct) that assumption? Also, I note that each object (as far as my limited testing has shown) has a single block size with no internal variation. > > Interestingly, all of my zvols seem to use fixed size blocks - that is, there is no variation in the block sizes - they''re all the size defined on creation with no dynamic block sizes being used. I previously thought that the -b option set the maximum size, rather than fixing all blocks. Learned something today :-) > > # zdb -ddbb siovale/testvol > Dataset siovale/testvol [ZVOL], ID 45, cr_txg 4717890, 23.9K, 2 objects > > Object lvl iblk dblk dsize lsize %full type > 0 7 16K 16K 21.0K 16K 6.25 DMU dnode > 1 1 16K 64K 0 64K 0.00 zvol object > 2 1 16K 512 1.50K 512 100.00 zvol prop > > # zdb -ddbb siovale/tm-media > Dataset siovale/tm-media [ZVOL], ID 706, cr_txg 4426997, 240G, 2 objects > > ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 > > Object lvl iblk dblk dsize lsize %full type > 0 7 16K 16K 21.0K 16K 6.25 DMU dnode > 1 5 16K 8K 240G 250G 97.33 zvol object > 2 1 16K 512 1.50K 512 100.00 zvol prop > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Thu, Jan 21, 2010 at 10:00 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Jan 21, 2010, at 8:04 AM, erik.ableson wrote: > >> Hi all, >> >> I''m going to be trying out some tests using b130 for dedup on a server with about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks). ?What I''m trying to get a handle on is how to estimate the memory overhead required for dedup on that amount of storage. ?From what I gather, the dedup hash keys are held in ARC and L2ARC and as such are in competition for the available memory. > > ... and written to disk, of course. > > For ARC sizing, more is always better. > >> So the question is how much memory or L2ARC would be necessary to ensure that I''m never going back to disk to read out the hash keys. Better yet would be some kind of algorithm for calculating the overhead. eg - averaged block size of 4K = a hash key for every 4k stored and a hash occupies 256 bits. An associated question is then how does the ARC handle competition between hash keys and regular ARC functions? > > AFAIK, there is no special treatment given to the DDT. The DDT is stored like > other metadata and (currently) not easily accounted for. > > Also the DDT keys are 320 bits. The key itself includes the logical and physical > block size and compression. The DDT entry is even larger.Looking at dedupe code, I noticed that on-disk DDT entries are compressed less efficiently than possible: key is not compressed at all (I''d expect roughly 2:1 compression ration with sha256 data), while other entry data is currently passed through zle compressor only (I''d expect this one to be less efficient than off-the-shelf compressors, feel free to correct me if I''m wrong). Is this v1, going to be improved in the future? Further, with huge dedupe memory footprint and heavy performance impact when DDT entries need to be read from disk, it might be worthwhile to consider compression of in-core ddt entries (specifically for DDTs or, more generally, making ARC/L2ARC compression-aware). Has this been considered? Regards, Andrey> > I think it is better to think of the ARC as caching the uncompressed DDT > blocks which were written to disk. ?The number of these will be data dependent. > "zdb -S poolname" will give you an idea of the number of blocks and how well > dedup will work on your data, but that means you already have the data in a > pool. > ?-- richard > > >> Based on these estimations, I think that I should be able to calculate the following: >> 1,7 ? TB >> 1740,8 ? ? ? ?GB >> 1782579,2 ? ? MB >> 1825361100,8 ?KB >> 4 ? ? average block size >> 456340275,2 ? blocks >> 256 ? hash key size-bits >> 1,16823E+11 ? hash key overhead - bits >> 14602888806,4 hash key size-bytes >> 14260633,6 ? ?hash key size-KB >> 13926,4 ? ? ? hash key size-MB >> 13,6 ?hash key overhead-GB >> >> Of course the big question on this will be the average block size - or better yet - to be able to analyze an existing datastore to see just how many blocks it uses and what is the current distribution of different block sizes. I''m currently playing around with zdb with mixed success ?on extracting this kind of data. That''s also a worst case scenario since it''s counting really small blocks and using 100% of available storage - highly unlikely. >> >> # zdb -ddbb siovale/iphone >> Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects >> >> ? ?ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 >> >> ? ?Object ?lvl ? iblk ? dblk ?dsize ?lsize ? %full ?type >> ? ? ? ? 0 ? ?7 ? ?16K ? ?16K ?57.0K ? ?64K ? 77.34 ?DMU dnode >> ? ? ? ? 1 ? ?1 ? ?16K ? ? 1K ?1.50K ? ? 1K ?100.00 ?ZFS master node >> ? ? ? ? 2 ? ?1 ? ?16K ? ?512 ?1.50K ? ?512 ?100.00 ?ZFS delete queue >> ? ? ? ? 3 ? ?2 ? ?16K ? ?16K ?18.0K ? ?32K ?100.00 ?ZFS directory >> ? ? ? ? 4 ? ?3 ? ?16K ? 128K ? 408M ? 408M ?100.00 ?ZFS plain file >> ? ? ? ? 5 ? ?1 ? ?16K ? ?16K ?3.00K ? ?16K ?100.00 ?FUID table >> ? ? ? ? 6 ? ?1 ? ?16K ? ? 4K ?4.50K ? ? 4K ?100.00 ?ZFS plain file >> ? ? ? ? 7 ? ?1 ? ?16K ?6.50K ?6.50K ?6.50K ?100.00 ?ZFS plain file >> ? ? ? ? 8 ? ?3 ? ?16K ? 128K ? 952M ? 952M ?100.00 ?ZFS plain file >> ? ? ? ? 9 ? ?3 ? ?16K ? 128K ? 912M ? 912M ?100.00 ?ZFS plain file >> ? ? ? ?10 ? ?3 ? ?16K ? 128K ? 695M ? 695M ?100.00 ?ZFS plain file >> ? ? ? ?11 ? ?3 ? ?16K ? 128K ? 914M ? 914M ?100.00 ?ZFS plain file >> >> Now, if I''m understanding this output properly, object 4 is composed of 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks. ?Can someone confirm (or correct) that assumption? Also, I note that each object ?(as far as my limited testing has shown) has a single block size with no internal variation. >> >> Interestingly, all of my zvols seem to use fixed size blocks - that is, there is no variation in the block sizes - they''re all the size defined on creation with no dynamic block sizes being used. I previously thought that the -b option set the maximum size, rather than fixing all blocks. ?Learned something today :-) >> >> # zdb -ddbb siovale/testvol >> Dataset siovale/testvol [ZVOL], ID 45, cr_txg 4717890, 23.9K, 2 objects >> >> ? ?Object ?lvl ? iblk ? dblk ?dsize ?lsize ? %full ?type >> ? ? ? ? 0 ? ?7 ? ?16K ? ?16K ?21.0K ? ?16K ? ?6.25 ?DMU dnode >> ? ? ? ? 1 ? ?1 ? ?16K ? ?64K ? ? ?0 ? ?64K ? ?0.00 ?zvol object >> ? ? ? ? 2 ? ?1 ? ?16K ? ?512 ?1.50K ? ?512 ?100.00 ?zvol prop >> >> # zdb -ddbb siovale/tm-media >> Dataset siovale/tm-media [ZVOL], ID 706, cr_txg 4426997, 240G, 2 objects >> >> ? ?ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 >> >> ? ?Object ?lvl ? iblk ? dblk ?dsize ?lsize ? %full ?type >> ? ? ? ? 0 ? ?7 ? ?16K ? ?16K ?21.0K ? ?16K ? ?6.25 ?DMU dnode >> ? ? ? ? 1 ? ?5 ? ?16K ? ? 8K ? 240G ? 250G ? 97.33 ?zvol object >> ? ? ? ? 2 ? ?1 ? ?16K ? ?512 ?1.50K ? ?512 ?100.00 ?zvol prop >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Thu, Jan 21, 2010 at 05:04:51PM +0100, erik.ableson wrote:> What I''m trying to get a handle on is how to estimate the memory > overhead required for dedup on that amount of storage.We''d all appreciate better visibility of this. This requires: - time and observation and experience, and - better observability tools and (probably) data exposed for them> So the question is how much memory or L2ARC would be necessary to > ensure that I''m never going back to disk to read out the hash keys.I think that''s a wrong-goal for optimisation. For performance (rather than space) issues, I look at dedup as simply increasing the size of the working set, with a goal of reducing the amount of IO (avoided duplicate writes) in return. If saving one large async write costs several small sync reads, you fall off a very steep performance cliff, especially for IOPS-limited seeking media. However, it doesn''t matter whether those reads are for DDT entries or other filesystem metadata necessary to complete the write. Nor does it even matter if those reads are data reads, for other processes that have been pushed out of ARC because of the larger working set. So I think it''s right that arc doesn''t treat DDT entries specially. The trouble is that the hash function produces (we can assume) random hits across the DDT, so the working set depends on the amount of data and the rate of potentially dedupable writes as well as the actual dedup hit ratio. A high rate of writes also means a large amount of data in ARC waiting to be written at the same time. This makes analysis very hard (and pushes you very fast towards that very steep cliff, as we''ve all seen). Separately, what might help is something like "dedup=opportunistic" that would keep the working set smaller: - dedup the block IFF the DDT entry is already in (l2)arc - otherwise, just write another copy - maybe some future async dedup "cleaner", using bp-rewrite, to tidy up later. I''m not sure what, in this scheme, would ever bring DDT entries into cache, though. Reads for previously dedup''d data? I also think a threshold on the size of blocks to try deduping would help. If I only dedup blocks (say) 64k and larger, i might well get most of the space benefit for much less overhead. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/0999432a/attachment.bin>
On Fri, Jan 22, 2010 at 08:55:16AM +1100, Daniel Carosone wrote:> For performance (rather than space) issues, I look at dedup as simply > increasing the size of the working set, with a goal of reducing the > amount of IO (avoided duplicate writes) in return.I should add "and avoided future duplicate reads" in those parentheses as well. A CVS checkout, with identical CVS/Root files in every directory, is a great example. Every one of those files is read on cvs update. Developers often have multiple checkouts (different branches) from the same server. Good performance gains can be had by avoiding potentially many thousands of extra reads and cache entries, whether with dedup or simply by hardlinking them all together. I''ve hit the 64k limit on hardlinks to the one file more than once with this, on bsd FFS. It''s not a great example for my suggestion of a threshold lower blocksize for dedup, however :-/ -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/582f9f11/attachment.bin>
On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin <andrey.v.kuzmin at gmail.com> wrote:> Looking at dedupe code, I noticed that on-disk DDT entries are > compressed less efficiently than possible: key is not compressed at > all (I''d expect roughly 2:1 compression ration with sha256 data),A cryptographic hash such as sha256 should not be compressible. A trivial example shows this to be the case: for i in {1..10000} ; do echo $i | openssl dgst -sha256 -binary done > /tmp/sha256 $ gzip -c <sha256 >sha256.gz $ compress -c <sha256 >sha256.Z $ bzip2 -c <sha256 >sha256.bz2 $ ls -go sha256* -rw-r--r-- 1 320000 Jan 22 04:13 sha256 -rw-r--r-- 1 428411 Jan 22 04:14 sha256.Z -rw-r--r-- 1 321846 Jan 22 04:14 sha256.bz2 -rw-r--r-- 1 320068 Jan 22 04:14 sha256.gz -- Mike Gerdts http://mgerdts.blogspot.com/
On 21 janv. 2010, at 22:55, Daniel Carosone wrote:> On Thu, Jan 21, 2010 at 05:04:51PM +0100, erik.ableson wrote: > >> What I''m trying to get a handle on is how to estimate the memory >> overhead required for dedup on that amount of storage. > > We''d all appreciate better visibility of this. This requires: > - time and observation and experience, and > - better observability tools and (probably) data exposed for themI''d guess that since every written block is going to go and ask for the hash keys, this should result in this data living in the ARC based on the MFU ruleset. The theory being that as a result if I can determine the maximum memory requirement for these keys, I know what my minimum memory baseline requirements will be to guarantee that I won''t be caught short.>> So the question is how much memory or L2ARC would be necessary to >> ensure that I''m never going back to disk to read out the hash keys. > > I think that''s a wrong-goal for optimisation. > > For performance (rather than space) issues, I look at dedup as simply > increasing the size of the working set, with a goal of reducing the > amount of IO (avoided duplicate writes) in return.True. but as a practical aspect, we''ve seen that overall performance drops off the cliff if you overstep your memory bounds and the system is obliged to go to disk to evaluate a new block to write against the hash keys. Compounded by the fact that the ARC is full so it''s obliged to go straight to disk, further exacerbating the problem. It''s this particular scenario that I''m trying to avoid and from a business aspect of selling ZFS based solutions (whether to a client or to an internal project) we need to be able to ensure that the performance is predictable with no surprises. Realizing of course that all of this is based on a slew of uncontrollable variables (size of the working set, IO profiles, ideal block sizes, etc.). The empirical approach of "give it lots and we''ll see if we need to add an L2ARC later" is not really viable for many managers (despite the fact that the real world works like this).> The trouble is that the hash function produces (we can assume) random > hits across the DDT, so the working set depends on the amount of > data and the rate of potentially dedupable writes as well as the > actual dedup hit ratio. A high rate of writes also means a large > amount of data in ARC waiting to be written at the same time. This > makes analysis very hard (and pushes you very fast towards that very > steep cliff, as we''ve all seen).I don''t think it would be random since _any_ write operation on a deduplicated filesystem would require a hash check, forcing them to live in the MFU. However I agree that a high write rate would result in memory pressure on the ARC which could result in the eviction of the hash keys. So the next factor to include in memory sizing is the maximum write rate (determined by IO availability). So with a team of two GbE cards, I could conservatively say that I need to size for inbound write IO of 160MB/s, worst case accumulated for the 30 second flush cycle so, say about 5GB of memory (leaving aside ZIL issues etc.). Noting that this is all very back of the napkin estimations, and I also need to have some idea of what my physical storage is capable of ingesting which could add to this value.> I also think a threshold on the size of blocks to try deduping would > help. If I only dedup blocks (say) 64k and larger, i might well get > most of the space benefit for much less overhead.Well - since my primary use case is iSCSI presentation to VMware backed by zvols and I can manually force the block size on volume creation to 64, this reduces the unpredictability a little bit. That''s based on the hypothesis that zvols use a fixed block size.
On Fri, Jan 22, 2010 at 7:19 AM, Mike Gerdts <mgerdts at gmail.com> wrote:> On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin > <andrey.v.kuzmin at gmail.com> wrote: >> Looking at dedupe code, I noticed that on-disk DDT entries are >> compressed less efficiently than possible: key is not compressed at >> all (I''d expect roughly 2:1 compression ration with sha256 data), > > A cryptographic hash such as sha256 should not be compressible. ?AI''d certainly agree for block-encryption where encrypted bytes are uniformly distributed by design, was not sure with digests. And, anyway, ddt includes other data as well, so I''d consider compression efficiency question, especially for in-core ddt entries. Regards, Andrey> trivial example shows this to be the case: > > for i in {1..10000} ; do > ? ?echo $i | openssl dgst -sha256 -binary > done > /tmp/sha256 > > $ gzip -c <sha256 >sha256.gz > $ compress -c <sha256 >sha256.Z > $ bzip2 -c <sha256 >sha256.bz2 > > $ ls -go sha256* > -rw-r--r-- ? 1 ?320000 Jan 22 04:13 sha256 > -rw-r--r-- ? 1 ?428411 Jan 22 04:14 sha256.Z > -rw-r--r-- ? 1 ?321846 Jan 22 04:14 sha256.bz2 > -rw-r--r-- ? 1 ?320068 Jan 22 04:14 sha256.gz > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ >
Sorry fort he late answer. Approximately it''s 150 bytes per individual block. So increasing the blocksize is a good idea. Also when L1 and L2 arc is not enough system will start making disk IOPS and RaidZ is not very effective for random IOPS and it''s likely that when your dram is not enough your perfor ance will suffer. You may choose to use Raid 10 which is a lot better on random loads Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of erik.ableson Sent: Thursday, January 21, 2010 6:05 PM To: zfs-discuss Subject: [zfs-discuss] Dedup memory overhead Hi all, I''m going to be trying out some tests using b130 for dedup on a server with about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks). What I''m trying to get a handle on is how to estimate the memory overhead required for dedup on that amount of storage. From what I gather, the dedup hash keys are held in ARC and L2ARC and as such are in competition for the available memory. So the question is how much memory or L2ARC would be necessary to ensure that I''m never going back to disk to read out the hash keys. Better yet would be some kind of algorithm for calculating the overhead. eg - averaged block size of 4K = a hash key for every 4k stored and a hash occupies 256 bits. An associated question is then how does the ARC handle competition between hash keys and regular ARC functions? Based on these estimations, I think that I should be able to calculate the following: 1,7 TB 1740,8 GB 1782579,2 MB 1825361100,8 KB 4 average block size 456340275,2 blocks 256 hash key size-bits 1,16823E+11 hash key overhead - bits 14602888806,4 hash key size-bytes 14260633,6 hash key size-KB 13926,4 hash key size-MB 13,6 hash key overhead-GB Of course the big question on this will be the average block size - or better yet - to be able to analyze an existing datastore to see just how many blocks it uses and what is the current distribution of different block sizes. I''m currently playing around with zdb with mixed success on extracting this kind of data. That''s also a worst case scenario since it''s counting really small blocks and using 100% of available storage - highly unlikely. # zdb -ddbb siovale/iphone Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 57.0K 64K 77.34 DMU dnode 1 1 16K 1K 1.50K 1K 100.00 ZFS master node 2 1 16K 512 1.50K 512 100.00 ZFS delete queue 3 2 16K 16K 18.0K 32K 100.00 ZFS directory 4 3 16K 128K 408M 408M 100.00 ZFS plain file 5 1 16K 16K 3.00K 16K 100.00 FUID table 6 1 16K 4K 4.50K 4K 100.00 ZFS plain file 7 1 16K 6.50K 6.50K 6.50K 100.00 ZFS plain file 8 3 16K 128K 952M 952M 100.00 ZFS plain file 9 3 16K 128K 912M 912M 100.00 ZFS plain file 10 3 16K 128K 695M 695M 100.00 ZFS plain file 11 3 16K 128K 914M 914M 100.00 ZFS plain file Now, if I''m understanding this output properly, object 4 is composed of 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks. Can someone confirm (or correct) that assumption? Also, I note that each object (as far as my limited testing has shown) has a single block size with no internal variation. Interestingly, all of my zvols seem to use fixed size blocks - that is, there is no variation in the block sizes - they''re all the size defined on creation with no dynamic block sizes being used. I previously thought that the -b option set the maximum size, rather than fixing all blocks. Learned something today :-) # zdb -ddbb siovale/testvol Dataset siovale/testvol [ZVOL], ID 45, cr_txg 4717890, 23.9K, 2 objects Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 21.0K 16K 6.25 DMU dnode 1 1 16K 64K 0 64K 0.00 zvol object 2 1 16K 512 1.50K 512 100.00 zvol prop # zdb -ddbb siovale/tm-media Dataset siovale/tm-media [ZVOL], ID 706, cr_txg 4426997, 240G, 2 objects ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0 Object lvl iblk dblk dsize lsize %full type 0 7 16K 16K 21.0K 16K 6.25 DMU dnode 1 5 16K 8K 240G 250G 97.33 zvol object 2 1 16K 512 1.50K 512 100.00 zvol prop _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss