Jeff Bonwick
2005-Nov-29 08:59 UTC
[zfs-discuss] I can believe that ZFS is better than hardware RAID?
> At what point is it faster to simply copy a large swath of disk drive or > LUN then going back and forth, taking speed hits, by copying only the data?It depends on the I/O scheduling policy, the hardware characteristics, the placement of the data being copied, the block sizes, etc.> Could you re-order your read/write operations to increase the speed > of the overall copy operation yet also maintain the requisite amount > of consistency? And do it all while other operations are ongoing?Yes, and we do exactly that. In fact, this problem -- live pool traversal -- was one of the hardest things we did in ZFS. It took the better part of a year just to figure out how to pull it off. (It''s a good war story -- I''ll blog about it some cold night in December.) Once we had the algorithmic issues nailed down, the actual coding wasn''t too bad -- dmu_traverse.c is only about 800 lines of code. But it was worth the effort because live pool traversal allows us to do all sorts of things without locking the filesystem: resilvering, disk scrubbing, snapshot deletion, and so on. Going through the block tree has two major benefits. Most obviously, since we have the block pointers, we have the checksums -- so we can verify the correctness of the operations along the way. But the other *very* cool thing is that ZFS resilvering is breadth-first. That is, the very first thing we resilver is the uberblock and disk labels. Then we resilver the meta-objset; then each objset''s meta-dnode; and so on. Throughout the process we maintain this rule: no block is resilvered until all of its ancestors have been resilvered. It''s hard to overstate how important this is. With a blind disk copy there''s a 50% chance that when you''re 50% done, you still haven''t resilvered the blocks needed to *find* the stuff you''ve resilvered. This means that from an MTTR perspective, you haven''t actually made much progress: a second disk failure at this point would be catastrophic. But with breadth-first resilvering, every single block copy increases the amount of discoverable data. If you had a second disk failure, everything that had been resilvered up to that point would be available. Jeff
Torrey McMahon
2005-Dec-01 06:55 UTC
[zfs-discuss] I can believe that ZFS is better than hardware RAID?
Jeff Bonwick wrote:>> At what point is it faster to simply copy a large swath of disk drive or >> LUN then going back and forth, taking speed hits, by copying only the data? >> > > It depends on the I/O scheduling policy, the hardware characteristics, > the placement of the data being copied, the block sizes, etc. >Which isn''t too easy to decipher. Especially if your LUN is really made up of something else of something else. The big problem with virtualization is being able to peel the onion when needed ... but we already know that, right? :-)