thr3ads.net - zfs discuss - [zfs-discuss] I can believe that ZFS is better than hardware RAID? [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Jeff Bonwick

2005-Nov-29 08:59 UTC

[zfs-discuss] I can believe that ZFS is better than hardware RAID?

> At what point is it faster to simply copy a large swath of disk drive or 
> LUN then going back and forth, taking speed hits, by copying only the data?
It depends on the I/O scheduling policy, the hardware characteristics,
the placement of the data being copied, the block sizes, etc.
> Could you re-order your read/write operations to increase the speed
> of the overall copy operation yet also maintain the requisite amount
> of consistency?  And do it all while other operations are ongoing?
Yes, and we do exactly that.  In fact, this problem -- live pool
traversal -- was one of the hardest things we did in ZFS.  It took
the better part of a year just to figure out how to pull it off.
(It''s a good war story -- I''ll blog about it some cold night
in December.)

Once we had the algorithmic issues nailed down, the actual coding
wasn''t too bad -- dmu_traverse.c is only about 800 lines of code.
But it was worth the effort because live pool traversal allows us
to do all sorts of things without locking the filesystem: resilvering,
disk scrubbing, snapshot deletion, and so on.

Going through the block tree has two major benefits.  Most obviously,
since we have the block pointers, we have the checksums -- so we can
verify the correctness of the operations along the way.  But the other
*very* cool thing is that ZFS resilvering is breadth-first.  That is,
the very first thing we resilver is the uberblock and disk labels.
Then we resilver the meta-objset; then each objset''s meta-dnode;
and so on.  Throughout the process we maintain this rule: no block
is resilvered until all of its ancestors have been resilvered.

It''s hard to overstate how important this is.  With a blind disk copy
there''s a 50% chance that when you''re 50% done, you still
haven''t
resilvered the blocks needed to *find* the stuff you''ve resilvered.
This means that from an MTTR perspective, you haven''t actually made
much
progress: a second disk failure at this point would be catastrophic.

But with breadth-first resilvering, every single block copy increases
the amount of discoverable data.  If you had a second disk failure,
everything that had been resilvered up to that point would be available.

Jeff

Torrey McMahon

2005-Dec-01 06:55 UTC

head link

[zfs-discuss] I can believe that ZFS is better than hardware RAID?

Jeff Bonwick wrote:>> At what point is it faster to simply copy a large swath of disk drive
or
>> LUN then going back and forth, taking speed hits, by copying only the
data?
>>     
>
> It depends on the I/O scheduling policy, the hardware characteristics,
> the placement of the data being copied, the block sizes, etc.
>   
Which isn''t too easy to decipher. Especially if your LUN is really made
up of something else of something else. The big problem with 
virtualization is being able to peel the onion when needed ... but we 
already know that, right? :-)

zfs discuss - Nov 2005 - I can believe that ZFS is better than hardware RAID?

[zfs-discuss] I can believe that ZFS is better than hardware RAID?

[zfs-discuss] I can believe that ZFS is better than hardware RAID?