Hi. I read most of ZFS documentation and Jeff''s blog entry about RAID-Z: http://blogs.sun.com/roller/page/bonwick?entry=raid_z and I fail to understand the difference between RAID-Z and RAID3. In RAID3 every read or write is full-stripe operation an there is no need for read-modify-write. Because I found many RAID3 descriptions I''ll try explain how it works. The most important thing: raid3 device has bigger sector size. For 5 disks RAID3 array, where disk sector size is 512 bytes, we have a device which has 2kB sector (4*512 + 512 bytes for parity). Every read/write operation (multiple of 2kB) needs all components. I implemented RAID3 for FreeBSD and it works very well, but some things should be noted: - linear/random (parallel) writing is faster than for RAID5, - linear (parallel) reads are faster than for RAID5, - random, parallel reads are slower than for RAID5. It is possible to speed-up random reads by using parity component. In case of power loss/system crash parity has to be synchronized still. There is no guarantee, that one of the disks wasn''t slower and has finished write. After explaining this, my questions are: Where is the difference between RAID-Z and RAID3? How does it compare to RAID5 random parallel reads performance? Have you tried to put ZFS on HW RAID5 array and compare performance with ZFS''s RAID-Z for many parallel random reads? Does RAID-Z depend on ZFS checksums? Is it still reliable if checksumming is turned off? If parity correctness verification is not done immediately when possible (after power loss, on boot) the risk of data loss is bigger (one of the raid''s component can fail leaving the data in an uncorrectable state if the data was not read, so there was no chance to detect the problem). And BTW. I fully understand ZFS''s goal of beeing easy to use, but do you plan to provide more flexibility of storage poll management? For example, if I''ve 5 disks (3x40GB and 2x60GB), I''d like to create two stripes (RAID0): 3x40GB and 2x60GB, so I can then mirror both stripes and get 120GB of RAID1-protected storage. It doesn''t seem to be possible currently. Thanks in advance. PS. ...and let me repeat what you already heard so many times: great work! -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051119/604782f0/attachment.bin>
<p>> I read most of ZFS documentation and Jeff''s blog entry about RAID-Z<br>> and I fail to understand the difference between RAID-Z and RAID3.</p><p> Ah. Let me try to clarify. </p><p> First, some background for folks who haven''t written RAID software in the last few weeks (and I''m told such people really do exist): </p><p> RAID-3 can be thought of as RAID-4 with two constraints and a twist: </p><p> (1) the number of disks must be exactly 2^n + 1 <br> (2) the minimum transfer size is 2^n sectors <br> (3) data placement is row-major rather than column-major (see below) </p><p> The power-of-two blocksize constraint isn''t imposed by RAID-3 proper, but rather is necessary to make it useful to most filesystems. </p><p> By row-major vs. column-major I mean this: think of the disks as columns of a matrix, such that (M, N) is the Mth sector of disk N. Disk I/O is always column-major, by definition. The two most common options for data placement are row-major and column-major. </p><p> For a 20-sector write at offset 40, this is row-major placement: </p><p> <tt> 40 41 42 43 P <br> 44 45 46 47 P <br> 48 49 50 51 P <br> 52 53 54 55 P <br> 56 57 58 59 P </tt> </p> <p> And this is column-major placement: </p><p> <tt> 40 45 50 55 P <br> 41 46 51 56 P <br> 42 47 52 57 P <br> 43 48 53 58 P <br> 44 49 54 59 P </tt> </p><p> With row-major placement, you must either do a lot of scatter/gather setup to get each sector to land where you want it, or you must interleave writes and de-interleave reads by hand. With column-major placement, I/O is trivial because the data is stored on disk in logical order. </p><p> RAID-2 and RAID-3 both do row-major data placement. </p><p> All other RAID schemes I''m aware of do column-major data placement. </p><p> One might ask: why on earth would anyone do row-major placement? Indeed, it seems a bit crazy for disks because it''s just extra work. But in some cases the interleave may be a natural side-effect of the way the hardware works -- memory, for example. Parity memory is essentially a form of RAID-2, albeit with different error semantics. </p><p> Now, to address the original question: </p><p> RAID-Z is like RAID-3 in that all writes are full-stripe writes. This eliminates a number of thorny performance and correctness problems such as the read-modify-write penalty and the RAID-5 write hole. </p><p> RAID-Z is like RAID-4 in that it supports any number of disks, it''s sector-addressable, and it does column-major data placement. </p><p> RAID-Z is like RAID-5 in that it doesn''t have a dedicated parity disk; the parity is distributed across all disks to maximize bandwidth. </p><p> And RAID-Z is unique in that it can detect and correct silent data corruption, as described <a href="http://blogs.sun.com/roller/page/bonwick?entry=raid_z">here</a>. </p><p>> PS. ...and let me repeat what you already heard so many times: great work!</p><p> Thank you! Much appreciated. </p> This message posted from opensolaris.org
> Does RAID-Z depend on ZFS checksums? > Is it still reliable if checksumming is turned off?The usual RAID capabilities (handling explicit I/O errors and/or whole-disk failure) do not rely on checksums. The self-healing data feature (detecting and correcting silent data corruption) does, of course, require checksums. This message posted from opensolaris.org
On Mon, Nov 21, 2005 at 12:41:50AM -0800, Jeff Bonwick wrote: +> > Does RAID-Z depend on ZFS checksums? +> > Is it still reliable if checksumming is turned off? +> +> The usual RAID capabilities (handling explicit I/O errors and/or whole-disk failure) +> do not rely on checksums. +> +> The self-healing data feature (detecting and correcting silent data +> corruption) does, of course, require checksums. I was pondering this parcitular situation: ZFS receives write request, sends it to three RAID-Z components. Request is delivered to two of them and you lost power before it is delivered to the last component (you know, ZFS likes cheap disks and one of them was slower:)). How this issue is detected after the reboot when one doesn''t use checksums? Maybe this is one of the reasons checksums are there? This issue is ignored by most software RAID implementations, but is very real (rare, but real). Thanks. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051206/58b91bdf/attachment.bin>
Pawel Jakub Dawidek wrote:> ZFS receives write request, sends it to three RAID-Z components. > Request is delivered to two of them and you lost power before it is > delivered to the last component (you know, ZFS likes cheap disks and one > of them was slower:)). > How this issue is detected after the reboot when one doesn''t use > checksums? > Maybe this is one of the reasons checksums are there?ZFS treats disk updates like a transaction, so an interrupted write operation will not affect any of the reachable on-disk data. When the system reboots, the blocks that were being written at the time of the power failure will not be reachable from the filesystem metadata, so the on-disk state of the filesystem will be fine. And because ZFS knows more than an LVM about what space is free and what is in use, it can always do full-stripe writes using a variable stripe size, so it doesn''t matter what''s already on the disk. This also means that you don''t have to wait to zero out the components of a RAID-Z vdev when you add it to a pool.> This issue is ignored by most software RAID implementations, but is very > real (rare, but real).Hardware RAID isn''t enough by itself either. I was asked to look at a box running RHEL which was acting a bit "funny". Found a whole bunch of kernel messages going back a few months about inconsistent ext2 cylinder group free block bitmaps. Yay, a whole bunch of files were sharing storage with each other, and it was impossible to tell how many others had been corrupted previous to that time. Had to verify 500GB of gzip''d bioinformatics data and re-mirror many tens of gigabytes of compressed data that was unverifiable (there''s no checksum in .Z files). I really wanted ZFS that day... -Jason
Pawel Jakub Dawidek wrote On 12/06/05 09:27,:>On Mon, Nov 21, 2005 at 12:41:50AM -0800, Jeff Bonwick wrote: >+> > Does RAID-Z depend on ZFS checksums? >+> > Is it still reliable if checksumming is turned off? >+> >+> The usual RAID capabilities (handling explicit I/O errors and/or whole-disk failure) >+> do not rely on checksums. >+> >+> The self-healing data feature (detecting and correcting silent data >+> corruption) does, of course, require checksums. > >I was pondering this parcitular situation: > >ZFS receives write request, sends it to three RAID-Z components. >Request is delivered to two of them and you lost power before it is >delivered to the last component (you know, ZFS likes cheap disks and one >of them was slower:)). >How this issue is detected after the reboot when one doesn''t use >checksums? >Maybe this is one of the reasons checksums are there? > >Matters for filesystems like UFS, ie., whenever system crashes some volume managers or raid arrays resync every time while most of them log the "dirty blocks region" onto an NVRAM (or even to a region of disk synchronously) and thus resync only those dirty blocks. In case of zfs, it may not matter since zfs always writes to new locations and it is not committed until the root block is written and there are multiple copies of it. Srinivas.>This issue is ignored by most software RAID implementations, but is very >real (rare, but real). > >Thanks. > > > >------------------------------------------------------------------------ > >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >