Hi. I''ve a prototype RAID5 implementation for ZFS. It only works in non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 performance, as I suspected that RAIDZ, because of full-stripe operations, doesn''t work well for random reads issued by many processes in parallel. There is of course write-hole problem, which can be mitigated by running scrub after a power failure or system crash. Another idea is to store dirty regions somewhere and regenerate parity only for dirty regions after reboot, but I''m not yet sure how to make it efficient. Adding two write-cache-flush requests to each write requests isn''t a good idea. Anyway, people live with RAID5 write-hole today, so why don''t give them an option? My test environment was 5 SATA disks and ZFS on FreeBSD. For testing I used raidtest tool, I written some time ago. It starts a given number of processes, that do a given number of random I/O requests. I was using 8 processes, I/O size was a random value between 2kB and 32kB (with 2kB step), offset was a random value between 0 and 10GB (also with 2kB step). I was testing ZVOL created on top of my pool. I first run raidtest in write-only mode, so it makes ZFS to allocate blocks, then I exported the pool, imported it again (to flush cache) and started read-only test with the same I/O requests. Pool configuration for RAIDZ: pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad1 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad5 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad7 ONLINE 0 0 0 errors: No known data errors Pool configuration for RAID5: pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raid5 ONLINE 0 0 0 ad1 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad5 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad7 ONLINE 0 0 0 And here are the results: RAIDZ: Number of READ requests: 40000. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 1305213 Requests per second: 75 RAID5: Number of READ requests: 40000. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 2749719 Requests per second: 158 In other words, in this particular test, RAID5 is 2.1 times faster than RAIDZ. I expected even better results for bigger pools (with more disks in one RAIDZ vdev). My question is: Is there any interest in finishing RAID5/RAID6 for ZFS? If there is no chance it will be integrated into ZFS at some point, I won''t bother finishing it. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070910/3733b056/attachment.bin>
Hello Pawel, Excellent job! Now I guess it would be a good idea to get writes done properly, even if it means make them slow (like with SVM). The end result would be - do you want fast wrties/slow reads go ahead with raid-z; if you need fast reads/slow writes go with raid-5. btw: I''m just thinking loudly - for raid-5 writes, couldn''t you somewhow utilize ZIL to make writes safe? I''m asking because we''ve got an ability to put zil somewhere else like NVRAM card... -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> Now I guess it would be a good idea to get writes done properly, > even if it means make them slow (like with SVM). The end result > would be - do you want fast wrties/slow reads go ahead with > raid-z; if you need fast reads/slow writes go with raid-5. > > btw: I''m just thinking loudly - for raid-5 writes, couldn''t you > somewhow utilize ZIL to make writes safe? I''m asking because we''ve > got an ability to put zil somewhere else like NVRAM card...But the safety of raidz (and the overall on-disk consistency of the pool) does not currently depend on the ZIL. It instead depends on the fact that blocks are never modified in-place, but written first, then activated atomically. So I guess this depends on how the R5 is implemented in ZFS. As long as all writes cause a new block to be written (which has a full R5 stripe?), then the activation will be atomic and there is no write hole. The only problem comes if existing blocks were modified (and that would cause problems with snapshots anyway, right?) -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote:> Hello Pawel, > > Excellent job! > > Now I guess it would be a good idea to get writes done properly, > even if it means make them slow (like with SVM). The end result > would be - do you want fast wrties/slow reads go ahead with > raid-z; if you need fast reads/slow writes go with raid-5.Writes in non-degraded mode already works. Only non-degraded mode doesn''t work. My implementation is based on RAIDZ, so I''m planning to support RAID6 as well.> btw: I''m just thinking loudly - for raid-5 writes, couldn''t you > somewhow utilize ZIL to make writes safe? I''m asking because we''ve > got an ability to put zil somewhere else like NVRAM card...The problem with RAID5 is that different blocks share the same parity, which is not the case for RAIDZ. When you write a block in RAIDZ, you write the data and the parity, and then you switch the pointer in uberblock. For RAID5, you write the data and you need to update parity, which also protects some other data. Now if you write the data, but don''t update the parity before a crash, you have a whole. If you update you parity before the write and a crash, you have a inconsistent with different block in the same stripe. My idea was to have one sector every 1GB on each disk for a "journal" to keep list of blocks beeing updated. For example you want to write 2kB of data at offset 1MB. You first store offset+size in this journal, then write data and update parity and then remove offset+size from the journal. Unfortuantely, we would need to flush write cache twice: after offset+size addition and before offset+size removal. We could optimize it by doing lazy removal, eg. wait for ZFS to flush write cache as a part of transaction and then remove old offset+size paris. But I still expect this to give too much overhead. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070910/8a3f2350/attachment.bin>
Hello Pawel, Monday, September 10, 2007, 6:18:37 PM, you wrote: PJD> On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote:>> Hello Pawel, >> >> Excellent job! >> >> Now I guess it would be a good idea to get writes done properly, >> even if it means make them slow (like with SVM). The end result >> would be - do you want fast wrties/slow reads go ahead with >> raid-z; if you need fast reads/slow writes go with raid-5.PJD> Writes in non-degraded mode already works. Only non-degraded mode PJD> doesn''t work. My implementation is based on RAIDZ, so I''m planning to PJD> support RAID6 as well.>> btw: I''m just thinking loudly - for raid-5 writes, couldn''t you >> somewhow utilize ZIL to make writes safe? I''m asking because we''ve >> got an ability to put zil somewhere else like NVRAM card...PJD> The problem with RAID5 is that different blocks share the same parity, PJD> which is not the case for RAIDZ. When you write a block in RAIDZ, you PJD> write the data and the parity, and then you switch the pointer in PJD> uberblock. For RAID5, you write the data and you need to update parity, PJD> which also protects some other data. Now if you write the data, but PJD> don''t update the parity before a crash, you have a whole. If you update PJD> you parity before the write and a crash, you have a inconsistent with PJD> different block in the same stripe. Are you overwriting old data? I hope you''re not... I don''t think you should suffer from above problem in ZFS due to COW. If you are not overwriting and you''re just writing to new locations from the pool perspective those changes (both new data block and checksum block) won''t be active until they are both flushed and uber block is updated... right? -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Tue, Sep 11, 2007 at 08:16:02AM +0100, Robert Milkowski wrote:> Are you overwriting old data? I hope you''re not...I am, I overwrite parity, this is the whole point. That''s why ZFS designers used RAIDZ instead of RAID5, I think.> I don''t think you should suffer from above problem in ZFS due to COW.I do, because autonomous blocks share the same parity block.> If you are not overwriting and you''re just writing to new locations > from the pool perspective those changes (both new data block and > checksum block) won''t be active until they are both flushed and uber > block is updated... right?Assume 128kB stripe size in RAID5. You have three disks: A, B and C. ZFS writes 128kB at offset 0. This makes RAID5 to write data into disk A and parity into disk C (both at offset 0). Then, ZFS writes 128kB at offset 128kB. RAID5 writes data into disk B (at offset 0) and updates parity on disk C (also at offset 0). As you can see, two independent ZFS blocks share one parity block. COW won''t help you here, you would need to be sure that each ZFS transaction goes to a different (and free) RAID5 row. This is I belive the main reason why poor RAID5 wasn''t used in the first place. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070911/0bdb3b74/attachment.bin>
> As you can see, two independent ZFS blocks share one parity block. > COW won''t help you here, you would need to be sure that each ZFS > transaction goes to a different (and free) RAID5 row. > > This is I belive the main reason why poor RAID5 wasn''t used in the first > place.Exactly right. RAID-Z has different performance trade-offs than RAID-5, but the deciding factor was correctness. I''m really glad you''re doing these experiments! It''s good to know what the trade-offs are, performance-wise, between RAID-Z and classic RAID-5. At a minimum, it tells us what''s on the table, and what we''re paying for transactional semantics. To be honest, I''m pleased that it''s only 2x. It wouldn''t have surprised me if it were Nx for an N+1 configuration. A factor of 2 is something we can work with. Jeff
> My question is: Is there any interest in finishing RAID5/RAID6 for ZFS? > If there is no chance it will be integrated into ZFS at some point, I > won''t bother finishing it.Your work is as pure an example as any of what OpenSolaris should be about. I think there should be no problem having a new feature like that integrated!... as long as it is the level of quality that the community wants. This message posted from opensolaris.org
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:> And here are the results: > > RAIDZ: > > Number of READ requests: 40000. > Number of WRITE requests: 0. > Number of bytes to transmit: 695678976. > Number of processes: 8. > Bytes per second: 1305213 > Requests per second: 75 > > RAID5: > > Number of READ requests: 40000. > Number of WRITE requests: 0. > Number of bytes to transmit: 695678976. > Number of processes: 8. > Bytes per second: 2749719 > Requests per second: 158I''m a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? Very interesting experiment... Adam -- Adam Leventhal, FishWorks http://blogs.sun.com/ahl
On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:> Hi. > > I''ve a prototype RAID5 implementation for ZFS. It only works in > non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 > performance, as I suspected that RAIDZ, because of full-stripe > operations, doesn''t work well for random reads issued by many processes > in parallel. > > There is of course write-hole problem, which can be mitigated by running > scrub after a power failure or system crash.If I read your suggestion correctly, your implementation is much more like traditional raid-5, with a read-modify-write cycle? My understanding of the raid-z performance issue is that it requires full-stripe reads in order to validate the checksum. So to get better random read performance, why not simply have a separate checksum for each chunk in the stripe? You still eliminate the raid-5 write hole (albeit at some loss in performance because you have to compute and write extra checksums) but you allow multiple independent reads. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:> On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: > > And here are the results: > > > > RAIDZ: > > > > Number of READ requests: 40000. > > Number of WRITE requests: 0. > > Number of bytes to transmit: 695678976. > > Number of processes: 8. > > Bytes per second: 1305213 > > Requests per second: 75 > > > > RAID5: > > > > Number of READ requests: 40000. > > Number of WRITE requests: 0. > > Number of bytes to transmit: 695678976. > > Number of processes: 8. > > Bytes per second: 2749719 > > Requests per second: 158 > > I''m a bit surprised by these results. Assuming relatively large blocks > written, RAID-Z and RAID-5 should be laid out on disk very similarly > resulting in similar read performance.Hmm, no. The data was organized very differenly on disks. The smallest block size used was 2kB, to ensure each block is written to all disks in RAIDZ configuration. In RAID5 configuration however, 128kB stripe size was used, which means each block was stored on one disk only. Now when you read the data, RAIDZ need to read all disks for each block, and RAID5 needs to read only one disk for each block.> Did you compare the I/O characteristic of both? Was the bottleneck in > the software or the hardware?The bottleneck were definiatelly disks. CPU was like 96% idle. To be honest I expected, just like Jeff, much bigger win for RAID5 case. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/3868da77/attachment.bin>
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:> I''m a bit surprised by these results. Assuming relatively large blocks > written, RAID-Z and RAID-5 should be laid out on disk very similarly > resulting in similar read performance. > > Did you compare the I/O characteristic of both? Was the bottleneck in > the software or the hardware?Note that Pawel wrote: Pawel> I was using 8 processes, I/O size was a random value between 2kB Pawel> and 32kB (with 2kB step), offset was a random value between 0 and Pawel> 10GB (also with 2kB step). If the dataset''s record size was the default (Pawel didn''t say, right?) then the reason for the lousy read performance is clear: RAID-Z has to read full blocks to verify the checksum, whereas RAID-5 need only read as much as is requested (assuming aligned reads, which Pawel did seem to indicate: "2KB steps"). Peter Tribble pointed out much the same thing already. The crucial requirement is to match the dataset record size to the I/O size done by the application. If the app writes in bigger chunks than it reads and you want to optimize for write performance then set the record size to match the write size, else set the record size to match the read size. Where the dataset record size is not matched to the application''s I/O size I guess we could say that RAID-Z trades off the RAID-5 write hole for a read-hole. Nico --
On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:> On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote: > > Hi. > > > > I''ve a prototype RAID5 implementation for ZFS. It only works in > > non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 > > performance, as I suspected that RAIDZ, because of full-stripe > > operations, doesn''t work well for random reads issued by many processes > > in parallel. > > > > There is of course write-hole problem, which can be mitigated by running > > scrub after a power failure or system crash. > > If I read your suggestion correctly, your implementation is much > more like traditional raid-5, with a read-modify-write cycle? > > My understanding of the raid-z performance issue is that it requires > full-stripe reads in order to validate the checksum. [...]No, checksum is independent thing, and this is not the reason why RAIDZ needs to do full-stripe reads - in non-degraded mode RAIDZ doesn''t read parity. This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 As you can see even small block is stored on all disks in RAIDZ, where on RAID5 small block can be stored on one disk only. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/98dff5fd/attachment.bin>
On Thu, Sep 13, 2007 at 12:56:44AM +0200, Pawel Jakub Dawidek wrote:> On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: > > My understanding of the raid-z performance issue is that it requires > > full-stripe reads in order to validate the checksum. [...] > > No, checksum is independent thing, and this is not the reason why RAIDZ > needs to do full-stripe reads - in non-degraded mode RAIDZ doesn''t read > parity.I doubt reading the parity could cost all that much (particularly if there''s enough I/O capacity). It''s reading the full 128KB that you have to read, if a file''s record size is 128KB, in order to satisfy a 2KB read. And ZFS has to read full blocks in order to verify the checksum. Nico --
On Thu, 13 Sep 2007, Pawel Jakub Dawidek wrote:> On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: >> On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote: >>> Hi. >>> >>> I''ve a prototype RAID5 implementation for ZFS. It only works in >>> non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 >>> performance, as I suspected that RAIDZ, because of full-stripe >>> operations, doesn''t work well for random reads issued by many processes >>> in parallel. >>> >>> There is of course write-hole problem, which can be mitigated by running >>> scrub after a power failure or system crash. >> >> If I read your suggestion correctly, your implementation is much >> more like traditional raid-5, with a read-modify-write cycle? >> >> My understanding of the raid-z performance issue is that it requires >> full-stripe reads in order to validate the checksum. [...] > > No, checksum is independent thing, and this is not the reason why RAIDZ > needs to do full-stripe reads - in non-degraded mode RAIDZ doesn''t read > parity. > > This is how RAIDZ fills the disks (follow the numbers): > > Disk0 Disk1 Disk2 Disk3 > > D0 D1 D2 P3 > D4 D5 D6 P7 > D8 D9 D10 P11 > D12 D13 D14 P15 > D16 D17 D18 P19 > D20 D21 D22 P23 > > D is data, P is parity. > > And RAID5 does this: > > Disk0 Disk1 Disk2 Disk3 > > D0 D3 D6 P0,3,6 > D1 D4 D7 P1,4,7 > D2 D5 D8 P2,5,8 > D9 D12 D15 P9,12,15 > D10 D13 D16 P10,13,16 > D11 D14 D17 P11,14,17Surely the above is not accurate? You''ve showing the parity data only being written to disk3. In RAID5 the parity is distributed across all disks in the RAID5 set. What is illustrated above is RAID3.> As you can see even small block is stored on all disks in RAIDZ, where > on RAID5 small block can be stored on one disk only. > > --Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper wrote:> >This is how RAIDZ fills the disks (follow the numbers): > > > > Disk0 Disk1 Disk2 Disk3 > > > > D0 D1 D2 P3 > > D4 D5 D6 P7 > > D8 D9 D10 P11 > > D12 D13 D14 P15 > > D16 D17 D18 P19 > > D20 D21 D22 P23 > > > >D is data, P is parity. > > > >And RAID5 does this: > > > > Disk0 Disk1 Disk2 Disk3 > > > > D0 D3 D6 P0,3,6 > > D1 D4 D7 P1,4,7 > > D2 D5 D8 P2,5,8 > > D9 D12 D15 P9,12,15 > > D10 D13 D16 P10,13,16 > > D11 D14 D17 P11,14,17 > > Surely the above is not accurate? You''ve showing the parity data only > being written to disk3. In RAID5 the parity is distributed across all > disks in the RAID5 set. What is illustrated above is RAID3.It''s actually RAID4 (RAID3 would look the same as RAIDZ, but there are differences in practice), but my point wasn''t how the parity is distributed:) Ok, RAID5 once again: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 P9,12,15 D15 D10 D13 P10,13,16 D16 D11 D14 P11,14,17 D17 -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/51c17924/attachment.bin>
Pawel Jakub Dawidek <pjd <at> FreeBSD.org> writes:> > This is how RAIDZ fills the disks (follow the numbers): > > Disk0 Disk1 Disk2 Disk3 > > D0 D1 D2 P3 > D4 D5 D6 P7 > D8 D9 D10 P11 > D12 D13 D14 P15 > D16 D17 D18 P19 > D20 D21 D22 P23 > > D is data, P is parity.This layout assumes of course that large stripes have been written to the RAIDZ vdev. As you know, the stripe width is dynamic, so it is possible for a single logical block to span only 2 disks (for those who don''t know what I am talking about, see the "red" block occupying LBAs D3 and E3 on page 13 of these ZFS slides [1]). To read this logical block (and validate its checksum), only D_0 needs to be read (LBA E3). So in this very specific case, a RAIDZ read operation is as cheap as a RAID5 read operation. The existence of these small stripes could explain why RAIDZ doesn''t perform as bad as RAID5 in Pawel''s benchmark... [1] http://br.sun.com/sunnews/events/2007/techdaysbrazil/pdf/eric_zfs.pdf -marc
On Thu, Sep 13, 2007 at 04:58:10AM +0000, Marc Bevand wrote:> Pawel Jakub Dawidek <pjd <at> FreeBSD.org> writes: > > > > This is how RAIDZ fills the disks (follow the numbers): > > > > Disk0 Disk1 Disk2 Disk3 > > > > D0 D1 D2 P3 > > D4 D5 D6 P7 > > D8 D9 D10 P11 > > D12 D13 D14 P15 > > D16 D17 D18 P19 > > D20 D21 D22 P23 > > > > D is data, P is parity. > > This layout assumes of course that large stripes have been written to > the RAIDZ vdev. As you know, the stripe width is dynamic, so it is > possible for a single logical block to span only 2 disks (for those who > don''t know what I am talking about, see the "red" block occupying LBAs > D3 and E3 on page 13 of these ZFS slides [1]).Yes I''m aware of that.> To read this logical block (and validate its checksum), only D_0 needs > to be read (LBA E3). So in this very specific case, a RAIDZ read > operation is as cheap as a RAID5 read operation. [...]If you do single sector writes - yes, but this is very inefficient, because of two reasons: 1. Bandwidth - writting one sector at a time? Come on. 2. Space - when you write one sector and its parity you consume two sectors. You may have more than one parity column in that case, eg. Disk0 Disk1 Disk2 Disk3 Disk4 Disk5 D0 P0 D1 P1 D2 P2 In this case space overhead is the same as in mirror.> [...] The existence of these > small stripes could explain why RAIDZ doesn''t perform as bad as RAID5 > in Pawel''s benchmark...No, as I said, the smallest block I used was 2kB, which means four 512b blocks plus one 512b of parity - each 2kB block uses all 5 disks. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/ff599a56/attachment.bin>
On 9/12/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:> On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: > > On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: > > I''m a bit surprised by these results. Assuming relatively large blocks > > written, RAID-Z and RAID-5 should be laid out on disk very similarly > > resulting in similar read performance. > > Hmm, no. The data was organized very differenly on disks. The smallest > block size used was 2kB, to ensure each block is written to all disks in > RAIDZ configuration. In RAID5 configuration however, 128kB stripe size > was used, which means each block was stored on one disk only. > > Now when you read the data, RAIDZ need to read all disks for each block, > and RAID5 needs to read only one disk for each block. > > > Did you compare the I/O characteristic of both? Was the bottleneck in > > the software or the hardware? > > The bottleneck were definiatelly disks. CPU was like 96% idle. > > To be honest I expected, just like Jeff, much bigger win for RAID5 case.Well it depends. In both configurations the available read bandwidth is the same. Presumably you''re expecting each disk to seek independently and concurrently. Is the spa aware that multiple, offset dependent, reads can be issued concurrently to the RAID-5 vdev? James
On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:> The problem with RAID5 is that different blocks share the same parity, > which is not the case for RAIDZ. When you write a block in RAIDZ, you > write the data and the parity, and then you switch the pointer in > uberblock. For RAID5, you write the data and you need to update parity, > which also protects some other data. Now if you write the data, but > don''t update the parity before a crash, you have a whole. If you update > you parity before the write and a crash, you have a inconsistent with > different block in the same stripe.This is why you should consider "old" data and parity as being "live". The old data (being overwritten) is live as it is needed for the parity to be consistent - and the old parity is live because it protects the other blocks. What IMO should be done is object level raid - write new parity and new data into blocks not yet used - and as the new parity protects also the "neighbouring" data the old parity can be freed, and after it no longer is live the "overwritten" data block can also be freed. Note that this is very different from traditional raid5 as it requires intimate knowledge about the FS structure. Traditional raids also keep parity "in line" with the data blocks it protects - but that is not necessary if the FS can store information about where the parity is located. Define "live data" well enough and you''re safe if you never overwrite any of it.> My idea was to have one sector every 1GB on each disk for a "journal" to > keep list of blocks beeing updated.This would be called "write intent log" or "bitmap" (as in linux software raid). Speeds up recovery, but doesn''t protect against write hole problems.
Here is a different twist on your interesting scheme. First start with writting 3 blocks and parity in a full stripe. Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P0,1,2 Next application modifies D0 -> D0'' and also writes other data D3, D4. Now you have Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P0,1,2 D0'' D3 D4 P0'',3,4 So file update combine with new data into new full stripes. This is the trivial part. Now the hard part : We have to deal with D0. D0 is free of data content (subsided by D0''). However it holds parity information protecting live data D1, D2. If workload updates data in D1 and D2 the full stripe becomes free (this is the easy part). But if D1 and D2 stays immutable for long time then we can run out of pool blocks with D0 held down in an half-freed state. So as we near full pool capacity, a scrubber would have to walk the stripes and look for partially freed ones. Then it would need to do a scrubbing "read/write" on D1, D2 so that they become part of a new stripe with some other data freeing the full initial stripe. -r
On 9/20/07, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> > Next application modifies D0 -> D0'' and also writes other > data D3, D4. Now you have > > Disk0 Disk1 Disk2 Disk3 > > D0 D1 D2 P0,1,2 > D0'' D3 D4 P0'',3,4 > > But if D1 and D2 stays immutable for long time then we can > run out of pool blocks with D0 held down in an half-freed state. > So as we near full pool capacity, a scrubber would have to walk > the stripes and look for partially freed ones. Then it > would need to do a scrubbing "read/write" on D1, D2 so that > they become part of a new stripe with some other data > freeing the full initial stripe. >Or, given a list of partial stripes (and sufficient cache), next write of D5 could be combined with D1,D2: Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P0,1,2 D0'' D3 D4 P0'',3,4 D5 free free P5,1,2 therefore freeing D0 and P012: Disk0 Disk1 Disk2 Disk3 free D1 D2 free D0'' D3 D4 P0'',3,4 D5 free free P5,1,2 (I assumed no need for alignment). Performance-wise, i''m guessing it might be beneficial to "quickly" write mirrored blocks on the disk and later combine them, freeing the now unneeded mirrors.