Jason J. W. Williams wrote:> Hello All, > > I was curious if anyone had run a benchmark on the IOPS performance of > RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and was > curious what others had seen. Thank you in advance.I''ve been using a simple model for small, random reads. In that model, the performance of a raidz[12] set will be approximately equal to a single disk. For example, if you have 6 disks, then the performance for the 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way dynamic stripe of 2-way mirrors will have a normalized performance of 6. I''d be very interested to see if your results concur. The models for writes or large reads are much more complicated because of the numerous caches of varying size and policy throughout the system. The small, random read workload will be largely unaffected by caches and you should see the performance as predicted by the disk rpm and seek time. -- richard
Hi Richard, Hmm....that''s interesting. I wonder if its worth benchmarking RAIDZ2 if those are the results you''re getting. The testing is to see the performance gain we might get for MySQL moving off the FLX210 to an active/passive pair of X4500s. Was hoping with that many SATA disks RAIDZ2 would provide a nice safety net. Best Regards, Jason On 1/3/07, Richard Elling <Richard.Elling at sun.com> wrote:> Jason J. W. Williams wrote: > > Hello All, > > > > I was curious if anyone had run a benchmark on the IOPS performance of > > RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and was > > curious what others had seen. Thank you in advance. > > I''ve been using a simple model for small, random reads. In that model, > the performance of a raidz[12] set will be approximately equal to a single > disk. For example, if you have 6 disks, then the performance for the > 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way > dynamic stripe of 2-way mirrors will have a normalized performance of 6. > I''d be very interested to see if your results concur. > > The models for writes or large reads are much more complicated because > of the numerous caches of varying size and policy throughout the system. > The small, random read workload will be largely unaffected by caches and > you should see the performance as predicted by the disk rpm and seek time. > -- richard >
Just got an interesting benchmark. I made two zpools: RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total) RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total) Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307 seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have expected the RAID-10 to write data more quickly. Its interesting to me that the RAID-10 pool registered the 38.4GB of data as 38.4GB, whereas the RAID-Z2 registered it as 56.4. Best Regards, Jason On 1/3/07, Jason J. W. Williams <jasonjwwilliams at gmail.com> wrote:> Hi Richard, > > Hmm....that''s interesting. I wonder if its worth benchmarking RAIDZ2 > if those are the results you''re getting. The testing is to see the > performance gain we might get for MySQL moving off the FLX210 to an > active/passive pair of X4500s. Was hoping with that many SATA disks > RAIDZ2 would provide a nice safety net. > > Best Regards, > Jason > > On 1/3/07, Richard Elling <Richard.Elling at sun.com> wrote: > > Jason J. W. Williams wrote: > > > Hello All, > > > > > > I was curious if anyone had run a benchmark on the IOPS performance of > > > RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and was > > > curious what others had seen. Thank you in advance. > > > > I''ve been using a simple model for small, random reads. In that model, > > the performance of a raidz[12] set will be approximately equal to a single > > disk. For example, if you have 6 disks, then the performance for the > > 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way > > dynamic stripe of 2-way mirrors will have a normalized performance of 6. > > I''d be very interested to see if your results concur. > > > > The models for writes or large reads are much more complicated because > > of the numerous caches of varying size and policy throughout the system. > > The small, random read workload will be largely unaffected by caches and > > you should see the performance as predicted by the disk rpm and seek time. > > -- richard > > >
> I''ve been using a simple model for small, random reads. In that model, > the performance of a raidz[12] set will be approximately equal to a single > disk. For example, if you have 6 disks, then the performance for the > 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way > dynamic stripe of 2-way mirrors will have a normalized performance of 6. > I''d be very interested to see if your results concur.Is this expected behavior? Assuming concurrent reads (not synchronous and sequential) I would naively expect an ndisk raidz2 pool to have a normalized performance of n for small reads. Is there some reason why a small read on a raidz2 is not statistically very likely to require I/O on only one device? Assuming a non-degraded pool of course. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
Hello Jason, Wednesday, January 3, 2007, 11:11:31 PM, you wrote: JJWW> Hi Richard, JJWW> Hmm....that''s interesting. I wonder if its worth benchmarking RAIDZ2 JJWW> if those are the results you''re getting. The testing is to see the JJWW> performance gain we might get for MySQL moving off the FLX210 to an JJWW> active/passive pair of X4500s. Was hoping with that many SATA disks JJWW> RAIDZ2 would provide a nice safety net. Well, you weren''t thinking about one big raidz2 group? To get more performance you can create one pool with many smaller raidz2 groups - that way your worst case read performance should increase approximately N times where N is number of raidz-2 groups. However keep in mind that write performance should be really good with raidz2. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2 groups striped together. Currently, 3 RZ2 groups. I''m about to test write performance against ZFS RAID-10. I''m curious why RAID-Z2 performance should be good? I assumed it was an analog to RAID-6. In our recent experience RAID-5 due to the 2 reads, a XOR calc and a write op per write instruction is usually much slower than RAID-10 (two write ops). Any advice is greatly appreciated. Best Regards, Jason On 1/3/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Jason, > > Wednesday, January 3, 2007, 11:11:31 PM, you wrote: > > JJWW> Hi Richard, > > JJWW> Hmm....that''s interesting. I wonder if its worth benchmarking RAIDZ2 > JJWW> if those are the results you''re getting. The testing is to see the > JJWW> performance gain we might get for MySQL moving off the FLX210 to an > JJWW> active/passive pair of X4500s. Was hoping with that many SATA disks > JJWW> RAIDZ2 would provide a nice safety net. > > Well, you weren''t thinking about one big raidz2 group? > > To get more performance you can create one pool with many smaller > raidz2 groups - that way your worst case read performance should > increase approximately N times where N is number of raidz-2 groups. > > However keep in mind that write performance should be really good > with raidz2. > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
Hello Jason, Wednesday, January 3, 2007, 11:40:38 PM, you wrote: JJWW> Just got an interesting benchmark. I made two zpools: JJWW> RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total) JJWW> RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total) JJWW> Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307 JJWW> seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of JJWW> data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have JJWW> expected the RAID-10 to write data more quickly. Actually with 18 disks in raid-10 in theory you get write performance equal to stripe of 9 disks. With 18 disks in 3 raidz2 groups of 6 disks each you should expect something like (6-2)*3 = 12 disk, so equal to 12 disks in stripe. JJWW> Its interesting to me that the RAID-10 pool registered the 38.4GB of JJWW> data as 38.4GB, whereas the RAID-Z2 registered it as 56.4. If you checked with zpool - then it''s "ok" - it reports disk usage also wit parity overhead. If zfs list showed you that numbers then either you''re using old snv bits or s10U2 as it was corrected some time ago (in U3). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, That makes sense. Thank you. :-) Also, it was zpool I was looking at. zfs always showed the correct size. -J On 1/3/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Jason, > > Wednesday, January 3, 2007, 11:40:38 PM, you wrote: > > JJWW> Just got an interesting benchmark. I made two zpools: > > JJWW> RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total) > JJWW> RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total) > > JJWW> Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307 > JJWW> seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of > JJWW> data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have > JJWW> expected the RAID-10 to write data more quickly. > > Actually with 18 disks in raid-10 in theory you get write performance > equal to stripe of 9 disks. With 18 disks in 3 raidz2 groups of 6 disks each you > should expect something like (6-2)*3 = 12 disk, so equal to 12 disks > in stripe. > > JJWW> Its interesting to me that the RAID-10 pool registered the 38.4GB of > JJWW> data as 38.4GB, whereas the RAID-Z2 registered it as 56.4. > > If you checked with zpool - then it''s "ok" - it reports disk usage > also wit parity overhead. If zfs list showed you that numbers then > either you''re using old snv bits or s10U2 as it was corrected some > time ago (in U3). > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
Hello Peter, Thursday, January 4, 2007, 1:12:47 AM, you wrote:>> I''ve been using a simple model for small, random reads. In that model, >> the performance of a raidz[12] set will be approximately equal to a single >> disk. For example, if you have 6 disks, then the performance for the >> 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way >> dynamic stripe of 2-way mirrors will have a normalized performance of 6. >> I''d be very interested to see if your results concur.PS> Is this expected behavior? Assuming concurrent reads (not synchronous and PS> sequential) I would naively expect an ndisk raidz2 pool to have a normalized PS> performance of n for small reads. PS> Is there some reason why a small read on a raidz2 is not statistically very PS> likely to require I/O on only one device? Assuming a non-degraded pool of PS> course. Unfortunately there''s. With raid-z1 and raid-z2 there''s no free lunch. You get excellent write performance (better than raid-10) however read performance for small IOs will suffer. It''s because in case of raid-z[12] each logical file system block is spread to all disks (minus parity disks). So in order to just read one block you have to get data from all disks in a raid-z[12] group. This is not something many people would expect knowing traditional raids. It''s not the case with striping and raid-1[0] in zfs. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Jason, Thursday, January 4, 2007, 1:55:02 AM, you wrote: JJWW> Hi Robert, JJWW> Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2 JJWW> groups striped together. Currently, 3 RZ2 groups. I''m about to test JJWW> write performance against ZFS RAID-10. I''m curious why RAID-Z2 JJWW> performance should be good? I assumed it was an analog to RAID-6. In JJWW> our recent experience RAID-5 due to the 2 reads, a XOR calc and a JJWW> write op per write instruction is usually much slower than RAID-10 JJWW> (two write ops). Any advice is greatly appreciated. I''m not going to describe it again - it was well explain here before. However one simple query to google and: http://blogs.sun.com/roch/entry/when_to_and_not_to -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, I''ve read that paper. Thank you for the condescension. -J On 1/3/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Jason, > > Thursday, January 4, 2007, 1:55:02 AM, you wrote: > > JJWW> Hi Robert, > > JJWW> Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2 > JJWW> groups striped together. Currently, 3 RZ2 groups. I''m about to test > JJWW> write performance against ZFS RAID-10. I''m curious why RAID-Z2 > JJWW> performance should be good? I assumed it was an analog to RAID-6. In > JJWW> our recent experience RAID-5 due to the 2 reads, a XOR calc and a > JJWW> write op per write instruction is usually much slower than RAID-10 > JJWW> (two write ops). Any advice is greatly appreciated. > > I''m not going to describe it again - it was well explain here before. > However one simple query to google and: > > http://blogs.sun.com/roch/entry/when_to_and_not_to > > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
On Jan 3, 2007, at 19:55, Jason J. W. Williams wrote:> performance should be good? I assumed it was an analog to RAID-6. In > our recent experience RAID-5 due to the 2 reads, a XOR calc and a > write op per write instruction is usually much slower than RAID-10 > (two write ops). Any advice is greatly appreciated.RAIDZ and RAIDZ2 does not suffer from this maladay (the RAID5 write hole). This is explained nicely in segment four of the ZFS video (at about time 2:30): http://www.sun.com/software/solaris/zfs_learning_center.jsp
>> In our recent experience RAID-5 due to the 2 reads, a XOR calc and a >> write op per write instruction is usually much slower than RAID-10 >> (two write ops). Any advice is greatly appreciated. > > RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write hole).1. This isn''t the "write hole". 2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when updating a file in writes of less than 128K, but not when writing a new file or issuing large writes. This message posted from opensolaris.org
> Is there some reason why a small read on a raidz2 is not statistically very > likely to require I/O on only one device? Assuming a non-degraded pool of > course.ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all disks must be read to compute and verify the checksum. This message posted from opensolaris.org
Jason J. W. Williams
2007-Jan-04 03:08 UTC
[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10
Hi Anton, Thank you for the information. That is exactly our scenario. We''re 70% write heavy, and given the nature of the workload, our typical writes are 10-20K. Again the information is much appreciated. Best Regards, Jason On 1/3/07, Anton B. Rang <Anton.Rang at sun.com> wrote:> >> In our recent experience RAID-5 due to the 2 reads, a XOR calc and a > >> write op per write instruction is usually much slower than RAID-10 > >> (two write ops). Any advice is greatly appreciated. > > > > RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write hole). > > 1. This isn''t the "write hole". > > 2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when updating a file in writes of less than 128K, but not when writing a new file or issuing large writes. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
>> Is there some reason why a small read on a raidz2 is not statistically very >> likely to require I/O on only one device? Assuming a non-degraded pool of >> course. > >ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all disks must be read to compute andverify the checksum. But why do ZFS reads require the computation of the RAIDZ checksum? If the block checksum is fine, then you need not care about the parity. Casper
Hello Anton, Thursday, January 4, 2007, 3:46:48 AM, you wrote:>> Is there some reason why a small read on a raidz2 is not statistically very >> likely to require I/O on only one device? Assuming a non-degraded pool of >> course.ABR> ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all ABR> disks must be read to compute and verify the checksum. It''s not about the checksum but about how a fs block is stored in raid-z[12] case - it''s spread out to all non-parity disks so in order to read one fs block you have to read fromm all disks except parity disks. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Jan 4, 2007, at 3:25 AM, Casper.Dik at Sun.COM wrote:>>> Is there some reason why a small read on a raidz2 is not >>> statistically very >>> likely to require I/O on only one device? Assuming a non-degraded >>> pool of >>> course. >> >> ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all >> disks must be read to compute and > verify the checksum. > > But why do ZFS reads require the computation of the RAIDZ checksum? > > If the block checksum is fine, then you need not care about the > parity.It''s the block checksum that requires reading all of the disks. If ZFS stored sub-block checksums for the RAID-Z case then short reads could often be satisfied without reading the whole block (and all disks). So actually I mis-spoke slightly; rather than "all disks", I should have said "all data disks." In practice this has the same effect: No more than one read may be processed at a time. Anton
>So actually I mis-spoke slightly; rather than "all disks", I should >have said "all data disks." >In practice this has the same effect: No more than one read may be >processed at a time.But aren''t short blocks sometimes stored on only a subset of disks? Casper
> It''s the block checksum that requires reading all of the disks. If > ZFS stored sub-block checksums for the RAID-Z case then short reads > could often be satisfied without reading the whole block (and all > disks).What happens when a sub-block is missing (single disk failure)? Surely it doesn''t have to discard the entire checksum and simply trust the remaining blocks? Also, even if it could read the data from a subset of the disks, isn''t it a feature that every read is also verifying the parity for correctness/silent corruption? I''m assuming that any "short-read" optimization wouldn''t be able to perform that check. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Anton B. Rang writes: > >> In our recent experience RAID-5 due to the 2 reads, a XOR calc and a > >> write op per write instruction is usually much slower than RAID-10 > >> (two write ops). Any advice is greatly appreciated. > > > > RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write hole). > > 1. This isn''t the "write hole". > > 2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when > updating a file in writes of less than 128K, but not when writing a > new file or issuing large writes. > I don''t think this is stated correctly. All filesystems will incur a read-modify-write when application is updating portion of a block. The read I/O only occurs if the block is not already in memory cache. The write is potentially deferred and multiple block updates may occur per write I/O. This is not RAIDZ specific. ZFS stores files less than 128K (or less than the filesystem recordsize) as a single block. Larger files are stored as multiple recordsize blocks. For RAID-Z a block spreads onto all devices of a group. -r > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Jan 4, 2007, at 10:26 AM, Roch - PAE wrote:> All filesystems will incur a read-modify-write when > application is updating portion of a block.For most Solaris file systems it is the page size, rather than the block size, that affects read-modify-write; hence 8K (SPARC) or 4K (x86/x64) writes do not require read-modify-write for UFS/QFS, even when larger block sizes are used. When direct I/O is enabled, UFS and QFS will write directly to disk (without reading) for 512-byte-aligned I/O.> The read I/O only occurs if the block is not already in memory cache.Of course.> ZFS stores files less than 128K (or less than the filesystem > recordsize) as a single block. Larger files are stored as > multiple recordsize blocks.So appending to any file less than 128K will result in a read-modify- write cycle (modulo read caching); while a write to a file which is not record-size-aligned (by default, 128K) results in a read-modify-write cycle.> For RAID-Z a block spreads onto all devices of a group.Which means that all devices are involved in the read and the write; except, as I believe Casper pointed out, that very small blocks (less than 512 bytes per data device) will reside on a smaller set of disks. Anton
> What happens when a sub-block is missing (single disk failure)? Surely > it doesn''t have to discard the entire checksum and simply trust the > remaining blocks?The checksum is over the data, not the data+parity. So when a disk fails, the data is first reconstructed, and then the block checksum is computed.> Also, even if it could read the data from a subset of the disks, isn''t > it a feature that every read is also verifying the parity for > correctness/silent corruption?It doesn''t -- we only read the data, not the parity. (See line 708 of vdev_raidz.c.) The parity is checked only when scrubbing. This message posted from opensolaris.org
Wade.Stuart at fallon.com
2007-Jan-04 17:40 UTC
[zfs-discuss] Scrubbing on active zfs systems (many snaps per day)
>From what I have read, it looks like there is a known issue with scrubbingrestarting when any of the other usages of the same code path run (re-silver, snap ...). It looks like there is a plan to put in a marker so that scrubbing knows where to start again after being preempted. This is good. I am wondering if any thought has been put in to a scrubbing service that would do constant low priority scrubs (either full with the restart marker, or randomized). I have noticed that the default scrub seems to be very resource intensive and can cause significant slowdowns on the filesystems, a much slower but constant scrub would be nice. while (1) { scrub_very_slowly(); } Are there any plans in this area documented anywhere or can someone give insight as to the devel teams goals? Thanks! -Wade
> > Also, even if it could read the data from a subset of the disks, isn''t > > it a feature that every read is also verifying the parity for > > correctness/silent corruption? > > It doesn''t -- we only read the data, not the parity. (See line 708 of > vdev_raidz.c.) The parity is checked only when scrubbing.Ah, that''s a major misconception on my part then. I''d thought I''d read that unlike any other RAID implementation, ZFS checked and verified parity on normal data access. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Darren Dunham wrote:>>> Also, even if it could read the data from a subset of the disks, isn''t >>> it a feature that every read is also verifying the parity for >>> correctness/silent corruption? >> It doesn''t -- we only read the data, not the parity. (See line 708 of >> vdev_raidz.c.) The parity is checked only when scrubbing. > > Ah, that''s a major misconception on my part then. I''d thought I''d read > that unlike any other RAID implementation, ZFS checked and verified > parity on normal data access. >Except that of course we compute the checksum of for all data read and compare that with the checksum stored in the block pointer.... and then use the parity data to reconstruct the block if the checksums don''t match. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
> I''d thought I''d read that unlike any other RAID implementation, ZFS checked > and verified parity on normal data access.Not yet, it appears. :-) (Incidentally, some hardware RAID controllers do verify parity, but generally only for RAID-3, where the extra reads are free as long as you have hardware parity checking.) This message posted from opensolaris.org
Darren Dunham wrote:>>> Also, even if it could read the data from a subset of the disks, isn''t >>> it a feature that every read is also verifying the parity for >>> correctness/silent corruption? >>> >> It doesn''t -- we only read the data, not the parity. (See line 708 of >> vdev_raidz.c.) The parity is checked only when scrubbing. >> > > Ah, that''s a major misconception on my part then. I''d thought I''d read > that unlike any other RAID implementation, ZFS checked and verified > parity on normal data access.That would be useless, and not provide anything extra. ZFS will do a block checksum check (that is, for each block read, read the checksum for that block, and compare to see if it is OK). If the block checksums show OK, then reading the parity for the corresponding data yields no additional useful information. I''m assuming that in a RAIDZ, RAIDZ2, or mirror configuration, should a block checksum show the corresponding block is corrupted, then ZFS will read the parity (or corresponding mirror) block, and attempt to re-construct the "bad" block, give the corrected info to the calling process, then re-writing the corrected data to a new block section on the disk(s). Right? -Erik
> > Ah, that''s a major misconception on my part then. I''d thought I''d read > > that unlike any other RAID implementation, ZFS checked and verified > > parity on normal data access.> That would be useless, and not provide anything extra.I think it''s useless if a (disk) block of data holding RAIDZ parity never has silent corruption, or if scrubbing was a lightweight operation that could be run often.> ZFS will do a > block checksum check (that is, for each block read, read the checksum > for that block, and compare to see if it is OK). If the block checksums > show OK, then reading the parity for the corresponding data yields no > additional useful information.It would yield useful information about the status of the parity information on disk. The read would be done because you''re already paying the penalty for reading all the data blocks, so you can verify the stability of the parity information on disk by reading an additional amount.> I''m assuming that in a RAIDZ, RAIDZ2, or mirror configuration, should a > block checksum show the corresponding block is corrupted, then ZFS will > read the parity (or corresponding mirror) block, and attempt to > re-construct the "bad" block, give the corrected info to the calling > process, then re-writing the corrected data to a new block section on > the disk(s). > > Right?I was assuming that *all* the data for a FS block was read and if redundant, the redundancy was verified correct (same data on mirrors, valid parity for RAIDZ) or the redundacy would be repaired. At least with a mirror I have a chance of reading all copies over time. With RAIDZ, I''ll never read the parity until a problem or a scrub occurs. Nothing wrong with that. I had simply managed to convince myself that it did more. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
>> ... If the block checksums >> show OK, then reading the parity for the corresponding data yields no >> additional useful information. > > It would yield useful information about the status of the parity > information on disk. > > The read would be done because you''re already paying the penalty for > reading all the data blocks, so you can verify the stability of the > parity information on disk by reading an additional amount.Sounds like this additional checking (I see your point) could be optional? --Toby
> >> ... If the block checksums > >> show OK, then reading the parity for the corresponding data yields no > >> additional useful information. > > > > It would yield useful information about the status of the parity > > information on disk. > > > > The read would be done because you''re already paying the penalty for > > reading all the data blocks, so you can verify the stability of the > > parity information on disk by reading an additional amount. > > Sounds like this additional checking (I see your point) could be > optional?Well, I''m not offering to implement it or anything. :-) Somehow from some of the early discussions of ZFS, I managed to "learn" that this was one of the fatures. What I read was wrong, or I misinterpreted it. (Either way, I''m afraid I''ve managed to repeat it to others since). I would expect such behavior to have some redundancy benefits and some performance and code complexity impacts. I think it''s a neat idea and I''m sorry to learn that I''ve been misunderstanding this as a feature, but I can''t guess what the cost of implementing it would be. I suppose having it as a per-pool option could make sense. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
> It''s not about the checksum but about how a fs block is stored in > raid-z[12] case - it''s spread out to all non-parity disks so in order > to read one fs block you have to read from all disks except parity > disks.However, if we didn''t need to verify the checksum, we wouldn''t have to read the whole file system block to satisfy small reads. Anton This message posted from opensolaris.org
Darren Dunham wrote:>> That would be useless, and not provide anything extra. >> > > I think it''s useless if a (disk) block of data holding RAIDZ parity > never has silent corruption, or if scrubbing was a lightweight operation > that could be run often. > >The problem is that you will still need to perform a periodic scrub because you can''t be sure that all data will be read during normal operation. So it doesn''t make sense to me to (further) penalize every read, when doing so does not remove the need for scrub. -- richard
Peter Schuller wrote:>> I''ve been using a simple model for small, random reads. In that model, >> the performance of a raidz[12] set will be approximately equal to a single >> disk. For example, if you have 6 disks, then the performance for the >> 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way >> dynamic stripe of 2-way mirrors will have a normalized performance of 6. >> I''d be very interested to see if your results concur. >> > > Is this expected behavior? Assuming concurrent reads (not synchronous and > sequential) I would naively expect an ndisk raidz2 pool to have a normalized > performance of n for small reads. > >q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942&tstart=0 where such behavior in a hardware RAID array lead to corruption which was detected by ZFS. No free lunch today, either. -- richard
> > Is this expected behavior? Assuming concurrent reads (not synchronous and > > sequential) I would naively expect an ndisk raidz2 pool to have a > > normalized performance of n for small reads. > > q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942&tstart=0 > where such behavior in a hardware RAID array lead to corruption which > was detected by ZFS. No free lunch today, either. > -- richardI appreciate the advantage of checksumming, believe me. Though I don''t see why this is directly related to the small read problem, other than that the implementation is such. Is there some fundamental reason why one could not (though I understand one *would* not) keep a checksum on a per-disk basis, so that in the normal case one really could read from just one disk, for a small read? I realize it is not enough for a block to be self-consistent, but theoretically couldn''t the block which points to the block in question contain multiple checksums for the various subsets on different disks, rather than just the one checksum for the entire block? Not that I consider this a major issue; but since you pointed me to that article in response to my statement above... -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
Peter Schuller wrote:>>> Is this expected behavior? Assuming concurrent reads (not synchronous and >>> sequential) I would naively expect an ndisk raidz2 pool to have a >>> normalized performance of n for small reads. >>> >> q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942&tstart=0 >> where such behavior in a hardware RAID array lead to corruption which >> was detected by ZFS. No free lunch today, either. >> -- richard >> > > I appreciate the advantage of checksumming, believe me. Though I don''t see why > this is directly related to the small read problem, other than that the > implementation is such. > > Is there some fundamental reason why one could not (though I understand one > *would* not) keep a checksum on a per-disk basis, so that in the normal case > one really could read from just one disk, for a small read? I realize it is > not enough for a block to be self-consistent, but theoretically couldn''t the > block which points to the block in question contain multiple checksums for > the various subsets on different disks, rather than just the one checksum for > the entire block? >Then you would need to keep checksums for each physical block, which is not part of the on-disk spec. It is not clear to me that this would be a net win, because you would need that checksum to be physically placed on another vdev, which implies that you still couldn''t just read a single block and be happy. Note, there are lots of different possibilities here, ZFS implements the end-to-end checksum which would not be replaced by a lower level checksum anyway. -- richard
Hello Anton, Saturday, January 6, 2007, 6:29:29 AM, you wrote:>> It''s not about the checksum but about how a fs block is stored in >> raid-z[12] case - it''s spread out to all non-parity disks so in order >> to read one fs block you have to read from all disks except parity >> disks.ABR> However, if we didn''t need to verify the checksum, we wouldn''t ABR> have to read the whole file system block to satisfy small reads. But we''ll loose end-to-end integrity feature. And still with 9 or more disks for most workloads we would endup reading them all anyway as each disk would hold so small portion of fs block. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On 03 January, 2007 - Jason J. W. Williams sent me these 0,4K bytes:> Hello All, > > I was curious if anyone had run a benchmark on the IOPS performance of > RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and was > curious what others had seen. Thank you in advance.http://blogs.sun.com/roch/entry/when_to_and_not_to has some info for you.. /Tomas -- Tomas ?gren, stric@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hello All, I was curious if anyone had run a benchmark on the IOPS performance of RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and was curious what others had seen. Thank you in advance. Best Regards, Jason _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss