1) Is there a big performance difference between raidz1 and raidz2? In traditional RAID I would think there would be since it would have two drives of parity to write/read to at the same time. 2) Can the ZFS boot/root volumes be using the same devices as the normal volumes? Like one raidz pool that is used for data (like normal) but also used for ZFS boot? 3) Does anyone have any heartburn for the new Seagate 1.5TB disks? I''ve had at least one person make comments about being nervous about them 4) What are the largest raidz pools people have successfully used (with snv_94 or later) - 10 disk raidz2? What are the reasons for not creating larger raidz1/raidz2 groups? Is it only performance and the possibility of multi-disk failure? Am I really going to suffer that bad performance if I created a raidz2 10 or 12 disk group? I''m trying to build a small machine (mid-tower hopefully) and maximize the amount of usable space... Thanks.
2008/9/15 Ben Rockwood:> On Thumpers I''ve created single pools of 44 disks, in 11 disk RAIDZ2''s. > I''ve come to regret this. I recommend keeping pools reasonably sized > and to keep stripes thinner than this.Could you clarify why you came to regret it? I was intending to create a single pool for 8 1TB disks.
On Sun, Sep 14, 2008 at 9:51 PM, Ben Rockwood <benr at cuddletech.com> wrote:> Frankly, most of the answers to the questions you''ve asked are based on > your hardware configuration more than the theoretical optimum. If I > were building out a setup I''d probly go with PCIe SATA adapters that can > host up to 4 drives... buy 3 of them, hang 600GB+ disks off them, and > create a RAIDZ per controller. I''d use RAIDZ only if this were a home > box that I could replace drives on in a reasonable timeframe and did > backups of. If that wasn''t the case, RAIDZ2.See, for my goal - a quiet, as-compact-as-possible storage solution that won''t fly. I''m going to need to do 7 disk raidz1 or 8 disk raidz2 I think to be able to stretch it. I don''t want to be spending a lot of money for redundancy - these machines do not need to be highly available and I can power it down if a disk fails until I can physically replace the disk. I don''t want to wind up with a lot of disks sitting around for parity''s sake. If this was a hardware RAID I''d be looking at maybe 12 disk RAID6/dual parity RAID5 (10 disk usable) for example...> When building your box, don''t forget to stash as much memory as possible > in the box for cache (ZFS ARC), its the best way to improve your read > performance in real-world workloads.This will be for home storage for DVDs, etc. Very low traffic, maybe 4 streams maximum. Shared mainly using CIFS, maybe some NFS. I''ll be putting 4GB of ram in it. I figure that should be decent.
I''m looking at this raidoptimizer spreadsheet that Richard generated, and now I''m wondering - Couldn''t I do a large stripe (like 10 disks) and then just have 3 disks marked as spare? Does it work like that? and if so, what would the pros/cons be doing a large stripe with spares vs. raidz1 or raidz2? I''m also showing a 15 disk raidz2 with 206 million MTTDL[1] years and 122,000 MTTDL[2] years, 13tb (at 1TB disk) and can suffer 2 disk failures... course only 91 iops. but the max theoretical bandwidth is over 1,000MB/sec... so many options. Trying to find a decent mix and match of iops, bandwidth, MTTDL''s...
Pavan Chandrashekar - Sun Microsystems
2008-Sep-15 07:30 UTC
[zfs-discuss] [storage-discuss] A few questions
Ben Rockwood wrote:> mike wrote: >> 1) Is there a big performance difference between raidz1 and raidz2? In >> traditional RAID I would think there would be since it would have two >> drives of parity to write/read to at the same time. >> > > I haven''t benchmarked this specific case in some time, but in my > experience I couldn''t call it "big". > >> 2) Can the ZFS boot/root volumes be using the same devices as the >> normal volumes? Like one raidz pool that is used for data (like >> normal) but also used for ZFS boot? >> > > That would involve putting pools on disk partitions, and I''m not certain > how that would effect the bootloader code.A downside to the performance in putting pools on a partition as opposed to the entire disk is that ZFS turns off the write cache in case of the former. You might want to check if that is of concern in your setup. Pavan
mike wrote:> I''m looking at this raidoptimizer spreadsheet that Richard generated, > and now I''m wondering - > > Couldn''t I do a large stripe (like 10 disks) and then just have 3 > disks marked as spare? > > Does it work like that? and if so, what would the pros/cons be doing a > large stripe with spares vs. raidz1 or raidz2? > >It would be a rather silly, you wouldn''t have any redundancy. Ian
On Mon, Sep 15, 2008 at 01:00:49PM +0530, Pavan Chandrashekar - Sun Microsystems wrote:> > A downside to the performance in putting pools on a partition as opposed > to the entire disk is that ZFS turns off the write cache in case of the > former. You might want to check if that is of concern in your setup.You might want to doublecheck that fact. I forget who I worked through on this one, but it was determined that in the case of SATA, that was not true. I forget the details now, but basically I went to turn the cache on (since there was only one partion, and it was "owned" by ZFS) for a pair of disks that were using a partition instead of the whole disk, and it was already on. Someone else on the list pointed out that would likely be the case and also checked several systems of their own.>From what we can tell, SCSI/FC/SAS/etc all do indeed disable the disk''s cache, whereasSATA does not. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Brian Hechinger wrote:> On Mon, Sep 15, 2008 at 01:00:49PM +0530, Pavan Chandrashekar - Sun Microsystems wrote: > >> A downside to the performance in putting pools on a partition as opposed >> to the entire disk is that ZFS turns off the write cache in case of the >> former. You might want to check if that is of concern in your setup. >> > > You might want to doublecheck that fact. I forget who I worked through on this one, > but it was determined that in the case of SATA, that was not true. I forget the > details now, but basically I went to turn the cache on (since there was only one > partion, and it was "owned" by ZFS) for a pair of disks that were using a partition > instead of the whole disk, and it was already on. > > Someone else on the list pointed out that would likely be the case and also checked > several systems of their own. > > >From what we can tell, SCSI/FC/SAS/etc all do indeed disable the disk''s cache, whereas > SATA does not. >Methinks it will depend on the disk model and firmware rev, not the host interface technology. -- richard
mike wrote:> I''m looking at this raidoptimizer spreadsheet that Richard generated, > and now I''m wondering - >cool :-)> Couldn''t I do a large stripe (like 10 disks) and then just have 3 > disks marked as spare? > > Does it work like that? and if so, what would the pros/cons be doing a > large stripe with spares vs. raidz1 or raidz2? > > I''m also showing a 15 disk raidz2 with 206 million MTTDL[1] years and > 122,000 MTTDL[2] years, 13tb (at 1TB disk) and can suffer 2 disk > failures... course only 91 iops. but the max theoretical bandwidth is > over 1,000MB/sec... so many options. Trying to find a decent mix and > match of iops, bandwidth, MTTDL''s... >performance, space, RAS -- it is a trade-off -- richard
On Mon, Sep 15, 2008 at 08:39:01AM -0700, Richard Elling wrote:> > Methinks it will depend on the disk model and firmware rev, not > the host interface technology.Hmmmm. This of course would take more research then since I can''t really say for anything other than what I have here (Seagate SATA disks). -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
On Mon, Sep 15, 2008 at 8:40 AM, Richard Elling <Richard.Elling at sun.com> wrote:> performance, space, RAS -- it is a trade-offHow about you whip up a "weight" factor for people like myself... :) This is how I would weigh my priorities: #1 Available space #2 Redundancy #3 Speed (as long as I can get at least 30-40MB/sec over CIFS I think that should be fine, any faster is awesome) (Totally joking about making it a feature. But would appreciate any tips for this) Everyone else: I get the stripe vs. raidz comparison now. Essentially a stripe has no ditto blocks then?
On Mon, Sep 15, 2008 at 13:18, mike <mike503 at gmail.com> wrote:> Everyone else: I get the stripe vs. raidz comparison now. Essentially > a stripe has no ditto blocks then?"Ditto blocks" it has; this is ZFS'' name for redundant metadata. There are multiple copies of directory entries and such, so that corrupting a single block can''t cause you to lose the entire filesystem below that block, even on a single disk. But a stripe lacks parity blocks, which ZFS needs to recreate damaged data. Raidz{,2} have this, and mirrors have this, but single disks do not. Will
On Mon, Sep 15, 2008 at 10:28 AM, Will Murnane <will.murnane at gmail.com> wrote:> "Ditto blocks" it has; this is ZFS'' name for redundant metadata. > There are multiple copies of directory entries and such, so that > corrupting a single block can''t cause you to lose the entire > filesystem below that block, even on a single disk. But a stripe > lacks parity blocks, which ZFS needs to recreate damaged data. > Raidz{,2} have this, and mirrors have this, but single disks do not.okay. ditto is only metadata. gotcha.
>>>>> "wm" == Will Murnane <will.murnane at gmail.com> writes:wm> corrupting a single block can''t cause you to lose the entire wm> filesystem below that block, even on a single disk. an entire pool, on the other hand.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080915/2079550c/attachment.bin>
mike wrote:> On Mon, Sep 15, 2008 at 10:28 AM, Will Murnane <will.murnane at gmail.com> wrote: > > >> "Ditto blocks" it has; this is ZFS'' name for redundant metadata. >> There are multiple copies of directory entries and such, so that >> corrupting a single block can''t cause you to lose the entire >> filesystem below that block, even on a single disk. But a stripe >> lacks parity blocks, which ZFS needs to recreate damaged data. >> Raidz{,2} have this, and mirrors have this, but single disks do not. >> > > okay. ditto is only metadata. gotcha. >By default, but you can set the number of copies for your data by setting the "copies" parameter on your file system. In other words, "copies=1" is the default. zfs get copies myfilesystemname will show the current setting for your file system. It is a little difficult to understand copies without pictures, so I blogged about it and drew some pictures. http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection The real takeaway is that you have many different ways to protect your data, which to choose is not always clear. -- richard
Wade.Stuart at fallon.com
2008-Sep-15 18:13 UTC
[zfs-discuss] [storage-discuss] A few questions
zfs-discuss-bounces at opensolaris.org wrote on 09/15/2008 12:58:44 PM:> mike wrote: > > On Mon, Sep 15, 2008 at 10:28 AM, Will Murnane <will. > murnane at gmail.com> wrote: > > > > > >> "Ditto blocks" it has; this is ZFS'' name for redundant metadata. > >> There are multiple copies of directory entries and such, so that > >> corrupting a single block can''t cause you to lose the entire > >> filesystem below that block, even on a single disk. But a stripe > >> lacks parity blocks, which ZFS needs to recreate damaged data. > >> Raidz{,2} have this, and mirrors have this, but single disks do not. > >> > > > > okay. ditto is only metadata. gotcha. > > > > By default, but you can set the number of copies for your data > by setting the "copies" parameter on your file system. In other > words, "copies=1" is the default. > zfs get copies myfilesystemname > will show the current setting for your file system. > > It is a little difficult to understand copies without pictures, so I > blogged about it and drew some pictures. > http://blogs.sun.com/relling/entry/zfs_copies_and_data_protectionThe big takeaway here is that while copies=N guarantees there will be N copies of blocks, it _does not_ guarantee that they will always be on separate disks. If you lose a disk there is still a chance that you lose the pool. zraid(2) > copies=N -Wade> > The real takeaway is that you have many different ways to protect > your data, which to choose is not always clear. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Mon, Sep 15, 2008 at 01:13:28PM -0500, Wade.Stuart at fallon.com wrote:> > The big takeaway here is that while copies=N guarantees there will be N > copies of blocks, it _does not_ guarantee that they will always be on > separate disks. If you lose a disk there is still a chance that you lose > the pool. zraid(2) > copies=N >It''s also worth pointing out that ZFS currently doesn''t survive toplevel vdev failure. If it happens while the pool is up, it will enter the I/O failure state when trying to write the labels, despite the fact it could theoretically continue on with the rest of the vdevs. If it happens while the pool is exported or the system is down, it treats this like the root vdev is faulted. Both of these are bugs that are being worked on - the only reason a pool should be faulted is because critical pool-wide metadata is not available. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
2008/9/15 gm_sjo:> 2008/9/15 Ben Rockwood: >> On Thumpers I''ve created single pools of 44 disks, in 11 disk RAIDZ2''s. >> I''ve come to regret this. I recommend keeping pools reasonably sized >> and to keep stripes thinner than this. > > Could you clarify why you came to regret it? I was intending to create > a single pool for 8 1TB disks.Sorry, just bouncing the back for Ben incase he missed it.
On Tue, Sep 16, 2008 at 10:03 PM, Ben Rockwood <benr at cuddletech.com> wrote:> gm_sjo wrote: >> 2008/9/15 gm_sjo: >> >>> 2008/9/15 Ben Rockwood: >>> >>>> On Thumpers I''ve created single pools of 44 disks, in 11 disk RAIDZ2''s. >>>> I''ve come to regret this. I recommend keeping pools reasonably sized >>>> and to keep stripes thinner than this. >>>> >>> Could you clarify why you came to regret it? I was intending to create >>> a single pool for 8 1TB disks. >>> >> >> Sorry, just bouncing the back for Ben incase he missed it. >> > > No, I didn''t miss it, just was hoping I could get some benchmarking in > to justify my points. > > > You want to keep stripes wide to reduce wasted disk space.... but you > also want to keep them narrow to reduce the elements involved in parity > calculation. In light home use I don''t see a problem with an 8 disk > RAIDZ/RAIDZ2. If your serving in a multi-user environment your primary > concern is to reduce the movement of the disk heads, and thus narrower > stripes become adventagious.I''m not sure that the width of the stripe is directly a problem. But what is true is that the random read performance of raidz1/2 is basically that of a single drive, so having more vdevs is better. Given a fixed number of drives, more vdevs implies narrower stripes, but that''s a side-effect rather than a cause. For what it''s worth, we put all the disks on our thumpers into a single pool - mostly it''s 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the OS and would happily go much bigger. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Tue, Sep 16, 2008 at 2:28 PM, Peter Tribble <peter.tribble at gmail.com> wrote:> For what it''s worth, we put all the disks on our thumpers into a single pool - > mostly it''s 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the OS and > would happily go much bigger.so you have 9 drive raidz1 (8 disks usable) + hot spare, or 8 drive raidz1 (7 disks usable) + hot spare? It sounds like people -can- build larger pools but due to their storage needs (performance, availability, etc) choose NOT to. For home usage with maybe 4 clients maximum and can deal with downtime when swapping out a drive, I think I can live with "decent" performance (not "insane") and try to maximize my space (without making ZFS''s redundancy features useless.)
Am I right in thinking though that for every raidz1/2 vdev, you''re effectively losing the storage of one/two disks in that vdev?
On Wed, Sep 17, 2008 at 8:40 AM, gm_sjo <saqmaster at gmail.com> wrote:> Am I right in thinking though that for every raidz1/2 vdev, you''re > effectively losing the storage of one/two disks in that vdev?Well yeah - you''ve got to have some allowance for redundancy. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
2008/9/17 Peter Tribble:> On Wed, Sep 17, 2008 at 8:40 AM, gm_sjo <saqmaster at gmail.com> wrote: >> Am I right in thinking though that for every raidz1/2 vdev, you''re >> effectively losing the storage of one/two disks in that vdev? > > Well yeah - you''ve got to have some allowance for redundancy.This is what i''m struggling to get my head around - the chances of losing two disks at the same time are pretty darn remote (within a reasonable time-to-replace delta), so what advantage is there (other than potentially pointless uber-redundancy) in running multiple raidz/2 vdevs? Are you not infact losing performance by reducing the amount of spindles used for a given pool?
On Wed, Sep 17, 2008 at 10:11 AM, gm_sjo <saqmaster at gmail.com> wrote:> 2008/9/17 Peter Tribble: >> On Wed, Sep 17, 2008 at 8:40 AM, gm_sjo <saqmaster at gmail.com> wrote: >>> Am I right in thinking though that for every raidz1/2 vdev, you''re >>> effectively losing the storage of one/two disks in that vdev? >> >> Well yeah - you''ve got to have some allowance for redundancy. > > This is what i''m struggling to get my head around - the chances of > losing two disks at the same time are pretty darn remote (within a > reasonable time-to-replace delta), so what advantage is there (other > than potentially pointless uber-redundancy) in running multiple > raidz/2 vdevs? Are you not infact losing performance by reducing the > amount of spindles used for a given pool?No. The number of spindles is constant. The snag is that for random reads, the performance of a raidz1/2 vdev is essentially that of a single disk. (The writes are fast because they''re always full-stripe; but so are the reads.) So your effective random read performance is that of a single disk times the number of raidz vdevs. It''s a tradeoff, as in all things. Fewer vdevs means less wasted space, but lower performance. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
gm_sjo wrote:> Are you not infact losing performance by reducing the > amount of spindles used for a given pool?This depends. Usually, RAIDZ1/2 isn''t a good performancer when it comes to random access read I/O, for instance. If I wanted to scale performance by adding spindles, I would use mirrors (RAID 10). If you want to scale filesystem sizes, RAIDZ is your friend. I once had the problem that I needed a high random I/O performance and at least a 11 TB large filesystem on a X4500. Mirroring was out of the question (not enough disk space left), and RAIDZ gave me only about 25% of the performance of the existing Linux ext2 boxes I had to compete with. But in the end, striping 13 RAIDZ sets of 3 drives each + 1 hot spare delivered acceptable results in both categories. But it took me a lot of benchmarks to get there. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Nils Goroll
2008-Sep-18 10:09 UTC
[zfs-discuss] [storage-discuss] A few questions : RAID set width
Hi all, Ben Rockwood wrote:> You want to keep stripes wide to reduce wasted disk space.... but you > also want to keep them narrow to reduce the elements involved in parity > calculation.I Ben''s argument, and the main point IMHO is how the RAID behaves in the degraded state. When a disk fails, that disk''s data has to be reconstructed by reading from ALL the other disks of the RAID set. Effectively, for the degraded RAID case, N disks of a RAID are reduced to the performance of one disk only. Also this situation will last until the RAID is reconstructed after replacing the failed disk, which is an argument for not using too large disks (see another thread on this list). Nils
Nils Goroll
2008-Sep-18 10:15 UTC
[zfs-discuss] [storage-discuss] A few questions - small read I/O performance on RAIDZ
Hi Peter, Sorry, I have read you post after posting a reply myself. Peter Tribble wrote:> No. The number of spindles is constant. The snag is that for random reads, > the performance of a raidz1/2 vdev is essentially that of a single disk. (The > writes are fast because they''re always full-stripe; but so are the reads.)Can you elaborate on this? My understanding is that with RAIDZ the writes are always full-stripe for as much data as can be agglomerated into a single contiguous write, but I thought this did not imply that all of the data has to be read at once except with a degraded RAID. What about for instance writing 16MB chunks and reading 8K random? Wouldn''t RAIDZ access only the disks containing the 8K bits? Nils
Nils Goroll
2008-Sep-18 10:24 UTC
[zfs-discuss] typo: [storage-discuss] A few questions : RAID set width
> I Ben''s argument, and the main point IMHO is how the RAID behaves in the^ second
Robert Milkowski
2008-Sep-18 10:53 UTC
[zfs-discuss] [storage-discuss] A few questions - small read I/O performance on RAIDZ
Hello Nils, Thursday, September 18, 2008, 11:15:37 AM, you wrote: NG> Hi Peter, NG> Sorry, I have read you post after posting a reply myself. NG> Peter Tribble wrote:>> No. The number of spindles is constant. The snag is that for random reads, >> the performance of a raidz1/2 vdev is essentially that of a single disk. (The >> writes are fast because they''re always full-stripe; but so are the reads.)NG> Can you elaborate on this? NG> My understanding is that with RAIDZ the writes are always full-stripe for as NG> much data as can be agglomerated into a single contiguous write, but I thought NG> this did not imply that all of the data has to be read at once except with a NG> degraded RAID. NG> What about for instance writing 16MB chunks and reading 8K random? Wouldn''t NG> RAIDZ access only the disks containing the 8K bits? Basically, the way RAID-Z works is that it spreads FS block to all disks in a given VDEV, minus parity/checksum disks). Because when you read data back from zfs before it gets to application zfs will check it''s checksum (fs checksum, not a raid-z one) so it needs entire fs block... which is spread to all data disks in a given vdev. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Hi Robert,> Basically, the way RAID-Z works is that it spreads FS block to all > disks in a given VDEV, minus parity/checksum disks). Because when you > read data back from zfs before it gets to application zfs will check > it''s checksum (fs checksum, not a raid-z one) so it needs entire fs > block... which is spread to all data disks in a given vdev.Thank you very much for correcting my long-time misconception. On the other hand, isn''t there room for improvement here? If it was possible to break large writes into smaller blocks with individual checkums(for instance those which are larger than a preferred_read_size parameter), we could still write all of these with a single RAIDZ(2) line, avoid the RAIDx write penalty and improve read performance because we''d only need to issue a single read I/O for each requested block - needing to access the full RAIDZ line only for the degraded RAID case. I think that this could make a big difference for write-once read many random access-type applications like DSS systems etc. Is this feasible at all? Nils
On Thu, 18 Sep 2008, Nils Goroll wrote:> > On the other hand, isn''t there room for improvement here? If it was possible to > break large writes into smaller blocks with individual checkums(for instance > those which are larger than a preferred_read_size parameter), we could still > write all of these with a single RAIDZ(2) line, avoid the RAIDx write penalty > and improve read performance because we''d only need to issue a single read I/O > for each requested block - needing to access the full RAIDZ line only for the > degraded RAID case. > > I think that this could make a big difference for write-once read many random > access-type applications like DSS systems etc.I imagine that this is indeed possible but that the law of diminishing returns would prevail. The level of per-block overhead would become much greater so sequential throughput would be reduced and more disk space would be wasted. You can be sure that the ZFS inventors thoroughly explored all of these issues and it would surprise me if someone didn''t prototype it to see how it actually performs. ZFS is designed for the present and the future. Legacy filesystems were designed for the past. In the present, the cost of memory is dramatically reduced, and in the future it will be even more so. This means that systems will contain massive cache RAM which dramatically reduces the number of read (and write) accesses. Also, solid state disks (SSDs) will eventually become common and SSDs don''t exhibit a seek penalty so designing the filesystem to avoid seeks does not carry over into the long term future. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, Sep 18, 2008 at 01:26:09PM +0200, Nils Goroll wrote:> Thank you very much for correcting my long-time misconception. > > On the other hand, isn''t there room for improvement here? If it was > possible to break large writes into smaller blocks with individual > checkums(for instance those which are larger than a > preferred_read_size parameter), we could still write all of these with > a single RAIDZ(2) line, avoid the RAIDx write penalty and improve read > performance because we''d only need to issue a single read I/O for each > requested block - needing to access the full RAIDZ line only for the > degraded RAID case.Don''t forget that the parent block contains the checksum so that it can be compared. There isn''t room in the parent for an arbitrary number of checksums as would be required with an arbitrary number of columns. -- Darren
Nils Goroll wrote:> Hi Robert, > > >> Basically, the way RAID-Z works is that it spreads FS block to all >> disks in a given VDEV, minus parity/checksum disks). Because when you >> read data back from zfs before it gets to application zfs will check >> it''s checksum (fs checksum, not a raid-z one) so it needs entire fs >> block... which is spread to all data disks in a given vdev. >> > > Thank you very much for correcting my long-time misconception. > > On the other hand, isn''t there room for improvement here? If it was possible to > break large writes into smaller blocks with individual checkums(for instance > those which are larger than a preferred_read_size parameter), we could still > write all of these with a single RAIDZ(2) line, avoid the RAIDx write penalty > and improve read performance because we''d only need to issue a single read I/O > for each requested block - needing to access the full RAIDZ line only for the > degraded RAID case. > > I think that this could make a big difference for write-once read many random > access-type applications like DSS systems etc. > > Is this feasible at all? >Someone in the community was supposedly working on this, at one time. It gets brought up about every 4-5 months or so. Lots of detail in the archives. -- richard
Hi Richard,> Someone in the community was supposedly working on this, at one > time. It gets brought up about every 4-5 months or so. Lots of detail > in the archives.Thank you for the pointer and sorry for the noise. I will definitely browse the archives to find out more regarding this question. Bob and Darren, thank you as well for your comments. I don''t expect it to be easy to optimize RAIDZ for random read I/O, but I do not agree with the argument that caching heals all I/O problems. Yes, it would be desirable to always have so large a cache to eliminate almost all read-I/O, but those of us who are responsible for deploying such systems will know that physical read I/O performance does matter for random access patterns in particular, but also for more sequential access patterns on ZFS due to the inherent fragmentation that comes with COW (plus relevance for cache warmup times etc). In short, I consider this optimization approach worthwhile exploring, but I don''t think I''ll be able to do this myself. I would appreciate any pointers to background information regarding this question. Thank you, Nils