Hi there, Richard''s blog post (http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance) got me thinking. I posted a comment but it got mangled, and I''m wondering if I got it right, so I''m reposting here: Just to make sure I have things right: Given (by the ZFS layer) a block D of data to store, RAID-Z will first split the block in several smaller blocks D_1..D_n as needed and calculate the parity block P from those. (n is the stripe width for this write) Then D_1..D_n and p are written to separate disks. When you request data from block D, ZFS has to read D_1..D_n and calculate the checksum over that to ensure data integrity, taking up bandwidth on those n disks. This makes read performance of a RAID-Z pool be the same as that of a single disk, even if you only needed a small read from block D. RAID-Z as it stands now in fact reinforces the layer principle behind "traditional" filesystem+volume manager systems. Now consider ditto blocks. They implement a flexible RAID-1 mirror, by merging the filesystem and volume layers. => What if ZFS had parity blocks? Try this scenario: Given data to store, that data is stored in regular ZFS blocks, and a parity block is calculated. The data and parity blocks are laid out across the available disks in the pool. When you need data from one of those blocks, only one block needs to be read to be able to calculate its checksum. If it is corrupt, the other data blocks and the parity block can be used to recreate its contents. That would implement something akin to RAID-5 storage, but "the ZFS way", merging filesystem and volume layers. The read performance would be the same as that of ditto blocks. On top of that, you would be able to store unrelated data blocks in the same stripe, thus ensuring maximum disk bandwidth and minimum parity size overhead. If an application asks for fsync(), a smaller stripe could be written. No RAID-5 write hole, RAID-1 read performance, tunable redundancy. Am I missing something? Is there something that would break this scenario? Cheers, Wout. This message posted from opensolaris.org
> Just to make sure I have things right: > Given (by the ZFS layer) a block D of data to store, RAID-Z will first split the block in several smaller blocks D_1..D_n as needed and calculate the parity block P from those. (n is the stripe width for this write) > > Then D_1..D_n and p are written to separate disks.Seems right.> When you request data from block D, ZFS has to read D_1..D_n and calculate the checksum over that to ensure data integrity, taking up bandwidth on those n disks. > > This makes read performance of a RAID-Z pool be the same as that of a single disk, even if you only needed a small read from block D. > > RAID-Z as it stands now in fact reinforces the layer principle behind "traditional" filesystem+volume manager systems.> Now consider ditto blocks. They implement a flexible RAID-1 mirror, by merging the filesystem and volume layers. > > => What if ZFS had parity blocks? > > Try this scenario: > > Given data to store, that data is stored in regular ZFS blocks, and a parity block is calculated. The data and parity blocks are laid out across the available disks in the pool. > > When you need data from one of those blocks, only one block needs to be read to be able to calculate its checksum. If it is corrupt, the other data blocks and the parity block can be used to recreate its contents.This seems identical to what raidz is doing, but you''re somehow able to verify the contents of an individual disk block, something that you cannot do with the current raidz. Presumably if there were room for multiple checksums in the parent pointer, you could checksum the individual raidz disk blocks rather than only the filesystem block. With ditto blocks, this isn''t a problem. The contents of each block are identical, so the checksum doesn''t have to be stored individually.> That would implement something akin to RAID-5 storage, but "the ZFS way", merging filesystem and volume layers. > > The read performance would be the same as that of ditto blocks. > > On top of that, you would be able to store unrelated data blocks in the same stripe, thus ensuring maximum disk bandwidth and minimum parity size overhead.Isn''t that true today?> If an application asks for fsync(), a smaller stripe could be written. > > No RAID-5 write hole, RAID-1 read performance, tunable redundancy. > > Am I missing something? Is there something that would break this scenario?How do you checksum the data to verify that the data you''re reading from the single block is valid? -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Wout Mertens wrote:> Hi there, > > Richard''s blog post (http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance) got me thinking. I posted a comment but it got mangled, and I''m wondering if I got it right, so I''m reposting here:Wout, I''m glad you started the thread here, I''ll add a pointer from the blog comments.> Just to make sure I have things right: > Given (by the ZFS layer) a block D of data to store, RAID-Z will first split the block in several smaller blocks D_1..D_n as needed and calculate the parity block P from those. (n is the stripe width for this write) > > Then D_1..D_n and p are written to separate disks. > > When you request data from block D, ZFS has to read D_1..D_n and calculate the checksum over that to ensure data integrity, taking up bandwidth on those n disks. > > This makes read performance of a RAID-Z pool be the same as that of a single disk, even if you only needed a small read from block D. > > RAID-Z as it stands now in fact reinforces the layer principle behind "traditional" filesystem+volume manager systems. > > > Now consider ditto blocks. They implement a flexible RAID-1 mirror, by merging the filesystem and volume layers. > > => What if ZFS had parity blocks? > > Try this scenario: > > Given data to store, that data is stored in regular ZFS blocks, and a parity block is calculated. The data and parity blocks are laid out across the available disks in the pool. > > When you need data from one of those blocks, only one block needs to be read to be able to calculate its checksum. If it is corrupt, the other data blocks and the parity block can be used to recreate its contents. > > That would implement something akin to RAID-5 storage, but "the ZFS way", merging filesystem and volume layers. > > The read performance would be the same as that of ditto blocks. > > On top of that, you would be able to store unrelated data blocks in the same stripe, thus ensuring maximum disk bandwidth and minimum parity size overhead.Don''t confuse bandwidth with iops. The model I used was for small, random iops where we are not bandwidth constrained, but we are seek constrained. Bandwidth should scale with the number of data disks. -- richard
> > => What if ZFS had parity blocks? > > > > Try this scenario: > > > > Given data to store, that data is stored in regular > ZFS blocks, and a parity block is calculated. The > data and parity blocks are laid out across the > available disks in the pool. > > > > When you need data from one of those blocks, only > one block needs to be read to be able to calculate > its checksum. If it is corrupt, the other data blocks > and the parity block can be used to recreate its > contents. > > This seems identical to what raidz is doing, but > you''re somehow able to > verify the contents of an individual disk block, > something that you > cannot do with the current raidz. Presumably if > there were room for > multiple checksums in the parent pointer, you could > checksum the > individual raidz disk blocks rather than only the > filesystem block.Actually, no. RAID-Z is a zpool that takes the single block that the zfs layer is asking it to store, and splits it up across the storage. In this scenario however, it''s the zfs layer that generates the RAID layout, just like with ditto blocks. So each block that you store is a regular zfs block that has a checksum. Take n-1 blocks of the same size, calculate a parity block and store that on the n disks.> With ditto blocks, this isn''t a problem. The > contents of each block are > identical, so the checksum doesn''t have to be stored > individually.Ok so you do need to keep track of which parity block belongs to which stripe. Is that a big problem?> > On top of that, you would be able to store > unrelated data blocks in the same stripe, thus > ensuring maximum disk bandwidth and minimum parity > size overhead. > > Isn''t that true today?Well, I don''t think so. The RAID-Z layer takes one block and stripes it as appropriate, it doesn''t take multiple blocks and concatenate them, right? Doing RAID in the ZFS layer seems like a much more ZFS way to do things... Wout. This message posted from opensolaris.org
On 1/31/07, Wout Mertens <wmertens at cisco.com> wrote:> > > => What if ZFS had parity blocks? > > > > > > Try this scenario: > > > > > > Given data to store, that data is stored in regular > > ZFS blocks, and a parity block is calculated. The > > data and parity blocks are laid out across the > > available disks in the pool. > > > > > > When you need data from one of those blocks, only > > one block needs to be read to be able to calculate > > its checksum. If it is corrupt, the other data blocks > > and the parity block can be used to recreate its > > contents. > > > > This seems identical to what raidz is doing, but > > you''re somehow able to > > verify the contents of an individual disk block, > > something that you > > cannot do with the current raidz. Presumably if > > there were room for > > multiple checksums in the parent pointer, you could > > checksum the > > individual raidz disk blocks rather than only the > > filesystem block. > > Actually, no. RAID-Z is a zpool that takes the single block that the zfs layer is asking it to store, and splits it up across the storage. > > In this scenario however, it''s the zfs layer that generates the RAID layout, just like with ditto blocks. > > So each block that you store is a regular zfs block that has a checksum. Take n-1 blocks of the same size, calculate a parity block and store that on the n disks.Indeed. This provides a clean interface where the vdev is asked to fetch/store blocks based on an offset and size.> > With ditto blocks, this isn''t a problem. The > > contents of each block are > > identical, so the checksum doesn''t have to be stored > > individually. > > Ok so you do need to keep track of which parity block belongs to which stripe. Is that a big problem?I think it is. If you consider the read-modify-write cycle, the first write can be easy -- all the files you''re writing can be concurrently streamed to separate disks and the parity calculated and written. When it comes to modify an existing block, however, you will now have to both write new data & parity and update all the parity block pointers before the old data location and parity location can be marked as free. Currently a block aligned write need not read; with this there is an extra 2 read penalty for writes + the extra writes for metadata updates for other columns in the ''stripe''. You must also ensure that parity and data are written on the correct RAID-Z disks (such that you can still recover from a full disk failure).> > > > On top of that, you would be able to store > > unrelated data blocks in the same stripe, thus > > ensuring maximum disk bandwidth and minimum parity > > size overhead. > > > > Isn''t that true today? > > Well, I don''t think so. The RAID-Z layer takes one block and stripes it as appropriate, it doesn''t take multiple blocks and concatenate them, right?Well the RAID-Z layer is given an offset + size of the data to be read/written. It turns these into offset on the leaf vdevs with a touch of magic, and the required number of blocks are written to each vdev (plus some extra if the size is not a multiple of cols - parity). I think Richard''s points was that although random reading may result in poor IO throughput, sustained reads have the bandwidth of all the disks.> Doing RAID in the ZFS layer seems like a much more ZFS way to do things...I''ve been doing a little bit of hacking in the ZFS raid-z vdev code, and it''s striking how clean and modular the design is. Your suggestion would probably require touching large portions of the fs. Even if it were to increase concurrent read performance, it would do so at a greater computational and space overhead. You would now have to maintain a structure of parity locations, and their associated data blocks, as well as free lists for each individual leaf vdev + the extra space for parity in the block pointer. Unless, of course, I''m missing something... James