thr3ads.net - zfs discuss - [zfs-discuss] ditto==RAID1, parity==RAID5? [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Wout Mertens

2007-Jan-31 13:58 UTC

[zfs-discuss] ditto==RAID1, parity==RAID5?

Hi there,

Richard''s blog post
(http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance)
got me thinking. I posted a comment but it got mangled, and I''m
wondering if I got it right, so I''m reposting here:


Just to make sure I have things right:
Given (by the ZFS layer) a block D of data to store, RAID-Z will first split the
block in several smaller blocks D_1..D_n as needed and calculate the parity
block P from those. (n is the stripe width for this write)

Then D_1..D_n and p are written to separate disks.

When you request data from block D, ZFS has to read D_1..D_n and calculate the
checksum over that to ensure data integrity, taking up bandwidth on those n
disks.

This makes read performance of a RAID-Z pool be the same as that of a single
disk, even if you only needed a small read from block D.

RAID-Z as it stands now in fact reinforces the layer principle behind
"traditional" filesystem+volume manager systems.


Now consider ditto blocks. They implement a flexible RAID-1 mirror, by merging
the filesystem and volume layers.

=> What if ZFS had parity blocks?

Try this scenario:

Given data to store, that data is stored in regular ZFS blocks, and a parity
block is calculated. The data and parity blocks are laid out across the
available disks in the pool.

When you need data from one of those blocks, only one block needs to be read to
be able to calculate its checksum. If it is corrupt, the other data blocks and
the parity block can be used to recreate its contents.

That would implement something akin to RAID-5 storage, but "the ZFS
way", merging filesystem and volume layers.

The read performance would be the same as that of ditto blocks.

On top of that, you would be able to store unrelated data blocks in the same
stripe, thus ensuring maximum disk bandwidth and minimum parity size overhead.

If an application asks for fsync(), a smaller stripe could be written.

No RAID-5 write hole, RAID-1 read performance, tunable redundancy.

Am I missing something? Is there something that would break this scenario?

Cheers,

Wout.
 
 
This message posted from opensolaris.org

Darren Dunham

2007-Jan-31 16:15 UTC

head link

[zfs-discuss] ditto==RAID1, parity==RAID5?

> Just to make sure I have things right:
> Given (by the ZFS layer) a block D of data to store, RAID-Z will first
split the block in several smaller blocks D_1..D_n as needed and calculate the
parity block P from those. (n is the stripe width for this write)
> 
> Then D_1..D_n and p are written to separate disks.
Seems right.
> When you request data from block D, ZFS has to read D_1..D_n and calculate
the checksum over that to ensure data integrity, taking up bandwidth on those n
disks.
> 
> This makes read performance of a RAID-Z pool be the same as that of a
single disk, even if you only needed a small read from block D.
> 
> RAID-Z as it stands now in fact reinforces the layer principle behind
"traditional" filesystem+volume manager systems.
> Now consider ditto blocks. They implement a flexible RAID-1 mirror, by
merging the filesystem and volume layers.
> 
> => What if ZFS had parity blocks?
> 
> Try this scenario:
> 
> Given data to store, that data is stored in regular ZFS blocks, and a
parity block is calculated. The data and parity blocks are laid out across the
available disks in the pool.
> 
> When you need data from one of those blocks, only one block needs to be
read to be able to calculate its checksum. If it is corrupt, the other data
blocks and the parity block can be used to recreate its contents.
This seems identical to what raidz is doing, but you''re somehow able to
verify the contents of an individual disk block, something that you
cannot do with the current raidz.  Presumably if there were room for
multiple checksums in the parent pointer, you could checksum the
individual raidz disk blocks rather than only the filesystem block.

With ditto blocks, this isn''t a problem.  The contents of each block
are
identical, so the checksum doesn''t have to be stored individually.
> That would implement something akin to RAID-5 storage, but "the ZFS
way", merging filesystem and volume layers.
> 
> The read performance would be the same as that of ditto blocks.
> 
> On top of that, you would be able to store unrelated data blocks in the
same stripe, thus ensuring maximum disk bandwidth and minimum parity size
overhead.
Isn''t that true today?
> If an application asks for fsync(), a smaller stripe could be written.
> 
> No RAID-5 write hole, RAID-1 read performance, tunable redundancy.
> 
> Am I missing something? Is there something that would break this scenario?
How do you checksum the data to verify that the data you''re reading
from
the single block is valid?

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Richard Elling

2007-Jan-31 17:28 UTC

head link

[zfs-discuss] ditto==RAID1, parity==RAID5?

Wout Mertens wrote:> Hi there,
> 
> Richard''s blog post
(http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance)
got me thinking. I posted a comment but it got mangled, and I''m
wondering if I got it right, so I''m reposting here:
Wout, I''m glad you started the thread here, I''ll add a pointer
from the blog
comments.
> Just to make sure I have things right:
> Given (by the ZFS layer) a block D of data to store, RAID-Z will first
split the block in several smaller blocks D_1..D_n as needed and calculate the
parity block P from those. (n is the stripe width for this write)
> 
> Then D_1..D_n and p are written to separate disks.
> 
> When you request data from block D, ZFS has to read D_1..D_n and calculate
the checksum over that to ensure data integrity, taking up bandwidth on those n
disks.
> 
> This makes read performance of a RAID-Z pool be the same as that of a
single disk, even if you only needed a small read from block D.
> 
> RAID-Z as it stands now in fact reinforces the layer principle behind
"traditional" filesystem+volume manager systems.
> 
> 
> Now consider ditto blocks. They implement a flexible RAID-1 mirror, by
merging the filesystem and volume layers.
> 
> => What if ZFS had parity blocks?
> 
> Try this scenario:
> 
> Given data to store, that data is stored in regular ZFS blocks, and a
parity block is calculated. The data and parity blocks are laid out across the
available disks in the pool.
> 
> When you need data from one of those blocks, only one block needs to be
read to be able to calculate its checksum. If it is corrupt, the other data
blocks and the parity block can be used to recreate its contents.
> 
> That would implement something akin to RAID-5 storage, but "the ZFS
way", merging filesystem and volume layers.
> 
> The read performance would be the same as that of ditto blocks.
> 
> On top of that, you would be able to store unrelated data blocks in the
same stripe, thus ensuring maximum disk bandwidth and minimum parity size
overhead.
Don''t confuse bandwidth with iops.  The model I used was for small,
random iops
where we are not bandwidth constrained, but we are seek constrained.  Bandwidth
should scale with the number of data disks.
  -- richard

Wout Mertens

2007-Jan-31 17:40 UTC

head link

[zfs-discuss] Re: ditto==RAID1, parity==RAID5?

> > => What if ZFS had parity blocks?
> > 
> > Try this scenario:
> > 
> > Given data to store, that data is stored in regular
> ZFS blocks, and a parity block is calculated. The
> data and parity blocks are laid out across the
> available disks in the pool.
> > 
> > When you need data from one of those blocks, only
> one block needs to be read to be able to calculate
> its checksum. If it is corrupt, the other data blocks
> and the parity block can be used to recreate its
> contents.
> 
> This seems identical to what raidz is doing, but
> you''re somehow able to
> verify the contents of an individual disk block,
> something that you
> cannot do with the current raidz.  Presumably if
> there were room for
> multiple checksums in the parent pointer, you could
> checksum the
> individual raidz disk blocks rather than only the
> filesystem block.
Actually, no. RAID-Z is a zpool that takes the single block that the zfs layer
is asking it to store, and splits it up across the storage.

In this scenario however, it''s the zfs layer that generates the RAID
layout, just like with ditto blocks.

So each block that you store is a regular zfs block that has a checksum. Take
n-1 blocks of the same size, calculate a parity block and store that on the n
disks.
> With ditto blocks, this isn''t a problem.  The
> contents of each block are
> identical, so the checksum doesn''t have to be stored
> individually.
Ok so you do need to keep track of which parity block belongs to which stripe.
Is that a big problem?
> > On top of that, you would be able to store
> unrelated data blocks in the same stripe, thus
> ensuring maximum disk bandwidth and minimum parity
> size overhead.
> 
> Isn''t that true today?
Well, I don''t think so. The RAID-Z layer takes one block and stripes it
as appropriate, it doesn''t take multiple blocks and concatenate them,
right?

Doing RAID in the ZFS layer seems like a much more ZFS way to do things...

Wout.
 
 
This message posted from opensolaris.org

James Blackburn

2007-Jan-31 19:17 UTC

head link

[zfs-discuss] Re: ditto==RAID1, parity==RAID5?

On 1/31/07, Wout Mertens <wmertens at cisco.com>
wrote:> > > => What if ZFS had parity blocks?
> > >
> > > Try this scenario:
> > >
> > > Given data to store, that data is stored in regular
> > ZFS blocks, and a parity block is calculated. The
> > data and parity blocks are laid out across the
> > available disks in the pool.
> > >
> > > When you need data from one of those blocks, only
> > one block needs to be read to be able to calculate
> > its checksum. If it is corrupt, the other data blocks
> > and the parity block can be used to recreate its
> > contents.
> >
> > This seems identical to what raidz is doing, but
> > you''re somehow able to
> > verify the contents of an individual disk block,
> > something that you
> > cannot do with the current raidz.  Presumably if
> > there were room for
> > multiple checksums in the parent pointer, you could
> > checksum the
> > individual raidz disk blocks rather than only the
> > filesystem block.
>
> Actually, no. RAID-Z is a zpool that takes the single block that the zfs
layer is asking it to store, and splits it up across the storage.
>
> In this scenario however, it''s the zfs layer that generates the
RAID layout, just like with ditto blocks.
>
> So each block that you store is a regular zfs block that has a checksum.
Take n-1 blocks of the same size, calculate a parity block and store that on the
n disks.
Indeed.  This provides a clean interface where the vdev is asked to
fetch/store blocks based on an offset and size.
> > With ditto blocks, this isn''t a problem.  The
> > contents of each block are
> > identical, so the checksum doesn''t have to be stored
> > individually.
>
> Ok so you do need to keep track of which parity block belongs to which
stripe. Is that a big problem?
I think it is. If you consider the read-modify-write cycle, the first
write can be easy -- all the files you''re writing  can be concurrently
streamed to separate disks and the parity calculated and written.
When it comes to modify an existing block, however, you will now have
to both write new data & parity and update all the parity block
pointers before the old data location and parity location can be
marked as free.  Currently a block aligned write need not read; with
this there is an extra 2 read penalty for writes + the extra writes
for metadata updates for other columns in the ''stripe''.  You
must also
ensure that parity and data are written on the correct RAID-Z disks
(such that you can still recover from a full disk
failure).>
> > > On top of that, you would be able to store
> > unrelated data blocks in the same stripe, thus
> > ensuring maximum disk bandwidth and minimum parity
> > size overhead.
> >
> > Isn''t that true today?
>
> Well, I don''t think so. The RAID-Z layer takes one block and
stripes it as appropriate, it doesn''t take multiple blocks and
concatenate them, right?
Well the RAID-Z layer is given an offset + size of the data to be
read/written.  It turns these into offset on the leaf vdevs with a
touch of magic, and the required number of blocks are written to each
vdev (plus some extra if the size is not a multiple of cols - parity).
 I think Richard''s points was that although random reading may result
in poor IO throughput, sustained reads have the bandwidth of all the
disks.
> Doing RAID in the ZFS layer seems like a much more ZFS way to do things...
I''ve been doing a little bit of hacking in the ZFS raid-z vdev code,
and it''s striking how clean and modular the design is.  Your
suggestion would probably require touching large portions of the fs.
Even if it were to increase concurrent read performance, it would do
so at a greater computational and space overhead.  You would now have
to maintain a structure of parity locations, and their associated data
blocks, as well as free lists for each individual leaf vdev + the
extra space for parity in the block pointer.  Unless, of course, I''m
missing something...

James

zfs discuss - Jan 2007 - ditto==RAID1, parity==RAID5?

[zfs-discuss] ditto==RAID1, parity==RAID5?

[zfs-discuss] ditto==RAID1, parity==RAID5?

[zfs-discuss] ditto==RAID1, parity==RAID5?

[zfs-discuss] Re: ditto==RAID1, parity==RAID5?

[zfs-discuss] Re: ditto==RAID1, parity==RAID5?