thr3ads.net - zfs discuss - [zfs-discuss] RAIDZ2 vs. ZFS RAID-10 [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Richard Elling

2007-Jan-03 20:51 UTC

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Jason J. W. Williams wrote:> Hello All,
> 
> I was curious if anyone had run a benchmark on the IOPS performance of
> RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and
was
> curious what others had seen. Thank you in advance.
I''ve been using a simple model for small, random reads.  In that model,
the performance of a raidz[12] set will be approximately equal to a single
disk.  For example, if you have 6 disks, then the performance for the
6-disk raidz2 set will be normalized to 1, and the performance of a 3-way
dynamic stripe of 2-way mirrors will have a normalized performance of 6.
I''d be very interested to see if your results concur.

The models for writes or large reads are much more complicated because
of the numerous caches of varying size and policy throughout the system.
The small, random read workload will be largely unaffected by caches and
you should see the performance as predicted by the disk rpm and seek time.
  -- richard

Jason J. W. Williams

2007-Jan-03 22:11 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hi Richard,

Hmm....that''s interesting. I wonder if its worth benchmarking RAIDZ2
if those are the results you''re getting. The testing is to see the
performance gain we might get for MySQL moving off the FLX210 to an
active/passive pair of X4500s. Was hoping with that many SATA disks
RAIDZ2 would provide a nice safety net.

Best Regards,
Jason

On 1/3/07, Richard Elling <Richard.Elling at sun.com>
wrote:> Jason J. W. Williams wrote:
> > Hello All,
> >
> > I was curious if anyone had run a benchmark on the IOPS performance of
> > RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper
and was
> > curious what others had seen. Thank you in advance.
>
> I''ve been using a simple model for small, random reads.  In that
model,
> the performance of a raidz[12] set will be approximately equal to a single
> disk.  For example, if you have 6 disks, then the performance for the
> 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way
> dynamic stripe of 2-way mirrors will have a normalized performance of 6.
> I''d be very interested to see if your results concur.
>
> The models for writes or large reads are much more complicated because
> of the numerous caches of varying size and policy throughout the system.
> The small, random read workload will be largely unaffected by caches and
> you should see the performance as predicted by the disk rpm and seek time.
>   -- richard
>

Jason J. W. Williams

2007-Jan-03 22:40 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Just got an interesting benchmark. I made two zpools:

RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total)
RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total)

Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307
seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of
data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have
expected the RAID-10 to write data more quickly.

Its interesting to me that the RAID-10 pool registered the 38.4GB of
data as 38.4GB, whereas the RAID-Z2 registered it as 56.4.

Best Regards,
Jason

On 1/3/07, Jason J. W. Williams <jasonjwwilliams at gmail.com>
wrote:> Hi Richard,
>
> Hmm....that''s interesting. I wonder if its worth benchmarking
RAIDZ2
> if those are the results you''re getting. The testing is to see the
> performance gain we might get for MySQL moving off the FLX210 to an
> active/passive pair of X4500s. Was hoping with that many SATA disks
> RAIDZ2 would provide a nice safety net.
>
> Best Regards,
> Jason
>
> On 1/3/07, Richard Elling <Richard.Elling at sun.com> wrote:
> > Jason J. W. Williams wrote:
> > > Hello All,
> > >
> > > I was curious if anyone had run a benchmark on the IOPS
performance of
> > > RAIDZ2 vs RAID-10? I''m getting ready to run one on a
Thumper and was
> > > curious what others had seen. Thank you in advance.
> >
> > I''ve been using a simple model for small, random reads.  In
that model,
> > the performance of a raidz[12] set will be approximately equal to a
single
> > disk.  For example, if you have 6 disks, then the performance for the
> > 6-disk raidz2 set will be normalized to 1, and the performance of a
3-way
> > dynamic stripe of 2-way mirrors will have a normalized performance of
6.
> > I''d be very interested to see if your results concur.
> >
> > The models for writes or large reads are much more complicated because
> > of the numerous caches of varying size and policy throughout the
system.
> > The small, random read workload will be largely unaffected by caches
and
> > you should see the performance as predicted by the disk rpm and seek
time.
> >   -- richard
> >
>

Peter Schuller

2007-Jan-04 00:12 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

> I''ve been using a simple model for small, random reads.  In that
model,
> the performance of a raidz[12] set will be approximately equal to a single
> disk.  For example, if you have 6 disks, then the performance for the
> 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way
> dynamic stripe of 2-way mirrors will have a normalized performance of 6.
> I''d be very interested to see if your results concur.
Is this expected behavior? Assuming concurrent reads (not synchronous and 
sequential) I would naively expect an ndisk raidz2 pool to have a normalized 
performance of n for small reads.

Is there some reason why a small read on a raidz2 is not statistically very 
likely to require I/O on only one device? Assuming a non-degraded pool of 
course.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

Robert Milkowski

2007-Jan-04 00:49 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hello Jason,

Wednesday, January 3, 2007, 11:11:31 PM, you wrote:

JJWW> Hi Richard,

JJWW> Hmm....that''s interesting. I wonder if its worth benchmarking
RAIDZ2
JJWW> if those are the results you''re getting. The testing is to see
the
JJWW> performance gain we might get for MySQL moving off the FLX210 to an
JJWW> active/passive pair of X4500s. Was hoping with that many SATA disks
JJWW> RAIDZ2 would provide a nice safety net.

Well, you weren''t thinking about one big raidz2 group?

To get more performance you can create one pool with many smaller
raidz2 groups - that way your worst case read performance should
increase approximately N times where N is number of raidz-2 groups.

However  keep in mind that write performance should be really good
with raidz2.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jason J. W. Williams

2007-Jan-04 00:55 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hi Robert,

Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2
groups striped together. Currently, 3 RZ2 groups. I''m about to test
write performance against ZFS RAID-10. I''m curious why RAID-Z2
performance should be good? I assumed it was an analog to RAID-6. In
our recent experience RAID-5 due to the 2 reads, a XOR calc and a
write op per write instruction is usually much slower than RAID-10
(two write ops). Any advice is  greatly appreciated.

Best Regards,
Jason

On 1/3/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Jason,
>
> Wednesday, January 3, 2007, 11:11:31 PM, you wrote:
>
> JJWW> Hi Richard,
>
> JJWW> Hmm....that''s interesting. I wonder if its worth
benchmarking RAIDZ2
> JJWW> if those are the results you''re getting. The testing is
to see the
> JJWW> performance gain we might get for MySQL moving off the FLX210 to
an
> JJWW> active/passive pair of X4500s. Was hoping with that many SATA
disks
> JJWW> RAIDZ2 would provide a nice safety net.
>
> Well, you weren''t thinking about one big raidz2 group?
>
> To get more performance you can create one pool with many smaller
> raidz2 groups - that way your worst case read performance should
> increase approximately N times where N is number of raidz-2 groups.
>
> However  keep in mind that write performance should be really good
> with raidz2.
>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>

Robert Milkowski

2007-Jan-04 01:00 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hello Jason,

Wednesday, January 3, 2007, 11:40:38 PM, you wrote:

JJWW> Just got an interesting benchmark. I made two zpools:

JJWW> RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total)
JJWW> RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total)

JJWW> Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307
JJWW> seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of
JJWW> data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have
JJWW> expected the RAID-10 to write data more quickly.

Actually with 18 disks in raid-10 in theory you get write performance
equal to stripe of 9 disks. With 18 disks in 3 raidz2 groups of 6 disks each you
should expect something like (6-2)*3 = 12 disk, so equal to 12 disks
in stripe.

JJWW> Its interesting to me that the RAID-10 pool registered the 38.4GB of
JJWW> data as 38.4GB, whereas the RAID-Z2 registered it as 56.4.

If you checked with zpool - then it''s "ok" - it reports disk
usage
also wit parity overhead. If zfs list showed you that numbers then
either you''re using old snv bits or s10U2 as it was corrected some
time ago (in U3).


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jason J. W. Williams

2007-Jan-04 01:05 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hi Robert,

That makes sense. Thank you. :-) Also, it was zpool I was looking at.
zfs always showed the correct size.

-J

On 1/3/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Jason,
>
> Wednesday, January 3, 2007, 11:40:38 PM, you wrote:
>
> JJWW> Just got an interesting benchmark. I made two zpools:
>
> JJWW> RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total)
> JJWW> RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total)
>
> JJWW> Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307
> JJWW> seconds. Deleted the data from the RAID-Z2. Then copying the
38.4GB of
> JJWW> data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have
> JJWW> expected the RAID-10 to write data more quickly.
>
> Actually with 18 disks in raid-10 in theory you get write performance
> equal to stripe of 9 disks. With 18 disks in 3 raidz2 groups of 6 disks
each you
> should expect something like (6-2)*3 = 12 disk, so equal to 12 disks
> in stripe.
>
> JJWW> Its interesting to me that the RAID-10 pool registered the 38.4GB
of
> JJWW> data as 38.4GB, whereas the RAID-Z2 registered it as 56.4.
>
> If you checked with zpool - then it''s "ok" - it reports
disk usage
> also wit parity overhead. If zfs list showed you that numbers then
> either you''re using old snv bits or s10U2 as it was corrected some
> time ago (in U3).
>
>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>

Robert Milkowski

2007-Jan-04 01:06 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hello Peter,

Thursday, January 4, 2007, 1:12:47 AM, you wrote:
>> I''ve been using a simple model for small, random reads.  In
that model,
>> the performance of a raidz[12] set will be approximately equal to a
single
>> disk.  For example, if you have 6 disks, then the performance for the
>> 6-disk raidz2 set will be normalized to 1, and the performance of a
3-way
>> dynamic stripe of 2-way mirrors will have a normalized performance of
6.
>> I''d be very interested to see if your results concur.
PS> Is this expected behavior? Assuming concurrent reads (not synchronous and
PS> sequential) I would naively expect an ndisk raidz2 pool to have a
normalized
PS> performance of n for small reads.

PS> Is there some reason why a small read on a raidz2 is not statistically
very
PS> likely to require I/O on only one device? Assuming a non-degraded pool of
PS> course.

Unfortunately there''s. With raid-z1 and raid-z2 there''s no
free lunch.
You get excellent write performance (better than raid-10) however read
performance for small IOs will suffer.
It''s because in case of raid-z[12] each logical file system block is
spread to all disks (minus parity disks). So in order to just read one
block you have to get data from all disks in a raid-z[12] group.

This is not something many people would expect knowing traditional
raids.

It''s not the case with striping and raid-1[0] in zfs.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2007-Jan-04 01:08 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hello Jason,

Thursday, January 4, 2007, 1:55:02 AM, you wrote:

JJWW> Hi Robert,

JJWW> Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2
JJWW> groups striped together. Currently, 3 RZ2 groups. I''m about to
test
JJWW> write performance against ZFS RAID-10. I''m curious why RAID-Z2
JJWW> performance should be good? I assumed it was an analog to RAID-6. In
JJWW> our recent experience RAID-5 due to the 2 reads, a XOR calc and a
JJWW> write op per write instruction is usually much slower than RAID-10
JJWW> (two write ops). Any advice is  greatly appreciated.

I''m not going to describe it again - it was well explain here before.
However one simple query to google and:

http://blogs.sun.com/roch/entry/when_to_and_not_to



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jason J. W. Williams

2007-Jan-04 01:22 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hi Robert,

I''ve read that paper. Thank you for the condescension.

-J

On 1/3/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Jason,
>
> Thursday, January 4, 2007, 1:55:02 AM, you wrote:
>
> JJWW> Hi Robert,
>
> JJWW> Our X4500 configuration is multiple 6-way (across controllers)
RAID-Z2
> JJWW> groups striped together. Currently, 3 RZ2 groups. I''m
about to test
> JJWW> write performance against ZFS RAID-10. I''m curious why
RAID-Z2
> JJWW> performance should be good? I assumed it was an analog to RAID-6.
In
> JJWW> our recent experience RAID-5 due to the 2 reads, a XOR calc and a
> JJWW> write op per write instruction is usually much slower than RAID-10
> JJWW> (two write ops). Any advice is  greatly appreciated.
>
> I''m not going to describe it again - it was well explain here
before.
> However one simple query to google and:
>
> http://blogs.sun.com/roch/entry/when_to_and_not_to
>
>
>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>

David Magda

2007-Jan-04 02:08 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

On Jan 3, 2007, at 19:55, Jason J. W. Williams wrote:
> performance should be good? I assumed it was an analog to RAID-6. In
> our recent experience RAID-5 due to the 2 reads, a XOR calc and a
> write op per write instruction is usually much slower than RAID-10
> (two write ops). Any advice is  greatly appreciated.
RAIDZ and RAIDZ2 does not suffer from this maladay (the RAID5 write  
hole).

This is explained nicely in segment four of the ZFS video (at about  
time 2:30):

http://www.sun.com/software/solaris/zfs_learning_center.jsp

Anton B. Rang

2007-Jan-04 02:44 UTC

head link

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

>> In our recent experience RAID-5 due to the 2 reads, a XOR calc and a
>> write op per write instruction is usually much slower than RAID-10
>> (two write ops). Any advice is  greatly appreciated.
> 
> RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write hole).
1. This isn''t the "write hole".

2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when updating a file
in writes of less than 128K, but not when writing a new file or issuing large
writes.
 
 
This message posted from opensolaris.org

Anton B. Rang

2007-Jan-04 02:46 UTC

head link

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

> Is there some reason why a small read on a raidz2 is not statistically very
> likely to require I/O on only one device? Assuming a non-degraded pool of 
> course.
ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all disks must be
read to compute and verify the checksum.
 
 
This message posted from opensolaris.org

Jason J. W. Williams

2007-Jan-04 03:08 UTC

head link

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

Hi Anton,

Thank you for the information. That is exactly our scenario. We''re 70%
write heavy, and given the nature of the workload, our typical writes
are 10-20K. Again the information is much appreciated.

Best Regards,
Jason

On 1/3/07, Anton B. Rang <Anton.Rang at sun.com>
wrote:> >> In our recent experience RAID-5 due to the 2 reads, a XOR calc and
a
> >> write op per write instruction is usually much slower than RAID-10
> >> (two write ops). Any advice is  greatly appreciated.
> >
> > RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write
hole).
>
> 1. This isn''t the "write hole".
>
> 2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when updating a
file in writes of less than 128K, but not when writing a new file or issuing
large writes.
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Casper.Dik at Sun.COM

2007-Jan-04 09:25 UTC

head link

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

>> Is there some reason why a small read on a raidz2 is not statistically
very
>> likely to require I/O on only one device? Assuming a non-degraded pool
of
>> course.
>
>ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all disks must
be read to compute and verify the checksum.


But why do ZFS reads require the computation of the RAIDZ checksum?

If the block checksum is fine, then you need not care about the
parity.

Casper

Robert Milkowski

2007-Jan-04 10:25 UTC

head link

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

Hello Anton,

Thursday, January 4, 2007, 3:46:48 AM, you wrote:
>> Is there some reason why a small read on a raidz2 is not statistically
very
>> likely to require I/O on only one device? Assuming a non-degraded pool
of
>> course.
ABR> ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all
ABR> disks must be read to compute and verify the checksum.

It''s not about the checksum but about how a fs block is stored in
raid-z[12] case - it''s spread out to all non-parity disks so in order
to read one fs block you have to read fromm all disks except parity
disks.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Anton Rang

2007-Jan-04 13:27 UTC

head link

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

On Jan 4, 2007, at 3:25 AM, Casper.Dik at Sun.COM wrote:
>>> Is there some reason why a small read on a raidz2 is not  
>>> statistically very
>>> likely to require I/O on only one device? Assuming a non-degraded  
>>> pool of
>>> course.
>>
>> ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all  
>> disks must be read to compute and
>  verify the checksum.
>
> But why do ZFS reads require the computation of the RAIDZ checksum?
>
> If the block checksum is fine, then you need not care about the  
> parity.
It''s the block checksum that requires reading all of the disks.  If  
ZFS stored sub-block checksums
for the RAID-Z case then short reads could often be satisfied without  
reading the whole block (and
all disks).

So actually I mis-spoke slightly; rather than "all disks", I should  
have said "all data disks."
In practice this has the same effect: No more than one read may be  
processed at a time.

Anton

Casper.Dik at Sun.COM

2007-Jan-04 13:33 UTC

head link

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

>So actually I mis-spoke slightly; rather than "all disks", I
should
>have said "all data disks."
>In practice this has the same effect: No more than one read may be  
>processed at a time.
But aren''t short blocks sometimes stored on only a subset of disks?

Casper

Darren Dunham

2007-Jan-04 15:46 UTC

head link

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

> It''s the block checksum that requires reading all of the disks. 
If
> ZFS stored sub-block checksums for the RAID-Z case then short reads
> could often be satisfied without reading the whole block (and all
> disks).
What happens when a sub-block is missing (single disk failure)?  Surely
it doesn''t have to discard the entire checksum and simply trust the
remaining blocks?

Also, even if it could read the data from a subset of the disks, isn''t
it a feature that every read is also verifying the parity for
correctness/silent corruption?  I''m assuming that any
"short-read"
optimization wouldn''t be able to perform that check.

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Roch - PAE

2007-Jan-04 16:26 UTC

head link

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

Anton B. Rang writes:
 > >> In our recent experience RAID-5 due to the 2 reads, a XOR calc
and a
 > >> write op per write instruction is usually much slower than
RAID-10
 > >> (two write ops). Any advice is  greatly appreciated.
 > > 
 > > RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write
hole).
 > 
 > 1. This isn''t the "write hole".
 > 
 > 2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when
 > updating a file in writes of less than 128K, but not when writing a
 > new file or issuing large writes. 
 >  

I don''t think this is stated correctly.

All   filesystems   will   incur  a   read-modify-write when
application is  updating  portion of a  block.  The read I/O
only   occurs if the block  is  not already in memory cache.
The write is potentially deferred and multiple block updates
may occur per write I/O. 

This is not RAIDZ specific.

ZFS stores files less than 128K (or less than the filesystem
recordsize)  as a single block.  Larger  files are stored as
multiple recordsize blocks. 

For RAID-Z a block spreads onto all devices of a group.

-r

 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Anton Rang

2007-Jan-04 16:43 UTC

head link

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

On Jan 4, 2007, at 10:26 AM, Roch - PAE wrote:
> All   filesystems   will   incur  a   read-modify-write when
> application is  updating  portion of a  block.
For most Solaris file systems it is the page size, rather than
the block size, that affects read-modify-write; hence 8K (SPARC)
or 4K (x86/x64) writes do not require read-modify-write for
UFS/QFS, even when larger block sizes are used.

When direct I/O is enabled, UFS and QFS will write directly to
disk (without reading) for 512-byte-aligned I/O.
> The read I/O only occurs if the block is not already in memory cache.
Of course.
> ZFS stores files less than 128K (or less than the filesystem
> recordsize)  as a single block.  Larger  files are stored as
> multiple recordsize blocks.
So appending to any file less than 128K will result in a read-modify- 
write
cycle (modulo read caching); while a write to a file which is not
record-size-aligned (by default, 128K) results in a read-modify-write  
cycle.
> For RAID-Z a block spreads onto all devices of a group.
Which means that all devices are involved in the read and the write;  
except,
as I believe Casper pointed out, that very small blocks (less than  
512 bytes
per data device) will reside on a smaller set of disks.

Anton

Anton B. Rang

2007-Jan-04 17:18 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

> What happens when a sub-block is missing (single disk failure)?  Surely
> it doesn''t have to discard the entire checksum and simply trust
the
> remaining blocks?
The checksum is over the data, not the data+parity.  So when a disk fails,
the data is first reconstructed, and then the block checksum is computed.
> Also, even if it could read the data from a subset of the disks,
isn''t
> it a feature that every read is also verifying the parity for
> correctness/silent corruption?
It doesn''t -- we only read the data, not the parity.  (See line 708 of
vdev_raidz.c.)  The parity is checked only when scrubbing.
 
 
This message posted from opensolaris.org

Wade.Stuart at fallon.com

2007-Jan-04 17:40 UTC

head link

[zfs-discuss] Scrubbing on active zfs systems (many snaps per day)

>From what I have read, it looks like there is a known issue with scrubbingrestarting when any of the other usages of the same code path run
(re-silver, snap ...).  It looks like there is a plan to put in a marker so
that scrubbing knows where to start again after being preempted.  This is
good.  I am wondering if any thought has been put in to a scrubbing service
that would do constant low priority scrubs (either full with the restart
marker, or randomized).   I have noticed that the default scrub seems to be
very resource intensive and can cause significant slowdowns on the
filesystems,  a much slower but constant scrub would be nice.

while (1) {
      scrub_very_slowly();
}


Are there any plans in this area documented anywhere or can someone give
insight as to the devel teams goals?

Thanks!
-Wade

Darren Dunham

2007-Jan-04 17:44 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

> > Also, even if it could read the data from a subset of the disks,
isn''t
> > it a feature that every read is also verifying the parity for
> > correctness/silent corruption?
> 
> It doesn''t -- we only read the data, not the parity.  (See line
708 of
> vdev_raidz.c.)  The parity is checked only when scrubbing.
Ah, that''s a major misconception on my part then.  I''d thought
I''d read
that unlike any other RAID implementation, ZFS checked and verified
parity on normal data access.  

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Bart Smaalders

2007-Jan-04 21:37 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

Darren Dunham wrote:>>> Also, even if it could read the data from a subset of the disks,
isn''t
>>> it a feature that every read is also verifying the parity for
>>> correctness/silent corruption?
>> It doesn''t -- we only read the data, not the parity.  (See
line 708 of
>> vdev_raidz.c.)  The parity is checked only when scrubbing.
> 
> Ah, that''s a major misconception on my part then.  I''d
thought I''d read
> that unlike any other RAID implementation, ZFS checked and verified
> parity on normal data access.  
> 
Except that of course we compute the checksum of for all
data read and compare that with the checksum stored in
the block pointer.... and then use the parity data to
reconstruct the block if the checksums don''t match.

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Anton B. Rang

2007-Jan-04 21:38 UTC

head link

[zfs-discuss] Re: Re: Re: RAIDZ2 vs. ZFS RAID-10

> I''d thought I''d read that unlike any other RAID
implementation, ZFS checked
> and verified parity on normal data access.  
Not yet, it appears.  :-)

(Incidentally, some hardware RAID controllers do verify parity, but generally
only for RAID-3, where the extra reads are free as long as you have hardware
parity checking.)
 
 
This message posted from opensolaris.org

Erik Trimble

2007-Jan-05 07:19 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

Darren Dunham wrote:>>> Also, even if it could read the data from a subset of the disks,
isn''t
>>> it a feature that every read is also verifying the parity for
>>> correctness/silent corruption?
>>>       
>> It doesn''t -- we only read the data, not the parity.  (See
line 708 of
>> vdev_raidz.c.)  The parity is checked only when scrubbing.
>>     
>
> Ah, that''s a major misconception on my part then.  I''d
thought I''d read
> that unlike any other RAID implementation, ZFS checked and verified
> parity on normal data access.  That would be useless, and not provide anything extra.  ZFS will do a 
block checksum check (that is, for each block read, read the checksum 
for that block, and compare to see if it is OK).  If the block checksums 
show OK, then reading the parity for the corresponding data yields no 
additional useful information.

I''m assuming that in a RAIDZ, RAIDZ2, or mirror configuration, should a
block checksum show the corresponding block is corrupted, then ZFS will 
read the parity (or corresponding mirror) block, and attempt to 
re-construct the "bad" block, give the corrected info to the calling 
process, then re-writing the corrected data to a new block section on 
the disk(s).

Right?

-Erik

Darren Dunham

2007-Jan-05 16:03 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

> > Ah, that''s a major misconception on my part then. 
I''d thought I''d read
> > that unlike any other RAID implementation, ZFS checked and verified
> > parity on normal data access.  
> That would be useless, and not provide anything extra.
I think it''s useless if a (disk) block of data holding RAIDZ parity
never has silent corruption, or if scrubbing was a lightweight operation
that could be run often.
> ZFS will do a 
> block checksum check (that is, for each block read, read the checksum 
> for that block, and compare to see if it is OK).  If the block checksums 
> show OK, then reading the parity for the corresponding data yields no 
> additional useful information.
It would yield useful information about the status of the parity
information on disk.

The read would be done because you''re already paying the penalty for
reading all the data blocks, so you can verify the stability of the
parity information on disk by reading an additional amount.
> I''m assuming that in a RAIDZ, RAIDZ2, or mirror configuration,
should a
> block checksum show the corresponding block is corrupted, then ZFS will 
> read the parity (or corresponding mirror) block, and attempt to 
> re-construct the "bad" block, give the corrected info to the
calling
> process, then re-writing the corrected data to a new block section on 
> the disk(s).
> 
> Right?
I was assuming that *all* the data for a FS block was read and if
redundant, the redundancy was verified correct (same data on mirrors,
valid parity for RAIDZ) or the redundacy would be repaired.  At least
with a mirror I have a chance of reading all copies over time.  With
RAIDZ, I''ll never read the parity until a problem or a scrub occurs.

Nothing wrong with that.  I had simply managed to convince myself that
it did more.  

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Toby Thain

2007-Jan-05 23:37 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

>> ... If the block checksums
>> show OK, then reading the parity for the corresponding data yields no
>> additional useful information.
>
> It would yield useful information about the status of the parity
> information on disk.
>
> The read would be done because you''re already paying the penalty
for
> reading all the data blocks, so you can verify the stability of the
> parity information on disk by reading an additional amount.
Sounds like this additional checking (I see your point) could be  
optional?

--Toby

Darren Dunham

2007-Jan-06 00:58 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

> >> ... If the block checksums
> >> show OK, then reading the parity for the corresponding data yields
no
> >> additional useful information.
> >
> > It would yield useful information about the status of the parity
> > information on disk.
> >
> > The read would be done because you''re already paying the
penalty for
> > reading all the data blocks, so you can verify the stability of the
> > parity information on disk by reading an additional amount.
> 
> Sounds like this additional checking (I see your point) could be  
> optional?
Well, I''m not offering to implement it or anything.  :-) Somehow from
some of the early discussions of ZFS, I managed to "learn" that this
was
one of the fatures.  What I read was wrong, or I misinterpreted it.
(Either way, I''m afraid I''ve managed to repeat it to others
since).

I would expect such behavior to have some redundancy benefits and some
performance and code complexity impacts.  I think it''s a neat idea and
I''m sorry to learn that I''ve been misunderstanding this as a
feature,
but I can''t guess what the cost of implementing it would be.

I suppose having it as a per-pool option could make sense.



-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Anton B. Rang

2007-Jan-06 05:29 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

> It''s not about the checksum but about how a fs block is stored in
> raid-z[12] case - it''s spread out to all non-parity disks so in
order
> to read one fs block you have to read from all disks except parity
> disks.
However, if we didn''t need to verify the checksum, we wouldn''t
have to read the whole file system block to satisfy small reads.

Anton
 
 
This message posted from opensolaris.org

Richard Elling

2007-Jan-06 06:17 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

Darren Dunham wrote:>> That would be useless, and not provide anything extra.
>>     
>
> I think it''s useless if a (disk) block of data holding RAIDZ
parity
> never has silent corruption, or if scrubbing was a lightweight operation
> that could be run often.
>
>   The problem is that you will still need to perform a periodic scrub
because you can''t be sure that all data will be read during normal
operation.  So it doesn''t make sense to me to (further) penalize
every read, when doing so does not remove the need for scrub.
 -- richard

Richard Elling

2007-Jan-06 06:21 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Peter Schuller wrote:>> I''ve been using a simple model for small, random reads.  In
that model,
>> the performance of a raidz[12] set will be approximately equal to a
single
>> disk.  For example, if you have 6 disks, then the performance for the
>> 6-disk raidz2 set will be normalized to 1, and the performance of a
3-way
>> dynamic stripe of 2-way mirrors will have a normalized performance of
6.
>> I''d be very interested to see if your results concur.
>>     
>
> Is this expected behavior? Assuming concurrent reads (not synchronous and 
> sequential) I would naively expect an ndisk raidz2 pool to have a
normalized
> performance of n for small reads.
>
>   
q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942&tstart=0
where such behavior in a hardware RAID array lead to corruption which
was detected by ZFS.  No free lunch today, either.
 -- richard

Peter Schuller

2007-Jan-08 14:34 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

> > Is this expected behavior? Assuming concurrent reads (not synchronous
and
> > sequential) I would naively expect an ndisk raidz2 pool to have a
> > normalized performance of n for small reads.
>
> q.v.
http://www.opensolaris.org/jive/thread.jspa?threadID=20942&tstart=0
> where such behavior in a hardware RAID array lead to corruption which
> was detected by ZFS.  No free lunch today, either.
>  -- richard
I appreciate the advantage of checksumming, believe me. Though I don''t
see why
this is directly related to the small read problem, other than that the 
implementation is such.

Is there some fundamental reason why one could not (though I understand one 
*would* not) keep a checksum on a per-disk basis, so that in the normal case 
one really could read from just one disk, for a small read? I realize it is 
not enough for a block to be self-consistent, but theoretically
couldn''t the
block which points to the block in question contain multiple checksums for 
the various subsets on different disks, rather than just the one checksum for 
the entire block?

Not that I consider this a major issue; but since you pointed me to that 
article in response to my statement above...

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

Richard Elling

2007-Jan-08 17:12 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Peter Schuller wrote:>>> Is this expected behavior? Assuming concurrent reads (not
synchronous and
>>> sequential) I would naively expect an ndisk raidz2 pool to have a
>>> normalized performance of n for small reads.
>>>       
>> q.v.
http://www.opensolaris.org/jive/thread.jspa?threadID=20942&tstart=0
>> where such behavior in a hardware RAID array lead to corruption which
>> was detected by ZFS.  No free lunch today, either.
>>  -- richard
>>     
>
> I appreciate the advantage of checksumming, believe me. Though I
don''t see why
> this is directly related to the small read problem, other than that the 
> implementation is such.
>
> Is there some fundamental reason why one could not (though I understand one
> *would* not) keep a checksum on a per-disk basis, so that in the normal
case
> one really could read from just one disk, for a small read? I realize it is
> not enough for a block to be self-consistent, but theoretically
couldn''t the
> block which points to the block in question contain multiple checksums for 
> the various subsets on different disks, rather than just the one checksum
for
> the entire block?
>   Then you would need to keep checksums for each physical block, which
is not part of the on-disk spec.    It is not clear to me that this 
would be a net
win, because you would need that checksum to be physically placed on
another vdev, which implies that you still couldn''t just read a single 
block and
be happy.  Note, there are lots of different possibilities here, ZFS 
implements
the end-to-end checksum which would not be replaced by a lower level
checksum anyway.
 -- richard

Robert Milkowski

2007-Jan-09 12:41 UTC

head link

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

Hello Anton,

Saturday, January 6, 2007, 6:29:29 AM, you wrote:
>> It''s not about the checksum but about how a fs block is stored
in
>> raid-z[12] case - it''s spread out to all non-parity disks so
in order
>> to read one fs block you have to read from all disks except parity
>> disks.
ABR> However, if we didn''t need to verify the checksum, we
wouldn''t
ABR> have to read the whole file system block to satisfy small reads.

But we''ll loose end-to-end integrity feature.
And still with 9 or more disks for most workloads we would endup
reading them all anyway as each disk would hold so small portion of fs
block.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Tomas Ögren

2007-Jan-10 16:26 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

On 03 January, 2007 - Jason J. W. Williams sent me these 0,4K bytes:
> Hello All,
>
> I was curious if anyone had run a benchmark on the IOPS performance of
> RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and
was
> curious what others had seen. Thank you in advance.
http://blogs.sun.com/roch/entry/when_to_and_not_to  has some info for
you..

/Tomas
--
Tomas ?gren, stric@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jason J. W. Williams

2007-Jan-10 16:26 UTC

head link

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

Hello All,

I was curious if anyone had run a benchmark on the IOPS performance of
RAIDZ2 vs RAID-10? I''m getting ready to run one on a Thumper and was
curious what others had seen. Thank you in advance.

Best Regards,
Jason
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Jan 2007 - RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Scrubbing on active zfs systems (many snaps per day)

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] Re: Re: RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10