thr3ads.net - zfs discuss - [zfs-discuss] ZFS and Storage [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Mika Borner

2006-Jun-26 07:15 UTC

[zfs-discuss] ZFS and Storage

Hi

Now that Solaris 10 06/06 is finally downloadable I have some questions
about ZFS.

-We have a big storage sytem supporting RAID5 and RAID1. At the moment,
we only use RAID5 (for non-solaris systems as well). We are thinking
about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5
seems like overkill, an option would be to use RAID1 with RAID-Z. Then
again, this is a waist of space, as it needs more disks, due to the
mirroring. Later on, we might be using asynchronous replication to
another storage system using SAN, even more waste of space. This looks
somehow like storage virtualization as of today just doesn''t work
nicely
together. What we need, would be the feature to use JBODs.

-Does ZFS in the current version support LUN extension? With UFS, we
have to zero the VTOC, and then adjust the new disk geometry. How does
it look like with ZFS?

-I''ve read the threads about zfs and databases. Still I''m not
100%
convenienced about read performance. Doesn''t the fragmentation of the
large database files (because of the concept of COW) impact
read-performance? 

-Does anybody have any experience in database cloning using the ZFS
mechanism? What factors influence the performance, when running the
cloned database in parallel? 
-I really like the idea to keep all needed databasefiles together, to
allow fast and consistent cloning.

Thanks

Mika


# mv Disclaimer.txt /dev/null





-------------------------------------------------------------------------
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-------------------------------------------------------------------------

Roch

2006-Jun-26 12:11 UTC

head link

[zfs-discuss] ZFS and Storage

About:

  -I''ve read the threads about zfs and databases. Still I''m
not 100%
  convenienced about read performance. Doesn''t the fragmentation of the
  large database files (because of the concept of COW) impact
  read-performance? 

I do need to get back to this thread. The way I am currently 
looking at this is this:

	ZFS will perform great at doing the transaction
	component (say the small (8K) O_DSYNC writes)
	because the ZIL will aggregate them in fewer larger
	I/Os and the block allocation will stream them to the 
	surface.

	On the other hand, read streaming will require a
	good prefetch code (under review) to get the read
	performance we want.


If  the   requirements balances   random  writes   and  read
streaming, then  ZFS  should be  right there  with the  best
FS. If the critical requirement  focuses exclusively on read
streaming a file that was written randomly and, in addition,
the  number  of spindles is limited   then  that is  not the
sweetspot of ZFS.  Read performance  should still scale with
number of  spindles.  And, if    the load can accomodate   a
reorder, to  get top per-spindle read-streaming performance,
a cp(1) of the file should do wonders on the layout.


-r

Gregory Shaw

2006-Jun-26 16:27 UTC

head link

[zfs-discuss] ZFS and Storage

On Jun 26, 2006, at 1:15 AM, Mika Borner wrote:
> Hi
>
> Now that Solaris 10 06/06 is finally downloadable I have some  
> questions
> about ZFS.
>
> -We have a big storage sytem supporting RAID5 and RAID1. At the  
> moment,
> we only use RAID5 (for non-solaris systems as well). We are thinking
> about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5
> seems like overkill, an option would be to use RAID1 with RAID-Z. Then
> again, this is a waist of space, as it needs more disks, due to the
> mirroring. Later on, we might be using asynchronous replication to
> another storage system using SAN, even more waste of space. This looks
> somehow like storage virtualization as of today just doesn''t work
> nicely
> together. What we need, would be the feature to use JBODs.
>
If you''ve got hardware raid-5, why not just run regular (non-raid)  
pools on top of the raid-5?

I wouldn''t go back to JBOD.   Hardware arrays offer a number of  
advantages to JBOD:
	- disk microcode management
	- optimized access to storage
	- large write caches
	- RAID computation can be done in specialized hardware
	- SAN-based hardware products allow sharing of storage among  
multiple hosts.  This allows storage to be utilized more effectively.
> -Does ZFS in the current version support LUN extension? With UFS, we
> have to zero the VTOC, and then adjust the new disk geometry. How does
> it look like with ZFS?
>
I don''t understand what you''re asking.  What problem is solved
by
zeroing the vtoc?
> -I''ve read the threads about zfs and databases. Still I''m
not 100%
> convenienced about read performance. Doesn''t the fragmentation of
the
> large database files (because of the concept of COW) impact
> read-performance?
>
This is discussed elsewhere in the zfs discussion group.
> -Does anybody have any experience in database cloning using the ZFS
> mechanism? What factors influence the performance, when running the
> cloned database in parallel?
> -I really like the idea to keep all needed databasefiles together, to
> allow fast and consistent cloning.
>
> Thanks
>
> Mika
>
>
> # mv Disclaimer.txt /dev/null
>
>
>
>
>
> ---------------------------------------------------------------------- 
> ---
> This message is intended for the addressee only and may
> contain confidential or privileged information. If you
> are not the intended receiver, any disclosure, copying
> to any person or any action taken or omitted to be taken
> in reliance on this e-mail, is prohibited and may be un-
> lawful. You must therefore delete this e-mail.
> Internet communications may not be secure or error-free
> and may contain viruses. They may be subject to possible
> data corruption, accidental or on purpose. This e-mail is
> not and should not be construed as an offer or the
> solicitation of an offer to purchase or subscribe or sell
> or redeem any investments.
> ---------------------------------------------------------------------- 
> ---
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                    shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Darren Dunham

2006-Jun-26 17:24 UTC

head link

[zfs-discuss] ZFS and Storage

> > -Does ZFS in the current version support LUN extension? With UFS, we
> > have to zero the VTOC, and then adjust the new disk geometry. How does
> > it look like with ZFS?
> 
> I don''t understand what you''re asking.  What problem is
solved by
> zeroing the vtoc?
When the underlying storage increases the size of the LUN.  The old size
is still on the label and the ''sd'' driver doesn''t
recognize the
increase.

This doesn''t appear to be a level of ZFS, but interactions with the EFI
label may be interesting.

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Nathan Kroenert

2006-Jun-26 23:09 UTC

head link

[zfs-discuss] ZFS and Storage

On Tue, 2006-06-27 at 02:27, Gregory Shaw wrote:> On Jun 26, 2006, at 1:15 AM, Mika Borner wrote:
> 
> ><snip> What we need, would be the feature to use JBODs.
> >
> 
> If you''ve got hardware raid-5, why not just run regular (non-raid)
> pools on top of the raid-5?
> 
> I wouldn''t go back to JBOD.   Hardware arrays offer a number of  
> advantages to JBOD:
> 	- disk microcode management
> 	- optimized access to storage
> 	- large write caches
> 	- RAID computation can be done in specialized hardware
> 	- SAN-based hardware products allow sharing of storage among  
> multiple hosts.  This allows storage to be utilized more effectively.
How would ZFS self heal in this case?

Nathan.

Gregory Shaw

2006-Jun-26 23:26 UTC

head link

[zfs-discuss] ZFS and Storage

On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert
wrote:> On Tue, 2006-06-27 at 02:27, Gregory Shaw wrote:
> > On Jun 26, 2006, at 1:15 AM, Mika Borner wrote:
> > 
> > ><snip> What we need, would be the feature to use JBODs.
> > >
> > 
> > If you''ve got hardware raid-5, why not just run regular
(non-raid)
> > pools on top of the raid-5?
> > 
> > I wouldn''t go back to JBOD.   Hardware arrays offer a number
of
> > advantages to JBOD:
> > 	- disk microcode management
> > 	- optimized access to storage
> > 	- large write caches
> > 	- RAID computation can be done in specialized hardware
> > 	- SAN-based hardware products allow sharing of storage among  
> > multiple hosts.  This allows storage to be utilized more effectively.
> 
> How would ZFS self heal in this case?
> 
> Nathan.
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
You''re using hardware raid.  The hardware raid controller will rebuild
the volume in the event of a single drive failure.  You''d need to keep
on top of it, but that''s a given in the case of either hardware or
software raid.

If you''ve got requirements for surviving an array failure, the
recommended solution in that case is to mirror between volumes on
multiple arrays.   I''ve always liked software raid (mirroring) in that
case, as no manual intervention is needed in the event of an array
failure.  Mirroring between discrete arrays is usually reserved for
mission-critical applications that cost thousands of dollars per hour in
downtime.

Philip Brown

2006-Jun-26 23:29 UTC

head link

[zfs-discuss] ZFS and Storage

Roch wrote:> And, if    the load can accomodate   a
> reorder, to  get top per-spindle read-streaming performance,
> a cp(1) of the file should do wonders on the layout.
> 
but there may not be filesystem space for double the data.
Sounds like there is a need for a zfs-defragement-file utility perhaps?

Or if you want to be politically cagey about naming choice, perhaps,

zfs-seq-read-optimize-file ?  :-)

Nathanael Burton

2006-Jun-27 00:09 UTC

head link

[zfs-discuss] Re: ZFS and Storage

> If you''ve got hardware raid-5, why not just run
> regular (non-raid)  
> pools on top of the raid-5?
> 
> I wouldn''t go back to JBOD.   Hardware arrays offer a
> number of  
> advantages to JBOD:
> 	- disk microcode management
> 	- optimized access to storage
> 	- large write caches
> - RAID computation can be done in specialized
> d hardware
> - SAN-based hardware products allow sharing of
> f storage among  
> multiple hosts.  This allows storage to be utilized
> more effectively.
> 
I''m a little confused by the first poster''s message as well,
but you lose some benefits of ZFS if you don''t create your pools with
either RAID1 or RAIDZ, such as data corruption detection.  The array
isn''t going to detect that because all it knows about are blocks.

-Nate
 
 
This message posted from opensolaris.org

Eric Schrock

2006-Jun-27 00:40 UTC

head link

[zfs-discuss] ZFS and Storage

On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw
wrote:> 
> You''re using hardware raid.  The hardware raid controller will
rebuild
> the volume in the event of a single drive failure.  You''d need to
keep
> on top of it, but that''s a given in the case of either hardware or
> software raid.
True for total drive failure, but not there are a more failure modes
than that.  With hardware RAID, there is no way for the RAID controller
to know which block was bad, and therefore cannot repair the block.
With RAID-Z, we have the integrated checksum and can do combinatorial
analysis to know not only which drive was bad, but what the data
_should_ be, and can repair it to prevent more corruption in the future.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Olaf Manczak

2006-Jun-27 01:04 UTC

head link

[zfs-discuss] ZFS and Storage

Eric Schrock wrote:> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:
>> You''re using hardware raid.  The hardware raid controller will
rebuild
>> the volume in the event of a single drive failure.  You''d need
to keep
>> on top of it, but that''s a given in the case of either
hardware or
>> software raid.
> 
> True for total drive failure, but not there are a more failure modes
> than that.  With hardware RAID, there is no way for the RAID controller
> to know which block was bad, and therefore cannot repair the block.
> With RAID-Z, we have the integrated checksum and can do combinatorial
> analysis to know not only which drive was bad, but what the data
> _should_ be, and can repair it to prevent more corruption in the future.
Keep in mind that each disk data block is accompanied by a pretty
long error correction code (ECC) which allows for (a) verification
of data integrity (b) repair of lost/misread bits (typically up to
about 10% of the block data).

Therefore, in case of single block errors there are several possible
situations:

- non-recoverable errors - the amount of correct bits in the combined
   data + ECC in insufficient - such errors are visible to the RAID
   controller, the controller can use a redundant copy of the data, and
   the controller can perform the repair

- recoverable errors - some bits can''t be read correctly but they
   can be reconstructed  using ECC - these errors are not directly
   visible to either the RAID controller or ZFS. However, the disks
   keep the count of recoverable errors so disk scrubbers can identify
   disk areas with rotten blocks and force block relocation

- silent data corruption - it can happen in memory before the data
   was written to disk, it can occur in the disk cache, it can be caused
   by a bug in disk firmware. Here the disk controller can''t do
   anything and the end-to-end checksums, which ZFS offers,
   are the only solution.

-- Olaf

Bart Smaalders

2006-Jun-27 01:24 UTC

head link

[zfs-discuss] ZFS and Storage

Gregory Shaw wrote:> On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote:
>> How would ZFS self heal in this case? >
> You''re using hardware raid.  The hardware raid controller will
rebuild
> the volume in the event of a single drive failure.  You''d need to
keep
> on top of it, but that''s a given in the case of either hardware or
> software raid.
> 
> If you''ve got requirements for surviving an array failure, the
> recommended solution in that case is to mirror between volumes on
> multiple arrays.   I''ve always liked software raid (mirroring) in
that
> case, as no manual intervention is needed in the event of an array
> failure.  Mirroring between discrete arrays is usually reserved for
> mission-critical applications that cost thousands of dollars per hour in
> downtime.
> 
In other words, it won''t.  You''ve spent the disk space, but
because you''re mirroring in the wrong place (the raid array)
all ZFS can do is tell you that your data is gone.  With luck,
subsequent reads _might_ get the right data, but maybe not.

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Jonathan Edwards

2006-Jun-27 04:14 UTC

head link

[zfs-discuss] ZFS and Storage

> -Does ZFS in the current version support LUN extension? With UFS, we
> have to zero the VTOC, and then adjust the new disk geometry. How does
> it look like with ZFS?
The vdev can handle dynamic lun growth, but the underlying VTOC or  
EFI label
may need to be zero''d and reapplied if you setup the initial vdev on  
a slice.  If
you introduced the entire disk to the pool you should be fine, but I  
believe you''ll
still need to offline/online the pool.

.je
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060627/3b1fbdc9/attachment.html>

Richard Elling

2006-Jun-27 05:33 UTC

head link

[zfs-discuss] ZFS and Storage

Olaf Manczak wrote:> Eric Schrock wrote:
>> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:
>>> You''re using hardware raid.  The hardware raid controller
will rebuild
>>> the volume in the event of a single drive failure.  You''d
need to keep
>>> on top of it, but that''s a given in the case of either
hardware or
>>> software raid.
>>
>> True for total drive failure, but not there are a more failure modes
>> than that.  With hardware RAID, there is no way for the RAID controller
>> to know which block was bad, and therefore cannot repair the block.
>> With RAID-Z, we have the integrated checksum and can do combinatorial
>> analysis to know not only which drive was bad, but what the data
>> _should_ be, and can repair it to prevent more corruption in the
future.
> 
> Keep in mind that each disk data block is accompanied by a pretty
> long error correction code (ECC) which allows for (a) verification
> of data integrity (b) repair of lost/misread bits (typically up to
> about 10% of the block data).
AFAIK, typical disk ECC will correct 8 bytes.  I''d love for it to be
10% (51 bytes).  Do you have a pointer to such information?
> Therefore, in case of single block errors there are several possible
> situations:
> 
> - non-recoverable errors - the amount of correct bits in the combined
>   data + ECC in insufficient - such errors are visible to the RAID
>   controller, the controller can use a redundant copy of the data, and
>   the controller can perform the repair
> 
> - recoverable errors - some bits can''t be read correctly but they
>   can be reconstructed  using ECC - these errors are not directly
>   visible to either the RAID controller or ZFS. However, the disks
>   keep the count of recoverable errors so disk scrubbers can identify
>   disk areas with rotten blocks and force block relocation
> 
> - silent data corruption - it can happen in memory before the data
>   was written to disk, it can occur in the disk cache, it can be caused
>   by a bug in disk firmware. Here the disk controller can''t do
>   anything and the end-to-end checksums, which ZFS offers,
>   are the only solution.
Another mode occurs when you use a format(1m)-like utility to scan
and repair disks.  For such utilities, if the data cannot be reconstructed
it is zero-filled.  If there was real data stored there, then ZFS will
detect it and the majority of other file systems will not detect it.
For an array, one should not be able to readily access such utilities,
and cause such corrective actions, but I would not bet the farm on it --
end-to-end error detection will always prevail.
  -- richard

Mika Borner

2006-Jun-27 06:48 UTC

head link

[zfs-discuss] ZFS and Storage

>The vdev can handle dynamic lun growth, but the underlying VTOC or  
>EFI label
>may need to be zero''d and reapplied if you setup the initial vdev
on
>a slice.  If
>you introduced the entire disk to the pool you should be fine, but I 
>believe you''ll
>still need to offline/online the pool.
Fine, at least the vdev can handle this...

I asked about this feature in October and hoped that it would be
implemented when integrating ZFS into Sol10U2 ...

http://www.opensolaris.org/jive/thread.jspa?messageID=11646

Does anybody know something about when this feature is finally coming?
This would keep the number of  LUNs low on the host. Especially as
devicenames can be really ugly (long!).

//Mika

# mv Disclaimer.txt /dev/null


-------------------------------------------------------------------------
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-------------------------------------------------------------------------

Mika Borner

2006-Jun-27 07:24 UTC

head link

[zfs-discuss] Re: ZFS and Storage

>I''m a little confused by the first poster''s message as
well, but youlose some benefits of ZFS if you don''t create >your pools with
either
RAID1 or RAIDZ, such as data corruption detection.  The array isn''t
going to detect that >because all it knows about are blocks. 

That''s the dilemma, the array provides nice features like RAID1 and
RAID5, but those are of no real use when using ZFS. 

The advantages  to use ZFS on such array are e.g. the sometimes huge
write cache available, use of consolidated storage and in SAN
configurations, cloning and sharing storage between hosts.

The price comes of course in additional administrative overhead (lots
of microcode updates, more components that can fail in between, etc).

Also, in bigger companies there usually is a team of storage
specialist, that mostly do not know about the applications running on
top of it, or do not care... (like: "here you have your bunch of
gigabytes...")

//Mika

# mv Disclaimer.txt /dev/null


-------------------------------------------------------------------------
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-------------------------------------------------------------------------

Mika Borner

2006-Jun-27 08:19 UTC

head link

[zfs-discuss] ZFS and Storage

>but there may not be filesystem space for double the data.
>Sounds like there is a need for a zfs-defragement-file utility
perhaps?>Or if you want to be politically cagey about naming choice, perhaps,
>zfs-seq-read-optimize-file ?  :-)
For Datawarehouse and streaming applications a 
"seq-read-omptimization" could bring additional performance. For
"normal" databases this should be benchmarked...

This brings me back to another question. We have a production database,
that is cloned on every end of month for end-of-month processing
(currently with a feature on our storage array).

I''m thinking about a ZFS version of this task. Requirements: the
production database should not suffer from performance degradation,
whilst running the clone in parallel. As ZFS does not clone all the
blocks, I wonder how much the procution database will suffer from
sharing most of the data with the clone (concurrent access vs. caching)

Maybe we need a feature in ZFS to do a full clone (speak: copy all
blocks) inside the pool, if performance is an issue.... just like the
"Quick Copy" vs. "Shadow Image" -features on HDS Arrays... 






-------------------------------------------------------------------------
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-------------------------------------------------------------------------

Casper.Dik at Sun.COM

2006-Jun-27 08:27 UTC

head link

[zfs-discuss] Re: ZFS and Storage

>That''s the dilemma, the array provides nice features like RAID1 and
>RAID5, but those are of no real use when using ZFS. 

RAID5 is not a "nice" feature when it breaks.

A RAID controller cannot guarantee that all bits of a RAID5 stripe
are written when power fails; then you have data corruption and no
one can tell you what data was corrupted.  ZFS RAIDZ can.
>The advantages  to use ZFS on such array are e.g. the sometimes huge
>write cache available, use of consolidated storage and in SAN
>configurations, cloning and sharing storage between hosts.
Are huge write caches really a advantage?  Or are you taking about huge
write caches with non-volatile storage?
>The price comes of course in additional administrative overhead (lots
>of microcode updates, more components that can fail in between, etc).
>
>Also, in bigger companies there usually is a team of storage
>specialist, that mostly do not know about the applications running on
>top of it, or do not care... (like: "here you have your bunch of
>gigabytes...")
True enough ....

Casper

Mika Borner

2006-Jun-27 08:34 UTC

head link

[zfs-discuss] Re: ZFS and Storage

>RAID5 is not a "nice" feature when it breaks.
Let me correct myself...  RAID5 is a "nice" feature for systems
without
ZFS...
>Are huge write caches really a advantage?  Or are you taking about
huge>write caches with non-volatile storage?
Yes, you are right. The huge cache is needed mostly because of poor
write performance for RAID5 (of course battery backuped)...


// Mika

# mv Disclaimer.txt /dev/null


-------------------------------------------------------------------------
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-------------------------------------------------------------------------

Robert Milkowski

2006-Jun-27 09:07 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello Nathanael,


NB> I''m a little confused by the first poster''s message as
well, but
NB> you lose some benefits of ZFS if you don''t create your pools
with
NB> either RAID1 or RAIDZ, such as data corruption detection.  The
NB> array isn''t going to detect that because all it knows about are
blocks.

Actually ZFS will detect data corruption if pool is not redundand but
it won''t repair data (metadata are protected with 2 or/and 3 copies
anyway).




-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-Jun-27 09:14 UTC

head link

[zfs-discuss] ZFS and Storage

Hello Mika,

Tuesday, June 27, 2006, 10:19:05 AM, you wrote:
>>but there may not be filesystem space for double the data.
>>Sounds like there is a need for a zfs-defragement-file utility
MB> perhaps?>>Or if you want to be politically cagey about naming choice, perhaps,
>>zfs-seq-read-optimize-file ?  :-)
MB> For Datawarehouse and streaming applications a 
MB> "seq-read-omptimization" could bring additional performance.
For
MB> "normal" databases this should be benchmarked...

MB> This brings me back to another question. We have a production database,
MB> that is cloned on every end of month for end-of-month processing
MB> (currently with a feature on our storage array).

MB> I''m thinking about a ZFS version of this task. Requirements: the
MB> production database should not suffer from performance degradation,
MB> whilst running the clone in parallel. As ZFS does not clone all the
MB> blocks, I wonder how much the procution database will suffer from
MB> sharing most of the data with the clone (concurrent access vs. caching)

MB> Maybe we need a feature in ZFS to do a full clone (speak: copy all
MB> blocks) inside the pool, if performance is an issue.... just like the
MB> "Quick Copy" vs. "Shadow Image" -features on HDS
Arrays...

I belive you want a clone on different pool (so different disks) and
that way you get separation.

The most important problem with two DBs after current clone would be
shared spindles.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Roch

2006-Jun-27 09:42 UTC

head link

[zfs-discuss] ZFS and Storage

Philip Brown writes:
 > Roch wrote:
 > > And, if    the load can accomodate   a
 > > reorder, to  get top per-spindle read-streaming performance,
 > > a cp(1) of the file should do wonders on the layout.
 > > 
 > 
 > but there may not be filesystem space for double the data.
 > Sounds like there is a need for a zfs-defragement-file utility perhaps?
 > 
 > Or if you want to be politically cagey about naming choice, perhaps,
 > 
 > zfs-seq-read-optimize-file ?  :-)
 > 

Possibly or may using fcntl ?

Now the goal is to take a file with scattered blocks and order
them in contiguous chunks. So this is contigent on the
existence of regions of free contiguous disk space. This
will get more difficult as we get close to full on the
storage.

-r




 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch

2006-Jun-27 09:52 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Mika Borner writes:
 > >RAID5 is not a "nice" feature when it breaks.
 > 
 > Let me correct myself...  RAID5 is a "nice" feature for systems
without
 > ZFS...
 > 
 > >Are huge write caches really a advantage?  Or are you taking about
 > huge
 > >write caches with non-volatile storage?
 > 
 > Yes, you are right. The huge cache is needed mostly because of poor
 > write performance for RAID5 (of course battery backuped)...
 > 
 > 
 > // Mika
 > 

Having a certain amount on non-volatile cache is great to
speed up the latency of ZIL operations which directly impact 
some application performance. 

-r

Jeff Victor

2006-Jun-27 12:52 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Does it make sense to solve these problems piece-meal:

* Performance: ZFS algorithms and NVRAM
* Error detection: ZFS checksums
* Error correction: ZFS RAID1 or RAIDZ

Nathanael Burton wrote:>> If you''ve got hardware raid-5, why not just run regular
(non-raid) pools on
>> top of the raid-5?
>> 
>> I wouldn''t go back to JBOD.   Hardware arrays offer a number
of advantages to
>> JBOD: - disk microcode management - optimized access to storage - large
write
>> caches - RAID computation can be done in specialized d hardware -
SAN-based
>> hardware products allow sharing of f storage among multiple hosts. 
This
>> allows storage to be utilized more effectively.
>> 
> 
> 
> I''m a little confused by the first poster''s message as
well, but you lose some
> benefits of ZFS if you don''t create your pools with either RAID1
or RAIDZ, such
> as data corruption detection.  The array isn''t going to detect
that because all
> it knows about are blocks.-- 
--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Bill Sommerfeld

2006-Jun-27 13:25 UTC

head link

[zfs-discuss] ZFS and Storage

On Tue, 2006-06-27 at 04:19, Mika Borner wrote:> I''m thinking about a ZFS version of this task. Requirements: the
> production database should not suffer from performance degradation,
> whilst running the clone in parallel. As ZFS does not clone all the
> blocks, I wonder how much the procution database will suffer from
> sharing most of the data with the clone (concurrent access vs. caching)
given that zfs always does copy-on-write for any updates, it''s not
clear
why this would necessarily degrade performance..
> Maybe we need a feature in ZFS to do a full clone (speak: copy all
> blocks) inside the pool, if performance is an issue.... just like the
> "Quick Copy" vs. "Shadow Image" -features on HDS
Arrays...
Seems to me that the main reason you''d need to do a full copy would be
to get clone and production on different sets of disks so their access
patterns don''t end up fighting.  For ZFS that requires having separate
pools; if they''re in the same pool, sharing the unchanged blocks should
only help performance.

If you want a full copy you can use zfs send/zfs receive -- either
within the same pool or between two different pools.

Mika Borner

2006-Jun-27 13:40 UTC

head link

[zfs-discuss] ZFS and Storage

>given that zfs always does copy-on-write for any updates, it''s not
clear>why this would necessarily degrade performance..
Writing should be no problem, as it is serialized... but when both
database instances are reading a lot of different blocks at the same
time, the spindles might "heat up".
>If you want a full copy you can use zfs send/zfs receive -- either
>within the same pool or between two different pools.
Ok. But then again, it might be necessary to throttle zfs send/receive
replication between pools. Otherwise the replication process might be
influencing the production environment performance too much. Or is there
already some kind of prioritization, that I have overlooked?

//Mika

# mv Disclaimer.txt /dev/null






-------------------------------------------------------------------------
This message is intended for the addressee only and may
contain confidential or privileged information. If you
are not the intended receiver, any disclosure, copying
to any person or any action taken or omitted to be taken
in reliance on this e-mail, is prohibited and may be un-
lawful. You must therefore delete this e-mail.
Internet communications may not be secure or error-free
and may contain viruses. They may be subject to possible
data corruption, accidental or on purpose. This e-mail is
not and should not be construed as an offer or the
solicitation of an offer to purchase or subscribe or sell
or redeem any investments.
-------------------------------------------------------------------------

Roch

2006-Jun-27 14:01 UTC

head link

[zfs-discuss] ZFS and Storage

Mika Borner writes:

 > >given that zfs always does copy-on-write for any updates,
it''s not
 > clear
 > >why this would necessarily degrade performance..
 > 
 > Writing should be no problem, as it is serialized... but when both
 > database instances are reading a lot of different blocks at the same
 > time, the spindles might "heat up".
 > 
 > >If you want a full copy you can use zfs send/zfs receive -- either
 > >within the same pool or between two different pools.
 > 
 > Ok. But then again, it might be necessary to throttle zfs send/receive
 > replication between pools. Otherwise the replication process might be
 > influencing the production environment performance too much. Or is there
 > already some kind of prioritization, that I have overlooked?
 > 
 > //Mika
 > 
 > # mv Disclaimer.txt /dev/null
 > 
 > 

I think this is heading toward ''quotas and reservation'' of
IOPS.
Sounds like something that would be very useful. I don''t
know if this is planned. 

-r


 > 
 > 
 > 
 > 
 > -------------------------------------------------------------------------
 > This message is intended for the addressee only and may
 > contain confidential or privileged information. If you
 > are not the intended receiver, any disclosure, copying
 > to any person or any action taken or omitted to be taken
 > in reliance on this e-mail, is prohibited and may be un-
 > lawful. You must therefore delete this e-mail.
 > Internet communications may not be secure or error-free
 > and may contain viruses. They may be subject to possible
 > data corruption, accidental or on purpose. This e-mail is
 > not and should not be construed as an offer or the
 > solicitation of an offer to purchase or subscribe or sell
 > or redeem any investments.
 > -------------------------------------------------------------------------
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Gregory Shaw

2006-Jun-27 14:36 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Yes, but the idea of using software raid on a large server doesn''t  
make sense in modern systems.  If you''ve got a large database server  
that runs a large oracle instance, using CPU cycles for RAID is  
counter productive.  Add to that the need to manage the hardware  
directly (drive microcode, drive brownouts/restarts, etc.) and the  
idea of using JBOD in modern systems starts to lose value in a big way.

You will detect any corruption when doing a scrub.  It''s not end-to- 
end, but it''s no worse than today with VxVM.

On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote:
>> If you''ve got hardware raid-5, why not just run
>> regular (non-raid)
>> pools on top of the raid-5?
>>
>> I wouldn''t go back to JBOD.   Hardware arrays offer a
>> number of
>> advantages to JBOD:
>> 	- disk microcode management
>> 	- optimized access to storage
>> 	- large write caches
>> - RAID computation can be done in specialized
>> d hardware
>> - SAN-based hardware products allow sharing of
>> f storage among
>> multiple hosts.  This allows storage to be utilized
>> more effectively.
>>
>
> I''m a little confused by the first poster''s message as
well, but
> you lose some benefits of ZFS if you don''t create your pools with
> either RAID1 or RAIDZ, such as data corruption detection.  The  
> array isn''t going to detect that because all it knows about are  
> blocks.
>
> -Nate
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Gregory Shaw

2006-Jun-27 14:39 UTC

head link

[zfs-discuss] ZFS and Storage

Most controllers support a background-scrub that will read a volume  
and repair any bad stripes.  This addresses the bad block issue in  
most cases.

It still doesn''t help when a double-failure occurs.   Luckily,
that''s
very rare.  Usually, in that case, you need to evacuate the volume  
and try to restore what was damaged.

On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:
> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:
>>
>> You''re using hardware raid.  The hardware raid controller will
>> rebuild
>> the volume in the event of a single drive failure.  You''d need
to
>> keep
>> on top of it, but that''s a given in the case of either
hardware or
>> software raid.
>
> True for total drive failure, but not there are a more failure modes
> than that.  With hardware RAID, there is no way for the RAID  
> controller
> to know which block was bad, and therefore cannot repair the block.
> With RAID-Z, we have the integrated checksum and can do combinatorial
> analysis to know not only which drive was bad, but what the data
> _should_ be, and can repair it to prevent more corruption in the  
> future.
>
> - Eric
>
> --
> Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/ 
> eschrock
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Darren J Moffat

2006-Jun-27 14:48 UTC

head link

[zfs-discuss] ZFS and Storage

So everything you are saying seems to suggest you think ZFS was a waste 
of engineering time since hardware raid solves all the problems ?

I don''t believe it does but I''m no storage expert and maybe
I''ve drank
too much cool aid.  I''m software person and for me ZFS is brilliant it 
is so much easier than managing any of the hardware raid systems I''ve 
dealt with.

--
Darren J Moffat

Peter Rival

2006-Jun-27 14:50 UTC

head link

[zfs-discuss] Re: ZFS and Storage

I don''t like to top-post, but there''s no better way right now.
This issue has recurred several times and there have been no answers to it that
cover the bases.  The question is, say I as a customer have a database,
let''s say it''s around 8 TB, all built on a series of high end
storage arrays that _don''t_ support the JBOD everyone seems to want -
what is the preferred configuration for my storage arrays to present LUNs to the
OS for ZFS to consume?

Let''s say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5 -
that spans the breadth of about as good as it gets.  What should I as a customer
do?  Should I create RAID0 sets and let ZFS self-heal via its own mirroring or
RAIDZ when a disk blows in the set?  Should I use RAID1 and eat the disk space
used?  RAID5 and be thankful I have a large write cache - and then which type of
ZFS pool should I create over it?

See, telling folks "you should just use JBOD" when they don''t
have JBOD and have invested millions to get to state they''re in where
they''re efficiently utilizing their storage via a SAN infrastructure is
just plain one big waste of everyone''s time.  Shouting down the
advantages of storage arrays with the same arguments over and over without
providing an answer to the customer problem doesn''t do anyone any good.
So.  I''ll restate the question.  I have a 10TB database that''s
spread over 20 storage arrays that I''d like to migrate to ZFS.  How
should I configure the storage array?  Let''s at least get that
conversation moving...

 - Pete

Gregory Shaw wrote:> Yes, but the idea of using software raid on a large server doesn''t
make
> sense in modern systems.  If you''ve got a large database server
that
> runs a large oracle instance, using CPU cycles for RAID is counter 
> productive.  Add to that the need to manage the hardware directly (drive 
> microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in 
> modern systems starts to lose value in a big way.
> 
> You will detect any corruption when doing a scrub.  It''s not
end-to-end,
> but it''s no worse than today with VxVM.
> 
> On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote:
> 
>>> If you''ve got hardware raid-5, why not just run
>>> regular (non-raid)
>>> pools on top of the raid-5?
>>>
>>> I wouldn''t go back to JBOD.   Hardware arrays offer a
>>> number of
>>> advantages to JBOD:
>>>     - disk microcode management
>>>     - optimized access to storage
>>>     - large write caches
>>> - RAID computation can be done in specialized
>>> d hardware
>>> - SAN-based hardware products allow sharing of
>>> f storage among
>>> multiple hosts.  This allows storage to be utilized
>>> more effectively.
>>>
>>
>> I''m a little confused by the first poster''s message
as well, but you
>> lose some benefits of ZFS if you don''t create your pools with
either
>> RAID1 or RAIDZ, such as data corruption detection.  The array
isn''t
>> going to detect that because all it knows about are blocks.
>>
>> -Nate
>>
>>
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> -----
> Gregory Shaw, IT Architect
> Phone: (303) 673-8273        Fax: (303) 673-8273
> ITCTO Group, Sun Microsystems Inc.
> 1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
> Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
> "When Microsoft writes an application for Linux, I''ve
Won." - Linus
> Torvalds
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Torrey McMahon

2006-Jun-27 14:53 UTC

head link

[zfs-discuss] ZFS and Storage

Bart Smaalders wrote:> Gregory Shaw wrote:
>> On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote:
>>> How would ZFS self heal in this case?
> >
>
>> You''re using hardware raid.  The hardware raid controller will
rebuild
>> the volume in the event of a single drive failure.  You''d need
to keep
>> on top of it, but that''s a given in the case of either
hardware or
>> software raid.
>>
>> If you''ve got requirements for surviving an array failure, the
>> recommended solution in that case is to mirror between volumes on
>> multiple arrays.   I''ve always liked software raid (mirroring)
in that
>> case, as no manual intervention is needed in the event of an array
>> failure.  Mirroring between discrete arrays is usually reserved for
>> mission-critical applications that cost thousands of dollars per hour
in
>> downtime.
>>
>
> In other words, it won''t.  You''ve spent the disk space,
but
> because you''re mirroring in the wrong place (the raid array)
> all ZFS can do is tell you that your data is gone.  With luck,
> subsequent reads _might_ get the right data, but maybe not.
Careful here when you say "wrong place". There are many scenarios
where
mirroring in the hardware is the correct way to go even when running ZFS 
on top of it.

Darren J Moffat

2006-Jun-27 14:58 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Peter Rival wrote:> storage arrays with the same arguments over and over without providing 
> an answer to the customer problem doesn''t do anyone any good.  So.
I''ll
> restate the question.  I have a 10TB database that''s spread over
20
> storage arrays that I''d like to migrate to ZFS.  How should I
configure
> the storage array?  Let''s at least get that conversation moving...
I''ll answer your question with more questions:

What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ?

What of that doesn''t work for you ?

What functionality of ZFS is it that you want to leverage ?

-- 
Darren J Moffat

Jeff Victor

2006-Jun-27 15:01 UTC

head link

[zfs-discuss] ZFS and Storage

Unfortunately, a storage-based RAID controller cannot detect errors which
occurred
between the filesystem layer and the RAID controller, in either direction - in
or
out.  ZFS will detect them through its use of checksums.

But ZFS can only fix them if it can access redundant bits.  It can''t
tell a
storage device to provide the redundant bits, so it must use its own data 
protection system (RAIDZ or RAID1) in order to correct errors it detects.


Gregory Shaw wrote:> Most controllers support a background-scrub that will read a volume  and 
> repair any bad stripes.  This addresses the bad block issue in  most cases.
> 
> It still doesn''t help when a double-failure occurs.   Luckily,
that''s
> very rare.  Usually, in that case, you need to evacuate the volume  and 
> try to restore what was damaged.
> 
> On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:
> 
>> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:
>>
>>>
>>> You''re using hardware raid.  The hardware raid controller
will  rebuild
>>> the volume in the event of a single drive failure.  You''d
need to  keep
>>> on top of it, but that''s a given in the case of either
hardware or
>>> software raid.
>>
>>
>> True for total drive failure, but not there are a more failure modes
>> than that.  With hardware RAID, there is no way for the RAID 
controller
>> to know which block was bad, and therefore cannot repair the block.
>> With RAID-Z, we have the integrated checksum and can do combinatorial
>> analysis to know not only which drive was bad, but what the data
>> _should_ be, and can repair it to prevent more corruption in the 
future.
>>
>> - Eric
>>
>> -- 
>> Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/ 
>> eschrock
> 
> 
> -----
> Gregory Shaw, IT Architect
> Phone: (303) 673-8273        Fax: (303) 673-8273
> ITCTO Group, Sun Microsystems Inc.
> 1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
> Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
> "When Microsoft writes an application for Linux, I''ve
Won." - Linus
> Torvalds
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Jeff Victor

2006-Jun-27 15:12 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Peter Rival wrote:> 
> See, telling folks "you should just use JBOD" when they
don''t have JBOD
> and have invested millions to get to state they''re in where
they''re
> efficiently utilizing their storage via a SAN infrastructure is just 
> plain one big waste of everyone''s time.  Shouting down the
advantages of
> storage arrays with the same arguments over and over without providing 
> an answer to the customer problem doesn''t do anyone any good.  So.
I''ll
> restate the question.  I have a 10TB database that''s spread over
20
> storage arrays that I''d like to migrate to ZFS.  How should I
configure
> the storage array?  Let''s at least get that conversation moving...
In general, I''d say that if the storage has battery-backed cache, use
RAID5 on the
storage device - limit the amount of redundant data, but improve performance and
achieve some data protection in fast special-purpose hardware.


Just my $.02.

-- 
--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Richard Elling

2006-Jun-27 15:24 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Peter Rival wrote:> I don''t like to top-post, but there''s no better way right
now.  This
> issue has recurred several times and there have been no answers to it 
> that cover the bases.  The question is, say I as a customer have a 
> database, let''s say it''s around 8 TB, all built on a
series of high end
> storage arrays that _don''t_ support the JBOD everyone seems to
want -
> what is the preferred configuration for my storage arrays to present 
> LUNs to the OS for ZFS to consume?
> 
> Let''s say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5
-
> that spans the breadth of about as good as it gets.  What should I as a 
> customer do?  Should I create RAID0 sets and let ZFS self-heal via its 
> own mirroring or RAIDZ when a disk blows in the set?  Should I use RAID1 
> and eat the disk space used?  RAID5 and be thankful I have a large write 
> cache - and then which type of ZFS pool should I create over it?
The only use I see for RAID-0 is when you are configuring your
competitor''s
systems.  Real friends don''t let friends use RAID-0.

For most modern arrays, RAID-5 works pretty well wrt performance.  While
not quite as good as RAID-1+0, most people are ok with RAID-5.  s/-5/-6/g
> See, telling folks "you should just use JBOD" when they
don''t have JBOD
> and have invested millions to get to state they''re in where
they''re
> efficiently utilizing their storage via a SAN infrastructure is just 
> plain one big waste of everyone''s time.  Shouting down the
advantages of
> storage arrays with the same arguments over and over without providing 
> an answer to the customer problem doesn''t do anyone any good.  So.
I''ll
> restate the question.  I have a 10TB database that''s spread over
20
> storage arrays that I''d like to migrate to ZFS.  How should I
configure
> the storage array?  Let''s at least get that conversation moving...
It almost always boils down to how much money you have to spend.  Since
I''m
a RAS guy, I prefer multiple ZFS RAID-1 mirrors over RAID-1 LUNs with hot
spares and multiple kilometer separation with multiple data paths between
them.  After I win the lottery, I might be able to afford that :-).

More applicable guidance would be to use the best redundancy closest to the
context of the data first, and work down the stack from there.  This
philosophy will give you the best fault detection and recovery.  Having the
applications themselves provide such redundancy is best, but very uncommon.
Next in the stack is the file system, where RAID-1 and RAID-Z[2] can help.
Finally, the hardware RAID.  This begs for a performability analysis [*],
which is on my plate, once things get settled a bit.

[*] does anyone know what performability analysis is?  I''d be happy to
post some info on how we do that at Sun.
  -- richard

Gregory Shaw

2006-Jun-27 15:34 UTC

head link

[zfs-discuss] ZFS and Storage

Not at all.  ZFS is a quantum leap in Solaris filesystem/VM  
functionality.

However,  I don''t see a lot of use for RAID-Z (or Z2) in large  
enterprise customers situations.  For instance, does ZFS enable Sun  
to walk into an account and say "You can now replace all of your high- 
end (EMC) disk with JBOD."?  I don''t think many customers would
bite
on that.

RAID-Z is an excellent feature, however, it doesn''t address many of  
the reasons for using high-end arrays:

- Exporting snapshots to alternate systems (for live database or  
backup purposes)
- Remote replication
- Sharing of storage among multiple systems (LUN masking and equivalent)
- Storage management (migration between tiers of storage)
- No-downtime failure replacement (the system doesn''t even know)
- Clustering

I know that ZFS is still a work in progress, so some of the above may  
arrive in future versions of the product.

I see the RAID-Z[2] value in small-to-mid size systems where the  
storage is relatively small and you don''t have high availability  
requirements.

On Jun 27, 2006, at 8:48 AM, Darren J Moffat wrote:
> So everything you are saying seems to suggest you think ZFS was a  
> waste of engineering time since hardware raid solves all the  
> problems ?
>
> I don''t believe it does but I''m no storage expert and
maybe I''ve
> drank too much cool aid.  I''m software person and for me ZFS is  
> brilliant it is so much easier than managing any of the hardware  
> raid systems I''ve dealt with.
>
> --
> Darren J Moffat
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Gregory Shaw

2006-Jun-27 15:41 UTC

head link

[zfs-discuss] ZFS and Storage

This is getting pretty picky.  You''re saying that ZFS will detect any  
errors introduced after ZFS has gotten the data.  However, as stated  
in a previous post, that doesn''t guarantee that the data given to ZFS  
wasn''t already corrupted.

If you don''t trust your storage subsystem, you''re going to
encounter
issues regardless of the software use to store data.  We''ll have to  
see if ZFS can ''save'' customers in this situation. 
I''ve found that
regardless of the storage solution in question you can''t anticipate  
all issues and when a brownout or other ugly loss-of-service occurs,  
you may or may not be intact, ZFS or no.

I''ve never seen a product that can deal with all possible situations.

On Jun 27, 2006, at 9:01 AM, Jeff Victor wrote:
> Unfortunately, a storage-based RAID controller cannot detect errors  
> which occurred between the filesystem layer and the RAID  
> controller, in either direction - in or out.  ZFS will detect them  
> through its use of checksums.
>
> But ZFS can only fix them if it can access redundant bits.  It  
> can''t tell a storage device to provide the redundant bits, so it  
> must use its own data protection system (RAIDZ or RAID1) in order  
> to correct errors it detects.
>
>
> Gregory Shaw wrote:
>> Most controllers support a background-scrub that will read a  
>> volume  and repair any bad stripes.  This addresses the bad block  
>> issue in  most cases.
>> It still doesn''t help when a double-failure occurs.   Luckily,
>> that''s  very rare.  Usually, in that case, you need to
evacuate
>> the volume  and try to restore what was damaged.
>> On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote:
>>> On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote:
>>>
>>>>
>>>> You''re using hardware raid.  The hardware raid
controller will
>>>> rebuild
>>>> the volume in the event of a single drive failure. 
You''d need
>>>> to  keep
>>>> on top of it, but that''s a given in the case of either
hardware or
>>>> software raid.
>>>
>>>
>>> True for total drive failure, but not there are a more failure
modes
>>> than that.  With hardware RAID, there is no way for the RAID   
>>> controller
>>> to know which block was bad, and therefore cannot repair the block.
>>> With RAID-Z, we have the integrated checksum and can do  
>>> combinatorial
>>> analysis to know not only which drive was bad, but what the data
>>> _should_ be, and can repair it to prevent more corruption in the   
>>> future.
>>>
>>> - Eric
>>>
>>> -- 
>>> Eric Schrock, Solaris Kernel Development       http:// 
>>> blogs.sun.com/ eschrock
>> -----
>> Gregory Shaw, IT Architect
>> Phone: (303) 673-8273        Fax: (303) 673-8273
>> ITCTO Group, Sun Microsystems Inc.
>> 1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
>> Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
>> "When Microsoft writes an application for Linux, I''ve
Won." -
>> Linus  Torvalds
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> -- 
> ---------------------------------------------------------------------- 
> ----
> Jeff VICTOR              Sun Microsystems            jeff.victor @  
> sun.com
> OS Ambassador            Sr. Technical Specialist
> Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/ 
> zones/faq
> ---------------------------------------------------------------------- 
> ----
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Casper.Dik at sun.com

2006-Jun-27 15:51 UTC

head link

[zfs-discuss] ZFS and Storage

>This is getting pretty picky.  You''re saying that ZFS will detect
any
>errors introduced after ZFS has gotten the data.  However, as stated  
>in a previous post, that doesn''t guarantee that the data given to
ZFS
>wasn''t already corrupted.
But there''s a big difference between the time ZFS gets the data
and the time your typical storage system gets it.

And your typical storage system does not store any information which
allows it to detect all but the most simple errors.

Storage systems are complicated and have many failure modes at many
different levels.

	- disks not writing data or writing data in incorrect location
	- disks not reporting failures when they occur
	- bit errors in disk write buffers causing data corruption
	- storage array software with bugs
	- storage array with undetected hardware errors
	- data corruption in the path (such as switches with mangle
	  packets but keep the TCP checksum working

>If you don''t trust your storage subsystem, you''re going to
encounter
>issues regardless of the software use to store data.  We''ll have to
>see if ZFS can ''save'' customers in this situation. 
I''ve found that
>regardless of the storage solution in question you can''t anticipate
>all issues and when a brownout or other ugly loss-of-service occurs,  
>you may or may not be intact, ZFS or no.
>
>I''ve never seen a product that can deal with all possible
situations.
ZFS attempts to deal with more problems than any of the current
existing solutions by giving end-to-end verification of the data.

One of the reasons why ZFS was created was a particular large customer
who had datacorruption which occured two years (!) before it was
detected.  The bad data had migrated and corrupted; the good data
was no longer available on backups (which weren''t very relevant
anyway after such a long time)

ZFS tries to give one important guarantee: if the data is bad, we will
not return it.

One case in point is the person in MPK with a SATA controller which
corrupts memory; he didn''t discover this using UFS (except for perhaps
a few strange events he noticed).  After switch to ZFS he started to
find corruption so now he uses a self-healing ZFS mirror (or RAIDZ).

ZFS helps at the low end as much as it does at the highend.

I''ll bet that ZFS will generate more calls about broken hardware
and fingers will be pointed at ZFS at first because it''s the new
kid; it will be some time before people realize that the data was
rotting all along.

Casper

Dale Ghent

2006-Jun-27 16:09 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Gregory Shaw wrote:> Yes, but the idea of using software raid on a large server doesn''t
make
> sense in modern systems.  If you''ve got a large database server
that
> runs a large oracle instance, using CPU cycles for RAID is counter 
> productive.  Add to that the need to manage the hardware directly (drive 
> microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in 
> modern systems starts to lose value in a big way.
> 
> You will detect any corruption when doing a scrub.  It''s not
end-to-end,
> but it''s no worse than today with VxVM.
Yes, but we''re trying to be better than VxVM. The end-to-end guarantee 
that ZFS offers is one of, if not the, primary attractor to using it in 
the first place.

CPU cycles are cheap these days. In the era of sub-1ghz single core/chip 
  systems, yes, those XOR calculations for software RAID were expensive. 
Now, not so much I think as that problem has been solved by brute force.

When using ZFS in a storage network, I''m envisioning the arrays being a
hybrid between what a JBOD is and what a full-fledge hardware RAID5 with 
tons o'' cache is.

As far as I''m concerned, the traditional RAID features on an array do 
not offer me much when using those LUNs with ZFS. I lose that 
end-to-end. But the arrays are still useful from a performance 
perspective because of their caching, LUN management, and FC-related 
abilities which is something a JBOD largely lacks.

/dale

Richard Elling

2006-Jun-27 16:37 UTC

head link

[zfs-discuss] ZFS and Storage

Gregory Shaw wrote:> Not at all.  ZFS is a quantum leap in Solaris filesystem/VM functionality.
Agreed.
> However,  I don''t see a lot of use for RAID-Z (or Z2) in large 
> enterprise customers situations.  For instance, does ZFS enable Sun to 
> walk into an account and say "You can now replace all of your high-end
> (EMC) disk with JBOD."?  I don''t think many customers would
bite on that.
I don''t see this happening, for organizational reasons more than
technical
reasons -- the folks who manage storage are usually different than the
folks who specify OSes.  Rather, I think they complement each other.

More interesting is the entreprenuer who builds a storage array using
ZFS in the back end.  Leveraging ZFS could save a lot of feature
development work.
> RAID-Z is an excellent feature, however, it doesn''t address many
of the
> reasons for using high-end arrays:
> 
> - Exporting snapshots to alternate systems (for live database or backup 
> purposes)
> - Remote replication
> - Sharing of storage among multiple systems (LUN masking and equivalent)
> - Storage management (migration between tiers of storage)
> - No-downtime failure replacement (the system doesn''t even know)
> - Clustering
This list is beyond the scope of ZFS itself.  I could see ZFS playing a
part in such solutions, Sun Cluster will support it for example.
> I know that ZFS is still a work in progress, so some of the above may 
> arrive in future versions of the product.
> 
> I see the RAID-Z[2] value in small-to-mid size systems where the storage 
> is relatively small and you don''t have high availability
requirements.
I see it being very applicable to any high availability requirement.  If
you think of availability as a continuum, it gets you a little bit closer
to perfect availability no matter what else is in the system.
  -- richard

Erik Trimble

2006-Jun-27 16:50 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Darren J Moffat wrote:
> Peter Rival wrote:
>
>> storage arrays with the same arguments over and over without 
>> providing an answer to the customer problem doesn''t do anyone
any
>> good.  So.  I''ll restate the question.  I have a 10TB database
that''s
>> spread over 20 storage arrays that I''d like to migrate to ZFS.
How
>> should I configure the storage array?  Let''s at least get that
>> conversation moving...
>
>
> I''ll answer your question with more questions:
>
> What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ?
>
> What of that doesn''t work for you ?
>
> What functionality of ZFS is it that you want to leverage ?
>It seems that the big thing we all want (relative to the discussion of 
moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to 
reliabily detect that a given block is bad, and have ZFS compensate).   
Now, how do we get things when using HW arrays, and not just treat them 
like JBODs (which is impractical for large SAN and similar arrays that 
are already configured).

Since the best way to get this is to use a Mirror or RAIDZ vdev, I''m 
assuming that the proper way to get benefits from both ZFS and HW RAID 
is the following:

(1)  ZFS mirror of  HW stripes, i.e.  "zpool create tank mirror 
hwStripe1 hwStripe2"
(2)  ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1, 
hwMirror2"
(3)  ZFS RAIDZ of  HW stripes, i.e. "zpool create tank raidz hwStripe1, 
hwStripe2"

mirrors of mirrors and raidz of raid5 is also possible, but I''m pretty 
sure they''re considerably less useful than the 3 above.

Personally, I can''t think of a good reason to use ZFS with HW RAID5;  
case (3) above seems to me to provide better performance with roughly 
the same amount of redundancy (not quite true, but close).

I''d vote for (1) if you need high performance, at the cost of disk 
space, (2) for maximum redundancy, and (3) as maximum space with 
reasonable performance.

I''m making a couple of assumptions here:

(a)  you have the spare cycles on your hosts to allow for using ZFS 
RAIDZ, which is a non-trivial cost (though not that big, folks).
(b)  your HW RAID controller uses NVRAM (or battery-backed cache), which 
you''d like to be able to use to speed up writes
(c)  you HW RAID''s NVRAM speeds up ALL writes, regardless of the 
configuration of arrays in the HW
(d)  having your HW controller present individual disks to the machines 
is a royal pain (way too many, the HW does other nice things with 
arrays, etc)

Erik Trimble
Java System Support
Mailstop:  usca14-102
Phone:  x17195
Santa Clara, CA

Eric Schrock

2006-Jun-27 17:02 UTC

head link

[zfs-discuss] ZFS and Storage

One of the key points here is that people seem focused on two types of
errors:

	1. Total drive failure
	2. Bit rot

Traditional RAID solves #1.  Reed-Solomon ECC found in all modern drives
solves #2 for all but the most extreme cases.

The real problem is the rising complexity of firmware in modern drives
and the reality of software bugs.  Misdirected reads and writes and
phantom writes are all real phenomena, and while more prevalent in SATA
and commodity drives, is by no means restricted to the low end.  This
type of corruption happens everwhere, and results in corruption that is
undetectable by drive firmware.   We''ve seen these failures in SCSI,
FC,
and SATA drives.  At a large storage company, a common story related to
us was that they would see approximately one silently corrupted block
per 9 TB of storage (on high-end FC drives).  As mentioned previously,
traditional RAID can detect these failures, but cannot repair the
damaged data.

Also, as pointed out previously, ZFS can detect failures in the entire
data path, up to the point where it reaches main memory (at which point
FMA takes over).  Once again, bad switches, cables, and drivers are a
reality of life.

There will always be a tradeoff between hardware RAID and RAID-Z.  But
saying that RAID-Z provides no discernable benefit over hardware RAID is
a lie, and has been disproven time and again by its ability to detect
and correct otherwise silent data corruption, even on top of hardware
RAID.

You are welcome to argue that people will make a judgement call and
chose performance/familiarity over RAID-Z in the datacenter, but that is
a matter of opinion that can only be settled by watching the evolution
of ZFS deployment over the next five years.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Nicolas Williams

2006-Jun-27 17:04 UTC

head link

[zfs-discuss] ZFS and Storage

On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw
wrote:> This is getting pretty picky.  You''re saying that ZFS will detect
any
> errors introduced after ZFS has gotten the data.  However, as stated  
> in a previous post, that doesn''t guarantee that the data given to
ZFS
> wasn''t already corrupted.
There will always be some place where errors can be introduced and go on
undetected.  But some parts of the system are more error prone than
others, and ZFS targets the most error prone of them: rotating rust.

For the rest, make sure you have ECC memory, that you''re using secure
NFS (with krb5i or krb5p), and the probability of undetectable data
corruption errors should be much closer to zero than what you''d get
with
other systems.

That said, there''s a proposal to add end-to-end data checksumming over
NFSv4 (see the IETF NFSv4 WG list archives).  That proposal can''t
protect meta-data, and it doesn''t remove any one type of data
corruption
error on the client side, but it does on the server side.

Nico
--

Joe Little

2006-Jun-27 17:19 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On 6/27/06, Erik Trimble <Erik.Trimble at sun.com>
wrote:> Darren J Moffat wrote:
>
> > Peter Rival wrote:
> >
> >> storage arrays with the same arguments over and over without
> >> providing an answer to the customer problem doesn''t do
anyone any
> >> good.  So.  I''ll restate the question.  I have a 10TB
database that''s
> >> spread over 20 storage arrays that I''d like to migrate to
ZFS.  How
> >> should I configure the storage array?  Let''s at least get
that
> >> conversation moving...
> >
> >
> > I''ll answer your question with more questions:
> >
> > What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ?
> >
> > What of that doesn''t work for you ?
> >
> > What functionality of ZFS is it that you want to leverage ?
> >
> It seems that the big thing we all want (relative to the discussion of
> moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to
> reliabily detect that a given block is bad, and have ZFS compensate).
> Now, how do we get things when using HW arrays, and not just treat them
> like JBODs (which is impractical for large SAN and similar arrays that
> are already configured).
>
> Since the best way to get this is to use a Mirror or RAIDZ vdev,
I''m
> assuming that the proper way to get benefits from both ZFS and HW RAID
> is the following:
>
> (1)  ZFS mirror of  HW stripes, i.e.  "zpool create tank mirror
> hwStripe1 hwStripe2"
> (2)  ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1,
> hwMirror2"
> (3)  ZFS RAIDZ of  HW stripes, i.e. "zpool create tank raidz
hwStripe1,
> hwStripe2"
>
> mirrors of mirrors and raidz of raid5 is also possible, but I''m
pretty
> sure they''re considerably less useful than the 3 above.
>
> Personally, I can''t think of a good reason to use ZFS with HW
RAID5;
> case (3) above seems to me to provide better performance with roughly
> the same amount of redundancy (not quite true, but close).
>
> I''d vote for (1) if you need high performance, at the cost of disk
> space, (2) for maximum redundancy, and (3) as maximum space with
> reasonable performance.
>
>
> I''m making a couple of assumptions here:
>
> (a)  you have the spare cycles on your hosts to allow for using ZFS
> RAIDZ, which is a non-trivial cost (though not that big, folks).
> (b)  your HW RAID controller uses NVRAM (or battery-backed cache), which
> you''d like to be able to use to speed up writes
> (c)  you HW RAID''s NVRAM speeds up ALL writes, regardless of the
> configuration of arrays in the HW
> (d)  having your HW controller present individual disks to the machines
> is a royal pain (way too many, the HW does other nice things with
> arrays, etc)
>
>
The case for HW RAID 5 with ZFS is easy: when you use iscsi. You get
major performance degradation over iscsi when trying to coordinate
writes and reads serially over iscsi using RAIDZ. The sweet spot in
the iscsi world is let your targets do RAID5 or whatnot (RAID10,
RAID50, RAID6), and combine those into ZFS pools, mirrored or not.
There are other benefits to ZFS, including snapshots, easily managed
storage pools, and with iscsi, ease of switching head nodes with a
simple export/import.

>
> Erik Trimble
> Java System Support
> Mailstop:  usca14-102
> Phone:  x17195
> Santa Clara, CA
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Torrey McMahon

2006-Jun-27 17:27 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Casper.Dik at Sun.COM wrote:>   
>> That''s the dilemma, the array provides nice features like
RAID1 and
>> RAID5, but those are of no real use when using ZFS. 
>>     
>
>
> RAID5 is not a "nice" feature when it breaks.
>
> A RAID controller cannot guarantee that all bits of a RAID5 stripe
> are written when power fails; then you have data corruption and no
> one can tell you what data was corrupted.  ZFS RAIDZ can.

That depends on the raid controller. Some implementations use a log 
*and* a battery back up. In some cases the battery is a embedded UPS of 
sorts to make sure the power stays up long enough to take the writes 
from the host and get them to disk.

Torrey McMahon

2006-Jun-27 17:27 UTC

head link

[zfs-discuss] ZFS and Storage

Darren J Moffat wrote:> So everything you are saying seems to suggest you think ZFS was a 
> waste of engineering time since hardware raid solves all the problems ?
>
> I don''t believe it does but I''m no storage expert and
maybe I''ve drank
> too much cool aid.  I''m software person and for me ZFS is
brilliant it
> is so much easier than managing any of the hardware raid systems
I''ve
> dealt with.

ZFS is great....for the systems that can run it. However, any enterprise 
datacenter is going to be made up of many many hosts running many many 
OS. In that world you''re going to consolidate on large arrays and use 
the features of those arrays where they cover the most ground. For 
example, if I''ve 100 hosts all running different OS and apps and I can 
perform my data replication and redundancy algorithms, in most cases 
Raid, in one spot then it will be much more cost efficient to do it there.

Torrey McMahon

2006-Jun-27 17:27 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Your example would prove more effective if you added, "I''ve got
ten
databases. Five on AIX, Five on Solaris 8...."

Peter Rival wrote:> I don''t like to top-post, but there''s no better way right
now.  This
> issue has recurred several times and there have been no answers to it 
> that cover the bases.  The question is, say I as a customer have a 
> database, let''s say it''s around 8 TB, all built on a
series of high
> end storage arrays that _don''t_ support the JBOD everyone seems to
> want - what is the preferred configuration for my storage arrays to 
> present LUNs to the OS for ZFS to consume?
>
> Let''s say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5
-
> that spans the breadth of about as good as it gets.  What should I as 
> a customer do?  Should I create RAID0 sets and let ZFS self-heal via 
> its own mirroring or RAIDZ when a disk blows in the set?  Should I use 
> RAID1 and eat the disk space used?  RAID5 and be thankful I have a 
> large write cache - and then which type of ZFS pool should I create 
> over it?
>
> See, telling folks "you should just use JBOD" when they
don''t have
> JBOD and have invested millions to get to state they''re in where 
> they''re efficiently utilizing their storage via a SAN
infrastructure
> is just plain one big waste of everyone''s time.  Shouting down the
> advantages of storage arrays with the same arguments over and over 
> without providing an answer to the customer problem doesn''t do
anyone
> any good.  So.  I''ll restate the question.  I have a 10TB database
> that''s spread over 20 storage arrays that I''d like to
migrate to ZFS.
> How should I configure the storage array?  Let''s at least get that
> conversation moving...
>
> - Pete
>
> Gregory Shaw wrote:
>> Yes, but the idea of using software raid on a large server
doesn''t
>> make sense in modern systems.  If you''ve got a large database
server
>> that runs a large oracle instance, using CPU cycles for RAID is 
>> counter productive.  Add to that the need to manage the hardware 
>> directly (drive microcode, drive brownouts/restarts, etc.) and the 
>> idea of using JBOD in modern systems starts to lose value in a big way.
>>
>> You will detect any corruption when doing a scrub.  It''s not 
>> end-to-end, but it''s no worse than today with VxVM.
>>
>> On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote:
>>
>>>> If you''ve got hardware raid-5, why not just run
>>>> regular (non-raid)
>>>> pools on top of the raid-5?
>>>>
>>>> I wouldn''t go back to JBOD.   Hardware arrays offer a
>>>> number of
>>>> advantages to JBOD:
>>>>     - disk microcode management
>>>>     - optimized access to storage
>>>>     - large write caches
>>>> - RAID computation can be done in specialized
>>>> d hardware
>>>> - SAN-based hardware products allow sharing of
>>>> f storage among
>>>> multiple hosts.  This allows storage to be utilized
>>>> more effectively.
>>>>
>>>
>>> I''m a little confused by the first poster''s
message as well, but you
>>> lose some benefits of ZFS if you don''t create your pools
with either
>>> RAID1 or RAIDZ, such as data corruption detection.  The array
isn''t
>>> going to detect that because all it knows about are blocks.
>>>
>>> -Nate
>>>
>>>
>

Torrey McMahon

2006-Jun-27 17:27 UTC

head link

[zfs-discuss] ZFS and Storage

Casper.Dik at Sun.COM wrote:>
> I''ll bet that ZFS will generate more calls about broken hardware
> and fingers will be pointed at ZFS at first because it''s the new
> kid; it will be some time before people realize that the data was
> rotting all along.

Ehhh....I don''t think so. Most of our customers have HW arrays that
have
been scrubbing data for years and years as well as apps on the top that 
have been verifying the data. (Oracle for example.) Not to mention there 
will be a bit of time before people move over to ZFS in the high end.

Dale Ghent

2006-Jun-27 17:36 UTC

head link

[zfs-discuss] ZFS and Storage

Torrey McMahon wrote:
> ZFS is great....for the systems that can run it. However, any enterprise 
> datacenter is going to be made up of many many hosts running many many 
> OS. In that world you''re going to consolidate on large arrays and
use
> the features of those arrays where they cover the most ground. For 
> example, if I''ve 100 hosts all running different OS and apps and I
can
> perform my data replication and redundancy algorithms, in most cases 
> Raid, in one spot then it will be much more cost efficient to do it there.
Exactly what I''m pondering.

In the near to mid term, Solaris with ZFS can be seen as sort of a 
storage virtualizer where it takes disks into ZFS pools and volumes and 
then presents them to other hosts and OSes via iSCSI, NFS, SMB and so 
on. At that point, those other OSes can enjoy the benefits of ZFS.

In the long term, it would be nice to see ZFS (or its concepts) 
integrated as the LUN provisioning and backing store mechanism on 
hardware RAID arrays themselves, supplanting the traditional RAID 
paradigms that have been in use for years.

/dale

Jason Schroeder

2006-Jun-27 19:35 UTC

head link

[zfs-discuss] ZFS and Storage

Torrey McMahon wrote:
> Casper.Dik at Sun.COM wrote:
>
>>
>> I''ll bet that ZFS will generate more calls about broken
hardware
>> and fingers will be pointed at ZFS at first because it''s the
new
>> kid; it will be some time before people realize that the data was
>> rotting all along.
>
>
>
> Ehhh....I don''t think so. Most of our customers have HW arrays
that
> have been scrubbing data for years and years as well as apps on the 
> top that have been verifying the data. (Oracle for example.) Not to 
> mention there will be a bit of time before people move over to ZFS in 
> the high end.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Ahh... but there is the rub.  Today - you/we don''t *really* know, do 
we?  Maybe there''s bad juju blocks, maybe not.  Running ZFS, whether in
a redundant vdev or not, will certainly turn the big spotlight on and 
give us the data that checksums matched, or they didn''t.  And if we are
in redundant vdevs, hey - we''ll fix it.  If not, well we are certainly 
no worse off then today''s filesystems, but at least we''ll know
the bad
juju is there.  How do the number of checksum mismatches compare across 
different types/vendors/costs of storage subsystems?  SLAs based on the 
number of bad checksums?  Price cut on storage that routinely gives back 
bad checksummed data?  Now, that is what will be interesting to me to 
see....

ZFS, the DTrace of storage - no more guessing, just data.

/jason

Torrey McMahon

2006-Jun-27 19:48 UTC

head link

[zfs-discuss] ZFS and Storage

Jason Schroeder wrote:> Torrey McMahon wrote:
>
>> Casper.Dik at Sun.COM wrote:
>>
>>>
>>> I''ll bet that ZFS will generate more calls about broken
hardware
>>> and fingers will be pointed at ZFS at first because it''s
the new
>>> kid; it will be some time before people realize that the data was
>>> rotting all along.
>>
>>
>>
>> Ehhh....I don''t think so. Most of our customers have HW arrays
that
>> have been scrubbing data for years and years as well as apps on the 
>> top that have been verifying the data. (Oracle for example.) Not to 
>> mention there will be a bit of time before people move over to ZFS in 
>> the high end.
>>
>
> Ahh... but there is the rub.  Today - you/we don''t *really* know,
do
> we?  Maybe there''s bad juju blocks, maybe not.  Running ZFS,
whether
> in a redundant vdev or not, will certainly turn the big spotlight on 
> and give us the data that checksums matched, or they didn''t.  

A spotlight on what? How is that data going to get into ZFS? The more I 
think about this more I realize it''s going to do little for existing 
data sets. You''re going to have to migrate that data from
"filesystem X"
into ZFS first. From that point on ZFS has no idea if the data was bad 
to begin with. If you can do an in place migration then you might be 
able to weed out some bad physical blocks/drives over time but I assert 
that the current disk scrubbing methodologies catch most of those.

Yes, it''s great for new data sets where you started with ZFS. Sorry if
I
sound like I''m raining on the parade here folks. That''s not
the case,
really, and I''m all for the great new features and EAU ZFS gives where 
applicable.

Darren J Moffat

2006-Jun-27 20:25 UTC

head link

[zfs-discuss] ZFS and Storage

Nicolas Williams wrote:> On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote:
>> This is getting pretty picky.  You''re saying that ZFS will
detect any
>> errors introduced after ZFS has gotten the data.  However, as stated  
>> in a previous post, that doesn''t guarantee that the data given
to ZFS
>> wasn''t already corrupted.
> 
> There will always be some place where errors can be introduced and go on
> undetected.  But some parts of the system are more error prone than
> others, and ZFS targets the most error prone of them: rotating rust.
> 
> For the rest, make sure you have ECC memory, that you''re using
secure
> NFS (with krb5i or krb5p), and the probability of undetectable data
> corruption errors should be much closer to zero than what you''d
get with
> other systems.
Another alternative is using IPsec with just AH.

For the benefit of those outside of Sun MPK17 both krb5i and IPsec AH 
were used to diagnose and prove that we have a faulty router in a lab 
that was causing very strange build errors.  TCP/IP alone didn''t catch 
the problems and sometimes they showed up with SCCS simple checksums and 
sometimes we had compile errors.

-- 
Darren J Moffat

Darren J Moffat

2006-Jun-27 20:27 UTC

head link

[zfs-discuss] ZFS and Storage

Torrey McMahon wrote:> Darren J Moffat wrote:
>> So everything you are saying seems to suggest you think ZFS was a 
>> waste of engineering time since hardware raid solves all the problems ?
>>
>> I don''t believe it does but I''m no storage expert and
maybe I''ve drank
>> too much cool aid.  I''m software person and for me ZFS is
brilliant it
>> is so much easier than managing any of the hardware raid systems
I''ve
>> dealt with.
> 
> 
> ZFS is great....for the systems that can run it. However, any enterprise 
> datacenter is going to be made up of many many hosts running many many 
> OS. In that world you''re going to consolidate on large arrays and
use
> the features of those arrays where they cover the most ground. For 
> example, if I''ve 100 hosts all running different OS and apps and I
can
> perform my data replication and redundancy algorithms, in most cases 
> Raid, in one spot then it will be much more cost efficient to do it there.
but you still need a local file system on those systems in many cases.

So back to where we started I guess, how to effectively use ZFS to 
benefit Solaris (and the other platforms it gets ported to) while still 
using Hardware RAID because you have no choice but to use it.

-- 
Darren J Moffat

Al Hopper

2006-Jun-27 21:16 UTC

head link

[zfs-discuss] ZFS and Storage

On Tue, 27 Jun 2006 Casper.Dik at sun.com wrote:
>
> >This is getting pretty picky.  You''re saying that ZFS will
detect any
> >errors introduced after ZFS has gotten the data.  However, as stated
> >in a previous post, that doesn''t guarantee that the data given
to ZFS
> >wasn''t already corrupted.
>
> But there''s a big difference between the time ZFS gets the data
> and the time your typical storage system gets it.
>
> And your typical storage system does not store any information which
> allows it to detect all but the most simple errors.
>
> Storage systems are complicated and have many failure modes at many
> different levels.
>
> 	- disks not writing data or writing data in incorrect location
> 	- disks not reporting failures when they occur
> 	- bit errors in disk write buffers causing data corruption
> 	- storage array software with bugs
Case in point, there was a gentleman who posted on the Yahoo Groups solx86
list and described how faulty firmware on a Hitach HDS system damaged a
bunch of data.  The HDS system moves disk blocks around, between one disk
and another, in the background, to optimized the filesystem layout.  Long
after he had written data, blocks from one data set were intermingled with
blocks for other data sets/files causing extensive data corruption.

I know this is a simplistic explanation (and perhaps technically
inaccurate) of the exact failure mode - but the effects were that a lot of
data was silently corrupted and went undiscovered for several days.

.... snip .....

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

Al Hopper

2006-Jun-27 21:30 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Tue, 27 Jun 2006, Gregory Shaw wrote:
> Yes, but the idea of using software raid on a large server doesn''t
> make sense in modern systems.  If you''ve got a large database
server
> that runs a large oracle instance, using CPU cycles for RAID is
> counter productive.  Add to that the need to manage the hardware
> directly (drive microcode, drive brownouts/restarts, etc.) and the
> idea of using JBOD in modern systems starts to lose value in a big way.
>
> You will detect any corruption when doing a scrub.  It''s not
end-to-
> end, but it''s no worse than today with VxVM.
The initial impression I got, after reading the original post, is that its
author[1] does not grok something fundamental about ZFS and/or how it
works!  Or does not understand that there are many CPU cycles in a modern
Unix box that are never taken advantage of.

It''s clear to me that ZFS provides considerable, never before
available,
features and facilities, and that any impact that ZFS may have on CPU or
memory utilization will become meaningless over time, as the # of CPU
cores increase - along with their performance.  And that average system
memory size will continue to increase over time.

Perhaps the author is taking a narrow view that ZFS will *replace*
existing systems.  I don''t think that this will be the general case.
Especially in a large organization where politics and turf wars will
dominate any "technical" discussions and implementation decisions will
be
made by senior management who are 100% buzzword compliant (and have
questionable technical/engineering skills).  Rather it will provide the
system designer with a hugely powerful *new* tool to apply in system
design.  And will challenge the designer to use it creatively and
effectively.

There is no such thing as the universal screw-driver.  Every toolbox has
tens of screwdrivers and tool designers will continue to dream about
replacing them all with _one_ tool.

[1] Sorry Gregory.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

David Valin

2006-Jun-27 22:30 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Al Hopper wrote:> On Tue, 27 Jun 2006, Gregory Shaw wrote:
> 
> 
>>Yes, but the idea of using software raid on a large server
doesn''t
>>make sense in modern systems.  If you''ve got a large database
server
>>that runs a large oracle instance, using CPU cycles for RAID is
>>counter productive.  Add to that the need to manage the hardware
>>directly (drive microcode, drive brownouts/restarts, etc.) and the
>>idea of using JBOD in modern systems starts to lose value in a big way.
>>
>>You will detect any corruption when doing a scrub.  It''s not
end-to-
>>end, but it''s no worse than today with VxVM.
> 
> 
> The initial impression I got, after reading the original post, is that its
> author[1] does not grok something fundamental about ZFS and/or how it
> works!  Or does not understand that there are many CPU cycles in a modern
> Unix box that are never taken advantage of.

Just because there are idle cpu cycles does not mean it is ok for the
Operating System to use them.  If there is a valid reason for the OS to
consume those cycles then that is fine.  But every cycle that the OS
consumes is one less cycle that is available for the customer apps (be
it Oracle or whatever, and I spend a lot of my time trying to squeeze
those cycles out of the high end systems).  The job of the operating
system is get the hell out of the way as quickly as possible so the user
aps. can do there work.  That can mean offloading some of the work onto
smart arrays.   As someone once said to me once, a customer does not buy
hardware to run an OS on, they buy it to accomplish some given piece of
work.

> 
> It''s clear to me that ZFS provides considerable, never before
available,
> features and facilities, and that any impact that ZFS may have on CPU or
> memory utilization will become meaningless over time, as the # of CPU
> cores increase - along with their performance.  And that average system
> memory size will continue to increase over time.
This is true and will probably be true for ever and has been going on
ever since the first chip.  There has always been more demand for more
power by the end users.  However just because we have available cycles
does not mean the OS should consume them.> 
> Perhaps the author is taking a narrow view that ZFS will *replace*
> existing systems.  I don''t think that this will be the general
case.
> Especially in a large organization where politics and turf wars will
> dominate any "technical" discussions and implementation decisions
will be
> made by senior management who are 100% buzzword compliant (and have
> questionable technical/engineering skills).  Rather it will provide the
> system designer with a hugely powerful *new* tool to apply in system
> design.  And will challenge the designer to use it creatively and
> effectively.
It all depends on your needs.  The idea of ZFS of providing raid
capabilities is very appealing for those systems that are desk top units
or small servers.  But where we are talking petabyte+  storage with 30+
gig/sec of IO bandwidth capacity, I believe we will find the CPUs are
going to consume way to much to handle the IO rate in such in
environment, at which time the work needs to be off loaded to smart
arrays (I have to do that experimentation yet).  You do not buy a 18
wheel tractor trailer to simply move a lawnmower from job site to job
site, you buy a SUV, pickup truck or trailer. Vice versa you do not buy
a pickup truck to move a tracked excavator, you have a tractor trailer.
> There is no such thing as the universal screw-driver.  Every toolbox has
> tens of screwdrivers and tool designers will continue to dream about
> replacing them all with _one_ tool.
> 
How true.  ZFS is one of many tools available.  However the impression I
have been picking up out of here at various times is that alot of people
view ZFS as the only tool in the tool box, thus everything is looking
like a nail because all you have is a hammer.

If ZFS is providing better data integrity then the current storage
arrays, that  sounds like to me an opportunity for the next generation
of intelligent arrays to become better.

Dave Valin
> [1] Sorry G
regory.> 
> Regards,
> 
> Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
>            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
> OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
>                 OpenSolaris Governing Board (OGB) Member - Feb 2006
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Peter Tribble

2006-Jun-27 23:11 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Tue, 2006-06-27 at 17:50, Erik Trimble wrote:> Since the best way to get this is to use a Mirror or RAIDZ vdev,
I''m
> assuming that the proper way to get benefits from both ZFS and HW RAID 
> is the following:
> 
> (1)  ZFS mirror of  HW stripes, i.e.  "zpool create tank mirror 
> hwStripe1 hwStripe2"
> (2)  ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1,
> hwMirror2"
> (3)  ZFS RAIDZ of  HW stripes, i.e. "zpool create tank raidz
hwStripe1,
> hwStripe2"
> 
> mirrors of mirrors and raidz of raid5 is also possible, but I''m
pretty
> sure they''re considerably less useful than the 3 above.
> 
> Personally, I can''t think of a good reason to use ZFS with HW
RAID5;
> case (3) above seems to me to provide better performance with roughly 
> the same amount of redundancy (not quite true, but close).
You really need some level of redundancy if you''re using HW raid.
Using plain stripes is downright dangerous. 0+1 vs 1+0 and all
that. Seems to me that the simplest way to go is to use zfs to mirror
HW raid5, preferably with the HW raid5 LUNs being completely
independent disks attached to completely independent controllers
with no components or datapaths in common.

-- 
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Torrey McMahon

2006-Jun-27 23:35 UTC

head link

[zfs-discuss] ZFS and Storage

Darren J Moffat wrote:> Torrey McMahon wrote:
>> Darren J Moffat wrote:
>>> So everything you are saying seems to suggest you think ZFS was a 
>>> waste of engineering time since hardware raid solves all the
problems ?
>>>
>>> I don''t believe it does but I''m no storage expert
and maybe I''ve
>>> drank too much cool aid.  I''m software person and for me
ZFS is
>>> brilliant it is so much easier than managing any of the hardware 
>>> raid systems I''ve dealt with.
>>
>>
>> ZFS is great....for the systems that can run it. However, any 
>> enterprise datacenter is going to be made up of many many hosts 
>> running many many OS. In that world you''re going to
consolidate on
>> large arrays and use the features of those arrays where they cover 
>> the most ground. For example, if I''ve 100 hosts all running
different
>> OS and apps and I can perform my data replication and redundancy 
>> algorithms, in most cases Raid, in one spot then it will be much more 
>> cost efficient to do it there.
>
> but you still need a local file system on those systems in many cases.
>
> So back to where we started I guess, how to effectively use ZFS to 
> benefit Solaris (and the other platforms it gets ported to) while 
> still using Hardware RAID because you have no choice but to use it.
>

Too many variables in an overall storage environment. This is why I 
always jump on people that say, "Dude! You''ve got ZFS. Just use
JBODs".
They''re not based in a reality outside of the ones that constitute a 
brand new workstation or SMB server....and we don''t really target that 
market these days.

You need to clearly define what the environment is, what the data growth 
will look like, what apps are going to be deployed, replication 
requirements, etc. It''s the way things have been for years. ZFS just 
changes a couple of variables. It doesn''t eliminate them or turn the 
equation into anything easier to solve.

Gregory Shaw

2006-Jun-28 00:55 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Jun 27, 2006, at 3:30 PM, Al Hopper wrote:
> On Tue, 27 Jun 2006, Gregory Shaw wrote:
>
>> Yes, but the idea of using software raid on a large server
doesn''t
>> make sense in modern systems.  If you''ve got a large database
server
>> that runs a large oracle instance, using CPU cycles for RAID is
>> counter productive.  Add to that the need to manage the hardware
>> directly (drive microcode, drive brownouts/restarts, etc.) and the
>> idea of using JBOD in modern systems starts to lose value in a big  
>> way.
>>
>> You will detect any corruption when doing a scrub.  It''s not
end-to-
>> end, but it''s no worse than today with VxVM.
>
> The initial impression I got, after reading the original post, is  
> that its
> author[1] does not grok something fundamental about ZFS and/or how it
> works!  Or does not understand that there are many CPU cycles in a  
> modern
> Unix box that are never taken advantage of.
>
> It''s clear to me that ZFS provides considerable, never before  
> available,
> features and facilities, and that any impact that ZFS may have on  
> CPU or
> memory utilization will become meaningless over time, as the # of CPU
> cores increase - along with their performance.  And that average  
> system
> memory size will continue to increase over time.
>
> Perhaps the author is taking a narrow view that ZFS will *replace*
> existing systems.  I don''t think that this will be the general
case.
> Especially in a large organization where politics and turf wars will
> dominate any "technical" discussions and implementation decisions
> will be
> made by senior management who are 100% buzzword compliant (and have
> questionable technical/engineering skills).  Rather it will provide  
> the
> system designer with a hugely powerful *new* tool to apply in system
> design.  And will challenge the designer to use it creatively and
> effectively.
>
> There is no such thing as the universal screw-driver.  Every  
> toolbox has
> tens of screwdrivers and tool designers will continue to dream about
> replacing them all with _one_ tool.
>
> [1] Sorry Gregory.
>
> Regards,
>
> Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
>            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
> OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
>                 OpenSolaris Governing Board (OGB) Member - Feb 2006
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
No insult taken.  I was trying to point out that many customers don''t  
have ''free'' cpu cycles, and that every little bit you take
from their
machine for subsystem control is that much real work the system will  
not be doing.

I think of the statement of "many cpu cycles in modern unix boxes  
that are never taken advantage of" in the similar vein of monitoring  
vendors:  "It''s just another agent.  It won''t take more
than 5% of
the box."

I think we''ll let the customer decide on the above.  I''ve
encountered
both situations:  customers with large boxes with plenty of headroom,  
and customers that run 100% all day, every day and have no cycles  
that aren''t dedicated to real work.

When I read as a ex-customer (e.g. not with Sun) that I''ve got to  
sacrifice cpu cycles in a software upgrade, it says to me that the  
upgrade will result in a slower system.

-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060627/31980de5/attachment.html>

przemolicc at poczta.fm

2006-Jun-28 08:57 UTC

head link

[zfs-discuss] ZFS and Storage

On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper
wrote:> Case in point, there was a gentleman who posted on the Yahoo Groups solx86
> list and described how faulty firmware on a Hitach HDS system damaged a
> bunch of data.  The HDS system moves disk blocks around, between one disk
> and another, in the background, to optimized the filesystem layout.  Long
> after he had written data, blocks from one data set were intermingled with
> blocks for other data sets/files causing extensive data corruption.
Al,

the problem you described comes probably from failures in code of firmware
not the failure of disk surface.  Sun''s engineers can also do some
mistakes
in ZFS code, right ?

przemol

Robert Milkowski

2006-Jun-28 09:15 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello David,

Wednesday, June 28, 2006, 12:30:54 AM, you wrote:

DV> If ZFS is providing better data integrity then the current storage
DV> arrays, that  sounds like to me an opportunity for the next generation
DV> of intelligent arrays to become better.

Actually they can''t.
If you want end-to-end data integrity it has to be checked on a
server.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-Jun-28 09:23 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello Erik,

Tuesday, June 27, 2006, 6:50:52 PM, you wrote:

ET> Personally, I can''t think of a good reason to use ZFS with HW
RAID5;
ET> case (3) above seems to me to provide better performance with roughly 
ET> the same amount of redundancy (not quite true, but close).

I can see a reason.
In our enviroment it looks like hw raid-5 could be actually faster
than raid-z. We do have lot''s of small random IOs large enough that
caching doesn''t help.

I don''t have actual data as it''s production and there
wasn''t
unfortunately time to test it - just the "feeling".

I belive raid-z will offer much better write performance in most
scenarios but not necessary better read performance.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-Jun-28 09:26 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello Peter,

Wednesday, June 28, 2006, 1:11:29 AM, you wrote:

PT> On Tue, 2006-06-27 at 17:50, Erik Trimble wrote:

PT> You really need some level of redundancy if you''re using HW
raid.
PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all
PT> that. Seems to me that the simplest way to go is to use zfs to mirror
PT> HW raid5, preferably with the HW raid5 LUNs being completely
PT> independent disks attached to completely independent controllers
PT> with no components or datapaths in common.

well, it will give you less than half your raw storage.
Due to costs I belive in most cases it won''t be acceptable.
People are using raid-5 mostly due to costs and you are proposing
something worse (in terms of available logical storage) than
mirroring.




-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Al Hopper

2006-Jun-28 09:51 UTC

head link

[zfs-discuss] ZFS and Storage

On Wed, 28 Jun 2006 przemolicc at poczta.fm wrote:
> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote:
> > Case in point, there was a gentleman who posted on the Yahoo Groups
solx86
> > list and described how faulty firmware on a Hitach HDS system damaged
a
> > bunch of data.  The HDS system moves disk blocks around, between one
disk
> > and another, in the background, to optimized the filesystem layout. 
Long
> > after he had written data, blocks from one data set were intermingled
with
> > blocks for other data sets/files causing extensive data corruption.
>
> Al,
>
> the problem you described comes probably from failures in code of firmware
> not the failure of disk surface.  Sun''s engineers can also do some
mistakes
> in ZFS code, right ?
Yes!

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

Robert Milkowski

2006-Jun-28 12:23 UTC

head link

[zfs-discuss] ZFS and Storage

Hello przemolicc,

Wednesday, June 28, 2006, 10:57:17 AM, you wrote:

ppf> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper
wrote:>> Case in point, there was a gentleman who posted on the Yahoo Groups
solx86
>> list and described how faulty firmware on a Hitach HDS system damaged a
>> bunch of data.  The HDS system moves disk blocks around, between one
disk
>> and another, in the background, to optimized the filesystem layout. 
Long
>> after he had written data, blocks from one data set were intermingled
with
>> blocks for other data sets/files causing extensive data corruption.
ppf> Al,

ppf> the problem you described comes probably from failures in code of
firmware
ppf> not the failure of disk surface.  Sun''s engineers can also do
some mistakes
ppf> in ZFS code, right ?

But the point is that ZFS should detect also such errors and take
proper actions. Other filesystems can''t.

And of course there are bugs in ZFS :P

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

przemolicc at poczta.fm

2006-Jun-28 13:05 UTC

head link

[zfs-discuss] ZFS and Storage

On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski
wrote:> Hello przemolicc,
> 
> Wednesday, June 28, 2006, 10:57:17 AM, you wrote:
> 
> ppf> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote:
> >> Case in point, there was a gentleman who posted on the Yahoo
Groups solx86
> >> list and described how faulty firmware on a Hitach HDS system
damaged a
> >> bunch of data.  The HDS system moves disk blocks around, between
one disk
> >> and another, in the background, to optimized the filesystem
layout.  Long
> >> after he had written data, blocks from one data set were
intermingled with
> >> blocks for other data sets/files causing extensive data
corruption.
> 
> ppf> Al,
> 
> ppf> the problem you described comes probably from failures in code of
firmware
> ppf> not the failure of disk surface.  Sun''s engineers can also
do some mistakes
> ppf> in ZFS code, right ?
> 
> But the point is that ZFS should detect also such errors and take
> proper actions. Other filesystems can''t.
Does it mean that ZFS can detect errors in ZFS''s code itself ? ;-)

What I wanted to point out is the Al''s example: he wrote about damaged
data. Data
were damaged by firmware _not_ disk surface ! In such case ZFS doesn''t
help. ZFS can
detect (and repair) errors on disk surface, bad cables, etc. But cannot detect
and repair
errors in its (ZFS) code.

I am comparing firmware code to ZFS code.

przemol

Jeremy Teo

2006-Jun-28 13:09 UTC

head link

[zfs-discuss] ZFS and Storage

Hello,
> What I wanted to point out is the Al''s example: he wrote about
damaged data. Data
> were damaged by firmware _not_ disk surface ! In such case ZFS
doesn''t help. ZFS can
> detect (and repair) errors on disk surface, bad cables, etc. But cannot
detect and repair
> errors in its (ZFS) code.
>
> I am comparing firmware code to ZFS code.
>
Firmware doesn''t do end to end checksumming. If ZFS code is buggy, the
checksums won''t match up anyway, so you still detect errors.

Plus it is a lot easier to debug ZFS code than firmware.

-- 
Regards,
Jeremy

Jeff Victor

2006-Jun-28 13:30 UTC

head link

[zfs-discuss] ZFS and Storage

przemolicc at poczta.fm wrote:> On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote:
> 
> What I wanted to point out is the Al''s example: he wrote about
damaged data. Data
> were damaged by firmware _not_ disk surface ! In such case ZFS
doesn''t help. ZFS can
> detect (and repair) errors on disk surface, bad cables, etc. But cannot
detect and repair
> errors in its (ZFS) code.
> 
If you mean "ZFS doesn''t help with firmware problems" that is
not true. For
example, if ZFS is mirroring a pool across two different storage arrays, a 
firmware error in one of them will cause problems that ZFS will detect when it 
tries to read the data. Further, ZFS would be able to correct the error by
reading
from the other mirror, unless the second array also suffered from a firmware
error.

There are categories of problems that ZFS cannot handle, mostly regarding data 
availability after catastophes (as Richard E described) but ZFS can help with
many
firmware problems.

-- 
--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Robert Milkowski

2006-Jun-28 13:30 UTC

head link

[zfs-discuss] ZFS and Storage

Hello przemolicc,

Wednesday, June 28, 2006, 3:05:42 PM, you wrote:

ppf> On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski
wrote:>> Hello przemolicc,
>> 
>> Wednesday, June 28, 2006, 10:57:17 AM, you wrote:
>> 
>> ppf> On Tue, Jun 27, 2006 at 04:16:13PM -0500, Al Hopper wrote:
>> >> Case in point, there was a gentleman who posted on the Yahoo
Groups solx86
>> >> list and described how faulty firmware on a Hitach HDS system
damaged a
>> >> bunch of data.  The HDS system moves disk blocks around,
between one disk
>> >> and another, in the background, to optimized the filesystem
layout.  Long
>> >> after he had written data, blocks from one data set were
intermingled with
>> >> blocks for other data sets/files causing extensive data
corruption.
>> 
>> ppf> Al,
>> 
>> ppf> the problem you described comes probably from failures in code
of firmware
>> ppf> not the failure of disk surface.  Sun''s engineers can
also do some mistakes
>> ppf> in ZFS code, right ?
>> 
>> But the point is that ZFS should detect also such errors and take
>> proper actions. Other filesystems can''t.
ppf> Does it mean that ZFS can detect errors in ZFS''s code itself ?
;-)

ppf> What I wanted to point out is the Al''s example: he wrote about
damaged data. Data
ppf> were damaged by firmware _not_ disk surface ! In such case ZFS
doesn''t help. ZFS can
ppf> detect (and repair) errors on disk surface, bad cables, etc. But cannot
detect and repair
ppf> errors in its (ZFS) code.

Not in its code but definitely in a firmware code in a controller.



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Darren J Moffat

2006-Jun-28 13:31 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Robert Milkowski wrote:> Hello David,
> 
> Wednesday, June 28, 2006, 12:30:54 AM, you wrote:
> 
> DV> If ZFS is providing better data integrity then the current storage
> DV> arrays, that  sounds like to me an opportunity for the next
generation
> DV> of intelligent arrays to become better.
> 
> Actually they can''t.
> If you want end-to-end data integrity it has to be checked on a
> server.
but the checking could be done by a cooperating zfs module and stuff in 
the hardware array.  That is making some of ZFS pluggable in away that 
it can be delegated to hardware.

-- 
Darren J Moffat

Torrey McMahon

2006-Jun-28 14:05 UTC

head link

[zfs-discuss] ZFS and Storage

Jeremy Teo wrote:> Hello,
>
>> What I wanted to point out is the Al''s example: he wrote about
>> damaged data. Data
>> were damaged by firmware _not_ disk surface ! In such case ZFS 
>> doesn''t help. ZFS can
>> detect (and repair) errors on disk surface, bad cables, etc. But 
>> cannot detect and repair
>> errors in its (ZFS) code.
>>
>> I am comparing firmware code to ZFS code.
>>
>
> Firmware doesn''t do end to end checksumming. If ZFS code is buggy,
the
> checksums won''t match up anyway, so you still detect errors.
>
> Plus it is a lot easier to debug ZFS code than firmware.
>

Depends on your definition of firmware. In higher end arrays the data is 
checksummed when it comes in and a hash is written when it gets to disk. 
Of course this is no where near end to end but it is better then nothing.

... and code is code. Easier to debug is a context sensitive term.

Bill Sommerfeld

2006-Jun-28 14:16 UTC

head link

[zfs-discuss] ZFS and Storage

On Wed, 2006-06-28 at 09:05, przemolicc at poczta.fm wrote:
> > But the point is that ZFS should detect also such errors and take
> > proper actions. Other filesystems can''t.
> 
> Does it mean that ZFS can detect errors in ZFS''s code itself ? ;-)
In many cases, yes.

As a hypothetical: Consider a bug in the file system''s block allocator
which causes an allocated on-disk block to be prematurely reused by
another file.

With UFS, you''re doomed -- one file or the other (or both) will be
corrupted and you''ll have no way to tell which one has correct data;
all
you can do is take the filesystem offline and run fsck on it to prune
out the damaged area.  

With ZFS''s design, because block checksums are an integral part of the
block pointer, the checksum error received when reading one or the other
file will most likely indicate that something is wrong and these errors
will be flagged; with an error of this form, the filesystem will either
deliver the correct data to the app or will know that it can''t.

						- Bill

Casper.Dik at Sun.COM

2006-Jun-28 14:20 UTC

head link

[zfs-discuss] ZFS and Storage

>Depends on your definition of firmware. In higher end arrays the data is 
>checksummed when it comes in and a hash is written when it gets to disk. 
>Of course this is no where near end to end but it is better then nothing.

The checksum is often stored with the data (so if the data is not written
or in the wrong location the checksum is still valid)

ZFS stores the checksum with the data pointer; so it knows more about
the data and whether is was proper.

ZFS also checksums before the data travels over the fabric.
>... and code is code. Easier to debug is a context sensitive term.

Uhm, well, firmware, in production systems?

Casper

Nagakiran

2006-Jun-28 14:27 UTC

head link

[zfs-discuss] ZFS and Storage

>>
>
>
> Depends on your definition of firmware. In higher end arrays the data 
> is checksummed when it comes in and a hash is written when it gets to 
> disk. Of course this is no where near end to end but it is better then 
> nothing.
>
> ... and code is code. Easier to debug is a context sensitive term.
>
Its unfortunate that so many posts hung about the code,
Its the design that protects your data and with ZFS you have a better 
design for
data integrity. If the code is faulty and now thats a bug. And design 
should protect you
unless your error detection and correction logic is faulty.

(I mean this is like anti-corruption buereau being corrupt :-)).


There is a huge difference between ability to detect corruption versus 
not knowing
that data is corrupted at all.

Now if the code is upto design or not, is what real world testing shows,
in most of the cases ZFS should help.

Kiran

Erik Trimble

2006-Jun-28 16:32 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Robert Milkowski wrote:> Hello Peter,
>
> Wednesday, June 28, 2006, 1:11:29 AM, you wrote:
>
> PT> On Tue, 2006-06-27 at 17:50, Erik Trimble wrote:
>
> PT> You really need some level of redundancy if you''re using HW
raid.
> PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all
> PT> that. Seems to me that the simplest way to go is to use zfs to
mirror
> PT> HW raid5, preferably with the HW raid5 LUNs being completely
> PT> independent disks attached to completely independent controllers
> PT> with no components or datapaths in common.
>
> well, it will give you less than half your raw storage.
> Due to costs I belive in most cases it won''t be acceptable.
> People are using raid-5 mostly due to costs and you are proposing
> something worse (in terms of available logical storage) than
> mirroring.
>   The main reason I don''t see ZFS mirror / HW RAID5 as useful is this:

ZFS mirror/ RAID5:      capacity =  (N / 2) -1
                                     speed <<  N / 2 -1
                                     minimum # disks to lose before loss 
of data:  4
                                     maximum # disks to lose before loss 
of data:  (N / 2) + 2

ZFS mirror / HW Stripe   capacity =  (N / 2)
                                     speed >=  N / 2
                                     minimum # disks to lose before loss 
of data:  2
                                     maximum # disks to lose before loss 
of data:  (N / 2) + 1

Given a reasonable number of hot-spares, I simply can''t see the (very) 
marginal increase in safety give by using HW RAID5 as out balancing the 
considerable speed hit using RAID5 takes. 

Robert -

I would definitely like to see the difference between read on HW RAID5 
vs read on RAIDZ. Naturally, one of the big concerns I would have is  
how much RAM is needed to avoid any cache starvation on the ZFS 
machine.  I''d discount the NVRAM on the RAID controller, since
I''d
assume that it would be dedicated to write acceleration, and not for 
read.  My big problem right now is that I only have an old A3500FC to do 
testing on, as all my other HW RAID controllers are IBM ServerRAIDs, for 
which the Solaris driver isn''t really the best.

-Erik

Richard Elling

2006-Jun-28 17:20 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Erik Trimble wrote:> The main reason I don''t see ZFS mirror / HW RAID5 as useful is
this:
> 
> ZFS mirror/ RAID5:      capacity =  (N / 2) -1
>                                     speed <<  N / 2 -1
>                                     minimum # disks to lose before loss 
>                                     of data:  4
>                                     maximum # disks to lose before loss 
>                                     of data:  (N / 2) + 2
> 
> ZFS mirror / HW Stripe   capacity =  (N / 2)
>                                     speed >=  N / 2
>                                     minimum # disks to lose before loss 
>                                     of data:  2
>                                     maximum # disks to lose before loss 
>                                     of data:  (N / 2) + 1
> 
> Given a reasonable number of hot-spares, I simply can''t see the
(very)
> marginal increase in safety give by using HW RAID5 as out balancing the 
> considerable speed hit using RAID5 takes.
Eric,
Your analysis lacks some very important views of the problem.

0. Probability of failure is not constant across the components involved.

1. Disks don''t tend to fail completely as often as they fail partially.
    For partial failures, the recovery method is very different for the
    various hardware RAID types and ZFS.

2. Analysis for data availability is different than analysis for data loss
    and performance.  Typically, we do a performability analysis which
    shows the relationship between availability and performance.  Data loss
    analysis is handled differently, as it is often measured in years
    (perhaps tens of thousands of years) and is highly dependent upon
    maintenance activity.

3. For most hardware RAID arrays, RAID-5 performance is similar to RAID-1+0.
    In order to assign a value to the performance envelope, something must
    be known about the workload.  RAID-6 or raidz2 performs ???

4. Scrubbing methods are also different between ZFS and RAID arrays.  This
    does impact latent fault detection which in turn impacts data loss.

Depending on requirements, we might recommend something fast, but risky, or
something designed to never forget.  Saying that some configuration has
little value only applies to a specific set of requirements.
  -- richard

Jonathan Edwards

2006-Jun-28 17:24 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Jun 28, 2006, at 12:32, Erik Trimble wrote:
> The main reason I don''t see ZFS mirror / HW RAID5 as useful is
this:
>
> ZFS mirror/ RAID5:      capacity =  (N / 2) -1
>                                     speed <<  N / 2 -1
>                                     minimum # disks to lose before  
> loss of data:  4
>                                     maximum # disks to lose before  
> loss of data:  (N / 2) + 2
shouldn''t that be capacity = ((N -1) / 2) ?

loss of a single disk would cause a rebuild on the R5 stripe which  
could affect performance on that side of the mirror.  Generally  
speaking good RAID controllers will dedicate processors and channels  
to calculate the parity and write it out so you''re not impacted from  
the host access PoV.  There is a similar sort of CoW behaviour that  
can happen between the array cache and the drives, but in the ideal  
case you''re dealing with this in dedicated hw instead of shared hw.
>
> ZFS mirror / HW Stripe   capacity =  (N / 2)
>                                     speed >=  N / 2
>                                     minimum # disks to lose before  
> loss of data:  2
>                                     maximum # disks to lose before  
> loss of data:  (N / 2) + 1
>
> Given a reasonable number of hot-spares, I simply can''t see the  
> (very) marginal increase in safety give by using HW RAID5 as out  
> balancing the considerable speed hit using RAID5 takes.
I think you''re comparing this to software R5 or at least badly  
implemented array code and divining that there is a considerable  
speed hit when using R5.  In practice this is not always the case  
provided that the response time and interaction between the array  
cache and drives is sufficient for the incoming stream.  By moving  
your operation to software you''re now introducing more layers between  
the CPU, L1/L2 cache, memory bus, and system bus before you get to  
the interconnect and further latencies on the storage port and  
underlying device (virtualized or not.)  Ideally it would be nice to  
see ZFS style improvements in array firmware, but given the state of  
embedded Solaris and the predominance of 32bit controllers - I think  
we''re going to have some issues.  We''d also need to have some
sort of
client mechanism to interact with the array if we''re talking about  
moving the filesystem layer out there .. just a thought

Jon E



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060628/2d1d6853/attachment.html>

Nicolas Williams

2006-Jun-28 17:40 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Wed, Jun 28, 2006 at 11:15:34AM +0200, Robert Milkowski
wrote:> DV> If ZFS is providing better data integrity then the current storage
> DV> arrays, that  sounds like to me an opportunity for the next
generation
> DV> of intelligent arrays to become better.
> RM> Actually they can''t.
RM> If you want end-to-end data integrity it has to be checked on a
RM> server.

But Joe makes a good point about RAID-Z and iSCSI.

It''d be nice if RAID HW could assist RAID-Z, and it wouldn''t
take much
to do that: parity computation on write, checksum verification on read
and, if the checksum verification fails, combinarotial reconstruction on
read.  The ZFS system (iSCSI client) would still have to verify the
checksum on read...

...but leaving parity computation/reconstruction to the iSCSI server
would greatly cut down the amount of I/O needed for RAID-Z to something
similar to that needed for HW RAID-5.

Sure, I don''t expect HW-assisted RAID-Z anytime soon, nor iSCSI
extensions for server-assisted RAID-Z.  But at least iSCSI protocol
extensions could be pursued now.

Nico
--

Peter Tribble

2006-Jun-28 21:13 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Wed, 2006-06-28 at 17:32, Erik Trimble wrote:> The main reason I don''t see ZFS mirror / HW RAID5 as useful is
this:
> 
> ZFS mirror/ RAID5:      capacity =  (N / 2) -1
>                                      speed <<  N / 2 -1
>                                      minimum # disks to lose before loss 
> of data:  4
>                                      maximum # disks to lose before loss 
> of data:  (N / 2) + 2
> 
> ZFS mirror / HW Stripe   capacity =  (N / 2)
>                                      speed >=  N / 2
>                                      minimum # disks to lose before loss 
> of data:  2
>                                      maximum # disks to lose before loss 
> of data:  (N / 2) + 1
> 
> Given a reasonable number of hot-spares, I simply can''t see the
(very)
> marginal increase in safety give by using HW RAID5 as out balancing the 
> considerable speed hit using RAID5 takes. 
That''s not quite right. There''s no significant difference in
performance,
and the question is whether you''re prepared to give up a small amount
of
space for orders of magnitude increase in safety.

Each extra disk you can survive the failure of leads to a massive
increase in safety. By something like (just considering isolated
random disk failures) the ratio of the MTBF of a disk to the time
it takes to get a spare back in and repair the LUN. That''s something
like 100,000 - which isn''t a marginal increase in safety!

Yes, I know it''s not that simple. The point to take from this is
simply that being able to survive 2 failures instead of 1 doesn''t
double your safety, it increases it by a very large number. And
by a very large number again for the next failure you can survive.

In the stripe case, a single disk loss (pretty common) loses all your
redundancy straight off. A second disk failure (not particularly rare)
and all your data''s toast. Hot spares don''t really help in
this case.

At this point having HW raid-5 underneath means you''re still humming
along safely.

-- 
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Peter Tribble

2006-Jun-28 21:24 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Robert,
> PT> You really need some level of redundancy if you''re using HW
raid.
> PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all
> PT> that. Seems to me that the simplest way to go is to use zfs to
mirror
> PT> HW raid5, preferably with the HW raid5 LUNs being completely
> PT> independent disks attached to completely independent controllers
> PT> with no components or datapaths in common.
> 
> well, it will give you less than half your raw storage.
> Due to costs I belive in most cases it won''t be acceptable.
> People are using raid-5 mostly due to costs and you are proposing
> something worse (in terms of available logical storage) than
> mirroring.
I realise that, but the question was about what combination of
ZFS redundancy and HW-raid redundancy made sense. My point was
that putting no redundancy at all at the HW-raid layer was a
really bad idea, and the self-healing capability of zfs means
that you want a level of redundancy within zfs. So you are
inevitably going to lose some extra capacity. Which is better -
zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5?

I wouldn''t rule out raidz (or even raidz2) across multiple
arrays that are HW-raid5 internally. My real concern there is
the small random read performance issue.

-- 
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Erik Trimble

2006-Jun-28 21:25 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Wed, 2006-06-28 at 13:24 -0400, Jonathan Edwards
wrote:> 
> On Jun 28, 2006, at 12:32, Erik Trimble wrote:
> 
> > The main reason I don''t see ZFS mirror / HW RAID5 as useful
is this:
> > 
> > 
> > ZFS mirror/ RAID5:      capacity =  (N / 2) -1
> > 
> >                                     speed <<  N / 2 -1
> > 
> >                                     minimum # disks to lose before
> > loss of data:  4
> > 
> >                                     maximum # disks to lose before
> > loss of data:  (N / 2) + 2
> > 
> 
> 
> shouldn''t that be capacity = ((N -1) / 2) ?
> Nope.  For instance, 12 drives:  2 mirrors of 6 drive RAID5, which
actually has 5 drives capacity. N=12, so (12 / 2) -1 = 6 -1 = 5. 
> 
> loss of a single disk would cause a rebuild on the R5 stripe which
> could affect performance on that side of the mirror.  Generally
> speaking good RAID controllers will dedicate processors and channels
> to calculate the parity and write it out so you''re not impacted
from
> the host access PoV.  There is a similar sort of CoW behaviour that
> can happen between the array cache and the drives, but in the ideal
> case you''re dealing with this in dedicated hw instead of shared
hw.
> 
But, in all cases I''ve ever observed, even with hardware assist,
writing
to a N-drive RAID5 array is slower than writing to a (N-1)-drive HW
Striped array. NVRAM of course can mitigate this somewhat, but the truth
comes down to that RAID 5/6 always requires more work to be done than
simple striping.

And, a N-drive striped array will always outperform a N-drive RAID5/6
array.  Always. 

I agree that there is some latitude for array design/cache
performance/workload variance in this, but I''ve compared what would be
the generally optimal RAID-5 workload (large size streaming writes/
streaming reads) against a identical number of striped drives, and you
are looking at BEST CASE the RAID5 performing at (N-1)/N  of the stripe.

[ in reality, that isn''t quite the best case. The best case is that
RAID-5 matches striping, in the case of reads of size <= (stripe size) *
(N-1) ]

> > 
> > ZFS mirror / HW Stripe   capacity =  (N / 2)
> > 
> >                                     speed >=  N / 2
> > 
> >                                     minimum # disks to lose before
> > loss of data:  2
> > 
> >                                     maximum # disks to lose before
> > loss of data:  (N / 2) + 1
> > 
> > 
> > Given a reasonable number of hot-spares, I simply can''t see
the
> > (very) marginal increase in safety give by using HW RAID5 as out
> > balancing the considerable speed hit using RAID5 takes. 
> > 
> 
> 
> I think you''re comparing this to software R5 or at least badly
> implemented array code and divining that there is a considerable speed
> hit when using R5.  In practice this is not always the case provided
> that the response time and interaction between the array cache and
> drives is sufficient for the incoming stream.  By moving your
> operation to software you''re now introducing more layers between
the
> CPU, L1/L2 cache, memory bus, and system bus before you get to the
> interconnect and further latencies on the storage port and underlying
> device (virtualized or not.)  Ideally it would be nice to see ZFS
> style improvements in array firmware, but given the state of embedded
> Solaris and the predominance of 32bit controllers - I think we''re
> going to have some issues.  We''d also need to have some sort of
client
> mechanism to interact with the array if we''re talking about moving
the
> filesystem layer out there .. just a thought
> 
> 
> Jon E
> 

What I was trying to provide was the case for those using HW Arrays AND
ZFS, and what the best configuration would be to do so.  I''m not saying
either/or; what the discussion centered around was what the best way to
do BOTH is.


-- 
Erik Trimble
Java System Support
Mailstop:  usca14-102
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Jeff Bonwick

2006-Jun-28 21:55 UTC

head link

[zfs-discuss] Re: ZFS and Storage

> Which is better -
> zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5?
The latter.  With a mirror of RAID-5 arrays, you get:

(1) Self-healing data.

(2) Tolerance of whole-array failure.

(3) Tolerance of *at least* three disk failures.

(4) More IOPs than raidz of hardware mirrors (see Roch''s blog entry).

(5) More convenient FRUs (the whole array becomes a FRU).

Jeff

Robert Milkowski

2006-Jun-28 22:00 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello Peter,

Wednesday, June 28, 2006, 11:24:32 PM, you wrote:

PT> Robert,
>> PT> You really need some level of redundancy if you''re
using HW raid.
>> PT> Using plain stripes is downright dangerous. 0+1 vs 1+0 and all
>> PT> that. Seems to me that the simplest way to go is to use zfs to
mirror
>> PT> HW raid5, preferably with the HW raid5 LUNs being completely
>> PT> independent disks attached to completely independent controllers
>> PT> with no components or datapaths in common.
>> 
>> well, it will give you less than half your raw storage.
>> Due to costs I belive in most cases it won''t be acceptable.
>> People are using raid-5 mostly due to costs and you are proposing
>> something worse (in terms of available logical storage) than
>> mirroring.
PT> I realise that, but the question was about what combination of
PT> ZFS redundancy and HW-raid redundancy made sense. My point was
PT> that putting no redundancy at all at the HW-raid layer was a
PT> really bad idea, and the self-healing capability of zfs means
PT> that you want a level of redundancy within zfs. So you are
PT> inevitably going to lose some extra capacity. Which is better -
PT> zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5?

PT> I wouldn''t rule out raidz (or even raidz2) across multiple
PT> arrays that are HW-raid5 internally. My real concern there is
PT> the small random read performance issue.

I hit that problem (raidz on hw-raid5) with lot of small random reads
(and many small writes). The performance was not acceptable here (nor
more raid-z groups due to too much logical storage consumed for
redundancy).

I belive that in many cases mirroring hw-raid-5 luns would perform
actually better.

And why exactly do you think that not redundant luns on hw arrays is a
bad idea? (except lack of hot spare support in zfs)
You still would benefit from caches in the array.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-Jun-28 22:08 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello Erik,

Wednesday, June 28, 2006, 6:32:38 PM, you wrote:


ET> Robert -

ET> I would definitely like to see the difference between read on HW RAID5
ET> vs read on RAIDZ. Naturally, one of the big concerns I would have is  
ET> how much RAM is needed to avoid any cache starvation on the ZFS 
ET> machine.  I''d discount the NVRAM on the RAID controller, since
I''d
ET> assume that it would be dedicated to write acceleration, and not for 
ET> read.  My big problem right now is that I only have an old A3500FC to do
ET> testing on, as all my other HW RAID controllers are IBM ServerRAIDs, for
ET> which the Solaris driver isn''t really the best.

I belive the problem here was mostly due to 64kB read from each disk
in raid-z while dataset was many TBs of data with small random reads
from many threads (nfsd). It meant that during peak hours I wasn''t
probably far from saturating fc links (there was over 200MB read
throughout sometime) while nfsd was actually reading someting like 10x
less. I belive that most of that "cached" data weren''t used.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Erik Trimble

2006-Jun-28 22:12 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Wed, 2006-06-28 at 22:13 +0100, Peter Tribble wrote:> On Wed, 2006-06-28 at 17:32, Erik Trimble wrote:
> > Given a reasonable number of hot-spares, I simply can''t see
the (very)
> > marginal increase in safety give by using HW RAID5 as out balancing
the
> > considerable speed hit using RAID5 takes. 
> 
> That''s not quite right. There''s no significant difference
in
> performance,
> and the question is whether you''re prepared to give up a small
amount of
> space for orders of magnitude increase in safety.
As indicated by previous posts, even with HW assist, RAID5/6 on N disks
will be noticably slower than a stripe of N disks. Theoretical read
performance for RAID5 is at best (in a limited number of cases) equal to
striping, and for the general read case, is (N-1)/N slower. Even
assuming no performance hit at all for the parity calculation on writes,
writes to a RAID5 are at best equal to a stripe (assuming a full stripe
has to be written), and usually at least (N-1)/N slower, as the parity
must be written in addition to the normal data (i.e. N/(N-1) data must
be written, rather than N).

> Each extra disk you can survive the failure of leads to a massive
> increase in safety. By something like (just considering isolated
> random disk failures) the ratio of the MTBF of a disk to the time
> it takes to get a spare back in and repair the LUN. That''s
something
> like 100,000 - which isn''t a marginal increase in safety!
> 
> Yes, I know it''s not that simple. The point to take from this is
> simply that being able to survive 2 failures instead of 1 doesn''t
> double your safety, it increases it by a very large number. And
> by a very large number again for the next failure you can survive.
> 
> In the stripe case, a single disk loss (pretty common) loses all your
> redundancy straight off. A second disk failure (not particularly rare)
> and all your data''s toast. Hot spares don''t really help
in this case.
> 
> At this point having HW raid-5 underneath means you''re still
humming
> along safely.
> 
Agreed. However, part of the issue that comes up is that you take into
account the possibility of a drive failing before the hotspare can be
re-silvered when a drive.  In general, this is why I don''t see stripes
being more than 6-8 drives wide. It takes somewhere between 30 minutes
and 2 hours to resilver a drive in a mirrored stripe of that size
(depending on capacity and activity). So, the tradeoff has to be made.
And, note that you''ll reduce your performance with the RAID5 resyncing
for considerably longer than it takes to re-silver the stripe. 

Put in another way (assuming a hotspare is automatically put in place
after a drive failure): 

With mirrored stripes, I''m vulnerable to complete data loss while the
2nd stripe resyncs.  Say 2 hours or so, worst case.  By percentages, a
2nd drive loss in a mirrored stripe has a 50% chance of causing data
loss.

With mirrored RAID5, I''m invulnerable to a second drive loss. However,
my resync time is vastly greater than for a stripe, increasing my window
for more drive failures by at least double, if not more.

It''s all a tradeoff. However, in general, I haven''t see 2nd
drives fail
within a stripe resync time, UNLESS I saw many drives fail (that is,
30-40% of a group fail close together), in which case, this overwhelms
any RAID ability to compensate.  It''s certainly possible.  (which is
why
striped mirrors are preferred to mirrored stripes).


In the end, it comes down to the local needs. Speaking in
generalizations, my opinion is that a mirror of RAID5 doesn''t
significantly increase your safety enough to warrant a 15-20% reduction
in space, and at least that in performance, vs a mirror of stripes. 


And, of course, backups help. :-)


-- 
Erik Trimble
Java System Support
Mailstop:  usca14-102
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Erik Trimble

2006-Jun-28 22:25 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Wed, 2006-06-28 at 14:55 -0700, Jeff Bonwick wrote:> > Which is better -
> > zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5?
> 
> The latter.  With a mirror of RAID-5 arrays, you get:
> 
> (1) Self-healing data.
> 
> (2) Tolerance of whole-array failure.
> 
> (3) Tolerance of *at least* three disk failures.
> 
> (4) More IOPs than raidz of hardware mirrors (see Roch''s blog
entry).
> 
> (5) More convenient FRUs (the whole array becomes a FRU).
> 
> Jeff
> 

Not that I disagree with the inital assessment, but a couple of
corrections:

(1)  Both give you this.

(2)  ZFS RAIDZ on HW mirrors can also survive a complete HW mirror array
failure.

(3)  Both configs can survive AT LEAST 3 drive failures. RAIDZ of HW
mirrors is slightly better at being able to survive 4+ drive failures,
statistically speaking.

-- 
Erik Trimble
Java System Support
Mailstop:  usca14-102
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Jonathan Edwards

2006-Jun-28 22:41 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Jun 28, 2006, at 17:25, Erik Trimble wrote:
> On Wed, 2006-06-28 at 13:24 -0400, Jonathan Edwards wrote:
>>
>> On Jun 28, 2006, at 12:32, Erik Trimble wrote:
>>
>>> The main reason I don''t see ZFS mirror / HW RAID5 as
useful is this:
>>>
>>>
>>> ZFS mirror/ RAID5:      capacity =  (N / 2) -1
>>>
>>>                                     speed <<  N / 2 -1
>>>
>>>                                     minimum # disks to lose before
>>> loss of data:  4
>>>
>>>                                     maximum # disks to lose before
>>> loss of data:  (N / 2) + 2
>>>
>>
>>
>> shouldn''t that be capacity = ((N -1) / 2) ?
>>
> Nope.  For instance, 12 drives:  2 mirrors of 6 drive RAID5, which
> actually has 5 drives capacity. N=12, so (12 / 2) -1 = 6 -1 = 5.
right, sorry - was thinking of the case where i''ve got 2 luns built  
out of single R5 parity group .. but there''s not much point to using  
a mirror there since disk failure is typically much more common than  
LDEV failure.  If you''re really concerned with reliability (the only  
reason you should be thinking about doing both R5 and R1) - you''d be  
better off mirroring each component of a RAID stripe before you  
construct the parity group.  This will still give you an effective  
capacity of (N-2)/2 or ((N/2) -1) but now you would have to lose 2  
complete mirrors before you would fail.  To me this would say that  
the best case for reliability here should be to plat with HW mirrored  
drives and RAID-Z on top.  Of course, you''re not going to be able to  
split mirrors very easily if you ever have that intention.

<snip>
> And, a N-drive striped array will always outperform a N-drive RAID5/6
> array.  Always.
true - but with some modern hardware, I think you''ll find that
it''s
pretty negligible.
> I agree that there is some latitude for array design/cache
> performance/workload variance in this, but I''ve compared what
would be
> the generally optimal RAID-5 workload (large size streaming writes/
> streaming reads) against a identical number of striped drives, and you
> are looking at BEST CASE the RAID5 performing at (N-1)/N  of the  
> stripe.
right, and you''ll also have a read/<modify>/write penalty that
will
happen somewhere that can degrade performance particularly when you  
blow your cache in a large streaming write.  Realistically you''ll  
typically give up the performance addition of a drive or two for  
parity to get basic redundancy and then realign your stripe width for  
your filesystem allocation unit or block commit based on the number  
of data drives in your RAID set (N-1 for R5 or N-2 for R6).  You''ll  
probably be giving up another couple drives for hot spares .. you''re  
never doing R5 for performance - it''s the recoverability aspect with  
less capacity overhead than a full mirror.

.je

Nathan Kroenert

2006-Jun-28 23:25 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Thu, 2006-06-29 at 03:40, Nicolas Williams wrote:> But Joe makes a good point about RAID-Z and iSCSI.
> 
> It''d be nice if RAID HW could assist RAID-Z, and it
wouldn''t take much
> to do that: parity computation on write, checksum verification on read
> and, if the checksum verification fails, combinarotial reconstruction on
> read.  The ZFS system (iSCSI client) would still have to verify the
> checksum on read...
> 
> ...but leaving parity computation/reconstruction to the iSCSI server
> would greatly cut down the amount of I/O needed for RAID-Z to something
> similar to that needed for HW RAID-5.
> 
> Sure, I don''t expect HW-assisted RAID-Z anytime soon, nor iSCSI
> extensions for server-assisted RAID-Z.  But at least iSCSI protocol
> extensions could be pursued now.
But - This still fails to address the design concept of ZFS''s end to
end
checksumming, and fails to address things gong bad over the system''s
hardware bus, the IO card and the Fibre...

So - It''s not nearly the same level of protection, IMO.

Richard Elling

2006-Jun-28 23:34 UTC

head link

[zfs-discuss] Re: ZFS and Storage

[hit send too soon...]

Richard Elling wrote:> Erik Trimble wrote:
>> The main reason I don''t see ZFS mirror / HW RAID5 as useful is
this:
>>
>> ZFS mirror/ RAID5:      capacity =  (N / 2) -1
>>                                     speed <<  N / 2 -1
>>                                     minimum # disks to lose before 
>> loss                                     of data:  4
>>                                     maximum # disks to lose before 
>> loss                                     of data:  (N / 2) + 2
>>
>> ZFS mirror / HW Stripe   capacity =  (N / 2)
>>                                     speed >=  N / 2
>>                                     minimum # disks to lose before 
>> loss                                     of data:  2
>>                                     maximum # disks to lose before 
>> loss                                     of data:  (N / 2) + 1
>>
>> Given a reasonable number of hot-spares, I simply can''t see
the (very)
>> marginal increase in safety give by using HW RAID5 as out balancing 
>> the considerable speed hit using RAID5 takes.
> 
> Eric,
> Your analysis lacks some very important views of the problem.
> 
> 0. Probability of failure is not constant across the components involved.
> 
> 1. Disks don''t tend to fail completely as often as they fail
partially.
>    For partial failures, the recovery method is very different for the
>    various hardware RAID types and ZFS.
> 
> 2. Analysis for data availability is different than analysis for data loss
>    and performance.  Typically, we do a performability analysis which
>    shows the relationship between availability and performance.  Data loss
>    analysis is handled differently, as it is often measured in years
>    (perhaps tens of thousands of years) and is highly dependent upon
>    maintenance activity.
> 
> 3. For most hardware RAID arrays, RAID-5 performance is similar to
RAID-1+0.
>    In order to assign a value to the performance envelope, something must
>    be known about the workload.  RAID-6 or raidz2 performs ???
> 
> 4. Scrubbing methods are also different between ZFS and RAID arrays.  This
>    does impact latent fault detection which in turn impacts data loss.
5. Excepting recovery from tape, the availability of a ZFS volume is a function
    of the amount of space used.  This is different than LVMs or HW RAID arrays
    where the availability is a function of the size of the disk.
> Depending on requirements, we might recommend something fast, but risky, or
> something designed to never forget.  Saying that some configuration has
> little value only applies to a specific set of requirements.
>  -- richard

Philip Brown

2006-Jun-29 00:10 UTC

head link

[zfs-discuss] ZFS and Storage

Roch wrote:> Philip Brown writes:
>  
>  > but there may not be filesystem space for double the data.
>  > Sounds like there is a need for a zfs-defragement-file utility
perhaps?
>  > 
>  > Or if you want to be politically cagey about naming choice, perhaps,
>  > 
>  > zfs-seq-read-optimize-file ?  :-)
>  > 
> 
> Possibly or may using fcntl ?
> 
> Now the goal is to take a file with scattered blocks and order
> them in contiguous chunks. So this is contigent on the
> existence of regions of free contiguous disk space. This
> will get more difficult as we get close to full on the
> storage.
> 
Quite so. It should be reasonable to require some minimum level of free 
space on the filesystem or the operation cannot continue.

but even with some relatively ''small'' amount of free space, it
should be
still possible. it will just take significantly longer.
CF: Any "defrag your hard drive" algorithm. same algorithm, just
applied to
a file instead of a drive.

Philip Brown

2006-Jun-29 00:58 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Erik Trimble wrote:> 
> Since the best way to get this is to use a Mirror or RAIDZ vdev,
I''m
> assuming that the proper way to get benefits from both ZFS and HW RAID 
> is the following:
> 
> (1)  ZFS mirror of  HW stripes, i.e.  "zpool create tank mirror 
> hwStripe1 hwStripe2"
> (2)  ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz hwMirror1,
> hwMirror2"
> (3)  ZFS RAIDZ of  HW stripes, i.e. "zpool create tank raidz
hwStripe1,
> hwStripe2"
> 
> mirrors of mirrors and raidz of raid5 is also possible, but I''m
pretty
> sure they''re considerably less useful than the 3 above.
> 
> Personally, I can''t think of a good reason to use ZFS with HW
RAID5;
> case (3) above seems to me to provide better performance with roughly 
> the same amount of redundancy (not quite true, but close).
> 
I almost regret extending this thread more :-) but I havent seen anyone 
spell out one thing in simple language, so i''ll attempt to do that now.

#2 is incredibly wasteful of space, so I''m not going to address it. it
is
highly redundant, that''s great. if you need it, do it. I''m
more concerned
with the concept of

zfs of two hardware raid boxes that have internal disk redundancy

vs

zfs of two hardware raid boxes that are pure stripes (raid 0)

(doesnt matter if using zfs mirror vs raidz to me, for this aspect of things)

The point that I think people should remember, is that if you lose a drive 
in a pure raid0 configuration... your time to recover that hwraid unit and 
bring it back to full operation in the filesystem.. is HUGE.
It will most likely be unacceptibly long.  hours if not days, for a decent 
sized raid box.

So, you can choose to throw away half your disk space in that hwraid box for 
redundancy, or use raid5.

raid5 IS useful in zfs+hwraid boxes, for "Mean Time To Recover"
purposes.

Nicolas Williams

2006-Jun-29 02:10 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Thu, Jun 29, 2006 at 09:25:21AM +1000, Nathan Kroenert
wrote:> On Thu, 2006-06-29 at 03:40, Nicolas Williams wrote:
> > But Joe makes a good point about RAID-Z and iSCSI.
> > 
> > It''d be nice if RAID HW could assist RAID-Z, and it
wouldn''t take much
> > to do that: parity computation on write, checksum verification on read
> > and, if the checksum verification fails, combinarotial reconstruction
on
> > read.  The ZFS system (iSCSI client) would still have to verify the
> > checksum on read...
> > 
> > ...but leaving parity computation/reconstruction to the iSCSI server
> > would greatly cut down the amount of I/O needed for RAID-Z to
something
> > similar to that needed for HW RAID-5.
> 
> But - This still fails to address the design concept of ZFS''s end
to end
> checksumming, and fails to address things gong bad over the
system''s
> hardware bus, the IO card and the Fibre...
No it doesn''t.

As I''d have it (and as I wrote) ZFS would compute the checksum both, on
reads and writes, but on reads the iSCSI target would also compute the
checksum, so it could do combinatorial reconstruction if a block is bad.

Nico
--

Torrey McMahon

2006-Jun-29 03:15 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Philip Brown wrote:>
>
> raid5 IS useful in zfs+hwraid boxes, for "Mean Time To Recover"
purposes.

Or, and people haven''t really mentioned this yet, if you''re
using R5 for
the raid set and carving LUNs out of it to multiple hosts.

Joe Little

2006-Jun-29 05:26 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On 6/28/06, Nathan Kroenert <Nathan.Kroenert at sun.com>
wrote:> On Thu, 2006-06-29 at 03:40, Nicolas Williams wrote:
> > But Joe makes a good point about RAID-Z and iSCSI.
> >
> > It''d be nice if RAID HW could assist RAID-Z, and it
wouldn''t take much
> > to do that: parity computation on write, checksum verification on read
> > and, if the checksum verification fails, combinarotial reconstruction
on
> > read.  The ZFS system (iSCSI client) would still have to verify the
> > checksum on read...
> >
> > ...but leaving parity computation/reconstruction to the iSCSI server
> > would greatly cut down the amount of I/O needed for RAID-Z to
something
> > similar to that needed for HW RAID-5.
> >
> > Sure, I don''t expect HW-assisted RAID-Z anytime soon, nor
iSCSI
> > extensions for server-assisted RAID-Z.  But at least iSCSI protocol
> > extensions could be pursued now.
>
> But - This still fails to address the design concept of ZFS''s end
to end
> checksumming, and fails to address things gong bad over the
system''s
> hardware bus, the IO card and the Fibre...
>
> So - It''s not nearly the same level of protection, IMO.
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Its not the same level of protection, but you get location
independence, multi-pathing throughout your device tree (all
components redundant, including the head node potentially), and if
Solaris ever got around to it, ERL2 support would do wonders to ensure
data integrity. Sure, you can still have specific components fail
partially and no know if you corrupt the data, but again, the answer
of mirroring your iscsi based storage allows for the error
correction/checksumming route to work its wonders. For exceptionally
large data pools, you''ll need many systems (perhaps beyond the scope
of even FC). Just exposing each drive as a naked lun and doing layers
of raidz/mirrors will show poor performance and an overly centralized
management nightmare. Segmenting off the workload solves some
performance ills (again, referring to Roch''s work on raidz''s
failings
for large number of luns) and does provide its own level of
compartmentabilty, redundancy, and manageability (to gain some, and
you lose some, I agree).

Let ZFS integrate as best it can based on the environment at hand. My
target use involves both tier1 (ala NetApp) and tier2 (very large
multi-location storage pools).

Jonathan Edwards

2006-Jun-29 05:40 UTC

head link

[zfs-discuss] Re: ZFS and Storage

On Jun 28, 2006, at 18:25, Erik Trimble wrote:
> On Wed, 2006-06-28 at 14:55 -0700, Jeff Bonwick wrote:
>>> Which is better -
>>> zfs raidz on hardware mirrors, or zfs mirror on hardware raid-5?
>>
>> The latter.  With a mirror of RAID-5 arrays, you get:
>>
>> (1) Self-healing data.
>>
>> (2) Tolerance of whole-array failure.
>>
>> (3) Tolerance of *at least* three disk failures.
>>
>> (4) More IOPs than raidz of hardware mirrors (see Roch''s blog
entry).
>>
>> (5) More convenient FRUs (the whole array becomes a FRU).
>>
>> Jeff
>>
>
>
> Not that I disagree with the inital assessment, but a couple of
> corrections:
>
> (1)  Both give you this.
>
> (2)  ZFS RAIDZ on HW mirrors can also survive a complete HW mirror  
> array
> failure.
>
> (3)  Both configs can survive AT LEAST 3 drive failures. RAIDZ of HW
> mirrors is slightly better at being able to survive 4+ drive failures,
> statistically speaking.
Here''s 10 options I can think of to summarize combinations of zfs  
with hw redundancy:

#   ZFS     ARRAY HW        CAPACITY    COMMENTS
--  ---     --------        --------    --------
1   R0      R1              N/2         hw mirror - no zfs healing (XXX)
2   R0      R5              N-1         hw R5 - no zfs healing (XXX)
3   R1      2 x R0          N/2         flexible, redundant, good perf
4   R1      2 x R5          (N/2)-1     flexible, more redundant,  
decent perf
5   R1      1 x R5          (N-1)/2     parity and mirror on same  
drives (XXX)
6   RZ      R0              N-1         standard RAID-Z - no array  
RAID (XXX)
7   RZ      R1 (tray)       (N/2)-1     RAIDZ+1
8   RZ      R1 (drives)     (N/2)-1     RAID1+Z (highest redundancy)
9   RZ      2 x R5          N-3         triple parity calculations (XXX)
10  RZ      1 x R5          N-2         double parity calculations (XXX)

If we eliminate the configs with no zfs healing, and the configs with  
double parity calculations (overworking the drives), I believe that  
configs 3 and 4 on a decent RAID array will probably perform  
similarly for most workloads.  Config 4 (as Jeff pointed out) will  
probably get you the best performance and redundancy utilizing both  
the arrays'' strengths and zfs'' strengths. If we optimize for  
performance we''d probably shy away from the RAID-Z options since we  
can''t really dedicate channels and resources for the parity  
calculations and writes (roch''s explanation is much better.)  But if  
we optimize for reliability config 8 would get you the highest  
overall redundancy.

Other options not considered:
- Double mirroring - capacity loss is too high for too little gain
- RAID2/3/4/6/S - not commonly used - have their own flaw areas

Jonathan Edwards
(generic storage and filesystem engineer)

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060629/1ab5eb95/attachment.html>

przemolicc at poczta.fm

2006-Jun-29 06:00 UTC

head link

[zfs-discuss] ZFS and Storage

On Wed, Jun 28, 2006 at 09:30:25AM -0400, Jeff Victor
wrote:> przemolicc at poczta.fm wrote:
> >On Wed, Jun 28, 2006 at 02:23:32PM +0200, Robert Milkowski wrote:
> >
> >What I wanted to point out is the Al''s example: he wrote about
damaged
> >data. Data
> >were damaged by firmware _not_ disk surface ! In such case ZFS
doesn''t
> >help. ZFS can
> >detect (and repair) errors on disk surface, bad cables, etc. But cannot
> >detect and repair
> >errors in its (ZFS) code.
> >
> 
> If you mean "ZFS doesn''t help with firmware problems"
that is not true.
No, I don''t mean that. :-)
>                                                                         For
> example, if ZFS is mirroring a pool across two different storage arrays, a 
> firmware error in one of them will cause problems that ZFS will detect when
> it tries to read the data. Further, ZFS would be able to correct the error 
> by reading from the other mirror, unless the second array also suffered 
> from a firmware error.
In this case ZFS is going to help. I agree. But how often you meet such solution
(mirror of two different storage arrays) ?
> There are categories of problems that ZFS cannot handle, mostly regarding 
> data availability after catastophes (as Richard E described) but ZFS can 
> help with many firmware problems.
Indeed.


przemol

przemolicc at poczta.fm

2006-Jun-29 06:01 UTC

head link

[zfs-discuss] ZFS and Storage

On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski
wrote:> ppf> What I wanted to point out is the Al''s example: he wrote
about damaged data. Data
> ppf> were damaged by firmware _not_ disk surface ! In such case ZFS
doesn''t help. ZFS can
> ppf> detect (and repair) errors on disk surface, bad cables, etc. But
cannot detect and repair
> ppf> errors in its (ZFS) code.
> 
> Not in its code but definitely in a firmware code in a controller.
As Jeff pointed out: if you mirror two different storage arrays.

przemol

Robert Milkowski

2006-Jun-29 07:58 UTC

head link

[zfs-discuss] Re: ZFS and Storage

Hello Philip,

Thursday, June 29, 2006, 2:58:41 AM, you wrote:

PB> Erik Trimble wrote:>> 
>> Since the best way to get this is to use a Mirror or RAIDZ vdev,
I''m
>> assuming that the proper way to get benefits from both ZFS and HW RAID 
>> is the following:
>> 
>> (1)  ZFS mirror of  HW stripes, i.e.  "zpool create tank mirror 
>> hwStripe1 hwStripe2"
>> (2)  ZFS RAIDZ of HW mirrors, i.e. "zpool create tank raidz
hwMirror1,
>> hwMirror2"
>> (3)  ZFS RAIDZ of  HW stripes, i.e. "zpool create tank raidz
hwStripe1,
>> hwStripe2"
>> 
>> mirrors of mirrors and raidz of raid5 is also possible, but
I''m pretty
>> sure they''re considerably less useful than the 3 above.
>> 
>> Personally, I can''t think of a good reason to use ZFS with HW
RAID5;
>> case (3) above seems to me to provide better performance with roughly 
>> the same amount of redundancy (not quite true, but close).
>> 
PB> I almost regret extending this thread more :-) but I havent seen anyone
PB> spell out one thing in simple language, so i''ll attempt to do
that now.

PB> #2 is incredibly wasteful of space, so I''m not going to address
it. it is
PB> highly redundant, that''s great. if you need it, do it.
I''m more concerned
PB> with the concept of

PB> zfs of two hardware raid boxes that have internal disk redundancy

PB> vs

PB> zfs of two hardware raid boxes that are pure stripes (raid 0)

PB> (doesnt matter if using zfs mirror vs raidz to me, for this aspect of
things)

PB> The point that I think people should remember, is that if you lose a
drive
PB> in a pure raid0 configuration... your time to recover that hwraid unit
and
PB> bring it back to full operation in the filesystem.. is HUGE.
PB> It will most likely be unacceptibly long.  hours if not days, for a
decent
PB> sized raid box.

Not really.
You can create many smaller raid-0 luns in one array and then do
raid-10 in zfs. That should also expand your available depth queue
and minimize impact of resilvering. Storage capacity would still be
the same.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-Jun-29 08:01 UTC

head link

[zfs-discuss] ZFS and Storage

Hello przemolicc,

Thursday, June 29, 2006, 8:01:26 AM, you wrote:

ppf> On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski
wrote:>> ppf> What I wanted to point out is the Al''s example: he
wrote about damaged data. Data
>> ppf> were damaged by firmware _not_ disk surface ! In such case ZFS
doesn''t help. ZFS can
>> ppf> detect (and repair) errors on disk surface, bad cables, etc.
But cannot detect and repair
>> ppf> errors in its (ZFS) code.
>> 
>> Not in its code but definitely in a firmware code in a controller.
ppf> As Jeff pointed out: if you mirror two different storage arrays.

Not only I belive. There are some classes of problems that even in one
array ZFS could help for fw problems (with many controllers in
active-active config like Symetrix).


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

przemolicc at poczta.fm

2006-Jun-29 08:08 UTC

head link

[zfs-discuss] ZFS and Storage

On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski
wrote:> Hello przemolicc,
> 
> Thursday, June 29, 2006, 8:01:26 AM, you wrote:
> 
> ppf> On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote:
> >> ppf> What I wanted to point out is the Al''s example:
he wrote about damaged data. Data
> >> ppf> were damaged by firmware _not_ disk surface ! In such case
ZFS doesn''t help. ZFS can
> >> ppf> detect (and repair) errors on disk surface, bad cables,
etc. But cannot detect and repair
> >> ppf> errors in its (ZFS) code.
> >> 
> >> Not in its code but definitely in a firmware code in a controller.
> 
> ppf> As Jeff pointed out: if you mirror two different storage arrays.
> 
> Not only I belive. There are some classes of problems that even in one
> array ZFS could help for fw problems (with many controllers in
> active-active config like Symetrix).
Any real example ?

przemol

Robert Milkowski

2006-Jun-29 08:31 UTC

head link

[zfs-discuss] ZFS and Storage

Hello przemolicc,

Thursday, June 29, 2006, 10:08:23 AM, you wrote:

ppf> On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski
wrote:>> Hello przemolicc,
>> 
>> Thursday, June 29, 2006, 8:01:26 AM, you wrote:
>> 
>> ppf> On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski
wrote:
>> >> ppf> What I wanted to point out is the Al''s
example: he wrote about damaged data. Data
>> >> ppf> were damaged by firmware _not_ disk surface ! In such
case ZFS doesn''t help. ZFS can
>> >> ppf> detect (and repair) errors on disk surface, bad
cables, etc. But cannot detect and repair
>> >> ppf> errors in its (ZFS) code.
>> >> 
>> >> Not in its code but definitely in a firmware code in a
controller.
>> 
>> ppf> As Jeff pointed out: if you mirror two different storage
arrays.
>> 
>> Not only I belive. There are some classes of problems that even in one
>> array ZFS could help for fw problems (with many controllers in
>> active-active config like Symetrix).
ppf> Any real example ?

I wouldn''t say such problems are common.
The issue is we don''t know. From time to time some files are bad,
sometimes fsck is needed with no apparent reason.

I think only the future will tell how and when ZFS will protect us.
All I can say there''s big potential in ZFS.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jeff Victor

2006-Jun-29 12:42 UTC

head link

[zfs-discuss] ZFS and Storage

przemolicc at poczta.fm wrote:> On Wed, Jun 28, 2006 at 09:30:25AM -0400, Jeff Victor wrote:
> 
>>For example, if ZFS is mirroring a pool across two different storage
arrays, a
>>firmware error in one of them will cause problems that ZFS will detect
when
>>it tries to read the data. Further, ZFS would be able to correct the
error
>>by reading from the other mirror, unless the second array also suffered 
>>from a firmware error.
>  
> In this case ZFS is going to help. I agree. But how often you meet such
solution
> (mirror of two different storage arrays) ?
I have never seen this for cabinet-sized storage systems, because they offer the
ability to perform on-line maintenance.  But I do see mirroring to two arrays
for
small arrays, which typically do not offer on-line maintenance.

-- 
--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Sean Meighan

2006-Jun-29 17:32 UTC

head link

[zfs-discuss] two simple questions

1) We installed ZFS onto our Solaris 10 T2000 3 months ago. I have been 
told our ZFS code is downrev. What is the recommended way to upgrade ZFS 
on a production system (we want minimum downtime)? Can it safely be 
done  without affecting our 3.5 million files?

2) We did not turn on compression as most of our 3+ million files are 
already gzipped. What is the performance penalty of having compression 
on (both read and write numbers)? Is there advantage to compressing 
already gzipped files? Should compression be the default when installing 
ZFS? Nearly all our files are ASCII.

here is some info on our machine

itsm-mpk-2% showrev
Hostname: itsm-mpk-2
Hostid: 83d8d784
Release: 5.10
Kernel architecture: sun4v
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain:
Kernel version: SunOS 5.10 Generic_118833-08

T2000 32x1000mhz, 16gigs RAM.

# zpool status
  pool: canary
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        canary      ONLINE       0     0     0
          c1t0d0s3  ONLINE       0     0     0

errors: No known data errors
# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
canary      42.0G  12.0G    169    223  8.92M  1.39M
canary      42.0G  12.0G      0    732      0  3.05M
canary      42.0G  12.0G      0    573      0  2.47M
canary      42.0G  12.0G      0    515      0  2.22M
canary      42.0G  12.0G      0    680      0  3.11M
canary      42.0G  12.0G      0    620      0  2.80M
canary      42.0G  12.0G      0    687      0  2.85M
canary      42.0G  12.0G      0    568      0  2.40M
canary      42.0G  12.0G      0    688      0  2.91M
canary      42.0G  12.0G      0    634      0  2.75M
canary      42.0G  12.0G      0    625      0  2.61M
canary      42.0G  12.0G      0    700      0  2.96M
canary      42.0G  12.0G      0    733      0  3.19M
canary      42.0G  12.0G      0    639      0  2.76M
canary      42.0G  12.0G      1    573   127K  2.89M
canary      42.0G  12.0G      0    652      0  2.48M
canary      42.0G  12.0G      0    713  63.4K  3.55M
canary      42.0G  12.0G    117    355  7.83M   782K
canary      42.0G  12.0G     43    616  2.97M  1.11M
canary      42.0G  12.0G    128    424  8.60M  1.57M
canary      42.0G  12.0G    288    151  18.9M   795K
canary      42.0G  12.0G    364      0  23.9M      0
canary      42.0G  12.0G    387      0  25.6M      0


thanks
sean

zfs discuss - Jun 2006 - ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage

[zfs-discuss] Re: ZFS and Storage