thr3ads.net - zfs discuss - [zfs-discuss] ZFS on SAN? [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Andras Spitzer

2009-Feb-13 23:21 UTC

[zfs-discuss] ZFS on SAN?

Hi,

When I read the ZFS manual, it usually recommends to configure redundancy at the
ZFS layer, mainly because there are features that will work only with redundant
configuration (like corrupted data correction), also it implies that the overall
robustness will improve.

My question is simple, what is the recommended configuration on SAN (on high-end
EMC, like the Symmetrix DMX series for example) where usually the redundancy is
configured at the array level, so most likely we would use simple ZFS layout,
without redundancy?

Is it worth to move the redundancy from the SAN array layer to the ZFS layer?
(configuring redundancy on both layers is sounds like a waste to me)  There are
certain advantages on the array to have redundancy configured (beyond the
protection against simple disk failure). Can we compare the advantages of having
(for example) RAID5 configured on a high-end SAN with no redundancy at the ZFS
layer versus no redundant RAID configuration on the high-end SAN but having
raidz or raidz2 on the ZFS layer?

Any tests, experience or best practices regarding this topic?

How does ZFS perform (from performance and robustness (or availability if you
like) point of view) on high-end SANs, compared to a VSF for example?

If you could share your experience with me, I would really appreciate that.

Regards,
sendai
-- 
This message posted from opensolaris.org

Andras Spitzer

2009-Feb-14 07:40 UTC

head link

[zfs-discuss] ZFS on SAN?

Damon,

Yes, we can provide simple concat inside the array (even though today we provide
RAID5 or RAID1 as our standard, and using Veritas with concat), the question is
more of if it''s worth it to switch the redundancy from the array to the
ZFS layer.

The RAID5/1 features of the high-end EMC arrays also provide performance
improvements, that''s why I wonder what would be the pros/cons of such a
switch (I mean the switch of the redundancy from the array to the ZFS layer).

So, you telling me that even if the SAN provides redundancy (HW RAID5 or RAID1),
people still configure ZFS with either raidz or mirror?

Regards,
sendai

On Sat, Feb 14, 2009 at 6:06 AM, Damon Atkins <Damon.Atkins at
_no_spam_yahoo.com.au> wrote:> Andras,
>  It you can get Concat Disk or Raid 0 Disk inside the array, then use RaidZ
> (if I/O is not large amount or its mostly sequential) if very high I/O then
> use ZFS Mirror. You can not spread a zpool over multiple EMC Arrays using
> SRDF if you are not using EMC Power Path.
>
> HDS for example does not support anything other than Mirror or RAID5
> configuration, so RaidZ or ZFS Mirror results in a lot of wasted disk
space.
>  However people still use RaidZ on HDS Raid5. As the top of the line HDS
> arrays are very fast and they want the features offered by ZFS.
>
> Cheers
>
>-- 
This message posted from opensolaris.org

Toby Thain

2009-Feb-14 15:58 UTC

head link

[zfs-discuss] ZFS on SAN?

On 14-Feb-09, at 2:40 AM, Andras Spitzer wrote:
> Damon,
>
> Yes, we can provide simple concat inside the array (even though  
> today we provide RAID5 or RAID1 as our standard, and using Veritas  
> with concat), the question is more of if it''s worth it to switch  
> the redundancy from the array to the ZFS layer.
>
> The RAID5/1 features of the high-end EMC arrays also provide  
> performance improvements, that''s why I wonder what would be the  
> pros/cons of such a switch (I mean the switch of the redundancy  
> from the array to the ZFS layer).
>
> So, you telling me that even if the SAN provides redundancy (HW  
> RAID5 or RAID1), people still configure ZFS with either raidz or  
> mirror?
Without doing so, you don''t get the benefit of checksummed
self-healing.

--Toby
>
> Regards,
> sendai
>
> On Sat, Feb 14, 2009 at 6:06 AM, Damon Atkins  
> <Damon.Atkins at _no_spam_yahoo.com.au> wrote:
>> Andras,
>>  It you can get Concat Disk or Raid 0 Disk inside the array, then  
>> use RaidZ
>> (if I/O is not large amount or its mostly sequential) if very high  
>> I/O then
>> use ZFS Mirror. You can not spread a zpool over multiple EMC  
>> Arrays using
>> SRDF if you are not using EMC Power Path.
>>
>> HDS for example does not support anything other than Mirror or RAID5
>> configuration, so RaidZ or ZFS Mirror results in a lot of wasted  
>> disk space.
>>  However people still use RaidZ on HDS Raid5. As the top of the  
>> line HDS
>> arrays are very fast and they want the features offered by ZFS.
>>
>> Cheers
>>
>>
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mika Borner

2009-Feb-14 17:08 UTC

head link

[zfs-discuss] ZFS on SAN?

Andras Spitzer wrote:> Is it worth to move the redundancy from the SAN array layer to the ZFS
layer? (configuring redundancy on both layers is sounds like a waste to me) 
There are certain advantages on the array to have redundancy configured (beyond
the protection against simple disk failure). Can we compare the advantages of
having  (for example) RAID5 configured on a high-end SAN with no redundancy at
the ZFS layer versus no redundant RAID configuration on the high-end SAN but
having raidz or raidz2 on the ZFS layer?
>
> Any tests, experience or best practices regarding this topic?
>
>
>   Would also like to hear about experiences with ZFS on EMC''s Symmetrix.

Currently we are using VxFS with Powerpath for multipathing, and 
synchronous SRDF for replication to our other datacenter.

At some point we will move to ZFS, but there are so many options how to 
implement this.

 From a sysadmin point of view (simplicity), I would like to use mpxio 
and host based mirroring. ZFS self-healing would be available in this 
configuration.

Asking EMC guys for their opinion is not an option. They will push you 
to buy SRDF and Powerpath licenses... :-)

Bob Friesenhahn

2009-Feb-14 18:16 UTC

head link

[zfs-discuss] ZFS on SAN?

On Fri, 13 Feb 2009, Andras Spitzer wrote:>
> So, you telling me that even if the SAN provides redundancy (HW 
> RAID5 or RAID1), people still configure ZFS with either raidz or 
> mirror?
When ZFS''s redundancy features are used, there is decreased risk of 
total pool failure.  With redundancy at the ZFS level, errors may be 
corrected.  With care in the pool design, more overall performance may 
be obtained since a number of independent arrays may be pooled 
together to obtain more bandwidth and storage space.

With this in mind, if the SAN hardware is known to work very well, 
placing the pool on a single SAN device is still an option.

If you do use ZFS''s redundancy features, it is important to consider 
resilver time.  Try to keep volume size small enough that it may be 
resilvered in a reasonable amount of time.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Miles Nordin

2009-Feb-15 07:41 UTC

head link

[zfs-discuss] ZFS on SAN?

>>>>> "as" == Andras Spitzer <wsendai at
gmail.com> writes:
as> So, you telling me that even if the SAN provides redundancy
as> (HW RAID5 or RAID1), people still configure ZFS with either
as> raidz or mirror?

There''s some experience that, in the case where the storage device or
the FC mesh glitches or reboots while the ZFS host stays up across the
reboot, you are less likely to lose the whole pool to ``ZFS-8000-72
The pool metadata is corrupted and cannot be opened. Destroy the pool
and restore from backup.'''' if you have ZFS-level redundancy
than if
you don''t.

Note that this ``corrupt and cannot be opened'''' is a different
problem
from ``not being able to self-heal.'''' When you need
self-healing and
don''t have it, you usually shouldn''t lose the whole pool. You
should
get a message in ''zpool status'' telling you the name of a file
that
has unrecoverable errors. Any attempt to read the file returns an I/O
error (not the marginal data). Then you have to go delete that file
to clear the error, but otherwise the pool keeps working. In this
self-heal case, if you''d had the ZFS-layer redundancy you''d
get a
count in the checksum column of one device and wouldn''t have to delete
the file, in fact you wouldn''t even know the name of the file that got
healed.

some people have been trying to blame the ``corrupt and cannot be
opened'''' on bit-flips supposedly happening inside the storage
or the
FC cloud, the same kind of bit flip that causes the other
self-healable problem, but I don''t buy it. I think it''s
probably
cache sync / write barrier problems that''s killing the unredundant
pools on SAN''s.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090215/7d45ad99/attachment.bin>

Robert Milkowski

2009-Feb-15 12:45 UTC

head link

[zfs-discuss] ZFS on SAN?

Hello Bob,

Saturday, February 14, 2009, 6:16:54 PM, you wrote:

BF> If you do use ZFS''s redundancy features, it is important to
consider
BF> resilver time.  Try to keep volume size small enough that it may be 
BF> resilvered in a reasonable amount of time.

Well, in most cases resilver in ZFS should be quicker than resilver in
a disk array because ZFS will resilver only blocks which are actually
in use while most disk arrays will blindly resilver full disk drives.
So assuming you still have plenty unused disk space in a pool then zfs
resilver should take less time.


-- 
Best regards,
 Robert Milkowski
                                       http://milek.blogspot.com

Bob Friesenhahn

2009-Feb-15 16:00 UTC

head link

[zfs-discuss] ZFS on SAN?

On Sun, 15 Feb 2009, Robert Milkowski wrote:>
> Well, in most cases resilver in ZFS should be quicker than resilver in
> a disk array because ZFS will resilver only blocks which are actually
> in use while most disk arrays will blindly resilver full disk drives.
> So assuming you still have plenty unused disk space in a pool then zfs
> resilver should take less time.
It is reasonable to assume that storage will eventually become close 
to full.  Then the user becomes entrapped by their design.  Adding to 
the issues is that as the ZFS pool ages and becomes full, it becomes 
slower as well due to increased fragmentation, and this fragmentation 
slows down resilver performance.

We have heard here from people who based their pool on a mirror of 
large multi-terrabyte LUNs.  This seemed to initially work ok but 
later on (to their dismay) they discovered that it would take several 
days or a week to resilver one of the LUNs.  The most severe cases 
were when the huge LUN is actually a ZFS volume exported by iSCSI from 
a server (e.g. a whole Thumper).  When one of the LUNs gets rebooted, 
it takes quite a long time for ZFS to catch it up, and possibly it (or 
its peer) will be rebooted again in the mean-time.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Colin Raven

2009-Feb-15 18:33 UTC

head link

[zfs-discuss] ZFS on SAN?

On Sun, Feb 15, 2009 at 5:00 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Sun, 15 Feb 2009, Robert Milkowski wrote:
>
>>
>> Well, in most cases resilver in ZFS should be quicker than resilver in
>> a disk array because ZFS will resilver only blocks which are actually
>> in use while most disk arrays will blindly resilver full disk drives.
>> So assuming you still have plenty unused disk space in a pool then zfs
>> resilver should take less time.
>>
>
> It is reasonable to assume that storage will eventually become close to
> full.  Then the user becomes entrapped by their design.  Adding to the
> issues is that as the ZFS pool ages and becomes full, it becomes slower as
> well due to increased fragmentation, and this fragmentation slows down
> resilver performance.
>
Pardon me for jumping into this discussion. I invariably lurk and keep mouth
firmly shut. In this case however, curiosity and a degree of alarm bade me
to jump in....could you elaborate on ''fragmentation'' since the
only context
I know this is Windows. Now surely, ZFS doesn''t suffer from the same
sickness?
As a followup; is there any ongoing sensible way to defend against the
dreaded fragmentation? A [shudder] "defrag" routine of some kind
perhaps?
Forgive the "silly questions" from the sidelines.....ignorance knows
no
bounds apparently :)
Warm Regards,
-Colin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090215/a203ab0e/attachment.html>

Bob Friesenhahn

2009-Feb-15 19:02 UTC

head link

[zfs-discuss] ZFS on SAN?

On Sun, 15 Feb 2009, Colin Raven wrote:>
> Pardon me for jumping into this discussion. I invariably lurk and keep
mouth
> firmly shut. In this case however, curiosity and a degree of alarm bade me
> to jump in....could you elaborate on ''fragmentation''
since the only context
> I know this is Windows. Now surely, ZFS doesn''t suffer from the
same
> sickness?
ZFS is "fragmented by design".  Regardless, it takes steps to minimize
fragmentation, and the costs of fragmentation.  Files written 
sequentially at a reasonable rate of speed are usually contiguous on 
disk as well.  A "slab" allocator is used in order to allocate space 
in larger units, and then dice this space up into ZFS 128K blocks so 
that related blocks will be close together on disk.  The use of larger 
block sizes (default 128K vs 4K, or 8K) dramatically reduces the 
amount of disk seeking required for sequential I/O when fragmentation 
is present.  Written data is buffered in RAM for up to 5 seconds 
before being written so that opportunities for contiguous storage are 
improved.  When the pool has multiple vdevs, then ZFS''s "load
share"
can also intelligently allocate file blocks across multiple disks such 
that there is minimal head movement, and multiple seeks can take place 
at once.
> As a followup; is there any ongoing sensible way to defend against the
> dreaded fragmentation? A [shudder] "defrag" routine of some kind
perhaps?
> Forgive the "silly questions" from the sidelines.....ignorance
knows no
> bounds apparently :)
The most important thing is to never operate your pool close to 100% 
full.  Always leave a reserve so that ZFS can use reasonable block 
allocation policies, and is not forced to allocate blocks in a way 
which causes additional performance penalty.  Installing more RAM in 
the system is likely to decrease fragmentation since then ZFS can 
defer writes longer and make better choices about where to put the 
data.

Updating already written portions of files "in place" will convert a 
completely contiguous file into a fragmented file due to ZFS''s 
copy-on-write design.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Colin Raven

2009-Feb-15 20:33 UTC

head link

[zfs-discuss] ZFS on SAN?

On Sun, Feb 15, 2009 at 8:02 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Sun, 15 Feb 2009, Colin Raven wrote:
>
>>
>> Pardon me for jumping into this discussion. I invariably lurk and keep
>> mouth
>> firmly shut. In this case however, curiosity and a degree of alarm bade
me
>> to jump in....could you elaborate on ''fragmentation''
since the only
>> context
>> I know this is Windows. Now surely, ZFS doesn''t suffer from
the same
>> sickness?
>>
>
> ZFS is "fragmented by design".  Regardless, it takes steps to
minimize
> fragmentation, and the costs of fragmentation.  Files written sequentially
> at a reasonable rate of speed are usually contiguous on disk as well.  A
> "slab" allocator is used in order to allocate space in larger
units, and
> then dice this space up into ZFS 128K blocks so that related blocks will be
> close together on disk.  The use of larger block sizes (default 128K vs 4K,
> or 8K) dramatically reduces the amount of disk seeking required for
> sequential I/O when fragmentation is present.  Written data is buffered in
> RAM for up to 5 seconds before being written so that opportunities for
> contiguous storage are improved.  When the pool has multiple vdevs, then
> ZFS''s "load share" can also intelligently allocate file
blocks across
> multiple disks such that there is minimal head movement, and multiple seeks
> can take place at once.
>
>  As a followup; is there any ongoing sensible way to defend against the
>> dreaded fragmentation? A [shudder] "defrag" routine of some
kind perhaps?
>> Forgive the "silly questions" from the
sidelines.....ignorance knows no
>> bounds apparently :)
>>
>
> The most important thing is to never operate your pool close to 100% full.
>  Always leave a reserve so that ZFS can use reasonable block allocation
> policies, and is not forced to allocate blocks in a way which causes
> additional performance penalty.  Installing more RAM in the system is
likely
> to decrease fragmentation since then ZFS can defer writes longer and make
> better choices about where to put the data.
>
> Updating already written portions of files "in place" will
convert a
> completely contiguous file into a fragmented file due to ZFS''s
copy-on-write
> design.
>
>
> Thank you for a most lucid and readily understandable explanation. I shall
> now return to the sidelines....hoping to have a zfs box up and running
> sometime in the near future when budget and time permit. Keeping up with
> this list is helpful in anticipation of that time arriving.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090215/b1eb4834/attachment.html>

Bob Friesenhahn

2009-Feb-15 21:42 UTC

head link

[zfs-discuss] ZFS on SAN?

On Sun, 15 Feb 2009, Colin Raven wrote:>>
>>  As a followup; is there any ongoing sensible way to defend against the
>>> dreaded fragmentation? A [shudder] "defrag" routine of
some kind perhaps?
>>> Forgive the "silly questions" from the
sidelines.....ignorance knows no
>>> bounds apparently :)
There is no "defrag" facility since the zfs pool design naturally 
controls the degree of fragmentation.  Zfs filesystems don''t tend to 
"fall apart" like Windows FAT, or NTFS.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Sanjeev

2009-Feb-16 03:41 UTC

head link

[zfs-discuss] ZFS on SAN?

Sendai,

On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer
wrote:> Hi,
> 
> When I read the ZFS manual, it usually recommends to configure redundancy
at the ZFS layer, mainly because there are features that will work only with
redundant configuration (like corrupted data correction), also it implies that
the overall robustness will improve.
> 
> My question is simple, what is the recommended configuration on SAN (on
high-end EMC, like the Symmetrix DMX series for example) where usually the
redundancy is configured at the array level, so most likely we would use simple
ZFS layout, without redundancy?
>From my experience, this is a bad idea. I ahve seen couple of cases with
suchconfig (no redundancy at ZFS level) where the connection between the HBA and the
storage was flaky. And there was no way for ZFS to recover. I agree that MPxIO
or any other multipathing handles failure of links. But, that in itself is not
sufficient.

Thanks and regards,
Sanjeev

-- 
----------------
Sanjeev Bagewadi
Solaris RPE 
Bangalore, India

Sriram Narayanan

2009-Feb-16 05:42 UTC

head link

[zfs-discuss] ZFS on SAN?

On Mon, Feb 16, 2009 at 9:11 AM, Sanjeev <sanjeev.bagewadi at sun.com>
wrote:> Sendai,
>
> On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote:
>> Hi,
>>
>> When I read the ZFS manual, it usually recommends to configure
redundancy at the ZFS layer, mainly because there are features that will work
only with redundant configuration (like corrupted data correction), also it
implies that the overall robustness will improve.
>>
>> My question is simple, what is the recommended configuration on SAN (on
high-end EMC, like the Symmetrix DMX series for example) where usually the
redundancy is configured at the array level, so most likely we would use simple
ZFS layout, without redundancy?
>
> >From my experience, this is a bad idea. I ahve seen couple of cases
with such
> config (no redundancy at ZFS level) where the connection between the HBA
and the
> storage was flaky. And there was no way for ZFS to recover. I agree that
MPxIO
> or any other multipathing handles failure of links. But, that in itself is
not
> sufficient.
>
So what would you recommend then, Sanjeev ?
- multiple ZFS pools running on a SAN ?
- An S10 box or boxes that provide ZFS backed iSCSI ?

-- Sriram

Sanjeev

2009-Feb-16 06:26 UTC

head link

[zfs-discuss] ZFS on SAN?

Sriram,

On Mon, Feb 16, 2009 at 11:12:42AM +0530, Sriram Narayanan
wrote:> On Mon, Feb 16, 2009 at 9:11 AM, Sanjeev <sanjeev.bagewadi at
sun.com> wrote:
> > Sendai,
> >
> > On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote:
> >> Hi,
> >>
> >> When I read the ZFS manual, it usually recommends to configure
redundancy at the ZFS layer, mainly because there are features that will work
only with redundant configuration (like corrupted data correction), also it
implies that the overall robustness will improve.
> >>
> >> My question is simple, what is the recommended configuration on
SAN (on high-end EMC, like the Symmetrix DMX series for example) where usually
the redundancy is configured at the array level, so most likely we would use
simple ZFS layout, without redundancy?
> >
> > >From my experience, this is a bad idea. I ahve seen couple of
cases with such
> > config (no redundancy at ZFS level) where the connection between the
HBA and the
> > storage was flaky. And there was no way for ZFS to recover. I agree
that MPxIO
> > or any other multipathing handles failure of links. But, that in
itself is not
> > sufficient.
> >
> 
> So what would you recommend then, Sanjeev ?
> - multiple ZFS pools running on a SAN ?
That''s fine. What I meant was you need to have redundancy at ZFS level.
> - An S10 box or boxes that provide ZFS backed iSCSI ?That should be fine as well. 

The point of discussion was whether we should have redundancy at ZFS level
and the answer is yes.

Thanks and regards,
Sanjeev
> 
> -- Sriram
-- 
----------------
Sanjeev Bagewadi
Solaris RPE 
Bangalore, India

Robert Milkowski

2009-Feb-16 17:07 UTC

head link

[zfs-discuss] ZFS on SAN?

Hello Bob,

Sunday, February 15, 2009, 9:42:25 PM, you wrote:

BF> On Sun, 15 Feb 2009, Colin Raven wrote:>>>
>>>  As a followup; is there any ongoing sensible way to defend against
the
>>>> dreaded fragmentation? A [shudder] "defrag" routine
of some kind perhaps?
>>>> Forgive the "silly questions" from the
sidelines.....ignorance knows no
>>>> bounds apparently :)
BF> There is no "defrag" facility since the zfs pool design
naturally
BF> controls the degree of fragmentation.  Zfs filesystems don''t
tend to
BF> "fall apart" like Windows FAT, or NTFS.

Well...



-- 
Best regards,
 Robert Milkowski
                                       http://milek.blogspot.com

Tim

2009-Feb-16 17:40 UTC

head link

[zfs-discuss] ZFS on SAN?

On Mon, Feb 16, 2009 at 12:26 AM, Sanjeev <Sanjeev.Bagewadi at sun.com>
wrote:
> Sriram,
>
> On Mon, Feb 16, 2009 at 11:12:42AM +0530, Sriram Narayanan wrote:
> > On Mon, Feb 16, 2009 at 9:11 AM, Sanjeev <sanjeev.bagewadi at
sun.com>
> wrote:
> > > Sendai,
> > >
> > > On Fri, Feb 13, 2009 at 03:21:25PM -0800, Andras Spitzer wrote:
> > >> Hi,
> > >>
> > >> When I read the ZFS manual, it usually recommends to
configure
> redundancy at the ZFS layer, mainly because there are features that will
> work only with redundant configuration (like corrupted data correction),
> also it implies that the overall robustness will improve.
> > >>
> > >> My question is simple, what is the recommended configuration
on SAN
> (on high-end EMC, like the Symmetrix DMX series for example) where usually
> the redundancy is configured at the array level, so most likely we would
use
> simple ZFS layout, without redundancy?
> > >
> > > >From my experience, this is a bad idea. I ahve seen couple of
cases
> with such
> > > config (no redundancy at ZFS level) where the connection between
the
> HBA and the
> > > storage was flaky. And there was no way for ZFS to recover. I
agree
> that MPxIO
> > > or any other multipathing handles failure of links. But, that in
itself
> is not
> > > sufficient.
> > >
> >
> > So what would you recommend then, Sanjeev ?
> > - multiple ZFS pools running on a SAN ?
>
> That''s fine. What I meant was you need to have redundancy at ZFS
level.
>
> > - An S10 box or boxes that provide ZFS backed iSCSI ?
> That should be fine as well.
>
> The point of discussion was whether we should have redundancy at ZFS level
> and the answer is yes.
>
> Thanks and regards,
> Sanjeev
>
>
Uhhh, S10 box that provide zfs backed iSCSI is NOT fine.  Cite the plethora
of examples on this list of how the fault management stack takes so long to
respond it''s basically unusable as it stands today.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090216/da33427f/attachment.html>

Miles Nordin

2009-Feb-16 20:02 UTC

head link

[zfs-discuss] ZFS on SAN?

>>>>> "t" == Tim  <tim at tcsac.net> writes:
     t> Uhhh, S10 box that provide zfs backed iSCSI is NOT fine.  Cite
     t> the plethora of examples on this list of how the fault
     t> management stack takes so long to respond it''s basically
     t> unusable as it stands today.

well...if we are talking about reliability, whether or not you lose
the whole pool when some network element or disk target reboots,
that''s separate from availability, do your final applications
experience glitches and outages or are they insulated from failures
that happen far enough underneath the layering?  The insulating right
now seems pretty poor compared to other SAN products, but that''s not
the same problem as the single-LUN Reliability issue we''ve been
discussing recently.

if you make a zpool mirror out of two S10 box providing iSCSI instead
of one iSCSI target, that is better.

If you make a zpool from one S10 box providing iSCSI, does not matter
if the iSCSI target software is serving the one LUN from a zvol or a
disk or an SVM slice, it is not fine for reliability to have a
single-LUN iSCSI vdev.  You must have ZFS-layer redundancy on the
overall pool, above the iSCSI.

if you change from:

  client
   NFS
    |                   +------------------------------+
+------------+          |        SAN                   |
| NFS     ZFS|--lun0----|iSCSI/FC                      |
|            |          +------------------------------+
+------------+            |             |            |
                     +------------+-------------++------------+
                     | disk       || disk       || disk       |
                     | shelf      || shelf      || shelf      |
                     |            ||            ||            |
                     +------------++------------++------------+


to using a non-ZFS filesystem above the iSCSI:

 [client   |notZFS]  -----+-------------+------------+
                          |             |            |
                          |             |            |
                     +------------+-------------++------------+
                     |iSCSI target||iSCSI target||iSCSI target|
                     |  ZFS       ||  ZFS       ||  ZFS       |
                     |local disk  ||local disk  ||local disk  |
                     +------------++------------++------------+

then that is probably more okay.  But for me the appeal of ZFS is to
aggregate spread-out storage into really huge pools.  Maybe it''s hard
to find a good notZFS to use in that diagram.


My understanding so far is, this is also better:

  client
   NFS
    |                   +------------------------------+
+------------+---lun0---|        SAN                   |
| NFS     ZFS|---lun1---|iSCSI/FC                      |
|            |          +------------------------------+
+------------+            |             |            |
                     +------------+-------------++------------+
                     | disk       || disk       || disk       |
                     | shelf      || shelf      || shelf      |
                     |            ||            ||            |
                     +------------++------------++------------+

where lun0 and lun1 make up a mirrored vdev in ZFS.  It does not
matter if lun0 and lun1 are carried on the same physical cable or
connected to the same SAN controller, but obviously they DO need to be
backed by separate storage, so you''re wasting a lot of disk to do
this.  Even if some maintenance event reboots lun0 and lun1 at the
same time, my understanding so far is, people have found this
configuration above less likely to lose whole pools than running on a
single lun.  Ex., see thread including message-id

 <B5EB902A18810E43800F26A02DB4279A2E6F13FE at
CNJEXCHANGE.composers.caxton.com>

around 2008-08-13.

Another alternative is to go ahead and use single-lun vdev''s, but have
data mirrored across multiple zpools using zfs send | zfs recv or
rsync in case you lose a whole pool.  That''s appealing to me because
the backup pool can be built from cheaper slower pieces than the main
pool, instead of burning up double the amount of expensive main pool
storage, and it protects from problems/bugs/mistakes that a vdev
mirror does not, and it allows changing pool geometry and removing
slogs and stuff.  The downside to planning on restoring from backup
is, if you lose your single-lun pool relatively often, these big
heavily-aggregated pools can take days to restore.  It''s a type of
offline maintenance, which is always against the original ZFS kool-aid
philosophy because offline maintenance puts a de-facto cap on max pool
size.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090216/0714033e/attachment.bin>

Henrik Johansson

2009-Feb-17 00:59 UTC

head link

[zfs-discuss] ZFS on SAN?

Hi all,

Ok, this might be to stir some things up again but I would like to  
make this more clear.

I have been reading this and other threads regarding ZFS on SAN and  
how well ZFS can recover from a serious error such as a cached disk  
array goes down or the connection to the SAN is lost. What I am  
hearing (Miles, ZFS-8000-72) is that sometimes you can end up in an  
unrecoverable state that forces you to restore the whole pool. I have  
been operating quite large deployments of SVM/UFS VxFS/VxVM for some  
years and while you sometimes are forced to do a filesystem check and  
some files might end up in lost+found I have never lost a whole  
filesystem. This is despite whole arrays crashing, split-brain  
scenarios etc. In the previous discussion a lot of fingers was pointed  
at hardware and USB connections, but then some people mentioned  
loosing pools located SAN in this thread.

We are currently evaluating if we should begin to implement ZFS in our  
SAN. I can see great opportunities with ZFS but if we have a higher  
risk of loosing entire pools that is a serious issue. I am aware that  
the other filesystems might not be in a correct state after a serious  
failure, but as stated before  that can be much better than restoring   
a multi terabyte filesystem from yesterdays backup.

So, what is the opinion, is this an existing problem even when using  
enterprise arrays? If I understand this correctly there should be no  
risk of loosing an entire pool if DKIOCFLUSHWRITECACHE is honored by  
the array?

If it is a problem, will the worst case scenario be at least on pair  
with UFS/VxFS when 6667683 is fixed?

Grateful for any additional information.

Regards

Henrik Johansson
http://sparcv9.blogspot.com

Bob Friesenhahn

2009-Feb-17 01:57 UTC

head link

[zfs-discuss] ZFS on SAN?

On Tue, 17 Feb 2009, Henrik Johansson wrote:>
> We are currently evaluating if we should begin to implement ZFS in our SAN.
I
> can see great opportunities with ZFS but if we have a higher risk of
loosing
> entire pools that is a serious issue. I am aware that the other filesystems
> might not be in a correct state after a serious failure, but as stated
before
> that can be much better than restoring  a multi terabyte filesystem from 
> yesterdays backup.
It is not clear that the risk of loosing the entire pool is higher 
than other filesystem types.  This is a point of considerable 
conjecture, with no failure data to base statistics on.  What is clear 
is that ZFS allows you to easily build much much larger pools than 
other filesystem types do.

A 12-disk pool that I built a year ago is still working fine with 
absolutely no problems at all.  Another two disk pool built using 
cheap large USB drives has been running for maybe eight months, with 
no problems.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marion Hakanson

2009-Feb-17 06:50 UTC

head link

[zfs-discuss] ZFS on SAN?

bfriesen at simple.dallas.tx.us said:> A 12-disk pool that I built a year ago is still working fine with 
absolutely
> no problems at all.  Another two disk pool built using  cheap large USB
> drives has been running for maybe eight months, with  no problems. 
We have non-redundant ZFS pools on an HDS 9520V array, and also a Sun 6120
array, some of them running for two years now (S10U3, S10U4, S10U5, both
SPARC and x86), up to 4TB in size.  We have experienced SAN zoning mistakes,
complete power loss to arrays, servers, and/or SAN switches, etc., with no
pool corruption or data loss.  We have not even seen one block checksum
error detected by ZFS on these arrays (we have seen one such error on our
X4500 in the past 6 months).

Note that the only available pool failure mode in the presence of a SAN
I/O error for these OS''s has been to panic/reboot, but so far when the
systems have come back, data has been fine.  We also do tape backups
of these pools, of course.

Regards,

-- 
Marion Hakanson <hakansom at ohsu.edu>
OHSU Advanced Computing Center

David Magda

2009-Feb-17 16:05 UTC

head link

[zfs-discuss] ZFS on SAN?

On Tue, February 17, 2009 01:50, Marion Hakanson wrote:
> Note that the only available pool failure mode in the presence of a SAN
> I/O error for these OS''s has been to panic/reboot, but so far when
the
> systems have come back, data has been fine.  We also do tape backups
> of these pools, of course.
Starting with Solaris 10u6 (?), the following property is available in
zpool(1M):

     failmode=wait | continue | panic

         Controls the system behavior  in  the  event  of  catas-
         trophic  pool  failure.  This  condition  is typically a
         result of a  loss  of  connectivity  to  the  underlying
         storage device(s) or a failure of all devices within the
         pool. The behavior of such an  event  is  determined  as
         follows:

         wait        Blocks all I/O access until the device  con-
                     nectivity  is  recovered  and the errors are
                     cleared. This is the default behavior.

         continue    Returns EIO to any new  write  I/O  requests
                     but  allows  reads  to  any of the remaining
                     healthy devices.  Any  write  requests  that
                     have  yet  to  be committed to disk would be
                     blocked.

         panic       Prints out a message to the console and gen-
                     erates a system crash dump.

Scott Lawson

2009-Feb-17 19:54 UTC

head link

[zfs-discuss] ZFS on SAN?

Hi All,

I have been watching this thread for a while and thought it was time a 
chipped my 2 cents
worth in. I have been an aggressive adopter of ZFS here across all of 
our Solaris
systems and have found the benefits have far outweighed any small issues 
that have
arisen.

Currently I have many systems that have LUNs provided from SAN based 
storage to
systems for zpools. All our systems are configured with mirrored vdevs 
and the
reliability factor has been as good as, if not greater than UFS and LVM.

My rules of thumb around systems tend to stem around getting the storage 
infrastructure
right as that generally leads to the best availability. To this end for 
every single SAN
attached system we have dual paths to separate switches, every array has 
dual controllers
dual pathed to different switches. ZFS may be more or less susceptible 
to any physical
infrastructure problem, but in my experience it is on a par with UFS 
(and I gave up
shelling out for vxfs long ago)

The reasons for the above configuration is that our storage is evenly 
split between two sites
dark fibre between them across redundant routes. This forms a ring 
configuration which is
around 5 km around. We have so much storage that we need to have this in 
case of a data
center catastrophe. The business recognizes the time to recovery risk 
would be so great
that if we didn''t we would be out  of business in the event of one of 
our data centres burning
or other natural disaster.

I have seen other people discussing power availability on other threads 
recently. If you
want it, you can have it. You just need the business case for it. I 
don''t buy the comments
on UPS unreliability.

Quite frequently I have rebooted arrays and removed them from mirrored 
vdevs and have
not had any issues with the LUNS they provided reattaching and re 
silvering. Scrubs
on the pools have always been successful.  Largest single mirrored pool 
is around 11TB
which is  form two 6140  RAID 5''s.

We also use Loki boxes as well for very large storage pools which are 
routinely filled.
(I was a beta tester for Loki). I have two J4500''s, one with 48 x 250
GB
and 1 x with 48
x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk 
_> Disk - Tape config
with a SL500 to back up our entire site. It is routinely fulled to the 
brim and it performs
admirably attached to a T5220 which is 10 gig attached.

All of the systems I have mentioned vary from Samba servers to 
compliance archives
to Oracle DB servers, Blackboard content stores, squid web caches, LDAP 
directory
servers, Mail stores, Mail spools., Calendar servers DB''s. The list  
covers 60  plus systems.
I have 0% Solaris older than Solaris 10. Why would you?

In short I hope people don''t hold back from adoption of ZFS because
they
are unsure
about it. Judge for yourself as I have done and dip your toes in at 
whatever rate you
are happy to do so. Thats what I did.

/Scott.

I also use it at home too with and old D1000 attached to a v120 with 8 x 
320 GB scsi''s
in a RAIDZ2 for all our home data and home business (which is a printing 
outfit
which creates a lot of very big files on our macs).

-- 
_______________________________________________________________________


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

________________________________________________________________________


perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

________________________________________________________________________

Scott Lawson

2009-Feb-17 20:01 UTC

head link

[zfs-discuss] ZFS on SAN?

Hi All,

I have been watching this thread for a while and thought it was time a
chipped my 2 cents
worth in. I have been an aggressive adopter of ZFS here across all of
our Solaris
systems and have found the benefits have far outweighed any small issues
that have
arisen.

Currently I have many systems that have LUNs provided from SAN based
storage to
systems for zpools. All our systems are configured with mirrored vdevs
and the
reliability factor has been as good as, if not greater than UFS and LVM.

My rules of thumb around systems tend to stem around getting the storage
infrastructure
right as that generally leads to the best availability. To this end for
every single SAN
attached system we have dual paths to separate switches, every array has
dual controllers
dual pathed to different switches. ZFS may be more or less susceptible
to any physical
infrastructure problem, but in my experience it is on a par with UFS
(and I gave up
shelling out for vxfs long ago)

The reasons for the above configuration is that our storage is evenly
split between two sites
dark fibre between them across redundant routes. This forms a ring
configuration which is
around 5 km around. We have so much storage that we need to have this in
case of a data
center catastrophe. The business recognizes the time to recovery risk
would be so great
that if we didn''t we would be out  of business in the event of one of
our data centres burning
or other natural disaster.

I have seen other people discussing power availability on other threads
recently. If you
want it, you can have it. You just need the business case for it. I
don''t buy the comments
on UPS unreliability.

Quite frequently I have rebooted arrays and removed them from mirrored
vdevs and have
not had any issues with the LUNS they provided reattaching and re
silvering. Scrubs
on the pools have always been successful.  Largest single mirrored pool
is around 11TB
which is  form two 6140  RAID 5''s.

We also use Loki boxes as well for very large storage pools which are
routinely filled.
(I was a beta tester for Loki). I have two J4500''s, one with 48 x 250
GB
and 1 x with 48
x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk
_> Disk - Tape config
with a SL500 to back up our entire site. It is routinely fulled to the
brim and it performs
admirably attached to a T5220 which is 10 gig attached.

All of the systems I have mentioned vary from Samba servers to
compliance archives
to Oracle DB servers, Blackboard content stores, squid web caches, LDAP
directory
servers, Mail stores, Mail spools., Calendar servers DB''s. The list
covers 60  plus systems.
I have 0% Solaris older than Solaris 10. Why would you?

In short I hope people don''t hold back from adoption of ZFS because
they
are unsure
about it. Judge for yourself as I have done and dip your toes in at
whatever rate you
are happy to do so. Thats what I did.

/Scott.

I also use it at home too with and old D1000 attached to a v120 with 8 x
320 GB scsi''s
in a RAIDZ2 for all our home data and home business (which is a printing
outfit
which creates a lot of very big files on our macs).

-- 
_______________________________________________________________________


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

________________________________________________________________________


perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

________________________________________________________________________

Miles Nordin

2009-Feb-17 20:29 UTC

head link

[zfs-discuss] ZFS on SAN?

>>>>> "hj" == Henrik Johansson <henrikj at
henkis.net> writes:
hj> I have been operating quite large deployments of SVM/UFS
hj> VxFS/VxVM for some years and while you sometimes are forced to
hj> do a filesystem check and some files might end up in
hj> lost+found I have never lost a whole filesystem.

I think in the world we want, even with the other filesystems, the SAN
fabric or array controller or disk shelf should be able to reboot
without causing any files to show up in lost+found, or requiring
anything other than the normal log roll-forward. I bet there are
rampant misimplementations.

Maybe the whole SAN situation is ubiquitously misthought because
filesystem designers build things assuming that whenever anything
``crashes,'''' the kernel and their own code will go down too.
They
invent a clever way to handle a non-SAN cord-yanking, test it, and yup
you can yank the cord it works fine. But this isn''t the actual way
things can fail.

In the diagram below the disk loses power, but the host, SAN, and
controller don''t. I doubt this is too common. Probably I should redo
diagrams like this after better understanding the disk commandset and
iSCSI tagged commands and stuff, for other parts of the stack
rebooting like the SAN or the controller.

filesystem initiator SAN controller diskbuffer platter

[...earlier writes not shown...]

t SYNC ------..
i ---------..
m -----------..
e ------------- write(A)
| . write(B)
v . write(C)
..-------------
..-----------
..---------
success ------
good. A-C are
on the platter.
commit ueberblock(D).

write(D) -----..
---------..
write(E) -----.. -----------..
--------.. ------------ [D]
write(F) -----.. -----------..
-------.. ------------- [E]
write(G) -----.. -----------.. =======POWER
FAILURE======= -------.. --------------
poof...[F] gone
-----------..
XXXX no
..XXXX disk
..-----------
..-------
ERROR(G) <----

ohno! couldn''t write G.
increment error counter =======POWER
RESTORED======= retry

write(G) -----..
-------..
SYNC -----.. -----------..
-------.. -------------- [G]
-----------..
-------------- write(G)
.
..--------------
..------------
..-------
success -----
good. that means D-G are
on the platter.
commit ueberblock(H)

write(H) <-- DANGER, Will Robinson.

Writes D - F were lost in this ``event,'''' and the filesystem
has no
idea. If ===POWER FAILURE=== applied to the filesystem and the disk
at the same time, then this problem would not exist---the way we are
using SYNC here would be enough to stop H from being written---so
power failures for non-SAN setups are safe from this.

Also if we treat the disk as bad the moment it says ``write
failure'''',
and the array controller decides ``this disk is bad,
forever,'''', if,
the instant it loses power and times out write F the controller
considers its entire contents lost and does not bother reading
ANYthing from it until it''s been resilvered by other disks in the
RAIDset, then we also do not have this problem, so power failures on
SVM mirror with no understanding of the overlying filesystem are okay.

Using naked UFS or ext3 or whatever over a SAN still has this problem
I think. The filesystems are just better at losing some data but not
the whole filesystem, compared to ZFS.

I think ZFS attempts to be smarter than SVM, and also more broadly
ambitious than one power supply all in one box, but is probably not
smart enough to finish the job. Rather than just more UFS/VxFS-style
robustness I''d like to see the job finished and this SAN write hole
closed up.

It''s important to accept that nothing is broken in this event.
It''s
just a yanked power cord. I won''t accept, ``a device failed, and you
didn''t have enough redundancy, so all bets are off. You must feed ZFS
more redundnacy. You expect the impossible.'''' No, that
argument is
bullshit. Losing power unexpectedly is not the same as a device
failure---unexpected power loss is part of the overall state diagram
of a normal, working storage system.

hj> We are currently evaluating if we should begin to implement
hj> ZFS in our SAN. I can see great opportunities with ZFS but if
hj> we have a higher risk of loosing entire pools

Optimistically, the ueberblock rollback will make ZFS like the other
filesystems, though maybe faster to recover. If you are tied to
stable solaris it''ll probably take like a year before you get your
hands on it, but so far I think everyone agrees it''s promising.

I think it''s not enough though. If the problem is that a batch of
writes were lost, then a trick to recover the pool still won''t recover
those lost writes, and you promised applications those writes were on
the disk. Databases and filesystems inside zvol''s could still become
corrupt. What this really means, is that using SAN''s makes corruption
in general more likely.

I think we sysadmins should start using some tiny 10-line programs to
test the SAN''s and figure out what''s wrong with them. I think
in the
end we will need about two things to fix it:

* some kind of commit/replay feature in iSCSI and FC initiators.

or else the same feature implemented in the filesystems right above
them but cooperating with the initiators pretty intimately.
Gigabytes of write data could be ``in flight''''---we are
talking
about however much data is between the return of a first
SYNCHRONIZE CACHE command and the next one---so it''d be good to
arrange that it not be buffered two or three or four times, which
may require layer-violating cooperation.

I''m all but certain nobody''s doing this now.

- is it in the initiator? commit/replay in the initiator would
mean the initiator issues SYNCHRONIZE CACHE commands for itself,
ones not demanded by the filesystem above it, whenever its
replay write cache gets too large. I''ve never heard of that.
and I don''t think anyone would put up with an iSCSI/FC initiator
burning up gigabytes of RAM without an explanation which would
mean that I''d hear about it and be worried about tuning it.

- is it in the filesystem? Any filesystem designed before SAN''s
will expect to eventually get a successful return from any
SYNCHRONIZE CACHE command it passes to storage. a failed SYNC
will happen in the form of someone yanking the cord, so the
filesystem code will never see the failure because it won''t be
executing any longer. UFS and ext3 don''t even bother to issue
SYNCHRONIZE CACHE at all, much less pay attention to its return
value and buffer writes so they can be replayed if it fails, so
I doubt they have an exception path for a failed SYNC command.

Putting repaly in the filesystem also means, if the iSCSI
initiator notices the target bounce, then it MUST warn the
layers above that writes were lost, for example by waiting for
the next SYNCHRONIZE CACHE command to come along and
deliberately returning it failed without consulting the target,
even though the LUN would say it succeeded if it were issued.
I''ve never heard of anything like this.

* pay some attention to what happens to ZFS when a SAN controller
reboots, separately with each ''failmode'' setting. To
maintain
correctness with NFS clients the zpool is serving, or with
replicated/tiered database applications where the dbms app is
keeping several nodes in sync, ZFS may need a failmode=umount that
kills any app with outstanding writes on a failed pool and
un-NFS-exports all the pool''s filesystems.

the existing failmode=panic could probably be verified (and likely
have to be fixed) to provide the same level of correctness, but
that would not be as good as the umount-and-kill because it''d make
HA and zones more antagonistic to each other, by putting many zones
at the mercy of the weakest pool on the system, which could even be
a USB stick or something. It''s the wrong direction to move.

I am not sure what failmode=continue and failmode=wait mean now, or
what they should mean to fix this problem. It''d be nice if they
meant what they claim to be: ``wait: use commit/replay schemes so
that no writes are lost even if the SAN controller reboots. apps
should be frozen until they can be allowed to continue as if
nothing went wrong. continue: fsync() returns -1 immediately for
the first data that never made it to disk, and continues returning
-1 until all writes issued up to now are on the platter, including
writes that had to be replayed because of the reboot. Once fsync()
has been called and has returned -1, all write() to that file must
also fail because of the barrier. And once your app calls fsync()
a second, third, fourth time and finally gets a 0 return from
fsync(), it can be sure no data was lost.'''' Of course all
that
seems optimistic beyond ridiculous, even for UFS and VxFS. but if
implemented like that, panic and wait should both be safe for SAN
outages, and continue we already understand to be unsafe but
implemented like this it becomes possible to write a cooperating
app, like a database or a user-mode iSCSI target app for example,
which is correct.

hj> So, what is the opinion, is this an existing problem even when
hj> using enterprise arrays? If I understand this correctly there
hj> should be no risk of loosing an entire pool if
hj> DKIOCFLUSHWRITECACHE is honored by the array?

no, the timing diagram I showed explains how I think data might still
be lost during a SAN reboot, even for a SAN which respects cache
flushes. but all this is pretty speculative for now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090217/dea0ae4b/attachment.bin>

Toby Thain

2009-Feb-18 00:03 UTC

head link

[zfs-discuss] ZFS on SAN?

On 17-Feb-09, at 3:01 PM, Scott Lawson wrote:
> Hi All,
> ...
> I have seen other people discussing power availability on other  
> threads
> recently. If you
> want it, you can have it. You just need the business case for it. I
> don''t buy the comments
> on UPS unreliability.
Hi,

I remarked on it. FWIW, my experience is that commercial data centres  
do not avoid ''unscheduled outages'', no matter how many
steely-eyed
assurances they give. It seems rather imprudent to assume that power  
is never going to fail.

No matter how many diesel generators, rooftop tanks, or pebblebed  
reactors you have, somebody is inevitably going to kick out a plug...  
at least in most of the real world.

--Toby
> ...
>
> -- 
> ______________________________________________________________________ 
> _
>
>
> Scott Lawson
> Systems Architect
> Manukau Institute of Technology
> Information Communication Technology Services Private Bag 94006  
> Manukau
> City Auckland New Zealand
>
> Phone  : +64 09 968 7611
> Fax    : +64 09 968 7641
> Mobile : +64 27 568 7611
>
> mailto:scott at manukau.ac.nz
>
> http://www.manukau.ac.nz
>
> ______________________________________________________________________ 
> __
>
>
> perl -e ''print
> $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''
>
> ______________________________________________________________________ 
> __
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

David Magda

2009-Feb-18 02:24 UTC

head link

[zfs-discuss] ZFS on SAN?

On Feb 17, 2009, at 21:35, Scott Lawson wrote:
> Everything we have has dual power supplies, feed from dual power  
> rails, feed from separate switchboards, through separate very large  
> UPS''s, backed by generators, feed by two substations and then
cloned
> to another data center 3 km away. HA
http://www.geonet.org.nz/earthquake/quakes/recent_quakes.html

	;)
> I am far, far more worried with someone with root access typing  
> ''zpool destroy'' than I am worried about the lights going
out in the
> data centers I designed that house hundreds and hundreds of  
> servers. ;)
Yeah, this is probably more likely.

Scott Lawson

2009-Feb-18 02:35 UTC

head link

[zfs-discuss] ZFS on SAN?

Toby Thain wrote:>
> On 17-Feb-09, at 3:01 PM, Scott Lawson wrote:
>
>> Hi All,
>> ...
>> I have seen other people discussing power availability on other threads
>> recently. If you
>> want it, you can have it. You just need the business case for it. I
>> don''t buy the comments
>> on UPS unreliability.
>
> Hi,
>
> I remarked on it. FWIW, my experience is that commercial data centres 
> do not avoid ''unscheduled outages'', no matter how many
steely-eyed
> assurances they give. It seems rather imprudent to assume that power 
> is never going to fail.
>
> No matter how many diesel generators, rooftop tanks, or pebblebed 
> reactors you have, somebody is inevitably going to kick out a plug... 
> at least in most of the real world.
>
> --TobyThats why you have two plugs if not more. I still don''t buy your 
argument. It comes down to procedural
issues on the site when it comes to people kicking plugs out. Everything 
we have has dual power supplies,
feed from dual power rails, feed from separate switchboards, through 
separate very large UPS''s, backed by
generators, feed by two substations and then cloned to another data 
center 3 km away. HA is all about
design. (I won''t even comment about further up the stack than
electricity)

We have secure data centers  with strict practices of work and qualified 
staff following best practice for maintenance
and risk management around maintenance.

I am far, far more worried with someone with root access typing ''zpool 
destroy'' than I am worried about the lights
going out in the data centers I designed that house hundreds and 
hundreds of servers. ;) and no we don''t have unplanned
outages. Not in a long time. Not all people that design data centers 
know how to design power systems
for them. Sometimes the IT people don''t convey their requirements 
exactly enough to the electrical engineers. (I am
an electrical engineer who got sidetracked by SunOS around ''91 and
never
went back.)

Anyway we diverge I think. Maybe we can agree to disagree? Back to 
discussions about disk caddies and
overpriced hardware.. slightly more closer to the topic at hand...
;)>
>> ...
>>
>> -- 
>> _______________________________________________________________________
>>
>>
>> Scott Lawson
>> Systems Architect
>> Manukau Institute of Technology
>> Information Communication Technology Services Private Bag 94006 Manukau
>> City Auckland New Zealand
>>
>> Phone  : +64 09 968 7611
>> Fax    : +64 09 968 7641
>> Mobile : +64 27 568 7611
>>
>> mailto:scott at manukau.ac.nz
>>
>> http://www.manukau.ac.nz
>>
>>
________________________________________________________________________
>>
>>
>> perl -e ''print
>> $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''
>>
>>
________________________________________________________________________
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
-- 
_______________________________________________________________________


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

________________________________________________________________________


perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

________________________________________________________________________

Toby Thain

2009-Feb-18 02:53 UTC

head link

[zfs-discuss] ZFS on SAN?

On 17-Feb-09, at 9:35 PM, Scott Lawson wrote:
>
>
> Toby Thain wrote:
>>
>> On 17-Feb-09, at 3:01 PM, Scott Lawson wrote:
>>
>>> Hi All,
>>> ...
>>> I have seen other people discussing power availability on other  
>>> threads
>>> recently. If you
>>> want it, you can have it. You just need the business case for it. I
>>> don''t buy the comments
>>> on UPS unreliability.
>>
>> Hi,
>>
>> I remarked on it. FWIW, my experience is that commercial data  
>> centres do not avoid ''unscheduled outages'', no matter
how many
>> steely-eyed assurances they give. It seems rather imprudent to  
>> assume that power is never going to fail.
>>
>> No matter how many diesel generators, rooftop tanks, or pebblebed  
>> reactors you have, somebody is inevitably going to kick out a  
>> plug... at least in most of the real world.
>>
>> --Toby
> Thats why you have two plugs if not more. I still don''t buy your  
> argument. It comes down to procedural
> issues on the site when it comes to people kicking plugs out.  
> Everything we have has dual power supplies,
> feed from dual power rails, feed from separate switchboards,  
> through separate very large UPS''s, backed by
> generators, feed by two substations and then cloned to another data  
> center 3 km away. HA is all about
> design. (I won''t even comment about further up the stack than  
> electricity)
>
> We have secure data centers  with strict practices of work and  
> qualified staff following best practice for maintenance
> and risk management around maintenance.
>
> I am far, far more worried with someone with root access typing  
> ''zpool destroy'' than I am worried about the lights
> going out in the data centers I designed that house hundreds and  
> hundreds of servers. ;) and no we don''t have unplanned
> outages. Not in a long time. Not all people that design data  
> centers know how to design power systems
> for them. Sometimes the IT people don''t convey their requirements
> exactly enough to the electrical engineers. (I am
> an electrical engineer who got sidetracked by SunOS around ''91 and
> never went back.)
>
> Anyway we diverge I think. Maybe we can agree to disagree?
Not at all. You''ve convinced me. Your servers will never, ever lose  
power unexpectedly.

--Toby
> Back to discussions about disk caddies and
> overpriced hardware.. slightly more closer to the topic at hand... ;)
>>
>>> ...
>>>
>>> -- 
>>>
____________________________________________________________________
>>> ___
>>>
>>>
>>> Scott Lawson
>>> Systems Architect
>>> Manukau Institute of Technology
>>> Information Communication Technology Services Private Bag 94006  
>>> Manukau
>>> City Auckland New Zealand
>>>
>>> Phone  : +64 09 968 7611
>>> Fax    : +64 09 968 7641
>>> Mobile : +64 27 568 7611
>>>
>>> mailto:scott at manukau.ac.nz
>>>
>>> http://www.manukau.ac.nz
>>>
>>>
____________________________________________________________________
>>> ____
>>>
>>>
>>> perl -e ''print
>>>
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''
>>>
>>>
____________________________________________________________________
>>> ____
>>>
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> -- 
> ______________________________________________________________________ 
> _
>
>
> Scott Lawson
> Systems Architect
> Manukau Institute of Technology
> Information Communication Technology Services Private Bag 94006  
> Manukau
> City Auckland New Zealand
>
> Phone  : +64 09 968 7611
> Fax    : +64 09 968 7641
> Mobile : +64 27 568 7611
>
> mailto:scott at manukau.ac.nz
>
> http://www.manukau.ac.nz
>
> ______________________________________________________________________ 
> __
>
>
> perl -e ''print
> $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''
>
> ______________________________________________________________________ 
> __

Scott Lawson

2009-Feb-18 04:14 UTC

head link

[zfs-discuss] ZFS on SAN?

David Magda wrote:>
> On Feb 17, 2009, at 21:35, Scott Lawson wrote:
>
>> Everything we have has dual power supplies, feed from dual power 
>> rails, feed from separate switchboards, through separate very large 
>> UPS''s, backed by generators, feed by two substations and then
cloned
>> to another data center 3 km away. HA
>
> http://www.geonet.org.nz/earthquake/quakes/recent_quakes.html
>Ha. Yeah thats why we were once known to the British as "The Shaky 
Isles" . We do have lot''s
of earthquakes around the pacific rim. We are in Auckland however which 
is north of all those little stars on the pic
which is where the edge of the pacific plate intersects with the 
Australian plate. So not too many earthquakes
to worry about in Auckland compared to the rest of NZ. Although one of 
the data centers I built recently
was on the second floor of a building and had to be earthquake 
restrained due to the fact that we were going
to be potentially creating up to one ton point loads on the floor. The 
rest of NZ gets little and biggish quakes
fairly often, so much so that the Aussies next door on the west island 
see fit to warn their citizens about the
potential for earthquakes in NZ if visiting... Our capital city 
Wellington the other hand is built on fault lines.. think
San Francisco...

Now a Volcano in Auckland might be a different story... We have over 50 
dormant cones and whopping big one in the
main harbor which is called "Rangitoto". Translated form the Maori it 
means "Blood Sky". My UPS''s wont
protect  from that one...>     ;)
>
>> I am far, far more worried with someone with root access typing 
>> ''zpool destroy'' than I am worried about the lights
going out in the
>> data centers I designed that house hundreds and hundreds of servers. ;)
>
> Yeah, this is probably more likely.
>
-- 
_______________________________________________________________________


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

________________________________________________________________________


perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

________________________________________________________________________

Richard Elling

2009-Feb-18 05:36 UTC

head link

[zfs-discuss] ZFS on SAN?

Toby Thain wrote:> Not at all. You''ve convinced me. Your servers will never, ever
lose
> power unexpectedly.
Methinks living in Auckland has something to do with that :-) 
http://en.wikipedia.org/wiki/1998_Auckland_power_crisis

When services are reliable, then complacency brings risk.
My favorite example recently is the levees in New Orleans.
Katrina didn''t top the levees, they were undermined.
 -- richard

Scott Lawson

2009-Feb-18 09:07 UTC

head link

[zfs-discuss] ZFS on SAN?

Hi Andras,

No problems writing direct.  Answers inline below. (If there are any 
typo''s it cause it''s late and I have had a very long day ;))

andras spitzer wrote:> Scott,
>
> Sorry for writing you directly, but most likely you have missed my
> questions regarding your SW design, whenever you have time, would you
> reply to that? I really value your comments and appreciate it as it
> seems you have great experience with ZFS in a professional
> environment, and this is something not so frequent today.
>
> That was my e-mail, response to your e-mail (it''s in the thread) :
>
> "Scott,
>
> That is an awesome reference you wrote, I totally understand and agree
> with your idea of having everything redundant (dual path, redundant
> switches, dual controllers) at the SAN infrastructure, I would have
> some question about the sw design you use if you don''t mind.
>
> - are you using MPxIO as DMP?
>   Yes. configuring via ''stmsboot''. I have used Sun MPXIO for
quite a few
years now and have found
it works well (was SAN Foundatin Kit for many years).> - as I understood from your e-mail all of your ZFS pools are ZFS
> mirrored? (you don''t have non-redundant ZFS configuration)
>   Certainly the ones that are from SAN based disk. No there are no non 
redundant ZFS configurations.
All storage is doubled up. Expensive, but we tend to stick to modular 
storage for this and spread
the cost over many yeasr. Storage budget is at least 50% of systems 
group infrastructure budget.

There are many other ZFS file systems which aren''t SAN attached and are
in mirrors, RAIDZ''s etc.
I mentioned the Loki''s aka J4500 which are in RAIDZ''s. Very
nice and
have worked very reliably
so far. I would strongly advocate these units for ZFS if you want a lot 
of disk reasonably cheaply
that performs well...> - why you decided to use ZFS mirror instead of ZFS raidz or raidz2?
>   As we already have hardware based RAID5 from our arrays. (Sun 3510, 
3511, 6140''s). The ZFS
file systems are used mostly for mirroring purposes, but also to take 
advantage of the other nice things
ZFS brings lack snapshots, cloning, clone promotions
etc.>  
> - you have RAID 5 protected LUNs from SAN, and you put ZFS mirror on
> top of them?
>   
Yes. Covered above I think.> Could you please share some details about your configuration regarding
> SAN redundancy VS ZFS redundancy (I guess you use both here), also
> some background why you decided to go with that?
>   Been doing it for many years. Not just with ZFS, but UFS and VXFS as 
well. Also quite a large
number of NTFS machines. We have two geographically separate data 
centers which are a few kilometers
apart with redundant dark fibre links over different routes.  All core 
switches are in a full mesh with
two cores per site, each with a redundant connection to the two cores at 
the other site. One via each route.

We believe strongly that storage is the key to our business. Servers are 
but processing to work the data and
are far easier to replace. We tend to standardize on particular models 
and then buy a bunch of em and not
necessarily maintenance for them.

There are a lot of key things to building a reliable data center. I have 
been having a lively discussion on this
twith Toby and Richard which has been raising some interesting points. I 
do firmly believe in getting
things right from the ground up. I start with power and environment. 
Storage comes next in my book.> Regards,
> sendai "
>
> One point I''m really interested is that it seems you deploy ZFS
with
> ZFS mirror, even when you have RAID redundancy at the HW/SAN level,
> which means extra costs to you obviously. I''m looking for a fairly
> decisive opinion whether is it safe to use ZFS configuration without
> redundancy when you have RAID redundancy in your high-end SAN, or you
> still decide to go with ZFS redundancy (ZFS mirror in your case, not
> even raidz or raidz2) because of the extra self-healing feature and
> the lowered risk of total pool failure?
>   I think this has also been covered in recent list posts. the important 
thing is really to have two copies
of blocks if you wish to be able to self heal.  The cost I guess is what 
value you place on availability
and reliability of your data.

ZFS mirrors are faster for resilvering as well. Much much faster in my 
experience. We recently
used this during a data center move and rebuild. Our SAN fabric was 
extended to 3 sites and we moved blocks
of storage one piece at a time and resynced them at the new location 
once they were in place with 0%
disruption to the business.

I do think the fishworks stuff are going to prove to be game breakers in 
the near future for many
people as they will offer many of the features we want in our storage. 
Once COMSTAR has
been integrated into this line I might buy some. (I have a large 
investment in fibre channel and I don''t
trust networking people as far as I can kick them when it comes to 
understanding the potential
problems that can arise from disconnecting block targets that are coming 
in over Ethernet. )> Also, if you could reply in the thread, so that everyone can read your
> experiences, that would be great!
>
> Regards,
> sendai
>

zfs discuss - Feb 2009 - ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?

[zfs-discuss] ZFS on SAN?