thr3ads.net - zfs discuss - [zfs-discuss] OpenSolaris ZFS NAS Setup [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Simon Breden

2008-Apr-01 08:18 UTC

[zfs-discuss] OpenSolaris ZFS NAS Setup

If it''s of interest, I''ve written up some articles on my
experiences of building a ZFS NAS box which you can read here:
http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/

I used CIFS to share the filesystems, but it will be a simple matter to use NFS
instead: issue the command ''zfs set sharenfs=on
pool/filesystem'' instead of ''zfs set sharesmb=on
pool/filesystem''.

Hope it helps.
Simon

Originally posted to answer someone''s request for info in
storage:discuss
 
 
This message posted from opensolaris.org

Vincent Fox

2008-Apr-03 08:08 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Fascinating read, thanks Simon!

I have been using ZFS in production data center for some while now, but it never
occurred to me to use iSCSI with ZFS also.

This gives me some ideas on how to backup our mail pools into some older slower
disks offsite.  I find it interesting that while a local ZFS pool becoming
unavailable will panic the system, losing access to iSCSI may not have this
penalty.  Not sure if it''s a bug or a feature, but when I rebooted the
target system the initiator system stayed up and did not panic.
 
 
This message posted from opensolaris.org

Simon Breden

2008-Apr-03 22:35 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Thanks a lot, glad you liked it :)

Yes I agree, using older, slower disks in this way for backups seems a nice way
to reuse old kit for something useful.

There''s one nasty problem I''ve seen with making a pool from an
iSCSI disk hosted on a different machine, and that is that if you turn off the
hosting machine, if you then shutdown the machine using the iSCSI disk in the
pool, it takes ages to shutdown. Seems like it tries forever (or a long time
anyway) to connect with the iSCSI disk and finds it can''t, obviously. I
think there''s a bug report for this, and I thought it was fixed but, as
of SXCE build 85, it seems not as I saw the problem occur again yesterday.

The solution is to do a ''zpool export
pool_importing_iSCSI_disks'' before shutting down the machine and then
it will shutdown normally without trying to connect to the iSCSI target(s).

More info here:
http://www.opensolaris.org/jive/thread.jspa?messageID=196459&#196459

This guy seems to have had lots of fun with iSCSI :)
http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html
http://web.ivy.net/~carton/oneNightOfWork/20071204-zfsnotes.txt

I wonder how many of his problems were due to using a non-Solaris iSCSI target?
My experience of mixing iSCSI targets & initiators from different
OS''s was not very good, but I didn''t do very much with it.
 
 
This message posted from opensolaris.org

Jonathan Loran

2008-Apr-05 05:25 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

> This guy seems to have had lots of fun with iSCSI :)
> http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html
>
>   This is scaring the heck out of me.  I have a project to create a zpool 
mirror out of two iSCSI targets, and if the failure of one of them will 
panic my system, that will be totally unacceptable.  What''s the point
of
having an HA mirror if one side can''t fail without busting the host. 
Is
it really true that as the guy on the above link states (Please read the 
link, sorry) when one iSCSI mirror goes off line, the initiator system 
will panic?  Or even worse, not boot its self cleanly after such a 
panic?  How could this be?  Anyone else with experience with iSCSI based 
ZFS mirrors?

Thanks,

Jon

Tim

2008-Apr-05 05:39 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

On Sat, Apr 5, 2008 at 12:25 AM, Jonathan Loran <jloran at
ssl.berkeley.edu>
wrote:
>
> > This guy seems to have had lots of fun with iSCSI :)
> >
http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html<http://web.ivy.net/%7Ecarton/oneNightOfWork/20061119-carton.html>
> >
> >
> This is scaring the heck out of me.  I have a project to create a zpool
> mirror out of two iSCSI targets, and if the failure of one of them will
> panic my system, that will be totally unacceptable.  What''s the
point of
> having an HA mirror if one side can''t fail without busting the
host.  Is
> it really true that as the guy on the above link states (Please read the
> link, sorry) when one iSCSI mirror goes off line, the initiator system
> will panic?  Or even worse, not boot its self cleanly after such a
> panic?  How could this be?  Anyone else with experience with iSCSI based
> ZFS mirrors?
>
> Thanks,
>
> Jon
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




Crazy question here... but has anyone tried this with say, a QLogic hardware
iSCSI card?  Seems like it would solve all your issues.  Granted, they
aren''t free like the software stack, but if you''re trying to
setup an HA
solution, the ~$800 price tag per card seems pretty darn reasonable to me.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080405/ed686fa9/attachment.html>

Will Murnane

2008-Apr-05 15:32 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

On Sat, Apr 5, 2008 at 5:25 AM, Jonathan Loran <jloran at
ssl.berkeley.edu> wrote:>  This is scaring the heck out of me.  I have a project to create a zpool
>  mirror out of two iSCSI targets, and if the failure of one of them will
>  panic my system, that will be totally unacceptable.I haven''t tried this myself, but perhaps the "failmode"
property of
zfs will solve this?

Will

kristof

2008-Apr-05 17:36 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

If you have a mirrored iscsi zpool. It will NOT panic when 1 of the submirrors
is unavailable.

zpool status will hang for some time, but after I thinkt 300 seconds it will put
the device on unavailable.

The panic was the default in the past, And it only occurs if all devices are
unavailable.

Since I think b77 there is a new zpool property: failemode, which you can set to
prevent a panic:

     failmode=wait | continue | panic

         Controls the system behavior  in  the  event  of  catas-
         trophic  pool  failure.  This  condition  is typically a
         result of a  loss  of  connectivity  to  the  underlying
         storage device(s) or a failure of all devices within the
         pool. The behavior of such an  event  is  determined  as
         follows:

         wait        Blocks all I/O access until the device  con-
                     nectivity  is  recovered  and the errors are
                     cleared. This is the default behavior.

         continue    Returns EIO to any new  write  I/O  requests
                     but  allows  reads  to  any of the remaining
                     healthy devices.  Any  write  requests  that
                     have  yet  to  be committed to disk would be
                     blocked.

         panic       Prints out a message to the console and gen-
                     erates a system crash dump.
 
 
This message posted from opensolaris.org

Vincent Fox

2008-Apr-05 20:12 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

I don''t think ANY situation in which you are mirrored and one half of
the mirror pair becomes unavailable will panic the system.  At least this has
been the case when I''ve tested with local storage haven''t
tried with iSCSI yet but will give it a whirl.

I had a simple single ZVOL shared over iSCSI, and thus no redundancy.  And
bringing down the target system didn''t crash the initiator. And this is
with Solaris 10u4 not even latest OpenSolaris.  Well okay if I''m logged
onto the initator and in the directory for the pool at the time I bring down the
target, my shell gets hung.  But it hasn''t panicked I will wait a good
15 minutes and make sure of this and post some failure-mode results later this
evening.
 
 
This message posted from opensolaris.org

Vincent Fox

2008-Apr-05 20:31 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

Followup, my initiator did eventually panic.

I will have to do some setup to get a ZVOL from another system to mirror with,
and see what happens when one of them goes away.  Will post in a day or two on
that.
 
 
This message posted from opensolaris.org

Vincent Fox

2008-Apr-05 20:32 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

Followup, my initiator did eventually panic.

I will have to do some setup to get a ZVOL from another system to mirror with,
and see what happens when one of them goes away.  Will post in a day or two on
that.
 
 
This message posted from opensolaris.org

Jonathan Loran

2008-Apr-06 06:30 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

kristof wrote:> If you have a mirrored iscsi zpool. It will NOT panic when 1 of the
submirrors is unavailable.
>
> zpool status will hang for some time, but after I thinkt 300 seconds it
will put the device on unavailable.
>
> The panic was the default in the past, And it only occurs if all devices
are unavailable.
>
> Since I think b77 there is a new zpool property: failemode, which you can
set to prevent a panic:
>
>      failmode=wait | continue | panic
>
>          Controls the system behavior  in  the  event  of  catas-
>          trophic  pool  failure.  This  condition  is typically a
>          result of a  loss  of  connectivity  to  the  underlying
>          storage device(s) or a failure of all devices within the
>          pool. The behavior of such an  event  is  determined  as
>          follows:
>
>          wait        Blocks all I/O access until the device  con-
>                      nectivity  is  recovered  and the errors are
>                      cleared. This is the default behavior.
>
>          continue    Returns EIO to any new  write  I/O  requests
>                      but  allows  reads  to  any of the remaining
>                      healthy devices.  Any  write  requests  that
>                      have  yet  to  be committed to disk would be
>                      blocked.
>
>          panic       Prints out a message to the console and gen-
>                      erates a system crash dump.
>  
>  
>   This is encouraging, but one problem:  Our system is on Solaris 10 U4. 
Will this guy be immune to panics when one side of the mirror goes
down?  Seriously, I''m tempted to upgrade this box to OS b8?  However,
there are a lot of dependencies which we need to worry about in doing
that - for example, will all our off the shelf software run with Open
Solaris.  More things to test.



Thanks,



Jon

Jonathan Loran

2008-Apr-06 06:42 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

Vincent Fox wrote:> Followup, my initiator did eventually panic.
>
> I will have to do some setup to get a ZVOL from another system to mirror
with, and see what happens when one of them goes away.  Will post in a day or
two on that.
>  
>   On Sol 10 U4, I could have told you that.  A few weeks back, I was bone 
headed, and took down a target with a completely idle zpool on it.  The 
initiator system eventually panicked, when I brought the target back 
up!  But this pool wasn''t mirrored.  I''m hoping I can setup a
mirror of
iSCSI targets and get all the benefits of HA.

BTW Vincent: thanks for doing all my testing for me ;)  Seriously, I''m 
throwing together a test setup of my own on Monday.  Need to be sure 
this will work.

Jon

Richard Elling

2008-Apr-06 16:33 UTC

head link

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

Jonathan Loran wrote:> Vincent Fox wrote:
>   
>> Followup, my initiator did eventually panic.
>>
>> I will have to do some setup to get a ZVOL from another system to
mirror with, and see what happens when one of them goes away.  Will post in a
day or two on that.
>>  
>>   
>>     
> On Sol 10 U4, I could have told you that.  A few weeks back, I was bone 
> headed, and took down a target with a completely idle zpool on it.  The 
> initiator system eventually panicked, when I brought the target back 
> up!  But this pool wasn''t mirrored.  I''m hoping I can
setup a mirror of
> iSCSI targets and get all the benefits of HA.
>   
This is the expected behaviour for an unprotected Solaris 10 u4 setup.
 -- richard

Ross

2008-Apr-07 08:06 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

To repeat what some others have said, yes, Solaris seems to handle an iSCSI
device going offline in that it doesn''t panick and continues working
once everything has timed out.

However that doesn''t necessarily mean it''s ready for
production use.  ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI
client to timeout.  Now I don''t know about you, but HA to me
doesn''t mean "Highly Available, but with occasional 3 minute
breaks".  Most of the client applications we would want to run on ZFS would
be broken with a 3 minute delay returning data, and this was enough for us to
give up on ZFS over iSCSI for now.
 
 
This message posted from opensolaris.org

Gary Mills

2008-Apr-07 12:39 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

On Mon, Apr 07, 2008 at 01:06:34AM -0700, Ross wrote:> 
> To repeat what some others have said, yes, Solaris seems to handle
> an iSCSI device going offline in that it doesn''t panick and
> continues working once everything has timed out.
> 
> However that doesn''t necessarily mean it''s ready for
production use.
> ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client
> to timeout.  Now I don''t know about you, but HA to me
doesn''t mean
> "Highly Available, but with occasional 3 minute breaks".  Most of
> the client applications we would want to run on ZFS would be broken
> with a 3 minute delay returning data, and this was enough for us to
> give up on ZFS over iSCSI for now.
Doesn''t this also happen with UFS on an Iscsi device?  Iscsi is just
local disk.  What would happen if a physical disk went offline?  We
like the 3-minute delay because it gives us time to reboot the Netapp
that provides storage on our Iscsi SAN without having to shut down
all of the applications.  Something has to happen when a disk goes
offline.  We also use Solaris multipathing with two independant network
paths to the Netapp so that a network won''t break Iscsi.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Richard Elling

2008-Apr-07 14:48 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Ross wrote:> To repeat what some others have said, yes, Solaris seems to handle an iSCSI
device going offline in that it doesn''t panick and continues working
once everything has timed out.
>
> However that doesn''t necessarily mean it''s ready for
production use.  ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI
client to timeout.  Now I don''t know about you, but HA to me
doesn''t mean "Highly Available, but with occasional 3 minute
breaks".  Most of the client applications we would want to run on ZFS would
be broken with a 3 minute delay returning data, and this was enough for us to
give up on ZFS over iSCSI for now.
>  
>   
By default, the sd driver has a 60 second timeout with either 3 or 5
retries before timing out the I/O request.  In other words, for the
same failure mode in a DAS or SAN you will get the same behaviour.
 -- richard

Bob Friesenhahn

2008-Apr-07 15:10 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

On Mon, 7 Apr 2008, Ross wrote:
> However that doesn''t necessarily mean it''s ready for
production use.
> ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client 
> to timeout.  Now I don''t know about you, but HA to me
doesn''t mean
> "Highly Available, but with occasional 3 minute breaks".  Most of
> the client applications we would want to run on ZFS would be broken 
> with a 3 minute delay returning data, and this was enough for us to 
> give up on ZFS over iSCSI for now.
It seems to me that this is a problem with the iSCSI client timeout 
parameters rather than ZFS itself.  Three minutes is sufficient for 
use over the "internet" but seems excessive on a LAN.  Have you 
investigated to see if the iSCSI client timeout parameters can be 
adjusted?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Christine Tran

2008-Apr-07 15:40 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

> Crazy question here... but has anyone tried this with say, a QLogic 
> hardware iSCSI card?  Seems like it would solve all your issues.  
> Granted, they aren''t free like the software stack, but if
you''re trying
> to setup an HA solution, the ~$800 price tag per card seems pretty darn 
> reasonable to me.
Not sure how this would help if one target fails.  The card doesn''t
work
any magic making the target always available.  We are testing a 
QLA-4052C card, we believe QLogic tested it as installed on a Sun box 
but not against Solaris iSCSI targets; an attempt to connect from this 
card *appears* to cause our iscsitgtd daemon to consume a great deal of 
CPU and memory.  We''re still trying to find out why.

CT

Tim

2008-Apr-07 16:33 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

On Mon, Apr 7, 2008 at 10:40 AM, Christine Tran <Christine.Tran at
sun.com>
wrote:
>
>  Crazy question here... but has anyone tried this with say, a QLogic
> > hardware iSCSI card?  Seems like it would solve all your issues. 
Granted,
> > they aren''t free like the software stack, but if
you''re trying to setup an
> > HA solution, the ~$800 price tag per card seems pretty darn reasonable
to
> > me.
> >
>
> Not sure how this would help if one target fails.  The card
doesn''t work
> any magic making the target always available.  We are testing a QLA-4052C
> card, we believe QLogic tested it as installed on a Sun box but not against
> Solaris iSCSI targets; an attempt to connect from this card *appears* to
> cause our iscsitgtd daemon to consume a great deal of CPU and memory. 
We''re
> still trying to find out why.
>
> CT
>

How would it not help?  From what I''m reading, there''s a flag
in the
software iSCSI stack on how to react if a target is lost.  This is
completely bypassed if you use the hardware card.  As far as the OS is
concerned, it''s just another SCSI disk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080407/b327b7fc/attachment.html>

Richard Elling

2008-Apr-07 18:34 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Ross Smith wrote:> Which again is unacceptable for network storage.  If hardware raid 
> controllers took over a minute to timeout a drive network admins would 
> be in uproar.  Why should software be held to a different standard?
You need to take a systems approach to analyzing these things.
For example, how long does an array take to cold boot?  When
I was Chief Architect for Integrated Systems Engineering, we had
a product which included a storage array and a server racked
together.  If you used the defaults, and simulated a power-loss
failure scenario, then the whole thing fell apart.  Why?  Because
the server cold booted much faster than the array.  When Solaris
started, it looked for the disks, found none because the array was
still booting, and declared those disks dead.  The result was that
you needed system administrator intervention to get the services
started again.  Not acceptable.  The solution was to delay the
server boot to more closely match the array''s boot time.

The default timeout values can be changed, but we rarely
recommend it.  You can get into all sorts of false failure modes
with small timeouts.  For example, most disks spec a 30 second
spin up time.  So if your disk is spun down, perhaps for power
savings, then you need a timeout which is greater than 30
seconds by some margin.  Similarly, if you have a CD-ROM
hanging off the bus, then you need a long timeout to accommodate
the slow data access for a CD-ROM.  I wrote a Sun BluePrint
article discussing some of these issues  a few years ago.
http://www.sun.com/blueprints/1101/clstrcomplex.pdf
>  
> I can understand the driver being persistant if your data is on a 
> single disk, however when you have any kind of redundant data, there 
> is no need for these delays.  And there should definately not be 
> delays in returning status information.  Who ever heard of a hardware 
> raid controller that takes 3 minutes to tell you which disk has gone bad?
>  
> I can understand how the current configuration came about, but it 
> seems to me that the design of ZFS isn''t quite consistent.  You do
all
> this end-to-end checksumming to double check that data is consistent 
> because you don''t trust the hardware, cables, or controllers to
not
> corrupt data.  Yet you trust that same equipment absolutely when it 
> comes to making status decisions.
>  
> It seems to me that you either trust the infrastructure or you
don''t,
> and the safest decision (as ZFS'' integrity checking has shown), is
not
> to trust it.  ZFS would be better assuming that drivers and 
> controllers won''t always return accurate status information, and
have
> it''s own set of criteria to determine whether a drive (of any
kind) is
> working as expected and returning responses in a timely manner.
I don''t see any benefit for ZFS to add another set of timeouts
over and above the existing timeouts.  Indeed we often want to
delay any rash actions which would cause human intervention
or prolonged recovery later.  Sometimes patience is a virtue.
 -- richard

>  
>  
>
>
> > Date: Mon, 7 Apr 2008 07:48:41 -0700
> > From: Richard.Elling at Sun.COM
> > Subject: Re: [zfs-discuss] OpenSolaris ZFS NAS Setup
> > To: myxiplx at hotmail.com
> > CC: zfs-discuss at opensolaris.org
> >
> > Ross wrote:
> > > To repeat what some others have said, yes, Solaris seems to
handle
> an iSCSI device going offline in that it doesn''t panick and
continues
> working once everything has timed out.
> > >
> > > However that doesn''t necessarily mean it''s
ready for production
> use. ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI 
> client to timeout. Now I don''t know about you, but HA to me
doesn''t
> mean "Highly Available, but with occasional 3 minute breaks".
Most of
> the client applications we would want to run on ZFS would be broken 
> with a 3 minute delay returning data, and this was enough for us to 
> give up on ZFS over iSCSI for now.
> > >
> > >
> >
> > By default, the sd driver has a 60 second timeout with either 3 or 5
> > retries before timing out the I/O request. In other words, for the
> > same failure mode in a DAS or SAN you will get the same behaviour.
> > -- richard
> >
>
>
> ------------------------------------------------------------------------
> Have you played Fishticuffs? Get fish-slapping on Messenger 
> <http://www.fishticuffs.co.uk>

Chris Siebenmann

2008-Apr-08 15:05 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

| Is it really true that as the guy on the above link states (Please
| read the link, sorry) when one iSCSI mirror goes off line, the
| initiator system will panic?  Or even worse, not boot its self cleanly
| after such a panic?  How could this be?  Anyone else with experience
| with iSCSI based ZFS mirrors?

 Our experience with Solaris 10U4 and iSCSI targets is that Solaris only
panics if the pool fails entirely (eg, you lose both/all mirrors in a
mirrored vdev). The fix for this is in current OpenSolaris builds, and
we have been told by our Sun support people that it will (only) appear
in Solaris 10 U6, apparently scheduled for sometime around fall.

 My experience is that Solaris will normally recover after the panic and
reboot, although failed ZFS pools will be completely inaccessible as
you''d
expect. However, there are two gotchas:

* under at least some circumstances, a completely inaccessible iSCSI
  target (as you might get with, eg, a switch failure) will stall booting
  for a significant length of time (tens of minutes, depending on how many
  iSCSI disks you have on it).

* if a ZFS pool''s storage is present but unwritable for some reason,
  Solaris 10 U4 will panic the moment it tries to bring the pool up;
  you will wind up stuck in a perpetual ''boot, panic, reboot,
...''
  cycle until you forcibly remove the storage entirely somehow.

The second issue is presumably fixed as part of the general fix of ''ZFS
panics on pool failure'', although we haven''t tested it
explicitly. I
don''t know if the first issue is fixed in current Nevada builds.

	- cks

Jonathan Loran

2008-Apr-09 21:51 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Just to report back to the list...  Sorry for the lengthy post

So I''ve tested the iSCSI based zfs mirror on Sol 10u4, and it does more
or less work as expected.  If I unplug one side of the mirror - unplug 
or power down one of the iSCSI targets -  I/O to the zpool stops for a 
while, perhaps a minute, and then things free up again.  zpool commands 
seem to get unworkably slow, and error messages fly by on the console 
like fire ants running from a flood.  Worst of all, plugging the faulted 
mirror back in (before removing the mirror from the pool)  it''s very 
hard to bring the faulted device back online:

prudhoe # zpool status
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using ''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Tue Apr  8 16:34:08 2008
config:

        NAME        STATE     READ WRITE CKSUM
        test        DEGRADED     0     0     0
          mirror    DEGRADED     0     0     0
            c2t1d0  FAULTED      0 2.88K     0  corrupted data
            c2t1d0  ONLINE       0     0     0

errors: No known data errors
>>>>>>>>> Comment: why are there now two instances of
c2t1d0??  <<<<<<<<<<

prudhoe # zpool replace test c2t2d0
invalid vdev specification
use ''-f'' to override the following errors:
/dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M).

prudhoe # zpool replace -f test c2t2d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M).

prudhoe # zpool remove test c2t2d0
cannot remove c2t2d0: no such device in pool

prudhoe # zpool offline test c2t2d0
cannot offline c2t2d0: no such device in pool

prudhoe # zpool online test c2t2d0
cannot online c2t2d0: no such device in pool
>>>>>>>>>>  OK, get more drastic
<<<<<<<<<<<<<<
prudhoe # zpool clear test

prudhoe # zpool status
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using ''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Tue Apr  8 16:34:08 2008
config:

        NAME        STATE     READ WRITE CKSUM
        test        DEGRADED     0     0     0
          mirror    DEGRADED     0     0     0
            c2t1d0  FAULTED      0     0     0  corrupted data
            c2t1d0  ONLINE       0     0     0

errors: No known data errors
>>>>>>>>>>>>>>>>>>>>>
Frustration setting in.  The error counts are zero, but stilltwo instances of c2t1d0 listed...
<<<<<<<<<<<<<<<<

prudhoe # zpool export test

prudhoe # zpool import test

prudhoe # zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
test                   12.9G   9.54G   3.34G    74%  ONLINE     -

prudhoe # zpool status
  pool: test
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 1.11% done, 0h20m to go
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0

errors: No known data errors

>>>>>  Finally resilvering with the right devices.  The thing I
really don''t like here is the pool had to be exported and then imported
to make this work.  For an NFS server, this is not really acceptable.  Now I
know this is ol'' Solaris 10u4, but still, I''m surprised I
needed to export/import the pool to get it working correctly again.  Anyone know
what I did wrong?  Is there a canonical way to online the previously faulted
device?
Anyway, It looks like for now, I can get some sort of HA our of this iSCSI
mirror.  The other pluses is the pool can self heal, and reads will be spread
across both units.

Cheers,

Jon

--- P.S.  Playing with this more before sending this message, if you can detach
the faulted mirror before putting it back online, it all works well.  Hope that
nothing bounces on your network when you have a failure:

---->>>> unplug one iscsi mirror, then: 

prudhoe # zpool status -v
  pool: test
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: scrub completed with 0 errors on Wed Apr  9 14:18:45 2008
config:

        NAME        STATE     READ WRITE CKSUM
        test        DEGRADED     0     0     0
          mirror    DEGRADED     0     0     0
            c2t2d0  UNAVAIL      4    91     0  cannot open
            c2t1d0  ONLINE       0     0     0

errors: No known data errors

prudhoe # zpool detach test c2t2d0

prudhoe # zpool status -v
  pool: test
 state: ONLINE
 scrub: scrub completed with 0 errors on Wed Apr  9 14:18:45 2008
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          c2t1d0    ONLINE       0     0     0

errors: No known data errors

----->>>> replug the downed mirror, and: 

prudhoe # zpool attach test c2t1d0 c2t2d0
prudhoe # zpool status -v
  pool: test
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 0.04% done, 2h17m to go
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0

errors: No known data errors

Viola!

Jon

Jonathan Loran

2008-Apr-10 20:06 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Chris Siebenmann wrote:> | What your saying is independent of the iqn id?
>
>  Yes. SCSI objects (including iSCSI ones) respond to specific SCSI
> INQUIRY commands with various ''VPD'' pages that contain
information about
> the drive/object, including serial number info.
>
>  Some Googling turns up:
> 
http://wikis.sun.com/display/StorageDev/Solaris+OS+Disk+Driver+Device+Identifier+Generation
> 	http://www.bustrace.com/bustrace6/sas.htm
>
>  Since you''re using Linux IET as the target, you want to set the
> ''ScsiId'' and ''ScsiSN'' Lun parameters to
unique (and different) values.
>
> (You can use sdparm, http://sg.torque.net/sg/sdparm.html, on Solaris
> to see exactly what you''re currently reporting in the VPD data for
each
> disk.)
>
> 	- cks
>   
CC-ing the list, cause this is of general interest....

Chris, indeed the older version of Open-E iSCSI I was using for my tests 
has no unique VPD identifiers what so ever, so this could confuse the 
initiator:

prudhoe # sdparm -6 -i /devices/iscsi/disk at
0000iqn.2008-04%3Aiscsi-1.target10001,0:wd,raw
    /devices/iscsi/disk at 0000iqn.2008-04%3Aiscsi-1.target10001,0:wd,raw: IET  
VIRTUAL-DISK      0
Device identification VPD page:
  Addressed logical unit:
    designator type: T10 vendor identification,  code set: Binary
      vendor id: IET
      vendor specific:


Where as the new version of Open-E iSCSI (called iSCSI R3) does.  These 
are two LUNS from the system I will be doing a ZFS mirror on, running 
the new Open-E iSCSI-R3 on the target:


apollo # sdparm -i 
/devices/scsi_vhci/ssd at
g695343534900000058424433517a6639707a71597273647a:wd,raw

    /devices/scsi_vhci/ssd at
g695343534900000058424433517a6639707a71597273647a:wd,raw: iSCSI     DISK        
0
Device identification VPD page:
  Addressed logical unit:
    designator type: T10 vendor identification,  code set: Binary
      vendor id: iSCSI
      vendor specific: XBD3Qzf9pzqYrsdz

apollo # sdparm -i /devices/scsi_vhci/ssd at
g69534353490000005a6b6e43326c6257413579334d377636:wd,raw
    /devices/scsi_vhci/ssd at
g69534353490000005a6b6e43326c6257413579334d377636:wd,raw: iSCSI     DISK        
0
Device identification VPD page:
  Addressed logical unit:
    designator type: T10 vendor identification,  code set: Binary
      vendor id: iSCSI
      vendor specific: ZknC2lbWA5y3M7v6


Open-E iSCSI-R3 generates a uniq vendor specific serial number, so the 
ZFS mirror will most likely fail and recover more cleanly.

Thanks for the pointers.

Jon

Ross

2008-Apr-11 09:26 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

I had similar problems replacing a drive myself, it''s not intuitive
exactly which ZFS commands you need to issue to recover from a drive failure.

I think your problems stemmed from using -f.  Generally if you have to use that,
there''s a step or option you''ve missed somewhere.

However I''m not 100% sure what command you should have used instead. 
Things I''ve tried in the past include:
# zpool replace test c2t2d0 c2t2d0
or
# zpool online test c2t2d0
# zpool replace test c2t2d0

I know I did a whole load of testing various options to work out how to replace
a drive in a test machine.  I''m looking to see if I have any iSCSI
notes around, but from memory when I tested iSCSI I was also testing ZFS on a
cluster, so my solution was to simply get the iSCSI devices working on the
offline node, then failover ZFS.

It only took 2-3 seconds to failover ZFS to the other node, and I suspect I used
that solution because I couldn''t work out how to get ZFS to correctly
bring faulted iSCSI devices back online.

However, in case it helps, I do have the whole process for physical disks on a
Sun x4500 documented:

# zpool offline splash c5t7d0

Now, find the controller in use for this device:
# cfgadm | grep c5t7d0
sata3/7::dsk/c5t7d0            disk         connected    configured   ok

And offline it with:
# cfgadm -c unconfigure sata3/7

Verify that it is now offline with:
# cfgadm | grep sata3/7
sata3/7                        disk         connected    unconfigured ok

Now remove and replace the disk.

Bring the disk online and check it''s status with:
# cfgadm -c configure sata3/7 
# cfgadm | grep sata3/7 
sata3/7::dsk/c5t7d0            disk         connected    configured   ok

Bring the disk back into the zfs pool.  You will get a warning:
# zpool online splash c5t7d0
warning: device ''c5t7d0'' onlined, but remains in faulted state

use ''zpool replace'' to replace devices that are no longer
present
# zpool replace splash c5t7d0

you will now see zpool status report that a resilver is in process, with detail
as follows:
          raidz2            DEGRADED     0     0     0
            spare           DEGRADED     0     0     0
              replacing     DEGRADED     0     0     0
                c5t7d0s0/o  UNAVAIL      0     0     0  corrupted data
                c5t7d0      ONLINE       0     0     0

Once the resilver finishes, run zpool status again and it should appear fine.

Note:   I sometimes had to run zpool status twice to get an up to date status of
the devices.
 
 
This message posted from opensolaris.org

Simon Breden

2008-Apr-11 10:19 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Thanks myxiplx for the info on replacing a faulted drive. I think the X4500 has
LEDs to show drive statuses so you can see which physical drive to pull and
replace, but how does one know which physical disk to pull out when you just
have a standard PC with drives directly plugged into on-motherboard SATA
connectors -- i.e. with no status LEDs?
 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Apr-11 15:41 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

On Fri, 11 Apr 2008, Simon Breden wrote:
> Thanks myxiplx for the info on replacing a faulted drive. I think 
> the X4500 has LEDs to show drive statuses so you can see which 
> physical drive to pull and replace, but how does one know which 
> physical disk to pull out when you just have a standard PC with 
> drives directly plugged into on-motherboard SATA connectors -- i.e. 
> with no status LEDs?
This should be a wakeup call to make sure that this is all figured out 
in advance before the hardware fails.  If you were to format the drive 
for a traditional filesystem you would need to know which one it was. 
Failure recovery should be no different except for the fact that the 
machine may be down, pressure is on, and the information you expected 
to use for recovery was on that machine. :-)

This is a case where it is worthwhile maintaining a folder (in paper 
form) which contains important recovery information for your machines.
Open up the machine in advance and put sticky labels on the drives 
with their device names.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Simon Breden

2008-Apr-11 20:28 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

Thanks Bob, that''s good advice. So, before I open my case,
I''ve currently got 3 SATA drives all the same model, so how do I know
which one is plugged into which SATA connector on the motherboard? Is there a
command I can issue which gives identifying info that includes the disk id AND
the SATA connector number that it is plugged into?

If I type ''format'' I get the following info:

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 20007 alt 2 hd 255 sec 63>
          /pci at 0,0/pci-ide at 4/ide at 0/cmdk at 0,0
       1. c1t0d0 <ATA-WDC WD7500AAKS-0-4G30-698.64GB>
          /pci at 0,0/pci1043,8239 at 5/disk at 0,0
       2. c1t1d0 <ATA-WDC WD7500AAKS-0-4G30-698.64GB>
          /pci at 0,0/pci1043,8239 at 5/disk at 1,0
       3. c2t0d0 <ATA-WDC WD7500AAKS-0-4G30-698.64GB>
          /pci at 0,0/pci1043,8239 at 5,1/disk at 0,0
Specify disk (enter its number): ^C
# 

Disks 1, 2, 3 and 3 form my RAIDZ1 pool, but I don''t see info relating
to the SATA connector number (1 to 6, or 0 to 5 perhaps, as I have 6 onboard
SATA connectors on the motherboard).

And once a disk id (e.g. c1t0d0) is assigned to a disk, is it guaranteed never
to change?
 
 
This message posted from opensolaris.org

Simon Breden

2008-Apr-11 21:16 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

To answer my own question, I might have found the answer:

# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
sata0/0::dsk/c1t0d0            disk         connected    configured   ok
sata0/1::dsk/c1t1d0            disk         connected    configured   ok
sata1/0::dsk/c2t0d0            disk         connected    configured   ok
sata1/1                        sata-port    empty        unconfigured ok
sata2/0                        sata-port    empty        unconfigured ok
sata2/1                        sata-port    empty        unconfigured ok


It appears as if these SATA ids 0/0, 0/1, and 1/0 that are in use, almost
certainly follow the SATA connector numbering on the motherboard for my 6 SATA
ports. I guess it probably maps out like this:

SATA conn #,    cfgadm #,   current disk id
1____________0/0_______c1t0d0
2____________0/1_______c1t1d0
3____________1/0_______c2t0d0
4____________1/1_______empty
5____________2/0_______empty
6____________2/1_______empty
 
 
This message posted from opensolaris.org

Simon Breden

2008-Apr-11 22:27 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

So for a general purpose fileserver using standard SATA connectors on the
motherboard, with no drive status LEDs for each drive, using the info above from
myxiplx, this faulty drive replacement routine should work in the event that a
drive fails:  (I have copy & pasted the example from myxiplx and made a few
changes for my array/drive ids)

---------------------------

- have a cron task do a ''zpool status pool'' periodically and
email you if it detects a ''FAULTED'' status using grep
- when you see the email, see which drive is faulted from the email text grepped
from doing a ''zpool status pool | grep FAULTED'' -- e.g. c1t1d0

- offline the dive with:

# zpool offline pool c1t1d0

- then identify the SATA controller that maps to this drive by running:

# cfgadm | grep Ap_Id ; cfgadm | grep c1t1d0
Ap_Id                          Type         Receptacle   Occupant     Condition
sata0/1::dsk/c1t1d0            disk         connected    configured   ok
# 

And offline it with:
# cfgadm -c unconfigure sata0/1

Verify that it is now offline with:
# cfgadm | grep sata0/1
sata0/1 disk connected unconfigured ok

Now remove and replace the disk. For my motherboard (M2N-SLI Deluxe), SATA
controller 0/1 maps to "SATA 1" in the book -- i.e. SATA connector #1.

Bring the disk online and check its status with:
# cfgadm -c configure sata0/1
# cfgadm | grep sata0/1
sata0/1::dsk/c1t1d0 disk connected configured ok

Bring the disk back into the zfs pool. You will get a warning:
# zpool online splash c1t1d0
warning: device ''c1t1d0'' onlined, but remains in faulted state

use ''zpool replace'' to replace devices that are no longer
present
# zpool replace pool c1t1d0

you will now see zpool status report that a resilver is in process, with detail
as follows: (example from myxiplx''s array)
(resilvering is the process whereby ZFS recreates the data on the new disk from
redundant data: data held on the other drives in the array plus parity data)

raidz2 DEGRADED 0 0 0
spare DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c5t7d0s0/o UNAVAIL 0 0 0 corrupted data
c5t7d0 ONLINE 0 0 0

Once the resilver finishes, run zpool status again and it should appear fine --
i.e. array and drives marked as ONLINE and no errors shown.

Note: I sometimes had to run zpool status twice to get an up to date status of
the devices.

---------------------------

Now I need to print out this info and keep it safe for the time when a drive
fails. Also I should print out the SATA connector mapping for each drive
currently in my array in case I''m unable to for any reason later.
 
 
This message posted from opensolaris.org

Ross

2008-Apr-14 10:01 UTC

head link

[zfs-discuss] OpenSolaris ZFS NAS Setup

I''m a bit late replying to this, but I''d take the quick &
dirty approach personally.  When the server is running fine, unplug one disk and
just see which one is reported faulty in ZFS.

A couple of minutes doing that and you''ve tested that your raid array
is working fine and you know exactly which disk is which, no guesswork involved
:)
 
 
This message posted from opensolaris.org

Maybe Matching Threads

Search for more possibly parallel threads

zfs discuss - Apr 2008 - OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

[zfs-discuss] OpenSolaris ZFS NAS Setup

Maybe Matching Threads