If it''s of interest, I''ve written up some articles on my experiences of building a ZFS NAS box which you can read here: http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ I used CIFS to share the filesystems, but it will be a simple matter to use NFS instead: issue the command ''zfs set sharenfs=on pool/filesystem'' instead of ''zfs set sharesmb=on pool/filesystem''. Hope it helps. Simon Originally posted to answer someone''s request for info in storage:discuss This message posted from opensolaris.org
Fascinating read, thanks Simon! I have been using ZFS in production data center for some while now, but it never occurred to me to use iSCSI with ZFS also. This gives me some ideas on how to backup our mail pools into some older slower disks offsite. I find it interesting that while a local ZFS pool becoming unavailable will panic the system, losing access to iSCSI may not have this penalty. Not sure if it''s a bug or a feature, but when I rebooted the target system the initiator system stayed up and did not panic. This message posted from opensolaris.org
Thanks a lot, glad you liked it :) Yes I agree, using older, slower disks in this way for backups seems a nice way to reuse old kit for something useful. There''s one nasty problem I''ve seen with making a pool from an iSCSI disk hosted on a different machine, and that is that if you turn off the hosting machine, if you then shutdown the machine using the iSCSI disk in the pool, it takes ages to shutdown. Seems like it tries forever (or a long time anyway) to connect with the iSCSI disk and finds it can''t, obviously. I think there''s a bug report for this, and I thought it was fixed but, as of SXCE build 85, it seems not as I saw the problem occur again yesterday. The solution is to do a ''zpool export pool_importing_iSCSI_disks'' before shutting down the machine and then it will shutdown normally without trying to connect to the iSCSI target(s). More info here: http://www.opensolaris.org/jive/thread.jspa?messageID=196459𯽫 This guy seems to have had lots of fun with iSCSI :) http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html http://web.ivy.net/~carton/oneNightOfWork/20071204-zfsnotes.txt I wonder how many of his problems were due to using a non-Solaris iSCSI target? My experience of mixing iSCSI targets & initiators from different OS''s was not very good, but I didn''t do very much with it. This message posted from opensolaris.org
> This guy seems to have had lots of fun with iSCSI :) > http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html > >This is scaring the heck out of me. I have a project to create a zpool mirror out of two iSCSI targets, and if the failure of one of them will panic my system, that will be totally unacceptable. What''s the point of having an HA mirror if one side can''t fail without busting the host. Is it really true that as the guy on the above link states (Please read the link, sorry) when one iSCSI mirror goes off line, the initiator system will panic? Or even worse, not boot its self cleanly after such a panic? How could this be? Anyone else with experience with iSCSI based ZFS mirrors? Thanks, Jon
On Sat, Apr 5, 2008 at 12:25 AM, Jonathan Loran <jloran at ssl.berkeley.edu> wrote:> > > This guy seems to have had lots of fun with iSCSI :) > > http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html<http://web.ivy.net/%7Ecarton/oneNightOfWork/20061119-carton.html> > > > > > This is scaring the heck out of me. I have a project to create a zpool > mirror out of two iSCSI targets, and if the failure of one of them will > panic my system, that will be totally unacceptable. What''s the point of > having an HA mirror if one side can''t fail without busting the host. Is > it really true that as the guy on the above link states (Please read the > link, sorry) when one iSCSI mirror goes off line, the initiator system > will panic? Or even worse, not boot its self cleanly after such a > panic? How could this be? Anyone else with experience with iSCSI based > ZFS mirrors? > > Thanks, > > Jon > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussCrazy question here... but has anyone tried this with say, a QLogic hardware iSCSI card? Seems like it would solve all your issues. Granted, they aren''t free like the software stack, but if you''re trying to setup an HA solution, the ~$800 price tag per card seems pretty darn reasonable to me. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080405/ed686fa9/attachment.html>
Will Murnane
2008-Apr-05 15:32 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
On Sat, Apr 5, 2008 at 5:25 AM, Jonathan Loran <jloran at ssl.berkeley.edu> wrote:> This is scaring the heck out of me. I have a project to create a zpool > mirror out of two iSCSI targets, and if the failure of one of them will > panic my system, that will be totally unacceptable.I haven''t tried this myself, but perhaps the "failmode" property of zfs will solve this? Will
If you have a mirrored iscsi zpool. It will NOT panic when 1 of the submirrors is unavailable. zpool status will hang for some time, but after I thinkt 300 seconds it will put the device on unavailable. The panic was the default in the past, And it only occurs if all devices are unavailable. Since I think b77 there is a new zpool property: failemode, which you can set to prevent a panic: failmode=wait | continue | panic Controls the system behavior in the event of catas- trophic pool failure. This condition is typically a result of a loss of connectivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is determined as follows: wait Blocks all I/O access until the device con- nectivity is recovered and the errors are cleared. This is the default behavior. continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked. panic Prints out a message to the console and gen- erates a system crash dump. This message posted from opensolaris.org
Vincent Fox
2008-Apr-05 20:12 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
I don''t think ANY situation in which you are mirrored and one half of the mirror pair becomes unavailable will panic the system. At least this has been the case when I''ve tested with local storage haven''t tried with iSCSI yet but will give it a whirl. I had a simple single ZVOL shared over iSCSI, and thus no redundancy. And bringing down the target system didn''t crash the initiator. And this is with Solaris 10u4 not even latest OpenSolaris. Well okay if I''m logged onto the initator and in the directory for the pool at the time I bring down the target, my shell gets hung. But it hasn''t panicked I will wait a good 15 minutes and make sure of this and post some failure-mode results later this evening. This message posted from opensolaris.org
Vincent Fox
2008-Apr-05 20:31 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
Followup, my initiator did eventually panic. I will have to do some setup to get a ZVOL from another system to mirror with, and see what happens when one of them goes away. Will post in a day or two on that. This message posted from opensolaris.org
Vincent Fox
2008-Apr-05 20:32 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
Followup, my initiator did eventually panic. I will have to do some setup to get a ZVOL from another system to mirror with, and see what happens when one of them goes away. Will post in a day or two on that. This message posted from opensolaris.org
Jonathan Loran
2008-Apr-06 06:30 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
kristof wrote:> If you have a mirrored iscsi zpool. It will NOT panic when 1 of the submirrors is unavailable. > > zpool status will hang for some time, but after I thinkt 300 seconds it will put the device on unavailable. > > The panic was the default in the past, And it only occurs if all devices are unavailable. > > Since I think b77 there is a new zpool property: failemode, which you can set to prevent a panic: > > failmode=wait | continue | panic > > Controls the system behavior in the event of catas- > trophic pool failure. This condition is typically a > result of a loss of connectivity to the underlying > storage device(s) or a failure of all devices within the > pool. The behavior of such an event is determined as > follows: > > wait Blocks all I/O access until the device con- > nectivity is recovered and the errors are > cleared. This is the default behavior. > > continue Returns EIO to any new write I/O requests > but allows reads to any of the remaining > healthy devices. Any write requests that > have yet to be committed to disk would be > blocked. > > panic Prints out a message to the console and gen- > erates a system crash dump. > > >This is encouraging, but one problem: Our system is on Solaris 10 U4. Will this guy be immune to panics when one side of the mirror goes down? Seriously, I''m tempted to upgrade this box to OS b8? However, there are a lot of dependencies which we need to worry about in doing that - for example, will all our off the shelf software run with Open Solaris. More things to test. Thanks, Jon
Jonathan Loran
2008-Apr-06 06:42 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
Vincent Fox wrote:> Followup, my initiator did eventually panic. > > I will have to do some setup to get a ZVOL from another system to mirror with, and see what happens when one of them goes away. Will post in a day or two on that. > >On Sol 10 U4, I could have told you that. A few weeks back, I was bone headed, and took down a target with a completely idle zpool on it. The initiator system eventually panicked, when I brought the target back up! But this pool wasn''t mirrored. I''m hoping I can setup a mirror of iSCSI targets and get all the benefits of HA. BTW Vincent: thanks for doing all my testing for me ;) Seriously, I''m throwing together a test setup of my own on Monday. Need to be sure this will work. Jon
Richard Elling
2008-Apr-06 16:33 UTC
[zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup
Jonathan Loran wrote:> Vincent Fox wrote: > >> Followup, my initiator did eventually panic. >> >> I will have to do some setup to get a ZVOL from another system to mirror with, and see what happens when one of them goes away. Will post in a day or two on that. >> >> >> > On Sol 10 U4, I could have told you that. A few weeks back, I was bone > headed, and took down a target with a completely idle zpool on it. The > initiator system eventually panicked, when I brought the target back > up! But this pool wasn''t mirrored. I''m hoping I can setup a mirror of > iSCSI targets and get all the benefits of HA. >This is the expected behaviour for an unprotected Solaris 10 u4 setup. -- richard
To repeat what some others have said, yes, Solaris seems to handle an iSCSI device going offline in that it doesn''t panick and continues working once everything has timed out. However that doesn''t necessarily mean it''s ready for production use. ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client to timeout. Now I don''t know about you, but HA to me doesn''t mean "Highly Available, but with occasional 3 minute breaks". Most of the client applications we would want to run on ZFS would be broken with a 3 minute delay returning data, and this was enough for us to give up on ZFS over iSCSI for now. This message posted from opensolaris.org
On Mon, Apr 07, 2008 at 01:06:34AM -0700, Ross wrote:> > To repeat what some others have said, yes, Solaris seems to handle > an iSCSI device going offline in that it doesn''t panick and > continues working once everything has timed out. > > However that doesn''t necessarily mean it''s ready for production use. > ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client > to timeout. Now I don''t know about you, but HA to me doesn''t mean > "Highly Available, but with occasional 3 minute breaks". Most of > the client applications we would want to run on ZFS would be broken > with a 3 minute delay returning data, and this was enough for us to > give up on ZFS over iSCSI for now.Doesn''t this also happen with UFS on an Iscsi device? Iscsi is just local disk. What would happen if a physical disk went offline? We like the 3-minute delay because it gives us time to reboot the Netapp that provides storage on our Iscsi SAN without having to shut down all of the applications. Something has to happen when a disk goes offline. We also use Solaris multipathing with two independant network paths to the Netapp so that a network won''t break Iscsi. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Ross wrote:> To repeat what some others have said, yes, Solaris seems to handle an iSCSI device going offline in that it doesn''t panick and continues working once everything has timed out. > > However that doesn''t necessarily mean it''s ready for production use. ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client to timeout. Now I don''t know about you, but HA to me doesn''t mean "Highly Available, but with occasional 3 minute breaks". Most of the client applications we would want to run on ZFS would be broken with a 3 minute delay returning data, and this was enough for us to give up on ZFS over iSCSI for now. > >By default, the sd driver has a 60 second timeout with either 3 or 5 retries before timing out the I/O request. In other words, for the same failure mode in a DAS or SAN you will get the same behaviour. -- richard
On Mon, 7 Apr 2008, Ross wrote:> However that doesn''t necessarily mean it''s ready for production use. > ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client > to timeout. Now I don''t know about you, but HA to me doesn''t mean > "Highly Available, but with occasional 3 minute breaks". Most of > the client applications we would want to run on ZFS would be broken > with a 3 minute delay returning data, and this was enough for us to > give up on ZFS over iSCSI for now.It seems to me that this is a problem with the iSCSI client timeout parameters rather than ZFS itself. Three minutes is sufficient for use over the "internet" but seems excessive on a LAN. Have you investigated to see if the iSCSI client timeout parameters can be adjusted? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Crazy question here... but has anyone tried this with say, a QLogic > hardware iSCSI card? Seems like it would solve all your issues. > Granted, they aren''t free like the software stack, but if you''re trying > to setup an HA solution, the ~$800 price tag per card seems pretty darn > reasonable to me.Not sure how this would help if one target fails. The card doesn''t work any magic making the target always available. We are testing a QLA-4052C card, we believe QLogic tested it as installed on a Sun box but not against Solaris iSCSI targets; an attempt to connect from this card *appears* to cause our iscsitgtd daemon to consume a great deal of CPU and memory. We''re still trying to find out why. CT
On Mon, Apr 7, 2008 at 10:40 AM, Christine Tran <Christine.Tran at sun.com> wrote:> > Crazy question here... but has anyone tried this with say, a QLogic > > hardware iSCSI card? Seems like it would solve all your issues. Granted, > > they aren''t free like the software stack, but if you''re trying to setup an > > HA solution, the ~$800 price tag per card seems pretty darn reasonable to > > me. > > > > Not sure how this would help if one target fails. The card doesn''t work > any magic making the target always available. We are testing a QLA-4052C > card, we believe QLogic tested it as installed on a Sun box but not against > Solaris iSCSI targets; an attempt to connect from this card *appears* to > cause our iscsitgtd daemon to consume a great deal of CPU and memory. We''re > still trying to find out why. > > CT >How would it not help? From what I''m reading, there''s a flag in the software iSCSI stack on how to react if a target is lost. This is completely bypassed if you use the hardware card. As far as the OS is concerned, it''s just another SCSI disk. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080407/b327b7fc/attachment.html>
Ross Smith wrote:> Which again is unacceptable for network storage. If hardware raid > controllers took over a minute to timeout a drive network admins would > be in uproar. Why should software be held to a different standard?You need to take a systems approach to analyzing these things. For example, how long does an array take to cold boot? When I was Chief Architect for Integrated Systems Engineering, we had a product which included a storage array and a server racked together. If you used the defaults, and simulated a power-loss failure scenario, then the whole thing fell apart. Why? Because the server cold booted much faster than the array. When Solaris started, it looked for the disks, found none because the array was still booting, and declared those disks dead. The result was that you needed system administrator intervention to get the services started again. Not acceptable. The solution was to delay the server boot to more closely match the array''s boot time. The default timeout values can be changed, but we rarely recommend it. You can get into all sorts of false failure modes with small timeouts. For example, most disks spec a 30 second spin up time. So if your disk is spun down, perhaps for power savings, then you need a timeout which is greater than 30 seconds by some margin. Similarly, if you have a CD-ROM hanging off the bus, then you need a long timeout to accommodate the slow data access for a CD-ROM. I wrote a Sun BluePrint article discussing some of these issues a few years ago. http://www.sun.com/blueprints/1101/clstrcomplex.pdf> > I can understand the driver being persistant if your data is on a > single disk, however when you have any kind of redundant data, there > is no need for these delays. And there should definately not be > delays in returning status information. Who ever heard of a hardware > raid controller that takes 3 minutes to tell you which disk has gone bad? > > I can understand how the current configuration came about, but it > seems to me that the design of ZFS isn''t quite consistent. You do all > this end-to-end checksumming to double check that data is consistent > because you don''t trust the hardware, cables, or controllers to not > corrupt data. Yet you trust that same equipment absolutely when it > comes to making status decisions. > > It seems to me that you either trust the infrastructure or you don''t, > and the safest decision (as ZFS'' integrity checking has shown), is not > to trust it. ZFS would be better assuming that drivers and > controllers won''t always return accurate status information, and have > it''s own set of criteria to determine whether a drive (of any kind) is > working as expected and returning responses in a timely manner.I don''t see any benefit for ZFS to add another set of timeouts over and above the existing timeouts. Indeed we often want to delay any rash actions which would cause human intervention or prolonged recovery later. Sometimes patience is a virtue. -- richard> > > > > > Date: Mon, 7 Apr 2008 07:48:41 -0700 > > From: Richard.Elling at Sun.COM > > Subject: Re: [zfs-discuss] OpenSolaris ZFS NAS Setup > > To: myxiplx at hotmail.com > > CC: zfs-discuss at opensolaris.org > > > > Ross wrote: > > > To repeat what some others have said, yes, Solaris seems to handle > an iSCSI device going offline in that it doesn''t panick and continues > working once everything has timed out. > > > > > > However that doesn''t necessarily mean it''s ready for production > use. ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI > client to timeout. Now I don''t know about you, but HA to me doesn''t > mean "Highly Available, but with occasional 3 minute breaks". Most of > the client applications we would want to run on ZFS would be broken > with a 3 minute delay returning data, and this was enough for us to > give up on ZFS over iSCSI for now. > > > > > > > > > > By default, the sd driver has a 60 second timeout with either 3 or 5 > > retries before timing out the I/O request. In other words, for the > > same failure mode in a DAS or SAN you will get the same behaviour. > > -- richard > > > > > ------------------------------------------------------------------------ > Have you played Fishticuffs? Get fish-slapping on Messenger > <http://www.fishticuffs.co.uk>
| Is it really true that as the guy on the above link states (Please | read the link, sorry) when one iSCSI mirror goes off line, the | initiator system will panic? Or even worse, not boot its self cleanly | after such a panic? How could this be? Anyone else with experience | with iSCSI based ZFS mirrors? Our experience with Solaris 10U4 and iSCSI targets is that Solaris only panics if the pool fails entirely (eg, you lose both/all mirrors in a mirrored vdev). The fix for this is in current OpenSolaris builds, and we have been told by our Sun support people that it will (only) appear in Solaris 10 U6, apparently scheduled for sometime around fall. My experience is that Solaris will normally recover after the panic and reboot, although failed ZFS pools will be completely inaccessible as you''d expect. However, there are two gotchas: * under at least some circumstances, a completely inaccessible iSCSI target (as you might get with, eg, a switch failure) will stall booting for a significant length of time (tens of minutes, depending on how many iSCSI disks you have on it). * if a ZFS pool''s storage is present but unwritable for some reason, Solaris 10 U4 will panic the moment it tries to bring the pool up; you will wind up stuck in a perpetual ''boot, panic, reboot, ...'' cycle until you forcibly remove the storage entirely somehow. The second issue is presumably fixed as part of the general fix of ''ZFS panics on pool failure'', although we haven''t tested it explicitly. I don''t know if the first issue is fixed in current Nevada builds. - cks
Just to report back to the list... Sorry for the lengthy post So I''ve tested the iSCSI based zfs mirror on Sol 10u4, and it does more or less work as expected. If I unplug one side of the mirror - unplug or power down one of the iSCSI targets - I/O to the zpool stops for a while, perhaps a minute, and then things free up again. zpool commands seem to get unworkably slow, and error messages fly by on the console like fire ants running from a flood. Worst of all, plugging the faulted mirror back in (before removing the mirror from the pool) it''s very hard to bring the faulted device back online: prudhoe # zpool status pool: test state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Tue Apr 8 16:34:08 2008 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c2t1d0 FAULTED 0 2.88K 0 corrupted data c2t1d0 ONLINE 0 0 0 errors: No known data errors>>>>>>>>> Comment: why are there now two instances of c2t1d0?? <<<<<<<<<<prudhoe # zpool replace test c2t2d0 invalid vdev specification use ''-f'' to override the following errors: /dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M). prudhoe # zpool replace -f test c2t2d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M). prudhoe # zpool remove test c2t2d0 cannot remove c2t2d0: no such device in pool prudhoe # zpool offline test c2t2d0 cannot offline c2t2d0: no such device in pool prudhoe # zpool online test c2t2d0 cannot online c2t2d0: no such device in pool>>>>>>>>>> OK, get more drastic <<<<<<<<<<<<<<prudhoe # zpool clear test prudhoe # zpool status pool: test state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Tue Apr 8 16:34:08 2008 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c2t1d0 FAULTED 0 0 0 corrupted data c2t1d0 ONLINE 0 0 0 errors: No known data errors>>>>>>>>>>>>>>>>>>>>> Frustration setting in. The error counts are zero, but stilltwo instances of c2t1d0 listed... <<<<<<<<<<<<<<<< prudhoe # zpool export test prudhoe # zpool import test prudhoe # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT test 12.9G 9.54G 3.34G 74% ONLINE - prudhoe # zpool status pool: test state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 1.11% done, 0h20m to go config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 errors: No known data errors>>>>> Finally resilvering with the right devices. The thing I really don''t like here is the pool had to be exported and then imported to make this work. For an NFS server, this is not really acceptable. Now I know this is ol'' Solaris 10u4, but still, I''m surprised I needed to export/import the pool to get it working correctly again. Anyone know what I did wrong? Is there a canonical way to online the previously faulted device?Anyway, It looks like for now, I can get some sort of HA our of this iSCSI mirror. The other pluses is the pool can self heal, and reads will be spread across both units. Cheers, Jon --- P.S. Playing with this more before sending this message, if you can detach the faulted mirror before putting it back online, it all works well. Hope that nothing bounces on your network when you have a failure: ---->>>> unplug one iscsi mirror, then: prudhoe # zpool status -v pool: test state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: scrub completed with 0 errors on Wed Apr 9 14:18:45 2008 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c2t2d0 UNAVAIL 4 91 0 cannot open c2t1d0 ONLINE 0 0 0 errors: No known data errors prudhoe # zpool detach test c2t2d0 prudhoe # zpool status -v pool: test state: ONLINE scrub: scrub completed with 0 errors on Wed Apr 9 14:18:45 2008 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 errors: No known data errors ----->>>> replug the downed mirror, and: prudhoe # zpool attach test c2t1d0 c2t2d0 prudhoe # zpool status -v pool: test state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 0.04% done, 2h17m to go config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 errors: No known data errors Viola! Jon
Chris Siebenmann wrote:> | What your saying is independent of the iqn id? > > Yes. SCSI objects (including iSCSI ones) respond to specific SCSI > INQUIRY commands with various ''VPD'' pages that contain information about > the drive/object, including serial number info. > > Some Googling turns up: > http://wikis.sun.com/display/StorageDev/Solaris+OS+Disk+Driver+Device+Identifier+Generation > http://www.bustrace.com/bustrace6/sas.htm > > Since you''re using Linux IET as the target, you want to set the > ''ScsiId'' and ''ScsiSN'' Lun parameters to unique (and different) values. > > (You can use sdparm, http://sg.torque.net/sg/sdparm.html, on Solaris > to see exactly what you''re currently reporting in the VPD data for each > disk.) > > - cks >CC-ing the list, cause this is of general interest.... Chris, indeed the older version of Open-E iSCSI I was using for my tests has no unique VPD identifiers what so ever, so this could confuse the initiator: prudhoe # sdparm -6 -i /devices/iscsi/disk at 0000iqn.2008-04%3Aiscsi-1.target10001,0:wd,raw /devices/iscsi/disk at 0000iqn.2008-04%3Aiscsi-1.target10001,0:wd,raw: IET VIRTUAL-DISK 0 Device identification VPD page: Addressed logical unit: designator type: T10 vendor identification, code set: Binary vendor id: IET vendor specific: Where as the new version of Open-E iSCSI (called iSCSI R3) does. These are two LUNS from the system I will be doing a ZFS mirror on, running the new Open-E iSCSI-R3 on the target: apollo # sdparm -i /devices/scsi_vhci/ssd at g695343534900000058424433517a6639707a71597273647a:wd,raw /devices/scsi_vhci/ssd at g695343534900000058424433517a6639707a71597273647a:wd,raw: iSCSI DISK 0 Device identification VPD page: Addressed logical unit: designator type: T10 vendor identification, code set: Binary vendor id: iSCSI vendor specific: XBD3Qzf9pzqYrsdz apollo # sdparm -i /devices/scsi_vhci/ssd at g69534353490000005a6b6e43326c6257413579334d377636:wd,raw /devices/scsi_vhci/ssd at g69534353490000005a6b6e43326c6257413579334d377636:wd,raw: iSCSI DISK 0 Device identification VPD page: Addressed logical unit: designator type: T10 vendor identification, code set: Binary vendor id: iSCSI vendor specific: ZknC2lbWA5y3M7v6 Open-E iSCSI-R3 generates a uniq vendor specific serial number, so the ZFS mirror will most likely fail and recover more cleanly. Thanks for the pointers. Jon
I had similar problems replacing a drive myself, it''s not intuitive exactly which ZFS commands you need to issue to recover from a drive failure. I think your problems stemmed from using -f. Generally if you have to use that, there''s a step or option you''ve missed somewhere. However I''m not 100% sure what command you should have used instead. Things I''ve tried in the past include: # zpool replace test c2t2d0 c2t2d0 or # zpool online test c2t2d0 # zpool replace test c2t2d0 I know I did a whole load of testing various options to work out how to replace a drive in a test machine. I''m looking to see if I have any iSCSI notes around, but from memory when I tested iSCSI I was also testing ZFS on a cluster, so my solution was to simply get the iSCSI devices working on the offline node, then failover ZFS. It only took 2-3 seconds to failover ZFS to the other node, and I suspect I used that solution because I couldn''t work out how to get ZFS to correctly bring faulted iSCSI devices back online. However, in case it helps, I do have the whole process for physical disks on a Sun x4500 documented: # zpool offline splash c5t7d0 Now, find the controller in use for this device: # cfgadm | grep c5t7d0 sata3/7::dsk/c5t7d0 disk connected configured ok And offline it with: # cfgadm -c unconfigure sata3/7 Verify that it is now offline with: # cfgadm | grep sata3/7 sata3/7 disk connected unconfigured ok Now remove and replace the disk. Bring the disk online and check it''s status with: # cfgadm -c configure sata3/7 # cfgadm | grep sata3/7 sata3/7::dsk/c5t7d0 disk connected configured ok Bring the disk back into the zfs pool. You will get a warning: # zpool online splash c5t7d0 warning: device ''c5t7d0'' onlined, but remains in faulted state use ''zpool replace'' to replace devices that are no longer present # zpool replace splash c5t7d0 you will now see zpool status report that a resilver is in process, with detail as follows: raidz2 DEGRADED 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c5t7d0s0/o UNAVAIL 0 0 0 corrupted data c5t7d0 ONLINE 0 0 0 Once the resilver finishes, run zpool status again and it should appear fine. Note: I sometimes had to run zpool status twice to get an up to date status of the devices. This message posted from opensolaris.org
Thanks myxiplx for the info on replacing a faulted drive. I think the X4500 has LEDs to show drive statuses so you can see which physical drive to pull and replace, but how does one know which physical disk to pull out when you just have a standard PC with drives directly plugged into on-motherboard SATA connectors -- i.e. with no status LEDs? This message posted from opensolaris.org
On Fri, 11 Apr 2008, Simon Breden wrote:> Thanks myxiplx for the info on replacing a faulted drive. I think > the X4500 has LEDs to show drive statuses so you can see which > physical drive to pull and replace, but how does one know which > physical disk to pull out when you just have a standard PC with > drives directly plugged into on-motherboard SATA connectors -- i.e. > with no status LEDs?This should be a wakeup call to make sure that this is all figured out in advance before the hardware fails. If you were to format the drive for a traditional filesystem you would need to know which one it was. Failure recovery should be no different except for the fact that the machine may be down, pressure is on, and the information you expected to use for recovery was on that machine. :-) This is a case where it is worthwhile maintaining a folder (in paper form) which contains important recovery information for your machines. Open up the machine in advance and put sticky labels on the drives with their device names. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thanks Bob, that''s good advice. So, before I open my case, I''ve currently got 3 SATA drives all the same model, so how do I know which one is plugged into which SATA connector on the motherboard? Is there a command I can issue which gives identifying info that includes the disk id AND the SATA connector number that it is plugged into? If I type ''format'' I get the following info: # format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0d0 <DEFAULT cyl 20007 alt 2 hd 255 sec 63> /pci at 0,0/pci-ide at 4/ide at 0/cmdk at 0,0 1. c1t0d0 <ATA-WDC WD7500AAKS-0-4G30-698.64GB> /pci at 0,0/pci1043,8239 at 5/disk at 0,0 2. c1t1d0 <ATA-WDC WD7500AAKS-0-4G30-698.64GB> /pci at 0,0/pci1043,8239 at 5/disk at 1,0 3. c2t0d0 <ATA-WDC WD7500AAKS-0-4G30-698.64GB> /pci at 0,0/pci1043,8239 at 5,1/disk at 0,0 Specify disk (enter its number): ^C # Disks 1, 2, 3 and 3 form my RAIDZ1 pool, but I don''t see info relating to the SATA connector number (1 to 6, or 0 to 5 perhaps, as I have 6 onboard SATA connectors on the motherboard). And once a disk id (e.g. c1t0d0) is assigned to a disk, is it guaranteed never to change? This message posted from opensolaris.org
To answer my own question, I might have found the answer: # cfgadm -al Ap_Id Type Receptacle Occupant Condition sata0/0::dsk/c1t0d0 disk connected configured ok sata0/1::dsk/c1t1d0 disk connected configured ok sata1/0::dsk/c2t0d0 disk connected configured ok sata1/1 sata-port empty unconfigured ok sata2/0 sata-port empty unconfigured ok sata2/1 sata-port empty unconfigured ok It appears as if these SATA ids 0/0, 0/1, and 1/0 that are in use, almost certainly follow the SATA connector numbering on the motherboard for my 6 SATA ports. I guess it probably maps out like this: SATA conn #, cfgadm #, current disk id 1____________0/0_______c1t0d0 2____________0/1_______c1t1d0 3____________1/0_______c2t0d0 4____________1/1_______empty 5____________2/0_______empty 6____________2/1_______empty This message posted from opensolaris.org
So for a general purpose fileserver using standard SATA connectors on the motherboard, with no drive status LEDs for each drive, using the info above from myxiplx, this faulty drive replacement routine should work in the event that a drive fails: (I have copy & pasted the example from myxiplx and made a few changes for my array/drive ids) --------------------------- - have a cron task do a ''zpool status pool'' periodically and email you if it detects a ''FAULTED'' status using grep - when you see the email, see which drive is faulted from the email text grepped from doing a ''zpool status pool | grep FAULTED'' -- e.g. c1t1d0 - offline the dive with: # zpool offline pool c1t1d0 - then identify the SATA controller that maps to this drive by running: # cfgadm | grep Ap_Id ; cfgadm | grep c1t1d0 Ap_Id Type Receptacle Occupant Condition sata0/1::dsk/c1t1d0 disk connected configured ok # And offline it with: # cfgadm -c unconfigure sata0/1 Verify that it is now offline with: # cfgadm | grep sata0/1 sata0/1 disk connected unconfigured ok Now remove and replace the disk. For my motherboard (M2N-SLI Deluxe), SATA controller 0/1 maps to "SATA 1" in the book -- i.e. SATA connector #1. Bring the disk online and check its status with: # cfgadm -c configure sata0/1 # cfgadm | grep sata0/1 sata0/1::dsk/c1t1d0 disk connected configured ok Bring the disk back into the zfs pool. You will get a warning: # zpool online splash c1t1d0 warning: device ''c1t1d0'' onlined, but remains in faulted state use ''zpool replace'' to replace devices that are no longer present # zpool replace pool c1t1d0 you will now see zpool status report that a resilver is in process, with detail as follows: (example from myxiplx''s array) (resilvering is the process whereby ZFS recreates the data on the new disk from redundant data: data held on the other drives in the array plus parity data) raidz2 DEGRADED 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c5t7d0s0/o UNAVAIL 0 0 0 corrupted data c5t7d0 ONLINE 0 0 0 Once the resilver finishes, run zpool status again and it should appear fine -- i.e. array and drives marked as ONLINE and no errors shown. Note: I sometimes had to run zpool status twice to get an up to date status of the devices. --------------------------- Now I need to print out this info and keep it safe for the time when a drive fails. Also I should print out the SATA connector mapping for each drive currently in my array in case I''m unable to for any reason later. This message posted from opensolaris.org
I''m a bit late replying to this, but I''d take the quick & dirty approach personally. When the server is running fine, unplug one disk and just see which one is reported faulty in ZFS. A couple of minutes doing that and you''ve tested that your raid array is working fine and you know exactly which disk is which, no guesswork involved :) This message posted from opensolaris.org