Has anybody here got any thoughts on how to resolve this problem: http://www.opensolaris.org/jive/thread.jspa?messageID=261204&tstart=0 It sounds like two of us have been affected by this now, and it''s a bit of a nuisance your entire server hanging when a drive is removed, makes you worry about how Solaris would handle a drive failure. Has anybody tried pulling a drive on a live Thumper, surely they don''t hang like this? Although, having said that I do remember they do have a great big warning in the manual about using cfgadm to stop the disk before removal saying: "Caution - You must follow these steps before removing a disk from service. Failure to follow the procedure can corrupt your data or render your file system inoperable." Ross This message posted from opensolaris.org
I''ve discovered this as well - b81 to b93 (latest I''ve tried). I switched from my on-board SATA controller to AOC-SAT2-MV8 cards because the MCP55 controller caused random disk hangs. Now the SAT2-MV8 works as long as the drives are working correctly, but the system can''t handle a drive failure or disconnect. :( I don''t think there''s a bug filed for it. That would probably be the first step to getting this resolved (might also post to storage-discuss). -- Dave Ross wrote:> Has anybody here got any thoughts on how to resolve this problem: > http://www.opensolaris.org/jive/thread.jspa?messageID=261204&tstart=0 > > It sounds like two of us have been affected by this now, and it''s a bit of a nuisance your entire server hanging when a drive is removed, makes you worry about how Solaris would handle a drive failure. > > Has anybody tried pulling a drive on a live Thumper, surely they don''t hang like this? Although, having said that I do remember they do have a great big warning in the manual about using cfgadm to stop the disk before removal saying: > > "Caution - You must follow these steps before removing a disk from service. Failure to follow the procedure can corrupt your data or render your file system inoperable." > > Ross > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Yeah, I thought of the storage forum today and found somebody else with the problem, and since my post a couple of people have reported similar issues on Thumpers. I guess the storage thread is the best place for this now: http://www.opensolaris.org/jive/thread.jspa?threadID=42507&tstart=0 This message posted from opensolaris.org
Ok, after doing a lot more testing of this I''ve found it''s not the Supermicro controller causing problems. It''s purely ZFS, and it causes some major problems! I''ve even found one scenario that appears to cause huge data loss without any warning from ZFS - up to 30,000 files and 100MB of data missing after a reboot, with zfs reporting that the pool is OK. *********************************************************************** 1. Solaris handles USB and SATA hot plug fine If disks are not in use by ZFS, you can unplug USB or SATA devices, cfgadm will recognise the disconnection. USB devices are recognised automatically as you reconnect them, SATA devices need reconfiguring. Cfgadm even recognises the SATA device as an empty bay: # cfgadm Ap_Id Type Receptacle Occupant Condition sata1/7 sata-port empty unconfigured ok usb1/3 unknown empty unconfigured ok -- insert devices -- # cfgadm Ap_Id Type Receptacle Occupant Condition sata1/7 disk connected unconfigured unknown usb1/3 usb-storage connected configured ok To bring the sata drive online it''s just a case of running # cfgadm -c configure sata1/7 *********************************************************************** 2. If ZFS is using a hot plug device, disconnecting it will hang all ZFS status tools. While pools remain accessible, any attempt to run "zpool status" will hang. I don''t know if there is any way to recover these tools once this happens. While this is a pretty big problem in itself, it also makes me worry if other types of error could have the same effect. I see potential for this leaving a server in a state whereby you know there are errors in a pool, but have no way of finding out what those errors might be without rebooting the server. *********************************************************************** 3. Once ZFS status tools are hung the computer will not shut down. The only way I''ve found to recover from this is to physically power down the server. The solaris shutdown process simply hangs. *********************************************************************** 4. While reading an offline disk causes errors, writing does not! *** CAUSES DATA LOSS *** This is a big one: ZFS can continue writing to an unavailable pool. It doesn''t always generate errors (I''ve seen it copy over 100MB before erroring), and if not spotted, this *will* cause data loss after you reboot. I discovered this while testing how ZFS coped with the removal of a hot plug SATA drive. I knew that the ZFS admin tools were hanging, but that redundant pools remained available. I wanted to see whether it was just the ZFS admin tools that were failing, or whether ZFS was also failing to send appropriate error messages back to the OS. These are the tests I carried out: Zpool: Single drive zpool, consisting of one 250GB SATA drive in a hot plug bay. Test data: A folder tree containing 19,160 items. 71.1MB in total. TEST1: Opened File Browser, copied the test data to the pool. Half way through the copy I pulled the drive. THE COPY COMPLETED WITHOUT ERROR. Zpool list reports the pool as online, however zpool status hung as expected. Not quite believing the results, I rebooted and tried again. TEST2: Opened File Browser, copied the data to the pool. Pulled the drive half way through. The copy again finished without error. Checking the properties shows 19,160 files in the copy. ZFS list again shows the filesystem as ONLINE. Now I decided to see how many files I could copy before it errored. I started the copy again. File Browser managed a further 9,171 files before it stopped. That''s nearly 30,000 files before any error was detected. Again, despite the copy having finally errored, zpool list shows the pool as online, even though zpool status hangs. I rebooted the server, and found that after the reboot my first copy contains just 10,952 items, and my second copy is completely missing. That''s a loss of almost 20,000 files. Zpool status however reports NO ERRORS. For the third test I decided to see if these files are actually accessible before the reboot: TEST3: This time I pulled the drive *before* starting the copy. The copy started much slower this time and only got to 2,939 files before reporting an error. At this point I copied all the files that had been copied to another pool, and then rebooted. After the reboot, the folder in the test pool had disappeared completely, but the copy I took before rebooting was fine and contains 2,938 items, approximately 12MB of data. Again, zpool status reports no errors. Further tests revealed that reading the pool results in an error almost immediately. Writing to the pool appears very inconsistent. This is a huge problem. Data can be written without error, and is still served to users. It is only later on that the server will begin to issue errors, but at that point zfs admin tools are useless. The only possible recovery is a server reboot, but that will loose recent data written to the pool, but will do so without any warnings at all from ZFS. Needless to say I have a lot less faith in ZFS'' error checking after having seen it loose 30,000 files without error. *********************************************************************** 5. If you are using CIFS and pull a drive from the volume, the whole server hangs! This appears to be the original problem I found. While ZFS doesn''t handle drive removal well, the combination of ZFS and CIFS is worse. If you pull a drive from a ZFS pool (redundant or not), which is serving CIFS data, the entire server freezes until you re-insert the drive. Note that ZFS itself does not recover after the drive is inserted; admin tools will still hang. However the re-insertion of the drive is enough to unfreeze the server. Of course, you still need a physical reboot to get your ZFS admin tools back, but in the meantime data is accessible again. This message posted from opensolaris.org
Mattias Pantzare
2008-Jul-28 16:39 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
> 4. While reading an offline disk causes errors, writing does not! > *** CAUSES DATA LOSS *** > > This is a big one: ZFS can continue writing to an unavailable pool. It doesn''t always generate errors (I''ve seen it copy over 100MB > before erroring), and if not spotted, this *will* cause data loss after you reboot. > > I discovered this while testing how ZFS coped with the removal of a hot plug SATA drive. I knew that the ZFS admin tools were > hanging, but that redundant pools remained available. I wanted to see whether it was just the ZFS admin tools that were failing, > or whether ZFS was also failing to send appropriate error messages back to the OS. >This is not unique for zfs. If you need to know that your writes has reached stable store you have to call fsync(). It is not enough to close a file. This is true even for UFS, but UFS won''t delay writes for all operations so you will notice faster. But you will still loose data. I have been able to undo rm -rf / on a FreeBSD system by pulling the power cord before it wrote the changes... Databases use fsync (or similar) before they close a transaction, that one of the reasons that databases like hardware write caches. cp will not.
Bob Friesenhahn
2008-Jul-28 18:03 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
On Mon, 28 Jul 2008, Ross wrote:> > TEST1: Opened File Browser, copied the test data to the pool. > Half way through the copy I pulled the drive. THE COPY COMPLETED > WITHOUT ERROR. Zpool list reports the pool as online, however zpool > status hung as expected.Are you sure that this reference software you call "File Browser" actually responds to errors? Maybe it is typical Linux-derived software which does not check for or handle errors and ZFS is reporting errors all along while the program pretends to copy the lost files. If you were using Microsoft Windows, its file browser would probably report "Unknown error: 666" but at least you would see an error dialog and you could visit the Microsoft knowledge base to learn that message ID 666 means "Unknown error". The other possibility is that all of these files fit in the ZFS write cache so the error reporting is delayed. The Dtrace Toolkit provides a very useful DTrace script called ''errinfo'' which will list every system call which reports and error. This is very useful and informative. If you run it, you will see every error reported to the application level. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross Smith
2008-Jul-28 18:09 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
"File Browser" is the name of the program that Solaris opens when you open "Computer" on the desktop. It''s the default graphical file manager. It does eventually stop copying with an error, but it takes a good long while for ZFS to throw up that error, and even when it does, the pool doesn''t report any problems at all.> Date: Mon, 28 Jul 2008 13:03:24 -0500> From: bfriesen at simple.dallas.tx.us> To: myxiplx at hotmail.com> CC: zfs-discuss at opensolaris.org> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> > On Mon, 28 Jul 2008, Ross wrote:> >> > TEST1: Opened File Browser, copied the test data to the pool. > > Half way through the copy I pulled the drive. THE COPY COMPLETED > > WITHOUT ERROR. Zpool list reports the pool as online, however zpool > > status hung as expected.> > Are you sure that this reference software you call "File Browser" > actually responds to errors? Maybe it is typical Linux-derived > software which does not check for or handle errors and ZFS is > reporting errors all along while the program pretends to copy the lost > files. If you were using Microsoft Windows, its file browser would > probably report "Unknown error: 666" but at least you would see an > error dialog and you could visit the Microsoft knowledge base to learn > that message ID 666 means "Unknown error". The other possibility is > that all of these files fit in the ZFS write cache so the error > reporting is delayed.> > The Dtrace Toolkit provides a very useful DTrace script called > ''errinfo'' which will list every system call which reports and error. > This is very useful and informative. If you run it, you will see > every error reported to the application level.> > Bob> ======================================> Bob Friesenhahn> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/>_________________________________________________________________ Invite your Facebook friends to chat on Messenger http://clk.atdmt.com/UKM/go/101719649/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/7ccd448c/attachment.html>
Ross Smith
2008-Jul-28 18:10 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
snv_91. I downloaded snv_94 today so I''ll be testing with that tomorrow.> Date: Mon, 28 Jul 2008 09:58:43 -0700> From: Richard.Elling at Sun.COM> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: myxiplx at hotmail.com> > Which OS and revision?> -- richard> > > Ross wrote:> > Ok, after doing a lot more testing of this I''ve found it''s not the Supermicro controller causing problems. It''s purely ZFS, and it causes some major problems! I''ve even found one scenario that appears to cause huge data loss without any warning from ZFS - up to 30,000 files and 100MB of data missing after a reboot, with zfs reporting that the pool is OK.> >> > ***********************************************************************> > 1. Solaris handles USB and SATA hot plug fine> >> > If disks are not in use by ZFS, you can unplug USB or SATA devices, cfgadm will recognise the disconnection. USB devices are recognised automatically as you reconnect them, SATA devices need reconfiguring. Cfgadm even recognises the SATA device as an empty bay:> >> > # cfgadm> > Ap_Id Type Receptacle Occupant Condition> > sata1/7 sata-port empty unconfigured ok> > usb1/3 unknown empty unconfigured ok> >> > -- insert devices --> >> > # cfgadm> > Ap_Id Type Receptacle Occupant Condition> > sata1/7 disk connected unconfigured unknown> > usb1/3 usb-storage connected configured ok> >> > To bring the sata drive online it''s just a case of running> > # cfgadm -c configure sata1/7 > >> > ***********************************************************************> > 2. If ZFS is using a hot plug device, disconnecting it will hang all ZFS status tools.> >> > While pools remain accessible, any attempt to run "zpool status" will hang. I don''t know if there is any way to recover these tools once this happens. While this is a pretty big problem in itself, it also makes me worry if other types of error could have the same effect. I see potential for this leaving a server in a state whereby you know there are errors in a pool, but have no way of finding out what those errors might be without rebooting the server.> >> > ***********************************************************************> > 3. Once ZFS status tools are hung the computer will not shut down.> >> > The only way I''ve found to recover from this is to physically power down the server. The solaris shutdown process simply hangs.> >> > ***********************************************************************> > 4. While reading an offline disk causes errors, writing does not! > > *** CAUSES DATA LOSS ***> >> > This is a big one: ZFS can continue writing to an unavailable pool. It doesn''t always generate errors (I''ve seen it copy over 100MB before erroring), and if not spotted, this *will* cause data loss after you reboot.> >> > I discovered this while testing how ZFS coped with the removal of a hot plug SATA drive. I knew that the ZFS admin tools were hanging, but that redundant pools remained available. I wanted to see whether it was just the ZFS admin tools that were failing, or whether ZFS was also failing to send appropriate error messages back to the OS.> >> > These are the tests I carried out:> >> > Zpool: Single drive zpool, consisting of one 250GB SATA drive in a hot plug bay.> > Test data: A folder tree containing 19,160 items. 71.1MB in total.> >> > TEST1: Opened File Browser, copied the test data to the pool. Half way through the copy I pulled the drive. THE COPY COMPLETED WITHOUT ERROR. Zpool list reports the pool as online, however zpool status hung as expected.> >> > Not quite believing the results, I rebooted and tried again.> >> > TEST2: Opened File Browser, copied the data to the pool. Pulled the drive half way through. The copy again finished without error. Checking the properties shows 19,160 files in the copy. ZFS list again shows the filesystem as ONLINE.> >> > Now I decided to see how many files I could copy before it errored. I started the copy again. File Browser managed a further 9,171 files before it stopped. That''s nearly 30,000 files before any error was detected. Again, despite the copy having finally errored, zpool list shows the pool as online, even though zpool status hangs.> >> > I rebooted the server, and found that after the reboot my first copy contains just 10,952 items, and my second copy is completely missing. That''s a loss of almost 20,000 files. Zpool status however reports NO ERRORS.> >> > For the third test I decided to see if these files are actually accessible before the reboot:> >> > TEST3: This time I pulled the drive *before* starting the copy. The copy started much slower this time and only got to 2,939 files before reporting an error. At this point I copied all the files that had been copied to another pool, and then rebooted.> >> > After the reboot, the folder in the test pool had disappeared completely, but the copy I took before rebooting was fine and contains 2,938 items, approximately 12MB of data. Again, zpool status reports no errors.> >> > Further tests revealed that reading the pool results in an error almost immediately. Writing to the pool appears very inconsistent.> >> > This is a huge problem. Data can be written without error, and is still served to users. It is only later on that the server will begin to issue errors, but at that point zfs admin tools are useless. The only possible recovery is a server reboot, but that will loose recent data written to the pool, but will do so without any warnings at all from ZFS. > >> > Needless to say I have a lot less faith in ZFS'' error checking after having seen it loose 30,000 files without error.> >> > ***********************************************************************> > 5. If you are using CIFS and pull a drive from the volume, the whole server hangs!> >> > This appears to be the original problem I found. While ZFS doesn''t handle drive removal well, the combination of ZFS and CIFS is worse. If you pull a drive from a ZFS pool (redundant or not), which is serving CIFS data, the entire server freezes until you re-insert the drive.> >> > Note that ZFS itself does not recover after the drive is inserted; admin tools will still hang. However the re-insertion of the drive is enough to unfreeze the server.> >> > Of course, you still need a physical reboot to get your ZFS admin tools back, but in the meantime data is accessible again.> > > > > > This message posted from opensolaris.org> > _______________________________________________> > zfs-discuss mailing list> > zfs-discuss at opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> > >_________________________________________________________________ Find the best and worst places on the planet http://clk.atdmt.com/UKM/go/101719807/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/ae39563b/attachment.html>
Ross Smith
2008-Jul-28 18:41 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Heh, sounds like there are a few problems with that tool then. I guess that''s one of the benefits of me being so new to Solaris. I''m still learning all the command line tools so I''m playing with the graphical stuff as much as possible. :) Regarding the delay, I plan to have a go tomorrow and see just how much of a delay there can be. I''ve definately had the system up for 10 minutes still reading data that''s going to disappear on reboot and suspect I can stretch it a lot longer than that. The biggest concern for me with the delay is that the data appears fine to all intents & purposes. You can read it off the pool and copy it elsewhere. There doesn''t seem to be any indication that it''s going to disappear after a reboot.> Date: Mon, 28 Jul 2008 13:35:21 -0500> From: bfriesen at simple.dallas.tx.us> To: myxiplx at hotmail.com> Subject: RE: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> > On Mon, 28 Jul 2008, Ross Smith wrote:> > >> > "File Browser" is the name of the program that Solaris opens when > > you open "Computer" on the desktop. It''s the default graphical file > > manager.> > Got it. I have brought it up once or twice. I tend to distrust such > tools since I am not sure if their implementation is sound. In fact, > usually it is not.> > Now that you mention this tool, I am going to see what happens when it > enters my test directory containing a million files. Hmmm, this turd > says "Loading" and I see that system error messages are scrolling by > as fast as dtrace can report them:> > nautilus ioctl 25 Inappropriate ioctl for device> nautilus acl 89 Unsupported file system operation> nautilus ioctl 25 Inappropriate ioctl for device> nautilus acl 89 Unsupported file system operation> nautilus ioctl 25 Inappropriate ioctl for device> > we shall see if it crashes or if it eventually returns. Ahhh, it has > returned and declared that my directory with a million files is > "(Empty)". So much for a short stint of trusting this tool.> > > It does eventually stop copying with an error, but it takes a good > > long while for ZFS to throw up that error, and even when it does, > > the pool doesn''t report any problems at all.> > The delayed error report may be ok but the pool not reporting a > problem does not seem very ok.> > Bob> ======================================> Bob Friesenhahn> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/>_________________________________________________________________ Play and win great prizes with Live Search and Kung Fu Panda http://clk.atdmt.com/UKM/go/101719966/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/264e1a43/attachment.html>
Miles Nordin
2008-Jul-29 01:24 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
>>>>> "mp" == Mattias Pantzare <pantzer at ludd.ltu.se> writes:>> This is a big one: ZFS can continue writing to an unavailable >> pool. It doesn''t always generate errors (I''ve seen it copy >> over 100MB before erroring), and if not spotted, this *will* >> cause data loss after you reboot. mp> This is not unique for zfs. If you need to know that your mp> writes has reached stable store you have to call fsync(). seconded. How about this: * start the copy * pull the disk, without waiting for an error reported to the application * type ''lockfs -fa''. Does either lockfs hang, or you get an immediate error after requesting the lockfs? If so, I think it''s ok and within the unix tradition to allow all these writes, it''s just maybe a more extreme version of the tradition, which might not be an entirely bad compromise if ZFS can keep up this behavior, and actually retry the unreported failed writes, when confronted with FC, iSCSI, USB, FW targets that bounce. I''m not sure if it can ever do that yet or not, but architecturally I wouldn''t want to demand that it return failure to the app too soon, so long as fsync() still behaves correctly w.r.t. power failures. However the other problems you report are things I''ve run into, also. ''zpool status'' should not be touching the disk at all. so, we have: * ''zpool list'' shows ONLINE several minutes after a drive is yanked. At the time ''zpool list'' still shows ONLINE, ''zpool status'' doesn''t show anything at all because it hangs, so ONLINE seems too positive a report for the situation. I''d suggest: + ''zpool list'' should not borrow the ONLINE terminology from ''zpool status'' if the list command means something different by the word ONLINE. maybe SEEMS_TO_BE_AROUND_SOMEWHERE is more appropriate. + during this problem, ''zpool list'' is available while ''zpool status'' is not working. Fine, maybe, during a failure, not all status tools will be available. However it would be nice if, as a minimum, some status tool capable of reporting ``pool X is failing'''' were available. In the absence of that, you may have to reboot the machine without ever knowing even which pool failed to bring it down. * maybe sometimes certain types of status and statistics aren''t available, but no status-reporting tools should ever be subject to blocking inside the kernel. At worst they should refuse to give information, and return to a prompt, immediately. I''m in the habit of typing ''zpool status &'' during serious problems so I don''t lose control of the console. * ''zpool status'' is used when things are failing. Cabling and driver state machines are among the failures from which a volume manager should protect us---that''s why we say ``buy redundant controllers if possible.'''' In this scenario, a read is an intrusive act, because it could provoke a problem. so even if ''zpool status'' is only reading, not writing to disk nor to data structures inside the kernel, it is still not really a status tool. It''s an invasive poking/pinging/restarting/breaking tool. Such tools should be segregated, and shouldn''t substitute for the requirement to have true status tools that only read data structures kept in the kernel, not update kernel structures and not touch disks. This would be like if ''ps'' made an implicit call to rcapd, or activated some swapping thread, or something like that. ``My machine is sluggish. I wonder what''s slowing it down. ...''ps''... oh, shit, now it''s not responding at all, and I''ll never know why.'''' There can be other tools, too, but I think LVM2 and SVM both have carefully non-invasive status tools, don''t they? This principle should be followed everywhere. For example, ''iscsiadm list discovery-address'' should simply list the discovery addresses. It should not implicitly attempt to contact each discovery address in its list, while I wait. -----8<----- terabithia:/# time iscsiadm list discovery-address Discovery Address: 10.100.100.135:3260 Discovery Address: 10.100.100.138:3260 real 0m45.935s user 0m0.006s sys 0m0.019s terabithia:/# jobs [1]+ Running zpool status & terabithia:/# -----8<----- now, if you''re really scalable, try the above again with 100 iSCSI targets and 20 pools. A single ''iscsiadm list discovery-address'' command, even if it''s sort-of ``working'''', can take hours to complete. This does not happen on Linux where I configure through text files and inspect status through ''cat /proc/...'' In other words, it''s not just that the information ''zpool status'' gives is inaccurate. It''s not just that some information is hidden (like how sometimes a device listed as ONLINE will say ``no valid replicas'''' when you try to offline it, and sometimes it won''t, and the only way to tell the difference is to attempt to offline the device---so trying to ''zpool offline'' each device in turn is a way to get some more indication of pool health than what ''zpool status'' gives on its own). It''s also that I don''t trust ''zpool status'' not to affect the information it''s supposed to be reporting. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/dd4f305b/attachment.bin>
Ross Smith
2008-Jul-29 10:07 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
A little more information today. I had a feeling that ZFS would continue quite some time before giving an error, and today I''ve shown that you can carry on working with the filesystem for at least half an hour with the disk removed. I suspect on a system with little load you could carry on working for several hours without any indication that there is a problem. It looks to me like ZFS is caching reads & writes, and that provided requests can be fulfilled from the cache, it doesn''t care whether the disk is present or not. I would guess that ZFS is attempting to write to the disk in the background, and that this is silently failing. Here''s the log of the tests I did today. After removing the drive, over a period of 30 minutes I copied folders to the filesystem, created an archive, set permissions, and checked properties. I did this both in the command line and with the graphical file manager tool in Solaris. Neither reported any errors, and all the data could be read & written fine. Until the reboot, at which point all the data was lost, again without error. If you''re not interested in the detail, please skip to the end where I''ve got some thoughts on just how many problems there are here. # zpool status test pool: test state: ONLINE scrub: none requestedconfig: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 errors: No known data errors# zfs list testNAME USED AVAIL REFER MOUNTPOINTtest 243M 228G 242M /test# zpool list testNAME SIZE USED AVAIL CAP HEALTH ALTROOTtest 232G 243M 232G 0% ONLINE - -- drive removed -- # cfgadm |grep sata1/7sata1/7 sata-port empty unconfigured ok -- cfgadmin knows the drive is removed. How come ZFS does not? -- # cp -r /rc-pool/copytest /test/copytest# zpool list testNAME SIZE USED AVAIL CAP HEALTH ALTROOTtest 232G 73.4M 232G 0% ONLINE -# zfs list testNAME USED AVAIL REFER MOUNTPOINTtest 142K 228G 18K /test -- Yup, still up. Let''s start the clock -- # dateTue Jul 29 09:31:33 BST 2008# du -hs /test/copytest 667K /test/copytest -- 5 minutes later, still going strong -- # dateTue Jul 29 09:36:30 BST 2008# zpool list testNAME SIZE USED AVAIL CAP HEALTH ALTROOTtest 232G 73.4M 232G 0% ONLINE -# cp -r /rc-pool/copytest /test/copytest2# ls /testcopytest copytest2# du -h -s /test 1.3M /test# zpool list testNAME SIZE USED AVAIL CAP HEALTH ALTROOTtest 232G 73.4M 232G 0% ONLINE -# find /test | wc -l 2669# find //test/copytest | wc -l 1334# find /rc-pool/copytest | wc -l 1334# du -h -s /rc-pool/copytest 5.3M /rc-pool/copytest -- Not sure why the original pool has 5.3MB of data when I use du. -- -- File Manager reports that they both have the same size -- -- 15 minutes later it''s still working. I can read data fine -- # dateTue Jul 29 09:43:04 BST 2008# chmod 777 /test/*# mkdir /rc-pool/test2# cp -r /test/copytest2 /rc-pool/test2/copytest2# find /rc-pool/test2/copytest2 | wc -l 1334# zpool list testNAME SIZE USED AVAIL CAP HEALTH ALTROOTtest 232G 73.4M 232G 0% ONLINE - -- and yup, the drive is still offline -- # cfgadm | grep sata1/7sata1/7 sata-port empty unconfigured ok -- And finally, after 30 minutes the pool is still going strong -- # dateTue Jul 29 09:59:56 BST 2008 # tar -cf /test/copytest.tar /test/copytest/*# ls -ltotal 3drwxrwxrwx 3 root root 3 Jul 29 09:30 copytest-rwxrwxrwx 1 root root 4626432 Jul 29 09:59 copytest.tardrwxrwxrwx 3 root root 3 Jul 29 09:39 copytest2# zpool list testNAME SIZE USED AVAIL CAP HEALTH ALTROOTtest 232G 73.4M 232G 0% ONLINE - After a full 30 minutes there''s no indication whatsoever of any problem. Checking properties of the folder in File Browser reports 2665 items, totalling 9.0MB. At this point I tried "# zfs set sharesmb=on test". I didn''t really expect it to work, and sure enough, that command hung. zpool status also hung, so I had to reboot the server. -- Rebooted server -- Now I found that not only are all the files I''ve written in the last 30 minutes missing, but in fact files that I had deleted several minutes prior to removing the drive have re-appeared. -- /test mount point is still present, I''ll probably have to remove that manually -- # cd /# lsbin export media proc systemboot home mnt rc-pool testdev kernel net rc-usb tmpdevices lib opt root usretc lost+found platform sbin var -- ZFS still has the pool mounted, but at least now it realises it''s not working -- # zpool listNAME SIZE USED AVAIL CAP HEALTH ALTROOTrc-pool 2.27T 52.6G 2.21T 2% DEGRADED -test - - - - FAULTED -# zpool status test pool: test state: UNAVAILstatus: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning.action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requestedconfig: NAME STATE READ WRITE CKSUM test UNAVAIL 0 0 0 insufficient replicas c2t7d0 UNAVAIL 0 0 0 cannot open -- At least re-activating the pool is simple, but gotta love the "No known data errors" line -- # cfgadm -c configure sata1/7# zpool status test pool: test state: ONLINE scrub: none requestedconfig: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 errors: No known data errors -- But of course, although ZFS thinks it''s online, it didn''t mount properly -- # cd /test# ls# zpool export test# rm -r /test# zpool import test# cd test# lsvar (copy) var2 -- Now that''s unexpected. Those folders should be long gone. Let''s see how many files ZFS failed to delete -- # du -h -s /test 77M /test# find /test | wc -l 19033 So in addition to working for a full half hour creating files, it''s also failed to remove 77MB of data contained in nearly 20,000 files. And it''s done all that without reporting any error or problem with the pool. In fact, if I didn''t know what I was looking for, there would be no indication of a problem at all. Before the reboot I can''t find what''s going on as "zfs status" hangs. After the reboot it says there''s no problem. Both ZFS and it''s troubleshooting tools fail in a big way here. As others have said, "zfs status" should not hang. ZFS has to know the state of all the drives and pools it''s currently using, "zfs status" should simply report the current known status from ZFS'' internal state. It shouldn''t need to scan anything. ZFS'' internal state should also be checking with cfgadm so that it knows if a disk isn''t there. It should also be updated if the cache can''t be flushed to disk, and "zfs list / zpool list" needs to borrow state information from the status commands so that they don''t say ''online'' when the pool has problems. ZFS needs to deal more intelligently with mount points when a pool has problems. Leaving the folder lying around in a way that prevents the pool mounting properly when the drives are recovered is not good. When the pool appears to come back online without errors, it would be very easy for somebody to assume the data was lost from the pool without realising that it simply hasn''t mounted and they''re actually looking at an empty folder. Firstly ZFS should be removing the mount point when problems occur, and secondly, ZFS list or ZFS status should include information to inform you that the pool could not be mounted properly. ZFS status really should be warning of any ZFS errors that occur. Including things like being unable to mount the pool, CIFS mounts failing, etc... And finally, if ZFS does find problems writing from the cache, it really needs to log somewhere the names of all the files affected, and the action that could not be carried out. ZFS knows the files it was meant to delete here, it also knows the files that were written. I can accept that with delayed writes files may occasionally be lost when a failure happens, but I don''t accept that we need to loose all knowledge of the affected files when the filesystem has complete knowledge of what is affected. If there are any working filesystems on the server, ZFS should make an attempt to store a log of the problem, failing that it should e-mail the data out. The admin really needs to know what files have been affected so that they can notify users of the data loss. I don''t know where you would store this information, but wherever that is, "zpool status" should be reporting the error and directing the admin to the log file. I would probably say this could be safely stored on the system drive. Would it be possible to have a number of possible places to store this log? What I''m thinking is that if the system drive is unavailable, ZFS could try each pool in turn and attempt to store the log there. In fact e-mail alerts or external error logging would be a great addition to ZFS. Surely it makes sense that filesystem errors would be better off being stored and handled externally? Ross> Date: Mon, 28 Jul 2008 12:28:34 -0700> From: Richard.Elling at Sun.COM> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: myxiplx at hotmail.com> > I''m trying to reproduce and will let you know what I find.> -- richard>_________________________________________________________________ The John Lewis Clearance - save up to 50% with FREE delivery http://clk.atdmt.com/UKM/go/101719806/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080729/3c5ea29f/attachment.html>
David Collier-Brown
2008-Jul-29 15:59 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Just a side comment: this discussion shows all the classic symptoms of two groups of people with different basic assumptions, each wondering why the other said what they did. Getting these out in the open would be A Good Thing (;-)) --dave Jonathan Loran wrote:> I think the important point here is that this makes the case for ZFS > handling at least one layer of redundancy. If the disk you pulled was > part of a mirror or raidz, there wouldn''t be data loss when your system > was rebooted. In fact, the zpool status commands would likely keep > working, and a reboot wouldn''t be necessary at all. I think it''s > unreasonable to expect a system with any file system to recover from a > single drive being pulled. Of course, loosing extra work because of the > delayed notification is bad, but none the less, this is not a reasonable > test. Basically, always provide redundancy in your zpool config. > > Jon > > Ross Smith wrote: > >>A little more information today. I had a feeling that ZFS would >>continue quite some time before giving an error, and today I''ve shown >>that you can carry on working with the filesystem for at least half an >>hour with the disk removed. >> >>I suspect on a system with little load you could carry on working for >>several hours without any indication that there is a problem. It >>looks to me like ZFS is caching reads & writes, and that provided >>requests can be fulfilled from the cache, it doesn''t care whether the >>disk is present or not. >> >>I would guess that ZFS is attempting to write to the disk in the >>background, and that this is silently failing. >> >>Here''s the log of the tests I did today. After removing the drive, >>over a period of 30 minutes I copied folders to the filesystem, >>created an archive, set permissions, and checked properties. I did >>this both in the command line and with the graphical file manager tool >>in Solaris. Neither reported any errors, and all the data could be >>read & written fine. Until the reboot, at which point all the data >>was lost, again without error. >> >>If you''re not interested in the detail, please skip to the end where >>I''ve got some thoughts on just how many problems there are here. >> >> >># zpool status test >> pool: test >> state: ONLINE >> scrub: none requested >>config: >> NAME STATE READ WRITE CKSUM >> test ONLINE 0 0 0 >> c2t7d0 ONLINE 0 0 0 >>errors: No known data errors >># zfs list test >>NAME USED AVAIL REFER MOUNTPOINT >>test 243M 228G 242M /test >># zpool list test >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>test 232G 243M 232G 0% ONLINE - >> >> >>-- drive removed -- >> >> >># cfgadm |grep sata1/7 >>sata1/7 sata-port empty unconfigured ok >> >> >>-- cfgadmin knows the drive is removed. How come ZFS does not? -- >> >> >># cp -r /rc-pool/copytest /test/copytest >># zpool list test >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>test 232G 73.4M 232G 0% ONLINE - >># zfs list test >>NAME USED AVAIL REFER MOUNTPOINT >>test 142K 228G 18K /test >> >> >>-- Yup, still up. Let''s start the clock -- >> >> >># date >>Tue Jul 29 09:31:33 BST 2008 >># du -hs /test/copytest >> 667K /test/copytest >> >> >>-- 5 minutes later, still going strong -- >> >> >># date >>Tue Jul 29 09:36:30 BST 2008 >># zpool list test >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>test 232G 73.4M 232G 0% ONLINE - >># cp -r /rc-pool/copytest /test/copytest2 >># ls /test >>copytest copytest2 >># du -h -s /test >> 1.3M /test >># zpool list test >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>test 232G 73.4M 232G 0% ONLINE - >># find /test | wc -l >> 2669 >># find //test/copytest | wc -l >> 1334 >># find /rc-pool/copytest | wc -l >> 1334 >># du -h -s /rc-pool/copytest >> 5.3M /rc-pool/copytest >> >> >>-- Not sure why the original pool has 5.3MB of data when I use du. -- >>-- File Manager reports that they both have the same size -- >> >> >>-- 15 minutes later it''s still working. I can read data fine -- >> >># date >>Tue Jul 29 09:43:04 BST 2008 >># chmod 777 /test/* >># mkdir /rc-pool/test2 >># cp -r /test/copytest2 /rc-pool/test2/copytest2 >># find /rc-pool/test2/copytest2 | wc -l >> 1334 >># zpool list test >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>test 232G 73.4M 232G 0% ONLINE - >> >> >>-- and yup, the drive is still offline -- >> >> >># cfgadm | grep sata1/7 >>sata1/7 sata-port empty unconfigured ok >> >> >>-- And finally, after 30 minutes the pool is still going strong -- >> >> >># date >>Tue Jul 29 09:59:56 BST 2008 >># tar -cf /test/copytest.tar /test/copytest/* >># ls -l >>total 3 >>drwxrwxrwx 3 root root 3 Jul 29 09:30 copytest >>-rwxrwxrwx 1 root root 4626432 Jul 29 09:59 copytest.tar >>drwxrwxrwx 3 root root 3 Jul 29 09:39 copytest2 >># zpool list test >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>test 232G 73.4M 232G 0% ONLINE - >> >> >>After a full 30 minutes there''s no indication whatsoever of any >>problem. Checking properties of the folder in File Browser reports >>2665 items, totalling 9.0MB. >> >>At this point I tried "# zfs set sharesmb=on test". I didn''t really >>expect it to work, and sure enough, that command hung. zpool status >>also hung, so I had to reboot the server. >> >> >>-- Rebooted server -- >> >> >>Now I found that not only are all the files I''ve written in the last >>30 minutes missing, but in fact files that I had deleted several >>minutes prior to removing the drive have re-appeared. >> >> >>-- /test mount point is still present, I''ll probably have to remove >>that manually -- >> >> >># cd / >># ls >>bin export media proc system >>boot home mnt rc-pool test >>dev kernel net rc-usb tmp >>devices lib opt root usr >>etc lost+found platform sbin var >> >> >>-- ZFS still has the pool mounted, but at least now it realises it''s >>not working -- >> >> >># zpool list >>NAME SIZE USED AVAIL CAP HEALTH ALTROOT >>rc-pool 2.27T 52.6G 2.21T 2% DEGRADED - >>test - - - - FAULTED - >># zpool status test >> pool: test >> state: UNAVAIL >>status: One or more devices could not be opened. There are insufficient >> replicas for the pool to continue functioning. >>action: Attach the missing device and online it using ''zpool online''. >> see: http://www.sun.com/msg/ZFS-8000-3C >> scrub: none requested >>config: >> NAME STATE READ WRITE CKSUM >> test UNAVAIL 0 0 0 insufficient replicas >> c2t7d0 UNAVAIL 0 0 0 cannot open >> >> >>-- At least re-activating the pool is simple, but gotta love the "No >>known data errors" line -- >> >> >># cfgadm -c configure sata1/7 >># zpool status test >> pool: test >> state: ONLINE >> scrub: none requested >>config: >> NAME STATE READ WRITE CKSUM >> test ONLINE 0 0 0 >> c2t7d0 ONLINE 0 0 0 >>errors: No known data errors >> >> >>-- But of course, although ZFS thinks it''s online, it didn''t mount >>properly -- >> >> >># cd /test >># ls >># zpool export test >># rm -r /test >># zpool import test >># cd test >># ls >>var (copy) var2 >> >> >>-- Now that''s unexpected. Those folders should be long gone. Let''s >>see how many files ZFS failed to delete -- >> >> >># du -h -s /test >> 77M /test >># find /test | wc -l >> 19033 >> >> >>So in addition to working for a full half hour creating files, it''s >>also failed to remove 77MB of data contained in nearly 20,000 files. >>And it''s done all that without reporting any error or problem with the >>pool. >> >>In fact, if I didn''t know what I was looking for, there would be no >>indication of a problem at all. Before the reboot I can''t find what''s >>going on as "zfs status" hangs. After the reboot it says there''s no >>problem. Both ZFS and it''s troubleshooting tools fail in a big way >>here. >> >>As others have said, "zfs status" should not hang. ZFS has to know >>the state of all the drives and pools it''s currently using, "zfs >>status" should simply report the current known status from ZFS'' >>internal state. It shouldn''t need to scan anything. ZFS'' internal >>state should also be checking with cfgadm so that it knows if a disk >>isn''t there. It should also be updated if the cache can''t be flushed >>to disk, and "zfs list / zpool list" needs to borrow state information >>from the status commands so that they don''t say ''online'' when the pool >>has problems. >> >>ZFS needs to deal more intelligently with mount points when a pool has >>problems. Leaving the folder lying around in a way that prevents the >>pool mounting properly when the drives are recovered is not good. >>When the pool appears to come back online without errors, it would be >>very easy for somebody to assume the data was lost from the pool >>without realising that it simply hasn''t mounted and they''re actually >>looking at an empty folder. Firstly ZFS should be removing the mount >>point when problems occur, and secondly, ZFS list or ZFS status should >>include information to inform you that the pool could not be mounted >>properly. >> >>ZFS status really should be warning of any ZFS errors that occur. >>Including things like being unable to mount the pool, CIFS mounts >>failing, etc... >> >>And finally, if ZFS does find problems writing from the cache, it >>really needs to log somewhere the names of all the files affected, and >>the action that could not be carried out. ZFS knows the files it was >>meant to delete here, it also knows the files that were written. I >>can accept that with delayed writes files may occasionally be lost >>when a failure happens, but I don''t accept that we need to loose all >>knowledge of the affected files when the filesystem has complete >>knowledge of what is affected. If there are any working filesystems >>on the server, ZFS should make an attempt to store a log of the >>problem, failing that it should e-mail the data out. The admin really >>needs to know what files have been affected so that they can notify >>users of the data loss. I don''t know where you would store this >>information, but wherever that is, "zpool status" should be reporting >>the error and directing the admin to the log file. >> >>I would probably say this could be safely stored on the system drive. >>Would it be possible to have a number of possible places to store this >>log? What I''m thinking is that if the system drive is unavailable, >>ZFS could try each pool in turn and attempt to store the log there. >> >>In fact e-mail alerts or external error logging would be a great >>addition to ZFS. Surely it makes sense that filesystem errors would >>be better off being stored and handled externally? >> >>Ross >> >> >> >> >>>Date: Mon, 28 Jul 2008 12:28:34 -0700 >>>From: Richard.Elling at Sun.COM >>>Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive >> >>removed >> >>>To: myxiplx at hotmail.com >>> >>>I''m trying to reproduce and will let you know what I find. >>>-- richard >>> >> >> >>------------------------------------------------------------------------ >>Win ?3000 to spend on whatever you want at Uni! Click here to WIN! >><http://clk.atdmt.com/UKM/go/101719803/direct/01/> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>zfs-discuss mailing list >>zfs-discuss at opensolaris.org >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > >-- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
Jonathan Loran
2008-Jul-29 19:23 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
I think the important point here is that this makes the case for ZFS handling at least one layer of redundancy. If the disk you pulled was part of a mirror or raidz, there wouldn''t be data loss when your system was rebooted. In fact, the zpool status commands would likely keep working, and a reboot wouldn''t be necessary at all. I think it''s unreasonable to expect a system with any file system to recover from a single drive being pulled. Of course, loosing extra work because of the delayed notification is bad, but none the less, this is not a reasonable test. Basically, always provide redundancy in your zpool config. Jon Ross Smith wrote:> A little more information today. I had a feeling that ZFS would > continue quite some time before giving an error, and today I''ve shown > that you can carry on working with the filesystem for at least half an > hour with the disk removed. > > I suspect on a system with little load you could carry on working for > several hours without any indication that there is a problem. It > looks to me like ZFS is caching reads & writes, and that provided > requests can be fulfilled from the cache, it doesn''t care whether the > disk is present or not. > > I would guess that ZFS is attempting to write to the disk in the > background, and that this is silently failing. > > Here''s the log of the tests I did today. After removing the drive, > over a period of 30 minutes I copied folders to the filesystem, > created an archive, set permissions, and checked properties. I did > this both in the command line and with the graphical file manager tool > in Solaris. Neither reported any errors, and all the data could be > read & written fine. Until the reboot, at which point all the data > was lost, again without error. > > If you''re not interested in the detail, please skip to the end where > I''ve got some thoughts on just how many problems there are here. > > > # zpool status test > pool: test > state: ONLINE > scrub: none requested > config: > NAME STATE READ WRITE CKSUM > test ONLINE 0 0 0 > c2t7d0 ONLINE 0 0 0 > errors: No known data errors > # zfs list test > NAME USED AVAIL REFER MOUNTPOINT > test 243M 228G 242M /test > # zpool list test > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > test 232G 243M 232G 0% ONLINE - > > > -- drive removed -- > > > # cfgadm |grep sata1/7 > sata1/7 sata-port empty unconfigured ok > > > -- cfgadmin knows the drive is removed. How come ZFS does not? -- > > > # cp -r /rc-pool/copytest /test/copytest > # zpool list test > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > test 232G 73.4M 232G 0% ONLINE - > # zfs list test > NAME USED AVAIL REFER MOUNTPOINT > test 142K 228G 18K /test > > > -- Yup, still up. Let''s start the clock -- > > > # date > Tue Jul 29 09:31:33 BST 2008 > # du -hs /test/copytest > 667K /test/copytest > > > -- 5 minutes later, still going strong -- > > > # date > Tue Jul 29 09:36:30 BST 2008 > # zpool list test > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > test 232G 73.4M 232G 0% ONLINE - > # cp -r /rc-pool/copytest /test/copytest2 > # ls /test > copytest copytest2 > # du -h -s /test > 1.3M /test > # zpool list test > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > test 232G 73.4M 232G 0% ONLINE - > # find /test | wc -l > 2669 > # find //test/copytest | wc -l > 1334 > # find /rc-pool/copytest | wc -l > 1334 > # du -h -s /rc-pool/copytest > 5.3M /rc-pool/copytest > > > -- Not sure why the original pool has 5.3MB of data when I use du. -- > -- File Manager reports that they both have the same size -- > > > -- 15 minutes later it''s still working. I can read data fine -- > > # date > Tue Jul 29 09:43:04 BST 2008 > # chmod 777 /test/* > # mkdir /rc-pool/test2 > # cp -r /test/copytest2 /rc-pool/test2/copytest2 > # find /rc-pool/test2/copytest2 | wc -l > 1334 > # zpool list test > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > test 232G 73.4M 232G 0% ONLINE - > > > -- and yup, the drive is still offline -- > > > # cfgadm | grep sata1/7 > sata1/7 sata-port empty unconfigured ok > > > -- And finally, after 30 minutes the pool is still going strong -- > > > # date > Tue Jul 29 09:59:56 BST 2008 > # tar -cf /test/copytest.tar /test/copytest/* > # ls -l > total 3 > drwxrwxrwx 3 root root 3 Jul 29 09:30 copytest > -rwxrwxrwx 1 root root 4626432 Jul 29 09:59 copytest.tar > drwxrwxrwx 3 root root 3 Jul 29 09:39 copytest2 > # zpool list test > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > test 232G 73.4M 232G 0% ONLINE - > > > After a full 30 minutes there''s no indication whatsoever of any > problem. Checking properties of the folder in File Browser reports > 2665 items, totalling 9.0MB. > > At this point I tried "# zfs set sharesmb=on test". I didn''t really > expect it to work, and sure enough, that command hung. zpool status > also hung, so I had to reboot the server. > > > -- Rebooted server -- > > > Now I found that not only are all the files I''ve written in the last > 30 minutes missing, but in fact files that I had deleted several > minutes prior to removing the drive have re-appeared. > > > -- /test mount point is still present, I''ll probably have to remove > that manually -- > > > # cd / > # ls > bin export media proc system > boot home mnt rc-pool test > dev kernel net rc-usb tmp > devices lib opt root usr > etc lost+found platform sbin var > > > -- ZFS still has the pool mounted, but at least now it realises it''s > not working -- > > > # zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > rc-pool 2.27T 52.6G 2.21T 2% DEGRADED - > test - - - - FAULTED - > # zpool status test > pool: test > state: UNAVAIL > status: One or more devices could not be opened. There are insufficient > replicas for the pool to continue functioning. > action: Attach the missing device and online it using ''zpool online''. > see: http://www.sun.com/msg/ZFS-8000-3C > scrub: none requested > config: > NAME STATE READ WRITE CKSUM > test UNAVAIL 0 0 0 insufficient replicas > c2t7d0 UNAVAIL 0 0 0 cannot open > > > -- At least re-activating the pool is simple, but gotta love the "No > known data errors" line -- > > > # cfgadm -c configure sata1/7 > # zpool status test > pool: test > state: ONLINE > scrub: none requested > config: > NAME STATE READ WRITE CKSUM > test ONLINE 0 0 0 > c2t7d0 ONLINE 0 0 0 > errors: No known data errors > > > -- But of course, although ZFS thinks it''s online, it didn''t mount > properly -- > > > # cd /test > # ls > # zpool export test > # rm -r /test > # zpool import test > # cd test > # ls > var (copy) var2 > > > -- Now that''s unexpected. Those folders should be long gone. Let''s > see how many files ZFS failed to delete -- > > > # du -h -s /test > 77M /test > # find /test | wc -l > 19033 > > > So in addition to working for a full half hour creating files, it''s > also failed to remove 77MB of data contained in nearly 20,000 files. > And it''s done all that without reporting any error or problem with the > pool. > > In fact, if I didn''t know what I was looking for, there would be no > indication of a problem at all. Before the reboot I can''t find what''s > going on as "zfs status" hangs. After the reboot it says there''s no > problem. Both ZFS and it''s troubleshooting tools fail in a big way > here. > > As others have said, "zfs status" should not hang. ZFS has to know > the state of all the drives and pools it''s currently using, "zfs > status" should simply report the current known status from ZFS'' > internal state. It shouldn''t need to scan anything. ZFS'' internal > state should also be checking with cfgadm so that it knows if a disk > isn''t there. It should also be updated if the cache can''t be flushed > to disk, and "zfs list / zpool list" needs to borrow state information > from the status commands so that they don''t say ''online'' when the pool > has problems. > > ZFS needs to deal more intelligently with mount points when a pool has > problems. Leaving the folder lying around in a way that prevents the > pool mounting properly when the drives are recovered is not good. > When the pool appears to come back online without errors, it would be > very easy for somebody to assume the data was lost from the pool > without realising that it simply hasn''t mounted and they''re actually > looking at an empty folder. Firstly ZFS should be removing the mount > point when problems occur, and secondly, ZFS list or ZFS status should > include information to inform you that the pool could not be mounted > properly. > > ZFS status really should be warning of any ZFS errors that occur. > Including things like being unable to mount the pool, CIFS mounts > failing, etc... > > And finally, if ZFS does find problems writing from the cache, it > really needs to log somewhere the names of all the files affected, and > the action that could not be carried out. ZFS knows the files it was > meant to delete here, it also knows the files that were written. I > can accept that with delayed writes files may occasionally be lost > when a failure happens, but I don''t accept that we need to loose all > knowledge of the affected files when the filesystem has complete > knowledge of what is affected. If there are any working filesystems > on the server, ZFS should make an attempt to store a log of the > problem, failing that it should e-mail the data out. The admin really > needs to know what files have been affected so that they can notify > users of the data loss. I don''t know where you would store this > information, but wherever that is, "zpool status" should be reporting > the error and directing the admin to the log file. > > I would probably say this could be safely stored on the system drive. > Would it be possible to have a number of possible places to store this > log? What I''m thinking is that if the system drive is unavailable, > ZFS could try each pool in turn and attempt to store the log there. > > In fact e-mail alerts or external error logging would be a great > addition to ZFS. Surely it makes sense that filesystem errors would > be better off being stored and handled externally? > > Ross > > > > > Date: Mon, 28 Jul 2008 12:28:34 -0700 > > From: Richard.Elling at Sun.COM > > Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive > removed > > To: myxiplx at hotmail.com > > > > I''m trying to reproduce and will let you know what I find. > > -- richard > > > > > ------------------------------------------------------------------------ > Win ?3000 to spend on whatever you want at Uni! Click here to WIN! > <http://clk.atdmt.com/UKM/go/101719803/direct/01/> > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Well yeah, this is obviously not a valid setup for my data, but if you read my first e-mail, the whole point of this test was that I had seen Solaris hang when a drive was removed from a fully redundant array (five sets of three way mirrors), and wanted to see what was going on. So I started with the most basic pool I could to see how ZFS and Solaris actually reacted to a drive being removed. I was fully expecting ZFS to simply error when the drive was removed, and move the test on to move complex pools. I did not expect to find so many problems with such a simple setup. And the problems I have found also lead to potential data loss in a redundant array, although it would have been much more difficult to spot: Imagine you had a raid-z array and pulled a drive as I''m doing here. Because ZFS isn''t aware of the removal it keeps writing to that drive as if it''s valid. That means ZFS still believes the array is online when in fact it should be degrated. If any other drive now fails, ZFS will consider the status degrated instead of faulted, and will continue writing data. The problem is, ZFS is writing some of that data to a drive which doesn''t exist, meaning all that data will be lost on reboot. This message posted from opensolaris.org
Bob Friesenhahn
2008-Jul-30 14:48 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
On Wed, 30 Jul 2008, Ross wrote:> > Imagine you had a raid-z array and pulled a drive as I''m doing here. > Because ZFS isn''t aware of the removal it keeps writing to that > drive as if it''s valid. That means ZFS still believes the array is > online when in fact it should be degrated. If any other drive now > fails, ZFS will consider the status degrated instead of faulted, and > will continue writing data. The problem is, ZFS is writing some of > that data to a drive which doesn''t exist, meaning all that data will > be lost on reboot.While I do believe that device drivers. or the fault system, should notify ZFS when a device fails (and ZFS should appropriately react), I don''t think that ZFS should be responsible for fault monitoring. ZFS is in a rather poor position for device fault monitoring, and if it attempts to do so then it will be slow and may misbehave in other ways. The software which communicates with the device (i.e. the device driver) is in the best position to monitor the device. The primary goal of ZFS is to be able to correctly read data which was successfully committed to disk. There are programming interfaces (e.g. fsync(), msync()) which may be used to ensure that data is committed to disk, and which should return an error if there is a problem. If you were performing your tests over an NFS mount then the results should be considerably different since NFS requests that its data be committed to disk. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross Smith
2008-Jul-30 15:03 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
I agree that device drivers should perform the bulk of the fault monitoring, however I disagree that this absolves ZFS of any responsibility for checking for errors. The primary goal of ZFS is to be a filesystem and maintain data integrity, and that entails both reading and writing data to the devices. It is no good having checksumming when reading data if you are loosing huge amounts of data when a disk fails. I''m not saying that ZFS should be monitoring disks and drivers to ensure they are working, just that if ZFS attempts to write data and doesn''t get the response it''s expecting, an error should be logged against the device regardless of what the driver says. If ZFS is really about end-to-end data integrity, then you do need to consider the possibility of a faulty driver. Now I don''t know what the root cause of this error is, but I suspect it will be either a bad response from the SATA driver, or something within ZFS that is not working correctly. Either way however I believe ZFS should have caught this. It''s similar to the iSCSI problem I posted a few months back where the ZFS pool hangs for 3 minutes when a device is disconnected. There''s absolutely no need for the entire pool to hang when the other half of the mirror is working fine. ZFS is often compared to hardware raid controllers, but so far it''s ability to handle problems is falling short. Ross> Date: Wed, 30 Jul 2008 09:48:34 -0500> From: bfriesen at simple.dallas.tx.us> To: myxiplx at hotmail.com> CC: zfs-discuss at opensolaris.org> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> > On Wed, 30 Jul 2008, Ross wrote:> >> > Imagine you had a raid-z array and pulled a drive as I''m doing here. > > Because ZFS isn''t aware of the removal it keeps writing to that > > drive as if it''s valid. That means ZFS still believes the array is > > online when in fact it should be degrated. If any other drive now > > fails, ZFS will consider the status degrated instead of faulted, and > > will continue writing data. The problem is, ZFS is writing some of > > that data to a drive which doesn''t exist, meaning all that data will > > be lost on reboot.> > While I do believe that device drivers. or the fault system, should > notify ZFS when a device fails (and ZFS should appropriately react), I > don''t think that ZFS should be responsible for fault monitoring. ZFS > is in a rather poor position for device fault monitoring, and if it > attempts to do so then it will be slow and may misbehave in other > ways. The software which communicates with the device (i.e. the > device driver) is in the best position to monitor the device.> > The primary goal of ZFS is to be able to correctly read data which was > successfully committed to disk. There are programming interfaces > (e.g. fsync(), msync()) which may be used to ensure that data is > committed to disk, and which should return an error if there is a > problem. If you were performing your tests over an NFS mount then the > results should be considerably different since NFS requests that its > data be committed to disk.> > Bob> ======================================> Bob Friesenhahn> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/>_________________________________________________________________ Find the best and worst places on the planet http://clk.atdmt.com/UKM/go/101719807/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080730/fc493c26/attachment.html>
Bob Friesenhahn
2008-Jul-30 15:21 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
On Wed, 30 Jul 2008, Ross Smith wrote:> > I''m not saying that ZFS should be monitoring disks and drivers to > ensure they are working, just that if ZFS attempts to write data and > doesn''t get the response it''s expecting, an error should be logged > against the device regardless of what the driver says. If ZFS isA few things to consider: * Maybe the device driver has not yet reported (or fails to report) and error and just seems "slow". * ZFS is at such a high level that in many cases it has no useful knowledge of actual devices. For example, MPXIO (multipath) may be layered on top, or maybe an ethernet network is involved. If ZFS experiences a temporary problem with reaching a device, does that mean the device has failed, or does it perhaps indicate that a path is temporarily slow? If one device is a local disk and the other device is accessed via iSCSI and is located on the other end of the country, should ZFS refuse to operate if the remote disk is slow or stops responding for several minutes? This would be a typical situation when using mirroring, and one mirror device is remote. The parameters that a device driver for a local device uses to decide if there is a fault will be (and should be) substantially different than the parameters for a remote device. That is why most responsibility is left to the device driver. ZFS will behave according to how the device driver behaves. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Peter Cudhea
2008-Jul-30 15:27 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Your point is well taken that ZFS should not duplicate functionality that is already or should be available at the device driver level. In this case, I think it misses the point of what ZFS should be doing that it is not. ZFS does its own periodic commits to the disk, and it knows if those commit points have reached the disk or not, or whether they are getting errors. In this particular case, those commits to disk are presumably failing, because one of the disks they depend on has been removed from the system. (If the writes are not being marked as failures, that would definitely be an error in the device driver, as you say.) In this case, however, the ZIL log has stopped being updated, but ZFS does nothing to announce that this has happened, or to indicate that a remedy is required. At the very least, it would be extremely helpful if ZFS had a status to report that indicates that the ZIL log is out of date, or that there are troubles writing to the ZIL log, or something like that. An additional feature would be to have user-selectable behavior when the ZIL log is significantly out of date. For example, if the ZIL log is more than X seconds out of date, then new writes to the system should pause, or give errors or continue to silently succeed. In an earlier phase of my career when I worked for a database company, I was responsible for a similar bug. It caused a major customer to lose a major amount of data when a system rebooted when not all good data had been successfully committed to disk. The resulting stink caused us to add a feature to detect the cases when the writing-to-disk process had fallen too far behind, and to pause new writes to the database until the situation was resolved. Peter Bob Friesenhahn wrote:> While I do believe that device drivers. or the fault system, should > notify ZFS when a device fails (and ZFS should appropriately react), I > don''t think that ZFS should be responsible for fault monitoring. ZFS > is in a rather poor position for device fault monitoring, and if it > attempts to do so then it will be slow and may misbehave in other > ways. The software which communicates with the device (i.e. the > device driver) is in the best position to monitor the device. > > The primary goal of ZFS is to be able to correctly read data which was > successfully committed to disk. There are programming interfaces > (e.g. fsync(), msync()) which may be used to ensure that data is > committed to disk, and which should return an error if there is a > problem. If you were performing your tests over an NFS mount then the > results should be considerably different since NFS requests that its > data be committed to disk. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Richard Elling
2008-Jul-30 18:17 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
I was able to reproduce this in b93, but might have a different interpretation of the conditions. More below... Ross Smith wrote:> A little more information today. I had a feeling that ZFS would > continue quite some time before giving an error, and today I''ve shown > that you can carry on working with the filesystem for at least half an > hour with the disk removed. > > I suspect on a system with little load you could carry on working for > several hours without any indication that there is a problem. It > looks to me like ZFS is caching reads & writes, and that provided > requests can be fulfilled from the cache, it doesn''t care whether the > disk is present or not.In my USB-flash-disk-sudden-removal-while-writing-big-file-test, 1. I/O to the missing device stopped (as I expected) 2. FMA kicked in, as expected. 3. /var/adm/messages recorded "Command failed to complete... device gone." 4. After exactly 9 minutes, 17,951 e-reports had been processed and the diagnosis was complete. FMA logged the following to /var/adm/messages Jul 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1458,5004 at b,1/storage at 8/disk at 0,0 (sd1): Jul 30 10:33:44 grond Command failed to complete...Device is gone Jul 30 10:42:31 grond fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008 Jul 30 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond Jul 30 10:42:31 grond SOURCE: zfs-diagnosis, REV: 1.0 Jul 30 10:42:31 grond EVENT-ID: d99769aa-28e8-cf16-d181-945592130525 Jul 30 10:42:31 grond DESC: The number of I/O errors associated with a ZFS device exceeded Jul 30 10:42:31 grond acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Jul 30 10:42:31 grond will be made to activate a hot spare if available. Jul 30 10:42:31 grond IMPACT: Fault tolerance of the pool may be compromised. Jul 30 10:42:31 grond REC-ACTION: Run ''zpool status -x'' and replace the bad device. The above URL shows what you expect, but more (and better) info is available from zpool status -xv pool: rmtestpool state: UNAVAIL status: One or more devices are faultd in response to IO failures. action: Make sure the affected devices are connected, then run ''zpool clear''. see: http://www.sun.com/msg/ZFS-8000-HC scrub: none requested config: NAME STATE READ WRITE CKSUM rmtestpool UNAVAIL 0 15.7K 0 insufficient replicas c2t0d0p0 FAULTED 0 15.7K 0 experienced I/O failures errors: Permanent errors have been detected in the following files: /rmtestpool/random.data If you surf to http://www.sun.com/msg/ZFS-8000-HC you''ll see words to the effect that, The pool has experienced I/O failures. Since the ZFS pool property ''failmode'' is set to ''wait'', all I/Os (reads and writes) are blocked. See the zpool(1M) manpage for more information on the ''failmode'' property. Manual intervention is required for I/Os to be serviced.> > I would guess that ZFS is attempting to write to the disk in the > background, and that this is silently failing.It is clearly not silently failing. However, the default failmode property is set to "wait" which will patiently wait forever. If you would rather have the I/O fail, then you should change the failmode to "continue" I would not normally recommend a failmode of "panic" Now to figure out how to recover gracefully... zpool clear isn''t happy... [sidebar] while performing this experiment, I noticed that fmd was checkpointing the diagnosis engine to disk in the /var/fm/fmd/ckpt/zfs-diagnosis directory. If this had been the boot disk, with failmode=wait, I''m not convinced that we''d get a complete diagnosis... I''ll explore that later. [/sidebar] -- richard
Paul Fisher
2008-Jul-30 18:24 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Richard Elling wrote:> I was able to reproduce this in b93, but might have a different > interpretation of the conditions. More below... > > Ross Smith wrote: > >> A little more information today. I had a feeling that ZFS would >> continue quite some time before giving an error, and today I''ve shown >> that you can carry on working with the filesystem for at least half an >> hour with the disk removed. >> >> I suspect on a system with little load you could carry on working for >> several hours without any indication that there is a problem. It >> looks to me like ZFS is caching reads & writes, and that provided >> requests can be fulfilled from the cache, it doesn''t care whether the >> disk is present or not. >> > > In my USB-flash-disk-sudden-removal-while-writing-big-file-test, > 1. I/O to the missing device stopped (as I expected) > 2. FMA kicked in, as expected. > 3. /var/adm/messages recorded "Command failed to complete... device gone." > 4. After exactly 9 minutes, 17,951 e-reports had been processed and the > diagnosis was complete. FMA logged the following to /var/adm/messages >Wow! Who knew that 17, 951 was the magic number... Seriously, this does seem like an "excessive amount of certainty". -- paul
Neil Perrin
2008-Jul-30 18:41 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Peter Cudhea wrote:> Your point is well taken that ZFS should not duplicate functionality > that is already or should be available at the device driver level. In > this case, I think it misses the point of what ZFS should be doing that > it is not. > > ZFS does its own periodic commits to the disk, and it knows if those > commit points have reached the disk or not, or whether they are getting > errors. In this particular case, those commits to disk are presumably > failing, because one of the disks they depend on has been removed from > the system. (If the writes are not being marked as failures, that > would definitely be an error in the device driver, as you say.) In this > case, however, the ZIL log has stopped being updated, but ZFS does > nothing to announce that this has happened, or to indicate that a remedy > is required.I think you have some misconceptions about how the ZIL works. It doesn''t provide journalling like UFS. The following might help: http://blogs.sun.com/perrin/entry/the_lumberjack The ZIL isn''t used at all unless there''s fsync/O_DSYNC activity.> > At the very least, it would be extremely helpful if ZFS had a status to > report that indicates that the ZIL log is out of date, or that there are > troubles writing to the ZIL log, or something like that.If the ZIL cannot be written then we force a transaction group (txg) commit. That is the only recourse to force data to stable storage before returning to the application.> > An additional feature would be to have user-selectable behavior when the > ZIL log is significantly out of date. For example, if the ZIL log is > more than X seconds out of date, then new writes to the system should > pause, or give errors or continue to silently succeed.Again this doesn''t make sense given how the ZIL works.> > In an earlier phase of my career when I worked for a database company, I > was responsible for a similar bug. It caused a major customer to lose > a major amount of data when a system rebooted when not all good data had > been successfully committed to disk. The resulting stink caused us to > add a feature to detect the cases when the writing-to-disk process had > fallen too far behind, and to pause new writes to the database until the > situation was resolved. > > Peter > > Bob Friesenhahn wrote: >> While I do believe that device drivers. or the fault system, should >> notify ZFS when a device fails (and ZFS should appropriately react), I >> don''t think that ZFS should be responsible for fault monitoring. ZFS >> is in a rather poor position for device fault monitoring, and if it >> attempts to do so then it will be slow and may misbehave in other >> ways. The software which communicates with the device (i.e. the >> device driver) is in the best position to monitor the device. >> >> The primary goal of ZFS is to be able to correctly read data which was >> successfully committed to disk. There are programming interfaces >> (e.g. fsync(), msync()) which may be used to ensure that data is >> committed to disk, and which should return an error if there is a >> problem. If you were performing your tests over an NFS mount then the >> results should be considerably different since NFS requests that its >> data be committed to disk. >> >> Bob >> =====================================>> Bob Friesenhahn >> bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Peter Cudhea
2008-Jul-30 19:42 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Thanks, this is helpful. I was definitely misunderstanding the part that the ZIL plays in ZFS. I found Richard Elling''s discussion of the FMA response to the failure very informative. I see how the device driver, the fault analysis layer and the ZFS layer are all working together. Though the customer''s complaint that the change in state from "working" to "not working" is taking too long seems pretty valid. Peter Neil Perrin wrote:> > > Peter Cudhea wrote: >> Your point is well taken that ZFS should not duplicate functionality >> that is already or should be available at the device driver level. >> In this case, I think it misses the point of what ZFS should be doing >> that it is not. >> >> ZFS does its own periodic commits to the disk, and it knows if those >> commit points have reached the disk or not, or whether they are >> getting errors. In this particular case, those commits to disk are >> presumably failing, because one of the disks they depend on has been >> removed from the system. (If the writes are not being marked as >> failures, that would definitely be an error in the device driver, as >> you say.) In this case, however, the ZIL log has stopped being >> updated, but ZFS does nothing to announce that this has happened, or >> to indicate that a remedy is required. > > I think you have some misconceptions about how the ZIL works. > It doesn''t provide journalling like UFS. The following might help: > > http://blogs.sun.com/perrin/entry/the_lumberjack > > The ZIL isn''t used at all unless there''s fsync/O_DSYNC activity. > >> >> At the very least, it would be extremely helpful if ZFS had a status >> to report that indicates that the ZIL log is out of date, or that >> there are troubles writing to the ZIL log, or something like that. > > If the ZIL cannot be written then we force a transaction group (txg) > commit. That is the only recourse to force data to stable storage before > returning to the application. >> >> An additional feature would be to have user-selectable behavior when >> the ZIL log is significantly out of date. For example, if the ZIL >> log is more than X seconds out of date, then new writes to the system >> should pause, or give errors or continue to silently succeed. > > Again this doesn''t make sense given how the ZIL works. > >> >> In an earlier phase of my career when I worked for a database >> company, I was responsible for a similar bug. It caused a major >> customer to lose a major amount of data when a system rebooted when >> not all good data had been successfully committed to disk. The >> resulting stink caused us to add a feature to detect the cases when >> the writing-to-disk process had fallen too far behind, and to pause >> new writes to the database until the situation was resolved. >> >> Peter >> >> Bob Friesenhahn wrote: >>> While I do believe that device drivers. or the fault system, should >>> notify ZFS when a device fails (and ZFS should appropriately react), >>> I don''t think that ZFS should be responsible for fault monitoring. >>> ZFS is in a rather poor position for device fault monitoring, and if >>> it attempts to do so then it will be slow and may misbehave in other >>> ways. The software which communicates with the device (i.e. the >>> device driver) is in the best position to monitor the device. >>> >>> The primary goal of ZFS is to be able to correctly read data which >>> was successfully committed to disk. There are programming >>> interfaces (e.g. fsync(), msync()) which may be used to ensure that >>> data is committed to disk, and which should return an error if there >>> is a problem. If you were performing your tests over an NFS mount >>> then the results should be considerably different since NFS requests >>> that its data be committed to disk. >>> >>> Bob >>> =====================================>>> Bob Friesenhahn >>> bfriesen at simple.dallas.tx.us, >>> http://www.simplesystems.org/users/bfriesen/ >>> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2008-Jul-30 21:04 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Peter Cudhea wrote:> Thanks, this is helpful. I was definitely misunderstanding the part that > the ZIL plays in ZFS. > > I found Richard Elling''s discussion of the FMA response to the failure > very informative. I see how the device driver, the fault analysis > layer and the ZFS layer are all working together. Though the > customer''s complaint that the change in state from "working" to "not > working" is taking too long seems pretty valid. >I wish there was a simple answer to the can-of-worms^TM that this question opens. But there really isn''t. As Paul Fisher points out, logging 17,951 e-reports in 9 minutes seems like a lot, but I''m quite sure that is CPU bound and I could log more with a faster system :-) The key here is that 9 minutes represents some combination of timeouts in the sd/scsa2usb/usb stack. The myth of layered software says that timeouts compound, so digging around for a better collection might or might not be generally satisfying. Since this is not a ZFS timeout, perhaps the conversation should be continued in a more appropriate forum? -- richard
Jonathan Loran
2008-Jul-30 21:44 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
From a reporting perspective, yes, zpool status should not hang, and should report an error if a drive goes away, or is in any way behaving badly. No arguments there. From the data integrity perspective, the only event zfs needs to know about is when a bad drive is replaced, such that a resilver is triggered. If a drive is suddenly gone, but it is only one component of a redundant set, your data should still be fine. Now, if enough drives go away to break the redundancy, that''s a different story altogether. Jon Ross Smith wrote:> I agree that device drivers should perform the bulk of the fault > monitoring, however I disagree that this absolves ZFS of any > responsibility for checking for errors. The primary goal of ZFS is to > be a filesystem and maintain data integrity, and that entails both > reading and writing data to the devices. It is no good having > checksumming when reading data if you are loosing huge amounts of data > when a disk fails. > > I''m not saying that ZFS should be monitoring disks and drivers to > ensure they are working, just that if ZFS attempts to write data and > doesn''t get the response it''s expecting, an error should be logged > against the device regardless of what the driver says. If ZFS is > really about end-to-end data integrity, then you do need to consider > the possibility of a faulty driver. Now I don''t know what the root > cause of this error is, but I suspect it will be either a bad response > from the SATA driver, or something within ZFS that is not working > correctly. Either way however I believe ZFS should have caught this. > > It''s similar to the iSCSI problem I posted a few months back where the > ZFS pool hangs for 3 minutes when a device is disconnected. There''s > absolutely no need for the entire pool to hang when the other half of > the mirror is working fine. ZFS is often compared to hardware raid > controllers, but so far it''s ability to handle problems is falling short. > > Ross > > > > Date: Wed, 30 Jul 2008 09:48:34 -0500 > > From: bfriesen at simple.dallas.tx.us > > To: myxiplx at hotmail.com > > CC: zfs-discuss at opensolaris.org > > Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive > removed > > > > On Wed, 30 Jul 2008, Ross wrote: > > > > > > Imagine you had a raid-z array and pulled a drive as I''m doing here. > > > Because ZFS isn''t aware of the removal it keeps writing to that > > > drive as if it''s valid. That means ZFS still believes the array is > > > online when in fact it should be degrated. If any other drive now > > > fails, ZFS will consider the status degrated instead of faulted, and > > > will continue writing data. The problem is, ZFS is writing some of > > > that data to a drive which doesn''t exist, meaning all that data will > > > be lost on reboot. > > > > While I do believe that device drivers. or the fault system, should > > notify ZFS when a device fails (and ZFS should appropriately react), I > > don''t think that ZFS should be responsible for fault monitoring. ZFS > > is in a rather poor position for device fault monitoring, and if it > > attempts to do so then it will be slow and may misbehave in other > > ways. The software which communicates with the device (i.e. the > > device driver) is in the best position to monitor the device. > > > > The primary goal of ZFS is to be able to correctly read data which was > > successfully committed to disk. There are programming interfaces > > (e.g. fsync(), msync()) which may be used to ensure that data is > > committed to disk, and which should return an error if there is a > > problem. If you were performing your tests over an NFS mount then the > > results should be considerably different since NFS requests that its > > data be committed to disk. > > > > Bob >-- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Ross Smith
2008-Jul-31 12:28 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
I''m not sure you''re actually seeing the same problem there Richard. It seems that for you I/O is stopping on removal of the device, whereas for me I/O continues for some considerable time. You are also able to obtain a result from "zpool status" whereas that completely hangs for me. To illustrate the difference, this is what I saw today in snv_94, with a pool created from a single external USB hard drive. 1. As before I started a copy of a directory using Solaris'' file manager. About 1/3 of the way through I pulled the plug on the drive. 2. File manager continued to copy a further 30MB+ of files across. Checking the properties of the copy shows it contains 71.1MB of data and 19,160 files, despite me pulling the drive at around 8,000 files. 3. 8:24am I ran "zpool status": # zpool status rc-usb pool: rc-usb state: ONLINEstatus: One or more devices has experienced an error resulting in data corruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested That is as far as it gets. It never gives me any further information. I left it two hours, and it still had not displayed the status of the drive in the pool. I also did a "zfs list", that also hangs now although I''m pretty sure that if you run "zfs list" before "zpool status" it works fine. As you can see from /var/adm/messages, I am getting nothing at all from FMA: Jul 31 08:16:46 unknown usba: [ID 912658 kern.info] USB 2.0 device (usbd49,7350) operating at hi speed (USB 2.x) on USB 2.0 root hub: storage at 3, scsa2usb0 at bus address 2Jul 31 08:16:46 unknown usba: [ID 349649 kern.info] Maxtor OneTouch 2HAP70DZ Jul 31 08:16:46 unknown genunix: [ID 936769 kern.info] scsa2usb0 is /pci at 0,0/pci15d9,a011 at 2,1/storage at 3Jul 31 08:16:46 unknown genunix: [ID 408114 kern.info] /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0) onlineJul 31 08:16:46 unknown scsi: [ID 193665 kern.info] sd17 at scsa2usb0: target 0 lun 0Jul 31 08:16:46 unknown genunix: [ID 936769 kern.info] sd17 is /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0Jul 31 08:16:46 unknown genunix: [ID 340201 kern.warning] WARNING: Page83 data not standards compliant Maxtor OneTouch 0125Jul 31 08:16:46 unknown genunix: [ID 408114 kern.info] /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17) onlineJul 31 08:16:49 unknown pcplusmp: [ID 444295 kern.info] pcplusmp: ide (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3Jul 31 08:16:49 unknown scsi: [ID 193665 kern.info] sd14 at marvell88sx1: target 7 lun 0Jul 31 08:16:49 unknown genunix: [ID 936769 kern.info] sd14 is /pci at 1,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 7,0Jul 31 08:16:49 unknown genunix: [ID 408114 kern.info] /pci at 1,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 7,0 (sd14) onlineJul 31 08:21:35 unknown usba: [ID 691482 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Disconnected device was busy, please reconnect.Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:24:26 unknown Command failed to complete...Device is goneJul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:24:26 unknown Command failed to complete...Device is goneJul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):Jul 31 08:24:26 unknown drive offlineJul 31 08:27:43 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 08:39:43 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 08:44:50 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 31 08:44:58 unknown last message repeated 3 timesJul 31 08:45:06 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 31 08:45:06 unknown last message repeated 1 timeJul 31 08:51:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:03:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:13:51 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 31 09:14:09 unknown last message repeated 5 timesJul 31 09:15:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:27:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:27:55 unknown pcplusmp: [ID 444295 kern.info] pcplusmp: ide (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3 cfgadm reports that the port is empty but still configured: # cfgadmAp_Id Type Receptacle Occupant Conditionusb1/3 unknown empty configured unusable 4. 9:32am I now tried writing more data to the pool, to see if I can trigger the I/O error you are seeing. I tried making a second copy of the files on the USB drive in the Solaris File manager, but that attempt simply hung the copy dialog. I''m still seeing nothing else that appears relevant in /var/adm/messages. 5. 10:08am While checking free space, I found that although df works, "df -kh" hangs, apparently when it tries to query any zfs pool: # df/ (/dev/dsk/c1t0d0s0 ): 2504586 blocks 656867 files/devices (/devices ): 0 blocks 0 files/dev (/dev ): 0 blocks 0 files/system/contract (ctfs ): 0 blocks 2147483609 files/proc (proc ): 0 blocks 29902 files/etc/mnttab (mnttab ): 0 blocks 0 files/etc/svc/volatile (swap ): 9850928 blocks 1180374 files/system/object (objfs ): 0 blocks 2147483409 files/etc/dfs/sharetab (sharefs ): 0 blocks 2147483646 files/lib/libc.so.1 (/usr/lib/libc/libc_hwcap2.so.1): 2504586 blocks 656867 files/dev/fd (fd ): 0 blocks 0 files/tmp (swap ): 9850928 blocks 1180374 files/var/run (swap ): 9850928 blocks 1180374 files/export/home (/dev/dsk/c1t0d0s7 ):881398942 blocks 53621232 files/rc-pool (rc-pool ):4344346098 blocks 4344346098 files/rc-pool/admin (rc-pool/admin ):4344346098 blocks 4344346098 files/rc-pool/ross-home (rc-pool/ross-home ):4344346098 blocks 4344346098 files/rc-pool/vmware (rc-pool/vmware ):4344346098 blocks 4344346098 files/rc-usb (rc-usb ):153725153 blocks 153725153 files# df -khFilesystem size used avail capacity Mounted on/dev/dsk/c1t0d0s0 7.2G 6.0G 1.1G 85% //devices 0K 0K 0K 0% /devices/dev 0K 0K 0K 0% /devctfs 0K 0K 0K 0% /system/contractproc 0K 0K 0K 0% /procmnttab 0K 0K 0K 0% /etc/mnttabswap 4.7G 1.1M 4.7G 1% /etc/svc/volatileobjfs 0K 0K 0K 0% /system/objectsharefs 0K 0K 0K 0% /etc/dfs/sharetab/usr/lib/libc/libc_hwcap2.so.1 7.2G 6.0G 1.1G 85% /lib/libc.so.1fd 0K 0K 0K 0% /dev/fdswap 4.7G 48K 4.7G 1% /tmpswap 4.7G 76K 4.7G 1% /var/run/dev/dsk/c1t0d0s7 425G 4.8G 416G 2% /export/home 6. 10:35am It''s now been two hours, neither "zpool status" nor "zfs list" have ever finished. The file copy attempt has also been hung for over an hour (although that''s not unexpected with ''wait'' as the failmode). Richard, you say ZFS is not silently failing, well for me it appears that it is. I can''t see any warnings from ZFS, I can''t get any status information. I see no way that I could find out what files are going to be lost on this server. Yes, I''m now aware that the pool has hung since file operations are hanging, however had that been my first indication of a problem I believe I am now left in a position where I cannot find out either the cause, nor the files affected. I don''t believe I have any way to find out which operations had completed without error, but are not currently committed to disk. I certainly don''t get the status message you do saying permanent errors have been found in files. I plugged the USB drive back in now, Solaris detected it ok, but ZFS is still hung. The rest of /var/adm/messages is: Jul 31 09:39:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:45:22 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 31 09:45:38 unknown last message repeated 5 timesJul 31 09:51:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:03:44 unknown last message repeated 2 timesJul 31 10:14:27 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 31 10:14:45 unknown last message repeated 5 timesJul 31 10:15:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:27:45 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packet Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning] WARNING: /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Reinserted device is accessible again.Jul 31 10:39:45 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:45:53 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 31 10:46:09 unknown last message repeated 5 timesJul 31 10:51:45 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packet 7. 10:55am Gave up on ZFS ever recovering. A shutdown attempt hung as expected. I hard-reset the computer. Ross> Date: Wed, 30 Jul 2008 11:17:08 -0700> From: Richard.Elling at Sun.COM> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: myxiplx at hotmail.com> CC: zfs-discuss at opensolaris.org> > I was able to reproduce this in b93, but might have a different> interpretation of the conditions. More below...> > Ross Smith wrote:> > A little more information today. I had a feeling that ZFS would > > continue quite some time before giving an error, and today I''ve shown > > that you can carry on working with the filesystem for at least half an > > hour with the disk removed.> > > > I suspect on a system with little load you could carry on working for > > several hours without any indication that there is a problem. It > > looks to me like ZFS is caching reads & writes, and that provided > > requests can be fulfilled from the cache, it doesn''t care whether the > > disk is present or not.> > In my USB-flash-disk-sudden-removal-while-writing-big-file-test,> 1. I/O to the missing device stopped (as I expected)> 2. FMA kicked in, as expected.> 3. /var/adm/messages recorded "Command failed to complete... device gone."> 4. After exactly 9 minutes, 17,951 e-reports had been processed and the> diagnosis was complete. FMA logged the following to /var/adm/messages> > Jul 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING: > /pci at 0,0/pci1458,5004 at b,1/storage at 8/disk at 0,0 (sd1):> Jul 30 10:33:44 grond Command failed to complete...Device is gone> Jul 30 10:42:31 grond fmd: [ID 441519 daemon.error] SUNW-MSG-ID: > ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major> Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008> Jul 30 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond> Jul 30 10:42:31 grond SOURCE: zfs-diagnosis, REV: 1.0> Jul 30 10:42:31 grond EVENT-ID: d99769aa-28e8-cf16-d181-945592130525> Jul 30 10:42:31 grond DESC: The number of I/O errors associated with a > ZFS device exceeded> Jul 30 10:42:31 grond acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD for more information.> Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been offlined and > marked as faulted. An attempt> Jul 30 10:42:31 grond will be made to activate a hot spare if > available.> Jul 30 10:42:31 grond IMPACT: Fault tolerance of the pool may be > compromised.> Jul 30 10:42:31 grond REC-ACTION: Run ''zpool status -x'' and replace > the bad device.> > The above URL shows what you expect, but more (and better) info> is available from zpool status -xv> > pool: rmtestpool> state: UNAVAIL> status: One or more devices are faultd in response to IO failures.> action: Make sure the affected devices are connected, then run ''zpool > clear''.> see: http://www.sun.com/msg/ZFS-8000-HC> scrub: none requested> config:> > NAME STATE READ WRITE CKSUM> rmtestpool UNAVAIL 0 15.7K 0 insufficient replicas> c2t0d0p0 FAULTED 0 15.7K 0 experienced I/O failures> > errors: Permanent errors have been detected in the following files:> > /rmtestpool/random.data> > > If you surf to http://www.sun.com/msg/ZFS-8000-HC you''ll> see words to the effect that,> The pool has experienced I/O failures. Since the ZFS pool property> ''failmode'' is set to ''wait'', all I/Os (reads and writes) are> blocked. See the zpool(1M) manpage for more information on the> ''failmode'' property. Manual intervention is required for I/Os to> be serviced.> > > > > I would guess that ZFS is attempting to write to the disk in the > > background, and that this is silently failing.> > It is clearly not silently failing.> > However, the default failmode property is set to "wait" which will patiently> wait forever. If you would rather have the I/O fail, then you should change> the failmode to "continue" I would not normally recommend a failmode of> "panic"> > Now to figure out how to recover gracefully... zpool clear isn''t happy...> > [sidebar]> while performing this experiment, I noticed that fmd was checkpointing> the diagnosis engine to disk in the /var/fm/fmd/ckpt/zfs-diagnosis > directory.> If this had been the boot disk, with failmode=wait, I''m not convinced> that we''d get a complete diagnosis... I''ll explore that later.> [/sidebar]> > -- richard>_________________________________________________________________ The John Lewis Clearance - save up to 50% with FREE delivery http://clk.atdmt.com/UKM/go/101719806/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080731/3dac936f/attachment.html>
Andrew Hisgen
2008-Aug-01 13:36 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Question embedded below... Richard Elling wrote: ...> If you surf to http://www.sun.com/msg/ZFS-8000-HC you''ll > see words to the effect that, > The pool has experienced I/O failures. Since the ZFS pool property > ''failmode'' is set to ''wait'', all I/Os (reads and writes) are > blocked. See the zpool(1M) manpage for more information on the > ''failmode'' property. Manual intervention is required for I/Os to > be serviced. > >> >> I would guess that ZFS is attempting to write to the disk in the >> background, and that this is silently failing. > > It is clearly not silently failing. > > However, the default failmode property is set to "wait" which will patiently > wait forever. If you would rather have the I/O fail, then you should change > the failmode to "continue" I would not normally recommend a failmode of > "panic"Hi Richard, Does failmode==wait cause ZFS itself to retry i/o, that is, to retry an i/o where an earlier request (of that same i/o) returned from the driver with an error? If so, that will compound timeouts even further. I''m also confused by your statement that wait means wait forever, given that the actual circumstances here are that zfs (and the rest of the i/o stack) returned after 9 minutes. thanks, Andy
Richard Elling
2008-Aug-01 15:59 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Hi Andy, answer & pointer below... Andrew Hisgen wrote:> Question embedded below... > > Richard Elling wrote: > ... >> If you surf to http://www.sun.com/msg/ZFS-8000-HC you''ll >> see words to the effect that, >> The pool has experienced I/O failures. Since the ZFS pool property >> ''failmode'' is set to ''wait'', all I/Os (reads and writes) are >> blocked. See the zpool(1M) manpage for more information on the >> ''failmode'' property. Manual intervention is required for I/Os to >> be serviced. >> >>> >>> I would guess that ZFS is attempting to write to the disk in the >>> background, and that this is silently failing. >> >> It is clearly not silently failing. >> >> However, the default failmode property is set to "wait" which will >> patiently >> wait forever. If you would rather have the I/O fail, then you should >> change >> the failmode to "continue" I would not normally recommend a failmode of >> "panic" > > Hi Richard, > > Does failmode==wait cause ZFS itself to retry i/o, that is, to retry an > i/o where an earlier request (of that same i/o) returned from the driver > with an error? If so, that will compound timeouts even further. > > I''m also confused by your statement that wait means wait forever, given > that the actual circumstances here are that zfs (and the rest of the > i/o stack) returned after 9 minutes.The details are in PSARC/2007/567. Externally available at: http://www.opensolaris.org/os/community/arc/caselog/2007/567/ With failmode=wait, I/Os will wait until "manual intervention" which is shown as an administrator running zpool clear on the affected pool. I see the need for a document to help people work through these cases as they can be complex at many different levels. -- richard
Miles Nordin
2008-Aug-05 01:16 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>> "pf" == Paul Fisher <pfisher at alertlogic.net> writes:re> I was able to reproduce this in b93, but might have a re> different interpretation You weren''t able to reproduce the hang of ''zpool status''? Your ''zpool status'' was after the FMA fault kicked in, though. How about before FMA decided to mark the pool faulted---did ''zpool status'' hang, or work? If it worked, what did it report? The ''zpool status'' hanging happens for me on b71 when an iSCSI target goes away. (IIRC ''iscsiadm remove discovery-address ...'' unwedges zpool status for me, but my notes could be more careful.) re> However, the default failmode property is set to "wait" which re> will patiently wait forever. If you would rather have the I/O re> fail, then you should change the failmode to "continue" for him, it sounds like it''s not doing either. I think he does not have the failmode property, since it is so new? It sounds like ''continue'' should return I/O errors sooner than 9 minutes after the unredundant disks generate them (but not at all for degraded redundant pools of course). And it sounds like ''wait'' should block the writing program, forever if necessary, like an NFS hard mount. (1) Is the latter what ''wait'' actually did for you? Or did the writing process get I/O errors after the 9-minutes-later FMA diagnosis? (2) is it like NFS ''hard'' or is it like ''hard,intr''? :) It''s great to see these things improving. pf> Wow! Who knew that 17, 951 was the magic number... Seriously, pf> this does seem like an "excessive amount of certainty". I agree it''s an awfully forgiving constant, so big that it sounds like it might not be a constant manually set to 16384 or something, but rather an accident. I''m surprised to find FMA is responsible for deciding the length of this 9-minute (or more, for Ross) delay. note that, if the false positives one is trying to filter out are things like USB/SAN cabling spasms and drive recalibrations, the right metric is time, not number of failed CDB''s. The hugely-delayed response may be a blessing in disguise though, because arranging for the differnet FMA states to each last tens of minutes means it''s possible to evaluate the system''s behavior in each state, to see if it''s correct. For example, within this 9-minute window: * what does ''zpool status'' say before the FMA faulting * what do applications experience, ex., + is it possible to get an I/O error during this window with failmode=wait? how about with failmode=continue? + are reads and writes that block interruptible or uninterruptible? + What about fsync()? o what about fsync() if there is a slog? * is the system stable or are there ``lazy panic'''' cases? + what if you ``ask for it'''' by calling ''zpool clear'' or ''zpool scrub'' within the 9-minute window? * are other pools that don''t include failed devices affected (for reading/writing. but, also, if ''zpool status'' is frozen for all pools, then other pools are affected.) * probably other stuff... God willing some day some of the states can be shortened to values more like 1 second or 1 minute, or really aggressive variance-and-average-based threshholds like TCP timers, so that FMA is actually useful rather than a step backwards from SVM as it seems to me right now. The NetApp paper Richard posted earlier was saying NetApp never waits the 30 seconds for an ATAPI error, they just ignore the disk if it doesn''t answer within 1000ms or so. But my crappy Linux iSCSI targets would probably miss 1000ms timeouts all the time just because they''re heavily loaded---you could get pools that go FAULTED whenever they get heavy use. so some of FMA''s states maybe should be short, but they''re harder to observe when they''re so short. The point of FMA, AIUI, is to make the failure state machine really complicated. We want it complicated to deal with both netapp''s good example of aggressive timers and also deal with my crappy Linux IET setup, so increasingly hairy rules can be written with experience. Complicated means that observing each state is important to verify the complicated system''s correctness. And observing means they can''t be 1 second long even if that''s the appropriate length. But I don''t know if that''s really the developer''s intent, or just my dreaming and hoping. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080804/692ef033/attachment.bin>
Ross Smith
2008-Aug-05 14:04 UTC
[zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Ok, I think I''ve got to the bottom of all this now, but it took some work to figure out everything that was going on. I couldn''t think of any way to sensibly write this all up in an e-mail, so I''ve written up my findings and they''re all in the attached PDF. The initial problem can be summarised as: ZFS can cause silent data loss if you accidentally remove a device from a pool that''s in a non-redundant state. But that breaks down into several individual issues: SATA hot plug is poorly supported on the Supermicro AOC-SAT2-MV8 card,which uses a Marvell 88SX6081 controller. ZFS is inconsistent in its handling of SATA devices going offline. FMA takes too long to diagnose a device removal, and can generate hundredsof MB of errors while doing so.. ZFS can continue to read and write from a pool for some considerable time after it has gone offline. "zpool status" can not only hang, but can lock out other tools. BUG: 6667199 "zpool clear" hangs on single drives (and probably also hangs for any pool in anon redundant state).Probably related to BUG: 667208 "zpool status" doesn''t report if there has been a problem mounting the pool Ross> Date: Thu, 31 Jul 2008 09:17:46 -0700> From: Richard.Elling at Sun.COM> Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: myxiplx at hotmail.com> > Ross Smith wrote:> > Ok, in snv_94, with a USB drive that I pulled half way through the > > standard copy of 19k files (71MB).> > > > This time the copy operation paused after just 5-10MB more, and it''s > > currently sat there. FMdump doesn''t have a a lot to say, fmdump -e has > > been scrolling zfs io & data messages down the screen for nearly 10 > > minutes now.> > OK, that is what I saw. There is a transaction group which is waiting> to get out and it has up to 5 seconds of writes in it. > > There is a couple of rounds of logic going on here with the diagnosis> and feedback to ZFS to stop trying. These things can get very complex> to solve for the general case, but the current state seems to be> suboptimal.> > > > > # fmdump> > TIME UUID SUNW-MSG-ID> > Jul 25 11:27:27.2858 08faf2a3-e39f-e435-8229-d409514f8531 ZFS-8000-D3> > Interesting... you got 3 -D3 diagnoses and one -HC (which is what I also> got). The -D3 is similar, but may also lead to a different zpool status -x> result (which has yet another diagnosis).> > > Jul 29 16:27:56.5151 c2537861-80bb-6154-c8d2-cac9fb1674ae ZFS-8000-D3> > Jul 30 14:11:08.8059 7e33e484-728e-4ffe-cbdc-e9d8a05e33aa ZFS-8000-HC> > Jul 31 11:45:12.3883 d76fcc2c-acee-6b62-f70f-b770651ea5ad ZFS-8000-D3> > > > The fmdump -e lines are all along the lines of:> > Jul 31 08:21:38.9999 ereport.fs.zf.io> > Jul 31 08:21:38.9999 ereport.fs.zf.data> > Yes, these are error reports where ZFS hit an I/O error and that> will stimulate a data error report, too. The correlation and analysis> of these errors is done by FMA (actually fmd). I also noticed a> lot of activity on the /var file system as fmd was busy checkpointing> the zfs diagnosis. This is probably redundant, redundant also.> > > > > I plugged the USB disk in again, /var/adm/messages says:> > > > Jul 31 16:45:06 unknown usba: [ID 691482 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Disconnected device > > was busy, please reconnect.> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:07 unknown scsi: [ID 107833 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > Jul 31 16:45:07 unknown Command failed to complete...Device is gone> > Jul 31 16:45:17 unknown smbd[516]: [ID 766186 daemon.error] > > NbtDatagramDecode[11]: too small packet> > Jul 31 16:47:17 unknown last message repeated 1 time> > Jul 31 16:49:06 unknown /sbin/dhcpagent[100]: [ID 732317 > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory > > lease option, ignored> > Jul 31 16:49:22 unknown last message repeated 5 times> > Jul 31 16:49:54 unknown usba: [ID 691482 kern.warning] WARNING: > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Reinserted device is > > accessible again.> >> > After a few minutes (at 16:51), fmdump -e changed from the above lines to:> > fmdump: warning: skipping record: log file corruption detected> > > > Checking /var/adm/messages now gives:> > Jul 31 16:50:17 unknown smbd[516]: [ID 766186 daemon.error] > > NbtDatagramDecode[11]: too small packet> > Jul 31 16:50:50 unknown fmd: [ID 441519 daemon.error] SUNW-MSG-ID: > > ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major> > Jul 31 16:50:50 unknown EVENT-TIME: Thu Jul 31 16:50:49 BST 2008> > Jul 31 16:50:50 unknown PLATFORM: H8DM3-2, CSN: 1234567890, HOSTNAME: > > unknown> > Jul 31 16:50:50 unknown SOURCE: zfs-diagnosis, REV: 1.0> > Jul 31 16:50:50 unknown EVENT-ID: 3a12b357-2d61-491f-e8ab-9247ebcea342> > Jul 31 16:50:50 unknown DESC: The number of I/O errors associated with > > a ZFS device exceeded> > Jul 31 16:50:50 unknown acceptable levels. Refer to > > http://sun.com/msg/ZFS-8000-FD for more information.> > Jul 31 16:50:50 unknown AUTO-RESPONSE: The device has been offlined > > and marked as faulted. An attempt> > Jul 31 16:50:50 unknown will be made to activate a hot spare if > > available.> > Jul 31 16:50:50 unknown IMPACT: Fault tolerance of the pool may be > > compromised.> > Jul 31 16:50:50 unknown REC-ACTION: Run ''zpool status -x'' and replace > > the bad device.> > > > Which looks pretty similar to what you saw. zpool status still > > appears to hang though.> > Yes. The hang is due to the failmode property. A process waiting on I/O> in UNIX will not receive any signals until it wakes from the wait... which> won''t happen because the failmode=wait. I''m going to try another test> with failmode=continue and see what happens.> > FWIW, there is considerable debate about whether failmode=wait or> continue is the best default. wait works like the default for NFS, which> works like most PC-like operating systems. For highly available systems,> we''d actually rather ''get off the pot'' than ''sh*t'' so we tend to prefer> panic, with a compromise on continue.> > > > > Running fmdump again, I now have this line at the bottom:> > > > TIME UUID SUNW-MSG-ID> > Jul 31 16:50:49.9906 3a12b357-2d61-491f-e8ab-9247ebcea342 ZFS-8000-FD> >> > This is the first time I''ve ever seen that FMD message appear in > > /var/adm/messages. I wonder if it''s the zpool status hanging that''s > > causing the FMD stuff to not work? What happens if you try to > > reproduce this there and run zpool status as you remove your drive?> > Some zpool commands will wait, but I had good luck with> zpool status -x... but now that seems to be hanging too. I don''t> think zpool status should hang, ever, so this looks like a real> bug.> -- richard> > > > > > Ross> > > >> >> > > Date: Thu, 31 Jul 2008 07:42:48 -0700> > > From: Richard.Elling at Sun.COM> > > Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive > > removed> > > To: myxiplx at hotmail.com> > >> > > [off-alias, as the e-mails may get large...]> > > what does fmdump and fmdump -e say?> > > -- richard> > >> > >> > > Ross Smith wrote:> > > > I''m not sure you''re actually seeing the same problem there Richard.> > > > It seems that for you I/O is stopping on removal of the device,> > > > whereas for me I/O continues for some considerable time. You are also> > > > able to obtain a result from ''zpool status'' whereas that completely> > > > hangs for me.> > > >> > > > To illustrate the difference, this is what I saw today in snv_94, > > with> > > > a pool created from a single external USB hard drive.> > > >> > > > 1. As before I started a copy of a directory using Solaris'' file> > > > manager. About 1/3 of the way through I pulled the plug on the drive.> > > > 2. File manager continued to copy a further 30MB+ of files across.> > > > Checking the properties of the copy shows it contains 71.1MB of data> > > > and 19,160 files, despite me pulling the drive at around 8,000 files.> > > >> > > > 3. 8:24am I ran ''zpool status'':> > > > # zpool status rc-usb> > > > pool: rc-usb> > > > state: ONLINE> > > > status: One or more devices has experienced an error resulting in data> > > > corruption. Applications may be affected.> > > > action: Restore the file in question if possible. Otherwise > > restore the> > > > entire pool from backup.> > > > see: http://www.sun.com/msg/ZFS-8000-8A> > > > scrub: none requested> > > >> > > > That is as far as it gets. It never gives me any further> > > > information. I left it two hours, and it still had not displayed the> > > > status of the drive in the pool. I also did a ''zfs list'', that also> > > > hangs now although I''m pretty sure that if you run ''zfs list'' before> > > > ''zpool status'' it works fine.> > > >> > > > As you can see from /var/adm/messages, I am getting nothing at all> > > > from FMA:> > > > Jul 31 08:16:46 unknown usba: [ID 912658 kern.info] USB 2.0 device> > > > (usbd49,7350) operating at hi speed (USB 2.x) on USB 2.0 root hub:> > > > storage at 3 <mailto:storage at 3>, scsa2usb0 at bus address 2> > > > Jul 31 08:16:46 unknown usba: [ID 349649 kern.info] Maxtor> > > > OneTouch 2HAP70DZ> > > > Jul 31 08:16:46 unknown genunix: [ID 936769 kern.info] scsa2usb0 is> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3> > > > Jul 31 08:16:46 unknown genunix: [ID 408114 kern.info]> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0) online> > > > Jul 31 08:16:46 unknown scsi: [ID 193665 kern.info] sd17 at > > scsa2usb0:> > > > target 0 lun 0> > > > Jul 31 08:16:46 unknown genunix: [ID 936769 kern.info] sd17 is> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0> > > > Jul 31 08:16:46 unknown genunix: [ID 340201 kern.warning] WARNING:> > > > Page83 data not standards compliant Maxtor OneTouch 0125> > > > Jul 31 08:16:46 unknown genunix: [ID 408114 kern.info]> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17) online> > > > Jul 31 08:16:49 unknown pcplusmp: [ID 444295 kern.info] pcplusmp: ide> > > > (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3> > > > Jul 31 08:16:49 unknown scsi: [ID 193665 kern.info] sd14 at> > > > marvell88sx1: target 7 lun 0> > > > Jul 31 08:16:49 unknown genunix: [ID 936769 kern.info] sd14 is> > > > /pci at 1,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 7,0> > > > Jul 31 08:16:49 unknown genunix: [ID 408114 kern.info]> > > > /pci at 1,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 7,0 (sd14) online> > > > Jul 31 08:21:35 unknown usba: [ID 691482 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Disconnected device> > > > was busy, please reconnect.> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:21:38 unknown Command failed to complete...Device is gone> > > > Jul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:24:26 unknown Command failed to complete...Device is gone> > > > Jul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:24:26 unknown Command failed to complete...Device is gone> > > > Jul 31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3/disk at 0,0 (sd17):> > > > Jul 31 08:24:26 unknown drive offline> > > > Jul 31 08:27:43 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 08:39:43 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 08:44:50 unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > > > lease option, ignored> > > > Jul 31 08:44:58 unknown last message repeated 3 times> > > > Jul 31 08:45:06 unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > > > lease option, ignored> > > > Jul 31 08:45:06 unknown last message repeated 1 time> > > > Jul 31 08:51:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:03:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:13:51 unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > > > lease option, ignored> > > > Jul 31 09:14:09 unknown last message repeated 5 times> > > > Jul 31 09:15:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:27:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:27:55 unknown pcplusmp: [ID 444295 kern.info] pcplusmp: ide> > > > (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3> > > >> > > > cfgadm reports that the port is empty but still configured:> > > > # cfgadm> > > > Ap_Id Type Receptacle Occupant> > > > Condition> > > > usb1/3 unknown empty configured> > > > unusable> > > >> > > > 4. 9:32am I now tried writing more data to the pool, to see if I can> > > > trigger the I/O error you are seeing. I tried making a second copy of> > > > the files on the USB drive in the Solaris File manager, but that> > > > attempt simply hung the copy dialog. I''m still seeing nothing else> > > > that appears relevant in /var/adm/messages.> > > >> > > > 5. 10:08am While checking free space, I found that although df works,> > > > ''df -kh'' hangs, apparently when it tries to query any zfs pool:> > > > # df> > > > / (/dev/dsk/c1t0d0s0 ): 2504586 blocks 656867 files> > > > /devices (/devices ): 0 blocks 0 files> > > > /dev (/dev ): 0 blocks 0 files> > > > /system/contract (ctfs ): 0 blocks 2147483609 files> > > > /proc (proc ): 0 blocks 29902 files> > > > /etc/mnttab (mnttab ): 0 blocks 0 files> > > > /etc/svc/volatile (swap ): 9850928 blocks 1180374 files> > > > /system/object (objfs ): 0 blocks 2147483409 files> > > > /etc/dfs/sharetab (sharefs ): 0 blocks 2147483646 files> > > > /lib/libc.so.1 (/usr/lib/libc/libc_hwcap2.so.1): 2504586 blocks> > > > 656867 files> > > > /dev/fd (fd ): 0 blocks 0 files> > > > /tmp (swap ): 9850928 blocks 1180374 files> > > > /var/run (swap ): 9850928 blocks 1180374 files> > > > /export/home (/dev/dsk/c1t0d0s7 ):881398942 blocks 53621232 files> > > > /rc-pool (rc-pool ):4344346098 blocks 4344346098 files> > > > /rc-pool/admin (rc-pool/admin ):4344346098 blocks 4344346098 files> > > > /rc-pool/ross-home (rc-pool/ross-home ):4344346098 blocks > > 4344346098 files> > > > /rc-pool/vmware (rc-pool/vmware ):4344346098 blocks 4344346098 files> > > > /rc-usb (rc-usb ):153725153 blocks 153725153 files> > > > # df -kh> > > > Filesystem size used avail capacity Mounted on> > > > /dev/dsk/c1t0d0s0 7.2G 6.0G 1.1G 85% /> > > > /devices 0K 0K 0K 0% /devices> > > > /dev 0K 0K 0K 0% /dev> > > > ctfs 0K 0K 0K 0% /system/contract> > > > proc 0K 0K 0K 0% /proc> > > > mnttab 0K 0K 0K 0% /etc/mnttab> > > > swap 4.7G 1.1M 4.7G 1% /etc/svc/volatile> > > > objfs 0K 0K 0K 0% /system/object> > > > sharefs 0K 0K 0K 0% /etc/dfs/sharetab> > > > /usr/lib/libc/libc_hwcap2.so.1> > > > 7.2G 6.0G 1.1G 85% /lib/libc.so.1> > > > fd 0K 0K 0K 0% /dev/fd> > > > swap 4.7G 48K 4.7G 1% /tmp> > > > swap 4.7G 76K 4.7G 1% /var/run> > > > /dev/dsk/c1t0d0s7 425G 4.8G 416G 2% /export/home> > > >> > > > 6. 10:35am It''s now been two hours, neither ''zpool status'' nor ''zfs> > > > list'' have ever finished. The file copy attempt has also been hung> > > > for over an hour (although that''s not unexpected with ''wait'' as the> > > > failmode).> > > >> > > > Richard, you say ZFS is not silently failing, well for me it appears> > > > that it is. I can''t see any warnings from ZFS, I can''t get any status> > > > information. I see no way that I could find out what files are going> > > > to be lost on this server.> > > >> > > > Yes, I''m now aware that the pool has hung since file operations are> > > > hanging, however had that been my first indication of a problem I> > > > believe I am now left in a position where I cannot find out either > > the> > > > cause, nor the files affected. I don''t believe I have any way to find> > > > out which operations had completed without error, but are not> > > > currently committed to disk. I certainly don''t get the status message> > > > you do saying permanent errors have been found in files.> > > >> > > > I plugged the USB drive back in now, Solaris detected it ok, but ZFS> > > > is still hung. The rest of /var/adm/messages is:> > > > Jul 31 09:39:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 09:45:22 unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > > > lease option, ignored> > > > Jul 31 09:45:38 unknown last message repeated 5 times> > > > Jul 31 09:51:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 10:03:44 unknown last message repeated 2 times> > > > Jul 31 10:14:27 unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > > > lease option, ignored> > > > Jul 31 10:14:45 unknown last message repeated 5 times> > > > Jul 31 10:15:44 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 10:27:45 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning] WARNING:> > > > /pci at 0,0/pci15d9,a011 at 2,1/storage at 3 (scsa2usb0): Reinserted device is> > > > accessible again.> > > > Jul 31 10:39:45 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > > Jul 31 10:45:53 unknown /sbin/dhcpagent[95]: [ID 732317> > > > daemon.warning] accept_v4_acknak: ACK packet on nge0 missing > > mandatory> > > > lease option, ignored> > > > Jul 31 10:46:09 unknown last message repeated 5 times> > > > Jul 31 10:51:45 unknown smbd[603]: [ID 766186 daemon.error]> > > > NbtDatagramDecode[11]: too small packet> > > >> > > > 7. 10:55am Gave up on ZFS ever recovering. A shutdown attempt hung> > > > as expected. I hard-reset the computer.> > > >> > > > Ross> > > >> > > >> > > >> > > >> > > > > Date: Wed, 30 Jul 2008 11:17:08 -0700> > > > > From: Richard.Elling at Sun.COM> > > > > Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive> > > > removed> > > > > To: myxiplx at hotmail.com> > > > > CC: zfs-discuss at opensolaris.org> > > > >> > > > > I was able to reproduce this in b93, but might have a different> > > > > interpretation of the conditions. More below...> > > > >> > > > > Ross Smith wrote:> > > > > > A little more information today. I had a feeling that ZFS would> > > > > > continue quite some time before giving an error, and today > > I''ve shown> > > > > > that you can carry on working with the filesystem for at least> > > > half an> > > > > > hour with the disk removed.> > > > > >> > > > > > I suspect on a system with little load you could carry on > > working for> > > > > > several hours without any indication that there is a problem. It> > > > > > looks to me like ZFS is caching reads & writes, and that provided> > > > > > requests can be fulfilled from the cache, it doesn''t care > > whether the> > > > > > disk is present or not.> > > > >> > > > > In my USB-flash-disk-sudden-removal-while-writing-big-file-test,> > > > > 1. I/O to the missing device stopped (as I expected)> > > > > 2. FMA kicked in, as expected.> > > > > 3. /var/adm/messages recorded ''Command failed to complete... device> > > > gone.''> > > > > 4. After exactly 9 minutes, 17,951 e-reports had been processed > > and the> > > > > diagnosis was complete. FMA logged the following to > > /var/adm/messages> > > > >> > > > > Jul 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING:> > > > > /pci at 0,0/pci1458,5004 at b,1/storage at 8/disk at 0,0 (sd1):> > > > > Jul 30 10:33:44 grond Command failed to complete...Device is gone> > > > > Jul 30 10:42:31 grond fmd: [ID 441519 daemon.error] SUNW-MSG-ID:> > > > > ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major> > > > > Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008> > > > > Jul 30 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond> > > > > Jul 30 10:42:31 grond SOURCE: zfs-diagnosis, REV: 1.0> > > > > Jul 30 10:42:31 grond EVENT-ID: d99769aa-28e8-cf16-d181-945592130525> > > > > Jul 30 10:42:31 grond DESC: The number of I/O errors associated > > with a> > > > > ZFS device exceeded> > > > > Jul 30 10:42:31 grond acceptable levels. Refer to> > > > > http://sun.com/msg/ZFS-8000-FD for more information.> > > > > Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been > > offlined and> > > > > marked as faulted. An attempt> > > > > Jul 30 10:42:31 grond will be made to activate a hot spare if> > > > > available.> > > > > Jul 30 10:42:31 grond IMPACT: Fault tolerance of the pool may be> > > > > compromised.> > > > > Jul 30 10:42:31 grond REC-ACTION: Run ''zpool status -x'' and replace> > > > > the bad device.> > > > >> > > > > The above URL shows what you expect, but more (and better) info> > > > > is available from zpool status -xv> > > > >> > > > > pool: rmtestpool> > > > > state: UNAVAIL> > > > > status: One or more devices are faultd in response to IO failures.> > > > > action: Make sure the affected devices are connected, then run > > ''zpool> > > > > clear''.> > > > > see: http://www.sun.com/msg/ZFS-8000-HC> > > > > scrub: none requested> > > > > config:> > > > >> > > > > NAME STATE READ WRITE CKSUM> > > > > rmtestpool UNAVAIL 0 15.7K 0 insufficient replicas> > > > > c2t0d0p0 FAULTED 0 15.7K 0 experienced I/O failures> > > > >> > > > > errors: Permanent errors have been detected in the following files:> > > > >> > > > > /rmtestpool/random.data> > > > >> > > > >> > > > > If you surf to http://www.sun.com/msg/ZFS-8000-HC you''ll> > > > > see words to the effect that,> > > > > The pool has experienced I/O failures. Since the ZFS pool property> > > > > ''failmode'' is set to ''wait'', all I/Os (reads and writes) are> > > > > blocked. See the zpool(1M) manpage for more information on the> > > > > ''failmode'' property. Manual intervention is required for I/Os to> > > > > be serviced.> > > > >> > > > > >> > > > > > I would guess that ZFS is attempting to write to the disk in the> > > > > > background, and that this is silently failing.> > > > >> > > > > It is clearly not silently failing.> > > > >> > > > > However, the default failmode property is set to ''wait'' which will> > > > patiently> > > > > wait forever. If you would rather have the I/O fail, then you > > should> > > > change> > > > > the failmode to ''continue'' I would not normally recommend a > > failmode of> > > > > ''panic''> > > > >> > > > > Now to figure out how to recover gracefully... zpool clear isn''t> > > > happy...> > > > >> > > > > [sidebar]> > > > > while performing this experiment, I noticed that fmd was > > checkpointing> > > > > the diagnosis engine to disk in the /var/fm/fmd/ckpt/zfs-diagnosis> > > > > directory.> > > > > If this had been the boot disk, with failmode=wait, I''m not > > convinced> > > > > that we''d get a complete diagnosis... I''ll explore that later.> > > > > [/sidebar]> > > > >> > > > > -- richard> > > > >> > > >> > > >> > > > > > ------------------------------------------------------------------------> > > > Win ?3000 to spend on whatever you want at Uni! Click here to WIN!> > > > <http://clk.atdmt.com/UKM/go/101719803/direct/01/>> > >> >> >> > ------------------------------------------------------------------------> > Win ?3000 to spend on whatever you want at Uni! Click here to WIN! > > <http://clk.atdmt.com/UKM/go/101719803/direct/01/>>_________________________________________________________________ Get Hotmail on your mobile from Vodafone http://clk.atdmt.com/UKM/go/107571435/direct/01/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080805/55e5b972/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Problems with ZFS + SATA hot plug.pdf Type: application/pdf Size: 84624 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080805/55e5b972/attachment.pdf>