Todd H. Poole
2008-Aug-24 04:06 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced
Howdy yall, Earlier this month I downloaded and installed the latest copy of OpenSolaris (2008.05) so that I could test out some of the newer features I''ve heard so much about, primarily ZFS. My goal was to replace our aging linux-based (SuSE 10.1) file and media server with a new machine running Sun''s OpenSolaris and ZFS. Our old server ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used lvm, mdadm, and xfs to help keep things in order, and relied on NFS to export users'' shares. It was solid, stable, and worked wonderfully well. I would like to replicate this experience using the tools OpenSolaris has to offer, taking advantages of ZFS. However, there are enough differences between the two OSes - especially with respect to the filesystems and (for lack of a better phrase) "RAID managers" - to cause me to consult (on numerous occasions) the likes of Google, these forums, and other places for help. I''ve been successful in troubleshooting all problems up until now. On our old media server (the SuSE 10.1 one), when a disk failed, the machine would send out an e-mail detailing the type of failure, and gracefully fall into a degraded state, but would otherwise continue to operate using the remaining 3 disks in the system. After the faulty disk was replaced, all of the data from the old disk would be replicated onto the new one (I think the term is "resilvered" around here?), and after a few hours, the RAID5 array would be seamlessly promoted from "degraded" back up to a healthy "clean" (or "online") state. Throughout the entire process, there would be no interruptions to the end user: all NFS shares still remained mounted, there were no noticeable drops in I/O, files, directories, and any other user-created data still remained available, and if everything went smoothly, no one would notice a failure had even occurred. I''ve tried my best to recreate something similar in OpenSolaris, but I''m stuck on making it all happen seamlessly. For example, I have a standard beige box machine running OS 2008.05 with a zpool that contains 4 disks, similar to what the old SuSE 10.1 server had. However, whenever I unplug the SATA cable from one of the drives (to simulate a catastrophic drive failure) while doing moderate reading from the zpool (such as streaming HD video), not only does the video hang on the remote machine (which is accessing the zpool via NFS), but the server running OpenSolaris seems to either hang, or become incredibly unresponsive. And when I write unresponsive, I mean that when I type the command "zpool status" to see what''s going on, the command hangs, followed by a frozen Terminal a few seconds later. After just a few more seconds, the entire GUI - mouse included - locks up or freezes, and all NFS shares become unavailable from the perspective of the remote machines. The whole machine locks up hard. The machine then stays in this frozen state until I plug the hard disk back in, at which point everything, quite literally, pops back into existence all at once: the output of the "zpool status" command flies by (with all disks listed as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as "0"), the mouse jumps to a different part of the screen, the NFS share becomes available again, and the movie resumes right where it had left off. While such a quick resume is encouraging, I''d like to avoid the freeze in the first place. How can I keep any hardware failures like the above transparent to my users? -Todd PS: I''ve done some researching, and while my problem is similar to the following: http://opensolaris.org/jive/thread.jspa?messageID=151719𥂧 http://opensolaris.org/jive/thread.jspa?messageID=240481𺭡 most of these posts are quite old, and do not offer any solutions. PSS: I know I haven''t provided any details on hardware, but I feel like this is more likely a higher-level issue (like some sort of configuration file or setting is needed) rather than a lower-level one (like faulty hardware). However, if someone were to give me a command to run, I''d gladly do it... I''m just not sure which ones would be helpful, or if I even know which ones to run. It took me half an hour of searching just to find out how to list the disks installed in this system (it''s "format") so that I could build my zpool in the first place. It''s not quite as simple as writing out /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;) This message posted from opensolaris.org
Tim
2008-Aug-24 04:13 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced
On Sat, Aug 23, 2008 at 11:06 PM, Todd H. Poole <toddhpoole at gmail.com>wrote:> Howdy yall, > > Earlier this month I downloaded and installed the latest copy of > OpenSolaris (2008.05) so that I could test out some of the newer features > I''ve heard so much about, primarily ZFS. > > My goal was to replace our aging linux-based (SuSE 10.1) file and media > server with a new machine running Sun''s OpenSolaris and ZFS. Our old server > ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used > lvm, mdadm, and xfs to help keep things in order, and relied on NFS to > export users'' shares. It was solid, stable, and worked wonderfully well. > > I would like to replicate this experience using the tools OpenSolaris has > to offer, taking advantages of ZFS. However, there are enough differences > between the two OSes - especially with respect to the filesystems and (for > lack of a better phrase) "RAID managers" - to cause me to consult (on > numerous occasions) the likes of Google, these forums, and other places for > help. > > I''ve been successful in troubleshooting all problems up until now. > > On our old media server (the SuSE 10.1 one), when a disk failed, the > machine would send out an e-mail detailing the type of failure, and > gracefully fall into a degraded state, but would otherwise continue to > operate using the remaining 3 disks in the system. After the faulty disk was > replaced, all of the data from the old disk would be replicated onto the new > one (I think the term is "resilvered" around here?), and after a few hours, > the RAID5 array would be seamlessly promoted from "degraded" back up to a > healthy "clean" (or "online") state. > > Throughout the entire process, there would be no interruptions to the end > user: all NFS shares still remained mounted, there were no noticeable drops > in I/O, files, directories, and any other user-created data still remained > available, and if everything went smoothly, no one would notice a failure > had even occurred. > > I''ve tried my best to recreate something similar in OpenSolaris, but I''m > stuck on making it all happen seamlessly. > > For example, I have a standard beige box machine running OS 2008.05 with a > zpool that contains 4 disks, similar to what the old SuSE 10.1 server had. > However, whenever I unplug the SATA cable from one of the drives (to > simulate a catastrophic drive failure) while doing moderate reading from the > zpool (such as streaming HD video), not only does the video hang on the > remote machine (which is accessing the zpool via NFS), but the server > running OpenSolaris seems to either hang, or become incredibly unresponsive. > > And when I write unresponsive, I mean that when I type the command "zpool > status" to see what''s going on, the command hangs, followed by a frozen > Terminal a few seconds later. After just a few more seconds, the entire GUI > - mouse included - locks up or freezes, and all NFS shares become > unavailable from the perspective of the remote machines. The whole machine > locks up hard. > > The machine then stays in this frozen state until I plug the hard disk back > in, at which point everything, quite literally, pops back into existence all > at once: the output of the "zpool status" command flies by (with all disks > listed as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as > "0"), the mouse jumps to a different part of the screen, the NFS share > becomes available again, and the movie resumes right where it had left off. > > While such a quick resume is encouraging, I''d like to avoid the freeze in > the first place. > > How can I keep any hardware failures like the above transparent to my > users? > > -Todd > > PS: I''ve done some researching, and while my problem is similar to the > following: > > http://opensolaris.org/jive/thread.jspa?messageID=151719𥂧 > http://opensolaris.org/jive/thread.jspa?messageID=240481𺭡 > > most of these posts are quite old, and do not offer any solutions. > > PSS: I know I haven''t provided any details on hardware, but I feel like > this is more likely a higher-level issue (like some sort of configuration > file or setting is needed) rather than a lower-level one (like faulty > hardware). However, if someone were to give me a command to run, I''d gladly > do it... I''m just not sure which ones would be helpful, or if I even know > which ones to run. It took me half an hour of searching just to find out how > to list the disks installed in this system (it''s "format") so that I could > build my zpool in the first place. It''s not quite as simple as writing out > /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;) > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >It''s a lower level one. What hardware are you running? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080823/7ef3b4ae/attachment.html>
Hmm... I''m leaning away a bit from the hardware, but just in case you''ve got an idea, the machine is as follows: CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model ADH4850DOBOX (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255) Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid Capacitor AMD Motherboard (http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081) RAM: G.SKILL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual Channel Kit Desktop Memory Model F2-6400CL5D-4GBPQ (http://www.newegg.com/Product/Product.aspx?Item=N82E16820231122) HDD (x4): Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA 3.0Gb/s Hard Drive (http://www.newegg.com/Product/Product.aspx?Item=N82E16822136151) The reason why I don''t think there''s a hardware issue is because before I got OpenSolaris up and running, I had a fully functional install of openSuSE 11.0 running (with everything similar to the original server) to make sure that none of the components were damaged during shipping from Newegg. Everything worked as expected. Furthermore, before making my purchases, I made sure to check the HCL and my processor and motherboard combination are supported: http://www.sun.com/bigadmin/hcl/data/systems/details/3079.html But, like I said earlier, I''m new here, so you might be on to something that never occurred to me. Any ideas? This message posted from opensolaris.org
On Sat, Aug 23, 2008 at 11:41 PM, Todd H. Poole <toddhpoole at gmail.com>wrote:> Hmm... I''m leaning away a bit from the hardware, but just in case you''ve > got an idea, the machine is as follows: > > CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model > ADH4850DOBOX ( > http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255) > > Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid Capacitor > AMD Motherboard ( > http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081) > > RAM: G.SKILL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual > Channel Kit Desktop Memory Model F2-6400CL5D-4GBPQ ( > http://www.newegg.com/Product/Product.aspx?Item=N82E16820231122) > > HDD (x4): Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA > 3.0Gb/s Hard Drive ( > http://www.newegg.com/Product/Product.aspx?Item=N82E16822136151) > > The reason why I don''t think there''s a hardware issue is because before I > got OpenSolaris up and running, I had a fully functional install of openSuSE > 11.0 running (with everything similar to the original server) to make sure > that none of the components were damaged during shipping from Newegg. > Everything worked as expected. > > Furthermore, before making my purchases, I made sure to check the HCL and > my processor and motherboard combination are supported: > http://www.sun.com/bigadmin/hcl/data/systems/details/3079.html > > But, like I said earlier, I''m new here, so you might be on to something > that never occurred to me. > > Any ideas? > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >What are you using to connect the HD''s to the system? The onboard ports? What driver is being used? AHCI, or IDE compatibility mode? I''m not saying the hardware is bad, I''m saying the hardware is most likely the cause by way of driver. There really isn''t any *setting* in solaris I''m aware of that says "hey, freeze my system when a drive dies". That just sounds like hot-swap isn''t working as it should be. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080823/3c9b386f/attachment.html>
Ross
2008-Aug-24 08:04 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced
You''re seeing exactly the same behaviour I found on my server, using a Supermicro AOC-SAT2-MV8 SATA controller. It''s detailed on the forums under the topics "Supermicro AOC-SAT2-MV8 hang when drive removed", but unfortunately that topic split into 3 or 4 pieces so it''s a pain to find. I also reported it as a bug here: http://bugs.opensolaris.org/view_bug.do?bug_id=6735931 This message posted from opensolaris.org
Ah, yes - all four hard drives are connected to the motherboard''s onboard SATA II ports. There is one additional drive I have neglected to mention thus far (the boot drive) but that is connected via the motherboard''s IDE channel, and has remained untouched since the install... I don''t really consider it part of the problem, but I thought I should mention it just in case... you never know... As for the drivers... well, I''m not sure of the command to determine that directly, but going under System > Administration > Device Driver Utility yields the following information under the "Storage" entry: Components: "ATI Technologies Inc. SB600 IDE" Driver: pci-ide --Driver Information-- Driver: pci-ide Instance: 1 Attach Status: Attached --Hardware Information-- Vendor ID: 0x1002 Device ID: 0x438c Class Code: 0001018a DevPath: /pci at 0,0/pci-ide at 14,1 and Components: "ATI Technologies Inc. SB600 Non-Raid-5 SATA" Driver: pci-ide --Driver Information-- Driver: pci-ide Instance: 0 Attach Status: Attached --Hardware Information-- Vendor ID: 0x1002 Device ID: 0x4380 Class Code: 0001018f DevPath: /pci at 0,0/pci-ide at 12 Furthermore, there is one Driver Problem detected but the error is under the "USB" entry. There are seven items listed: Components: ATI Technologies Inc. SB600 USB Controller (EHCI) Driver: ehci Components: ATI Technologies Inc. SB600 USB (OHCI4) Driver: ohci Components: ATI Technologies Inc. SB600 USB (OHCI3) Driver: ohci Components: ATI Technologies Inc. SB600 USB (OHCI2) Driver: ohci Components: ATI Technologies Inc. SB600 USB (OHCI1) Driver: ohci (Driver Misconfigured) Components: ATI Technologies Inc. SB600 USB (OHCI0) Driver: ohci Components: Microsoft Corp. Wheel Mouse Optical Driver: hid As you can tell, the OHCI1 device isn''t properly configured, but I don''t know how to configure it (there''s only a "Help" "Submit...", and "Close" button to click, no "Install Driver"). And, to tell you the truth, I''m not even sure it''s worth mentioning because I don''t have anything but my mouse plugged into USB, and even so... it''s a mouse... plugged into USB... hardly something that is going to bring my machine to a grinding halt every time a SATA II disk gets yanked from a RAID-Z array (at least, I should hope the two don''t have anything in common!). And... wait... you mean to tell me that I can''t just untick the checkbox that says "Hey, freeze my system when a drive dies" to solve this problem? Ugh. And here I was hoping for a quick fix... ;) Anyway, how does the above sound? What else can I give you? -Todd PS: Thanks, by the way, for the support - I''m not sure where else to turn to for this kind of stuff! This message posted from opensolaris.org
Ross
2008-Aug-24 08:31 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced
PS. Does your system definitely support SATA hot swap? Could you for example test it under windows to see if it runs fine there? I suspect this is a Solaris driver problem, but it would be good to have confirmation that the hardware handles this fine. This message posted from opensolaris.org
Todd H. Poole
2008-Aug-24 09:17 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced
Hmm... You know, that''s a good question. I''m not sure if those SATA II ports support hot swap or not. The motherboard is fairly new, but taking a look at the specifications provided by Gigabyte (http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ProductID=2874) doesn''t seem to yield anything. To tell you the truth, I think they''re just plain ''ol dumb SATA II ports - nothing fancy here. But that''s alright, because hot swappable isn''t something I''m necessarily chasing after. It would be nice, of course, but the thing that we want the most is stability during hardware failures. For this particular server, it is _far_ more important for the thing to keep chugging along and blow right through as many hardware failures as it can. If it''s still got 3 of those 4 drives (which implies at least 2 data and 1 parity, or 3 data and no parity) then I still want to be able to read and write to those NFS exports like nothing happened. Then, at the end of the day, if we need to bring the machine down in order to install a new disk and resilver the RAID-Z array, that is perfectly acceptable. We could do that around 6:00 or so when everyone goes home for the day and when its much more convenient for us and the users, and let the resilvering/repairing operation run over night. I also read the PDF summary you included in your link to your other post. And it seems we''re seeing similar behavior here. Although, in this case, things are even simpler: there are only 4 drives in the case (not 8), and there is no extra controller card (just the ports on the motherboard)... It''s hard to get any more basic than that. As for testing in other OSes, unfortunately I don''t readily have a copy of Windows available. But even if I did, I wouldn''t know where to begin: almost all of my experience in server administration has been with Linux. For what it''s worth, I have already established the above (that is, the seamless experience) with OpenSuSE 11.0 as the operating system, LVM as the volume manager, madam as the RAID manager, and XFS as the filesystem, so I know it can work... I just want to get it working with OpenSolaris and ZFS. :) This message posted from opensolaris.org
James C. McPherson
2008-Aug-24 12:30 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure,
Todd H. Poole wrote:> Hmm... I''m leaning away a bit from the hardware, but just in case you''ve > got an idea, the machine is as follows: > > CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model > ADH4850DOBOX > (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255) > > Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid > Capacitor AMD Motherboard > (http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)..> The reason why I don''t think there''s a hardware issue is because before I > got OpenSolaris up and running, I had a fully functional install of > openSuSE 11.0 running (with everything similar to the original server) to > make sure that none of the components were damaged during shipping from > Newegg. Everything worked as expected.Yes, but you''re running a new operating system, new filesystem... that''s a mountain of difference right in front of you. A few commands that you could provide the output from include: (these two show any FMA-related telemetry) fmadm faulty fmdump -v (this shows your storage controllers and what''s connected to them) cfgadm -lav You''ll also find messages in /var/adm/messages which might prove useful to review. Apart from that, your description of what you''re doing to simulate failure is "however, whenever I unplug the SATA cable from one of the drives (to simulate a catastrophic drive failure) while doing moderate reading from the zpool (such as streaming HD video), not only does the video hang on the remote machine (which is accessing the zpool via NFS), but the server running OpenSolaris seems to either hang, or become incredibly unresponsive." First and foremost, for me, this is a stupid thing to do. You''ve got common-or-garden PC hardware which almost *definitely* does not support hot plug of devices. Which is what you''re telling us that you''re doing. Would try this with your pci/pci-e cards in this system? I think not. If you absolutely must do something like this, then please use what''s known as "coordinated hotswap" using the cfgadm(1m) command. Viz: (detect fault in disk c2t3d0, in some way) # cfgadm -c unconfigure c2::dsk/c2t3d0 # cfgadm -c disconnect c2::dsk/c2t3d0 (go and swap the drive, plugin new drive with same cable) # zpool replace -f poolname c2t3d0 What this will do is tell the kernel to do things in the right order, and - for zpool - tell it to do an in-place replacement of device c2t3d0 in your pool. There are manpages and admin guides you could have a look through, too: http://docs.sun.com/app/docs/coll/40.17 (manpages) http://docs.sun.com/app/docs/coll/47.23 (system admin collection) http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
I''m pretty sure pci-ide doesn''t support hot-swap. I believe you need ahci. On 8/24/08, Todd H. Poole <toddhpoole at gmail.com> wrote:> Ah, yes - all four hard drives are connected to the motherboard''s onboard > SATA II ports. There is one additional drive I have neglected to mention > thus far (the boot drive) but that is connected via the motherboard''s IDE > channel, and has remained untouched since the install... I don''t really > consider it part of the problem, but I thought I should mention it just in > case... you never know... > > As for the drivers... well, I''m not sure of the command to determine that > directly, but going under System > Administration > Device Driver Utility > yields the following information under the "Storage" entry: > > Components: "ATI Technologies Inc. SB600 IDE" > Driver: pci-ide > --Driver Information-- > Driver: pci-ide > Instance: 1 > Attach Status: Attached > --Hardware Information-- > Vendor ID: 0x1002 > Device ID: 0x438c > Class Code: 0001018a > DevPath: /pci at 0,0/pci-ide at 14,1 > > and > > Components: "ATI Technologies Inc. SB600 Non-Raid-5 SATA" > Driver: pci-ide > --Driver Information-- > Driver: pci-ide > Instance: 0 > Attach Status: Attached > --Hardware Information-- > Vendor ID: 0x1002 > Device ID: 0x4380 > Class Code: 0001018f > DevPath: /pci at 0,0/pci-ide at 12 > > Furthermore, there is one Driver Problem detected but the error is under the > "USB" entry. There are seven items listed: > > Components: ATI Technologies Inc. SB600 USB Controller (EHCI) > Driver: ehci > > Components: ATI Technologies Inc. SB600 USB (OHCI4) > Driver: ohci > > Components: ATI Technologies Inc. SB600 USB (OHCI3) > Driver: ohci > > Components: ATI Technologies Inc. SB600 USB (OHCI2) > Driver: ohci > > Components: ATI Technologies Inc. SB600 USB (OHCI1) > Driver: ohci (Driver Misconfigured) > > Components: ATI Technologies Inc. SB600 USB (OHCI0) > Driver: ohci > > Components: Microsoft Corp. Wheel Mouse Optical > Driver: hid > > As you can tell, the OHCI1 device isn''t properly configured, but I don''t > know how to configure it (there''s only a "Help" "Submit...", and "Close" > button to click, no "Install Driver"). And, to tell you the truth, I''m not > even sure it''s worth mentioning because I don''t have anything but my mouse > plugged into USB, and even so... it''s a mouse... plugged into USB... hardly > something that is going to bring my machine to a grinding halt every time a > SATA II disk gets yanked from a RAID-Z array (at least, I should hope the > two don''t have anything in common!). > > And... wait... you mean to tell me that I can''t just untick the checkbox > that says "Hey, freeze my system when a drive dies" to solve this problem? > Ugh. And here I was hoping for a quick fix... ;) > > Anyway, how does the above sound? What else can I give you? > > -Todd > > PS: Thanks, by the way, for the support - I''m not sure where else to turn to > for this kind of stuff! > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hmmm. Alright, but supporting hot-swap isn''t the issue, is it? I mean, like I said in my response to myxiplx, if I have to bring down the machine in order to replace a faulty drive, that''s perfectly acceptable - I can do that whenever it''s most convenient for me. What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_) is if the machine hangs/freezes/locks up or is otherwise brought down by an isolated failure in a supposedly redundant array... Yanking the drive is just how I chose to simulate that failure. I could just as easily have decided to take a sledgehammer or power drill to it, http://www.youtube.com/watch?v=CN6iDzesEs0 (fast-forward to the 2:30 part) http://www.youtube.com/watch?v=naKd9nARAes and the machine shouldn''t have skipped a beat. After all, that''s the whole point behind the "redundant" part of RAID, no? And besides, RAID''s been around for almost 20 years now... It''s nothing new. I''ve seen (countless times, mind you) plenty of regular old IDE drives fail in a simple software RAID5 array and not bring the machine down at all. Granted, you still had to power down to re-insert a new one (unless you were using some fancy controller card), but the point remains: the machine would still work perfectly with only 3 out of 4 drives present... So I know for a fact this type of stability can be achieved with IDE. What I''m getting at is this: I don''t think the method by which the drives are connected - or whether or not that method supports hot-swap - should matter. A machine _should_not_ crash when a single drive (out of a 4 drive ZFS RAID-Z array) is ungracefully removed, regardless of how abruptly that drive is excised (be it by a slow failure of the drive motor''s spindle, by yanking the drive''s power cable, by yanking the drive''s SATA connector, by smashing it to bits with a sledgehammer, or by drilling into it with a power drill). So we''ve established that one potential work around is to use the ahci instead of the pci-ide driver. Good! I like this kind of problem solving! But that''s still side-stepping the problem... While this machine is entirely SATA II, what about those who have a mix between SATA and IDE? Or even much larger entities whose vast majority of hardware is only a couple of years old, and still entirely IDE? I''m grateful for your help, but is there another way that you can think of to get this to work? This message posted from opensolaris.org
James C. McPherson
2008-Aug-24 20:52 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure,
Tim wrote:> I''m pretty sure pci-ide doesn''t support hot-swap. I believe you need ahci.You''re correct, it doesn''t. Furthermore, to the best of my knowledge, it won''t ever support hotswap. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
James C. McPherson
2008-Aug-24 21:28 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure,
Todd H. Poole wrote:> Hmmm. Alright, but supporting hot-swap isn''t the issue, is it? I mean, > like I said in my response to myxiplx, if I have to bring down the > machine in order to replace a faulty drive, that''s perfectly acceptable - > I can do that whenever it''s most convenient for me. > > What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_) > is if the machine hangs/freezes/locks up or is otherwise brought down by > an isolated failure in a supposedly redundant array... Yanking the drive > is just how I chose to simulate that failure. I could just as easily have > decided to take a sledgehammer or power drill to it,But you''re not attempting hotswap, you''re doing hot plug.... and unless you''re using the onboard bios'' concept of an actual RAID array, you don''t have an array, you''ve got a JBOD and it''s not a real JBOD - it''s a PC motherboard which does _not_ have the same electronic and electrical protections that a JBOD has *by design*.> http://www.youtube.com/watch?v=CN6iDzesEs0 (fast-forward to the 2:30 > part) http://www.youtube.com/watch?v=naKd9nARAes > > and the machine shouldn''t have skipped a beat. After all, that''s the > whole point behind the "redundant" part of RAID, no?Sigh.> And besides, RAID''s been around for almost 20 years now... It''s nothing > new. I''ve seen (countless times, mind you) plenty of regular old IDE > drives fail in a simple software RAID5 array and not bring the machine > down at all. Granted, you still had to power down to re-insert a new one > (unless you were using some fancy controller card), but the point > remains: the machine would still work perfectly with only 3 out of 4 > drives present... So I know for a fact this type of stability can be > achieved with IDE.And you''re right, it can. But what you''ve been doing is outside the bounds of what IDE hardware on a PC motherboard is designed to cope with.> What I''m getting at is this: I don''t think the method by which the drives > are connected - or whether or not that method supports hot-swap - should > matter.Well sorry, it does. Welcome to an OS which does care.> A machine _should_not_ crash when a single drive (out of a 4 > drive ZFS RAID-Z array) is ungracefully removed, regardless of how > abruptly that drive is excised (be it by a slow failure of the drive > motor''s spindle, by yanking the drive''s power cable, by yanking the > drive''s SATA connector, by smashing it to bits with a sledgehammer, or by > drilling into it with a power drill).If the controlling electronics for your disk can''t handle it, then you''re hosed. That''s why FC, SATA (in SATA mode) and SAS are much more likely to handle this out of the box. Parallel SCSI requires funky hardware, which is why those old 6- or 12-disk multipacks are so useful to have. Of the failure modes that you suggest above, only one is going to give you anything other than catastrophic failure (drive motor degradation) - and that is because the drive''s electronics will realise this, and send warnings to the host.... which should have its drivers written so that these messages are logged for the sysadmin to act upon. The other failure modes are what we call catastrophic. And where your hardware isn''t designed with certain protections around drive connections, you''re hosed. No two ways about it. If your system suffers that sort of failure, would you seriously expect that non-hardened hardware would survive it?> So we''ve established that one potential work around is to use the ahci > instead of the pci-ide driver. Good! I like this kind of problem solving! > But that''s still side-stepping the problem... While this machine is > entirely SATA II, what about those who have a mix between SATA and IDE? > Or even much larger entities whose vast majority of hardware is only a > couple of years old, and still entirely IDE?If you''ve got newer hardware, which can support SATA in native SATA mode, USE IT. Don''t _ever_ try that sort of thing with IDE. As I mentioned above, IDE is not designed to be able to cope with what you''ve been inflicting on this machine.> I''m grateful for your help, but is there another way that you can think > of to get this to work?You could start by taking us seriously when we tell you that what you''ve been doing is not a good idea, and find other ways to simulate drive failures. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
> But you''re not attempting hotswap, you''re doing hot plug....Do you mean hot UNplug? Because I''m not trying to get this thing to recognize any new disks without a restart... Honest. I''m just trying to prevent the machine from freezing up when a drive fails. I have no problem restarting the machine with a new drive in it later so that it recognizes the new disk.> and unless you''re using the onboard bios'' concept of an actual > RAID array, you don''t have an array, you''ve got a JBOD and > it''s not a real JBOD - it''s a PC motherboard which does _not_ > have the same electronic and electrical protections that a > JBOD has *by design*.I''m confused by what your definition of a RAID array is, and for that matter, what a JBOD is... I''ve got plenty of experience with both, but just to make sure I wasn''t off my rocker, I consulted the demigod: http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/JBOD and I think what I''m doing is indeed RAID... I''m not using some sort of controller card, or any specialized hardware, so it''s certainly not Hardware RAID (and thus doesn''t contain any of the fancy electronic or electrical protections you mentioned), but lacking said protections doesn''t preclude the machine from being considered a RAID. All the disks are the same capacity, the OS still sees the zpool I''ve created as one large volume, and since I''m using RAID-Z (RAID5), it should be redundant... What other qualifiers out there are necessary before a system can be called RAID compliant? If it''s hot-swappable technology, or a controller hiding the details from the OS and instead presenting a single volume, then I would argue those things are extra - not a fundamental prerequisite for a system to be called a RAID. Furthermore, while I''m not sure what the difference between a "real JBOD" and a plain old JBOD is, this set-up certainly wouldn''t qualify for either. I mean, there is no concatenation going on, redundancy should be present (but due to this issue, I haven''t been able to verify that yet), and all the drives are the same size... Am I missing something in the definition of a JBOD? I don''t think so...> And you''re right, it can. But what you''ve been doing is outside > the bounds of what IDE hardware on a PC motherboard is designed > to cope with.Well, yes, you''re right, but it''s not like I''m making some sort of radical departure outside of the bounds of the hardware... It really shouldn''t be a problem so long as it''s not an unreasonable departure because that''s where software comes in. When the hardware can''t cut it, that''s where software picks up the slack. Now, obviously, I''m not saying software can do anything with any piece of hardware you give it - no matter how many lines of code you write, your keyboard isn''t going to turn into a speaker - but when it comes to reasonable stuff like ensuring a machine doesn''t crash because a user did something with the hardware that he or she wasn''t supposed to do? Prime target for software. And that''s the way it''s always been... The whole push behind that whole ZFS Promise thing (or if you want to make it less specific, the attractiveness of RAID in general), was that "RAID-Z [wouldn''t] require any special hardware. It doesn''t need NVRAM for correctness, and it doesn''t need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks." (http://blogs.sun.com/bonwick/entry/raid_z)> Well sorry, it does. Welcome to an OS which does care.The half-hearted apology wasn''t necessary... I understand that OpenSolaris cares about the method those disks use to plug into the motherboard, but what I don''t understand is why that limitation exists in the first place. It would seem much better to me to have an OS that doesn''t care (but developers that do) and just finds a way to work, versus one that does care (but developers that don''t) and instead isn''t as flexible and gets picky... I''m not saying OpenSolaris is the latter, but I''m not getting the impression it''s the former either...> If the controlling electronics for your disk can''t > handle it, then you''re hosed. That''s why FC, SATA (in SATA > mode) and SAS are much more likely to handle this out of > the box. Parallel SCSI requires funky hardware, which is why > those old 6- or 12-disk multipacks are so useful to have. > > Of the failure modes that you suggest above, only one > is going to give you anything other than catastrophic > failure (drive motor degradation) - and that is because the > drive''s electronics will realise this, and send warnings to > the host.... which should have its drivers written so > that these messages are logged for the sysadmin to act upon. > > The other failure modes are what we call catastrophic. And > where your hardware isn''t designed with certain protections > around drive connections, you''re hosed. No two ways > about it. If your system suffers that sort of failure, would > you seriously expect that non-hardened hardware would survive it?Yes, I would. At the risk of sounding repetitive, I''ll summarize what I''ve been getting at in my previous responses: I certainly _do_ think it''s reasonable to expect non-hardened hardware to survive this type of failure. In fact, I think its unreasonable _not_ to expect it to. The Linux kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs Windows) all provide this type of functionality, and have so for some time. Granted, they may all do it in different ways, but at the end of the day, unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat, FreeBSD, or Windows XP Professional will not bring the machine down. And it shouldn''t in OpenSolaris either. There might be some sort of noticeable bump (Windows, for example, pauses for a few seconds while it tries to figure out what hell just happened to one of it''s disks), but there isn''t anything show stopping...> If you''ve got newer hardware, which can support SATA > in native SATA mode, USE IT.I''ll see what I can do - this might be some sort of BIOS setting that can be configured.> > I''m grateful for your help, but is there another way that you can think > > of to get this to work? > You could start by taking us seriously when we tell > you that what you''ve been doing is not a good idea, and > find other ways to simulate drive failures.Lets drop the confrontational attitude - I''m not trying to dick around with you here. I''ve done my due diligence in researching this issue on Google, these forums, and Sun''s documentation before making a post, I''ve provided any clarifying information that has been requested by those kind enough to post a response, and I''ve yet to resort to any witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever is causing you to think I''m not taking anyone seriously, let me reassure you, I am. The only thing I''m doing is testing a system by applying the worst case scenario of survivable torture to it and seeing how it recovers. If that''s not a good idea, then I guess we disagree. But that''s ok - you''re James C. McPherson, Senior Kernel Software Engineer, Solaris, and I''m just some user who''s trying to find a solution to his problem. My bad for expecting the same level of respect I''ve given two other members of this community to be returned in kind by one of it''s leaders. So aside from telling me to "[never] try this sort of thing with IDE" does anyone else have any other ideas on how to prevent OpenSolaris from locking up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z array? -Todd This message posted from opensolaris.org
Todd H. Poole wrote:>> But you''re not attempting hotswap, you''re doing hot plug.... > > Do you mean hot UNplug? Because I''m not trying to get this thing to recognize any new disks without a restart... Honest. I''m just trying to prevent the machine from freezing up when a drive fails. I have no problem restarting the machine with a new drive in it later so that it recognizes the new disk. > >> and unless you''re using the onboard bios'' concept of an actual >> RAID array, you don''t have an array, you''ve got a JBOD and >> it''s not a real JBOD - it''s a PC motherboard which does _not_ >> have the same electronic and electrical protections that a >> JBOD has *by design*. > > I''m confused by what your definition of a RAID array is, and for that matter, what a JBOD is... I''ve got plenty of experience with both, but just to make sure I wasn''t off my rocker, I consulted the demigod: > > http://en.wikipedia.org/wiki/RAID > http://en.wikipedia.org/wiki/JBOD > > and I think what I''m doing is indeed RAID... I''m not using some sort of controller card, or any specialized hardware, so it''s certainly not Hardware RAID (and thus doesn''t contain any of the fancy electronic or electrical protections you mentioned), but lacking said protections doesn''t preclude the machine from being considered a RAID. All the disks are the same capacity, the OS still sees the zpool I''ve created as one large volume, and since I''m using RAID-Z (RAID5), it should be redundant... What other qualifiers out there are necessary before a system can be called RAID compliant? > > If it''s hot-swappable technology, or a controller hiding the details from the OS and instead presenting a single volume, then I would argue those things are extra - not a fundamental prerequisite for a system to be called a RAID. > > Furthermore, while I''m not sure what the difference between a "real JBOD" and a plain old JBOD is, this set-up certainly wouldn''t qualify for either. I mean, there is no concatenation going on, redundancy should be present (but due to this issue, I haven''t been able to verify that yet), and all the drives are the same size... Am I missing something in the definition of a JBOD? > > I don''t think so... > >> And you''re right, it can. But what you''ve been doing is outside >> the bounds of what IDE hardware on a PC motherboard is designed >> to cope with. > > Well, yes, you''re right, but it''s not like I''m making some sort of radical departure outside of the bounds of the hardware... It really shouldn''t be a problem so long as it''s not an unreasonable departure because that''s where software comes in. When the hardware can''t cut it, that''s where software picks up the slack. > > Now, obviously, I''m not saying software can do anything with any piece of hardware you give it - no matter how many lines of code you write, your keyboard isn''t going to turn into a speaker - but when it comes to reasonable stuff like ensuring a machine doesn''t crash because a user did something with the hardware that he or she wasn''t supposed to do? Prime target for software. > > And that''s the way it''s always been... The whole push behind that whole ZFS Promise thing (or if you want to make it less specific, the attractiveness of RAID in general), was that "RAID-Z [wouldn''t] require any special hardware. It doesn''t need NVRAM for correctness, and it doesn''t need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks." (http://blogs.sun.com/bonwick/entry/raid_z) > >> Well sorry, it does. Welcome to an OS which does care. > > The half-hearted apology wasn''t necessary... I understand that OpenSolaris cares about the method those disks use to plug into the motherboard, but what I don''t understand is why that limitation exists in the first place. It would seem much better to me to have an OS that doesn''t care (but developers that do) and just finds a way to work, versus one that does care (but developers that don''t) and instead isn''t as flexible and gets picky... I''m not saying OpenSolaris is the latter, but I''m not getting the impression it''s the former either... > >> If the controlling electronics for your disk can''t >> handle it, then you''re hosed. That''s why FC, SATA (in SATA >> mode) and SAS are much more likely to handle this out of >> the box. Parallel SCSI requires funky hardware, which is why >> those old 6- or 12-disk multipacks are so useful to have. >> >> Of the failure modes that you suggest above, only one >> is going to give you anything other than catastrophic >> failure (drive motor degradation) - and that is because the >> drive''s electronics will realise this, and send warnings to >> the host.... which should have its drivers written so >> that these messages are logged for the sysadmin to act upon. >> >> The other failure modes are what we call catastrophic. And >> where your hardware isn''t designed with certain protections >> around drive connections, you''re hosed. No two ways >> about it. If your system suffers that sort of failure, would >> you seriously expect that non-hardened hardware would survive it? > > Yes, I would. At the risk of sounding repetitive, I''ll summarize what I''ve been getting at in my previous responses: I certainly _do_ think it''s reasonable to expect non-hardened hardware to survive this type of failure. In fact, I think its unreasonable _not_ to expect it to. The Linux kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs Windows) all provide this type of functionality, and have so for some time. Granted, they may all do it in different ways, but at the end of the day, unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat, FreeBSD, or Windows XP Professional will not bring the machine down. And it shouldn''t in OpenSolaris either. There might be some sort of noticeable bump (Windows, for example, pauses for a few seconds while it tries to figure out what hell just happened to one of it''s disks), but there isn''t anything show stopping... > >> If you''ve got newer hardware, which can support SATA >> in native SATA mode, USE IT. > > I''ll see what I can do - this might be some sort of BIOS setting that can be configured. > >>> I''m grateful for your help, but is there another way that you can think >>> of to get this to work? >> You could start by taking us seriously when we tell >> you that what you''ve been doing is not a good idea, and >> find other ways to simulate drive failures. > > Lets drop the confrontational attitude - I''m not trying to dick around with you here. I''ve done my due diligence in researching this issue on Google, these forums, and Sun''s documentation before making a post, I''ve provided any clarifying information that has been requested by those kind enough to post a response, and I''ve yet to resort to any witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever is causing you to think I''m not taking anyone seriously, let me reassure you, I am. > > The only thing I''m doing is testing a system by applying the worst case scenario of survivable torture to it and seeing how it recovers. If that''s not a good idea, then I guess we disagree. But that''s ok - you''re James C. McPherson, Senior Kernel Software Engineer, Solaris, and I''m just some user who''s trying to find a solution to his problem. My bad for expecting the same level of respect I''ve given two other members of this community to be returned in kind by one of it''s leaders. > > So aside from telling me to "[never] try this sort of thing with IDE" does anyone else have any other ideas on how to prevent OpenSolaris from locking up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z array? > > -Todd > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I''m far from being an expert on this subject, but this is what I understand: Unplugging a drive (actually pulling the cable out) does not simulate a drive failure, it simulates a drive getting unplugged, which is something the hardware is not capable of dealing with. If your drive were to suffer something more realistic, along the lines of how you would normally expect a drive to die, then the system should cope with it a whole lot better. Unfortunately, hard drives don''t come with a big button saying "simulate head crash now" or "make me some bad sectors" so it''s going to be difficult to simulate those failures. All I can say is that unplugging a drive yourself will not simulate a failure, it merely causes the disk to disappear. Dying or dead disks will still normally be able to communicate with the driver to some extent, so they are still "there". If you were using dedicated hotswappable hardware, then I wouldn''t expect to see the problem, but AFAIK off the shelf SATA hardware doesn''t support this fully, so unexpected results will occur. I hope this has been of some small help, even just to explain why the system didn''t cope as you expected. Matt
John Sonnenschein
2008-Aug-25 03:19 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure,
James isn''t being a jerk because he hates your or anything... Look, yanking the drives like that can seriously damage the drives or your motherboard. Solaris doesn''t let you do it and assumes that something''s gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else. This message posted from opensolaris.org
aye mate, I had the exact same problem, but where i work, we pay some pretty seriosu dollars for a direct 24/7 line to some of sun''s engineers, so i decided to call them up. after spending some time with tech support, i never really got the thing resolved, and i instead ended up going back to debian for all of our simple ide-based file servers. if you really just want zfs, you can add it to whatever installation you''ve got now (opensuse?) through something like zfs-fuse, but you might take a 10-15% performance hit. if you don''t want that, and you''re not too concerned with violating a few licenses, you can just add it to your installation yourself, the source code is out there. you know, roll your own. ;-) you just might be trying too hard to force a round peg into a square hole. hey, besides, where you work? i registered because i know a guy with the same name This message posted from opensolaris.org
On Mon, Aug 25, 2008 at 5:19 AM, John Sonnenschein <johnsonnenschein at gmail.com> wrote:> James isn''t being a jerk because he hates your or anything... > > Look, yanking the drives like that can seriously damage the drives or your motherboard.It can, but it''s not very likely to.> Solaris doesn''t let you do it and assumes that something''s gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else.That if something sounds more like defensiveness. Pulling out the cable isn''t advisable, but it simulates the controller card on the disk going belly up pretty well. Unless he pulls the power at the same time, because that would also simulate a power failure. If a piece of hardware stops responding you might do well to stop talking to it, but there is nothing admirable about locking up the OS if there is enough redundancy to continue without that particular chunk of metal. -- Peter Bortas
Howdy Matt. Just to make it absolutely clear, I appreciate your response. I would be quite lost if it weren''t for all of the input.> Unplugging a drive (actually pulling the cable out) does not simulate a > drive failure, it simulates a drive getting unplugged, which is > something the hardware is not capable of dealing with. > > If your drive were to suffer something more realistic, along the lines > of how you would normally expect a drive to die, then the system should > cope with it a whole lot better.Hmmm... I see what you''re saying. But, ok, let me play devil''s advocate. What about the times when a drive fails in a way the system didn''t expect? What you said was right - most of the time, when a hard drive goes bad, SMART will pick up on it''s impending doom long before it''s too late - but what about the times when the cause of the problem is larger or more abrupt than that (like tin whiskers causing shorts, or a server room technician yanking the wrong drive)? To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect me from data loss during _specific_ kinds of failures (the one''s which OpenSolaris considers "normal") is a pretty big implication... and is certainly a show-stopping one at that. Nobody is going to want to rely on an OS/RAID solution that can only survive certain types of drive failures, while there are others out there that can survive the same and more... But then again, I''m not sure if that''s what you meant... is that what you were getting at, or did I misunderstand?> Unfortunately, hard drives don''t come with a big button saying "simulate > head crash now" or "make me some bad sectors" so it''s going to be > difficult to simulate those failures.lol, if only they did - just having a button to push would make testing these types of things a lot easier. ;)> All I can say is that unplugging a drive yourself will not simulate a > failure, it merely causes the disk to disappear.But isn''t that a perfect example of a failure!? One in which the drive just seems to pop out of existence? lol, forgive me if I''m sounding pedantic, but why is there even a distinction between the two? This is starting to sound more and more like a bug...> I hope this has been of some small help, even just to > explain why the system didn''t cope as you expected.It has, thank you - I appreciate the response. This message posted from opensolaris.org
Howdy Matt, thanks for the response. But I dunno man... I think I disagree... I''m kinda of the opinion that regardless of what happens to hardware, an OS should be able to work around it, if it''s possible. If a sysadmin wants to yank a hard drive out of a motherboard (despite the risk of damage to the drive and board), then no OS in the world is going to stop him, so instead of the sysadmin trying to work around the OS, shouldn''t the OS instead try to work around the sysadmin? I mean, as great of an OS as it is, Solaris can''t possibly hope to stop me from doing anything I want to do... so when it assumes that something''s gone seriously wrong (which yanking a disk drive would hopefully cause it to assume), instead of just freezing up and becoming totally useless, why not do something useful like eject the disk from it''s memory, degrade the array, send out an e-mail to a designated sysadmin, and then keep on chugging along? Or, for a greater level of control, why not just read from some configuration set by the sysadmin, and then decide to either do the above or shut down entirely, as per the wishes of the sysadmin? Anything would be better than just going into a catatonic state in less than five seconds. Which is exactly what Linux, BSD, and even Windows _don''t_ do, and why their continual operation even under such failures wouldn''t be considered a bug. When I yank a drive in a RAID5 array - any drive, be it IDE, SATA, USB, or Firewire - in OpenSuSE or RedHat, the kernel will immediately notice it''s absence, and inform lvm and mdadm (the software responsible for keeping the RAID array together). mdadm will then degrade the array, and consult whatever instructions root gave it when the sysadmin was configuring the array. If the sysadmin waned the array to "stay up as long as it could," then it would continue to do that. If root wanted the array to be "brought down after any sort of drive failure," then the array would be unmounted. If root wanted to "power the machine down," then the machine will dutifully turn off. Shouldn''t OpenSolaris do the same thing? And as for James not being a jerk because he hates me, does that mean he''s just always like that? lol, it''s alright: lets not try to explain or excuse trollish behavior, and instead just call it out and expose it for what it is, and then be done with it. I certainly am. Anyways, thanks for the input Matt. This message posted from opensolaris.org
Howdy Matt. Just to make it absolutely clear, I appreciate your response. I would be quite lost if it weren''t for all of the input.> Unplugging a drive (actually pulling the cable out) does not simulate a > drive failure, it simulates a drive getting unplugged, which is > something the hardware is not capable of dealing with. > > If your drive were to suffer something more realistic, along the lines > of how you would normally expect a drive to die, then the system should > cope with it a whole lot better.Hmmm... I see what you''re saying. But, ok, let me play devil''s advocate. What about the times when a drive fails in a way the system didn''t expect? What you said was right - most of the time, when a hard drive goes bad, SMART will pick up on it''s impending doom long before it''s too late - but what about the times when the cause of the problem is larger or more abrupt than that (like tin whiskers causing shorts, or a server room technician yanking the wrong drive)? To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect me from data loss during _specific_ kinds of failures (the one''s which OpenSolaris considers "normal") is a pretty big implication... and is certainly a show-stopping one at that. Nobody is going to want to rely on an OS/RAID solution that can only survive certain types of drive failures, while there are others out there that can survive the same and more... But then again, I''m not sure if that''s what you meant... is that what you were getting at, or did I misunderstand?> Unfortunately, hard drives don''t come with a big button saying "simulate > head crash now" or "make me some bad sectors" so it''s going to be > difficult to simulate those failures.lol, if only they did - just having a button to push would make testing these types of things a lot easier. ;)> All I can say is that unplugging a drive yourself will not simulate a > failure, it merely causes the disk to disappear.But isn''t that a perfect example of a failure!? One in which the drive just seems to pop out of existence? lol, forgive me if I''m sounding pedantic, but why is there even a distinction between the two? This is starting to sound more and more like a bug...> I hope this has been of some small help, even just to > explain why the system didn''t cope as you expected.It has, thank you - I appreciate the response. This message posted from opensolaris.org
Howdy 404, thanks for the response. But I dunno man... I think I disagree... I''m kinda of the opinion that regardless of what happens to hardware, an OS should be able to work around it, if it''s possible. If a sysadmin wants to yank a hard drive out of a motherboard (despite the risk of damage to the drive and board), then no OS in the world is going to stop him, so instead of the sysadmin trying to work around the OS, shouldn''t the OS instead try to work around the sysadmin? I mean, as great of an OS as it is, Solaris can''t possibly hope to stop me from doing anything I want to do... so when it assumes that something''s gone seriously wrong (which yanking a disk drive would hopefully cause it to assume), instead of just freezing up and becoming totally useless, why not do something useful like eject the disk from it''s memory, degrade the array, send out an e-mail to a designated sysadmin, and then keep on chugging along? Or, for a greater level of control, why not just read from some configuration set by the sysadmin, and then decide to either do the above or shut down entirely, as per the wishes of the sysadmin? Anything would be better than just going into a catatonic state in less than five seconds. Which is exactly what Linux, BSD, and even Windows _don''t_ do, and why their continual operation even under such failures wouldn''t be considered a bug. When I yank a drive in a RAID5 array - any drive, be it IDE, SATA, USB, or Firewire - in OpenSuSE or RedHat, the kernel will immediately notice it''s absence, and inform lvm and mdadm (the software responsible for keeping the RAID array together). mdadm will then degrade the array, and consult whatever instructions root gave it when the sysadmin was configuring the array. If the sysadmin waned the array to "stay up as long as it could," then it would continue to do that. If root wanted the array to be "brought down after any sort of drive failure," then the array would be unmounted. If root wanted to "power the machine down," then the machine will dutifully turn off. Shouldn''t OpenSolaris do the same thing? And as for James not being a jerk because he hates me, does that mean he''s just always like that? lol, it''s alright: lets not try to explain or excuse trollish behavior, and instead just call it out and expose it for what it is, and then be done with it. I certainly am. As always, thanks for the input. This message posted from opensolaris.org
John Sonnenschein wrote:> James isn''t being a jerk because he hates your or anything... > > Look, yanking the drives like that can seriously damage the drives or your motherboard. Solaris doesn''t let you do it and assumes that something''s gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else. > >One point that''s been overlooked in all the chest thumping - PCs vibrate and cables fall out. I had this happen with an SCSI connector. Luckily for me, it fell in a fan and made a lot of noise! So pulling a drive is a possible, if rare, failure mode. Ian
jalex? As in Justin Alex? If you''re who I think you are, don''t you have a pretty long list of things you need to get done for Jerry before your little vacation? This message posted from opensolaris.org
alrigt, alright, but your fault. you left your workstation logged on, what was i supposed to do? not chime in? grotty yank This message posted from opensolaris.org
John Sonnenschein wrote:> Look, yanking the drives like that can seriously damage the drives or > your motherboard. Solaris doesn''t let you do it and assumes that > something''s gone seriously wrong if you try it. That Linux ignores > the behavior and lets you do it sounds more like a bug in linux than > anything else.OK, so far we''ve had a lot of knee jerk defense of Solaris. Sorry, but that isn''t helping. Let''s get back to science here, shall we? What happens when you remove a disk? A) The driver detects the removal and informs the OS. Solaris appears to behave reasonaby well in this case. B) The driver does not detect the removal. Commands must time out before a problem is detected. Due to driver layering, timeouts increase rapidly, causig te OS to "hang" for unreasonable periods of time. We really need to fix (B). It seems the "easy" fixes are: - Configure faster timeouts and fewer retries on redundant devices, similar to drive manufacturers'' RAID edition firmware. This could be via driver config file, or (better) automatically via ZFS, similar to write cache behaviour. - Propagate timeouts quickly between layers (immediate soft fail without retry) or perhaps just to the fault management system -- Carson
Todd H. Poole wrote:> Hmmm... I see what you''re saying. But, ok, let me play devil''s advocate. What about the times when a drive fails in a way the system didn''t expect? What you said was right - most of the time, when a hard drive goes bad, SMART will pick up on it''s impending doom long before it''s too late - but what about the times when the cause of the problem is larger or more abrupt than that (like tin whiskers causing shorts, or a server room technician yanking the wrong drive)? > > To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect me from data loss during _specific_ kinds of failures (the one''s which OpenSolaris considers "normal") is a pretty big implication... and is certainly a show-stopping one at that. Nobody is going to want to rely on an OS/RAID solution that can only survive certain types of drive failures, while there are others out there that can survive the same and more... > > But then again, I''m not sure if that''s what you meant... is that what you were getting at, or did I misunderstand? >I think there''s a misunderstanding concerning underlying concepts. I''ll try to explain my thoughts, please excuse me in case this becomes a bit lengthy. Oh, and I am not a Sun employee or ZFS fan, I''m just a customer who loves and hates ZFS at the same time ;-) You know, ZFS is designed for high *reliability*. This means that ZFS tries to keep your data as safe as possible. This includes faulty hardware, missing hardware (like in your testing scenario) and, to a certain degree, even human mistakes. But there are limits. For instance, ZFS does not make a backup unnecessary. If there''s a fire and your drives melt, then ZFS can''t do anything. Or if the hardware is lying about the drive geometry. ZFS is part of the operating environment and, as a consequence, relies on the hardware. so ZFS can''t make unreliable hardware reliable. All it can do is trying to protect the data you saved on it. But it cannot guarantee this to you if the hardware becomes its enemy. A real world example: I have a 32 core Opteron server here, with 4 FibreChannel Controllers and 4 JBODs with a total of FC drives connected to it, running a RAID 10 using ZFS mirrors. Sounds a lot like high end hardware compared to your NFS server, right? But ... I have exactly the same symptom. If one drive fails, an entire JBOD with all 16 included drives hangs, and all zpool access freezes. The reason for this is the miserable JBOD hardware. There''s only one FC loop inside of it, the drives are connected serially to each other, and if one drive dies, the drives behind it go downhill, too. ZFS immediately starts caring about the data, the zpool command hangs (but I still have traffic on the other half of the ZFS mirror!), and it does the right thing by doing so: whatever happens, my data must not be damaged. A "bad" filesystem like Linux ext2 or ext3 with LVM would just continue, even if the Volume Manager noticed the missing drive or not. That''s what you experienced. But you run in the real danger of having to use fsck at some point. Or, in my case, fsck''ing 5 TB of data on 64 drives. That''s not much fun and results in a lot more downtime than replacing the faulty drive. What can you expect from ZFS in your case? You can expect it to detect that a drive is missing and to make sure, that your _data integrity_ isn''t compromised. By any means necessary. This may even require to make a system completely unresponsive until a timeout has passed. But what you described is not a case of reliability. You want something completely different. You expect it to deliver *availability*. And availability is something ZFS doesn''t promise. It simply can''t deliver this. You have the impression that NTFS and various other Filesystems do so, but that''s an illusion. The next reboot followed by a fsck run will show you why. Availability requires full reliability of every included component of your server as a minimum, and you can''t expect ZFS or any other filesystem to deliver this with cheap IDE hardware. Usually people want to save money when buying hardware, and ZFS is a good choice to deliver the *reliability* then. But the conceptual stalemate between reliability and availability of such cheap hardware still exists - the hardware is cheap, the file system and services may be reliable, but as soon as you want *availability*, it''s getting expensive again, because you have to buy every hardware component at least twice. So, you have the choice: a) If you want *availability*, stay with your old solution. But oyu have no guarantee that your data is always intact. You''ll always be able to stream your video, but you have no guarantee that the client will receive a stream without drop outs forever. b) If you want *data integrity*, ZFS is your best friend. But you may have slight availability issues when it comes to hardware defects. You may reduce the percentage of pain during a desaster by spending more money, e.g. by making the SATA controllers redundant and creating a mirror (than controller 1 will hang, but controller 2 will continue working), but you must not forget that your PCI bridges, fans, power supplies, etc. remain single points of failures why can take the entire service down like your pulling of the non-hotpluggable drive did. c) If you want both, you should buy a second server and create a NFS cluster. Hope I could help you a bit, Ralf -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Ralf Ramge wrote: [...] Oh, and please excuse the grammar mistakes and typos. I''m in a hurry, not a retard ;-) At least I think so. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Heikki Suonsivu on list forwarder
2008-Aug-25 14:36 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure,
Justin wrote:> Howdy Matt. Just to make it absolutely clear, I appreciate your > response. I would be quite lost if it weren''t for all of the input. > >> Unplugging a drive (actually pulling the cable out) does not >> simulate a drive failure, it simulates a drive getting unplugged, >> which is something the hardware is not capable of dealing with. >> >> If your drive were to suffer something more realistic, along the >> lines of how you would normally expect a drive to die, then the >> system should cope with it a whole lot better. > > Hmmm... I see what you''re saying. But, ok, let me play devil''s > advocate. What about the times when a drive fails in a way the system > didn''t expect? What you said was right - most of the time, when a > hard drive goes bad, SMART will pick up on it''s impending doom long > before it''s too late - but what about the times when the cause of the > problem is larger or more abrupt than that (like tin whiskers causing > shorts, or a server room technician yanking the wrong drive)?I read a research paper by google about this a while ago. Their conclusion was that SMART is poor predictor of disk failure, even though they did find some useful indications. google for "google disk failure", it came out as second link a moment ago, title "Failure Trends in a Large Disk Drive Population". The problem with trying to predict disk failures with SMART parameters only catches a certain percentage of failing disks, and that percentage is not all that great. Many disks will still decide to fail catastrophically, most often early morning December 25th, in particular if there is a huge snowstorm going :) Heikki
Todd H. Poole wrote:> Howdy 404, thanks for the response. > > But I dunno man... I think I disagree... I''m kinda of the opinion that regardless of what happens to hardware, an OS should be able to work around it, if it''s possible. If a sysadmin wants to yank a hard drive out of a motherboard (despite the risk of damage to the drive and board), then no OS in the world is going to stop him, so instead of the sysadmin trying to work around the OS, shouldn''t the OS instead try to work around the sysadmin? >The behavior of ZFS to an error reported by an underlying device driver is tunable by the zpool failmode property. By default, it is set to "wait." For root pools, the installer may change this to "continue." The key here is that you can argue with the choice of default behavior, but don''t argue with the option to change.> I mean, as great of an OS as it is, Solaris can''t possibly hope to stop me from doing anything I want to do... so when it assumes that something''s gone seriously wrong (which yanking a disk drive would hopefully cause it to assume), instead of just freezing up and becoming totally useless, why not do something useful like eject the disk from it''s memory, degrade the array, send out an e-mail to a designated sysadmin, and then keep on chugging along? >If this does not occur, then please file a bug against the appropriate device driver (you''re not operating in ZFS code here).> Or, for a greater level of control, why not just read from some configuration set by the sysadmin, and then decide to either do the above or shut down entirely, as per the wishes of the sysadmin? Anything would be better than just going into a catatonic state in less than five seconds. >qv. zpool failmode property, at least when you are operating in the zfs code. I think the concerns here are that hangs can, and do, occur at other places in the software stack. Please report these in the appropriate forums and bug categories. -- richard
On Sun, 24 Aug 2008, Todd H. Poole wrote:> So aside from telling me to "[never] try this sort of thing with > IDE" does anyone else have any other ideas on how to prevent > OpenSolaris from locking up whenever an IDE drive is abruptly > disconnected from a ZFS RAID-Z array?I think that your expectations from ZFS are reasonable. However, it is useful to determine if pulling the IDE drive locks the entire IDE channel, which serves the other disks as well. This could happen at a hardware level, or at a device driver level. If this happens, then there is nothing that ZFS can do. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, 25 Aug 2008, Carson Gaspar wrote:> > B) The driver does not detect the removal. Commands must time out before > a problem is detected. Due to driver layering, timeouts increase > rapidly, causig te OS to "hang" for unreasonable periods of time. > > We really need to fix (B). It seems the "easy" fixes are: > > - Configure faster timeouts and fewer retries on redundant devices,I don''t think that any of these "easy" fixes are wise. Any fix based on timeouts is going to cause problems with devices mysteriously timing out and being resilvered. Device drivers should know the expected behavior of the device and act appropriately. For example, if the device is in a powered-down state, then the device driver can expect that it will take at least 30 seconds for the device to return after being requested to power-up but that some weak devices might take a minute. As far as device drivers go, I expect that IDE device drivers are at the very bottom of the feeding chain in Solaris since Solaris is optimized for enterprise hardware. Since OpenSolaris is open source, perhaps some brave soul can investigate the issues with the IDE device driver and send a patch. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Aug 25, 2008 at 08:17:55PM +1200, Ian Collins wrote:> John Sonnenschein wrote: > > > > Look, yanking the drives like that can seriously damage the drives > > or your motherboard. Solaris doesn''t let you do it ...Haven''t seen an andruid/"universal soldier" shipping with Solaris ... ;-)> > and assumes that something''s gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else.Not sure, whether everything, what can''t be understood, is "likely a bug" - maybe it is "more forgiving" and tries its best to solve the problem without taking you out of business (see below), even if it requires some hacks not in line with specifications ...> One point that''s been overlooked in all the chest thumping - PCs vibrate > and cables fall out. I had this happen with an SCSI connector. LuckilyYes - and a colleague told me, that he''ve had the same problem once. Also he managed a SiemensFujitsu server, where the SCSI-controller card had a tiny hairline crack: very odd behavior, usually not reproducible, IIRC, the 4th ServiceEngineer finally replaced the card ...> So pulling a drive is a possible, if rare, failure mode.Definitely! And expecting strange controller (or in general hardware) behavior is possibly a big + for an OS, which targets SMEs and "home users" as well (everybody knows about far east and other cheap HW producers, which sometimes seem to say, lets ship it, later we build a special driver for MS Windows, which workarounds the bug/problem ...). "Similar" story: ~ 2000+ we had a WG server with 4 IDE channels PATA, one HDD on each. HDD0 on CH0 mirrored to HDD2 on CH2, HDD1 on CH1 mirrored to HDD3 on CH3, using Linux Softraid driver. We found out, that when HDD1 on CH1 got on the blink, for some reason the controller got on the blink as well, i.e. took CH0 and vice versa down too. After reboot, we were able to force the md raid to re-take the bad marked drives and even found out, that the problem starts, when a certain part of a partition was accessed (which made the ops on that raid really slow for some minutes - but after the driver marked the drive(s) as bad, performance was back). Thus disabling the partition gave us the time to get a new drive... During all these ops nobody (except sysadmins) realized, that we had a problem - thanx to the md raid1 (with xfs btw.). And also we did not have any data corruption (at least, nobody has complained about it ;-)). Wrt. what I''ve experienced and read in ZFS-discussion etc. list I''ve the __feeling__, that we would have got really into trouble, using Solaris (even the most recent one) on that system ... So if one asks me, whether to run Solaris+ZFS on a production system, I usually say: definitely, but only, if it is a Sun server ... My 2? ;-) Regards, jel. PS: And yes, all the vendor specific workarounds/hacks are for Linux kernel folks a problem as well - at least on Torvalds side discouraged IIRC ... -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768
>>>>> "jcm" == James C McPherson <James.McPherson at Sun.COM> writes: >>>>> "thp" == Todd H Poole <toddhpoole at gmail.com> writes: >>>>> "mh" == Matt Harrison <iwasinnamuknow at genestate.com> writes: >>>>> "js" == John Sonnenschein <johnsonnenschein at gmail.com> writes: >>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>> "cg" == Carson Gaspar <carson at taltos.org> writes:jcm> Don''t _ever_ try that sort of thing with IDE. As I mentioned jcm> above, IDE is not designed to be able to cope with [unplugging jcm> a cable] It shouldn''t have to be designed for it, if there''s controller redundancy. On Linux, one drive per IDE bus (not using any ``slave'''' drives) seems like it should be enough for any electrical issue, but is not quite good enough in my experience, when there are two PATA busses per chip. but one hard drive per chip seems to be mostly okay. In this SATA-based case, not even that much separation was necessary for Linux to survive on the same hardware, but I agree with you and haven''t found that level with PATA either. OTOH, if the IDE drivers are written such that a confusing interaction with one controller chip brings down the whole machine, then I expect the IDE drivers to do better. If they don''t, why advise people to buy twice as much hardware ``because, you know, controllers can also fail, so you should have some controller redundancy''''---the advice is worse than a waste of money, it''s snake oil---a false sense of security. jcm> You could start by taking us seriously when we tell you that jcm> what you''ve been doing is not a good idea, and find other ways jcm> to simulate drive failures. well, you could suggest a method. except that the whole point of the story is, Linux, without any blather about ``green-line'''' and ``self-healing,'''' without any concerted platform-wide effort toward availability at all, simply works more reliably. thp> So aside from telling me to "[never] try this sort of thing thp> with IDE" does anyone else have any other ideas on how to thp> prevent OpenSolaris from locking up whenever an IDE drive is thp> abruptly disconnected from a ZFS RAID-Z array? yeah, get a Sil3124 card, which will run in native SATA mode and be more likely to work. Then, redo your test and let us know what happens. The not-fully-voiced suggestion to run your ATI SB600 in native/AHCI mode instead of pci-ide/compatibility mode is probably a bad one because of bug 6665032: the chip is only reliable in compatibility mode. You could trade your ATI board for an nVidia board for about the same price as the Sil3124 add-on card. AIUI from Linux wiki: http://ata.wiki.kernel.org/index.php/SATA_hardware_features ...says the old nVidia chips use nv_sata driver, and the new ones use the ahci driver, so both of these are different from pci-ide and more likely to work. Get an old one (MCP61 or older), and a new one (MCP65 or newer), repeat your test and let us know what happens. If the Sil3124 doesn''t work, and nv_sata doesn''t work, and AHCI on newer-nVidia doesn''t work, then hook the drives up to Linux running IET on basically any old chip, and mount them from Solaris using the built-in iSCSI initiator. If you use iSCSI, you will find: you will get a pause like with NT. Also, if one of the iSCSI targets is down, ''zpool status'' might hang _every time_ you run it, not just the first time when the failure is detected. The pool itself will only hang the first time. Also, you cannot boot unless all iSCSI targets are available, but you can continue running if some go away after booting. Overall IMHO it''s not as good as LVM2, but it''s more robust than plugging the drives into Solaris. It also gives you the ability to run smartctl on the drives (by running it natively on Linux) with full support for all commands, while someone here who I told to run smartctl reported that on Solaris ''smartctl -a'' worked but ''smartctl -t'' did not. I still have performance problems with iSCSI. I''m not sure yet if they''re unresolvable: there are a lot of tweakables with iSCSI, like disabling Nagle''s algorithm, and enabling RED on the initiator switchport, but first I need to buy faster CPU''s for the targets. mh> Dying or dead disks will still normally be able to mh> communicate with the driver to some extent, so they are still mh> "there". The dead disks I have which don''t spin also don''t respond to IDENTIFY(0) so they don''t really communicate with the driver at all. now, possibly, *possibly* they are still responsive after they fail, and become unresponsive after the first time they''re rebooted---because I think they load part of their firmware off the platters. Also, ATAPI standard says that while ``still communicating'''' drives are allowed to take up to 30sec to answer each command, which is probably too long to freeze a whole system. and still, just because ``possibly,'''' it doesn''t make sense to replace a tested-working system with a tested-broken system, not even after someone tells a complicated story trying to convince you the broken system is actually secretly working, just completely impossible to test, so you have to accept it based on stardust and fantasy. js> yanking the drives like that can seriously damage the js> drives or your motherboard. no, it can''t. And if I want a software developer''s opinion on what will electrically damage my machine, I''ll be sure to let you know first. jcm> If you absolutely must do something like this, then please use jcm> what''s known as "coordinated hotswap" using the cfgadm(1m) jcm> command. jcm> Viz: jcm> (detect fault in disk c2t3d0, in some way) jcm> # cfgadm -c unconfigure c2::dsk/c2t3d0 # cfgadm -c disconnect jcm> c2::dsk/c2t3d0 so....dont dont DONT do it because its STUPID and it might FRY YOUR DISK AND MOTHERBOARD. but, if you must do it, please warn our software first? I shouldn''t have to say it, but aside from being absurd this warning-command completely defeats the purpose of the test. jcm> Yes, but you''re running a new operating system, new jcm> filesystem... that''s a mountain of difference right in front jcm> of you. so we do agree that Linux''s not freezing in the same scenario indicates the difference is inside that mountain, which, however large, is composed entirely of SOFTWARE. re> The behavior of ZFS to an error reported by an underlying re> device driver is tunable by the zpool failmode property. By re> default, it is set to "wait." I think you like speculation well enough, so long as it''s optimistic. which is the tunable setting that causes other pools, ones not even including failed devices, to freeze? Why is the failmode property involved at all in a pool that still has enough replicas to keep functioning? cg> We really need to fix (B). It seems the "easy" fixes are: cg> - Configure faster timeouts and fewer retries on redundant cg> devices, similar to drive manufacturers'' RAID edition cg> firmware. This could be via driver config file, or (better) cg> automatically via ZFS, similar to write cache behaviour. cg> - Propagate timeouts quickly between layers (immediate soft cg> fail without retry) or perhaps just to the fault management cg> system It''s also important that things unrelated to the failure aren''t frozen. This was how I heard the ``green line'''' marketing campaign when it was pitched to me, and I found it really compelling because I felt Linux had too little of this virtue. However compelling, I just don''t find it even slightly acquainted with reality. I can understand ``unrelated'''' is a tricky concept when the boot pool is involved, but for example when it isn''t involved: I''ve had problems where one exported data pool''s becoming FAULTED stops NFS service from all other pools. The pool that FAULTED contained no Solaris binaries. and the zpool status hangs people keep discovering. I think this is a good test in general: configure two almost-completely independent stacks through the same kernel: NFS export NFS export filesystem filesystem pool pool ZFS/NFS driver driver controller controller disks disks Simulate whatever you regard as a ``catastrophic'''' or ``unplanned'''' or ``really stupid'''' failure, and see how big the shared region in the middle can be without affecting the other stack. Right now, my experience is even the stack above does not work. Maybe mountd gets blocked or something, I don''t know. Optimistically, we would of course like this stack below to remain failure-separate: NFS export NFS export filesystem filesystem pool pool ZFS/NFS driver controller disks disks The OP is implying, on Linux that stack DOES keep failures separate. However, even if ``hot plug'''' (or ``hot unplug'''' for demanding Linux users) is not supported, at least this stack below should still be failure-independent: NFS export NFS export filesystem filesystem pool pool ZFS/NFS driver controller controller disks disks I suspect it isn''t because the less-demanding stack I started with isn''t failure-independent. There is probably more than one problem making these failures spread more widely than they should, but so far we can''t even agree on what we wish were working. I do think the failures need to be isolated better first, independent of time. It''s not ``a failure of a drive on the left should propogate up the stack faster so that the stack on the right unfreezes before anyone gets too upset.'''' The stack on the right shouldn''t freeze at all. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080825/1139b0d0/attachment.bin>
Richard Elling
2008-Aug-26 18:10 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote:>>>>>> "jcm" == James C McPherson <James.McPherson at Sun.COM> writes: >>>>>> "thp" == Todd H Poole <toddhpoole at gmail.com> writes: >>>>>> "mh" == Matt Harrison <iwasinnamuknow at genestate.com> writes: >>>>>> "js" == John Sonnenschein <johnsonnenschein at gmail.com> writes: >>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> "cg" == Carson Gaspar <carson at taltos.org> writes: >>>>>> > > jcm> Don''t _ever_ try that sort of thing with IDE. As I mentioned > jcm> above, IDE is not designed to be able to cope with [unplugging > jcm> a cable] > > It shouldn''t have to be designed for it, if there''s controller > redundancy. On Linux, one drive per IDE bus (not using any ``slave'''' > drives) seems like it should be enough for any electrical issue, but > is not quite good enough in my experience, when there are two PATA > busses per chip. but one hard drive per chip seems to be mostly okay. > In this SATA-based case, not even that much separation was necessary > for Linux to survive on the same hardware, but I agree with you and > haven''t found that level with PATA either. > > OTOH, if the IDE drivers are written such that a confusing interaction > with one controller chip brings down the whole machine, then I expect > the IDE drivers to do better. If they don''t, why advise people to buy > twice as much hardware ``because, you know, controllers can also fail, > so you should have some controller redundancy''''---the advice is worse > than a waste of money, it''s snake oil---a false sense of security. >No snake oil. Pulling cables only simulates pulling cables. If you are having difficulty with cables falling out, then this problem cannot be solved with software. It *must* be solved with hardware. But the main problem with "simulating disk failures by pulling cables" is that the code paths executed during that test are different than those executed when the disk fails in other ways. It is not simply an issue of the success or failure of the test, but it is an issue of what you are testing. Studies have shown that pulled cables is not the dominant failure mode in disk populations. Bairavasundaram et.al. [1] showed that data checksum errors are much more common. In some internal Sun studies, we also see unrecoverable read as the dominant disk failure mode. ZFS will do well for these errors, regardless of the underlying OS. AFAIK, none of the traditional software logical volume managers nor the popular open source file systems (other than ZFS :-) address this problem. [1] http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf -- richard
Miles Nordin
2008-Aug-26 18:38 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> unrecoverable read as the dominant disk failure mode. [...] re> none of the traditional software logical volume managers nor re> the popular open source file systems (other than ZFS :-) re> address this problem. Other LVM''s should address unrecoverable read errors as well or better than ZFS, because that''s when the drive returns an error instead of data. Doing a good job with this error is mostly about not freezing the whole filesystem for the 30sec it takes the drive to report the error. Either the drives should be loaded with special firmware that returns errors earlier, or the software LVM should read redundant data and collect the statistic if the drive is well outside its usual response latency. I would expect all the software volume managers including ZFS fail to do this. It''s really hard to test without somehow getting a drive that returns read errors frequently, but isn''t about to die within the month---maybe ZFS should have an error injector at driver-level instead of block-level, and a model for time-based errors. One thing other LVM''s seem like they may do better than ZFS, based on not-quite-the-same-scenario tests, is not freeze filesystems unrelated to the failing drive during the 30 seconds it''s waiting for the I/O request to return an error. In terms of FUD about ``silent corruption'''', there is none of it when the drive clearly reports a sector is unreadable. Yes, traditional non-big-storage-vendor RAID5, and all software LVM''s I know of except ZFS, depend on the drives to report unreadable sectors. And, generally, drives do. so let''s be clear about that and not try to imply that the ``dominant failure mode'''' causes silent corruption for everyone except ZFS and Netapp users---it doesn''t. The Netapp paper focused on when drives silently return incorrect data, which is different than returning an error. Both Netapp and ZFS do checksums to protect from this. However Netapp never claimed this failure mode was more common than reported unrecoverable read errors, just that it was more interesting. I expect it''s much *less* common. Further, we know Netapp loaded special firmware into the enterprise drives in that study because they wanted the larger sector size. They are likely also loading special firmware into the desktop drives to make them return errors sooner than 30 seconds. so, it''s not improbable that the Netapp drives are more prone to deliver silently corrupt data instead of UNC/seek errors compared to off-the-shelf drives. Finally, for the Google paper, silent corruption ``didn''t even make the chart.'''' so, saying something didn''t make your chart and saying that it doesn''t happen are two different things, and your favoured conclusion has a stake in maintaining that view, too. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080826/dc34c85b/attachment.bin>
Carson Gaspar
2008-Aug-26 18:56 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Richard Elling wrote:> > No snake oil. Pulling cables only simulates pulling cables. If you > are having difficulty with cables falling out, then this problem cannot > be solved with software. It *must* be solved with hardware. > > But the main problem with "simulating disk failures by pulling cables" > is that the code paths executed during that test are different than those > executed when the disk fails in other ways. It is not simply an issue > of the success or failure of the test, but it is an issue of what you are > testing.All of that may be true, but it doesn''t change the fact that Solaris'' observed begaviour under these conditions is _abysmally_ bad, and for no good reason. It might not be a high priority to fix, but it would be nice if one of the Sun folks would at least acknowledge that something is terribly wrong here, rather than claiming it''s not a problem. -- Carson
> The behavior of ZFS to an error reported by an underlying device > driver is tunable by the zpool failmode property. By default, it is > set to "wait." For root pools, the installer may change this > to "continue." The key here is that you can argue with the choice > of default behavior, but don''t argue with the option to change.I didn''t want to argue with the option to change... trust me. Being able to change those types of options and having that type of flexibility in the first place is what makes a very large part of my day possible.> qv. zpool failmode property, at least when you are operating in the > zfs code. I think the concerns here are that hangs can, and do, occur > at other places in the software stack. Please report these in the > appropriate forums and bug categories. > -- richardNow _that''s_ a great constructive suggestion! Very good - I''ll research this in a few hours, and report back on what I find. Thanks for the pointer! -Todd This message posted from opensolaris.org
> Since OpenSolaris is open source, perhaps some brave > soul can investigate the issues with the IDE device driver and > send a patch.Fearing that other Senior Kernel Engineers, Solaris, might exhibit similar responses, or join in and play ?antagonize the noob,? I decided that I would try to solve my problem on my own. I tried my best to unravel the source tree that is OpenSolaris with some help from a friend, but I''ll be the first to admit - we didn''t even know where to begin, much less understand what we were looking at. To say that he and I were lost would be an understatement. I?m familiar with some subsections of the Linux kernel, and I can read and write code in a pinch, but there''s a reason why most of my work is done for small, personal projects, or just for fun... Some people out there can see things like Neo sees the Matrix? I am not one of them. I wish I knew how to write and then submit those types of patches. If I did, you can bet I would have been all over that days ago! :) -Todd This message posted from opensolaris.org
PS: I also think it''s worthy to note the level of supportive and constructive feedback that many others have provided, and how much I appreciate it. Thanks! Keep it coming! This message posted from opensolaris.org
Richard Elling
2008-Aug-26 20:15 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Carson Gaspar wrote:> Richard Elling wrote: > >> No snake oil. Pulling cables only simulates pulling cables. If you >> are having difficulty with cables falling out, then this problem cannot >> be solved with software. It *must* be solved with hardware. >> >> But the main problem with "simulating disk failures by pulling cables" >> is that the code paths executed during that test are different than those >> executed when the disk fails in other ways. It is not simply an issue >> of the success or failure of the test, but it is an issue of what you are >> testing. >> > > All of that may be true, but it doesn''t change the fact that Solaris'' > observed begaviour under these conditions is _abysmally_ bad, and for no > good reason. >Please file bugs. That is the best way to get things fixed. The most appropriate forum for storage driver discussions will be storage-discuss. -- richard
Ron Halstead
2008-Aug-26 20:45 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Todd, 3 days ago you were asked what mode the BIOS was using, AHCI or IDE compatibility. Which is it? Did you change it? What was the result? A few other posters suggested the same thing but the thread went off into left field and I believe the question / suggestions got lost in the noise. --ron This message posted from opensolaris.org
> I think that your expectations from ZFS are > reasonable. However, it is useful to determine if pulling the IDE drive locks > the entire IDE channel, which serves the other disks as well. This > could happen at a hardware level, or at a device driver level. If this > happens, then there is nothing that ZFS can do.Gotcha. But just to let you know, there are 4 SATA ports on the motherboard, with each drive getting its own port... how should I go about testing to see whether pulling one IDE drive (remember, they''re really SATA drives, but they''re being presented to the OS by the pci-ide driver) locks the entire IDE channel if there''s only one drive per channel? Or do you think it''s possible that two ports on the motherboard could be on one "logical channel" (for lack of a better phrase) while the other two are on the other, and thus we could test one drive while another on the same "logical channel" is unplugged? Also, remember that OpenSolaris freezes when this occurs, so I''m only going to have 2-3 seconds to execute a command before Terminal and - after a few more seconds, the rest of the machine - stop responding to input... I''m all for trying to test this, but I might need some instruction. This message posted from opensolaris.org
Richard Elling
2008-Aug-26 21:26 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> > > re> unrecoverable read as the dominant disk failure mode. [...] > re> none of the traditional software logical volume managers nor > re> the popular open source file systems (other than ZFS :-) > re> address this problem. > > Other LVM''s should address unrecoverable read errors as well or better > than ZFS, because that''s when the drive returns an error instead of > data.ZFS handles that case as well.> Doing a good job with this error is mostly about not freezing > the whole filesystem for the 30sec it takes the drive to report the > error.That is not a ZFS problem. Please file bugs in the appropriate category.> Either the drives should be loaded with special firmware that > returns errors earlier, or the software LVM should read redundant data > and collect the statistic if the drive is well outside its usual > response latency.ZFS will handle this case as well.> I would expect all the software volume managers > including ZFS fail to do this. It''s really hard to test without > somehow getting a drive that returns read errors frequently, but isn''t > about to die within the month---maybe ZFS should have an error > injector at driver-level instead of block-level, and a model for > time-based errors.qv ztest. Project comstar creates an opportunity for better testing in an open-source way. However, it will only work for SCSI protocol and therefore does not provide coverage for IDE devices -- which is not a long-term issue.> One thing other LVM''s seem like they may do better > than ZFS, based on not-quite-the-same-scenario tests, is not freeze > filesystems unrelated to the failing drive during the 30 seconds it''s > waiting for the I/O request to return an error. >This is not operating in ZFS code.> In terms of FUD about ``silent corruption'''', there is none of it when > the drive clearly reports a sector is unreadable. Yes, traditional > non-big-storage-vendor RAID5, and all software LVM''s I know of except > ZFS, depend on the drives to report unreadable sectors. And, > generally, drives do. so let''s be clear about that and not try to imply > that the ``dominant failure mode'''' causes silent corruption for > everyone except ZFS and Netapp users---it doesn''t. >In my field data, the dominant failure mode for disks is unrecoverable reads. If your software does not handle this case, then you should be worried. We tend to recommend configuring ZFS to manage data redundancy for this reason.> The Netapp paper focused on when drives silently return incorrect > data, which is different than returning an error. Both Netapp and ZFS > do checksums to protect from this. However Netapp never claimed this > failure mode was more common than reported unrecoverable read errors, > just that it was more interesting. I expect it''s much *less* common. >I would love for you produce data to that effect.> Further, we know Netapp loaded special firmware into the enterprise > drives in that study because they wanted the larger sector size. They > are likely also loading special firmware into the desktop drives to > make them return errors sooner than 30 seconds. so, it''s not > improbable that the Netapp drives are more prone to deliver silently > corrupt data instead of UNC/seek errors compared to off-the-shelf > drives. >I am not sure of the basis of your assertion. Can you explain in more detail?> Finally, for the Google paper, silent corruption ``didn''t even make > the chart.'''' so, saying something didn''t make your chart and saying > that it doesn''t happen are two different things, and your favoured > conclusion has a stake in maintaining that view, too. >The google paper[1] didn''t deal with silent errors or corruption at all. Section 2 describes in nice detail how they decided when a drive was failed -- it was replaced. They also cite disk vendors who test "failed" drives and many times the drives test clean (what they call "no problem found"). This is not surprising because it is unlikely that data corruption is detected in the systems under study. [1] http://www.cs.cmu.edu/~bianca/fast07.pdf -- richard
Mattias Pantzare
2008-Aug-27 01:32 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
2008/8/26 Richard Elling <Richard.Elling at sun.com>:> >> Doing a good job with this error is mostly about not freezing >> the whole filesystem for the 30sec it takes the drive to report the >> error. > > That is not a ZFS problem. Please file bugs in the appropriate category.Who''s problem is it? It can''t be the device driver as that has no knowledge of zfs filesystems or redundancy.> >> Either the drives should be loaded with special firmware that >> returns errors earlier, or the software LVM should read redundant data >> and collect the statistic if the drive is well outside its usual >> response latency. > > ZFS will handle this case as well.How is ZFS handling this? Is there a timeout in ZFS?>> One thing other LVM''s seem like they may do better >> than ZFS, based on not-quite-the-same-scenario tests, is not freeze >> filesystems unrelated to the failing drive during the 30 seconds it''s >> waiting for the I/O request to return an error. >> > > This is not operating in ZFS code.In what way is freezing a ZFS filesystem not operating in ZFS code? Notice that he wrote filesystems unrelated to the failing drive.> >> In terms of FUD about ``silent corruption'''', there is none of it when >> the drive clearly reports a sector is unreadable. Yes, traditional >> non-big-storage-vendor RAID5, and all software LVM''s I know of except >> ZFS, depend on the drives to report unreadable sectors. And, >> generally, drives do. so let''s be clear about that and not try to imply >> that the ``dominant failure mode'''' causes silent corruption for >> everyone except ZFS and Netapp users---it doesn''t. >> > > In my field data, the dominant failure mode for disks is unrecoverable > reads. If your software does not handle this case, then you should be > worried. We tend to recommend configuring ZFS to manage data > redundancy for this reason.He is writing that all software LVM''s will handle unrecoverable reads. What is your definition of unrecoverable reads?
Richard Elling
2008-Aug-27 04:40 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Mattias Pantzare wrote:> 2008/8/26 Richard Elling <Richard.Elling at sun.com>: > >>> Doing a good job with this error is mostly about not freezing >>> the whole filesystem for the 30sec it takes the drive to report the >>> error. >>> >> That is not a ZFS problem. Please file bugs in the appropriate category. >> > > Who''s problem is it? It can''t be the device driver as that has no > knowledge of zfs > filesystems or redundancy. >In most cases it is the drivers below ZFS. For an IDE disk it might be cmdk(7d) over ata(7d). For a USB disk it might be sd(7d) over scsa2usb(7d) over ehci(7d). printconf -D will show which device drivers are attached to your system. If you search the ZFS source code, you will find very little error handling of devices, by design.>>> Either the drives should be loaded with special firmware that >>> returns errors earlier, or the software LVM should read redundant data >>> and collect the statistic if the drive is well outside its usual >>> response latency. >>> >> ZFS will handle this case as well. >> > > How is ZFS handling this? Is there a timeout in ZFS? >Not for this case, but if configured to manage redundancy, ZFS will "read redundant data" from alternate devices. A business metric such as reasonable transaction latency would live at a level above ZFS.>>> One thing other LVM''s seem like they may do better >>> than ZFS, based on not-quite-the-same-scenario tests, is not freeze >>> filesystems unrelated to the failing drive during the 30 seconds it''s >>> waiting for the I/O request to return an error. >>> >>> >> This is not operating in ZFS code. >> > > In what way is freezing a ZFS filesystem not operating in ZFS code? > > Notice that he wrote filesystems unrelated to the failing drive. > >At the ZFS level, this is dictated by the failmode property.>>> In terms of FUD about ``silent corruption'''', there is none of it when >>> the drive clearly reports a sector is unreadable. Yes, traditional >>> non-big-storage-vendor RAID5, and all software LVM''s I know of except >>> ZFS, depend on the drives to report unreadable sectors. And, >>> generally, drives do. so let''s be clear about that and not try to imply >>> that the ``dominant failure mode'''' causes silent corruption for >>> everyone except ZFS and Netapp users---it doesn''t. >>> >>> >> In my field data, the dominant failure mode for disks is unrecoverable >> reads. If your software does not handle this case, then you should be >> worried. We tend to recommend configuring ZFS to manage data >> redundancy for this reason. >> > > He is writing that all software LVM''s will handle unrecoverable reads. >I agree. And if ZFS is configured to manage redundancy and a disk read returns EIO or the checksum does not match, then ZFS will attempt to read from the redundant data. However, not all devices return error codes which indicate unrecoverable reads. Also, data corrupted in the data path between media and main memory may not have an associated error condition reported. I find comparing unprotected ZFS configurations with LVMs using protected configurations to be disingenuous.> What is your definition of unrecoverable reads? >I wrote data, but when I try to read, I don''t get back what I wrote. -- richard
> James isn''t being a jerk because he hates your or > anything... > > Look, yanking the drives like that can seriously > damage the drives or your motherboard. Solaris > doesn''t let you do it and assumes that something''s > gone seriously wrong if you try it. That Linux > ignores the behavior and lets you do it sounds more > like a bug in linux than anything else.Solaris crashing is a linux bug. That''s a new one folks. This message posted from opensolaris.org
MC
2008-Aug-27 06:08 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
> Pulling cables only simulates pulling cables. If you > are having difficulty with cables falling out, then this problem cannot > be solved with software. It *must* be solved with hardware.I don''t think anyone is asking for software to fix cables that fall out... they''re asking for the OS to not crash, which they perceive to be better than a crash... This message posted from opensolaris.org
Okay, so your ACHI hardware is not using an ACHI driver in solaris. A crash when pulling a cable is still not great, but it is understandable because that driver is old and bad and doesn''t support hot swapping at all. So there are two things to do here. File a bug about how pulling a sata cable crashes solaris when the device is using the old ide driver. And file another bug about how solaris recognizes your ACHI SATA hardware as old ide hardware. The two bonus things to do are: come to the forum and bitch about the bugs to give them some attention, and come to the forum asking for help on making solaris recognize your ACHI SATA hardware properly :) Good luck...> Gotcha. But just to let you know, there are 4 SATA > ports on the motherboard, with each drive getting its > own port... how should I go about testing to see > whether pulling one IDE drive (remember, they''re > really SATA drives, but they''re being presented to > the OS by the pci-ide driver) locks the entire IDE > channel if there''s only one drive per channel? Or do > you think it''s possible that two ports on the > motherboard could be on one "logical channel" (for > lack of a better phrase) while the other two are on > the other, and thus we could test one drive while > another on the same "logical channel" is unplugged? > > Also, remember that OpenSolaris freezes when this > occurs, so I''m only going to have 2-3 seconds to > execute a command before Terminal and - after a few > more seconds, the rest of the machine - stop > responding to input... > > I''m all for trying to test this, but I might need > some instruction.This message posted from opensolaris.org
Todd H. Poole
2008-Aug-27 06:53 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Howdy Ron, Right, right - I know I dropped the ball on that one. Sorry, I haven''t been able to log into OpenSolaris lately, and thus haven''t been able to actually do anything useful... (lol, not to rag on OpenSolaris or anything, but it can also freeze just by logging in... See: http://defect.opensolaris.org/bz/show_bug.cgi?id=1681) Ok, so, just to give a refresher of what''s going on: When everything is in it''s default state (standard install of OpenSolaris, standard configuration of ZFS, factory-set BIOS settings, etc.) OpenSolaris will indeed freeze/hang/lock up, and generally become unusable _without exception_ on the hardware I''ve described above. I''m not confident enough to say that it will _always_ happen on _any_ machine using the 4 drive configuration of RAID-Z with the pci-ide driver and hardware set-up I''ve described thus far, but since I am not alone in experiencing this (see what my myxiplx experienced on his [different] hardware set-up), I don''t think its an isolated case. The factory-set BIOS settings for the 4 SATA II ports on my motherboard are [Native IDE]. I can change this setting from [Native IDE] to [RAID], [Legacy IDE], and {SATA->AHCI] Changing the setting to [SATA->AHCI] prevents the machine from booting. There isn''t any extra information that I can give aside from the fact that when I''m at the "SunOS Release 5.11 Version snv_86 64-bit" screen where the copyright is listed, the machine hangs right after listing "Hostname: ". A restart didn''t fix anything (that would sometimes fix the login bug I wrote about a few paragraphs up, but it didn''t work for this). By the way: Is there a way to pull up a text-only interface from the log in screen (or during the boot process?) without having to log in (or just sit there reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be nice if I could see a bit more information during boot, or if I didn''t have to use gnome if I just wanted to get at the CLI anyways... On some OSes, if you want to access TTY1 through 6, you only need to press ESC during boot, or CTRL + ALT + F1 through F6 (or something similar) during the login screen to gain access to other non-GUI login screens... Anyway, after changing the setting back to [Native IDE], the machine boots fine. And this time, the freeze-on-login bug didn''t get me. Now, I know for a fact this motherboard supports SATA II (see link to manufacturer''s website in earlier post), and that all 4 of these disks are _definitely_ SATA II disks (see hardware specifications listed in one of my earliest posts), and that I''m using all the right cables and everything... so, I don''t know how to explore this any further... Could it be that when I installed OpenSolaris, I was using the pci-ide (or [Native IDE]) setting on my BIOS, and thus if I were to change it, OpenSolaris might not know hot to handle that, and might refuse to boot? Or that maybe OpenSolaris only installed the drivers it thought it would need, and the stat-ahci one wasn''t one of them? Let me know what you think. -Todd This message posted from opensolaris.org
Howdy James, While responding to halstead''s post (see below), I had to restart several times to complete some testing. I''m not sure if that''s important to these commands or not, but I just wanted to put it out there anyway.> A few commands that you could provide the output from > include: > > > (these two show any FMA-related telemetry) > fmadm faulty > fmdump -vThis is the output from both commands: todd at mediaserver:~# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD Major Fault class : fault.fs.zfs.vdev.io Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response : The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run ''zpool status -x'' and replace the bad device. todd at mediaserver:~# fmdump -v TIME UUID SUNW-MSG-ID Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD 100% fault.fs.zfs.vdev.io Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719 Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719 FRU: - Location: -> (this shows your storage controllers and what''s > connected to them) cfgadm -lavThis is the output from cfgadm -lav todd at mediaserver:~# cfgadm -lav Ap_Id Receptacle Occupant Condition Information When Type Busy Phys_Id usb2/1 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13:1 usb2/2 connected configured ok Mfg: Microsoft Product: Microsoft 3-Button Mouse with IntelliEye(TM) NConfigs: 1 Config: 0 <no cfg str descr> unavailable usb-mouse n /devices/pci at 0,0/pci1458,5004 at 13:2 usb3/1 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,2:1 usb3/2 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,2:2 usb4/1 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,3:1 usb4/2 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,3:2 usb5/1 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,4:1 usb5/2 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,4:2 usb6/1 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:1 usb6/2 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:2 usb6/3 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:3 usb6/4 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:4 usb6/5 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:5 usb6/6 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:6 usb6/7 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:7 usb6/8 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:8 usb6/9 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:9 usb6/10 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,5:10 usb7/1 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,1:1 usb7/2 empty unconfigured ok unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,1:2 You''ll notice that the only thing listed is my USB mouse... is that expected?> You''ll also find messages in /var/adm/messages which > might prove > useful to review.If you really want, I can list the output from /var/adm/messages, but it doesn''t seem to add anything new to what I''ve already copied and pasted.> First and foremost, for me, this is a stupid thing to > do. You''ve got common-or-garden PC hardware which almost > *definitely* does not support hot plug of devices. Which is what you''re > telling us that you''re doing. Would try this with your pci/pci-e > cards in this system? I think not.I would if I had some sort of set-up that supposedly promised me redundant PCI/PCI-E cards... You might think it''s stupid, but how else could one be sure that the back-up PCI/PCI-E card would take over when the primary one died? Unplugging one of them seems like a fine test to me - It''s definitely the worst case scenario, and if the rig survives that, then I _know_ I would be able to rely on it for redundancy should one of the cards fail (which would most likely occur in a less spectacular fashion than a quick yank anyways)> If you absolutely must do something like this, then > please use what''s known as "coordinated hotswap" using the > cfgadm(1m) command. > > > Viz: > > (detect fault in disk c2t3d0, in some way) > > # cfgadm -c unconfigure c2::dsk/c2t3d0 > # cfgadm -c disconnect c2::dsk/c2t3d0 > > (go and swap the drive, plugin new drive with same > cable) > > # zpool replace -f poolname c2t3d0 > > > What this will do is tell the kernel to do things in > the right order, and - for zpool - tell it to do an > in-place replacement of device c2t3d0 in your pool.Thanks for the command listings - they''ll certainly prove useful if I should ever find myself in a situation where I have to manually swap a disk like you described. Unfortunately though, I''m with Miles Nordin (see below) on this one - I don''t want to warn OpenSolaris of what I''m about to do... That would defeat the purpose of the test. Even with technologies (like S.M.A.R.T.) that are designed to give you a bit of a heads-up, as Heikki Suonsivu and Google have noted, they''re not very reliable at all (research.google.com/archive/disk_failures.pdf). And I want this test to be as rough as it gets. I don''t want to play nice with this system... I want to drag it through the most tortuous worst-case scenario tests I can imagine, and if it survives with all my test data intact, then (and only then) will I begin to trust it.> http://docs.sun.com/app/docs/coll/40.17 (manpages) > http://docs.sun.com/app/docs/coll/47.23 (system admin collection) > http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide > http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guideOohh... Thank you. Good Links. I''m bookmarking these for future reading. They''ll definitely be helpful if we end up choosing to deploy OpenSolaris + ZFS for our media servers. -Todd This message posted from opensolaris.org
Mattias Pantzare
2008-Aug-27 10:44 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
2008/8/27 Richard Elling <Richard.Elling at sun.com>:> >>>> Either the drives should be loaded with special firmware that >>>> returns errors earlier, or the software LVM should read redundant data >>>> and collect the statistic if the drive is well outside its usual >>>> response latency. >>>> >>> >>> ZFS will handle this case as well. >>> >> >> How is ZFS handling this? Is there a timeout in ZFS? >> > > Not for this case, but if configured to manage redundancy, ZFS will > "read redundant data" from alternate devices.No, ZFS will not, ZFS waits for the device driver to report an error, after that it will read from alternate devices. ZFS could detect that there is probably a problem with the device and read from an alternate device much faster while it waits for the device to answer. You can''t do this at any other level than ZFS.>>>> One thing other LVM''s seem like they may do better >>>> than ZFS, based on not-quite-the-same-scenario tests, is not freeze >>>> filesystems unrelated to the failing drive during the 30 seconds it''s >>>> waiting for the I/O request to return an error. >>>> >>>> >>> >>> This is not operating in ZFS code. >>> >> >> In what way is freezing a ZFS filesystem not operating in ZFS code? >> >> Notice that he wrote filesystems unrelated to the failing drive. >> >> > > At the ZFS level, this is dictated by the failmode property.But that is used after ZFS has detected an error?> I find comparing unprotected ZFS configurations with LVMs > using protected configurations to be disingenuous.I don''t think anyone is doing that.> >> What is your definition of unrecoverable reads? >> > > I wrote data, but when I try to read, I don''t get back what I wrote.There is only one case where ZFS is better, that is when wrong data is returned. All other cases are managed by layers below ZFS. Wrong data returned is not normaly called unrecoverable reads.
On Tue, Aug 26, 2008 at 11:18:51PM -0700, MC wrote:> The two bonus things to do are: come to the forum and bitch about the bugs to give them some attention, and come to the forum asking for help on making solaris recognize your ACHI SATA hardware properly :)Been there, done that. No t-shirt, though... The Solaris kernel might be the best thing since MULTICS, but the lack of drivers really hampers it''s spread. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/d0011f64/attachment.bin>
I plan on fiddling around with this failmode property in a few hours. I''ll be using http://docs.sun.com/app/docs/doc/817-2271/gftgp?l=en&a=view as a reference. I''ll let you know what I find out. -Todd This message posted from opensolaris.org
Ross
2008-Aug-27 15:38 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced
Hi Todd, Having finally gotten the time to read through this entire thread, I think Ralf said it best. ZFS can provide data integrity, but you''re reliant on hardware and drivers for data availability. In this case either your SATA controller, or the drivers for it don''t cope at all well with a device going offline, so what you need is a SATA card that can handle that. Provided you have a controller that can cope with the disk errors, it should be able to return the appropriate status information to ZFS, which will in turn ensure your data is ok. The technique obviously works or Sun''s x4500 servers wouldn''t be doing anywhere near as well as they are. The problem we all seem to be having is finding white box hardware that supports it. I suspect your best bet would be to pick up a SAS controller based on the LSI chipsets used in the new x4540 server. There''s been a fair bit of discussion here on these, and while there''s a limitation in that you will have to manually keep track of drive names, I would expect it to handle disk failures (and pulling disks) much better, but you would probably be well advised asking the folks on the forums running those SAS controllers whether they''ve been able to pull disks sucessfully. I think the solution you need is definately to get a better disk controller, and your choice is either a plain SAS controller, or a raid controller that can present individual disks in pass through mode since they *definately* are designed to handle failures. Ross This message posted from opensolaris.org
Tim
2008-Aug-27 16:21 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
> > > By the way: Is there a way to pull up a text-only interface from the log in > screen (or during the boot process?) without having to log in (or just sit > there reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be > nice if I could see a bit more information during boot, or if I didn''t have > to use gnome if I just wanted to get at the CLI anyways... On some OSes, if > you want to access TTY1 through 6, you only need to press ESC during boot, > or CTRL + ALT + F1 through F6 (or something similar) during the login screen > to gain access to other non-GUI login screens... >On SXDE/Solaris, there''s a dropdown menu that lets you select what type of logon you''d like to use. I haven''t touched 2008.11 so I have no idea if it''s got similar.> > Anyway, after changing the setting back to [Native IDE], the machine boots > fine. And this time, the freeze-on-login bug didn''t get me. Now, I know for > a fact this motherboard supports SATA II (see link to manufacturer''s website > in earlier post), and that all 4 of these disks are _definitely_ SATA II > disks (see hardware specifications listed in one of my earliest posts), and > that I''m using all the right cables and everything... so, I don''t know how > to explore this any further... > > Could it be that when I installed OpenSolaris, I was using the pci-ide (or > [Native IDE]) setting on my BIOS, and thus if I were to change it, > OpenSolaris might not know hot to handle that, and might refuse to boot? Or > that maybe OpenSolaris only installed the drivers it thought it would need, > and the stat-ahci one wasn''t one of them? >Did you do a reboot reconfigure? "reboot -- -r" or "init 6"? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/f4dba7e2/attachment.html>
On Wed, Aug 27, 2008 at 1:18 AM, MC <rac at eastlink.ca> wrote:> Okay, so your ACHI hardware is not using an ACHI driver in solaris. A > crash when pulling a cable is still not great, but it is understandable > because that driver is old and bad and doesn''t support hot swapping at all. >His AHCI is not using AHCI because he''s set it not to. If linux is somehow ignoring the BIOS configuration, and attempting to load an AHCI driver for the hardware anyways, that''s *BROKEN* behavior. I''ve yet to see WHAT driver linux was using because he was too busy having a pissing match to get that USEFUL information back to the list. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/7634480a/attachment.html>
Todd H. Poole wrote:> And I want this test to be as rough as it gets. I don''t want to play > nice with this system... I want to drag it through the most tortuous > worst-case scenario tests I can imagine, and if it survives with all > my test data intact, then (and only then) will I begin to trust it.http://www.youtube.com/watch?v=naKd9nARAes :-) -- richard
Richard Elling
2008-Aug-27 17:17 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Mattias Pantzare wrote:> 2008/8/27 Richard Elling <Richard.Elling at sun.com>: > >>>>> Either the drives should be loaded with special firmware that >>>>> returns errors earlier, or the software LVM should read redundant data >>>>> and collect the statistic if the drive is well outside its usual >>>>> response latency. >>>>> >>>>> >>>> ZFS will handle this case as well. >>>> >>>> >>> How is ZFS handling this? Is there a timeout in ZFS? >>> >>> >> Not for this case, but if configured to manage redundancy, ZFS will >> "read redundant data" from alternate devices. >> > > No, ZFS will not, ZFS waits for the device driver to report an error, > after that it will read from alternate devices. >Yes, ZFS will, ZFS waits for the device driver to report an error, after that it will read from alternate devices.> ZFS could detect that there is probably a problem with the device and > read from an alternate device much faster while it waits for the > device to answer. >Rather than complicating ZFS code with error handling code which is difficult to port or maintain over time, ZFS leverages the Solaris Fault Management Architecture. There is opportunity to expand features using the flexible FMA framework. Feel free to propose additional RFEs.> You can''t do this at any other level than ZFS. > > > > >>>>> One thing other LVM''s seem like they may do better >>>>> than ZFS, based on not-quite-the-same-scenario tests, is not freeze >>>>> filesystems unrelated to the failing drive during the 30 seconds it''s >>>>> waiting for the I/O request to return an error. >>>>> >>>>> >>>>> >>>> This is not operating in ZFS code. >>>> >>>> >>> In what way is freezing a ZFS filesystem not operating in ZFS code? >>> >>> Notice that he wrote filesystems unrelated to the failing drive. >>> >>> >>> >> At the ZFS level, this is dictated by the failmode property. >> > > But that is used after ZFS has detected an error? >I don''t understand this question. Could you rephrase to clarify?>> I find comparing unprotected ZFS configurations with LVMs >> using protected configurations to be disingenuous. >> > > I don''t think anyone is doing that. >harrumph>>> What is your definition of unrecoverable reads? >>> >>> >> I wrote data, but when I try to read, I don''t get back what I wrote. >> > > There is only one case where ZFS is better, that is when wrong data is > returned. All other cases are managed by layers below ZFS. Wrong data > returned is not normaly called unrecoverable reads. >It depends on your perspective. T10 has provided a standard error code for a device to tell a host that it experienced an unrecoverable read error. However, we still find instances where what we wrote is not what we read, whether it is detected at the media level or higher in the software stack. In my pile of borken parts, I have devices which fail to indicate an unrecoverable read, yet do indeed suffer from forgetful media. To carry that discussion very far, it quickly descends into the ability of the device''s media checksums to detect bad data -- even ZFS''s checksums. But here is another case where enterprise-class devices tend to perform better than consumer-grade devices. -- richard
Miles Nordin
2008-Aug-27 17:48 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> not all devices return error codes which indicate re> unrecoverable reads. What you mean is, ``devices sometimes return bad data instead of an error code.'''' If you really mean there are devices out there which never return error codes, and always silently return bad data, please tell us which one and the story of when you encountered it, because I''m incredulous. I''ve never seen or heard of anything like that. Not even 5.25" floppies do that. Well...wait, actually I have. I heard some SGI disks had special firmware which could be ordered to behave this way, and some kind of ioctl or mount option to turn it on per-file or per-filesystem. But the drives wouldn''t disable error reporting unless ordered to. Another interesting lesson SGI offers here: they pushed this feature through their entire stack. The point was, for some video playback, data which arrives after the playback point has passed is just as useless as silently corrupt data, so the disk, driver, filesystem, all need to modify their exception handling to deliver the largest amount of on-time data possible, rather than the traditional goal of eventually returning the largest amount of correct data possible and clear errors instead of silent corruption. This whole-stack approach is exactly what I thought ``green line'''' was promising, and exactly what''s kept out of Solaris by the ``go blame the drivers'''' mantra. Maybe I was thinking of this SGI firmware when I suggested the customized firmware netapp loads into the drives in their study could silently return bad data more often than the firmware we''re all using, the standard firmware with 512-byte sectors intended for RAID layers without block checksums. re> I would love for you produce data to that effect. Read the netapp paper you cited earlier http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf on page 234 there''s a comparison of the relative prevalence of each kind of error. Latent sector errors / Unrecoverable reads nearline disks experiencing latent read errors per year: 9.5% Netapp calls the UNC errors, where the drive returns an error instead of data, ``latent sector errors.'''' Software RAID systems other than ZFS *do* handle this error, usually better than ZFS to my impression. And AIUI when it doesn''t freeze and reboot, ZFS counts this as a READ error. In addition to reporting it, most consumer drives seem to log the last five of these non-volatilely, and you can read the log with ''smartctl -a'' (if you''re using Linux always, or under Solaris only if smartctl is working with your particular disk driver). Silent corruption nearline disks experiencing silent corruption per year: 0.466% What netapp calls ``silent data corruption'''' is bad data silently returned by drives with no error indication, counted by ZFS as CKSUM and seems not to cause ZFS to freeze. I think you have been lumping this in with unrecoverable reads, but using the word ``silent'''' makes it clearer because unrecoverable makes it sound to me like the drive tried to recover, and failed, in which case the drive probably also reported the error making it a ``latent sector error''''. filesystem corruption This is also discovered silently w.r.t. the driver: the corruption that happens to ZFS systems when SAN targets disappear suddenly or when you offline a target and then reboot (which is also counted in the CKSUM column, and which ZFS-level redundancy also helps fix). I would call this ``ZFS bugs'''', ``filesystem corruption,'''' or ``manual resilvering''''. Obviously it''s not included on the Netapp table. It would be nice if ZFS had two separate CKSUM columns to distinguish between what netapp calls ``checksum errors'''' vs ``identity discrepancies''''. For ZFS the ``checksum error'''' would point with high certainty to the storage and silent corruption, and the ``identity discrepancy'''' would be more like filesystem corruption and flag things like one side of a mirror being out-of-date when ZFS thinks it shouldn''t be. but currently we have only one CKSUM column for both cases. so, I would say, yes, the type of read error that other software RAID systems besides ZFS do still handle is a lot more common: 9.5%/yr vs 0.466%/yr for nearline disks, and the same ~20x factor for enterprise disks. The rare silent error which other software LVM''s miss and only ZFS/Netapp/EMC/... handles is still common enough to worry about, at least on the nearline disks in the Netapp drive population. What this also shows, though, is that about 1 in 10 drives will return an UNC per year, and possibly cause ZFS to freeze up. It''s worth worrying about availability during an exception as common as that---it might even be more important for some applications than catching the silent corruption. not for my own application, but for some readily imagineable ones, yes. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/0599d597/attachment.bin>
>>>>> "m" == MC <rac at eastlink.ca> writes:m> file another bug about how solaris recognizes your ACHI SATA m> hardware as old ide hardware. I don''t have that board but AIUI the driver attachment''s chooseable in the BIOS Blue Screen of Setup, by setting the controller to ``Compatibility'''' mode (pci-ide) or ``Native'''' mode (AHCI). This particular chip must be run in Compatibility mode because of bug 6665032. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/db12d42f/attachment.bin>
Keith Bierman
2008-Aug-27 18:05 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
On Aug 27, 2008, at 11:17 AM, Richard Elling wrote:>>>> In my pile of broken parts, I have devices > which fail to indicate an unrecoverable read, yet do indeed suffer > from forgetful media.A long time ago, in a hw company long since dead and buried, I spent some months trying to find an intermittent error in the last bits of a complicated floating point application. It only occurred when disk striping was turned on (but the OS and device codes checked cleanly). In the end, it turned out that one of the device vendors had modified the specification slightly (by like 1 nano-sec) and the result was that least significant bits were often wrong when we drove the disk cage to it''s max. Errors were occurring randomly (e.g. swapping, paging, etc.) but no other application noticed. As the error was "within the margin of error" a less stubborn analyst might have not made a serious of federal cases about the non-determinism ;> My point is that undetected errors happen all the time; that people don''t notice doesn''t mean that they don''t happen ... -- Keith H. Bierman khbkhb at gmail.com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008
>>>>> "thp" == Todd H Poole <toddhpoole at gmail.com> writes:>> Would try this with >> your pci/pci-e cards in this system? I think not. thp> Unplugging one of them seems like a fine test to me I''ve done it, with 32-bit 5 volt PCI, I forget why. I might have been trying to use a board, but bypass the broken etherboot ROM on the board. It was something like that. IIRC it works sometimes, crashes the machine sometimes, and fries the hardware eventually if you keep doing it long enough. The exact same three cases are true of cold-plugging a PCI card. It just works a-lot-more-often sometimes if you power down first. Does massively inappropriate hotplugging possibly weaken the hardware so that it''s more likely to pop later? maybe. Can you think of a good test for that? Believe it or not, sometimes accurate information is worth more than a motherboard that cost $50 five years ago. Sometimes saving ten minutes is worth more. or...<cough> recovering an openprom password. Testing availability claims rather than accepting them on faith, or rather than gaining experience in a slow, oozing, anecdotal way on production machinery, is definitely not stupid. Testing them in a way that compares one system to another is double-un-stupid. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/f89534e3/attachment.bin>
Richard Elling
2008-Aug-27 18:27 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> > > re> not all devices return error codes which indicate > re> unrecoverable reads. > > What you mean is, ``devices sometimes return bad data instead of an > error code.'''' > > If you really mean there are devices out there which never return > error codes, and always silently return bad data, please tell us which > one and the story of when you encountered it, because I''m incredulous. > I''ve never seen or heard of anything like that. Not even 5.25" > floppies do that. >I blogged about one such case. http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file However, I''m not inclined to publically chastise the vendor or device model. It is a major vendor and a popular device. ''nuff said.> Well...wait, actually I have. I heard some SGI disks had special > firmware which could be ordered to behave this way, and some kind of > ioctl or mount option to turn it on per-file or per-filesystem. But > the drives wouldn''t disable error reporting unless ordered to. > Another interesting lesson SGI offers here: they pushed this feature > through their entire stack. The point was, for some video playback, > data which arrives after the playback point has passed is just as > useless as silently corrupt data, so the disk, driver, filesystem, all > need to modify their exception handling to deliver the largest amount > of on-time data possible, rather than the traditional goal of > eventually returning the largest amount of correct data possible and > clear errors instead of silent corruption. This whole-stack approach > is exactly what I thought ``green line'''' was promising, and exactly > what''s kept out of Solaris by the ``go blame the drivers'''' mantra. > > Maybe I was thinking of this SGI firmware when I suggested the > customized firmware netapp loads into the drives in their study could > silently return bad data more often than the firmware we''re all using, > the standard firmware with 512-byte sectors intended for RAID layers > without block checksums. > > re> I would love for you produce data to that effect. > > Read the netapp paper you cited earlier > > http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf > > on page 234 there''s a comparison of the relative prevalence of each > kind of error. > > Latent sector errors / Unrecoverable reads > > nearline disks experiencing latent read errors per year: 9.5% >This number should scare the *%^ out of you. It basically means that no data redundancy is a recipe for disaster. Fortunately, with ZFS you can have data redundancy without requiring a logical volume manager to mirror your data. This is especially useful on single-disk systems like laptops.> Netapp calls the UNC errors, where the drive returns an error > instead of data, ``latent sector errors.'''' Software RAID systems > other than ZFS *do* handle this error, usually better than ZFS to > my impression. And AIUI when it doesn''t freeze and reboot, ZFS > counts this as a READ error. In addition to reporting it, most > consumer drives seem to log the last five of these non-volatilely, > and you can read the log with ''smartctl -a'' (if you''re using Linux > always, or under Solaris only if smartctl is working with your > particular disk driver). > > > Silent corruption > > nearline disks experiencing silent corruption per year: 0.466% > > What netapp calls ``silent data corruption'''' is bad data silently > returned by drives with no error indication, counted by ZFS as > CKSUM and seems not to cause ZFS to freeze. I think you have been > lumping this in with unrecoverable reads, but using the word > ``silent'''' makes it clearer because unrecoverable makes it sound to > me like the drive tried to recover, and failed, in which case the > drive probably also reported the error making it a ``latent sector > error''''. >Likewise, this number should scare you. AFAICT, logical volume managers like SVM will not detect this. Terminology wise, silent errors are, by-definition, not detected. But in the literature you might see this in studies of failures where the author intends to differentiate between one system which detects such errors and one which does not.> > filesystem corruption > > This is also discovered silently w.r.t. the driver: the corruption > that happens to ZFS systems when SAN targets disappear suddenly or > when you offline a target and then reboot (which is also counted in > the CKSUM column, and which ZFS-level redundancy also helps fix). > I would call this ``ZFS bugs'''', ``filesystem corruption,'''' or > ``manual resilvering''''. Obviously it''s not included on the Netapp > table. It would be nice if ZFS had two separate CKSUM columns to > distinguish between what netapp calls ``checksum errors'''' vs > ``identity discrepancies''''. For ZFS the ``checksum error'''' would > point with high certainty to the storage and silent corruption, and > the ``identity discrepancy'''' would be more like filesystem > corruption and flag things like one side of a mirror being > out-of-date when ZFS thinks it shouldn''t be. but currently we have > only one CKSUM column for both cases. > >This differentiation is noted in the FMA e-reports.> so, I would say, yes, the type of read error that other software RAID > systems besides ZFS do still handle is a lot more common: 9.5%/yr vs > 0.466%/yr for nearline disks, and the same ~20x factor for enterprise > disks. The rare silent error which other software LVM''s miss and only > ZFS/Netapp/EMC/... handles is still common enough to worry about, at > least on the nearline disks in the Netapp drive population. >0.466%/yr is a per-disk rate. If you have 10 disks, your exposure is 4.6% per year. For 100 disks, 46% per year, etc. For systems with thousands of disks this is a big problem. But I don''t think using a rate-per-unit-time is the best way to look at this problem because if you never read the data, you don''t care. This is why disk vendors spec UERs as rate-per-bits-read. I have some field data on bits read over time, but routine activities, like backups, zfs sends, or scrubs, can change the number of bits read per unit time by a significant amount.> What this also shows, though, is that about 1 in 10 drives will return > an UNC per year, and possibly cause ZFS to freeze up. It''s worth > worrying about availability during an exception as common as that---it > might even be more important for some applications than catching the > silent corruption. not for my own application, but for some readily > imagineable ones, yes. >UNCs don''t cause ZFS to freeze as long as failmode != wait or ZFS manages the data redundancy. -- richard
Ross
2008-Aug-27 18:31 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Forgive me for being a bit wooly with this explanation (I''ve only recently moved over from Windows), but changing disk mode from IDE to SATA may well not work without a re-install, or at the very least messing around with boot settings. I''ve seen many systems which list SATA disks in front of IDE ones, so you changing the drives to SATA may now mean that instead of your OS being installed on drive 0, and your data on drive 1, you now have the data on drive 0 and the OS on drive 1. You''ll get through the first part of the boot process fine, but the second stage is where you usually have problems which sounds like what''s happening to you. Unfortunately swapping hard disk controllers (which is what you''re doing here) isn''t as simple as just making the change and rebooting, and that would be just as true in Windows. I do think some solaris drivers need a bit of work, but I suspect the standard SATA ones are pretty good, so there is a fair chance that you''ll find hot plug works ok in SATA mode. Ultimately however you''re trying to get enterprise kinds of performance out of consumer kit, and no matter how good Solaris and ZFS are, they can''t guarantee to work with that. I used to have the same opinion as you, but I''m starting to see now that ZFS isn''t quite an exact match for traditional raid controllers. It''s close, but you do need to think about the hardware too and make sure that can definately cope with what you''re wanting to do. I think the sales literature is a little misleading in that sense. Ross This message posted from opensolaris.org
Tim
2008-Aug-27 18:38 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
On Wed, Aug 27, 2008 at 1:31 PM, Ross <myxiplx at hotmail.com> wrote:> Forgive me for being a bit wooly with this explanation (I''ve only recently > moved over from Windows), but changing disk mode from IDE to SATA may well > not work without a re-install, or at the very least messing around with boot > settings. I''ve seen many systems which list SATA disks in front of IDE > ones, so you changing the drives to SATA may now mean that instead of your > OS being installed on drive 0, and your data on drive 1, you now have the > data on drive 0 and the OS on drive 1. >Solaris does not do this. This is one of the many annoyances I have with linux. The way they handle /dev is ridiculous. Did you add a new drive? Let''s renumber everything! --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/0cfe04e5/attachment.html>
Miles Nordin
2008-Aug-27 21:51 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:>> If you really mean there are devices out there which never >> return error codes, and always silently return bad data, please >> tell us which one and the story of when you encountered it, re> I blogged about one such case. re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file re> However, I''m not inclined to publically chastise the vendor or re> device model. It is a major vendor and a popular re> device. ''nuff said. It''s not really enough for me, but what''s more the case doesn''t match what we were looking for: a device which ``never returns error codes, always returns silently bad data.'''' I asked for this because you said ``However, not all devices return error codes which indicate unrecoverable reads,'''' which I think is wrong. Rather, most devices sometimes don''t, not some devices always don''t. Your experience doesn''t say anything about this drive''s inability to return UNC errors. It says you suspect it of silently returning bad data, once, but your experience doesn''t even clearly implicate the device once: It could have been cabling/driver/power-supply/zfs-bugs when the block was written. I was hoping for a device in your ``bad stack'''' which does it over and over. Remember, I''m not arguing ZFS checksums are worthless---I think they''re great. I''m arguing with your original statement that ZFS is the only software RAID which deals with the dominant error you find in your testing, unrecoverable reads. This is untrue! re> This number should scare the *%^ out of you. It basically re> means that no data redundancy is a recipe for disaster. yeah, but that 9.5% number alone isn''t an argument for ZFS over other software LVM''s. re> 0.466%/yr is a per-disk rate. If you have 10 disks, your re> exposure is 4.6% per year. For 100 disks, 46% per year, etc. no, you''re doing the statistics wrong, and in a really elementary way. You''re counting multiple times the possible years in which more than one disk out of the hundred failed. If what you care about for 100 disks is that no disk experiences an error within one year, then you need to calculate (1 - 0.00466) ^ 100 = 62.7% so that''s 37% probability of silent corruption. For 10 disks, the mistake doesn''t make much difference and 4.6% is about right. I don''t dispute ZFS checksums have value, but the point stands that the reported-error failure mode is 20x more common in netapp''s study than this one, and other software LVM''s do take care of the more common failure mode. re> UNCs don''t cause ZFS to freeze as long as failmode != wait or re> ZFS manages the data redundancy. The time between issuing the read and getting the UNC back can be up to 30 seconds, and there are often several unrecoverable sectors in a row as well as lower-level retries multiplying this 30-second value. so, it ends up being a freeze. To fix it, ZFS needs to dispatch read requests for redundant data if the driver doesn''t reply quickly. ``Quickly'''' can be ambiguous, but the whole point of FMD was supposed to be that complicated statistics could be collected at various levels to identify even more subtle things than READ and CKSUM errors, like drives that are working at 1/10th the speed they should be, yet right now we can''t even flag a drive taking 30 seconds to read a sector. ZFS is still ``patiently waiting'''', and now that FMD is supposedly integrated instead of a discussion of what knobs and responses there are, you''re passing the buck to the drivers and their haphazard nonuniform exception state machines. The best answer isn''t changing drivers to make the drive timeout in 15 seconds instead---it''s to send the read to other disks quickly using a very simple state machine, and start actually using FMD and a complicated state machine to generate suspicion-events for slow disks that aren''t returning errors. Also the driver and mid-layer need to work with the hypothetical ZFS-layer timeouts to be as good as possible about not stalling the SATA chip, the channel if there''s a port multiplier, or freezing the whole SATA stack including other chips, just because one disk has an outstanding READ command waiting to get an UNC back. In some sense the disk drivers and ZFS have different goals. The goal of drivers should be to keep marginal disk/cabling/... subsystems online as aggressively as possible, while the goal of ZFS should be to notice and work around slightly-failing devices as soon as possible. I thought the point of putting off reasonable exception handling for two years while waiting for FMD, was to be able to pursue both goals simultaneously without pressure to compromise one in favor of the other. In addition, I''m repeating myself like crazy at this point, but ZFS tools used for all pools like ''zpool status'' need to not freeze when a single pool, or single device within a pool, is unavailable or slow, and this expectation is having nothing to do with failmode on the failing pool. And NFS running above ZFS should continue serving filesystems from available pools even if some pools are faulted, again nothing to do with failmode. Neither is the case now, and it''s not a driver fix, but even beyond fixing these basic problems there''s vast room for improvement, to deliver something better than LVM2 and closer to NetApp, rather than just catching up. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/60051301/attachment.bin>
Ian Collins
2008-Aug-27 22:21 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin writes:> > In addition, I''m repeating myself like crazy at this point, but ZFS > tools used for all pools like ''zpool status'' need to not freeze when a > single pool, or single device within a pool, is unavailable or slow, > and this expectation is having nothing to do with failmode on the > failing pool. And NFS running above ZFS should continue serving > filesystems from available pools even if some pools are faulted, again > nothing to do with failmode. >I agree with the bulk of this post, but I''d like to add to this last point. I''ve had a few problems with ZFS tools hanging on recent builds due to problems with a pool on a USB stick. One tiny $20 component causing a fault that required a reboot of the host. This really shouldn''t happen. Ian
Miles Nordin
2008-Aug-27 22:33 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "t" == Tim <tim at tcsac.net> writes:t> Solaris does not do this. yeah but the locators for local disks are still based on pci/controller/channel not devid, so the disk will move to a different device name if he changes BIOS from pci-ide to AHCI because it changes the driver attachment. This may be the problem preventing his bootup, rather than the known AHCI bug. I''m not sure what''s required to boot off a root pool that''s moved devices, maybe nothing, but for UFS roots it often required booting off the install media, regenerating /dev (and /devices on sol9), editing vfstab, u.s.w. Linux device names don''t move as much if you use LVM2, as some of the distros do by default even for single-device systems. Device names are then based on labels written onto the drive, which is a little scary and adds a lot of confusion, but I think helps with this moving-device problem and is analagous to what it sounds like ZFS might do on the latest SXCE''s that don''t put zpool.cache in the boot archive. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/27b3756c/attachment.bin>
Toby Thain
2008-Aug-27 22:39 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
On 27-Aug-08, at 7:21 PM, Ian Collins wrote:> Miles Nordin writes: > >> >> In addition, I''m repeating myself like crazy at this point, but ZFS >> tools used for all pools like ''zpool status'' need to not freeze >> when a >> single pool, or single device within a pool, is unavailable or slow, >> and this expectation is having nothing to do with failmode on the >> failing pool. And NFS running above ZFS should continue serving >> filesystems from available pools even if some pools are faulted, >> again >> nothing to do with failmode. >> > I agree with the bulk of this post, but I''d like to add to this > last point. > I''ve had a few problems with ZFS tools hanging on recent builds due to > problems with a pool on a USB stick. One tiny $20 component > causing a fault > that required a reboot of the host. This really shouldn''t happen.Let''s not be too quick to assign blame, or to think that perfecting the behaviour is straightforward or even possible. Traditionally, systems bearing ''enterprisey'' expectations were/are integrated hardware and software from one vendor (e.g. Sun) which could be certified as a unit. Start introducing ''random $20 components'' and you begin to dilute the quality and predictability of the composite system''s behaviour. If hard drive firmware is as cr*ppy as anecdotes indicate, what can we really expect from a $20 USB pendrive? --Toby> > Ian > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Tim
2008-Aug-27 22:40 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
On Wed, Aug 27, 2008 at 5:33 PM, Miles Nordin <carton at ivy.net> wrote:> >>>>> "t" == Tim <tim at tcsac.net> writes: > > t> Solaris does not do this. > > yeah but the locators for local disks are still based on > pci/controller/channel not devid, so the disk will move to a different > device name if he changes BIOS from pci-ide to AHCI because it changes > the driver attachment. This may be the problem preventing his bootup, > rather than the known AHCI bug. >Except he was, and is referring to a non-root disk. If I''m using raw devices and I unplug my root disk and move it somewhere else, I would expect to have to update my boot loader.> Linux device names don''t move as much if you use LVM2, as some of the > distros do by default even for single-device systems. Device names > are then based on labels written onto the drive, which is a little > scary and adds a lot of confusion, but I think helps with this > moving-device problem and is analagous to what it sounds like ZFS > might do on the latest SXCE''s that don''t put zpool.cache in the boot > archive. >LVM hardly changes the way devices move around in Linux, or it''s horrendous handling of /dev. You are correct in that it is a step towards masking the ugliness. I, however, do not consider it a fix. Unfortunately it''s not used in the majority of the sites I am involved in, and as such isn''t any sort of help. The administration overhead it adds is not worth the hassle for the majority of my customers. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/61528124/attachment.html>
Bob Friesenhahn
2008-Aug-27 22:42 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
On Wed, 27 Aug 2008, Miles Nordin wrote:> > In some sense the disk drivers and ZFS have different goals. The goal > of drivers should be to keep marginal disk/cabling/... subsystems > online as aggressively as possible, while the goal of ZFS should be to > notice and work around slightly-failing devices as soon as possible.My buffer did overflow from this email, but I still noticed the stated goal of ZFS, which might differ from the objectives the ZFS authors have been working toward these past seven years. Could you please define "slightly-failing device" as well as how ZFS can know when the device is slightly-failing so it can start to work around it? Thanks, Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tim
2008-Aug-27 22:43 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
On Wed, Aug 27, 2008 at 5:39 PM, Toby Thain <toby at telegraphics.com.au>wrote:> > > Let''s not be too quick to assign blame, or to think that perfecting > the behaviour is straightforward or even possible. > > Traditionally, systems bearing ''enterprisey'' expectations were/are > integrated hardware and software from one vendor (e.g. Sun) which > could be certified as a unit. >PSSSHHH, Sun should be certifying every piece of hardware that is, or will ever be released. Community putback shmamunnity putback.> > Start introducing ''random $20 components'' and you begin to dilute the > quality and predictability of the composite system''s behaviour. >But this NEVER happens on linux *grin*.> > If hard drive firmware is as cr*ppy as anecdotes indicate, what can > we really expect from a $20 USB pendrive? > > --Toby > >Perfection? --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/079adda4/attachment.html>
Miles Nordin
2008-Aug-27 23:02 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "t" == Tim <tim at tcsac.net> writes:t> Except he was, and is referring to a non-root disk. wait, what? his root disk isn''t plugged into the pci-ide controller? t> LVM hardly changes the way devices move around in Linux, fine, be pedantic. It makes systems boot and mount all their filesystems including ''/'' even when you move disks around. agreed now? There''s a simpler Linux way of doing this which I use on my Linux systems: mounting by the UUID in the filesystem''s superblock. But I think RedHat is using LVM2 to do it. Anyway modern Linux systems don''t put names like /dev/sda in /etc/fstab, and they don''t use these names to find the root filesystem either---they have all that LVM2 stuff in the early userspace. Solaris seems to be going the same ``mount by label'''' direction with ZFS (except with zpool.cache, devid''s, and mpxio, it''s a bit of a hybrid approach---when it goes out searching for labels, and when it expects devices to be on the same bus/controller/channel, isn''t something I fully understand yet and I expect will only become clear through experience). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/637b3d44/attachment.bin>
Ian Collins
2008-Aug-27 23:04 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Toby Thain writes:> > On 27-Aug-08, at 7:21 PM, Ian Collins wrote: > >> Miles Nordin writes: >> >>> >>> In addition, I''m repeating myself like crazy at this point, but ZFS >>> tools used for all pools like ''zpool status'' need to not freeze when a >>> single pool, or single device within a pool, is unavailable or slow, >>> and this expectation is having nothing to do with failmode on the >>> failing pool. And NFS running above ZFS should continue serving >>> filesystems from available pools even if some pools are faulted, again >>> nothing to do with failmode. >>> >> I agree with the bulk of this post, but I''d like to add to this last >> point. >> I''ve had a few problems with ZFS tools hanging on recent builds due to >> problems with a pool on a USB stick. One tiny $20 component causing a >> fault >> that required a reboot of the host. This really shouldn''t happen. > > Let''s not be too quick to assign blame, or to think that perfecting the > behaviour is straightforward or even possible. >I''m not assigning blame, just illustrating a problem. If you look back a week or so you will see a thread I started with the subject " ZFS commands hanging in B95". This thread went off list but the cause was tracked back to a problem with a USB pool.> Traditionally, systems bearing ''enterprisey'' expectations were/are > integrated hardware and software from one vendor (e.g. Sun) which could > be certified as a unit. > > Start introducing ''random $20 components'' and you begin to dilute the > quality and predictability of the composite system''s behaviour. >So we shouldn''t be using USB sticks to transfer data between home and office systems? If the stick was a FAT device and it crapped out or was removed without unmounting, the system would not have hung.> If hard drive firmware is as cr*ppy as anecdotes indicate, what can we > really expect from a $20 USB pendrive? >All the more reason not to lock up if one craps out. Ian
Richard Elling
2008-Aug-27 23:24 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> > > >> If you really mean there are devices out there which never > >> return error codes, and always silently return bad data, please > >> tell us which one and the story of when you encountered it, > > re> I blogged about one such case. > re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file > > re> However, I''m not inclined to publically chastise the vendor or > re> device model. It is a major vendor and a popular > re> device. ''nuff said. > > It''s not really enough for me, but what''s more the case doesn''t match > what we were looking for: a device which ``never returns error codes, > always returns silently bad data.'''' I asked for this because you said > ``However, not all devices return error codes which indicate > unrecoverable reads,'''' which I think is wrong. Rather, most devices > sometimes don''t, not some devices always don''t. >I really don''t know how to please you. I''ve got a bunch of borken devices of all sorts. If you''d like to stop by some time and rummage in the boneyard, feel free. Make it quick before my wife makes me clean up :-) For the device which I mentioned in my blog, it does return bad data far more often than I''d like. But that is why I only use it for testing and don''t store my wife''s photo album on it. Anyone who has been around for a while will have similar anecdotes.> Your experience doesn''t say anything about this drive''s inability to > return UNC errors. It says you suspect it of silently returning bad > data, once, but your experience doesn''t even clearly implicate the > device once: It could have been cabling/driver/power-supply/zfs-bugs > when the block was written. I was hoping for a device in your ``bad > stack'''' which does it over and over. > > Remember, I''m not arguing ZFS checksums are worthless---I think > they''re great. I''m arguing with your original statement that ZFS is > the only software RAID which deals with the dominant error you find in > your testing, unrecoverable reads. This is untrue! >To be clear. I claim: 1. The dominant failure mode in my field data for magnetic disks is unrecoverable reads. You need some sort of data protection to get past this problem. 2. Unrecoverable reads are not always reported by disk drives. 3. You really want a system that performs end-to-end data verification, and if you don''t bother to code that into your applications, then you might rely on ZFS to do it for you. If you ignore this problem, it will not go away.> re> This number should scare the *%^ out of you. It basically > re> means that no data redundancy is a recipe for disaster. > > yeah, but that 9.5% number alone isn''t an argument for ZFS over other > software LVM''s. > > re> 0.466%/yr is a per-disk rate. If you have 10 disks, your > re> exposure is 4.6% per year. For 100 disks, 46% per year, etc. > > no, you''re doing the statistics wrong, and in a really elementary way. > You''re counting multiple times the possible years in which more than > one disk out of the hundred failed. If what you care about for 100 > disks is that no disk experiences an error within one year, then you > need to calculate > > (1 - 0.00466) ^ 100 = 62.7% > > so that''s 37% probability of silent corruption. For 10 disks, the > mistake doesn''t make much difference and 4.6% is about right. >Indeed. Intuitively, the AFR and population is more easily grokked by the masses. But if you go into a customer and say "dude, there is only a 62.7% chance that your system won''t be affected by a silent data corruption problem this year with my (insert favorite non-ZFS, non-NetApp solution here)" then you will have a difficult sale.> I don''t dispute ZFS checksums have value, but the point stands that > the reported-error failure mode is 20x more common in netapp''s study > than this one, and other software LVM''s do take care of the more > common failure mode. >I agree.> re> UNCs don''t cause ZFS to freeze as long as failmode != wait or > re> ZFS manages the data redundancy. > > The time between issuing the read and getting the UNC back can be up > to 30 seconds, and there are often several unrecoverable sectors in a > row as well as lower-level retries multiplying this 30-second value. > so, it ends up being a freeze. >Untrue. There are disks which will retry forever. But don''t take my word for it, believe another RAID software vendor: http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and [sorry about the redirect, you have to sign up for an Adaptec webinar before you can get to the list of webinars, so it is hard to provide the direct URL] Incidentally, I have one such disk in my boneyard, but it isn''t much fun to work with because it just sits there and spins when you try to access the bad sector.> To fix it, ZFS needs to dispatch read requests for redundant data if > the driver doesn''t reply quickly. ``Quickly'''' can be ambiguous, but > the whole point of FMD was supposed to be that complicated statistics > could be collected at various levels to identify even more subtle > things than READ and CKSUM errors, like drives that are working at > 1/10th the speed they should be, yet right now we can''t even flag a > drive taking 30 seconds to read a sector. ZFS is still ``patiently > waiting'''', and now that FMD is supposedly integrated instead of a > discussion of what knobs and responses there are, you''re passing the > buck to the drivers and their haphazard nonuniform exception state > machines. The best answer isn''t changing drivers to make the drive > timeout in 15 seconds instead---it''s to send the read to other disks > quickly using a very simple state machine, and start actually using > FMD and a complicated state machine to generate suspicion-events for > slow disks that aren''t returning errors. >I think the proposed timeouts here are too short, but the idea has merit. Note that such a preemptive read will have negative performance impacts for high-workload systems, so it will not be a given that people will want this enabled by default. Designing such a proactive system which remains stable under high workloads may not be trivial. Please file an RFE at http://bugs.opensolaris.org> Also the driver and mid-layer need to work with the hypothetical > ZFS-layer timeouts to be as good as possible about not stalling the > SATA chip, the channel if there''s a port multiplier, or freezing the > whole SATA stack including other chips, just because one disk has an > outstanding READ command waiting to get an UNC back. > > In some sense the disk drivers and ZFS have different goals. The goal > of drivers should be to keep marginal disk/cabling/... subsystems > online as aggressively as possible, while the goal of ZFS should be to > notice and work around slightly-failing devices as soon as possible. > I thought the point of putting off reasonable exception handling for > two years while waiting for FMD, was to be able to pursue both goals > simultaneously without pressure to compromise one in favor of the > other. > > In addition, I''m repeating myself like crazy at this point, but ZFS > tools used for all pools like ''zpool status'' need to not freeze when a > single pool, or single device within a pool, is unavailable or slow, > and this expectation is having nothing to do with failmode on the > failing pool. And NFS running above ZFS should continue serving > filesystems from available pools even if some pools are faulted, again > nothing to do with failmode. > >You mean something like: http://bugs.opensolaris.org/view_bug.do?bug_id=6667208 http://bugs.opensolaris.org/view_bug.do?bug_id=6667199 Yes, we all wish these to be fixed soon.> Neither is the case now, and it''s not a driver fix, but even beyond > fixing these basic problems there''s vast room for improvement, to > deliver something better than LVM2 and closer to NetApp, rather than > just catching up. >If you find more issues, then please file bugs. http://bugs.opensolaris.org -- richard
Ian Collins
2008-Aug-27 23:41 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Richard Elling writes:> > I think the proposed timeouts here are too short, but the idea has > merit. Note that such a preemptive read will have negative performance > impacts for high-workload systems, so it will not be a given that people > will want this enabled by default. Designing such a proactive system > which remains stable under high workloads may not be trivial.Isn''t this how things already work with mirrors? By this I mean requests are issued to all devices and if the first returned data is OK, the others are not required. Ian
Richard Elling
2008-Aug-27 23:59 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Ian Collins wrote:> Richard Elling writes: >> >> I think the proposed timeouts here are too short, but the idea has >> merit. Note that such a preemptive read will have negative performance >> impacts for high-workload systems, so it will not be a given that people >> will want this enabled by default. Designing such a proactive system >> which remains stable under high workloads may not be trivial. > > Isn''t this how things already work with mirrors? By this I mean > requests are issued to all devices and if the first returned data is > OK, the others are not required.No. Yes. Sometimes. The details on choice of read targets varies by implementation. I''ve seen some telco systems which work this way, but most of the general purpose systems will choose one target for the read based on some policy: round-robin, location, etc. This way you could get the read performance of all disks operating concurrently. -- richard
Ah yes - that video is what got this whole thing going in the first place... I referenced it in one of my other posts much earlier. Heh... there''s something gruesomely entertaining about brutishly taking a drill or sledge hammer to a piece of precision hardware like that. But yes, that''s the kind of torture test I would like to conduct, however, I''m operating on a limited test-budget right now, and I have to get the damn thing working in the first place before I start performing tests I can''t easily reverse (I still have yet to fire up Bonnie++ and do some benchmarking), and most definitely before I can put on a show for those who control the draw strings to the purse... But, imagine: walking into... oh say, I dunno... your manager''s office, for example, and asking him to beat the hell out of one of your server''s hard drives all the while promising him that no data would be lost, and none of his video on demand customers would ever notice an interruption in service. He might think you''re crazy, but if it still works at the end of the day, your annual budget just might get a sizable increase to help you make all the other servers "sledge hammer resistant" like the first one. ;) But that''s just an example. That functionality could (and probably does) prove useful almost anywhere. This message posted from opensolaris.org
Ian Collins
2008-Aug-28 00:38 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Richard Elling writes:> Ian Collins wrote: >> Richard Elling writes: >>> >>> I think the proposed timeouts here are too short, but the idea has >>> merit. Note that such a preemptive read will have negative performance >>> impacts for high-workload systems, so it will not be a given that people >>> will want this enabled by default. Designing such a proactive system >>> which remains stable under high workloads may not be trivial. >> >> Isn''t this how things already work with mirrors? By this I mean requests >> are issued to all devices and if the first returned data is OK, the >> others are not required. > > No. Yes. Sometimes. The details on choice of read targets varies by > implementation. I''ve seen some telco systems which work this way, > but most of the general purpose systems will choose one target for > the read based on some policy: round-robin, location, etc. This way > you could get the read performance of all disks operating concurrently.Would it be possible to get ZFS to work the way I described? I was looking at using an exported iSCSI target from a machine in another building to mirror a fileserver with a mainly (>95%) read workload. A first back read implementation would be a good fit for that situation. Ian
Miles Nordin
2008-Aug-28 01:27 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> I really don''t know how to please you. dd from the raw device instead of through ZFS would be better. If you could show that you can write data to a sector, and read back different data, without getting an error, over and over, I''d be totally stunned. The netapp paper was different from your test in many ways that make their claim that ``all drives silently corrupt data sometimes'''' more convincing than your claim that you have ``one drive which silently corrupts data always and never returns UNC'''': * not a desktop. The circumstances were more tightly-controlled, and their drive population installed in a repeated way * their checksum measurement was better than ZFS''s by breaking the type of error up into three buckets instead of one, and their filesystem more mature, and their filesystem is not already known to count CKSUM errors for circumstances other than silent corruption, which argues the checksums are less likely to come from software bugs * they make statistical arguments that at least some of the errors are really coming from the drives by showing they have spatial locality w.r.t. the LBA on the drive, and are correlated with drive age and impending drive failure. The paper was less convincing in one way: * their drives are using nonstandard firmware re> Anyone who has been around for a while will have similar re> anecdotes. yeah, you''d think, but my similar anecdote is that (a) I can get UNC''s repeatably on a specific bad sector that persist either forever or until I write new data to that sector with dd, and do get them on at least 10% of my drives per year, and (b) I get CKSUM errors from ZFS all the time with my iSCSI ghetto-SAN and with an IDE/Firewire mirror, often from things I can specifically trace back to not-a-drive-failure, but so far never from something I can for certain trace back to silent corruption by the disk drive. I don''t doubt that it happens, but CKSUM isn''t a way to spot it. ZFS may give me a way to stop it, but it doesn''t give me an accurate way to measure/notice it. re> Indeed. Intuitively, the AFR and population is more easily re> grokked by the masses. It''s nothing to do with masses. There''s an error in your math. It''s not right under any circumstance. Your point that a 100 drive population has bad/high odds of having silent corruption within a year isn''t diminished by the correction, but it would be nice if you would own up to the statistics mistake since we''re taking you at your word on a lot of other statistics. >> so, it ends up being a freeze. re> Untrue. There are disks which will retry forever. I don''t understand. ZFS freezes until the disk stops retrying and returns an error. Because some disks never stop retrying and never return an error, just lock up until they''re power-cycled, it''s untrue that ZFS freezes? I think either you or I have lost the thread of the argument in our reply chain bantering. re> please file bugs. k., I filed the NFS bug, but unfortunately I don''t have output to cut and paste into it. glad to see the ''zpool status'' bug is there already and includes the point that lots of other things are probably hanging which shouldn''t. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/8be11f79/attachment.bin>
James C. McPherson
2008-Aug-28 12:52 UTC
[zfs-discuss] ZFS hangs/freezes after disk failure,
Hi Todd, sorry for the delay in responding, been head down rewriting a utility for the last few days. Todd H. Poole wrote:> Howdy James, > > While responding to halstead''s post (see below), I had to restart several > times to complete some testing. I''m not sure if that''s important to these > commands or not, but I just wanted to put it out there anyway. > >> A few commands that you could provide the output from >> include: >> >> >> (these two show any FMA-related telemetry) >> fmadm faulty >> fmdump -v > > This is the output from both commands: > > todd at mediaserver:~# fmadm faulty > --------------- ------------------------------------ -------------- --------- > TIME EVENT-ID MSG-ID SEVERITY > --------------- ------------------------------------ -------------- --------- > Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD Major > > Fault class : fault.fs.zfs.vdev.io > Description : The number of I/O errors associated with a ZFS device exceeded > acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD > for more information. > Response : The device has been offlined and marked as faulted. An attempt > will be made to activate a hot spare if available. > Impact : Fault tolerance of the pool may be compromised. > Action : Run ''zpool status -x'' and replace the bad device.>> todd at mediaserver:~# fmdump -v > TIME UUID SUNW-MSG-ID > Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD > 100% fault.fs.zfs.vdev.io > > Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719 > Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719 > FRU: - > Location: -In other emails in this thread you''ve mentioned the desire to get an email (or some sort of notification) when Problems Happen(tm) in your system, and the FMA framework is how we achieve that in OpenSolaris. # fmadm config MODULE VERSION STATUS DESCRIPTION cpumem-retire 1.1 active CPU/Memory Retire Agent disk-transport 1.0 active Disk Transport Agent eft 1.16 active eft diagnosis engine fabric-xlate 1.0 active Fabric Ereport Translater fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis io-retire 2.0 active I/O Retire Agent snmp-trapgen 1.0 active SNMP Trap Generation Agent sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.0 active Syslog Messaging Agent zfs-diagnosis 1.0 active ZFS Diagnosis Engine zfs-retire 1.0 active ZFS Retire Agent You''ll notice that we''ve got an SNMP agent there... and you can acquire a copy of the FMA mib from the Fault Management community pages (http://opensolaris.org/os/community/fm and http://opensolaris.org/os/community/fm/mib/).>> (this shows your storage controllers and what''s >> connected to them) cfgadm -lav > > This is the output from cfgadm -lav > > todd at mediaserver:~# cfgadm -lav > Ap_Id Receptacle Occupant Condition Information > When Type Busy Phys_Id > usb2/1 empty unconfigured ok > unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13:1 > usb2/2 connected configured ok > Mfg: Microsoft Product: Microsoft 3-Button Mouse with IntelliEye(TM) > NConfigs: 1 Config: 0 <no cfg str descr> > unavailable usb-mouse n /devices/pci at 0,0/pci1458,5004 at 13:2 > usb3/1 empty unconfigured ok[snip]> usb7/2 empty unconfigured ok > unavailable unknown n /devices/pci at 0,0/pci1458,5004 at 13,1:2 > > You''ll notice that the only thing listed is my USB mouse... is that expected?Yup. One of the artefacts of the cfgadm architecture. cfgadm(1m) works by using plugins - usb, FC, SCSI, SATA, pci hotplug, InfiniBand... but not IDE. I think you also were wondering how to tell what controller instances your disks were using in IDE mode - two basic ways of achieving this: /usr/bin/iostat -En and /usr/sbin/format Your IDE disks will attach using the cmdk driver and show up like this: c1d0 c1d1 c2d0 c2d1 In AHCI/SATA mode they''d show up as c1t0d0 c1t1d0 c1t2d0 c1t3d0 or something similar, depending on how the bios and the actual controllers sort themselves out.>> You''ll also find messages in /var/adm/messages which >> might prove >> useful to review. > > If you really want, I can list the output from /var/adm/messages, but it > doesn''t seem to add anything new to what I''ve already copied and pasted.No need - you''ve got them if you need them. [snip]>> http://docs.sun.com/app/docs/coll/40.17 (manpages) >> http://docs.sun.com/app/docs/coll/47.23 (system admin collection) >> http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide >> http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide > > Oohh... Thank you. Good Links. I''m bookmarking these for future reading. > They''ll definitely be helpful if we end up choosing to deploy OpenSolaris > + ZFS for our media servers.There''s a heap of info there, getting started with it can be like trying to drink from a fire hose :) Best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Richard Elling
2008-Aug-28 13:49 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote:> re> Indeed. Intuitively, the AFR and population is more easily > re> grokked by the masses. > > It''s nothing to do with masses. There''s an error in your math. It''s > not right under any circumstance. >There is no error in my math. I presented a failure rate for a time interval, you presented a probability of failure over a time interval. The two are both correct, but say different things. Mathematically, an AFR > 100% is quite possible and quite common. A probability of failure > 100% (1.0) is not. In my experience, failure rates described as annualized failure rates (AFR) are more intuitive than their mathematically equivalent counterpart: MTBF. -- richard
Robert Milkowski
2008-Aug-28 13:55 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Hello Miles, Wednesday, August 27, 2008, 10:51:49 PM, you wrote: MN> It''s not really enough for me, but what''s more the case doesn''t match MN> what we were looking for: a device which ``never returns error codes, MN> always returns silently bad data.'''' I asked for this because you said MN> ``However, not all devices return error codes which indicate MN> unrecoverable reads,'''' which I think is wrong. Rather, most devices MN> sometimes don''t, not some devices always don''t. Please look for slides 23-27 at http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Richard Elling
2008-Aug-28 15:04 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Robert Milkowski wrote:> Hello Miles, > > Wednesday, August 27, 2008, 10:51:49 PM, you wrote: > > MN> It''s not really enough for me, but what''s more the case doesn''t match > MN> what we were looking for: a device which ``never returns error codes, > MN> always returns silently bad data.'''' I asked for this because you said > MN> ``However, not all devices return error codes which indicate > MN> unrecoverable reads,'''' which I think is wrong. Rather, most devices > MN> sometimes don''t, not some devices always don''t. > > > > Please look for slides 23-27 at http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf > >You really don''t have to look very far to find this sort of thing. The scar just below my left knee is directly attributed to a bugid fixed in patch 106129-12. Warning: the following link may frighten experienced datacenter personnel, fortunately, the affected device is long since EOL. http://sunsolve.sun.com/search/document.do?assetkey=1-21-106129-12-1 -- richard
Miles Nordin
2008-Aug-28 16:54 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes:re> There is no error in my math. I presented a failure rate for re> a time interval, What is a ``failure rate for a time interval''''? AIUI, the failure rate for a time interval is 0.46% / yr, no matter how many drives you have. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/3c7133a9/attachment.bin>
Miles Nordin
2008-Aug-28 16:55 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes:rm> Please look for slides 23-27 at rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf yeah, ok, ONCE AGAIN, I never said that checksums are worthless. relling: some drives don''t return errors on unrecoverable read events. carton: I doubt that. Tell me a story about one that doesn''t. Your stories are about storage subsystems again, not drives. Also most or all of the slides aren''t about unrecoverable read events. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/4f742127/attachment.bin>
Jonathan Loran
2008-Aug-28 18:13 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
Miles Nordin wrote:> What is a ``failure rate for a time interval''''? > >Failure rate => Failures/unit time Failure rate for a time interval => (Failures/unit time) * time For example, if we have a failure rate: Fr = 46% failures/month Then the expectation value of a failure in one year: Fe = 46% failures/month * 12 months = 5.52 failures Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Miles Nordin
2008-Aug-28 18:42 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,
>>>>> "jl" == Jonathan Loran <jloran at ssl.berkeley.edu> writes:jl> Fe = 46% failures/month * 12 months = 5.52 failures the original statistic wasn''t of this kind. It was ``likelihood a single drive will experience one or more failures within 12 months''''. so, you could say, ``If I have a thousand drives, about 4.66 of those drives will silently-corrupt at least once within 12 months.'''' It is 0.466% no matter how many drives you have. And it''s 4.66 drives, not 4.66 corruptions. The estimated number of corruptions is higher because some drives will corrupt twice, or thousands of times. It''s not a BER, so you can''t just add it like Richard did. If the original statistic in the paper were of the kind you''re talking about, it would be larger than 0.466%. I''m not sure it would capture the situation well, though. I think you''d want to talk about bits of recoverable data after one year, not corruption ``events'''', and this is not really measured well by the type of telemetry NetApp has. If it were, though, it would still be the same size number no matter how many drives you had. The 37% I gave was ``one or more within a population of 100 drives silently corrupts within 12 months.'''' The 46% Richard gave has no meaning, and doesn''t mean what you just said. The only statistic under discussion which (a) gets intimidatingly large as you increase the number of drives, and (b) is a ratio rather than, say, an absolute number of bits, is the one I gave. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/fa784f20/attachment.bin>
Anton B. Rang
2008-Aug-28 20:35 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn''t respond quickly, they''ll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they''ll change parameters on the disk to limit how long the disk retries before returning an error for a bad sector (this is standardized for SCSI, I don''t recall offhand whether any of this is standardized for ATA). RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when enough (N-1 or N-2) disks return data, they''ll return the data to the host. At least they do that for full stripes. But this strategy works better for sequential I/O, not so good for random I/O, since you''re using up extra bandwidth. Host-based RAID/mirroring almost never takes this strategy for two reasons. First, the bottleneck is almost always the channel from disk to host, and you don''t want to clog it. [Yes, I know there''s more bandwidth there than the sum of the disks, but consider latency.] Second, to read from two disks on a mirror, you''d need two memory buffers. -- This message posted from opensolaris.org
Anton B. Rang
2008-Aug-28 20:35 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn''t respond quickly, they''ll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they''ll change parameters on the disk to limit how long the disk retries before returning an error for a bad sector (this is standardized for SCSI, I don''t recall offhand whether any of this is standardized for ATA). RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when enough (N-1 or N-2) disks return data, they''ll return the data to the host. At least they do that for full stripes. But this strategy works better for sequential I/O, not so good for random I/O, since you''re using up extra bandwidth. Host-based RAID/mirroring almost never takes this strategy for two reasons. First, the bottleneck is almost always the channel from disk to host, and you don''t want to clog it. [Yes, I know there''s more bandwidth there than the sum of the disks, but consider latency.] Second, to read from two disks on a mirror, you''d need two memory buffers. -- This message posted from opensolaris.org
Todd H. Poole
2008-Aug-30 05:05 UTC
[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk
> Let''s not be too quick to assign blame, or to think that perfecting > the behaviour is straightforward or even possible. > > Start introducing random $20 components and you begin to dilute the > quality and predictability of the composite system''s behaviour. > > But this NEVER happens on linux *grin*.Actually, it really doesn''t! At least, it hasn''t in many years... I can''t tell if you were being sarcastic or not, but honestly... you find a USB drive that can bring down your Linux machine, and I''ll show you someone running a kernel from November of 2003. And for all the other "cheaper" components out there? Those are the components we make serious bucks off of. Just because it costs $30 doesn''t mean it won''t last a _really_ long time under stress! But if it doesn''t, even when hardware fails, software''s always there to route around it. So no biggie.> Perfection?Is Linux perfect? Not even close. But certainly a lot closer at what the topic of this thread seems to cover: not crashing. Linux may get a small number of things wrong, but it gets a ridiculously large number of them right, and stability/reliability on unstable/unreliable hardware is one of them. ;) PS: I found this guy''s experiment amusing. Talk about adding a bunch of cheap, $20 crappy components to a system, and still seeing it soar. http://linuxgazette.net/151/weiner.html -- This message posted from opensolaris.org
> Wrt. what I''ve experienced and read in ZFS-discussion etc. list I''ve the > __feeling__, that we would have got really into trouble, using Solaris > (even the most recent one) on that system ... > So if one asks me, whether to run Solaris+ZFS on a production system, I > usually say: definitely, but only, if it is a Sun server ... > > My 2? ;-)I can''t agree with you more. I''m beginning to understand what the phrase "Sun''s software is great - as long as you''re running it on Sun''s hardware" means... Whether it''s deserved or not, I feel like this OS isn''t mature yet. And maybe it''s not the whole OS, maybe it''s some specific subsection (like ZFS), but my general impression of OpenSolaris has been... not stellar. I don''t think it''s ready yet for a prime time slot on commodity hardware. And while I don''t intend to fan any flames that might already exist (remember, I''ve only just joined within the past week, and thus haven''t been around long enough to figured out even if any flames exist), but I believe I''m justified in making the above statement. Just off the top of my head, here is a list of red flags I''ve run into in 7 day''s time: - If I don''t wait for at least 2 minutes before logging into my system after I''ve powered everything up, my machine freezes. - If I yank a hard drive out of a (supposedly redundant) RAID5 array (or "RAID-Z zpool," as its called) that has an NFS mount attached to it, not only does that mount point get severed, but _all_ NFS connections to all mount points are dropped, regardless of whether they were on the zpool or not. Oh, and then my machine freezes. - If I just yank a hard drive out of a (supposedly redundant) RAID5 array (or "RAID-Z zpool," as its called), and just forgetting about NFS, my machine freezes. - If I query a zpool for its status, but don''t do so under the right circumstances, my machine freezes. I''ve had to use the hard reset button on my case more times than I''ve had the ability to shut down the machine properly from a non-frozen console or GUI. That shouldn''t happen. I dunno. If this sounds like bitching, that''s fine: I''ll file bug reports and then move on. It''s just that sometimes, software needs to grow a bit more before it''s ready for production, and I feel like trying to run OpenSolaris + ZFS on commodity hardware just might be one of those times. Just two more cents to add to yours. As Richard said, the only way to fix things is to file bug reports. Hopefully, the most helpful things to come out of this thread will be those forms of constructive criticism. As for now, it looks like a return to LVM2, XFS, and one of the Linux or BSD kernels might be a more stable decision, but don''t worry - I haven''t been completely dissuaded, and I definitely plan on checking back in a few releases to see how things are going in the ZFS world. ;) Thanks everyone for your help, and keep improving! :) -Todd -- This message posted from opensolaris.org
On 30-Aug-08, at 2:32 AM, Todd H. Poole wrote:>> Wrt. what I''ve experienced and read in ZFS-discussion etc. list >> I''ve the >> __feeling__, that we would have got really into trouble, using >> Solaris >> (even the most recent one) on that system ... >> So if one asks me, whether to run Solaris+ZFS on a production >> system, I >> usually say: definitely, but only, if it is a Sun server ... >> >> My 2? ;-) > > I can''t agree with you more. I''m beginning to understand what the > phrase "Sun''s software is great - as long as you''re running it on > Sun''s hardware" means... > ...Totally OT, but this is also why Apple doesn''t sell OS X for whitebox junk. :) --Toby> > -Todd > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Sat, 30 Aug 2008 09:35:31 -0300 Toby Thain <toby at telegraphics.com.au> wrote:> On 30-Aug-08, at 2:32 AM, Todd H. Poole wrote: > > I can''t agree with you more. I''m beginning to understand what the > > phrase "Sun''s software is great - as long as you''re running it on > > Sun''s hardware" means... > > Totally OT, but this is also why Apple doesn''t sell OS X for > whitebox junk. :)There''s also a lot of whiteboxes that -do- run solaris very well. "Some apples are rotten others are healthy". That quite normal. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D ++ http://nagual.nl/ + SunOS sxce snv95 ++
On Fri, Aug 29, 2008 at 10:32 PM, Todd H. Poole <toddhpoole at gmail.com> wrote:> I can''t agree with you more. I''m beginning to understand what the phrase "Sun''s software is great - as long as you''re running it on Sun''s hardware" means... > > Whether it''s deserved or not, I feel like this OS isn''t mature yet. And maybe it''s not the whole OS, maybe it''s some specific subsection (like ZFS), but my general impression of OpenSolaris has been... not stellar. > > I don''t think it''s ready yet for a prime time slot on commodity hardware.I agree, but with careful research, you can find the *right* hardware. In my quest (took weeks) to find reports of reliable hardware, I found that the AMD chipsets were way too buggy. I also noticed that of the workstations that Sun sells, they use nVidia nForce chipsets for AMD CPU''s and Intel x38 (only intel desktop chipset that supports ecc) for the Intel CPUs. I read good and bad stories about various hardware and decided I would stay close to what Sun sells. I''ve found NO Sun hardware using the same chipset as yours. There are a couple of AHCI bugs with the AMD/ATI SB600 chipset. Both Linux and Solaris were affected. Linux put in a workaround that may hurt performance slightly. Sun still has the bug open, but for what it''s worth, who''s gonna use or care about a buggy desktop chipset in a storage server? I have an nVidia nForce 750a chipset (not the same as the sun workstations, which use nforce pro, but its not too different) and the same CPU (45 Watt dual core!) you have. My system works great (so far). I haven''t tried the disconnect drive issue thought. I will try it tonight.