thr3ads.net - zfs discuss - [zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Todd H. Poole

2008-Aug-24 04:06 UTC

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

Howdy yall,

Earlier this month I downloaded and installed the latest copy of OpenSolaris
(2008.05) so that I could test out some of the newer features I''ve
heard so much about, primarily ZFS.

My goal was to replace our aging linux-based (SuSE 10.1) file and media server
with a new machine running Sun''s OpenSolaris and ZFS. Our old server
ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used lvm,
mdadm, and xfs to help keep things in order, and relied on NFS to export
users'' shares. It was solid, stable, and worked wonderfully well.

I would like to replicate this experience using the tools OpenSolaris has to
offer, taking advantages of ZFS. However, there are enough differences between
the two OSes - especially with respect to the filesystems and (for lack of a
better phrase) "RAID managers" - to cause me to consult (on numerous
occasions) the likes of Google, these forums, and other places for help.

I''ve been successful in troubleshooting all problems up until now.

On our old media server (the SuSE 10.1 one), when a disk failed, the machine
would send out an e-mail detailing the type of failure, and gracefully fall into
a degraded state, but would otherwise continue to operate using the remaining 3
disks in the system. After the faulty disk was replaced, all of the data from
the old disk would be replicated onto the new one (I think the term is
"resilvered" around here?), and after a few hours, the RAID5 array
would be seamlessly promoted from "degraded" back up to a healthy
"clean" (or "online") state.

Throughout the entire process, there would be no interruptions to the end user:
all NFS shares still remained mounted, there were no noticeable drops in I/O,
files, directories, and any other user-created data still remained available,
and if everything went smoothly, no one would notice a failure had even
occurred.

I''ve tried my best to recreate something similar in OpenSolaris, but
I''m stuck on making it all happen seamlessly.

For example, I have a standard beige box machine running OS 2008.05 with a zpool
that contains 4 disks, similar to what the old SuSE 10.1 server had. However,
whenever I unplug the SATA cable from one of the drives (to simulate a
catastrophic drive failure) while doing moderate reading from the zpool (such as
streaming HD video), not only does the video hang on the remote machine (which
is accessing the zpool via NFS), but the server running OpenSolaris seems to
either hang, or become incredibly unresponsive.

And when I write unresponsive, I mean that when I type the command "zpool
status" to see what''s going on, the command hangs, followed by a
frozen Terminal a few seconds later. After just a few more seconds, the entire
GUI - mouse included - locks up or freezes, and all NFS shares become
unavailable from the perspective of the remote machines. The whole machine locks
up hard.

The machine then stays in this frozen state until I plug the hard disk back in,
at which point everything, quite literally, pops back into existence all at
once: the output of the "zpool status" command flies by (with all
disks listed as "ONLINE" and all "READ," "WRITE,"
and "CKSUM," fields listed as "0"), the mouse jumps to a
different part of the screen, the NFS share becomes available again, and the
movie resumes right where it had left off.

While such a quick resume is encouraging, I''d like to avoid the freeze
in the first place.

How can I keep any hardware failures like the above transparent to my users?

-Todd

PS: I''ve done some researching, and while my problem is similar to the
following:

http://opensolaris.org/jive/thread.jspa?messageID=151719&#151719
http://opensolaris.org/jive/thread.jspa?messageID=240481&#240481

most of these posts are quite old, and do not offer any solutions.

PSS: I know I haven''t provided any details on hardware, but I feel like
this is more likely a higher-level issue (like some sort of configuration file
or setting is needed) rather than a lower-level one (like faulty hardware).
However, if someone were to give me a command to run, I''d gladly do
it... I''m just not sure which ones would be helpful, or if I even know
which ones to run. It took me half an hour of searching just to find out how to
list the disks installed in this system (it''s "format") so
that I could build my zpool in the first place. It''s not quite as
simple as writing out /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;)
 
 
This message posted from opensolaris.org

Tim

2008-Aug-24 04:13 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

On Sat, Aug 23, 2008 at 11:06 PM, Todd H. Poole <toddhpoole at
gmail.com>wrote:
> Howdy yall,
>
> Earlier this month I downloaded and installed the latest copy of
> OpenSolaris (2008.05) so that I could test out some of the newer features
> I''ve heard so much about, primarily ZFS.
>
> My goal was to replace our aging linux-based (SuSE 10.1) file and media
> server with a new machine running Sun''s OpenSolaris and ZFS. Our
old server
> ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used
> lvm, mdadm, and xfs to help keep things in order, and relied on NFS to
> export users'' shares. It was solid, stable, and worked wonderfully
well.
>
> I would like to replicate this experience using the tools OpenSolaris has
> to offer, taking advantages of ZFS. However, there are enough differences
> between the two OSes - especially with respect to the filesystems and (for
> lack of a better phrase) "RAID managers" - to cause me to consult
(on
> numerous occasions) the likes of Google, these forums, and other places for
> help.
>
> I''ve been successful in troubleshooting all problems up until now.
>
> On our old media server (the SuSE 10.1 one), when a disk failed, the
> machine would send out an e-mail detailing the type of failure, and
> gracefully fall into a degraded state, but would otherwise continue to
> operate using the remaining 3 disks in the system. After the faulty disk
was
> replaced, all of the data from the old disk would be replicated onto the
new
> one (I think the term is "resilvered" around here?), and after a
few hours,
> the RAID5 array would be seamlessly promoted from "degraded" back
up to a
> healthy "clean" (or "online") state.
>
> Throughout the entire process, there would be no interruptions to the end
> user: all NFS shares still remained mounted, there were no noticeable drops
> in I/O, files, directories, and any other user-created data still remained
> available, and if everything went smoothly, no one would notice a failure
> had even occurred.
>
> I''ve tried my best to recreate something similar in OpenSolaris,
but I''m
> stuck on making it all happen seamlessly.
>
> For example, I have a standard beige box machine running OS 2008.05 with a
> zpool that contains 4 disks, similar to what the old SuSE 10.1 server had.
> However, whenever I unplug the SATA cable from one of the drives (to
> simulate a catastrophic drive failure) while doing moderate reading from
the
> zpool (such as streaming HD video), not only does the video hang on the
> remote machine (which is accessing the zpool via NFS), but the server
> running OpenSolaris seems to either hang, or become incredibly
unresponsive.
>
> And when I write unresponsive, I mean that when I type the command
"zpool
> status" to see what''s going on, the command hangs, followed
by a frozen
> Terminal a few seconds later. After just a few more seconds, the entire GUI
> - mouse included - locks up or freezes, and all NFS shares become
> unavailable from the perspective of the remote machines. The whole machine
> locks up hard.
>
> The machine then stays in this frozen state until I plug the hard disk back
> in, at which point everything, quite literally, pops back into existence
all
> at once: the output of the "zpool status" command flies by (with
all disks
> listed as "ONLINE" and all "READ," "WRITE,"
and "CKSUM," fields listed as
> "0"), the mouse jumps to a different part of the screen, the NFS
share
> becomes available again, and the movie resumes right where it had left off.
>
> While such a quick resume is encouraging, I''d like to avoid the
freeze in
> the first place.
>
> How can I keep any hardware failures like the above transparent to my
> users?
>
> -Todd
>
> PS: I''ve done some researching, and while my problem is similar to
the
> following:
>
> http://opensolaris.org/jive/thread.jspa?messageID=151719&#151719
> http://opensolaris.org/jive/thread.jspa?messageID=240481&#240481
>
> most of these posts are quite old, and do not offer any solutions.
>
> PSS: I know I haven''t provided any details on hardware, but I feel
like
> this is more likely a higher-level issue (like some sort of configuration
> file or setting is needed) rather than a lower-level one (like faulty
> hardware). However, if someone were to give me a command to run,
I''d gladly
> do it... I''m just not sure which ones would be helpful, or if I
even know
> which ones to run. It took me half an hour of searching just to find out
how
> to list the disks installed in this system (it''s
"format") so that I could
> build my zpool in the first place. It''s not quite as simple as
writing out
> /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;)
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


It''s a lower level one.  What hardware are you running?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080823/7ef3b4ae/attachment.html>

Todd H. Poole

2008-Aug-24 04:41 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Hmm... I''m leaning away a bit from the hardware, but just in case
you''ve got an idea, the machine is as follows:

CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
ADH4850DOBOX (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)

Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid Capacitor AMD
Motherboard (http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)

RAM: G.SKILL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual Channel
Kit Desktop Memory Model F2-6400CL5D-4GBPQ
(http://www.newegg.com/Product/Product.aspx?Item=N82E16820231122)

HDD (x4): Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA 3.0Gb/s
Hard Drive (http://www.newegg.com/Product/Product.aspx?Item=N82E16822136151)

The reason why I don''t think there''s a hardware issue is
because before I got OpenSolaris up and running, I had a fully functional
install of openSuSE 11.0 running (with everything similar to the original
server) to make sure that none of the components were damaged during shipping
from Newegg. Everything worked as expected.

Furthermore, before making my purchases, I made sure to check the HCL and my
processor and motherboard combination are supported:
http://www.sun.com/bigadmin/hcl/data/systems/details/3079.html

But, like I said earlier, I''m new here, so you might be on to something
that never occurred to me.

Any ideas?
 
 
This message posted from opensolaris.org

Tim

2008-Aug-24 04:55 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Sat, Aug 23, 2008 at 11:41 PM, Todd H. Poole <toddhpoole at
gmail.com>wrote:
> Hmm... I''m leaning away a bit from the hardware, but just in case
you''ve
> got an idea, the machine is as follows:
>
> CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
> ADH4850DOBOX (
> http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)
>
> Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid Capacitor
> AMD Motherboard (
> http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)
>
> RAM: G.SKILL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual
> Channel Kit Desktop Memory Model F2-6400CL5D-4GBPQ (
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820231122)
>
> HDD (x4): Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA
> 3.0Gb/s Hard Drive (
> http://www.newegg.com/Product/Product.aspx?Item=N82E16822136151)
>
> The reason why I don''t think there''s a hardware issue is
because before I
> got OpenSolaris up and running, I had a fully functional install of
openSuSE
> 11.0 running (with everything similar to the original server) to make sure
> that none of the components were damaged during shipping from Newegg.
> Everything worked as expected.
>
> Furthermore, before making my purchases, I made sure to check the HCL and
> my processor and motherboard combination are supported:
> http://www.sun.com/bigadmin/hcl/data/systems/details/3079.html
>
> But, like I said earlier, I''m new here, so you might be on to
something
> that never occurred to me.
>
> Any ideas?
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

What are you using to connect the HD''s to the system?  The onboard
ports?
What driver is being used?  AHCI, or IDE compatibility mode?

I''m not saying the hardware is bad, I''m saying the hardware is
most likely
the cause by way of driver.  There really isn''t any *setting* in
solaris I''m
aware of that says "hey, freeze my system when a drive dies".  That
just
sounds like hot-swap isn''t working as it should be.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080823/3c9b386f/attachment.html>

Ross

2008-Aug-24 08:04 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

You''re seeing exactly the same behaviour I found on my server, using a
Supermicro AOC-SAT2-MV8 SATA controller.  It''s detailed on the forums
under the topics "Supermicro AOC-SAT2-MV8 hang when drive removed",
but unfortunately that topic split into 3 or 4 pieces so it''s a pain to
find.

I also reported it as a bug here:
http://bugs.opensolaris.org/view_bug.do?bug_id=6735931
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-24 08:27 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Ah, yes - all four hard drives are connected to the motherboard''s
onboard SATA II ports. There is one additional drive I have neglected to mention
thus far (the boot drive) but that is connected via the motherboard''s
IDE channel, and has remained untouched since the install... I don''t
really consider it part of the problem, but I thought I should mention it just
in case... you never know...

As for the drivers... well, I''m not sure of the command to determine
that directly, but going under System > Administration > Device Driver
Utility yields the following information under the "Storage" entry:

Components: "ATI Technologies Inc. SB600 IDE"
Driver: pci-ide
--Driver Information--
Driver: pci-ide
Instance: 1
Attach Status: Attached
--Hardware Information--
Vendor ID: 0x1002
Device ID: 0x438c
Class Code: 0001018a
DevPath: /pci at 0,0/pci-ide at 14,1

and

Components: "ATI Technologies Inc. SB600 Non-Raid-5 SATA"
Driver: pci-ide
--Driver Information--
Driver: pci-ide
Instance: 0
Attach Status: Attached
--Hardware Information--
Vendor ID: 0x1002
Device ID: 0x4380
Class Code: 0001018f
DevPath: /pci at 0,0/pci-ide at 12

Furthermore, there is one Driver Problem detected but the error is under the
"USB" entry. There are seven items listed:

Components: ATI Technologies Inc. SB600 USB Controller (EHCI)
Driver: ehci

Components: ATI Technologies Inc. SB600 USB (OHCI4)
Driver: ohci

Components: ATI Technologies Inc. SB600 USB (OHCI3)
Driver: ohci

Components: ATI Technologies Inc. SB600 USB (OHCI2)
Driver: ohci

Components: ATI Technologies Inc. SB600 USB (OHCI1)
Driver: ohci (Driver Misconfigured)

Components: ATI Technologies Inc. SB600 USB (OHCI0)
Driver: ohci

Components: Microsoft Corp. Wheel Mouse Optical
Driver: hid

As you can tell, the OHCI1 device isn''t properly configured, but I
don''t know how to configure it (there''s only a
"Help" "Submit...", and "Close" button to click,
no "Install Driver"). And, to tell you the truth, I''m not
even sure it''s worth mentioning because I don''t have anything
but my mouse plugged into USB, and even so... it''s a mouse... plugged
into USB... hardly something that is going to bring my machine to a grinding
halt every time a SATA II disk gets yanked from a RAID-Z array (at least, I
should hope the two don''t have anything in common!).

And... wait... you mean to tell me that I can''t just untick the
checkbox that says "Hey, freeze my system when a drive dies" to solve
this problem? Ugh. And here I was hoping for a quick fix... ;)

Anyway, how does the above sound? What else can I give you?

-Todd

PS: Thanks, by the way, for the support - I''m not sure where else to
turn to for this kind of stuff!
 
 
This message posted from opensolaris.org

Ross

2008-Aug-24 08:31 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

PS.  Does your system definitely support SATA hot swap?  Could you for example
test it under windows to see if it runs fine there?

I suspect this is a Solaris driver problem, but it would be good to have
confirmation that the hardware handles this fine.
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-24 09:17 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

Hmm... You know, that''s a good question. I''m not sure if those
SATA II ports support hot swap or not. The motherboard is fairly new, but taking
a look at the specifications provided by Gigabyte
(http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ProductID=2874)
doesn''t seem to yield anything. To tell you the truth, I think
they''re just plain ''ol dumb SATA II ports - nothing fancy
here.

But that''s alright, because hot swappable isn''t something
I''m necessarily chasing after. It would be nice, of course, but the
thing that we want the most is stability during hardware failures. For this
particular server, it is _far_ more important for the thing to keep chugging
along and blow right through as many hardware failures as it can. If
it''s still got 3 of those 4 drives (which implies at least 2 data and 1
parity, or 3 data and no parity) then I still want to be able to read and write
to those NFS exports like nothing happened. Then, at the end of the day, if we
need to bring the machine down in order to install a new disk and resilver the
RAID-Z array, that is perfectly acceptable. We could do that around 6:00 or so
when everyone goes home for the day and when its much more convenient for us and
the users, and let the resilvering/repairing operation run over night.

I also read the PDF summary you included in your link to your other post. And it
seems we''re seeing similar behavior here. Although, in this case,
things are even simpler: there are only 4 drives in the case (not 8), and there
is no extra controller card (just the ports on the motherboard)... It''s
hard to get any more basic than that.

As for testing in other OSes, unfortunately I don''t readily have a copy
of Windows available. But even if I did, I wouldn''t know where to
begin: almost all of my experience in server administration has been with Linux.
For what it''s worth, I have already established the above (that is, the
seamless experience) with OpenSuSE 11.0 as the operating system, LVM as the
volume manager, madam as the RAID manager, and XFS as the filesystem, so I know
it can work...

I just want to get it working with OpenSolaris and ZFS. :)
 
 
This message posted from opensolaris.org

James C. McPherson

2008-Aug-24 12:30 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Todd H. Poole wrote:> Hmm... I''m leaning away a bit from the hardware, but just in case
you''ve
> got an idea, the machine is as follows:
> 
> CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
> ADH4850DOBOX
> (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)
> 
> Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid
> Capacitor AMD Motherboard
> (http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)

..> The reason why I don''t think there''s a hardware issue is
because before I
> got OpenSolaris up and running, I had a fully functional install of
> openSuSE 11.0 running (with everything similar to the original server) to
> make sure that none of the components were damaged during shipping from
> Newegg. Everything worked as expected.
Yes, but you''re running a new operating system, new filesystem...
that''s a mountain of difference right in front of you.


A few commands that you could provide the output from include:


(these two show any FMA-related telemetry)
fmadm faulty
fmdump -v

(this shows your storage controllers and what''s connected to them)
cfgadm -lav

You''ll also find messages in /var/adm/messages which might prove
useful to review.


Apart from that, your description of what you''re doing to simulate
failure is

"however, whenever I unplug the SATA cable from one of the drives (to 
simulate a catastrophic drive failure) while doing moderate reading from the 
zpool (such as streaming HD video), not only does the video hang on the 
remote machine (which is accessing the zpool via NFS), but the server 
running OpenSolaris seems to either hang, or become incredibly
unresponsive."


First and foremost, for me, this is a stupid thing to do. You''ve
got common-or-garden PC hardware which almost *definitely* does not
support hot plug of devices. Which is what you''re telling us that
you''re doing. Would try this with your pci/pci-e cards in this
system? I think not.


If you absolutely must do something like this, then please use
what''s known as "coordinated hotswap" using the cfgadm(1m)
command.


Viz:

(detect fault in disk c2t3d0, in some way)

# cfgadm -c unconfigure c2::dsk/c2t3d0
# cfgadm -c disconnect c2::dsk/c2t3d0

(go and swap the drive, plugin new drive with same cable)

# zpool replace -f poolname c2t3d0


What this will do is tell the kernel to do things in the
right order, and - for zpool - tell it to do an in-place
replacement of device c2t3d0 in your pool.


There are manpages and admin guides you could have a look
through, too:

http://docs.sun.com/app/docs/coll/40.17 (manpages)
http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Tim

2008-Aug-24 15:30 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

I''m pretty sure pci-ide doesn''t support hot-swap.  I believe
you need ahci.





On 8/24/08, Todd H. Poole <toddhpoole at gmail.com>
wrote:> Ah, yes - all four hard drives are connected to the motherboard''s
onboard
> SATA II ports. There is one additional drive I have neglected to mention
> thus far (the boot drive) but that is connected via the
motherboard''s IDE
> channel, and has remained untouched since the install... I don''t
really
> consider it part of the problem, but I thought I should mention it just in
> case... you never know...
>
> As for the drivers... well, I''m not sure of the command to
determine that
> directly, but going under System > Administration > Device Driver
Utility
> yields the following information under the "Storage" entry:
>
> Components: "ATI Technologies Inc. SB600 IDE"
> Driver: pci-ide
> --Driver Information--
> Driver: pci-ide
> Instance: 1
> Attach Status: Attached
> --Hardware Information--
> Vendor ID: 0x1002
> Device ID: 0x438c
> Class Code: 0001018a
> DevPath: /pci at 0,0/pci-ide at 14,1
>
> and
>
> Components: "ATI Technologies Inc. SB600 Non-Raid-5 SATA"
> Driver: pci-ide
> --Driver Information--
> Driver: pci-ide
> Instance: 0
> Attach Status: Attached
> --Hardware Information--
> Vendor ID: 0x1002
> Device ID: 0x4380
> Class Code: 0001018f
> DevPath: /pci at 0,0/pci-ide at 12
>
> Furthermore, there is one Driver Problem detected but the error is under
the
> "USB" entry. There are seven items listed:
>
> Components: ATI Technologies Inc. SB600 USB Controller (EHCI)
> Driver: ehci
>
> Components: ATI Technologies Inc. SB600 USB (OHCI4)
> Driver: ohci
>
> Components: ATI Technologies Inc. SB600 USB (OHCI3)
> Driver: ohci
>
> Components: ATI Technologies Inc. SB600 USB (OHCI2)
> Driver: ohci
>
> Components: ATI Technologies Inc. SB600 USB (OHCI1)
> Driver: ohci (Driver Misconfigured)
>
> Components: ATI Technologies Inc. SB600 USB (OHCI0)
> Driver: ohci
>
> Components: Microsoft Corp. Wheel Mouse Optical
> Driver: hid
>
> As you can tell, the OHCI1 device isn''t properly configured, but I
don''t
> know how to configure it (there''s only a "Help"
"Submit...", and "Close"
> button to click, no "Install Driver"). And, to tell you the
truth, I''m not
> even sure it''s worth mentioning because I don''t have
anything but my mouse
> plugged into USB, and even so... it''s a mouse... plugged into
USB... hardly
> something that is going to bring my machine to a grinding halt every time a
> SATA II disk gets yanked from a RAID-Z array (at least, I should hope the
> two don''t have anything in common!).
>
> And... wait... you mean to tell me that I can''t just untick the
checkbox
> that says "Hey, freeze my system when a drive dies" to solve this
problem?
> Ugh. And here I was hoping for a quick fix... ;)
>
> Anyway, how does the above sound? What else can I give you?
>
> -Todd
>
> PS: Thanks, by the way, for the support - I''m not sure where else
to turn to
> for this kind of stuff!
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Todd H. Poole

2008-Aug-24 20:23 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Hmmm. Alright, but supporting hot-swap isn''t the issue, is it? I mean,
like I said in my response to myxiplx, if I have to bring down the machine in
order to replace a faulty drive, that''s perfectly acceptable - I can do
that whenever it''s most convenient for me.

What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_) is if
the machine hangs/freezes/locks up or is otherwise brought down by an isolated
failure in a supposedly redundant array... Yanking the drive is just how I chose
to simulate that failure. I could just as easily have decided to take a
sledgehammer or power drill to it,

http://www.youtube.com/watch?v=CN6iDzesEs0 (fast-forward to the 2:30 part)
http://www.youtube.com/watch?v=naKd9nARAes

and the machine shouldn''t have skipped a beat. After all,
that''s the whole point behind the "redundant" part of RAID,
no?

And besides, RAID''s been around for almost 20 years now...
It''s nothing new. I''ve seen (countless times, mind you) plenty
of regular old IDE drives fail in a simple software RAID5 array and not bring
the machine down at all. Granted, you still had to power down to re-insert a new
one (unless you were using some fancy controller card), but the point remains:
the machine would still work perfectly with only 3 out of 4 drives present... So
I know for a fact this type of stability can be achieved with IDE.

What I''m getting at is this: I don''t think the method by which
the drives are connected - or whether or not that method supports hot-swap -
should matter. A machine _should_not_ crash when a single drive (out of a 4
drive ZFS RAID-Z array) is ungracefully removed, regardless of how abruptly that
drive is excised (be it by a slow failure of the drive motor''s spindle,
by yanking the drive''s power cable, by yanking the drive''s
SATA connector, by smashing it to bits with a sledgehammer, or by drilling into
it with a power drill).

So we''ve established that one potential work around is to use the ahci
instead of the pci-ide driver. Good! I like this kind of problem solving! But
that''s still side-stepping the problem... While this machine is
entirely SATA II, what about those who have a mix between SATA and IDE? Or even
much larger entities whose vast majority of hardware is only a couple of years
old, and still entirely IDE?

I''m grateful for your help, but is there another way that you can think
of to get this to work?
 
 
This message posted from opensolaris.org

James C. McPherson

2008-Aug-24 20:52 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Tim wrote:> I''m pretty sure pci-ide doesn''t support hot-swap.  I
believe you need ahci.
You''re correct, it doesn''t. Furthermore, to the best of
my knowledge, it won''t ever support hotswap.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

James C. McPherson

2008-Aug-24 21:28 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Todd H. Poole wrote:> Hmmm. Alright, but supporting hot-swap isn''t the issue, is it? I
mean,
> like I said in my response to myxiplx, if I have to bring down the
> machine in order to replace a faulty drive, that''s perfectly
acceptable -
> I can do that whenever it''s most convenient for me.
> 
> What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_)
> is if the machine hangs/freezes/locks up or is otherwise brought down by
> an isolated failure in a supposedly redundant array... Yanking the drive
> is just how I chose to simulate that failure. I could just as easily have
> decided to take a sledgehammer or power drill to it,
But you''re not attempting hotswap, you''re doing hot plug....
and unless you''re using the onboard bios'' concept of an actual
RAID array, you don''t have an array, you''ve got a JBOD and
it''s not a real JBOD - it''s a PC motherboard which does _not_
have the same electronic and electrical protections that a
JBOD has *by design*.
> http://www.youtube.com/watch?v=CN6iDzesEs0 (fast-forward to the 2:30
> part) http://www.youtube.com/watch?v=naKd9nARAes
> 
> and the machine shouldn''t have skipped a beat. After all,
that''s the
> whole point behind the "redundant" part of RAID, no?
Sigh.
> And besides, RAID''s been around for almost 20 years now...
It''s nothing
> new. I''ve seen (countless times, mind you) plenty of regular old
IDE
> drives fail in a simple software RAID5 array and not bring the machine
> down at all. Granted, you still had to power down to re-insert a new one
> (unless you were using some fancy controller card), but the point
> remains: the machine would still work perfectly with only 3 out of 4
> drives present... So I know for a fact this type of stability can be
> achieved with IDE.
And you''re right, it can. But what you''ve been doing is
outside
the bounds of what IDE hardware on a PC motherboard is designed
to cope with.
> What I''m getting at is this: I don''t think the method by
which the drives
> are connected - or whether or not that method supports hot-swap - should
> matter.
Well sorry, it does. Welcome to an OS which does care.
> A machine _should_not_ crash when a single drive (out of a 4
> drive ZFS RAID-Z array) is ungracefully removed, regardless of how
> abruptly that drive is excised (be it by a slow failure of the drive
> motor''s spindle, by yanking the drive''s power cable, by
yanking the
> drive''s SATA connector, by smashing it to bits with a
sledgehammer, or by
> drilling into it with a power drill).
If the controlling electronics for your disk can''t handle
it, then you''re hosed. That''s why FC, SATA (in SATA mode)
and SAS are much more likely to handle this out of the box.
Parallel SCSI requires funky hardware, which is why those
old 6- or 12-disk multipacks are so useful to have.

Of the failure modes that you suggest above, only one is
going to give you anything other than catastrophic failure
(drive motor degradation) - and that is because the drive''s
electronics will realise this, and send warnings to the
host.... which should have its drivers written so that these
messages are logged for the sysadmin to act upon.

The other failure modes are what we call catastrophic. And
where your hardware isn''t designed with certain protections
around drive connections, you''re hosed. No two ways about it.
If your system suffers that sort of failure, would you seriously
expect that non-hardened hardware would survive it?
> So we''ve established that one potential work around is to use the
ahci
> instead of the pci-ide driver. Good! I like this kind of problem solving!
> But that''s still side-stepping the problem... While this machine
is
> entirely SATA II, what about those who have a mix between SATA and IDE?
> Or even much larger entities whose vast majority of hardware is only a
> couple of years old, and still entirely IDE?
If you''ve got newer hardware, which can support SATA in
native SATA mode, USE IT.

Don''t _ever_ try that sort of thing with IDE. As I mentioned
above, IDE is not designed to be able to cope with what
you''ve been inflicting on this machine.
> I''m grateful for your help, but is there another way that you can
think
> of to get this to work?
You could start by taking us seriously when we tell you
that what you''ve been doing is not a good idea, and find
other ways to simulate drive failures.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Todd H. Poole

2008-Aug-25 02:36 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

> But you''re not attempting hotswap, you''re doing hot
plug....
Do you mean hot UNplug? Because I''m not trying to get this thing to
recognize any new disks without a restart... Honest. I''m just trying to
prevent the machine from freezing up when a drive fails. I have no problem
restarting the machine with a new drive in it later so that it recognizes the
new disk.
> and unless you''re using the onboard bios'' concept of an
actual
> RAID array, you don''t have an array, you''ve got a JBOD
and
> it''s not a real JBOD - it''s a PC motherboard which does
_not_
> have the same electronic and electrical protections that a
> JBOD has *by design*.
I''m confused by what your definition of a RAID array is, and for that
matter, what a JBOD is... I''ve got plenty of experience with both, but
just to make sure I wasn''t off my rocker, I consulted the demigod:

http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/JBOD

and I think what I''m doing is indeed RAID... I''m not using
some sort of controller card, or any specialized hardware, so it''s
certainly not Hardware RAID (and thus doesn''t contain any of the fancy
electronic or electrical protections you mentioned), but lacking said
protections doesn''t preclude the machine from being considered a RAID.
All the disks are the same capacity, the OS still sees the zpool I''ve
created as one large volume, and since I''m using RAID-Z (RAID5), it
should be redundant... What other qualifiers out there are necessary before a
system can be called RAID compliant?

If it''s hot-swappable technology, or a controller hiding the details
from the OS and instead  presenting a single volume, then I would argue those
things are extra - not a fundamental prerequisite for a system to be called a
RAID.

Furthermore, while I''m not sure what the difference between a
"real JBOD" and a plain old JBOD is, this set-up certainly
wouldn''t qualify for either. I mean, there is no concatenation going
on, redundancy should be present (but due to this issue, I haven''t been
able to verify that yet), and all the drives are the same size... Am I missing
something in the definition of a JBOD?

I don''t think so...
 > And you''re right, it can. But what you''ve been doing is
outside
> the bounds of what IDE hardware on a PC motherboard is designed
> to cope with.
Well, yes, you''re right, but it''s not like I''m making
some sort of radical departure outside of the bounds of the hardware... It
really shouldn''t be a problem so long as it''s not an
unreasonable departure because that''s where software comes in. When the
hardware can''t cut it, that''s where software picks up the
slack.

Now, obviously, I''m not saying software can do anything with any piece
of hardware you give it - no matter how many lines of code you write, your
keyboard isn''t going to turn into a speaker - but when it comes to
reasonable stuff like ensuring a machine doesn''t crash because a user
did something with the hardware that he or she wasn''t supposed to do?
Prime target for software.

And that''s the way it''s always been... The whole push behind
that whole ZFS Promise thing (or if you want to make it less specific, the
attractiveness of RAID in general), was that "RAID-Z [wouldn''t]
require any special hardware. It doesn''t need NVRAM for correctness,
and it doesn''t need write buffering for good performance. With RAID-Z,
ZFS makes good on the original RAID promise: it provides fast, reliable storage
using cheap, commodity disks." (http://blogs.sun.com/bonwick/entry/raid_z)
> Well sorry, it does. Welcome to an OS which does care.
The half-hearted apology wasn''t necessary... I understand that
OpenSolaris cares about the method those disks use to plug into the motherboard,
but what I don''t understand is why that limitation exists in the first
place. It would seem much better to me to have an OS that doesn''t care
(but developers that do) and just finds a way to work, versus one that does care
(but developers that don''t) and instead isn''t as flexible and
gets picky... I''m not saying OpenSolaris is the latter, but
I''m not getting the impression it''s the former either...
> If the controlling electronics for your disk can''t
> handle it, then you''re hosed. That''s why FC, SATA (in
SATA
> mode) and SAS are much more likely to handle this out of
> the box. Parallel SCSI requires funky hardware, which is why
> those old 6- or 12-disk multipacks are so useful to have.
> 
> Of the failure modes that you suggest above, only one
> is going to give you anything other than catastrophic
> failure (drive motor degradation) - and that is because the
> drive''s electronics will realise this, and send warnings to
> the host.... which should have its drivers written so
> that these messages are logged for the sysadmin to act upon.
> 
> The other failure modes are what we call catastrophic. And
> where your hardware isn''t designed with certain protections
> around drive connections, you''re hosed. No two ways
> about it. If your system suffers that sort of failure, would
> you seriously expect that non-hardened hardware would survive it?
Yes, I would. At the risk of sounding repetitive, I''ll summarize what
I''ve been getting at in my previous responses: I certainly _do_ think
it''s reasonable to expect non-hardened hardware to survive this type of
failure. In fact, I think its unreasonable _not_ to expect it to. The Linux
kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs
Windows) all provide this type of functionality, and have so for some time.
Granted, they may all do it in different ways, but at the end of the day,
unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat,
FreeBSD, or Windows XP Professional will not bring the machine down. And it
shouldn''t in OpenSolaris either. There might be some sort of noticeable
bump (Windows, for example, pauses for a few seconds while it tries to figure
out what hell just happened to one of it''s disks), but there
isn''t anything show stopping...
> If you''ve got newer hardware, which can support SATA
> in native SATA mode, USE IT.
I''ll see what I can do - this might be some sort of BIOS setting that
can be configured.
 > > I''m grateful for your help, but is there another way that you
can think
> > of to get this to work?
> You could start by taking us seriously when we tell
> you that what you''ve been doing is not a good idea, and
> find other ways to simulate drive failures.
Lets drop the confrontational attitude - I''m not trying to dick around
with you here. I''ve done my due diligence in researching this issue on
Google, these forums, and Sun''s documentation before making a post,
I''ve provided any clarifying information that has been requested by
those kind enough to post a response, and I''ve yet to resort to any
witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever
is causing you to think I''m not taking anyone seriously, let me
reassure you, I am.

The only thing I''m doing is testing a system by applying the worst case
scenario of survivable torture to it and seeing how it recovers. If
that''s not a good idea, then I guess we disagree. But that''s
ok - you''re James C. McPherson, Senior Kernel Software Engineer,
Solaris, and I''m just some user who''s trying to find a
solution to his problem. My bad for expecting the same level of respect
I''ve given two other members of this community to be returned in kind
by one of it''s leaders.

So aside from telling me to "[never] try this sort of thing with IDE"
does anyone else have any other ideas on how to prevent OpenSolaris from locking
up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z array?

-Todd
 
 
This message posted from opensolaris.org

Matt Harrison

2008-Aug-25 03:06 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Todd H. Poole wrote:>> But you''re not attempting hotswap, you''re doing hot
plug....
> 
> Do you mean hot UNplug? Because I''m not trying to get this thing
to recognize any new disks without a restart... Honest. I''m just trying
to prevent the machine from freezing up when a drive fails. I have no problem
restarting the machine with a new drive in it later so that it recognizes the
new disk.
> 
>> and unless you''re using the onboard bios'' concept of
an actual
>> RAID array, you don''t have an array, you''ve got a
JBOD and
>> it''s not a real JBOD - it''s a PC motherboard which
does _not_
>> have the same electronic and electrical protections that a
>> JBOD has *by design*.
> 
> I''m confused by what your definition of a RAID array is, and for
that matter, what a JBOD is... I''ve got plenty of experience with both,
but just to make sure I wasn''t off my rocker, I consulted the demigod:
> 
> http://en.wikipedia.org/wiki/RAID
> http://en.wikipedia.org/wiki/JBOD
> 
> and I think what I''m doing is indeed RAID... I''m not
using some sort of controller card, or any specialized hardware, so
it''s certainly not Hardware RAID (and thus doesn''t contain any
of the fancy electronic or electrical protections you mentioned), but lacking
said protections doesn''t preclude the machine from being considered a
RAID. All the disks are the same capacity, the OS still sees the zpool
I''ve created as one large volume, and since I''m using RAID-Z
(RAID5), it should be redundant... What other qualifiers out there are necessary
before a system can be called RAID compliant?
> 
> If it''s hot-swappable technology, or a controller hiding the
details from the OS and instead  presenting a single volume, then I would argue
those things are extra - not a fundamental prerequisite for a system to be
called a RAID.
> 
> Furthermore, while I''m not sure what the difference between a
"real JBOD" and a plain old JBOD is, this set-up certainly
wouldn''t qualify for either. I mean, there is no concatenation going
on, redundancy should be present (but due to this issue, I haven''t been
able to verify that yet), and all the drives are the same size... Am I missing
something in the definition of a JBOD?
> 
> I don''t think so...
>  
>> And you''re right, it can. But what you''ve been doing
is outside
>> the bounds of what IDE hardware on a PC motherboard is designed
>> to cope with.
> 
> Well, yes, you''re right, but it''s not like I''m
making some sort of radical departure outside of the bounds of the hardware...
It really shouldn''t be a problem so long as it''s not an
unreasonable departure because that''s where software comes in. When the
hardware can''t cut it, that''s where software picks up the
slack.
> 
> Now, obviously, I''m not saying software can do anything with any
piece of hardware you give it - no matter how many lines of code you write, your
keyboard isn''t going to turn into a speaker - but when it comes to
reasonable stuff like ensuring a machine doesn''t crash because a user
did something with the hardware that he or she wasn''t supposed to do?
Prime target for software.
> 
> And that''s the way it''s always been... The whole push
behind that whole ZFS Promise thing (or if you want to make it less specific,
the attractiveness of RAID in general), was that "RAID-Z
[wouldn''t] require any special hardware. It doesn''t need NVRAM
for correctness, and it doesn''t need write buffering for good
performance. With RAID-Z, ZFS makes good on the original RAID promise: it
provides fast, reliable storage using cheap, commodity disks."
(http://blogs.sun.com/bonwick/entry/raid_z)
> 
>> Well sorry, it does. Welcome to an OS which does care.
> 
> The half-hearted apology wasn''t necessary... I understand that
OpenSolaris cares about the method those disks use to plug into the motherboard,
but what I don''t understand is why that limitation exists in the first
place. It would seem much better to me to have an OS that doesn''t care
(but developers that do) and just finds a way to work, versus one that does care
(but developers that don''t) and instead isn''t as flexible and
gets picky... I''m not saying OpenSolaris is the latter, but
I''m not getting the impression it''s the former either...
> 
>> If the controlling electronics for your disk can''t
>> handle it, then you''re hosed. That''s why FC, SATA (in
SATA
>> mode) and SAS are much more likely to handle this out of
>> the box. Parallel SCSI requires funky hardware, which is why
>> those old 6- or 12-disk multipacks are so useful to have.
>>
>> Of the failure modes that you suggest above, only one
>> is going to give you anything other than catastrophic
>> failure (drive motor degradation) - and that is because the
>> drive''s electronics will realise this, and send warnings to
>> the host.... which should have its drivers written so
>> that these messages are logged for the sysadmin to act upon.
>>
>> The other failure modes are what we call catastrophic. And
>> where your hardware isn''t designed with certain protections
>> around drive connections, you''re hosed. No two ways
>> about it. If your system suffers that sort of failure, would
>> you seriously expect that non-hardened hardware would survive it?
> 
> Yes, I would. At the risk of sounding repetitive, I''ll summarize
what I''ve been getting at in my previous responses: I certainly _do_
think it''s reasonable to expect non-hardened hardware to survive this
type of failure. In fact, I think its unreasonable _not_ to expect it to. The
Linux kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs
Windows) all provide this type of functionality, and have so for some time.
Granted, they may all do it in different ways, but at the end of the day,
unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat,
FreeBSD, or Windows XP Professional will not bring the machine down. And it
shouldn''t in OpenSolaris either. There might be some sort of noticeable
bump (Windows, for example, pauses for a few seconds while it tries to figure
out what hell just happened to one of it''s disks), but there
isn''t anything show stopping...
> 
>> If you''ve got newer hardware, which can support SATA
>> in native SATA mode, USE IT.
> 
> I''ll see what I can do - this might be some sort of BIOS setting
that can be configured.
>  
>>> I''m grateful for your help, but is there another way that
you can think
>>> of to get this to work?
>> You could start by taking us seriously when we tell
>> you that what you''ve been doing is not a good idea, and
>> find other ways to simulate drive failures.
> 
> Lets drop the confrontational attitude - I''m not trying to dick
around with you here. I''ve done my due diligence in researching this
issue on Google, these forums, and Sun''s documentation before making a
post, I''ve provided any clarifying information that has been requested
by those kind enough to post a response, and I''ve yet to resort to any
witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever
is causing you to think I''m not taking anyone seriously, let me
reassure you, I am.
> 
> The only thing I''m doing is testing a system by applying the worst
case scenario of survivable torture to it and seeing how it recovers. If
that''s not a good idea, then I guess we disagree. But that''s
ok - you''re James C. McPherson, Senior Kernel Software Engineer,
Solaris, and I''m just some user who''s trying to find a
solution to his problem. My bad for expecting the same level of respect
I''ve given two other members of this community to be returned in kind
by one of it''s leaders.
> 
> So aside from telling me to "[never] try this sort of thing with
IDE" does anyone else have any other ideas on how to prevent OpenSolaris
from locking up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z
array?
> 
> -Todd
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
I''m far from being an expert on this subject, but this is what I
understand:

Unplugging a drive (actually pulling the cable out) does not simulate a 
drive failure, it simulates a drive getting unplugged, which is 
something the hardware is not capable of dealing with.

If your drive were to suffer something more realistic, along the lines 
of how you would normally expect a drive to die, then the system should 
cope with it a whole lot better.

Unfortunately, hard drives don''t come with a big button saying
"simulate
head crash now" or "make me some bad sectors" so it''s
going to be
difficult to simulate those failures.

All I can say is that unplugging a drive yourself will not simulate a 
failure, it merely causes the disk to disappear. Dying or dead disks 
will still normally be able to communicate with the driver to some 
extent, so they are still "there".

If you were using dedicated hotswappable hardware, then I wouldn''t 
expect to see the problem, but AFAIK off the shelf SATA hardware
doesn''t
support this fully, so unexpected results will occur.

I hope this has been of some small help, even just to explain why the 
system didn''t cope as you expected.

Matt

John Sonnenschein

2008-Aug-25 03:19 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

James isn''t being a jerk because he hates your or anything...

Look, yanking the drives like that can seriously damage the drives or your
motherboard. Solaris doesn''t let you do it and assumes that
something''s gone seriously wrong if you try it. That Linux ignores the
behavior and lets you do it sounds more like a bug in linux than anything else.
 
 
This message posted from opensolaris.org

Justin

2008-Aug-25 03:41 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

aye mate, I had the exact same problem, but where i work, we pay some pretty
seriosu dollars for a direct 24/7 line to some of sun''s engineers, so i
decided to call them up. after spending some time with tech support, i never
really got the thing resolved, and i instead ended up going back to debian for
all of our simple ide-based file servers.

if you really just want zfs, you can add it to whatever installation
you''ve got now (opensuse?) through something like zfs-fuse, but you
might take a 10-15% performance hit. if you don''t want that, and
you''re not too concerned with violating a few licenses, you can just
add it to your installation yourself, the source code is out there. you know,
roll your own. ;-)

you just might be trying too hard to force a round peg into a square hole.

hey, besides, where you work? i registered because i know a guy with the same
name
 
 
This message posted from opensolaris.org

Peter Bortas

2008-Aug-25 06:32 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Mon, Aug 25, 2008 at 5:19 AM, John Sonnenschein
<johnsonnenschein at gmail.com> wrote:> James isn''t being a jerk because he hates your or anything...
>
> Look, yanking the drives like that can seriously damage the drives or your
motherboard.
It can, but it''s not very likely to.
> Solaris doesn''t let you do it and assumes that
something''s gone seriously wrong if you try it. That Linux ignores the
behavior and lets you do it sounds more like a bug in linux than anything else.
That if something sounds more like defensiveness. Pulling out the
cable isn''t advisable, but it simulates the controller card on the
disk going belly up pretty well. Unless he pulls the power at the same
time, because that would also simulate a power failure.

If a piece of hardware stops responding you might do well to stop
talking to it, but there is nothing admirable about locking up the OS
if there is enough redundancy to continue without that particular
chunk of metal.

-- 
Peter Bortas

Justin

2008-Aug-25 06:53 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Howdy Matt. Just to make it absolutely clear, I appreciate your response. I
would be quite lost if it weren''t for all of the input.
> Unplugging a drive (actually pulling the cable out) does not simulate a 
> drive failure, it simulates a drive getting unplugged, which is 
> something the hardware is not capable of dealing with.
> 
> If your drive were to suffer something more realistic, along the lines 
> of how you would normally expect a drive to die, then the system should 
> cope with it a whole lot better.
Hmmm... I see what you''re saying. But, ok, let me play devil''s
advocate. What about the times when a drive fails in a way the system
didn''t expect? What you said was right - most of the time, when a hard
drive goes bad, SMART will pick up on it''s impending doom long before
it''s too late - but what about the times when the cause of the problem
is larger or more abrupt than that (like tin whiskers causing shorts, or a
server room technician yanking the wrong drive)?

To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect
me from data loss during _specific_ kinds of failures (the one''s which
OpenSolaris considers "normal") is a pretty big implication... and is
certainly a show-stopping one at that. Nobody is going to want to rely on an
OS/RAID solution that can only survive certain types of drive failures, while
there are others out there that can survive the same and more...

But then again, I''m not sure if that''s what you meant... is
that what you were getting at, or did I misunderstand?
> Unfortunately, hard drives don''t come with a big button saying
"simulate
> head crash now" or "make me some bad sectors" so
it''s going to be
> difficult to simulate those failures.
lol, if only they did - just having a button to push would make testing these
types of things a lot easier. ;)
> All I can say is that unplugging a drive yourself will not simulate a 
> failure, it merely causes the disk to disappear. 
But isn''t that a perfect example of a failure!? One in which the drive
just seems to pop out of existence? lol, forgive me if I''m sounding
pedantic, but why is there even a distinction between the two? This is starting
to sound more and more like a bug...
 > I hope this has been of some small help, even just to
> explain why the system didn''t cope as you expected.
It has, thank you - I appreciate the response.
 
 
This message posted from opensolaris.org

Justin

2008-Aug-25 07:34 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Howdy Matt, thanks for the response.

But I dunno man... I think I disagree... I''m kinda of the opinion that
regardless of what happens to hardware, an OS should be able to work around it,
if it''s possible. If a sysadmin wants to yank a hard drive out of a
motherboard (despite the risk of damage to the drive and board), then no OS in
the world is going to stop him, so instead of the sysadmin trying to work around
the OS, shouldn''t the OS instead try to work around the sysadmin?

I mean, as great of an OS as it is, Solaris can''t possibly hope to stop
me from doing anything I want to do... so when it assumes that
something''s gone seriously wrong (which yanking a disk drive would
hopefully cause it to assume), instead of just freezing up and becoming totally
useless, why not do something useful like eject the disk from it''s
memory, degrade the array, send out an e-mail to a designated sysadmin, and then
keep on chugging along?

Or, for a greater level of control, why not just read from some configuration
set by the sysadmin, and then decide to either do the above or shut down
entirely, as per the wishes of the sysadmin? Anything would be better than just
going into a catatonic state in less than five seconds.

Which is exactly what Linux, BSD, and even Windows _don''t_ do, and why
their continual operation even under such failures wouldn''t be
considered a bug.

When I yank a drive in a RAID5 array - any drive, be it IDE, SATA, USB, or
Firewire - in OpenSuSE or RedHat, the kernel will immediately notice
it''s absence, and inform lvm and mdadm (the software responsible for
keeping the RAID array together). mdadm will then degrade the array, and consult
whatever instructions root gave it when the sysadmin was configuring the array.
If the sysadmin waned the array to "stay up as long as it could," then
it would continue to do that. If root wanted the array to be "brought down
after any sort of drive failure," then the array would be unmounted. If
root wanted to "power the machine down," then the machine will
dutifully turn off.

Shouldn''t OpenSolaris do the same thing?

And as for James not being a jerk because he hates me, does that mean
he''s just always like that? lol, it''s alright: lets not try to
explain or excuse trollish behavior, and instead just call it out and expose it
for what it is, and then be done with it.

I certainly am.

Anyways, thanks for the input Matt.
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-25 07:39 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Howdy Matt. Just to make it absolutely clear, I appreciate your response. I
would be quite lost if it weren''t for all of the input.
> Unplugging a drive (actually pulling the cable out) does not simulate a 
> drive failure, it simulates a drive getting unplugged, which is 
> something the hardware is not capable of dealing with.
> 
> If your drive were to suffer something more realistic, along the lines 
> of how you would normally expect a drive to die, then the system should 
> cope with it a whole lot better.
Hmmm... I see what you''re saying. But, ok, let me play devil''s
advocate. What about the times when a drive fails in a way the system
didn''t expect? What you said was right - most of the time, when a hard
drive goes bad, SMART will pick up on it''s impending doom long before
it''s too late - but what about the times when the cause of the problem
is larger or more abrupt than that (like tin whiskers causing shorts, or a
server room technician yanking the wrong drive)?

To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect
me from data loss during _specific_ kinds of failures (the one''s which
OpenSolaris considers "normal") is a pretty big implication... and is
certainly a show-stopping one at that. Nobody is going to want to rely on an
OS/RAID solution that can only survive certain types of drive failures, while
there are others out there that can survive the same and more...

But then again, I''m not sure if that''s what you meant... is
that what you were getting at, or did I misunderstand?
> Unfortunately, hard drives don''t come with a big button saying
"simulate
> head crash now" or "make me some bad sectors" so
it''s going to be
> difficult to simulate those failures.
lol, if only they did - just having a button to push would make testing these
types of things a lot easier. ;)
> All I can say is that unplugging a drive yourself will not simulate a 
> failure, it merely causes the disk to disappear. 
But isn''t that a perfect example of a failure!? One in which the drive
just seems to pop out of existence? lol, forgive me if I''m sounding
pedantic, but why is there even a distinction between the two? This is starting
to sound more and more like a bug...
 > I hope this has been of some small help, even just to
> explain why the system didn''t cope as you expected.
It has, thank you - I appreciate the response.
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-25 07:41 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Howdy 404, thanks for the response.

But I dunno man... I think I disagree... I''m kinda of the opinion that
regardless of what happens to hardware, an OS should be able to work around it,
if it''s possible. If a sysadmin wants to yank a hard drive out of a
motherboard (despite the risk of damage to the drive and board), then no OS in
the world is going to stop him, so instead of the sysadmin trying to work around
the OS, shouldn''t the OS instead try to work around the sysadmin?

I mean, as great of an OS as it is, Solaris can''t possibly hope to stop
me from doing anything I want to do... so when it assumes that
something''s gone seriously wrong (which yanking a disk drive would
hopefully cause it to assume), instead of just freezing up and becoming totally
useless, why not do something useful like eject the disk from it''s
memory, degrade the array, send out an e-mail to a designated sysadmin, and then
keep on chugging along?

Or, for a greater level of control, why not just read from some configuration
set by the sysadmin, and then decide to either do the above or shut down
entirely, as per the wishes of the sysadmin? Anything would be better than just
going into a catatonic state in less than five seconds.

Which is exactly what Linux, BSD, and even Windows _don''t_ do, and why
their continual operation even under such failures wouldn''t be
considered a bug.

When I yank a drive in a RAID5 array - any drive, be it IDE, SATA, USB, or
Firewire - in OpenSuSE or RedHat, the kernel will immediately notice
it''s absence, and inform lvm and mdadm (the software responsible for
keeping the RAID array together). mdadm will then degrade the array, and consult
whatever instructions root gave it when the sysadmin was configuring the array.
If the sysadmin waned the array to "stay up as long as it could," then
it would continue to do that. If root wanted the array to be "brought down
after any sort of drive failure," then the array would be unmounted. If
root wanted to "power the machine down," then the machine will
dutifully turn off.

Shouldn''t OpenSolaris do the same thing?

And as for James not being a jerk because he hates me, does that mean
he''s just always like that? lol, it''s alright: lets not try to
explain or excuse trollish behavior, and instead just call it out and expose it
for what it is, and then be done with it.

I certainly am.

As always, thanks for the input.
 
 
This message posted from opensolaris.org

Ian Collins

2008-Aug-25 08:17 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

John Sonnenschein wrote:> James isn''t being a jerk because he hates your or anything...
>
> Look, yanking the drives like that can seriously damage the drives or your
motherboard. Solaris doesn''t let you do it and assumes that
something''s gone seriously wrong if you try it. That Linux ignores the
behavior and lets you do it sounds more like a bug in linux than anything else.
>  
>   One point that''s been overlooked in all the chest thumping - PCs
vibrate
and cables fall out.  I had this happen with an SCSI connector.  Luckily
for me, it fell in a fan and made a lot of noise!

So pulling a drive is a possible, if rare, failure mode.

Ian

Todd H. Poole

2008-Aug-25 08:25 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

jalex? As in Justin Alex?

If you''re who I think you are, don''t you have a pretty long
list of things you need to get done for Jerry before your little vacation?
 
 
This message posted from opensolaris.org

Justin

2008-Aug-25 08:58 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

alrigt, alright, but your fault. you left your workstation logged on, what was i
supposed to do? not chime in?

grotty yank
 
 
This message posted from opensolaris.org

Carson Gaspar

2008-Aug-25 09:10 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

John Sonnenschein wrote:
> Look, yanking the drives like that can seriously damage the drives or
> your motherboard. Solaris doesn''t let you do it and assumes that
> something''s gone seriously wrong if you try it. That Linux ignores
> the behavior and lets you do it sounds more like a bug in linux than
> anything else.
OK, so far we''ve had a lot of knee jerk defense of Solaris. Sorry, but 
that isn''t helping. Let''s get back to science here, shall we?

What happens when you remove a disk?

A) The driver detects the removal and informs the OS. Solaris appears to 
behave reasonaby well in this case.

B) The driver does not detect the removal. Commands must time out before 
a problem is detected. Due to driver layering, timeouts increase 
rapidly, causig te OS to "hang" for unreasonable periods of time.

We really need to fix (B). It seems the "easy" fixes are:

- Configure faster timeouts and fewer retries on redundant devices, 
similar to drive manufacturers'' RAID edition firmware. This could be
via
driver config file, or (better) automatically via ZFS, similar to write 
cache behaviour.

- Propagate timeouts quickly between layers (immediate soft fail without 
retry) or perhaps just to the fault management system

-- 
Carson

Ralf Ramge

2008-Aug-25 10:15 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Todd H. Poole wrote:> Hmmm... I see what you''re saying. But, ok, let me play
devil''s advocate. What about the times when a drive fails in a way the
system didn''t expect? What you said was right - most of the time, when
a hard drive goes bad, SMART will pick up on it''s impending doom long
before it''s too late - but what about the times when the cause of the
problem is larger or more abrupt than that (like tin whiskers causing shorts, or
a server room technician yanking the wrong drive)?
>
> To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_
protect me from data loss during _specific_ kinds of failures (the
one''s which OpenSolaris considers "normal") is a pretty big
implication... and is certainly a show-stopping one at that. Nobody is going to
want to rely on an OS/RAID solution that can only survive certain types of drive
failures, while there are others out there that can survive the same and more...
>
> But then again, I''m not sure if that''s what you meant...
is that what you were getting at, or did I misunderstand?
>   
I think there''s a misunderstanding concerning underlying concepts.
I''ll
try to explain my thoughts, please excuse me in case this becomes a bit 
lengthy. Oh, and I am not a Sun employee or ZFS fan, I''m just a
customer
who loves and hates ZFS at the same time ;-)

You know, ZFS is designed for high *reliability*. This means that ZFS 
tries to keep your data as safe as possible. This includes faulty 
hardware, missing hardware (like in your testing scenario) and, to a 
certain degree, even human mistakes.
But there are limits. For instance, ZFS does not make a backup 
unnecessary. If there''s a fire and your drives melt, then ZFS
can''t do
anything. Or if the hardware is lying about the drive geometry. ZFS is 
part of the operating environment and, as a consequence, relies on the 
hardware. 
so ZFS can''t make unreliable hardware reliable. All it can do is trying
to protect the data you saved on it. But it cannot guarantee this to you 
if the hardware becomes its enemy.
A real world example: I have a 32 core Opteron server here, with 4 
FibreChannel Controllers and 4 JBODs with a total of FC drives connected 
to it, running a RAID 10 using ZFS mirrors. Sounds a lot like high end 
hardware compared to your NFS server, right? But ... I have exactly the 
same symptom. If one drive fails, an entire JBOD with all 16 included 
drives hangs, and all zpool access freezes. The reason for this is the 
miserable JBOD hardware. There''s only one FC loop inside of it, the 
drives are connected serially to each other, and if one drive dies, the 
drives behind it go downhill, too. ZFS immediately starts caring about 
the data, the zpool command hangs (but I still have traffic on the other 
half of the ZFS mirror!), and it does the right thing by doing so: 
whatever happens, my data must not be damaged.
A "bad" filesystem like Linux ext2 or ext3 with LVM would just
continue,
even if the Volume Manager noticed the missing drive or not. That''s
what
you experienced. But you run in the real danger of having to use fsck at 
some point. Or, in my case, fsck''ing 5 TB of data on 64 drives.
That''s
not much fun and results in a lot more downtime than replacing the 
faulty drive.

What can you expect from ZFS in your case? You can expect it to detect 
that a drive is missing and to make sure, that your _data integrity_  
isn''t compromised. By any means necessary.  This may even require  to 
make a system completely unresponsive until a timeout has passed.

But what you described is not a case of reliability. You want something 
completely different. You expect it to deliver *availability*.

And availability is something ZFS doesn''t promise. It simply
can''t
deliver this. You have the impression that NTFS and various other 
Filesystems do so, but that''s an illusion. The next reboot followed by
a
fsck run will show you why. Availability requires full reliability of 
every included component of your server as a minimum,  and you can''t 
expect ZFS or any other filesystem to deliver this  with cheap IDE 
hardware.

Usually people want to save money when buying hardware, and ZFS is a 
good choice to deliver the *reliability* then. But the conceptual 
stalemate between reliability and availability of such cheap hardware 
still exists - the hardware is cheap, the file system and services may 
be reliable, but as soon as you want *availability*, it''s getting 
expensive again, because you have to buy every hardware component at 
least twice.

So, you have the choice:

a) If you want *availability*, stay with your old solution. But oyu have 
no guarantee that your data is always intact. You''ll always be able to 
stream your video, but you have no guarantee that the client will 
receive a stream without drop outs forever.

b) If you want *data integrity*, ZFS is your best friend. But you may 
have slight availability issues when it comes to hardware defects. You 
may reduce the percentage of pain during a desaster by spending more 
money, e.g. by making the SATA controllers redundant and creating a 
mirror (than controller 1 will hang, but controller 2 will continue 
working), but you must not forget that your PCI bridges, fans, power  
supplies, etc. remain single points of failures why can take the entire 
service down like your pulling of the non-hotpluggable drive did.

c) If you want both, you should buy a second server and create a NFS 
cluster.

Hope I could help you a bit,

  Ralf

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim
Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Ralf Ramge

2008-Aug-25 10:24 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Ralf Ramge wrote:
[...]

Oh, and please excuse the grammar mistakes and typos. I''m in a hurry, 
not a retard ;-) At least I think so.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim
Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Heikki Suonsivu on list forwarder

2008-Aug-25 14:36 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Justin wrote:> Howdy Matt. Just to make it absolutely clear, I appreciate your
> response. I would be quite lost if it weren''t for all of the
input.
> 
>> Unplugging a drive (actually pulling the cable out) does not
>> simulate a drive failure, it simulates a drive getting unplugged,
>> which is something the hardware is not capable of dealing with.
>> 
>> If your drive were to suffer something more realistic, along the
>> lines of how you would normally expect a drive to die, then the
>> system should cope with it a whole lot better.
> 
> Hmmm... I see what you''re saying. But, ok, let me play
devil''s
> advocate. What about the times when a drive fails in a way the system
> didn''t expect? What you said was right - most of the time, when a
> hard drive goes bad, SMART will pick up on it''s impending doom
long
> before it''s too late - but what about the times when the cause of
the
> problem is larger or more abrupt than that (like tin whiskers causing
> shorts, or a server room technician yanking the wrong drive)?
I read a research paper by google about this a while ago.  Their 
conclusion was that SMART is poor predictor of disk failure, even though 
they did find some useful indications.  google for "google disk 
failure", it came out as second link a moment ago, title "Failure
Trends
in a Large Disk Drive Population".

The problem with trying to predict disk failures with SMART parameters 
only catches a certain percentage of failing disks, and that percentage 
is not all that great.  Many disks will still decide to fail 
catastrophically, most often early morning December 25th, in particular 
if there is a huge snowstorm going :)

Heikki

Richard Elling

2008-Aug-25 14:39 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Todd H. Poole wrote:> Howdy 404, thanks for the response.
>
> But I dunno man... I think I disagree... I''m kinda of the opinion
that regardless of what happens to hardware, an OS should be able to work around
it, if it''s possible. If a sysadmin wants to yank a hard drive out of a
motherboard (despite the risk of damage to the drive and board), then no OS in
the world is going to stop him, so instead of the sysadmin trying to work around
the OS, shouldn''t the OS instead try to work around the sysadmin?
>   
The behavior of ZFS to an error reported by an underlying device
driver is tunable by the zpool failmode property.  By default, it is
set to "wait."  For root pools, the installer may change this
to "continue."  The key here is that you can argue with the choice
of default behavior, but don''t argue with the option to change.
> I mean, as great of an OS as it is, Solaris can''t possibly hope to
stop me from doing anything I want to do... so when it assumes that
something''s gone seriously wrong (which yanking a disk drive would
hopefully cause it to assume), instead of just freezing up and becoming totally
useless, why not do something useful like eject the disk from it''s
memory, degrade the array, send out an e-mail to a designated sysadmin, and then
keep on chugging along?
>   
If this does not occur, then please file a bug against the appropriate
device driver (you''re not operating in ZFS code here).
> Or, for a greater level of control, why not just read from some
configuration set by the sysadmin, and then decide to either do the above or
shut down entirely, as per the wishes of the sysadmin? Anything would be better
than just going into a catatonic state in less than five seconds.
>   
qv. zpool failmode property, at least when you are operating in the
zfs code.  I think the concerns here are that hangs can, and do, occur
at other places in the software stack.  Please report these in the
appropriate forums and bug categories.
 -- richard

Bob Friesenhahn

2008-Aug-25 15:37 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Sun, 24 Aug 2008, Todd H. Poole wrote:
> So aside from telling me to "[never] try this sort of thing with 
> IDE" does anyone else have any other ideas on how to prevent 
> OpenSolaris from locking up whenever an IDE drive is abruptly 
> disconnected from a ZFS RAID-Z array?
I think that your expectations from ZFS are reasonable.  However, it 
is useful to determine if pulling the IDE drive locks the entire IDE 
channel, which serves the other disks as well. This could happen at a 
hardware level, or at a device driver level. If this happens, then 
there is nothing that ZFS can do.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Aug-25 15:57 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Mon, 25 Aug 2008, Carson Gaspar wrote:>
> B) The driver does not detect the removal. Commands must time out before
> a problem is detected. Due to driver layering, timeouts increase
> rapidly, causig te OS to "hang" for unreasonable periods of time.
>
> We really need to fix (B). It seems the "easy" fixes are:
>
> - Configure faster timeouts and fewer retries on redundant devices,
I don''t think that any of these "easy" fixes are wise.  Any
fix based
on timeouts is going to cause problems with devices mysteriously 
timing out and being resilvered.

Device drivers should know the expected behavior of the device and act 
appropriately. For example, if the device is in a powered-down state, 
then the device driver can expect that it will take at least 30 
seconds for the device to return after being requested to power-up but 
that some weak devices might take a minute.  As far as device drivers 
go, I expect that IDE device drivers are at the very bottom of the 
feeding chain in Solaris since Solaris is optimized for enterprise 
hardware.

Since OpenSolaris is open source, perhaps some brave soul can 
investigate the issues with the IDE device driver and send a patch.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jens Elkner

2008-Aug-25 18:25 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Mon, Aug 25, 2008 at 08:17:55PM +1200, Ian Collins
wrote:> John Sonnenschein wrote:
> >
> > Look, yanking the drives like that can seriously damage the drives
> > or your motherboard. Solaris doesn''t let you do it ...
Haven''t seen an andruid/"universal soldier" shipping with
Solaris ... ;-)
> > and assumes that something''s gone seriously wrong if you try
it. That Linux ignores the behavior and lets you do it sounds more like a bug in
linux than anything else.
Not sure, whether everything, what can''t be understood, is "likely
a bug"
- maybe it is "more forgiving" and tries its best to solve the problem
without taking you out of business (see below), even if it requires some
hacks not in line with specifications ...
> One point that''s been overlooked in all the chest thumping - PCs
vibrate
> and cables fall out.  I had this happen with an SCSI connector.  Luckily
Yes - and a colleague told me, that he''ve had the same problem once.
Also he managed a SiemensFujitsu server, where the SCSI-controller card 
had a tiny hairline crack: very odd behavior, usually not reproducible,
IIRC, the 4th ServiceEngineer finally replaced the card ...
> So pulling a drive is a possible, if rare, failure mode.
Definitely!

And expecting strange controller (or in general hardware) behavior is
possibly a big + for an OS, which targets SMEs and "home users" as
well
(everybody knows about far east and other cheap HW producers,  which 
sometimes seem to say, lets ship it, later we build a special driver for
MS Windows, which workarounds the bug/problem ...).

"Similar" story: ~ 2000+ we had a WG server with 4 IDE channels PATA,
one HDD on each. HDD0 on CH0 mirrored to HDD2 on CH2, HDD1 on CH1 mirrored
to HDD3 on CH3, using Linux Softraid driver. We found out, that when
HDD1 on CH1 got on the blink, for some reason the controller got on the
blink as well, i.e. took CH0 and vice versa down too. After reboot, we
were able to force the md raid to re-take the bad marked drives and even
found out, that the problem starts, when a certain part of a partition
was accessed (which made the ops on that raid really slow for some
minutes - but after the driver marked the drive(s) as bad, performance
was back). Thus disabling the partition gave us the time to get a new
drive... During all these ops nobody (except sysadmins) realized, that we
had a problem - thanx to the md raid1 (with xfs btw.). And also we did not
have any data corruption (at least, nobody has complained about it ;-)).

Wrt. what I''ve experienced and read in ZFS-discussion etc. list
I''ve the
__feeling__, that we would have got really into trouble, using Solaris
(even the most recent one) on that system ... 
So if one asks me, whether to run Solaris+ZFS on a production system, I
usually say: definitely, but only, if it is a Sun server ...

My 2? ;-)

Regards,
jel.

PS: And yes, all the vendor specific workarounds/hacks are for Linux
    kernel folks a problem as well - at least on Torvalds side
    discouraged IIRC ...
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768

Miles Nordin

2008-Aug-25 18:55 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

>>>>> "jcm" == James C McPherson <James.McPherson at
Sun.COM> writes:
>>>>> "thp" == Todd H Poole <toddhpoole at
gmail.com> writes:
>>>>> "mh" == Matt Harrison <iwasinnamuknow at
genestate.com> writes:
>>>>> "js" == John Sonnenschein <johnsonnenschein at
gmail.com> writes:
>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>> "cg" == Carson Gaspar <carson at
taltos.org> writes:
jcm> Don''t _ever_ try that sort of thing with IDE. As I mentioned
jcm> above, IDE is not designed to be able to cope with [unplugging
jcm> a cable]

It shouldn''t have to be designed for it, if there''s controller
redundancy. On Linux, one drive per IDE bus (not using any
``slave''''
drives) seems like it should be enough for any electrical issue, but
is not quite good enough in my experience, when there are two PATA
busses per chip. but one hard drive per chip seems to be mostly okay.
In this SATA-based case, not even that much separation was necessary
for Linux to survive on the same hardware, but I agree with you and
haven''t found that level with PATA either.

OTOH, if the IDE drivers are written such that a confusing interaction
with one controller chip brings down the whole machine, then I expect
the IDE drivers to do better. If they don''t, why advise people to buy
twice as much hardware ``because, you know, controllers can also fail,
so you should have some controller redundancy''''---the advice
is worse
than a waste of money, it''s snake oil---a false sense of security.

jcm> You could start by taking us seriously when we tell you that
jcm> what you''ve been doing is not a good idea, and find other
ways
jcm> to simulate drive failures.

well, you could suggest a method.

except that the whole point of the story is, Linux, without any
blather about ``green-line'''' and
``self-healing,'''' without any
concerted platform-wide effort toward availability at all, simply
works more reliably.

thp> So aside from telling me to "[never] try this sort of thing
thp> with IDE" does anyone else have any other ideas on how to
thp> prevent OpenSolaris from locking up whenever an IDE drive is
thp> abruptly disconnected from a ZFS RAID-Z array?

yeah, get a Sil3124 card, which will run in native SATA mode and be
more likely to work. Then, redo your test and let us know what
happens.

The not-fully-voiced suggestion to run your ATI SB600 in native/AHCI
mode instead of pci-ide/compatibility mode is probably a bad one
because of bug 6665032: the chip is only reliable in compatibility
mode. You could trade your ATI board for an nVidia board for about
the same price as the Sil3124 add-on card. AIUI from Linux wiki:

http://ata.wiki.kernel.org/index.php/SATA_hardware_features

...says the old nVidia chips use nv_sata driver, and the new ones use
the ahci driver, so both of these are different from pci-ide and more
likely to work. Get an old one (MCP61 or older), and a new one (MCP65
or newer), repeat your test and let us know what happens.

If the Sil3124 doesn''t work, and nv_sata doesn''t work, and
AHCI on
newer-nVidia doesn''t work, then hook the drives up to Linux running
IET on basically any old chip, and mount them from Solaris using the
built-in iSCSI initiator.

If you use iSCSI, you will find:

you will get a pause like with NT. Also, if one of the iSCSI targets
is down, ''zpool status'' might hang _every time_ you run it,
not just
the first time when the failure is detected. The pool itself will
only hang the first time. Also, you cannot boot unless all iSCSI
targets are available, but you can continue running if some go away
after booting.

Overall IMHO it''s not as good as LVM2, but it''s more robust
than
plugging the drives into Solaris. It also gives you the ability to
run smartctl on the drives (by running it natively on Linux) with full
support for all commands, while someone here who I told to run
smartctl reported that on Solaris ''smartctl -a'' worked but
''smartctl
-t'' did not. I still have performance problems with iSCSI.
I''m not
sure yet if they''re unresolvable: there are a lot of tweakables with
iSCSI, like disabling Nagle''s algorithm, and enabling RED on the
initiator switchport, but first I need to buy faster CPU''s for the
targets.

mh> Dying or dead disks will still normally be able to
mh> communicate with the driver to some extent, so they are still
mh> "there".

The dead disks I have which don''t spin also don''t respond to
IDENTIFY(0) so they don''t really communicate with the driver at all.
now, possibly, *possibly* they are still responsive after they fail,
and become unresponsive after the first time they''re
rebooted---because I think they load part of their firmware off the
platters. Also, ATAPI standard says that while ``still
communicating'''' drives are allowed to take up to 30sec to
answer each
command, which is probably too long to freeze a whole system. and
still, just because ``possibly,'''' it doesn''t make
sense to replace a
tested-working system with a tested-broken system, not even after
someone tells a complicated story trying to convince you the broken
system is actually secretly working, just completely impossible to
test, so you have to accept it based on stardust and fantasy.

js> yanking the drives like that can seriously damage the
js> drives or your motherboard.

no, it can''t.

And if I want a software developer''s opinion on what will electrically
damage my machine, I''ll be sure to let you know first.

jcm> If you absolutely must do something like this, then please use
jcm> what''s known as "coordinated hotswap" using the
cfgadm(1m)
jcm> command.

jcm> Viz:

jcm> (detect fault in disk c2t3d0, in some way)

jcm> # cfgadm -c unconfigure c2::dsk/c2t3d0 # cfgadm -c disconnect
jcm> c2::dsk/c2t3d0

so....dont dont DONT do it because its STUPID and it might FRY YOUR
DISK AND MOTHERBOARD. but, if you must do it, please warn our
software first?

I shouldn''t have to say it, but aside from being absurd this
warning-command completely defeats the purpose of the test.

jcm> Yes, but you''re running a new operating system, new
jcm> filesystem... that''s a mountain of difference right in
front
jcm> of you.

so we do agree that Linux''s not freezing in the same scenario
indicates the difference is inside that mountain, which, however
large, is composed entirely of SOFTWARE.

re> The behavior of ZFS to an error reported by an underlying
re> device driver is tunable by the zpool failmode property. By
re> default, it is set to "wait."

I think you like speculation well enough, so long as it''s optimistic.

which is the tunable setting that causes other pools, ones not even
including failed devices, to freeze?

Why is the failmode property involved at all in a pool that still has
enough replicas to keep functioning?

cg> We really need to fix (B). It seems the "easy" fixes are:

cg> - Configure faster timeouts and fewer retries on redundant
cg> devices, similar to drive manufacturers'' RAID edition
cg> firmware. This could be via driver config file, or (better)
cg> automatically via ZFS, similar to write cache behaviour.

cg> - Propagate timeouts quickly between layers (immediate soft
cg> fail without retry) or perhaps just to the fault management
cg> system

It''s also important that things unrelated to the failure
aren''t
frozen. This was how I heard the ``green line'''' marketing
campaign
when it was pitched to me, and I found it really compelling because I
felt Linux had too little of this virtue. However compelling, I just
don''t find it even slightly acquainted with reality.

I can understand ``unrelated'''' is a tricky concept when the
boot pool
is involved, but for example when it isn''t involved: I''ve had
problems
where one exported data pool''s becoming FAULTED stops NFS service from
all other pools. The pool that FAULTED contained no Solaris binaries.

and the zpool status hangs people keep discovering.

I think this is a good test in general: configure two
almost-completely independent stacks through the same kernel:

NFS export NFS export

filesystem filesystem
pool pool

ZFS/NFS

driver driver

controller controller

disks disks

Simulate whatever you regard as a ``catastrophic'''' or
``unplanned'''' or
``really stupid'''' failure, and see how big the shared region
in the
middle can be without affecting the other stack. Right now, my
experience is even the stack above does not work. Maybe mountd gets
blocked or something, I don''t know. Optimistically, we would of
course like this stack below to remain failure-separate:

NFS export NFS export

filesystem filesystem
pool pool

ZFS/NFS

driver

controller

disks disks

The OP is implying, on Linux that stack DOES keep failures separate.
However, even if ``hot plug'''' (or ``hot
unplug'''' for demanding Linux
users) is not supported, at least this stack below should still be
failure-independent:

NFS export NFS export

filesystem filesystem
pool pool

ZFS/NFS

driver

controller controller

disks disks

I suspect it isn''t because the less-demanding stack I started with
isn''t failure-independent. There is probably more than one problem
making these failures spread more widely than they should, but so far
we can''t even agree on what we wish were working.

I do think the failures need to be isolated better first, independent
of time. It''s not ``a failure of a drive on the left should propogate
up the stack faster so that the stack on the right unfreezes before
anyone gets too upset.'''' The stack on the right
shouldn''t freeze at
all.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080825/1139b0d0/attachment.bin>

Richard Elling

2008-Aug-26 18:10 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin wrote:>>>>>> "jcm" == James C McPherson
<James.McPherson at Sun.COM> writes:
>>>>>> "thp" == Todd H Poole <toddhpoole at
gmail.com> writes:
>>>>>> "mh" == Matt Harrison <iwasinnamuknow at
genestate.com> writes:
>>>>>> "js" == John Sonnenschein
<johnsonnenschein at gmail.com> writes:
>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>> "cg" == Carson Gaspar <carson at
taltos.org> writes:
>>>>>>             
>
>    jcm> Don''t _ever_ try that sort of thing with IDE. As I
mentioned
>    jcm> above, IDE is not designed to be able to cope with [unplugging
>    jcm> a cable]
>
> It shouldn''t have to be designed for it, if there''s
controller
> redundancy.  On Linux, one drive per IDE bus (not using any
``slave''''
> drives) seems like it should be enough for any electrical issue, but
> is not quite good enough in my experience, when there are two PATA
> busses per chip.  but one hard drive per chip seems to be mostly okay.
> In this SATA-based case, not even that much separation was necessary
> for Linux to survive on the same hardware, but I agree with you and
> haven''t found that level with PATA either.
>
> OTOH, if the IDE drivers are written such that a confusing interaction
> with one controller chip brings down the whole machine, then I expect
> the IDE drivers to do better.  If they don''t, why advise people to
buy
> twice as much hardware ``because, you know, controllers can also fail,
> so you should have some controller redundancy''''---the
advice is worse
> than a waste of money, it''s snake oil---a false sense of security.
>   
No snake oil.  Pulling cables only simulates pulling cables.  If you
are having difficulty with cables falling out, then this problem cannot
be solved with software.  It *must* be solved with hardware.

But the main problem with "simulating disk failures by pulling cables"
is that the code paths executed during that test are different than those
executed when the disk fails in other ways.  It is not simply an issue
of the success or failure of the test, but it is an issue of what you are
testing.

Studies have shown that pulled cables is not the dominant failure
mode in disk populations.  Bairavasundaram et.al. [1] showed that
data checksum errors are much more common.  In some internal Sun
studies, we also see unrecoverable read as the dominant disk failure
mode. ZFS will do well for these errors, regardless of the underlying
OS.  AFAIK, none of the traditional software logical volume managers
nor the popular open source file systems (other than ZFS :-) address
this problem.

[1] 
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
 -- richard

Miles Nordin

2008-Aug-26 18:38 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
re> unrecoverable read as the dominant disk failure mode. [...]
re> none of the traditional software logical volume managers nor
re> the popular open source file systems (other than ZFS :-)
re> address this problem.

Other LVM''s should address unrecoverable read errors as well or better
than ZFS, because that''s when the drive returns an error instead of
data. Doing a good job with this error is mostly about not freezing
the whole filesystem for the 30sec it takes the drive to report the
error. Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency. I would expect all the software volume managers
including ZFS fail to do this. It''s really hard to test without
somehow getting a drive that returns read errors frequently, but isn''t
about to die within the month---maybe ZFS should have an error
injector at driver-level instead of block-level, and a model for
time-based errors. One thing other LVM''s seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it''s
waiting for the I/O request to return an error.

In terms of FUD about ``silent corruption'''', there is none of
it when
the drive clearly reports a sector is unreadable. Yes, traditional
non-big-storage-vendor RAID5, and all software LVM''s I know of except
ZFS, depend on the drives to report unreadable sectors. And,
generally, drives do. so let''s be clear about that and not try to
imply
that the ``dominant failure mode'''' causes silent corruption
for
everyone except ZFS and Netapp users---it doesn''t.

The Netapp paper focused on when drives silently return incorrect
data, which is different than returning an error. Both Netapp and ZFS
do checksums to protect from this. However Netapp never claimed this
failure mode was more common than reported unrecoverable read errors,
just that it was more interesting. I expect it''s much *less* common.

Further, we know Netapp loaded special firmware into the enterprise
drives in that study because they wanted the larger sector size. They
are likely also loading special firmware into the desktop drives to
make them return errors sooner than 30 seconds. so, it''s not
improbable that the Netapp drives are more prone to deliver silently
corrupt data instead of UNC/seek errors compared to off-the-shelf
drives.

Finally, for the Google paper, silent corruption ``didn''t even make
the chart.'''' so, saying something didn''t make your
chart and saying
that it doesn''t happen are two different things, and your favoured
conclusion has a stake in maintaining that view, too.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080826/dc34c85b/attachment.bin>

Carson Gaspar

2008-Aug-26 18:56 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Richard Elling wrote:>
> No snake oil.  Pulling cables only simulates pulling cables.  If you
> are having difficulty with cables falling out, then this problem cannot
> be solved with software.  It *must* be solved with hardware.
>
> But the main problem with "simulating disk failures by pulling
cables"
> is that the code paths executed during that test are different than those
> executed when the disk fails in other ways.  It is not simply an issue
> of the success or failure of the test, but it is an issue of what you are
> testing.
All of that may be true, but it doesn''t change the fact that
Solaris''
observed begaviour under these conditions is _abysmally_ bad, and for no 
good reason.

It might not be a high priority to fix, but it would be nice if one of 
the Sun folks would at least acknowledge that something is terribly 
wrong here, rather than claiming it''s not a problem.

-- 
Carson

Todd H. Poole

2008-Aug-26 19:30 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

> The behavior of ZFS to an error reported by an underlying device
> driver is tunable by the zpool failmode property.  By default, it is
> set to "wait."  For root pools, the installer may change this
> to "continue."  The key here is that you can argue with the
choice
> of default behavior, but don''t argue with the option to change.
I didn''t want to argue with the option to change... trust me. Being
able to change those types of options and having that type of flexibility in the
first place is what makes a very large part of my day possible.
> qv. zpool failmode property, at least when you are operating in the
> zfs code.  I think the concerns here are that hangs can, and do, occur
> at other places in the software stack.  Please report these in the
> appropriate forums and bug categories.
>  -- richard
Now _that''s_ a great constructive suggestion! Very good - I''ll
research this in a few hours, and report back on what I find.

Thanks for the pointer!

-Todd
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-26 19:50 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

> Since OpenSolaris is open source, perhaps some brave
> soul can investigate the issues with the IDE device driver and
> send a patch.
Fearing that other Senior Kernel Engineers, Solaris, might exhibit similar
responses, or join in and play ?antagonize the noob,? I decided that I would try
to solve my problem on my own. I tried my best to unravel the source tree that
is OpenSolaris with some help from a friend, but I''ll be the first to
admit - we didn''t even know where to begin, much less understand what
we were looking at.

To say that he and I were lost would be an understatement.

I?m familiar with some subsections of the Linux kernel, and I can read and write
code in a pinch, but there''s a reason why most of my work is done for
small, personal projects, or just for fun... Some people out there can see
things like Neo sees the Matrix? I am not one of them.

I wish I knew how to write and then submit those types of patches. If I did, you
can bet I would have been all over that days ago! :)

-Todd
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-26 20:14 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

PS: I also think it''s worthy to note the level of supportive and
constructive feedback that many others have provided, and how much I appreciate
it. Thanks! Keep it coming!
 
 
This message posted from opensolaris.org

Richard Elling

2008-Aug-26 20:15 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Carson Gaspar wrote:> Richard Elling wrote:
>   
>> No snake oil.  Pulling cables only simulates pulling cables.  If you
>> are having difficulty with cables falling out, then this problem cannot
>> be solved with software.  It *must* be solved with hardware.
>>
>> But the main problem with "simulating disk failures by pulling
cables"
>> is that the code paths executed during that test are different than
those
>> executed when the disk fails in other ways.  It is not simply an issue
>> of the success or failure of the test, but it is an issue of what you
are
>> testing.
>>     
>
> All of that may be true, but it doesn''t change the fact that
Solaris''
> observed begaviour under these conditions is _abysmally_ bad, and for no 
> good reason.
>   
Please file bugs.  That is the best way to get things fixed.
The most appropriate forum for storage driver discussions will
be storage-discuss.
 -- richard

Ron Halstead

2008-Aug-26 20:45 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Todd, 3 days ago you were asked what mode the BIOS was using, AHCI or IDE
compatibility. Which is it? Did you change it? What was the result? A few other
posters suggested the same thing but the thread went off into left field and I
believe the question / suggestions got lost in the noise.

--ron
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-26 21:09 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

> I think that your expectations from ZFS are
> reasonable.  However, it is useful to determine if pulling the IDE drive
locks
> the entire IDE channel, which serves the other disks as well. This
> could happen at a hardware level, or at a device driver level. If this
> happens, then there is nothing that ZFS can do.
Gotcha. But just to let you know, there are 4 SATA ports on the motherboard,
with each drive getting its own port... how should I go about testing to see
whether pulling one IDE drive (remember, they''re really SATA drives,
but they''re being presented to the OS by the pci-ide driver) locks the
entire IDE channel if there''s only one drive per channel? Or do you
think it''s possible that two ports on the motherboard could be on one
"logical channel" (for lack of a  better phrase) while the other two
are on the other, and thus we could test one drive while another on the same
"logical channel" is unplugged?

Also, remember that OpenSolaris freezes when this occurs, so I''m only
going to have 2-3 seconds to execute a command before Terminal and - after a few
more seconds, the rest of the machine - stop responding to input...

I''m all for trying to test this, but I might need some instruction.
 
 
This message posted from opensolaris.org

Richard Elling

2008-Aug-26 21:26 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>>             
>
>     re> unrecoverable read as the dominant disk failure mode. [...]
>     re> none of the traditional software logical volume managers nor
>     re> the popular open source file systems (other than ZFS :-)
>     re> address this problem.
>
> Other LVM''s should address unrecoverable read errors as well or
better
> than ZFS, because that''s when the drive returns an error instead
of
> data.  
ZFS handles that case as well.
> Doing a good job with this error is mostly about not freezing
> the whole filesystem for the 30sec it takes the drive to report the
> error.  
That is not a ZFS problem.  Please file bugs in the appropriate category.
> Either the drives should be loaded with special firmware that
> returns errors earlier, or the software LVM should read redundant data
> and collect the statistic if the drive is well outside its usual
> response latency.  
ZFS will handle this case as well.
> I would expect all the software volume managers
> including ZFS fail to do this.  It''s really hard to test without
> somehow getting a drive that returns read errors frequently, but
isn''t
> about to die within the month---maybe ZFS should have an error
> injector at driver-level instead of block-level, and a model for
> time-based errors. 
qv ztest.

Project comstar creates an opportunity for better testing in an open-source
way.  However, it will only work for SCSI protocol and therefore does
not provide coverage for IDE devices -- which is not a long-term issue.
>  One thing other LVM''s seem like they may do better
> than ZFS, based on not-quite-the-same-scenario tests, is not freeze
> filesystems unrelated to the failing drive during the 30 seconds
it''s
> waiting for the I/O request to return an error.
>   
This is not operating in ZFS code.
> In terms of FUD about ``silent corruption'''', there is
none of it when
> the drive clearly reports a sector is unreadable.  Yes, traditional
> non-big-storage-vendor RAID5, and all software LVM''s I know of
except
> ZFS, depend on the drives to report unreadable sectors.  And,
> generally, drives do.  so let''s be clear about that and not try to
imply
> that the ``dominant failure mode'''' causes silent
corruption for
> everyone except ZFS and Netapp users---it doesn''t.
>   
In my field data, the dominant failure mode for disks is unrecoverable
reads.  If your software does not handle this case, then you should be
worried.  We tend to recommend configuring ZFS to manage data
redundancy for this reason.
> The Netapp paper focused on when drives silently return incorrect
> data, which is different than returning an error.  Both Netapp and ZFS
> do checksums to protect from this.  However Netapp never claimed this
> failure mode was more common than reported unrecoverable read errors,
> just that it was more interesting.  I expect it''s much *less*
common.
>   
I would love for you produce data to that effect.
> Further, we know Netapp loaded special firmware into the enterprise
> drives in that study because they wanted the larger sector size.  They
> are likely also loading special firmware into the desktop drives to
> make them return errors sooner than 30 seconds.  so, it''s not
> improbable that the Netapp drives are more prone to deliver silently
> corrupt data instead of UNC/seek errors compared to off-the-shelf
> drives.
>   
I am not sure of the basis of your assertion.  Can you explain
in more detail?
> Finally, for the Google paper, silent corruption ``didn''t even
make
> the chart.''''  so, saying something didn''t make
your chart and saying
> that it doesn''t happen are two different things, and your favoured
> conclusion has a stake in maintaining that view, too.
>   
The google paper[1] didn''t deal with silent errors or corruption at
all.
Section 2 describes in nice detail how they decided when a drive
was failed -- it was replaced.  They also cite disk vendors who test
"failed" drives and many times the drives test clean (what they call
"no problem found"). This is not surprising because it is unlikely
that
data corruption is detected in the systems under study.

[1] http://www.cs.cmu.edu/~bianca/fast07.pdf
-- richard

Mattias Pantzare

2008-Aug-27 01:32 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008/8/26 Richard Elling <Richard.Elling at
sun.com>:>
>> Doing a good job with this error is mostly about not freezing
>> the whole filesystem for the 30sec it takes the drive to report the
>> error.
>
> That is not a ZFS problem.  Please file bugs in the appropriate category.
Who''s problem is it? It can''t be the device driver as that has
no
knowledge of zfs
filesystems or redundancy.
>
>> Either the drives should be loaded with special firmware that
>> returns errors earlier, or the software LVM should read redundant data
>> and collect the statistic if the drive is well outside its usual
>> response latency.
>
> ZFS will handle this case as well.
How is ZFS handling this? Is there a timeout in ZFS?

>>  One thing other LVM''s seem like they may do better
>> than ZFS, based on not-quite-the-same-scenario tests, is not freeze
>> filesystems unrelated to the failing drive during the 30 seconds
it''s
>> waiting for the I/O request to return an error.
>>
>
> This is not operating in ZFS code.
In what way is freezing a ZFS filesystem not operating in ZFS code?

Notice that he wrote filesystems unrelated to the failing drive.

>
>> In terms of FUD about ``silent corruption'''', there is
none of it when
>> the drive clearly reports a sector is unreadable.  Yes, traditional
>> non-big-storage-vendor RAID5, and all software LVM''s I know of
except
>> ZFS, depend on the drives to report unreadable sectors.  And,
>> generally, drives do.  so let''s be clear about that and not
try to imply
>> that the ``dominant failure mode'''' causes silent
corruption for
>> everyone except ZFS and Netapp users---it doesn''t.
>>
>
> In my field data, the dominant failure mode for disks is unrecoverable
> reads.  If your software does not handle this case, then you should be
> worried.  We tend to recommend configuring ZFS to manage data
> redundancy for this reason.
He is writing that all software LVM''s will handle unrecoverable reads.

What is your definition of unrecoverable reads?

Richard Elling

2008-Aug-27 04:40 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Mattias Pantzare wrote:> 2008/8/26 Richard Elling <Richard.Elling at sun.com>:
>   
>>> Doing a good job with this error is mostly about not freezing
>>> the whole filesystem for the 30sec it takes the drive to report the
>>> error.
>>>       
>> That is not a ZFS problem.  Please file bugs in the appropriate
category.
>>     
>
> Who''s problem is it? It can''t be the device driver as
that has no
> knowledge of zfs
> filesystems or redundancy.
>   
In most cases it is the drivers below ZFS.  For an IDE disk it
might be cmdk(7d) over ata(7d).  For a USB disk it might be sd(7d)
over scsa2usb(7d) over ehci(7d).  printconf -D will show which
device drivers are attached to your system.

If you search the ZFS source code, you will find very little error
handling of devices, by design.
>>> Either the drives should be loaded with special firmware that
>>> returns errors earlier, or the software LVM should read redundant
data
>>> and collect the statistic if the drive is well outside its usual
>>> response latency.
>>>       
>> ZFS will handle this case as well.
>>     
>
> How is ZFS handling this? Is there a timeout in ZFS?
>   
Not for this case, but if configured to manage redundancy, ZFS will
"read redundant data" from alternate devices.

A business metric such as reasonable transaction latency would live
at a level above ZFS.
>>>  One thing other LVM''s seem like they may do better
>>> than ZFS, based on not-quite-the-same-scenario tests, is not freeze
>>> filesystems unrelated to the failing drive during the 30 seconds
it''s
>>> waiting for the I/O request to return an error.
>>>
>>>       
>> This is not operating in ZFS code.
>>     
>
> In what way is freezing a ZFS filesystem not operating in ZFS code?
>
> Notice that he wrote filesystems unrelated to the failing drive.
>
>   
At the ZFS level, this is dictated by the failmode property.
>>> In terms of FUD about ``silent corruption'''',
there is none of it when
>>> the drive clearly reports a sector is unreadable.  Yes, traditional
>>> non-big-storage-vendor RAID5, and all software LVM''s I
know of except
>>> ZFS, depend on the drives to report unreadable sectors.  And,
>>> generally, drives do.  so let''s be clear about that and
not try to imply
>>> that the ``dominant failure mode'''' causes silent
corruption for
>>> everyone except ZFS and Netapp users---it doesn''t.
>>>
>>>       
>> In my field data, the dominant failure mode for disks is unrecoverable
>> reads.  If your software does not handle this case, then you should be
>> worried.  We tend to recommend configuring ZFS to manage data
>> redundancy for this reason.
>>     
>
> He is writing that all software LVM''s will handle unrecoverable
reads.
>   
I agree. And if ZFS is configured to manage redundancy and a disk
read returns EIO or the checksum does not match, then ZFS will
attempt to read from the redundant data.  However, not all devices return
error codes which indicate unrecoverable reads.  Also, data corrupted
in the data path between media and main memory may not have an
associated error condition reported.

I find comparing unprotected ZFS configurations with LVMs
using protected configurations to be disingenuous.
> What is your definition of unrecoverable reads?
>   
I wrote data, but when I try to read, I don''t get back what I wrote.
 -- richard

2008-Aug-27 06:00 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

> James isn''t being a jerk because he hates your or
> anything...
> 
> Look, yanking the drives like that can seriously
> damage the drives or your motherboard. Solaris
> doesn''t let you do it and assumes that something''s
> gone seriously wrong if you try it. That Linux
> ignores the behavior and lets you do it sounds more
> like a bug in linux than anything else.
Solaris crashing is a linux bug.  That''s a new one folks.
 
 
This message posted from opensolaris.org

2008-Aug-27 06:08 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

> Pulling cables only simulates pulling cables. If you
> are having difficulty with cables falling out, then this problem cannot
> be solved with software. It *must* be solved with hardware.
I don''t think anyone is asking for software to fix cables that fall
out... they''re asking for the OS to not crash, which they perceive to
be better than a crash...
 
 
This message posted from opensolaris.org

2008-Aug-27 06:18 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Okay, so your ACHI hardware is not using an ACHI driver in solaris.  A crash
when pulling a cable is still not great, but it is understandable because that
driver is old and bad and doesn''t support hot swapping at all.

So there are two things to do here.  File a bug about how pulling a sata cable
crashes solaris when the device is using the old ide driver.  And file another
bug about how solaris recognizes your ACHI SATA hardware as old ide hardware.

The two bonus things to do are: come to the forum and bitch about the bugs to
give them some attention, and come to the forum asking for help on making
solaris recognize your ACHI SATA hardware properly :)

Good luck...
> Gotcha. But just to let you know, there are 4 SATA
> ports on the motherboard, with each drive getting its
> own port... how should I go about testing to see
> whether pulling one IDE drive (remember, they''re
> really SATA drives, but they''re being presented to
> the OS by the pci-ide driver) locks the entire IDE
> channel if there''s only one drive per channel? Or do
> you think it''s possible that two ports on the
> motherboard could be on one "logical channel" (for
> lack of a  better phrase) while the other two are on
> the other, and thus we could test one drive while
> another on the same "logical channel" is unplugged?
> 
> Also, remember that OpenSolaris freezes when this
> occurs, so I''m only going to have 2-3 seconds to
> execute a command before Terminal and - after a few
> more seconds, the rest of the machine - stop
> responding to input... 
> 
> I''m all for trying to test this, but I might need
> some instruction. 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-27 06:53 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Howdy Ron,

Right, right - I know I dropped the ball on that one. Sorry, I haven''t
been able to log into OpenSolaris lately, and thus haven''t been able to
actually do anything useful... (lol, not to rag on OpenSolaris or anything, but
it can also freeze just by logging in... See:
http://defect.opensolaris.org/bz/show_bug.cgi?id=1681)

Ok, so, just to give a refresher of what''s going on:
When everything is in it''s default state (standard install of
OpenSolaris, standard configuration of ZFS, factory-set BIOS settings, etc.)
OpenSolaris will indeed freeze/hang/lock up, and generally become unusable
_without exception_ on the hardware I''ve described above. I''m
not confident enough to say that it will _always_ happen on _any_ machine using
the 4 drive configuration of RAID-Z with the pci-ide driver and hardware set-up
I''ve described thus far, but since I am not alone in experiencing this
(see what my myxiplx experienced on his [different] hardware set-up), I
don''t think its an isolated case.

The factory-set BIOS settings for the 4 SATA II ports on my motherboard are
[Native IDE]. I can change this setting from [Native IDE] to [RAID], [Legacy
IDE], and {SATA->AHCI]

Changing the setting to [SATA->AHCI] prevents the machine from booting. There
isn''t any extra information that I can give aside from the fact that
when I''m at the "SunOS Release 5.11 Version snv_86 64-bit"
screen where the copyright is listed, the machine hangs right after listing
"Hostname: ".

A restart didn''t fix anything (that would sometimes fix the login bug I
wrote about a few paragraphs up, but it didn''t work for this).

By the way: Is there a way to pull up a text-only interface from the log in
screen (or during the boot process?) without having to log in (or just sit there
reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be
nice if I could see a bit more information during boot, or if I didn''t
have to use gnome if I just wanted to get at the CLI anyways... On some OSes, if
you want to access TTY1 through 6, you only need to press ESC during boot, or
CTRL + ALT + F1 through F6 (or something similar) during the login screen to
gain access to other non-GUI login screens...

Anyway, after changing the setting back to [Native IDE], the machine boots fine.
And this time, the freeze-on-login bug didn''t get me. Now, I know for a
fact this motherboard supports SATA II (see link to manufacturer''s
website in earlier post), and that all 4 of these disks are _definitely_ SATA II
disks (see hardware specifications listed in one of my earliest posts), and that
I''m using all the right cables and everything... so, I don''t
know how to explore this any further...

Could it be that when I installed OpenSolaris, I was using the pci-ide (or
[Native IDE]) setting on my BIOS, and thus if I were to change it, OpenSolaris
might not know hot to handle that, and might refuse to boot? Or that maybe
OpenSolaris only installed the drivers it thought it would need, and the
stat-ahci one wasn''t one of them?

Let me know what you think.

-Todd
 
 
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-27 07:27 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Howdy James,

While responding to halstead''s post (see below), I had to restart
several times to complete some testing. I''m not sure if that''s
important to these commands or not, but I just wanted to put it out there
anyway.
> A few commands that you could provide the output from
> include:
> 
> 
> (these two show any FMA-related telemetry)
> fmadm faulty
> fmdump -v
This is the output from both commands:

todd at mediaserver:~# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a  ZFS-8000-FD    Major

Fault class : fault.fs.zfs.vdev.io
Description : The number of I/O errors associated with a ZFS device exceeded
                    acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD
             for more information.
Response    : The device has been offlined and marked as faulted.  An attempt
                    will be made to activate a hot spare if available.
Impact      : Fault tolerance of the pool may be compromised.
Action      : Run ''zpool status -x'' and replace the bad
device.



todd at mediaserver:~# fmdump -v
TIME                 UUID                                 SUNW-MSG-ID
Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD
 100%  fault.fs.zfs.vdev.io

       Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719
          Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719
              FRU: -
         Location: -

> (this shows your storage controllers and what''s
> connected to them) cfgadm -lav
This is the output from cfgadm -lav

todd at mediaserver:~# cfgadm -lav
Ap_Id                          Receptacle   Occupant     Condition  Information
When         Type         Busy     Phys_Id
usb2/1                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13:1
usb2/2                         connected    configured   ok
Mfg: Microsoft  Product: Microsoft 3-Button Mouse with IntelliEye(TM)
NConfigs: 1  Config: 0  <no cfg str descr>
unavailable  usb-mouse    n        /devices/pci at 0,0/pci1458,5004 at 13:2
usb3/1                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,2:1
usb3/2                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,2:2
usb4/1                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,3:1
usb4/2                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,3:2
usb5/1                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,4:1
usb5/2                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,4:2
usb6/1                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:1
usb6/2                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:2
usb6/3                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:3
usb6/4                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:4
usb6/5                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:5
usb6/6                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:6
usb6/7                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:7
usb6/8                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:8
usb6/9                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:9
usb6/10                        empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,5:10
usb7/1                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,1:1
usb7/2                         empty        unconfigured ok
unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13,1:2

You''ll notice that the only thing listed is my USB mouse... is that
expected?

> You''ll also find messages in /var/adm/messages which
> might prove
> useful to review.
If you really want, I can list the output from /var/adm/messages, but it
doesn''t seem to add anything new to what I''ve already copied
and pasted.
 > First and foremost, for me, this is a stupid thing to
> do. You''ve got common-or-garden PC hardware which almost
> *definitely* does not support hot plug of devices. Which is what
you''re
> telling us that you''re doing. Would try this with your pci/pci-e
> cards in this system? I think not.
I would if I had some sort of set-up that supposedly promised me redundant
PCI/PCI-E cards... You might think it''s stupid, but how else could one
be sure that the back-up PCI/PCI-E card would take over when the primary one
died?

Unplugging one of them seems like a fine test to me - It''s definitely
the worst case scenario, and if the rig survives that, then I _know_ I would be
able to rely on it for redundancy should one of the cards fail (which would most
likely occur in a less spectacular fashion than a quick yank anyways)
> If you absolutely must do something like this, then
> please use what''s known as "coordinated hotswap" using
the
> cfgadm(1m) command.
> 
> 
> Viz:
> 
> (detect fault in disk c2t3d0, in some way)
> 
> # cfgadm -c unconfigure c2::dsk/c2t3d0
> # cfgadm -c disconnect c2::dsk/c2t3d0
> 
> (go and swap the drive, plugin new drive with same
> cable)
> 
> # zpool replace -f poolname c2t3d0
> 
> 
> What this will do is tell the kernel to do things in
> the right order, and - for zpool - tell it to do an
> in-place replacement of device c2t3d0 in your pool.
Thanks for the command listings - they''ll certainly prove useful if I
should ever find myself in a situation where I have to manually swap a disk like
you described. Unfortunately though, I''m with Miles Nordin (see below)
on this one - I don''t want to warn OpenSolaris of what I''m
about to do... That would defeat the purpose of the test. Even with technologies
(like S.M.A.R.T.) that are designed to give you a bit of a heads-up, as Heikki
Suonsivu and Google have noted, they''re not very reliable at all
(research.google.com/archive/disk_failures.pdf).

And I want this test to be as rough as it gets. I don''t want to play
nice with this system... I want to drag it through the most tortuous worst-case
scenario tests I can imagine, and if it survives with all my test data intact,
then (and only then) will I begin to trust it.
> http://docs.sun.com/app/docs/coll/40.17 (manpages)
> http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
> http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
> http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide
Oohh... Thank you. Good Links. I''m bookmarking these for future
reading. They''ll definitely be helpful if we end up choosing to deploy
OpenSolaris + ZFS for our media servers.

-Todd
 
 
This message posted from opensolaris.org

Mattias Pantzare

2008-Aug-27 10:44 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008/8/27 Richard Elling <Richard.Elling at
sun.com>:>
>>>> Either the drives should be loaded with special firmware that
>>>> returns errors earlier, or the software LVM should read
redundant data
>>>> and collect the statistic if the drive is well outside its
usual
>>>> response latency.
>>>>
>>>
>>> ZFS will handle this case as well.
>>>
>>
>> How is ZFS handling this? Is there a timeout in ZFS?
>>
>
> Not for this case, but if configured to manage redundancy, ZFS will
> "read redundant data" from alternate devices.
No, ZFS will not, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.

ZFS could detect that there is probably a problem with the device and
read from an alternate device much faster while it waits for the
device to answer.

You can''t do this at any other level than ZFS.


>>>>  One thing other LVM''s seem like they may do better
>>>> than ZFS, based on not-quite-the-same-scenario tests, is not
freeze
>>>> filesystems unrelated to the failing drive during the 30
seconds it''s
>>>> waiting for the I/O request to return an error.
>>>>
>>>>
>>>
>>> This is not operating in ZFS code.
>>>
>>
>> In what way is freezing a ZFS filesystem not operating in ZFS code?
>>
>> Notice that he wrote filesystems unrelated to the failing drive.
>>
>>
>
> At the ZFS level, this is dictated by the failmode property.
But that is used after ZFS has detected an error?

> I find comparing unprotected ZFS configurations with LVMs
> using protected configurations to be disingenuous.
I don''t think anyone is doing that.

>
>> What is your definition of unrecoverable reads?
>>
>
> I wrote data, but when I try to read, I don''t get back what I
wrote.
There is only one case where ZFS is better, that is when wrong data is
returned. All other cases are managed by layers below ZFS. Wrong data
returned is not normaly called unrecoverable reads.

Florin Iucha

2008-Aug-27 12:51 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Tue, Aug 26, 2008 at 11:18:51PM -0700, MC wrote:> The two bonus things to do are: come to the forum and bitch about the bugs
to give them some attention, and come to the forum asking for help on making
solaris recognize your ACHI SATA hardware properly :)
Been there, done that.  No t-shirt, though...

The Solaris kernel might be the best thing since MULTICS, but the lack
of drivers really hampers it''s spread.

florin

-- 
Bruce Schneier expects the Spanish Inquisition.
      http://geekz.co.uk/schneierfacts/fact/163
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/d0011f64/attachment.bin>

Todd H. Poole

2008-Aug-27 15:01 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

I plan on fiddling around with this failmode property in a few hours.
I''ll be using
http://docs.sun.com/app/docs/doc/817-2271/gftgp?l=en&a=view as a reference.

I''ll let you know what I find out.

-Todd
 
 
This message posted from opensolaris.org

Ross

2008-Aug-27 15:38 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

Hi Todd,

Having finally gotten the time to read through this entire thread, I think Ralf
said it best.  ZFS can provide data integrity, but you''re reliant on
hardware and drivers for data availability.

In this case either your SATA controller, or the drivers for it don''t
cope at all well with a device going offline, so what you need is a SATA card
that can handle that.  Provided you have a controller that can cope with the
disk errors, it should be able to return the appropriate status information to
ZFS, which will in turn ensure your data is ok.

The technique obviously works or Sun''s x4500 servers wouldn''t
be doing anywhere near as well as they are.  The problem we all seem to be
having is finding white box hardware that supports it.

I suspect your best bet would be to pick up a SAS controller based on the LSI
chipsets used in the new x4540 server.  There''s been a fair bit of
discussion here on these, and while there''s a limitation in that you
will have to manually keep track of drive names, I would expect it to handle
disk failures (and pulling disks) much better, but you would probably be well
advised asking the folks on the forums running those SAS controllers whether
they''ve been able to pull disks sucessfully.

I think the solution you need is definately to get a better disk controller, and
your choice is either a plain SAS controller, or a raid controller that can
present individual disks in pass through mode since they *definately* are
designed to handle failures.

Ross
 
 
This message posted from opensolaris.org

Tim

2008-Aug-27 16:21 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>
>
> By the way: Is there a way to pull up a text-only interface from the log in
> screen (or during the boot process?) without having to log in (or just sit
> there reading about "SunOS Release 5.11 Version snv_86 64-bit")?
It would be
> nice if I could see a bit more information during boot, or if I
didn''t have
> to use gnome if I just wanted to get at the CLI anyways... On some OSes, if
> you want to access TTY1 through 6, you only need to press ESC during boot,
> or CTRL + ALT + F1 through F6 (or something similar) during the login
screen
> to gain access to other non-GUI login screens...
>
On SXDE/Solaris, there''s a dropdown menu that lets you select what type
of
logon you''d like to use.  I haven''t touched 2008.11 so I have
no idea if
it''s got similar.

>
> Anyway, after changing the setting back to [Native IDE], the machine boots
> fine. And this time, the freeze-on-login bug didn''t get me. Now, I
know for
> a fact this motherboard supports SATA II (see link to
manufacturer''s website
> in earlier post), and that all 4 of these disks are _definitely_ SATA II
> disks (see hardware specifications listed in one of my earliest posts), and
> that I''m using all the right cables and everything... so, I
don''t know how
> to explore this any further...
>
> Could it be that when I installed OpenSolaris, I was using the pci-ide (or
> [Native IDE]) setting on my BIOS, and thus if I were to change it,
> OpenSolaris might not know hot to handle that, and might refuse to boot? Or
> that maybe OpenSolaris only installed the drivers it thought it would need,
> and the stat-ahci one wasn''t one of them?
>
Did you do a reboot reconfigure?  "reboot -- -r" or "init
6"?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/f4dba7e2/attachment.html>

Tim

2008-Aug-27 16:43 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Wed, Aug 27, 2008 at 1:18 AM, MC <rac at eastlink.ca> wrote:
> Okay, so your ACHI hardware is not using an ACHI driver in solaris.  A
> crash when pulling a cable is still not great, but it is understandable
> because that driver is old and bad and doesn''t support hot
swapping at all.
>
His AHCI is not using AHCI because he''s set it not to.  If linux is
somehow
ignoring the BIOS configuration, and attempting to load an AHCI driver for
the hardware anyways, that''s *BROKEN* behavior.  I''ve yet to
see WHAT driver
linux was using because he was too busy having a pissing match to get that
USEFUL information back to the list.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/7634480a/attachment.html>

Richard Elling

2008-Aug-27 16:47 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Todd H. Poole wrote:> And I want this test to be as rough as it gets. I don''t want to
play
> nice with this system... I want to drag it through the most tortuous 
> worst-case scenario tests I can imagine, and if it survives with all 
> my test data intact, then (and only then) will I begin to trust it.
http://www.youtube.com/watch?v=naKd9nARAes
:-)
 -- richard

Richard Elling

2008-Aug-27 17:17 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Mattias Pantzare wrote:> 2008/8/27 Richard Elling <Richard.Elling at sun.com>:
>   
>>>>> Either the drives should be loaded with special firmware
that
>>>>> returns errors earlier, or the software LVM should read
redundant data
>>>>> and collect the statistic if the drive is well outside its
usual
>>>>> response latency.
>>>>>
>>>>>           
>>>> ZFS will handle this case as well.
>>>>
>>>>         
>>> How is ZFS handling this? Is there a timeout in ZFS?
>>>
>>>       
>> Not for this case, but if configured to manage redundancy, ZFS will
>> "read redundant data" from alternate devices.
>>     
>
> No, ZFS will not, ZFS waits for the device driver to report an error,
> after that it will read from alternate devices.
>   
Yes, ZFS will, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.
> ZFS could detect that there is probably a problem with the device and
> read from an alternate device much faster while it waits for the
> device to answer.
>   
Rather than complicating ZFS code with error handling code
which is difficult to port or maintain over time, ZFS leverages
the Solaris Fault Management Architecture.  There is opportunity
to expand features using the flexible FMA framework.  Feel free
to propose additional RFEs.
> You can''t do this at any other level than ZFS.
>
>
>
>   
>>>>>  One thing other LVM''s seem like they may do
better
>>>>> than ZFS, based on not-quite-the-same-scenario tests, is
not freeze
>>>>> filesystems unrelated to the failing drive during the 30
seconds it''s
>>>>> waiting for the I/O request to return an error.
>>>>>
>>>>>
>>>>>           
>>>> This is not operating in ZFS code.
>>>>
>>>>         
>>> In what way is freezing a ZFS filesystem not operating in ZFS code?
>>>
>>> Notice that he wrote filesystems unrelated to the failing drive.
>>>
>>>
>>>       
>> At the ZFS level, this is dictated by the failmode property.
>>     
>
> But that is used after ZFS has detected an error?
>   
I don''t understand this question.  Could you rephrase to clarify?
>> I find comparing unprotected ZFS configurations with LVMs
>> using protected configurations to be disingenuous.
>>     
>
> I don''t think anyone is doing that.
>   
harrumph
>>> What is your definition of unrecoverable reads?
>>>
>>>       
>> I wrote data, but when I try to read, I don''t get back what I
wrote.
>>     
>
> There is only one case where ZFS is better, that is when wrong data is
> returned. All other cases are managed by layers below ZFS. Wrong data
> returned is not normaly called unrecoverable reads.
>   
It depends on your perspective.  T10 has provided a standard error
code for a device to tell a host that it experienced an unrecoverable
read error.  However, we still find instances where what we wrote
is not what we read, whether it is detected at the media level or higher
in the software stack.  In my pile of borken parts, I have devices
which fail to indicate an unrecoverable read, yet do indeed suffer
from forgetful media.  To carry that discussion very far, it quickly
descends into the ability of the device''s media checksums to detect
bad data -- even ZFS''s checksums.  But here is another case where
enterprise-class devices tend to perform better than consumer-grade
devices.
 -- richard

Miles Nordin

2008-Aug-27 17:48 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
re> not all devices return error codes which indicate
re> unrecoverable reads.

What you mean is, ``devices sometimes return bad data instead of an
error code.''''

If you really mean there are devices out there which never return
error codes, and always silently return bad data, please tell us which
one and the story of when you encountered it, because I''m incredulous.
I''ve never seen or heard of anything like that. Not even 5.25"
floppies do that.

Well...wait, actually I have. I heard some SGI disks had special
firmware which could be ordered to behave this way, and some kind of
ioctl or mount option to turn it on per-file or per-filesystem. But
the drives wouldn''t disable error reporting unless ordered to.
Another interesting lesson SGI offers here: they pushed this feature
through their entire stack. The point was, for some video playback,
data which arrives after the playback point has passed is just as
useless as silently corrupt data, so the disk, driver, filesystem, all
need to modify their exception handling to deliver the largest amount
of on-time data possible, rather than the traditional goal of
eventually returning the largest amount of correct data possible and
clear errors instead of silent corruption. This whole-stack approach
is exactly what I thought ``green line'''' was promising, and
exactly
what''s kept out of Solaris by the ``go blame the
drivers'''' mantra.

Maybe I was thinking of this SGI firmware when I suggested the
customized firmware netapp loads into the drives in their study could
silently return bad data more often than the firmware we''re all using,
the standard firmware with 512-byte sectors intended for RAID layers
without block checksums.

re> I would love for you produce data to that effect.

Read the netapp paper you cited earlier

http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

on page 234 there''s a comparison of the relative prevalence of each
kind of error.

Latent sector errors / Unrecoverable reads

nearline disks experiencing latent read errors per year: 9.5%

Netapp calls the UNC errors, where the drive returns an error
instead of data, ``latent sector errors.'''' Software RAID
systems
other than ZFS *do* handle this error, usually better than ZFS to
my impression. And AIUI when it doesn''t freeze and reboot, ZFS
counts this as a READ error. In addition to reporting it, most
consumer drives seem to log the last five of these non-volatilely,
and you can read the log with ''smartctl -a'' (if
you''re using Linux
always, or under Solaris only if smartctl is working with your
particular disk driver).

Silent corruption

nearline disks experiencing silent corruption per year: 0.466%

What netapp calls ``silent data corruption'''' is bad data
silently
returned by drives with no error indication, counted by ZFS as
CKSUM and seems not to cause ZFS to freeze. I think you have been
lumping this in with unrecoverable reads, but using the word
``silent'''' makes it clearer because unrecoverable makes it
sound to
me like the drive tried to recover, and failed, in which case the
drive probably also reported the error making it a ``latent sector
error''''.

filesystem corruption

This is also discovered silently w.r.t. the driver: the corruption
that happens to ZFS systems when SAN targets disappear suddenly or
when you offline a target and then reboot (which is also counted in
the CKSUM column, and which ZFS-level redundancy also helps fix).
I would call this ``ZFS bugs'''', ``filesystem
corruption,'''' or
``manual resilvering''''. Obviously it''s not
included on the Netapp
table. It would be nice if ZFS had two separate CKSUM columns to
distinguish between what netapp calls ``checksum errors''''
vs
``identity discrepancies''''. For ZFS the ``checksum
error'''' would
point with high certainty to the storage and silent corruption, and
the ``identity discrepancy'''' would be more like filesystem
corruption and flag things like one side of a mirror being
out-of-date when ZFS thinks it shouldn''t be. but currently we have
only one CKSUM column for both cases.

so, I would say, yes, the type of read error that other software RAID
systems besides ZFS do still handle is a lot more common: 9.5%/yr vs
0.466%/yr for nearline disks, and the same ~20x factor for enterprise
disks. The rare silent error which other software LVM''s miss and only
ZFS/Netapp/EMC/... handles is still common enough to worry about, at
least on the nearline disks in the Netapp drive population.

What this also shows, though, is that about 1 in 10 drives will return
an UNC per year, and possibly cause ZFS to freeze up. It''s worth
worrying about availability during an exception as common as that---it
might even be more important for some applications than catching the
silent corruption. not for my own application, but for some readily
imagineable ones, yes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/0599d597/attachment.bin>

Miles Nordin

2008-Aug-27 17:58 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

>>>>> "m" == MC  <rac at eastlink.ca> writes:
     m> file another bug about how solaris recognizes your ACHI SATA
     m> hardware as old ide hardware.

I don''t have that board but AIUI the driver attachment''s
chooseable in
the BIOS Blue Screen of Setup, by setting the controller to
``Compatibility'''' mode (pci-ide) or
``Native'''' mode (AHCI).  This
particular chip must be run in Compatibility mode because of bug
6665032.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/db12d42f/attachment.bin>

Keith Bierman

2008-Aug-27 18:05 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

On Aug 27, 2008, at 11:17 AM, Richard Elling wrote:>>>>   In my pile of broken parts, I have devices
> which fail to indicate an unrecoverable read, yet do indeed suffer
> from forgetful media.
A long time ago, in a hw company long since dead and buried, I spent  
some months trying to find an intermittent error in the last bits of  
a complicated floating point application. It only occurred when disk  
striping was turned on (but the OS and device codes checked cleanly).  
In the end, it turned out that one of the device vendors had modified  
the specification slightly (by like 1 nano-sec) and the result was  
that least significant bits were often wrong when we drove the disk  
cage to it''s max.

Errors were occurring randomly (e.g. swapping, paging, etc.) but no  
other application noticed. As the error was "within the margin of  
error" a less stubborn analyst might have not made a serious of  
federal cases about the non-determinism ;>

My point is that undetected errors happen all the time; that people  
don''t notice doesn''t mean that they don''t happen ...


-- 
Keith H. Bierman   khbkhb at gmail.com      | AIM kbiermank
5430 Nassau Circle East                  |
Cherry Hills Village, CO 80113           | 303-997-2749
<speaking for myself*> Copyright 2008

Miles Nordin

2008-Aug-27 18:24 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

>>>>> "thp" == Todd H Poole <toddhpoole at
gmail.com> writes:
>> Would try this with
>> your pci/pci-e cards in this system? I think not.

thp> Unplugging one of them seems like a fine test to me

I''ve done it, with 32-bit 5 volt PCI, I forget why. I might have been
trying to use a board, but bypass the broken etherboot ROM on the
board. It was something like that.

IIRC it works sometimes, crashes the machine sometimes, and fries the
hardware eventually if you keep doing it long enough.

The exact same three cases are true of cold-plugging a PCI
card. It just works a-lot-more-often sometimes if you power down
first.

Does massively inappropriate hotplugging possibly weaken the hardware
so that it''s more likely to pop later? maybe. Can you think of a
good test for that?

Believe it or not, sometimes accurate information is worth more than a
motherboard that cost $50 five years ago. Sometimes saving ten
minutes is worth more. or...<cough> recovering an openprom password.

Testing availability claims rather than accepting them on faith, or
rather than gaining experience in a slow, oozing, anecdotal way on
production machinery, is definitely not stupid. Testing them in a way
that compares one system to another is double-un-stupid.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/f89534e3/attachment.bin>

Richard Elling

2008-Aug-27 18:27 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>>             
>
>     re> not all devices return error codes which indicate
>     re> unrecoverable reads.
>
> What you mean is, ``devices sometimes return bad data instead of an
> error code.''''
>
> If you really mean there are devices out there which never return
> error codes, and always silently return bad data, please tell us which
> one and the story of when you encountered it, because I''m
incredulous.
> I''ve never seen or heard of anything like that.  Not even
5.25"
> floppies do that.
>   
I blogged about one such case.
http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file

However, I''m not inclined to publically chastise the vendor or device
model.
It is a major vendor and a popular device. ''nuff said.
> Well...wait, actually I have.  I heard some SGI disks had special
> firmware which could be ordered to behave this way, and some kind of
> ioctl or mount option to turn it on per-file or per-filesystem.  But
> the drives wouldn''t disable error reporting unless ordered to.
> Another interesting lesson SGI offers here: they pushed this feature
> through their entire stack.  The point was, for some video playback,
> data which arrives after the playback point has passed is just as
> useless as silently corrupt data, so the disk, driver, filesystem, all
> need to modify their exception handling to deliver the largest amount
> of on-time data possible, rather than the traditional goal of
> eventually returning the largest amount of correct data possible and
> clear errors instead of silent corruption.  This whole-stack approach
> is exactly what I thought ``green line'''' was promising,
and exactly
> what''s kept out of Solaris by the ``go blame the
drivers'''' mantra.
>
> Maybe I was thinking of this SGI firmware when I suggested the
> customized firmware netapp loads into the drives in their study could
> silently return bad data more often than the firmware we''re all
using,
> the standard firmware with 512-byte sectors intended for RAID layers
> without block checksums.
>
>     re> I would love for you produce data to that effect.
>
> Read the netapp paper you cited earlier
>
>  
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
>
> on page 234 there''s a comparison of the relative prevalence of
each
> kind of error.
>
>   Latent sector errors / Unrecoverable reads
>
>    nearline disks experiencing latent read errors per year:   9.5%
>   
This number should scare the *%^ out of you.  It basically means
that no data redundancy is a recipe for disaster.  Fortunately, with
ZFS you can have data redundancy without requiring a logical
volume manager to mirror your data.  This is especially useful on
single-disk systems like laptops.
>    Netapp calls the UNC errors, where the drive returns an error
>    instead of data, ``latent sector errors.''''  Software
RAID systems
>    other than ZFS *do* handle this error, usually better than ZFS to
>    my impression.  And AIUI when it doesn''t freeze and reboot, ZFS
>    counts this as a READ error.  In addition to reporting it, most
>    consumer drives seem to log the last five of these non-volatilely,
>    and you can read the log with ''smartctl -a'' (if
you''re using Linux
>    always, or under Solaris only if smartctl is working with your
>    particular disk driver).
>
>
>   Silent corruption
>
>    nearline disks experiencing silent corruption per year:    0.466%
>
>    What netapp calls ``silent data corruption'''' is bad
data silently
>    returned by drives with no error indication, counted by ZFS as
>    CKSUM and seems not to cause ZFS to freeze.  I think you have been
>    lumping this in with unrecoverable reads, but using the word
>    ``silent'''' makes it clearer because unrecoverable
makes it sound to
>    me like the drive tried to recover, and failed, in which case the
>    drive probably also reported the error making it a ``latent sector
>    error''''.
>   
Likewise, this number should scare you.  AFAICT, logical volume
managers like SVM will not detect this.

Terminology wise, silent errors are, by-definition, not detected.  But
in the literature you might see this in studies of failures where the
author intends to differentiate between one system which detects
such errors and one which does not.
>
>   filesystem corruption
>
>    This is also discovered silently w.r.t. the driver: the corruption
>    that happens to ZFS systems when SAN targets disappear suddenly or
>    when you offline a target and then reboot (which is also counted in
>    the CKSUM column, and which ZFS-level redundancy also helps fix).
>    I would call this ``ZFS bugs'''', ``filesystem
corruption,'''' or
>    ``manual resilvering''''.  Obviously it''s not
included on the Netapp
>    table.  It would be nice if ZFS had two separate CKSUM columns to
>    distinguish between what netapp calls ``checksum
errors'''' vs
>    ``identity discrepancies''''.  For ZFS the ``checksum
error'''' would
>    point with high certainty to the storage and silent corruption, and
>    the ``identity discrepancy'''' would be more like
filesystem
>    corruption and flag things like one side of a mirror being
>    out-of-date when ZFS thinks it shouldn''t be.  but currently we
have
>    only one CKSUM column for both cases.
>
>   
This differentiation is noted in the FMA e-reports.
> so, I would say, yes, the type of read error that other software RAID
> systems besides ZFS do still handle is a lot more common: 9.5%/yr vs
> 0.466%/yr for nearline disks, and the same ~20x factor for enterprise
> disks.  The rare silent error which other software LVM''s miss and
only
> ZFS/Netapp/EMC/... handles is still common enough to worry about, at
> least on the nearline disks in the Netapp drive population.
>   
0.466%/yr is a per-disk rate.  If you have 10 disks, your exposure
is 4.6% per year.  For 100 disks, 46% per year, etc.  For systems with
thousands of disks this is a big problem.

But I don''t think using a rate-per-unit-time is the best way to look
at this problem because if you never read the data, you don''t care.
This is why disk vendors spec UERs as rate-per-bits-read.  I have
some field data on bits read over time, but routine activities, like
backups, zfs sends, or scrubs, can change the number of bits read
per unit time by a significant amount.
> What this also shows, though, is that about 1 in 10 drives will return
> an UNC per year, and possibly cause ZFS to freeze up.  It''s worth
> worrying about availability during an exception as common as that---it
> might even be more important for some applications than catching the
> silent corruption.  not for my own application, but for some readily
> imagineable ones, yes.
>   
UNCs don''t cause ZFS to freeze as long as failmode != wait or
ZFS manages the data redundancy.
 -- richard

Ross

2008-Aug-27 18:31 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Forgive me for being a bit wooly with this explanation (I''ve only
recently moved over from Windows), but changing disk mode from IDE to SATA may
well not work without a re-install, or at the very least messing around with
boot settings.  I''ve seen many systems which list SATA disks in front
of IDE ones, so you changing the drives to SATA may now mean that instead of
your OS being installed on drive 0, and your data on drive 1, you now have the
data on drive 0 and the OS on drive 1.

You''ll get through the first part of the boot process fine, but the
second stage is where you usually have problems which sounds like
what''s happening to you.  Unfortunately swapping hard disk controllers
(which is what you''re doing here) isn''t as simple as just
making the change and rebooting, and that would be just as true in Windows.

I do think some solaris drivers need a bit of work, but I suspect the standard
SATA ones are pretty good, so there is a fair chance that you''ll find
hot plug works ok in SATA mode.

Ultimately however you''re trying to get enterprise kinds of performance
out of consumer kit, and no matter how good Solaris and ZFS are, they
can''t guarantee to work with that.  I used to have the same opinion as
you, but I''m starting to see now that ZFS isn''t quite an exact
match for traditional raid controllers.  It''s close, but you do need to
think about the hardware too and make sure that can definately cope with what
you''re wanting to do.  I think the sales literature is a little
misleading in that sense.

Ross
 
 
This message posted from opensolaris.org

Tim

2008-Aug-27 18:38 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

On Wed, Aug 27, 2008 at 1:31 PM, Ross <myxiplx at hotmail.com> wrote:
> Forgive me for being a bit wooly with this explanation (I''ve only
recently
> moved over from Windows), but changing disk mode from IDE to SATA may well
> not work without a re-install, or at the very least messing around with
boot
> settings.  I''ve seen many systems which list SATA disks in front
of IDE
> ones, so you changing the drives to SATA may now mean that instead of your
> OS being installed on drive 0, and your data on drive 1, you now have the
> data on drive 0 and the OS on drive 1.
>

Solaris does not do this.  This is one of the many annoyances I have with
linux.  The way they handle /dev is ridiculous.  Did you add a new drive?
Let''s renumber everything!

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/0cfe04e5/attachment.html>

Miles Nordin

2008-Aug-27 21:51 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>> If you really mean there are devices out there which never
>> return error codes, and always silently return bad data, please
>> tell us which one and the story of when you encountered it,

re> I blogged about one such case.
re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file

re> However, I''m not inclined to publically chastise the vendor
or
re> device model. It is a major vendor and a popular
re> device. ''nuff said.

It''s not really enough for me, but what''s more the case
doesn''t match
what we were looking for: a device which ``never returns error codes,
always returns silently bad data.'''' I asked for this because
you said
``However, not all devices return error codes which indicate
unrecoverable reads,'''' which I think is wrong. Rather, most
devices
sometimes don''t, not some devices always don''t.

Your experience doesn''t say anything about this drive''s
inability to
return UNC errors. It says you suspect it of silently returning bad
data, once, but your experience doesn''t even clearly implicate the
device once: It could have been cabling/driver/power-supply/zfs-bugs
when the block was written. I was hoping for a device in your ``bad
stack'''' which does it over and over.

Remember, I''m not arguing ZFS checksums are worthless---I think
they''re great. I''m arguing with your original statement that
ZFS is
the only software RAID which deals with the dominant error you find in
your testing, unrecoverable reads. This is untrue!

re> This number should scare the *%^ out of you. It basically
re> means that no data redundancy is a recipe for disaster.

yeah, but that 9.5% number alone isn''t an argument for ZFS over other
software LVM''s.

re> 0.466%/yr is a per-disk rate. If you have 10 disks, your
re> exposure is 4.6% per year. For 100 disks, 46% per year, etc.

no, you''re doing the statistics wrong, and in a really elementary way.
You''re counting multiple times the possible years in which more than
one disk out of the hundred failed. If what you care about for 100
disks is that no disk experiences an error within one year, then you
need to calculate

(1 - 0.00466) ^ 100 = 62.7%

so that''s 37% probability of silent corruption. For 10 disks, the
mistake doesn''t make much difference and 4.6% is about right.

I don''t dispute ZFS checksums have value, but the point stands that
the reported-error failure mode is 20x more common in netapp''s study
than this one, and other software LVM''s do take care of the more
common failure mode.

re> UNCs don''t cause ZFS to freeze as long as failmode != wait
or
re> ZFS manages the data redundancy.

The time between issuing the read and getting the UNC back can be up
to 30 seconds, and there are often several unrecoverable sectors in a
row as well as lower-level retries multiplying this 30-second value.
so, it ends up being a freeze.

To fix it, ZFS needs to dispatch read requests for redundant data if
the driver doesn''t reply quickly. ``Quickly'''' can be
ambiguous, but
the whole point of FMD was supposed to be that complicated statistics
could be collected at various levels to identify even more subtle
things than READ and CKSUM errors, like drives that are working at
1/10th the speed they should be, yet right now we can''t even flag a
drive taking 30 seconds to read a sector. ZFS is still ``patiently
waiting'''', and now that FMD is supposedly integrated instead
of a
discussion of what knobs and responses there are, you''re passing the
buck to the drivers and their haphazard nonuniform exception state
machines. The best answer isn''t changing drivers to make the drive
timeout in 15 seconds instead---it''s to send the read to other disks
quickly using a very simple state machine, and start actually using
FMD and a complicated state machine to generate suspicion-events for
slow disks that aren''t returning errors.

Also the driver and mid-layer need to work with the hypothetical
ZFS-layer timeouts to be as good as possible about not stalling the
SATA chip, the channel if there''s a port multiplier, or freezing the
whole SATA stack including other chips, just because one disk has an
outstanding READ command waiting to get an UNC back.

In some sense the disk drivers and ZFS have different goals. The goal
of drivers should be to keep marginal disk/cabling/... subsystems
online as aggressively as possible, while the goal of ZFS should be to
notice and work around slightly-failing devices as soon as possible.
I thought the point of putting off reasonable exception handling for
two years while waiting for FMD, was to be able to pursue both goals
simultaneously without pressure to compromise one in favor of the
other.

In addition, I''m repeating myself like crazy at this point, but ZFS
tools used for all pools like ''zpool status'' need to not
freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool. And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.

Neither is the case now, and it''s not a driver fix, but even beyond
fixing these basic problems there''s vast room for improvement, to
deliver something better than LVM2 and closer to NetApp, rather than
just catching up.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/60051301/attachment.bin>

Ian Collins

2008-Aug-27 22:21 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin writes: 
> 
> In addition, I''m repeating myself like crazy at this point, but
ZFS
> tools used for all pools like ''zpool status'' need to not
freeze when a
> single pool, or single device within a pool, is unavailable or slow,
> and this expectation is having nothing to do with failmode on the
> failing pool.  And NFS running above ZFS should continue serving
> filesystems from available pools even if some pools are faulted, again
> nothing to do with failmode. 
> I agree with the bulk of this post, but I''d like to add to this last
point.
I''ve had a few problems with ZFS tools hanging on recent builds due to 
problems with a pool on a USB stick.  One tiny $20 component causing a fault 
that required a reboot of the host.  This really shouldn''t happen. 

Ian

Miles Nordin

2008-Aug-27 22:33 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "t" == Tim  <tim at tcsac.net> writes:
     t> Solaris does not do this.

yeah but the locators for local disks are still based on
pci/controller/channel not devid, so the disk will move to a different
device name if he changes BIOS from pci-ide to AHCI because it changes
the driver attachment.  This may be the problem preventing his bootup,
rather than the known AHCI bug.

I''m not sure what''s required to boot off a root pool
that''s moved
devices, maybe nothing, but for UFS roots it often required booting
off the install media, regenerating /dev (and /devices on sol9),
editing vfstab, u.s.w.

Linux device names don''t move as much if you use LVM2, as some of the
distros do by default even for single-device systems.  Device names
are then based on labels written onto the drive, which is a little
scary and adds a lot of confusion, but I think helps with this
moving-device problem and is analagous to what it sounds like ZFS
might do on the latest SXCE''s that don''t put zpool.cache in
the boot
archive.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/27b3756c/attachment.bin>

Toby Thain

2008-Aug-27 22:39 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

On 27-Aug-08, at 7:21 PM, Ian Collins wrote:
> Miles Nordin writes:
>
>>
>> In addition, I''m repeating myself like crazy at this point,
but ZFS
>> tools used for all pools like ''zpool status'' need to
not freeze
>> when a
>> single pool, or single device within a pool, is unavailable or slow,
>> and this expectation is having nothing to do with failmode on the
>> failing pool.  And NFS running above ZFS should continue serving
>> filesystems from available pools even if some pools are faulted,  
>> again
>> nothing to do with failmode.
>>
> I agree with the bulk of this post, but I''d like to add to this  
> last point.
> I''ve had a few problems with ZFS tools hanging on recent builds
due to
> problems with a pool on a USB stick.  One tiny $20 component  
> causing a fault
> that required a reboot of the host.  This really shouldn''t happen.
Let''s not be too quick to assign blame, or to think that perfecting  
the behaviour is straightforward or even possible.

Traditionally, systems bearing ''enterprisey'' expectations
were/are
integrated hardware and software from one vendor (e.g. Sun) which  
could be certified as a unit.

Start introducing ''random $20 components'' and you begin to
dilute the
quality and predictability of the composite system''s behaviour.

If hard drive firmware is as cr*ppy as anecdotes indicate, what can  
we really expect from a $20 USB pendrive?

--Toby
>
> Ian
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tim

2008-Aug-27 22:40 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

On Wed, Aug 27, 2008 at 5:33 PM, Miles Nordin <carton at ivy.net> wrote:
> >>>>> "t" == Tim  <tim at tcsac.net> writes:
>
>     t> Solaris does not do this.
>
> yeah but the locators for local disks are still based on
> pci/controller/channel not devid, so the disk will move to a different
> device name if he changes BIOS from pci-ide to AHCI because it changes
> the driver attachment.  This may be the problem preventing his bootup,
> rather than the known AHCI bug.
>
Except he was, and is referring to a non-root disk.  If I''m using raw
devices and I unplug my root disk and move it somewhere else, I would expect
to have to update my boot loader.


> Linux device names don''t move as much if you use LVM2, as some of
the
> distros do by default even for single-device systems.  Device names
> are then based on labels written onto the drive, which is a little
> scary and adds a lot of confusion, but I think helps with this
> moving-device problem and is analagous to what it sounds like ZFS
> might do on the latest SXCE''s that don''t put zpool.cache
in the boot
> archive.
>
LVM hardly changes the way devices move around in Linux, or it''s
horrendous
handling of /dev.  You are correct in that it is a step towards masking the
ugliness.  I, however, do not consider it a fix.  Unfortunately it''s
not
used in the majority of the sites I am involved in, and as such isn''t
any
sort of help.  The administration overhead it adds is not worth the hassle
for the majority of my customers.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/61528124/attachment.html>

Bob Friesenhahn

2008-Aug-27 22:42 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

On Wed, 27 Aug 2008, Miles Nordin wrote:>
> In some sense the disk drivers and ZFS have different goals.  The goal
> of drivers should be to keep marginal disk/cabling/... subsystems
> online as aggressively as possible, while the goal of ZFS should be to
> notice and work around slightly-failing devices as soon as possible.
My buffer did overflow from this email, but I still noticed the stated 
goal of ZFS, which might differ from the objectives the ZFS authors 
have been working toward these past seven years.  Could you please 
define "slightly-failing device" as well as how ZFS can know when the 
device is slightly-failing so it can start to work around it?

Thanks,

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tim

2008-Aug-27 22:43 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

On Wed, Aug 27, 2008 at 5:39 PM, Toby Thain <toby at
telegraphics.com.au>wrote:
>
>
> Let''s not be too quick to assign blame, or to think that
perfecting
> the behaviour is straightforward or even possible.
>
> Traditionally, systems bearing ''enterprisey'' expectations
were/are
> integrated hardware and software from one vendor (e.g. Sun) which
> could be certified as a unit.
>
PSSSHHH, Sun should be certifying every piece of hardware that is, or will
ever be released.  Community putback shmamunnity putback.

>
> Start introducing ''random $20 components'' and you begin
to dilute the
> quality and predictability of the composite system''s behaviour.
>
But this NEVER happens on linux *grin*.

>
> If hard drive firmware is as cr*ppy as anecdotes indicate, what can
> we really expect from a $20 USB pendrive?
>
> --Toby
>
>Perfection?

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/079adda4/attachment.html>

Miles Nordin

2008-Aug-27 23:02 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "t" == Tim  <tim at tcsac.net> writes:
     t> Except he was, and is referring to a non-root disk.

wait, what?  his root disk isn''t plugged into the pci-ide controller?

     t> LVM hardly changes the way devices move around in Linux,

fine, be pedantic.  It makes systems boot and mount all their
filesystems including ''/'' even when you move disks around. 
agreed
now?

There''s a simpler Linux way of doing this which I use on my Linux
systems: mounting by the UUID in the filesystem''s superblock.  But I
think RedHat is using LVM2 to do it.

Anyway modern Linux systems don''t put names like /dev/sda in
/etc/fstab, and they don''t use these names to find the root filesystem
either---they have all that LVM2 stuff in the early userspace.

Solaris seems to be going the same ``mount by label''''
direction with
ZFS (except with zpool.cache, devid''s, and mpxio, it''s a bit
of a
hybrid approach---when it goes out searching for labels, and when it
expects devices to be on the same bus/controller/channel, isn''t
something I fully understand yet and I expect will only become clear
through experience).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/637b3d44/attachment.bin>

Ian Collins

2008-Aug-27 23:04 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Toby Thain writes: 
> 
> On 27-Aug-08, at 7:21 PM, Ian Collins wrote: 
> 
>> Miles Nordin writes: 
>> 
>>> 
>>> In addition, I''m repeating myself like crazy at this
point, but ZFS
>>> tools used for all pools like ''zpool status'' need
to not freeze  when a
>>> single pool, or single device within a pool, is unavailable or
slow,
>>> and this expectation is having nothing to do with failmode on the
>>> failing pool.  And NFS running above ZFS should continue serving
>>> filesystems from available pools even if some pools are faulted, 
again
>>> nothing to do with failmode. 
>>> 
>> I agree with the bulk of this post, but I''d like to add to
this  last
>> point.
>> I''ve had a few problems with ZFS tools hanging on recent
builds due to
>> problems with a pool on a USB stick.  One tiny $20 component  causing a
>> fault
>> that required a reboot of the host.  This really shouldn''t
happen.
> 
> Let''s not be too quick to assign blame, or to think that
perfecting  the
> behaviour is straightforward or even possible. 
> I''m not assigning blame, just illustrating a problem. 

If you look back a week or so you will see a thread I started with the 
subject " ZFS commands hanging in B95".  This thread went off list but
the
cause was tracked back to a problem with a USB pool. 
> Traditionally, systems bearing ''enterprisey'' expectations
were/are
> integrated hardware and software from one vendor (e.g. Sun) which  could 
> be certified as a unit. 
> 
> Start introducing ''random $20 components'' and you begin
to dilute the
> quality and predictability of the composite system''s behaviour. 
> So we shouldn''t be using USB sticks to transfer data between home and
office
systems?  If the stick was a FAT device and it crapped out or was removed 
without unmounting, the system would not have hung. 
> If hard drive firmware is as cr*ppy as anecdotes indicate, what can  we 
> really expect from a $20 USB pendrive? 
>All the more reason not to lock up if one craps out. 

Ian

Richard Elling

2008-Aug-27 23:24 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>>             
>
>     >> If you really mean there are devices out there which never
>     >> return error codes, and always silently return bad data,
please
>     >> tell us which one and the story of when you encountered it,
>
>     re> I blogged about one such case.
>     re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file
>
>     re> However, I''m not inclined to publically chastise the
vendor or
>     re> device model.  It is a major vendor and a popular
>     re> device. ''nuff said.
>
> It''s not really enough for me, but what''s more the case
doesn''t match
> what we were looking for: a device which ``never returns error codes,
> always returns silently bad data.''''  I asked for this
because you said
> ``However, not all devices return error codes which indicate
> unrecoverable reads,'''' which I think is wrong.  Rather,
most devices
> sometimes don''t, not some devices always don''t.
>   
I really don''t know how to please you.  I''ve got a bunch of
borken devices of all sorts.  If you''d like to stop by some time
and rummage in the boneyard, feel free.  Make it quick before
my wife makes me clean up :-)  For the device which
I mentioned in my blog, it does return bad data far more often
than I''d like.  But that is why I only use it for testing and
don''t
store my wife''s photo album on it.  Anyone who has been
around for a while will have similar anecdotes.
> Your experience doesn''t say anything about this drive''s
inability to
> return UNC errors.  It says you suspect it of silently returning bad
> data, once, but your experience doesn''t even clearly implicate the
> device once: It could have been cabling/driver/power-supply/zfs-bugs
> when the block was written.  I was hoping for a device in your ``bad
> stack'''' which does it over and over.
>
> Remember, I''m not arguing ZFS checksums are worthless---I think
> they''re great.  I''m arguing with your original statement
that ZFS is
> the only software RAID which deals with the dominant error you find in
> your testing, unrecoverable reads.  This is untrue!
>   
To be clear.  I claim:
    1. The dominant failure mode in my field data for magnetic disks is
    unrecoverable reads.  You need some sort of data protection to get
    past this problem.
    2. Unrecoverable reads are not always reported by disk drives.
    3. You really want a system that performs end-to-end data verification,
    and if you don''t bother to code that into your applications, then
you
    might rely on ZFS to do it for you.  If you ignore this problem, it will
    not go away.
>     re> This number should scare the *%^ out of you.  It basically
>     re> means that no data redundancy is a recipe for disaster.
>
> yeah, but that 9.5% number alone isn''t an argument for ZFS over
other
> software LVM''s.
>
>     re> 0.466%/yr is a per-disk rate.  If you have 10 disks, your
>     re> exposure is 4.6% per year.  For 100 disks, 46% per year, etc.
>
> no, you''re doing the statistics wrong, and in a really elementary
way.
> You''re counting multiple times the possible years in which more
than
> one disk out of the hundred failed.  If what you care about for 100
> disks is that no disk experiences an error within one year, then you
> need to calculate
>
>   (1 - 0.00466) ^ 100 = 62.7%
>
> so that''s 37% probability of silent corruption.  For 10 disks, the
> mistake doesn''t make much difference and 4.6% is about right.
>   
Indeed.  Intuitively, the AFR and population is more easily grokked by
the masses.  But if you go into a customer and say "dude, there is only a
62.7% chance that your system won''t be affected by a silent data
corruption
problem this year with my (insert favorite non-ZFS, non-NetApp solution
here)" then you will have a difficult sale.
> I don''t dispute ZFS checksums have value, but the point stands
that
> the reported-error failure mode is 20x more common in netapp''s
study
> than this one, and other software LVM''s do take care of the more
> common failure mode.
>   
I agree.
>     re> UNCs don''t cause ZFS to freeze as long as failmode !=
wait or
>     re> ZFS manages the data redundancy.
>
> The time between issuing the read and getting the UNC back can be up
> to 30 seconds, and there are often several unrecoverable sectors in a
> row as well as lower-level retries multiplying this 30-second value.
> so, it ends up being a freeze.
>   
Untrue.  There are disks which will retry forever.  But don''t take
my word for it, believe another RAID software vendor:
http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and
[sorry about the redirect, you have to sign up for an Adaptec
webinar before you can get to the list of webinars, so it is hard
to provide the direct URL]

Incidentally, I have one such disk in my boneyard, but it isn''t
much fun to work with because it just sits there and spins when
you try to access the bad sector.
> To fix it, ZFS needs to dispatch read requests for redundant data if
> the driver doesn''t reply quickly.  ``Quickly''''
can be ambiguous, but
> the whole point of FMD was supposed to be that complicated statistics
> could be collected at various levels to identify even more subtle
> things than READ and CKSUM errors, like drives that are working at
> 1/10th the speed they should be, yet right now we can''t even flag
a
> drive taking 30 seconds to read a sector.  ZFS is still ``patiently
> waiting'''', and now that FMD is supposedly integrated
instead of a
> discussion of what knobs and responses there are, you''re passing
the
> buck to the drivers and their haphazard nonuniform exception state
> machines.  The best answer isn''t changing drivers to make the
drive
> timeout in 15 seconds instead---it''s to send the read to other
disks
> quickly using a very simple state machine, and start actually using
> FMD and a complicated state machine to generate suspicion-events for
> slow disks that aren''t returning errors.
>   
I think the proposed timeouts here are too short, but the idea has
merit.  Note that such a preemptive read will have negative performance
impacts for high-workload systems, so it will not be a given that people
will want this enabled by default.  Designing such a proactive system
which remains stable under high workloads may not be trivial.
Please file an RFE at http://bugs.opensolaris.org
> Also the driver and mid-layer need to work with the hypothetical
> ZFS-layer timeouts to be as good as possible about not stalling the
> SATA chip, the channel if there''s a port multiplier, or freezing
the
> whole SATA stack including other chips, just because one disk has an
> outstanding READ command waiting to get an UNC back.  
>
> In some sense the disk drivers and ZFS have different goals.  The goal
> of drivers should be to keep marginal disk/cabling/... subsystems
> online as aggressively as possible, while the goal of ZFS should be to
> notice and work around slightly-failing devices as soon as possible.
> I thought the point of putting off reasonable exception handling for
> two years while waiting for FMD, was to be able to pursue both goals
> simultaneously without pressure to compromise one in favor of the
> other.
>
> In addition, I''m repeating myself like crazy at this point, but
ZFS
> tools used for all pools like ''zpool status'' need to not
freeze when a
> single pool, or single device within a pool, is unavailable or slow,
> and this expectation is having nothing to do with failmode on the
> failing pool.  And NFS running above ZFS should continue serving
> filesystems from available pools even if some pools are faulted, again
> nothing to do with failmode.
>
>   
You mean something like:
http://bugs.opensolaris.org/view_bug.do?bug_id=6667208
http://bugs.opensolaris.org/view_bug.do?bug_id=6667199

Yes, we all wish these to be fixed soon.
 > Neither is the case now, and it''s not a driver fix, but even
beyond
> fixing these basic problems there''s vast room for improvement, to
> deliver something better than LVM2 and closer to NetApp, rather than
> just catching up.
>   
If you find more issues, then please file bugs. http://bugs.opensolaris.org
 -- richard

Ian Collins

2008-Aug-27 23:41 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Richard Elling writes:> 
> I think the proposed timeouts here are too short, but the idea has
> merit.  Note that such a preemptive read will have negative performance
> impacts for high-workload systems, so it will not be a given that people
> will want this enabled by default.  Designing such a proactive system
> which remains stable under high workloads may not be trivial.
Isn''t this how things already work with mirrors?  By this I mean
requests
are issued to all devices and if the first returned data is OK, the others 
are not required. 

Ian

Richard Elling

2008-Aug-27 23:59 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Ian Collins wrote:> Richard Elling writes:
>>
>> I think the proposed timeouts here are too short, but the idea has
>> merit.  Note that such a preemptive read will have negative performance
>> impacts for high-workload systems, so it will not be a given that
people
>> will want this enabled by default.  Designing such a proactive system
>> which remains stable under high workloads may not be trivial.
>
> Isn''t this how things already work with mirrors?  By this I mean 
> requests are issued to all devices and if the first returned data is 
> OK, the others are not required.
No.  Yes.  Sometimes.  The details on choice of read targets varies by
implementation.  I''ve seen some telco systems which work this way,
but most of the general purpose systems will choose one target for
the read based on some policy: round-robin, location, etc.  This way
you could get the read performance of all disks operating concurrently.
 -- richard

Todd H. Poole

2008-Aug-28 00:24 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Ah yes - that video is what got this whole thing going in the first place... I
referenced it in one of my other posts much earlier. Heh... there''s
something gruesomely entertaining about brutishly taking a drill or sledge
hammer to a piece of precision hardware like that.

But yes, that''s the kind of torture test I would like to conduct,
however, I''m operating on a limited test-budget right now, and I have
to get the damn thing working in the first place before I start performing tests
I can''t easily reverse (I still have yet to fire up Bonnie++ and do
some benchmarking), and most definitely before I can put on a show for those who
control the draw strings to the purse...

But, imagine: walking into... oh say, I dunno... your manager''s office,
for example, and asking him to beat the hell out of one of your
server''s hard drives all the while promising him that no data would be
lost, and none of his video on demand customers would ever notice an
interruption in service. He might think you''re crazy, but if it still
works at the end of the day, your annual budget just might get a sizable
increase to help you make all the other servers "sledge hammer
resistant" like the first one. ;)

But that''s just an example. That functionality could (and probably
does) prove useful almost anywhere.
 
 
This message posted from opensolaris.org

Ian Collins

2008-Aug-28 00:38 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Richard Elling writes: 
> Ian Collins wrote:
>> Richard Elling writes:
>>> 
>>> I think the proposed timeouts here are too short, but the idea has
>>> merit.  Note that such a preemptive read will have negative
performance
>>> impacts for high-workload systems, so it will not be a given that
people
>>> will want this enabled by default.  Designing such a proactive
system
>>> which remains stable under high workloads may not be trivial.
>> 
>> Isn''t this how things already work with mirrors?  By this I
mean requests
>> are issued to all devices and if the first returned data is OK, the 
>> others are not required.
> 
> No.  Yes.  Sometimes.  The details on choice of read targets varies by
> implementation.  I''ve seen some telco systems which work this way,
> but most of the general purpose systems will choose one target for
> the read based on some policy: round-robin, location, etc.  This way
> you could get the read performance of all disks operating concurrently.
Would it be possible to get ZFS to work the way I described?  I was looking 
at using an exported iSCSI target from a machine in another building to 
mirror a fileserver with a mainly (>95%) read workload.  A first back read 
implementation would be a good fit for that situation. 

Ian

Miles Nordin

2008-Aug-28 01:27 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
re> I really don''t know how to please you.

dd from the raw device instead of through ZFS would be better. If you
could show that you can write data to a sector, and read back
different data, without getting an error, over and over, I''d be
totally stunned.

The netapp paper was different from your test in many ways that make
their claim that ``all drives silently corrupt data
sometimes'''' more
convincing than your claim that you have ``one drive which silently
corrupts data always and never returns UNC'''':

* not a desktop. The circumstances were more tightly-controlled,
and their drive population installed in a repeated way

* their checksum measurement was better than ZFS''s by breaking the
type of error up into three buckets instead of one, and their
filesystem more mature, and their filesystem is not already known
to count CKSUM errors for circumstances other than silent
corruption, which argues the checksums are less likely to come
from software bugs

* they make statistical arguments that at least some of the errors
are really coming from the drives by showing they have spatial
locality w.r.t. the LBA on the drive, and are correlated with
drive age and impending drive failure.

The paper was less convincing in one way:

* their drives are using nonstandard firmware

re> Anyone who has been around for a while will have similar
re> anecdotes.

yeah, you''d think, but my similar anecdote is that (a) I can get
UNC''s
repeatably on a specific bad sector that persist either forever or
until I write new data to that sector with dd, and do get them on at
least 10% of my drives per year, and (b) I get CKSUM errors from ZFS
all the time with my iSCSI ghetto-SAN and with an IDE/Firewire mirror,
often from things I can specifically trace back to
not-a-drive-failure, but so far never from something I can for certain
trace back to silent corruption by the disk drive.

I don''t doubt that it happens, but CKSUM isn''t a way to spot
it. ZFS
may give me a way to stop it, but it doesn''t give me an accurate way
to measure/notice it.

re> Indeed. Intuitively, the AFR and population is more easily
re> grokked by the masses.

It''s nothing to do with masses. There''s an error in your
math. It''s
not right under any circumstance.

Your point that a 100 drive population has bad/high odds of having
silent corruption within a year isn''t diminished by the correction,
but it would be nice if you would own up to the statistics mistake
since we''re taking you at your word on a lot of other statistics.

>> so, it ends up being a freeze.

re> Untrue. There are disks which will retry forever.

I don''t understand. ZFS freezes until the disk stops retrying and
returns an error. Because some disks never stop retrying and never
return an error, just lock up until they''re power-cycled, it''s
untrue
that ZFS freezes? I think either you or I have lost the thread of the
argument in our reply chain bantering.

re> please file bugs.

k., I filed the NFS bug, but unfortunately I don''t have output to cut
and paste into it. glad to see the ''zpool status'' bug is
there
already and includes the point that lots of other things are probably
hanging which shouldn''t.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080827/8be11f79/attachment.bin>

James C. McPherson

2008-Aug-28 12:52 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

Hi Todd,
sorry for the delay in responding, been head down rewriting
a utility for the last few days.


Todd H. Poole wrote:> Howdy James,
> 
> While responding to halstead''s post (see below), I had to restart
several
> times to complete some testing. I''m not sure if that''s
important to these
> commands or not, but I just wanted to put it out there anyway.
> 
>> A few commands that you could provide the output from
>> include:
>>
>>
>> (these two show any FMA-related telemetry)
>> fmadm faulty
>> fmdump -v
> 
> This is the output from both commands:
> 
> todd at mediaserver:~# fmadm faulty
> --------------- ------------------------------------  --------------
---------
> TIME            EVENT-ID                              MSG-ID        
SEVERITY
> --------------- ------------------------------------  --------------
---------
> Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a  ZFS-8000-FD    Major
> 
> Fault class : fault.fs.zfs.vdev.io
> Description : The number of I/O errors associated with a ZFS device
exceeded
>                     acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
>              for more information.
> Response    : The device has been offlined and marked as faulted.  An
attempt
>                     will be made to activate a hot spare if available.
> Impact      : Fault tolerance of the pool may be compromised.
> Action      : Run ''zpool status -x'' and replace the bad
device.
 >> todd at mediaserver:~# fmdump -v
> TIME                 UUID                                 SUNW-MSG-ID
> Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD
>  100%  fault.fs.zfs.vdev.io
> 
>        Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719
>           Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719
>               FRU: -
>          Location: -

In other emails in this thread you''ve mentioned the desire to
get an email (or some sort of notification) when Problems Happen(tm)
in your system, and the FMA framework is how we achieve that
in OpenSolaris.



# fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-retire            1.1     active  CPU/Memory Retire Agent
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
fabric-xlate             1.0     active  Fabric Ereport Translater
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                2.0     active  I/O Retire Agent
snmp-trapgen             1.0     active  SNMP Trap Generation Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent


You''ll notice that we''ve got an SNMP agent there... and you
can acquire a copy of the FMA mib from the Fault Management
community pages (http://opensolaris.org/os/community/fm and
http://opensolaris.org/os/community/fm/mib/).



>> (this shows your storage controllers and what''s
>> connected to them) cfgadm -lav
> 
> This is the output from cfgadm -lav
> 
> todd at mediaserver:~# cfgadm -lav
> Ap_Id                          Receptacle   Occupant     Condition 
Information
> When         Type         Busy     Phys_Id
> usb2/1                         empty        unconfigured ok
> unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at 13:1
> usb2/2                         connected    configured   ok
> Mfg: Microsoft  Product: Microsoft 3-Button Mouse with IntelliEye(TM)
> NConfigs: 1  Config: 0  <no cfg str descr>
> unavailable  usb-mouse    n        /devices/pci at 0,0/pci1458,5004 at 13:2
> usb3/1                         empty        unconfigured ok
[snip]> usb7/2                         empty        unconfigured ok
> unavailable  unknown      n        /devices/pci at 0,0/pci1458,5004 at
13,1:2
> 
> You''ll notice that the only thing listed is my USB mouse... is
that expected?
Yup. One of the artefacts of the cfgadm architecture. cfgadm(1m)
works by using plugins - usb, FC, SCSI, SATA, pci hotplug, InfiniBand...
but not IDE.

I think you also were wondering how to tell what controller
instances your disks were using in IDE mode - two basic ways
of achieving this:

/usr/bin/iostat -En

and

/usr/sbin/format

Your IDE disks will attach using the cmdk driver and show up like this:

c1d0
c1d1
c2d0
c2d1

In AHCI/SATA mode they''d show up as

c1t0d0
c1t1d0
c1t2d0
c1t3d0

or something similar, depending on how the bios and the actual
controllers sort themselves out.

>> You''ll also find messages in /var/adm/messages which
>> might prove
>> useful to review.
> 
> If you really want, I can list the output from /var/adm/messages, but it
> doesn''t seem to add anything new to what I''ve already
copied and pasted.
No need - you''ve got them if you need them.

[snip]
>> http://docs.sun.com/app/docs/coll/40.17 (manpages)
>> http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
>> http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
>> http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide
> 
> Oohh... Thank you. Good Links. I''m bookmarking these for future
reading.
> They''ll definitely be helpful if we end up choosing to deploy
OpenSolaris
> + ZFS for our media servers.
There''s a heap of info there, getting started with it can be
like trying to drink from a fire hose :)


Best regards,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Richard Elling

2008-Aug-28 13:49 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin wrote:>     re> Indeed.  Intuitively, the AFR and population is more easily
>     re> grokked by the masses.
>
> It''s nothing to do with masses.  There''s an error in your
math.  It''s
> not right under any circumstance.
>   
There is no error in my math.  I presented a failure rate for a time 
interval,
you presented a probability of failure over a time interval.  The two are
both correct, but say different things.  Mathematically, an AFR > 100%
is quite possible and quite common.  A probability of failure > 100% (1.0)
is not.  In my experience, failure rates described as annualized failure
rates (AFR) are more intuitive than their mathematically equivalent
counterpart: MTBF.
 -- richard

Robert Milkowski

2008-Aug-28 13:55 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Hello Miles,

Wednesday, August 27, 2008, 10:51:49 PM, you wrote:

MN> It''s not really enough for me, but what''s more the case
doesn''t match
MN> what we were looking for: a device which ``never returns error codes,
MN> always returns silently bad data.''''  I asked for this
because you said
MN> ``However, not all devices return error codes which indicate
MN> unrecoverable reads,'''' which I think is wrong.  Rather,
most devices
MN> sometimes don''t, not some devices always don''t.



Please look for slides 23-27 at
http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf


-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling

2008-Aug-28 15:04 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Robert Milkowski wrote:> Hello Miles,
>
> Wednesday, August 27, 2008, 10:51:49 PM, you wrote:
>
> MN> It''s not really enough for me, but what''s more the
case doesn''t match
> MN> what we were looking for: a device which ``never returns error
codes,
> MN> always returns silently bad data.''''  I asked for
this because you said
> MN> ``However, not all devices return error codes which indicate
> MN> unrecoverable reads,'''' which I think is wrong. 
Rather, most devices
> MN> sometimes don''t, not some devices always don''t.
>
>
>
> Please look for slides 23-27 at
http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf
>
>   
You really don''t have to look very far to find this sort of thing.
The scar just below my left knee is directly attributed to a bugid
fixed in patch 106129-12.  Warning: the following link may
frighten experienced datacenter personnel, fortunately, the affected
device is long since EOL.
http://sunsolve.sun.com/search/document.do?assetkey=1-21-106129-12-1
 -- richard

Miles Nordin

2008-Aug-28 16:54 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
    re> There is no error in my math.  I presented a failure rate for
    re> a time interval,

What is a ``failure rate for a time interval''''?

AIUI, the failure rate for a time interval is 0.46% / yr, no matter how
many drives you have.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/3c7133a9/attachment.bin>

Miles Nordin

2008-Aug-28 16:55 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "rm" == Robert Milkowski <milek at
task.gda.pl> writes:
    rm> Please look for slides 23-27 at
    rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf

yeah, ok, ONCE AGAIN, I never said that checksums are worthless.

relling: some drives don''t return errors on unrecoverable read events.
carton: I doubt that.  Tell me a story about one that doesn''t.

Your stories are about storage subsystems again, not drives.  Also
most or all of the slides aren''t about unrecoverable read events.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/4f742127/attachment.bin>

Jonathan Loran

2008-Aug-28 18:13 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Miles Nordin wrote:> What is a ``failure rate for a time interval''''?
>
>   Failure rate => Failures/unit time
Failure rate for a time interval => (Failures/unit time) * time

For example, if we have a failure rate: 

  Fr = 46% failures/month

Then the expectation value of a failure in one year:

  Fe = 46% failures/month  *  12 months = 5.52 failures


Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

Miles Nordin

2008-Aug-28 18:42 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

>>>>> "jl" == Jonathan Loran <jloran at
ssl.berkeley.edu> writes:
jl> Fe = 46% failures/month * 12 months = 5.52 failures

the original statistic wasn''t of this kind. It was ``likelihood a
single drive will experience one or more failures within 12
months''''.

so, you could say, ``If I have a thousand drives, about 4.66 of those
drives will silently-corrupt at least once within 12 months.''''
It is
0.466% no matter how many drives you have.

And it''s 4.66 drives, not 4.66 corruptions. The estimated number of
corruptions is higher because some drives will corrupt twice, or
thousands of times. It''s not a BER, so you can''t just add it
like
Richard did.

If the original statistic in the paper were of the kind you''re talking
about, it would be larger than 0.466%. I''m not sure it would capture
the situation well, though. I think you''d want to talk about bits of
recoverable data after one year, not corruption ``events'''',
and this
is not really measured well by the type of telemetry NetApp has. If
it were, though, it would still be the same size number no matter how
many drives you had.

The 37% I gave was ``one or more within a population of 100 drives
silently corrupts within 12 months.'''' The 46% Richard gave
has no
meaning, and doesn''t mean what you just said. The only statistic
under discussion which (a) gets intimidatingly large as you increase
the number of drives, and (b) is a ratio rather than, say, an absolute
number of bits, is the one I gave.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080828/fa784f20/attachment.bin>

Anton B. Rang

2008-Aug-28 20:35 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

Many mid-range/high-end RAID controllers work by having a small timeout on
individual disk I/O operations. If the disk doesn''t respond quickly,
they''ll issue an I/O to the redundant disk(s) to get the data back to
the host in a reasonable time. Often they''ll change parameters on the
disk to limit how long the disk retries before returning an error for a bad
sector (this is standardized for SCSI, I don''t recall offhand whether
any of this is standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when
enough (N-1 or N-2) disks return data, they''ll return the data to the
host. At least they do that for full stripes. But this strategy works better for
sequential I/O, not so good for random I/O, since you''re using up extra
bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons.
First, the bottleneck is almost always the channel from disk to host, and you
don''t want to clog it. [Yes, I know there''s more bandwidth
there than the sum of the disks, but consider latency.] Second, to read from two
disks on a mirror, you''d need two memory buffers.
--
This message posted from opensolaris.org

Anton B. Rang

2008-Aug-28 20:35 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

Many mid-range/high-end RAID controllers work by having a small timeout on
individual disk I/O operations. If the disk doesn''t respond quickly,
they''ll issue an I/O to the redundant disk(s) to get the data back to
the host in a reasonable time. Often they''ll change parameters on the
disk to limit how long the disk retries before returning an error for a bad
sector (this is standardized for SCSI, I don''t recall offhand whether
any of this is standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when
enough (N-1 or N-2) disks return data, they''ll return the data to the
host. At least they do that for full stripes. But this strategy works better for
sequential I/O, not so good for random I/O, since you''re using up extra
bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons.
First, the bottleneck is almost always the channel from disk to host, and you
don''t want to clog it. [Yes, I know there''s more bandwidth
there than the sum of the disks, but consider latency.] Second, to read from two
disks on a mirror, you''d need two memory buffers.
--
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-30 05:05 UTC

head link

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk

> Let''s not be too quick to assign blame, or to think that
perfecting
> the behaviour is straightforward or even possible.
>
> Start introducing random $20 components and you begin to dilute the
> quality and predictability of the composite system''s behaviour.
>
> But this NEVER happens on linux *grin*.
Actually, it really doesn''t! At least, it hasn''t in many
years...

I can''t tell if you were being sarcastic or not, but honestly... you
find a USB drive that can bring down your Linux machine, and I''ll show
you someone running a kernel from November of 2003. And for all the other
"cheaper" components out there? Those are the components we make
serious bucks off of. Just because it costs $30 doesn''t mean it
won''t last a _really_ long time under stress! But if it
doesn''t, even when hardware fails, software''s always there to
route around it. So no biggie.
> Perfection?
Is Linux perfect?
Not even close. But certainly a lot closer at what the topic of this thread
seems to cover: not crashing.

Linux may get a small number of things wrong, but it gets a ridiculously large
number of them right, and stability/reliability on unstable/unreliable hardware
is one of them. ;)

PS: I found this guy''s experiment amusing. Talk about adding a bunch of
cheap, $20 crappy components to a system, and still seeing it soar.
http://linuxgazette.net/151/weiner.html
--
This message posted from opensolaris.org

Todd H. Poole

2008-Aug-30 05:32 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

> Wrt. what I''ve experienced and read in ZFS-discussion etc. list
I''ve the
> __feeling__, that we would have got really into trouble, using Solaris
> (even the most recent one) on that system ... 
> So if one asks me, whether to run Solaris+ZFS on a production system, I
> usually say: definitely, but only, if it is a Sun server ...
> 
> My 2? ;-)
I can''t agree with you more. I''m beginning to understand what
the phrase "Sun''s software is great - as long as you''re
running it on Sun''s hardware" means...

Whether it''s deserved or not, I feel like this OS isn''t mature
yet. And maybe it''s not the whole OS, maybe it''s some specific
subsection (like ZFS), but my general impression of OpenSolaris has been... not
stellar.

I don''t think it''s ready yet for a prime time slot on
commodity hardware.

And while I don''t intend to fan any flames that might already exist
(remember, I''ve only just joined within the past week, and thus
haven''t been around long enough to figured out even if any flames
exist), but I believe I''m justified in making the above statement. Just
off the top of my head, here is a list of red flags I''ve run into in 7
day''s time:

 - If I don''t wait for at least 2 minutes before logging into my system
after I''ve powered everything up, my machine freezes.
 - If I yank a hard drive out of a (supposedly redundant) RAID5 array (or
"RAID-Z zpool," as its called) that has an NFS mount attached to it,
not only does that mount point get severed, but _all_ NFS connections to all
mount points are dropped, regardless of whether they were on the zpool or not.
Oh, and then my machine freezes.
 - If I just yank a hard drive out of a (supposedly redundant) RAID5 array (or
"RAID-Z zpool," as its called), and just forgetting about NFS, my
machine freezes.
 - If I query a zpool for its status, but don''t do so under the right
circumstances, my machine freezes.

I''ve had to use the hard reset button on my case more times than
I''ve had the ability to shut down the machine properly from a
non-frozen console or GUI.

That shouldn''t happen.

I dunno. If this sounds like bitching, that''s fine: I''ll file
bug reports and then move on. It''s just that sometimes, software needs
to grow a bit more before it''s ready for production, and I feel like
trying to run OpenSolaris + ZFS on commodity hardware just might be one of those
times.

Just two more cents to add to yours.

As Richard said, the only way to fix things is to file bug reports. Hopefully,
the most helpful things to come out of this thread will be those forms of
constructive criticism.

As for now, it looks like a return to LVM2, XFS, and one of the Linux or BSD
kernels might be a more stable decision, but don''t worry - I
haven''t been completely dissuaded, and I definitely plan on checking
back in a few releases to see how things are going in the ZFS world. ;)

Thanks everyone for your help, and keep improving! :)

-Todd
--
This message posted from opensolaris.org

Toby Thain

2008-Aug-30 12:35 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On 30-Aug-08, at 2:32 AM, Todd H. Poole wrote:
>> Wrt. what I''ve experienced and read in ZFS-discussion etc.
list
>> I''ve the
>> __feeling__, that we would have got really into trouble, using  
>> Solaris
>> (even the most recent one) on that system ...
>> So if one asks me, whether to run Solaris+ZFS on a production  
>> system, I
>> usually say: definitely, but only, if it is a Sun server ...
>>
>> My 2? ;-)
>
> I can''t agree with you more. I''m beginning to understand
what the
> phrase "Sun''s software is great - as long as you''re
running it on
> Sun''s hardware" means...
> ...
Totally OT, but this is also why Apple doesn''t sell OS X for whitebox  
junk. :)

--Toby
>
> -Todd
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

dick hoogendijk

2008-Aug-30 15:31 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Sat, 30 Aug 2008 09:35:31 -0300
Toby Thain <toby at telegraphics.com.au> wrote:> On 30-Aug-08, at 2:32 AM, Todd H. Poole wrote:
> > I can''t agree with you more. I''m beginning to
understand what the
> > phrase "Sun''s software is great - as long as
you''re running it on
> > Sun''s hardware" means...
> 
> Totally OT, but this is also why Apple doesn''t sell OS X for
> whitebox junk. :)
There''s also a lot of whiteboxes that -do- run solaris very well.
"Some apples are rotten others are healthy". That quite normal.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
++ http://nagual.nl/ + SunOS sxce snv95 ++

Joe S

2008-Sep-03 17:15 UTC

head link

[zfs-discuss] ZFS hangs/freezes after disk failure,

On Fri, Aug 29, 2008 at 10:32 PM, Todd H. Poole <toddhpoole at gmail.com>
wrote:> I can''t agree with you more. I''m beginning to understand
what the phrase "Sun''s software is great - as long as
you''re running it on Sun''s hardware" means...
>
> Whether it''s deserved or not, I feel like this OS isn''t
mature yet. And maybe it''s not the whole OS, maybe it''s some
specific subsection (like ZFS), but my general impression of OpenSolaris has
been... not stellar.
>
> I don''t think it''s ready yet for a prime time slot on
commodity hardware.
I agree, but with careful research, you can find the *right* hardware.
In my quest (took weeks) to find reports of reliable hardware, I found
that the AMD chipsets were way too buggy. I also noticed that of the
workstations that Sun sells, they use nVidia nForce chipsets for AMD
CPU''s and Intel x38 (only intel desktop chipset that supports ecc) for
the Intel CPUs. I read good and bad stories about various hardware and
decided I would stay close to what Sun sells. I''ve found NO Sun
hardware using the same chipset as yours.

There are a couple of AHCI bugs with the AMD/ATI SB600 chipset. Both
Linux and Solaris were affected. Linux put in a workaround that may
hurt performance slightly. Sun still has the bug open, but for what
it''s worth, who''s gonna use or care about a buggy desktop
chipset in a
storage server?

I have an nVidia nForce 750a chipset (not the same as the sun
workstations, which use nforce pro, but its not too different) and the
same CPU (45 Watt dual core!) you have. My system works great (so
far). I haven''t tried the disconnect drive issue thought. I will try
it tonight.

zfs discuss - Aug 2008 - ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

[zfs-discuss] ZFS hangs/freezes after disk failure,