Brian Hechinger
2008-Dec-02 17:42 UTC
[zfs-discuss] A failed disk can bring down a machine?
I was not in front of the machine, I had remote hands working with me, so I appologize in advance for any lack of detail I''m about to give. The server in question is running snv_81 booting ZFS Root using Tim''s scripts to "convert" it over to ZFS Root. My server in colo stopped responding. I had a screen session open and I could switch between screen windows and create new windows but I could not run any commands. I also could not log into the box. The hands on person saw this on the console (transcribed from a video console): SYNCHRONIZE CACHE command failed (5) scsi: WARNING: /pci at 1,0/pci1095,3124 at 2/disk at 1,0 (sd1) sd1 is one of two SATA disks connected to the machine via a SiL3124 controller. I had the remote hands pull sd1 and reboot the machine. It came right up and has been running fine since. Lacking its mirrored disks, however. Due to other issues I''ve had with this box (If you think you can get away with running ZFS on a 32-bit machine, you are mistaken) I''m looking to replace it anyway. What concerns me is that a single disk having gone bad like that can take out the whole machine. This is not what I would consider an ideal or acceptable setup for a machine that is in colo that doesn''t have 24x7 onsite support. What was to blame for this disk failure causing my machine to become unresponsive? Was it the SiL3124? Is it something else? Is this what I should expect from SATA? I ask all these questions as I want to make sure that if this is indeed connected to the use of a SATA controller, or the use of a specific SATA controller that I certainly avoid that with this next machine. I''ve got a very slim budget on this, and based on that I found what looks like a pretty nice little server that is in my budget. It''s an ASUS RS161-E2/PA2 which is based on the nForce Professional 2200, which from what I can tell is what the Ultra 40 is based on, so I would expect it to pretty much just work. Will the nv_sata driver behave in a more sane fashion in a case like what I''ve just gone through? If this is a shortcoming of SATA, does anyone have any recommendations on a not too expensive setup based on a SAS controller? As much as I would like this thing to do a great job in the performance arena, stability is definitely higher on the list of what''s really important to me. Thanks, -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
On Tue, Dec 2, 2008 at 11:42 AM, Brian Hechinger <wonko at 4amlunch.net> wrote:> I was not in front of the machine, I had remote hands working with me, so I > appologize in advance for any lack of detail I''m about to give. > > The server in question is running snv_81 booting ZFS Root using Tim''s > scripts to > "convert" it over to ZFS Root. > > My server in colo stopped responding. I had a screen session open and I > could > switch between screen windows and create new windows but I could not run > any > commands. I also could not log into the box. > > The hands on person saw this on the console (transcribed from a video > console): > > SYNCHRONIZE CACHE command failed (5) > scsi: WARNING: /pci at 1,0/pci1095,3124 at 2/disk at 1,0 (sd1) > > sd1 is one of two SATA disks connected to the machine via a SiL3124 > controller. > > I had the remote hands pull sd1 and reboot the machine. It came right up > and has > been running fine since. Lacking its mirrored disks, however. > > Due to other issues I''ve had with this box (If you think you can get away > with running > ZFS on a 32-bit machine, you are mistaken) I''m looking to replace it > anyway. What > concerns me is that a single disk having gone bad like that can take out > the whole > machine. This is not what I would consider an ideal or acceptable setup > for a machine > that is in colo that doesn''t have 24x7 onsite support. > > What was to blame for this disk failure causing my machine to become > unresponsive? Was > it the SiL3124? Is it something else? Is this what I should expect from > SATA? > > I ask all these questions as I want to make sure that if this is indeed > connected to the > use of a SATA controller, or the use of a specific SATA controller that I > certainly avoid > that with this next machine. > > I''ve got a very slim budget on this, and based on that I found what looks > like a pretty > nice little server that is in my budget. It''s an ASUS RS161-E2/PA2 which > is based on the > nForce Professional 2200, which from what I can tell is what the Ultra 40 > is based on, so > I would expect it to pretty much just work. > > Will the nv_sata driver behave in a more sane fashion in a case like what > I''ve just gone > through? If this is a shortcoming of SATA, does anyone have any > recommendations on a not > too expensive setup based on a SAS controller? > > As much as I would like this thing to do a great job in the performance > arena, stability is > definitely higher on the list of what''s really important to me. > > Thanks, > > -brian >I believe the issue you''re running into is the failmode you currently have set. Take a look at this: http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081202/075a0bc6/attachment.html>
Brian Hechinger
2008-Dec-02 20:12 UTC
[zfs-discuss] A failed disk can bring down a machine?
On Tue, Dec 02, 2008 at 12:50:08PM -0600, Tim wrote:> On Tue, Dec 2, 2008 at 11:42 AM, Brian Hechinger <wonko at 4amlunch.net> wrote: > > I believe the issue you''re running into is the failmode you currently have > set. Take a look at this: > http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/Ah ha! It''s now set to continue. Hopefully that''ll save me next time this happens. Which I hope isn''t too soon. ;) Sadly this has rid me of my urgent need to replace that box, which I suppose isn''t a bad thing as I can now take my time. Anyone have any opinions of that ASUS box running the latest OpenSolaris? -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)