Roy Dragseth
2009-Aug-31 08:56 UTC
[Lustre-discuss] Changing error behaviour to kernel panic.
Hi. We have a few problems with our storage hw where we get file systems corruption on some OSTs once in a while. Lustre prefers to try to continue, but io- operations to the OSTs in question fail making applications crash. I would prefer to have a full hang instead of a partially working system as a reboot is needed anyway to fix the problem. The tune2fs manual says this can be done using the -e flag on a per device basis making the kernel panic on errors. So, my question is: Will this have any severe side-effects that I''m not aware of? Any other alternatives to this approach? System specs: CentOS 5.2 / Lustre 1.6.7.1 (we''re upgrading to lustre 1.8.X in a few weeks.) Just for the record, here are an example of the error message we get in dmesg: LustreError: 16008:0:(filter_io_26.c:721:filter_commitrw_write()) error starting transaction: rc = -30 Any hints are greatly appreciated. Regards, r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
Brian J. Murrell
2009-Aug-31 13:01 UTC
[Lustre-discuss] Changing error behaviour to kernel panic.
On Mon, 2009-08-31 at 10:56 +0200, Roy Dragseth wrote:> Hi.Hi,> We have a few problems with our storage hw where we get file systems corruption > on some OSTs once in a while. Lustre prefers to try to continue, but io- > operations to the OSTs in question fail making applications crash. I would > prefer to have a full hang instead of a partially working system as a reboot > is needed anyway to fix the problem. The tune2fs manual says this can be done > using the -e flag on a per device basis making the kernel panic on errors.You are looking for is the "errors=panic" mount option. It can be done at mount time with "-o errors=panic" mount option or, I believe, it can be set permanently in the device''s configuration.> So, my question is: Will this have any severe side-effects that I''m not aware > of?Well, if what you are looking for is a node to completely halt when a corruption is detected, this will certainly do it.> LustreError: 16008:0:(filter_io_26.c:721:filter_commitrw_write()) error starting > transaction: rc = -30Yes, the event that made the backing-store read-only (-30) will instead panic the node. That gives something like HA on another node the opportunity to start serving up the target. This is only useful of course if there is not really any corruption. If the target really is corrupted though, you really ought to fix that first. You shouldn''t just go on, letting corruptions pile up. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090831/c7ced2b3/attachment.bin