Roy Dragseth
2006-Dec-29 02:55 UTC
[Lustre-discuss] OSS/MDS server crashes after recovery.
We had a disk failure on one of our raid arrays and had to do fsck on all ost/mds-devices. The fsck process did a lot of repairs, but eventually it succeded cleaning all devices. When I try to start lustre again I everything seems fine until I try to connect the clients, then one of the two combined mds/oss servers crashes as soon as the recovery grace period is over. If I start with --abort_recovery the server hangs almost immediately. It dumps a lot of debug log files in the /tmp directory but I cannot make any sense of them. The syslog gets filled with things like this: Dec 29 10:52:36 lustre-11-1 kernel: LustreError: 6943:0: (client.c:554:ptlrpc_check_reply()) previously skipped 1 similar messages Dec 29 10:52:36 lustre-11-1 kernel: Lustre: OSC_lustre-11-0.local_ost8_home-mds: Connection restored to service ost8 using nid 0@lo. Dec 29 10:52:36 lustre-11-1 kernel: Lustre: previously skipped 1 similar messages Dec 29 10:52:36 lustre-11-1 kernel: LustreError: 7113:0: (lov_obd.c:837:lov_clear_orphans()) error in orphan recovery on OST idx 0/4: rc = -16 Dec 29 10:52:36 lustre-11-1 kernel: LustreError: 7113:0: (lov_obd.c:837:lov_clear_orphans()) previously skipped 1 similar messages This crash makes it impossible to continue into the lfsck realm of fixing things. System info: RH EL4 w/2.6.9-34.EL_lustre1.4.6.4smp lustre 1.4.6.4 Lustre setup: Two combined MDS/OSS servers with dual FC connections to two sata raids, serving a home area and a scratch area. Any help is greatly appreciated, my last resort is to reformat and roll everything in from backup. Regards, r.