Roger Spellman
2009-Jun-04 21:29 UTC
[Lustre-discuss] Critical Situation -- trying to remove a badly configured OST
I wonder if anyone can help me. I''m in a bit of a rush, because my file system is down, and I can''t seem to get it back up. I have a lot of users who need their data ASAP. As of yesterday, I had a 5 node system. Node 1 was and MGS and MDT. Nodes 2-5 were OSTs. The network is both IB and tcp. Today, I added another OST. Unfortunately, I messed up the --mgsnode option, and I only set it up for IB. We noticed this when the IB clients were OK, but the TCP clients were not. Then, rather try to change that (as I should have), I just reformatted that OST. That left me in a bad spot: OST-0004 was missing. The device list from the MDT looked as follows:> 3 UP lov lstr-ter-mdtlov lstr-ter-mdtlov_UUID 4> 4 UP mds lstr-ter-MDT0000 lstr-ter-MDT0000_UUID 437> 5 UP osc lstr-ter-OST0000-osc lstr-ter-mdtlov_UUID 5> 6 UP osc lstr-ter-OST0001-osc lstr-ter-mdtlov_UUID 5> 7 UP osc lstr-ter-OST0002-osc lstr-ter-mdtlov_UUID 5> 8 UP osc lstr-ter-OST0003-osc lstr-ter-mdtlov_UUID 5> 9 UP osc lstr-ter-OST0004-osc lstr-ter-mdtlov_UUID 5> 10 UP osc lstr-ter-OST0005-osc lstr-ter-OST0004-osc-mdtlov_UUID 4OST-0004 was the "bad" one, and OST0004 was its replacement. Why is the UUID so messed up? In any case, I just deactived OST-0004 using: lctl conf_param lstr-ter-0ST0004.osc.active=0 That did not solve anything. I''ve even deactivating OST-0005, trying to get back to where I was yesterday. I''ve rebooted my whole system, but can''t even mount the MDT. When I try, I get the following errors: Jun 4 14:28:54 ts-nrel-01 sshd(pam_unix)[6717]: session opened for user root by (uid=0) Jun 4 14:28:55 ts-nrel-01 kernel: LustreError: 137-5: UUID ''lstr-ter-MDT0000_UUID'' is not available for connect (stopp ing) Jun 4 14:28:56 ts-nrel-01 kernel: Lustre: Request x19 sent from lstr-ter-OST0000-osc to NID 172.16.103.22 at tcp 5s ago ha s timed out (limit 5s). Jun 4 14:28:56 ts-nrel-01 kernel: Lustre: Request x20 sent from lstr-ter-OST0001-osc to NID 172.16.103.23 at tcp 5s ago ha s timed out (limit 5s). Jun 4 14:28:56 ts-nrel-01 kernel: Lustre: Failing over lstr-ter-OST0000-osc Jun 4 14:29:03 ts-nrel-01 kernel: LustreError: 137-5: UUID ''lstr-ter-MDT0000_UUID'' is not available for connect (stopp ing) Jun 4 14:29:13 ts-nrel-01 last message repeated 3 timesJun 4 14:29:13 ts-nrel-01 kernel: LustreError: Skipped 1 previous similar message Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: Request x23 sent from lstr-ter-OST0004-osc to NID 172.17.103.27 at o2ib 25s ago has timed out (limit 25s). Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: Skipped 2 previous similar messagesJun 4 14:29:16 ts-nrel-01 kernel: Lustre: Failing over lstr-ter-OST0004-oscJun 4 14:29:16 ts-nrel-01 kernel: Lustre: Skipped 3 previous similar messages Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: lstr-ter-MDT0000: shutting down for failover; client state will be preserved.Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: MDT lstr-ter-MDT0000 has stopped. If it helps to see the output of tunefs.lustre, here it is on the MDT and the first OST: checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: MGS Index: unassigned Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x174 (MGS needs_index first_time update writeconf ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: Permanent disk data: Target: MGS Index: unassigned Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x174 (MGS needs_index first_time update writeconf ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: Writing CONFIGS/mountdata checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lstr-ter-MDT0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x1 (MDT ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib Permanent disk data: Target: lstr-ter-MDT0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x1 (MDT ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lstr-ter-OST0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib Permanent disk data: Target: lstr-ter-OST0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib Writing CONFIGS/mountdata My goal for right now is to get something (even without the new OST) up and running ASAP, as my users are putting great pressure on me. If you can help, I''d greatly appreciate it. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090604/e5caef60/attachment-0001.html
Roger Spellman
2009-Jun-04 22:09 UTC
[Lustre-discuss] Critical Situation -- trying to remove a badly configured OST
I rebooted my system, and tried mounting just my MGS. I get the following error: After I reboot the system, I get the following error when I start up the MGS: Jun 4 16:01:10 ts-nrel-01 kernel: Lustre: Server MGS on device /dev/sdb1 has started Jun 4 16:01:12 ts-nrel-01 kernel: LustreError: 6481:0:(mgs_handler.c:560:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS Jun 4 16:01:12 ts-nrel-01 kernel: LustreError: 6481:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-107) req at ffff81011e5c9050 x4348809/t0 o400-><?>@<?>:0/0 lens 128/0 e 0 to 0 dl 1244153372 ref 1 fl Interpret:/0/0 rc -107/0 Jun 4 16:01:13 ts-nrel-01 kernel: LustreError: 6482:0:(mgs_handler.c:560:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS Jun 4 16:01:13 ts-nrel-01 kernel: LustreError: 6482:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-107) req at ffff8101219c5850 x2634794/t0 o400-><?>@<?>:0/0 lens 128/0 e 0 to 0 dl 1244153373 ref 1 fl Interpret:/0/0 rc -107/0 Any idea how to fix this? ________________________________ From: Roger Spellman Sent: Thursday, June 04, 2009 5:30 PM To: ''lustre-discuss at lists.lustre.org'' Subject: Critical Situation -- trying to remove a badly configured OST I wonder if anyone can help me. I''m in a bit of a rush, because my file system is down, and I can''t seem to get it back up. I have a lot of users who need their data ASAP. As of yesterday, I had a 5 node system. Node 1 was and MGS and MDT. Nodes 2-5 were OSTs. The network is both IB and tcp. Today, I added another OST. Unfortunately, I messed up the --mgsnode option, and I only set it up for IB. We noticed this when the IB clients were OK, but the TCP clients were not. Then, rather try to change that (as I should have), I just reformatted that OST. That left me in a bad spot: OST-0004 was missing. The device list from the MDT looked as follows:> 3 UP lov lstr-ter-mdtlov lstr-ter-mdtlov_UUID 4> 4 UP mds lstr-ter-MDT0000 lstr-ter-MDT0000_UUID 437> 5 UP osc lstr-ter-OST0000-osc lstr-ter-mdtlov_UUID 5> 6 UP osc lstr-ter-OST0001-osc lstr-ter-mdtlov_UUID 5> 7 UP osc lstr-ter-OST0002-osc lstr-ter-mdtlov_UUID 5> 8 UP osc lstr-ter-OST0003-osc lstr-ter-mdtlov_UUID 5> 9 UP osc lstr-ter-OST0004-osc lstr-ter-mdtlov_UUID 5> 10 UP osc lstr-ter-OST0005-osc lstr-ter-OST0004-osc-mdtlov_UUID 4OST-0004 was the "bad" one, and OST0004 was its replacement. Why is the UUID so messed up? In any case, I just deactived OST-0004 using: lctl conf_param lstr-ter-0ST0004.osc.active=0 That did not solve anything. I''ve even deactivating OST-0005, trying to get back to where I was yesterday. I''ve rebooted my whole system, but can''t even mount the MDT. When I try, I get the following errors: Jun 4 14:28:54 ts-nrel-01 sshd(pam_unix)[6717]: session opened for user root by (uid=0) Jun 4 14:28:55 ts-nrel-01 kernel: LustreError: 137-5: UUID ''lstr-ter-MDT0000_UUID'' is not available for connect (stopp ing) Jun 4 14:28:56 ts-nrel-01 kernel: Lustre: Request x19 sent from lstr-ter-OST0000-osc to NID 172.16.103.22 at tcp 5s ago ha s timed out (limit 5s). Jun 4 14:28:56 ts-nrel-01 kernel: Lustre: Request x20 sent from lstr-ter-OST0001-osc to NID 172.16.103.23 at tcp 5s ago ha s timed out (limit 5s). Jun 4 14:28:56 ts-nrel-01 kernel: Lustre: Failing over lstr-ter-OST0000-osc Jun 4 14:29:03 ts-nrel-01 kernel: LustreError: 137-5: UUID ''lstr-ter-MDT0000_UUID'' is not available for connect (stopp ing) Jun 4 14:29:13 ts-nrel-01 last message repeated 3 timesJun 4 14:29:13 ts-nrel-01 kernel: LustreError: Skipped 1 previous similar message Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: Request x23 sent from lstr-ter-OST0004-osc to NID 172.17.103.27 at o2ib 25s ago has timed out (limit 25s). Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: Skipped 2 previous similar messagesJun 4 14:29:16 ts-nrel-01 kernel: Lustre: Failing over lstr-ter-OST0004-oscJun 4 14:29:16 ts-nrel-01 kernel: Lustre: Skipped 3 previous similar messages Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: lstr-ter-MDT0000: shutting down for failover; client state will be preserved.Jun 4 14:29:16 ts-nrel-01 kernel: Lustre: MDT lstr-ter-MDT0000 has stopped. If it helps to see the output of tunefs.lustre, here it is on the MDT and the first OST: checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: MGS Index: unassigned Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x174 (MGS needs_index first_time update writeconf ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: Permanent disk data: Target: MGS Index: unassigned Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x174 (MGS needs_index first_time update writeconf ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: Writing CONFIGS/mountdata checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lstr-ter-MDT0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x1 (MDT ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib Permanent disk data: Target: lstr-ter-MDT0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x1 (MDT ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lstr-ter-OST0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib Permanent disk data: Target: lstr-ter-OST0000 Index: 0 Lustre FS: lstr-ter Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib Writing CONFIGS/mountdata My goal for right now is to get something (even without the new OST) up and running ASAP, as my users are putting great pressure on me. If you can help, I''d greatly appreciate it. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090604/0e79a1f7/attachment-0001.html