Hi everyone, We have 2 OSS''s each with 5 1TB OST''s that share lun''s on on our san. OST0-4 are on server3 OST5-9 are on server4 Each ost is 1TB with an external journal Server3 crashed HARD (as in it wouldn''t post upon power off, wait 30 seconds, power on) and we were told by the vendor that the motherboard died. In the meanwhile I attempted to mount the OSTs up on server4. Server3 was powered off before attempting this (STONITH theory, right?) I ended up with lots of problems and did end up hitting a few lbug''s, specifically: LustreError: 11283:0: (tracefile.c:431:libcfs_assertion_failed()) LBUG LustreError: 8095:0: (tracefile.c:431:libcfs_assssertion_failed()) LBUG We are running an older lustre version (lustre-1.6.4.3-2.6.18_53.1.13.el5_lustre.1.6.4.3smp) on Centos 5.2 boxes, with the appropriate matching e2fsck, utilities, etc from the appropriate download page on the Sun website. I had major problems getting the remaining lustre server to mount the new OSTs because of apparent journal problems. I kept hitting "LDISKFS: failed to claim external journal device" when trying to mount the OST''s as type ldiskfs. Trying to mount them as type lustre gave me an error -22. The way I fixed it was by taking the following steps: * fsck /path/to/block/device/of/ost-data (this seemed to pick up the journal correctly) * ls -la /path/to/block/device/of/journal-dev of ost-data which gives output such as: Brw-rw---- 1 root disk 253, 7 Feb 24 20:31 /path/to/block/device/of/journal-dev * mount -t ldiskfs -o journal_dev=0xFD07 /path/to/block/device/of/ost-data /mnt/tmp-mt-pt (FD=253 in hex, 07 = 7 in hex) * unmount /mnt/tmp-mt-pt * mount -t lustre /path/to/block/device/of/ost-data /mnt/normal-mountpoint-of-ost My questions: 1) Since the mds did not crash, but half the OST''s did, do I need to make any changes to the mds? 2) Any idea why e2fsck can figure out the journal device automatically but Lustre cannot ? (at least until I manually mount/unmount as type ldiskfs and manually specify the journal major/minor dev numbers) 3) Is the LBUG above fixed in a newer version of lustre? If there is not enough information, what steps should I take next time to get you everything you need? Thanks, Rob The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
On Tue, Feb 24, 2009 at 09:38:42PM -0600, Hendelman, Rob wrote:> ...... > I ended up with lots of problems and did end up hitting a few lbug''s, > specifically: > > LustreError: 11283:0: (tracefile.c:431:libcfs_assertion_failed()) LBUG > LustreError: 8095:0: (tracefile.c:431:libcfs_assssertion_failed()) LBUGThere should be some messages immediately before these two LBUGs, which would tell us the exact place where they happened. Isaac
I grepped my /var/log/messages for 11283 and 8095 messages:Feb 24 17:41:53 maglustre04 kernel: Lustre: 11283:0:(filter.c:805:filter_init_server_data()) RECOVERY: service fs01-OST0001, 6 recoverable clients, last_rcvd 3540640 messages:Feb 24 17:41:53 maglustre04 kernel: LustreError: 11283:0:(recov_thread.c:473:llog_start_commit_thread()) error starting thread #1: -513 messages:Feb 24 17:41:53 maglustre04 kernel: LustreError: 11283:0:(llog_obd.c:392:llog_cat_initialize()) rc: -513 messages:Feb 24 17:41:53 maglustre04 kernel: LustreError: 11283:0:(filter.c:1717:filter_common_setup()) failed to setup llogging subsystems messages:Feb 24 17:41:53 maglustre04 kernel: LustreError: 11283:0:(lprocfs_status.c:671:lprocfs_obd_cleanup()) ASSERTION(obd->obd_proc_exports->subdir == NULL) failed messages:Feb 24 17:41:53 maglustre04 kernel: LustreError: 11283:0:(tracefile.c:431:libcfs_assertion_failed()) LBUG messages:Feb 24 17:41:53 maglustre04 kernel: Lustre: 11283:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 11283 messages:Feb 24 19:39:16 maglustre04 kernel: Lustre: 8095:0:(filter.c:805:filter_init_server_data()) RECOVERY: service fs01-OST0006, 6 recoverable clients, last_rcvd 3390940 messages:Feb 24 19:39:16 maglustre04 kernel: LustreError: 8095:0:(recov_thread.c:473:llog_start_commit_thread()) error starting thread #1: -513 messages:Feb 24 19:39:16 maglustre04 kernel: LustreError: 8095:0:(llog_obd.c:392:llog_cat_initialize()) rc: -513 messages:Feb 24 19:39:16 maglustre04 kernel: LustreError: 8095:0:(filter.c:1717:filter_common_setup()) failed to setup llogging subsystems messages:Feb 24 19:39:16 maglustre04 kernel: LustreError: 8095:0:(lprocfs_status.c:671:lprocfs_obd_cleanup()) ASSERTION(obd->obd_proc_exports->subdir == NULL) failed messages:Feb 24 19:39:16 maglustre04 kernel: LustreError: 8095:0:(tracefile.c:431:libcfs_assertion_failed()) LBUG messages:Feb 24 19:39:16 maglustre04 kernel: Lustre: 8095:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 8095 Unfortunately I did not capture the stack... Thanks, Robert -----Original Message----- From: Isaac Huang [mailto:He.Huang at Sun.COM] Sent: Tue 2/24/2009 11:18 PM To: Hendelman, Rob Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] OST external journals & an lbug On Tue, Feb 24, 2009 at 09:38:42PM -0600, Hendelman, Rob wrote:> ...... > I ended up with lots of problems and did end up hitting a few lbug''s, > specifically: > > LustreError: 11283:0: (tracefile.c:431:libcfs_assertion_failed()) LBUG > LustreError: 8095:0: (tracefile.c:431:libcfs_assssertion_failed()) LBUGThere should be some messages immediately before these two LBUGs, which would tell us the exact place where they happened. Isaac The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
On Wed, 2009-02-25 at 09:28 -0600, Hendelman, Rob wrote:> I grepped my /var/log/messages for 11283 and 8095You really should try searching bugzilla. It''s easy. :-)> messages:Feb 24 17:41:53 maglustre04 kernel: LustreError: 11283:0:(lprocfs_status.c:671:lprocfs_obd_cleanup()) ASSERTION(obd->obd_proc_exports->subdir == NULL) failedGiven this assertion, I found bug 14370, fixed in 1.6.5. I don''t recall what release you said you were running but if it was < 1.6.5, of course an upgrade, or patching your release will solve that problem. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090225/147b2e86/attachment.bin
Thanks for the help. I searched for the bug but I think I put the wrong term in. I put in "tracefile.c:431:libcfs_assertion_failed()" and that brought up 2 items that didn''t seem applicable to me. Next time I will search by ASSERTION. Thanks for the help as always! Robert The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
On Wed, 2009-02-25 at 12:18 -0600, Hendelman, Rob wrote:> Thanks for the help. I searched for the bug but I think I put the wrong term in. > > I put in "tracefile.c:431:libcfs_assertion_failed()" and that brought up 2 items that didn''t seem applicable to me. Next time I will search by ASSERTION.Better still, just search for what''s between the brackets.> Thanks for the help as always!NP. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090225/7101b3b1/attachment.bin