We had a LBUG several days ago on our lustre 1.8.0. One OSS reported kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 24669 ...... I google for this, and find little information about it. It seems to be a race condition on OSS, right? Should I open a bugzilla for this LBUG? Thanks.
Sure, but I think for engineering to make progress on this bug, they are going to want a crash dump. If you can enable crash dumps and panic on lbug (and if HA, increase dead timeout so it can complete the dump before being shot in the head) it would provide more info for the bug report. That being said, there are quite a few other bugs that have been fixed since 1.8.0, so you really should upgrade ASAP to 1.8.4. Kevin On Nov 21, 2010, at 6:59 PM, Larry <tsrjzq at gmail.com> wrote:> We had a LBUG several days ago on our lustre 1.8.0. One OSS reported > > kernel: LustreError: > 24669:0:(service.c:1311:ptlrpc_server_handle_request()) > ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed > kernel: LustreError: > 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG > kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) > showing stack for process 24669 > ...... > > I google for this, and find little information about it. It seems to > be a race condition on OSS, right? Should I open a bugzilla for this > LBUG? > Thanks. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Looks like bug 17924. You may want to upgrade to newer release. ? 2010-11-22???9:59? Larry ???> We had a LBUG several days ago on our lustre 1.8.0. One OSS reported > > kernel: LustreError: > 24669:0:(service.c:1311:ptlrpc_server_handle_request()) > ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed > kernel: LustreError: > 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG > kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) > showing stack for process 24669 > ...... > > I google for this, and find little information about it. It seems to > be a race condition on OSS, right? Should I open a bugzilla for this > LBUG? > Thanks. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Yes, but it has still not been fixed -- there were no patches landed for the bug, and it was reported again after updating to 1.8.1.1 (under a restricted bug) and against 1.8.2 (bug 22936), and progress appears stalled without a crash dump. Kevin Wang Yibin wrote:> Looks like bug 17924. You may want to upgrade to newer release. > > ? 2010-11-22???9:59? Larry ??? > > >> We had a LBUG several days ago on our lustre 1.8.0. One OSS reported >> >> kernel: LustreError: >> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) >> ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed >> kernel: LustreError: >> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG >> kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) >> showing stack for process 24669 >> ...... >> >> I google for this, and find little information about it. It seems to >> be a race condition on OSS, right? Should I open a bugzilla for this >> LBUG? >> Thanks. >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
We add the "options libcfs libcfs_panic_on_lbug=1" in modprobe.conf to make the server kernel panic ASAP the LBUG happened. Is there some way to make the server dead a few seconds after the LBUG? We are also puzzled with the message lost during the LBUG happened. On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren <Kevin.Van.Maren at oracle.com> wrote:> Sure, but I think for engineering to make progress on this bug, they are > going to want a crash dump. ?If you can enable crash dumps and panic on lbug > (and if HA, increase dead timeout so it can complete the dump before being > shot in the head) it would provide more info for the bug report. > > That being said, there are quite a few other bugs that have been fixed since > 1.8.0, so you really should upgrade ASAP to 1.8.4. > > Kevin > > > On Nov 21, 2010, at 6:59 PM, Larry <tsrjzq at gmail.com> wrote: > >> We had a LBUG several days ago on our lustre 1.8.0. One OSS reported >> >> kernel: LustreError: >> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) >> ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed >> kernel: LustreError: >> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG >> kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) >> showing stack for process 24669 >> ...... >> >> I google for this, and find little information about it. It seems to >> be a race condition on OSS, right? Should I open a bugzilla for this >> LBUG? >> Thanks. >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Larry wrote:> We add the "options libcfs libcfs_panic_on_lbug=1" in modprobe.conf to > make the server kernel panic ASAP the LBUG happened. Is there some way > to make the server dead a few seconds after the LBUG? We are also > puzzled with the message lost during the LBUG happened. >The messages should have gone to the console just fine (hopefully you are logging a serial console). If you are talking about /var/log/messages, then yes, it will be missing the final output as the messages don''t have time to get written to disk on a kernel panic. Kevin> On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren > <Kevin.Van.Maren at oracle.com> wrote: > >> Sure, but I think for engineering to make progress on this bug, they are >> going to want a crash dump. If you can enable crash dumps and panic on lbug >> (and if HA, increase dead timeout so it can complete the dump before being >> shot in the head) it would provide more info for the bug report. >> >> That being said, there are quite a few other bugs that have been fixed since >> 1.8.0, so you really should upgrade ASAP to 1.8.4. >> >> Kevin >> >> >> On Nov 21, 2010, at 6:59 PM, Larry <tsrjzq at gmail.com> wrote: >> >> >>> We had a LBUG several days ago on our lustre 1.8.0. One OSS reported >>> >>> kernel: LustreError: >>> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) >>> ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed >>> kernel: LustreError: >>> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG >>> kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) >>> showing stack for process 24669 >>> ...... >>> >>> I google for this, and find little information about it. It seems to >>> be a race condition on OSS, right? Should I open a bugzilla for this >>> LBUG? >>> Thanks. >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Unfortunately we haven''t a serial console now, perhaps we can add one per node, thanks a lot On Mon, Nov 22, 2010 at 12:18 PM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> Larry wrote: >> >> We add the "options libcfs libcfs_panic_on_lbug=1" in modprobe.conf to >> make the server kernel panic ASAP the LBUG happened. Is there some way >> to make the server dead a few seconds after the LBUG? We are also >> puzzled with the message lost during the LBUG happened. >> > > The messages should have gone to the console just fine (hopefully you are > logging a serial console). > If you are talking about /var/log/messages, then yes, it will be missing the > final output as the > messages don''t have time to get written to disk on a kernel panic. > > Kevin > > >> On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren >> <Kevin.Van.Maren at oracle.com> wrote: >> >>> >>> Sure, but I think for engineering to make progress on this bug, they are >>> going to want a crash dump. ?If you can enable crash dumps and panic on >>> lbug >>> (and if HA, increase dead timeout so it can complete the dump before >>> being >>> shot in the head) it would provide more info for the bug report. >>> >>> That being said, there are quite a few other bugs that have been fixed >>> since >>> 1.8.0, so you really should upgrade ASAP to 1.8.4. >>> >>> Kevin >>> >>> >>> On Nov 21, 2010, at 6:59 PM, Larry <tsrjzq at gmail.com> wrote: >>> >>> >>>> >>>> We had a LBUG several days ago on our lustre 1.8.0. One OSS reported >>>> >>>> kernel: LustreError: >>>> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) >>>> ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed >>>> kernel: LustreError: >>>> 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG >>>> kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) >>>> showing stack for process 24669 >>>> ...... >>>> >>>> I google for this, and find little information about it. It seems to >>>> be a race condition on OSS, right? Should I open a bugzilla for this >>>> LBUG? >>>> Thanks. >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >