Erich Focht
2010-Apr-19 17:14 UTC
[Lustre-discuss] LBUG: ost_rw_hpreq_check() ASSERTION(nb != NULL) failed
Hi, we saw this LBUG 3 times within past week, and are puzzled of what''s going on, and how comes there''s no bugzilla entry for this... What happens is that on an OSS a request (must be read or write) expects (according to the content of the ioobj structure) to find an array of 22 struct niobuf_remote''s (niocount), but only finds one. This is obviously corrupted. We enabled checksumming where we could, but unfortunately the request headers don''t seem to be covered by any checksum check (well, the reply path possibly is). Anyway, we see no corruption/checksum failures for bulk data transfer, so it''s improbable that this is a corruption on the wire, that three times in a row says "size 16 too small (required X)" (with X being 352, 432, 4016 in our failures). Did anybody see this? Any ideas or hints? We''re using Lustre 1.6.7.2 on server and client side. The LBUG traceback is: LustreError: 12946:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg ffff8101d0c4aad0 buffer[3] size 16 too small (required 352) LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) ASSERTION(nb !NULL) failed LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) LBUG Lustre: 12946:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 12946 ll_ost_io_135 R running task 0 12946 1 12947 12945 (L-TLB) ffffffff88574438 ffffffff88abb2e0 000000000000063a ffff8101d0c4ac28 ffffffff88abb2e0 ffffffff88571c20 0000000000000000 0000000000000000 ffffffff88574a35 ffffffff88abc7e2 0000000000000000 0000000000000016 Call Trace: [<ffffffff88571c20>] :libcfs:tracefile_init+0x0/0x110 [<ffffffff88aac641>] :ost:ost_rw_hpreq_check+0x1b1/0x290 [<ffffffff88ab9ebf>] :ost:ost_hpreq_handler+0x50f/0x7c0 [<ffffffff886d243b>] :ptlrpc:ptlrpc_main+0xebb/0x13e0 [<ffffffff8008a4aa>] default_wake_function+0x0/0xe [<ffffffff800b4a6d>] audit_syscall_exit+0x327/0x342 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff886d1580>] :ptlrpc:ptlrpc_main+0x0/0x13e0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Regards, Erich
Bernd Schubert
2010-Apr-19 18:09 UTC
[Lustre-discuss] LBUG: ost_rw_hpreq_check() ASSERTION(nb != NULL) failed
Hello Erich, check out my bug report: https://bugzilla.lustre.org/show_bug.cgi?id=19992 It was closed as duplicate of bug 16129, although that is probably not correct, as 16129 is the root cause, but not the solution. As we never observed it with 1.6.7.2 I didn''t complain bug 19992 was closed. As you now can confirm it also happens with 1.6.7.2, please re-open that bug. Thanks, Bernd On Monday 19 April 2010, Erich Focht wrote:> Hi, > > we saw this LBUG 3 times within past week, and are puzzled of what''s going > on, and how comes there''s no bugzilla entry for this... > > What happens is that on an OSS a request (must be read or write) expects > (according to the content of the ioobj structure) to find an array of 22 > struct niobuf_remote''s (niocount), but only finds one. This is obviously > corrupted. > > We enabled checksumming where we could, but unfortunately the request > headers don''t seem to be covered by any checksum check (well, the reply > path possibly is). Anyway, we see no corruption/checksum failures for bulk > data transfer, so it''s improbable that this is a corruption on the wire, > that three times in a row says "size 16 too small (required X)" (with X > being 352, 432, 4016 in our failures). > > Did anybody see this? Any ideas or hints? > > We''re using Lustre 1.6.7.2 on server and client side. > > > The LBUG traceback is: > > LustreError: 12946:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg > ffff8101d0c4aad0 buffer[3] size 16 too small (required 352) > LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) ASSERTION(nb > != NULL) failed > LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) LBUG > Lustre: 12946:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack > for process 12946 > ll_ost_io_135 R running task 0 12946 1 12947 12945 > (L-TLB) ffffffff88574438 ffffffff88abb2e0 000000000000063a > ffff8101d0c4ac28 ffffffff88abb2e0 ffffffff88571c20 0000000000000000 > 0000000000000000 ffffffff88574a35 ffffffff88abc7e2 0000000000000000 > 0000000000000016 Call Trace: > [<ffffffff88571c20>] :libcfs:tracefile_init+0x0/0x110 > [<ffffffff88aac641>] :ost:ost_rw_hpreq_check+0x1b1/0x290 > [<ffffffff88ab9ebf>] :ost:ost_hpreq_handler+0x50f/0x7c0 > [<ffffffff886d243b>] :ptlrpc:ptlrpc_main+0xebb/0x13e0 > [<ffffffff8008a4aa>] default_wake_function+0x0/0xe > [<ffffffff800b4a6d>] audit_syscall_exit+0x327/0x342 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff886d1580>] :ptlrpc:ptlrpc_main+0x0/0x13e0 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > > Regards, > Erich > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Bernd Schubert DataDirect Networks
Erich Focht
2010-Apr-20 07:51 UTC
[Lustre-discuss] LBUG: ost_rw_hpreq_check() ASSERTION(nb != NULL) failed
Hi Bernd, thanks, I reopened your bug 19992. Wonder why I couldn''t find it in bugzilla... Regards, Erich Bernd Schubert wrote:> Hello Erich, > > check out my bug report: > > https://bugzilla.lustre.org/show_bug.cgi?id=19992 > > It was closed as duplicate of bug 16129, although that is probably not > correct, as 16129 is the root cause, but not the solution. > > As we never observed it with 1.6.7.2 I didn''t complain bug 19992 was closed. > As you now can confirm it also happens with 1.6.7.2, please re-open that bug. > > > Thanks, > Bernd > > On Monday 19 April 2010, Erich Focht wrote: >> Hi, >> >> we saw this LBUG 3 times within past week, and are puzzled of what''s going >> on, and how comes there''s no bugzilla entry for this... >> >> What happens is that on an OSS a request (must be read or write) expects >> (according to the content of the ioobj structure) to find an array of 22 >> struct niobuf_remote''s (niocount), but only finds one. This is obviously >> corrupted. >> >> We enabled checksumming where we could, but unfortunately the request >> headers don''t seem to be covered by any checksum check (well, the reply >> path possibly is). Anyway, we see no corruption/checksum failures for bulk >> data transfer, so it''s improbable that this is a corruption on the wire, >> that three times in a row says "size 16 too small (required X)" (with X >> being 352, 432, 4016 in our failures). >> >> Did anybody see this? Any ideas or hints? >> >> We''re using Lustre 1.6.7.2 on server and client side. >> >> >> The LBUG traceback is: >> >> LustreError: 12946:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg >> ffff8101d0c4aad0 buffer[3] size 16 too small (required 352) >> LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) ASSERTION(nb >> != NULL) failed >> LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) LBUG >> Lustre: 12946:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack >> for process 12946 >> ll_ost_io_135 R running task 0 12946 1 12947 12945 >> (L-TLB) ffffffff88574438 ffffffff88abb2e0 000000000000063a >> ffff8101d0c4ac28 ffffffff88abb2e0 ffffffff88571c20 0000000000000000 >> 0000000000000000 ffffffff88574a35 ffffffff88abc7e2 0000000000000000 >> 0000000000000016 Call Trace: >> [<ffffffff88571c20>] :libcfs:tracefile_init+0x0/0x110 >> [<ffffffff88aac641>] :ost:ost_rw_hpreq_check+0x1b1/0x290 >> [<ffffffff88ab9ebf>] :ost:ost_hpreq_handler+0x50f/0x7c0 >> [<ffffffff886d243b>] :ptlrpc:ptlrpc_main+0xebb/0x13e0 >> [<ffffffff8008a4aa>] default_wake_function+0x0/0xe >> [<ffffffff800b4a6d>] audit_syscall_exit+0x327/0x342 >> [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> [<ffffffff886d1580>] :ptlrpc:ptlrpc_main+0x0/0x13e0 >> [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> >> >> Regards, >> Erich >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >