Brock Palen
2009-Jan-14 15:14 UTC
[Lustre-discuss] LBUG ASSERTION(lock->l_resource != NULL) failed
I am having servers LBUG on a regular basis, Clients are running 1.6.6 patchless on RHEL4, servers are running RHEL4 with 1.6.5.1 RPM''s from the download page. All connection is over Ethernet, Servers are x4600''s. The OSS that BUG''d has in its log: Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: 432:libcfs_assertion_failed()) LBUG Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: 167:libcfs_debug_dumpstack()) showing stack for process 10243 Jan 13 16:35:39 oss2 kernel: ldlm_cn_08 R running task 0 10243 1 10244 7776 (L-TLB) Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629 00000103d83c7e00 0000000000000000 Jan 13 16:35:39 oss2 kernel: 00000101f8c88d40 ffffffffa021445e 00000103e315dd98 0000000000000001 Jan 13 16:35:39 oss2 kernel: 00000101f3993ea0 0000000000000000 Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> {:ptlrpc:ptlrpc_server_handle_request+2457} Jan 13 16:35:39 oss2 kernel: <ffffffffa021445e> {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67} Jan 13 16:35:39 oss2 kernel: <ffffffffa0416d05> {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> {:ptlrpc:ptlrpc_retry_rqbds+0} Jan 13 16:35:39 oss2 kernel: <ffffffffa0415270> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> {:ptlrpc:ptlrpc_retry_rqbds+0} Jan 13 16:35:39 oss2 kernel: <ffffffff80110de3>{child_rip+8} <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0} Jan 13 16:35:39 oss2 kernel: <ffffffff80110ddb>{child_rip+0} Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/lustre- log.1231882539.10243 At the same time a client (nyx346) lost contact with that oss, and is never allowed to reconnect. Client /var/log/message: Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- osc-000001022c2a7800: Connection to service nobackup-OST000d via nid 10.164.3.245 at tcp was lost; in progress operations using this service will wait for recovery to complete.Jan 13 16:37:20 nyx346 kernel: Lustre: Skipped 6 previous similar messagesJan 13 16:37:20 nyx346 kernel: LustreError: 3889:0:(ldlm_request.c:996:ldlm_cli_cancel_req ()) Got rc -11 from cancel RPC: canceling anywayJan 13 16:37:20 nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: 1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11Jan 13 16:37:20 nyx346 kernel: LustreError: 11-0: an error occurred while communicating with 10.164.3.245 at tcp. The ost_connect operation failed with -16Jan 13 16:37:20 nyx346 kernel: LustreError: Skipped 10 previous similar messages Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: 410:import_select_connection()) nobackup-OST000d- osc-000001022c2a7800: tried all connections, increasing latency to 7s Even now the server(OSS) is refusing connection to OST00d, with the message: Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- OST000d: refuse reconnection from 145a1ec5-07ef- f7eb-0ca9-2a2b6503e0cd at 10.164.1.90@tcp to 0x00000103d5ce7000; still busy with 2 active RPCs If I reboot the OSS, the OST''s on it go though recovery like normal, and then the client is fine. Network looks clean, found one machine with lots of dropped packets between the servers, but that is not the client in question. Thank you! If it happens again, and I find any other data I will let you know. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Cliff White
2009-Jan-15 00:27 UTC
[Lustre-discuss] LBUG ASSERTION(lock->l_resource != NULL) failed
Brock Palen wrote:> I am having servers LBUG on a regular basis, Clients are running > 1.6.6 patchless on RHEL4, servers are running RHEL4 with 1.6.5.1 > RPM''s from the download page. All connection is over Ethernet, > Servers are x4600''s.This looks like bug 16496, which is fixed in 1.6.6. You should upgrade your servers to 1.6.6 cliffw> > The OSS that BUG''d has in its log: > > Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: > 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed > Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: > 432:libcfs_assertion_failed()) LBUG > Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: > 167:libcfs_debug_dumpstack()) showing stack for process 10243 > Jan 13 16:35:39 oss2 kernel: ldlm_cn_08 R running task 0 > 10243 1 10244 7776 (L-TLB) > Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629 > 00000103d83c7e00 0000000000000000 > Jan 13 16:35:39 oss2 kernel: 00000101f8c88d40 ffffffffa021445e > 00000103e315dd98 0000000000000001 > Jan 13 16:35:39 oss2 kernel: 00000101f3993ea0 0000000000000000 > Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> > {:ptlrpc:ptlrpc_server_handle_request+2457} > Jan 13 16:35:39 oss2 kernel: <ffffffffa021445e> > {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67} > Jan 13 16:35:39 oss2 kernel: <ffffffffa0416d05> > {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> > {:ptlrpc:ptlrpc_retry_rqbds+0} > Jan 13 16:35:39 oss2 kernel: <ffffffffa0415270> > {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> > {:ptlrpc:ptlrpc_retry_rqbds+0} > Jan 13 16:35:39 oss2 kernel: <ffffffff80110de3>{child_rip+8} > <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0} > Jan 13 16:35:39 oss2 kernel: <ffffffff80110ddb>{child_rip+0} > Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/lustre- > log.1231882539.10243 > > > At the same time a client (nyx346) lost contact with that oss, and is > never allowed to reconnect. > Client /var/log/message: > > Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- > osc-000001022c2a7800: Connection to service nobackup-OST000d via nid > 10.164.3.245 at tcp was lost; in progress operations using this service > will wait for recovery to complete.Jan 13 16:37:20 nyx346 kernel: > Lustre: Skipped 6 previous similar messagesJan 13 16:37:20 nyx346 > kernel: LustreError: 3889:0:(ldlm_request.c:996:ldlm_cli_cancel_req > ()) Got rc -11 from cancel RPC: canceling anywayJan 13 16:37:20 > nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: > 1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11Jan 13 16:37:20 > nyx346 kernel: LustreError: 11-0: an error occurred while > communicating with 10.164.3.245 at tcp. The ost_connect operation failed > with -16Jan 13 16:37:20 nyx346 kernel: LustreError: Skipped 10 > previous similar messages > Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: > 410:import_select_connection()) nobackup-OST000d- > osc-000001022c2a7800: tried all connections, increasing latency to 7s > > Even now the server(OSS) is refusing connection to OST00d, with the > message: > > Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- > OST000d: refuse reconnection from 145a1ec5-07ef- > f7eb-0ca9-2a2b6503e0cd at 10.164.1.90@tcp to 0x00000103d5ce7000; still > busy with 2 active RPCs > > > If I reboot the OSS, the OST''s on it go though recovery like normal, > and then the client is fine. > > Network looks clean, found one machine with lots of dropped packets > between the servers, but that is not the client in question. > > Thank you! If it happens again, and I find any other data I will let > you know. > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brock Palen
2009-Jan-15 00:30 UTC
[Lustre-discuss] LBUG ASSERTION(lock->l_resource != NULL) failed
Gah! Ok no problem. No risk of data loss right? And is there anyway to ''limp along'' till an outage without rebooting OST''s? Thanks for the insight! Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Jan 14, 2009, at 7:27 PM, Cliff White wrote:> Brock Palen wrote: >> I am having servers LBUG on a regular basis, Clients are running >> 1.6.6 patchless on RHEL4, servers are running RHEL4 with 1.6.5.1 >> RPM''s from the download page. All connection is over Ethernet, >> Servers are x4600''s. > > This looks like bug 16496, which is fixed in 1.6.6. You should upgrade > your servers to 1.6.6 > cliffw > >> The OSS that BUG''d has in its log: >> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: >> 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed >> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: >> 432:libcfs_assertion_failed()) LBUG >> Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: >> 167:libcfs_debug_dumpstack()) showing stack for process 10243 >> Jan 13 16:35:39 oss2 kernel: ldlm_cn_08 R running task >> 0 10243 1 10244 7776 (L-TLB) >> Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629 >> 00000103d83c7e00 0000000000000000 >> Jan 13 16:35:39 oss2 kernel: 00000101f8c88d40 >> ffffffffa021445e 00000103e315dd98 0000000000000001 >> Jan 13 16:35:39 oss2 kernel: 00000101f3993ea0 0000000000000000 >> Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> >> {:ptlrpc:ptlrpc_server_handle_request+2457} >> Jan 13 16:35:39 oss2 kernel: <ffffffffa021445e> >> {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67} >> Jan 13 16:35:39 oss2 kernel: <ffffffffa0416d05> >> {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> >> {:ptlrpc:ptlrpc_retry_rqbds+0} >> Jan 13 16:35:39 oss2 kernel: <ffffffffa0415270> >> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> >> {:ptlrpc:ptlrpc_retry_rqbds+0} >> Jan 13 16:35:39 oss2 kernel: <ffffffff80110de3>{child_rip >> +8} <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0} >> Jan 13 16:35:39 oss2 kernel: <ffffffff80110ddb>{child_rip+0} >> Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/ >> lustre- log.1231882539.10243 >> At the same time a client (nyx346) lost contact with that oss, and >> is never allowed to reconnect. >> Client /var/log/message: >> Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- >> osc-000001022c2a7800: Connection to service nobackup-OST000d via >> nid 10.164.3.245 at tcp was lost; in progress operations using this >> service will wait for recovery to complete.Jan 13 16:37:20 nyx346 >> kernel: Lustre: Skipped 6 previous similar messagesJan 13 >> 16:37:20 nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: >> 996:ldlm_cli_cancel_req ()) Got rc -11 from cancel RPC: canceling >> anywayJan 13 16:37:20 nyx346 kernel: LustreError: 3889:0: >> (ldlm_request.c: 1605:ldlm_cli_cancel_list()) >> ldlm_cli_cancel_list: -11Jan 13 16:37:20 nyx346 kernel: >> LustreError: 11-0: an error occurred while communicating with >> 10.164.3.245 at tcp. The ost_connect operation failed with -16Jan 13 >> 16:37:20 nyx346 kernel: LustreError: Skipped 10 previous similar >> messages >> Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: >> 410:import_select_connection()) nobackup-OST000d- >> osc-000001022c2a7800: tried all connections, increasing latency to 7s >> Even now the server(OSS) is refusing connection to OST00d, with >> the message: >> Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- >> OST000d: refuse reconnection from 145a1ec5-07ef- >> f7eb-0ca9-2a2b6503e0cd at 10.164.1.90@tcp to 0x00000103d5ce7000; >> still busy with 2 active RPCs >> If I reboot the OSS, the OST''s on it go though recovery like >> normal, and then the client is fine. >> Network looks clean, found one machine with lots of dropped >> packets between the servers, but that is not the client in question. >> Thank you! If it happens again, and I find any other data I will >> let you know. >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >
Cliff White
2009-Jan-15 03:08 UTC
[Lustre-discuss] LBUG ASSERTION(lock->l_resource != NULL) failed
Brock Palen wrote:> Gah! Ok no problem. > > No risk of data loss right?Umm...I can''t say that. No idea really. And is there anyway to ''limp along'' till an> outage without rebooting OST''s?Nope. An LBUG always requires a reboot. We freeze the LBUG''d thread for debugging purposes. The frozen thread may stall other threads, which can eventually wedge the server. Best not to try to run with an LBUG. However, 1.6.5.1 -> 1.6.6 is a minor bug fix upgrade, and you can run a mix of 1.6.5.1 and 1.6.6 quite well. So if you cannot take a big downtime, you could do a ''rolling upgrade'' - Make the MGS/MDS 1.6.6 - Each time an OSS LBUGs, reboot, upgrade to 1.6.6, remount Lustre. Installing the new rpms should be quite quick. You don''t have to change any configuration, just throw the new RPMS on there. Given a random distribution, eventually all your OSSs will be 1.6.6 :) cliffw> > Thanks for the insight! > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Jan 14, 2009, at 7:27 PM, Cliff White wrote: > >> Brock Palen wrote: >>> I am having servers LBUG on a regular basis, Clients are running >>> 1.6.6 patchless on RHEL4, servers are running RHEL4 with 1.6.5.1 >>> RPM''s from the download page. All connection is over Ethernet, >>> Servers are x4600''s. >> >> This looks like bug 16496, which is fixed in 1.6.6. You should upgrade >> your servers to 1.6.6 >> cliffw >> >>> The OSS that BUG''d has in its log: >>> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: >>> 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed >>> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: >>> 432:libcfs_assertion_failed()) LBUG >>> Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: >>> 167:libcfs_debug_dumpstack()) showing stack for process 10243 >>> Jan 13 16:35:39 oss2 kernel: ldlm_cn_08 R running task 0 >>> 10243 1 10244 7776 (L-TLB) >>> Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629 >>> 00000103d83c7e00 0000000000000000 >>> Jan 13 16:35:39 oss2 kernel: 00000101f8c88d40 >>> ffffffffa021445e 00000103e315dd98 0000000000000001 >>> Jan 13 16:35:39 oss2 kernel: 00000101f3993ea0 0000000000000000 >>> Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> >>> {:ptlrpc:ptlrpc_server_handle_request+2457} >>> Jan 13 16:35:39 oss2 kernel: <ffffffffa021445e> >>> {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67} >>> Jan 13 16:35:39 oss2 kernel: <ffffffffa0416d05> >>> {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> >>> {:ptlrpc:ptlrpc_retry_rqbds+0} >>> Jan 13 16:35:39 oss2 kernel: <ffffffffa0415270> >>> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> >>> {:ptlrpc:ptlrpc_retry_rqbds+0} >>> Jan 13 16:35:39 oss2 kernel: <ffffffff80110de3>{child_rip+8} >>> <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0} >>> Jan 13 16:35:39 oss2 kernel: <ffffffff80110ddb>{child_rip+0} >>> Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/lustre- >>> log.1231882539.10243 >>> At the same time a client (nyx346) lost contact with that oss, and >>> is never allowed to reconnect. >>> Client /var/log/message: >>> Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- >>> osc-000001022c2a7800: Connection to service nobackup-OST000d via nid >>> 10.164.3.245 at tcp was lost; in progress operations using this service >>> will wait for recovery to complete.Jan 13 16:37:20 nyx346 kernel: >>> Lustre: Skipped 6 previous similar messagesJan 13 16:37:20 nyx346 >>> kernel: LustreError: 3889:0:(ldlm_request.c:996:ldlm_cli_cancel_req >>> ()) Got rc -11 from cancel RPC: canceling anywayJan 13 16:37:20 >>> nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: >>> 1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11Jan 13 >>> 16:37:20 nyx346 kernel: LustreError: 11-0: an error occurred while >>> communicating with 10.164.3.245 at tcp. The ost_connect operation >>> failed with -16Jan 13 16:37:20 nyx346 kernel: LustreError: Skipped >>> 10 previous similar messages >>> Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: >>> 410:import_select_connection()) nobackup-OST000d- >>> osc-000001022c2a7800: tried all connections, increasing latency to 7s >>> Even now the server(OSS) is refusing connection to OST00d, with the >>> message: >>> Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- >>> OST000d: refuse reconnection from 145a1ec5-07ef- >>> f7eb-0ca9-2a2b6503e0cd at 10.164.1.90@tcp to 0x00000103d5ce7000; still >>> busy with 2 active RPCs >>> If I reboot the OSS, the OST''s on it go though recovery like normal, >>> and then the client is fine. >>> Network looks clean, found one machine with lots of dropped packets >>> between the servers, but that is not the client in question. >>> Thank you! If it happens again, and I find any other data I will >>> let you know. >>> Brock Palen >>> www.umich.edu/~brockp >>> Center for Advanced Computing >>> brockp at umich.edu >>> (734)936-1985 >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >