We consistantly see random ocurances of a client being kicked out, and while lustre says it tries to reconnect, it almost never can without a reboot: Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 waiting for callback (3 != 0) Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: 230:ptlrpc_invalidate_import()) @@@ still on sending list req at 000001015dd9ec00 x979024/t0 o101->nobackup- MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000 using nid 10.164.3.246 at tcp. Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error occurred while communicating with 10.164.3.246 at tcp. The mds_statfs operation failed with -107 Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid 10.164.3.246 at tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client was evicted by nobackup-MDT0000; in progress operations using this service will fail. Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001000990fe00 x983192/t0 o41->nobackup-MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 Is there any way to make lustre more robust against these types of failures? According to the manual (and many times in practice, like rebooting a MDS) the filesystem will just block and comeback. This almost never comes back, after a while it will say reconnected, but will fail again right away. On the MDS I see: Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: 1515:mds_handle()) operation 41 on unconnected MDS from 12345-141.212.31.43 at tcp Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: 1536:target_send_reply_msg()) @@@ processing error (-107) req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Just keeps kicking it out, /proc/fs/lustre/health_check on client, and servers are healthy. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Brock: What is the client version? I am getting the same type of failures. Also, check your network if you have any TX/RX packet drops (netstat -i). I am wondering if you are having the same problem as us. On Fri, Nov 14, 2008 at 6:37 PM, Brock Palen <brockp at umich.edu> wrote:> We consistantly see random ocurances of a client being kicked out, > and while lustre says it tries to reconnect, it almost never can > without a reboot: > > > Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 > waiting for callback (3 != 0) > Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > 230:ptlrpc_invalidate_import()) @@@ still on sending list > req at 000001015dd9ec00 x979024/t0 o101->nobackup- > MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl > 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 > Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov > 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- > mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000 > using nid 10.164.3.246 at tcp. > Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error > occurred while communicating with 10.164.3.246 at tcp. The mds_statfs > operation failed with -107 > Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- > mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid > 10.164.3.246 at tcp was lost; in progress operations using this service > will wait for recovery to complete. > Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client > was evicted by nobackup-MDT0000; in progress operations using this > service will fail. > Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: > 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 > Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: > 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001000990fe00 > x983192/t0 o41->nobackup-MDT0000_UUID at 10.164.3.246@tcp:12/10 lens > 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 > Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: > 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 > > Is there any way to make lustre more robust against these types of > failures? According to the manual (and many times in practice, like > rebooting a MDS) the filesystem will just block and comeback. This > almost never comes back, after a while it will say reconnected, but > will fail again right away. > > On the MDS I see: > > Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard > from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at > 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am > evicting it. > Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: > 1515:mds_handle()) operation 41 on unconnected MDS from > 12345-141.212.31.43 at tcp > Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: > 1536:target_send_reply_msg()) @@@ processing error (-107) > req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 > dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 > Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard > from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at > 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am > evicting it. > > Just keeps kicking it out, /proc/fs/lustre/health_check on client, > and servers are healthy. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Running 1.6.5.1 both server and client, on RHEL4 patchless clients. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Nov 16, 2008, at 8:26 AM, Mag Gam wrote:> Brock: > > What is the client version? I am getting the same type of failures. > > Also, check your network if you have any TX/RX packet drops > (netstat -i). > > I am wondering if you are having the same problem as us. > > > > On Fri, Nov 14, 2008 at 6:37 PM, Brock Palen <brockp at umich.edu> wrote: >> We consistantly see random ocurances of a client being kicked out, >> and while lustre says it tries to reconnect, it almost never can >> without a reboot: >> >> >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 >> waiting for callback (3 != 0) >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >> 230:ptlrpc_invalidate_import()) @@@ still on sending list >> req at 000001015dd9ec00 x979024/t0 o101->nobackup- >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl >> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov >> 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- >> mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000 >> using nid 10.164.3.246 at tcp. >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error >> occurred while communicating with 10.164.3.246 at tcp. The mds_statfs >> operation failed with -107 >> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- >> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid >> 10.164.3.246 at tcp was lost; in progress operations using this service >> will wait for recovery to complete. >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client >> was evicted by nobackup-MDT0000; in progress operations using this >> service will fail. >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: >> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: >> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001000990fe00 >> x983192/t0 o41->nobackup-MDT0000_UUID at 10.164.3.246@tcp:12/10 lens >> 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: >> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 >> >> Is there any way to make lustre more robust against these types of >> failures? According to the manual (and many times in practice, like >> rebooting a MDS) the filesystem will just block and comeback. This >> almost never comes back, after a while it will say reconnected, but >> will fail again right away. >> >> On the MDS I see: >> >> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard >> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >> evicting it. >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: >> 1515:mds_handle()) operation 41 on unconnected MDS from >> 12345-141.212.31.43 at tcp >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: >> 1536:target_send_reply_msg()) @@@ processing error (-107) >> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 >> dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 >> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard >> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >> evicting it. >> >> Just keeps kicking it out, /proc/fs/lustre/health_check on client, >> and servers are healthy. >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >
what does netstat -i give you? any RX or TX drops? On Sun, Nov 16, 2008 at 11:09 AM, Brock Palen <brockp at umich.edu> wrote:> Running 1.6.5.1 both server and client, on RHEL4 patchless clients. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Nov 16, 2008, at 8:26 AM, Mag Gam wrote: > >> Brock: >> >> What is the client version? I am getting the same type of failures. >> >> Also, check your network if you have any TX/RX packet drops (netstat -i). >> >> I am wondering if you are having the same problem as us. >> >> >> >> On Fri, Nov 14, 2008 at 6:37 PM, Brock Palen <brockp at umich.edu> wrote: >>> >>> We consistantly see random ocurances of a client being kicked out, >>> and while lustre says it tries to reconnect, it almost never can >>> without a reboot: >>> >>> >>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 >>> waiting for callback (3 != 0) >>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>> 230:ptlrpc_invalidate_import()) @@@ still on sending list >>> req at 000001015dd9ec00 x979024/t0 o101->nobackup- >>> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl >>> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 >>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov >>> 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- >>> mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000 >>> using nid 10.164.3.246 at tcp. >>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error >>> occurred while communicating with 10.164.3.246 at tcp. The mds_statfs >>> operation failed with -107 >>> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- >>> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid >>> 10.164.3.246 at tcp was lost; in progress operations using this service >>> will wait for recovery to complete. >>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client >>> was evicted by nobackup-MDT0000; in progress operations using this >>> service will fail. >>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: >>> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 >>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: >>> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001000990fe00 >>> x983192/t0 o41->nobackup-MDT0000_UUID at 10.164.3.246@tcp:12/10 lens >>> 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 >>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: >>> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 >>> >>> Is there any way to make lustre more robust against these types of >>> failures? According to the manual (and many times in practice, like >>> rebooting a MDS) the filesystem will just block and comeback. This >>> almost never comes back, after a while it will say reconnected, but >>> will fail again right away. >>> >>> On the MDS I see: >>> >>> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard >>> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >>> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >>> evicting it. >>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: >>> 1515:mds_handle()) operation 41 on unconnected MDS from >>> 12345-141.212.31.43 at tcp >>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: >>> 1536:target_send_reply_msg()) @@@ processing error (-107) >>> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 >>> dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 >>> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard >>> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >>> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >>> evicting it. >>> >>> Just keeps kicking it out, /proc/fs/lustre/health_check on client, >>> and servers are healthy. >>> >>> Brock Palen >>> www.umich.edu/~brockp >>> Center for Advanced Computing >>> brockp at umich.edu >>> (734)936-1985 >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> > >
Brock Palen ??:> We consistantly see random ocurances of a client being kicked out, > and while lustre says it tries to reconnect, it almost never can > without a reboot: > >Maybe you can check: https://bugzilla.lustre.org/show_bug.cgi?id=15927 Regards! -- Fan Yong> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 > waiting for callback (3 != 0) > Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > 230:ptlrpc_invalidate_import()) @@@ still on sending list > req at 000001015dd9ec00 x979024/t0 o101->nobackup- > MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl > 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 > Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov > 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- > mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000 > using nid 10.164.3.246 at tcp. > Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error > occurred while communicating with 10.164.3.246 at tcp. The mds_statfs > operation failed with -107 > Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- > mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid > 10.164.3.246 at tcp was lost; in progress operations using this service > will wait for recovery to complete. > Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client > was evicted by nobackup-MDT0000; in progress operations using this > service will fail. > Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: > 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 > Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: > 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001000990fe00 > x983192/t0 o41->nobackup-MDT0000_UUID at 10.164.3.246@tcp:12/10 lens > 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 > Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: > 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 > > Is there any way to make lustre more robust against these types of > failures? According to the manual (and many times in practice, like > rebooting a MDS) the filesystem will just block and comeback. This > almost never comes back, after a while it will say reconnected, but > will fail again right away. > > On the MDS I see: > > Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard > from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at > 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am > evicting it. > Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: > 1515:mds_handle()) operation 41 on unconnected MDS from > 12345-141.212.31.43 at tcp > Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: > 1536:target_send_reply_msg()) @@@ processing error (-107) > req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 > dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 > Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t heard > from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at > 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am > evicting it. > > Just keeps kicking it out, /proc/fs/lustre/health_check on client, > and servers are healthy. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
I see no errors, if that is the bug causing this, is the fix till we upgrade to the newer lustre, to set statahead_max=0 again? I see this same behavior this morning on a compute node. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Nov 16, 2008, at 10:49 PM, Yong Fan wrote:> Brock Palen ??: >> We consistantly see random ocurances of a client being kicked >> out, and while lustre says it tries to reconnect, it almost never >> can without a reboot: >> >> > Maybe you can check: > https://bugzilla.lustre.org/show_bug.cgi?id=15927 > > Regards! > -- > Fan Yong >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 >> waiting for callback (3 != 0) >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >> 230:ptlrpc_invalidate_import()) @@@ still on sending list >> req at 000001015dd9ec00 x979024/t0 o101->nobackup- >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl >> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar >> messageNov 14 18:28:18 nyx-login1 kernel: Lustre: nobackup- >> MDT0000- mdc-00000100f7ef0400: Connection restored to service >> nobackup-MDT0000 using nid 10.164.3.246 at tcp. >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error >> occurred while communicating with 10.164.3.246 at tcp. The >> mds_statfs operation failed with -107 >> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- >> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via >> nid 10.164.3.246 at tcp was lost; in progress operations using this >> service will wait for recovery to complete. >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This >> client was evicted by nobackup-MDT0000; in progress operations >> using this service will fail. >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0: >> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: >> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID >> req at 000001000990fe00 x983192/t0 o41->nobackup- >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 128/400 e 0 to 100 dl 0 >> ref 1 fl Rpc:/0/0 rc 0/0 >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0: >> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 >> >> Is there any way to make lustre more robust against these types >> of failures? According to the manual (and many times in >> practice, like rebooting a MDS) the filesystem will just block >> and comeback. This almost never comes back, after a while it >> will say reconnected, but will fail again right away. >> >> On the MDS I see: >> >> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t >> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >> evicting it. >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: >> 1515:mds_handle()) operation 41 on unconnected MDS from >> 12345-141.212.31.43 at tcp >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: >> 1536:target_send_reply_msg()) @@@ processing error (-107) >> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to >> 0 dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 >> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t >> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >> evicting it. >> >> Just keeps kicking it out, /proc/fs/lustre/health_check on >> client, and servers are healthy. >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > >
On Nov 18, 2008 12:14 -0500, Brock Palen wrote:> if that is the bug causing this, is the fix till we upgrade to the > newer lustre, to set statahead_max=0 again?Yes, this is another statahead bug.> I see this same behavior this morning on a compute node. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Nov 16, 2008, at 10:49 PM, Yong Fan wrote: > > > Brock Palen ??: > >> We consistantly see random ocurances of a client being kicked > >> out, and while lustre says it tries to reconnect, it almost never > >> can without a reboot: > >> > >> > > Maybe you can check: > > https://bugzilla.lustre.org/show_bug.cgi?id=15927 > > > > Regards! > > -- > > Fan Yong > >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > >> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 > >> waiting for callback (3 != 0) > >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > >> 230:ptlrpc_invalidate_import()) @@@ still on sending list > >> req at 000001015dd9ec00 x979024/t0 o101->nobackup- > >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl > >> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 > >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: > >> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar > >> messageNov 14 18:28:18 nyx-login1 kernel: Lustre: nobackup- > >> MDT0000- mdc-00000100f7ef0400: Connection restored to service > >> nobackup-MDT0000 using nid 10.164.3.246 at tcp. > >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error > >> occurred while communicating with 10.164.3.246 at tcp. The > >> mds_statfs operation failed with -107 > >> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- > >> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via > >> nid 10.164.3.246 at tcp was lost; in progress operations using this > >> service will wait for recovery to complete. > >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This > >> client was evicted by nobackup-MDT0000; in progress operations > >> using this service will fail. > >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0: > >> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 > >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: > >> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID > >> req at 000001000990fe00 x983192/t0 o41->nobackup- > >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 128/400 e 0 to 100 dl 0 > >> ref 1 fl Rpc:/0/0 rc 0/0 > >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0: > >> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 > >> > >> Is there any way to make lustre more robust against these types > >> of failures? According to the manual (and many times in > >> practice, like rebooting a MDS) the filesystem will just block > >> and comeback. This almost never comes back, after a while it > >> will say reconnected, but will fail again right away. > >> > >> On the MDS I see: > >> > >> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t > >> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at > >> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am > >> evicting it. > >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: > >> 1515:mds_handle()) operation 41 on unconnected MDS from > >> 12345-141.212.31.43 at tcp > >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: > >> 1536:target_send_reply_msg()) @@@ processing error (-107) > >> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to > >> 0 dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 > >> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t > >> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at > >> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am > >> evicting it. > >> > >> Just keeps kicking it out, /proc/fs/lustre/health_check on > >> client, and servers are healthy. > >> > >> Brock Palen > >> www.umich.edu/~brockp > >> Center for Advanced Computing > >> brockp at umich.edu > >> (734)936-1985 > >> > >> > >> > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >> > > > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thanks, Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Nov 18, 2008, at 4:47 PM, Andreas Dilger wrote:> On Nov 18, 2008 12:14 -0500, Brock Palen wrote: >> if that is the bug causing this, is the fix till we upgrade to the >> newer lustre, to set statahead_max=0 again? > > Yes, this is another statahead bug. > >> I see this same behavior this morning on a compute node. >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> >> On Nov 16, 2008, at 10:49 PM, Yong Fan wrote: >> >>> Brock Palen ??: >>>> We consistantly see random ocurances of a client being kicked >>>> out, and while lustre says it tries to reconnect, it almost never >>>> can without a reboot: >>>> >>>> >>> Maybe you can check: >>> https://bugzilla.lustre.org/show_bug.cgi?id=15927 >>> >>> Regards! >>> -- >>> Fan Yong >>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>>> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 >>>> waiting for callback (3 != 0) >>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>>> 230:ptlrpc_invalidate_import()) @@@ still on sending list >>>> req at 000001015dd9ec00 x979024/t0 o101->nobackup- >>>> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl >>>> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 >>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>>> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar >>>> messageNov 14 18:28:18 nyx-login1 kernel: Lustre: nobackup- >>>> MDT0000- mdc-00000100f7ef0400: Connection restored to service >>>> nobackup-MDT0000 using nid 10.164.3.246 at tcp. >>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error >>>> occurred while communicating with 10.164.3.246 at tcp. The >>>> mds_statfs operation failed with -107 >>>> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- >>>> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via >>>> nid 10.164.3.246 at tcp was lost; in progress operations using this >>>> service will wait for recovery to complete. >>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This >>>> client was evicted by nobackup-MDT0000; in progress operations >>>> using this service will fail. >>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0: >>>> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 >>>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: >>>> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID >>>> req at 000001000990fe00 x983192/t0 o41->nobackup- >>>> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 128/400 e 0 to 100 dl 0 >>>> ref 1 fl Rpc:/0/0 rc 0/0 >>>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0: >>>> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = >>>> -108 >>>> >>>> Is there any way to make lustre more robust against these types >>>> of failures? According to the manual (and many times in >>>> practice, like rebooting a MDS) the filesystem will just block >>>> and comeback. This almost never comes back, after a while it >>>> will say reconnected, but will fail again right away. >>>> >>>> On the MDS I see: >>>> >>>> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven''t >>>> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >>>> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >>>> evicting it. >>>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: >>>> 1515:mds_handle()) operation 41 on unconnected MDS from >>>> 12345-141.212.31.43 at tcp >>>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: >>>> 1536:target_send_reply_msg()) @@@ processing error (-107) >>>> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to >>>> 0 dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 >>>> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven''t >>>> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >>>> 141.212.31.43 at tcp) in 227 seconds. I think it''s dead, and I am >>>> evicting it. >>>> >>>> Just keeps kicking it out, /proc/fs/lustre/health_check on >>>> client, and servers are healthy. >>>> >>>> Brock Palen >>>> www.umich.edu/~brockp >>>> Center for Advanced Computing >>>> brockp at umich.edu >>>> (734)936-1985 >>>> >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> >>> >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >