Brock Palen
2008-Jul-21 15:43 UTC
[Lustre-discuss] Lustre locking up on login/interactive nodes
Every so often lustre locks up. It will recover eventually. The process show this self''s in ''D'' Uninterruptible IO Wait. This case it was ''ar'' making an archive. Dmesg then shows: Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection to service nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by nobackup-MDT0000; in progress operations using this service will fail. LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 0000010189e2f400 x912452/t0 o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 488/768 ref 1 fl Rpc:P/0/0 rc 0/0 LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 00000101ed528a00 x912464/t0 o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc = -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc = -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped 46 previous similar messages Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection restored to service nobackup-MDT0000 using nid 141.212.30.184 at tcp. LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116 LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc close failed: rc = -116 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped 113 previous similar messages Is there special options that should be done on interactive/login nodes? I remember something about how much memory should be available on login vs batch nodes. But I don''t know how to change that, I just assumed lustre would use it. Login nodes have 8GB. __________________________________________________ www.palen.serveftp.net Center for Advanced Computing http://cac.engin.umich.edu brockp at umich.edu
Brian J. Murrell
2008-Jul-21 15:51 UTC
[Lustre-discuss] Lustre locking up on login/interactive nodes
On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote:> Every so often lustre locks up. It will recover eventually. The > process show this self''s in ''D'' Uninterruptible IO Wait. This case > it was ''ar'' making an archive. > > Dmesg then shows:Syslog is usually a better place to get messages from as it gives some context as to the time of the messages.> Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection to service > nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 167-0: This client was evicted by nobackup-MDT0000; in > progress operations using this service will fail. > LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at 0000010189e2f400 x912452/t0 > o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 488/768 ref 1 > fl Rpc:P/0/0 rc 0/0 > LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue()) > ldlm_cli_enqueue: -108 > LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at 00000101ed528a00 x912464/t0 > o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 440/768 ref 1 > fl Rpc:/0/0 rc 0/0 > LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue()) > ldlm_cli_enqueue: -108 > LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode > 12653753 mdc close failed: rc = -108 > LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode > 12195682 mdc close failed: rc = -108 > LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped > 46 previous similar messages > Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection restored to > service nobackup-MDT0000 using nid 141.212.30.184 at tcp. > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_close operation failed with -116 > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_close operation failed with -116 > LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode > 11441446 mdc close failed: rc = -116 > LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped > 113 previous similar messagesThis looks like a pretty standard eviction. Probably the most interesting information is on the node that did the evicting. If it doesn''t contain much other than a "have not heard from", then you have node that is either disappearing from the network or getting wedged enough to stop sending pings (or any other traffic in lieu of). b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080721/e4756dfb/attachment.bin
Brock Palen
2008-Jul-21 16:04 UTC
[Lustre-discuss] Lustre locking up on login/interactive nodes
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 21, 2008, at 11:51 AM, Brian J. Murrell wrote:> On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote: >> Every so often lustre locks up. It will recover eventually. The >> process show this self''s in ''D'' Uninterruptible IO Wait. This case >> it was ''ar'' making an archive. >> >> Dmesg then shows: > > Syslog is usually a better place to get messages from as it gives some > context as to the time of the messages.Ok will keep in mind. Looks the same though, Its odd though, if I login to the same machine I can move to that directory list the files etc. read files on those OST''s and notice this was eviction by the MDS, I see no lost network connections or network errors. Strange not good not good at all. The syslog data is the same, its below: Brock Jul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT0000- mdc-00000101fc467800: Connection to service nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress operations using this service will wait for recovery to complete.Jul 21 11:38:39 nyx-login1 kernel: LustreError: 167-0: This client was evicted by nobackup- MDT0000; in progress operations using this service will fail.Jul 21 11:38:39 nyx-login1 kernel: LustreError: 17575:0:(client.c: 519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 0000010189e2f400 x912452/t0 o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 488/768 ref 1 fl Rpc:P/0/0 rc 0/0Jul 21 11:38:39 nyx-login1 kernel: LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 00000101ed528a00 x912464/t0 o101->nobackup- MDT0000_UUID at 141.212.30.184@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc 0/0Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27076:0: (mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc = - -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc = - -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 97:ll_close_inode_openhandle()) Skipped 46 previous similar messagesJul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT0000- mdc-00000101fc467800: Connection restored to service nobackup-MDT0000 using nid 141.212.30.184 at tcp.Jul 21 11:38:39 nyx-login1 kernel: LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116Jul 21 11:38:39 nyx-login1 kernel: LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116Jul 21 11:38:39 nyx-login1 kernel: LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc close failed: rc = -116Jul 21 11:38:39 nyx-login1 kernel: LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped 113 previous similar messages -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (Darwin) iD8DBQFIhLOqMFCQB4Bvz5QRAgWvAJ9HhQAo9JZdcS2iyMFb19HzcgkwcQCdGosB sHaligENGxnJHdMu5116D5U=GOlg -----END PGP SIGNATURE-----
Brian J. Murrell
2008-Jul-22 13:18 UTC
[Lustre-discuss] Lustre locking up on login/interactive nodes
On Mon, 2008-07-21 at 12:04 -0400, Brock Palen wrote:> > Ok will keep in mind. Looks the same though,Indeed, in most cases it is the same, with the one exception that from syslog, we get time context.> Its odd though, if I > login to the same machine I can move to that directory list the files > etc. read files on those OST''s and notice this was eviction by the > MDS, > > I see no lost network connections or network errors. Strange not > good not good at all. > The syslog data is the same, its below:Right. There is nothing terribly useful in it though. It''s just the notification of an eviction. The real question is why did it get evicted. The evictor will know more about that than the evictee. If the evictor doesn''t log anything more than a "didn''t hear from client ... in ... seconds, evicting" then something is preventing the client from sending traffic to the evictor. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080722/77d388b9/attachment.bin