Chris Worley
2008-Aug-06 13:26 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
This seems to be caused by a specific file or directory. The files in this directory cycle through having errors trying to see the file, then, moments later, it sees the file with the wrong size, then it sees it fine. For example (<dir> and <user> replace the actual username and directory), these three "ls" commands were done in quick succession, at first et[06-10,11,13,15] claim an I/O error trying to "ls" the file, then et[07,09,13] see the file at 0 length, then they see it correctly (with et11 still showing an I/O error... which still doesn''t seem to be going away): [root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l /lfs/<user>/<dir>/zre.sh | dshbak -c et07: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et07: ssh exited with exit code 2 et11: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et11: ssh exited with exit code 2 et06: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et06: ssh exited with exit code 2 et09: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et09: ssh exited with exit code 2 et15: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et15: ssh exited with exit code 2 et08: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et08: ssh exited with exit code 2 et13: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et13: ssh exited with exit code 2 ---------------- et[01-05,14,16-18] ---------------- -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh [root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l /lfs/<user>/<dir>/zre.sh | dshbak -c et06: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call pdsh at ln: et06: ssh exited with exit code 2 et15: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call pdsh at ln: et15: ssh exited with exit code 2 et11: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call pdsh at ln: et11: ssh exited with exit code 2 ---------------- et[07-09,13] ---------------- -rwxr-xr-x 1 <user> <user> 0 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh ---------------- et[01-05,14,16-18] ---------------- -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh [root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l /lfs/<user>/<dir>/zre.sh | dshbak -c et11: ls: /lfs/<user>/<dir>/zre.sh: Input/output error pdsh at ln: et11: ssh exited with exit code 2 ---------------- et[01-09,13-18] ---------------- -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh Looking at the directory, I still see errors: [root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l /lfs/<user>/<dir>/ | dshbak -c et09: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et09: ssh exited with exit code 2 et08: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et08: ssh exited with exit code 2 et11: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et11: ssh exited with exit code 2 et06: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et06: ssh exited with exit code 2 et07: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et07: ssh exited with exit code 2 et15: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et15: ssh exited with exit code 2 et13: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown pdsh at ln: et13: ssh exited with exit code 2 ---------------- et[01-05,14,16-18] ---------------- total 20 drwxrwx--- 4 <user> <user> 4096 Aug 5 09:47 Gaussian-E.01 -rwxr-x--- 1 <user> <user> 9920 Aug 5 09:25 run_benchmarks -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 zre.sh Any ideas? I''m using Lustre 1.6.5.1, with OSS''s running 1.6.4.3 (with RHEL5 with a 2.6.18-92.1.6 kernel). I''m not seeing this on RHEL4 (kernel 2.6.9-67.0.22.ELsmp-lfs-1.6.5.1) clients also running 1.6.5.1. Chris
Brian J. Murrell
2008-Aug-06 14:19 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, 2008-08-06 at 07:26 -0600, Chris Worley wrote:> This seems to be caused by a specific file or directory. The files in > this directory cycle through having errors trying to see the file, > then, moments later, it sees the file with the wrong size, then it > sees it fine. For example (<dir> and <user> replace the actual > username and directory), these three "ls" commands were done in quick > succession, at first et[06-10,11,13,15] claim an I/O error trying to > "ls" the file, then et[07,09,13] see the file at 0 length, then they > see it correctly (with et11 still showing an I/O error... which still > doesn''t seem to be going away):Are there any Lustre messages in your syslog on the clients which are returning unexpected results? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/170f06e0/attachment.bin
Chris Worley
2008-Aug-06 14:24 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 8:19 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> > Are there any Lustre messages in your syslog on the clients which are > returning unexpected results?Indeed. These clients seem to be timing out and getting evicted: LustreError: 11-0: an error occurred while communicating with 36.102.29.1 at o2ib. The ldlm_enqueue operation failed with -107 LustreError: Skipped 36 previous similar messages Lustre: lfs-MDT0000-mdc-ffff81027cca8d20: Connection to service lfs-MDT0000 via nid 36.102.29.1 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 36 previous similar messages LustreError: 167-0: This client was evicted by lfs-MDT0000; in progress operations using this service will fail. LustreError: Skipped 36 previous similar messages LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -5 LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 1 previous similar message LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure -5 inode 66781185 LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81016f0f0988 x2285852/t0 o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 5 previous similar messages LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure -108 inode 66781185 LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8104bfc93238 x2285905/t0 o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 1 previous similar message LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -108 LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 2 previous similar messages LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) failure -108 inode 66781185 LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1 previous similar message LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) failure -108 inode 66781185 LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1 previous similar message
Chris Worley
2008-Aug-06 14:31 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
Going back further in the logs, there seem to be lots of this message: Lustre: 5594:0:(client.c:1587:ptlrpc_unregister_reply()) @@@ Unexpectedly long timeout rvcng=1 unlnk=1 req at ffff810270717b68 x1187993/t0 o400->MGS at 36.102.29.1@o2ib:26/25 lens 128/256 e 1 to 40 dl 1217828500 ref 1 fl Rpc:EN/0/0 rc -4/0 On Wed, Aug 6, 2008 at 8:24 AM, Chris Worley <worleys at gmail.com> wrote:> On Wed, Aug 6, 2008 at 8:19 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: >> >> Are there any Lustre messages in your syslog on the clients which are >> returning unexpected results? > > Indeed. These clients seem to be timing out and getting evicted: > > LustreError: 11-0: an error occurred while communicating with > 36.102.29.1 at o2ib. The ldlm_enqueue operation failed with -107 > LustreError: Skipped 36 previous similar messages > Lustre: lfs-MDT0000-mdc-ffff81027cca8d20: Connection to service > lfs-MDT0000 via nid 36.102.29.1 at o2ib was lost; in progress operations > using this service will wait for recovery to complete. > Lustre: Skipped 36 previous similar messages > LustreError: 167-0: This client was evicted by lfs-MDT0000; in > progress operations using this service will fail. > LustreError: Skipped 36 previous similar messages > LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -5 > LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 1 previous > similar message > LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure > -5 inode 66781185 > LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff81016f0f0988 x2285852/t0 > o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40 > dl 0 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 5 > previous similar messages > LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure > -108 inode 66781185 > LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff8104bfc93238 x2285905/t0 > o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40 > dl 0 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 1 > previous similar message > LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -108 > LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 2 previous > similar messages > LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) failure > -108 inode 66781185 > LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1 > previous similar message > LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) failure > -108 inode 66781185 > LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1 > previous similar message >
Brian J. Murrell
2008-Aug-06 15:15 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, 2008-08-06 at 08:24 -0600, Chris Worley wrote:> On Wed, Aug 6, 2008 at 8:19 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: > > > > Are there any Lustre messages in your syslog on the clients which are > > returning unexpected results? > > Indeed. These clients seem to be timing out and getting evicted: > > LustreError: 11-0: an error occurred while communicating with > 36.102.29.1 at o2ib. The ldlm_enqueue operation failed with -107 > LustreError: Skipped 36 previous similar messages > Lustre: lfs-MDT0000-mdc-ffff81027cca8d20: Connection to service > lfs-MDT0000 via nid 36.102.29.1 at o2ib was lost; in progress operations > using this service will wait for recovery to complete. > Lustre: Skipped 36 previous similar messages > LustreError: 167-0: This client was evicted by lfs-MDT0000; in > progress operations using this service will fail.So, now what does the MDS serving lfs-MDT0000 say about this? Why did it evict? What version of Lustre is this? Perhaps you said so already and I have just forgotten. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/6b1b5cb6/attachment.bin
Chris Worley
2008-Aug-06 15:29 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 9:15 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> > So, now what does the MDS serving lfs-MDT0000 say about this? Why did > it evict? What version of Lustre is this? Perhaps you said so already > and I have just forgotten.1.6.5.1 clients w/ 1.6.4.3 OSS''s. The MDS is very verbose. I get these all the time, even prior to the error: Lustre: lfs-OST0000: haven''t heard from client 12f00621-096c-b331-8774-abfc72dfd82 2 (at 36.102.36.15 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. Would these messages be correlated? LustreError: 6820:0:(handler.c:1499:mds_handle()) operation 101 on unconnected MDS from 12345-36.102.36.11 at o2ib LustreError: 6820:0:(handler.c:1499:mds_handle()) Skipped 6 previous similar messages LustreError: 6820:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 000001022ed4fa00 x2285850/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 6820:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 27 previous similar messages Lustre: lfs-MDT0000: haven''t heard from client f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. Lustre: Skipped 27 previous similar messages Lustre: 5696:0:(ldlm_lib.c:519:target_handle_reconnect()) MGS: e04def6d-186d-2d01-1beb-2e00418cfaca reconnecting Lustre: 5696:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 12 previous similar messages LustreError: 6877:0:(handler.c:1499:mds_handle()) operation 101 on unconnected MDS from 12345-36.102.36.11 at o2ib LustreError: 6877:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 0000010247c30050 x2286894/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 6877:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 1 previous similar message Lustre: lfs-MDT0000: haven''t heard from client f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. LustreError: 6829:0:(handler.c:1499:mds_handle()) operation 101 on unconnected MDS from 12345-36.102.36.11 at o2ib LustreError: 6829:0:(handler.c:1499:mds_handle()) Skipped 1 previous similar message LustreError: 6829:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 000001022aa7bc00 x2288557/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 5898:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 000001022f07da00 x2290923/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Lustre: lfs-OST0000: haven''t heard from client f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. Lustre: 6713:0:(ldlm_lib.c:519:target_handle_reconnect()) MGS: e04def6d-186d-2d01-1beb-2e00418cfaca reconnecting Lustre: 6713:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 12 previous similar messages Lustre: lfs-MDT0000: haven''t heard from client f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. Lustre: Skipped 5 previous similar messages LustreError: 6464:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 0000010224c18200 x2239379/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 6464:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 5 previous similar messages LustreError: 7324:0:(handler.c:1499:mds_handle()) operation 37 on unconnected MDS from 12345-36.102.36.11 at o2ib LustreError: 6846:0:(handler.c:1499:mds_handle()) operation 101 on unconnected MDS from 12345-36.102.36.6 at o2ib Lustre: lfs-OST0000: haven''t heard from client f99ffcca-2168-b2ab-8648-e40f4725acd1 (at 36.102.36.9 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. Lustre: lfs-OST0000: haven''t heard from client df74b077-e833-5fce-2744-9c0fa51427c9 (at 36.102.36.6 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it. Lustre: Skipped 23 previous similar messages Lustre: 5697:0:(ldlm_lib.c:519:target_handle_reconnect()) MGS: e04def6d-186d-2d01-1beb-2e00418cfaca reconnecting Lustre: 5697:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 12 previous similar messages LustreError: 5860:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 0000010251a06400 x2284477/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 5860:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 54 previous similar messages Chris
Chris Worley
2008-Aug-06 15:48 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
It looks like I can keep the error from recurring if I continuously ls the files: watch -n 10 ''pdsh -w et[01-18] -x et[10,12] "find /lfs/<user> -type f | xargs ls -l" >/dev/null'' Otherwise, it comes back again.
Brian J. Murrell
2008-Aug-06 16:17 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, 2008-08-06 at 09:29 -0600, Chris Worley wrote:> On Wed, Aug 6, 2008 at 9:15 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: > > > > So, now what does the MDS serving lfs-MDT0000 say about this? Why did > > it evict? What version of Lustre is this? Perhaps you said so already > > and I have just forgotten. > > 1.6.5.1 clients w/ 1.6.4.3 OSS''s. > > The MDS is very verbose. I get these all the time, even prior to the error: > > Lustre: lfs-OST0000: haven''t heard from client > 12f00621-096c-b331-8774-abfc72dfd82 > 2 (at 36.102.36.15 at o2ib) in 92 seconds. I think it''s dead, and I am evicting it.Yup. If you can correlate those kinds of messages (they have the client ip address in them) to the errors on the client, you have your eviction events. I notice that you are getting messages out of dmesg rather than syslog. Syslog makes correlation easier and more definite due to the time stamps. But this kind of eviction is simply due to clients that are unresponsive from the POV of the MDS. They are neither making filesystem RPC nor are they "ping"ing (keepalives) so the MDS assumes they have died and evicts them to get back the locks it could be holding and not having that dead client holding up other, living clients. So you need to investigate why the clients are dying or appear to be dead (i.e. going silent) to the MDS. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/8066b18a/attachment.bin
Chris Worley
2008-Aug-06 16:41 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 10:17 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> But this kind of eviction is simply due to clients that are unresponsive > from the POV of the MDS. They are neither making filesystem RPC nor are > they "ping"ing (keepalives) so the MDS assumes they have died and evicts > them to get back the locks it could be holding and not having that dead > client holding up other, living clients. > > So you need to investigate why the clients are dying or appear to be > dead (i.e. going silent) to the MDS.Is there anything in /proc or /sys I can look at to see whatever "keepalive" parameters are setup? The systems aren''t dying. I need to know how to least obtrusively force the clients to keep pinging, or tell the MDS to give them a longer time before timeout. I don''t see why this only effects the RHEL5 clients. Maybe that''s a hint. Thanks, Chris
Brian J. Murrell
2008-Aug-06 16:45 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote:> > Is there anything in /proc or /sys I can look at to see whatever > "keepalive" parameters are setup?All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout which MUST be the same on all nodes.> The systems aren''t dying.They are failing to communicate with the MDS for some reason. Network problems perhaps? You could try enabling +rpctrace debug and inspecting the debug file for RPCs to see if the client is indeed sending something (even if it''s a ping) at regular intervals. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/8b0c71f6/attachment.bin
Chris Worley
2008-Aug-06 17:08 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 10:45 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote: >> >> Is there anything in /proc or /sys I can look at to see whatever >> "keepalive" parameters are setup? > > All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout > which MUST be the same on all nodes. >Would you suggest I increase or decrease this value? Is there a way to inhibit the eviction, or is that necessary to keep really dead clients from locking-out files.>> The systems aren''t dying. > > They are failing to communicate with the MDS for some reason. Network > problems perhaps? You could try enabling +rpctrace debug and inspecting > the debug file for RPCs to see if the client is indeed sending something > (even if it''s a ping) at regular intervals.All the systems (RHEL4 and 5 clients, Lustre servers) are on the same ethernet and IB switches. There were no issues before the 1.6.5.1 upgrade with the RHEL5 nodes. Would a normal ping do it? I can jury-rig all the RHEL5 nodes to ping the MDS. Chris
Brian J. Murrell
2008-Aug-06 17:26 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, 2008-08-06 at 11:08 -0600, Chris Worley wrote:> > Would you suggest I increase or decrease this value?Neither, just make sure all nodes have the same value. I haven''t seen anything that indicates that it needs to be changed.> Is there a way to inhibit the eviction,No.> or is that necessary to keep > really dead clients from locking-out files.Indeed.> All the systems (RHEL4 and 5 clients, Lustre servers) are on the same > ethernet and IB switches. There were no issues before the 1.6.5.1 > upgrade with the RHEL5 nodes.+rpctrace debug is probably the way to go to see what the client are(n''t) doing in terms of keeping the MDS aware of it''s existence.> Would a normal ping do it?No.> I can jury-rig all the RHEL5 nodes to ping the MDS.Of course, even if it would work, this is a hack that would simply be masquerading what is likely another, more real problem. You need to get to the root of the real problem. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/ce74c8d8/attachment.bin
Chris Worley
2008-Aug-06 18:11 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> > +rpctrace debug is probably the way to go to see what the client > are(n''t) doing in terms of keeping the MDS aware of it''s existence. >The log from the client perspective starts with a lot of packet mismatches: 00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000400:00000100:3:1217633608.884380:0:5588:0:(lib-move.c:1647:lnet_parse_put()) Dropping PUT from 12345-36.102.29.1 at o2ib portal 25 match 157556 offset 0 length 128 : 2 00000100:00020000:3:1217633710.886759:0:5587:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:3:1217633852.887489:0:5586:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:3:1217634099.512727:0:5591:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 ...repeat many times. Then it repeats this many times: 00000100:00080000:6:1217651769.943608:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery()) MGS: starting recovery 00000100:00080000:6:1217651769.943619:0:5596:0:(import.c:465:ptlrpc_connect_import()) ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING 00000100:00080000:6:1217651769.943623:0:5596:0:(import.c:342:import_select_connection()) MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt 4342590785 00000100:00080000:6:1217651769.943628:0:5596:0:(import.c:423:import_select_connection()) MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection MGC36.102.29.1 @o2ib_0/36.102.29.1 at o2ib 00000100:00080000:6:1217651769.943630:0:5596:0:(import.c:504:ptlrpc_connect_import()) Last reconnection attempt (743) for MGS 00000100:00080000:6:1217651820.942855:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery()) MGS: starting recovery 00000100:00080000:6:1217651820.942858:0:5596:0:(import.c:465:ptlrpc_connect_import()) ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING 00000100:00080000:6:1217651820.942862:0:5596:0:(import.c:342:import_select_connection()) MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt 4342641786 00000100:00080000:6:1217651820.942866:0:5596:0:(import.c:423:import_select_connection()) MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection MGC36.102.29.1 @o2ib_0/36.102.29.1 at o2ib 00000100:00080000:6:1217651820.942868:0:5596:0:(import.c:504:ptlrpc_connect_import()) Last reconnection attempt (744) for MGS ... repeat... with one of these embedded: 00000100:00080000:6:1217651860.941624:0:5596:0:(pinger.c:144:ptlrpc_pinger_main()) not pinging MGS (in recovery: EVICTED or recovery disabled: 0/0) ...ending with this: 00000100:00080000:6:1217652002.937993:0:5596:0:(import.c:504:ptlrpc_connect_import()) Last reconnection attempt (748) for MGS 00000100:00080000:3:1217663095.721928:0:9878:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib 10000000:01000000:3:1217663095.721931:0:9878:0:(mgc_request.c:880:mgc_import_event()) import event 0x808003 00000100:00080000:3:1217663095.722708:0:9878:0:(import.c:1025:ptlrpc_invalidate_import_thread()) ffff810275e06188 MGS: changing import state from EVICTED to RECOVER 00000100:00080000:3:1217663095.722710:0:9878:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) reconnected to MGS at 36.102.29.1@o2ib 00000100:00080000:3:1217663095.722712:0:9878:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) ffff810275e06188 MGS: changing import state from RECOVER to FUL L 10000000:01000000:3:1217663095.722714:0:9878:0:(mgc_request.c:880:mgc_import_event()) import event 0x808004 10000000:02000400:3:1217663095.722715:0:9878:0:(mgc_request.c:894:mgc_import_event()) MGC36.102.29.1 at o2ib: Reactivating import 00000100:02000000:3:1217663095.722719:0:9878:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) MGC36.102.29.1 at o2ib: Connection restored to service MGS using n id 36.102.29.1 at o2ib. ...then followed by length mismatches interspersed with "recovery": 00000100:00020000:3:1217663393.547988:0:5586:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:3:1217663677.841799:0:5591:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:3:1217663714.726333:0:5586:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:3:1217663819.561596:0:5592:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:3:1217664140.732610:0:5592:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:3:1217664244.038963:0:5590:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:3:1217664556.735280:0:5586:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00080000:3:1217665408.738550:0:9927:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib 10000000:01000000:3:1217665408.738554:0:9927:0:(mgc_request.c:880:mgc_import_event()) import event 0x808003 00000100:00080000:3:1217665408.739328:0:9927:0:(import.c:1025:ptlrpc_invalidate_import_thread()) ffff810275e06188 MGS: changing import state from EVICTED to RECOVER 00000100:00080000:3:1217665408.739330:0:9927:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) reconnected to MGS at 36.102.29.1@o2ib 00000100:00080000:3:1217665408.739333:0:9927:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) ffff810275e06188 MGS: changing import state from RECOVER to FUL L 10000000:01000000:3:1217665408.739335:0:9927:0:(mgc_request.c:880:mgc_import_event()) import event 0x808004 10000000:02000400:3:1217665408.739336:0:9927:0:(mgc_request.c:894:mgc_import_event()) MGC36.102.29.1 at o2ib: Reactivating import 00000100:02000000:3:1217665408.739340:0:9927:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) MGC36.102.29.1 at o2ib: Connection restored to service MGS using n id 36.102.29.1 at o2ib. 00000100:00080000:3:1217665408.739344:0:9927:0:(recover.c:175:ptlrpc_wake_delayed()) @@@ waking (set 0000000000000000): req at ffff8101d3aa78e8 x328561/t0 o501->MGS at 36 .102.29.1 at o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 00000100:00020000:3:1217665653.806267:0:5589:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:3:1217665692.740540:0:5589:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:6:1217666544.737120:0:5592:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:6:1217666647.982634:0:5590:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:6:1217667112.732267:0:5588:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00080000:3:1217667254.732738:0:9966:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib 10000000:01000000:3:1217667254.732742:0:9966:0:(mgc_request.c:880:mgc_import_event()) import event 0x808003 00000100:00080000:3:1217667254.733543:0:9966:0:(import.c:1025:ptlrpc_invalidate_import_thread()) ffff810275e06188 MGS: changing import state from EVICTED to RECOVER 00000100:00080000:3:1217667254.733546:0:9966:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) reconnected to MGS at 36.102.29.1@o2ib 00000100:00080000:3:1217667254.733549:0:9966:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) ffff810275e06188 MGS: changing import state from RECOVER to FUL L 10000000:01000000:3:1217667254.733552:0:9966:0:(mgc_request.c:880:mgc_import_event()) import event 0x808004 10000000:02000400:3:1217667254.733553:0:9966:0:(mgc_request.c:894:mgc_import_event()) MGC36.102.29.1 at o2ib: Reactivating import 00000100:02000000:3:1217667254.733556:0:9966:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) MGC36.102.29.1 at o2ib: Connection restored to service MGS using n id 36.102.29.1 at o2ib. 00000100:00080000:3:1217667254.733559:0:9966:0:(recover.c:175:ptlrpc_wake_delayed()) @@@ waking (set 0000000000000000): req at ffff8102341658e8 x338536/t0 o501->MGS at 36 .102.29.1 at o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 00000100:00020000:6:1217667254.733672:0:5586:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 00000100:00020000:6:1217667502.281243:0:5589:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00020000:3:1217668248.722633:0:5587:0:(events.c:116:reply_in_callback()) early reply sized 176, expect 128 ... lots of these mismatches, followed by what look like more recoveries: 00000100:02000000:0:1217678411.535026:0:12692:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) MGC36.102.29.1 at o2ib: Connection restored to service MGS using nid 36.102.29.1 at o2ib. 00000100:00020000:0:1217678425.863040:0:5590:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 00000100:00080000:0:1217678462.534057:0:12693:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2i b 10000000:01000000:0:1217678462.534061:0:12693:0:(mgc_request.c:880:mgc_import_event()) import event 0x808003 00000100:00080000:0:1217678462.534192:0:12693:0:(import.c:1025:ptlrpc_invalidate_import_thread()) ffff810275e06188 MGS: changing import state from EVICTED to RECOVER 00000100:00080000:0:1217678462.534194:0:12693:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) reconnected to MGS at 36.102.29.1@o2ib 00000100:00080000:0:1217678462.534195:0:12693:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) ffff810275e06188 MGS: changing import state from RECOVER to FU LL 10000000:01000000:0:1217678462.534196:0:12693:0:(mgc_request.c:880:mgc_import_event()) import event 0x808004 10000000:02000400:0:1217678462.534197:0:12693:0:(mgc_request.c:894:mgc_import_event()) MGC36.102.29.1 at o2ib: Reactivating import ...repeated many, many, many times... then: 00000100:00080000:4:1217780793.292959:0:16703:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) ffff810275e06188 MGS: changing import state from RECOVER to FU LL 10000000:01000000:4:1217780793.292960:0:16703:0:(mgc_request.c:880:mgc_import_event()) import event 0x808004 10000000:02000400:4:1217780793.292961:0:16703:0:(mgc_request.c:894:mgc_import_event()) MGC36.102.29.1 at o2ib: Reactivating import 00000100:02000000:4:1217780793.292962:0:16703:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) MGC36.102.29.1 at o2ib: Connection restored to service MGS using nid 36.102.29.1 at o2ib. 10000000:01000000:4:1217780806.980814:0:6127:0:(mgc_request.c:289:mgc_requeue_thread()) updating log lfs-client 10000000:01000000:4:1217780806.980817:0:6127:0:(mgc_request.c:1092:mgc_process_log()) Process log lfs-client:ffff81027cca8d20 from 530 10000000:01000000:4:1217780806.980819:0:6127:0:(mgc_request.c:627:mgc_enqueue()) Enqueue for lfs-client (res 0x73666c) 00000100:00020000:0:1217780806.980984:0:5588:0:(events.c:116:reply_in_callback()) early reply sized 240, expect 128 10000000:01000000:4:1217780844.291599:0:6127:0:(mgc_request.c:1111:mgc_process_log()) Can''t get cfg lock: -5 00000100:00080000:0:1217780844.291621:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery()) MGS: starting recovery 00000100:00080000:4:1217780844.291623:0:6127:0:(client.c:1801:ptlrpc_queue_wait()) @@@ "ll_cfg_requeue" waiting for recovery: (FULL != DISCONN) req at ffff810238534098 x935044/t0 o501->MGS at 36.102.29.1@o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 00000100:00080000:0:1217780844.291625:0:5596:0:(import.c:465:ptlrpc_connect_import()) ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING 00000100:00080000:0:1217780844.291630:0:5596:0:(import.c:342:import_select_connection()) MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt 4471669871 10000000:01000000:4:1217780844.291636:0:5660:0:(mgc_request.c:565:mgc_blocking_ast()) Lock res 0x73666c (lfs) 00000100:00080000:0:1217780844.291637:0:5596:0:(import.c:423:import_select_connection()) MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection MGC36.102.29.1 @o2ib_0/36.102.29.1 at o2ib 10000000:01000000:4:1217780844.291638:0:5660:0:(mgc_request.c:590:mgc_blocking_ast()) log lfs-client: original grant failed, will requeue later 00000100:00080000:0:1217780844.291639:0:5596:0:(import.c:504:ptlrpc_connect_import()) Last reconnection attempt (3480) for MGS 00000100:00080000:4:1217780844.291870:0:5595:0:(import.c:709:ptlrpc_connect_interpret()) reconnected to MGS at 36.102.29.1@o2ib after partition 00000100:00080000:4:1217780844.291873:0:5595:0:(import.c:714:ptlrpc_connect_interpret()) MGC36.102.29.1 at o2ib: reconnected but import is invalid; marking evicted 00000100:00080000:4:1217780844.291874:0:5595:0:(import.c:715:ptlrpc_connect_interpret()) ffff810275e06188 MGS: changing import state from CONNECTING to EVICTED 00000100:00080000:4:1217780844.291877:0:5595:0:(import.c:1053:ptlrpc_import_recovery_state_machine()) evicted from MGS at 36.102.29.1@o2ib; invalidating 10000000:01000000:4:1217780844.291975:0:5595:0:(mgc_request.c:880:mgc_import_event()) import event 0x808005 00000100:00080000:0:1217780844.291996:0:16704:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2i b 10000000:01000000:0:1217780844.292000:0:16704:0:(mgc_request.c:880:mgc_import_event()) import event 0x808003 00000100:00080000:0:1217780844.292132:0:16704:0:(import.c:1025:ptlrpc_invalidate_import_thread()) ffff810275e06188 MGS: changing import state from EVICTED to RECOVER 00000100:00080000:0:1217780844.292134:0:16704:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) reconnected to MGS at 36.102.29.1@o2ib 00000100:00080000:0:1217780844.292135:0:16704:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) ffff810275e06188 MGS: changing import state from RECOVER to FU LL Does this give any hints... there''s a lot more log to go. Thanks, Chris
Brian J. Murrell
2008-Aug-06 19:25 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
dOn Wed, 2008-08-06 at 12:11 -0600, Chris Worley wrote:> On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: > > > > +rpctrace debug is probably the way to go to see what the client > > are(n''t) doing in terms of keeping the MDS aware of it''s existence. > > > > The log from the client perspective starts with a lot of packet mismatches: > > 00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback()) > early reply sized 176, expect 128 > 00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback()) > early reply sized 240, expect 128Given the amount of debug provided, this looks like it could be bug 16534 which is on my plate to test the solution for and land. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/d3bd4860/attachment.bin
Chris Worley
2008-Aug-06 19:36 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 1:25 PM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> > Given the amount of debug provided, this looks like it could be bug > 16534 which is on my plate to test the solution for and land. >Would you like me to add my complete RPC log to this bug report? What distro/kernel are they running? Note that I didn''t have this error using RHEL4/5 with 1.6.4.2 clients and a 1.6.4.3 servers. This problem doesn''t start immediately after reboot. I''m guessing it''s correlated w/ somebody''s job, and they haven''t complained... so I don''t know who''s doing what to put the client in this state. Chris
Brian J. Murrell
2008-Aug-06 19:47 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, 2008-08-06 at 13:36 -0600, Chris Worley wrote:> Note that I didn''t have this error using RHEL4/5 with 1.6.4.2 clients > and a 1.6.4.3 servers.Indeed. IIUC, this is something that was introduced with 1.6.5. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/db8e00d3/attachment.bin
Chris Worley
2008-Aug-06 21:39 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 1:47 PM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Wed, 2008-08-06 at 13:36 -0600, Chris Worley wrote: >> Note that I didn''t have this error using RHEL4/5 with 1.6.4.2 clients >> and a 1.6.4.3 servers. > > Indeed. IIUC, this is something that was introduced with 1.6.5.Are there any work-arounds? Chris
Chris Worley
2008-Aug-15 06:00 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Wed, Aug 6, 2008 at 1:25 PM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> dOn Wed, 2008-08-06 at 12:11 -0600, Chris Worley wrote: >> On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: >> > >> > +rpctrace debug is probably the way to go to see what the client >> > are(n''t) doing in terms of keeping the MDS aware of it''s existence. >> > >> >> The log from the client perspective starts with a lot of packet mismatches: >> >> 00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback()) >> early reply sized 176, expect 128 >> 00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback()) >> early reply sized 240, expect 128 > > Given the amount of debug provided, this looks like it could be bug > 16534 which is on my plate to test the solution for and land. >In adding info to bugzilla, you clarified that this is actually bug #16237. This bug is "fixed" and in Lustre 1.6.6... but 1.6.6 isn''t released. Is there a work-around for the 1.6.5.1 client (short of rebooting)? Thanks, Chris
Andreas Dilger
2008-Aug-18 09:23 UTC
[Lustre-discuss] Nodes claim error with files, then say everything is fine.
On Aug 15, 2008 00:00 -0600, Chris Worley wrote:> In adding info to bugzilla, you clarified that this is actually bug #16237. > > This bug is "fixed" and in Lustre 1.6.6... but 1.6.6 isn''t released. > > Is there a work-around for the 1.6.5.1 client (short of rebooting)?If you are up to rebuilding the 1.6.5.1 client you can likely fix it with a small change in lustre_start_mgc() (untested): - data->ocd_connect_flags = OBD_CONNECT_VERSION | OBD_CONNECT_AT; + data->ocd_connect_flags = OBD_CONNECT_VERSION; Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.