thr3ads.net - Lustre discuss - [Lustre-discuss] Nodes claim error with files, then say everything is fine. [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Chris Worley

2008-Aug-06 13:26 UTC

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

This seems to be caused by a specific file or directory.  The files in
this directory cycle through having errors trying to see the file,
then, moments later, it sees the file with the wrong size, then it
sees it fine.  For example (<dir> and <user> replace the actual
username and directory), these three "ls" commands were done in quick
succession, at first et[06-10,11,13,15] claim an I/O error trying to
"ls" the file, then et[07,09,13] see the file at 0 length, then they
see it correctly (with et11 still showing an I/O error... which still
doesn''t seem to be going away):

[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/zre.sh | dshbak -c
et07: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et07: ssh exited with exit code 2
et11: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et11: ssh exited with exit code 2
et06: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et06: ssh exited with exit code 2
et09: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et09: ssh exited with exit code 2
et15: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et15: ssh exited with exit code 2
et08: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et08: ssh exited with exit code 2
et13: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et13: ssh exited with exit code 2
----------------
et[01-05,14,16-18]
----------------
 -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07
/lfs/<user>/<dir>/zre.sh
[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/zre.sh | dshbak -c
et06: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call
pdsh at ln: et06: ssh exited with exit code 2
et15: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call
pdsh at ln: et15: ssh exited with exit code 2
et11: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call
pdsh at ln: et11: ssh exited with exit code 2
----------------
et[07-09,13]
----------------
 -rwxr-xr-x 1 <user> <user> 0 Jul 25 06:07
/lfs/<user>/<dir>/zre.sh
----------------
et[01-05,14,16-18]
----------------
 -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07
/lfs/<user>/<dir>/zre.sh
[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/zre.sh | dshbak -c
et11: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et11: ssh exited with exit code 2
----------------
et[01-09,13-18]
----------------
 -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07
/lfs/<user>/<dir>/zre.sh

Looking at the directory, I still see errors:

[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/ | dshbak -c
et09: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et09: ssh exited with exit code 2
et08: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et08: ssh exited with exit code 2
et11: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et11: ssh exited with exit code 2
et06: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et06: ssh exited with exit code 2
et07: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et07: ssh exited with exit code 2
et15: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et15: ssh exited with exit code 2
et13: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint
shutdown
pdsh at ln: et13: ssh exited with exit code 2
----------------
et[01-05,14,16-18]
----------------
 total 20
 drwxrwx--- 4 <user> <user> 4096 Aug  5 09:47 Gaussian-E.01
 -rwxr-x--- 1 <user> <user> 9920 Aug  5 09:25 run_benchmarks
 -rwxr-xr-x 1 <user> <user>   69 Jul 25 06:07 zre.sh

Any ideas?

I''m using Lustre 1.6.5.1, with OSS''s running 1.6.4.3 (with
RHEL5 with
a 2.6.18-92.1.6 kernel).  I''m not seeing this on RHEL4 (kernel
2.6.9-67.0.22.ELsmp-lfs-1.6.5.1) clients also running 1.6.5.1.

Chris

Brian J. Murrell

2008-Aug-06 14:19 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, 2008-08-06 at 07:26 -0600, Chris Worley wrote:> This seems to be caused by a specific file or directory.  The files in
> this directory cycle through having errors trying to see the file,
> then, moments later, it sees the file with the wrong size, then it
> sees it fine.  For example (<dir> and <user> replace the actual
> username and directory), these three "ls" commands were done in
quick
> succession, at first et[06-10,11,13,15] claim an I/O error trying to
> "ls" the file, then et[07,09,13] see the file at 0 length, then
they
> see it correctly (with et11 still showing an I/O error... which still
> doesn''t seem to be going away):
Are there any Lustre messages in your syslog on the clients which are
returning unexpected results?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/170f06e0/attachment.bin

Chris Worley

2008-Aug-06 14:24 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 8:19 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:>
> Are there any Lustre messages in your syslog on the clients which are
> returning unexpected results?
Indeed.  These clients seem to be timing out and getting evicted:

LustreError: 11-0: an error occurred while communicating with
36.102.29.1 at o2ib. The ldlm_enqueue operation failed with -107
LustreError: Skipped 36 previous similar messages
Lustre: lfs-MDT0000-mdc-ffff81027cca8d20: Connection to service
lfs-MDT0000 via nid 36.102.29.1 at o2ib was lost; in progress operations
using this service will wait for recovery to complete.
Lustre: Skipped 36 previous similar messages
LustreError: 167-0: This client was evicted by lfs-MDT0000; in
progress operations using this service will fail.
LustreError: Skipped 36 previous similar messages
LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -5
LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 1 previous
similar message
LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure
-5 inode 66781185
LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) @@@
IMP_INVALID  req at ffff81016f0f0988 x2285852/t0
o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40
dl 0 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 5
previous similar messages
LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure
-108 inode 66781185
LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) @@@
IMP_INVALID  req at ffff8104bfc93238 x2285905/t0
o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40
dl 0 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 1
previous similar message
LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 2 previous
similar messages
LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) failure
-108 inode 66781185
LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1
previous similar message
LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) failure
-108 inode 66781185
LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1
previous similar message

Chris Worley

2008-Aug-06 14:31 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

Going back further in the logs, there seem to be lots of this message:

Lustre: 5594:0:(client.c:1587:ptlrpc_unregister_reply()) @@@
Unexpectedly long timeout rvcng=1 unlnk=1  req at ffff810270717b68
x1187993/t0 o400->MGS at 36.102.29.1@o2ib:26/25 lens 128/256 e 1 to 40 dl
1217828500 ref 1 fl Rpc:EN/0/0 rc -4/0


On Wed, Aug 6, 2008 at 8:24 AM, Chris Worley <worleys at gmail.com>
wrote:> On Wed, Aug 6, 2008 at 8:19 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
>>
>> Are there any Lustre messages in your syslog on the clients which are
>> returning unexpected results?
>
> Indeed.  These clients seem to be timing out and getting evicted:
>
> LustreError: 11-0: an error occurred while communicating with
> 36.102.29.1 at o2ib. The ldlm_enqueue operation failed with -107
> LustreError: Skipped 36 previous similar messages
> Lustre: lfs-MDT0000-mdc-ffff81027cca8d20: Connection to service
> lfs-MDT0000 via nid 36.102.29.1 at o2ib was lost; in progress operations
> using this service will wait for recovery to complete.
> Lustre: Skipped 36 previous similar messages
> LustreError: 167-0: This client was evicted by lfs-MDT0000; in
> progress operations using this service will fail.
> LustreError: Skipped 36 previous similar messages
> LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -5
> LustreError: 3143:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 1 previous
> similar message
> LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure
> -5 inode 66781185
> LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) @@@
> IMP_INVALID  req at ffff81016f0f0988 x2285852/t0
> o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40
> dl 0 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 3143:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 5
> previous similar messages
> LustreError: 3143:0:(file.c:2664:ll_inode_revalidate_fini()) failure
> -108 inode 66781185
> LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) @@@
> IMP_INVALID  req at ffff8104bfc93238 x2285905/t0
> o101->lfs-MDT0000_UUID at 36.102.29.1@o2ib:12/10 lens 432/2048 e 0 to 40
> dl 0 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 3188:0:(client.c:716:ptlrpc_import_delay_req()) Skipped 1
> previous similar message
> LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) ldlm_cli_enqueue: -108
> LustreError: 3188:0:(mdc_locks.c:586:mdc_enqueue()) Skipped 2 previous
> similar messages
> LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) failure
> -108 inode 66781185
> LustreError: 3188:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1
> previous similar message
> LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) failure
> -108 inode 66781185
> LustreError: 3190:0:(file.c:2664:ll_inode_revalidate_fini()) Skipped 1
> previous similar message
>

Brian J. Murrell

2008-Aug-06 15:15 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, 2008-08-06 at 08:24 -0600, Chris Worley wrote:> On Wed, Aug 6, 2008 at 8:19 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
> >
> > Are there any Lustre messages in your syslog on the clients which are
> > returning unexpected results?
> 
> Indeed.  These clients seem to be timing out and getting evicted:
> 
> LustreError: 11-0: an error occurred while communicating with
> 36.102.29.1 at o2ib. The ldlm_enqueue operation failed with -107
> LustreError: Skipped 36 previous similar messages
> Lustre: lfs-MDT0000-mdc-ffff81027cca8d20: Connection to service
> lfs-MDT0000 via nid 36.102.29.1 at o2ib was lost; in progress operations
> using this service will wait for recovery to complete.
> Lustre: Skipped 36 previous similar messages
> LustreError: 167-0: This client was evicted by lfs-MDT0000; in
> progress operations using this service will fail.
So, now what does the MDS serving lfs-MDT0000 say about this?  Why did
it evict?  What version of Lustre is this?  Perhaps you said so already
and I have just forgotten.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/6b1b5cb6/attachment.bin

Chris Worley

2008-Aug-06 15:29 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 9:15 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:>
> So, now what does the MDS serving lfs-MDT0000 say about this?  Why did
> it evict?  What version of Lustre is this?  Perhaps you said so already
> and I have just forgotten.
1.6.5.1 clients w/ 1.6.4.3 OSS''s.

The MDS is very verbose.  I get these all the time, even prior to the error:

Lustre: lfs-OST0000: haven''t heard from client
12f00621-096c-b331-8774-abfc72dfd82
2 (at 36.102.36.15 at o2ib) in 92 seconds. I think it''s dead, and I am
evicting it.

Would these messages be correlated?

LustreError: 6820:0:(handler.c:1499:mds_handle()) operation 101 on
unconnected MDS from 12345-36.102.36.11 at o2ib
LustreError: 6820:0:(handler.c:1499:mds_handle()) Skipped 6 previous
similar messages
LustreError: 6820:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@
processing error (-107)  req at 000001022ed4fa00 x2285850/t0
o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 6820:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped
27 previous similar messages
Lustre: lfs-MDT0000: haven''t heard from client
f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92
seconds. I think it''s dead, and I am evicting it.
Lustre: Skipped 27 previous similar messages
Lustre: 5696:0:(ldlm_lib.c:519:target_handle_reconnect()) MGS:
e04def6d-186d-2d01-1beb-2e00418cfaca reconnecting
Lustre: 5696:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 12
previous similar messages
LustreError: 6877:0:(handler.c:1499:mds_handle()) operation 101 on
unconnected MDS from 12345-36.102.36.11 at o2ib
LustreError: 6877:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@
processing error (-107)  req at 0000010247c30050 x2286894/t0
o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 6877:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped
1 previous similar message
Lustre: lfs-MDT0000: haven''t heard from client
f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92
seconds. I think it''s dead, and I am evicting it.
LustreError: 6829:0:(handler.c:1499:mds_handle()) operation 101 on
unconnected MDS from 12345-36.102.36.11 at o2ib
LustreError: 6829:0:(handler.c:1499:mds_handle()) Skipped 1 previous
similar message
LustreError: 6829:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@
processing error (-107)  req at 000001022aa7bc00 x2288557/t0
o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 5898:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@
processing error (-107)  req at 000001022f07da00 x2290923/t0
o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0
Lustre: lfs-OST0000: haven''t heard from client
f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92
seconds. I think it''s dead, and I am evicting it.
Lustre: 6713:0:(ldlm_lib.c:519:target_handle_reconnect()) MGS:
e04def6d-186d-2d01-1beb-2e00418cfaca reconnecting
Lustre: 6713:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 12
previous similar messages
Lustre: lfs-MDT0000: haven''t heard from client
f8aab44b-4829-c626-8056-8b80d38c960e (at 36.102.36.11 at o2ib) in 92
seconds. I think it''s dead, and I am evicting it.
Lustre: Skipped 5 previous similar messages
LustreError: 6464:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@
processing error (-107)  req at 0000010224c18200 x2239379/t0
o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 6464:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped
5 previous similar messages
LustreError: 7324:0:(handler.c:1499:mds_handle()) operation 37 on
unconnected MDS from 12345-36.102.36.11 at o2ib
LustreError: 6846:0:(handler.c:1499:mds_handle()) operation 101 on
unconnected MDS from 12345-36.102.36.6 at o2ib
Lustre: lfs-OST0000: haven''t heard from client
f99ffcca-2168-b2ab-8648-e40f4725acd1 (at 36.102.36.9 at o2ib) in 92
seconds. I think it''s dead, and I am evicting it.
Lustre: lfs-OST0000: haven''t heard from client
df74b077-e833-5fce-2744-9c0fa51427c9 (at 36.102.36.6 at o2ib) in 92
seconds. I think it''s dead, and I am evicting it.
Lustre: Skipped 23 previous similar messages
Lustre: 5697:0:(ldlm_lib.c:519:target_handle_reconnect()) MGS:
e04def6d-186d-2d01-1beb-2e00418cfaca reconnecting
Lustre: 5697:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 12
previous similar messages
LustreError: 5860:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@
processing error (-107)  req at 0000010251a06400 x2284477/t0
o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0
LustreError: 5860:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped
54 previous similar messages

Chris

Chris Worley

2008-Aug-06 15:48 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

It looks like I can keep the error from recurring if I continuously ls
the files:

watch -n 10 ''pdsh -w et[01-18] -x et[10,12] "find
/lfs/<user> -type f
| xargs ls -l"  >/dev/null''

Otherwise, it comes back again.

Brian J. Murrell

2008-Aug-06 16:17 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, 2008-08-06 at 09:29 -0600, Chris Worley wrote:> On Wed, Aug 6, 2008 at 9:15 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
> >
> > So, now what does the MDS serving lfs-MDT0000 say about this?  Why did
> > it evict?  What version of Lustre is this?  Perhaps you said so
already
> > and I have just forgotten.
> 
> 1.6.5.1 clients w/ 1.6.4.3 OSS''s.
> 
> The MDS is very verbose.  I get these all the time, even prior to the
error:
> 
> Lustre: lfs-OST0000: haven''t heard from client
> 12f00621-096c-b331-8774-abfc72dfd82
> 2 (at 36.102.36.15 at o2ib) in 92 seconds. I think it''s dead, and
I am evicting it.
Yup.  If you can correlate those kinds of messages (they have the client
ip address in them) to the errors on the client, you have your eviction
events.

I notice that you are getting messages out of dmesg rather than syslog.
Syslog makes correlation easier and more definite due to the time
stamps.

But this kind of eviction is simply due to clients that are unresponsive
from the POV of the MDS.  They are neither making filesystem RPC nor are
they "ping"ing (keepalives) so the MDS assumes they have died and
evicts
them to get back the locks it could be holding and not having that dead
client holding up other, living clients.

So you need to investigate why the clients are dying or appear to be
dead (i.e. going silent) to the MDS.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/8066b18a/attachment.bin

Chris Worley

2008-Aug-06 16:41 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 10:17 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:> But this kind of eviction is simply due to clients that are unresponsive
> from the POV of the MDS.  They are neither making filesystem RPC nor are
> they "ping"ing (keepalives) so the MDS assumes they have died and
evicts
> them to get back the locks it could be holding and not having that dead
> client holding up other, living clients.
>
> So you need to investigate why the clients are dying or appear to be
> dead (i.e. going silent) to the MDS.
Is there anything in /proc or /sys I can look at to see whatever
"keepalive" parameters are setup?

The systems aren''t dying.

I need to know how to least obtrusively force the clients to keep
pinging, or tell the MDS to give them a longer time before timeout.

I don''t see why this only effects the RHEL5 clients.  Maybe
that''s a hint.

Thanks,

Chris

Brian J. Murrell

2008-Aug-06 16:45 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote:> 
> Is there anything in /proc or /sys I can look at to see whatever
> "keepalive" parameters are setup?
All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout
which MUST be the same on all nodes.
> The systems aren''t dying.
They are failing to communicate with the MDS for some reason.  Network
problems perhaps?  You could try enabling +rpctrace debug and inspecting
the debug file for RPCs to see if the client is indeed sending something
(even if it''s a ping) at regular intervals.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/8b0c71f6/attachment.bin

Chris Worley

2008-Aug-06 17:08 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 10:45 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:> On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote:
>>
>> Is there anything in /proc or /sys I can look at to see whatever
>> "keepalive" parameters are setup?
>
> All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout
> which MUST be the same on all nodes.
>
Would you suggest I increase or decrease this value?

Is there a way to inhibit the eviction, or is that necessary to keep
really dead clients from locking-out files.
>> The systems aren''t dying.
>
> They are failing to communicate with the MDS for some reason.  Network
> problems perhaps?  You could try enabling +rpctrace debug and inspecting
> the debug file for RPCs to see if the client is indeed sending something
> (even if it''s a ping) at regular intervals.
All the systems (RHEL4 and 5 clients, Lustre servers) are on the same
ethernet and IB switches.  There were no issues before the 1.6.5.1
upgrade with the RHEL5 nodes.

Would a normal ping do it?  I can jury-rig all the RHEL5 nodes to ping the MDS.

Chris

Brian J. Murrell

2008-Aug-06 17:26 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, 2008-08-06 at 11:08 -0600, Chris Worley wrote:> 
> Would you suggest I increase or decrease this value?
Neither, just make sure all nodes have the same value.  I haven''t seen
anything that indicates that it needs to be changed.
> Is there a way to inhibit the eviction,
No.
> or is that necessary to keep
> really dead clients from locking-out files.
Indeed.

> All the systems (RHEL4 and 5 clients, Lustre servers) are on the same
> ethernet and IB switches.  There were no issues before the 1.6.5.1
> upgrade with the RHEL5 nodes.
+rpctrace debug is probably the way to go to see what the client
are(n''t) doing in terms of keeping the MDS aware of it''s
existence.
> Would a normal ping do it?
No.
> I can jury-rig all the RHEL5 nodes to ping the MDS.
Of course, even if it would work, this is a hack that would simply be
masquerading what is likely another, more real problem.  You need to get
to the root of the real problem.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/ce74c8d8/attachment.bin

Chris Worley

2008-Aug-06 18:11 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:>
> +rpctrace debug is probably the way to go to see what the client
> are(n''t) doing in terms of keeping the MDS aware of it''s
existence.
>
The log from the client perspective starts with a lot of packet mismatches:

00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000400:00000100:3:1217633608.884380:0:5588:0:(lib-move.c:1647:lnet_parse_put())
Dropping PUT from 12345-36.102.29.1 at o2ib portal 25 match 157556 offset
0 length 128
: 2
00000100:00020000:3:1217633710.886759:0:5587:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217633852.887489:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217634099.512727:0:5591:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128

...repeat many times.  Then it repeats this many times:

00000100:00080000:6:1217651769.943608:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery())
MGS: starting recovery
00000100:00080000:6:1217651769.943619:0:5596:0:(import.c:465:ptlrpc_connect_import())
ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:6:1217651769.943623:0:5596:0:(import.c:342:import_select_connection())
MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt
4342590785
00000100:00080000:6:1217651769.943628:0:5596:0:(import.c:423:import_select_connection())
MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection
MGC36.102.29.1
@o2ib_0/36.102.29.1 at o2ib
00000100:00080000:6:1217651769.943630:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (743) for MGS
00000100:00080000:6:1217651820.942855:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery())
MGS: starting recovery
00000100:00080000:6:1217651820.942858:0:5596:0:(import.c:465:ptlrpc_connect_import())
ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:6:1217651820.942862:0:5596:0:(import.c:342:import_select_connection())
MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt
4342641786
00000100:00080000:6:1217651820.942866:0:5596:0:(import.c:423:import_select_connection())
MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection
MGC36.102.29.1
@o2ib_0/36.102.29.1 at o2ib
00000100:00080000:6:1217651820.942868:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (744) for MGS

... repeat... with one of these embedded:

00000100:00080000:6:1217651860.941624:0:5596:0:(pinger.c:144:ptlrpc_pinger_main())
not pinging MGS (in recovery: EVICTED or recovery disabled: 0/0)

...ending with this:

00000100:00080000:6:1217652002.937993:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (748) for MGS
00000100:00080000:3:1217663095.721928:0:9878:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib
10000000:01000000:3:1217663095.721931:0:9878:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:3:1217663095.722708:0:9878:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:3:1217663095.722710:0:9878:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:3:1217663095.722712:0:9878:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FUL
L
10000000:01000000:3:1217663095.722714:0:9878:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:3:1217663095.722715:0:9878:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:3:1217663095.722719:0:9878:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using n
id 36.102.29.1 at o2ib.

...then followed by length mismatches interspersed with "recovery":

00000100:00020000:3:1217663393.547988:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217663677.841799:0:5591:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217663714.726333:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217663819.561596:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217664140.732610:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217664244.038963:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217664556.735280:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00080000:3:1217665408.738550:0:9927:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib
10000000:01000000:3:1217665408.738554:0:9927:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:3:1217665408.739328:0:9927:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:3:1217665408.739330:0:9927:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:3:1217665408.739333:0:9927:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FUL
L
10000000:01000000:3:1217665408.739335:0:9927:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:3:1217665408.739336:0:9927:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:3:1217665408.739340:0:9927:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using n
id 36.102.29.1 at o2ib.
00000100:00080000:3:1217665408.739344:0:9927:0:(recover.c:175:ptlrpc_wake_delayed())
@@@ waking (set 0000000000000000):  req at ffff8101d3aa78e8 x328561/t0
o501->MGS at 36
.102.29.1 at o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00020000:3:1217665653.806267:0:5589:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217665692.740540:0:5589:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:6:1217666544.737120:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:6:1217666647.982634:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:6:1217667112.732267:0:5588:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00080000:3:1217667254.732738:0:9966:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib
10000000:01000000:3:1217667254.732742:0:9966:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:3:1217667254.733543:0:9966:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:3:1217667254.733546:0:9966:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:3:1217667254.733549:0:9966:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FUL
L
10000000:01000000:3:1217667254.733552:0:9966:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:3:1217667254.733553:0:9966:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:3:1217667254.733556:0:9966:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using n
id 36.102.29.1 at o2ib.
00000100:00080000:3:1217667254.733559:0:9966:0:(recover.c:175:ptlrpc_wake_delayed())
@@@ waking (set 0000000000000000):  req at ffff8102341658e8 x338536/t0
o501->MGS at 36
.102.29.1 at o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00020000:6:1217667254.733672:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:6:1217667502.281243:0:5589:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217668248.722633:0:5587:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128

... lots of these mismatches, followed by what look like more recoveries:

00000100:02000000:0:1217678411.535026:0:12692:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using
nid 36.102.29.1 at o2ib.
00000100:00020000:0:1217678425.863040:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00080000:0:1217678462.534057:0:12693:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2i
b
10000000:01000000:0:1217678462.534061:0:12693:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:0:1217678462.534192:0:12693:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:0:1217678462.534194:0:12693:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:0:1217678462.534195:0:12693:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FU
LL
10000000:01000000:0:1217678462.534196:0:12693:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:0:1217678462.534197:0:12693:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import

...repeated many, many, many times... then:

00000100:00080000:4:1217780793.292959:0:16703:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FU
LL
10000000:01000000:4:1217780793.292960:0:16703:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:4:1217780793.292961:0:16703:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:4:1217780793.292962:0:16703:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using
nid 36.102.29.1 at o2ib.
10000000:01000000:4:1217780806.980814:0:6127:0:(mgc_request.c:289:mgc_requeue_thread())
updating log lfs-client
10000000:01000000:4:1217780806.980817:0:6127:0:(mgc_request.c:1092:mgc_process_log())
Process log lfs-client:ffff81027cca8d20 from 530
10000000:01000000:4:1217780806.980819:0:6127:0:(mgc_request.c:627:mgc_enqueue())
Enqueue for lfs-client (res 0x73666c)
00000100:00020000:0:1217780806.980984:0:5588:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
10000000:01000000:4:1217780844.291599:0:6127:0:(mgc_request.c:1111:mgc_process_log())
Can''t get cfg lock: -5
00000100:00080000:0:1217780844.291621:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery())
MGS: starting recovery
00000100:00080000:4:1217780844.291623:0:6127:0:(client.c:1801:ptlrpc_queue_wait())
@@@ "ll_cfg_requeue" waiting for recovery: (FULL != DISCONN)
req at ffff810238534098
 x935044/t0 o501->MGS at 36.102.29.1@o2ib:26/25 lens 200/304 e 0 to 40 dl
0 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1217780844.291625:0:5596:0:(import.c:465:ptlrpc_connect_import())
ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:0:1217780844.291630:0:5596:0:(import.c:342:import_select_connection())
MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt
4471669871
10000000:01000000:4:1217780844.291636:0:5660:0:(mgc_request.c:565:mgc_blocking_ast())
Lock res 0x73666c (lfs)
00000100:00080000:0:1217780844.291637:0:5596:0:(import.c:423:import_select_connection())
MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection
MGC36.102.29.1
@o2ib_0/36.102.29.1 at o2ib
10000000:01000000:4:1217780844.291638:0:5660:0:(mgc_request.c:590:mgc_blocking_ast())
log lfs-client: original grant failed, will requeue later
00000100:00080000:0:1217780844.291639:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (3480) for MGS
00000100:00080000:4:1217780844.291870:0:5595:0:(import.c:709:ptlrpc_connect_interpret())
reconnected to MGS at 36.102.29.1@o2ib after partition
00000100:00080000:4:1217780844.291873:0:5595:0:(import.c:714:ptlrpc_connect_interpret())
MGC36.102.29.1 at o2ib: reconnected but import is invalid; marking
evicted
00000100:00080000:4:1217780844.291874:0:5595:0:(import.c:715:ptlrpc_connect_interpret())
ffff810275e06188 MGS: changing import state from CONNECTING to EVICTED
00000100:00080000:4:1217780844.291877:0:5595:0:(import.c:1053:ptlrpc_import_recovery_state_machine())
evicted from MGS at 36.102.29.1@o2ib; invalidating
10000000:01000000:4:1217780844.291975:0:5595:0:(mgc_request.c:880:mgc_import_event())
import event 0x808005
00000100:00080000:0:1217780844.291996:0:16704:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2i
b
10000000:01000000:0:1217780844.292000:0:16704:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:0:1217780844.292132:0:16704:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:0:1217780844.292134:0:16704:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:0:1217780844.292135:0:16704:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FU
LL

Does this give any hints... there''s a lot more log to go.

Thanks,

Chris

Brian J. Murrell

2008-Aug-06 19:25 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

dOn Wed, 2008-08-06 at 12:11 -0600, Chris Worley wrote:> On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
> >
> > +rpctrace debug is probably the way to go to see what the client
> > are(n''t) doing in terms of keeping the MDS aware of
it''s existence.
> >
> 
> The log from the client perspective starts with a lot of packet mismatches:
> 
>
00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback())
> early reply sized 176, expect 128
>
00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback())
> early reply sized 240, expect 128
Given the amount of debug provided, this looks like it could be bug
16534 which is on my plate to test the solution for and land.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/d3bd4860/attachment.bin

Chris Worley

2008-Aug-06 19:36 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 1:25 PM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:>
> Given the amount of debug provided, this looks like it could be bug
> 16534 which is on my plate to test the solution for and land.
>
Would you like me to add my complete RPC log to this bug report?

What distro/kernel are they running?

Note that I didn''t have this error using RHEL4/5 with 1.6.4.2 clients
and a 1.6.4.3 servers.

This problem doesn''t start immediately after reboot.  I''m
guessing
it''s correlated w/ somebody''s job, and they haven''t
complained... so I
don''t know who''s doing what to put the client in this state.

Chris

Brian J. Murrell

2008-Aug-06 19:47 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, 2008-08-06 at 13:36 -0600, Chris Worley wrote:> Note that I didn''t have this error using RHEL4/5 with 1.6.4.2
clients
> and a 1.6.4.3 servers.
Indeed.  IIUC, this is something that was introduced with 1.6.5.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080806/db8e00d3/attachment.bin

Chris Worley

2008-Aug-06 21:39 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 1:47 PM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:> On Wed, 2008-08-06 at 13:36 -0600, Chris Worley wrote:
>> Note that I didn''t have this error using RHEL4/5 with 1.6.4.2
clients
>> and a 1.6.4.3 servers.
>
> Indeed.  IIUC, this is something that was introduced with 1.6.5.
Are there any work-arounds?

Chris

Chris Worley

2008-Aug-15 06:00 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Wed, Aug 6, 2008 at 1:25 PM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:> dOn Wed, 2008-08-06 at 12:11 -0600, Chris Worley wrote:
>> On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
>> >
>> > +rpctrace debug is probably the way to go to see what the client
>> > are(n''t) doing in terms of keeping the MDS aware of
it''s existence.
>> >
>>
>> The log from the client perspective starts with a lot of packet
mismatches:
>>
>>
00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback())
>> early reply sized 176, expect 128
>>
00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback())
>> early reply sized 240, expect 128
>
> Given the amount of debug provided, this looks like it could be bug
> 16534 which is on my plate to test the solution for and land.
>
In adding info to bugzilla, you clarified that this is actually bug #16237.

This bug is "fixed" and in Lustre 1.6.6... but 1.6.6 isn''t
released.

Is there a work-around for the 1.6.5.1 client (short of rebooting)?

Thanks,

Chris

Andreas Dilger

2008-Aug-18 09:23 UTC

head link

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

On Aug 15, 2008  00:00 -0600, Chris Worley wrote:> In adding info to bugzilla, you clarified that this is actually bug #16237.
> 
> This bug is "fixed" and in Lustre 1.6.6... but 1.6.6
isn''t released.
> 
> Is there a work-around for the 1.6.5.1 client (short of rebooting)?
If you are up to rebuilding the 1.6.5.1 client you can likely fix it with
a small change in lustre_start_mgc() (untested):

-       data->ocd_connect_flags = OBD_CONNECT_VERSION | OBD_CONNECT_AT;
+       data->ocd_connect_flags = OBD_CONNECT_VERSION;
	 
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Aug 2008 - Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.

[Lustre-discuss] Nodes claim error with files, then say everything is fine.