thr3ads.net - Lustre discuss - [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Robin Humble

2009-Jul-15 12:46 UTC

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

I get a ferocious set of error messages when I mount a 1.6.7.2
filesystem on a b_release_1_8_1 client.
is this expected?

  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at
o2ib failed: 5
  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at
o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: MGC10.8.30.244 at o2ib: Reactivating import
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at
o2ib failed: 5
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at
o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: Client system-client has started
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at
o2ib failed: 5
  ... last message repeated 17 times ...
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at
o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at
o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at
o2ib failed: 5

looks like it succeeds in the end, but only after a struggle.

I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2
<-> 1.6.7.2.

servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
quota fix).
client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
cvs 20090712131220 + bz18793 again.

BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
I''m confused about which is closest to the final 1.8.1 :-/

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Brian J. Murrell

2009-Jul-15 14:10 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:> 
>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: Client system-client has started
>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.201 at o2ib failed: 5
>   ... last message repeated 17 times ...
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.203 at o2ib failed: 5
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.205 at o2ib failed: 5
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.208 at o2ib failed: 5
These are all LND errors.  What versions of OFED are you using on each
end?
> looks like it succeeds in the end, but only after a struggle.
Is it completely stable and performant after the struggle?  Do the error
messages stop?
> BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
> I''m confused about which is closest to the final 1.8.1 :-/
b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot
in time from the branch) which is getting tested from that branch which
has the potential to become 1.8.1 if the testing pans out.  It is
entirely possible that even when v1_8_1_RCn becomes the final release,
there will be patches dangling on the tip of b_release_1_8_1 that are
not release blockers but there in case we need a 1.8.1.1.

So the choice is yours.  If you want to be using exactly what could
potentially be the GA release, you should stick to using the most recent
tags.  If you want to test ahead of what could be the GA, use the branch
tip.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090715/2a68feab/attachment.bin

Robin Humble

2009-Jul-15 15:22 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell
wrote:>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:
>> 
>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: Client system-client has started
>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.201 at o2ib failed: 5
>>   ... last message repeated 17 times ...
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.203 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.205 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.208 at o2ib failed: 5
>
>These are all LND errors.  What versions of OFED are you using on each
>end?
all kernels all compiled with the rhel5 kernel tree''s standard OFED.
I think 1.3.2 is what''s in rhel5.3/centos5.3?
>> looks like it succeeds in the end, but only after a struggle.
>Is it completely stable and performant after the struggle?  Do the error
>messages stop?
the fs''s appear to be fine.

the error messages are just on the initial mount of the first lustre fs.
subsequent mounts of other lustre fs''s don''t get any messages,
so it
seems like it''s just an extremely noisy protocol/version negotiation
the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2
servers??

another data point is that the above errors don''t happen with
2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel
OFED, so it''s probably something that''s happened between
1.8.0.1 and
1.8.1-pre.
or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and
2.6.18-128.1.16.el5, but that seems less likely.
I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...
>> BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
>> I''m confused about which is closest to the final 1.8.1 :-/
>
>b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot
>in time from the branch) which is getting tested from that branch which
>has the potential to become 1.8.1 if the testing pans out.  It is
>entirely possible that even when v1_8_1_RCn becomes the final release,
>there will be patches dangling on the tip of b_release_1_8_1 that are
>not release blockers but there in case we need a 1.8.1.1.
>
>So the choice is yours.  If you want to be using exactly what could
>potentially be the GA release, you should stick to using the most recent
>tags.  If you want to test ahead of what could be the GA, use the branch
>tip.
cool. thanks for the explanation.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Brian J. Murrell

2009-Jul-15 15:59 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

On Wed, 2009-07-15 at 11:22 -0400, Robin Humble wrote:> all kernels all compiled with the rhel5 kernel tree''s standard
OFED.
> I think 1.3.2 is what''s in rhel5.3/centos5.3?
Yeah, something like that IIRC.
> the error messages are just on the initial mount of the first lustre fs.
> subsequent mounts of other lustre fs''s don''t get any
messages, so it
> seems like it''s just an extremely noisy protocol/version
negotiation
> the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2
> servers??
Maybe one of our LNET experts might have some additional information to
offer.
> another data point is that the above errors don''t happen with
> 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel
> OFED, so it''s probably something that''s happened between
1.8.0.1 and
> 1.8.1-pre.
> or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and
> 2.6.18-128.1.16.el5, but that seems less likely.
> I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...
Yeah, it would be a great troubleshooting addition to see if the same
kernel on the clients and servers with the different lustre versions has
the same problem.  This would isolate the problem either to or away from
a problem with the difference in OFED stacks.
> cool. thanks for the explanation.
NP.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090715/2e608c5d/attachment.bin

Robin Humble

2009-Jul-15 16:53 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

On Wed, Jul 15, 2009 at 11:59:54AM -0400, Brian J. Murrell
wrote:>On Wed, 2009-07-15 at 11:22 -0400, Robin Humble wrote:
>> another data point is that the above errors don''t happen with
>> 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel
>> OFED, so it''s probably something that''s happened
between 1.8.0.1 and
>> 1.8.1-pre.
>> or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and
>> 2.6.18-128.1.16.el5, but that seems less likely.
>> I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...
>Yeah, it would be a great troubleshooting addition to see if the same
>kernel on the clients and servers with the different lustre versions has
>the same problem.  This would isolate the problem either to or away from
>a problem with the difference in OFED stacks.
ok - I made a 2.6.18-128.1.14.el5 with b_release_1_8_1 and it behaves
the same as 2.6.18-128.1.16.el5 with b_release_1_8_1. ie. spits out a
bunch of errors on the first lustre mount.

the only changes between those rhel .14 and .16 versions looks pretty
unrelated to IB/lnet, so I guess that was to be expected:
  * Sat Jun 27 2009 Jiri Pirko <jpirko at redhat.com>
[2.6.18-128.1.16.el5]
  - [mm] prevent panic in copy_hugetlb_page_range (Larry Woodman ) [508030
507860]
  
  * Tue Jun 23 2009 Jiri Pirko <jpirko at redhat.com>
[2.6.18-128.1.15.el5]
  - [mm] fix swap race condition in fork-gup-race patch (Andrea Arcangeli)
[507297 506684]

so I guess the change is between Lustre 1.8.0.1 and
b_release_1_8_1-20090712131220 somewhere.
if only we had git bisect, and if only I knew how to use it, and if only
I had the time to try it... :-)

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Robin Humble

2009-Jul-15 17:09 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble
wrote:>I get a ferocious set of error messages when I mount a 1.6.7.2
>filesystem on a b_release_1_8_1 client.
>is this expected?
just to annotate the below a bit in case it''s not clear... sorry -
should have done that in the first email :-/

10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the
failover pair. 10.8.30.201 -> 208 are OSS''s (one OST per OSS), and
the
fs is mounted in the usual failover way eg.
  mount -t lustre 10.8.30.244 at o2ib:10.8.30.245 at o2ib:/system /system

from the below (and other similar logs) it kinda looks like the client
fails and then renegotiates with all the servers.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: Client system-client has started
>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.201 at o2ib failed: 5
>  ... last message repeated 17 times ...
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.203 at o2ib failed: 5
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.205 at o2ib failed: 5
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib:
retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size:
4096
>  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.208 at o2ib failed: 5
>
>looks like it succeeds in the end, but only after a struggle.
>
>I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2
<-> 1.6.7.2.
>
>servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
>quota fix).
>client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
>cvs 20090712131220 + bz18793 again.
>
>BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
>I''m confused about which is closest to the final 1.8.1 :-/
>
>cheers,
>robin
>--
>Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Robin Humble

2009-Jul-15 17:44 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

On Wed, Jul 15, 2009 at 11:22:26AM -0400, Robin Humble
wrote:>On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote:
>>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:
>>> 
>>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>>   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: Client system-client has started
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.201 at o2ib failed: 5
>>>   ... last message repeated 17 times ...
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.203 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.205 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.208 at o2ib failed: 5
>>
>>> looks like it succeeds in the end, but only after a struggle.
>>Is it completely stable and performant after the struggle?  Do the error
>>messages stop?
>the fs''s appear to be fine.
hmmm - actually, the fs''s are _mostly_ fine... but sometimes i/o that
happens right after the above errors fails completely. after a few more
trials, this seems to happen about 40% of the time... :-/

eg. (+/- some missing characters from crappy IPMI SoL) you can see that
rsync has managed to list the files on the newly mounted lustre fs, but
then gets i/o errors when trying to copy the files off lustre to ramdisk ->

...
rsync: readlink "/mnt/lustre_system/lib64/libattr.so.1.1.0.1.0"
failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libc-2.5.so" faile(5)
rsync: readlink "/mnt/lustre_system/lib64/libcrypt-2.5.so" failed:
Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.a.1.02"
failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.so.1.02"
failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libexpat.so.0.5.0" failed:
Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib6-2.0.a" failed: Input/output
error (5)
rsync: readlink "/mnt/lustre_system/lib64/libgmodule-2.0.aject-2.0.a"
failed: Input/output error (5)
...

so maybe lnet has renegotiated a connection to the MDS ok, but not to
the OSS''s yet.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Robin Humble

2009-Jul-21 15:05 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

I added this to bugzilla.
  https://bugzilla.lustre.org/show_bug.cgi?id=20227

cheers,
robin

On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble
wrote:>On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote:
>>I get a ferocious set of error messages when I mount a 1.6.7.2
>>filesystem on a b_release_1_8_1 client.
>>is this expected?
>
>just to annotate the below a bit in case it''s not clear... sorry -
>should have done that in the first email :-/
>
>10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the
>failover pair. 10.8.30.201 -> 208 are OSS''s (one OST per OSS),
and the
>fs is mounted in the usual failover way eg.
>  mount -t lustre 10.8.30.244 at o2ib:10.8.30.245 at o2ib:/system /system
>
>from the below (and other similar logs) it kinda looks like the client
>fails and then renegotiates with all the servers.
>
>cheers,
>robin
>--
>Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>
>>  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>>  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: Client system-client has started
>>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.201 at o2ib failed: 5
>>  ... last message repeated 17 times ...
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.203 at o2ib failed: 5
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.205 at o2ib failed: 5
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at
o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.208 at o2ib failed: 5
>>
>>looks like it succeeds in the end, but only after a struggle.
>>
>>I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2
<-> 1.6.7.2.
>>
>>servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
>>quota fix).
>>client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
>>cvs 20090712131220 + bz18793 again.
>>
>>BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
>>I''m confused about which is closest to the final 1.8.1 :-/
>>
>>cheers,
>>robin
>>--
>>Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>>_______________________________________________
>>Lustre-discuss mailing list
>>Lustre-discuss at lists.lustre.org
>>http://lists.lustre.org/mailman/listinfo/lustre-discuss
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

Liang Zhen

2009-Jul-21 16:42 UTC

head link

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

Robin,

These messages should be harmless, 1.8.1 is using new o2iblnd message 
protocol, so there is a version negotiation if o2iblnd version of client 
is older,  is there any other o2ib error messages like " Deleting 
messages for xxx.xxx.xxx.xxx at o2b: connection failed" when you see IO 
failure? Anyway, if you got more complaint from o2ib except these 
information, could you please post them on the bug you filed.

Thanks
Liang

Robin Humble wrote:> I added this to bugzilla.
>   https://bugzilla.lustre.org/show_bug.cgi?id=20227
>
> cheers,
> robin
>
> On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote:
>   
>> On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote:
>>     
>>> I get a ferocious set of error messages when I mount a 1.6.7.2
>>> filesystem on a b_release_1_8_1 client.
>>> is this expected?
>>>       
>> just to annotate the below a bit in case it''s not clear...
sorry -
>> should have done that in the first email :-/
>>
>> 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the
>> failover pair. 10.8.30.201 -> 208 are OSS''s (one OST per
OSS), and the
>> fs is mounted in the usual failover way eg.
>>  mount -t lustre 10.8.30.244 at o2ib:10.8.30.245 at o2ib:/system
/system
>>
>>     
> >from the below (and other similar logs) it kinda looks like the client
>   
>> fails and then renegotiates with all the servers.
>>
>> cheers,
>> robin
>> --
>> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>>
>>     
>>>  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>>  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.244 at o2ib failed: 5
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>>>  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.245 at o2ib failed: 5
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: Client system-client has started
>>>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.201 at o2ib failed: 5
>>>  ... last message repeated 17 times ...
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.203 at o2ib failed: 5
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.205 at o2ib failed: 5
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208
at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256,
msg_size: 4096
>>>  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from
10.8.30.208 at o2ib failed: 5
>>>
>>> looks like it succeeds in the end, but only after a struggle.
>>>
>>> I don''t have any problems with 1.8.1 <-> 1.8.1 or
1.6.7.2 <-> 1.6.7.2.
>>>
>>> servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793
(group
>>> quota fix).
>>> client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1
from
>>> cvs 20090712131220 + bz18793 again.
>>>
>>> BTW, should I be using cvs tag v1_8_1_RC1 instead of
b_release_1_8_1?
>>> I''m confused about which is closest to the final 1.8.1 :-/
>>>
>>> cheers,
>>> robin
>>> --
>>> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Jul 2009 - 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server