I get a ferocious set of error messages when I mount a 1.6.7.2 filesystem on a b_release_1_8_1 client. is this expected? Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: MGC10.8.30.244 at o2ib: Reactivating import Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: Client system-client has started Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 ... last message repeated 17 times ... Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5 looks like it succeeds in the end, but only after a struggle. I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2 <-> 1.6.7.2. servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group quota fix). client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from cvs 20090712131220 + bz18793 again. BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? I''m confused about which is closest to the final 1.8.1 :-/ cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
Brian J. Murrell
2009-Jul-15 14:10 UTC
[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:> > Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 > Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: MGC10.8.30.244 at o2ib: Reactivating import > Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 > Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: Client system-client has started > Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 > ... last message repeated 17 times ... > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5These are all LND errors. What versions of OFED are you using on each end?> looks like it succeeds in the end, but only after a struggle.Is it completely stable and performant after the struggle? Do the error messages stop?> BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? > I''m confused about which is closest to the final 1.8.1 :-/b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot in time from the branch) which is getting tested from that branch which has the potential to become 1.8.1 if the testing pans out. It is entirely possible that even when v1_8_1_RCn becomes the final release, there will be patches dangling on the tip of b_release_1_8_1 that are not release blockers but there in case we need a 1.8.1.1. So the choice is yours. If you want to be using exactly what could potentially be the GA release, you should stick to using the most recent tags. If you want to test ahead of what could be the GA, use the branch tip. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090715/2a68feab/attachment.bin
On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote:>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote: >> >> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >> Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: MGC10.8.30.244 at o2ib: Reactivating import >> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: Client system-client has started >> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 >> ... last message repeated 17 times ... >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5 > >These are all LND errors. What versions of OFED are you using on each >end?all kernels all compiled with the rhel5 kernel tree''s standard OFED. I think 1.3.2 is what''s in rhel5.3/centos5.3?>> looks like it succeeds in the end, but only after a struggle. >Is it completely stable and performant after the struggle? Do the error >messages stop?the fs''s appear to be fine. the error messages are just on the initial mount of the first lustre fs. subsequent mounts of other lustre fs''s don''t get any messages, so it seems like it''s just an extremely noisy protocol/version negotiation the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2 servers?? another data point is that the above errors don''t happen with 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel OFED, so it''s probably something that''s happened between 1.8.0.1 and 1.8.1-pre. or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and 2.6.18-128.1.16.el5, but that seems less likely. I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...>> BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? >> I''m confused about which is closest to the final 1.8.1 :-/ > >b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot >in time from the branch) which is getting tested from that branch which >has the potential to become 1.8.1 if the testing pans out. It is >entirely possible that even when v1_8_1_RCn becomes the final release, >there will be patches dangling on the tip of b_release_1_8_1 that are >not release blockers but there in case we need a 1.8.1.1. > >So the choice is yours. If you want to be using exactly what could >potentially be the GA release, you should stick to using the most recent >tags. If you want to test ahead of what could be the GA, use the branch >tip.cool. thanks for the explanation. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
Brian J. Murrell
2009-Jul-15 15:59 UTC
[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
On Wed, 2009-07-15 at 11:22 -0400, Robin Humble wrote:> all kernels all compiled with the rhel5 kernel tree''s standard OFED. > I think 1.3.2 is what''s in rhel5.3/centos5.3?Yeah, something like that IIRC.> the error messages are just on the initial mount of the first lustre fs. > subsequent mounts of other lustre fs''s don''t get any messages, so it > seems like it''s just an extremely noisy protocol/version negotiation > the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2 > servers??Maybe one of our LNET experts might have some additional information to offer.> another data point is that the above errors don''t happen with > 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel > OFED, so it''s probably something that''s happened between 1.8.0.1 and > 1.8.1-pre. > or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and > 2.6.18-128.1.16.el5, but that seems less likely. > I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...Yeah, it would be a great troubleshooting addition to see if the same kernel on the clients and servers with the different lustre versions has the same problem. This would isolate the problem either to or away from a problem with the difference in OFED stacks.> cool. thanks for the explanation.NP. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090715/2e608c5d/attachment.bin
On Wed, Jul 15, 2009 at 11:59:54AM -0400, Brian J. Murrell wrote:>On Wed, 2009-07-15 at 11:22 -0400, Robin Humble wrote: >> another data point is that the above errors don''t happen with >> 2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel >> OFED, so it''s probably something that''s happened between 1.8.0.1 and >> 1.8.1-pre. >> or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and >> 2.6.18-128.1.16.el5, but that seems less likely. >> I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like... >Yeah, it would be a great troubleshooting addition to see if the same >kernel on the clients and servers with the different lustre versions has >the same problem. This would isolate the problem either to or away from >a problem with the difference in OFED stacks.ok - I made a 2.6.18-128.1.14.el5 with b_release_1_8_1 and it behaves the same as 2.6.18-128.1.16.el5 with b_release_1_8_1. ie. spits out a bunch of errors on the first lustre mount. the only changes between those rhel .14 and .16 versions looks pretty unrelated to IB/lnet, so I guess that was to be expected: * Sat Jun 27 2009 Jiri Pirko <jpirko at redhat.com> [2.6.18-128.1.16.el5] - [mm] prevent panic in copy_hugetlb_page_range (Larry Woodman ) [508030 507860] * Tue Jun 23 2009 Jiri Pirko <jpirko at redhat.com> [2.6.18-128.1.15.el5] - [mm] fix swap race condition in fork-gup-race patch (Andrea Arcangeli) [507297 506684] so I guess the change is between Lustre 1.8.0.1 and b_release_1_8_1-20090712131220 somewhere. if only we had git bisect, and if only I knew how to use it, and if only I had the time to try it... :-) cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote:>I get a ferocious set of error messages when I mount a 1.6.7.2 >filesystem on a b_release_1_8_1 client. >is this expected?just to annotate the below a bit in case it''s not clear... sorry - should have done that in the first email :-/ 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the failover pair. 10.8.30.201 -> 208 are OSS''s (one OST per OSS), and the fs is mounted in the usual failover way eg. mount -t lustre 10.8.30.244 at o2ib:10.8.30.245 at o2ib:/system /system from the below (and other similar logs) it kinda looks like the client fails and then renegotiates with all the servers. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 > Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: MGC10.8.30.244 at o2ib: Reactivating import > Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 > Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: Client system-client has started > Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 > ... last message repeated 17 times ... > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 > Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5 > >looks like it succeeds in the end, but only after a struggle. > >I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2 <-> 1.6.7.2. > >servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group >quota fix). >client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from >cvs 20090712131220 + bz18793 again. > >BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? >I''m confused about which is closest to the final 1.8.1 :-/ > >cheers, >robin >-- >Dr Robin Humble, HPC Systems Analyst, NCI National Facility >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Wed, Jul 15, 2009 at 11:22:26AM -0400, Robin Humble wrote:>On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote: >>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote: >>> >>> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >>> Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: MGC10.8.30.244 at o2ib: Reactivating import >>> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: Client system-client has started >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 >>> ... last message repeated 17 times ... >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5 >> >>> looks like it succeeds in the end, but only after a struggle. >>Is it completely stable and performant after the struggle? Do the error >>messages stop? >the fs''s appear to be fine.hmmm - actually, the fs''s are _mostly_ fine... but sometimes i/o that happens right after the above errors fails completely. after a few more trials, this seems to happen about 40% of the time... :-/ eg. (+/- some missing characters from crappy IPMI SoL) you can see that rsync has managed to list the files on the newly mounted lustre fs, but then gets i/o errors when trying to copy the files off lustre to ramdisk -> ... rsync: readlink "/mnt/lustre_system/lib64/libattr.so.1.1.0.1.0" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libc-2.5.so" faile(5) rsync: readlink "/mnt/lustre_system/lib64/libcrypt-2.5.so" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.a.1.02" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.so.1.02" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libexpat.so.0.5.0" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib6-2.0.a" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libgmodule-2.0.aject-2.0.a" failed: Input/output error (5) ... so maybe lnet has renegotiated a connection to the MDS ok, but not to the OSS''s yet. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
I added this to bugzilla. https://bugzilla.lustre.org/show_bug.cgi?id=20227 cheers, robin On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote:>On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote: >>I get a ferocious set of error messages when I mount a 1.6.7.2 >>filesystem on a b_release_1_8_1 client. >>is this expected? > >just to annotate the below a bit in case it''s not clear... sorry - >should have done that in the first email :-/ > >10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the >failover pair. 10.8.30.201 -> 208 are OSS''s (one OST per OSS), and the >fs is mounted in the usual failover way eg. > mount -t lustre 10.8.30.244 at o2ib:10.8.30.245 at o2ib:/system /system > >from the below (and other similar logs) it kinda looks like the client >fails and then renegotiates with all the servers. > >cheers, >robin >-- >Dr Robin Humble, HPC Systems Analyst, NCI National Facility > >> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >> Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: MGC10.8.30.244 at o2ib: Reactivating import >> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: Client system-client has started >> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 >> ... last message repeated 17 times ... >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5 >> >>looks like it succeeds in the end, but only after a struggle. >> >>I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2 <-> 1.6.7.2. >> >>servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group >>quota fix). >>client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from >>cvs 20090712131220 + bz18793 again. >> >>BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? >>I''m confused about which is closest to the final 1.8.1 :-/ >> >>cheers, >>robin >>-- >>Dr Robin Humble, HPC Systems Analyst, NCI National Facility >>_______________________________________________ >>Lustre-discuss mailing list >>Lustre-discuss at lists.lustre.org >>http://lists.lustre.org/mailman/listinfo/lustre-discuss >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Robin, These messages should be harmless, 1.8.1 is using new o2iblnd message protocol, so there is a version negotiation if o2iblnd version of client is older, is there any other o2ib error messages like " Deleting messages for xxx.xxx.xxx.xxx at o2b: connection failed" when you see IO failure? Anyway, if you got more complaint from o2ib except these information, could you please post them on the bug you filed. Thanks Liang Robin Humble wrote:> I added this to bugzilla. > https://bugzilla.lustre.org/show_bug.cgi?id=20227 > > cheers, > robin > > On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote: > >> On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote: >> >>> I get a ferocious set of error messages when I mount a 1.6.7.2 >>> filesystem on a b_release_1_8_1 client. >>> is this expected? >>> >> just to annotate the below a bit in case it''s not clear... sorry - >> should have done that in the first email :-/ >> >> 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the >> failover pair. 10.8.30.201 -> 208 are OSS''s (one OST per OSS), and the >> fs is mounted in the usual failover way eg. >> mount -t lustre 10.8.30.244 at o2ib:10.8.30.245 at o2ib:/system /system >> >> > >from the below (and other similar logs) it kinda looks like the client > >> fails and then renegotiates with all the servers. >> >> cheers, >> robin >> -- >> Dr Robin Humble, HPC Systems Analyst, NCI National Facility >> >> >>> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >>> Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: MGC10.8.30.244 at o2ib: Reactivating import >>> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: Client system-client has started >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5 >>> ... last message repeated 17 times ... >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 >>> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5 >>> >>> looks like it succeeds in the end, but only after a struggle. >>> >>> I don''t have any problems with 1.8.1 <-> 1.8.1 or 1.6.7.2 <-> 1.6.7.2. >>> >>> servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group >>> quota fix). >>> client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from >>> cvs 20090712131220 + bz18793 again. >>> >>> BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? >>> I''m confused about which is closest to the final 1.8.1 :-/ >>> >>> cheers, >>> robin >>> -- >>> Dr Robin Humble, HPC Systems Analyst, NCI National Facility >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >