Heiko Schröter
2009-Nov-05 09:26 UTC
[Lustre-discuss] lnet_try_match_md(), Matching packet from length too big
Hello, since a few days these messages pop up on a client and the lustre mount gets blocked. After a force unmount and then remount lustre everything seems to work fine for some minutes but than the error apeears again. Our system has approx. 140TB on 9 OSTs. We use lustre via automount and everything was fine until these errors did occur. On the mds there are no errors in the logs. I can spot no networks errors using ''ping -s 20000 ...'' et al. ''lctl ping'' shows up ok. We had a ldap crash a few days ago and the lustre system was down due to the UID requests of the mds. Besides this the data on some OST have been moved and the OSTs were taken out of the system, reformated and put back (different RAID Level that is). I did reboot the MDS, but that did not cure the problem. What can cause such hangups ? Are there fscks needed on the OSTs ? Should we upgrade to 1.6.7 or 1.8.1.1 ? Thanks and Regards Heiko lustre: 1.6.6 vanilla: 2.6.22.19 On the CLIENT: Nov 5 10:01:42 dras2 Lustre: scia-MDT0000-mdc-ffff8101ee1c7000: Connection restored to service scia-MDT0000 using nid 192.168.16.122 at tcp. Nov 5 10:01:42 dras2 Lustre: Skipped 6 previous similar messages Nov 5 10:05:02 dras2 Lustre: Request x3223875 sent from scia-MDT0000-mdc-ffff8101ee1c7000 to NID 192.168.16.122 at tcp 100s ago has timed out (limit 100s). Nov 5 10:05:02 dras2 Lustre: Skipped 6 previous similar messages Nov 5 10:05:02 dras2 Lustre: scia-MDT0000-mdc-ffff8101ee1c7000: Connection to service scia-MDT0000 via nid 192.168.16.122 at tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 5 10:05:02 dras2 Lustre: Skipped 6 previous similar messages Nov 5 10:08:22 dras2 LustreError: 5027:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16.122 at tcp, match 3223875 length 1336 too big: 1272 left, 1272 allowed Nov 5 10:08:22 dras2 LustreError: 5027:0:(lib-move.c:111:lnet_try_match_md()) Skipped 6 previous similar messages
Heiko Schröter
2009-Nov-05 09:31 UTC
[Lustre-discuss] lnet_try_match_md(), Matching packet from length too big
Hello, this just came up on the MDS after a failed attempt to mount lustre: MDS: Nov 5 10:25:07 mds1 LustreError: 138-a: scia-MDT0000: A client on nid 192.168.16.133 at tcp was evicted due to a lock blocking callback to 192.168.16.133 at tcp timed out: rc -107 No message on the client. Hello, since a few days these messages pop up on a client and the lustre mount gets blocked. After a force unmount and then remount lustre everything seems to work fine for some minutes but than the error apeears again. Our system has approx. 140TB on 9 OSTs. We use lustre via automount and everything was fine until these errors did occur. On the mds there are no errors in the logs. I can spot no networks errors using ''ping -s 20000 ...'' et al. ''lctl ping'' shows up ok. We had a ldap crash a few days ago and the lustre system was down due to the UID requests of the mds. Besides this the data on some OST have been moved and the OSTs were taken out of the system, reformated and put back (different RAID Level that is). I did reboot the MDS, but that did not cure the problem. What can cause such hangups ? Are there fscks needed on the OSTs ? Should we upgrade to 1.6.7 or 1.8.1.1 ? Thanks and Regards Heiko lustre: 1.6.6 vanilla: 2.6.22.19 On the CLIENT: Nov 5 10:01:42 dras2 Lustre: scia-MDT0000-mdc-ffff8101ee1c7000: Connection restored to service scia-MDT0000 using nid 192.168.16.122 at tcp. Nov 5 10:01:42 dras2 Lustre: Skipped 6 previous similar messages Nov 5 10:05:02 dras2 Lustre: Request x3223875 sent from scia-MDT0000-mdc-ffff8101ee1c7000 to NID 192.168.16.122 at tcp 100s ago has timed out (limit 100s). Nov 5 10:05:02 dras2 Lustre: Skipped 6 previous similar messages Nov 5 10:05:02 dras2 Lustre: scia-MDT0000-mdc-ffff8101ee1c7000: Connection to service scia-MDT0000 via nid 192.168.16.122 at tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 5 10:05:02 dras2 Lustre: Skipped 6 previous similar messages Nov 5 10:08:22 dras2 LustreError: 5027:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16.122 at tcp, match 3223875 length 1336 too big: 1272 left, 1272 allowed Nov 5 10:08:22 dras2 LustreError: 5027:0:(lib-move.c:111:lnet_try_match_md()) Skipped 6 previous similar messages _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Alexey Lyashkov
2009-Nov-06 08:11 UTC
[Lustre-discuss] lnet_try_match_md(), Matching packet from length too big
Hi Heiko, lustre team have fix several bugs with same error messages after 1.6.6. quick list: 1) 1.8<>1.6 interop issue 2) fix access to file with very long acl list. 3) client isn''t see all ost connected, but file is wide striped. in all cases fsck isn''t need, but to say which this case is need full lustre log from MDS about this issue. set sysctl variable lnet.debug=-1, lnet.subsystem_debug=-1, lnet.debug_size=100, replicate this, lctk dk > log (save log to analyze). as about upgrade - better to upgrade to 1.8.1.1 because some fixes isn''t landed into 1.6 branch due this close to support. On Thu, 2009-11-05 at 10:26 +0100, Heiko Schr?ter wrote:> Nov 5 10:08:22 dras2 LustreError: > 5027:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from > 12345-192.168.16.122 at tcp, match 3223875 length 1336 too big: 1272 > left, 1272 allowed
Heiko Schröter
2009-Nov-09 13:36 UTC
[Lustre-discuss] lnet_try_match_md(), Matching packet from length too big
Am Freitag 06 November 2009 09:11:36 schrieben Sie: Hello Alexey, thanks for your hints. We have found 2 issues over the weekend. 1) A loose contact on the SATA Backplane of the MDS. Nasty one this was. 2) Some sporadic high percentage of package loss for a few seconds/minutes. I think i need to fix these issues first before getting on. Regards Heiko> Hi Heiko, > > lustre team have fix several bugs with same error messages after 1.6.6. > quick list: > 1) 1.8<>1.6 interop issue > 2) fix access to file with very long acl list. > 3) client isn''t see all ost connected, but file is wide striped. > in all cases fsck isn''t need, but to say which this case is need full > lustre log from MDS about this issue. > > set sysctl variable lnet.debug=-1, lnet.subsystem_debug=-1, > lnet.debug_size=100, replicate this, lctk dk > log (save log to > analyze). > > as about upgrade - better to upgrade to 1.8.1.1 because some fixes isn''t > landed into 1.6 branch due this close to support. > > On Thu, 2009-11-05 at 10:26 +0100, Heiko Schr?ter wrote: > > Nov 5 10:08:22 dras2 LustreError: > > 5027:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from > > 12345-192.168.16.122 at tcp, match 3223875 length 1336 too big: 1272 > > left, 1272 allowed >