Michael D. Seymour
2009-May-22 20:38 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Hi all, I hope you could help us with some connection problems we are having with our lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS. Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3). The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch stack. One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a different network. We get the following messages on a particular client: May 22 15:07:45 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed May 22 15:07:45 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: Request x19154486 sent from roc-MDT0000-mdc-000001044e1d4c00 to NID 10.5.203.250 at tcp 300s ago has timed out (limit 300s). May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00: Connection to service roc-MDT0000 via nid 10.5.203.250 at tcp was lost; in progress operations using this service will wait for recovery to complete. May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00: Connection restored to service roc-MDT0000 using nid 10.5.203.250 at tcp. May 22 15:12:45 trinity kernel: Lustre: Skipped 4 previous similar messages [root at trinity ~]# cat /proc/fs/lustre/lov/roc-clilov-000001044e1d4c00/uuid 84adb9a1-8959-fcf5-cc72-81c6a1e171b8 On the MDS containing roc-MDT0000: May 22 15:12:45 rocpile kernel: Lustre: 19236:0:(ldlm_lib.c:538:target_handle_reconnect()) roc-MDT0000: 84adb9a1-8959-fcf5-cc72-81c6a1e171b8 reconnecting May 22 15:12:45 rocpile kernel: Lustre: 19236:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar messages Any idea what could be causing this? BUG 11332 looked similar, but it has been closed because of other related bugs being fixed. Thanks, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto
Alexey Lyashkov
2009-May-23 17:18 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Hi Michael, On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote:> Hi all, > > One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a > different network. > > We get the following messages on a particular client: > > May 22 15:07:45 trinity kernel: LustreError: > 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from > 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowedwhat frequently for this bug? if this quickly replicated - please set lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and client, replicate and save logs with lctl dk > $logfile. after it - please fill a bug and attach log from MDS and client to bug. this message say - client want for reply less data when mds is send. -- Alexey Lyashkov <Alexey.Lyashkov at Sun.COM> Sun Microsystems
Isaac Huang
2009-May-26 18:42 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
On Sat, May 23, 2009 at 09:18:43PM +0400, Alexey Lyashkov wrote:> Hi Michael, > > > On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: > > Hi all, > > > > One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a > > different network. > > > > We get the following messages on a particular client: > > > > May 22 15:07:45 trinity kernel: LustreError: > > 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from > > 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed > > what frequently for this bug? if this quickly replicated - please set > lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and > client, replicate and save logs with lctl dk > $logfile.Could it have something to do with bug 14379? Though 1.6.7 should have the fixes already. Isaac
Alexey Lyashkov
2009-May-27 03:30 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
On Tue, 2009-05-26 at 14:42 -0400, Isaac Huang wrote:> On Sat, May 23, 2009 at 09:18:43PM +0400, Alexey Lyashkov wrote: > > Hi Michael, > > > > > > On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: > > > Hi all, > > > > > > One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a > > > different network. > > > > > > We get the following messages on a particular client: > > > > > > May 22 15:07:45 trinity kernel: LustreError: > > > 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from > > > 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed > > > > what frequently for this bug? if this quickly replicated - please set > > lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and > > client, replicate and save logs with lctl dk > $logfile. > > Could it have something to do with bug 14379? Though 1.6.7 should have > the fixes already.In fact me have several bugs with similar symptoms (have too big acl on inode, bad handling ost additional on mds/client, bad shrink reply in error case, ... etc ), and I not sure all is fixed. So i need lots investigation this before say something. But for investigation need know more info - what is request, how it processed on mds and other. -- Alexey Lyashkov <Alexey.Lyashkov at Sun.COM> Sun Microsystems
Michael D. Seymour
2009-May-29 19:42 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Hi Alexey, Alexey Lyashkov wrote:> Hi Michael, > > > On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: >> Hi all, >> >> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a >> different network. >> >> We get the following messages on a particular client: >> >> May 22 15:07:45 trinity kernel: LustreError: >> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from >> 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed > > what frequently for this bug?Sets of entries (about 20) happen a few times per day, each entry spaced about ten minutes apart.> if this quickly replicated - please set > lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and > client, replicate and save logs with lctl dk > $logfile.Debugging has been enabled.I haven''t been able to catch it in the act yet. Will enabling the debug logging until I can catch the bug overflow anything?> after it - please fill a bug and attach log from MDS and client to bug.A bug will be filed as soon as it can be caught with logging enabled.> this message say - client want for reply less data when mds is send.Trinity cannot accept data as large as the MDS is sending? Thanks for you help, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto
Michael D. Seymour
2009-May-29 19:51 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Michael D. Seymour wrote:> Hi all, > > I hope you could help us with some connection problems we are having with our > lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS. > Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3). > The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the > filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch > stack. > > One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a > different network. >Also got this earlier today before more verbose debug logging was enabled: On client trinity: May 29 10:35:47 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-10.5.203.250 at tcp, match 20177453 length 728 too big: 704 left, 704 allowed May 29 10:40:47 trinity kernel: LustreError: 11-0: an error occurred while communicating with 10.5.203.250 at tcp. The mds_close operation failed with -116 May 29 10:40:47 trinity kernel: LustreError: 26783:0:(file.c:113:ll_close_inode_openhandle()) inode 37609433 mdc close failed: rc = -116 May 29 10:40:47 trinity kernel: LustreError: 26783:0:(file.c:113:ll_close_inode_openhandle()) Skipped 1 previous similar message On MDS rocpile: May 29 10:35:47 rocpile kernel: LustreError: 10227:0:(mds_open.c:1561:mds_close()) @@@ no handle for file close ino 37609433: cookie 0xa00c7cf9e763396b req at ffff8101274e3400 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e171b8 at NET_0x200000a05cc02_UUID:0/0 lens 296/728 e 0 to 0 dl 1243608047 ref 1 fl Interpret:/0/0 rc 0/0 May 29 10:35:47 rocpile kernel: LustreError: 10227:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-116) req at ffff8101274e3400 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e171b8 at NET_0x200000a05cc02_UUID:0/0 lens 296/728 e 0 to 0 dl 1243608047 ref 1 fl Interpret:/0/0 rc -116/0 May 29 10:35:47 rocpile kernel: LustreError: 10227:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar message May 29 10:40:47 rocpile kernel: LustreError: 3611:0:(mds_open.c:1561:mds_close()) @@@ no handle for file close ino 37609433: cookie 0xa00c7cf9e763396b req at ffff81011f0cda00 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e171b8 at NET_0x200000a05cc02_UUID:0/0 lens 296/728 e 0 to 0 dl 1243608347 ref 1 fl Interpret:/2/0 rc 0/0 May 29 10:40:47 rocpile kernel: LustreError: 3611:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-116) req at ffff81011f0cda00 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e171b8 at NET_0x200000a05cc02_UUID:0/0 lens 296/728 e 0 to 0 dl 1243608347 ref 1 fl Interpret:/2/0 rc -116/0 I''ve already extended /proc/sys/lustre/timeout to 300s. Thanks again, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto
Alexey Lyashkov
2009-May-30 05:16 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Hi Michael,> > On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: > >> Hi all, > >> > >> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a > >> different network. > >> > >> We get the following messages on a particular client: > >> > >> May 22 15:07:45 trinity kernel: LustreError: > >> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from > >> 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed > > > > what frequently for this bug? > > Sets of entries (about 20) happen a few times per day, each entry spaced about > ten minutes apart.can you please show syslog messages around this time - should be exist lines with errors related to ''match XXXXX'' (in this example match 19154486 -- should be something about request x19154486).> > > if this quickly replicated - please set > > lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and > > client, replicate and save logs with lctl dk > $logfile. > > Debugging has been enabled.I haven''t been able to catch it in the act yet. Will > enabling the debug logging until I can catch the bug overflow anything?you can use ''lctl debug_daemon start $file'' but at this case size is limited to 512M :\ If it possible to you i can make small patch which do dump lustre log if this error is hit.> > > after it - please fill a bug and attach log from MDS and client to bug. > > A bug will be filed as soon as it can be caught with logging enabled. > > > this message say - client want for reply less data when mds is send. > > Trinity cannot accept data as large as the MDS is sending?yes. this should be caused timeout on waiting answer to request and later reconnect.> > Thanks for you help, > Mike > > >-- Alexey Lyashkov <Alexey.Lyashkov at Sun.COM> Sun Microsystems
Michael D. Seymour
2009-Jun-08 19:55 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Alexey Lyashkov wrote:> Hi Michael, > >>> On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: >>>> Hi all, >>>> >>>> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a >>>> different network. >>>> >>>> We get the following messages on a particular client: >>>> >>>> May 22 15:07:45 trinity kernel: LustreError: >>>> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from >>>> 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed >>> what frequently for this bug? >> Sets of entries (about 20) happen a few times per day, each entry spaced about >> ten minutes apart. > can you please show syslog messages around this time - should be exist > lines with errors related to ''match XXXXX'' (in this example match > 19154486 -- should be something about request x19154486).I''ve upgraded the MDS to 1.6.7.1. So far no issues. I will probably upgrade to 1.8 very soon. Will write back if there is still problems. Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto
Tim Burgess
2009-Jun-22 11:19 UTC
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250@tcp, match 19154486 length 728 too big
Hi all, We are seeing this also, with clients and servers running 2.6.18-92.1.26.el5_lustre.1.6.7.2smp, tcp over gig-e only, after an upgrade from 1.6.5.1 over the weekend. (it appears that older client versions are working fine, but I''ve had a couple of the new ones without trouble too so I don''t really have enough stats to be sure that it''s a version thing) If there''s any chance it''s related, we hit this bug on the MDS (also after an fsck) just before the upgrade: https://bugzilla.lustre.org/show_bug.cgi?id=19091 It was preventing the MDS/MGS from starting after the fsck (but before the upgrade), but since bugzilla mentioned there was a related fix in 1.6.7.1 we proceeded with the upgrade and the MDS started fine after that... There are still some odd messages in the MDS log though - see the bottom log segment below. Any ideas out there? Thanks, Tim ----- On the client: (hand transcribed, please forgive any typos) LustreError: 647:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-172.16.0.251 at tcp, match 115 length 1168 too big: 992 left, 992 allowed Lustre: Request x115 sent from p1-MDT0000-mdc-ffff81012a031000 to NID 172.16.0.251 at tcp 100s has timed out (limit 100s) Lustre: p1-MDT0000-mdc-ffff81012a031000: Connection to service prod_mds_001 via nid 172.16.0.251 at tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: p1-MDT0000-mdc-ffff81012a031000: connection restored to service prod_mds_001 using nid 172.16.0.251 at tcp and then repeat... On the servers: Jun 22 19:01:29 mds001 kernel: LustreError: 3389:0:(service.c:611:ptlrpc_check_req()) @@@ DROPPING req from old connection 309 < 310 req at ffff81010965dc00 x77181/t0 o400->12dffd61-75ec-a926-c333-3c3d8acf9201 at NET_0x20000ac100453_UUID:0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Jun 22 19:01:29 mds001 kernel: LustreError: 3389:0:(service.c:611:ptlrpc_check_req()) Skipped 3 previous similar messages Jun 22 19:02:06 mds001 kernel: Lustre: 3359:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: 23127a45-3e3a-5b92-dba5-c7444d593e7f reconnecting Jun 22 19:02:06 mds001 kernel: Lustre: 3359:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 77 previous similar messages Jun 22 19:02:25 oss019 kernel: Lustre: 3417:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0012: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss020 kernel: Lustre: 3370:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0013: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss014 kernel: Lustre: 3263:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST000d: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss025 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0018: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss024 kernel: Lustre: 3901:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0017: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss029 kernel: Lustre: 3879:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001c: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss028 kernel: Lustre: 3909:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001b: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss010 kernel: Lustre: 3462:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0009: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss021 kernel: Lustre: 3933:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0014: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss022 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0015: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss023 kernel: Lustre: 3928:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0016: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss030 kernel: Lustre: 3854:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001d: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss027 kernel: Lustre: 3907:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001a: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss026 kernel: Lustre: 3914:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0019: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss018 kernel: Lustre: 3379:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0011: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss016 kernel: Lustre: 3268:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST000f: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss017 kernel: Lustre: 3402:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0010: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss010 kernel: Lustre: 3462:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss016 kernel: Lustre: 3268:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Jun 22 19:02:25 oss018 kernel: Lustre: 3379:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss017 kernel: Lustre: 3402:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0010: dcb418b0-12c5-61d2-ab8c-f9f3ced8130a reconnecting Jun 22 19:02:25 oss030 kernel: Lustre: 3854:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss025 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Jun 22 19:02:25 oss023 kernel: Lustre: 3928:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Jun 22 19:02:25 oss022 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss021 kernel: Lustre: 3933:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss026 kernel: Lustre: 3914:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss027 kernel: Lustre: 3907:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:03:06 oss022 kernel: Lustre: p1-OST0015: haven''t heard from client 6aaa9429-5a2c-9c20-1fe8-e42c3d108882 (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:06 oss011 kernel: Lustre: p1-OST000a: haven''t heard from client 6aaa9429-5a2c-9c20-1fe8-e42c3d108882 (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:06 oss011 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:06 oss022 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss003 kernel: Lustre: p1-OST0002: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 mds001 kernel: Lustre: MGS: haven''t heard from client d14860df-7906-9a56-5c84-79b25b9cc99e (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss007 kernel: Lustre: p1-OST0006: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 mds001 kernel: Lustre: Skipped 2 previous similar messages Jun 22 19:03:07 oss006 kernel: Lustre: p1-OST0005: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss005 kernel: Lustre: p1-OST0004: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss021 kernel: Lustre: p1-OST0014: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss026 kernel: Lustre: p1-OST0019: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss024 kernel: Lustre: p1-OST0017: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss013 kernel: Lustre: p1-OST000c: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss029 kernel: Lustre: p1-OST001c: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss027 kernel: Lustre: p1-OST001a: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss030 kernel: Lustre: p1-OST001d: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss009 kernel: Lustre: p1-OST0008: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss012 kernel: Lustre: p1-OST000b: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss019 kernel: Lustre: p1-OST0012: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss020 kernel: Lustre: p1-OST0013: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss018 kernel: Lustre: p1-OST0011: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss028 kernel: Lustre: p1-OST001b: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss002 kernel: Lustre: p1-OST0001: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss025 kernel: Lustre: p1-OST0018: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss017 kernel: Lustre: p1-OST0010: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss001 kernel: Lustre: p1-OST0000: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss014 kernel: Lustre: p1-OST000d: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss004 kernel: Lustre: p1-OST0003: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss023 kernel: Lustre: p1-OST0016: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss016 kernel: Lustre: p1-OST000f: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss010 kernel: Lustre: p1-OST0009: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss008 kernel: Lustre: p1-OST0007: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss015 kernel: Lustre: p1-OST000e: haven''t heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16.5.3 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jun 22 19:03:07 oss018 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss020 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss012 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss019 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss014 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss017 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss007 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss003 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss001 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss009 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss016 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss004 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss015 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss010 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss030 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss029 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss027 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss028 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss021 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss025 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss023 kernel: Lustre: Skipped 1 previous similar message Possibly still related to the earlier problem, we have this sort of thing appearing in the server logs too: Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(llog_obd.c:226:llog_add()) Skipped 351 previous similar messages Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(lov_log.c:118:lov_llog_origin_add()) Can''t add llog (rc = -19) for stripe 0 Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(lov_log.c:118:lov_llog_origin_add()) Skipped 351 previous similar messages Jun 21 11:48:04 mds001 kernel: Lustre: 4130:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:48:51 mds001 kernel: LustreError: 3624:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x45e0e2f sub-object on OST idx 15/1: rc = -110 Jun 21 11:49:44 mds001 kernel: Lustre: 4132:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:50:21 mds001 kernel: LustreError: 4151:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 11:50:21 mds001 kernel: LustreError: 4151:0:(lov_log.c:118:lov_llog_origin_add()) Can''t add llog (rc = -19) for stripe 0 Jun 21 11:50:54 mds001 kernel: LustreError: 3631:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x51c0136 sub-object on OST idx 15/1: rc = -110 Jun 21 11:51:24 mds001 kernel: Lustre: 3644:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:51:50 mds001 kernel: LustreError: 4075:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 11:51:50 mds001 kernel: LustreError: 4075:0:(lov_log.c:118:lov_llog_origin_add()) Can''t add llog (rc = -19) for stripe 0 Jun 21 11:53:05 mds001 kernel: Lustre: 4077:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:54:10 mds001 kernel: LustreError: 4128:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x45f118f sub-object on OST idx 15/1: rc = -110 Jun 21 11:54:10 mds001 kernel: LustreError: 4128:0:(lov_request.c:692:lov_update_create_set()) Skipped 1 previous similar message Jun 21 11:54:45 mds001 kernel: Lustre: 4039:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:56:25 mds001 kernel: Lustre: 4147:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:58:05 mds001 kernel: Lustre: 4097:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:59:46 mds001 kernel: Lustre: 4075:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 12:03:06 mds001 kernel: Lustre: 4158:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 12:03:06 mds001 kernel: Lustre: 4158:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 21 12:05:40 mds001 kernel: LustreError: 4057:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x45e0ff4 sub-object on OST idx 15/1: rc = -110 Jun 21 12:05:40 mds001 kernel: LustreError: 4057:0:(lov_request.c:692:lov_update_create_set()) Skipped 1 previous similar message Jun 21 12:07:20 mds001 kernel: LustreError: 4071:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 12:07:20 mds001 kernel: LustreError: 4071:0:(llog_obd.c:226:llog_add()) Skipped 8 previous similar messages Cheers, Tim On Tue, Jun 9, 2009 at 3:55 AM, Michael D. Seymour<seymour at cita.utoronto.ca> wrote:> Alexey Lyashkov wrote: >> Hi Michael, >> >>>> On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: >>>>> Hi all, >>>>> >>>>> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a >>>>> different network. >>>>> >>>>> We get the following messages on a particular client: >>>>> >>>>> May 22 15:07:45 trinity kernel: LustreError: >>>>> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from >>>>> 12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed >>>> what frequently for this bug? >>> Sets of entries (about 20) happen a few times per day, each entry spaced about >>> ten minutes apart. >> can you please show syslog messages around this time - should be exist >> lines with errors related to ''match XXXXX'' (in this example match >> 19154486 -- should be something about request x19154486). > > I''ve upgraded the MDS to 1.6.7.1. So far no issues. I will probably upgrade to > 1.8 very soon. Will write back if there is still problems. > > Mike > > > -- > Michael D. Seymour ? ? ? ? ? ? ? ? Phone: 416-978-8497 > Scientific Computing Support ? ? ? Fax: 416-978-3921 > Canadian Institute for Theoretical Astrophysics, University of Toronto > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >