th@llnl.gov
2006-Dec-15 17:54 UTC
[Lustre-devel] [Bug 11332] lnet_try_match_md(): match X length Y too big: Z left, Z allowed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11332 th@llnl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|S3 (normal) |S2 (major) Priority|P3 |P2 Marking as Sev2. This bug is hitting on many Peloton client nodes. Today 8 client nodes hit this on Zeus. We think we have a node that repeatedly hits this _even after a reboot_ (we don''t understand that), so if we can get a debug patch that shows more of the state around such a bad message, we might be able to gleen more insight.
th@llnl.gov
2006-Dec-21 21:54 UTC
[Lustre-devel] [Bug 11332] lnet_try_match_md(): match X length Y too big: Z left, Z allowed
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11332 Some snippets of updated info from Andreas & Brian, who continues to work on this problem: (11:53:46) behlendo: adilger: In other news, and I''ll put this all in the bug once I get to the bottom of it. But our 11040 appears to be server side now, our 1.4.7.2-pre-6llnl clients taking to 1.4.6.95-17.2llnl servers hit the bug when accessing a symlink from lustre to NFS. But matched 1.4.7.2-pre-6llnl clients and servers work fine. The older server returns a larger reply to the newer client, the newer server returns the expect reply size. So I suppose I need to look at the server now and see what changed here, and why they''re not 100% interoperable. (12:05:23) adilger: behlendo: is it that the symlink is too long on the server, or is there a bug because the symlink is to NFS? (12:34:57) behlendo: adilger: As for our symlink issue, no the symlink is very dull. I looked at the inode on the MDS and it''s a fast link for a very short path. Nothing special about the inode, no EAs, etc. (12:37:10) behlendo: adilger: But on a single client mounting lscratcha (1.4.6.95 servers), and lscratchb (1.4.7.2 servers) with a symlink from each FS to the same NFS file. It caused the issue everytime for the lscratcha symlink and never for the lscratchb symlink. Now that we know what to look for we''re also seeing it on non-peleton style systems. In fact I''m about to go reproduce it in the testbed so I can get clean MDS logs of the failure. (14:28:30) behlendo: adilger: Got a sec, I''ve got an ugly ugly hack as a workaround for out 11332 issue, the symlink thing between mismatched lustre version. But I want to run it past you just in case I''ve made some oversight. Here are the ground rules since we''re going in to a holiday: 1) We don''t want to be putting new code on a server, so the change needs to be client side. 2) It should be as minimal as possible to avoid introducing new issues while we''re all away 3) It doesn''t have to be pretty, we can get a -correct- fix in after the new year So with that in mind I basically adjusted the getattr case in mdc_enqueue() to add in an arbitrary extra 512 bytes to the repsize[3] to increase the allocated replen. This seems to work... and is ugly as sin... but since we can''t tweak the server it seemed reasonable. Can you think of any bad side effect this might cause? (14:47:49) adilger: behlendo: it likely isn''t harmful, but yes it''s ugly (14:48:28) adilger: do you know WHY the MDS is trying to reply with a larger buffer? or conversely why the client thinks it only needs a smaller one? (14:56:37) behlendo: adilger: Nope, not yet. Since my time is short before the holidays I was focusing on a quick hack, a little sanity testing, then put it out where folks are suffering. Once I''ve got that handed of to the admins I''m planning to look in to it in the testbed. This afternoon I hope. (14:57:32) behlendo: So, the 1.4.7.2 servers response with the correctly sized reply, so I''m going to try and see why the older server code thinks it should be bigger. (14:58:06) behlendo: Presumably the 1.4.6.95 clients expect the larger reply too, but I don''t know that for sure. (15:06:49) adilger: I''d imagine yes, but nothing pops out at me why this would have changed (15:22:59) behlendo: Me either... well I''ve just built a version with the hack for zeus so I''ll turn my attention to the real reason WHY this changed. Thanks for the quick sanity thoughts on that ugly ugly client mod.