Frederik Ferner
2010-Nov-23 12:01 UTC
[Lustre-discuss] Bad lmm_size during open replay for inode
Hi List, during a planned MDT fail over today, we got a number of these messages below, can anyone explain what this could be?> Nov 23 08:33:26 cs04r-sc-mds01-01 kernel: Lustre: 21054:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size during open replay for inode 111003141 > Nov 23 08:33:26 cs04r-sc-mds01-01 kernel: Lustre: 21043:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size during open replay for inode 111003144 > Nov 23 08:33:27 cs04r-sc-mds01-01 kernel: Lustre: 21059:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size during open replay for inode 110642714 > Nov 23 08:33:27 cs04r-sc-mds01-01 kernel: Lustre: 21059:0:(mds_open.c:367:mds_create_objects()) Skipped 7 previous similar messagesSearching for this message produced only Lustre source code. This is using Lustre 1.8.3-ddn3.3 on all servers and most clients, some clients use 1.8.4. So far we''ve not noticed any ill effect but would like to know what that message is and if we can safely ignore it. Kind regards, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)
Andreas Dilger
2010-Nov-23 17:12 UTC
[Lustre-discuss] Bad lmm_size during open replay for inode
On 2010-11-23, at 05:01, Frederik Ferner wrote:> during a planned MDT fail over today, we got a number of these messages > below, can anyone explain what this could be? > >> Nov 23 08:33:26 cs04r-sc-mds01-01 kernel: Lustre: 21054:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size during open replay for inode 111003141This means that the client (trying to recreate a file that was not saved to disk during the MDS failover) sent the layout information, but the size it reported for the layout information did not match the size that the MDS thought it should be for that kind of layout. Unfortunately, the error message doesn''t report what those sizes are, so it is hard to know what the impact might be. The message is only a warning, and it is not necessarily a problem if the client-specified size is larger than the size expected, but it might be a problem if the client-specified size is smaller than expected (which I think is the less likely case).> This is using Lustre 1.8.3-ddn3.3 on all servers and most clients, some > clients use 1.8.4.I can''t comment on what changes are in the DDN release, so I don''t know if this is specific to that release or not. In any case, I''ve never seen these messages before.> So far we''ve not noticed any ill effect but would like to know what that > message is and if we can safely ignore it.It would only affect the listed inodes, if at all. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Alexey Lyashkov
2010-Nov-24 09:23 UTC
[Lustre-discuss] Bad lmm_size during open replay for inode
On Nov 23, 2010, at 20:12, Andreas Dilger wrote:> On 2010-11-23, at 05:01, Frederik Ferner wrote: >> during a planned MDT fail over today, we got a number of these messages >> below, can anyone explain what this could be? >> >>> Nov 23 08:33:26 cs04r-sc-mds01-01 kernel: Lustre: 21054:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size during open replay for inode 111003141 > > This means that the client (trying to recreate a file that was not saved to disk during the MDS failover) sent the layout information, but the size it reported for the layout information did not match the size that the MDS thought it should be for that kind of layout.if you don''t have PPC clients, that say MDS forget to shrink LOV EA buffer before send to client or someone break code to shrink replay buffer on client side. (client trust LOV EA size from MDS reply) -------------------------------------- Alexey Lyashkov alexey.lyashkov at clusterstor.com
Frederik Ferner
2010-Nov-24 16:19 UTC
[Lustre-discuss] Bad lmm_size during open replay for inode
Alexey Lyashkov wrote:> On Nov 23, 2010, at 20:12, Andreas Dilger wrote: > >> On 2010-11-23, at 05:01, Frederik Ferner wrote: >>> during a planned MDT fail over today, we got a number of these >>> messages below, can anyone explain what this could be? >>> >>>> Nov 23 08:33:26 cs04r-sc-mds01-01 kernel: Lustre: >>>> 21054:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size >>>> during open replay for inode 111003141>> This means that the client (trying to recreate a file that was not >> saved to disk during the MDS failover) sent the layout information, >> but the size it reported for the layout information did not match >> the size that the MDS thought it should be for that kind of layout.> if you don''t have PPC clients, that say MDS forget to shrink LOV EA > buffer before send to client or someone break code to shrink replay > buffer on client side. (client trust LOV EA size from MDS reply)No PPC clients here. Other than that I''m not sure I understand that paragraph, do you mean PPC clients mis-interpret the data send from the MDS during replay and that these warnings could happen if somehow the replay buffer on the client side shrinks? Cheers, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)
Frederik Ferner
2010-Nov-24 16:27 UTC
[Lustre-discuss] Bad lmm_size during open replay for inode
Andreas Dilger wrote:> On 2010-11-23, at 05:01, Frederik Ferner wrote: >> during a planned MDT fail over today, we got a number of these >> messages below, can anyone explain what this could be? >> >>> Nov 23 08:33:26 cs04r-sc-mds01-01 kernel: Lustre: >>> 21054:0:(mds_open.c:367:mds_create_objects()) Bad lmm_size during >>> open replay for inode 111003141 > > This means that the client (trying to recreate a file that was not > saved to disk during the MDS failover) sent the layout information, > but the size it reported for the layout information did not match the > size that the MDS thought it should be for that kind of layout. > > Unfortunately, the error message doesn''t report what those sizes are, > so it is hard to know what the impact might be. The message is only > a warning, and it is not necessarily a problem if the > client-specified size is larger than the size expected, but it might > be a problem if the client-specified size is smaller than expected > (which I think is the less likely case).Thanks for this, I don''t think, I''ll worry to much about it now as the clients were all fairly quiet at the time of failover, so I don''t think many important files have been written then. We tried to suspend all cluster jobs about 10 minutes before the fail over and some of the files/inodes at least now seem to belong to some cluster jobs. So I''m not sure if the inodes still are the same files or what was going on then. Does this relate to the stripe layout? Most files should have a stripe count of 1, would this make a difference?>> This is using Lustre 1.8.3-ddn3.3 on all servers and most clients, >> some clients use 1.8.4. > > I can''t comment on what changes are in the DDN release, so I don''t > know if this is specific to that release or not. In any case, I''ve > never seen these messages before.I''ll test this later on our test file system but no promises that I''ll be able to reproduce similar conditions.>> So far we''ve not noticed any ill effect but would like to know what >> that message is and if we can safely ignore it. > > It would only affect the listed inodes, if at all.Unfortunately I don''t have the full list of inodes as syslog has skipped some ''similar messages'', but as mentioned above, I''m not that worried at the moment. Thanks, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)