We are getting lots of these (always for the same resource) on one of our OSSs. LustreError: 22308:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22225:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22277:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 2 Time(s) LustreError: 22274:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 3 Time(s) LustreError: 22204:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22193:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 2 Time(s) LustreError: 22253:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22200:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 2 Time(s) LustreError: 22264:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) We''ve tried to track down the "object" with "lfs find" but no joy so far. I''m not even sure that is the right approach. We found a but pertaining to this in the lustre bugzilla but it looks like it was resolved so I''m not sure that''s the issue either. Any one else run into this before? Is there something we can do to stop it? We are running 1.6.4.2 on CentOS 4.5 with an updated kernel on the OSSs. Linux hpcio7.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Mon Feb 18 13:24:27 EST 2008 x86_64 x86_64 x86_64 GNU/Linux). This file system has been in production for about six months - first time we''ve seen this. Charlie Taylor UF HPC Center -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080717/41a7cbe4/attachment.html
On Jul 17, 2008 08:05 -0400, Charles Taylor wrote:> We are getting lots of these (always for the same resource) on one of > our OSSs. > > LustreError: 22308:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 1 Time(s) > LustreError: 22225:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 1 Time(s) > LustreError: 22277:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 2 Time(s) > LustreError: 22274:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 3 Time(s) > LustreError: 22204:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 1 Time(s) > LustreError: 22193:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 2 Time(s) > LustreError: 22253:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 1 Time(s) > LustreError: 22200:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 2 Time(s) > LustreError: 22264:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init > failed for resource 5820180: rc -2: 1 Time(s) > > We''ve tried to track down the "object" with "lfs find" but no joy so > far. I''m not even sure that is the right approach. We found a but > pertaining to this in the lustre bugzilla but it looks like it was > resolved so I''m not sure that''s the issue either. Any one else run > into this before? Is there something we can do to stop it?This is an indication that some object is missing on the OST that one or more clients is trying to access. You can look at the Lustre debug logs with "rpctrace" enabled to extract the "Handling RPC" and "Handled RPC" messages on the thread printing this message, e.g.: 00000100:00100000:0:1216234365.071325:1536:32091:0:(service.c:1064:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn_00:01318c63-cfd4-9199-8142-4e41ea812bd3+7:32099:x9:12345-0 at lo:101 00000100:00100000:0:1216234366.071325:1536:32091:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2 00000100:00100000:0:1216234367.071325:1536:32091:0:(service.c:1064:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn_00:01318c63-cfd4-9199-8142-4e41ea812bd3+7:32099:x9:12345-0 at lo:101 is PID 32091, from the client "0 at lo" (in this made up example a local client). Then you can check on the client (also with "vfstrace" and "rpctrace" debugging on) what it was trying to do on the thread that requested this RPC (PID 32091, XID "9" in this example). To quiet it, the easiest mechanism is probably to just delete this file. If it is a small file (< 1MB) and the data is still valid you could copy it to another file and rename it over the old one. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Awesome. We''ll try it and let you know. Of course, if it works, we''ll have a naming dilemma - "The Dilger Procedure" is already taken. :) Thanks, Charlie Taylor UF HPC Center On Jul 17, 2008, at 7:00 PM, Andreas Dilger wrote:> > This is an indication that some object is missing on the OST that one > or more clients is trying to access. You can look at the Lustre debug > logs with "rpctrace" enabled to extract the "Handling RPC" and > "Handled RPC" > messages on the thread printing this message, e.g.: > > 00000100:00100000:0:1216234365.071325:1536:32091:0:(service.c: > 1064:ptlrpc_server_handle_request()) Handled RPC pname:cluuid > +ref:pid:xid:nid:opc ldlm_cn_00:01318c63- > cfd4-9199-8142-4e41ea812bd3+7:32099:x9:12345-0 at lo:101 > 00000100:00100000:0:1216234366.071325:1536:32091:0:(ldlm_resource.c: > 719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2 > 00000100:00100000:0:1216234367.071325:1536:32091:0:(service.c: > 1064:ptlrpc_server_handle_request()) Handled RPC pname:cluuid > +ref:pid:xid:nid:opc ldlm_cn_00:01318c63- > cfd4-9199-8142-4e41ea812bd3+7:32099:x9:12345-0 at lo:101 > > > is PID 32091, from the client "0 at lo" (in this made up example a > local client). > Then you can check on the client (also with "vfstrace" and "rpctrace" > debugging on) what it was trying to do on the thread that requested > this > RPC (PID 32091, XID "9" in this example). > > To quiet it, the easiest mechanism is probably to just delete this > file. > If it is a small file (< 1MB) and the data is still valid you could > copy > it to another file and rename it over the old one. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >
On Thu, 2008-07-17 at 19:09 -0400, Charles Taylor wrote:> Awesome. We''ll try it and let you know. Of course, if it works, > we''ll have a naming dilemma - "The Dilger Procedure" is already > taken. :)"The Dilger Maneuver" is still available IIRC. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080717/29fa57c5/attachment.bin