Ok, I inherited a lustre filesystem used on a cluster. I am seeing an issue where on the frontend, I see all of /work On nodes, however, I only see SOME of the user''s directories. Work consists of one MDT/MGS and 3 osts The osts are LVMs served from a DDN via infiniband Running the kernel modules/client one the nodes/frontend lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 on the ost/mdt lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 kernel-2.6.18-164.11.1.el5_lustre.1.8.2 lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2 I have so many error messages in the logs, I am not sure which to sift through for this issue. A quick tail on the MDT: ========================Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous similar messages Apr 27 16:16:38 nas-0-1 kernel: LustreError: 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.255.55 at tcp Apr 27 16:16:38 nas-0-1 kernel: LustreError: 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar messages Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff810673a78050 x1334009404220652/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous similar messages Apr 27 16:26:41 nas-0-1 kernel: LustreError: 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.255.46 at tcp Apr 27 16:26:41 nas-0-1 kernel: LustreError: 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages ======================== Any direction/insigt would be most helpful. Brian Andrus -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100427/bf7f7d53/attachment.html
Brian Andrus wrote:> Ok, I inherited a lustre filesystem used on a cluster. > > I am seeing an issue where on the frontend, I see all of /work > On nodes, however, I only see SOME of the user''s directories.That''s rather odd. The directory structure is all on the MDS, so it''s usually either all there, or not there. Are any of the user errors permission-related? That''s the only thing I can think that would change what directories one node sees vs another.> > Work consists of one MDT/MGS and 3 osts > The osts are LVMs served from a DDN via infiniband > > Running the kernel modules/client one the nodes/frontend > lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 > lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 > > on the ost/mdt > lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 > kernel-2.6.18-164.11.1.el5_lustre.1.8.2 > lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 > lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2 > > I have so many error messages in the logs, I am not sure which to sift > through for this issue. > A quick tail on the MDT: > ========================> Apr 27 16:15:19 nas-0-1 kernel: LustreError: > 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-107) req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0 lens > 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 > Apr 27 16:15:19 nas-0-1 kernel: LustreError: > 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous > similar messages > Apr 27 16:16:38 nas-0-1 kernel: LustreError: > 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS > from 12345-10.1.255.55 at tcp > Apr 27 16:16:38 nas-0-1 kernel: LustreError: > 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages > Apr 27 16:25:21 nas-0-1 kernel: LustreError: > 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on > unconnected MGS > Apr 27 16:25:21 nas-0-1 kernel: LustreError: > 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar > messages > Apr 27 16:25:21 nas-0-1 kernel: LustreError: > 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error > (-107) req at ffff810673a78050 x1334009404220652/t0 o400-><?>@<?>:0/0 lens > 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0 > Apr 27 16:25:21 nas-0-1 kernel: LustreError: > 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous > similar messages > Apr 27 16:26:41 nas-0-1 kernel: LustreError: > 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS > from 12345-10.1.255.46 at tcp > Apr 27 16:26:41 nas-0-1 kernel: LustreError: > 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages > ========================>The ENOTCONN (-107) points at server/network health. I would umount the clients and verify server health, then verify LNET connectivity. However, this would not relate to missing directories - in the absence of other explanations, check the MDT with fsck - that''s more of a generic useful thing to do rather then something indicated by your data. I would also look through older logs if available, and see if you can find a point in time where things go bad. The first error is always the most useful.> Any direction/insigt would be most helpful.Hope this helps cliffw> > Brian Andrus > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 > > Any direction/insigt would be most helpful.That''s way too late in the logs to see what happened aside from server decided to evict some clients for some reason. Interesting parts should be around "evicting" or "timeout" were first mentioned. Bye, Oleg
On 4/27/2010 6:10 PM, Oleg Drokin wrote:> Hello! > > On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote: > >> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 >> >> Any direction/insigt would be most helpful. >> > That''s way too late in the logs to see what happened aside from server decided to evict some clients for some reason. > Interesting parts should be around "evicting" or "timeout" were first mentioned. > > Bye, > OlegOdd, I just went through the log on the MDT and basically it has been repeating those errors for over 24 hours (not spewing, but often enough). only ONE other line on an ost: Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID ''work-OST0000_UUID'' is not available for connect (no target) Brian
Hello! On Apr 27, 2010, at 9:38 PM, Brian Andrus wrote:> Odd, I just went through the log on the MDT and basically it has been repeating those errors for over 24 hours (not spewing, but often enough). only ONE other line on an ost:Each such message means there was an attempt to send a ping to this server from a client that the server does not recognize.> Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID ''work-OST0000_UUID'' is not available for connect (no target)This one tells you that a client tried to contact OST0, but this service is not hosted on that node (or did not yet start up). This might be a somewhat valid message if you have failover configured and this node is a currently passive failover target for the service. Bye, Oleg
This means that your OST is not available. Maybe it is nor mounted? Cheers, Andreas On 2010-04-27, at 19:38, Brian Andrus <toomuchit at gmail.com> wrote:> On 4/27/2010 6:10 PM, Oleg Drokin wrote: >> Hello! >> >> On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote: >> >>> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c: >>> 1848:target_send_reply_msg()) @@@ processing error (-107) >>> req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0 lens >>> 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 >>> >>> Any direction/insigt would be most helpful. >>> >> That''s way too late in the logs to see what happened aside from >> server decided to evict some clients for some reason. >> Interesting parts should be around "evicting" or "timeout" were >> first mentioned. >> >> Bye, >> Oleg > Odd, I just went through the log on the MDT and basically it has been > repeating those errors for over 24 hours (not spewing, but often > enough). only ONE other line on an ost: > > Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID > ''work-OST0000_UUID'' is not available for connect (no target) > > > Brian > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss