thr3ads.net - Lustre discuss - [Lustre-discuss] Newbie w/issues [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Brian Andrus

2010-Apr-27 23:29 UTC

[Lustre-discuss] Newbie w/issues

Ok, I inherited a lustre filesystem used on a cluster.

I am seeing an issue where on the frontend, I see all of /work
On nodes, however, I only see SOME of the user''s directories.

Work consists of one MDT/MGS and 3 osts
The osts are LVMs served from a DDN via infiniband

Running the kernel modules/client one the nodes/frontend
lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2

on the ost/mdt
lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
kernel-2.6.18-164.11.1.el5_lustre.1.8.2
lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2

I have so many error messages in the logs, I am not sure which to sift
through for this issue.
A quick tail on the MDT:
========================Apr 27 16:15:19 nas-0-1 kernel: LustreError:
4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)
 req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0
lens 192/0 e 0
to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
Apr 27 16:15:19 nas-0-1 kernel: LustreError:
4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous
similar messages
Apr 27 16:16:38 nas-0-1 kernel: LustreError:
4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from
12345-10.1.255.55 at tcp
Apr 27 16:16:38 nas-0-1 kernel: LustreError:
4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on
unconnected MGS
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar
messages
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)
 req at ffff810673a78050 x1334009404220652/t0 o400-><?>@<?>:0/0
lens 192/0 e 0
to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous
similar messages
Apr 27 16:26:41 nas-0-1 kernel: LustreError:
4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from
12345-10.1.255.46 at tcp
Apr 27 16:26:41 nas-0-1 kernel: LustreError:
4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages
========================
Any direction/insigt would be most helpful.

Brian Andrus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100427/bf7f7d53/attachment.html

Cliff White

2010-Apr-28 00:00 UTC

head link

[Lustre-discuss] Newbie w/issues

Brian Andrus wrote:> Ok, I inherited a lustre filesystem used on a cluster. 
> 
> I am seeing an issue where on the frontend, I see all of /work
> On nodes, however, I only see SOME of the user''s directories.
That''s rather odd. The directory structure is all on the MDS, so
it''s usually either all there, or not there. Are any of the user errors
permission-related? That''s the only thing I can think that would change
what directories one node sees vs another.> 
> Work consists of one MDT/MGS and 3 osts
> The osts are LVMs served from a DDN via infiniband
> 
> Running the kernel modules/client one the nodes/frontend
> lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> 
> on the ost/mdt
> lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> kernel-2.6.18-164.11.1.el5_lustre.1.8.2
> lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2
> 
> I have so many error messages in the logs, I am not sure which to sift 
> through for this issue.
> A quick tail on the MDT:
> ========================> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
> 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff810669d35c50 x1334203739385128/t0
o400-><?>@<?>:0/0 lens
> 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
> 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous 
> similar messages
> Apr 27 16:16:38 nas-0-1 kernel: LustreError: 
> 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS 
> from 12345-10.1.255.55 at tcp
> Apr 27 16:16:38 nas-0-1 kernel: LustreError: 
> 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on 
> unconnected MGS
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar 
> messages
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff810673a78050 x1334009404220652/t0
o400-><?>@<?>:0/0 lens
> 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous 
> similar messages
> Apr 27 16:26:41 nas-0-1 kernel: LustreError: 
> 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS 
> from 12345-10.1.255.46 at tcp
> Apr 27 16:26:41 nas-0-1 kernel: LustreError: 
> 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages
> ========================> 
The ENOTCONN (-107) points at server/network health. I would umount the 
clients and verify server health, then verify LNET connectivity. 
However, this would not relate to missing directories - in the absence 
of other explanations, check the MDT with fsck - that''s more of a 
generic useful thing to do rather then something indicated by your data.

I would also look through older logs if available, and see if you can
find a point in time where things go bad. The first error is always the 
most useful.> Any direction/insigt would be most helpful.
Hope this helps
cliffw
> 
> Brian Andrus
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Oleg Drokin

2010-Apr-28 01:10 UTC

head link

[Lustre-discuss] Newbie w/issues

Hello!

On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:> Apr 27 16:15:19 nas-0-1 kernel: LustreError:
4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
> 
> Any direction/insigt would be most helpful.
That''s way too late in the logs to see what happened aside from server
decided to evict some clients for some reason.
Interesting parts should be around "evicting" or "timeout"
were first mentioned.

Bye,
    Oleg

Brian Andrus

2010-Apr-28 01:38 UTC

head link

[Lustre-discuss] Newbie w/issues

On 4/27/2010 6:10 PM, Oleg Drokin wrote:> Hello!
>
> On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:
>    
>> Apr 27 16:15:19 nas-0-1 kernel: LustreError:
4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) 
req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
>>
>> Any direction/insigt would be most helpful.
>>      
> That''s way too late in the logs to see what happened aside from
server decided to evict some clients for some reason.
> Interesting parts should be around "evicting" or
"timeout" were first mentioned.
>
> Bye,
>      OlegOdd, I just went through the log on the MDT and basically it has been 
repeating those errors for over 24 hours (not spewing, but often 
enough). only ONE other line on an ost:

Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID 
''work-OST0000_UUID'' is not available  for connect (no target)


Brian

Oleg Drokin

2010-Apr-28 01:43 UTC

head link

[Lustre-discuss] Newbie w/issues

Hello!

On Apr 27, 2010, at 9:38 PM, Brian Andrus wrote:
> Odd, I just went through the log on the MDT and basically it has been
repeating those errors for over 24 hours (not spewing, but often enough). only
ONE other line on an ost:
Each such message means there was an attempt to send a ping to this server from
a client that the server does not recognize.
> Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID
''work-OST0000_UUID'' is not available  for connect (no target)
This one tells you that a client tried to contact OST0, but this service is not
hosted on that node (or did not yet start up).
This might be a somewhat valid message if you have failover configured and this
node is a currently passive failover target for the service.

Bye,
    Oleg

Andreas Dilger

2010-Apr-28 06:30 UTC

head link

[Lustre-discuss] Newbie w/issues

This means that your OST is not available. Maybe it is nor mounted?

Cheers, Andreas

On 2010-04-27, at 19:38, Brian Andrus <toomuchit at gmail.com> wrote:
> On 4/27/2010 6:10 PM, Oleg Drokin wrote:
>> Hello!
>>
>> On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:
>>
>>> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c: 
>>> 1848:target_send_reply_msg()) @@@ processing error (-107)   
>>> req at ffff810669d35c50 x1334203739385128/t0
o400-><?>@<?>:0/0 lens
>>> 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
>>>
>>> Any direction/insigt would be most helpful.
>>>
>> That''s way too late in the logs to see what happened aside
from
>> server decided to evict some clients for some reason.
>> Interesting parts should be around "evicting" or
"timeout" were
>> first mentioned.
>>
>> Bye,
>>     Oleg
> Odd, I just went through the log on the MDT and basically it has been
> repeating those errors for over 24 hours (not spewing, but often
> enough). only ONE other line on an ost:
>
> Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID
> ''work-OST0000_UUID'' is not available  for connect (no
target)
>
>
> Brian
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Apr 2010 - Newbie w/issues

[Lustre-discuss] Newbie w/issues

[Lustre-discuss] Newbie w/issues

[Lustre-discuss] Newbie w/issues

[Lustre-discuss] Newbie w/issues

[Lustre-discuss] Newbie w/issues

[Lustre-discuss] Newbie w/issues