thr3ads.net - Lustre discuss - [Lustre-discuss] Stalled autofs + lustre summary [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Heiko Schröter

2009-Nov-20 08:31 UTC

[Lustre-discuss] Stalled autofs + lustre summary

Hello,

FYI, we had stalling lustre mounts in conjunction with automount over the last
weeks.
This is a short summary in case you are using automunt + lustre.

When lustre gets automounted ok you will see the messages as in 1).

A user can stall the lustre mount by not using a FQN Filename.
Example file: /lustre_automount/myfile.dat

When lustre is *NOT* mounted a user can stall the client mount with ''ls
/lustre_automount/myfile'' (no asterik after myfile !) for at minimum
100s.
Error messages as in 2) will popup with the
''lnet_try_match_md()'' sequence.
After that you will see messages of type 3) which may indicate a network problem
(hm, well, ok to us ...)
After 100s the user gets back ''ls: cannot access
/lustre_automount/myfile.dat: No such file or directory''
After that it looks that lustre is mounted. But a simple ''ls
/lustre_automount/'' in a second shell will not return anything and
produce the same message sequence as above.

Attention:
When several ''illformed'' ls commands are send at once the
lustre mount freezes completely and forever on that client.
This happened in our case because this command sequence has been driven by
scripts running in parallel.
You have to ''umount -f /lustre_automount/'' or even
''lustre_rmmod'' to recover.

If umount works correct it looks like 3).

Due to the fact that a lot of messages are between 1)2) and 3) we were mislead
and searched the error in wrong places.
Especially the MDS/MGS hardware and additionally due to 2) we have replaced
nearly all network components we could get our hands on.

Unfortunatly doing the same illformed ls command over an NFS automount will not
result in a stalled system but will return the ''cannot access''
message back at once.

Examples of what does work correctly when lustre is not mounted:
a) ls /lustre_automount/myfi*
b) find /lustre_automount -iname ''myfi*'' (eventually:
-maxdepth 1)
c) lfs find /lustre_automount --name ''myfile*'' --maxdepth 1
(returns the file)
d) lfs find /lustre_automount --name ''myfile'' --maxdepth 1
(does not return anything, but will not freeze the system)
.....
Another ''illformed'' command is ''gunzip -c
/lustre_automount/myfile > /tmp/test'' instead of ''gunzip -c
/lustre_automount/myfile.gz > /tmp/test''.

The solution seems to be to not using autofs + lustre if the above cannot be
avoided for sure including mistyping.
Or to tar and feather the user .... that''s what we did .... ;-)

Hairless by now
Heiko

################################################################
Gentoo x86_64 GNU/Linux
lustre: 1.6.6
vanilla-kernel 2.6.22.19
autofs 5.0.3-r6
mount 2.14.2
################################################################
Client Syslog. Automount timing 60s + 120s WAIT, just for testing. The same
holds true for timouts of 600s.
1) Mounting OK:
Nov 19 17:29:58 quadcore2 automount[21803]: attempting to mount entry
/lustre_automount
Nov 19 17:29:58 quadcore2 Lustre: fs_lustre-OST0006-osc-ffff8101c918b800.osc:
set parameter active=0
Nov 19 17:29:58 quadcore2 Lustre: Skipped 16 previous similar messages
Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd())
not connecting OSC fs_lustre-OST0006_UUID; administratively disabled
Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd())
Skipped 13 previous similar messages
Nov 19 17:29:58 quadcore2 Lustre: Client fs_lustre-client has started
Nov 19 17:29:58 quadcore2 automount[21803]: mount(generic): mounted mds1 at
tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount
Nov 19 17:29:58 quadcore2 automount[21803]: mounted /lustre_automount

2) Mounting failed:
Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry
/lustre_automount
Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started
Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted mds1 at
tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount
Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount
Nov 19 17:43:10 quadcore2 LustreError:
25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from
12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272
allowed
Nov 19 17:43:16 quadcore2 automount[21803]: 1 remaining in /home

3) The possible network problem message:
Nov 19 17:44:50 quadcore2 Lustre: Request x776 sent from
fs_lustre-MDT0000-mdc-ffff8101aac5f400 to NID 192.168.16.122 at tcp 100s ago has
timed out (limit 100s).
Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400:
Connection to service fs_lustre-MDT0000 via nid 192.168.16.122 at tcp was lost;
in progress operations using this service will wait for recovery to complete.
Nov 19 17:44:50 quadcore2 LustreError: 25692:0:(mdc_locks.c:598:mdc_enqueue())
ldlm_cli_enqueue: -4
Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400:
Connection restored to service fs_lustre-MDT0000 using nid 192.168.16.122 at
tcp.

4) Umount OK:
Nov 19 17:45:37 quadcore2 automount[21803]: expiring path /lustre_automount
Nov 19 17:45:37 quadcore2 automount[21803]: unmounting dir = /lustre_automount
Nov 19 17:45:37 quadcore2 LustreError:
25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC:
canceling anyway
Nov 19 17:45:37 quadcore2 LustreError:
25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 2 previous similar
messages
Nov 19 17:45:37 quadcore2 LustreError:
25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Nov 19 17:45:37 quadcore2 LustreError:
25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped 2 previous similar
messages
Nov 19 17:45:37 quadcore2 LustreError:
25298:0:(connection.c:155:ptlrpc_put_connection()) NULL connection
Nov 19 17:45:37 quadcore2 LustreError:
25298:0:(connection.c:155:ptlrpc_put_connection()) Skipped 13 previous similar
messages
Nov 19 17:45:37 quadcore2 Lustre: client ffff8101aac5f400 umount complete
Nov 19 17:45:37 quadcore2 automount[21803]: expired /lustre_automount

Brian J. Murrell

2009-Nov-20 13:15 UTC

head link

[Lustre-discuss] Stalled autofs + lustre summary

On Fri, 2009-11-20 at 09:31 +0100, Heiko Schr?ter wrote:
> Hello,
Hi,
> A user can stall the lustre mount by not using a FQN Filename.
> Example file: /lustre_automount/myfile.dat
This sounds very strange and does not represent what I would think is
correct behaviour.
> 
> When lustre is *NOT* mounted a user can stall the client mount with
''ls /lustre_automount/myfile'' (no asterik after myfile !)
IOW, an invalid filename?
> for at minimum 100s.
> Error messages as in 2) will popup with the
''lnet_try_match_md()'' sequence.
Hrm.  That seems very strange, given that automount should be using the
same mount command in both instances.
> lustre: 1.6.6
Do you have an opportunity to test this on a newer release?
> vanilla-kernel 2.6.22.19
Ideally on one of the platforms you can download binary RPMs from us for
(i.e. RHEL5 or SLES10)?
> 2) Mounting failed:
> Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry
/lustre_automount
> Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started
> Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted mds1 at
tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount
> Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount
> Nov 19 17:43:10 quadcore2 LustreError:
25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from
12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272
allowed
I think this is the key to this issue.  There was one or more bugs
around this symptom fixed in the 1.6.6-1.6.7 time frame.  Perhaps even
an upgrade to 1.6.7.2 might prove fruitful.  It would likely require and
MDS upgrade at least and should probably include clients and OSSes as
well.

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091120/cc961b11/attachment.bin

Heiko Schröter

2009-Nov-23 13:36 UTC

head link

[Lustre-discuss] Stalled autofs + lustre summary

Am Freitag 20 November 2009 14:15:17 schrieb Brian J. Murrell:
> > When lustre is *NOT* mounted a user can stall the client mount with
''ls /lustre_automount/myfile'' (no asterik after myfile !)
> 
> IOW, an invalid filename?
Yes, this behaviour is 100% reproducable with the lustre/autofs versions
mentioned.
> > lustre: 1.6.6
> > vanilla-kernel 2.6.22.19
> 
> Ideally on one of the platforms you can download binary RPMs from us for
> (i.e. RHEL5 or SLES10)?
An upgrade to 1.8.x is scheduled for Jan/Feb 2010. Until then i cannot interupt
the system because of some important deadlines coming up.
We are bundled to the Gentoo Distro. So a RHEL5/SLES10 Kernel probably
won''t help.
Installing lustre from an rpm or so would probably not work because of beeing
compiled against different libs.

Are there any "killer" options needed within the kernel which are
crucial for lustre+autofs ?
Would it make any difference to only update a client ? This could be done quite
easily.
> > Nov 19 17:43:10 quadcore2 LustreError:
25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from
12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272
allowed
> 
> I think this is the key to this issue.  There was one or more bugs
> around this symptom fixed in the 1.6.6-1.6.7 time frame.
Is it known if that is fixed in 1.8.x.x ?

We turned of autofs+lustre last week (week 47) and since then we don''t
have any problems with the fs.

Thanks and Regards
Heiko

Brian J. Murrell

2009-Nov-23 15:14 UTC

head link

[Lustre-discuss] Stalled autofs + lustre summary

On Mon, 2009-11-23 at 14:36 +0100, Heiko Schr?ter wrote:
> 
> An upgrade to 1.8.x is scheduled for Jan/Feb 2010. Until then i cannot
interupt the system because of some important deadlines coming up.
> We are bundled to the Gentoo Distro. So a RHEL5/SLES10 Kernel probably
won''t help.
Yeah, it''s gets more and more difficult to try to support the further
one diverges from the "tested and known working set".  Given that the
servers are supposed to be dedicated, treated almost as "sealed
server"
systems, it really should not be difficult to run one of our packaged
and supported releases (i.e. rhel5, sles10/11) on them.  It would sure
make your life easier.

Then all you have to worry about on the "divergence" scale is clients
and we are pretty loose about them given that we support patchless
clients now.
> Are there any "killer" options needed within the kernel which are
crucial for lustre+autofs ?
There should not be.  autofs is (supposed to be) nothing more than
simply demand mounting.  It really should not be any different than
issuing a mount command at a root prompt.
> Would it make any difference to only update a client ?
It might.
> Is it known if that is fixed in 1.8.x.x ?
Should be given 1.6.6-1.6.7''s time frame.
> We turned of autofs+lustre last week (week 47) and since then we
don''t have any problems with the fs.
Well, that''s good news.  In terms of autofs being your only issue,
anyway.  :-)

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091123/b1bd22d6/attachment.bin

Lustre discuss - Nov 2009 - Stalled autofs + lustre summary

[Lustre-discuss] Stalled autofs + lustre summary

[Lustre-discuss] Stalled autofs + lustre summary

[Lustre-discuss] Stalled autofs + lustre summary

[Lustre-discuss] Stalled autofs + lustre summary