Hello, FYI, we had stalling lustre mounts in conjunction with automount over the last weeks. This is a short summary in case you are using automunt + lustre. When lustre gets automounted ok you will see the messages as in 1). A user can stall the lustre mount by not using a FQN Filename. Example file: /lustre_automount/myfile.dat When lustre is *NOT* mounted a user can stall the client mount with ''ls /lustre_automount/myfile'' (no asterik after myfile !) for at minimum 100s. Error messages as in 2) will popup with the ''lnet_try_match_md()'' sequence. After that you will see messages of type 3) which may indicate a network problem (hm, well, ok to us ...) After 100s the user gets back ''ls: cannot access /lustre_automount/myfile.dat: No such file or directory'' After that it looks that lustre is mounted. But a simple ''ls /lustre_automount/'' in a second shell will not return anything and produce the same message sequence as above. Attention: When several ''illformed'' ls commands are send at once the lustre mount freezes completely and forever on that client. This happened in our case because this command sequence has been driven by scripts running in parallel. You have to ''umount -f /lustre_automount/'' or even ''lustre_rmmod'' to recover. If umount works correct it looks like 3). Due to the fact that a lot of messages are between 1)2) and 3) we were mislead and searched the error in wrong places. Especially the MDS/MGS hardware and additionally due to 2) we have replaced nearly all network components we could get our hands on. Unfortunatly doing the same illformed ls command over an NFS automount will not result in a stalled system but will return the ''cannot access'' message back at once. Examples of what does work correctly when lustre is not mounted: a) ls /lustre_automount/myfi* b) find /lustre_automount -iname ''myfi*'' (eventually: -maxdepth 1) c) lfs find /lustre_automount --name ''myfile*'' --maxdepth 1 (returns the file) d) lfs find /lustre_automount --name ''myfile'' --maxdepth 1 (does not return anything, but will not freeze the system) ..... Another ''illformed'' command is ''gunzip -c /lustre_automount/myfile > /tmp/test'' instead of ''gunzip -c /lustre_automount/myfile.gz > /tmp/test''. The solution seems to be to not using autofs + lustre if the above cannot be avoided for sure including mistyping. Or to tar and feather the user .... that''s what we did .... ;-) Hairless by now Heiko ################################################################ Gentoo x86_64 GNU/Linux lustre: 1.6.6 vanilla-kernel 2.6.22.19 autofs 5.0.3-r6 mount 2.14.2 ################################################################ Client Syslog. Automount timing 60s + 120s WAIT, just for testing. The same holds true for timouts of 600s. 1) Mounting OK: Nov 19 17:29:58 quadcore2 automount[21803]: attempting to mount entry /lustre_automount Nov 19 17:29:58 quadcore2 Lustre: fs_lustre-OST0006-osc-ffff8101c918b800.osc: set parameter active=0 Nov 19 17:29:58 quadcore2 Lustre: Skipped 16 previous similar messages Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd()) not connecting OSC fs_lustre-OST0006_UUID; administratively disabled Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd()) Skipped 13 previous similar messages Nov 19 17:29:58 quadcore2 Lustre: Client fs_lustre-client has started Nov 19 17:29:58 quadcore2 automount[21803]: mount(generic): mounted mds1 at tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount Nov 19 17:29:58 quadcore2 automount[21803]: mounted /lustre_automount 2) Mounting failed: Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry /lustre_automount Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted mds1 at tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount Nov 19 17:43:10 quadcore2 LustreError: 25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272 allowed Nov 19 17:43:16 quadcore2 automount[21803]: 1 remaining in /home 3) The possible network problem message: Nov 19 17:44:50 quadcore2 Lustre: Request x776 sent from fs_lustre-MDT0000-mdc-ffff8101aac5f400 to NID 192.168.16.122 at tcp 100s ago has timed out (limit 100s). Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: Connection to service fs_lustre-MDT0000 via nid 192.168.16.122 at tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 19 17:44:50 quadcore2 LustreError: 25692:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -4 Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: Connection restored to service fs_lustre-MDT0000 using nid 192.168.16.122 at tcp. 4) Umount OK: Nov 19 17:45:37 quadcore2 automount[21803]: expiring path /lustre_automount Nov 19 17:45:37 quadcore2 automount[21803]: unmounting dir = /lustre_automount Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 2 previous similar messages Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped 2 previous similar messages Nov 19 17:45:37 quadcore2 LustreError: 25298:0:(connection.c:155:ptlrpc_put_connection()) NULL connection Nov 19 17:45:37 quadcore2 LustreError: 25298:0:(connection.c:155:ptlrpc_put_connection()) Skipped 13 previous similar messages Nov 19 17:45:37 quadcore2 Lustre: client ffff8101aac5f400 umount complete Nov 19 17:45:37 quadcore2 automount[21803]: expired /lustre_automount
On Fri, 2009-11-20 at 09:31 +0100, Heiko Schr?ter wrote:> Hello,Hi,> A user can stall the lustre mount by not using a FQN Filename. > Example file: /lustre_automount/myfile.datThis sounds very strange and does not represent what I would think is correct behaviour.> > When lustre is *NOT* mounted a user can stall the client mount with ''ls /lustre_automount/myfile'' (no asterik after myfile !)IOW, an invalid filename?> for at minimum 100s. > Error messages as in 2) will popup with the ''lnet_try_match_md()'' sequence.Hrm. That seems very strange, given that automount should be using the same mount command in both instances.> lustre: 1.6.6Do you have an opportunity to test this on a newer release?> vanilla-kernel 2.6.22.19Ideally on one of the platforms you can download binary RPMs from us for (i.e. RHEL5 or SLES10)?> 2) Mounting failed: > Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry /lustre_automount > Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started > Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted mds1 at tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount > Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount > Nov 19 17:43:10 quadcore2 LustreError: 25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272 allowedI think this is the key to this issue. There was one or more bugs around this symptom fixed in the 1.6.6-1.6.7 time frame. Perhaps even an upgrade to 1.6.7.2 might prove fruitful. It would likely require and MDS upgrade at least and should probably include clients and OSSes as well. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091120/cc961b11/attachment.bin
Am Freitag 20 November 2009 14:15:17 schrieb Brian J. Murrell:> > When lustre is *NOT* mounted a user can stall the client mount with ''ls /lustre_automount/myfile'' (no asterik after myfile !) > > IOW, an invalid filename?Yes, this behaviour is 100% reproducable with the lustre/autofs versions mentioned.> > lustre: 1.6.6 > > vanilla-kernel 2.6.22.19 > > Ideally on one of the platforms you can download binary RPMs from us for > (i.e. RHEL5 or SLES10)?An upgrade to 1.8.x is scheduled for Jan/Feb 2010. Until then i cannot interupt the system because of some important deadlines coming up. We are bundled to the Gentoo Distro. So a RHEL5/SLES10 Kernel probably won''t help. Installing lustre from an rpm or so would probably not work because of beeing compiled against different libs. Are there any "killer" options needed within the kernel which are crucial for lustre+autofs ? Would it make any difference to only update a client ? This could be done quite easily.> > Nov 19 17:43:10 quadcore2 LustreError: 25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272 allowed > > I think this is the key to this issue. There was one or more bugs > around this symptom fixed in the 1.6.6-1.6.7 time frame.Is it known if that is fixed in 1.8.x.x ? We turned of autofs+lustre last week (week 47) and since then we don''t have any problems with the fs. Thanks and Regards Heiko
On Mon, 2009-11-23 at 14:36 +0100, Heiko Schr?ter wrote:> > An upgrade to 1.8.x is scheduled for Jan/Feb 2010. Until then i cannot interupt the system because of some important deadlines coming up. > We are bundled to the Gentoo Distro. So a RHEL5/SLES10 Kernel probably won''t help.Yeah, it''s gets more and more difficult to try to support the further one diverges from the "tested and known working set". Given that the servers are supposed to be dedicated, treated almost as "sealed server" systems, it really should not be difficult to run one of our packaged and supported releases (i.e. rhel5, sles10/11) on them. It would sure make your life easier. Then all you have to worry about on the "divergence" scale is clients and we are pretty loose about them given that we support patchless clients now.> Are there any "killer" options needed within the kernel which are crucial for lustre+autofs ?There should not be. autofs is (supposed to be) nothing more than simply demand mounting. It really should not be any different than issuing a mount command at a root prompt.> Would it make any difference to only update a client ?It might.> Is it known if that is fixed in 1.8.x.x ?Should be given 1.6.6-1.6.7''s time frame.> We turned of autofs+lustre last week (week 47) and since then we don''t have any problems with the fs.Well, that''s good news. In terms of autofs being your only issue, anyway. :-) b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091123/b1bd22d6/attachment.bin