I have seen this behavior a few times. Under heavy IO lustre will just stop and dmesg will have the following: LustreError: 3976:0:(events.c:134:client_bulk_callback()) event type 0, status -5, desc 000001012ce12000 LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_statfs operation failed with -107 LustreError: Skipped 1 previous similar messageLustre: nobackup- MDT0000-mdc-00000100e9e9ac00: Connection to service nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress operations using this service will wait for recovery to complete. No network connection issues between the login nodes. When this happens the client does not recover till we reboot the node. This does happen at times on the compute nodes but I see it most on login hosts. If I just go to the lustre mount and try to ls it it will hang for forever. Many times when lustre screws up it recovers but more and more it does not. and we see these bulk errors followed by mds errors. We are using lustre 1.6.x Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Brian J. Murrell
2008-May-16 20:13 UTC
[Lustre-discuss] Luster access locking up login nodes
On Fri, 2008-05-16 at 15:48 -0400, Brock Palen wrote:> I have seen this behavior a few times. > Under heavy IO lustre will just stop and dmesg will have the following:Review the list archives for statahead problems. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080516/76a8d903/attachment.bin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ahh didn''t realize this was related to that. Good to know fix in the works (2 x4500''s on the way so we have made a commitment to lustre). How would I make this option the default on boot? There isn''t an llite module I see on the clients. I can pdsh to all the clients, but machines to get rebooted some times. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On May 16, 2008, at 4:13 PM, Brian J. Murrell wrote:> On Fri, 2008-05-16 at 15:48 -0400, Brock Palen wrote: >> I have seen this behavior a few times. >> Under heavy IO lustre will just stop and dmesg will have the >> following: > > Review the list archives for statahead problems. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (Darwin) iD8DBQFILe2TMFCQB4Bvz5QRAvuRAKCG94UfDQvSxcBXCPSxThLuirrGbACfUsTm hobLA1aA+AHrZd4mwkY5sKQ=3Gzv -----END PGP SIGNATURE-----
Brian J. Murrell
2008-May-16 20:28 UTC
[Lustre-discuss] Luster access locking up login nodes
On Fri, 2008-05-16 at 16:24 -0400, Brock Palen wrote:> > Ahh didn''t realize this was related to that.Sure looks like it from what you posted.> How would I make this option the default on boot?initscript. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080516/2a067cd3/attachment-0001.bin