Chris Exton
2011-Apr-20 09:08 UTC
[Lustre-discuss] Lustre filesystem hangs when reading large files
Hello, We are currently using lustre 1.8.1.1 and using kernel version 2.6.18_128.7.1.el5_lustre. We are experiencing problems when performing reads of large files from my lustre filesystem, small reads are not affected. The read process hangs and the following message is reported in /var/log/messages: Feb 22 15:59:38 leopard kernel: LustreError: 11-0: an error occurred while communicating with 192.168.13.200 at o2ib. The obd_ping operation failed with -107 Feb 22 15:59:38 leopard kernel: Lustre: lustre-OST0000-osc-ffff81067e0eac00: Connection to service lustre-OST0000 via nid 192.168.13.200 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 22 15:59:38 leopard kernel: LustreError: 6811:0:(import.c:939:ptlrpc_connect_interpret()) lustre-OST0000_UUID went back in time (transno 476754140074 was previously committed, server now claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 Feb 22 15:59:38 leopard kernel: LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. Feb 22 15:59:38 leopard kernel: Lustre: lustre-OST0000-osc-ffff81067e0eac00: Connection restored to service lustre-OST0000 using nid 192.168.13.200 at o2ib. Feb 22 15:59:38 leopard kernel: LustreError: 17592:0:(lov_request.c:196:lov_update_enqueue_set()) enqueue objid 0x18f87222 subobj 0x4d0c9f on OST idx 0: rc -5 I have checked the bugzilla report but we have not had a disk crash and the system was not restarted. Could this be an underlying hardware problem that''s not getting logged? Any additional help on this matter would be much appreciated. Kind Regards Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110420/fa3a46fd/attachment.html
Kevin Van Maren
2011-Apr-20 10:42 UTC
[Lustre-discuss] Lustre filesystem hangs when reading large files
Chris Exton wrote:> > Hello, > > We are currently using lustre 1.8.1.1 and using kernel version > 2.6.18_128.7.1.el5_lustre. > > We are experiencing problems when performing reads of large files from > my lustre filesystem, small reads are not affected. > > The read process hangs and the following message is reported in > /var/log/messages: > > Feb 22 15:59:38 leopard kernel: LustreError: 11-0: an error occurred > while communicating with 192.168.13.200 at o2ib. The obd_ping operation > failed with -107 > > Feb 22 15:59:38 leopard kernel: Lustre: > lustre-OST0000-osc-ffff81067e0eac00: Connection to service > lustre-OST0000 via nid 192.168.13.200 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > > Feb 22 15:59:38 leopard kernel: LustreError: > 6811:0:(import.c:939:ptlrpc_connect_interpret()) lustre-OST0000_UUID > went back in time (transno 476754140074 was previously committed, > server now claims 0)! See > https://bugzilla.lustre.org/show_bug.cgi?id=9646 > > Feb 22 15:59:38 leopard kernel: LustreError: 167-0: This client was > evicted by lustre-OST0000; in progress operations using this service > will fail. > > Feb 22 15:59:38 leopard kernel: Lustre: > lustre-OST0000-osc-ffff81067e0eac00: Connection restored to service > lustre-OST0000 using nid 192.168.13.200 at o2ib. > > Feb 22 15:59:38 leopard kernel: LustreError: > 17592:0:(lov_request.c:196:lov_update_enqueue_set()) enqueue objid > 0x18f87222 subobj 0x4d0c9f on OST idx 0: rc -5 > > I have checked the bugzilla report but we have not had a disk crash > and the system was not restarted. Could this be an underlying hardware > problem that?s not getting logged? >Could be a hardware issue with your network, but not your disk: it looks like a network failure resulted in client eviction (server unable to contact client, so it was evicted), which resulted in the "back in time" message when it reconnected (and could not complete outstanding IOs -- pending writes, ie from client cache, get dropped on the floor when evicted). See https://bugzilla.lustre.org/show_bug.cgi?id=21681> > Any additional help on this matter would be much appreciated. > > Kind Regards > > Chris >