Roger Spellman
2009-Mar-23 15:46 UTC
[Lustre-discuss] Intermittent "obd_ping operation failed" errors
I am seeing the following error, multiple times on clients trying to talk to a particular OST. The errors are intermittent: I get five to ten every few seconds, then none for several hours (or even several days), then five to ten again. The servers are running Lustre 1.6.6, and are configured for IB and tcp (over eth0). There are 20 IB clients, and 200 tcp/eth0 clients. The system was relatively quiet while these errors were occurring. modprobe.conf contains: options ib_mthca msi_x=1 options lnet networks="tcp0(eth0),o2ib(ib0)" options ko2iblnd ipif_name=ib0 Any idea what causes this, and how to resolve it? Thanks. Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.103.26 at tcp. The obd_ping operation failed with -107 Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was lost; in progress operations using this service will wait for recovery to complete. Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 167-0: This client was evicted by lstr-ter-OST0003; in progress operations using this service will fail. Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: Connection restored to service lstr-ter-OST0003 using nid 172.16.103.26 at tcp. Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000: lstr-ter-OST0003_UUID now active, resetting orphans Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.103.26 at tcp. The obd_ping operation failed with -107 Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was lost; in progress operations using this service will wait for recovery to complete. Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 167-0: This client was evicted by lstr-ter-OST0003; in progress operations using this service will fail. Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: Connection restored to service lstr-ter-OST0003 using nid 172.16.103.26 at tcp. Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000: lstr-ter-OST0003_UUID now active, resetting orphans Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.103.26 at tcp. The obd_ping operation failed with -107 Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was lost; in progress operations using this service will wait for recovery to complete. Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 167-0: This client was evicted by lstr-ter-OST0003; in progress operations using this service will fail. Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: Connection restored to service lstr-ter-OST0003 using nid 172.16.103.26 at tcp. Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000: lstr-ter-OST0003_UUID now active, resetting orphans Roger Spellman Staff Engineer Terascala, Inc. 508-588-1501 www.terascala.com <http://www.terascala.com/> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090323/f9050b44/attachment.html
Wang Yibin
2009-Mar-24 03:26 UTC
[Lustre-discuss] Intermittent "obd_ping operation failed" errors
It looks like that the network of str-ter-OST003(IP: 172.16.103.26) is quite flaky. Can you please check the stability of this particular OST? ? 2009-03-23?? 11:46 -0400?Roger Spellman???> I am seeing the following error, multiple times on clients trying to > talk to a particular OST. The errors are intermittent: I get five to > ten every few seconds, then none for several hours (or even several > days), then five to ten again. > > > > The servers are running Lustre 1.6.6, and are configured for IB and > tcp (over eth0). There are 20 IB clients, and 200 tcp/eth0 clients. > The system was relatively quiet while these errors were occurring. > > > > modprobe.conf contains: > > options ib_mthca msi_x=1 > > options lnet networks="tcp0(eth0),o2ib(ib0)" > > options ko2iblnd ipif_name=ib0 > > > > Any idea what causes this, and how to resolve it? > > > > Thanks. > > > > Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 11-0: an error > occurred while communicating with 172.16.103.26 at tcp. The obd_ping > operation failed with -107 > > Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: > Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was > lost; in progress operations using this service will wait for recovery > to complete. > > Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 167-0: This client was > evicted by lstr-ter-OST0003; in progress operations using this service > will fail. > > Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: > Connection restored to service lstr-ter-OST0003 using nid > 172.16.103.26 at tcp. > > Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000: > lstr-ter-OST0003_UUID now active, resetting orphans > > Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 11-0: an error > occurred while communicating with 172.16.103.26 at tcp. The obd_ping > operation failed with -107 > > Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: > Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was > lost; in progress operations using this service will wait for recovery > to complete. > > Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 167-0: This client was > evicted by lstr-ter-OST0003; in progress operations using this service > will fail. > > Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: > Connection restored to service lstr-ter-OST0003 using nid > 172.16.103.26 at tcp. > > Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000: > lstr-ter-OST0003_UUID now active, resetting orphans > > Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 11-0: an error > occurred while communicating with 172.16.103.26 at tcp. The obd_ping > operation failed with -107 > > Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: > Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was > lost; in progress operations using this service will wait for recovery > to complete. > > Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 167-0: This client was > evicted by lstr-ter-OST0003; in progress operations using this service > will fail. > > Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc: > Connection restored to service lstr-ter-OST0003 using nid > 172.16.103.26 at tcp. > > Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000: > lstr-ter-OST0003_UUID now active, resetting orphans > > > > Roger Spellman > > Staff Engineer > > Terascala, Inc. > > 508-588-1501 > > www.terascala.com <http://www.terascala.com/> > > > > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss