Patrick Shopbell
2013-Apr-29 23:28 UTC
[Lustre-discuss] OSTs inactive on one client (only)
Hi everyone, I have seen this question here before, but without a very satisfactory answer. One of our half a dozen clients has lost access to a set of OSTs: > lfs osts OBDS:: 0: lustre-OST0000_UUID ACTIVE 1: lustre-OST0001_UUID ACTIVE 2: lustre-OST0002_UUID INACTIVE 3: lustre-OST0003_UUID INACTIVE 4: lustre-OST0004_UUID INACTIVE 5: lustre-OST0005_UUID ACTIVE 6: lustre-OST0006_UUID ACTIVE All OSTs show as completely fine on the other clients, and the system is working there. In addition, I have run numerous checks of the IB network (ibhosts, ibping, etc.), and I do not see any networking issues. Moreover, the OSSs include: OSS #1 --> OST #0, #1, #2 OSS #2 --> OST #3, #4, #5 OSS #3 --> OST #6 So, the machine is seeing two of three OSTs on OSS #1 and one of three OSTs on OSS #2. It is showing some OSTs on an OSS as active and others as inactive. So this does not seem to be a networking issue. I am getting a set of errors on that client periodically: Apr 29 16:21:18 abacus kernel: LustreError: 28707:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0003_UUID: rc = -110 waiting for callback (3 != 0) Apr 29 16:21:18 abacus kernel: LustreError: 28707:0:(import.c:324:ptlrpc_invalidate_import()) Skipped 18 previous similar messages Apr 29 16:21:18 abacus kernel: LustreError: 28707:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req at ffff8803b45c6c00 x1430098383471272/t0(0) o101->lustre-OST0003-osc-ffff880331f33400 at 192.168.100.103@o2ib:28/4 lens 328/352 e 0 to 0 dl 1367194410 ref 1 fl Interpret:RE/0/0 rc -5/0 Apr 29 16:21:18 abacus kernel: LustreError: 28707:0:(import.c:350:ptlrpc_invalidate_import()) Skipped 61 previous similar messages Apr 29 16:21:18 abacus kernel: LustreError: 28707:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0003_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out. Apr 29 16:21:18 abacus kernel: LustreError: 28707:0:(import.c:366:ptlrpc_invalidate_import()) Skipped 18 previous similar messages I seem to recall some talk of what happens when a client or two does a lot of I/O and sort of takes over. Indeed, a couple of the other clients are very busily using Lustre. But still, I would have hoped that this client (abacus) would have regained its connections after a few hours. Any ideas as to what I can do, short of rebooting the client? I am nervous about that leaving incomplete I/O. Thanks, Patrick Shopbell pls at astro.caltech.edu
Hi Patrick, Verify interconnect health from those clients to the OSS hosting those OST''s. -cf On Mon, Apr 29, 2013 at 5:28 PM, Patrick Shopbell <pls at astro.caltech.edu>wrote:> > > Hi everyone, > I have seen this question here before, but without a very > satisfactory answer. One of our half a dozen clients has > lost access to a set of OSTs: > > > lfs osts > OBDS:: > 0: lustre-OST0000_UUID ACTIVE > 1: lustre-OST0001_UUID ACTIVE > 2: lustre-OST0002_UUID INACTIVE > 3: lustre-OST0003_UUID INACTIVE > 4: lustre-OST0004_UUID INACTIVE > 5: lustre-OST0005_UUID ACTIVE > 6: lustre-OST0006_UUID ACTIVE > > All OSTs show as completely fine on the other clients, and > the system is working there. In addition, I have run numerous > checks of the IB network (ibhosts, ibping, etc.), and I do not > see any networking issues. > > Moreover, the OSSs include: > > OSS #1 --> OST #0, #1, #2 > OSS #2 --> OST #3, #4, #5 > OSS #3 --> OST #6 > > So, the machine is seeing two of three OSTs on OSS #1 and one > of three OSTs on OSS #2. It is showing some OSTs on an OSS as > active and others as inactive. So this does not seem to be a > networking > issue. > > I am getting a set of errors on that client periodically: > > Apr 29 16:21:18 abacus kernel: LustreError: > 28707:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0003_UUID: > rc = -110 waiting for callback (3 != 0) > Apr 29 16:21:18 abacus kernel: LustreError: > 28707:0:(import.c:324:ptlrpc_invalidate_import()) Skipped 18 previous > similar messages > Apr 29 16:21:18 abacus kernel: LustreError: > 28707:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending > list req at ffff8803b45c6c00 x1430098383471272/t0(0) > o101->lustre-OST0003-osc-ffff880331f33400 at 192.168.100.103@o2ib:28/4 lens > 328/352 e 0 to 0 dl 1367194410 ref 1 fl Interpret:RE/0/0 rc -5/0 > Apr 29 16:21:18 abacus kernel: LustreError: > 28707:0:(import.c:350:ptlrpc_invalidate_import()) Skipped 61 previous > similar messages > Apr 29 16:21:18 abacus kernel: LustreError: > 28707:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0003_UUID: > RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting > them to error out. > Apr 29 16:21:18 abacus kernel: LustreError: > 28707:0:(import.c:366:ptlrpc_invalidate_import()) Skipped 18 previous > similar messages > > I seem to recall some talk of what happens when a client or > two does a lot of I/O and sort of takes over. Indeed, a couple > of the other clients are very busily using Lustre. But still, > I would have hoped that this client (abacus) would have regained > its connections after a few hours. > > Any ideas as to what I can do, short of rebooting the client? > I am nervous about that leaving incomplete I/O. > > Thanks, > Patrick Shopbell > pls at astro.caltech.edu > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130429/ed1aa5f1/attachment.html