I have a client (one of our login nodes) that was evicted by one of the OST''s but not both of them. So some files are accessible others are not. Strange thing is that both the OST''s live on the same OSS. The errors in dmesg are: LustreError: 11-0: an error occurred while communicating with 141.212.30.181 at tcp. The obd_ping operation failed with -107 Lustre: nobackup-OST0001-osc-000001007d548400: Connection to service nobackup-OST0001 via nid 141.212.30.181 at tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by nobackup-OST0001; in progress operations using this service will fail. LustreError: 29595:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO LustreError: 29629:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO OST0000 also lives at 141.212.30.181, so its strange that only one will kill it off. Is there a way to ask lustre to restore this? Up till this point, the client would recover quickly, but this time its just waiting. Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985
On Jan 24, 2008 10:23 -0500, Brock Palen wrote:> I have a client (one of our login nodes) that was evicted by one of > the OST''s but not both of them. So some files are accessible others > are not. Strange thing is that both the OST''s live on the same OSS. > > Is there a way to ask lustre to restore this? Up > till this point, the client would recover quickly, but this time its > just waiting.You could try "lctl --device {OSC device in question} recover". Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On the client i tried the lctl --device $number deactivate which worked followed by llctl --device $number activate which i believe should have done the same thing this failed without error notice to me. i ended up having to umount and mount, which finally reconnected the ost. At 12:55 PM -0700 1/25/08, Andreas Dilger wrote:>On Jan 24, 2008 10:23 -0500, Brock Palen wrote: >> I have a client (one of our login nodes) that was evicted by one of >> the OST''s but not both of them. So some files are accessible others >> are not. Strange thing is that both the OST''s live on the same OSS. >> >> Is there a way to ask lustre to restore this? Up >> till this point, the client would recover quickly, but this time its >> just waiting. > >You could try "lctl --device {OSC device in question} recover". > >Cheers, Andreas >-- >Andreas Dilger >Sr. Staff Engineer, Lustre Group >Sun Microsystems of Canada, Inc. > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss-- }}}===============>> LLNL James E. Harm (Jim); jharm at llnl.gov System Administrator, ICCD Clusters (925) 422-4018 Page: 423-7705x57152
Is there a tool that will really attempt a reconnect from a client to a single OST? it would be helpful for those rare cases when this happens and there is nothing really wrong with either. i imagine original cause could be something as simple as repeated delays on a very busy network? Other OSTs from the same OSS remained connected to the same client during this problem. If umount and mount could be avoided, it would be less disruptive to other processes on the client. At 2:10 PM -0800 1/25/08, Jim Harm wrote:>On the client i tried the lctl --device $number deactivate >which worked >followed by >llctl --device $number activate >which i believe should have done the same thing >this failed without error notice to me. > >i ended up having to umount and mount, which finally reconnected the ost. > >At 12:55 PM -0700 1/25/08, Andreas Dilger wrote: >>On Jan 24, 2008 10:23 -0500, Brock Palen wrote: >>> I have a client (one of our login nodes) that was evicted by one of >>> the OST''s but not both of them. So some files are accessible others >>> are not. Strange thing is that both the OST''s live on the same OSS. >>> >>> Is there a way to ask lustre to restore this? Up >>> till this point, the client would recover quickly, but this time its >>> just waiting. >> >>You could try "lctl --device {OSC device in question} recover". >> >>Cheers, Andreas >>-- >>Andreas Dilger >>Sr. Staff Engineer, Lustre Group >>Sun Microsystems of Canada, Inc. >> >>_______________________________________________ >>Lustre-discuss mailing list >>Lustre-discuss at lists.lustre.org >>http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >-- >}}}===============>> LLNL >James E. Harm (Jim); jharm at llnl.gov >System Administrator, ICCD Clusters >(925) 422-4018 Page: 423-7705x57152 >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss-- }}}===============>> LLNL James E. Harm (Jim); jharm at llnl.gov System Administrator, ICCD Clusters (925) 422-4018 Page: 423-7705x57152
On Feb 04, 2008 08:31 -0800, Jim Harm wrote:> Is there a tool that will really attempt a reconnect from a client to > a single OST? > it would be helpful for those rare cases > when this happens and there is nothing really wrong with either. > i imagine original cause could be something as simple as repeated delays > on a very busy network? > Other OSTs from the same OSS remained connected to the same client > during this problem. > If umount and mount could be avoided, > it would be less disruptive to other processes on the client.You can use "echo_client" to perform operations on a single OST. See the lustre-iokit obdfilter-survey for usage details.> At 2:10 PM -0800 1/25/08, Jim Harm wrote: > >On the client i tried the lctl --device $number deactivate > >which worked > >followed by > >llctl --device $number activate > >which i believe should have done the same thing > >this failed without error notice to me. > > > >i ended up having to umount and mount, which finally reconnected the ost. > > > >At 12:55 PM -0700 1/25/08, Andreas Dilger wrote: > >>On Jan 24, 2008 10:23 -0500, Brock Palen wrote: > >>> I have a client (one of our login nodes) that was evicted by one of > >>> the OST''s but not both of them. So some files are accessible others > >>> are not. Strange thing is that both the OST''s live on the same OSS. > >>> > >>> Is there a way to ask lustre to restore this? Up > >>> till this point, the client would recover quickly, but this time its > >>> just waiting. > >> > >>You could try "lctl --device {OSC device in question} recover". > >> > >>Cheers, Andreas > >>-- > >>Andreas Dilger > >>Sr. Staff Engineer, Lustre Group > >>Sun Microsystems of Canada, Inc. > >> > >>_______________________________________________ > >>Lustre-discuss mailing list > >>Lustre-discuss at lists.lustre.org > >>http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > >-- > >}}}===============>> LLNL > >James E. Harm (Jim); jharm at llnl.gov > >System Administrator, ICCD Clusters > >(925) 422-4018 Page: 423-7705x57152 > >_______________________________________________ > >Lustre-discuss mailing list > >Lustre-discuss at lists.lustre.org > >http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > }}}===============>> LLNL > James E. Harm (Jim); jharm at llnl.gov > System Administrator, ICCD Clusters > (925) 422-4018 Page: 423-7705x57152 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.