thr3ads.net - Lustre discuss - Failure to connect to some OST from a client machine [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Bob Ball

2013-Sep-05 19:56 UTC

Failure to connect to some OST from a client machine

We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel 
2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre 1.8.4 on SL5.

We have had a few situations lately where a client stops talking to some 
subset of the OST (about 58 of these total on 8 OSS, nearly 500TB in 
total).  I have a couple of questions.

1. "lctl dl"  on the OSS shows a smaller count on the affected
servers;
on the client, all OSS showed UP in "lctl dl".  Today, I first tried 
rebooting this OSS, but that did not change the situation.  I ended up 
rebooting the client before I could get full connectivity.  Is there any 
way from the client to get the reconnect, short of rebooting that client?

2. It used to be the case under Lustre 1.8.4 that I could run "lfs df 
-h" on the client, and see all OST, even those where the connection was 
not working, for whatever reason.  That is no longer the case, now the 
lfs command stops at the first, non-talking OST. This seems more like a 
bug than a feature.  Is there some other way to see a list of 
non-communicating OST on a client?

Thanks in advance for any help offered.

bob

Kris Howard

2013-Sep-05 20:01 UTC

head link

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

Might check lctl show_route and look for downed routes.


On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball
<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
> 2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre 1.8.4 on SL5.
>
> We have had a few situations lately where a client stops talking to some
> subset of the OST (about 58 of these total on 8 OSS, nearly 500TB in
> total).  I have a couple of questions.
>
> 1. "lctl dl"  on the OSS shows a smaller count on the affected
servers; on
> the client, all OSS showed UP in "lctl dl".  Today, I first tried
rebooting
> this OSS, but that did not change the situation.  I ended up rebooting the
> client before I could get full connectivity.  Is there any way from the
> client to get the reconnect, short of rebooting that client?
>
> 2. It used to be the case under Lustre 1.8.4 that I could run "lfs df
-h"
> on the client, and see all OST, even those where the connection was not
> working, for whatever reason.  That is no longer the case, now the lfs
> command stops at the first, non-talking OST. This seems more like a bug
> than a feature.  Is there some other way to see a list of non-communicating
> OST on a client?
>
> Thanks in advance for any help offered.
>
> bob
>
>
>
> ______________________________**_________________
> HPDD-discuss mailing list
> HPDD-discuss-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
>
https://lists.01.org/mailman/**listinfo/hpdd-discuss<https://lists.01.org/mailman/listinfo/hpdd-discuss>
>

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bob Ball

2013-Sep-05 20:14 UTC

head link

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

That is an interesting mix.  Nothing shows up at all on the clients, 
even on those 3 that route to a second NIC.  On the OSS, it is quite the 
mix of up/down on the 3 routers, with no obvious pattern.

Most of our traffic is on the 10.10 network, with the 3 machines shown 
below routing to a small number of clients on a more public network.

FYI, the current situation is one in which all machines are happy, as 
far as I can tell.

bob

Running lctl show_route on all machines in lustre_fss.txt
On umdist05.local
net               tcp2 hops 1 gw                   10.10.1.52@tcp down
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
net               tcp2 hops 1 gw                   10.10.1.50@tcp up
          Succeeded
On umfs06.local
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
net               tcp2 hops 1 gw                   10.10.1.50@tcp up
net               tcp2 hops 1 gw                   10.10.1.52@tcp up
          Succeeded
On umdist01.local
net               tcp2 hops 1 gw                   10.10.1.52@tcp down
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
net               tcp2 hops 1 gw                   10.10.1.50@tcp up
          Succeeded
On umdist02.local
net               tcp2 hops 1 gw                   10.10.1.52@tcp down
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
net               tcp2 hops 1 gw                   10.10.1.50@tcp up
          Succeeded
On umdist03.local
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
net               tcp2 hops 1 gw                   10.10.1.52@tcp up
net               tcp2 hops 1 gw                   10.10.1.50@tcp up
          Succeeded
On umdist04.local
net               tcp2 hops 1 gw                   10.10.1.52@tcp down
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
net               tcp2 hops 1 gw                   10.10.1.50@tcp up
          Succeeded
On umdist07.local
net               tcp2 hops 1 gw                   10.10.1.50@tcp down
net               tcp2 hops 1 gw                   10.10.1.52@tcp down
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
          Succeeded
On umdist08.local
net               tcp2 hops 1 gw                   10.10.1.50@tcp down
net               tcp2 hops 1 gw                   10.10.1.52@tcp down
net               tcp2 hops 1 gw                   10.10.1.51@tcp down
          Succeeded

On 9/5/2013 4:01 PM, Kris Howard wrote:> Might check lctl show_route and look for downed routes.
>
>
> On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball
<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org
> <mailto:ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org>> wrote:
>
>     We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
>     2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre 1.8.4
>     on SL5.
>
>     We have had a few situations lately where a client stops talking
>     to some subset of the OST (about 58 of these total on 8 OSS,
>     nearly 500TB in total).  I have a couple of questions.
>
>     1. "lctl dl"  on the OSS shows a smaller count on the
affected
>     servers; on the client, all OSS showed UP in "lctl dl". 
Today, I
>     first tried rebooting this OSS, but that did not change the
>     situation.  I ended up rebooting the client before I could get
>     full connectivity.  Is there any way from the client to get the
>     reconnect, short of rebooting that client?
>
>     2. It used to be the case under Lustre 1.8.4 that I could run "lfs
>     df -h" on the client, and see all OST, even those where the
>     connection was not working, for whatever reason.  That is no
>     longer the case, now the lfs command stops at the first,
>     non-talking OST. This seems more like a bug than a feature.  Is
>     there some other way to see a list of non-communicating OST on a
>     client?
>
>     Thanks in advance for any help offered.
>
>     bob
>
>
>
>     _______________________________________________
>     HPDD-discuss mailing list
>     HPDD-discuss-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
<mailto:HPDD-discuss-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>
>     https://lists.01.org/mailman/listinfo/hpdd-discuss
>
>


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kris Howard

2013-Sep-05 20:22 UTC

head link

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

If you lctl ping 10.10.X.XX@tcp from both sides it should bring the route
up.
With all of those down routes all is happy?
hmm.


On Thu, Sep 5, 2013 at 1:14 PM, Bob Ball
<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>  That is an interesting mix.  Nothing shows up at all on the clients, even
> on those 3 that route to a second NIC.  On the OSS, it is quite the mix of
> up/down on the 3 routers, with no obvious pattern.
>
> Most of our traffic is on the 10.10 network, with the 3 machines shown
> below routing to a small number of clients on a more public network.
>
> FYI, the current situation is one in which all machines are happy, as far
> as I can tell.
>
> bob
>
> Running lctl show_route on all machines in lustre_fss.txt
> On umdist05.local
> net               tcp2 hops 1 gw                   10.10.1.52@tcp down
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
> net               tcp2 hops 1 gw                   10.10.1.50@tcp up
>          Succeeded
> On umfs06.local
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
> net               tcp2 hops 1 gw                   10.10.1.50@tcp up
> net               tcp2 hops 1 gw                   10.10.1.52@tcp up
>          Succeeded
> On umdist01.local
> net               tcp2 hops 1 gw                   10.10.1.52@tcp down
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
> net               tcp2 hops 1 gw                   10.10.1.50@tcp up
>          Succeeded
> On umdist02.local
> net               tcp2 hops 1 gw                   10.10.1.52@tcp down
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
> net               tcp2 hops 1 gw                   10.10.1.50@tcp up
>          Succeeded
> On umdist03.local
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
> net               tcp2 hops 1 gw                   10.10.1.52@tcp up
> net               tcp2 hops 1 gw                   10.10.1.50@tcp up
>          Succeeded
> On umdist04.local
> net               tcp2 hops 1 gw                   10.10.1.52@tcp down
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
> net               tcp2 hops 1 gw                   10.10.1.50@tcp up
>          Succeeded
> On umdist07.local
> net               tcp2 hops 1 gw                   10.10.1.50@tcp down
> net               tcp2 hops 1 gw                   10.10.1.52@tcp down
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
>          Succeeded
> On umdist08.local
> net               tcp2 hops 1 gw                   10.10.1.50@tcp down
> net               tcp2 hops 1 gw                   10.10.1.52@tcp down
> net               tcp2 hops 1 gw                   10.10.1.51@tcp down
>          Succeeded
>
>
> On 9/5/2013 4:01 PM, Kris Howard wrote:
>
> Might check lctl show_route and look for downed routes.
>
>
> On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball
<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>
>> We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
>> 2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre 1.8.4 on
SL5.
>>
>> We have had a few situations lately where a client stops talking to
some
>> subset of the OST (about 58 of these total on 8 OSS, nearly 500TB in
>> total).  I have a couple of questions.
>>
>> 1. "lctl dl"  on the OSS shows a smaller count on the
affected servers;
>> on the client, all OSS showed UP in "lctl dl".  Today, I
first tried
>> rebooting this OSS, but that did not change the situation.  I ended up
>> rebooting the client before I could get full connectivity.  Is there
any
>> way from the client to get the reconnect, short of rebooting that
client?
>>
>> 2. It used to be the case under Lustre 1.8.4 that I could run "lfs
df -h"
>> on the client, and see all OST, even those where the connection was not
>> working, for whatever reason.  That is no longer the case, now the lfs
>> command stops at the first, non-talking OST. This seems more like a bug
>> than a feature.  Is there some other way to see a list of
non-communicating
>> OST on a client?
>>
>> Thanks in advance for any help offered.
>>
>> bob
>>
>>
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
>> https://lists.01.org/mailman/listinfo/hpdd-discuss
>>
>
>
>

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bob Ball

2013-Sep-05 20:27 UTC

head link

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

There are very few clients right now on the tcp2 network.  AFAIK, they 
are happy.  No one is complaining, but they may not be using the Lustre 
storage right now, either.

bob

On 9/5/2013 4:22 PM, Kris Howard wrote:> If you lctl ping 10.10.X.XX@tcp from both sides it should bring the 
> route up.
> With all of those down routes all is happy?
> hmm.
>
>
> On Thu, Sep 5, 2013 at 1:14 PM, Bob Ball
<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org
> <mailto:ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org>> wrote:
>
>     That is an interesting mix.  Nothing shows up at all on the
>     clients, even on those 3 that route to a second NIC.  On the OSS,
>     it is quite the mix of up/down on the 3 routers, with no obvious
>     pattern.
>
>     Most of our traffic is on the 10.10 network, with the 3 machines
>     shown below routing to a small number of clients on a more public
>     network.
>
>     FYI, the current situation is one in which all machines are happy,
>     as far as I can tell.
>
>     bob
>
>     Running lctl show_route on all machines in lustre_fss.txt
>     On umdist05.local
>     net               tcp2 hops 1 gw 10.10.1.52@tcp down
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>     net               tcp2 hops 1 gw 10.10.1.50@tcp up
>              Succeeded
>     On umfs06.local
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>     net               tcp2 hops 1 gw 10.10.1.50@tcp up
>     net               tcp2 hops 1 gw 10.10.1.52@tcp up
>              Succeeded
>     On umdist01.local
>     net               tcp2 hops 1 gw 10.10.1.52@tcp down
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>     net               tcp2 hops 1 gw 10.10.1.50@tcp up
>              Succeeded
>     On umdist02.local
>     net               tcp2 hops 1 gw 10.10.1.52@tcp down
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>     net               tcp2 hops 1 gw 10.10.1.50@tcp up
>              Succeeded
>     On umdist03.local
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>     net               tcp2 hops 1 gw 10.10.1.52@tcp up
>     net               tcp2 hops 1 gw 10.10.1.50@tcp up
>              Succeeded
>     On umdist04.local
>     net               tcp2 hops 1 gw 10.10.1.52@tcp down
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>     net               tcp2 hops 1 gw 10.10.1.50@tcp up
>              Succeeded
>     On umdist07.local
>     net               tcp2 hops 1 gw 10.10.1.50@tcp down
>     net               tcp2 hops 1 gw 10.10.1.52@tcp down
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>              Succeeded
>     On umdist08.local
>     net               tcp2 hops 1 gw 10.10.1.50@tcp down
>     net               tcp2 hops 1 gw 10.10.1.52@tcp down
>     net               tcp2 hops 1 gw 10.10.1.51@tcp down
>              Succeeded
>
>
>     On 9/5/2013 4:01 PM, Kris Howard wrote:
>>     Might check lctl show_route and look for downed routes.
>>
>>
>>     On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball
<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org
>>     <mailto:ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org>>
wrote:
>>
>>         We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
>>         2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre
>>         1.8.4 on SL5.
>>
>>         We have had a few situations lately where a client stops
>>         talking to some subset of the OST (about 58 of these total on
>>         8 OSS, nearly 500TB in total).  I have a couple of questions.
>>
>>         1. "lctl dl"  on the OSS shows a smaller count on the
>>         affected servers; on the client, all OSS showed UP in
"lctl
>>         dl".  Today, I first tried rebooting this OSS, but that
did
>>         not change the situation.  I ended up rebooting the client
>>         before I could get full connectivity.  Is there any way from
>>         the client to get the reconnect, short of rebooting that
client?
>>
>>         2. It used to be the case under Lustre 1.8.4 that I could run
>>         "lfs df -h" on the client, and see all OST, even
those where
>>         the connection was not working, for whatever reason.  That is
>>         no longer the case, now the lfs command stops at the first,
>>         non-talking OST. This seems more like a bug than a feature.
>>          Is there some other way to see a list of non-communicating
>>         OST on a client?
>>
>>         Thanks in advance for any help offered.
>>
>>         bob
>>
>>
>>
>>         _______________________________________________
>>         HPDD-discuss mailing list
>>         HPDD-discuss-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
<mailto:HPDD-discuss-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>
>>         https://lists.01.org/mailman/listinfo/hpdd-discuss
>>
>>
>
>


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Sep 2013 - Failure to connect to some OST from a client machine

Failure to connect to some OST from a client machine

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

Re: [HPDD-discuss] Failure to connect to some OST from a client machine

Re: [HPDD-discuss] Failure to connect to some OST from a client machine