thr3ads.net - Lustre discuss - [Lustre-discuss] lustre using wrong network [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Michael Di Domenico

2009-Jun-19 01:11 UTC

[Lustre-discuss] lustre using wrong network

I cannot figure out what exactly has happened here and how to recover from it.

Jun 18 21:02:52 node0-eth1 kernel: LustreError:
2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.0.248
Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to
192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.0.248 at tcp one of
its NIDs?

for some reason when i mount the OST on the above node it''s trying to
connect to itself on eth0, even though i have networks=tcp0(eth1) in
my modprobe.conf and the NID is set to 192.168.1.248

Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has started
Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing
connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI

I''m trying to mount a client now, and for some reason it''s
using eth0,
even though modprobe.conf says network=tcp0(eth1)

Jun 18 21:02:57 node0-eth1 kernel: Lustre: Request x1002438670 sent
from data1-OST0005-osc-f4070600 to NID 192.168.0.248 at tcp 5s ago has
timed out (limit 5s).
Jun 18 21:03:05 node7-eth0 kernel: LustreError: 120-3: Refusing
connection from 192.168.0.254 for 192.168.0.248 at tcp: No matching NI
Jun 18 21:03:05 node1-eth0 kernel: Lustre:
2527:0:(import.c:508:import_select_connection()) data1-OST0005-osc:
tried all connections, increasing latency to 30s
Jun 18 21:03:05 node1-eth0 kernel: Lustre:
2527:0:(import.c:508:import_select_connection()) Skipped 2 previous
similar messages
Jun 18 21:03:05 node1-eth0 kernel: LustreError:
2520:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.0.248
Jun 18 21:03:05 node1-eth0 kernel: LustreError: 11b-b: Connection to
192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.0.248 at tcp one of
its NIDs?
Jun 18 21:03:10 node0-eth1 ntpd[2321]: synchronized to 204.9.136.253, stratum 2
Jun 18 21:03:17 node0-eth1 kernel: Lustre:
2727:0:(import.c:508:import_select_connection())
data1-OST0005-osc-f4070600: tried all connections, increasing latency
to 5s
Jun 18 21:03:17 node0-eth1 kernel: LustreError:
2719:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.0.248
Jun 18 21:03:17 node0-eth1 kernel: LustreError: 11b-b: Connection to
192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.0.248 at tcp one of
its NIDs?
Jun 18 21:03:17 node7-eth0 kernel: LustreError: 120-3: Refusing
connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI
Jun 18 21:03:27 node0-eth1 kernel: Lustre: Request x1002438682 sent
from data1-OST0005-osc-f4070600 to NID 192.168.0.248 at tcp 10s ago has
timed out (limit 10s).
Jun 18 21:03:33 node3-eth0 ntpd[2315]: synchronized to 192.168.0.50, stratum 3

Then i get in this visicious cycle where the filesystem will mount,
but i''m unable to df or ls it

Isaac Huang

2009-Jun-19 01:48 UTC

head link

[Lustre-discuss] lustre using wrong network

On Thu, Jun 18, 2009 at 09:11:50PM -0400, Michael Di Domenico
wrote:> I cannot figure out what exactly has happened here and how to recover from
it.
> 
> Jun 18 21:02:52 node0-eth1 kernel: LustreError:
> 2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
> HELLO from 192.168.0.248
> Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to
> 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
> running a compatible version of Lustre and is 192.168.0.248 at tcp one of
> its NIDs?
Lustre asked lnet to connect to 192.168.0.248 at tcp.
> for some reason when i mount the OST on the above node it''s trying
to
> connect to itself on eth0, even though i have networks=tcp0(eth1) in
> my modprobe.conf and the NID is set to 192.168.1.248
> 
> Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has started
> Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing
> connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI
But the connection was rejected because the server didn''t have
192.168.0.248 at tcp as one of its NIDs.

What was your mount command line? What does ''lctl list_nids''
say on
the nodes?

Isaac

Michael Di Domenico

2009-Jun-19 01:51 UTC

head link

[Lustre-discuss] lustre using wrong network

On Thu, Jun 18, 2009 at 9:48 PM, Isaac Huang<He.Huang at sun.com>
wrote:> On Thu, Jun 18, 2009 at 09:11:50PM -0400, Michael Di Domenico wrote:
>> I cannot figure out what exactly has happened here and how to recover
from it.
>>
>> Jun 18 21:02:52 node0-eth1 kernel: LustreError:
>> 2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
>> HELLO from 192.168.0.248
>> Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to
>> 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
>> running a compatible version of Lustre and is 192.168.0.248 at tcp one
of
>> its NIDs?
>
> Lustre asked lnet to connect to 192.168.0.248 at tcp.
>
>> for some reason when i mount the OST on the above node it''s
trying to
>> connect to itself on eth0, even though i have networks=tcp0(eth1) in
>> my modprobe.conf and the NID is set to 192.168.1.248
>>
>> Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has
started
>> Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing
>> connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI
>
> But the connection was rejected because the server didn''t have
> 192.168.0.248 at tcp as one of its NIDs.
>
> What was your mount command line? What does ''lctl
list_nids'' say on
> the nodes?
list_nids show the right nid on all the nodes 192.168.1.x at tcp

192.168.0.x does exist on all the nodes, but lustre shouldn''t be
trying to use it ever

Isaac Huang

2009-Jun-19 01:57 UTC

head link

[Lustre-discuss] lustre using wrong network

On Thu, Jun 18, 2009 at 09:51:33PM -0400, Michael Di Domenico
wrote:> ......
> > But the connection was rejected because the server didn''t
have
> > 192.168.0.248 at tcp as one of its NIDs.
> >
> > What was your mount command line? What does ''lctl
list_nids'' say on
> > the nodes?
> 
> list_nids show the right nid on all the nodes 192.168.1.x at tcp
> 
> 192.168.0.x does exist on all the nodes, but lustre shouldn''t be
> trying to use it ever
Have you changed server NIDs without updating configuration logs with
--writeconf?

Isaac

Michael Di Domenico

2009-Jun-19 12:43 UTC

head link

[Lustre-discuss] lustre using wrong network

On Thu, Jun 18, 2009 at 9:57 PM, Isaac Huang<He.Huang at sun.com>
wrote:> On Thu, Jun 18, 2009 at 09:51:33PM -0400, Michael Di Domenico wrote:
>> ......
>> > But the connection was rejected because the server didn''t
have
>> > 192.168.0.248 at tcp as one of its NIDs.
>> >
>> > What was your mount command line? What does ''lctl
list_nids'' say on
>> > the nodes?
>>
>> list_nids show the right nid on all the nodes 192.168.1.x at tcp
>>
>> 192.168.0.x does exist on all the nodes, but lustre shouldn''t
be
>> trying to use it ever
>
> Have you changed server NIDs without updating configuration logs with
> --writeconf?
By accident the lnet configs came up with the 192.168.0.x config
because a modprobe setting was wrong.  However, i took the system back
down and corrected the configuration.

So i guess the question is how do i clear out the logs if a host comes
up with a bad NID by accident?

Isaac Huang

2009-Jun-19 14:47 UTC

head link

[Lustre-discuss] lustre using wrong network

On Fri, Jun 19, 2009 at 08:43:11AM -0400, Michael Di Domenico
wrote:> > ......
> > Have you changed server NIDs without updating configuration logs with
> > --writeconf?
> 
> By accident the lnet configs came up with the 192.168.0.x config
> because a modprobe setting was wrong.  However, i took the system back
> down and corrected the configuration.
> 
> So i guess the question is how do i clear out the logs if a host comes
> up with a bad NID by accident?
I''m no expert on this, but you might try:
http://wiki.lustre.org/index.php/Mount_Conf#Changing_a_server_nid

Isaac

Lundgren, Andrew

2009-Jun-19 16:55 UTC

head link

[Lustre-discuss] lustre using wrong network

Look at doing a  --writeconf with tunefs.lustre on all of your OSTs and MDTs. 
Then remount them with the correct settings.
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-
> bounces at lists.lustre.org] On Behalf Of Michael Di Domenico
> Sent: Friday, June 19, 2009 6:43 AM
> To: Michael Di Domenico; Lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] lustre using wrong network
> 
> On Thu, Jun 18, 2009 at 9:57 PM, Isaac Huang<He.Huang at sun.com>
wrote:
> > On Thu, Jun 18, 2009 at 09:51:33PM -0400, Michael Di Domenico wrote:
> >> ......
> >> > But the connection was rejected because the server
didn''t have
> >> > 192.168.0.248 at tcp as one of its NIDs.
> >> >
> >> > What was your mount command line? What does ''lctl
list_nids'' say
> on
> >> > the nodes?
> >>
> >> list_nids show the right nid on all the nodes 192.168.1.x at tcp
> >>
> >> 192.168.0.x does exist on all the nodes, but lustre
shouldn''t be
> >> trying to use it ever
> >
> > Have you changed server NIDs without updating configuration logs with
> > --writeconf?
> 
> By accident the lnet configs came up with the 192.168.0.x config
> because a modprobe setting was wrong.  However, i took the system back
> down and corrected the configuration.
> 
> So i guess the question is how do i clear out the logs if a host comes
> up with a bad NID by accident?
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Jun 2009 - lustre using wrong network

[Lustre-discuss] lustre using wrong network

[Lustre-discuss] lustre using wrong network

[Lustre-discuss] lustre using wrong network

[Lustre-discuss] lustre using wrong network

[Lustre-discuss] lustre using wrong network

[Lustre-discuss] lustre using wrong network

[Lustre-discuss] lustre using wrong network