I cannot figure out what exactly has happened here and how to recover from it. Jun 18 21:02:52 node0-eth1 kernel: LustreError: 2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.0.248 Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.0.248 at tcp one of its NIDs? for some reason when i mount the OST on the above node it''s trying to connect to itself on eth0, even though i have networks=tcp0(eth1) in my modprobe.conf and the NID is set to 192.168.1.248 Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has started Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI I''m trying to mount a client now, and for some reason it''s using eth0, even though modprobe.conf says network=tcp0(eth1) Jun 18 21:02:57 node0-eth1 kernel: Lustre: Request x1002438670 sent from data1-OST0005-osc-f4070600 to NID 192.168.0.248 at tcp 5s ago has timed out (limit 5s). Jun 18 21:03:05 node7-eth0 kernel: LustreError: 120-3: Refusing connection from 192.168.0.254 for 192.168.0.248 at tcp: No matching NI Jun 18 21:03:05 node1-eth0 kernel: Lustre: 2527:0:(import.c:508:import_select_connection()) data1-OST0005-osc: tried all connections, increasing latency to 30s Jun 18 21:03:05 node1-eth0 kernel: Lustre: 2527:0:(import.c:508:import_select_connection()) Skipped 2 previous similar messages Jun 18 21:03:05 node1-eth0 kernel: LustreError: 2520:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.0.248 Jun 18 21:03:05 node1-eth0 kernel: LustreError: 11b-b: Connection to 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.0.248 at tcp one of its NIDs? Jun 18 21:03:10 node0-eth1 ntpd[2321]: synchronized to 204.9.136.253, stratum 2 Jun 18 21:03:17 node0-eth1 kernel: Lustre: 2727:0:(import.c:508:import_select_connection()) data1-OST0005-osc-f4070600: tried all connections, increasing latency to 5s Jun 18 21:03:17 node0-eth1 kernel: LustreError: 2719:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.0.248 Jun 18 21:03:17 node0-eth1 kernel: LustreError: 11b-b: Connection to 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.0.248 at tcp one of its NIDs? Jun 18 21:03:17 node7-eth0 kernel: LustreError: 120-3: Refusing connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI Jun 18 21:03:27 node0-eth1 kernel: Lustre: Request x1002438682 sent from data1-OST0005-osc-f4070600 to NID 192.168.0.248 at tcp 10s ago has timed out (limit 10s). Jun 18 21:03:33 node3-eth0 ntpd[2315]: synchronized to 192.168.0.50, stratum 3 Then i get in this visicious cycle where the filesystem will mount, but i''m unable to df or ls it
On Thu, Jun 18, 2009 at 09:11:50PM -0400, Michael Di Domenico wrote:> I cannot figure out what exactly has happened here and how to recover from it. > > Jun 18 21:02:52 node0-eth1 kernel: LustreError: > 2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading > HELLO from 192.168.0.248 > Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to > 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it > running a compatible version of Lustre and is 192.168.0.248 at tcp one of > its NIDs?Lustre asked lnet to connect to 192.168.0.248 at tcp.> for some reason when i mount the OST on the above node it''s trying to > connect to itself on eth0, even though i have networks=tcp0(eth1) in > my modprobe.conf and the NID is set to 192.168.1.248 > > Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has started > Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing > connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NIBut the connection was rejected because the server didn''t have 192.168.0.248 at tcp as one of its NIDs. What was your mount command line? What does ''lctl list_nids'' say on the nodes? Isaac
On Thu, Jun 18, 2009 at 9:48 PM, Isaac Huang<He.Huang at sun.com> wrote:> On Thu, Jun 18, 2009 at 09:11:50PM -0400, Michael Di Domenico wrote: >> I cannot figure out what exactly has happened here and how to recover from it. >> >> Jun 18 21:02:52 node0-eth1 kernel: LustreError: >> 2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading >> HELLO from 192.168.0.248 >> Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to >> 192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it >> running a compatible version of Lustre and is 192.168.0.248 at tcp one of >> its NIDs? > > Lustre asked lnet to connect to 192.168.0.248 at tcp. > >> for some reason when i mount the OST on the above node it''s trying to >> connect to itself on eth0, even though i have networks=tcp0(eth1) in >> my modprobe.conf and the NID is set to 192.168.1.248 >> >> Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has started >> Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing >> connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI > > But the connection was rejected because the server didn''t have > 192.168.0.248 at tcp as one of its NIDs. > > What was your mount command line? What does ''lctl list_nids'' say on > the nodes?list_nids show the right nid on all the nodes 192.168.1.x at tcp 192.168.0.x does exist on all the nodes, but lustre shouldn''t be trying to use it ever
On Thu, Jun 18, 2009 at 09:51:33PM -0400, Michael Di Domenico wrote:> ...... > > But the connection was rejected because the server didn''t have > > 192.168.0.248 at tcp as one of its NIDs. > > > > What was your mount command line? What does ''lctl list_nids'' say on > > the nodes? > > list_nids show the right nid on all the nodes 192.168.1.x at tcp > > 192.168.0.x does exist on all the nodes, but lustre shouldn''t be > trying to use it everHave you changed server NIDs without updating configuration logs with --writeconf? Isaac
On Thu, Jun 18, 2009 at 9:57 PM, Isaac Huang<He.Huang at sun.com> wrote:> On Thu, Jun 18, 2009 at 09:51:33PM -0400, Michael Di Domenico wrote: >> ...... >> > But the connection was rejected because the server didn''t have >> > 192.168.0.248 at tcp as one of its NIDs. >> > >> > What was your mount command line? What does ''lctl list_nids'' say on >> > the nodes? >> >> list_nids show the right nid on all the nodes 192.168.1.x at tcp >> >> 192.168.0.x does exist on all the nodes, but lustre shouldn''t be >> trying to use it ever > > Have you changed server NIDs without updating configuration logs with > --writeconf?By accident the lnet configs came up with the 192.168.0.x config because a modprobe setting was wrong. However, i took the system back down and corrected the configuration. So i guess the question is how do i clear out the logs if a host comes up with a bad NID by accident?
On Fri, Jun 19, 2009 at 08:43:11AM -0400, Michael Di Domenico wrote:> > ...... > > Have you changed server NIDs without updating configuration logs with > > --writeconf? > > By accident the lnet configs came up with the 192.168.0.x config > because a modprobe setting was wrong. However, i took the system back > down and corrected the configuration. > > So i guess the question is how do i clear out the logs if a host comes > up with a bad NID by accident?I''m no expert on this, but you might try: http://wiki.lustre.org/index.php/Mount_Conf#Changing_a_server_nid Isaac
Look at doing a --writeconf with tunefs.lustre on all of your OSTs and MDTs. Then remount them with the correct settings.> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss- > bounces at lists.lustre.org] On Behalf Of Michael Di Domenico > Sent: Friday, June 19, 2009 6:43 AM > To: Michael Di Domenico; Lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] lustre using wrong network > > On Thu, Jun 18, 2009 at 9:57 PM, Isaac Huang<He.Huang at sun.com> wrote: > > On Thu, Jun 18, 2009 at 09:51:33PM -0400, Michael Di Domenico wrote: > >> ...... > >> > But the connection was rejected because the server didn''t have > >> > 192.168.0.248 at tcp as one of its NIDs. > >> > > >> > What was your mount command line? What does ''lctl list_nids'' say > on > >> > the nodes? > >> > >> list_nids show the right nid on all the nodes 192.168.1.x at tcp > >> > >> 192.168.0.x does exist on all the nodes, but lustre shouldn''t be > >> trying to use it ever > > > > Have you changed server NIDs without updating configuration logs with > > --writeconf? > > By accident the lnet configs came up with the 192.168.0.x config > because a modprobe setting was wrong. However, i took the system back > down and corrected the configuration. > > So i guess the question is how do i clear out the logs if a host comes > up with a bad NID by accident? > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss