thr3ads.net - Lustre discuss - [Lustre-discuss] list_nids shows wrong IP address [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Wojciech Turek

2007-Aug-20 11:40 UTC

[Lustre-discuss] list_nids shows wrong IP address

Hi,

I have lustre 1.6.1 configured with combined MGS/MDS and one OSS.
kernel version is 2.6.9-55.EL_lustre-1.6.1smp
Both OSS and MGS/MDS machines have configured two NID''s

options lnet networks=tcp0(eth2),tcp1(eth1)

where
eth2 is in 10.143.0.0/16 network
eth1 is in 10.142.10.0/24 network

OSS
 > lctl list_nids
10.143.245.2@tcp
10.142.10.2@tcp1

MDS/MGS
 > lctl list_nids
10.143.245.3@tcp
10.142.10.3@tcp1

Client nodes have two interfaces but only one is meant to be used. I  
have following line in /etc/modprobe.conf
options lnet networks=tcp0(eth1)

eth0 - management interface which is not configured for lustre on  
this particular client node, IP address: 10.142.4.25 netmask  
255.255.255.0
eth1 - data interface configured with lnet, IP address 10.143.4.25  
netmask 255.255.0.0

When I do on client node:
 > lctl list_nids
10.142.4.25@tcp
Which is wrong address.
However
 > lctl which_nid node-d25
10.143.4.25@tcp
This is right address.

I have following entry in /etc/fstab
10.143.245.3@tcp0:/home-md      /home       lustre               
defaults,nosuid,nodev 0 0

Node is mounting /home correctly and I can write/read to/from lustre  
without any problem. However I don''t understand why list_nids shows  
wrong NID. Also if I umount /home on the client, in MGS/MDS syslog I  
see wrong address for umounted client.
Aug 20 15:52:25 storage03 kernel: Lustre: MGS: haven''t heard from  
client 8b3ee19c-ffe8-091c-b504-bf2082cca83e (at 10.142.4.25@tcp) in  
227 seconds. I think it''s dead, and I am evicting it.
Aug 20 17:51:58 storage03 kernel: Lustre: MGS: haven''t heard from  
client 33aaa972-83b9-cf56-a72e-1907d1e3641e (at 10.142.4.25@tcp) in  
227 seconds. I think it''s dead, and I am evicting it.
Aug 20 17:51:58 storage03 kernel: Lustre: Skipped 1 previous similar  
message

Also there is no physical connection between ordinary clients and MGS/ 
MDS, OSS on 10.142.10.0/24 network

I will appreciate if someone can help me find logical explanation for  
my problem.

Regards

Wojciech Turek
 >


Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27@cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070820/b3a494a8/attachment-0001.html

Isaac Huang

2007-Aug-20 13:15 UTC

head link

[Lustre-discuss] list_nids shows wrong IP address

On Mon, Aug 20, 2007 at 06:40:17PM +0100, Wojciech Turek wrote:
[...]> eth0 - management interface which is not configured for lustre on  
> this particular client node, IP address: 10.142.4.25 netmask  
> 255.255.255.0
> eth1 - data interface configured with lnet, IP address 10.143.4.25  
> netmask 255.255.0.0
> 
> When I do on client node:
> > lctl list_nids
> 10.142.4.25@tcp
> Which is wrong address.
Can you please also let me know:
What does ifconfig say on this node?
What does "dmesg | grep ''Added.LNI''" say after lnet
module is loaded?
> However
> > lctl which_nid node-d25
> 10.143.4.25@tcp
> This is right address.
According to the Lustre manual:

To get the best remote nid - 
$ lctl which_nid 
This will take the "best" nid from a list of the nids of a remote
host. The "best" nid is the one the local node will use when trying to
communicate with the remote node.

So lctl believed that 10.143.4.25@tcp is a remote nid, and this
doesn''t contradict what "lctl list_nids" said.

Wojciech Turek

2007-Aug-20 13:23 UTC

head link

[Lustre-discuss] list_nids shows wrong IP address

On 20 Aug 2007, at 20:15, Isaac Huang wrote:
> On Mon, Aug 20, 2007 at 06:40:17PM +0100, Wojciech Turek wrote:
> [...]
>> eth0 - management interface which is not configured for lustre on
>> this particular client node, IP address: 10.142.4.25 netmask
>> 255.255.255.0
>> eth1 - data interface configured with lnet, IP address 10.143.4.25
>> netmask 255.255.0.0
>>
>> When I do on client node:
>>> lctl list_nids
>> 10.142.4.25@tcp
>> Which is wrong address.
>
> Can you please also let me know:
> What does ifconfig say on this node?eth0      Link encap:Ethernet  HWaddr 00:13:72:FB:09:66
           inet addr:10.142.4.25  Bcast:10.142.4.255  Mask:255.255.255.0
           inet6 addr: fe80::213:72ff:fefb:966/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:429602 errors:0 dropped:0 overruns:0 frame:0
           TX packets:286189 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:42096900 (40.1 MiB)  TX bytes:27083024 (25.8 MiB)
           Interrupt:169 Memory:f8000000-f8011100

eth1      Link encap:Ethernet  HWaddr 00:13:72:FB:09:68
           inet addr:10.143.4.25  Bcast:10.143.255.255  Mask:255.255.0.0
           inet6 addr: fe80::213:72ff:fefb:968/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:181050667 errors:0 dropped:0 overruns:0 frame:0
           TX packets:167463035 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:213247025573 (198.6 GiB)  TX bytes:215923168460  
(201.0 GiB)
           Interrupt:169 Memory:f4000000-f4011100

lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:16436  Metric:1
           RX packets:240 errors:0 dropped:0 overruns:0 frame:0
           TX packets:240 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:20520 (20.0 KiB)  TX bytes:20520 (20.0 KiB)
> What does "dmesg | grep ''Added.LNI''" say after
lnet module is loaded? > dmesg | grep ''Added.LNI''
Lustre: Added LNI 10.142.1.25@tcp [8/256]>
>> However
>>> lctl which_nid node-d25
>> 10.143.4.25@tcp
>> This is right address.
>
> According to the Lustre manual:
>
> To get the best remote nid -
> $ lctl which_nid
> This will take the "best" nid from a list of the nids of a remote
> host. The "best" nid is the one the local node will use when
trying to
> communicate with the remote node.
>
> So lctl believed that 10.143.4.25@tcp is a remote nid, and this
> doesn''t contradict what "lctl list_nids" said.
>
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27@cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070820/922927b8/attachment.html

Wojciech Turek

2007-Aug-20 13:26 UTC

head link

[Lustre-discuss] list_nids shows wrong IP address

Hi,
>
> On 20 Aug 2007, at 20:15, Isaac Huang wrote:
>
>> On Mon, Aug 20, 2007 at 06:40:17PM +0100, Wojciech Turek wrote:
>> [...]
>>> eth0 - management interface which is not configured for lustre on
>>> this particular client node, IP address: 10.142.4.25 netmask
>>> 255.255.255.0
>>> eth1 - data interface configured with lnet, IP address 10.143.4.25
>>> netmask 255.255.0.0
>>>
>>> When I do on client node:
>>>> lctl list_nids
>>> 10.142.4.25@tcp
>>> Which is wrong address.
>>
>> Can you please also let me know:
>> What does ifconfig say on this node?
> eth0      Link encap:Ethernet  HWaddr 00:13:72:FB:09:66
>           inet addr:10.142.4.25  Bcast:10.142.4.255  Mask: 
> 255.255.255.0
>           inet6 addr: fe80::213:72ff:fefb:966/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:429602 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:286189 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:42096900 (40.1 MiB)  TX bytes:27083024 (25.8 MiB)
>           Interrupt:169 Memory:f8000000-f8011100
>
> eth1      Link encap:Ethernet  HWaddr 00:13:72:FB:09:68
>           inet addr:10.143.4.25  Bcast:10.143.255.255  Mask: 
> 255.255.0.0
>           inet6 addr: fe80::213:72ff:fefb:968/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:181050667 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:167463035 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:213247025573 (198.6 GiB)  TX bytes:215923168460  
> (201.0 GiB)
>           Interrupt:169 Memory:f4000000-f4011100
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:240 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:240 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:20520 (20.0 KiB)  TX bytes:20520 (20.0 KiB)
>
>> What does "dmesg | grep ''Added.LNI''" say
after lnet module is loaded?
> > dmesg | grep ''Added.LNI''
> Lustre: Added LNI 10.142.1.25@tcp [8/256]I need to correct this as I did it on wrong node.
 > dmesg | grep ''Added.LNI''
Lustre: Added LNI 10.142.4.25@tcp [8/256]
ifconfig output is from right node.
>>
>>> However
>>>> lctl which_nid node-d25
>>> 10.143.4.25@tcp
>>> This is right address.
>>
>> According to the Lustre manual:
>>
>> To get the best remote nid -
>> $ lctl which_nid
>> This will take the "best" nid from a list of the nids of a
remote
>> host. The "best" nid is the one the local node will use when
>> trying to
>> communicate with the remote node.
>>
>> So lctl believed that 10.143.4.25@tcp is a remote nid, and this
>> doesn''t contradict what "lctl list_nids" said.
>>
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service
> email: wjt27@cam.ac.uk
> tel. +441223763517
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27@cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070820/4105783a/attachment-0001.html

Isaac Huang

2007-Aug-21 08:25 UTC

head link

[Lustre-discuss] list_nids shows wrong IP address

On Mon, Aug 20, 2007 at 08:23:09PM +0100, Wojciech Turek
wrote:> eth0      Link encap:Ethernet  HWaddr 00:13:72:FB:09:66
>           inet addr:10.142.4.25  Bcast:10.142.4.255  Mask:255.255.255.0
>           inet6 addr: fe80::213:72ff:fefb:966/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:429602 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:286189 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:42096900 (40.1 MiB)  TX bytes:27083024 (25.8 MiB)
>           Interrupt:169 Memory:f8000000-f8011100
> 
> eth1      Link encap:Ethernet  HWaddr 00:13:72:FB:09:68
>           inet addr:10.143.4.25  Bcast:10.143.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::213:72ff:fefb:968/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:181050667 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:167463035 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:213247025573 (198.6 GiB)  TX bytes:215923168460  
> (201.0 GiB)
>           Interrupt:169 Memory:f4000000-f4011100
The traffic really went through eth1, as can be seen from "RX bytes"
and
"TX bytes" above.
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:240 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:240 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:20520 (20.0 KiB)  TX bytes:20520 (20.0 KiB)
> 
> >What does "dmesg | grep ''Added.LNI''" say
after lnet module is loaded?
> > dmesg | grep ''Added.LNI''
> Lustre: Added LNI 10.142.1.25@tcp [8/256]
This looks weird. Can you please run "lctl --net tcp print_interfaces"
and give me the outputs?

Can you also unload all lustre/lnet modules, and load lnet separately by 
running "modprobe lnet networks=''tcp(eth1)''
config_on_load=1"? Does it
change anything?

Isaac

Lustre discuss - Aug 2007 - list_nids shows wrong IP address

[Lustre-discuss] list_nids shows wrong IP address

[Lustre-discuss] list_nids shows wrong IP address

[Lustre-discuss] list_nids shows wrong IP address

[Lustre-discuss] list_nids shows wrong IP address

[Lustre-discuss] list_nids shows wrong IP address