Hi, I have lustre 1.6.1 configured with combined MGS/MDS and one OSS. kernel version is 2.6.9-55.EL_lustre-1.6.1smp Both OSS and MGS/MDS machines have configured two NID''s options lnet networks=tcp0(eth2),tcp1(eth1) where eth2 is in 10.143.0.0/16 network eth1 is in 10.142.10.0/24 network OSS > lctl list_nids 10.143.245.2@tcp 10.142.10.2@tcp1 MDS/MGS > lctl list_nids 10.143.245.3@tcp 10.142.10.3@tcp1 Client nodes have two interfaces but only one is meant to be used. I have following line in /etc/modprobe.conf options lnet networks=tcp0(eth1) eth0 - management interface which is not configured for lustre on this particular client node, IP address: 10.142.4.25 netmask 255.255.255.0 eth1 - data interface configured with lnet, IP address 10.143.4.25 netmask 255.255.0.0 When I do on client node: > lctl list_nids 10.142.4.25@tcp Which is wrong address. However > lctl which_nid node-d25 10.143.4.25@tcp This is right address. I have following entry in /etc/fstab 10.143.245.3@tcp0:/home-md /home lustre defaults,nosuid,nodev 0 0 Node is mounting /home correctly and I can write/read to/from lustre without any problem. However I don''t understand why list_nids shows wrong NID. Also if I umount /home on the client, in MGS/MDS syslog I see wrong address for umounted client. Aug 20 15:52:25 storage03 kernel: Lustre: MGS: haven''t heard from client 8b3ee19c-ffe8-091c-b504-bf2082cca83e (at 10.142.4.25@tcp) in 227 seconds. I think it''s dead, and I am evicting it. Aug 20 17:51:58 storage03 kernel: Lustre: MGS: haven''t heard from client 33aaa972-83b9-cf56-a72e-1907d1e3641e (at 10.142.4.25@tcp) in 227 seconds. I think it''s dead, and I am evicting it. Aug 20 17:51:58 storage03 kernel: Lustre: Skipped 1 previous similar message Also there is no physical connection between ordinary clients and MGS/ MDS, OSS on 10.142.10.0/24 network I will appreciate if someone can help me find logical explanation for my problem. Regards Wojciech Turek > Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27@cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070820/b3a494a8/attachment-0001.html
On Mon, Aug 20, 2007 at 06:40:17PM +0100, Wojciech Turek wrote: [...]> eth0 - management interface which is not configured for lustre on > this particular client node, IP address: 10.142.4.25 netmask > 255.255.255.0 > eth1 - data interface configured with lnet, IP address 10.143.4.25 > netmask 255.255.0.0 > > When I do on client node: > > lctl list_nids > 10.142.4.25@tcp > Which is wrong address.Can you please also let me know: What does ifconfig say on this node? What does "dmesg | grep ''Added.LNI''" say after lnet module is loaded?> However > > lctl which_nid node-d25 > 10.143.4.25@tcp > This is right address.According to the Lustre manual: To get the best remote nid - $ lctl which_nid This will take the "best" nid from a list of the nids of a remote host. The "best" nid is the one the local node will use when trying to communicate with the remote node. So lctl believed that 10.143.4.25@tcp is a remote nid, and this doesn''t contradict what "lctl list_nids" said.
On 20 Aug 2007, at 20:15, Isaac Huang wrote:> On Mon, Aug 20, 2007 at 06:40:17PM +0100, Wojciech Turek wrote: > [...] >> eth0 - management interface which is not configured for lustre on >> this particular client node, IP address: 10.142.4.25 netmask >> 255.255.255.0 >> eth1 - data interface configured with lnet, IP address 10.143.4.25 >> netmask 255.255.0.0 >> >> When I do on client node: >>> lctl list_nids >> 10.142.4.25@tcp >> Which is wrong address. > > Can you please also let me know: > What does ifconfig say on this node?eth0 Link encap:Ethernet HWaddr 00:13:72:FB:09:66 inet addr:10.142.4.25 Bcast:10.142.4.255 Mask:255.255.255.0 inet6 addr: fe80::213:72ff:fefb:966/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:429602 errors:0 dropped:0 overruns:0 frame:0 TX packets:286189 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:42096900 (40.1 MiB) TX bytes:27083024 (25.8 MiB) Interrupt:169 Memory:f8000000-f8011100 eth1 Link encap:Ethernet HWaddr 00:13:72:FB:09:68 inet addr:10.143.4.25 Bcast:10.143.255.255 Mask:255.255.0.0 inet6 addr: fe80::213:72ff:fefb:968/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:181050667 errors:0 dropped:0 overruns:0 frame:0 TX packets:167463035 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:213247025573 (198.6 GiB) TX bytes:215923168460 (201.0 GiB) Interrupt:169 Memory:f4000000-f4011100 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:240 errors:0 dropped:0 overruns:0 frame:0 TX packets:240 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:20520 (20.0 KiB) TX bytes:20520 (20.0 KiB)> What does "dmesg | grep ''Added.LNI''" say after lnet module is loaded?> dmesg | grep ''Added.LNI'' Lustre: Added LNI 10.142.1.25@tcp [8/256]> >> However >>> lctl which_nid node-d25 >> 10.143.4.25@tcp >> This is right address. > > According to the Lustre manual: > > To get the best remote nid - > $ lctl which_nid > This will take the "best" nid from a list of the nids of a remote > host. The "best" nid is the one the local node will use when trying to > communicate with the remote node. > > So lctl believed that 10.143.4.25@tcp is a remote nid, and this > doesn''t contradict what "lctl list_nids" said. >Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27@cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070820/922927b8/attachment.html
Hi,> > On 20 Aug 2007, at 20:15, Isaac Huang wrote: > >> On Mon, Aug 20, 2007 at 06:40:17PM +0100, Wojciech Turek wrote: >> [...] >>> eth0 - management interface which is not configured for lustre on >>> this particular client node, IP address: 10.142.4.25 netmask >>> 255.255.255.0 >>> eth1 - data interface configured with lnet, IP address 10.143.4.25 >>> netmask 255.255.0.0 >>> >>> When I do on client node: >>>> lctl list_nids >>> 10.142.4.25@tcp >>> Which is wrong address. >> >> Can you please also let me know: >> What does ifconfig say on this node? > eth0 Link encap:Ethernet HWaddr 00:13:72:FB:09:66 > inet addr:10.142.4.25 Bcast:10.142.4.255 Mask: > 255.255.255.0 > inet6 addr: fe80::213:72ff:fefb:966/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:429602 errors:0 dropped:0 overruns:0 frame:0 > TX packets:286189 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:42096900 (40.1 MiB) TX bytes:27083024 (25.8 MiB) > Interrupt:169 Memory:f8000000-f8011100 > > eth1 Link encap:Ethernet HWaddr 00:13:72:FB:09:68 > inet addr:10.143.4.25 Bcast:10.143.255.255 Mask: > 255.255.0.0 > inet6 addr: fe80::213:72ff:fefb:968/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:181050667 errors:0 dropped:0 overruns:0 frame:0 > TX packets:167463035 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:213247025573 (198.6 GiB) TX bytes:215923168460 > (201.0 GiB) > Interrupt:169 Memory:f4000000-f4011100 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:240 errors:0 dropped:0 overruns:0 frame:0 > TX packets:240 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:20520 (20.0 KiB) TX bytes:20520 (20.0 KiB) > >> What does "dmesg | grep ''Added.LNI''" say after lnet module is loaded? > > dmesg | grep ''Added.LNI'' > Lustre: Added LNI 10.142.1.25@tcp [8/256]I need to correct this as I did it on wrong node. > dmesg | grep ''Added.LNI'' Lustre: Added LNI 10.142.4.25@tcp [8/256] ifconfig output is from right node.>> >>> However >>>> lctl which_nid node-d25 >>> 10.143.4.25@tcp >>> This is right address. >> >> According to the Lustre manual: >> >> To get the best remote nid - >> $ lctl which_nid >> This will take the "best" nid from a list of the nids of a remote >> host. The "best" nid is the one the local node will use when >> trying to >> communicate with the remote node. >> >> So lctl believed that 10.143.4.25@tcp is a remote nid, and this >> doesn''t contradict what "lctl list_nids" said. >> > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27@cam.ac.uk > tel. +441223763517 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussMr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27@cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070820/4105783a/attachment-0001.html
On Mon, Aug 20, 2007 at 08:23:09PM +0100, Wojciech Turek wrote:> eth0 Link encap:Ethernet HWaddr 00:13:72:FB:09:66 > inet addr:10.142.4.25 Bcast:10.142.4.255 Mask:255.255.255.0 > inet6 addr: fe80::213:72ff:fefb:966/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:429602 errors:0 dropped:0 overruns:0 frame:0 > TX packets:286189 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:42096900 (40.1 MiB) TX bytes:27083024 (25.8 MiB) > Interrupt:169 Memory:f8000000-f8011100 > > eth1 Link encap:Ethernet HWaddr 00:13:72:FB:09:68 > inet addr:10.143.4.25 Bcast:10.143.255.255 Mask:255.255.0.0 > inet6 addr: fe80::213:72ff:fefb:968/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:181050667 errors:0 dropped:0 overruns:0 frame:0 > TX packets:167463035 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:213247025573 (198.6 GiB) TX bytes:215923168460 > (201.0 GiB) > Interrupt:169 Memory:f4000000-f4011100The traffic really went through eth1, as can be seen from "RX bytes" and "TX bytes" above.> > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:240 errors:0 dropped:0 overruns:0 frame:0 > TX packets:240 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:20520 (20.0 KiB) TX bytes:20520 (20.0 KiB) > > >What does "dmesg | grep ''Added.LNI''" say after lnet module is loaded? > > dmesg | grep ''Added.LNI'' > Lustre: Added LNI 10.142.1.25@tcp [8/256]This looks weird. Can you please run "lctl --net tcp print_interfaces" and give me the outputs? Can you also unload all lustre/lnet modules, and load lnet separately by running "modprobe lnet networks=''tcp(eth1)'' config_on_load=1"? Does it change anything? Isaac