--- On Thu, 1/14/10, Arden Wiebe <albert682 at yahoo.com> wrote:> DM: > > Your mount command is wrong - try this format. > > mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio > > > [root at oss ~]# mkfs.lustre --fsname=datafs --ost -- > > mgsnode=192.168.0.2 at tcp0 [root at oss ~]# mount -t lustre > /dev/lustre/ > > OST /lustre/OSS > > mount.lustre: mount /dev/lustre/OST at /lustre/OSS > failed: Input/ > > output error > > Is the MGS running? > > So by substitution for supplied your mount line should > read: > > mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs > > Enjoy the required reading and testing.? I found by > naming things uniquely helped me clarify what was actually > required.? Try calling your filesystem "Dusty" or > "Mark" and that should make things clearer for you.? > > --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> > wrote: > > > From: Andreas Dilger <adilger at sun.com> > > Subject: Re: [Lustre-discuss] Unable to activate OST > > To: "Dusty Marks" <dustynmarks at gmail.com> > > Cc: "lustre-discuss at lists.lustre.org > discuss" <lustre-discuss at lists.lustre.org> > > Date: Thursday, January 14, 2010, 9:03 PM > > On 2010-01-14, at 23:51, Dusty Marks > > wrote: > > > You are correct, there is information in > messages. > > Following are the? > > > entries related the lustre. The line that says > > 192.168.0.2 at tcp is? > > > unreachable makes sense, but what exactly is the > > problem? I entered? > > > the line "options lnet networks=tcp" in > modprobe.conf > > on the oss and? > > > mds. The only difference was, i entered that > line > > AFTER i setup? > > > lustre on the OSS. Could that be the problem? I > don''t > > see why that? > > > would be the problem, as the oss is trying to > reach > > the MDS/MGS,? > > > which is 192.168.0.2. > > > > > > --------------------------------------- > > /var/log/messages? > > > > > > ----------------------------------------------------------- > > > Jan 14 22:41:07 oss kernel: Lustre: > > 2846:0:(linux-tcpip.c: > > > 688:libcfs_sock_connect()) Error -113 connecting > > 0.0.0.0/1023 ->? > > > 192.168.0.2/988 > > > Jan 14 22:41:07 oss kernel: Lustre: > > 2846:0:(acceptor.c: > > > 95:lnet_connect_console_error()) Connection to > > 192.168.0.2 at tcp at? > > > host 192.168.0.2 was unreachable: the network or > that > > node may be? > > > down, or Lustre may be misconfigured. > > > > > > Please read the chapter in the manual about network > > configuration.? I? > > suspect the .0.2 network is not your eth0 network > > interface, and your? > > modprobe.conf needs to be fixed. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > >
On 2010-01-15, at 00:21, Arden Wiebe wrote:> Your mount command is wrong - try this format. > > mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio > > So by substitution for supplied your mount line should > read: > > mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafsNo, that isn''t correct. You are showing the mount command for a client. It is the OST that is failing to mount, likely because the network is not configured correctly, and the OST needs to contact the MGS node always on the first mount in order to join the filesystem.> Enjoy the required reading and testing. I found by > naming things uniquely helped me clarify what was actually > required. Try calling your filesystem "Dusty" or > "Mark" and that should make things clearer for you. > > --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >> On 2010-01-14, at 23:51, Dusty Marks wrote: >>> You are correct, there is information in messages. Following are >>> the >>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>> unreachable makes sense, but what exactly is the problem? I entered >>> the line "options lnet networks=tcp" in modprobe.conf on the oss and >>> mds. The only difference was, i entered that line AFTER i setup >>> lustre on the OSS. Could that be the problem? I don''t see why that >>> would be the problem, as the oss is trying to reach the MDS/MGS, >>> which is 192.168.0.2. >>> >>> --------------------------------------- >>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>> 192.168.0.2/988 >>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>> host 192.168.0.2 was unreachable: the network or that node may be >>> down, or Lustre may be misconfigured. >> >> >> Please read the chapter in the manual about network configuration. I >> suspect the .0.2 network is not your eth0 network interface, and your >> modprobe.conf needs to be fixed.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
I did some googling and i found the command lctl ping. So i went on the oss and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O error. It is quite obvious that i''ve simply misconfigured the network. Could someone explain how to properly configure it? I don''t understand what the entry in modprobe actually means, so i cannot say what should be entered. Each one of my machines has one NIC (eth0). What do i enter in modprobe.conf? To make this work correctly? if i update the entry in modprobe.conf, do i have to redo anything? or does lustre pickup on the changes without restarting anything? Thanks all for the help so far. - Dusty On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> wrote:> I searched through the manual, and the only section i could find dealing > with networking configuration is section 4.1.0.2 titled "Module Setup" in > the Lustre 1.8 operations manual. > > It tells me to run the command modprobe -v lustre "networks=tcp0(eth0)", > and i did such on the MDS, however it errored out with: > > [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" > insmod > /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko > networks=tcp0(eth0) > FATAL: Error inserting lustre > (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > > dmesg says nothing, but message says this: > Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' > > I even tried adding "options lnet networks=tcp0(eth0)" however that didn''t > work either > > I''m terribly sorry for my incompetence, but i''m having a difficult time > understanding lustre''s abstractions. > > Each one of my nodes have a single ethernet card (eth0) > > > > On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> wrote: > >> >> On 2010-01-15, at 00:21, Arden Wiebe wrote: >> >>> Your mount command is wrong - try this format. >>> >>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >>> >>> So by substitution for supplied your mount line should >>> read: >>> >>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >>> >> >> No, that isn''t correct. You are showing the mount command for a >> client. It is the OST that is failing to mount, likely because >> the network is not configured correctly, and the OST needs to >> contact the MGS node always on the first mount in order to join >> the filesystem. >> >> Enjoy the required reading and testing. I found by >>> naming things uniquely helped me clarify what was actually >>> required. Try calling your filesystem "Dusty" or >>> "Mark" and that should make things clearer for you. >>> >>> >>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >>> >>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >>>> >>>>> You are correct, there is information in messages. Following are the >>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>>>> unreachable makes sense, but what exactly is the problem? I entered >>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss and >>>>> mds. The only difference was, i entered that line AFTER i setup >>>>> lustre on the OSS. Could that be the problem? I don''t see why that >>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >>>>> which is 192.168.0.2. >>>>> >>>>> --------------------------------------- >>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>>>> 192.168.0.2/988 >>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>>>> host 192.168.0.2 was unreachable: the network or that node may be >>>>> down, or Lustre may be misconfigured. >>>>> >>>> >>>> >>>> Please read the chapter in the manual about network configuration. I >>>> suspect the .0.2 network is not your eth0 network interface, and your >>>> modprobe.conf needs to be fixed. >>>> >>> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> > > > -- > The graduate with a Science degree asks, "Why does it work?" The graduate > with an Engineering degree asks, "How does it work?" The graduate with an > Accounting degree asks, "How much will it cost?" The graduate with an Arts > degree asks, "Do you want fries with that?" >-- The graduate with a Science degree asks, "Why does it work?" The graduate with an Engineering degree asks, "How does it work?" The graduate with an Accounting degree asks, "How much will it cost?" The graduate with an Arts degree asks, "Do you want fries with that?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100115/889d4fc5/attachment-0001.html
Hi, Could you please post output of the ''lctl list_nids'' command on OSS system and on MDS system. This will show us which network was configured to work with lustre. Regarding entries in the modprobe.conf, they tell lnet module which NIC or multiple NICs will be configured to work with lustre. If your modprobe.conf doesn''t have lnet options line, by default Lustre will configure the first NIC which is usually eth0. Below is a modprobe.conf entry from my lustre setup. My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary Ethernet NIC options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) So the line above means that: first lustre network tcp0 is configured on interface ib0 second lustre network tcp1 is configured on interface eth1 third lustre network tcp2 is confiured on alias interface eth1:0 eth0 is not mentioned on this line because I have chosen not to configure it to work with lustre. Once lnet module is loaded you can check which network or networks are configured to work with Lustre using ''lctl list_nids'' command Cheers Wojciech 2010/1/15 Dusty Marks <dustynmarks at gmail.com>:> I did some googling and i found the command lctl ping. So i went on the oss > and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O > error. > > It is quite obvious that i''ve simply misconfigured the network. Could > someone explain how to properly configure it? > > I don''t understand what the entry in modprobe actually means, so i cannot > say what should be entered. > > Each one of my machines has one NIC (eth0). What do i enter in > modprobe.conf? To make this work correctly? if i update the entry in > modprobe.conf, do i have to redo anything? or does lustre pickup on the > changes without restarting anything? > > Thanks all for the help so far. > > - Dusty > > On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> wrote: >> >> I searched through the manual, and the only section i could find dealing >> with networking configuration is section 4.1.0.2 titled "Module Setup" in >> the Lustre 1.8 operations manual. >> >> It tells me to run the command modprobe -v lustre "networks=tcp0(eth0)", >> and i did such on the MDS, however it errored out with: >> >> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" >> insmod >> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko >> networks=tcp0(eth0) >> FATAL: Error inserting lustre >> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): >> Unknown symbol in module, or unknown parameter (see dmesg) >> >> dmesg says nothing, but message says this: >> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' >> >> I even tried adding "options lnet networks=tcp0(eth0)" however that didn''t >> work either >> >> I''m terribly sorry for my incompetence, but i''m having a difficult time >> understanding lustre''s abstractions. >> >> Each one of my nodes have a single ethernet card (eth0) >> >> >> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> wrote: >>> >>> On 2010-01-15, at 00:21, Arden Wiebe wrote: >>>> >>>> Your mount command is wrong - try this format. >>>> >>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >>>> >>>> So by substitution for supplied your mount line should >>>> read: >>>> >>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >>> >>> No, that isn''t correct. ?You are showing the mount command for a >>> client. ?It is the OST that is failing to mount, likely because >>> the network is not configured correctly, and the OST needs to >>> contact the MGS node always on the first mount in order to join >>> the filesystem. >>> >>>> Enjoy the required reading and testing. ?I found by >>>> naming things uniquely helped me clarify what was actually >>>> required. ?Try calling your filesystem "Dusty" or >>>> "Mark" and that should make things clearer for you. >>>> >>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >>>>> >>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >>>>>> >>>>>> You are correct, there is information in messages. ?Following are the >>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>>>>> unreachable makes sense, but what exactly is the problem? I entered >>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss and >>>>>> mds. The only difference was, i entered that line AFTER i setup >>>>>> lustre on the OSS. Could that be the problem? I don''t see why that >>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >>>>>> which is 192.168.0.2. >>>>>> >>>>>> --------------------------------------- >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>>>>> 192.168.0.2/988 >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>>>>> host 192.168.0.2 was unreachable: the network or that node may be >>>>>> down, or Lustre may be misconfigured. >>>>> >>>>> >>>>> Please read the chapter in the manual about network configuration. ?I >>>>> suspect the .0.2 network is not your eth0 network interface, and your >>>>> modprobe.conf needs to be fixed. >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Sr. Staff Engineer, Lustre Group >>> Sun Microsystems of Canada, Inc. >>> >> >> >> >> -- >> The graduate with a Science degree asks, "Why does it work?" The graduate >> with an Engineering degree asks, "How does it work?" The graduate with an >> Accounting degree asks, "How much will it cost?" The graduate with an Arts >> degree asks, "Do you want fries with that?" > > > > -- > The graduate with a Science degree asks, "Why does it work?" The graduate > with an Engineering degree asks, "How does it work?" The graduate with an > Accounting degree asks, "How much will it cost?" The graduate with an Arts > degree asks, "Do you want fries with that?" > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517
the output of ltcl list_nids on the oss is [root at oss ~]# lctl list_nids 192.168.0.3 at tcp and from the mds [root at mds ~]# lctl list_nids 192.168.0.2 at tcp Thanks, Dusty On Fri, Jan 15, 2010 at 5:39 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi, > > Could you please post output of the ''lctl list_nids'' command on OSS > system and on MDS system. This will show us which network was > configured to work with lustre. > > Regarding entries in the modprobe.conf, they tell lnet module which > NIC or multiple NICs will be configured to work with lustre. If your > modprobe.conf doesn''t have lnet options line, by default Lustre will > configure the first NIC which is usually eth0. > Below is a modprobe.conf entry from my lustre setup. > My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC > ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary > Ethernet NIC > options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) > So the line above means that: > first lustre network tcp0 is configured on interface ib0 > second lustre network tcp1 is configured on interface eth1 > third lustre network tcp2 is confiured on alias interface eth1:0 > > eth0 is not mentioned on this line because I have chosen not to > configure it to work with lustre. > > > Once lnet module is loaded you can check which network or networks are > configured to work with Lustre using ''lctl list_nids'' command > > Cheers > > Wojciech > 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: > > I did some googling and i found the command lctl ping. So i went on the > oss > > and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O > > error. > > > > It is quite obvious that i''ve simply misconfigured the network. Could > > someone explain how to properly configure it? > > > > I don''t understand what the entry in modprobe actually means, so i cannot > > say what should be entered. > > > > Each one of my machines has one NIC (eth0). What do i enter in > > modprobe.conf? To make this work correctly? if i update the entry in > > modprobe.conf, do i have to redo anything? or does lustre pickup on the > > changes without restarting anything? > > > > Thanks all for the help so far. > > > > - Dusty > > > > On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> > wrote: > >> > >> I searched through the manual, and the only section i could find dealing > >> with networking configuration is section 4.1.0.2 titled "Module Setup" > in > >> the Lustre 1.8 operations manual. > >> > >> It tells me to run the command modprobe -v lustre "networks=tcp0(eth0)", > >> and i did such on the MDS, however it errored out with: > >> > >> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" > >> insmod > >> > /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko > >> networks=tcp0(eth0) > >> FATAL: Error inserting lustre > >> > (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): > >> Unknown symbol in module, or unknown parameter (see dmesg) > >> > >> dmesg says nothing, but message says this: > >> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' > >> > >> I even tried adding "options lnet networks=tcp0(eth0)" however that > didn''t > >> work either > >> > >> I''m terribly sorry for my incompetence, but i''m having a difficult time > >> understanding lustre''s abstractions. > >> > >> Each one of my nodes have a single ethernet card (eth0) > >> > >> > >> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> > wrote: > >>> > >>> On 2010-01-15, at 00:21, Arden Wiebe wrote: > >>>> > >>>> Your mount command is wrong - try this format. > >>>> > >>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio > >>>> > >>>> So by substitution for supplied your mount line should > >>>> read: > >>>> > >>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs > >>> > >>> No, that isn''t correct. You are showing the mount command for a > >>> client. It is the OST that is failing to mount, likely because > >>> the network is not configured correctly, and the OST needs to > >>> contact the MGS node always on the first mount in order to join > >>> the filesystem. > >>> > >>>> Enjoy the required reading and testing. I found by > >>>> naming things uniquely helped me clarify what was actually > >>>> required. Try calling your filesystem "Dusty" or > >>>> "Mark" and that should make things clearer for you. > >>>> > >>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: > >>>>> > >>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: > >>>>>> > >>>>>> You are correct, there is information in messages. Following are > the > >>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is > >>>>>> unreachable makes sense, but what exactly is the problem? I entered > >>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss and > >>>>>> mds. The only difference was, i entered that line AFTER i setup > >>>>>> lustre on the OSS. Could that be the problem? I don''t see why that > >>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, > >>>>>> which is 192.168.0.2. > >>>>>> > >>>>>> --------------------------------------- > >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: > >>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> > >>>>>> 192.168.0.2/988 > >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: > >>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at > >>>>>> host 192.168.0.2 was unreachable: the network or that node may be > >>>>>> down, or Lustre may be misconfigured. > >>>>> > >>>>> > >>>>> Please read the chapter in the manual about network configuration. I > >>>>> suspect the .0.2 network is not your eth0 network interface, and your > >>>>> modprobe.conf needs to be fixed. > >>> > >>> > >>> Cheers, Andreas > >>> -- > >>> Andreas Dilger > >>> Sr. Staff Engineer, Lustre Group > >>> Sun Microsystems of Canada, Inc. > >>> > >> > >> > >> > >> -- > >> The graduate with a Science degree asks, "Why does it work?" The > graduate > >> with an Engineering degree asks, "How does it work?" The graduate with > an > >> Accounting degree asks, "How much will it cost?" The graduate with an > Arts > >> degree asks, "Do you want fries with that?" > > > > > > > > -- > > The graduate with a Science degree asks, "Why does it work?" The graduate > > with an Engineering degree asks, "How does it work?" The graduate with an > > Accounting degree asks, "How much will it cost?" The graduate with an > Arts > > degree asks, "Do you want fries with that?" > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > -- > -- > Wojciech Turek > > Assistant System Manager > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >-- The graduate with a Science degree asks, "Why does it work?" The graduate with an Engineering degree asks, "How does it work?" The graduate with an Accounting degree asks, "How much will it cost?" The graduate with an Arts degree asks, "Do you want fries with that?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100115/a62074d6/attachment.html
Can you check if you can ping MDS and OSS using normal ping command? 2010/1/16 Dusty Marks <dustynmarks at gmail.com>:> the output of ltcl list_nids on the oss is > > [root at oss ~]# lctl list_nids > 192.168.0.3 at tcp > > and from the mds > > [root at mds ~]# lctl list_nids > 192.168.0.2 at tcp > > Thanks, > Dusty > > On Fri, Jan 15, 2010 at 5:39 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: >> >> Hi, >> >> Could you please post output of the ''lctl list_nids'' command on OSS >> system and on MDS system. This will show us which network was >> configured to work with lustre. >> >> Regarding entries in the modprobe.conf, they tell lnet module which >> NIC or multiple NICs will be configured to work with lustre. If your >> modprobe.conf doesn''t have lnet options line, ?by default Lustre will >> configure the first NIC which is usually eth0. >> Below is a modprobe.conf entry from my lustre setup. >> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC >> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary >> Ethernet NIC >> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) >> So the line above means that: >> ? first lustre network tcp0 is configured on interface ib0 >> ? second lustre network tcp1 is configured on interface eth1 >> ? third lustre network tcp2 is confiured on alias interface eth1:0 >> >> eth0 is not mentioned on this line because I have chosen not to >> configure it to work with lustre. >> >> >> Once lnet module is loaded you can check which network or networks are >> configured to work with Lustre using ''lctl list_nids'' command >> >> Cheers >> >> Wojciech >> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: >> > I did some googling and i found the command lctl ping. So i went on the >> > oss >> > and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O >> > error. >> > >> > It is quite obvious that i''ve simply misconfigured the network. Could >> > someone explain how to properly configure it? >> > >> > I don''t understand what the entry in modprobe actually means, so i >> > cannot >> > say what should be entered. >> > >> > Each one of my machines has one NIC (eth0). What do i enter in >> > modprobe.conf? To make this work correctly? if i update the entry in >> > modprobe.conf, do i have to redo anything? or does lustre pickup on the >> > changes without restarting anything? >> > >> > Thanks all for the help so far. >> > >> > - Dusty >> > >> > On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> >> > wrote: >> >> >> >> I searched through the manual, and the only section i could find >> >> dealing >> >> with networking configuration is section 4.1.0.2 titled "Module Setup" >> >> in >> >> the Lustre 1.8 operations manual. >> >> >> >> It tells me to run the command modprobe -v lustre >> >> "networks=tcp0(eth0)", >> >> and i did such on the MDS, however it errored out with: >> >> >> >> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" >> >> insmod >> >> >> >> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko >> >> networks=tcp0(eth0) >> >> FATAL: Error inserting lustre >> >> >> >> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): >> >> Unknown symbol in module, or unknown parameter (see dmesg) >> >> >> >> dmesg says nothing, but message says this: >> >> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' >> >> >> >> I even tried adding "options lnet networks=tcp0(eth0)" however that >> >> didn''t >> >> work either >> >> >> >> I''m terribly sorry for my incompetence, but i''m having a difficult time >> >> understanding lustre''s abstractions. >> >> >> >> Each one of my nodes have a single ethernet card (eth0) >> >> >> >> >> >> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> >> >> wrote: >> >>> >> >>> On 2010-01-15, at 00:21, Arden Wiebe wrote: >> >>>> >> >>>> Your mount command is wrong - try this format. >> >>>> >> >>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >> >>>> >> >>>> So by substitution for supplied your mount line should >> >>>> read: >> >>>> >> >>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >> >>> >> >>> No, that isn''t correct. ?You are showing the mount command for a >> >>> client. ?It is the OST that is failing to mount, likely because >> >>> the network is not configured correctly, and the OST needs to >> >>> contact the MGS node always on the first mount in order to join >> >>> the filesystem. >> >>> >> >>>> Enjoy the required reading and testing. ?I found by >> >>>> naming things uniquely helped me clarify what was actually >> >>>> required. ?Try calling your filesystem "Dusty" or >> >>>> "Mark" and that should make things clearer for you. >> >>>> >> >>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >> >>>>> >> >>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >> >>>>>> >> >>>>>> You are correct, there is information in messages. ?Following are >> >>>>>> the >> >>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >> >>>>>> unreachable makes sense, but what exactly is the problem? I entered >> >>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss >> >>>>>> and >> >>>>>> mds. The only difference was, i entered that line AFTER i setup >> >>>>>> lustre on the OSS. Could that be the problem? I don''t see why that >> >>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >> >>>>>> which is 192.168.0.2. >> >>>>>> >> >>>>>> --------------------------------------- >> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >> >>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >> >>>>>> 192.168.0.2/988 >> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >> >>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >> >>>>>> host 192.168.0.2 was unreachable: the network or that node may be >> >>>>>> down, or Lustre may be misconfigured. >> >>>>> >> >>>>> >> >>>>> Please read the chapter in the manual about network configuration. >> >>>>> ?I >> >>>>> suspect the .0.2 network is not your eth0 network interface, and >> >>>>> your >> >>>>> modprobe.conf needs to be fixed. >> >>> >> >>> >> >>> Cheers, Andreas >> >>> -- >> >>> Andreas Dilger >> >>> Sr. Staff Engineer, Lustre Group >> >>> Sun Microsystems of Canada, Inc. >> >>> >> >> >> >> >> >> >> >> -- >> >> The graduate with a Science degree asks, "Why does it work?" The >> >> graduate >> >> with an Engineering degree asks, "How does it work?" The graduate with >> >> an >> >> Accounting degree asks, "How much will it cost?" The graduate with an >> >> Arts >> >> degree asks, "Do you want fries with that?" >> > >> > >> > >> > -- >> > The graduate with a Science degree asks, "Why does it work?" The >> > graduate >> > with an Engineering degree asks, "How does it work?" The graduate with >> > an >> > Accounting degree asks, "How much will it cost?" The graduate with an >> > Arts >> > degree asks, "Do you want fries with that?" >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > >> >> >> >> -- >> -- >> Wojciech Turek >> >> Assistant System Manager >> >> High Performance Computing Service >> University of Cambridge >> Email: wjt27 at cam.ac.uk >> Tel: (+)44 1223 763517 > > > > -- > The graduate with a Science degree asks, "Why does it work?" The graduate > with an Engineering degree asks, "How does it work?" The graduate with an > Accounting degree asks, "How much will it cost?" The graduate with an Arts > degree asks, "Do you want fries with that?" >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517
Could you also post here syslog messages from the OSS ? 2010/1/16 Wojciech Turek <wjt27 at cam.ac.uk>:> Can you check if you can ping MDS and OSS using normal ping command? > > > 2010/1/16 Dusty Marks <dustynmarks at gmail.com>: >> the output of ltcl list_nids on the oss is >> >> [root at oss ~]# lctl list_nids >> 192.168.0.3 at tcp >> >> and from the mds >> >> [root at mds ~]# lctl list_nids >> 192.168.0.2 at tcp >> >> Thanks, >> Dusty >> >> On Fri, Jan 15, 2010 at 5:39 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: >>> >>> Hi, >>> >>> Could you please post output of the ''lctl list_nids'' command on OSS >>> system and on MDS system. This will show us which network was >>> configured to work with lustre. >>> >>> Regarding entries in the modprobe.conf, they tell lnet module which >>> NIC or multiple NICs will be configured to work with lustre. If your >>> modprobe.conf doesn''t have lnet options line, ?by default Lustre will >>> configure the first NIC which is usually eth0. >>> Below is a modprobe.conf entry from my lustre setup. >>> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC >>> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary >>> Ethernet NIC >>> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) >>> So the line above means that: >>> ? first lustre network tcp0 is configured on interface ib0 >>> ? second lustre network tcp1 is configured on interface eth1 >>> ? third lustre network tcp2 is confiured on alias interface eth1:0 >>> >>> eth0 is not mentioned on this line because I have chosen not to >>> configure it to work with lustre. >>> >>> >>> Once lnet module is loaded you can check which network or networks are >>> configured to work with Lustre using ''lctl list_nids'' command >>> >>> Cheers >>> >>> Wojciech >>> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: >>> > I did some googling and i found the command lctl ping. So i went on the >>> > oss >>> > and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O >>> > error. >>> > >>> > It is quite obvious that i''ve simply misconfigured the network. Could >>> > someone explain how to properly configure it? >>> > >>> > I don''t understand what the entry in modprobe actually means, so i >>> > cannot >>> > say what should be entered. >>> > >>> > Each one of my machines has one NIC (eth0). What do i enter in >>> > modprobe.conf? To make this work correctly? if i update the entry in >>> > modprobe.conf, do i have to redo anything? or does lustre pickup on the >>> > changes without restarting anything? >>> > >>> > Thanks all for the help so far. >>> > >>> > - Dusty >>> > >>> > On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> >>> > wrote: >>> >> >>> >> I searched through the manual, and the only section i could find >>> >> dealing >>> >> with networking configuration is section 4.1.0.2 titled "Module Setup" >>> >> in >>> >> the Lustre 1.8 operations manual. >>> >> >>> >> It tells me to run the command modprobe -v lustre >>> >> "networks=tcp0(eth0)", >>> >> and i did such on the MDS, however it errored out with: >>> >> >>> >> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" >>> >> insmod >>> >> >>> >> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko >>> >> networks=tcp0(eth0) >>> >> FATAL: Error inserting lustre >>> >> >>> >> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): >>> >> Unknown symbol in module, or unknown parameter (see dmesg) >>> >> >>> >> dmesg says nothing, but message says this: >>> >> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' >>> >> >>> >> I even tried adding "options lnet networks=tcp0(eth0)" however that >>> >> didn''t >>> >> work either >>> >> >>> >> I''m terribly sorry for my incompetence, but i''m having a difficult time >>> >> understanding lustre''s abstractions. >>> >> >>> >> Each one of my nodes have a single ethernet card (eth0) >>> >> >>> >> >>> >> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> >>> >> wrote: >>> >>> >>> >>> On 2010-01-15, at 00:21, Arden Wiebe wrote: >>> >>>> >>> >>>> Your mount command is wrong - try this format. >>> >>>> >>> >>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >>> >>>> >>> >>>> So by substitution for supplied your mount line should >>> >>>> read: >>> >>>> >>> >>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >>> >>> >>> >>> No, that isn''t correct. ?You are showing the mount command for a >>> >>> client. ?It is the OST that is failing to mount, likely because >>> >>> the network is not configured correctly, and the OST needs to >>> >>> contact the MGS node always on the first mount in order to join >>> >>> the filesystem. >>> >>> >>> >>>> Enjoy the required reading and testing. ?I found by >>> >>>> naming things uniquely helped me clarify what was actually >>> >>>> required. ?Try calling your filesystem "Dusty" or >>> >>>> "Mark" and that should make things clearer for you. >>> >>>> >>> >>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >>> >>>>> >>> >>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >>> >>>>>> >>> >>>>>> You are correct, there is information in messages. ?Following are >>> >>>>>> the >>> >>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>> >>>>>> unreachable makes sense, but what exactly is the problem? I entered >>> >>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss >>> >>>>>> and >>> >>>>>> mds. The only difference was, i entered that line AFTER i setup >>> >>>>>> lustre on the OSS. Could that be the problem? I don''t see why that >>> >>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >>> >>>>>> which is 192.168.0.2. >>> >>>>>> >>> >>>>>> --------------------------------------- >>> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>> >>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>> >>>>>> 192.168.0.2/988 >>> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>> >>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>> >>>>>> host 192.168.0.2 was unreachable: the network or that node may be >>> >>>>>> down, or Lustre may be misconfigured. >>> >>>>> >>> >>>>> >>> >>>>> Please read the chapter in the manual about network configuration. >>> >>>>> ?I >>> >>>>> suspect the .0.2 network is not your eth0 network interface, and >>> >>>>> your >>> >>>>> modprobe.conf needs to be fixed. >>> >>> >>> >>> >>> >>> Cheers, Andreas >>> >>> -- >>> >>> Andreas Dilger >>> >>> Sr. Staff Engineer, Lustre Group >>> >>> Sun Microsystems of Canada, Inc. >>> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> The graduate with a Science degree asks, "Why does it work?" The >>> >> graduate >>> >> with an Engineering degree asks, "How does it work?" The graduate with >>> >> an >>> >> Accounting degree asks, "How much will it cost?" The graduate with an >>> >> Arts >>> >> degree asks, "Do you want fries with that?" >>> > >>> > >>> > >>> > -- >>> > The graduate with a Science degree asks, "Why does it work?" The >>> > graduate >>> > with an Engineering degree asks, "How does it work?" The graduate with >>> > an >>> > Accounting degree asks, "How much will it cost?" The graduate with an >>> > Arts >>> > degree asks, "Do you want fries with that?" >>> > >>> > _______________________________________________ >>> > Lustre-discuss mailing list >>> > Lustre-discuss at lists.lustre.org >>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > >>> > >>> >>> >>> >>> -- >>> -- >>> Wojciech Turek >>> >>> Assistant System Manager >>> >>> High Performance Computing Service >>> University of Cambridge >>> Email: wjt27 at cam.ac.uk >>> Tel: (+)44 1223 763517 >> >> >> >> -- >> The graduate with a Science degree asks, "Why does it work?" The graduate >> with an Engineering degree asks, "How does it work?" The graduate with an >> Accounting degree asks, "How much will it cost?" The graduate with an Arts >> degree asks, "Do you want fries with that?" >> > > > > -- > -- > Wojciech Turek > > Assistant System Manager > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517
Yes. MDS and OSS can be pinged using the normal ping command. (ping 192.168.0.2 and ping 192.168.0.3) I am not sure what you want me to post for syslog? I always thought syslog was just a config file telling an application where to append certain error messages. thanks dusty On Fri, Jan 15, 2010 at 7:19 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Could you also post here syslog messages from the OSS ? > > 2010/1/16 Wojciech Turek <wjt27 at cam.ac.uk>: > > Can you check if you can ping MDS and OSS using normal ping command? > > > > > > 2010/1/16 Dusty Marks <dustynmarks at gmail.com>: > >> the output of ltcl list_nids on the oss is > >> > >> [root at oss ~]# lctl list_nids > >> 192.168.0.3 at tcp > >> > >> and from the mds > >> > >> [root at mds ~]# lctl list_nids > >> 192.168.0.2 at tcp > >> > >> Thanks, > >> Dusty > >> > >> On Fri, Jan 15, 2010 at 5:39 PM, Wojciech Turek <wjt27 at cam.ac.uk> > wrote: > >>> > >>> Hi, > >>> > >>> Could you please post output of the ''lctl list_nids'' command on OSS > >>> system and on MDS system. This will show us which network was > >>> configured to work with lustre. > >>> > >>> Regarding entries in the modprobe.conf, they tell lnet module which > >>> NIC or multiple NICs will be configured to work with lustre. If your > >>> modprobe.conf doesn''t have lnet options line, by default Lustre will > >>> configure the first NIC which is usually eth0. > >>> Below is a modprobe.conf entry from my lustre setup. > >>> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC > >>> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary > >>> Ethernet NIC > >>> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) > >>> So the line above means that: > >>> first lustre network tcp0 is configured on interface ib0 > >>> second lustre network tcp1 is configured on interface eth1 > >>> third lustre network tcp2 is confiured on alias interface eth1:0 > >>> > >>> eth0 is not mentioned on this line because I have chosen not to > >>> configure it to work with lustre. > >>> > >>> > >>> Once lnet module is loaded you can check which network or networks are > >>> configured to work with Lustre using ''lctl list_nids'' command > >>> > >>> Cheers > >>> > >>> Wojciech > >>> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: > >>> > I did some googling and i found the command lctl ping. So i went on > the > >>> > oss > >>> > and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an > I/O > >>> > error. > >>> > > >>> > It is quite obvious that i''ve simply misconfigured the network. Could > >>> > someone explain how to properly configure it? > >>> > > >>> > I don''t understand what the entry in modprobe actually means, so i > >>> > cannot > >>> > say what should be entered. > >>> > > >>> > Each one of my machines has one NIC (eth0). What do i enter in > >>> > modprobe.conf? To make this work correctly? if i update the entry in > >>> > modprobe.conf, do i have to redo anything? or does lustre pickup on > the > >>> > changes without restarting anything? > >>> > > >>> > Thanks all for the help so far. > >>> > > >>> > - Dusty > >>> > > >>> > On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com > > > >>> > wrote: > >>> >> > >>> >> I searched through the manual, and the only section i could find > >>> >> dealing > >>> >> with networking configuration is section 4.1.0.2 titled "Module > Setup" > >>> >> in > >>> >> the Lustre 1.8 operations manual. > >>> >> > >>> >> It tells me to run the command modprobe -v lustre > >>> >> "networks=tcp0(eth0)", > >>> >> and i did such on the MDS, however it errored out with: > >>> >> > >>> >> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" > >>> >> insmod > >>> >> > >>> >> > /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko > >>> >> networks=tcp0(eth0) > >>> >> FATAL: Error inserting lustre > >>> >> > >>> >> > (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): > >>> >> Unknown symbol in module, or unknown parameter (see dmesg) > >>> >> > >>> >> dmesg says nothing, but message says this: > >>> >> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' > >>> >> > >>> >> I even tried adding "options lnet networks=tcp0(eth0)" however that > >>> >> didn''t > >>> >> work either > >>> >> > >>> >> I''m terribly sorry for my incompetence, but i''m having a difficult > time > >>> >> understanding lustre''s abstractions. > >>> >> > >>> >> Each one of my nodes have a single ethernet card (eth0) > >>> >> > >>> >> > >>> >> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> > >>> >> wrote: > >>> >>> > >>> >>> On 2010-01-15, at 00:21, Arden Wiebe wrote: > >>> >>>> > >>> >>>> Your mount command is wrong - try this format. > >>> >>>> > >>> >>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio > >>> >>>> > >>> >>>> So by substitution for supplied your mount line should > >>> >>>> read: > >>> >>>> > >>> >>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs > >>> >>> > >>> >>> No, that isn''t correct. You are showing the mount command for a > >>> >>> client. It is the OST that is failing to mount, likely because > >>> >>> the network is not configured correctly, and the OST needs to > >>> >>> contact the MGS node always on the first mount in order to join > >>> >>> the filesystem. > >>> >>> > >>> >>>> Enjoy the required reading and testing. I found by > >>> >>>> naming things uniquely helped me clarify what was actually > >>> >>>> required. Try calling your filesystem "Dusty" or > >>> >>>> "Mark" and that should make things clearer for you. > >>> >>>> > >>> >>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: > >>> >>>>> > >>> >>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: > >>> >>>>>> > >>> >>>>>> You are correct, there is information in messages. Following > are > >>> >>>>>> the > >>> >>>>>> entries related the lustre. The line that says 192.168.0.2 at tcpis > >>> >>>>>> unreachable makes sense, but what exactly is the problem? I > entered > >>> >>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss > >>> >>>>>> and > >>> >>>>>> mds. The only difference was, i entered that line AFTER i setup > >>> >>>>>> lustre on the OSS. Could that be the problem? I don''t see why > that > >>> >>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, > >>> >>>>>> which is 192.168.0.2. > >>> >>>>>> > >>> >>>>>> --------------------------------------- > >>> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: > >>> >>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023-> > >>> >>>>>> 192.168.0.2/988 > >>> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: > >>> >>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcpat > >>> >>>>>> host 192.168.0.2 was unreachable: the network or that node may > be > >>> >>>>>> down, or Lustre may be misconfigured. > >>> >>>>> > >>> >>>>> > >>> >>>>> Please read the chapter in the manual about network > configuration. > >>> >>>>> I > >>> >>>>> suspect the .0.2 network is not your eth0 network interface, and > >>> >>>>> your > >>> >>>>> modprobe.conf needs to be fixed. > >>> >>> > >>> >>> > >>> >>> Cheers, Andreas > >>> >>> -- > >>> >>> Andreas Dilger > >>> >>> Sr. Staff Engineer, Lustre Group > >>> >>> Sun Microsystems of Canada, Inc. > >>> >>> > >>> >> > >>> >> > >>> >> > >>> >> -- > >>> >> The graduate with a Science degree asks, "Why does it work?" The > >>> >> graduate > >>> >> with an Engineering degree asks, "How does it work?" The graduate > with > >>> >> an > >>> >> Accounting degree asks, "How much will it cost?" The graduate with > an > >>> >> Arts > >>> >> degree asks, "Do you want fries with that?" > >>> > > >>> > > >>> > > >>> > -- > >>> > The graduate with a Science degree asks, "Why does it work?" The > >>> > graduate > >>> > with an Engineering degree asks, "How does it work?" The graduate > with > >>> > an > >>> > Accounting degree asks, "How much will it cost?" The graduate with an > >>> > Arts > >>> > degree asks, "Do you want fries with that?" > >>> > > >>> > _______________________________________________ > >>> > Lustre-discuss mailing list > >>> > Lustre-discuss at lists.lustre.org > >>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >>> > > >>> > > >>> > >>> > >>> > >>> -- > >>> -- > >>> Wojciech Turek > >>> > >>> Assistant System Manager > >>> > >>> High Performance Computing Service > >>> University of Cambridge > >>> Email: wjt27 at cam.ac.uk > >>> Tel: (+)44 1223 763517 > >> > >> > >> > >> -- > >> The graduate with a Science degree asks, "Why does it work?" The > graduate > >> with an Engineering degree asks, "How does it work?" The graduate with > an > >> Accounting degree asks, "How much will it cost?" The graduate with an > Arts > >> degree asks, "Do you want fries with that?" > >> > > > > > > > > -- > > -- > > Wojciech Turek > > > > Assistant System Manager > > > > High Performance Computing Service > > University of Cambridge > > Email: wjt27 at cam.ac.uk > > Tel: (+)44 1223 763517 > > > > > > -- > -- > Wojciech Turek > > Assistant System Manager > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >-- The graduate with a Science degree asks, "Why does it work?" The graduate with an Engineering degree asks, "How does it work?" The graduate with an Accounting degree asks, "How much will it cost?" The graduate with an Arts degree asks, "Do you want fries with that?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100116/3f81c094/attachment-0001.html
Dusty Marks wrote:> Yes. MDS and OSS can be pinged using the normal ping command. (ping > 192.168.0.2 and ping 192.168.0.3) > > I am not sure what you want me to post for syslog? I always thought > syslog was just a config file telling an application where to append > certain error messagesyslog in this case means /var/log/messages. Richard> > thanks > dusty >
Christopher J. Walker
2010-Jan-16 11:14 UTC
[Lustre-discuss] Fw: Re: Unable to activate OST
Wojciech Turek wrote:> Hi, > > Could you please post output of the ''lctl list_nids'' command on OSS > system and on MDS system. This will show us which network was > configured to work with lustre. > > Regarding entries in the modprobe.conf, they tell lnet module which > NIC or multiple NICs will be configured to work with lustre.There''s a gotcha here which I''ve been meaning to write up. We have a 10Gig card as eth2 assigned a different IP address on the same subnet as eth0, a 1Gig card. Whilst lustre correctly bound to the ip address of eth2, the kernel decided (correctly) it could route packets via eth0. This worked, but gave poor performance (partly due to a bottleneck on that art of the network). The solution was to ensure that packets from eth2''s IP address were routed out of eth2. Chris> If your > modprobe.conf doesn''t have lnet options line, by default Lustre will > configure the first NIC which is usually eth0. > Below is a modprobe.conf entry from my lustre setup. > My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC > ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary > Ethernet NIC > options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) > So the line above means that: > first lustre network tcp0 is configured on interface ib0 > second lustre network tcp1 is configured on interface eth1 > third lustre network tcp2 is confiured on alias interface eth1:0 > > eth0 is not mentioned on this line because I have chosen not to > configure it to work with lustre. > > > Once lnet module is loaded you can check which network or networks are > configured to work with Lustre using ''lctl list_nids'' command > > Cheers > > Wojciech > 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: >> I did some googling and i found the command lctl ping. So i went on the oss >> and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O >> error. >> >> It is quite obvious that i''ve simply misconfigured the network. Could >> someone explain how to properly configure it? >> >> I don''t understand what the entry in modprobe actually means, so i cannot >> say what should be entered. >> >> Each one of my machines has one NIC (eth0). What do i enter in >> modprobe.conf? To make this work correctly? if i update the entry in >> modprobe.conf, do i have to redo anything? or does lustre pickup on the >> changes without restarting anything? >> >> Thanks all for the help so far. >> >> - Dusty >> >> On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> wrote: >>> I searched through the manual, and the only section i could find dealing >>> with networking configuration is section 4.1.0.2 titled "Module Setup" in >>> the Lustre 1.8 operations manual. >>> >>> It tells me to run the command modprobe -v lustre "networks=tcp0(eth0)", >>> and i did such on the MDS, however it errored out with: >>> >>> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" >>> insmod >>> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko >>> networks=tcp0(eth0) >>> FATAL: Error inserting lustre >>> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): >>> Unknown symbol in module, or unknown parameter (see dmesg) >>> >>> dmesg says nothing, but message says this: >>> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' >>> >>> I even tried adding "options lnet networks=tcp0(eth0)" however that didn''t >>> work either >>> >>> I''m terribly sorry for my incompetence, but i''m having a difficult time >>> understanding lustre''s abstractions. >>> >>> Each one of my nodes have a single ethernet card (eth0) >>> >>> >>> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> wrote: >>>> On 2010-01-15, at 00:21, Arden Wiebe wrote: >>>>> Your mount command is wrong - try this format. >>>>> >>>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >>>>> >>>>> So by substitution for supplied your mount line should >>>>> read: >>>>> >>>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >>>> No, that isn''t correct. You are showing the mount command for a >>>> client. It is the OST that is failing to mount, likely because >>>> the network is not configured correctly, and the OST needs to >>>> contact the MGS node always on the first mount in order to join >>>> the filesystem. >>>> >>>>> Enjoy the required reading and testing. I found by >>>>> naming things uniquely helped me clarify what was actually >>>>> required. Try calling your filesystem "Dusty" or >>>>> "Mark" and that should make things clearer for you. >>>>> >>>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >>>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >>>>>>> You are correct, there is information in messages. Following are the >>>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>>>>>> unreachable makes sense, but what exactly is the problem? I entered >>>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss and >>>>>>> mds. The only difference was, i entered that line AFTER i setup >>>>>>> lustre on the OSS. Could that be the problem? I don''t see why that >>>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >>>>>>> which is 192.168.0.2. >>>>>>> >>>>>>> --------------------------------------- >>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>>>>>> 192.168.0.2/988 >>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>>>>>> host 192.168.0.2 was unreachable: the network or that node may be >>>>>>> down, or Lustre may be misconfigured. >>>>>> >>>>>> Please read the chapter in the manual about network configuration. I >>>>>> suspect the .0.2 network is not your eth0 network interface, and your >>>>>> modprobe.conf needs to be fixed. >>>> >>>> Cheers, Andreas >>>> -- >>>> Andreas Dilger >>>> Sr. Staff Engineer, Lustre Group >>>> Sun Microsystems of Canada, Inc. >>>> >>> >>> >>> -- >>> The graduate with a Science degree asks, "Why does it work?" The graduate >>> with an Engineering degree asks, "How does it work?" The graduate with an >>> Accounting degree asks, "How much will it cost?" The graduate with an Arts >>> degree asks, "Do you want fries with that?" >> >> >> -- >> The graduate with a Science degree asks, "Why does it work?" The graduate >> with an Engineering degree asks, "How does it work?" The graduate with an >> Accounting degree asks, "How much will it cost?" The graduate with an Arts >> degree asks, "Do you want fries with that?" >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > >
I''ve posted my /var/log/messages here before, but here it is again: --------------------------------------- /var/log/messages ----------------------------------------------------------- Jan 14 22:41:05 oss kernel: Lustre: OBD class driver, http://www.lustre.org/ Jan 14 22:41:05 oss kernel: Lustre: Lustre Version: 1.8.1.1 Jan 14 22:41:05 oss kernel: Lustre: Build Version: 1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007 Jan 14 22:41:06 oss kernel: Lustre: Added LNI 192.168.0.3 at tcp [8/256/0/0] Jan 14 22:41:06 oss kernel: Lustre: Accept secure, port 988 Jan 14 22:41:06 oss kernel: Lustre: Lustre Client File System; http://www.lustre.org/ Jan 14 22:41:07 oss kernel: kjournald starting. Commit interval 5 seconds Jan 14 22:41:07 oss kernel: LDISKFS FS on dm-2, internal journal Jan 14 22:41:07 oss kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jan 14 22:41:07 oss kernel: kjournald starting. Commit interval 5 seconds Jan 14 22:41:07 oss kernel: LDISKFS FS on dm-2, internal journal Jan 14 22:41:07 oss kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jan 14 22:41:07 oss kernel: LDISKFS-fs: file extents enabled Jan 14 22:41:07 oss kernel: LDISKFS-fs: mballoc enabled Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> 192.168.0.2/988 Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at host 192.168.0.2 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 192.168.0.3 at tcp->192.168.0.2 at tcp Jan 14 22:41:12 oss kernel: Lustre: 2853:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1324907721916417 sent from MGC192.168.0.2 at tcp to NID 192.168.0.2 at tcp 5s ago has timed out (limit 5s). Jan 14 22:41:12 oss kernel: req at f5d7fe00 x1324907721916417/t0 o250->MGS at MGC192.168.0.2@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1263530472 ref 1 fl Rpc:N/0/0 rc 0/0 Jan 14 22:41:12 oss kernel: LustreError: 2819:0:(obd_mount.c:1085:server_start_targets()) Required registration failed for datafs-OSTffff: -5 Jan 14 22:41:12 oss kernel: LustreError: 15f-b: Communication error with the MGS. Is the MGS running? Jan 14 22:41:12 oss kernel: LustreError: 2819:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets: -5 Jan 14 22:41:12 oss kernel: LustreError: 2819:0:(obd_mount.c:1412:server_put_super()) no obd datafs-OSTffff Jan 14 22:41:12 oss kernel: LustreError: 2819:0:(obd_mount.c:136:server_deregister_mount()) datafs-OSTffff not registered Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Jan 14 22:41:12 oss kernel: Lustre: server umount datafs-OSTffff complete Jan 14 22:41:12 oss kernel: LustreError: 2819:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-5) On Sat, Jan 16, 2010 at 5:14 AM, Christopher J. Walker < C.J.Walker at qmul.ac.uk> wrote:> Wojciech Turek wrote: > >> Hi, >> >> Could you please post output of the ''lctl list_nids'' command on OSS >> system and on MDS system. This will show us which network was >> configured to work with lustre. >> >> Regarding entries in the modprobe.conf, they tell lnet module which >> NIC or multiple NICs will be configured to work with lustre. >> > > There''s a gotcha here which I''ve been meaning to write up. We have a 10Gig > card as eth2 assigned a different IP address on the same subnet as eth0, a > 1Gig card. Whilst lustre correctly bound to the ip address of eth2, the > kernel decided (correctly) it could route packets via eth0. This worked, but > gave poor performance (partly due to a bottleneck on that art of the > network). The solution was to ensure that packets from eth2''s IP address > were routed out of eth2. > > Chris > > > If your >> modprobe.conf doesn''t have lnet options line, by default Lustre will >> configure the first NIC which is usually eth0. >> Below is a modprobe.conf entry from my lustre setup. >> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC >> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary >> Ethernet NIC >> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) >> So the line above means that: >> first lustre network tcp0 is configured on interface ib0 >> second lustre network tcp1 is configured on interface eth1 >> third lustre network tcp2 is confiured on alias interface eth1:0 >> >> eth0 is not mentioned on this line because I have chosen not to >> configure it to work with lustre. >> >> >> Once lnet module is loaded you can check which network or networks are >> configured to work with Lustre using ''lctl list_nids'' command >> >> Cheers >> >> Wojciech >> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: >> >>> I did some googling and i found the command lctl ping. So i went on the >>> oss >>> and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O >>> error. >>> >>> It is quite obvious that i''ve simply misconfigured the network. Could >>> someone explain how to properly configure it? >>> >>> I don''t understand what the entry in modprobe actually means, so i cannot >>> say what should be entered. >>> >>> Each one of my machines has one NIC (eth0). What do i enter in >>> modprobe.conf? To make this work correctly? if i update the entry in >>> modprobe.conf, do i have to redo anything? or does lustre pickup on the >>> changes without restarting anything? >>> >>> Thanks all for the help so far. >>> >>> - Dusty >>> >>> On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> >>> wrote: >>> >>>> I searched through the manual, and the only section i could find dealing >>>> with networking configuration is section 4.1.0.2 titled "Module Setup" >>>> in >>>> the Lustre 1.8 operations manual. >>>> >>>> It tells me to run the command modprobe -v lustre "networks=tcp0(eth0)", >>>> and i did such on the MDS, however it errored out with: >>>> >>>> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" >>>> insmod >>>> >>>> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko >>>> networks=tcp0(eth0) >>>> FATAL: Error inserting lustre >>>> >>>> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): >>>> Unknown symbol in module, or unknown parameter (see dmesg) >>>> >>>> dmesg says nothing, but message says this: >>>> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' >>>> >>>> I even tried adding "options lnet networks=tcp0(eth0)" however that >>>> didn''t >>>> work either >>>> >>>> I''m terribly sorry for my incompetence, but i''m having a difficult time >>>> understanding lustre''s abstractions. >>>> >>>> Each one of my nodes have a single ethernet card (eth0) >>>> >>>> >>>> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> >>>> wrote: >>>> >>>>> On 2010-01-15, at 00:21, Arden Wiebe wrote: >>>>> >>>>>> Your mount command is wrong - try this format. >>>>>> >>>>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >>>>>> >>>>>> So by substitution for supplied your mount line should >>>>>> read: >>>>>> >>>>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >>>>>> >>>>> No, that isn''t correct. You are showing the mount command for a >>>>> client. It is the OST that is failing to mount, likely because >>>>> the network is not configured correctly, and the OST needs to >>>>> contact the MGS node always on the first mount in order to join >>>>> the filesystem. >>>>> >>>>> Enjoy the required reading and testing. I found by >>>>>> naming things uniquely helped me clarify what was actually >>>>>> required. Try calling your filesystem "Dusty" or >>>>>> "Mark" and that should make things clearer for you. >>>>>> >>>>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >>>>>> >>>>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >>>>>>> >>>>>>>> You are correct, there is information in messages. Following are >>>>>>>> the >>>>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>>>>>>> unreachable makes sense, but what exactly is the problem? I entered >>>>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss and >>>>>>>> mds. The only difference was, i entered that line AFTER i setup >>>>>>>> lustre on the OSS. Could that be the problem? I don''t see why that >>>>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >>>>>>>> which is 192.168.0.2. >>>>>>>> >>>>>>>> --------------------------------------- >>>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>>>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>>>>>>> 192.168.0.2/988 >>>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>>>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>>>>>>> host 192.168.0.2 was unreachable: the network or that node may be >>>>>>>> down, or Lustre may be misconfigured. >>>>>>>> >>>>>>> >>>>>>> Please read the chapter in the manual about network configuration. I >>>>>>> suspect the .0.2 network is not your eth0 network interface, and your >>>>>>> modprobe.conf needs to be fixed. >>>>>>> >>>>>> >>>>> Cheers, Andreas >>>>> -- >>>>> Andreas Dilger >>>>> Sr. Staff Engineer, Lustre Group >>>>> Sun Microsystems of Canada, Inc. >>>>> >>>>> >>>> >>>> -- >>>> The graduate with a Science degree asks, "Why does it work?" The >>>> graduate >>>> with an Engineering degree asks, "How does it work?" The graduate with >>>> an >>>> Accounting degree asks, "How much will it cost?" The graduate with an >>>> Arts >>>> degree asks, "Do you want fries with that?" >>>> >>> >>> >>> -- >>> The graduate with a Science degree asks, "Why does it work?" The graduate >>> with an Engineering degree asks, "How does it work?" The graduate with an >>> Accounting degree asks, "How much will it cost?" The graduate with an >>> Arts >>> degree asks, "Do you want fries with that?" >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> >> >> >-- The graduate with a Science degree asks, "Why does it work?" The graduate with an Engineering degree asks, "How does it work?" The graduate with an Accounting degree asks, "How much will it cost?" The graduate with an Arts degree asks, "Do you want fries with that?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100116/34d44daa/attachment-0001.html
Got it working. The firewall was blocking lustre traffic. :( After disabling it, it works. Thanks all for the help! On Sat, Jan 16, 2010 at 9:57 AM, Dusty Marks <dustynmarks at gmail.com> wrote:> I''ve posted my /var/log/messages here before, but here it is again: > > > --------------------------------------- /var/log/messages > ----------------------------------------------------------- > Jan 14 22:41:05 oss kernel: Lustre: OBD class driver, > http://www.lustre.org/ > Jan 14 22:41:05 oss kernel: Lustre: Lustre Version: 1.8.1.1 > Jan 14 22:41:05 oss kernel: Lustre: Build Version: > 1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007 > Jan 14 22:41:06 oss kernel: Lustre: Added LNI 192.168.0.3 at tcp [8/256/0/0] > Jan 14 22:41:06 oss kernel: Lustre: Accept secure, port 988 > Jan 14 22:41:06 oss kernel: Lustre: Lustre Client File System; > http://www.lustre.org/ > Jan 14 22:41:07 oss kernel: kjournald starting. Commit interval 5 seconds > Jan 14 22:41:07 oss kernel: LDISKFS FS on dm-2, internal journal > Jan 14 22:41:07 oss kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Jan 14 22:41:07 oss kernel: kjournald starting. Commit interval 5 seconds > Jan 14 22:41:07 oss kernel: LDISKFS FS on dm-2, internal journal > Jan 14 22:41:07 oss kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Jan 14 22:41:07 oss kernel: LDISKFS-fs: file extents enabled > Jan 14 22:41:07 oss kernel: LDISKFS-fs: mballoc enabled > > Jan 14 22:41:07 oss kernel: Lustre: > 2846:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting > 0.0.0.0/1023 -> 192.168.0.2/988 > Jan 14 22:41:07 oss kernel: Lustre: > 2846:0:(acceptor.c:95:lnet_connect_console_error()) Connection to > 192.168.0.2 at tcp at host 192.168.0.2 was unreachable: the network or that > node may be down, or Lustre may be misconfigured. > Jan 14 22:41:07 oss kernel: Lustre: > 2846:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len > 368 192.168.0.3 at tcp->192.168.0.2 at tcp > Jan 14 22:41:12 oss kernel: Lustre: > 2853:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request > x1324907721916417 sent from MGC192.168.0.2 at tcp to NID 192.168.0.2 at tcp 5s > ago has timed out (limit 5s). > Jan 14 22:41:12 oss kernel: req at f5d7fe00 x1324907721916417/t0 > o250->MGS at MGC192.168.0.2@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1263530472 > ref 1 fl Rpc:N/0/0 rc 0/0 > Jan 14 22:41:12 oss kernel: LustreError: > 2819:0:(obd_mount.c:1085:server_start_targets()) Required registration > failed for datafs-OSTffff: -5 > Jan 14 22:41:12 oss kernel: LustreError: 15f-b: Communication error with > the MGS. Is the MGS running? > Jan 14 22:41:12 oss kernel: LustreError: > 2819:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets: -5 > Jan 14 22:41:12 oss kernel: LustreError: > 2819:0:(obd_mount.c:1412:server_put_super()) no obd datafs-OSTffff > Jan 14 22:41:12 oss kernel: LustreError: > 2819:0:(obd_mount.c:136:server_deregister_mount()) datafs-OSTffff not > registered > Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 > success) > Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal > hits, 0 2^N hits, 0 breaks, 0 lost > Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 > Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 > discarded > Jan 14 22:41:12 oss kernel: Lustre: server umount datafs-OSTffff complete > Jan 14 22:41:12 oss kernel: LustreError: > 2819:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount (-5) > > > On Sat, Jan 16, 2010 at 5:14 AM, Christopher J. Walker < > C.J.Walker at qmul.ac.uk> wrote: > >> Wojciech Turek wrote: >> >>> Hi, >>> >>> Could you please post output of the ''lctl list_nids'' command on OSS >>> system and on MDS system. This will show us which network was >>> configured to work with lustre. >>> >>> Regarding entries in the modprobe.conf, they tell lnet module which >>> NIC or multiple NICs will be configured to work with lustre. >>> >> >> There''s a gotcha here which I''ve been meaning to write up. We have a 10Gig >> card as eth2 assigned a different IP address on the same subnet as eth0, a >> 1Gig card. Whilst lustre correctly bound to the ip address of eth2, the >> kernel decided (correctly) it could route packets via eth0. This worked, but >> gave poor performance (partly due to a bottleneck on that art of the >> network). The solution was to ensure that packets from eth2''s IP address >> were routed out of eth2. >> >> Chris >> >> >> If your >>> modprobe.conf doesn''t have lnet options line, by default Lustre will >>> configure the first NIC which is usually eth0. >>> Below is a modprobe.conf entry from my lustre setup. >>> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC >>> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary >>> Ethernet NIC >>> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0) >>> So the line above means that: >>> first lustre network tcp0 is configured on interface ib0 >>> second lustre network tcp1 is configured on interface eth1 >>> third lustre network tcp2 is confiured on alias interface eth1:0 >>> >>> eth0 is not mentioned on this line because I have chosen not to >>> configure it to work with lustre. >>> >>> >>> Once lnet module is loaded you can check which network or networks are >>> configured to work with Lustre using ''lctl list_nids'' command >>> >>> Cheers >>> >>> Wojciech >>> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>: >>> >>>> I did some googling and i found the command lctl ping. So i went on the >>>> oss >>>> and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O >>>> error. >>>> >>>> It is quite obvious that i''ve simply misconfigured the network. Could >>>> someone explain how to properly configure it? >>>> >>>> I don''t understand what the entry in modprobe actually means, so i >>>> cannot >>>> say what should be entered. >>>> >>>> Each one of my machines has one NIC (eth0). What do i enter in >>>> modprobe.conf? To make this work correctly? if i update the entry in >>>> modprobe.conf, do i have to redo anything? or does lustre pickup on the >>>> changes without restarting anything? >>>> >>>> Thanks all for the help so far. >>>> >>>> - Dusty >>>> >>>> On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com> >>>> wrote: >>>> >>>>> I searched through the manual, and the only section i could find >>>>> dealing >>>>> with networking configuration is section 4.1.0.2 titled "Module Setup" >>>>> in >>>>> the Lustre 1.8 operations manual. >>>>> >>>>> It tells me to run the command modprobe -v lustre >>>>> "networks=tcp0(eth0)", >>>>> and i did such on the MDS, however it errored out with: >>>>> >>>>> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)" >>>>> insmod >>>>> >>>>> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko >>>>> networks=tcp0(eth0) >>>>> FATAL: Error inserting lustre >>>>> >>>>> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko): >>>>> Unknown symbol in module, or unknown parameter (see dmesg) >>>>> >>>>> dmesg says nothing, but message says this: >>>>> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'' >>>>> >>>>> I even tried adding "options lnet networks=tcp0(eth0)" however that >>>>> didn''t >>>>> work either >>>>> >>>>> I''m terribly sorry for my incompetence, but i''m having a difficult time >>>>> understanding lustre''s abstractions. >>>>> >>>>> Each one of my nodes have a single ethernet card (eth0) >>>>> >>>>> >>>>> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com> >>>>> wrote: >>>>> >>>>>> On 2010-01-15, at 00:21, Arden Wiebe wrote: >>>>>> >>>>>>> Your mount command is wrong - try this format. >>>>>>> >>>>>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio >>>>>>> >>>>>>> So by substitution for supplied your mount line should >>>>>>> read: >>>>>>> >>>>>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs >>>>>>> >>>>>> No, that isn''t correct. You are showing the mount command for a >>>>>> client. It is the OST that is failing to mount, likely because >>>>>> the network is not configured correctly, and the OST needs to >>>>>> contact the MGS node always on the first mount in order to join >>>>>> the filesystem. >>>>>> >>>>>> Enjoy the required reading and testing. I found by >>>>>>> naming things uniquely helped me clarify what was actually >>>>>>> required. Try calling your filesystem "Dusty" or >>>>>>> "Mark" and that should make things clearer for you. >>>>>>> >>>>>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote: >>>>>>> >>>>>>>> On 2010-01-14, at 23:51, Dusty Marks wrote: >>>>>>>> >>>>>>>>> You are correct, there is information in messages. Following are >>>>>>>>> the >>>>>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is >>>>>>>>> unreachable makes sense, but what exactly is the problem? I entered >>>>>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss >>>>>>>>> and >>>>>>>>> mds. The only difference was, i entered that line AFTER i setup >>>>>>>>> lustre on the OSS. Could that be the problem? I don''t see why that >>>>>>>>> would be the problem, as the oss is trying to reach the MDS/MGS, >>>>>>>>> which is 192.168.0.2. >>>>>>>>> >>>>>>>>> --------------------------------------- >>>>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c: >>>>>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 -> >>>>>>>>> 192.168.0.2/988 >>>>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c: >>>>>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at >>>>>>>>> host 192.168.0.2 was unreachable: the network or that node may be >>>>>>>>> down, or Lustre may be misconfigured. >>>>>>>>> >>>>>>>> >>>>>>>> Please read the chapter in the manual about network configuration. >>>>>>>> I >>>>>>>> suspect the .0.2 network is not your eth0 network interface, and >>>>>>>> your >>>>>>>> modprobe.conf needs to be fixed. >>>>>>>> >>>>>>> >>>>>> Cheers, Andreas >>>>>> -- >>>>>> Andreas Dilger >>>>>> Sr. Staff Engineer, Lustre Group >>>>>> Sun Microsystems of Canada, Inc. >>>>>> >>>>>> >>>>> >>>>> -- >>>>> The graduate with a Science degree asks, "Why does it work?" The >>>>> graduate >>>>> with an Engineering degree asks, "How does it work?" The graduate with >>>>> an >>>>> Accounting degree asks, "How much will it cost?" The graduate with an >>>>> Arts >>>>> degree asks, "Do you want fries with that?" >>>>> >>>> >>>> >>>> -- >>>> The graduate with a Science degree asks, "Why does it work?" The >>>> graduate >>>> with an Engineering degree asks, "How does it work?" The graduate with >>>> an >>>> Accounting degree asks, "How much will it cost?" The graduate with an >>>> Arts >>>> degree asks, "Do you want fries with that?" >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> >>> >>> >>> >> > > > -- > The graduate with a Science degree asks, "Why does it work?" The graduate > with an Engineering degree asks, "How does it work?" The graduate with an > Accounting degree asks, "How much will it cost?" The graduate with an Arts > degree asks, "Do you want fries with that?" >-- The graduate with a Science degree asks, "Why does it work?" The graduate with an Engineering degree asks, "How does it work?" The graduate with an Accounting degree asks, "How much will it cost?" The graduate with an Arts degree asks, "Do you want fries with that?" -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100116/3de89e07/attachment.html