Hi, I''m a Lustre newbie. The server I set up is combined MGS/MDT file system on a block device. And set up OST on a block device. I set up MGS/MDT and OST in the same machine by using 2 disks. The NID is 10.0.38.102 at tcp, and the address 10.0.38.102 was assigned to eth0. One day I noticed the eth0 is broken so I use another NIC eth1 then assign IP address 10.0.38.102 to this card. Then I use client the mount the server Lustre FS by following command. mount -t lustre 10.0.38.102 at tcp:/ericlfs /mnt/foobar It reported following error messages. Lustre: Request x1310428982411274 sent from MGC10.0.38.102 at tcp to NID 10.0.38.102 at tcp 5s ago has timed out (limit 5s). LustreError: 4397:0:(client.c:792:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81002cb7d800 x1310428982411276/t0 o501->MGS at MGC10.0.38.102@tcp_0:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 15c-8: MGC10.0.38.102 at tcp: The configuration from log ''ericlfs-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 4397:0:(llite_lib.c:1169:ll_fill_super()) Unable to process log: -108 Lustre: client ffff81002bd17400 umount complete mount.lustre: mount 10.0.38.102 at tcp:/ericlfs at /mnt failed: Cannot send after transport endpoint shutdown So I feel a little confused. Is this problem caused by I replace the NIC card? And furthermore, how do I fix that problem? Thank you very much. Best Regards, Amy
Hi Amy, You may want to try the following options in your /etc/modprobe.conf options lnet networks=tcp0(eth1) Regards, Rhys 2009/8/8 Lee Amy <openlinuxsource at gmail.com>> Hi, > > I''m a Lustre newbie. The server I set up is combined MGS/MDT file > system on a block device. And set up OST on a block device. I set up > MGS/MDT and OST in the same machine by using 2 disks. The NID is > 10.0.38.102 at tcp, and the address 10.0.38.102 was assigned to eth0. One > day I noticed the eth0 is broken so I use another NIC eth1 then assign > IP address 10.0.38.102 to this card. > > Then I use client the mount the server Lustre FS by following command. > > mount -t lustre 10.0.38.102 at tcp:/ericlfs /mnt/foobar > > It reported following error messages. > > Lustre: Request x1310428982411274 sent from MGC10.0.38.102 at tcp to NID > 10.0.38.102 at tcp 5s ago has timed out (limit 5s). > LustreError: 4397:0:(client.c:792:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff81002cb7d800 x1310428982411276/t0 > o501->MGS at MGC10.0.38.102@tcp_0:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 > fl Rpc:/0/0 rc 0/0 > LustreError: 15c-8: MGC10.0.38.102 at tcp: The configuration from log > ''ericlfs-client'' failed (-108). This may be the result of > communication errors between this node and the MGS, a bad > configuration, or other errors. See the syslog for more information. > LustreError: 4397:0:(llite_lib.c:1169:ll_fill_super()) Unable to > process log: -108 > Lustre: client ffff81002bd17400 umount complete > mount.lustre: mount 10.0.38.102 at tcp:/ericlfs at /mnt failed: Cannot > send after transport endpoint shutdown > > So I feel a little confused. Is this problem caused by I replace the > NIC card? And furthermore, how do I fix that problem? > > Thank you very much. > > Best Regards, > > Amy > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090810/5e15a5d1/attachment.html
---------- Forwarded message ---------- From: Lee Amy <openlinuxsource at gmail.com> Date: Mon, Aug 10, 2009 at 9:32 AM Subject: Re: [Lustre-discuss] Help: NIC Changed Error To: Rhys McMurdo <rhys at mcmurdo.id.au> On Mon, Aug 10, 2009 at 6:14 AM, Rhys McMurdo<rhys at mcmurdo.id.au> wrote:> Hi Amy, > > You may want to try the following options in your /etc/modprobe.conf > > options lnet networks=tcp0(eth1) > > Regards, > > Rhys > > 2009/8/8 Lee Amy <openlinuxsource at gmail.com> >> >> Hi, >> >> I''m a Lustre newbie. The server I set up is combined MGS/MDT file >> system on a block device. And set up OST on a block device. I set up >> MGS/MDT and OST in the same machine by using 2 disks. The NID is >> 10.0.38.102 at tcp, and the address 10.0.38.102 was assigned to eth0. One >> day I noticed the eth0 is broken so I use another NIC eth1 then assign >> IP address 10.0.38.102 to this card. >> >> Then I use client the mount the server Lustre FS by following command. >> >> mount -t lustre 10.0.38.102 at tcp:/ericlfs /mnt/foobar >> >> It reported following error messages. >> >> Lustre: Request x1310428982411274 sent from MGC10.0.38.102 at tcp to NID >> 10.0.38.102 at tcp 5s ago has timed out (limit 5s). >> LustreError: 4397:0:(client.c:792:ptlrpc_import_delay_req()) @@@ >> IMP_INVALID ?req at ffff81002cb7d800 x1310428982411276/t0 >> o501->MGS at MGC10.0.38.102@tcp_0:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 >> fl Rpc:/0/0 rc 0/0 >> LustreError: 15c-8: MGC10.0.38.102 at tcp: The configuration from log >> ''ericlfs-client'' failed (-108). This may be the result of >> communication errors between this node and the MGS, a bad >> configuration, or other errors. See the syslog for more information. >> LustreError: 4397:0:(llite_lib.c:1169:ll_fill_super()) Unable to >> process log: -108 >> Lustre: client ffff81002bd17400 umount complete >> mount.lustre: mount 10.0.38.102 at tcp:/ericlfs at /mnt failed: Cannot >> send after transport endpoint shutdown >> >> So I feel a little confused. Is this problem caused by I replace the >> NIC card? And furthermore, how do I fix that problem? >> >> Thank you very much. >> >> Best Regards, >> >> Amy >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussThanks very much. Anyway, my nid is 10.0.38.102 at tcp, not 10.0.38.102 at tcp0. If I add the above item in /etc/modprobe.conf I don''t know whether it will affect something wrong. Could you tell me what''s the difference between tcp and tcp? Thank you very much. Regards, Amy
On Mon, Aug 10, 2009 at 9:32 AM, Lee Amy<openlinuxsource at gmail.com> wrote:> ---------- Forwarded message ---------- > From: Lee Amy <openlinuxsource at gmail.com> > Date: Mon, Aug 10, 2009 at 9:32 AM > Subject: Re: [Lustre-discuss] Help: NIC Changed Error > To: Rhys McMurdo <rhys at mcmurdo.id.au> > > > On Mon, Aug 10, 2009 at 6:14 AM, Rhys McMurdo<rhys at mcmurdo.id.au> wrote: >> Hi Amy, >> >> You may want to try the following options in your /etc/modprobe.conf >> >> options lnet networks=tcp0(eth1) >> >> Regards, >> >> Rhys >> >> 2009/8/8 Lee Amy <openlinuxsource at gmail.com> >>> >>> Hi, >>> >>> I''m a Lustre newbie. The server I set up is combined MGS/MDT file >>> system on a block device. And set up OST on a block device. I set up >>> MGS/MDT and OST in the same machine by using 2 disks. The NID is >>> 10.0.38.102 at tcp, and the address 10.0.38.102 was assigned to eth0. One >>> day I noticed the eth0 is broken so I use another NIC eth1 then assign >>> IP address 10.0.38.102 to this card. >>> >>> Then I use client the mount the server Lustre FS by following command. >>> >>> mount -t lustre 10.0.38.102 at tcp:/ericlfs /mnt/foobar >>> >>> It reported following error messages. >>> >>> Lustre: Request x1310428982411274 sent from MGC10.0.38.102 at tcp to NID >>> 10.0.38.102 at tcp 5s ago has timed out (limit 5s). >>> LustreError: 4397:0:(client.c:792:ptlrpc_import_delay_req()) @@@ >>> IMP_INVALID ?req at ffff81002cb7d800 x1310428982411276/t0 >>> o501->MGS at MGC10.0.38.102@tcp_0:26/25 lens 264/432 e 0 to 1 dl 0 ref 1 >>> fl Rpc:/0/0 rc 0/0 >>> LustreError: 15c-8: MGC10.0.38.102 at tcp: The configuration from log >>> ''ericlfs-client'' failed (-108). This may be the result of >>> communication errors between this node and the MGS, a bad >>> configuration, or other errors. See the syslog for more information. >>> LustreError: 4397:0:(llite_lib.c:1169:ll_fill_super()) Unable to >>> process log: -108 >>> Lustre: client ffff81002bd17400 umount complete >>> mount.lustre: mount 10.0.38.102 at tcp:/ericlfs at /mnt failed: Cannot >>> send after transport endpoint shutdown >>> >>> So I feel a little confused. Is this problem caused by I replace the >>> NIC card? And furthermore, how do I fix that problem? >>> >>> Thank you very much. >>> >>> Best Regards, >>> >>> Amy >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Thanks very much. Anyway, my nid is 10.0.38.102 at tcp, not > 10.0.38.102 at tcp0. If I add the above item in /etc/modprobe.conf I > don''t know whether it will affect something wrong. > > Could you tell me what''s the difference between tcp and tcp? > > Thank you very much. > > Regards, > > Amy >Hi, It seems this method cannot solve my problem. My NID is 10.0.38.102 at tcp, and furthermore when I add the item options lnet network=tcp0(eth1) I still encountered the same problem and after this failure I change this item back to options lnet network=tcp That still got failure. So I really feel very confused about that. When I installed Lustre the NID is 10.0.68.102 at tcp. not tcp0 suffix. Could someone tell me how to fix that problem? Thank you very much. Regards, Amy
Hi Amy, You could try first unmount all ost, mgs, etc, and redo a tunefs on each relevant disk: tunefs.lustre --writeconf --mgs --mdt --fsname=lufs DISKNAME tunefs.lustre --erase-param --mgsnode=10.0.38.102 at tcp0 --writeconf DISKNAME Best Regards, Jiawei On Aug 10, 2009, at 3:56 PM, Lee Amy wrote:> On Mon, Aug 10, 2009 at 9:32 AM, Lee Amy<openlinuxsource at gmail.com> > wrote: >> ---------- Forwarded message ---------- >> From: Lee Amy <openlinuxsource at gmail.com> >> Date: Mon, Aug 10, 2009 at 9:32 AM >> Subject: Re: [Lustre-discuss] Help: NIC Changed Error >> To: Rhys McMurdo <rhys at mcmurdo.id.au> >> >> >> On Mon, Aug 10, 2009 at 6:14 AM, Rhys McMurdo<rhys at mcmurdo.id.au> >> wrote: >>> Hi Amy, >>> >>> You may want to try the following options in your /etc/modprobe.conf >>> >>> options lnet networks=tcp0(eth1) >>> >>> Regards, >>> >>> Rhys >>> >>> 2009/8/8 Lee Amy <openlinuxsource at gmail.com> >>>> >>>> Hi, >>>> >>>> I''m a Lustre newbie. The server I set up is combined MGS/MDT file >>>> system on a block device. And set up OST on a block device. I set >>>> up >>>> MGS/MDT and OST in the same machine by using 2 disks. The NID is >>>> 10.0.38.102 at tcp, and the address 10.0.38.102 was assigned to >>>> eth0. One >>>> day I noticed the eth0 is broken so I use another NIC eth1 then >>>> assign >>>> IP address 10.0.38.102 to this card. >>>> >>>> Then I use client the mount the server Lustre FS by following >>>> command. >>>> >>>> mount -t lustre 10.0.38.102 at tcp:/ericlfs /mnt/foobar >>>> >>>> It reported following error messages. >>>> >>>> Lustre: Request x1310428982411274 sent from MGC10.0.38.102 at tcp to >>>> NID >>>> 10.0.38.102 at tcp 5s ago has timed out (limit 5s). >>>> LustreError: 4397:0:(client.c:792:ptlrpc_import_delay_req()) @@@ >>>> IMP_INVALID req at ffff81002cb7d800 x1310428982411276/t0 >>>> o501->MGS at MGC10.0.38.102@tcp_0:26/25 lens 264/432 e 0 to 1 dl 0 >>>> ref 1 >>>> fl Rpc:/0/0 rc 0/0 >>>> LustreError: 15c-8: MGC10.0.38.102 at tcp: The configuration from log >>>> ''ericlfs-client'' failed (-108). This may be the result of >>>> communication errors between this node and the MGS, a bad >>>> configuration, or other errors. See the syslog for more >>>> information. >>>> LustreError: 4397:0:(llite_lib.c:1169:ll_fill_super()) Unable to >>>> process log: -108 >>>> Lustre: client ffff81002bd17400 umount complete >>>> mount.lustre: mount 10.0.38.102 at tcp:/ericlfs at /mnt failed: Cannot >>>> send after transport endpoint shutdown >>>> >>>> So I feel a little confused. Is this problem caused by I replace >>>> the >>>> NIC card? And furthermore, how do I fix that problem? >>>> >>>> Thank you very much. >>>> >>>> Best Regards, >>>> >>>> Amy >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> Thanks very much. Anyway, my nid is 10.0.38.102 at tcp, not >> 10.0.38.102 at tcp0. If I add the above item in /etc/modprobe.conf I >> don''t know whether it will affect something wrong. >> >> Could you tell me what''s the difference between tcp and tcp? >> >> Thank you very much. >> >> Regards, >> >> Amy >> > Hi, > > It seems this method cannot solve my problem. My NID is > 10.0.38.102 at tcp, and furthermore when I add the item > > options lnet network=tcp0(eth1) > > I still encountered the same problem and after this failure I change > this item back to > > options lnet network=tcp > > That still got failure. So I really feel very confused about that. > When I installed Lustre the NID is 10.0.68.102 at tcp. not tcp0 suffix. > > Could someone tell me how to fix that problem? > > Thank you very much. > > Regards, > > Amy > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090810/4b169a3e/attachment-0001.html
On Mon, Aug 10, 2009 at 03:56:13PM +0800, Lee Amy wrote:> ...... > It seems this method cannot solve my problem. My NID is > 10.0.38.102 at tcp, and furthermore when I add the item > > options lnet network=tcp0(eth1) > > I still encountered the same problem and after this failure I change > this item back to > > options lnet network=tcp > > That still got failure. So I really feel very confused about that. > When I installed Lustre the NID is 10.0.68.102 at tcp. not tcp0 suffix.@tcp is just a shorthand for @tcp0 - they are 100% equivalent to each other.> Could someone tell me how to fix that problem?Please do "lctl list_nids" on the client and the MGS, and then "lctl ping 10.0.68.102 at tcp" from the client and show us the outputs please. Before running the commands please "echo +neterror > /proc/sys/lnet/printk" on client and server, and gather any console or dmesg error messages after the commands have been run. Isaac