Hi all, I am mounting lustre through an fstab entry. This fails quite often, the nodes end up without the lustre mount. Even when I log in, it take 2-3 tries to get it to mount. This is what I get: mount /lustre mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed: Cannot send after transport endpoint shutdown This is /var/log/messages: Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp: -113 Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0 Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1 at tcp: The configuration from log ''lustre-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108 Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-108) I have no errors on the interface, so I assume this is a timing problem. Can I improve this through some timeout setting? Cheers, Arne
From the log, we can see that either your MGS node was not ready for connection yet, or there''s network error between client and the MGS node. Were you rebooting the MGS at the moment? Since you said there''s no errors on the interface, you need to check the lnet connection and also verify that the MGS/MDT are up running. ? 2010-11-15???11:32? Arne Brutschy ???> Hi all, > > I am mounting lustre through an fstab entry. This fails quite often, the > nodes end up without the lustre mount. Even when I log in, it take 2-3 > tries to get it to mount. This is what I get: > > mount /lustre > mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed: Cannot send after transport endpoint shutdown > > This is /var/log/messages: > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp: -113 > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0 > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1 at tcp: The configuration from log ''lustre-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108 > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-108) > > I have no errors on the interface, so I assume this is a timing problem. > Can I improve this through some timeout setting? > > Cheers, > Arne > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello,> From the log, we can see that either your MGS node was not ready for > connection yet, or there''s network error between client and the MGS node.No error on the server nor on the client. What else can it be? Maybe the switch is bad, I can see RX errors on most of it''s interfaces.> Were you rebooting the MGS at the moment?No. It''s something that happenes regularly.> Since you said there''s no errors on the interface, you need to check > the lnet connection and also verify that the MGS/MDT are up running.As far as I can tell, everything seems to be set up correctly. I have quite a simple setup (single network, single interface gbe). Thanks Arne> ? 2010-11-15???11:32? Arne Brutschy ??? > > > Hi all, > > > > I am mounting lustre through an fstab entry. This fails quite often, the > > nodes end up without the lustre mount. Even when I log in, it take 2-3 > > tries to get it to mount. This is what I get: > > > > mount /lustre > > mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed: Cannot send after transport endpoint shutdown > > > > This is /var/log/messages: > > > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp: -113 > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0 > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1 at tcp: The configuration from log ''lustre-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108 > > Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-108) > > > > I have no errors on the interface, so I assume this is a timing problem. > > Can I improve this through some timeout setting? > > > > Cheers, > > Arne > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 2273 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
Hi, ? 2010-11-16???7:25? Arne Brutschy ???> Hello, > >> From the log, we can see that either your MGS node was not ready for >> connection yet, or there''s network error between client and the MGS node. > > No error on the server nor on the client. What else can it be? Maybe the > switch is bad, I can see RX errors on most of it''s interfaces.The switch could be the culprit - error message shows client failed to send request to MGS. Network sending status was -EHOSTUNREACH. I suggest you reexamine the network of your system.> >> Were you rebooting the MGS at the moment? > > No. It''s something that happenes regularly. > >> Since you said there''s no errors on the interface, you need to check >> the lnet connection and also verify that the MGS/MDT are up running. > > As far as I can tell, everything seems to be set up correctly. I have > quite a simple setup (single network, single interface gbe). > > Thanks > Arne > >> ? 2010-11-15???11:32? Arne Brutschy ??? >> >>> Hi all, >>> >>> I am mounting lustre through an fstab entry. This fails quite often, the >>> nodes end up without the lustre mount. Even when I log in, it take 2-3 >>> tries to get it to mount. This is what I get: >>> >>> mount /lustre >>> mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed: Cannot send after transport endpoint shutdown >>> >>> This is /var/log/messages: >>> >>> Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp: -113 >>> Nov 15 16:27:43 compute-1-10 kernel: LustreError: 2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0 >>> Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 >>> Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1 at tcp: The configuration from log ''lustre-client'' failed (-108). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. >>> Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108 >>> Nov 15 16:27:43 compute-1-10 kernel: LustreError: 29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount (-108) >>> >>> I have no errors on the interface, so I assume this is a timing problem. >>> Can I improve this through some timeout setting? >>> >>> Cheers, >>> Arne >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > -- > Arne Brutschy > Ph.D. Student Email arne.brutschy(AT)ulb.ac.be > IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy > Universite'' Libre de Bruxelles Tel +32 2 650 2273 > Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 > 1050 Bruxelles, Belgium (Fax at IRIDIA secretary) > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi.> The switch could be the culprit - error message shows client failed to > send request to MGS. Network sending status was -EHOSTUNREACH.OK, I started monitoring the switch - maybe I have to replace it though. I am reluctant to do so, as I am not *sure* that it''s the switch. I guess our overall network layout is bad - it''s a single network, single interface gbe network without any routing happening. Maybe the switch just gets overloaded...> I suggest you reexamine the network of your system.I realized that I did not give any parameters to the lnet module. Might this have been a problem? If yes, why did it work at all? Anyways, I added a simple options lnet networks=tcp0(eth0) to my modprobe.conf on my clients (I had it already on the servers). Thanks arne
Hi,> So I was incorrect to assume you are mounting this for the first time? > Sounded like other clients have successfully mounted and have written > data to the OSTs? Status -113 = no route to host so you might want > to check your "attempt mount" client connectivity to both the MDS/MGS > and most importantly to the OSS.Yes, sorry if this wasn''t clear before. I am using this system in production since 8 months (~40 users, ~600GB of data). The whole thing comes spinning down from time to time, most probably though flaky networking. I can''t really predict when it happens, or what triggers it. The mount problem I described here is always present and seems to be connected to the main problem. Random nodes can''t mount on the first try (even if it worked before on this node). Retrying works usually.> I have encountered the exact same error prior when my mgsnode=xxxx at o2ib0,yyyy at o2ib1 but missing the zzzz at tcp0 on the OSTs. I tried using "tunefs.lustre --erase-param --mgsnode=....." to avoid reformatting it but at the end decided to wipe it out and start from scratch. My client only use tcp to MDS/MGS/OSS via two separate networks so once I append the missing piece on the mgsnode= parameter it mounted immediately.This is what I did according to my notes: mds mkfs.lustre --fsname=lustre --mgs --mdt /dev/sda6 mount -t lustre /dev/sda6 /mdt0 oss0 mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sda3 mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sdb3 mount -t lustre /dev/sda3 /ost0 mount -t lustre /dev/sdb3 /ost1 oss1 mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sda3 mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sdb3 mount -t lustre /dev/sda3 /ost0 mount -t lustre /dev/sdb3 /ost1 client mount -t lustre 10.1.1.1 at tcp0:/lustre /lustre I can try to rewrite the parameters on the file systems and power-cycle the whole cluster, but I guess I have to wait for quieter times for this. Reformatting the OSTs would be a major hassle, as I need to back up all the data before. Arne