thr3ads.net - Lustre discuss - [Lustre-discuss] fstab mount fails often [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Arne Brutschy

2010-Nov-15 15:32 UTC

[Lustre-discuss] fstab mount fails often

Hi all,

I am mounting lustre through an fstab entry. This fails quite often, the
nodes end up without the lustre mount. Even when I log in, it take 2-3
tries to get it to mount. This is what I get:

        mount /lustre
        mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed: Cannot
send after transport endpoint shutdown

This is /var/log/messages:
        
        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp:
-113
        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at
d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens
368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0
        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at
d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
        Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1 at
tcp: The configuration from log ''lustre-client'' failed (-108).
This may be the result of communication errors between this node and the MGS, a
bad configuration, or other errors. See the syslog for more information.
        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108
        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-108)
        
I have no errors on the interface, so I assume this is a timing problem.
Can I improve this through some timeout setting?

Cheers,
Arne

Wang Yibin

2010-Nov-16 03:39 UTC

head link

[Lustre-discuss] fstab mount fails often

From the log, we can see that either your MGS node was not ready for connection
yet, or there''s network error between client and the MGS node.
Were you rebooting the MGS at the moment?
Since you said there''s no errors on the interface, you need to check
the lnet connection and also verify that the MGS/MDT are up running.

? 2010-11-15???11:32? Arne Brutschy ???
> Hi all,
> 
> I am mounting lustre through an fstab entry. This fails quite often, the
> nodes end up without the lustre mount. Even when I log in, it take 2-3
> tries to get it to mount. This is what I get:
> 
>        mount /lustre
>        mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed:
Cannot send after transport endpoint shutdown
> 
> This is /var/log/messages:
> 
>        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp:
-113
>        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at
d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens
368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0
>        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at
d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
>        Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8: MGC10.1.1.1
at tcp: The configuration from log ''lustre-client'' failed
(-108). This may be the result of communication errors between this node and the
MGS, a bad configuration, or other errors. See the syslog for more information.
>        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108
>        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-108)
> 
> I have no errors on the interface, so I assume this is a timing problem.
> Can I improve this through some timeout setting?
> 
> Cheers,
> Arne
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Arne Brutschy

2010-Nov-16 11:25 UTC

head link

[Lustre-discuss] fstab mount fails often

Hello,
> From the log, we can see that either your MGS node was not ready for
> connection yet, or there''s network error between client and the
MGS node.
No error on the server nor on the client. What else can it be? Maybe the
switch is bad, I can see RX errors on most of it''s interfaces.
> Were you rebooting the MGS at the moment?
No. It''s something that happenes regularly.
> Since you said there''s no errors on the interface, you need to
check
> the lnet connection and also verify that the MGS/MDT are up running.
As far as I can tell, everything seems to be set up correctly. I have
quite a simple setup (single network, single interface gbe).

Thanks
Arne
> ? 2010-11-15???11:32? Arne Brutschy ???
> 
> > Hi all,
> > 
> > I am mounting lustre through an fstab entry. This fails quite often,
the
> > nodes end up without the lustre mount. Even when I log in, it take 2-3
> > tries to get it to mount. This is what I get:
> > 
> >        mount /lustre
> >        mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre failed:
Cannot send after transport endpoint shutdown
> > 
> > This is /var/log/messages:
> > 
> >        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp:
-113
> >        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at
d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens
368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0
> >        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at
d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
> >        Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8:
MGC10.1.1.1 at tcp: The configuration from log ''lustre-client''
failed (-108). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. See the syslog for more
information.
> >        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108
> >        Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-108)
> > 
> > I have no errors on the interface, so I assume this is a timing
problem.
> > Can I improve this through some timeout setting?
> > 
> > Cheers,
> > Arne
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 2273
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

Wang Yibin

2010-Nov-16 11:56 UTC

head link

[Lustre-discuss] fstab mount fails often

Hi,

? 2010-11-16???7:25? Arne Brutschy ???
> Hello,
> 
>> From the log, we can see that either your MGS node was not ready for
>> connection yet, or there''s network error between client and
the MGS node.
> 
> No error on the server nor on the client. What else can it be? Maybe the
> switch is bad, I can see RX errors on most of it''s interfaces.
The switch could be the culprit - error message shows client failed to send
request to MGS. Network sending status was -EHOSTUNREACH.
I suggest you reexamine the network of your system.
> 
>> Were you rebooting the MGS at the moment?
> 
> No. It''s something that happenes regularly.
> 
>> Since you said there''s no errors on the interface, you need to
check
>> the lnet connection and also verify that the MGS/MDT are up running.
> 
> As far as I can tell, everything seems to be set up correctly. I have
> quite a simple setup (single network, single interface gbe).
> 
> Thanks
> Arne
> 
>> ? 2010-11-15???11:32? Arne Brutschy ???
>> 
>>> Hi all,
>>> 
>>> I am mounting lustre through an fstab entry. This fails quite
often, the
>>> nodes end up without the lustre mount. Even when I log in, it take
2-3
>>> tries to get it to mount. This is what I get:
>>> 
>>>       mount /lustre
>>>       mount.lustre: mount 10.1.1.1 at tcp0:/lustre at /lustre
failed: Cannot send after transport endpoint shutdown
>>> 
>>> This is /var/log/messages:
>>> 
>>>       Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.1.1.1 at tcp:
-113
>>>       Nov 15 16:27:43 compute-1-10 kernel: LustreError:
2124:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at
d73d7c00 x1352468062535684/t0 o250->MGS at MGC10.1.1.1@tcp_0:26/25 lens
368/584 e 0 to 1 dl 1289834868 ref 2 fl Rpc:N/0/0 rc 0/0
>>>       Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at
d73d7800 x1352468062535685/t0 o101->MGS at MGC10.1.1.1@tcp_0:26/25 lens
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
>>>       Nov 15 16:27:43 compute-1-10 kernel: LustreError: 15c-8:
MGC10.1.1.1 at tcp: The configuration from log ''lustre-client''
failed (-108). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. See the syslog for more
information.
>>>       Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(llite_lib.c:1176:ll_fill_super()) Unable to process log: -108
>>>       Nov 15 16:27:43 compute-1-10 kernel: LustreError:
29069:0:(obd_mount.c:2045:lustre_fill_super()) Unable to mount  (-108)
>>> 
>>> I have no errors on the interface, so I assume this is a timing
problem.
>>> Can I improve this through some timeout setting?
>>> 
>>> Cheers,
>>> Arne
>>> 
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
> 
> -- 
> Arne Brutschy
> Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
> IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
> Universite'' Libre de Bruxelles   Tel      +32 2 650 2273
> Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
> 1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Arne Brutschy

2010-Nov-16 15:58 UTC

head link

[Lustre-discuss] fstab mount fails often

Hi.
> The switch could be the culprit - error message shows client failed to
> send request to MGS. Network sending status was -EHOSTUNREACH.
OK, I started monitoring the switch - maybe I have to replace it though.
I am reluctant to do so, as I am not *sure* that it''s the switch. 

I guess our overall network layout is bad - it''s a single network,
single interface gbe network without any routing happening. Maybe the
switch just gets overloaded...
> I suggest you reexamine the network of your system.
I realized that I did not give any parameters to the lnet module. Might
this have been a problem? If yes, why did it work at all?

Anyways, I added a simple

	options lnet networks=tcp0(eth0)

to my modprobe.conf on my clients (I had it already on the servers).

Thanks
arne

Arne Brutschy

2010-Nov-16 16:11 UTC

head link

[Lustre-discuss] fstab mount fails often

Hi,
> So I was incorrect to assume you are mounting this for the first time? 
> Sounded like other clients have successfully mounted and have written
> data to the OSTs? Status -113 = no route to host so you might want
> to check your "attempt mount" client connectivity to both the
MDS/MGS
> and most importantly to the OSS.
Yes, sorry if this wasn''t clear before. I am using this system in
production since 8 months (~40 users, ~600GB of data). The whole thing
comes spinning down from time to time, most probably though flaky
networking. I can''t really predict when it happens, or what triggers
it. 

The mount problem I described here is always present and seems to be
connected to the main problem. Random nodes can''t mount on the first
try
(even if it worked before on this node). Retrying works usually.
> I have encountered the exact same error prior when my mgsnode=xxxx at
o2ib0,yyyy at o2ib1 but missing the zzzz at tcp0 on the OSTs. I tried using
"tunefs.lustre --erase-param --mgsnode=....." to avoid reformatting it
but at the end decided to wipe it out and start from scratch. My client only use
tcp to MDS/MGS/OSS via two separate networks so once I append the missing piece
on the mgsnode= parameter it mounted immediately.
This is what I did according to my notes:

        mds 
        	mkfs.lustre --fsname=lustre --mgs --mdt /dev/sda6
        	mount -t lustre /dev/sda6 /mdt0
        oss0
        	mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sda3
        	mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sdb3
        	mount -t lustre /dev/sda3 /ost0
        	mount -t lustre /dev/sdb3 /ost1
        oss1
        	mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sda3
        	mkfs.lustre --ost --fsname=lustre --mgsnode=10.1.1.1 at tcp0 /dev/sdb3
        	mount -t lustre /dev/sda3 /ost0
        	mount -t lustre /dev/sdb3 /ost1
        client
        	mount -t lustre 10.1.1.1 at tcp0:/lustre /lustre
        
I can try to rewrite the parameters on the file systems and power-cycle
the whole cluster, but I guess I have to wait for quieter times for
this. Reformatting the OSTs would be a major hassle, as I need to back
up all the data before.

Arne

Lustre discuss - Nov 2010 - fstab mount fails often

[Lustre-discuss] fstab mount fails often

[Lustre-discuss] fstab mount fails often

[Lustre-discuss] fstab mount fails often

[Lustre-discuss] fstab mount fails often

[Lustre-discuss] fstab mount fails often

[Lustre-discuss] fstab mount fails often