thr3ads.net - Lustre discuss - [Lustre-discuss] Cannot send after transport endpoint shutdown (-108) [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Aaron S. Knister

2008-Mar-04 20:31 UTC

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

This morning I''ve had both my infiniband and tcp lustre clients hiccup.
They are evicted from the server presumably as a result of their high load and
consequent timeouts. My question is- why don''t the clients re-connect.
The infiniband and tcp clients both give the following message when I type
"df" - Cannot send after transport endpoint shutdown (-108).
I''ve been battling with this on and off now for a few months.
I''ve upgraded my infiniband switch firmware, all the clients and
servers are running the latest version of lustre and the lustre patched kernel.
Any ideas?

-Aaron 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080304/6a315423/attachment-0002.html

Charles Taylor

2008-Mar-04 20:41 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

We''ve seen this before as well.    Our experience is that the  
obd_timeout is  far too small for large clusters (ours is 400+  
nodes)  and the only way we avoid these errors is by setting it to  
1000 which seems high to us but  appears to work and puts an end to  
the transport endpoint shutdowns.

On the MDS....

lctl conf_param srn.sys.timeout=1000

You may have to do this on the OSS''s as well unless you restart the  
OSS''s but I could be wrong on that.   You should check it everywhere  
with...

cat /proc/sys/lustre/timeout

On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote:
> This morning I''ve had both my infiniband and tcp lustre clients  
> hiccup. They are evicted from the server presumably as a result of  
> their high load and consequent timeouts. My question is- why don''t
> the clients re-connect. The infiniband and tcp clients both give  
> the following message when I type "df" - Cannot send after  
> transport endpoint shutdown (-108). I''ve been battling with this
on
> and off now for a few months. I''ve upgraded my infiniband switch  
> firmware, all the clients and servers are running the latest  
> version of lustre and the lustre patched kernel. Any ideas?
>
> -Aaron
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron S. Knister

2008-Mar-04 20:55 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

I think I tried that before and it didn''t help, but I will try it
again. Thanks for the suggestion.

-Aaron 

----- Original Message ----- 
From: "Charles Taylor" <taylor at hpc.ufl.edu> 
To: "Aaron S. Knister" <aaron at iges.org> 
Cc: "lustre-discuss" <lustre-discuss at clusterfs.com>,
"Thomas Wakefield" <twake at cola.iges.org>
Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern 
Subject: Re: [Lustre-discuss] Cannot send after transport endpoint shutdown
(-108)

We''ve seen this before as well. Our experience is that the 
obd_timeout is far too small for large clusters (ours is 400+ 
nodes) and the only way we avoid these errors is by setting it to 
1000 which seems high to us but appears to work and puts an end to 
the transport endpoint shutdowns. 

On the MDS.... 

lctl conf_param srn.sys.timeout=1000 

You may have to do this on the OSS''s as well unless you restart the 
OSS''s but I could be wrong on that. You should check it everywhere 
with... 

cat /proc/sys/lustre/timeout 


On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote: 
> This morning I''ve had both my infiniband and tcp lustre clients 
> hiccup. They are evicted from the server presumably as a result of 
> their high load and consequent timeouts. My question is- why don''t
> the clients re-connect. The infiniband and tcp clients both give 
> the following message when I type "df" - Cannot send after 
> transport endpoint shutdown (-108). I''ve been battling with this
on
> and off now for a few months. I''ve upgraded my infiniband switch 
> firmware, all the clients and servers are running the latest 
> version of lustre and the lustre patched kernel. Any ideas? 
> 
> -Aaron 
> _______________________________________________ 
> Lustre-discuss mailing list 
> Lustre-discuss at lists.lustre.org 
> http://lists.lustre.org/mailman/listinfo/lustre-discuss 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080304/4d9c5a7a/attachment-0002.html

Brian J. Murrell

2008-Mar-04 21:04 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister
wrote:> I think I tried that before and it didn''t help, but I will try it
> again. Thanks for the suggestion.
Just so you guys know, 1000 seconds for the obd_timeout is very, very
large!  As you could probably guess, we have some very, very big Lustre
installations and to the best of my knowledge none of them are using
anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
experience to some of these very large clusters might correct me) the
largest value that the largest clusters are using is in the
neighbourhood of 300s.  There has to be some other problem at play here
that you need 1000s.

Can you both please report your lustre and kernel versions?  I know you
said "latest" Aaron, but some version numbers might be more solid to
go
on.

b.

Aaron Knister

2008-Mar-04 22:42 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

I made this change and clients are still being evicted. This is very  
frustrating. It happens over tcp and infiniband. My timeout is 1000.  
Anybody know why don''t the clients reconnect?

On Mar 4, 2008, at 3:55 PM, Aaron S. Knister wrote:
> I think I tried that before and it didn''t help, but I will try it
> again. Thanks for the suggestion.
>
> -Aaron
>
> ----- Original Message -----
> From: "Charles Taylor" <taylor at hpc.ufl.edu>
> To: "Aaron S. Knister" <aaron at iges.org>
> Cc: "lustre-discuss" <lustre-discuss at clusterfs.com>,
"Thomas
> Wakefield" <twake at cola.iges.org>
> Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [Lustre-discuss] Cannot send after transport endpoint  
> shutdown (-108)
>
> We''ve seen this before as well.    Our experience is that the
> obd_timeout is  far too small for large clusters (ours is 400+
> nodes)  and the only way we avoid these errors is by setting it to
> 1000 which seems high to us but  appears to work and puts an end to
> the transport endpoint shutdowns.
>
> On the MDS....
>
> lctl conf_param srn.sys.timeout=1000
>
> You may have to do this on the OSS''s as well unless you restart
the
> OSS''s but I could be wrong on that.   You should check it
everywhere
> with...
>
> cat /proc/sys/lustre/timeout
>
>
> On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote:
>
> > This morning I''ve had both my infiniband and tcp lustre
clients
> > hiccup. They are evicted from the server presumably as a result of
> > their high load and consequent timeouts. My question is- why
don''t
> > the clients re-connect. The infiniband and tcp clients both give
> > the following message when I type "df" - Cannot send after
> > transport endpoint shutdown (-108). I''ve been battling with
this on
> > and off now for a few months. I''ve upgraded my infiniband
switch
> > firmware, all the clients and servers are running the latest
> > version of lustre and the lustre patched kernel. Any ideas?
> >
> > -Aaron
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080304/59f16d35/attachment-0002.html

Craig Prescott

2008-Mar-05 00:37 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Hi Aaron;

As Charlie mentioned, we have 400 clients and a timeout
value of 1000 is "enough" for us.  How many clients do you
have?  If it is more than 400, or the ratio of your o2ib/tcp
clients is not like ours (80/20), you may need a bigger value.

Also, we have observed that occassionally we set the timeout
on out MGS/MDS machine via:

lctl conf_param <fsname>.sys.timeout=1000

but it does not "take" everywhere.  That is, you should check
your OSSes and clients to observe that the correct timeout
is reflected in /proc/sys/lustre/timeout.  If it isn''t, just echo
the correct number in there.  If you already checked this, maybe
try a bigger value?

Hope that helps,
Craig Prescott


Aaron Knister wrote:> I made this change and clients are still being evicted. This is very 
> frustrating. It happens over tcp and infiniband. My timeout is 1000. 
> Anybody know why don''t the clients reconnect?
>
> On Mar 4, 2008, at 3:55 PM, Aaron S. Knister wrote:
>
>> I think I tried that before and it didn''t help, but I will try
it
>> again. Thanks for the suggestion.
>>
>> -Aaron
>>
>> ----- Original Message -----
>> From: "Charles Taylor" <taylor at hpc.ufl.edu
<mailto:taylor at hpc.ufl.edu>>
>> To: "Aaron S. Knister" <aaron at iges.org <mailto:aaron
at iges.org>>
>> Cc: "lustre-discuss" <lustre-discuss at clusterfs.com 
>> <mailto:lustre-discuss at clusterfs.com>>, "Thomas
Wakefield"
>> <twake at cola.iges.org <mailto:twake at cola.iges.org>>
>> Sent: Tuesday, March 4, 2008 3:41:04 PM GMT -05:00 US/Canada Eastern
>> Subject: Re: [Lustre-discuss] Cannot send after transport endpoint 
>> shutdown (-108)
>>
>> We''ve seen this before as well.    Our experience is that the
>> obd_timeout is  far too small for large clusters (ours is 400+  
>> nodes)  and the only way we avoid these errors is by setting it to  
>> 1000 which seems high to us but  appears to work and puts an end to  
>> the transport endpoint shutdowns.
>>
>> On the MDS....
>>
>> lctl conf_param srn.sys.timeout=1000
>>
>> You may have to do this on the OSS''s as well unless you
restart the
>> OSS''s but I could be wrong on that.   You should check it
everywhere
>> with...
>>
>> cat /proc/sys/lustre/timeout
>>
>>
>> On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote:
>>
>> > This morning I''ve had both my infiniband and tcp lustre
clients
>> > hiccup. They are evicted from the server presumably as a result of
>> > their high load and consequent timeouts. My question is- why
don''t
>> > the clients re-connect. The infiniband and tcp clients both give  
>> > the following message when I type "df" - Cannot send
after
>> > transport endpoint shutdown (-108). I''ve been battling
with this on
>> > and off now for a few months. I''ve upgraded my infiniband
switch
>> > firmware, all the clients and servers are running the latest  
>> > version of lustre and the lustre patched kernel. Any ideas?
>> >
>> > -Aaron
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at
lists.lustre.org>
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss 
>>
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org <mailto:aaron at iges.org>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Charles Taylor

2008-Mar-05 11:56 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Sure, we will provide you with more details of our installation but  
let me first say that, if recollection serves, we did not pull that  
number out of a hat.   I believe that there is a formula in one of  
the lustre tuning manuals for calculating the recommended timeout  
value.   I''ll have to take a moment to go back and find it.   Anyway,  
if you use that formula for our cluster, the recommended timeout  
value, I think, comes out to be *much* larger than 1000.

Later this morning, we will go back and find that formula and share  
with the list how we came up w/ our timeout.   Perhaps you can show  
us where we are going wrong.

One more comment.... We just brought up our second large lustre file  
system.   It is 80+ TB served by 24 OSTs on two (pretty beefy)  
OSSs.   We just achieved over 2GB/sec of sustained (large block,  
sequential) I/O from an aggregate of 20 clients.    Our design target  
was 1.0 GB/sec/OSS and we hit that pretty comfortably.   That said,  
when we first mounted the new (1.6.4.2) file system across all 400  
nodes in our cluster, we immediately started getting "transport  
endpoint failures" and evictions.   We looked rather intensively for  
network/fabric problems (we have both o2ib and tcp nids) and could  
find none.   All of our MPI apps are/were running just fine.   The  
only way we could get rid of the evictions and transport endpoint  
failures was by increasing the timeout.   Also, we knew to do this  
based on our experience with our first lustre file system (1.6.3 +  
patches) where we had to do the same thing.

Like I said, a little bit later, Craig or I will post more details  
about our implementation.   If we are doing something wrong with  
regard to this timeout business, I would love to know what it is.

Thanks,

Charlie Taylor
UF HPC Center

On Mar 4, 2008, at 4:04 PM, Brian J. Murrell wrote:
> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
>> I think I tried that before and it didn''t help, but I will try
it
>> again. Thanks for the suggestion.
>
> Just so you guys know, 1000 seconds for the obd_timeout is very, very
> large!  As you could probably guess, we have some very, very big  
> Lustre
> installations and to the best of my knowledge none of them are using
> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
> experience to some of these very large clusters might correct me) the
> largest value that the largest clusters are using is in the
> neighbourhood of 300s.  There has to be some other problem at play  
> here
> that you need 1000s.
>
> Can you both please report your lustre and kernel versions?  I know  
> you
> said "latest" Aaron, but some version numbers might be more solid
> to go
> on.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Frank Leers

2008-Mar-05 16:03 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell
wrote:> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
> > I think I tried that before and it didn''t help, but I will
try it
> > again. Thanks for the suggestion.
> 
> Just so you guys know, 1000 seconds for the obd_timeout is very, very
> large!  As you could probably guess, we have some very, very big Lustre
> installations and to the best of my knowledge none of them are using
> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
> experience to some of these very large clusters might correct me) the
> largest value that the largest clusters are using is in the
> neighbourhood of 300s.  There has to be some other problem at play here
> that you need 1000s.
I can confirm that at a recent large installation with several thousand
clients, the default of 100 is in effect.
> 
> Can you both please report your lustre and kernel versions?  I know you
> said "latest" Aaron, but some version numbers might be more solid
to go
> on.
> 
> b.
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister

2008-Mar-05 16:08 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

That''s very strange. What interconnect is that site using?

My versions are -

Lustre  - 1.6.4.2
Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
Kernel (clients) - 2.6.18-53.1.13.el5



On Mar 5, 2008, at 11:03 AM, Frank Leers wrote:
> On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote:
>> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
>>> I think I tried that before and it didn''t help, but I will
try it
>>> again. Thanks for the suggestion.
>>
>> Just so you guys know, 1000 seconds for the obd_timeout is very, very
>> large!  As you could probably guess, we have some very, very big  
>> Lustre
>> installations and to the best of my knowledge none of them are using
>> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
>> experience to some of these very large clusters might correct me) the
>> largest value that the largest clusters are using is in the
>> neighbourhood of 300s.  There has to be some other problem at play  
>> here
>> that you need 1000s.
>
> I can confirm that at a recent large installation with several  
> thousand
> clients, the default of 100 is in effect.
>
>>
>> Can you both please report your lustre and kernel versions?  I know  
>> you
>> said "latest" Aaron, but some version numbers might be more
solid
>> to go
>> on.
>>
>> b.
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

Frank Leers

2008-Mar-05 16:33 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

On Wed, 2008-03-05 at 11:08 -0500, Aaron Knister wrote:> That''s very strange. What interconnect is that site using?
> 
Not really strange, but - 

SDR IB/OFED

lustre 1.6.4.2
2.6.18.8 clients
2.6.9-55.0.9 servers
> My versions are -
> 
> Lustre  - 1.6.4.2
> Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
> Kernel (clients) - 2.6.18-53.1.13.el5
> 
> 
> 
> On Mar 5, 2008, at 11:03 AM, Frank Leers wrote:
> 
> > On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote:
> >> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
> >>> I think I tried that before and it didn''t help, but I
will try it
> >>> again. Thanks for the suggestion.
> >>
> >> Just so you guys know, 1000 seconds for the obd_timeout is very,
very
> >> large!  As you could probably guess, we have some very, very big  
> >> Lustre
> >> installations and to the best of my knowledge none of them are
using
> >> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
> >> experience to some of these very large clusters might correct me)
the
> >> largest value that the largest clusters are using is in the
> >> neighbourhood of 300s.  There has to be some other problem at play
> >> here
> >> that you need 1000s.
> >
> > I can confirm that at a recent large installation with several  
> > thousand
> > clients, the default of 100 is in effect.
> >
> >>
> >> Can you both please report your lustre and kernel versions?  I
know
> >> you
> >> said "latest" Aaron, but some version numbers might be
more solid
> >> to go
> >> on.
> >>
> >> b.
> >>
> >>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
> 
> (301) 595-7000
> aaron at iges.org
> 
> 
> 
>

Charles Taylor

2008-Mar-05 16:34 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Well, go figure.    We are running...

Lustre: 1.6.4.2 on clients and servers
Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
Platform: X86_64 (opteron 275s, mostly)
Interconnect: IB,  Ethernet
IB Stack: OFED 1.2

We already posted our procedure for patching the kernel, building  
OFED, and building lustre so I don''t think I''ll go into that  
again.    Like I said, we just brought a new file system online.    
Everything looked fine at first with just a few clients mounted.     
Once we mounted all 408 (or so), we started gettting all kinds of  
"transport endpoint failures" and the MGSs and OSTs were evicting  
clients left and right.    We looked for network problems and could  
not find any of any substance.    Once we increased the obd/lustre/ 
system timeout setting as previously discussed, the errors  
vanished.    This was consistent with our experience with 1.6.3 as  
well.    That file system has been online since early December.    
Both file systems appear to be working well.

I''m not sure what to make of it.    Perhaps we are just masking  
another problem.     Perhaps there are some other, related values  
that need to be tuned.    We''ve done the best we could but I''m
sure
there is still much about Lustre we don''t know.   We''ll try to
get
someone out to the next class but until then, we''re on our own, so to  
speak.

Charlie Taylor
UF HPC Center
>>
>> Just so you guys know, 1000 seconds for the obd_timeout is very, very
>> large!  As you could probably guess, we have some very, very big  
>> Lustre
>> installations and to the best of my knowledge none of them are using
>> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
>> experience to some of these very large clusters might correct me) the
>> largest value that the largest clusters are using is in the
>> neighbourhood of 300s.  There has to be some other problem at play  
>> here
>> that you need 1000s.
>
> I can confirm that at a recent large installation with several  
> thousand
> clients, the default of 100 is in effect.
>
>>

Aaron Knister

2008-Mar-05 18:09 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Are you running DDR or SDR IB? Also what hardware are you using for  
your storage?

On Mar 5, 2008, at 11:34 AM, Charles Taylor wrote:
> Well, go figure.    We are running...
>
> Lustre: 1.6.4.2 on clients and servers
> Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
> Platform: X86_64 (opteron 275s, mostly)
> Interconnect: IB,  Ethernet
> IB Stack: OFED 1.2
>
> We already posted our procedure for patching the kernel, building
> OFED, and building lustre so I don''t think I''ll go into
that
> again.    Like I said, we just brought a new file system online.
> Everything looked fine at first with just a few clients mounted.
> Once we mounted all 408 (or so), we started gettting all kinds of
> "transport endpoint failures" and the MGSs and OSTs were evicting
> clients left and right.    We looked for network problems and could
> not find any of any substance.    Once we increased the obd/lustre/
> system timeout setting as previously discussed, the errors
> vanished.    This was consistent with our experience with 1.6.3 as
> well.    That file system has been online since early December.
> Both file systems appear to be working well.
>
> I''m not sure what to make of it.    Perhaps we are just masking
> another problem.     Perhaps there are some other, related values
> that need to be tuned.    We''ve done the best we could but
I''m sure
> there is still much about Lustre we don''t know.   We''ll
try to get
> someone out to the next class but until then, we''re on our own, so
to
> speak.
>
> Charlie Taylor
> UF HPC Center
>
>>>
>>> Just so you guys know, 1000 seconds for the obd_timeout is very,  
>>> very
>>> large!  As you could probably guess, we have some very, very big
>>> Lustre
>>> installations and to the best of my knowledge none of them are
using
>>> anywhere near that.  AFAIK (and perhaps a Sun engineer with closer
>>> experience to some of these very large clusters might correct me)  
>>> the
>>> largest value that the largest clusters are using is in the
>>> neighbourhood of 300s.  There has to be some other problem at play
>>> here
>>> that you need 1000s.
>>
>> I can confirm that at a recent large installation with several
>> thousand
>> clients, the default of 100 is in effect.
>>
>>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

Charles Taylor

2008-Mar-05 18:30 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

SDR on the IB side.   Our storage is RAID Inc.  Falcon 3s, host  
attached via 4Gb qlogic FC HCAs.

http://www.raidinc.com/falcon_III.php

Regards,

Charlie


On Mar 5, 2008, at 1:09 PM, Aaron Knister wrote:
> Are you running DDR or SDR IB? Also what hardware are you using for  
> your storage?
>
> On Mar 5, 2008, at 11:34 AM, Charles Taylor wrote:
>
>> Well, go figure.    We are running...
>>
>> Lustre: 1.6.4.2 on clients and servers
>> Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
>> Platform: X86_64 (opteron 275s, mostly)
>> Interconnect: IB,  Ethernet
>> IB Stack: OFED 1.2
>>
>> We already posted our procedure for patching the kernel, building
>> OFED, and building lustre so I don''t think I''ll go
into that
>> again.    Like I said, we just brought a new file system online.
>> Everything looked fine at first with just a few clients mounted.
>> Once we mounted all 408 (or so), we started gettting all kinds of
>> "transport endpoint failures" and the MGSs and OSTs were
evicting
>> clients left and right.    We looked for network problems and could
>> not find any of any substance.    Once we increased the obd/lustre/
>> system timeout setting as previously discussed, the errors
>> vanished.    This was consistent with our experience with 1.6.3 as
>> well.    That file system has been online since early December.
>> Both file systems appear to be working well.
>>
>> I''m not sure what to make of it.    Perhaps we are just
masking
>> another problem.     Perhaps there are some other, related values
>> that need to be tuned.    We''ve done the best we could but
I''m sure
>> there is still much about Lustre we don''t know.  
We''ll try to get
>> someone out to the next class but until then, we''re on our
own, so to
>> speak.
>>
>> Charlie Taylor
>> UF HPC Center
>>
>>>>
>>>> Just so you guys know, 1000 seconds for the obd_timeout is
very,
>>>> very
>>>> large!  As you could probably guess, we have some very, very
big
>>>> Lustre
>>>> installations and to the best of my knowledge none of them are
>>>> using
>>>> anywhere near that.  AFAIK (and perhaps a Sun engineer with
closer
>>>> experience to some of these very large clusters might correct  
>>>> me) the
>>>> largest value that the largest clusters are using is in the
>>>> neighbourhood of 300s.  There has to be some other problem at
play
>>>> here
>>>> that you need 1000s.
>>>
>>> I can confirm that at a recent large installation with several
>>> thousand
>>> clients, the default of 100 is in effect.
>>>
>>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>

Aaron Knister

2008-Mar-05 18:37 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Could you tell me what version of OFED was being used? Was it the  
version that ships with the kernel?

-Aaron

On Mar 5, 2008, at 11:33 AM, Frank Leers wrote:
> On Wed, 2008-03-05 at 11:08 -0500, Aaron Knister wrote:
>> That''s very strange. What interconnect is that site using?
>>
>
> Not really strange, but -
>
> SDR IB/OFED
>
> lustre 1.6.4.2
> 2.6.18.8 clients
> 2.6.9-55.0.9 servers
>
>> My versions are -
>>
>> Lustre  - 1.6.4.2
>> Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
>> Kernel (clients) - 2.6.18-53.1.13.el5
>>
>>
>>
>> On Mar 5, 2008, at 11:03 AM, Frank Leers wrote:
>>
>>> On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote:
>>>> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
>>>>> I think I tried that before and it didn''t help,
but I will try it
>>>>> again. Thanks for the suggestion.
>>>>
>>>> Just so you guys know, 1000 seconds for the obd_timeout is
very,
>>>> very
>>>> large!  As you could probably guess, we have some very, very
big
>>>> Lustre
>>>> installations and to the best of my knowledge none of them are
>>>> using
>>>> anywhere near that.  AFAIK (and perhaps a Sun engineer with
closer
>>>> experience to some of these very large clusters might correct
me)
>>>> the
>>>> largest value that the largest clusters are using is in the
>>>> neighbourhood of 300s.  There has to be some other problem at
play
>>>> here
>>>> that you need 1000s.
>>>
>>> I can confirm that at a recent large installation with several
>>> thousand
>>> clients, the default of 100 is in effect.
>>>
>>>>
>>>> Can you both please report your lustre and kernel versions?  I
know
>>>> you
>>>> said "latest" Aaron, but some version numbers might
be more solid
>>>> to go
>>>> on.
>>>>
>>>> b.
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> Aaron Knister
>> Associate Systems Analyst
>> Center for Ocean-Land-Atmosphere Studies
>>
>> (301) 595-7000
>> aaron at iges.org
>>
>>
>>
>>
>
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

Aaron Knister

2008-Mar-05 18:39 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

I wonder if the issue is related to the kernels being run on the  
servers. Both Mr. Taylor and my setups are running the 2.6.18 kernel  
on the server, however the set up mentioned with a timeout of 100 was  
using the 2.6.9 kernel on the servers.

-Aaron

On Mar 5, 2008, at 1:30 PM, Charles Taylor wrote:
>
> SDR on the IB side.   Our storage is RAID Inc.  Falcon 3s, host  
> attached via 4Gb qlogic FC HCAs.
>
> http://www.raidinc.com/falcon_III.php
>
> Regards,
>
> Charlie
>
>
> On Mar 5, 2008, at 1:09 PM, Aaron Knister wrote:
>
>> Are you running DDR or SDR IB? Also what hardware are you using for  
>> your storage?
>>
>> On Mar 5, 2008, at 11:34 AM, Charles Taylor wrote:
>>
>>> Well, go figure.    We are running...
>>>
>>> Lustre: 1.6.4.2 on clients and servers
>>> Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers)
>>> Platform: X86_64 (opteron 275s, mostly)
>>> Interconnect: IB,  Ethernet
>>> IB Stack: OFED 1.2
>>>
>>> We already posted our procedure for patching the kernel, building
>>> OFED, and building lustre so I don''t think I''ll
go into that
>>> again.    Like I said, we just brought a new file system online.
>>> Everything looked fine at first with just a few clients mounted.
>>> Once we mounted all 408 (or so), we started gettting all kinds of
>>> "transport endpoint failures" and the MGSs and OSTs were
evicting
>>> clients left and right.    We looked for network problems and could
>>> not find any of any substance.    Once we increased the obd/lustre/
>>> system timeout setting as previously discussed, the errors
>>> vanished.    This was consistent with our experience with 1.6.3 as
>>> well.    That file system has been online since early December.
>>> Both file systems appear to be working well.
>>>
>>> I''m not sure what to make of it.    Perhaps we are just
masking
>>> another problem.     Perhaps there are some other, related values
>>> that need to be tuned.    We''ve done the best we could but
I''m sure
>>> there is still much about Lustre we don''t know.  
We''ll try to get
>>> someone out to the next class but until then, we''re on our
own, so
>>> to
>>> speak.
>>>
>>> Charlie Taylor
>>> UF HPC Center
>>>
>>>>>
>>>>> Just so you guys know, 1000 seconds for the obd_timeout is
very,
>>>>> very
>>>>> large!  As you could probably guess, we have some very,
very big
>>>>> Lustre
>>>>> installations and to the best of my knowledge none of them
are
>>>>> using
>>>>> anywhere near that.  AFAIK (and perhaps a Sun engineer with
closer
>>>>> experience to some of these very large clusters might
correct
>>>>> me) the
>>>>> largest value that the largest clusters are using is in the
>>>>> neighbourhood of 300s.  There has to be some other problem
at play
>>>>> here
>>>>> that you need 1000s.
>>>>
>>>> I can confirm that at a recent large installation with several
>>>> thousand
>>>> clients, the default of 100 is in effect.
>>>>
>>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> Aaron Knister
>> Associate Systems Analyst
>> Center for Ocean-Land-Atmosphere Studies
>>
>> (301) 595-7000
>> aaron at iges.org
>>
>>
>>
>>
>
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

Frank Leers

2008-Mar-05 19:03 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

On Wed, 2008-03-05 at 13:37 -0500, Aaron Knister wrote:> Could you tell me what version of OFED was being used? Was it the  
> version that ships with the kernel?
OFED version is 1.2.5.4
> 
> -Aaron
> 
> On Mar 5, 2008, at 11:33 AM, Frank Leers wrote:
> 
> > On Wed, 2008-03-05 at 11:08 -0500, Aaron Knister wrote:
> >> That''s very strange. What interconnect is that site
using?
> >>
> >
> > Not really strange, but -
> >
> > SDR IB/OFED
> >
> > lustre 1.6.4.2
> > 2.6.18.8 clients
> > 2.6.9-55.0.9 servers
> >
> >> My versions are -
> >>
> >> Lustre  - 1.6.4.2
> >> Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
> >> Kernel (clients) - 2.6.18-53.1.13.el5
> >>
> >>
> >>
> >> On Mar 5, 2008, at 11:03 AM, Frank Leers wrote:
> >>
> >>> On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote:
> >>>> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote:
> >>>>> I think I tried that before and it didn''t
help, but I will try it
> >>>>> again. Thanks for the suggestion.
> >>>>
> >>>> Just so you guys know, 1000 seconds for the obd_timeout is
very,
> >>>> very
> >>>> large!  As you could probably guess, we have some very,
very big
> >>>> Lustre
> >>>> installations and to the best of my knowledge none of them
are
> >>>> using
> >>>> anywhere near that.  AFAIK (and perhaps a Sun engineer
with closer
> >>>> experience to some of these very large clusters might
correct me)
> >>>> the
> >>>> largest value that the largest clusters are using is in
the
> >>>> neighbourhood of 300s.  There has to be some other problem
at play
> >>>> here
> >>>> that you need 1000s.
> >>>
> >>> I can confirm that at a recent large installation with several
> >>> thousand
> >>> clients, the default of 100 is in effect.
> >>>
> >>>>
> >>>> Can you both please report your lustre and kernel
versions?  I know
> >>>> you
> >>>> said "latest" Aaron, but some version numbers
might be more solid
> >>>> to go
> >>>> on.
> >>>>
> >>>> b.
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Lustre-discuss mailing list
> >>>> Lustre-discuss at lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >>>
> >>> _______________________________________________
> >>> Lustre-discuss mailing list
> >>> Lustre-discuss at lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >>
> >> Aaron Knister
> >> Associate Systems Analyst
> >> Center for Ocean-Land-Atmosphere Studies
> >>
> >> (301) 595-7000
> >> aaron at iges.org
> >>
> >>
> >>
> >>
> >
> 
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
> 
> (301) 595-7000
> aaron at iges.org
> 
> 
> 
>

Aaron Knister

2008-Mar-05 23:00 UTC

head link

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Are the clients SuSE, redhat or another distro? I can''t get OFED  
1.2.5.4 to build with rhel5 but im working on that.

On Mar 5, 2008, at 2:03 PM, Frank Leers wrote:
> On Wed, 2008-03-05 at 13:37 -0500, Aaron Knister wrote:
>> Could you tell me what version of OFED was being used? Was it the
>> version that ships with the kernel?
>
> OFED version is 1.2.5.4
>
>>
>> -Aaron
>>
>> On Mar 5, 2008, at 11:33 AM, Frank Leers wrote:
>>
>>> On Wed, 2008-03-05 at 11:08 -0500, Aaron Knister wrote:
>>>> That''s very strange. What interconnect is that site
using?
>>>>
>>>
>>> Not really strange, but -
>>>
>>> SDR IB/OFED
>>>
>>> lustre 1.6.4.2
>>> 2.6.18.8 clients
>>> 2.6.9-55.0.9 servers
>>>
>>>> My versions are -
>>>>
>>>> Lustre  - 1.6.4.2
>>>> Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp
>>>> Kernel (clients) - 2.6.18-53.1.13.el5
>>>>
>>>>
>>>>
>>>> On Mar 5, 2008, at 11:03 AM, Frank Leers wrote:
>>>>
>>>>> On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote:
>>>>>> On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister
wrote:
>>>>>>> I think I tried that before and it didn''t
help, but I will try
>>>>>>> it
>>>>>>> again. Thanks for the suggestion.
>>>>>>
>>>>>> Just so you guys know, 1000 seconds for the obd_timeout
is very,
>>>>>> very
>>>>>> large!  As you could probably guess, we have some very,
very big
>>>>>> Lustre
>>>>>> installations and to the best of my knowledge none of
them are
>>>>>> using
>>>>>> anywhere near that.  AFAIK (and perhaps a Sun engineer
with
>>>>>> closer
>>>>>> experience to some of these very large clusters might
correct me)
>>>>>> the
>>>>>> largest value that the largest clusters are using is in
the
>>>>>> neighbourhood of 300s.  There has to be some other
problem at
>>>>>> play
>>>>>> here
>>>>>> that you need 1000s.
>>>>>
>>>>> I can confirm that at a recent large installation with
several
>>>>> thousand
>>>>> clients, the default of 100 is in effect.
>>>>>
>>>>>>
>>>>>> Can you both please report your lustre and kernel
versions?  I
>>>>>> know
>>>>>> you
>>>>>> said "latest" Aaron, but some version numbers
might be more solid
>>>>>> to go
>>>>>> on.
>>>>>>
>>>>>> b.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>> Aaron Knister
>>>> Associate Systems Analyst
>>>> Center for Ocean-Land-Atmosphere Studies
>>>>
>>>> (301) 595-7000
>>>> aaron at iges.org
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> Aaron Knister
>> Associate Systems Analyst
>> Center for Ocean-Land-Atmosphere Studies
>>
>> (301) 595-7000
>> aaron at iges.org
>>
>>
>>
>>
>
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

Maybe Matching Threads

Search for more seemingly similar threads

Lustre discuss - Mar 2008 - Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

[Lustre-discuss] Cannot send after transport endpoint shutdown (-108)

Maybe Matching Threads