thr3ads.net - Lustre discuss - [Lustre-discuss] Recovery without end [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2009-Feb-25 12:25 UTC

[Lustre-discuss] Recovery without end

Hi all,

we have a problem with our production system (v. 1.6.5.1). It is in
recovery, but recovery never finishes.
The background are some unknown problems with the MDT, attempts to
restart the MDS etc. The MDT would start recovery, at some point during
recovery lose connection to its OSTs, restart recovery and so on.

I then moved the service to a partner machine, where recovery started
with>>11:37:07: ... in recovery for at least 5:00, or until 415 clientsreconnect.

(I always understood these numbers as minutes, the
/proc/.../recovery_status usually starts at 3000 sec, though 5 min would
be a little less...)

The countdown went on until>> 12:03:32: ...227 clients in recovery for 1457s
Four minutes later, there were>> 12:07:21: ...133 recoverable clients remain
Then something bad must have happened, because>> 12:07:42: ...121 clients in recovery for 20721s
Most of these clients seemed to be no problem, because only 4 minutes
later>> 12:11:52: ...1 clients in recovery for 20471s
So far, the countdown continues, but of course these are extremely long
recovery times.

My questions:
Where might I have misconfigured the system to wait that much for a client?
Is there a command to abort the recovery?

All the OSTs seem to be connected and happy. I therefore guess that the
remaining client is just one client in the ususal sense - a batch node
or similar machine that still has the system mounted. Of course I would
not hesitate to kick out that client - or many of these if necessary -
but I don''t know which it is. So another question: How to find out
about the identities of clients, recoverable/in recovery/without
problems/gone for good ?

Many thanks,
Thomas

Thomas Roth

2009-Feb-25 15:09 UTC

head link

[Lustre-discuss] Recovery without end

Ok. at an ETA of 8100 sec we lost patience and did> lctl --device MDS-Name abort_recovery
This obviously did the trick,>>  recovery period over; 1 clients never reconnected after 14483s (414clients did)

Access to the system seems to work as expected.

Still we are not satisfied at all. One thing we would like to know,
urgently, is how to find out which client caused that delay.
As indicated before, we have no problem nuking a silly client, tearing
it apart, ripping out its memory banks or whatever violent action might
be needed.
Most probably, though, the fault lies within our configuration, not this
single client ( perhaps this is a machine that had a Lustre mount some
time ago and is now switched off - batch nodes tend to die every now and
then).

Our /proc/sys/lustre/timeout is 1000 - there has been some debate on
this large value here, but most other installation will not run in a
network environment with a setup as crazy as ours. Putting the timeout
to 100 immediately results in "Transport endpoint" errors, impossible
to
run Lustre like this.

Since this is a 1.6.5.1 system, I activated the adaptive timeouts  - and
put them to equally large values,
/sys/module/ptlrpc/parameters/at_max = 6000
/sys/module/ptlrpc/parameters/at_history = 6000
/sys/module/ptlrpc/parameters/at_early_margin = 50
/sys/module/ptlrpc/parameters/at_extra = 30

Reading the manual, I understood that at_max is a maximum value. I
learned from an earlier question I posted on this list that with the
static timeout from /proc/sys/lustre/timeout, recovery will be 2.5 times
this value. Assuming the worst, 2.5 times at_max, I still don''t arrive
at  21000 sec !

So I''m quite clueless as to what mistakes I have made here.

Btw, when trying to find out about connected/disconnected clients, I ran
"lctl conn_list", which gave me a very long listing (how do you do
"
|less" in this lctl - shell?), with all entries marked as
"nonagle" -
what does that mean?

Oh, last remark for the records: to do this "lctl abort_recovery"
command, you have to find out the right device number or name. "lctl
dl"
gives me five entries on my MGS/MDT server, "mgs", "mgc"
"mdt" "lov"
"mds". The correct device name for the lctl command is the one after
"mds".

Regards,
Thomas

Thomas Roth wrote:> Hi all,
> 
> we have a problem with our production system (v. 1.6.5.1). It is in
> recovery, but recovery never finishes.
> The background are some unknown problems with the MDT, attempts to
> restart the MDS etc. The MDT would start recovery, at some point during
> recovery lose connection to its OSTs, restart recovery and so on.
> 
> I then moved the service to a partner machine, where recovery started with
>>> 11:37:07: ... in recovery for at least 5:00, or until 415 clients
> reconnect.
> 
> (I always understood these numbers as minutes, the
> /proc/.../recovery_status usually starts at 3000 sec, though 5 min would
> be a little less...)
> 
> The countdown went on until
>>> 12:03:32:  ...227 clients in recovery for 1457s
> 
> Four minutes later, there were
>>> 12:07:21: ...133 recoverable clients remain
> 
> Then something bad must have happened, because
>>> 12:07:42:  ...121 clients in recovery for 20721s
> 
> Most of these clients seemed to be no problem, because only 4 minutes later
>>> 12:11:52:  ...1 clients in recovery for 20471s
> 
> So far, the countdown continues, but of course these are extremely long
> recovery times.
> 
> My questions:
> Where might I have misconfigured the system to wait that much for a client?
> Is there a command to abort the recovery?
> 
> All the OSTs seem to be connected and happy. I therefore guess that the
> remaining client is just one client in the ususal sense - a batch node
> or similar machine that still has the system mounted. Of course I would
> not hesitate to kick out that client - or many of these if necessary -
> but I don''t know which it is.  So another question: How to find
out
> about the identities of clients, recoverable/in recovery/without
> problems/gone for good ?
> 
> 
> Many thanks,
> Thomas
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum fu"r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschra"nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gescha"ftsfu"hrer: Professor Dr. Horst Sto"cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Brian J. Murrell

2009-Feb-25 16:03 UTC

head link

[Lustre-discuss] Recovery without end

On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:> 
> Our /proc/sys/lustre/timeout is 1000
That''s way to high.  Long recoveries are exactly the reason you
don''t
want this number to be huge.
>  - there has been some debate on
> this large value here, but most other installation will not run in a
> network environment with a setup as crazy as ours.
What''s so crazy about your set up?  Unless your network is very flaky
and/or you have not tuned your OSSes properly, there should be no need
for such a high timeout and if there is you need to address the problems
requiring it.
> Putting the timeout
> to 100 immediately results in "Transport endpoint" errors,
impossible to
> run Lustre like this.
300 is the max that we recommend and we have very large production
clusters that use such values successfully.
> Since this is a 1.6.5.1 system, I activated the adaptive timeouts  - and
> put them to equally large values,
> /sys/module/ptlrpc/parameters/at_max = 6000
> /sys/module/ptlrpc/parameters/at_history = 6000
> /sys/module/ptlrpc/parameters/at_early_margin = 50
> /sys/module/ptlrpc/parameters/at_extra = 30
This is likely not good as well.  I will let somebody more knowledgeable
about AT comment in detail though.  It''s a new feature and not getting
wide use at all yet, so the real-world experience is still low.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090225/5f1e2c71/attachment.bin

Charles Taylor

2009-Feb-25 16:22 UTC

head link

[Lustre-discuss] Recovery without end

I''m going to pipe in here.    We too use a very large (1000) timeout  
value.   We have two separate luster file systems one of them consists  
of two rather beefy OSSs with 12 OSTs each (FalconIII FC-SATA RAID).    
The other consists of 8 OSSs with 3 OSTs each (Xyratex 4900FC).   We  
have about 500 clients and support both tcp and o2ib NIDS.   We run  
Lustre 1.6.4.2 on a patched 2.6.18-8.1.14 CentOS/RH kernel.   It has  
worked *very* well for us for over a year now - very few problems with  
very good performance under very heavy loads.

We''ve tried setting our timeout to lower values but settled on the  
1000 value (despite the long recovery periods) because if we don''t,  
our lustre connectivity starts to breakdown and our mounts come and go  
with errors like "transport endpoint failure" or "transport
endpoint
not connected" or some such (its been a while now).    File system  
access comes and goes randomly on nodes.    We tried many tunings and  
looked for other sources of  problems (underlying network issues).    
Ultimately, the only thing we found that fixed this was to extend the  
timeout value.

I know you will be tempted to tell us that our network must be flakey  
but it simply is not.   We''d love to understand why we need such a  
large timeout value and why, if we don''t use a large value, we see  
these transport end-point failures.    However, after spending several  
days trying to understand and resolve the issue, we finally just  
accepted the long timeout as a suitable workaround.

I wonder if there are others who have silently done the same.   We''ll  
be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.    Maybe  
then we''ll be able to do away with the long timeout value but until  
then, we need it.  :(

Just my two cents,

Charlie Taylor
UF HPC Center

On Feb 25, 2009, at 11:03 AM, Brian J. Murrell wrote:
> On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:
>>
>> Our /proc/sys/lustre/timeout is 1000
>
> That''s way to high.  Long recoveries are exactly the reason you
don''t
> want this number to be huge.
>
>> - there has been some debate on
>> this large value here, but most other installation will not run in a
>> network environment with a setup as crazy as ours.
>
> What''s so crazy about your set up?  Unless your network is very
flaky
> and/or you have not tuned your OSSes properly, there should be no need
> for such a high timeout and if there is you need to address the  
> problems
> requiring it.
>
>> Putting the timeout
>> to 100 immediately results in "Transport endpoint" errors,  
>> impossible to
>> run Lustre like this.
>
> 300 is the max that we recommend and we have very large production
> clusters that use such values successfully.
>
>> Since this is a 1.6.5.1 system, I activated the adaptive timeouts   
>> - and
>> put them to equally large values,
>> /sys/module/ptlrpc/parameters/at_max = 6000
>> /sys/module/ptlrpc/parameters/at_history = 6000
>> /sys/module/ptlrpc/parameters/at_early_margin = 50
>> /sys/module/ptlrpc/parameters/at_extra = 30
>
> This is likely not good as well.  I will let somebody more  
> knowledgeable
> about AT comment in detail though.  It''s a new feature and not
getting
> wide use at all yet, so the real-world experience is still low.
>
> b.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brock Palen

2009-Feb-25 16:26 UTC

head link

[Lustre-discuss] Recovery without end

We used to do something similar, and still had issues,

Upgrading all servers (2 OSS''s 7 OSTs each) and clients (800)  to  
1.6.6 fixed all our issues, we run default timeout''s and default  
everything really, no issues.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Feb 25, 2009, at 11:22 AM, Charles Taylor wrote:
> I''m going to pipe in here.    We too use a very large (1000)
timeout
> value.   We have two separate luster file systems one of them consists
> of two rather beefy OSSs with 12 OSTs each (FalconIII FC-SATA RAID).
> The other consists of 8 OSSs with 3 OSTs each (Xyratex 4900FC).   We
> have about 500 clients and support both tcp and o2ib NIDS.   We run
> Lustre 1.6.4.2 on a patched 2.6.18-8.1.14 CentOS/RH kernel.   It has
> worked *very* well for us for over a year now - very few problems with
> very good performance under very heavy loads.
>
> We''ve tried setting our timeout to lower values but settled on the
> 1000 value (despite the long recovery periods) because if we
don''t,
> our lustre connectivity starts to breakdown and our mounts come and go
> with errors like "transport endpoint failure" or "transport
endpoint
> not connected" or some such (its been a while now).    File system
> access comes and goes randomly on nodes.    We tried many tunings and
> looked for other sources of  problems (underlying network issues).
> Ultimately, the only thing we found that fixed this was to extend the
> timeout value.
>
> I know you will be tempted to tell us that our network must be flakey
> but it simply is not.   We''d love to understand why we need such a
> large timeout value and why, if we don''t use a large value, we see
> these transport end-point failures.    However, after spending several
> days trying to understand and resolve the issue, we finally just
> accepted the long timeout as a suitable workaround.
>
> I wonder if there are others who have silently done the same.  
We''ll
> be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.    Maybe
> then we''ll be able to do away with the long timeout value but
until
> then, we need it.  :(
>
> Just my two cents,
>
> Charlie Taylor
> UF HPC Center
>
> On Feb 25, 2009, at 11:03 AM, Brian J. Murrell wrote:
>
>> On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:
>>>
>>> Our /proc/sys/lustre/timeout is 1000
>>
>> That''s way to high.  Long recoveries are exactly the reason
you don''t
>> want this number to be huge.
>>
>>> - there has been some debate on
>>> this large value here, but most other installation will not run in
a
>>> network environment with a setup as crazy as ours.
>>
>> What''s so crazy about your set up?  Unless your network is
very flaky
>> and/or you have not tuned your OSSes properly, there should be no  
>> need
>> for such a high timeout and if there is you need to address the
>> problems
>> requiring it.
>>
>>> Putting the timeout
>>> to 100 immediately results in "Transport endpoint"
errors,
>>> impossible to
>>> run Lustre like this.
>>
>> 300 is the max that we recommend and we have very large production
>> clusters that use such values successfully.
>>
>>> Since this is a 1.6.5.1 system, I activated the adaptive timeouts
>>> - and
>>> put them to equally large values,
>>> /sys/module/ptlrpc/parameters/at_max = 6000
>>> /sys/module/ptlrpc/parameters/at_history = 6000
>>> /sys/module/ptlrpc/parameters/at_early_margin = 50
>>> /sys/module/ptlrpc/parameters/at_extra = 30
>>
>> This is likely not good as well.  I will let somebody more
>> knowledgeable
>> about AT comment in detail though.  It''s a new feature and not
>> getting
>> wide use at all yet, so the real-world experience is still low.
>>
>> b.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Brian J. Murrell

2009-Feb-25 16:37 UTC

head link

[Lustre-discuss] Recovery without end

On Wed, 2009-02-25 at 11:22 -0500, Charles Taylor wrote:> I know you will be tempted to tell us that our network must be flakey  
> but it simply is not.   We''d love to understand why we need such a
> large timeout value and why, if we don''t use a large value, we see
> these transport end-point failures.    However, after spending several  
> days trying to understand and resolve the issue, we finally just  
> accepted the long timeout as a suitable workaround.
I''d encourage you to upgrade to the latest version of Lustre (just so
we
are not chasing possibly old and fixed bugs) and re-evaluate your
timeout and report how it works out for you.  If you still see
unreliability, then file a bug.

I''d also suggest (if you have not already done it) that you use the
iokit to be sure your OSSes are properly tuned for the storage bandwidth
they have available to them and not tying up OST processes for overly
long periods of time waiting for storage access.
> I wonder if there are others who have silently done the same.  
We''ll
> be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.    Maybe  
> then we''ll be able to do away with the long timeout value but
until
> then, we need it.  :(
Sounds like a good idea.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090225/51dacc11/attachment.bin

Lustre discuss - Feb 2009 - Recovery without end

[Lustre-discuss] Recovery without end

[Lustre-discuss] Recovery without end

[Lustre-discuss] Recovery without end

[Lustre-discuss] Recovery without end

[Lustre-discuss] Recovery without end

[Lustre-discuss] Recovery without end