Hi all, we have a problem with our production system (v. 1.6.5.1). It is in recovery, but recovery never finishes. The background are some unknown problems with the MDT, attempts to restart the MDS etc. The MDT would start recovery, at some point during recovery lose connection to its OSTs, restart recovery and so on. I then moved the service to a partner machine, where recovery started with>>11:37:07: ... in recovery for at least 5:00, or until 415 clientsreconnect. (I always understood these numbers as minutes, the /proc/.../recovery_status usually starts at 3000 sec, though 5 min would be a little less...) The countdown went on until>> 12:03:32: ...227 clients in recovery for 1457sFour minutes later, there were>> 12:07:21: ...133 recoverable clients remainThen something bad must have happened, because>> 12:07:42: ...121 clients in recovery for 20721sMost of these clients seemed to be no problem, because only 4 minutes later>> 12:11:52: ...1 clients in recovery for 20471sSo far, the countdown continues, but of course these are extremely long recovery times. My questions: Where might I have misconfigured the system to wait that much for a client? Is there a command to abort the recovery? All the OSTs seem to be connected and happy. I therefore guess that the remaining client is just one client in the ususal sense - a batch node or similar machine that still has the system mounted. Of course I would not hesitate to kick out that client - or many of these if necessary - but I don''t know which it is. So another question: How to find out about the identities of clients, recoverable/in recovery/without problems/gone for good ? Many thanks, Thomas
Ok. at an ETA of 8100 sec we lost patience and did> lctl --device MDS-Name abort_recoveryThis obviously did the trick,>> recovery period over; 1 clients never reconnected after 14483s (414clients did) Access to the system seems to work as expected. Still we are not satisfied at all. One thing we would like to know, urgently, is how to find out which client caused that delay. As indicated before, we have no problem nuking a silly client, tearing it apart, ripping out its memory banks or whatever violent action might be needed. Most probably, though, the fault lies within our configuration, not this single client ( perhaps this is a machine that had a Lustre mount some time ago and is now switched off - batch nodes tend to die every now and then). Our /proc/sys/lustre/timeout is 1000 - there has been some debate on this large value here, but most other installation will not run in a network environment with a setup as crazy as ours. Putting the timeout to 100 immediately results in "Transport endpoint" errors, impossible to run Lustre like this. Since this is a 1.6.5.1 system, I activated the adaptive timeouts - and put them to equally large values, /sys/module/ptlrpc/parameters/at_max = 6000 /sys/module/ptlrpc/parameters/at_history = 6000 /sys/module/ptlrpc/parameters/at_early_margin = 50 /sys/module/ptlrpc/parameters/at_extra = 30 Reading the manual, I understood that at_max is a maximum value. I learned from an earlier question I posted on this list that with the static timeout from /proc/sys/lustre/timeout, recovery will be 2.5 times this value. Assuming the worst, 2.5 times at_max, I still don''t arrive at 21000 sec ! So I''m quite clueless as to what mistakes I have made here. Btw, when trying to find out about connected/disconnected clients, I ran "lctl conn_list", which gave me a very long listing (how do you do " |less" in this lctl - shell?), with all entries marked as "nonagle" - what does that mean? Oh, last remark for the records: to do this "lctl abort_recovery" command, you have to find out the right device number or name. "lctl dl" gives me five entries on my MGS/MDT server, "mgs", "mgc" "mdt" "lov" "mds". The correct device name for the lctl command is the one after "mds". Regards, Thomas Thomas Roth wrote:> Hi all, > > we have a problem with our production system (v. 1.6.5.1). It is in > recovery, but recovery never finishes. > The background are some unknown problems with the MDT, attempts to > restart the MDS etc. The MDT would start recovery, at some point during > recovery lose connection to its OSTs, restart recovery and so on. > > I then moved the service to a partner machine, where recovery started with >>> 11:37:07: ... in recovery for at least 5:00, or until 415 clients > reconnect. > > (I always understood these numbers as minutes, the > /proc/.../recovery_status usually starts at 3000 sec, though 5 min would > be a little less...) > > The countdown went on until >>> 12:03:32: ...227 clients in recovery for 1457s > > Four minutes later, there were >>> 12:07:21: ...133 recoverable clients remain > > Then something bad must have happened, because >>> 12:07:42: ...121 clients in recovery for 20721s > > Most of these clients seemed to be no problem, because only 4 minutes later >>> 12:11:52: ...1 clients in recovery for 20471s > > So far, the countdown continues, but of course these are extremely long > recovery times. > > My questions: > Where might I have misconfigured the system to wait that much for a client? > Is there a command to abort the recovery? > > All the OSTs seem to be connected and happy. I therefore guess that the > remaining client is just one client in the ususal sense - a batch node > or similar machine that still has the system mounted. Of course I would > not hesitate to kick out that client - or many of these if necessary - > but I don''t know which it is. So another question: How to find out > about the identities of clients, recoverable/in recovery/without > problems/gone for good ? > > > Many thanks, > Thomas > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum fu"r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschra"nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gescha"ftsfu"hrer: Professor Dr. Horst Sto"cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:> > Our /proc/sys/lustre/timeout is 1000That''s way to high. Long recoveries are exactly the reason you don''t want this number to be huge.> - there has been some debate on > this large value here, but most other installation will not run in a > network environment with a setup as crazy as ours.What''s so crazy about your set up? Unless your network is very flaky and/or you have not tuned your OSSes properly, there should be no need for such a high timeout and if there is you need to address the problems requiring it.> Putting the timeout > to 100 immediately results in "Transport endpoint" errors, impossible to > run Lustre like this.300 is the max that we recommend and we have very large production clusters that use such values successfully.> Since this is a 1.6.5.1 system, I activated the adaptive timeouts - and > put them to equally large values, > /sys/module/ptlrpc/parameters/at_max = 6000 > /sys/module/ptlrpc/parameters/at_history = 6000 > /sys/module/ptlrpc/parameters/at_early_margin = 50 > /sys/module/ptlrpc/parameters/at_extra = 30This is likely not good as well. I will let somebody more knowledgeable about AT comment in detail though. It''s a new feature and not getting wide use at all yet, so the real-world experience is still low. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090225/5f1e2c71/attachment.bin
I''m going to pipe in here. We too use a very large (1000) timeout value. We have two separate luster file systems one of them consists of two rather beefy OSSs with 12 OSTs each (FalconIII FC-SATA RAID). The other consists of 8 OSSs with 3 OSTs each (Xyratex 4900FC). We have about 500 clients and support both tcp and o2ib NIDS. We run Lustre 1.6.4.2 on a patched 2.6.18-8.1.14 CentOS/RH kernel. It has worked *very* well for us for over a year now - very few problems with very good performance under very heavy loads. We''ve tried setting our timeout to lower values but settled on the 1000 value (despite the long recovery periods) because if we don''t, our lustre connectivity starts to breakdown and our mounts come and go with errors like "transport endpoint failure" or "transport endpoint not connected" or some such (its been a while now). File system access comes and goes randomly on nodes. We tried many tunings and looked for other sources of problems (underlying network issues). Ultimately, the only thing we found that fixed this was to extend the timeout value. I know you will be tempted to tell us that our network must be flakey but it simply is not. We''d love to understand why we need such a large timeout value and why, if we don''t use a large value, we see these transport end-point failures. However, after spending several days trying to understand and resolve the issue, we finally just accepted the long timeout as a suitable workaround. I wonder if there are others who have silently done the same. We''ll be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future. Maybe then we''ll be able to do away with the long timeout value but until then, we need it. :( Just my two cents, Charlie Taylor UF HPC Center On Feb 25, 2009, at 11:03 AM, Brian J. Murrell wrote:> On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote: >> >> Our /proc/sys/lustre/timeout is 1000 > > That''s way to high. Long recoveries are exactly the reason you don''t > want this number to be huge. > >> - there has been some debate on >> this large value here, but most other installation will not run in a >> network environment with a setup as crazy as ours. > > What''s so crazy about your set up? Unless your network is very flaky > and/or you have not tuned your OSSes properly, there should be no need > for such a high timeout and if there is you need to address the > problems > requiring it. > >> Putting the timeout >> to 100 immediately results in "Transport endpoint" errors, >> impossible to >> run Lustre like this. > > 300 is the max that we recommend and we have very large production > clusters that use such values successfully. > >> Since this is a 1.6.5.1 system, I activated the adaptive timeouts >> - and >> put them to equally large values, >> /sys/module/ptlrpc/parameters/at_max = 6000 >> /sys/module/ptlrpc/parameters/at_history = 6000 >> /sys/module/ptlrpc/parameters/at_early_margin = 50 >> /sys/module/ptlrpc/parameters/at_extra = 30 > > This is likely not good as well. I will let somebody more > knowledgeable > about AT comment in detail though. It''s a new feature and not getting > wide use at all yet, so the real-world experience is still low. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
We used to do something similar, and still had issues, Upgrading all servers (2 OSS''s 7 OSTs each) and clients (800) to 1.6.6 fixed all our issues, we run default timeout''s and default everything really, no issues. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 25, 2009, at 11:22 AM, Charles Taylor wrote:> I''m going to pipe in here. We too use a very large (1000) timeout > value. We have two separate luster file systems one of them consists > of two rather beefy OSSs with 12 OSTs each (FalconIII FC-SATA RAID). > The other consists of 8 OSSs with 3 OSTs each (Xyratex 4900FC). We > have about 500 clients and support both tcp and o2ib NIDS. We run > Lustre 1.6.4.2 on a patched 2.6.18-8.1.14 CentOS/RH kernel. It has > worked *very* well for us for over a year now - very few problems with > very good performance under very heavy loads. > > We''ve tried setting our timeout to lower values but settled on the > 1000 value (despite the long recovery periods) because if we don''t, > our lustre connectivity starts to breakdown and our mounts come and go > with errors like "transport endpoint failure" or "transport endpoint > not connected" or some such (its been a while now). File system > access comes and goes randomly on nodes. We tried many tunings and > looked for other sources of problems (underlying network issues). > Ultimately, the only thing we found that fixed this was to extend the > timeout value. > > I know you will be tempted to tell us that our network must be flakey > but it simply is not. We''d love to understand why we need such a > large timeout value and why, if we don''t use a large value, we see > these transport end-point failures. However, after spending several > days trying to understand and resolve the issue, we finally just > accepted the long timeout as a suitable workaround. > > I wonder if there are others who have silently done the same. We''ll > be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future. Maybe > then we''ll be able to do away with the long timeout value but until > then, we need it. :( > > Just my two cents, > > Charlie Taylor > UF HPC Center > > On Feb 25, 2009, at 11:03 AM, Brian J. Murrell wrote: > >> On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote: >>> >>> Our /proc/sys/lustre/timeout is 1000 >> >> That''s way to high. Long recoveries are exactly the reason you don''t >> want this number to be huge. >> >>> - there has been some debate on >>> this large value here, but most other installation will not run in a >>> network environment with a setup as crazy as ours. >> >> What''s so crazy about your set up? Unless your network is very flaky >> and/or you have not tuned your OSSes properly, there should be no >> need >> for such a high timeout and if there is you need to address the >> problems >> requiring it. >> >>> Putting the timeout >>> to 100 immediately results in "Transport endpoint" errors, >>> impossible to >>> run Lustre like this. >> >> 300 is the max that we recommend and we have very large production >> clusters that use such values successfully. >> >>> Since this is a 1.6.5.1 system, I activated the adaptive timeouts >>> - and >>> put them to equally large values, >>> /sys/module/ptlrpc/parameters/at_max = 6000 >>> /sys/module/ptlrpc/parameters/at_history = 6000 >>> /sys/module/ptlrpc/parameters/at_early_margin = 50 >>> /sys/module/ptlrpc/parameters/at_extra = 30 >> >> This is likely not good as well. I will let somebody more >> knowledgeable >> about AT comment in detail though. It''s a new feature and not >> getting >> wide use at all yet, so the real-world experience is still low. >> >> b. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On Wed, 2009-02-25 at 11:22 -0500, Charles Taylor wrote:> I know you will be tempted to tell us that our network must be flakey > but it simply is not. We''d love to understand why we need such a > large timeout value and why, if we don''t use a large value, we see > these transport end-point failures. However, after spending several > days trying to understand and resolve the issue, we finally just > accepted the long timeout as a suitable workaround.I''d encourage you to upgrade to the latest version of Lustre (just so we are not chasing possibly old and fixed bugs) and re-evaluate your timeout and report how it works out for you. If you still see unreliability, then file a bug. I''d also suggest (if you have not already done it) that you use the iokit to be sure your OSSes are properly tuned for the storage bandwidth they have available to them and not tying up OST processes for overly long periods of time waiting for storage access.> I wonder if there are others who have silently done the same. We''ll > be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future. Maybe > then we''ll be able to do away with the long timeout value but until > then, we need it. :(Sounds like a good idea. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090225/51dacc11/attachment.bin