thr3ads.net - Lustre discuss - [Lustre-discuss] MDS overload, why? [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Arne Brutschy

2009-Oct-09 08:26 UTC

[Lustre-discuss] MDS overload, why?

Hi everyone,

2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
2.6.9-78.0.22.

We were quite happy with lustre''s performance, especially because
bottlenecks caused by /home disk access were history.

Saturday, the cluster went down (= was inaccessible). After some
investigation I found out that the reason seems to be an overloaded MDS.
Over the following 4 days, this happened multiple times and could only
be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.

The MDS did not respond to any command, if I managed to get a video
signal (not often), load was >170. Additionally, 2 times kernel oops got
displayed, but unfortunately I have to record of them.

The clients showed the following error:> Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
So, my question is: what could cause such a load? The cluster was not
exessively used... Is this a bug or a user''s job that creates the load?
How can I protect lustre against this kind of failure?

Thanks in advance,
Arne 

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

Michael Kluge

2009-Oct-09 10:16 UTC

head link

[Lustre-discuss] MDS overload, why?

Hi Arne,

could be memory pressure and the OOM running and shooting at things. How
much memory does you server has?


Michael

Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne
Brutschy:> Hi everyone,
> 
> 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> 2.6.9-78.0.22.
> 
> We were quite happy with lustre''s performance, especially because
> bottlenecks caused by /home disk access were history.
> 
> Saturday, the cluster went down (= was inaccessible). After some
> investigation I found out that the reason seems to be an overloaded MDS.
> Over the following 4 days, this happened multiple times and could only
> be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> 
> The MDS did not respond to any command, if I managed to get a video
> signal (not often), load was >170. Additionally, 2 times kernel oops got
> displayed, but unfortunately I have to record of them.
> 
> The clients showed the following error:
> > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> 
> So, my question is: what could cause such a load? The cluster was not
> exessively used... Is this a bug or a user''s job that creates the
load?
> How can I protect lustre against this kind of failure?
> 
> Thanks in advance,
> Arne 
> -- 

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/608ec349/attachment.bin

Brian J. Murrell

2009-Oct-09 11:44 UTC

head link

[Lustre-discuss] MDS overload, why?

On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:> 
> The clients showed the following error:
> > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> 
> So, my question is: what could cause such a load? The cluster was not
> exessively used... Is this a bug or a user''s job that creates the
load?
> How can I protect lustre against this kind of failure?
Without any more information we could not possibly know.  If you really
are getting oopses then you will need console logs (i.e. serial console)
so that we can see the stack trace.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/744dcfe2/attachment.bin

Arne Brutschy

2009-Oct-09 14:03 UTC

head link

[Lustre-discuss] MDS overload, why?

Hi,

8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz

Arne

On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:> Hi Arne,
> 
> could be memory pressure and the OOM running and shooting at things. How
> much memory does you server has?
> 
> 
> Michael
> 
> Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> > Hi everyone,
> > 
> > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> > 2.6.9-78.0.22.
> > 
> > We were quite happy with lustre''s performance, especially
because
> > bottlenecks caused by /home disk access were history.
> > 
> > Saturday, the cluster went down (= was inaccessible). After some
> > investigation I found out that the reason seems to be an overloaded
MDS.
> > Over the following 4 days, this happened multiple times and could only
> > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> > 
> > The MDS did not respond to any command, if I managed to get a video
> > signal (not often), load was >170. Additionally, 2 times kernel
oops got
> > displayed, but unfortunately I have to record of them.
> > 
> > The clients showed the following error:
> > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> > 
> > So, my question is: what could cause such a load? The cluster was not
> > exessively used... Is this a bug or a user''s job that creates
the load?
> > How can I protect lustre against this kind of failure?
> > 
> > Thanks in advance,
> > Arne 
> > -- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

Michael Kluge

2009-Oct-09 14:10 UTC

head link

[Lustre-discuss] MDS overload, why?

Hmm. Should be enough. I guess you need to set up a loghost for syslog
then and a reliable serial console to get stack traces. Everything else
would be just a wild guess (as the question for the ram size was).

Michael
> Hi,
> 
> 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz
> 
> Arne
> 
> On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:
> > Hi Arne,
> > 
> > could be memory pressure and the OOM running and shooting at things.
How
> > much memory does you server has?
> > 
> > 
> > Michael
> > 
> > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> > > Hi everyone,
> > > 
> > > 2 months ago, we switched our ~80 node cluster from NFS to
lustre. 1
> > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> > > 2.6.9-78.0.22.
> > > 
> > > We were quite happy with lustre''s performance,
especially because
> > > bottlenecks caused by /home disk access were history.
> > > 
> > > Saturday, the cluster went down (= was inaccessible). After some
> > > investigation I found out that the reason seems to be an
overloaded MDS.
> > > Over the following 4 days, this happened multiple times and could
only
> > > be resolved by 1) killing all user jobs and 2) hard-resetting the
MDS.
> > > 
> > > The MDS did not respond to any command, if I managed to get a
video
> > > signal (not often), load was >170. Additionally, 2 times
kernel oops got
> > > displayed, but unfortunately I have to record of them.
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was
not
> > > exessively used... Is this a bug or a user''s job that
creates the load?
> > > How can I protect lustre against this kind of failure?
> > > 
> > > Thanks in advance,
> > > Arne 
> > > -- 

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/9e93182e/attachment.bin

Arne Brutschy

2009-Oct-09 14:15 UTC

head link

[Lustre-discuss] MDS overload, why?

Hi,

thanks for replying!

I understand that without further information we can''t do much about
the
oopses. I was more hoping for some information regarding possible
sources of such an overload. Is it normal that a MDS gets overloaded
like this, while the OSTs have nothing to do, and what can I do about
it? How can I find the source of the problem?

More specifically, what are the operations that lead to a lot of MDS
load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is
not a top-notch server, it''s fairly recent and I feel the load
we''re
experiencing is not handable by a single MDS.

My problem is that I can''t make out major problems in the
user''s jobs
running on the cluster, and I can''t quantify nor track down the problem
because I don''t know what behavior might have caused it. 

As I said, ooppses appeared only twice, and all other problems where
just apparent by a non-responsive MDS.

Thanks,
Arne

On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell
wrote:> On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> > 
> > The clients showed the following error:
> > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> > 
> > So, my question is: what could cause such a load? The cluster was not
> > exessively used... Is this a bug or a user''s job that creates
the load?
> > How can I protect lustre against this kind of failure?
> 
> Without any more information we could not possibly know.  If you really
> are getting oopses then you will need console logs (i.e. serial console)
> so that we can see the stack trace.
> 
> b.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

Brian J. Murrell

2009-Oct-09 14:23 UTC

head link

[Lustre-discuss] MDS overload, why?

On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote:> Hi,
Hi,
> thanks for replying!
NP.
> I understand that without further information we can''t do much
about the
> oopses.
Not just the oopses.  It''s pretty hard to diagnose any problem without
data.
> I was more hoping for some information regarding possible
> sources of such an overload.
The possibilities are endless.  Lustre is a complex system with many,
many interactions with many other components.  Sure, I could make stab
in the dark guesses like too little memory, too slow disk, etc., but
none of those are actually useful for anything more than a shopping trip
and even then new hardware might still not solve your problem.
> How can I find the source of the problem?
Examine the logs.  Lustre logs a lot of information when things are
going wrong.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/e0e9a691/attachment.bin

Michael Kluge

2009-Oct-09 14:28 UTC

head link

[Lustre-discuss] MDS overload, why?

LMT (http://code.google.com/p/lmt) might be able to give some hints if
users are using the FS in a ''wild'' fashion. For the question
"what can
cause this behaviour of my MDS" I guess the answer is like: a million
things ;) There is no way of being more specific with more input about
the problem itself.

Michael

Am Freitag, den 09.10.2009, 16:15 +0200 schrieb Arne
Brutschy:> Hi,
> 
> thanks for replying!
> 
> I understand that without further information we can''t do much
about the
> oopses. I was more hoping for some information regarding possible
> sources of such an overload. Is it normal that a MDS gets overloaded
> like this, while the OSTs have nothing to do, and what can I do about
> it? How can I find the source of the problem?
> 
> More specifically, what are the operations that lead to a lot of MDS
> load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is
> not a top-notch server, it''s fairly recent and I feel the load
we''re
> experiencing is not handable by a single MDS.
> 
> My problem is that I can''t make out major problems in the
user''s jobs
> running on the cluster, and I can''t quantify nor track down the
problem
> because I don''t know what behavior might have caused it. 
> 
> As I said, ooppses appeared only twice, and all other problems where
> just apparent by a non-responsive MDS.
> 
> Thanks,
> Arne
> 
> 
> On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote:
> > On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  req at
f6222800 x8702488/t0 o250->MGS at 10.255.255.206@tcp:26/25 lens 304/456 e 0
to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError:
3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was
not
> > > exessively used... Is this a bug or a user''s job that
creates the load?
> > > How can I protect lustre against this kind of failure?
> > 
> > Without any more information we could not possibly know.  If you
really
> > are getting oopses then you will need console logs (i.e. serial
console)
> > so that we can see the stack trace.
> > 
> > b.
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091009/ff83e9bd/attachment.bin

Arne Brutschy

2009-Oct-09 15:16 UTC

head link

[Lustre-discuss] MDS overload, why?

Hi,

ok, thanks then. I was hoping there''s something like "doing this
or that
operation in abundance will cause this sort of behavior". :)

Well then, I will add a serial console and the lustre monitoring. I will
then try to send the MDS into overload, as we have by now managed to
isolate the user who is apparently causing the overload.

Thanks for now and have a nice weekend
Arne


On Fr, 2009-10-09 at 10:23 -0400, Brian J. Murrell
wrote:> On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote:
> > Hi,
> 
> Hi,
> 
> > thanks for replying!
> 
> NP.
> 
> > I understand that without further information we can''t do
much about the
> > oopses.
> 
> Not just the oopses.  It''s pretty hard to diagnose any problem
without
> data.
> 
> > I was more hoping for some information regarding possible
> > sources of such an overload.
> 
> The possibilities are endless.  Lustre is a complex system with many,
> many interactions with many other components.  Sure, I could make stab
> in the dark guesses like too little memory, too slow disk, etc., but
> none of those are actually useful for anything more than a shopping trip
> and even then new hardware might still not solve your problem.
> 
> > How can I find the source of the problem?
> 
> Examine the logs.  Lustre logs a lot of information when things are
> going wrong.
> 
> b.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

Arne Brutschy

2009-Oct-20 14:13 UTC

head link

[Lustre-discuss] MDS overload, why?

Hello,

thanks for your suggestions. I tried to investigate further the reason
for the MDS overload. It happened 3 times last week, even after I locked
out the user that I thought was causing it.

As suggested, I set up a serial console monitor on the MDS and installed
the lmt (not particularly easy :). This morning, the MDS was overloaded
again. The cpu load spike beyond 150 and the logs where showing the
following errors:

-------8<----------
Oct 20 12:02:38 compute-2-0 kernLustre:
9806:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x64452376
took longer than estimated (100+87s); client may timeout.  req at f6081c00
x64452376/t1397543027 o101->5f1b0ba3-5e5b-d6b0-17a2-8e974cb5df12@:0/0 lens
496/472 e 0 to 0 dl 1256032871 ref 1 fl Complete:/0/0 rc 301/301
el: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
callback timer expired after 187s: evicting client at 10.255.255.190 at tcp  ns:
mds-lustre-MDT0000_UUID lock: f0249540/0xefbdbd73c4253265 lrc: 3/0,0 mode: CR/CR
res: 1343024/2832666510 bits 0x3 rrc: 174 type: IBT flags: 0x4000030 remote:
0x73cb76d4b7121d86 expref: 1253 pid: 9807 timeout: 8475134
Lustre: 9747:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x78099102 took longer than estimated (100+92s); client may timeout.  req at
f4843a00 x78099102/t0 o101->99653b1f-c07d-b7c3-93fc-11f9ba3630f8@:0/0 lens
432/416 e 0 to 0 dl 1256032871 ref 1 fl Complete:/0/0 rc 0/0
LustreError: 9886:0:(ldlm_lockd.c:1161:ldlm_handle_enqueue()) ### lock on
destroyed export f631b000 ns: mds-lustre-MDT0000_UUID lock:
d23069c0/0xefbdbd73c4253281 lrc: 3/0,0 mode: CR/CR res: 1343024/2832666510 bits
0x3 rrc: 168 type: IBT flags: 0x4000030 remote: 0x9422ea29787ca6a7 expref: 3
pid: 9886 timeout: 0
Oct 20 12:02:43 compute-2-0 kernel: LustreError:
9886:0:(ldlm_lockd.c:1161:ldlm_handle_enqueue()) ### lock on destroyed export
f631b000 ns: mds-lustre-MDT0000_UUID lock: d23069c0/0xefbdbd73c4253281 lrc:
3/0,0 mode: CR/CR res: 1343024/2832666510 bits 0x3 rrc: 168 type: IBT flags:
0x4000030 remote: 0x9422ea29787ca6a7 expref: 3 pid: 9886 timeout: 0
Lustre: 4380:0:(import.c:507:import_select_connection()) lustre-OST0002-osc:
tried all connections, increasing latency to 26s
LustreError: 167-0: This client was evicted by lustre-OST0002; in progress
operations using this service will fail.
LustreError: 12203:0:(quota_ctl.c:288:client_quota_ctl()) ptlrpc_queue_wait
failed, rc: -3
Oct 20 12:02:45 LustreError: 12203:0:(quota_ctl.c:288:client_quota_ctl())
Skipped 1 previous similar message
compute-2-0 kernLustreError:
12203:0:(quota_master.c:1661:qmaster_recovery_main()) qmaster recovery failed!
(id:552 type:0 rc:-3)
el: LustreError: 167-0: This client was evicted by lustre-OST0002; in progress
operations using this service will fail.
Oct 20 12:02:45 compute-2-0 kernel: LustreError:
12203:0:(quota_ctl.c:288:client_quota_ctl()) ptlrpc_queue_wait failed, rc: -3
Oct 20 12:02:45 compute-2-0 kernel: LustreError:
12203:0:(quota_ctl.c:288:client_quota_ctl()) Skipped 1 previous similar message
Oct 20 12:02:45 compute-2-0 kernel: LustreError:
12203:0:(quota_master.c:1661:qmaster_recovery_main()) qmaster recovery failed!
(id:552 type:0 rc:-3)
LustreError: 11770:0:(llog_server.c:471:llog_origin_handle_cancel()) Cancel 62
of 124 llog-records failed: -2
Oct 20 12:02:47 compute-2-0 kernel: LustreError:
11770:0:(llog_server.c:471:llog_origin_handle_cancel()) Cancel 62 of 124
llog-records failed: -2
LustreError: 9771:0:(handler.c:1601:mds_handle()) operation 35 on unconnected
MDS from 12345-10.255.255.190 at tcp
Oct 20 12:02:49 compute-2-0 kernel: LustreError:
9771:0:(handler.c:1601:mds_handle()) operation 35 on unconnected MDS from
12345-10.255.255.190 at tcp
Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 9891: it was
inactive for 200s
Lustre: 0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for
process 9816
ll_mdt_75     S 00000246 ll_mdt_94      3660  9816      1 S       00000246   
9817  9815 (L-TLB)
 3556  9882      1 eef1f8e4       00000046    988300000000 00000246  
9881f9743d7c f60c17b0  (L-TLB)
4add8a03 ee1337a4 d1e28c40 00000046 
       f9677b82 00000000 00000000 00000246 f9743d7c c5664780 f6af2cb0 c5663de0
4add8a03 e02c1640 00000001
       f9677b82 00000000 a135db00 00000000 c566c780 0098e380 c566bde0 
       00000002 f60c17b0 f6cf3170 00000000 a135db00 f6cf32dc 0098e380 00000000 
       f6af2cb0 f9678064 00000286 ecfaf870 eef1e000 ecfaf9dc d1e28c40 00000000 
Call Trace:
f9678064 00000286 ee132000 e02c1640 
Call Trace:
 [<f9677b82>] [<f9677b82>] ldlm_expired_completion_wait+0xb2/0x510
[ptlrpc]
 ldlm_expired_completion_wait+0xb2/0x510 [ptlrpc]
 [<f9678064>] [<f9678064>] is_granted_or_cancelled+0x44/0x80
[ptlrpc]
 is_granted_or_cancelled+0x44/0x80 [ptlrpc]
 [<f96789fd>] [<f96789fd>] ldlm_completion_ast+0x95d/0xfb0 [ptlrpc]
 ldlm_completion_ast+0x95d/0xfb0 [ptlrpc]
 [<f969b06a>] [<f969b06a>] ldlm_process_inodebits_lock+0x27a/0x7a0
[ptlrpc]
 ldlm_process_inodebits_lock+0x27a/0x7a0 [ptlrpc]
 [<c02de9c6>] [<c02de9c6>] __cond_resched+0x14/0x39
 __cond_resched+0x14/0x39
 [<c011e7ac>] [<c011e7ac>] default_wake_function+0x0/0xc
 default_wake_function+0x0/0xc
 [<f9652cb7>] [<f9652cb7>] ldlm_lock_enqueue+0x1e7/0x11f0 [ptlrpc]
 ldlm_lock_enqueue+0x1e7/0x11f0 [ptlrpc]
 [<f9677ad0>] [<f9677ad0>] ldlm_expired_completion_wait+0x0/0x510
[ptlrpc]
 [<f9677ac0>] ldlm_expired_completion_wait+0x0/0x510 [ptlrpc]
 [<f9677ac0>] interrupted_completion_wait+0x0/0x10 [ptlrpc]
 interrupted_completion_wait+0x0/0x10 [ptlrpc]
 [<f9677ad0>] [<f9677ad0>] ldlm_expired_completion_wait+0x0/0x510
[ptlrpc]
 [<f9677ac0>] ldlm_expired_completion_wait+0x0/0x510 [ptlrpc]
 [<f9677ac0>] interrupted_completion_wait+0x0/0x10 [ptlrpc]
 interrupted_completion_wait+0x0/0x10 [ptlrpc]
 [<f96780a0>] [<f96780a0>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc]
 [<f967a00a>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc]
 [<f967a00a>] ldlm_cli_enqueue_local+0x91a/0xac0 [ptlrpc]
 ldlm_cli_enqueue_local+0x91a/0xac0 [ptlrpc]
 [<f9679050>] [<f9679050>] ldlm_blocking_ast+0x0/0x690 [ptlrpc]
 [<f96780a0>] ldlm_blocking_ast+0x0/0x690 [ptlrpc]
 [<f96780a0>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc]
 ldlm_completion_ast+0x0/0xfb0 [ptlrpc]
 [<f8f468eb>] [<f8f468eb>] enqueue_ordered_locks+0x1db/0x850 [mds]
 enqueue_ordered_locks+0x1db/0x850 [mds]
 [<f9679050>] [<f9679050>] ldlm_blocking_ast+0x0/0x690 [ptlrpc]
 [<f96780a0>] ldlm_blocking_ast+0x0/0x690 [ptlrpc]
 [<f96780a0>] ldlm_completion_ast+0x0/0xfb0 [ptlrpc]
 ldlm_completion_ast+0x0/0xfb0 [ptlrpc]
 [<c0168e0f>] [<c0168e44>] __lookup_hash+0x47/0x95
 __lookup_hash+0x7c/0x95
 [<f8f48a8c>] [<f8f48a8c>] mds_get_parent_child_locked+0x3cc/0xe30
[mds]
 mds_get_parent_child_locked+0x3cc/0xe30 [mds]
 [<f8cdf24f>] ksocknal_queue_tx_locked+0x1cf/0x630 [ksocklnd]
 [<f895c025>] upcall_cache_get_entry+0x295/0xd10 [lvfs]
 [<f8f6fdce>] mds_open+0x70e/0x53c0 [mds] [<f8f204d1>]
 mds_getattr_lock+0x241/0x1490 [mds]
 [<c011e832>] [<f96c8a3d>] lustre_msg_get_flags+0x10d/0x1f0 [ptlrpc]
 [<f8f34040>] fixup_handle_for_resent_req+0x40/0x480 [mds]
 [<f96c8a3d>] lustre_msg_get_flags+0x10d/0x1f0 [ptlrpc]
 [<f8f34e71>] mds_intent_policy+0x9f1/0x1500 [mds]
 [<f8bbdfdb>] LNetMDBind+0x1fb/0x6b0 [lnet]
 [<f964bc9e>] __wake_up+0x29/0x3c
 [<f895c025>] ldlm_lock_remove_from_lru+0x5e/0x210 [ptlrpc]
 upcall_cache_get_entry+0x295/0xd10 [lvfs]
 [<f8cdf8e3>] ksocknal_launch_packet+0x143/0x5c0 [ksocklnd]
 [<f96c7ea3>] [<f8f34480>] mds_intent_policy+0x0/0x1500 [mds]
 [<f96531a3>] lustre_msg_string+0x43/0x510 [ptlrpc]
 [<c0170c30>] ldlm_lock_enqueue+0x6d3/0x11f0 [ptlrpc]
 [<f89dc8e0>] lustre_hash_add+0x180/0x4d0 [obdclass]
 [<f968a6ab>] ldlm_handle_enqueue+0x67b/0x23f0 [ptlrpc]
 [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc]
 [<f9689060>] ldlm_server_completion_ast+0x0/0x990 [ptlrpc]
 [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc]
 [<f8f29674>] mds_handle+0x21e4/0x5830 [mds]
 [<f96cac0c>] lustre_msg_get_conn_cnt+0x10c/0x1f0 [ptlrpc]
 [<c011cdc3>] activate_task+0x88/0x95
 [<f96d4d2e>] ptlrpc_check_req+0xe/0x220 [ptlrpc]
 [<f96c9a8c>] dput+0x34/0x19b
 [<f8f55c48>] mds_reint_rec+0xe8/0x3a0 [mds]
 lustre_msg_get_opc+0x10c/0x1f0 [ptlrpc]
 [<c011de69>] [<f96c4260>] lustre_pack_reply_v2+0x1f0/0x620 [ptlrpc]
 [<f8f7b3d0>] mds_update_unpack+0x320/0x550 [mds]
 [<c02de9c6>] find_busiest_group+0xdd/0x295
 [<c010aea2>] do_gettimeofday+0x1a/0x9c
 [<f96d9f8a>] ptlrpc_server_handle_request+0x28a/0x1e80 [ptlrpc]
 [<c010aea2>] do_gettimeofday+0x1a/0x9c
 [<f894218c>] lcw_update_time+0x1c/0x1a0 [libcfs]
 [<c02de2dd>]<3>LustreError: 9881:0:(handler.c:1601:mds_handle())
operation 400 on unconnected MDS from 12345-10.255.255.191 at tcp
 __cond_resched+0x14/0x39
 [<f8f2438e>] mds_reint+0xfe/0x770 [mds]
 [<f96c8a3d>] lustre_msg_get_flags+0x10d/0x1f0 [ptlrpc]
 [<f895ee68>] lprocfs_counter_add+0xf8/0x120 [lvfs]
 [<f8f34f28>] mds_intent_policy+0xaa8/0x1500 [mds]
 [<f96547a4>] ldlm_run_cp_ast_work+0x174/0x330 [ptlrpc]
 [<f9653d87>] ldlm_reprocess_queue+0xc7/0x390 [ptlrpc]
 [<f969cbd8>] ldlm_pool_recalc+0x108/0x120 [ptlrpc]
 [<f964b045>] lock_res_and_lock+0x45/0xc0 [ptlrpc]
 [<f8f34480>] mds_intent_policy+0x0/0x1500 [mds]
 [<f96531a3>] ldlm_lock_enqueue+0x6d3/0x11f0 [ptlrpc]
 [<f89dc8e0>] schedule+0x855/0x8f3
 lustre_hash_add+0x180/0x4d0 [obdclass]
 [<c02de30d>] [<f968a6ab>] ldlm_handle_enqueue+0x67b/0x23f0 [ptlrpc]
 [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc]
 [<f9689060>] ldlm_server_completion_ast+0x0/0x990 [ptlrpc]
 [<f9688450>] ldlm_server_blocking_ast+0x0/0xc10 [ptlrpc]
 [<f8f29674>] mds_handle+0x21e4/0x5830 [mds]
 [<f96cac0c>] lustre_msg_get_conn_cnt+0x10c/0x1f0 [ptlrpc]
 [<c011cdc3>] activate_task+0x88/0x95
 [<f96d4d2e>] ptlrpc_check_req+0xe/0x220 [ptlrpc]
 [<f96c9a8c>] lustre_msg_get_opc+0x10c/0x1f0 [ptlrpc]
 [<c010aea2>] do_gettimeofday+0x1a/0x9c
 [<f96d9f8a>] ptlrpc_server_handle_request+0x28a/0x1e80 [ptlrpc]
 [<c010aea2>] do_gettimeofday+0x1a/0x9c
 [<f894218c>] lcw_update_time+0x1c/0x1a0 [libcfs]
 [<f894218c>] lcw_update_time+0x1c/0x1a0 [libcfs]
 [<c011e832>] __wake_up+0x29/0x3c
 [<f89423c0>] lc_watchdog_touch_ms+0xb0/0x280 [libcfs]
 [<f894263c>] lc_watchdog_disable+0x7c/0x240 [libcfs]
 [<f89345b3>] cfs_block_sigs+0x73/0xa0 [libcfs]
 [<f96dcb3b>] ptlrpc_main+0x62b/0xdb0 [ptlrpc]
 [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc]
 [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc]
 [<c02e03b2>] schedule+0x885/0x8f3
 [<f89423c0>] lc_watchdog_touch_ms+0xb0/0x280 [libcfs]
 [<f89345b3>] cfs_block_sigs+0x73/0xa0 [libcfs]
 [<f96dcb3b>] ptlrpc_main+0x62b/0xdb0 [ptlrpc]
 [<c011e7ac>] default_wake_function+0x0/0xc
 [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc]
 [<c02e03b2>] ret_from_fork+0x6/0x14
 [<f96dc500>] ptlrpc_retry_rqbds+0x0/0x10 [ptlrpc]
 [<f96dc510>] ptlrpc_main+0x0/0xdb0 [ptlrpc]
 [<c01041f5>] kernel_thread_helper+0x5/0xb
-------8<---------- 

Followed by pages and pages of call traces. The system locked up and I
had to reboot the MDS (I managed by using the console, ssh wasn''t
possible anymore). I attached the full log from the reboot yesterday to
todays reboot.

Reading the logs myself, it seems to be that the MDS looses connection
to one of the OSS, and tries to recover from it. Afterwards, things to
go haywire and the qmaster seems to crash? But I have trouble decrypting
the logs.

Can you see what''s going on?

Cheers,
Arne


PS: a small recap: 1 MDS, 4 OSS, ~80 clients, ~350 cpus, max of ~1000
active jobs.



On Fr, 2009-10-09 at 10:23 -0400, Brian J. Murrell
wrote:> On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote:
> > Hi,
> 
> Hi,
> 
> > thanks for replying!
> 
> NP.
> 
> > I understand that without further information we can''t do
much about the
> > oopses.
> 
> Not just the oopses.  It''s pretty hard to diagnose any problem
without
> data.
> 
> > I was more hoping for some information regarding possible
> > sources of such an overload.
> 
> The possibilities are endless.  Lustre is a complex system with many,
> many interactions with many other components.  Sure, I could make stab
> in the dark guesses like too little memory, too slow disk, etc., but
> none of those are actually useful for anything more than a shopping trip
> and even then new hardware might still not solve your problem.
> 
> > How can I find the source of the problem?
> 
> Examine the logs.  Lustre logs a lot of information when things are
> going wrong.
> 
> b.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mds.log.bz2
Type: application/x-bzip
Size: 21908 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091020/4363e24d/attachment-0001.bin

Brian J. Murrell

2009-Oct-20 15:20 UTC

head link

[Lustre-discuss] MDS overload, why?

On Tue, 2009-10-20 at 16:13 +0200, Arne Brutschy wrote:> 
> As suggested, I set up a serial console monitor on the MDS and installed
> the lmt (not particularly easy :). This morning, the MDS was overloaded
> again. The cpu load spike beyond 150 and the logs where showing the
> following errors:
Well, prior to this there are lots of quota errors you really ought to
look into fixing.  Try searching bugzilla for similar problems.  What
version of Lustre is this MDS running?

As for the MDS problems, the first sign of trouble looks like:

Lustre: Request x7178129 sent from lustre-OST0002-osc to NID 10.255.255.202 at
tcp 50s ago has timed out (limit 50s).
Lustre: lustre-OST0002-osc: Connection to service lustre-OST0002 via nid
10.255.255.202 at tcp was lost; in progress operations using this service will
wait for recovery to complete.
Lustre: Request x7178131 sent from lustre-OST0002-osc to NID 10.255.255.202 at
tcp 50s ago has timed out (limit 50s).
Lustre: Request x7178132 sent from lustre-OST0002-osc to NID 10.255.255.202 at
tcp 50s ago has timed out (limit 50s).
Lustre: Request x7178178 sent from lustre-OST0002-osc to NID 10.255.255.202 at
tcp 5s ago has timed out (limit 5s).

Which leads you do OST0002, which is at 10.255.255.202, yes?

So what''s going on there that the MDS lost it''s connection to
the OST
there?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091020/c00bdca8/attachment.bin

Arne Brutschy

2009-Oct-21 09:26 UTC

head link

[Lustre-discuss] MDS overload, why?

On Di, 2009-10-20 at 11:20 -0400, Brian J. Murrell
wrote:> Well, prior to this there are lots of quota errors you really ought to
> look into fixing.  Try searching bugzilla for similar problems.  What
> version of Lustre is this MDS running?
As far as I understand right, these ''errors'' just report that
a user ran
out of quota. All the cluster is running lustre 1.6.7.2
> As for the MDS problems, the first sign of trouble looks like:
> 
> Lustre: Request x7178129 sent from lustre-OST0002-osc to NID 10.255.255.202
at tcp 50s ago has timed out (limit 50s).
> Lustre: lustre-OST0002-osc: Connection to service lustre-OST0002 via nid
10.255.255.202 at tcp was lost; in progress operations using this service will
wait for recovery to complete.
> Lustre: Request x7178131 sent from lustre-OST0002-osc to NID 10.255.255.202
at tcp 50s ago has timed out (limit 50s).
> Lustre: Request x7178132 sent from lustre-OST0002-osc to NID 10.255.255.202
at tcp 50s ago has timed out (limit 50s).
> Lustre: Request x7178178 sent from lustre-OST0002-osc to NID 10.255.255.202
at tcp 5s ago has timed out (limit 5s).
> 
> Which leads you do OST0002, which is at 10.255.255.202, yes?
> So what''s going on there that the MDS lost it''s
connection to the OST
> there?
Well, thats the big question. The OSS''s are basically idle, as we
mostly
use small files. The OSS''s log for this time:

Oct 19 11:59:33 compute-2-3 kernel: LustreError: 166-1: MGC10.255.255.206 at
tcp: Connection to service MGS via nid 10.255.255.206 at tcp was lost; in
progress operations using this service will fail.
Oct 19 12:00:08 compute-2-3 kernel: LustreError:
4421:0:(ldlm_lib.c:552:target_handle_reconnect()) lustre-mdtlov_UUID
reconnecting from MGC10.255.255.206 at tcp_0, handle mismatch (ours
0x9f3a42f18876fee7, theirs 0x1b158acd85bf638)
Oct 19 12:00:08 compute-2-3 kernel: LustreError:
4421:0:(ldlm_lib.c:786:target_handle_connect()) lustre-OST0003: NID
10.255.255.206 at tcp (lustre-mdtlov_UUID) reconnected with 1 conn_cnt; cookies
not random?
Oct 19 12:00:08 compute-2-3 kernel: LustreError:
4421:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-114) 
req at f56e9e00 x10/t0 o8-><?>@<?>:0/0 lens 240/144 e 0 to 0 dl
1255946508 ref 1 fl Interpret:/0/0 rc -114/0

Just reporting that it lost connection. It seems that somehow the MDS
looses connection to at least one of the OSS, and then the situation
deteriorates as it gets overwhelmed with reconnect events which it
refuses. The in the end, it seems that every client is refused by the
MFS. This seems to be similar to this problem:
http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010244.html

Currently, the connection to clients and between MDS/OSS''s is a simple
gigabit link (2 48port managed switches). We don''t have a separate
storage network, as traffic on the job-side of things is very low.
Looking at the interface statistics, I can''t see any errors or dropped
packages during normal operation.

Maybe I should connect the lustre servers among themselves and see if
the latency/timeout problems persist.

Regards,
Arne

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite'' Libre de Bruxelles   Tel      +32 2 650 3168
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)

Lustre discuss - Oct 2009 - MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?

[Lustre-discuss] MDS overload, why?