thr3ads.net - Lustre discuss - [Lustre-discuss] kernel panic with 1.6.5rc2 on mds [May 2008]

If this information is useful, please help other people find it:
Share via:

Patrick Winnertz

2008-May-16 10:45 UTC

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

Hello,

As I wrote in #11742 [1] I experienced a kernel panic after doing heavy I/O 
on the 1.6.5rc2 cluster on the mds.  Since nobody answered to this bug 
until now (and I think in other cases the lustre team is _really_ fast 
(thanks for that :))) I fear that it was not recognised by anybody. 

This kernel-panic seems somehow to be related to the bug mentioned above 
(#11742) as this bugnr. is mentioned in the dmesg output when it died. 
Furthermore right before it started to fail there were several messages 
like the following:

LustreError: 3342:0:(osc_request.c:678:osc_announce_cached()) dirty 
81108992 > dirty_max 33554432

This behaviour is described in #13344 [2].

Any ideas?

Greetings
Patrick Winnertz

[1]:https://bugzilla.lustre.org/show_bug.cgi?id=11742
[2]:https://bugzilla.lustre.org/show_bug.cgi?id=13344

-- 
Patrick Winnertz
Tel.: +49 (0) 2161 / 4643 - 0

credativ GmbH, HRB M?nchengladbach 12080
Hohenzollernstr. 133, 41061 M?nchengladbach
Gesch?ftsf?hrung: Dr. Michael Meskes, J?rg Folz

Andreas Dilger

2008-May-17 15:47 UTC

head link

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

On May 16, 2008  12:45 +0200, Patrick Winnertz wrote:> As I wrote in #11742 [1] I experienced a kernel panic after doing heavy I/O
> on the 1.6.5rc2 cluster on the mds.  Since nobody answered to this bug 
> until now (and I think in other cases the lustre team is _really_ fast 
> (thanks for that :))) I fear that it was not recognised by anybody. 
> 
> This kernel-panic seems somehow to be related to the bug mentioned above 
> (#11742) as this bugnr. is mentioned in the dmesg output when it died. 
> Furthermore right before it started to fail there were several messages 
> like the following:
> 
> LustreError: 3342:0:(osc_request.c:678:osc_announce_cached()) dirty 
> 81108992 > dirty_max 33554432
> 
> This behaviour is described in #13344 [2].
Sorry, I don''t have net access right now, so I can''t see your
comments
in the bug, but the above messsage is definitely unusual and an indication
of some kind of code bug.

The client imposes a limit on the amount of dirty data that it can cache
(in /proc/fs/lustre/osc/*/max_dirty_mb, default 32MB), on a per-OST basis.
This ensures that in case of lock cancellation there isn''t 5TB of dirty
data out on the client and flushing this to the OST will take 30min.

It seems that either the accounting of the number of dirty pages on the
client has gone badly, or the client has actually dirtied far more data
(80MB) than it should have (32MB).

Could you please explain the type of IO that the client is doing?  Is
this normal write(), or writev(), pwrite(), O_DIRECT, mmap, other?
Were there IO errors, or IO resends, or some other unusual problem?
The entry points for this IO into Lustre is all slightly different, and
it wouldn''t be the first time there was an accounting error somewhere.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Oleg Drokin

2008-May-18 23:58 UTC

head link

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

Hello!

On May 16, 2008, at 6:45 AM, Patrick Winnertz wrote:
> As I wrote in #11742 [1] I experienced a kernel panic after doing  
> heavy I/O
> on the 1.6.5rc2 cluster on the mds.  Since nobody answered to this bug
> until now (and I think in other cases the lustre team is _really_ fast
> (thanks for that :))) I fear that it was not recognised by anybody.
I just looked into the logs, you have out of memory issues at the very  
least
during that i/o. Also checksum errors. The log you uploaded does not  
contain
actual crash info, but rather only these messages (that do not cause  
crash in itself),
followed by oom and checksum error messages.
I do not see any panic messages in your logs. Any chance you have a  
serial console
or other way to see what was the actual panic complete with stacktrace  
and other
useful info? (ideally a crashdump).
Bug 11742 that was referenced is just related to checksum errors  
problems.
How do you do your heavy i/o? just regular writes or mmap writes or  
what?

Bye,
     Oleg

Patrick Winnertz

2008-May-27 09:14 UTC

head link

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

Hello Andreas,
Sorry for my late answer.
> Could you please explain the type of IO that the client is doing?  Is
> this normal write(), or writev(), pwrite(), O_DIRECT, mmap, other?
> Were there IO errors, or IO resends, or some other unusual problem?
> The entry points for this IO into Lustre is all slightly different, and
> it wouldn''t be the first time there was an accounting error
somewhere.I''m doing IO with fsstress.. what kind of IO it is I don''t
know exaclty but if
you like I can publish the sources on our company server so that you can 
download it in order to test and debug it better.

Greetings
Patrick Winnertz

-- 
Patrick Winnertz
Tel.: +49 (0) 2161 / 4643 - 0

credativ GmbH, HRB M?nchengladbach 12080
Hohenzollernstr. 133, 41061 M?nchengladbach
Gesch?ftsf?hrung: Dr. Michael Meskes, J?rg Folz

Patrick Winnertz

2008-May-27 09:20 UTC

head link

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

Hello Oleg,
Sorry for this late answer... 


> I do not see any panic messages in your logs. Any chance you have a
> serial console or other way to see what was the actual panic complete with
> stacktrace and other useful info? (ideally a crashdump).After the mgs crashes there is no chance to get a konsole there.. the only 
possible action is to hit the reset button and reboot the server.
But I''ll try to create a coredump of the kernel and send it to you or
place it
somewhere else where you can download it. (Maybe a bugreport? If yes... which 
one? One of the 2 I mentioned or a new one?)

> Bug 11742 that was referenced is just related to checksum errors
> problems.
> How do you do your heavy i/o? just regular writes or mmap writes or
> what?I''m using for the I/O tests fsstress. I''ll send it to you.

Greetings
Patrick Winnertz


-- 
Patrick Winnertz
Tel.: +49 (0) 2161 / 4643 - 0

credativ GmbH, HRB M?nchengladbach 12080
Hohenzollernstr. 133, 41061 M?nchengladbach
Gesch?ftsf?hrung: Dr. Michael Meskes, J?rg Folz

Oleg Drokin

2008-May-28 17:31 UTC

head link

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

Hello!

On May 27, 2008, at 5:20 AM, Patrick Winnertz wrote:
>> I do not see any panic messages in your logs. Any chance you have a
>> serial console or other way to see what was the actual panic  
>> complete with
>> stacktrace and other useful info? (ideally a crashdump).
> After the mgs crashes there is no chance to get a konsole there..  
> the only
> possible action is to hit the reset button and reboot the server.
> But I''ll try to create a coredump of the kernel and send it to you
> or place it
> somewhere else where you can download it. (Maybe a bugreport? If  
> yes... which
> one? One of the 2 I mentioned or a new one?)
Thanks.
No need to send kernel crashdump to us. Just obtain a crashed thread  
backtrace and post and email with that
(and panic message) to lustre-discuss, also indicate that you have a  
complete crashdump.
We''ll ask you for more details if we would need something else.
>> Bug 11742 that was referenced is just related to checksum errors
>> problems.
>> How do you do your heavy i/o? just regular writes or mmap writes or
>> what?
> I''m using for the I/O tests fsstress. I''ll send it to
you.
Ok, thanks.

Bye,
     Oleg

Lustre discuss - May 2008 - kernel panic with 1.6.5rc2 on mds

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

[Lustre-discuss] kernel panic with 1.6.5rc2 on mds