thr3ads.net - Ocfs2 users - [Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Derek Hazell

2008-Sep-24 09:55 UTC

[Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs

Hi OCFS2 forum
A few things:
(i) thanks for your support of OCFS2 on this forum
(ii) the advice I received August 24 to run elevator=deadline io scheduling
seems to have helped - there have been no unexpected reboots since then
(iii) we did however have a crash last night on the same RHEL AS4 server
(running ocfs2 1.2.9-1) -the crash may be unrelated to ocfs2 but I thought
I'd run it past you anyway - here is a copy of a post I made to a linux
forum:

*Last night one of our Linux servers (running RHEL AS4, kernel
2.6.9-42.0.2.ELsmp) crashed. The server is part of a four node ocfs2 1.2.9-1
cluster. After the crash I believe the server needed to be manually
restarted.

I have cut the following out of /var/log/messages event log:
Sep 23 19:15:33 ImageInt1 sshd(pam_unix)[10011]: session opened for user
root by root(uid=0)
Sep 23 22:31:04 ImageInt1 kernel: Assertion failure in do_get_write_access()
at fs/jbd/transaction.c:693: "handle->h_buffer_credits > 0"
Sep 23 22:31:04 ImageInt1 kernel: ----------- [cut here ] --------- [please
bite here ] ---------
Sep 23 22:31:06 ImageInt1 kernel: Kernel BUG at transaction:693
Sep 23 22:31:06 ImageInt1 kernel: invalid operand: 0000 [1] SMP
Sep 23 22:31:06 ImageInt1 kernel: CPU 1
Sep 23 22:49:51 ImageInt1 syslogd 1.4.1: restart.

I googled on internet for the assertion failure and found one report saying
it is a bug in the code, but there was no fix mentioned.
*
As always, any help is appreciated

regards
Derek

####################################################

[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server
(kernel:2.6.9-42.0.2.ELs) *Derek Hazell* derek.hazell at gmail.com
<ocfs2-users%40oss.oracle.com?Subject=%5BOcfs2-users%5D%20ocfs2%20issue%3F%20%3A%20unexplained%20reboots%20of%20RHEL%204%0A%09server%20%28kernel%3A2.6.9-42.0.2.ELs%29&In-Reply-To=48B03D9F.7030707%40oracle.com>
*Sun Aug 24 04:08:01 PDT 2008*

   - Previous message: [Ocfs2-users] ocfs2 issue? : unexplained reboots of
   RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
   <http://oss.oracle.com/pipermail/ocfs2-users/2008-August/002898.html>
   - Next message: [Ocfs2-users] Problem with clustering on Linux
   <http://oss.oracle.com/pipermail/ocfs2-users/2008-August/002900.html>
   - *Messages sorted by:* [ date
]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/date.html#2899>
[
   thread
]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/thread.html#2899>
[
   subject
]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/subject.html#2899>
[
   author
]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/author.html#2899>

------------------------------

Hi Sunil,
I checked the grub.conf file on the machine that reboots and there is no
(deadline) reference to the io scheduler. I will check when back at work on
Monday, but I suspect that we are just using the default io scheduler which
would be cfq.

Just to briefly elaborate, our ocfs2 cluster consists of three nodes (one
node (or its backup) mounts the ocfs2 filesystem read/write, while two other
nodes mount the ocfs2 read only. It is always the read/write node that
automatically reboots (fences as we know now) (though sometimes but not
always the other systems need to be rebooted to get the system working
properly.) The problem could be load-related but it is difficult to be sure.

I will discuss with my colleagues about whether to try the deadline option
and/or set up a private network for the ocfs2 members. The deadline option
is very easy to try (involving a small change to the grub.conf, and a
reboot), while setting up the private network is a little bit more work but
not hard.
.
rgds
Derek

2008/8/24 Sunil Mushran <sunil.mushran at oracle.com
<http://oss.oracle.com/mailman/listinfo/ocfs2-users>>
>* Which io scheduler are you using? On el4, it is best to use deadline.*>* cfq is the default. Check the faq for details on using deadline.
*>*
*>* Derek Hazell wrote:
*>*
*



-- 
best wishes

Derek



Psalm 71:14 "But as for me, I will always have hope;  I will praise you
more
and more". (NIV)
########################
new home ph: 02-9701-0841
new mobile ph: 0458-588-821
(or +61-458-588-821 from overseas)
email : derek.hazell at gmail.com
skype : dereklife2005
msn : derek_hazell at yahoo.com
yahoo messenger : derek_hazell
########################
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080924/35fc90f4/attachment.html

Sunil Mushran

2008-Sep-24 17:50 UTC

head link

[Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs

Do you have a netconsole server setup? If not, it is recommended
that you do because it captures the full oops logs. For example,
if we had the full oops log, we would not only know the component
(ext3 or ocfs2) that triggered this and also the potential fix.

The non-auto-restart is because you have not set /proc/sys/kernel/panic
to a number > 0. You will find more in the ocfs2 faq. Or you could go thru
the section on kernel configuration in the ocfs2 1.4 user's guide.

Derek Hazell wrote:> Hi OCFS2 forum
> A few things:
> (i) thanks for your support of OCFS2 on this forum
> (ii) the advice I received August 24 to run elevator=deadline io 
> scheduling seems to have helped - there have been no unexpected 
> reboots since then
> (iii) we did however have a crash last night on the same RHEL AS4 
> server (running ocfs2 1.2.9-1) -the crash may be unrelated to ocfs2 
> but I thought I'd run it past you anyway - here is a copy of a post I 
> made to a linux forum:
>
> /Last night one of our Linux servers (running RHEL AS4, kernel 
> 2.6.9-42.0.2.ELsmp) crashed. The server is part of a four node ocfs2 
> 1.2.9-1 cluster. After the crash I believe the server needed to be 
> manually restarted.
>
> I have cut the following out of /var/log/messages event log:
> Sep 23 19:15:33 ImageInt1 sshd(pam_unix)[10011]: session opened for 
> user root by root(uid=0)
> Sep 23 22:31:04 ImageInt1 kernel: Assertion failure in 
> do_get_write_access() at fs/jbd/transaction.c:693: 
> "handle->h_buffer_credits > 0"
> Sep 23 22:31:04 ImageInt1 kernel: ----------- [cut here ] --------- 
> [please bite here ] ---------
> Sep 23 22:31:06 ImageInt1 kernel: Kernel BUG at transaction:693
> Sep 23 22:31:06 ImageInt1 kernel: invalid operand: 0000 [1] SMP
> Sep 23 22:31:06 ImageInt1 kernel: CPU 1
> Sep 23 22:49:51 ImageInt1 syslogd 1.4.1: restart.
>
> I googled on internet for the assertion failure and found one report 
> saying it is a bug in the code, but there was no fix mentioned.
> /
> As always, any help is appreciated
>
> regards
> Derek

Ocfs2 users - Sep 2008 - server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs

[Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs

[Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs