Derek Hazell
2008-Sep-24 09:55 UTC
[Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs
Hi OCFS2 forum A few things: (i) thanks for your support of OCFS2 on this forum (ii) the advice I received August 24 to run elevator=deadline io scheduling seems to have helped - there have been no unexpected reboots since then (iii) we did however have a crash last night on the same RHEL AS4 server (running ocfs2 1.2.9-1) -the crash may be unrelated to ocfs2 but I thought I'd run it past you anyway - here is a copy of a post I made to a linux forum: *Last night one of our Linux servers (running RHEL AS4, kernel 2.6.9-42.0.2.ELsmp) crashed. The server is part of a four node ocfs2 1.2.9-1 cluster. After the crash I believe the server needed to be manually restarted. I have cut the following out of /var/log/messages event log: Sep 23 19:15:33 ImageInt1 sshd(pam_unix)[10011]: session opened for user root by root(uid=0) Sep 23 22:31:04 ImageInt1 kernel: Assertion failure in do_get_write_access() at fs/jbd/transaction.c:693: "handle->h_buffer_credits > 0" Sep 23 22:31:04 ImageInt1 kernel: ----------- [cut here ] --------- [please bite here ] --------- Sep 23 22:31:06 ImageInt1 kernel: Kernel BUG at transaction:693 Sep 23 22:31:06 ImageInt1 kernel: invalid operand: 0000 [1] SMP Sep 23 22:31:06 ImageInt1 kernel: CPU 1 Sep 23 22:49:51 ImageInt1 syslogd 1.4.1: restart. I googled on internet for the assertion failure and found one report saying it is a bug in the code, but there was no fix mentioned. * As always, any help is appreciated regards Derek #################################################### [Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs) *Derek Hazell* derek.hazell at gmail.com <ocfs2-users%40oss.oracle.com?Subject=%5BOcfs2-users%5D%20ocfs2%20issue%3F%20%3A%20unexplained%20reboots%20of%20RHEL%204%0A%09server%20%28kernel%3A2.6.9-42.0.2.ELs%29&In-Reply-To=48B03D9F.7030707%40oracle.com> *Sun Aug 24 04:08:01 PDT 2008* - Previous message: [Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs) <http://oss.oracle.com/pipermail/ocfs2-users/2008-August/002898.html> - Next message: [Ocfs2-users] Problem with clustering on Linux <http://oss.oracle.com/pipermail/ocfs2-users/2008-August/002900.html> - *Messages sorted by:* [ date ]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/date.html#2899> [ thread ]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/thread.html#2899> [ subject ]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/subject.html#2899> [ author ]<http://oss.oracle.com/pipermail/ocfs2-users/2008-August/author.html#2899> ------------------------------ Hi Sunil, I checked the grub.conf file on the machine that reboots and there is no (deadline) reference to the io scheduler. I will check when back at work on Monday, but I suspect that we are just using the default io scheduler which would be cfq. Just to briefly elaborate, our ocfs2 cluster consists of three nodes (one node (or its backup) mounts the ocfs2 filesystem read/write, while two other nodes mount the ocfs2 read only. It is always the read/write node that automatically reboots (fences as we know now) (though sometimes but not always the other systems need to be rebooted to get the system working properly.) The problem could be load-related but it is difficult to be sure. I will discuss with my colleagues about whether to try the deadline option and/or set up a private network for the ocfs2 members. The deadline option is very easy to try (involving a small change to the grub.conf, and a reboot), while setting up the private network is a little bit more work but not hard. . rgds Derek 2008/8/24 Sunil Mushran <sunil.mushran at oracle.com <http://oss.oracle.com/mailman/listinfo/ocfs2-users>>>* Which io scheduler are you using? On el4, it is best to use deadline.*>* cfq is the default. Check the faq for details on using deadline. *>* *>* Derek Hazell wrote: *>* * -- best wishes Derek Psalm 71:14 "But as for me, I will always have hope; I will praise you more and more". (NIV) ######################## new home ph: 02-9701-0841 new mobile ph: 0458-588-821 (or +61-458-588-821 from overseas) email : derek.hazell at gmail.com skype : dereklife2005 msn : derek_hazell at yahoo.com yahoo messenger : derek_hazell ######################## -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080924/35fc90f4/attachment.html
Sunil Mushran
2008-Sep-24 17:50 UTC
[Ocfs2-users] server crash : Assertion failure in do_get_write_access (kernel 2.6.9-42.0.2.ELs
Do you have a netconsole server setup? If not, it is recommended that you do because it captures the full oops logs. For example, if we had the full oops log, we would not only know the component (ext3 or ocfs2) that triggered this and also the potential fix. The non-auto-restart is because you have not set /proc/sys/kernel/panic to a number > 0. You will find more in the ocfs2 faq. Or you could go thru the section on kernel configuration in the ocfs2 1.4 user's guide. Derek Hazell wrote:> Hi OCFS2 forum > A few things: > (i) thanks for your support of OCFS2 on this forum > (ii) the advice I received August 24 to run elevator=deadline io > scheduling seems to have helped - there have been no unexpected > reboots since then > (iii) we did however have a crash last night on the same RHEL AS4 > server (running ocfs2 1.2.9-1) -the crash may be unrelated to ocfs2 > but I thought I'd run it past you anyway - here is a copy of a post I > made to a linux forum: > > /Last night one of our Linux servers (running RHEL AS4, kernel > 2.6.9-42.0.2.ELsmp) crashed. The server is part of a four node ocfs2 > 1.2.9-1 cluster. After the crash I believe the server needed to be > manually restarted. > > I have cut the following out of /var/log/messages event log: > Sep 23 19:15:33 ImageInt1 sshd(pam_unix)[10011]: session opened for > user root by root(uid=0) > Sep 23 22:31:04 ImageInt1 kernel: Assertion failure in > do_get_write_access() at fs/jbd/transaction.c:693: > "handle->h_buffer_credits > 0" > Sep 23 22:31:04 ImageInt1 kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Sep 23 22:31:06 ImageInt1 kernel: Kernel BUG at transaction:693 > Sep 23 22:31:06 ImageInt1 kernel: invalid operand: 0000 [1] SMP > Sep 23 22:31:06 ImageInt1 kernel: CPU 1 > Sep 23 22:49:51 ImageInt1 syslogd 1.4.1: restart. > > I googled on internet for the assertion failure and found one report > saying it is a bug in the code, but there was no fix mentioned. > / > As always, any help is appreciated > > regards > Derek