Not sure if it would help , but try kernel 2.4.21 On 6/10/06, ocfs2-users-request at oss.oracle.com < ocfs2-users-request at oss.oracle.com> wrote:> > Send Ocfs2-users mailing list submissions to > ocfs2-users at oss.oracle.com > > To subscribe or unsubscribe via the World Wide Web, visit > http://oss.oracle.com/mailman/listinfo/ocfs2-users > or, via email, send a message with subject or body 'help' to > ocfs2-users-request at oss.oracle.com > > You can reach the person managing the list at > ocfs2-users-owner at oss.oracle.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Ocfs2-users digest..." > > > Today's Topics: > > 1. RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long) > 2. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran) > 3. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long) > 4. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 09 Jun 2006 13:38:58 -0400 > From: Brian Long <brilong at cisco.com> > Subject: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash? > To: ocfs2-users at oss.oracle.com > Message-ID: <1149874738.4142.17.camel at brilong-lnx> > Content-Type: text/plain > > Hello, > > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2 > 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self- > fencing) and I get a full vmcore on my netdump server. The netdump log > file shows the shared filesystem LUN (/dev/dm-6) did not respond within > 12000ms. I have not changed the default heartbeat values > in /etc/sysconfig/o2cb. There was no other IO ongoing when this > happens, but they are HP Proliant servers running the Insight Manager > agents. > > Why would the heartbeat fail roughly once a week? Should I open a > bugzilla and upload my netdump log file? > > Thanks. > > /Brian/ > -- > Brian Long | | | > IT Data Center Systems | .|||. .|||. > Cisco Linux Developer | ..:|||||||:...:|||||||:.. > Phone: (919) 392-7363 | C i s c o S y s t e m s > > > > > ------------------------------ > > Message: 2 > Date: Fri, 09 Jun 2006 10:49:48 -0700 > From: Sunil Mushran <Sunil.Mushran at oracle.com> > Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash? > To: Brian Long <brilong at cisco.com> > Cc: ocfs2-users at oss.oracle.com > Message-ID: <4489B4BC.50309 at oracle.com> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > The hb failure is just the effect of the ios not completing within 12 > secs. > The full oops trace gives the last 24 ops and their timings. > > One solution is to double up the hb timeout. Set, > O2CB_HEARTBEAT_THRESHOLD = 14 > > Brian Long wrote: > > Hello, > > > > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2 > > 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self- > > fencing) and I get a full vmcore on my netdump server. The netdump log > > file shows the shared filesystem LUN (/dev/dm-6) did not respond within > > 12000ms. I have not changed the default heartbeat values > > in /etc/sysconfig/o2cb. There was no other IO ongoing when this > > happens, but they are HP Proliant servers running the Insight Manager > > agents. > > > > Why would the heartbeat fail roughly once a week? Should I open a > > bugzilla and upload my netdump log file? > > > > Thanks. > > > > /Brian/ > > > > > > ------------------------------ > > Message: 3 > Date: Fri, 09 Jun 2006 15:30:05 -0400 > From: Brian Long <brilong at cisco.com> > Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash? > To: Sunil Mushran <Sunil.Mushran at oracle.com> > Cc: ocfs2-users at oss.oracle.com > Message-ID: <1149881406.4142.27.camel at brilong-lnx> > Content-Type: text/plain > > Understood, but how do I determine why once a week I'm failing the 12 > second heartbeat? Before I bump the HB, shouldn't I figure out why dm-6 > is gone for 12 seconds? The last 24 ops are as follows: > > (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device > dm-6 after 12000 milliseconds > Heartbeat thread (7) printing last 24 blocking operations (cur = 3): > Heartbeat thread stuck at waiting for read completion, stuffing current > time into that blocker (index 3) > Index 4: took 0 ms to do submit_bio for read > Index 5: took 0 ms to do waiting for read completion > Index 6: took 0 ms to do bio alloc write > Index 7: took 0 ms to do bio add page write > Index 8: took 0 ms to do submit_bio for write > Index 9: took 0 ms to do checking slots > Index 10: took 0 ms to do waiting for write completion > Index 11: took 1998 ms to do msleep > Index 12: took 0 ms to do allocating bios for read > Index 13: took 0 ms to do bio alloc read > Index 14: took 0 ms to do bio add page read > Index 15: took 0 ms to do submit_bio for read > Index 16: took 0 ms to do waiting for read completion > Index 17: took 0 ms to do bio alloc write > Index 18: took 0 ms to do bio add page write > Index 19: took 0 ms to do submit_bio for write > Index 20: took 0 ms to do checking slots > Index 21: took 0 ms to do waiting for write completion > Index 22: took 1999 ms to do msleep > Index 23: took 0 ms to do allocating bios for read > Index 0: took 0 ms to do bio alloc read > Index 1: took 0 ms to do bio add page read > Index 2: took 0 ms to do submit_bio for read > Index 3: took 9998 ms to do waiting for read completion > (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active > regions. > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this > system by panicing > > /Brian/ > > On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote: > > The hb failure is just the effect of the ios not completing within 12 > secs. > > The full oops trace gives the last 24 ops and their timings. > > > > One solution is to double up the hb timeout. Set, > > O2CB_HEARTBEAT_THRESHOLD = 14 > > > > Brian Long wrote: > > > Hello, > > > > > > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2 > > > 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self- > > > fencing) and I get a full vmcore on my netdump server. The netdump > log > > > file shows the shared filesystem LUN (/dev/dm-6) did not respond > within > > > 12000ms. I have not changed the default heartbeat values > > > in /etc/sysconfig/o2cb. There was no other IO ongoing when this > > > happens, but they are HP Proliant servers running the Insight Manager > > > agents. > > > > > > Why would the heartbeat fail roughly once a week? Should I open a > > > bugzilla and upload my netdump log file? > > > > > > Thanks. > > > > > > /Brian/ > > > > -- > Brian Long | | | > IT Data Center Systems | .|||. .|||. > Cisco Linux Developer | ..:|||||||:...:|||||||:.. > Phone: (919) 392-7363 | C i s c o S y s t e m s > > > > > ------------------------------ > > Message: 4 > Date: Fri, 09 Jun 2006 13:00:48 -0700 > From: Sunil Mushran <Sunil.Mushran at oracle.com> > Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash? > To: Brian Long <brilong at cisco.com> > Cc: ocfs2-users at oss.oracle.com > Message-ID: <4489D370.4050103 at oracle.com> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > This dump is very much like the one we used to see with the > cfq io scheduler. The very last io op would consume all the time. > I am assuming that you are running with the DEADLINE io sched. > > Is there any other common factors in all the crashes. Like, happens > on one node? Or, around the same time? How do you know there is > no other io happening at that time? What about cron jobs? > > Also, is the shared disk connected to some other nodes which > could be the cause of the io spike? > > Brian Long wrote: > > Understood, but how do I determine why once a week I'm failing the 12 > > second heartbeat? Before I bump the HB, shouldn't I figure out why dm-6 > > is gone for 12 seconds? The last 24 ops are as follows: > > > > (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device > > dm-6 after 12000 milliseconds > > Heartbeat thread (7) printing last 24 blocking operations (cur = 3): > > Heartbeat thread stuck at waiting for read completion, stuffing current > > time into that blocker (index 3) > > Index 4: took 0 ms to do submit_bio for read > > Index 5: took 0 ms to do waiting for read completion > > Index 6: took 0 ms to do bio alloc write > > Index 7: took 0 ms to do bio add page write > > Index 8: took 0 ms to do submit_bio for write > > Index 9: took 0 ms to do checking slots > > Index 10: took 0 ms to do waiting for write completion > > Index 11: took 1998 ms to do msleep > > Index 12: took 0 ms to do allocating bios for read > > Index 13: took 0 ms to do bio alloc read > > Index 14: took 0 ms to do bio add page read > > Index 15: took 0 ms to do submit_bio for read > > Index 16: took 0 ms to do waiting for read completion > > Index 17: took 0 ms to do bio alloc write > > Index 18: took 0 ms to do bio add page write > > Index 19: took 0 ms to do submit_bio for write > > Index 20: took 0 ms to do checking slots > > Index 21: took 0 ms to do waiting for write completion > > Index 22: took 1999 ms to do msleep > > Index 23: took 0 ms to do allocating bios for read > > Index 0: took 0 ms to do bio alloc read > > Index 1: took 0 ms to do bio add page read > > Index 2: took 0 ms to do submit_bio for read > > Index 3: took 9998 ms to do waiting for read completion > > (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active > > regions. > > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this > > system by panicing > > > > /Brian/ > > > > On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote: > > > >> The hb failure is just the effect of the ios not completing within 12 > secs. > >> The full oops trace gives the last 24 ops and their timings. > >> > >> One solution is to double up the hb timeout. Set, > >> O2CB_HEARTBEAT_THRESHOLD = 14 > >> > >> Brian Long wrote: > >> > >>> Hello, > >>> > >>> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2 > >>> 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self- > >>> fencing) and I get a full vmcore on my netdump server. The netdump > log > >>> file shows the shared filesystem LUN (/dev/dm-6) did not respond > within > >>> 12000ms. I have not changed the default heartbeat values > >>> in /etc/sysconfig/o2cb. There was no other IO ongoing when this > >>> happens, but they are HP Proliant servers running the Insight Manager > >>> agents. > >>> > >>> Why would the heartbeat fail roughly once a week? Should I open a > >>> bugzilla and upload my netdump log file? > >>> > >>> Thanks. > >>> > >>> /Brian/ > >>> > >>> > > > > ------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > End of Ocfs2-users Digest, Vol 30, Issue 7 > ****************************************** >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060611/056f2b17/attachment.html