Derek Hazell
2008-Aug-18 03:59 UTC
[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
Dear OCFS2 forum We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers running RHEL 4 (kernel: 2.6.9-42.0.2.ELs) We are getting unexpected reboots of one of the Linux servers and are wondering if the reboots are related to ocfs2 or not. We enable tracing of ocfs2 on the node we suspected would reboot # debugfs.ocfs2 -l SUPER allow # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow and then waited for the reboot to occur. A sample of log messages around the time of the reboot is included below. There are no strange ocfs2 messages in the /var/log/messages log file but I thought I would just check with your forum if you see anything strange. Can you confirm that ocfs2 version 1.2.9-1 is compatible with the Linux kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node can you confirm that a message is written to the /var/log/messages logfile noting that such fencing has occurred. Your responses may help us narrow down the cause Can you let us know if there are any particular logfiles we should check, or if there is anything we can do to confirm that ocfs2 is, or is not, the cause of these reboots. Appreciate any responses regards Derek Hazell | System Administrator ##################################################################### APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running) Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5b1914dc72d356 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5b1914dc72d356 Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M0000000000000009f1bbc95e1dad74 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M0000000000000009f1bbc95e1dad74 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M0000000000000009f1bbc95e1dad74 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M0000000000000009f1bbc95e1dad74 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M0000000000000009f1bbc95e1dad74 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M0000000000000009f1bbc95e1dad74 Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5bc95ddc72d357 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5bc95ddc72d357 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5bc95ddc72d357 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5bc95ddc72d357 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M000000000000000c5bc95ddc72d357 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M000000000000000c5bc95ddc72d357 Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M00000000000000049c73bf5e1d8e29 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148 ENTRY:M00000000000000049c73bf5e1d8e29 Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 ENTRY:M00000000000000049c73bf5e1d8e29 [UNEXPECTED REBOOT] Aug 15 21:05:09 Sysname syslogd 1.4.1: restart. Aug 15 21:05:09 Sysname syslog: syslogd startup succeeded Aug 15 21:05:09 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 15 21:05:09 Sysname kernel: Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) Aug 15 21:05:09 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp ( bhcompile at ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 17:57:31 EDT 2006 Aug 15 21:05:09 Sysname kernel: BIOS-provided physical RAM map: ###################################################################### APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running) Aug 15 21:08:12 Sysname kernel: o2net: connected to node Othersystem2.x.y (num 1) at 172.16.172.172:7777 Aug 15 21:08:13 Sysname kernel: o2net: accepted connection from node Othersystem1.x.y (num 3) at 172.16.172.171:7777 Aug 15 21:08:16 Sysname kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT 2008 (build a693806cb619dd7f225004092b675ede) Aug 15 21:08:16 Sysname kernel: ocfs2_dlm: Nodes in domain ("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3 Aug 15 21:08:17 Sysname kernel: kjournald starting. Commit interval 5 seconds Aug 15 21:08:17 Sysname kernel: ocfs2: Mounting device (120,1) on (node 2, slot 2) Aug 15 21:08:21 Sysname kernel: ocfs2_dlm: Nodes in domain ("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3 Aug 15 21:08:21 Sysname kernel: kjournald starting. Commit interval 5 seconds Aug 15 21:08:21 Sysname kernel: ocfs2: Mounting device (120,17) on (node 2, slot 2) Aug 15 21:08:31 Sysname ntpd[7076]: synchronized to 172.16.32.254, stratum 2 Aug 15 21:08:31 Sysname ntpd[7076]: kernel time sync disabled 0041 Aug 15 21:08:38 Sysname su(pam_unix)[9656]: session opened for user digicol by root(uid=0) Aug 15 21:08:41 Sysname su(pam_unix)[9656]: session closed for user digicol Aug 15 21:13:52 Sysname ntpd[7076]: kernel time sync enabled 0001 Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, sector 1291272320 Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, sector 1487646848 Aug 15 21:41:47 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:41:47 Sysname kernel: end_request: I/O error, dev sdc, sector 1301852288 Aug 15 21:41:48 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:41:48 Sysname kernel: end_request: I/O error, dev sdc, sector 1498484864 Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1611251840 Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1045610624 Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1234243712 Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 989614208 Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1115283584 Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector 1240952960 Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector 995807360 Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector 1104961664 Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code 0x20000 Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector 1008507952 Aug 16 03:00:26 Sysname Server Administrator: Storage Service EventID: 2242 The Patrol Read has started.: Controller 0 (PERC 5/i Integrated) Aug 16 03:00:27 Sysname snmpd[7589]: Got trap from peer on fd 13 Aug 16 03:52:02 Sysname Server Administrator: Storage Service EventID: 2243 The Patrol Read has stopped.: Controller 0 (PERC 5/i Integrated) Aug 16 03:52:02 Sysname snmpd[7589]: Got trap from peer on fd 13 Aug 16 16:38:33 Sysname sshd(pam_unix)[31901]: session opened for user root by root(uid=0) Aug 16 16:55:55 Sysname sshd(pam_unix)[32254]: session opened for user root by root(uid=0) Aug 16 17:27:06 Sysname sshd(pam_unix)[966]: session opened for user root by root(uid=0) [UNEXPECTED REBOOT] Aug 16 23:18:31 Sysname syslogd 1.4.1: restart. Aug 16 23:18:31 Sysname syslog: syslogd startup succeeded Aug 16 23:18:31 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 23:18:31 Sysname kernel: Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) Aug 16 23:18:31 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp ( bhcompile at ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 17:57:31 EDT 2006 Aug 16 23:18:31 Sysname kernel: BIOS-provided physical RAM map: ##################################################################### -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080818/4447aa96/attachment.html
Sunil Mushran
2008-Aug-18 17:55 UTC
[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
Configure a netdump or netconsole server. It will catch the relevant messages. Derek Hazell wrote:> > Dear OCFS2 forum > > We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers > running RHEL 4 (kernel: 2.6.9-42.0.2.ELs) > > We are getting unexpected reboots of one of the Linux servers and are > wondering if the reboots are related to ocfs2 or not. > We enable tracing of ocfs2 on the node we suspected would reboot > # debugfs.ocfs2 -l SUPER allow > # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow > and then waited for the reboot to occur. A sample of log messages > around the time of the reboot is included below. There are no strange > ocfs2 messages in the /var/log/messages log file but I thought I would > just check with your forum if you see anything strange. > > Can you confirm that ocfs2 version 1.2.9-1 is compatible with the > Linux kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node > can you confirm that a message is written to the /var/log/messages > logfile noting that such fencing has occurred. Your responses may help > us narrow down the cause > Can you let us know if there are any particular logfiles we should > check, or if there is anything we can do to confirm that ocfs2 is, or > is not, the cause of these reboots. > > Appreciate any responses > > regards > Derek Hazell | System Administrator > ##################################################################### > APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running) > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5b1914dc72d356 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5b1914dc72d356 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M00000000000000049c73bf5e1d8e29 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M00000000000000049c73bf5e1d8e29 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M00000000000000049c73bf5e1d8e29 > [UNEXPECTED REBOOT] > Aug 15 21:05:09 Sysname syslogd 1.4.1: restart. > Aug 15 21:05:09 Sysname syslog: syslogd startup succeeded > Aug 15 21:05:09 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Aug 15 21:05:09 Sysname kernel: Bootdata ok (command line is ro > root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) > Aug 15 21:05:09 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp > (bhcompile at ls20-bc1-13.build.redhat.com > <mailto:bhcompile at ls20-bc1-13.build.redhat.com>) (gcc version 3.4.6 > 20060404 (Red Hat 3.4.6-3)) #1 > SMP Thu Aug 17 17:57:31 EDT 2006 > Aug 15 21:05:09 Sysname kernel: BIOS-provided physical RAM map: > ###################################################################### > APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running) > Aug 15 21:08:12 Sysname kernel: o2net: connected to node > Othersystem2.x.y (num 1) at 172.16.172.172:7777 > <http://172.16.172.172:7777> > Aug 15 21:08:13 Sysname kernel: o2net: accepted connection from node > Othersystem1.x.y (num 3) at 172.16.172.171:7777 > <http://172.16.172.171:7777> > Aug 15 21:08:16 Sysname kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT > 2008 (build a693806cb619dd7f225004092b675ede) > Aug 15 21:08:16 Sysname kernel: ocfs2_dlm: Nodes in domain > ("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3 > Aug 15 21:08:17 Sysname kernel: kjournald starting. Commit interval > 5 seconds > Aug 15 21:08:17 Sysname kernel: ocfs2: Mounting device (120,1) on > (node 2, slot 2) > Aug 15 21:08:21 Sysname kernel: ocfs2_dlm: Nodes in domain > ("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3 > Aug 15 21:08:21 Sysname kernel: kjournald starting. Commit interval > 5 seconds > Aug 15 21:08:21 Sysname kernel: ocfs2: Mounting device (120,17) on > (node 2, slot 2) > Aug 15 21:08:31 Sysname ntpd[7076]: synchronized to 172.16.32.254 > <http://172.16.32.254>, stratum 2 > Aug 15 21:08:31 Sysname ntpd[7076]: kernel time sync disabled 0041 > Aug 15 21:08:38 Sysname su(pam_unix)[9656]: session opened for user > digicol by root(uid=0) > Aug 15 21:08:41 Sysname su(pam_unix)[9656]: session closed for user > digicol > Aug 15 21:13:52 Sysname ntpd[7076]: kernel time sync enabled 0001 > Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, > sector 1291272320 > Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, > sector 1487646848 > Aug 15 21:41:47 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:47 Sysname kernel: end_request: I/O error, dev sdc, > sector 1301852288 > Aug 15 21:41:48 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:48 Sysname kernel: end_request: I/O error, dev sdc, > sector 1498484864 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1611251840 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1045610624 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1234243712 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 989614208 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1115283584 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1240952960 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 995807360 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 1104961664 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 1008507952 > Aug 16 03:00:26 Sysname Server Administrator: Storage Service > EventID: 2242 The Patrol Read has started.: Controller 0 (PERC 5/i > Integrated) > Aug 16 03:00:27 Sysname snmpd[7589]: Got trap from peer on fd 13 > Aug 16 03:52:02 Sysname Server Administrator: Storage Service > EventID: 2243 The Patrol Read has stopped.: Controller 0 (PERC 5/i > Integrated) > Aug 16 03:52:02 Sysname snmpd[7589]: Got trap from peer on fd 13 > Aug 16 16:38:33 Sysname sshd(pam_unix)[31901]: session opened for > user root by root(uid=0) > Aug 16 16:55:55 Sysname sshd(pam_unix)[32254]: session opened for > user root by root(uid=0) > Aug 16 17:27:06 Sysname sshd(pam_unix)[966]: session opened for user > root by root(uid=0) > [UNEXPECTED REBOOT] > Aug 16 23:18:31 Sysname syslogd 1.4.1: restart. > Aug 16 23:18:31 Sysname syslog: syslogd startup succeeded > Aug 16 23:18:31 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Aug 16 23:18:31 Sysname kernel: Bootdata ok (command line is ro > root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) > Aug 16 23:18:31 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp > (bhcompile at ls20-bc1-13.build.redhat.com > <mailto:bhcompile at ls20-bc1-13.build.redhat.com>) (gcc version 3.4.6 > 20060404 (Red Hat 3.4.6-3)) #1 > SMP Thu Aug 17 17:57:31 EDT 2006 > Aug 16 23:18:31 Sysname kernel: BIOS-provided physical RAM map: > ##################################################################### > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Derek Hazell
2008-Aug-23 06:24 UTC
[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
Hi Ocfs2 user We got some relevant log messages (via a serial console) and via a putty session logged on a root. I suspect we need to set up a private network between the ocfs2 cluster members, is this right? Anything else we might need to do? regards, I appreciate your help Derek ######################################################## CURRENT O2CB CONFIG [root at sysname fs]# /etc/init.d/o2cb configure Configuring the O2CB driver. This will configure the on-boot properties of the O2CB driver. The following questions will determine whether the driver is loaded on boot. The current values will be shown in brackets ('[]'). Hitting <ENTER> without typing an answer will keep that current value. Ctrl-C will abort. Load O2CB driver on boot (y/n) [y]: Cluster to start on boot (Enter "none" to clear) [ocfs2]: Specify heartbeat dead threshold (>=7) [61]: Specify network idle timeout in ms (>=5000) [60000]: 120000 Specify network keepalive delay in ms (>=1000) [2000]: Specify network reconnect delay in ms (>=2000) [2000]: Writing O2CB configuration: OK O2CB cluster ocfs2 already online [root at sysname fs]# ################## TRACE OF ROOT PUTTY LOGIN [root at sysname ~]# Message from syslogd at sysname at Fri Aug 22 23:12:03 2008 ... sysname kernel: Heartbeat thread (11) printing last 24 blocking operations (cur = 8): Message from syslogd at sysname at Fri Aug 22 23:12:03 2008 ... sysname kernel: Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 8) Message from syslogd at sysname at Fri Aug 22 23:12:03 2008 ... sysname kernel: Index 9: took 0 ms to do bio alloc read . . . Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ... sysname kernel: Index 3: took 5240 ms to do waiting for write completion Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ... sysname kernel: Index 4: took 0 ms to do allocating bios for read Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ... sysname kernel: Index 5: took 0 ms to do bio alloc read Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ... sysname kernel: Index 6: took 0 ms to do bio add page read Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ... sysname kernel: Index 7: took 0 ms to do submit_bio for read Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ... sysname kernel: Index 8: took 120303 ms to do waiting for read completion ############# TRACE OF SERIAL CONSOLE: (11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device emcpowerb1 after 120000 milliseconds Heartbeat thread (11) printing last 24 blocking operations (cur = 8): Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 8) Index 9: took 0 ms to do bio alloc read Index 10: took 0 ms to do bio add page read Index 11: took 0 ms to do submit_bio for read Index 12: took 3025 ms to do waiting for read completion Index 13: took 0 ms to do bio alloc write Index 14: took 0 ms to do bio add page write Index 15: took 0 ms to do submit_bio for write Index 16: took 0 ms to do checking slots Index 17: took 7221 ms to do waiting for write completion Index 18: took 0 ms to do allocating bios for read Index 19: took 0 ms to do bio alloc read Index 20: took 0 ms to do bio add page read Index 21: took 0 ms to do submit_bio for read Index 22: took 3892 ms to do waiting for read completion Index 23: took 0 ms to do bio alloc write Index 0: took 0 ms to do bio add page write Index 1: took 0 ms to do submit_bio for write Index 2: took 0 ms to do checking slots Index 3: took 5240 ms to do waiting for write completion Index 4: took 0 ms to do allocating bios for read Index 5: took 0 ms to do bio alloc read Index 6: took 0 ms to do bio add page read Index 7: took 0 ms to do submit_bio for read Index 8: took 120303 ms to do waiting for read completion *** ocfs2 is very sorry to be fencing this system by restarting *** Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1 console=ttyS0,9600n8) ################################################################################ -----Original Message----- From: ocfs2-users-bounces at oss.oracle.com [mailto: ocfs2-users-bounces at oss.oracle.com] On Behalf Of Sunil Mushran Sent: Tuesday, 19 August 2008 3:56 AM To: _Derek Hazell (Internet) Cc: ocfs2-users at oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs) Configure a netdump or netconsole server. It will catch the relevant messages. ################################################################################ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080823/780d479d/attachment.html