Derek Hazell
2008-Aug-18 03:59 UTC
[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
Dear OCFS2 forum
We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers
running RHEL 4 (kernel: 2.6.9-42.0.2.ELs)
We are getting unexpected reboots of one of the Linux servers and are
wondering if the reboots are related to ocfs2 or not.
We enable tracing of ocfs2 on the node we suspected would reboot
# debugfs.ocfs2 -l SUPER allow
# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow
and then waited for the reboot to occur. A sample of log messages around the
time of the reboot is included below. There are no strange ocfs2 messages in
the /var/log/messages log file but I thought I would just check with your
forum if you see anything strange.
Can you confirm that ocfs2 version 1.2.9-1 is compatible with the Linux
kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node can you
confirm that a message is written to the /var/log/messages logfile noting
that such fencing has occurred. Your responses may help us narrow down the
cause
Can you let us know if there are any particular logfiles we should check, or
if there is anything we can do to confirm that ocfs2 is, or is not, the
cause of these reboots.
Appreciate any responses
regards
Derek Hazell | System Administrator
#####################################################################
APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running)
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M000000000000000c5b1914dc72d356
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M000000000000000c5b1914dc72d356
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M0000000000000009f1bbc95e1dad74
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M0000000000000009f1bbc95e1dad74
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M0000000000000009f1bbc95e1dad74
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M0000000000000009f1bbc95e1dad74
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M0000000000000009f1bbc95e1dad74
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M0000000000000009f1bbc95e1dad74
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M000000000000000c5bc95ddc72d357
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M000000000000000c5bc95ddc72d357
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M000000000000000c5bc95ddc72d357
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M000000000000000c5bc95ddc72d357
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M000000000000000c5bc95ddc72d357
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M000000000000000c5bc95ddc72d357
Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY:
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M00000000000000049c73bf5e1d8e29
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres_full:148
ENTRY:M00000000000000049c73bf5e1d8e29
Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182
ENTRY:M00000000000000049c73bf5e1d8e29
[UNEXPECTED REBOOT]
Aug 15 21:05:09 Sysname syslogd 1.4.1: restart.
Aug 15 21:05:09 Sysname syslog: syslogd startup succeeded
Aug 15 21:05:09 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Aug 15 21:05:09 Sysname kernel: Bootdata ok (command line is ro
root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet)
Aug 15 21:05:09 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp (
bhcompile at ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat
3.4.6-3)) #1
SMP Thu Aug 17 17:57:31 EDT 2006
Aug 15 21:05:09 Sysname kernel: BIOS-provided physical RAM map:
######################################################################
APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running)
Aug 15 21:08:12 Sysname kernel: o2net: connected to node Othersystem2.x.y
(num 1) at 172.16.172.172:7777
Aug 15 21:08:13 Sysname kernel: o2net: accepted connection from node
Othersystem1.x.y (num 3) at 172.16.172.171:7777
Aug 15 21:08:16 Sysname kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT 2008
(build a693806cb619dd7f225004092b675ede)
Aug 15 21:08:16 Sysname kernel: ocfs2_dlm: Nodes in domain
("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3
Aug 15 21:08:17 Sysname kernel: kjournald starting. Commit interval 5
seconds
Aug 15 21:08:17 Sysname kernel: ocfs2: Mounting device (120,1) on (node 2,
slot 2)
Aug 15 21:08:21 Sysname kernel: ocfs2_dlm: Nodes in domain
("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3
Aug 15 21:08:21 Sysname kernel: kjournald starting. Commit interval 5
seconds
Aug 15 21:08:21 Sysname kernel: ocfs2: Mounting device (120,17) on (node 2,
slot 2)
Aug 15 21:08:31 Sysname ntpd[7076]: synchronized to 172.16.32.254, stratum
2
Aug 15 21:08:31 Sysname ntpd[7076]: kernel time sync disabled 0041
Aug 15 21:08:38 Sysname su(pam_unix)[9656]: session opened for user digicol
by root(uid=0)
Aug 15 21:08:41 Sysname su(pam_unix)[9656]: session closed for user digicol
Aug 15 21:13:52 Sysname ntpd[7076]: kernel time sync enabled 0001
Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, sector
1291272320
Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, sector
1487646848
Aug 15 21:41:47 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:41:47 Sysname kernel: end_request: I/O error, dev sdc, sector
1301852288
Aug 15 21:41:48 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:41:48 Sysname kernel: end_request: I/O error, dev sdc, sector
1498484864
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector
1611251840
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector
1045610624
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector
1234243712
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector
989614208
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector
1115283584
Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, sector
1240952960
Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector
995807360
Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector
1104961664
Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code
0x20000
Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, sector
1008507952
Aug 16 03:00:26 Sysname Server Administrator: Storage Service EventID:
2242 The Patrol Read has started.: Controller 0 (PERC 5/i Integrated)
Aug 16 03:00:27 Sysname snmpd[7589]: Got trap from peer on fd 13
Aug 16 03:52:02 Sysname Server Administrator: Storage Service EventID:
2243 The Patrol Read has stopped.: Controller 0 (PERC 5/i Integrated)
Aug 16 03:52:02 Sysname snmpd[7589]: Got trap from peer on fd 13
Aug 16 16:38:33 Sysname sshd(pam_unix)[31901]: session opened for user root
by root(uid=0)
Aug 16 16:55:55 Sysname sshd(pam_unix)[32254]: session opened for user root
by root(uid=0)
Aug 16 17:27:06 Sysname sshd(pam_unix)[966]: session opened for user root
by root(uid=0)
[UNEXPECTED REBOOT]
Aug 16 23:18:31 Sysname syslogd 1.4.1: restart.
Aug 16 23:18:31 Sysname syslog: syslogd startup succeeded
Aug 16 23:18:31 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Aug 16 23:18:31 Sysname kernel: Bootdata ok (command line is ro
root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet)
Aug 16 23:18:31 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp (
bhcompile at ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat
3.4.6-3)) #1
SMP Thu Aug 17 17:57:31 EDT 2006
Aug 16 23:18:31 Sysname kernel: BIOS-provided physical RAM map:
#####################################################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080818/4447aa96/attachment.html
Sunil Mushran
2008-Aug-18 17:55 UTC
[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
Configure a netdump or netconsole server. It will catch the relevant messages. Derek Hazell wrote:> > Dear OCFS2 forum > > We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers > running RHEL 4 (kernel: 2.6.9-42.0.2.ELs) > > We are getting unexpected reboots of one of the Linux servers and are > wondering if the reboots are related to ocfs2 or not. > We enable tracing of ocfs2 on the node we suspected would reboot > # debugfs.ocfs2 -l SUPER allow > # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow > and then waited for the reboot to occur. A sample of log messages > around the time of the reboot is included below. There are no strange > ocfs2 messages in the /var/log/messages log file but I thought I would > just check with your forum if you see anything strange. > > Can you confirm that ocfs2 version 1.2.9-1 is compatible with the > Linux kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node > can you confirm that a message is written to the /var/log/messages > logfile noting that such fencing has occurred. Your responses may help > us narrow down the cause > Can you let us know if there are any particular logfiles we should > check, or if there is anything we can do to confirm that ocfs2 is, or > is not, the cause of these reboots. > > Appreciate any responses > > regards > Derek Hazell | System Administrator > ##################################################################### > APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running) > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5b1914dc72d356 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5b1914dc72d356 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M00000000000000049c73bf5e1d8e29 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M00000000000000049c73bf5e1d8e29 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M00000000000000049c73bf5e1d8e29 > [UNEXPECTED REBOOT] > Aug 15 21:05:09 Sysname syslogd 1.4.1: restart. > Aug 15 21:05:09 Sysname syslog: syslogd startup succeeded > Aug 15 21:05:09 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Aug 15 21:05:09 Sysname kernel: Bootdata ok (command line is ro > root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) > Aug 15 21:05:09 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp > (bhcompile at ls20-bc1-13.build.redhat.com > <mailto:bhcompile at ls20-bc1-13.build.redhat.com>) (gcc version 3.4.6 > 20060404 (Red Hat 3.4.6-3)) #1 > SMP Thu Aug 17 17:57:31 EDT 2006 > Aug 15 21:05:09 Sysname kernel: BIOS-provided physical RAM map: > ###################################################################### > APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running) > Aug 15 21:08:12 Sysname kernel: o2net: connected to node > Othersystem2.x.y (num 1) at 172.16.172.172:7777 > <http://172.16.172.172:7777> > Aug 15 21:08:13 Sysname kernel: o2net: accepted connection from node > Othersystem1.x.y (num 3) at 172.16.172.171:7777 > <http://172.16.172.171:7777> > Aug 15 21:08:16 Sysname kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT > 2008 (build a693806cb619dd7f225004092b675ede) > Aug 15 21:08:16 Sysname kernel: ocfs2_dlm: Nodes in domain > ("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3 > Aug 15 21:08:17 Sysname kernel: kjournald starting. Commit interval > 5 seconds > Aug 15 21:08:17 Sysname kernel: ocfs2: Mounting device (120,1) on > (node 2, slot 2) > Aug 15 21:08:21 Sysname kernel: ocfs2_dlm: Nodes in domain > ("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3 > Aug 15 21:08:21 Sysname kernel: kjournald starting. Commit interval > 5 seconds > Aug 15 21:08:21 Sysname kernel: ocfs2: Mounting device (120,17) on > (node 2, slot 2) > Aug 15 21:08:31 Sysname ntpd[7076]: synchronized to 172.16.32.254 > <http://172.16.32.254>, stratum 2 > Aug 15 21:08:31 Sysname ntpd[7076]: kernel time sync disabled 0041 > Aug 15 21:08:38 Sysname su(pam_unix)[9656]: session opened for user > digicol by root(uid=0) > Aug 15 21:08:41 Sysname su(pam_unix)[9656]: session closed for user > digicol > Aug 15 21:13:52 Sysname ntpd[7076]: kernel time sync enabled 0001 > Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, > sector 1291272320 > Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, > sector 1487646848 > Aug 15 21:41:47 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:47 Sysname kernel: end_request: I/O error, dev sdc, > sector 1301852288 > Aug 15 21:41:48 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:48 Sysname kernel: end_request: I/O error, dev sdc, > sector 1498484864 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1611251840 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1045610624 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1234243712 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 989614208 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1115283584 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1240952960 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 995807360 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 1104961664 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 1008507952 > Aug 16 03:00:26 Sysname Server Administrator: Storage Service > EventID: 2242 The Patrol Read has started.: Controller 0 (PERC 5/i > Integrated) > Aug 16 03:00:27 Sysname snmpd[7589]: Got trap from peer on fd 13 > Aug 16 03:52:02 Sysname Server Administrator: Storage Service > EventID: 2243 The Patrol Read has stopped.: Controller 0 (PERC 5/i > Integrated) > Aug 16 03:52:02 Sysname snmpd[7589]: Got trap from peer on fd 13 > Aug 16 16:38:33 Sysname sshd(pam_unix)[31901]: session opened for > user root by root(uid=0) > Aug 16 16:55:55 Sysname sshd(pam_unix)[32254]: session opened for > user root by root(uid=0) > Aug 16 17:27:06 Sysname sshd(pam_unix)[966]: session opened for user > root by root(uid=0) > [UNEXPECTED REBOOT] > Aug 16 23:18:31 Sysname syslogd 1.4.1: restart. > Aug 16 23:18:31 Sysname syslog: syslogd startup succeeded > Aug 16 23:18:31 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Aug 16 23:18:31 Sysname kernel: Bootdata ok (command line is ro > root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) > Aug 16 23:18:31 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp > (bhcompile at ls20-bc1-13.build.redhat.com > <mailto:bhcompile at ls20-bc1-13.build.redhat.com>) (gcc version 3.4.6 > 20060404 (Red Hat 3.4.6-3)) #1 > SMP Thu Aug 17 17:57:31 EDT 2006 > Aug 16 23:18:31 Sysname kernel: BIOS-provided physical RAM map: > ##################################################################### > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Derek Hazell
2008-Aug-23 06:24 UTC
[Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4 server (kernel:2.6.9-42.0.2.ELs)
Hi Ocfs2 user
We got some relevant log messages (via a serial console) and via a putty
session logged on a root.
I suspect we need to set up a private network between the ocfs2 cluster
members, is this right? Anything else we might need to do?
regards, I appreciate your help
Derek
########################################################
CURRENT O2CB CONFIG
[root at sysname fs]# /etc/init.d/o2cb configure
Configuring the O2CB driver.
This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot. The current values will be shown in brackets ('[]'). Hitting
<ENTER> without typing an answer will keep that current value. Ctrl-C
will abort.
Load O2CB driver on boot (y/n) [y]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]:
Specify heartbeat dead threshold (>=7) [61]:
Specify network idle timeout in ms (>=5000) [60000]: 120000
Specify network keepalive delay in ms (>=1000) [2000]:
Specify network reconnect delay in ms (>=2000) [2000]:
Writing O2CB configuration: OK
O2CB cluster ocfs2 already online
[root at sysname fs]#
##################
TRACE OF ROOT PUTTY LOGIN
[root at sysname ~]#
Message from syslogd at sysname at Fri Aug 22 23:12:03 2008 ...
sysname kernel: Heartbeat thread (11) printing last 24 blocking operations
(cur = 8):
Message from syslogd at sysname at Fri Aug 22 23:12:03 2008 ...
sysname kernel: Heartbeat thread stuck at waiting for read completion,
stuffing current time into that blocker (index 8)
Message from syslogd at sysname at Fri Aug 22 23:12:03 2008 ...
sysname kernel: Index 9: took 0 ms to do bio alloc read
.
.
.
Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ...
sysname kernel: Index 3: took 5240 ms to do waiting for write completion
Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ...
sysname kernel: Index 4: took 0 ms to do allocating bios for read
Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ...
sysname kernel: Index 5: took 0 ms to do bio alloc read
Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ...
sysname kernel: Index 6: took 0 ms to do bio add page read
Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ...
sysname kernel: Index 7: took 0 ms to do submit_bio for read
Message from syslogd at sysname at Fri Aug 22 23:12:04 2008 ...
sysname kernel: Index 8: took 120303 ms to do waiting for read completion
#############
TRACE OF SERIAL CONSOLE:
(11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
emcpowerb1 after 120000 milliseconds
Heartbeat thread (11) printing last 24 blocking operations (cur = 8):
Heartbeat thread stuck at waiting for read completion, stuffing current time
into that blocker (index 8)
Index 9: took 0 ms to do bio alloc read
Index 10: took 0 ms to do bio add page read
Index 11: took 0 ms to do submit_bio for read
Index 12: took 3025 ms to do waiting for read completion
Index 13: took 0 ms to do bio alloc write
Index 14: took 0 ms to do bio add page write
Index 15: took 0 ms to do submit_bio for write
Index 16: took 0 ms to do checking slots
Index 17: took 7221 ms to do waiting for write completion
Index 18: took 0 ms to do allocating bios for read
Index 19: took 0 ms to do bio alloc read
Index 20: took 0 ms to do bio add page read
Index 21: took 0 ms to do submit_bio for read
Index 22: took 3892 ms to do waiting for read completion
Index 23: took 0 ms to do bio alloc write
Index 0: took 0 ms to do bio add page write
Index 1: took 0 ms to do submit_bio for write
Index 2: took 0 ms to do checking slots
Index 3: took 5240 ms to do waiting for write completion
Index 4: took 0 ms to do allocating bios for read
Index 5: took 0 ms to do bio alloc read
Index 6: took 0 ms to do bio add page read
Index 7: took 0 ms to do submit_bio for read
Index 8: took 120303 ms to do waiting for read completion
*** ocfs2 is very sorry to be fencing this system by restarting ***
Bootdata ok (command line is ro root=/dev/VolGroup_ID_12182/LogVol1
console=ttyS0,9600n8)
################################################################################
-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com [mailto:
ocfs2-users-bounces at oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Tuesday, 19 August 2008 3:56 AM
To: _Derek Hazell (Internet)
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 issue? : unexplained reboots of RHEL 4
server (kernel:2.6.9-42.0.2.ELs)
Configure a netdump or netconsole server. It will catch the relevant
messages.
################################################################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080823/780d479d/attachment.html