thr3ads.net - Ocfs2 users - [Ocfs2-users] heartbeat write timeout [Jun 2006]

If this information is useful, please help other people find it:
Share via:
Thomas.Zimolong@bmi.bund.de
2006-Jun-14 15:56 UTC
[Ocfs2-users] heartbeat write timeout

hi,

today we had a self fencing node in our cluster and found the lines
below in /var/log/messages.

--snip
Jun 14 13:05:44 bmiam113 kernel: (18,0):o2hb_write_timeout:270 ERROR:
Heartbeat write timeout to device dm-0 after 12000 milliseconds
Jun 14 13:05:44 bmiam113 kernel: Heartbeat thread (18) printing last 24
blocking operations (cur = 22):
Jun 14 13:05:44 bmiam113 kernel: Heartbeat thread stuck at waiting for
read completion, stuffing current time into that blocker (index 22)
Jun 14 13:05:44 bmiam113 kernel: Index 23: took 0 ms to do submit_bio
for read
Jun 14 13:05:44 bmiam113 kernel: Index 0: took 0 ms to do waiting for
read completion
Jun 14 13:05:44 bmiam113 kernel: Index 1: took 0 ms to do bio alloc
write
Jun 14 13:05:44 bmiam113 kernel: Index 2: took 0 ms to do bio add page
write
Jun 14 13:05:44 bmiam113 kernel: Index 3: took 0 ms to do submit_bio for
write
Jun 14 13:05:44 bmiam113 kernel: Index 4: took 0 ms to do checking slots
Jun 14 13:05:44 bmiam113 kernel: Index 5: took 0 ms to do waiting for
write completion
Jun 14 13:05:44 bmiam113 kernel: Index 6: took 2001 ms to do msleep
Jun 14 13:05:44 bmiam113 kernel: Index 7: took 0 ms to do allocating
bios for read
Jun 14 13:05:44 bmiam113 kernel: Index 8: took 0 ms to do bio alloc read
Jun 14 13:05:44 bmiam113 kernel: Index 9: took 0 ms to do bio add page
read
Jun 14 13:05:44 bmiam113 kernel: Index 10: took 0 ms to do submit_bio
for read
Jun 14 13:05:44 bmiam113 kernel: Index 11: took 0 ms to do waiting for
read completion
Jun 14 13:05:44 bmiam113 kernel: Index 12: took 0 ms to do bio alloc
write
Jun 14 13:05:44 bmiam113 kernel: Index 13: took 0 ms to do bio add page
write
Jun 14 13:05:44 bmiam113 kernel: Index 14: took 0 ms to do submit_bio
for write
Jun 14 13:05:44 bmiam113 kernel: Index 15: took 0 ms to do checking
slots
Jun 14 13:05:44 bmiam113 kernel: Index 16: took 0 ms to do waiting for
write completion
Jun 14 13:05:44 bmiam113 kernel: Index 17: took 2000 ms to do msleep
Jun 14 13:05:44 bmiam113 kernel: Index 18: took 0 ms to do allocating
bios for read
Jun 14 13:05:44 bmiam113 kernel: Index 19: took 0 ms to do bio alloc
read
Jun 14 13:05:44 bmiam113 kernel: Index 20: took 0 ms to do bio add page
read
Jun 14 13:05:44 bmiam113 kernel: Index 21: took 0 ms to do submit_bio
for read
Jun 14 13:05:44 bmiam113 kernel: Index 22: took 9998 ms to do waiting
for read completion
Jun 14 13:05:44 bmiam113 kernel: (18,0):o2hb_stop_all_regions:1889
ERROR: stopping heartbeat on all active regions.
Jun 14 13:05:44 bmiam113 kernel: Kernel panic: ocfs2 is very sorry to be
fencing this system by panicing
Jun 14 13:05:44 bmiam113 kernel:
--snap

as for now, we don't have a clue what might have caused the timeout,
except there was a slight increase in response time of the LUN of
device dm-0.
so my question is, if there are any suggestions which way to "dig" for
the reason of this problem besides checking the performance of our SAN.

maybe there's anyone with a similar config who has experienced the same
problem.

our config is:
SLES9 SP3 +
Linux bmiam113 2.6.5-7.257-bigsmp #1 SMP Mon May 15 14:14:14 UTC 2006
i686 i686 i386 GNU/Linux
all OCFS modules version 1.2.1-SLES,
ocfs2console-1.2.1-4.2
ocfs2-tools-1.2.1-4.2
SAN: EMC clariion CX
HBA: QLA2340
version:        8.01.02-sles DBB4213CE377B7165AFB8AC
description:    QLogic ISP23xx FC-SCSI Host Bus Adapter driver
version:        8.01.02-sles 4C6E1EF52A188F2D559E2ED
description:    QLogic Fibre Channel HBA Driver
multipath
multipath-tools-0.4.5-0.16

greetings

thomas zimolong

Bundesministerium des Inneren
Referat Z 6 - Funktionsbereich Anwendungsentwicklung
Alt-Moabit 101 D
D-10559 Berlin
Fon 01888 681 2383
Fax 01888 681 5 2383
mailto:thomas.zimolong at bmi.bund.de
http://bmi.bund.de
Ocfs2 users - Jun 2006 - heartbeat write timeout

[Ocfs2-users] heartbeat write timeout