hi, today we had a self fencing node in our cluster and found the lines below in /var/log/messages. --snip Jun 14 13:05:44 bmiam113 kernel: (18,0):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device dm-0 after 12000 milliseconds Jun 14 13:05:44 bmiam113 kernel: Heartbeat thread (18) printing last 24 blocking operations (cur = 22): Jun 14 13:05:44 bmiam113 kernel: Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 22) Jun 14 13:05:44 bmiam113 kernel: Index 23: took 0 ms to do submit_bio for read Jun 14 13:05:44 bmiam113 kernel: Index 0: took 0 ms to do waiting for read completion Jun 14 13:05:44 bmiam113 kernel: Index 1: took 0 ms to do bio alloc write Jun 14 13:05:44 bmiam113 kernel: Index 2: took 0 ms to do bio add page write Jun 14 13:05:44 bmiam113 kernel: Index 3: took 0 ms to do submit_bio for write Jun 14 13:05:44 bmiam113 kernel: Index 4: took 0 ms to do checking slots Jun 14 13:05:44 bmiam113 kernel: Index 5: took 0 ms to do waiting for write completion Jun 14 13:05:44 bmiam113 kernel: Index 6: took 2001 ms to do msleep Jun 14 13:05:44 bmiam113 kernel: Index 7: took 0 ms to do allocating bios for read Jun 14 13:05:44 bmiam113 kernel: Index 8: took 0 ms to do bio alloc read Jun 14 13:05:44 bmiam113 kernel: Index 9: took 0 ms to do bio add page read Jun 14 13:05:44 bmiam113 kernel: Index 10: took 0 ms to do submit_bio for read Jun 14 13:05:44 bmiam113 kernel: Index 11: took 0 ms to do waiting for read completion Jun 14 13:05:44 bmiam113 kernel: Index 12: took 0 ms to do bio alloc write Jun 14 13:05:44 bmiam113 kernel: Index 13: took 0 ms to do bio add page write Jun 14 13:05:44 bmiam113 kernel: Index 14: took 0 ms to do submit_bio for write Jun 14 13:05:44 bmiam113 kernel: Index 15: took 0 ms to do checking slots Jun 14 13:05:44 bmiam113 kernel: Index 16: took 0 ms to do waiting for write completion Jun 14 13:05:44 bmiam113 kernel: Index 17: took 2000 ms to do msleep Jun 14 13:05:44 bmiam113 kernel: Index 18: took 0 ms to do allocating bios for read Jun 14 13:05:44 bmiam113 kernel: Index 19: took 0 ms to do bio alloc read Jun 14 13:05:44 bmiam113 kernel: Index 20: took 0 ms to do bio add page read Jun 14 13:05:44 bmiam113 kernel: Index 21: took 0 ms to do submit_bio for read Jun 14 13:05:44 bmiam113 kernel: Index 22: took 9998 ms to do waiting for read completion Jun 14 13:05:44 bmiam113 kernel: (18,0):o2hb_stop_all_regions:1889 ERROR: stopping heartbeat on all active regions. Jun 14 13:05:44 bmiam113 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing Jun 14 13:05:44 bmiam113 kernel: --snap as for now, we don't have a clue what might have caused the timeout, except there was a slight increase in response time of the LUN of device dm-0. so my question is, if there are any suggestions which way to "dig" for the reason of this problem besides checking the performance of our SAN. maybe there's anyone with a similar config who has experienced the same problem. our config is: SLES9 SP3 + Linux bmiam113 2.6.5-7.257-bigsmp #1 SMP Mon May 15 14:14:14 UTC 2006 i686 i686 i386 GNU/Linux all OCFS modules version 1.2.1-SLES, ocfs2console-1.2.1-4.2 ocfs2-tools-1.2.1-4.2 SAN: EMC clariion CX HBA: QLA2340 version: 8.01.02-sles DBB4213CE377B7165AFB8AC description: QLogic ISP23xx FC-SCSI Host Bus Adapter driver version: 8.01.02-sles 4C6E1EF52A188F2D559E2ED description: QLogic Fibre Channel HBA Driver multipath multipath-tools-0.4.5-0.16 greetings thomas zimolong Bundesministerium des Inneren Referat Z 6 - Funktionsbereich Anwendungsentwicklung Alt-Moabit 101 D D-10559 Berlin Fon 01888 681 2383 Fax 01888 681 5 2383 mailto:thomas.zimolong at bmi.bund.de http://bmi.bund.de