Peter Santos
2006-Nov-16 10:49 UTC
[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Folks, I'm trying to piece together what happened during a recent event where our 3 node RAC cluster had problems. It appears that all 3 nodes restarted .. which is likely to occur if all 3 nodes cannot communicate with the shared ocfs2 storage. I did find out from our SA, that this happened during the time he was replacing a failed drive on the storage and the storage was in a degraded mode. I'm trying to understand if the 3 nodes had a difficult time accessing the shared ocfs2 volume or was it a tcp connectivity issue. There is nobody currently using the cluster ..so it should have been idle from a user perspective. prompt># cat /etc/fstab | grep ocfs2 /dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0 /dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0 we have 2 ocfs2 volumes.. once if for the voting and ocr files, while the other is to be used as a shared storage for backups of archivelog files etc. /var/log/messages NODE1 (dbo1) =======================================================================================================Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device sdb2 after 12000 milliseconds Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 13): Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart. NODE2 (dbo2) ======================================================================================================= Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at 192.168.134.140:7777 has been idle for 10 seconds, shutting it down. Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv 1163628767.826104:1163628767.826105 func (f0735f96 :506) 1163454320.893701:1163454320.893708) Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 (num 0) at 192.168.134.140:7777 Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at 192.168.134.142:7777 has been idle for 10 seconds, shutting it down. Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv 1163628769.44159:1163628769.44160 func (f7e0383f:504) 1163540424.444236:1163540424.444248) Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 (num 2) at 192.168.134.142:7777 Nov 15 17:32:37 dbo2 -- MARK -- Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: fencing this node because it is only connected to 1 nodes and 2 is needed to make a quorum out of 3 heartbeating nodes Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: stopping heartbeat on all active regions. Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing Nov 15 17:33:03 dbo2 kernel: NODE3 (dbo3) =======================================================================================================Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device sdb2 after 12000 milliseconds Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 11): Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart. any help is greatly appreciated (BTW, I've read the ocfs2 user guide). thanks - -peter -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH RVxoqk930affeEnK3yw5SIU=eqqi -----END PGP SIGNATURE-----
Sunil Mushran
2006-Nov-16 11:49 UTC
[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout
On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes did not fully panic. As in, the network was shutdown but the hb thread was still going strong for some reason. Within 10 secs of that, by 17:12:59, db02 detected loss of network connectivity with both nodes db01 and db03. However, it was still seeing the nodes hb on disk and assumed that they were alive. As per quorum rules, it paniced. So the qs is: what was happening on nodes db01 and db03 after 17:12:49? Peter Santos wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Folks, > > I'm trying to piece together what happened during a recent event where our 3 node RAC cluster had problems. > It appears that all 3 nodes restarted .. which is likely to occur if all 3 nodes cannot communicate with the > shared ocfs2 storage. > > I did find out from our SA, that this happened during the time he was replacing a failed drive on the storage > and the storage was in a degraded mode. I'm trying to understand if the 3 nodes had a difficult time accessing > the shared ocfs2 volume or was it a tcp connectivity issue. There is nobody currently using the cluster ..so > it should have been idle from a user perspective. > > > prompt># cat /etc/fstab | grep ocfs2 > > /dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0 > /dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0 > > we have 2 ocfs2 volumes.. once if for the voting and ocr files, while the other is to be used as a > shared storage for backups of archivelog files etc. > > > /var/log/messages > > > NODE1 (dbo1) > =======================================================================================================> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device sdb2 > after 12000 milliseconds > Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 13): > Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart. > > > NODE2 (dbo2) > =======================================================================================================> > Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at 192.168.134.140:7777 has been idle for 10 > seconds, shutting it down. > Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr > 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv 1163628767.826104:1163628767.826105 func (f0735f96 > :506) 1163454320.893701:1163454320.893708) > Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 (num 0) at 192.168.134.140:7777 > Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at 192.168.134.142:7777 has been idle for 10 > seconds, shutting it down. > Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr > 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv 1163628769.44159:1163628769.44160 func (f7e0383f:504) > 1163540424.444236:1163540424.444248) > Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 (num 2) at 192.168.134.142:7777 > Nov 15 17:32:37 dbo2 -- MARK -- > Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: fencing this node because it is only connected to 1 > nodes and 2 is needed to make a quorum out of 3 heartbeating nodes > Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: stopping heartbeat on all active regions. > Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing > Nov 15 17:33:03 dbo2 kernel: > > NODE3 (dbo3) > =======================================================================================================> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write timeout to device sdb2 > after 12000 milliseconds > Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 11): > Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart. > > > any help is greatly appreciated (BTW, I've read the ocfs2 user guide). > > thanks > - -peter > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.1 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH > RVxoqk930affeEnK3yw5SIU> =eqqi > -----END PGP SIGNATURE----- > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Randy Ramsdell
2006-Nov-16 12:49 UTC
[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout
Hard to say anything with the info provided. If you are on 1.2.1, upgrade to 1.2.3. Then file a bug report rcr Peter Santos wrote:> Folks, > > I'm trying to piece together what happened during a recent event where > our 3 node RAC cluster had problems. > It appears that all 3 nodes restarted .. which is likely to occur if > all 3 nodes cannot communicate with the > shared ocfs2 storage. > > I did find out from our SA, that this happened during the time he was > replacing a failed drive on the storage > and the storage was in a degraded mode. I'm trying to understand if > the 3 nodes had a difficult time accessing > the shared ocfs2 volume or was it a tcp connectivity issue. There is > nobody currently using the cluster ..so > it should have been idle from a user perspective. > > > prompt># cat /etc/fstab | grep ocfs2 > > /dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0 > /dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0 > > we have 2 ocfs2 volumes.. once if for the voting and ocr files, while > the other is to be used as a > shared storage for backups of archivelog files etc. > > > /var/log/messages > > > NODE1 (dbo1) > =======================================================================================================> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: > Heartbeat write timeout to device sdb2 > after 12000 milliseconds > Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 > blocking operations (cur = 13): > Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart. > > > NODE2 (dbo2) > =======================================================================================================> > Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at > 192.168.134.140:7777 has been idle for 10 > seconds, shutting it down. > Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some > times that might help debug the situation: (tmr > 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv > 1163628767.826104:1163628767.826105 func (f0735f96 > :506) 1163454320.893701:1163454320.893708) > Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 > (num 0) at 192.168.134.140:7777 > Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at > 192.168.134.142:7777 has been idle for 10 > seconds, shutting it down. > Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some > times that might help debug the situation: (tmr > 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv > 1163628769.44159:1163628769.44160 func (f7e0383f:504) > 1163540424.444236:1163540424.444248) > Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 > (num 2) at 192.168.134.142:7777 > Nov 15 17:32:37 dbo2 -- MARK -- > Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: > fencing this node because it is only connected to 1 > nodes and 2 is needed to make a quorum out of 3 heartbeating nodes > Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: > stopping heartbeat on all active regions. > Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be > fencing this system by panicing > Nov 15 17:33:03 dbo2 kernel: > > NODE3 (dbo3) > =======================================================================================================> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: > Heartbeat write timeout to device sdb2 > after 12000 milliseconds > Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 > blocking operations (cur = 11): > Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart. > > > any help is greatly appreciated (BTW, I've read the ocfs2 user guide). > > thanks > -peter > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Hi. DO you have any idea, how to change default ID which ocfs2 uses as a system id, when running heartbeat (it appear at /config/cluster/<cluster>/heartbeat/* ) I found that, if I clone server (by disk copy) and then change IP, hostnames, etc etc (so they works as 2 different servers), heartbeat still uses the same string for the heartbeat, so making OCFSv2 cluster impossible. System is SLES9 SP3, but I guess that it do not depends of it.
The heartbeat region id is the same as the device uuid. One can change the device uuid using tunefs.ocfs2. Available with ocfs2-tools 1.2.2. Alexei_Roudnev wrote:> Hi. > > DO you have any idea, how to change default ID which ocfs2 uses as a system > id, when running heartbeat (it appear at > /config/cluster/<cluster>/heartbeat/* ) > > > I found that, if I clone server (by disk copy) and then change IP, > hostnames, etc etc (so they works as 2 different servers), > heartbeat still uses the same string for the heartbeat, so making OCFSv2 > cluster impossible. > > System is SLES9 SP3, but I guess that it do not depends of it. > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >