thr3ads.net - Ocfs2 users - [Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Peter Santos

2006-Nov-16 10:49 UTC

[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Folks,
	
I'm trying to piece together what happened during a recent event where our 3
node RAC cluster had problems.
It appears that all 3 nodes restarted .. which is likely to occur if all 3 nodes
cannot communicate with the
shared ocfs2 storage.

I did find out from our SA, that this happened during the time he was replacing
a failed drive on the storage
and the storage was in a degraded mode.  I'm trying to understand if the 3
nodes had a difficult time accessing
the shared ocfs2 volume or was it a tcp connectivity issue. There is nobody
currently using the cluster ..so
it should have been idle from a user perspective.


prompt># cat /etc/fstab | grep ocfs2

/dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
/dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0

we have 2 ocfs2 volumes.. once if for the voting and ocr files, while the other
is to be used as a
shared storage for backups of archivelog files etc.


/var/log/messages


NODE1 (dbo1)
=======================================================================================================Nov
15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write
timeout to device sdb2
				    after 12000 milliseconds
Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 blocking
operations (cur = 13):
Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.


NODE2 (dbo2)
=======================================================================================================
Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
192.168.134.140:7777 has been idle for 10
seconds, shutting it down.
Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times
that might help debug the situation: (tmr
1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
1163628767.826104:1163628767.826105 func (f0735f96
   :506) 1163454320.893701:1163454320.893708)
Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 (num 0) at
192.168.134.140:7777
Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
192.168.134.142:7777 has been idle for 10
seconds, shutting it down.
Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times
that might help debug the situation: (tmr
1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
1163628769.44159:1163628769.44160 func (f7e0383f:504)
    1163540424.444236:1163540424.444248)
Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 (num 2) at
192.168.134.142:7777
Nov 15 17:32:37 dbo2 -- MARK --
Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: fencing this
node because it is only connected to 1
nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: stopping
heartbeat on all active regions.
Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be fencing
this system by panicing
Nov 15 17:33:03 dbo2 kernel:

NODE3 (dbo3)
=======================================================================================================Nov
15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat write
timeout to device sdb2
				    after 12000 milliseconds
Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 blocking
operations (cur = 11):
Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.


any help is greatly appreciated (BTW, I've read the ocfs2 user guide).

thanks
- -peter

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH
RVxoqk930affeEnK3yw5SIU=eqqi
-----END PGP SIGNATURE-----

Sunil Mushran

2006-Nov-16 11:49 UTC

head link

[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
did not fully panic. As in, the network was shutdown but the hb thread
was still going strong for some reason.

Within 10 secs of that, by 17:12:59, db02 detected loss of network
connectivity with both nodes db01 and db03. However, it was still
seeing the nodes hb on disk and assumed that they were alive. As per
quorum rules, it paniced.

So the qs is: what was happening on nodes db01 and db03 after 17:12:49?

Peter Santos wrote:> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Folks,
> 	
> I'm trying to piece together what happened during a recent event where
our 3 node RAC cluster had problems.
> It appears that all 3 nodes restarted .. which is likely to occur if all 3
nodes cannot communicate with the
> shared ocfs2 storage.
>
> I did find out from our SA, that this happened during the time he was
replacing a failed drive on the storage
> and the storage was in a degraded mode.  I'm trying to understand if
the 3 nodes had a difficult time accessing
> the shared ocfs2 volume or was it a tcp connectivity issue. There is nobody
currently using the cluster ..so
> it should have been idle from a user perspective.
>
>
> prompt># cat /etc/fstab | grep ocfs2
>
> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
>
> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while the
other is to be used as a
> shared storage for backups of archivelog files etc.
>
>
> /var/log/messages
>
>
> NODE1 (dbo1)
>
=======================================================================================================>
Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat
write timeout to device sdb2
> 				    after 12000 milliseconds
> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
blocking operations (cur = 13):
> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
>
>
> NODE2 (dbo2)
>
=======================================================================================================>
> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
192.168.134.140:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
times that might help debug the situation: (tmr
> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
1163628767.826104:1163628767.826105 func (f0735f96
>    :506) 1163454320.893701:1163454320.893708)
> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 (num
0) at 192.168.134.140:7777
> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
192.168.134.142:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
times that might help debug the situation: (tmr
> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
1163628769.44159:1163628769.44160 func (f7e0383f:504)
>     1163540424.444236:1163540424.444248)
> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 (num
2) at 192.168.134.142:7777
> Nov 15 17:32:37 dbo2 -- MARK --
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: fencing
this node because it is only connected to 1
> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
stopping heartbeat on all active regions.
> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
fencing this system by panicing
> Nov 15 17:33:03 dbo2 kernel:
>
> NODE3 (dbo3)
>
=======================================================================================================>
Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat
write timeout to device sdb2
> 				    after 12000 milliseconds
> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
blocking operations (cur = 11):
> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
>
>
> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
>
> thanks
> - -peter
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH
> RVxoqk930affeEnK3yw5SIU> =eqqi
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Randy Ramsdell

2006-Nov-16 12:49 UTC

head link

[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Hard to say anything with the info provided.

If you are on 1.2.1, upgrade to 1.2.3.

Then file a bug report

rcr

Peter Santos wrote:> Folks,
>     
> I'm trying to piece together what happened during a recent event where
> our 3 node RAC cluster had problems.
> It appears that all 3 nodes restarted .. which is likely to occur if
> all 3 nodes cannot communicate with the
> shared ocfs2 storage.
>
> I did find out from our SA, that this happened during the time he was
> replacing a failed drive on the storage
> and the storage was in a degraded mode.  I'm trying to understand if
> the 3 nodes had a difficult time accessing
> the shared ocfs2 volume or was it a tcp connectivity issue. There is
> nobody currently using the cluster ..so
> it should have been idle from a user perspective.
>
>
> prompt># cat /etc/fstab | grep ocfs2
>
> /dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
> /dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0
>
> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
> the other is to be used as a
> shared storage for backups of archivelog files etc.
>
>
> /var/log/messages
>
>
> NODE1 (dbo1)
>
=======================================================================================================>
Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> Heartbeat write timeout to device sdb2
>                     after 12000 milliseconds
> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
> blocking operations (cur = 13):
> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
>
>
> NODE2 (dbo2)
>
=======================================================================================================>
> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
> 192.168.134.140:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> times that might help debug the situation: (tmr
> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
> 1163628767.826104:1163628767.826105 func (f0735f96
>    :506) 1163454320.893701:1163454320.893708)
> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1
> (num 0) at 192.168.134.140:7777
> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
> 192.168.134.142:7777 has been idle for 10
> seconds, shutting it down.
> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
> times that might help debug the situation: (tmr
> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
> 1163628769.44159:1163628769.44160 func (f7e0383f:504)
>     1163540424.444236:1163540424.444248)
> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3
> (num 2) at 192.168.134.142:7777
> Nov 15 17:32:37 dbo2 -- MARK --
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR:
> fencing this node because it is only connected to 1
> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
> stopping heartbeat on all active regions.
> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
> fencing this system by panicing
> Nov 15 17:33:03 dbo2 kernel:
>
> NODE3 (dbo3)
>
=======================================================================================================>
Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR:
> Heartbeat write timeout to device sdb2
>                     after 12000 milliseconds
> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
> blocking operations (cur = 11):
> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
>
>
> any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
>
> thanks
> -peter
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Alexei_Roudnev

2006-Nov-18 00:17 UTC

head link

[Ocfs2-users] Heartbeat ID and cloning systems

Hi.

DO you have any idea, how to change default ID which ocfs2 uses as a system
id, when running heartbeat (it appear at
/config/cluster/<cluster>/heartbeat/* )


I found that, if I clone server (by disk copy) and then change IP,
hostnames, etc etc (so they works as 2 different servers),
heartbeat still uses the same string for the heartbeat, so making OCFSv2
cluster impossible.

System is SLES9 SP3, but I guess that it do not depends of it.

Sunil Mushran

2006-Nov-20 10:39 UTC

head link

[Ocfs2-users] Heartbeat ID and cloning systems

The heartbeat region id is the same as the device uuid.
One can change the device uuid using tunefs.ocfs2.
Available with ocfs2-tools 1.2.2.

Alexei_Roudnev wrote:> Hi.
>
> DO you have any idea, how to change default ID which ocfs2 uses as a system
> id, when running heartbeat (it appear at
> /config/cluster/<cluster>/heartbeat/* )
>
>
> I found that, if I clone server (by disk copy) and then change IP,
> hostnames, etc etc (so they works as 2 different servers),
> heartbeat still uses the same string for the heartbeat, so making OCFSv2
> cluster impossible.
>
> System is SLES9 SP3, but I guess that it do not depends of it.
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - Nov 2006 - re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

[Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

[Ocfs2-users] Heartbeat ID and cloning systems

[Ocfs2-users] Heartbeat ID and cloning systems