b52@entrap.de
2008-Mar-08 02:33 UTC
[Ocfs2-users] AoE+ocfs2 = Heartbeat write timeout to device
Hi, I got a problem regarding 100Mbit Ethernet, AoE and ocfs2. I setup 2 boxes connected per 100Mbit ethernet to their Ata-over-Ethernet storage. The ocfs filesystem resides on such an AoE-Partition. If I produce high troughput to that ocfs-partition on one node, it reboots after some seconds. I use dd for testing, like dd if=/dev/zero of=test bs=1M count=1000 If I write 100Mb of data to the disk everything is fine. If I write 1Gb of data to the disk, the node reboots after some seconds and prints the following error: (9,0):o2hb_write_timeout:167 ERROR: Heartbeat write timeout to device etherd/e402.0 after 12000 milliseconds (9,0):o2hb_stop_all_regions:1865 ERROR: stopping heartbeat on all active regions. This couldn't be caused by lost heartbeat packets. I setup a seperate network for heartbeat to track this problem. Actually I know that 100Mbit Ethernet is a bottleneck, but this should not cause the system to reboot, right? Even if I could switch to Gigbit Ethernet it may be the bottleneck in future.. Someone experienced this already? Do you know how to solve this issue? Please help, I need to do some tests.. Your help is really appreciated. Cheers, Holger
Sunil Mushran
2008-Mar-08 09:43 UTC
[Ocfs2-users] AoE+ocfs2 = Heartbeat write timeout to device
The older 12 sec default timeout was too low. It has been bumped up to 60 secs. The FAQ has details on this. b52@entrap.de wrote:> Hi, > > I got a problem regarding 100Mbit Ethernet, AoE and ocfs2. I setup 2 boxes > connected per 100Mbit ethernet to their Ata-over-Ethernet storage. The > ocfs filesystem resides on such an AoE-Partition. If I produce high > troughput to that ocfs-partition on one node, it reboots after some > seconds. > > I use dd for testing, like dd if=/dev/zero of=test bs=1M count=1000 > If I write 100Mb of data to the disk everything is fine. If I write 1Gb of > data to the disk, the node reboots after some seconds and prints the > following error: > > (9,0):o2hb_write_timeout:167 ERROR: Heartbeat write timeout to device > etherd/e402.0 after 12000 milliseconds > (9,0):o2hb_stop_all_regions:1865 ERROR: stopping heartbeat on all active > regions. > > This couldn't be caused by lost heartbeat packets. I setup a seperate > network for heartbeat to track this problem. > > Actually I know that 100Mbit Ethernet is a bottleneck, but this should not > cause the system to reboot, right? Even if I could switch to Gigbit > Ethernet it may be the bottleneck in future.. > > Someone experienced this already? Do you know how to solve this issue? > Please help, I need to do some tests.. > Your help is really appreciated. > > Cheers, > Holger > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >