Hello, I've made some progress with the o2net_idle_timer issue. Various people seem to occasionally report instability and faults where the following message is generated; (From Andrew Brunton) Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been idle for 10 seconds, shutting it down. (From Peter Santos) Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10 seconds, shutting it down. And from me; Aug 2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at 172.16.6.10:7777 has been idle for 10 seconds, shutting it down. I've tried unsuccessfully to replicate the issue on my testbed environment. The problem stems from the o2net layer function 'o2net_idle_timer' firing, after not receiving a valid packet after O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest of the code to fall over in a heap, once the underlying socket goes. It turns out that its very likely not a bug in ocfs2. This code is doing what its supposed to do. Others will (and have) argued that the network timeout is too low - see any and all posts by Alexei to this list. Leaving that aside, or indeed the idea that the network layer should make an attempt at reconnecting before killing the entire machine, I'll focus on the causes we've found here of this problem which are not spanning tree related. One common thread is that people finding this are on EM64T or Opteron based systems. There are various bugs reported against RedHat Linux (and probably SuSE as well) for the kernels before RHAS 4.4. e.g. page 16 of this document - "lost ticks" Message Under Stress With Non Uniform Memory Access Enabled on AMD Processor-Based Systems http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf Or oracle bug 4593892 referenced in; http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html We were also seeing messages of the form; Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks. Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable or some driver is hogging interupts (sic) Our problem seems to have been at least partially down to dodgy AMI megaraid firmware for the system disks. We were getting messages from the megaraid driver module on the console, which correlated with dropped packets as logged by Oracle RAC's cssd.log. So given the above numa and driver/hardware errors its likely that ocfs2 was going for periods as long as 10 seconds without receiving a packet, and failing accordingly. Ocfs2 was hit the worst, as it has the finest trigger on lost packets. The heartbeat failure times for rac are over 60 seconds. The o2cb heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is fine for interruptions to the SAN/multipathing failover failures. We're planning an upgrade to 4.4 which apparently has fixed several of these bugs, and would recommend others with this problem to carefully check for signs of driver misbehaviour, particularly lost ticks messages. If you're running a large amd box with more than a couple of sockets, then turning numa off seems to be a way of making things more stable according to some pdfs. Sunil, I think that 10 seconds is too low for this timeout. Please consider making this tunable, in the way that O2CB_HEARTBEAT_THRESHOLD is tunable in /etc/sysconfig/o2cb. It can kill the box, and its a bit counter intuitive to have the documented o2cb_heartbeat_threshold effectively ignored when it comes to the network heartbeat. Having this in 1.2.4 would be ideal. Please. This is the point where alexei can jump in and tell us all that he told us so. He has a point about network spanning tree convergence, even though most sensible designs for heartbeat networks would never allow that to happen. I hope I've made it clear that this is a somewhat different problem. What we're planning to do next - once we've confirmed that our new disk firmware has eliminated the problem, is to test with numa=off, and eventually upgrade. We're also looking at trying to simulate a bad driver blocking interrupts in the kernel for configurable periods to confirm that this diagnosis is correct. I hope this some what long winded message is of use to people. Andy -- Andy Phillips Systems Architecture Manager, Betfair.com Office: 0208 8348436 Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP Company No. 5140986 The information in this e-mail and any attachment is confidential and is intended only for the named recipient(s). The e-mail may not be disclosed or used by any person other than the addressee, nor may it be copied in any way. If you are not a named recipient please notify the sender immediately and delete any copies of this message. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Any view or opinions presented are solely those of the author and do not necessarily represent those of the company.
That and also we've seen similar issues with Broadcom TG3 drivers. We use Intel E1000 mostly and thus did not experience the same issue. As far as the configurable net timeouts goes, the patch was added into mainline on Dec 4th. So it will be available with ocfs2 1.4. We are still seeing if we have the bandwidth to backport it to 1.2. http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=fs/ocfs2/cluster/tcp.c;h=ae4ff4a6636b23759522994898a95c148a4401f1;hb=HEAD commit 828ae6afbef03bfe107a4a8cc38798419d6a2765 Author: Andrew Beekhof <abeekhof@suse.de> Date: Mon Dec 4 14:04:55 2006 +0100 [patch 3/3] OCFS2 Configurable timeouts - Protocol changes Modify the OCFS2 handshake to ensure essential timeouts are configured identically on all nodes. Only allow changes when there are no connected peers Improves the logic in o2net_advance_rx() which broke now that sizeof(struct o2net_handshake) is greater than sizeof(struct o2net_msg) Included is the field for userspace-heartbeat timeout to avoid the need for further protocol changes. Uses a global spinlock to ensure the decisions to update configfs entries are made on the correct value. The region covered by the spinlock when incrementing the counter is much larger as this is the more critical case. Small cleanup contributed by Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Beekhof <abeekhof@suse.de> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> commit b5dd80304da482d77b2320e1a01a189e656b9770 Author: Jeff Mahoney <jeffm@suse.de> Date: Mon Dec 4 14:04:54 2006 +0100 [patch 2/3] OCFS2 Configurable timeouts Allow configuration of OCFS2 timeouts from userspace via configfs Signed-off-by: Andrew Beekhof <abeekhof@suse.de> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Andy Phillips wrote:> Hello, > > I've made some progress with the o2net_idle_timer issue. Various > people seem to occasionally report instability and faults where the > following message is generated; > > (From Andrew Brunton) > Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to > node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been idle > for 10 seconds, shutting it down. > > (From Peter Santos) > Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at > 192.168.134.141:7777 has been idle for 10 seconds, shutting it down. > > And from me; > Aug 2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at > 172.16.6.10:7777 has been idle for 10 seconds, shutting it down. > > I've tried unsuccessfully to replicate the issue on my testbed > environment. The problem stems from the o2net layer function > 'o2net_idle_timer' firing, after not receiving a valid packet after > O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in > ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest > of the code to fall over in a heap, once the underlying socket goes. > > It turns out that its very likely not a bug in ocfs2. > > This code is doing what its supposed to do. Others will (and have) > argued that the network timeout is too low - see any and all posts by > Alexei to this list. Leaving that aside, or indeed the idea that the > network layer should make an attempt at reconnecting before killing the > entire machine, I'll focus on the causes we've found here of this > problem which are not spanning tree related. > > One common thread is that people finding this are on EM64T or Opteron > based systems. There are various bugs reported against RedHat Linux (and > probably SuSE as well) for the kernels before RHAS 4.4. > > e.g. page 16 of this document - "lost ticks" Message Under Stress With > Non Uniform Memory Access Enabled on AMD Processor-Based Systems > http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf > > Or oracle bug 4593892 referenced in; > http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html > > We were also seeing messages of the form; > > Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks. > Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable > or some driver is hogging interupts > (sic) > > Our problem seems to have been at least partially down to dodgy AMI > megaraid firmware for the system disks. We were getting messages from > the megaraid driver module on the console, which correlated with dropped > packets as logged by Oracle RAC's cssd.log. > > So given the above numa and driver/hardware errors its likely that ocfs2 > was going for periods as long as 10 seconds without receiving a packet, > and failing accordingly. > > Ocfs2 was hit the worst, as it has the finest trigger on lost packets. > The heartbeat failure times for rac are over 60 seconds. The o2cb > heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is > fine for interruptions to the SAN/multipathing failover failures. > > We're planning an upgrade to 4.4 which apparently has fixed several of > these bugs, and would recommend others with this problem to carefully > check for signs of driver misbehaviour, particularly lost ticks > messages. If you're running a large amd box with more than a couple of > sockets, then turning numa off seems to be a way of making things more > stable according to some pdfs. > > Sunil, I think that 10 seconds is too low for this timeout. Please > consider making this tunable, in the way that O2CB_HEARTBEAT_THRESHOLD > is tunable in /etc/sysconfig/o2cb. It can kill the box, and its a bit > counter intuitive to have the documented o2cb_heartbeat_threshold > effectively ignored when it comes to the network heartbeat. Having this > in 1.2.4 would be ideal. Please. > > This is the point where alexei can jump in and tell us all that he told > us so. He has a point about network spanning tree convergence, even > though most sensible designs for heartbeat networks would never allow > that to happen. I hope I've made it clear that this is a somewhat > different problem. > > What we're planning to do next - once we've confirmed that our new disk > firmware has eliminated the problem, is to test with numa=off, and > eventually upgrade. We're also looking at trying to simulate a bad > driver blocking interrupts in the kernel for configurable periods to > confirm that this diagnosis is correct. > > I hope this some what long winded message is of use to people. > > Andy > >