thr3ads.net - Ocfs2 users - [Ocfs2-users] update on o2net_idle

If this information is useful, please help other people find it:
Share via:

Andy Phillips

2007-Jan-04 03:55 UTC

[Ocfs2-users] update on o2net_idle_timer

Hello,

I've made some progress with the o2net_idle_timer issue. Various
people seem to occasionally report instability and faults where the
following message is generated;

(From Andrew Brunton)
Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to
node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been idle
for 10 seconds, shutting it down.

(From Peter Santos)
Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
192.168.134.141:7777 has been idle for 10 seconds, shutting it down.

And from me;
Aug 2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.

I've tried unsuccessfully to replicate the issue on my testbed
environment. The problem stems from the o2net layer function
'o2net_idle_timer' firing, after not receiving a valid packet after
O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in
ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest
of the code to fall over in a heap, once the underlying socket goes.

It turns out that its very likely not a bug in ocfs2.

This code is doing what its supposed to do. Others will (and have)
argued that the network timeout is too low - see any and all posts by
Alexei to this list. Leaving that aside, or indeed the idea that the
network layer should make an attempt at reconnecting before killing the
entire machine, I'll focus on the causes we've found here of this
problem which are not spanning tree related.

One common thread is that people finding this are on EM64T or Opteron
based systems. There are various bugs reported against RedHat Linux (and
probably SuSE as well) for the kernels before RHAS 4.4.

e.g. page 16 of this document - "lost ticks" Message Under Stress With
Non Uniform Memory Access Enabled on AMD Processor-Based Systems
http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf

Or oracle bug 4593892 referenced in;
http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html

We were also seeing messages of the form;

Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks.
Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable
or some driver is hogging interupts
(sic)

Our problem seems to have been at least partially down to dodgy AMI
megaraid firmware for the system disks. We were getting messages from
the megaraid driver module on the console, which correlated with dropped
packets as logged by Oracle RAC's cssd.log.

So given the above numa and driver/hardware errors its likely that ocfs2
was going for periods as long as 10 seconds without receiving a packet,
and failing accordingly.

Ocfs2 was hit the worst, as it has the finest trigger on lost packets.
The heartbeat failure times for rac are over 60 seconds. The o2cb
heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is
fine for interruptions to the SAN/multipathing failover failures.

We're planning an upgrade to 4.4 which apparently has fixed several of
these bugs, and would recommend others with this problem to carefully
check for signs of driver misbehaviour, particularly lost ticks
messages. If you're running a large amd box with more than a couple of
sockets, then turning numa off seems to be a way of making things more
stable according to some pdfs.

Sunil, I think that 10 seconds is too low for this timeout. Please
consider making this tunable, in the way that O2CB_HEARTBEAT_THRESHOLD
is tunable in /etc/sysconfig/o2cb. It can kill the box, and its a bit
counter intuitive to have the documented o2cb_heartbeat_threshold
effectively ignored when it comes to the network heartbeat. Having this
in 1.2.4 would be ideal. Please.

This is the point where alexei can jump in and tell us all that he told
us so. He has a point about network spanning tree convergence, even
though most sensible designs for heartbeat networks would never allow
that to happen. I hope I've made it clear that this is a somewhat
different problem.

What we're planning to do next - once we've confirmed that our new disk
firmware has eliminated the problem, is to test with numa=off, and
eventually upgrade. We're also looking at trying to simulate a bad
driver blocking interrupts in the kernel for configurable periods to
confirm that this diagnosis is correct.

I hope this some what long winded message is of use to people.

Andy

--
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP
Company No. 5140986
The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.

Sunil Mushran

2007-Jan-04 15:32 UTC

head link

[Ocfs2-users] update on o2net_idle_timer

That and also we've seen similar issues with Broadcom TG3 drivers. We use
Intel E1000 mostly and thus did not experience the same issue.

As far as the configurable net timeouts goes, the patch was added into
mainline on Dec 4th. So it will be available with ocfs2 1.4. We are still
seeing if we have the bandwidth to backport it to 1.2.

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=fs/ocfs2/cluster/tcp.c;h=ae4ff4a6636b23759522994898a95c148a4401f1;hb=HEAD

commit 828ae6afbef03bfe107a4a8cc38798419d6a2765
Author: Andrew Beekhof <abeekhof@suse.de>
Date:   Mon Dec 4 14:04:55 2006 +0100

    [patch 3/3] OCFS2 Configurable timeouts - Protocol changes

    Modify the OCFS2 handshake to ensure essential timeouts are configured
    identically on all nodes.

    Only allow changes when there are no connected peers

    Improves the logic in o2net_advance_rx() which broke now that
    sizeof(struct o2net_handshake) is greater than sizeof(struct o2net_msg)

    Included is the field for userspace-heartbeat timeout to avoid the 
need for
    further protocol changes.

    Uses a global spinlock to ensure the decisions to update configfs 
entries
    are made on the correct value.  The region covered by the spinlock when
    incrementing the counter is much larger as this is the more critical 
case.

    Small cleanup contributed by Adrian Bunk <bunk@stusta.de>

    Signed-off-by: Andrew Beekhof <abeekhof@suse.de>
    Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

commit b5dd80304da482d77b2320e1a01a189e656b9770
Author: Jeff Mahoney <jeffm@suse.de>
Date:   Mon Dec 4 14:04:54 2006 +0100

    [patch 2/3] OCFS2 Configurable timeouts

    Allow configuration of OCFS2 timeouts from userspace via configfs

    Signed-off-by: Andrew Beekhof <abeekhof@suse.de>
    Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

Andy Phillips wrote:> Hello,
>
>    I've made some progress with the o2net_idle_timer issue. Various
> people seem to occasionally report instability and faults where the
> following message is generated;
>
> (From Andrew Brunton)
> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to
> node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been idle
> for 10 seconds, shutting it down.
>
> (From Peter Santos)
> Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
> 192.168.134.141:7777 has been idle for 10 seconds, shutting it down.
>
> And from me;
> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
>
> I've tried unsuccessfully to replicate the issue on my testbed
> environment. The problem stems from the o2net layer function
> 'o2net_idle_timer' firing, after not receiving a valid packet after
> O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in
> ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest
> of the code to fall over in a heap, once the underlying socket goes.
>
>  It turns out that its very likely not a bug in ocfs2. 
>
> This code is doing what its supposed to do. Others will (and have)
> argued that the network timeout is too low - see any and all posts by
> Alexei to this list. Leaving that aside, or indeed the idea that the 
> network layer should make an attempt at reconnecting before killing the
> entire machine, I'll focus on the causes we've found here of this
> problem which are not spanning tree related. 
>
> One common thread is that people finding this are on EM64T or Opteron
> based systems. There are various bugs reported against RedHat Linux (and
> probably SuSE as well) for the kernels before RHAS 4.4. 
>
> e.g. page 16 of this document - "lost ticks" Message Under Stress
With
> Non Uniform Memory Access Enabled on AMD Processor-Based Systems
> http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf
>
> Or oracle bug 4593892 referenced in;
>
http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html
>
> We were also seeing messages of the form;
>
> Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks.
> Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable
> or some driver is hogging interupts
> (sic)
>
> Our problem seems to have been at least partially down to dodgy AMI
> megaraid firmware for the system disks. We were getting messages from
> the megaraid driver module on the console, which correlated with dropped
> packets as logged by Oracle RAC's cssd.log. 
>
> So given the above numa and driver/hardware errors its likely that ocfs2
> was going for periods as long as 10 seconds without receiving a packet,
> and failing accordingly.
>
> Ocfs2 was hit the worst, as it has the finest trigger on lost packets.
> The heartbeat failure times for rac are over 60 seconds. The o2cb
> heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is
> fine for interruptions to the SAN/multipathing failover failures. 
>
> We're planning an upgrade to 4.4 which apparently has fixed several of
> these bugs, and would recommend others with this problem to carefully
> check for signs of driver misbehaviour, particularly lost ticks
> messages. If you're running a large amd box with more than a couple of
> sockets, then turning numa off seems to be a way of making things more
> stable according to some pdfs. 
>
> Sunil, I think that 10 seconds is too low for this timeout. Please
> consider making this tunable, in the way that O2CB_HEARTBEAT_THRESHOLD
> is tunable in /etc/sysconfig/o2cb. It can kill the box, and its a bit 
> counter intuitive to have the documented o2cb_heartbeat_threshold
> effectively ignored when it comes to the network heartbeat. Having this
> in 1.2.4 would be ideal. Please. 
>
> This is the point where alexei can jump in and tell us all that he told
> us so. He has a point about network spanning tree convergence, even
> though most sensible designs for heartbeat networks would never allow
> that to happen. I hope I've made it clear that this is a somewhat
> different problem. 
>
> What we're planning to do next - once we've confirmed that our new
disk
> firmware has eliminated the problem, is to test with numa=off, and
> eventually upgrade. We're also looking at trying to simulate a bad
> driver blocking interrupts in the kernel for configurable periods to
> confirm that this diagnosis is correct. 
>
> I hope this some what long winded message is of use to people. 
>
> Andy
>    
>

Ocfs2 users - Jan 2007 - update on o2net_idle_timer

[Ocfs2-users] update on o2net_idle_timer

[Ocfs2-users] update on o2net_idle_timer