Hi,
I've got _exactly_ the same problem. I've not had the time to dive
through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
For us the problem (same trace as below) was not that repeatable, and
was possibly related to the i/o pattern.
What seems to happen is that the underlying "network services" of
ocfs2 (o2net) believes that no packets are being sent. The tcp socket is
surrounded by wrapper functions, one of which times when the last packet
is received. Its this that decides the socket is dead, then closes the
socket. Meanwhile, the upper layers (which are actually sending data
regularly) find the carpet yanked out from underneath them, and decide
to halt the cluster to protect the data.
Highly annoying. I expect it will be some signed 32bit integer
wrapping somewhere....
Andy
On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:> Hi,
>
>
>
> We have 2 Dell 1850?s in a cluster, both machines are running Redhat
> Enterprise Linux 4 AS, update 2.
>
>
>
> The boxes are connected to a Dell EMC CX300 using emulex HBA?s
>
>
>
> The cluster is running an Oracle 10gR2 std edition RAC.
>
>
>
> We are using ocfs2 to store files generated by our application and not
> to store anything to do with the database.
>
>
>
> We?ve been having a few problems were the servers appear to hang, and
> have to be shutdown (using the powerbutton) and then started up again.
> This seems to be happening every weekend and I don?t really understand
> what?s happening, or how to fix it.
>
>
>
> I?ve included an extract from messages in the hope someone can shed
> some light on the matter.
>
>
>
> Kind regards
>
>
>
> Andrew
>
>
>
> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection
> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been
> idle for 10 seconds, shutting it down.
>
> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
> some times that might help debug the situation: (tmr 1158527154.993223
> now 1158527164.993090 dr 1158527154.993213 adv
> 1158527154.993227:1158527154.993228 func (101e0528:505)
> 1158527153.796194:1158527153.796200)
>
> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
> 10.1.1.110:7777
>
> Sep 17 22:06:04 argon2 kernel:
> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
>
> Sep 17 22:06:04 argon2 kernel:
> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 17 22:06:05 argon2 last message repeated 185 times
>
> Sep 17 22:06:05 argon2 kernel:
> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 17 22:06:05 argon2 last message repeated 154 times
>
> Sep 17 22:06:05 argon2 kernel:
> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 17 22:06:05 argon2 last message repeated 123 times
>
> Sep 17 22:06:05 argon2 kernel:
> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 17 22:06:05 argon2 last message repeated 472 times
>
> Sep 17 22:06:05 argon2 kernel:
> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 17 22:06:08 argon2 last message repeated 3239 times
>
> Sep 17 22:06:08 argon2 kernel:
> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 17 22:06:08 argon2 last message repeated 118 times
>
> Sep 17 22:06:08 argon2 kernel:
> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>
> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
>
> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
>
> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
> started.
>
> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
> root=LABEL=/ apic rhgb quiet)
>
> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
> (bhcompile@hs20-bc1-2.build.redhat.com) (gcc version 3.4.4 20050721
> (Red Hat 3.4.4-2)) #1 SMP
>
>
>
> Andrew Brunton
>
> Senior Application Developer
>
> UK Fuels Limited
>
>
>
> Tel +44 (0)1270 655636
>
> Fax +44 (0)1270 655700
>
>
>
> andrew.brunton@ukfuels.co.uk
>
>
>
>
>
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>
> ________________________________________________________________________
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
--
Andy Phillips, FRAS
Systems Architect, Information Systems.
Betfair.com
Direct Line: 0208 834 8436
Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
8501 (direct). The information in this e-mail and any attachment is
confidential, may contain legal advice protected by privilege and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.