lingu
2008-Nov-12 13:47 UTC
[CentOS] Cluster Broken Pipe error and Heartbeat configuration
Hi, I am running two node active/passive cluster on RHEL3U8-64 bit operating system for my oracle database,both the nodes are connected to HP MSA-500 storage(scsi not Fibre channel) . Below are my hardware and clumanager version details. It was running fine and stable for last two years but all of a sudden for the past one month i am getting below errors on syslog and cluster restarting locally. Server Hardware: HP ProLiant DL580 G4 OS: RHEL3U8-64BIT INTEL EMT Kernel : 2.4.21-47.EL Storage : HP MSA-500 storage (scsci channel) Cluster Version: clumanager-1.2.26.1-1 redhat-config-cluster-1.0.7-1 NODE1 ip: 20.2.135.161 (network bonding configured) NODE2 ip: 20.2.135.162 (network bonding configured) VIP : 20.2.135.35 Syslog errors cluquorumd[1921]: <warning> Disk-TB: Detected I/O Hang! clulockd[1996]: <warning> Potential recursive lock #0 grant to member #1, PID1962 clulockd[1996]: <warning> Denied 20.1.135.162: Broken pipe clulockd[1996]: <err> select error: Broken pipe clulockd[1996]: <warning> Denied 20.1.135.162: Broken pipe clulockd[1996]: <err> select error: Broken pipe cluquorumd[1921]: <warning> Disk-TB: Detected I/O Hang! clulockd[1996]: <warning> Denied 20.1.135.161: Broken pipe clulockd[1996]: <err> select error: Broken pipe clusvcmgrd[2011]: <err> Unable to obtain cluster lock: Connection timed out cluquorumd[2100]: <err> VF: Abort: Invalid header in reply from member #0 cluquorumd[1934]: <err> __msg_send: Incomplete write to 13. Error: Connection reset by peer Can any one guide me what is this above error indicates and how to troubleshoot.After a long google search i found the below link from redhat that is matching my scenario.Can i follow the same because it is my very critical production server. https://bugzilla.redhat.com/show_bug.cgi?id=185484 Also anyone help me to configure a dedicated LAN (for example eth3) as heartbeat(private point to point cross over cable network for cluster communications),I don't wish heartbeat over public LAN , because of heavy Network saturation. Fot the above heartbeat configuration i didnot found any suitable document for rhel. Can any one provide me the suitable link or guide me what are all the changes i have to made in my existing cluster.xml file for this private heartbeat configuration to work. Waiting for some one reply its urgent for me Regards, Lingu
lingu wrote:> Can any one guide me what is this above error indicates and how to > troubleshoot.After a long google search i found the below link from > redhat that is matching my scenario.Can i follow the same because it > is my very critical production server.I suggest you contact Red Hat support for this issue if it's such a critical server and sounds like a pretty fragile situation. That's what they are there for. And your running a really old version of RH. If it were me I would upgrade the system to be fiber channel instead of SCSI, and update to all the latest patches for your version of RH. The bug mentions how using SCSI attached storage as your shared storage medium is not at all proven reliable. At least some MSAs out there you can get a fiber channel head unit and a few HBAs, and perhaps a switch and hook things up without too much downtime and have a better system as a result. nate