thr3ads.net - CentOS - [CentOS] Cluster Failover Troubleshooting (luci and ricci) [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Ryan Bunce

2011-Jul-01 19:54 UTC

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Hello all.  I posted this in the forum and was told to instead post it to 
the mailing list.  My apologies for the redundancy if you have already 
seen and been irritated by my blatherings.

Thanks.
_________________________

I am working on a CentOS clustered LAMP stack and running into problems. I 
have searched extensively and have come up empty.

Here's my setup:

Two node cluster identical hardware. IBM x226 with RSAII adapters for 
fencing.
Configured for Active/Passive failover - no load balancing.
No shared storage - manual rsync of data (shared SSH keys, rsync over SSH, 
cron job).
Single shared IP address

I used luci and ricci to configure the cluster. It's a bit confusing that 
there's an 'apache' script but you have to use the custom init
script. I'm
past that though.

The failover function is working when it's kicked off manually from the 
luci web interface. I can tell it to transfer the services (IP, httpd, 
msqld) to the secondary server and it works fine.

I run into problems when I attempt to simulate a failure (a pulled network 
cord for instance). The primary system recognizes the failure, shuts down 
it's services, attempts to inform the secondary server to take over and 
then it never does. Here is a log excerpt from a cable pull test:

Jun 16 15:33:27 flex kernel: tg3: eth0: Link is down.
Jun 16 15:33:34 flex clurgmgrd: [2970]: <warning> Link for eth0: Not 
detected
Jun 16 15:33:34 flex clurgmgrd: [2970]: <warning> No link on eth0...
Jun 16 15:33:34 flex clurgmgrd[2970]: <notice> status on ip
"10.6.2.25"
returned 1 (generic error)
Jun 16 15:33:34 flex clurgmgrd[2970]: <notice> Stopping service 
service:web
Jun 16 15:33:35 flex proftpd[6321]: 10.6.2.47 - ProFTPD killed (signal 15)
Jun 16 15:33:35 flex proftpd[6321]: 10.6.2.47 - ProFTPD 1.3.3c standalone 
mode SHUTDOWN
Jun 16 15:33:39 flex avahi-daemon[2850]: Withdrawing address record for 
10.6.2.25 on eth0.
Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Service service:web is 
recovering
Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Recovering failed service 
service:web
Jun 16 15:33:49 flex clurgmgrd: [2970]: <warning> Link for eth0: Not 
detected
Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> start on ip
"10.6.2.25"
returned 1 (generic error)
Jun 16 15:33:49 flex clurgmgrd[2970]: <warning> #68: Failed to start 
service:web; return value: 1
Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Stopping service 
service:web
Jun 16 15:33:49 flex clurgmgrd: [2970]: <err> script:mysqld: stop of 
/etc/rc.d/init.d/mysqld failed (returned 1)
Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> stop on script
"mysqld"
returned 1 (generic error)
Jun 16 15:33:49 flex clurgmgrd[2970]: <crit> #12: RG service:web failed to
stop; intervention required
Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Service service:web is 
failed
Jun 16 15:33:49 flex clurgmgrd[2970]: <crit> #13: Service service:web 
failed to stop cleanly
Jun 16 15:36:43 flex kernel: tg3: eth0: Link is up at 100 Mbps, full 
duplex.
Jun 16 15:36:43 flex kernel: tg3: eth0: Flow control is off for TX and off 
for RX.
Jun 16 16:04:52 flex luci[2904]: Unable to retrieve batch 306226694 status 
from web2:11111: Unable to disable failed service web before starting 
it:clusvcadm failed to stop web:
Jun 16 16:05:28 flex clurgmgrd[2970]: <notice> Starting disabled service 
service:web
Jun 16 16:05:31 flex avahi-daemon[2850]: Registering new address record 
for 10.6.2.25 on eth0.
Jun 16 16:05:31 flex luci[2904]: Unable to retrieve batch 1997354692 
status from web2:11111: module scheduled for execution
Jun 16 16:05:33 flex proftpd[1926]: 10.6.2.47 - ProFTPD 1.3.3c (maint) 
(built Thu Nov 18 2010 03:38:57 CET) standalone mode STARTUP
Jun 16 16:05:33 flex clurgmgrd[2970]: <notice> Service service:web started



I have followed the HowTos for setting up the cluster (with the exception 
of the shared storage) as closely as possible.

Here's what I've already troubleshot:

No IPTables running
No SELinux running
Hosts file resolves all IP address/host names properly.

I must say that I am less familiar with how all of the cluster components 
work together. All of the Linux clusters I have built thus far have been 
heartbeat+mon style clusters.

I'm looking to find out if there is an additional debug layer that I can 
put in place to get some more detailed information about what is 
transacting (or not) between the two cluster members.

Many thanks. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20110701/00dde0c5/attachment-0002.html>

Ljubomir Ljubojevic

2011-Jul-01 20:51 UTC

head link

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Ryan Bunce wrote:> I must say that I am less familiar with how all of the cluster 
> components work together. All of the Linux clusters I have built thus 
> far have been heartbeat+mon style clusters.
> 
> I'm looking to find out if there is an additional debug layer that I
can
> put in place to get some more detailed information about what is 
> transacting (or not) between the two cluster members.
> 
> Many thanks.
I never installed or used any Conga/lucci/ricci sistem.

But as far as I know and understand, you need to have a way for server 
failing to warn the rest of the nodes. Your log said it failed.

Some of the failover sistems need separate network connected to 
collective file systems. So when eth0 is not working, main node will use 
  eth1(2,3,4) to report this event to all other nodes.

What comes to mind is that IP's set for interconnection (in lucci conf) 
must not be public IP's but of that separate/secundary network in order 
for main node to be able to contact the rest of the nodes.

I hope this helps.

Ljubomir

m.roth at 5-cent.us

2011-Jul-01 21:25 UTC

head link

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Ryan Bunce wrote:
<snip>> I am working on a CentOS clustered LAMP stack and running into problems. I
> have searched extensively and have come up empty.
>
> Here's my setup:
>
> Two node cluster identical hardware. IBM x226 with RSAII adapters for
> fencing.
> Configured for Active/Passive failover - no load balancing.
> No shared storage - manual rsync of data (shared SSH keys, rsync over SSH,
> cron job).
> Single shared IP address
>
> I used luci and ricci to configure the cluster. It's a bit confusing
that
> there's an 'apache' script but you have to use the custom init
script. I'm
> past that though.<snip>
I'm not sure if either of them install heartbeat.

         mark

Ryan Bunce

2011-Jul-06 17:59 UTC

head link

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Ljubomir Ljubojevic wrote:

I never installed or used any Conga/lucci/ricci sistem.

But as far as I know and understand, you need to have a way for server
failing to warn the rest of the nodes. Your log said it failed.

Some of the failover sistems need separate network connected to
collective file systems. So when eth0 is not working, main node will use
eth1(2,3,4) to report this event to all other nodes.

What comes to mind is that IP's set for interconnection (in lucci conf)
must not be public IP's but of that separate/secundary network in order
for main node to be able to contact the rest of the nodes.

I hope this helps.

Ljubomir

Ljubomir,

Thank you for your reply. I do have a secondary NIC providing the
communication between the cluster nodes.

I set this up by creating host entries in the /etc/hosts file and pointing
those entries to the IP addresses assigned to the NIC's connected via
x-over cable.

I then created the cluster using the names specified in the hosts file.
I've done some network sniffing on the NIC's connected with x-over cable
and there's clearly a constant communication between the two boxes. This
leads me to conclude that the cluster communication is both working and
moving over the channel I intended.

Thanks for the input. Let me know if you have any other suggestions.

Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20110706/510aab21/attachment-0002.html>

Ryan Bunce

2011-Jul-06 19:11 UTC

head link

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

m.roth wrote:

I'm not sure if either of them install heartbeat.

Indeed they do not.  Heartbeat is what I have been using for CentOS 4.  I 
thought I'd give the new cluster system a go since it's what is included
with CentOS 5.

Thanks,
Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20110706/eaad1437/attachment-0002.html>

Ljubomir Ljubojevic

2011-Jul-06 20:36 UTC

head link

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Ryan Bunce wrote:> Thank you for your reply.  I do have a secondary NIC providing the 
> communication between the cluster nodes.
> 
> I set this up by creating host entries in the /etc/hosts file and 
> pointing those entries to the IP addresses assigned to the NIC's 
> connected via x-over cable.
> 
> I then created the cluster using the names specified in the hosts file. 
>  I've done some network sniffing on the NIC's connected with x-over
> cable and there's clearly a constant communication between the two 
> boxes.  This leads me to conclude that the cluster communication is both 
> working and moving over the channel I intended.
> 
> Thanks for the input.  Let me know if you have any other suggestions.
You should provide your complete network setup (IP's routes, DNS) 
records for both systems, maybe someone else can find the error.

Ljubomir

Ryan Bunce

2011-Jul-07 15:50 UTC

head link

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Ljubomir Ljubojevic wrote:

You should provide your complete network setup (IP's routes, DNS) 
records for both systems, maybe someone else can find the error.

Ljubomir


Original posting on mailing list:
http://lists.centos.org/pipermail/centos/2011-July/113454.html

Here is my network configuration:


                                       |------ 10.6.2.x -------- Web1 
------- 172.2.2.x ---|
Firewall -----  Switch -----------virt shared IP 10.6.2.42               |
                                           |------ 10.6.2.x -------- Web2 
------- 172.2.2.x ---|

I'm currently using internal DNS for testing so no public DNS has been 
registered.  I use the host file for the backend (172.x.x.x) network 
resolution.  Routing is pretty simple, I don't think there's a problem 
with that.

I think the piece that I need the most is some info on how to up the 
logging to provide additional debug information.  If someone knows how to 
turn that up I'm sure I could come up with some more information to 
troubleshoot on my own.

Thanks again.

Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20110707/39e66fd2/attachment-0002.html>

Seemingly Similar Threads

Search for more seemingly similar threads

CentOS - Jul 2011 - Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

[CentOS] Cluster Failover Troubleshooting (luci and ricci)

Seemingly Similar Threads