thr3ads.net - CentOS - [CentOS] Problem with CLVM (really openais) [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Cris Rhea

2012-Nov-04 15:48 UTC

[CentOS] Problem with CLVM (really openais)

I'm desparately looking for more ideas on how to debug what's going on
with our CLVM cluster. 

Background:

4 node "cluster"-- machines are Dell blades with Dell M6220/M6348
switches.
Sole purpose of Cluster Suite tools is to use CLVM against an iSCSI storage
array.

Machines are running CentOS 5.8 with the Xen kernels. These blades host
various VMs for a project. The iSCSI back-end storage hosts the disk
images for the VMs. The blades themselves run from local disk.

Each blade has 3 active networks:

-- Front-end, public net.
-- Back-end net (backups, Database connections to external servers,
                cluster communication)
-- iSCSI net

Front and Back nets are on Xen Bridges and available/used by the VMs. 
iSCSI net only used by Dom0/blades.

Originally got CLVM working by installing Luci/Ricci. Cluster config
is dead-simple:

<?xml version="1.0"?>
<cluster alias="Alliance Blades" config_version="6"
name="Alliance Blades">
    <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
    <clusternodes>
	<clusternode name="calgb-blade1-mn.mayo.edu" nodeid="1"
votes="1">
	    <fence/>
	</clusternode>
	<clusternode name="calgb-blade2-mn.mayo.edu" nodeid="3"
votes="1">
	    <fence/>
	</clusternode>
	<clusternode name="calgb-blade3-mn.mayo.edu" nodeid="4"
votes="1">
	    <fence/>
	</clusternode>
	<clusternode name="calgb-blade4-mn.mayo.edu" nodeid="2"
votes="1">
	    <fence/>
	</clusternode>
    </clusternodes>
    <cman/>
    <fencedevices/>
    <rm>
	<failoverdomains/>
	<resources/>
    </rm>
</cluster>

All the basics are covered... LVM locking set to "Cluster", etc.

This all worked fine for a period (pre 5.8)... at some point, an update
took a step backwards.

I can bring up the entire cluster and it will work for about 10 minutes.
During that time, I'll get the following on several of the nodes:

Nov  3 17:28:18 calgb-blade2 openais[7154]: [TOTEM] Retransmit List: fe
Nov  3 17:28:49 calgb-blade2 last message repeated 105 times
Nov  3 17:29:50 calgb-blade2 last message repeated 154 times
Nov  3 17:30:51 calgb-blade2 last message repeated 154 times
Nov  3 17:31:52 calgb-blade2 last message repeated 154 times
Nov  3 17:32:53 calgb-blade2 last message repeated 154 times
Nov  3 17:33:54 calgb-blade2 last message repeated 154 times
Nov  3 17:34:55 calgb-blade2 last message repeated 154 times
Nov  3 17:35:56 calgb-blade2 last message repeated 154 times
Nov  3 17:36:36 calgb-blade2 last message repeated 105 times
Nov  3 17:36:36 calgb-blade2 openais[7154]: [TOTEM] Retransmit List: fe ff
Nov  3 17:37:07 calgb-blade2 last message repeated 104 times
Nov  3 17:38:08 calgb-blade2 last message repeated 154 times
Nov  3 17:39:09 calgb-blade2 last message repeated 154 times
Nov  3 17:40:10 calgb-blade2 last message repeated 154 times
Nov  3 17:41:11 calgb-blade2 last message repeated 154 times
Nov  3 17:42:12 calgb-blade2 last message repeated 154 times
Nov  3 17:43:13 calgb-blade2 last message repeated 154 times
Nov  3 17:44:14 calgb-blade2 last message repeated 154 times
Nov  3 17:44:24 calgb-blade2 last message repeated 26 times



Around the 10 minute mark, one of the nodes (it is not always the same node)
will do:

Nov  3 17:44:24 calgb-blade1 openais[7179]: [TOTEM] FAILED TO RECEIVE
Nov  3 17:44:24 calgb-blade1 openais[7179]: [TOTEM] entering GATHER state from
6.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Creating commit token
because I am the rep.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Storing new sequence id for
ring 34
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] entering COMMIT state.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] entering RECOVERY state.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] position [0] member
192.168.226.161:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] previous ring seq 48 rep
192.168.226.161
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] aru fd high delivered fd
received flag 1
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Did not need to originate
any messages in recovery.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] Sending initial ORF token
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] New Configuration:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.161)
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Left:
Nov  3 17:44:34 calgb-blade1 clurgmgrd[9933]: <emerg> #1: Quorum Dissolved
Nov  3 17:44:34 calgb-blade1 kernel: dlm: closing connection to node 2
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.162)
Nov  3 17:44:34 calgb-blade1 kernel: dlm: closing connection to node 3
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.163)
Nov  3 17:44:34 calgb-blade1 kernel: dlm: closing connection to node 4
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.164)
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Joined:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CMAN ] quorum lost, blocking
activity
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] New Configuration:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ]     r(0) ip(192.168.226.161)
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Left:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [CLM  ] Members Joined:
Nov  3 17:44:34 calgb-blade1 openais[7179]: [SYNC ] This node is within the
primary component and will provide service.
Nov  3 17:44:34 calgb-blade1 openais[7179]: [TOTEM] entering OPERATIONAL state.
...

When this happens, the node that lost connections kills CMAN. The other 3 
nodes get into this state:

[root at calgb-blade2 ~]# group_tool
type             level name       id       state       
fence            0     default    00010001 FAIL_START_WAIT
[1 2 3]
dlm              1     clvmd      00010003 FAIL_ALL_STOPPED
[1 2 3 4]
dlm              1     rgmanager  00020003 FAIL_ALL_STOPPED
[1 2 3 4]

One of the nodes will be barking about trying to fence the "failed"
node
(expected, as I don't have real fencing).  

Nothing (physically) has actually failed. All 4 blades are running their 
VMs and accessing the shared back-end storage. No network burps, etc.
CLVM is unusable and any LVM commands hang.

Here's where it gets really strange.....

1. Go to the blade that has "failed" and shut down all the VMs (just
to be
   safe-- they are all running fine).

2. Go to the node that tried to fence the failed node and run:
   fence_ack_manual -e -n <failed node>

3. The 3 "surviving" nodes are instantly fine. group_tool reports
"none" for
   state and all CLVM commands work properly.

4. OK, now reboot the "failed" node. It reboots and rejoins the
cluster.
   CLVM commands work, but are slow. Lots of these errors:
	openais[7154]: [TOTEM] Retransmit List: fe ff

5. This goes on for about 10 minutes and the whole cycle repeats (one of 
   the other nodes will "fail"...)

What I've tried:

1. Switches have IGMP snooping disabled. This is a simple config, so
   no switch-to-switch multicast is needed (all cluster/multicast traffic
   stays on the blade enclosure switch).  I've had the cluster
   messages use the front-end net and the back-end net (different switch
   model)-- no change in behavior.

2. Running the latest RH patched openais from their Beta channel: 
   openais-0.80.6-37.el5  (yes, I actually have RH licenses, just prefer
   CentOS for various reasons).

3. Tested multicast by enabling multicast/ICMP and running multicast
   pings. Ran with no data loss for > 10 minutes. (IBM Tech web site
   article-- where the end of the article say it's almost always a network
   problem.)

4. Tried various configuration changes such as defining two rings. No change
   (or got worse- the dual ring config triggers a kernel bug).

I've read about every article I can find... they usually fall into
two camps:

1. Multiple years old, so no idea if it's accurate with today's versions
of
   the software. 

2. People trying to do 2-node clusters and wondering why they lose quorum.

My feeling is that this is an openais issue-- the ring gets out of sync
and can't fix itself. This system needs to go "production" in a
matter of
weeks, so I'm not sure I want to dive into doing some sort of
custom compiled CLVM/Corosync/Pacemaker config. Since this software
has been around for years, I believe it's something simple I'm just
missing.

Thanks for any help/ideas...
 
--- Cris


-- 
 Cristopher J. Rhea
 Mayo Clinic - Research Computing Facility
 200 First St SW, Rochester, MN 55905
 crhea at Mayo.EDU
 (507) 284-0587

Digimer

2012-Nov-04 16:59 UTC

head link

[CentOS] Problem with CLVM (really openais)

On 11/04/2012 10:48 AM, Cris Rhea wrote:> One of the nodes will be barking about trying to fence the
"failed" node
> (expected, as I don't have real fencing).  
This is your problem. Without fencing, DLM (which is required for
clustered LVM, GFS2 and rgmanager) is designed to block when a fence is
called and stay blocked until the fence succeeds. Why it was called is
secondary, even if a human calls the fence and everything is otherwise
working fine, the cluster will hang.

This is by design. "A hung cluster is better than a corrupt cluster".
> Nothing (physically) has actually failed. All 4 blades are running their 
> VMs and accessing the shared back-end storage. No network burps, etc.
> CLVM is unusable and any LVM commands hang.
> 
> Here's where it gets really strange.....
> 
> 1. Go to the blade that has "failed" and shut down all the VMs
(just to be
>    safe-- they are all running fine).
> 
> 2. Go to the node that tried to fence the failed node and run:
>    fence_ack_manual -e -n <failed node>
> 
> 3. The 3 "surviving" nodes are instantly fine. group_tool reports
"none" for
>    state and all CLVM commands work properly.
> 
> 4. OK, now reboot the "failed" node. It reboots and rejoins the
cluster.
>    CLVM commands work, but are slow. Lots of these errors:
> 	openais[7154]: [TOTEM] Retransmit List: fe ff
Only time I've seen this happen is when something starves a node (slow
network, loaded cpu, insufficient ram...).
> 5. This goes on for about 10 minutes and the whole cycle repeats (one of 
>    the other nodes will "fail"...)
If a totem packet fails to return from a node within a set period of
time more than a set number of times in a row, the node is declared lost
and a fence action is initiated.
> What I've tried:
> 
> 1. Switches have IGMP snooping disabled. This is a simple config, so
>    no switch-to-switch multicast is needed (all cluster/multicast traffic
>    stays on the blade enclosure switch).  I've had the cluster
>    messages use the front-end net and the back-end net (different switch
>    model)-- no change in behavior.
Are the multicast groups static? Is STP disabled?
> 2. Running the latest RH patched openais from their Beta channel: 
>    openais-0.80.6-37.el5  (yes, I actually have RH licenses, just prefer
>    CentOS for various reasons).
> 
> 3. Tested multicast by enabling multicast/ICMP and running multicast
>    pings. Ran with no data loss for > 10 minutes. (IBM Tech web site
>    article-- where the end of the article say it's almost always a
network
>    problem.)
I'm leaning to a network problem, too.
> 4. Tried various configuration changes such as defining two rings. No
change
>    (or got worse- the dual ring config triggers a kernel bug).
I don't think RRP worked in EL5... Maybe it does now?
> I've read about every article I can find... they usually fall into
> two camps:
> 
> 1. Multiple years old, so no idea if it's accurate with today's
versions of
>    the software. 
> 
> 2. People trying to do 2-node clusters and wondering why they lose quorum.
> 
> My feeling is that this is an openais issue-- the ring gets out of sync
> and can't fix itself. This system needs to go "production" in
a matter of
> weeks, so I'm not sure I want to dive into doing some sort of
> custom compiled CLVM/Corosync/Pacemaker config. Since this software
> has been around for years, I believe it's something simple I'm just
missing.
> 
> Thanks for any help/ideas...
>  
> --- Cris
First and foremost; Get fencing working. At the very least, a lost node
will reboot and the cluster will recover as designed. It's amazing how
many problems "just go away" once fencing is properly configured.
Please
read this:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

It's for EL6, but it applies to EL5 exactly the same (corosync being the
functional replacement for openais).

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

Joseph L. Casale

2012-Nov-04 17:07 UTC

head link

[CentOS] Problem with CLVM (really openais)

>One of the nodes will be barking about trying to fence the
"failed" node
>(expected, as I don't have real fencing).
Have you tried the redhat cluster mailing list? Out of curiosity, how comes
you aren't using fencing, that will probably be one of the first things that
get
critiqued. At a minimum, you could at least use the ifmib agent I would think?

jlc

Digimer

2012-Nov-05 21:45 UTC

head link

[CentOS] Problem with CLVM (really openais)

On 11/05/2012 02:04 PM, Cris Rhea wrote:> On Sun, Nov 04, 2012 at 11:59:08AM -0500, Digimer wrote:
>> On 11/04/2012 10:48 AM, Cris Rhea wrote:
>>> One of the nodes will be barking about trying to fence the
"failed" node
>>> (expected, as I don't have real fencing).  
>>
>> This is your problem. Without fencing, DLM (which is required for
>> clustered LVM, GFS2 and rgmanager) is designed to block when a fence is
>> called and stay blocked until the fence succeeds. Why it was called is
>> secondary, even if a human calls the fence and everything is otherwise
>> working fine, the cluster will hang.
>>
>> This is by design. "A hung cluster is better than a corrupt
cluster".
> 
> I understand what you're saying, but I've got three concerns:
> 
> 1. Yes, I believe DLM is acting appropriately. It does "hang
everything"
>    until fencing succeeds. If I manually fence the node (fence_ack_manual),
>    the remaining 3 nodes are fine.
> 
> 2. As a practical matter, I cannot enable real fencing at this point.
>    These nodes are running VMs that are doing real stuff. Fencing a
"failed
>    node" would dump the VMs.   While I understand doing that in a real
>    failure situation, I have no indications that anyting is wrong (other
than
>    aisexec/TOTEM issues).
> 
> 3. At this point, this is NOT an HA cluster-- so I don't have VMs
defined
>    as resources that need to be running someplace. All I'm tring to
achieve
>    is to use CLVM (reliably) across a set of nodes.
As far as the cluster is concerned, it is HA. The cluster does not
understand the concept of unimportant things; It treats everything as
critical.

If you need CLVM (or anything else related to the cluster), then either
make the VMs HA resources or move them off. Until you use real fencing,
you will have problems. That said, some won't appear until it's too late
if you try to avoid this.

In short; use real fencing, period. Nothing else is supported or safe.
>>> 4. OK, now reboot the "failed" node. It reboots and
rejoins the cluster.
>>>    CLVM commands work, but are slow. Lots of these errors:
>>> 	openais[7154]: [TOTEM] Retransmit List: fe ff
>>
>> Only time I've seen this happen is when something starves a node
(slow
>> network, loaded cpu, insufficient ram...).
> 
> What methods would you use to pin this down?  From my perspective, the
> machines have enough RAM (large blades with fairly small VMs), decent CPU
> (I can be logged into the node via SSH while this is happening and
don't
> see a performance issue), and no network "glitches" (the aisexec
failure
> happens well after machine has booted and come on-line on the network).
This requires the assistance of the devs/advanced support people. If you
have Red Hat support, please call them.
>>> 5. This goes on for about 10 minutes and the whole cycle repeats
(one of
>>>    the other nodes will "fail"...)
>>
>> If a totem packet fails to return from a node within a set period of
>> time more than a set number of times in a row, the node is declared
lost
>> and a fence action is initiated.
> 
> Yup, got that. How can I debug this further? I have no indication (other
> than aisexec) that anything is wrong.  
Same as above comment.
>>> 1. Switches have IGMP snooping disabled. This is a simple config,
so
>>>    no switch-to-switch multicast is needed (all cluster/multicast
traffic
>>>    stays on the blade enclosure switch).  I've had the cluster
>>>    messages use the front-end net and the back-end net (different
switch
>>>    model)-- no change in behavior.
>>
>> Are the multicast groups static? Is STP disabled?
> 
> I'm not a cisco command guru (so please provide real commands in any
hints).
Nor am I. Whenever I hear "cisco", it's along with problems cause
by
Cisco doing non-standard things.
> I have not defined anything in the switch for MC groups. In speaking with 
> my local network team, turning off IGMP snooping should allow
full/unlimited
> MC within that switch.   Again, only a single switch involved, so no
> switch-switch configs needed for MC.
As I understand it, Cisco periodically purges multicast groups, forcing
machine to resubscribe, as a way to clean out disused groups. This
breaks the cluster comms. This is just one example of what might be
happening. The take-away is that latency must remain below 2ms and
multicast messages must never be interrupted. Your network people should
be able to interpret that further.
> Do I need to do something else within the switch or cluster configs to 
> aid MC?  All the nodes are using the default MC address/port. 
Set a static multicast group, for one.
> STP is currently enabled (set to pvst).
This can cause problems. When STP tries to find loops, it can block a
port or ports. This can break the cluster as well. Disable switch-wide
STP and only enable it on outword-facing ports (if any at all).
>>> 3. Tested multicast by enabling multicast/ICMP and running
multicast
>>>    pings. Ran with no data loss for > 10 minutes. (IBM Tech web
site
>>>    article-- where the end of the article say it's almost
always a network
>>>    problem.)
>>
>> I'm leaning to a network problem, too.
> 
> What else would you use to pinpoint the problem?  A Dell M6220 switch
> seems to be a Cisco clone, so any config suggestions are welcome.
Again, I am not a cisco user. Ask your network engineer(s) and/or Red
Hat for help specific to your environment.
>> I don't think RRP worked in EL5... Maybe it does now?
> 
> Good to know. As of 5.8, RRP doesn't seem to work for me (kernel
faults).
> 
>> First and foremost; Get fencing working. At the very least, a lost node
>> will reboot and the cluster will recover as designed. It's amazing
how
>> many problems "just go away" once fencing is properly
configured. Please
>> read this:
>>
>>
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing
> 
> Sure... makes perfect sense if one has REAL node failures. In my case, all
> I'd have is a set of VMs crash (due to fencing) every 10 mins. The
people
> using those VMs wouldn't be very happy with me.... :)
No, it makes sense whenever a node loses connection. The goal of fencing
is to ensure that two nodes don't try to both provide HA services. If
they can't talk to each other, then the *only* way to ensure that a node
is the only one providing services is to fence the lost node. It does
not matter at all that the rest of the machine is otherwise healthy.
> (I've done AIX and HP HA clusters before, so I understand the
need/purpose
> for fencing failed nodes. In this case, I have real users using the VMs
> on the nodes-- the only thing that appears to fail is aisexec's
communication.)
Then put your VMs under HA control. If the host node is lost, it will
restart on a healthy node. It's the *only* safe option.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

Apparently Analagous Threads

Search for more reasonably related threads

CentOS - Nov 2012 - Problem with CLVM (really openais)

[CentOS] Problem with CLVM (really openais)

[CentOS] Problem with CLVM (really openais)

[CentOS] Problem with CLVM (really openais)

[CentOS] Problem with CLVM (really openais)

Apparently Analagous Threads