thr3ads.net - CentOS - [CentOS] Pacemaker bugs? [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Andreas Haumer

2016-Nov-25 10:30 UTC

[CentOS] Pacemaker bugs?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi!

I think I stumbled on at least two bugs in the CentOS 7.2 pacemaker package,
though I'm not quite sure if or where to report it.

I'm using the following package to set up a 2-node active/passive cluster:

[root at clnode1 ~]# rpm -q pacemaker
pacemaker-1.1.13-10.el7_2.4.x86_64

The installation is up-to-date on both nodes as of the current PIT.

I have currently the following cluster resources running:

[root at clnode2 ~]# pcs status
Cluster name: rucluster1
Last updated: Fri Nov 25 11:26:51 2016          Last change: Fri Nov 25 10:51:32
2016 by root via cibadmin on clnode1
Stack: corosync
Current DC: clnode2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
2 nodes and 12 resources configured

Online: [ clnode1 clnode2 ]

Full list of resources:

 p_ip_cluster   (ocf::heartbeat:IPaddr2):       Started clnode2
 Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
     Masters: [ clnode2 ]
     Slaves: [ clnode1 ]
 p_fs_drbd1     (ocf::heartbeat:Filesystem):    Started clnode2
 p_apache       (ocf::heartbeat:apache):        Started clnode2
 p_dhcpd        (ocf::heartbeat:dhcpd): Started clnode2
 p_named        (ocf::heartbeat:named): Started clnode2
 p_slapd        (ocf::heartbeat:slapd): Started clnode2
 p_postgres     (ocf::heartbeat:pgsql): Started clnode2
 p_nmb  (systemd:nmb):  Started clnode2
 p_smb  (systemd:smb):  Started clnode2
 p_winbind      (systemd:winbind):      Started clnode2

PCSD Status:
  clnode1: Online
  clnode2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


The first bug is rather serious, though a workaround exists!

The cluster works fine, but as soon as I add a cluster resource of
class "service", the cluster manager software runs havoc on node
failover. In that situation, the lrmd process hangs in an infinite
loop (neither strace nor ltrace show any outout so it seems to be
an internal loop without any system or library call) and almost any
call to the cluster manager software (crmsh or pcs) runs into a timeout.
It's quite hard to recover the whole cluster from this situation.

When I replace the resource class "service" with resource class
"systemd", everything seems to work just fine.

I found a rather old, already closed bug for Fedora which looks similar:

<https://bugzilla.redhat.com/show_bug.cgi?id=1117151>


Another bug seems to be rather minor: I see following assertions in the corosync
logs:

Nov 25 11:13:56 [3206] clnode1       crmd:    error: crm_abort:
pcmkRegisterNode: Triggered assert at xml.c:594 : node->type ==
XML_ELEMENT_NODE

They seem to be related with the drbd resource, but do not cause any functional
problem it seems.

For this particular problem I found the following patch:

<https://github.com/ClusterLabs/pacemaker/commit/68c7506aa84c69e5f425ef5f3025a9efb41d13da>


Are these already known bugs?
(I searched the CentOS bugzilla site but couldn't find any ticket
describing these bugs)


Any advise on if or where I should report it?

Thanks!

- - andreas

- -- 
Andreas Haumer                     | mailto:andreas at xss.co.at
*x Software + Systeme              | http://www.xss.co.at/
Karmarschgasse 51/2/20             | Tel: +43-1-6060114-0
A-1100 Vienna, Austria             | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFYOBLBxJmyeGcXPhERAmSKAJ9NNI+D2OaBR1I8jum6AywMxQxsmACfU71C
HV+6j+4YRy71BkjHfipPJFg=Okfp
-----END PGP SIGNATURE-----

Johnny Hughes

2016-Nov-25 15:24 UTC

head link

[CentOS] Pacemaker bugs?

On 11/25/2016 04:30 AM, Andreas Haumer wrote:> Hi!
> 
> I think I stumbled on at least two bugs in the CentOS 7.2 pacemaker
package,
> though I'm not quite sure if or where to report it.
> 
> I'm using the following package to set up a 2-node active/passive
cluster:
> 
> [root at clnode1 ~]# rpm -q pacemaker
> pacemaker-1.1.13-10.el7_2.4.x86_64
> 
> The installation is up-to-date on both nodes as of the current PIT.
> 
> I have currently the following cluster resources running:
> 
> [root at clnode2 ~]# pcs status
> Cluster name: rucluster1
> Last updated: Fri Nov 25 11:26:51 2016          Last change: Fri Nov 25
10:51:32 2016 by root via cibadmin on clnode1
> Stack: corosync
> Current DC: clnode2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with
quorum
> 2 nodes and 12 resources configured
> 
> Online: [ clnode1 clnode2 ]
> 
> Full list of resources:
> 
>  p_ip_cluster   (ocf::heartbeat:IPaddr2):       Started clnode2
>  Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
>      Masters: [ clnode2 ]
>      Slaves: [ clnode1 ]
>  p_fs_drbd1     (ocf::heartbeat:Filesystem):    Started clnode2
>  p_apache       (ocf::heartbeat:apache):        Started clnode2
>  p_dhcpd        (ocf::heartbeat:dhcpd): Started clnode2
>  p_named        (ocf::heartbeat:named): Started clnode2
>  p_slapd        (ocf::heartbeat:slapd): Started clnode2
>  p_postgres     (ocf::heartbeat:pgsql): Started clnode2
>  p_nmb  (systemd:nmb):  Started clnode2
>  p_smb  (systemd:smb):  Started clnode2
>  p_winbind      (systemd:winbind):      Started clnode2
> 
> PCSD Status:
>   clnode1: Online
>   clnode2: Online
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> 
> The first bug is rather serious, though a workaround exists!
> 
> The cluster works fine, but as soon as I add a cluster resource of
> class "service", the cluster manager software runs havoc on node
> failover. In that situation, the lrmd process hangs in an infinite
> loop (neither strace nor ltrace show any outout so it seems to be
> an internal loop without any system or library call) and almost any
> call to the cluster manager software (crmsh or pcs) runs into a timeout.
> It's quite hard to recover the whole cluster from this situation.
> 
> When I replace the resource class "service" with resource class
> "systemd", everything seems to work just fine.
> 
> I found a rather old, already closed bug for Fedora which looks similar:
> 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1117151>
> 
> 
> Another bug seems to be rather minor: I see following assertions in the
corosync logs:
> 
> Nov 25 11:13:56 [3206] clnode1       crmd:    error: crm_abort:
pcmkRegisterNode: Triggered assert at xml.c:594 : node->type ==
XML_ELEMENT_NODE
> 
> They seem to be related with the drbd resource, but do not cause any
functional problem it seems.
> 
> For this particular problem I found the following patch:
> 
>
<https://github.com/ClusterLabs/pacemaker/commit/68c7506aa84c69e5f425ef5f3025a9efb41d13da>
> 
> 
> Are these already known bugs?
> (I searched the CentOS bugzilla site but couldn't find any ticket
> describing these bugs)
> 
> 
> Any advise on if or where I should report it?
> 
The new pacemaker from RHEL 7.3 source code is now in CR
(pacemaker-1.1.15-11.el7).

There will be a newer still version later today in CR :
pacemaker-1.1.15-11.el7_3.2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.centos.org/pipermail/centos/attachments/20161125/0168640a/attachment-0001.sig>

Possibly Parallel Threads

Search for more seemingly similar threads

CentOS - Nov 2016 - Pacemaker bugs?

[CentOS] Pacemaker bugs?

[CentOS] Pacemaker bugs?

Possibly Parallel Threads