thr3ads.net - samba - [Samba] smbstatus hang with CTDB 2.5.4 and Samba 4.1.13 [Oct 2014]

If this information is useful, please help other people find it:
Share via:

John P Janosik

2014-Oct-29 17:48 UTC

[Samba] smbstatus hang with CTDB 2.5.4 and Samba 4.1.13

Can anyone help with some pointers to debug a problem with Samba and CTDB 
with smbstatus traversing the connections tdb?  I've got a new two node 
cluster with Samba and CTDB on AIX.  If I run smbstatus when the server 
has much user activity it hangs and the node it was run on gets banned.  I 
see the following in the ctdb log:

2014/10/29 11:12:45.374580 [3932342]: server/ctdb_traverse.c:644 Traverse 
cancelled by client disconnect for database:0x6b06a26d
2014/10/29 11:12:45.429592 [3932342]: Recovery daemon ping timeout. Count 
: 0
2014/10/29 11:12:45.430655 [3932342]: Handling event took 195 seconds!
2014/10/29 11:12:45.452636 [3932342]: pnn 1 Invalid reqid 220668 in 
ctdb_reply_control
2014/10/29 11:12:48.462334 [recoverd:6488266]: server/ctdb_recoverd.c:3990 
Remote node:0 has different flags for node 1. It has 0x02 vs our 0x00
2014/10/29 11:12:48.462448 [recoverd:6488266]: Use flags 0x00 from local 
recmaster node for cluster update of node 1 flags
2014/10/29 11:12:48.483362 [3932342]: Freeze priority 1
2014/10/29 11:12:58.574548 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:13:08.569593 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:13:18.604124 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:13:28.638279 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:13:38.653236 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:13:48.454994 [3932342]: Recovery daemon ping timeout. Count 
: 0
2014/10/29 11:13:48.484659 [recoverd:6488266]: ctdb_control error: 
'ctdb_control timed out'
2014/10/29 11:13:48.484723 [recoverd:6488266]: ctdb_control error: 
'ctdb_control timed out'
2014/10/29 11:13:48.484782 [recoverd:6488266]: Async operation failed with 
ret=-1 res=-1 opcode=33
2014/10/29 11:13:48.484837 [recoverd:6488266]: Failed to freeze node 1 
during recovery. Set it as ban culprit for 2 credits
2014/10/29 11:13:48.484890 [recoverd:6488266]: Async wait failed - 
fail_count=1
2014/10/29 11:13:48.485047 [recoverd:6488266]: server/ctdb_recoverd.c:410 
Unable to freeze nodes. Recovery failed.
2014/10/29 11:13:48.485095 [recoverd:6488266]: server/ctdb_recoverd.c:1881 
Unable to set recovery mode to active on cluster
2014/10/29 11:13:48.503273 [3932342]: Freeze priority 1
2014/10/29 11:13:48.688292 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:13:58.704454 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:14:08.739136 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:14:18.753675 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:14:28.788835 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:14:38.824317 [3932342]: 
/usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
2014/10/29 11:14:48.486128 [3932342]: Recovery daemon ping timeout. Count 
: 0
2014/10/29 11:14:48.504242 [recoverd:6488266]: ctdb_control error: 
'ctdb_control timed out'
2014/10/29 11:14:48.504298 [recoverd:6488266]: ctdb_control error: 
'ctdb_control timed out'
2014/10/29 11:14:48.504357 [recoverd:6488266]: Async operation failed with 
ret=-1 res=-1 opcode=33
2014/10/29 11:14:48.504411 [recoverd:6488266]: Failed to freeze node 1 
during recovery. Set it as ban culprit for 2 credits
2014/10/29 11:14:48.504463 [recoverd:6488266]: Async wait failed - 
fail_count=1
2014/10/29 11:14:48.504507 [recoverd:6488266]: server/ctdb_recoverd.c:410 
Unable to freeze nodes. Recovery failed.
2014/10/29 11:14:48.504552 [recoverd:6488266]: server/ctdb_recoverd.c:1881 
Unable to set recovery mode to active on cluster
2014/10/29 11:14:48.506682 [3932342]: Banning this node for 300 seconds

The tdb mentioned in the first log message is:

ctdb getdbmap | grep 0x6b06a26d
dbid:0x6b06a26d name:smbXsrv_session_global.tdb 
path:/usr/smb_cluster/var/lib/ctdb/smbXsrv_session_global.tdb.1

I was going to start by trying to update 
/usr/smb_cluster/etc/ctdb/debug_locks.sh since I don't think flock is 
available on AIX.  I'm hoping the ctdb log will then contain some more 
help debugging the problem.  Any other ideas on how to debug this?


Thanks,

John Janosik
jpjanosi at us.ibm.com

Amitay Isaacs

2014-Oct-29 23:16 UTC

head link

[Samba] smbstatus hang with CTDB 2.5.4 and Samba 4.1.13

On Thu, Oct 30, 2014 at 4:48 AM, John P Janosik <jpjanosi at us.ibm.com>
wrote:
> Can anyone help with some pointers to debug a problem with Samba and CTDB
> with smbstatus traversing the connections tdb?  I've got a new two node
> cluster with Samba and CTDB on AIX.  If I run smbstatus when the server
> has much user activity it hangs and the node it was run on gets banned.  I
> see the following in the ctdb log:
>
> 2014/10/29 11:12:45.374580 [3932342]: server/ctdb_traverse.c:644 Traverse
> cancelled by client disconnect for database:0x6b06a26d
> 2014/10/29 11:12:45.429592 [3932342]: Recovery daemon ping timeout. Count
> : 0
> 2014/10/29 11:12:45.430655 [3932342]: Handling event took 195 seconds!
> 2014/10/29 11:12:45.452636 [3932342]: pnn 1 Invalid reqid 220668 in
> ctdb_reply_control
> 2014/10/29 11:12:48.462334 [recoverd:6488266]: server/ctdb_recoverd.c:3990
> Remote node:0 has different flags for node 1. It has 0x02 vs our 0x00
> 2014/10/29 11:12:48.462448 [recoverd:6488266]: Use flags 0x00 from local
> recmaster node for cluster update of node 1 flags
> 2014/10/29 11:12:48.483362 [3932342]: Freeze priority 1
> 2014/10/29 11:12:58.574548 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:08.569593 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:18.604124 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:28.638279 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:38.653236 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:48.454994 [3932342]: Recovery daemon ping timeout. Count
> : 0
> 2014/10/29 11:13:48.484659 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:13:48.484723 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:13:48.484782 [recoverd:6488266]: Async operation failed with
> ret=-1 res=-1 opcode=33
> 2014/10/29 11:13:48.484837 [recoverd:6488266]: Failed to freeze node 1
> during recovery. Set it as ban culprit for 2 credits
> 2014/10/29 11:13:48.484890 [recoverd:6488266]: Async wait failed -
> fail_count=1
> 2014/10/29 11:13:48.485047 [recoverd:6488266]: server/ctdb_recoverd.c:410
> Unable to freeze nodes. Recovery failed.
> 2014/10/29 11:13:48.485095 [recoverd:6488266]: server/ctdb_recoverd.c:1881
> Unable to set recovery mode to active on cluster
> 2014/10/29 11:13:48.503273 [3932342]: Freeze priority 1
> 2014/10/29 11:13:48.688292 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:13:58.704454 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:08.739136 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:18.753675 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:28.788835 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:38.824317 [3932342]:
> /usr/smb_cluster/etc/ctdb/debug_locks.sh[23]: flock:  not found
> 2014/10/29 11:14:48.486128 [3932342]: Recovery daemon ping timeout. Count
> : 0
> 2014/10/29 11:14:48.504242 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:14:48.504298 [recoverd:6488266]: ctdb_control error:
> 'ctdb_control timed out'
> 2014/10/29 11:14:48.504357 [recoverd:6488266]: Async operation failed with
> ret=-1 res=-1 opcode=33
> 2014/10/29 11:14:48.504411 [recoverd:6488266]: Failed to freeze node 1
> during recovery. Set it as ban culprit for 2 credits
> 2014/10/29 11:14:48.504463 [recoverd:6488266]: Async wait failed -
> fail_count=1
> 2014/10/29 11:14:48.504507 [recoverd:6488266]: server/ctdb_recoverd.c:410
> Unable to freeze nodes. Recovery failed.
> 2014/10/29 11:14:48.504552 [recoverd:6488266]: server/ctdb_recoverd.c:1881
> Unable to set recovery mode to active on cluster
> 2014/10/29 11:14:48.506682 [3932342]: Banning this node for 300 seconds
>
> The tdb mentioned in the first log message is:
>
> ctdb getdbmap | grep 0x6b06a26d
> dbid:0x6b06a26d name:smbXsrv_session_global.tdb
> path:/usr/smb_cluster/var/lib/ctdb/smbXsrv_session_global.tdb.1
>
> I was going to start by trying to update
> /usr/smb_cluster/etc/ctdb/debug_locks.sh since I don't think flock is
> available on AIX.  I'm hoping the ctdb log will then contain some more
> help debugging the problem.  Any other ideas on how to debug this?
>
>These logs are incomplete.  They do not include the traverse messages
(Starting/Ending traverse) and why the node became unhealthy.  Without
complete logs it's difficult to figure out what's happening.  However,
CTDB
is definitely busy.  If you have lots of processes, may be processes are
getting swapped to disk.  Usually AIX boxes are configured with small
amount of memory.

    2014/10/29 11:12:45.430655 [3932342]: Handling event took 195 seconds!

Node is getting banned because it could not be frozen.  You cannot use the
shipped debug_locks.sh script on AIX.  That script uses /proc/locks to
extract information about current locks.  There is no /proc/locks on AIX.
I do not know if there is any other way of finding information about
processes holding fcntl locks.

Amitay.

Seemingly Similar Threads

Search for more seemingly similar threads

samba - Oct 2014 - smbstatus hang with CTDB 2.5.4 and Samba 4.1.13

[Samba] smbstatus hang with CTDB 2.5.4 and Samba 4.1.13

[Samba] smbstatus hang with CTDB 2.5.4 and Samba 4.1.13

Seemingly Similar Threads