Smith, Jarrod A
2016-Mar-08 02:09 UTC
[Samba] Troubleshooting a suspected ctdb performance issue
We have a three-node ctdb/samba cluster (8x Sandy Bridge cores + 64GB RAM each node) running on top of GPFS 4.1.0.8, serving 5-600 CIFS clients. We use sernet-samba-4.1.6 with ctdb 1.0.114 on Centos 6.6. Unfortunately the administrator who originally installed the ctdb/samba solution has left some time ago, and we are still learning it. Users are reporting intermittent "latency" issues that occur mutliple times per day over the past several months. Typical complaints include 30-60s to open folders or files, and sometimes being disconnected from the service. The samba and ctdb logs show nothing at debug level WARNING. We have done tcpdump/wireshark packet captures during such events and analyzed these - they showed no obvious ill behavior in the network. I have recently been probing ctdb itself and today realized that periodically we see the number of ctdbd processes on a node quickly grow from 2 to 250+. This lasts for 30 seconds to single-digit minutes at which point it corrects itself. It seems to be correlated with a large increase in the number of lines in /proc/locks. We also see what I feel are fairly high max_lockwait_latency and max_call_latency values (see our statistics outputs below). I don't know what causes this, or how to fix it (if it indeed needs fixing). Keeping in mind that I am new to samba and ctdb, have you got any other recommendations for us to further troubleshoot and/or fix the issue if you think I've hit upon it already? Thanks for your advice, -- Jarrod A. Smith, Ph.D. Asst. Director, Center for Structural Biology Research Assoc. Professor, Biochemistry Vanderbilt University - 5135 MRB III 615-322-1739 ----------------------------------- CTDB statistics for each node. The counters were reset a week or two ago. ----------------------------------- CTDB version 1 num_clients 136 frozen 0 recovering 0 client_packets_sent 286501416 client_packets_recv 325458087 node_packets_sent 354199901 node_packets_recv 266253799 keepalive_packets_sent 394382 keepalive_packets_recv 394374 node req_call 143496620 reply_call 90253 req_dmaster 55005319 reply_dmaster 60629271 reply_error 0 req_message 1720274 req_control 74416680 reply_control 29674368 client req_call 253163227 req_message 1113562 req_control 71335811 timeouts call 0 control 1 traverse 3 total_calls 253163227 pending_calls 0 lockwait_calls 11138044 pending_lockwait_calls 0 childwrite_calls 6 pending_childwrite_calls 0 memory_used 210352 max_hop_count 2162 max_reclock_ctdbd 0.141385 sec max_reclock_recd 169.497819 sec max_call_latency 310.868259 sec max_lockwait_latency 214.839209 sec max_childwrite_latency 0.014314 sec ----------------------------------- CTDB version 1 num_clients 132 frozen 0 recovering 0 client_packets_sent 247024177 client_packets_recv 286512929 node_packets_sent 336526909 node_packets_recv 255250235 keepalive_packets_sent 394339 keepalive_packets_recv 394328 node req_call 128153759 reply_call 70305 req_dmaster 60194286 reply_dmaster 53335499 reply_error 0 req_message 1521121 req_control 73830804 reply_control 24189543 client req_call 219206250 req_message 1037383 req_control 66378108 timeouts call 0 control 3 traverse 5 total_calls 219206250 pending_calls 0 lockwait_calls 3265686 pending_lockwait_calls 0 childwrite_calls 6 pending_childwrite_calls 0 memory_used 253340 max_hop_count 2163 max_reclock_ctdbd 0.342660 sec max_reclock_recd 0.000000 sec max_call_latency 437.201033 sec max_lockwait_latency 67.572988 sec max_childwrite_latency 0.015522 sec ----------------------------------- CTDB version 1 num_clients 205 frozen 0 recovering 0 client_packets_sent 299537550 client_packets_recv 349951795 node_packets_sent 376914119 node_packets_recv 272621669 keepalive_packets_sent 417794 keepalive_packets_recv 417782 node req_call 139163332 reply_call 154848 req_dmaster 59987264 reply_dmaster 58997998 reply_error 0 req_message 1083367 req_control 85827912 reply_control 34427476 client req_call 262120527 req_message 2169333 req_control 85858251 timeouts call 0 control 2 traverse 5 total_calls 262120527 pending_calls 0 lockwait_calls 5550667 pending_lockwait_calls 0 childwrite_calls 6 pending_childwrite_calls 0 memory_used 250736 max_hop_count 2152 max_reclock_ctdbd 0.016747 sec max_reclock_recd 166.169447 sec max_call_latency 16350.672816 sec max_lockwait_latency 74.970163 sec max_childwrite_latency 0.016126 sec