Hello. We've been using glusterfs for five months without any problems until yesterday: suddenly all clients who tried to write something began to hang with "D" status (waiting for disk). Also at the same time gluster nodes began to consume very high CPU which never happened before. "htop" command suggested that it was mostly kernel load via gluster processes. I tried so stop rebalance task, restarted volume and even upgraded from 3.5.2 to 3.5.3 but it did'nt help. Currently we only reading from cluster and writing to other place. And CPU usage is still on high level. Gluster is running in 3 x 2, distributed-replicate volume on top of ext4. We're keeping about 5 TB of small files with high amount of hard links. Binaries deployed from source (via gentoo portage). Kernel version is 3.8.13. There are many debug information in logs but to me it looks vague and I don't know where to start. I can provide it if needed. I'm lost here. Any help appreciated. Thanks, Alex
Okay, I did some digging. On the client there was many errors such as: [2015-04-29 15:47:08.700174] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-img-client-0: remote operation failed: Transport endpoint is not connected. Path: /www/img/gallery/9722926_4130.jpg (00000000-0000-0000-0000-000000000000) [2015-04-29 15:47:08.700268] I [afr-self-heal-entry.c:607:afr_sh_entry_expunge_entry_cbk] 0-img-replicate-0: looking up /www/img/gallery/9722926_4130.jpg under img-client-0 failed (Transport endpoint is not connected) And at the same time on the cluster: [2015-04-29 15:47:59.989897] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-img-client-0: remote operation failed: Transport endpoint is not connected. Path: /www/pdf/23096091-1722.pdf (00000000-0000-0000-0000-000000000000) [2015-04-29 15:47:59.989923] I [afr-self-heal-entry.c:607:afr_sh_entry_expunge_entry_cbk] 0-img-replicate-0: looking up /www/pdf/23096091-1722.pdf under img-client-0 failed (Transport endpoint is not connected) What could it mean? Is there some kind of network error? BTW there was nothing that indicated any network connectivity problems between nodes and clients.