I replaced the brick in a node in my 3x2 dist+repl volume (RHS 3). I'm seeing that the heal process, which should essentially be a dump from the working replica to the newly added one is taking exceptionally long. It has moved ~100 G over a day on a 1Gigabit network. The CPU usage on both the nodes of the replica has been pretty high. I also think that nagios is making it worse. The heal is slow enough as it is, and nagios keeps triggering heal info, which I think never completes. I also see my logs filling up These are some of the log contents which I got by running tail on them: cli.log [2015-08-06 19:52:20.926000] T [socket.c:2759:socket_connect] (-->/lib64/libpthread.so.0() [0x3ec1407a51] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0x120) [0x7fb84c0f6980] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xd9) [0x7fb84bc96249]))) 0-glusterfs: connect () called on transport already connected [2015-08-06 19:52:21.926068] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect [2015-08-06 19:52:21.926091] T [socket.c:2767:socket_connect] 0-glusterfs: connecting 0xa198b0, state=0 gen=0 sock=-1 [2015-08-06 19:52:21.926114] W [dict.c:1060:data_to_str] (-->/usr/lib64/glusterfs/3.6.0.53/rpc-transport/socket.so(+0x6bea) [0x7fb844f82bea] (-->/usr/lib64/glusterfs/ 3.6.0.53/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0xad) [0x7fb844f873bd] (-->/usr/lib64/glusterfs/ 3.6.0.53/rpc-transport/socket.so(client_fill_address_family+0x200) [0x7fb844f87270]))) 0-dict: data is NULL [2015-08-06 19:52:21.926125] W [dict.c:1060:data_to_str] (-->/usr/lib64/glusterfs/3.6.0.53/rpc-transport/socket.so(+0x6bea) [0x7fb844f82bea] (-->/usr/lib64/glusterfs/ 3.6.0.53/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0xad) [0x7fb844f873bd] (-->/usr/lib64/glusterfs/ 3.6.0.53/rpc-transport/socket.so(client_fill_address_family+0x20b) [0x7fb844f8727b]))) 0-dict: data is NULL [2015-08-06 19:52:21.926129] E [name.c:140:client_fill_address_family] 0-glusterfs: transport.address-family not specified. Could not guess default value from (remote-host:(null) or transport.unix.connect-path:(null)) options [2015-08-06 19:52:21.926179] T [cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_DISCONNECT brick log full of these messages: [2015-08-06 19:54:22.494254] I [server-rpc-fops.c:693:server_removexattr_cbk] 0-gluster-server: 2206495: REMOVEXATTR file path (fadccb1e-ea0c-416a-94ec-ec88fafec2a5) of key security.ima ==> (No data available) [2015-08-06 19:54:22.514814] E [marker.c:2574:marker_removexattr_cbk] 0-gluster-marker: No data available occurred while creating symlinks sestatus SELinux status: disabled glusterfs --version glusterfs 3.6.0.53 built on Mar 18 2015 08:12:38 Does anyone know what's going on ? PS: I am using RHS because our school's satellite has the repos. Contacting RHN over this would likely be complicated, and i would prefer solving this on my own. -------------- next part -------------- An HTML attachment was scrubbed... URL: <gluster.org/pipermail/gluster-users/attachments/20150806/fb2e8daf/attachment.html>
On 08/07/2015 01:33 AM, Prasun Gera wrote:> I replaced the brick in a node in my 3x2 dist+repl volume (RHS 3). I'm > seeing that the heal process, which should essentially be a dump from > the working replica to the newly added one is taking exceptionally > long. It has moved ~100 G over a day on a 1Gigabit network. The CPU > usage on both the nodes of the replica has been pretty high.Does setting `cluster.data-self-heal-algorithm` to full make a difference in the cpu usage?> I also think that nagios is making it worse. The heal is slow enough > as it is, and nagios keeps triggering heal info, which I think never > completes. I also see my logs filling up These are some of the log > contents which I got by running tail on them: