thr3ads.net - Gluster users - [Gluster-users] How to diagnose volume rebalance failure? [Dec 2015]

If this information is useful, please help other people find it:
Share via:

PuYun

2015-Dec-14 13:51 UTC

[Gluster-users] How to diagnose volume rebalance failure?

Hi,

Thank you for your reply. I don't know how to send you the huge sized
rebalance log file which is about 2GB.

However, I might have found out the reason why the task failed. My gluster
server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task
began, top 3  processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are
less than 1%. But after a while, 2 CPU cores are used up totally and I even
can't login until the rebalance task failed.

It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual server
with 8 CPU cores and start rebalance task again. Everything goes well for now.

I will report again when the current task completed or failed.



PuYun
 
From: Nithya Balachandran
Date: 2015-12-14 18:57
To: PuYun
CC: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
 
Can you send us the rebalance log?
 
Regards,
Nithya
 
----- Original Message -----> From: "PuYun" <cloudor at 126.com>
> To: "gluster-users" <gluster-users at gluster.org>
> Sent: Monday, December 14, 2015 11:33:40 AM
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> 
> Here is the tail of the failed rebalance log, any clue?
> 
> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir]
> 0-FastVol-dht: Migration operation on dir
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95
secs
> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:31:37.210496] I [MSGID: 109081]
> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q
> [2015-12-13 21:31:37.722825] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume
0
> (FastVol-client-0): 1032124 chunks
> [2015-12-13 21:31:37.722837] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume
1
> (FastVol-client-1): 1032124 chunks
> [2015-12-13 21:33:03.955539] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout -
> 2146817920 - 4294967295 - 1
> [2015-12-13 21:33:04.069859] I [MSGID: 109018]
> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts
for
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid >
f38c4ed2-a26a-4d83-adfd-6b0331831738
> [2015-12-13 21:33:04.118800] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout -
> 0 - 2146817919 - 1
> [2015-12-13 21:33:19.979507] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:19.979459] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
>
/for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
>
/for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> 
> 
> Cloudor
> 
> 
> 
> From: Sakshi Bansal
> Date: 2015-12-12 13:02
> To: ??
> CC: gluster-users
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> In the rebalance log file you can check the file/directory for which the
> rebalance has failed. It can mention what was the fop for whihc the failure
> happened.
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151214/53ca3629/attachment.html>

Mauro M.

2015-Dec-14 15:34 UTC

head link

[Gluster-users] All issues resolved after disabling RDMA

I have been experiencing several issues with glusterfs for several months

These started more or less after upgrading to release 3.7 from 3.5. I
skipped 3.6 series. Almost at the same time I had introduced and
Infiniband point to point network between my two gluster bricks.

The symptoms were failures to start the volume even when both nodes were
up and running, failed synchronizations, unexplicable split-brain even for
those files that I had certainty were only accessed by a single client
only.

I was about to give up glusterfs altogether. As a last resort first I
tried again disabling RDMA (over infiniband) and I rebuilt the bricks from
scratch using only TCP from the start (I had tried before to disable RDMA,
but without starting from scratch, so I must have experienced what were
latent issues).

I cannot tell whether the RDMA defects are caused by gluster, the hardware
or the operating system, however now using TCP only over infiniband I have
had a stable cluster with an active node and a second node which I usually
leave turned off and that synchronizes perfectly every time I turn it back
on.

I hope this helps.

Mauro

PuYun

2015-Dec-14 23:30 UTC

head link

[Gluster-users] How to diagnose volume rebalance failure?

Hi,

Failed again.  I can see disconnections in logs, but no more details.

=========== mnt-b1-brick.log ==========[2015-12-14 21:46:54.179662] I [MSGID:
115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting
connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0
[2015-12-14 21:46:54.181764] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /
[2015-12-14 21:46:54.181815] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir
[2015-12-14 21:46:54.181856] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user
[2015-12-14 21:46:54.181918] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg
[2015-12-14 21:46:54.181961] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/an
[2015-12-14 21:46:54.182003] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif
[2015-12-14 21:46:54.182036] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji
[2015-12-14 21:46:54.182076] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay
[2015-12-14 21:46:54.182110] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/an/ling00
[2015-12-14 21:46:54.182203] I [MSGID: 101055] [client_t.c:419:gf_client_unref]
0-FastVol-server: Shutting down connection
d001-1799-2015/12/14-12:54:56:347561-FastVol-client-1-0-0
=====================================
============== mnt-c1-brick.log -===========[2015-12-14 21:46:54.179597] I
[MSGID: 115036] [server.c:552:server_rpc_notify] 0-FastVol-server: disconnecting
connection from d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0
[2015-12-14 21:46:54.180428] W [inodelk.c:404:pl_inodelk_log_cleanup]
0-FastVol-server: releasing lock on 5e300cdb-7298-44c0-90eb-5b50018daed6 held by
{client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}
[2015-12-14 21:46:54.180454] W [inodelk.c:404:pl_inodelk_log_cleanup]
0-FastVol-server: releasing lock on 3c9a1cd5-84c8-4967-98d5-e75a402b1f74 held by
{client=0x7effc810cce0, pid=-3 lk-owner=fdffffff}
[2015-12-14 21:46:54.180483] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on /
[2015-12-14 21:46:54.180525] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir
[2015-12-14 21:46:54.180570] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user
[2015-12-14 21:46:54.180604] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg
[2015-12-14 21:46:54.180634] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji
[2015-12-14 21:46:54.180678] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay
[2015-12-14 21:46:54.180725] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/an/ling00
[2015-12-14 21:46:54.180779] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif
[2015-12-14 21:46:54.180820] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-FastVol-server: fd cleanup on
/for_ybest_fsdir/user/ji/ay/an
[2015-12-14 21:46:54.180859] I [MSGID: 101055] [client_t.c:419:gf_client_unref]
0-FastVol-server: Shutting down connection
d001-1799-2015/12/14-12:54:56:347561-FastVol-client-0-0-0
=====================================

============== etc-glusterfs-glusterd.vol.log =========[2015-12-14
21:46:54.179819] W [socket.c:588:__socket_rwv] 0-management: readv on
/var/run/gluster/gluster-rebalance-dbee250a-e3fe-4448-b905-b76c5ba80b25.sock
failed (No data available)
[2015-12-14 21:46:54.209586] I [MSGID: 106007]
[glusterd-rebalance.c:162:__glusterd_defrag_notify] 0-management: Rebalance
process for volume FastVol has disconnected.
[2015-12-14 21:46:54.209627] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy]
0-management: size=588 max=1 total=1
[2015-12-14 21:46:54.209640] I [MSGID: 101053] [mem-pool.c:616:mem_pool_destroy]
0-management: size=124 max=1 total=1
============================================

================== FastVol-rebalance.log ===========...
[2015-12-14 21:46:53.423719] I [MSGID: 109022]
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/07.jpg from subvolume
FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.423976] I [MSGID: 109022]
[dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration of
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/126724/1d0ca0de913c4e50f85f2b29694e4e64.html
from subvolume FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436268] I [dht-rebalance.c:1010:dht_migrate_file]
0-FastVol-dht: /for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/0.jpg:
attempting to move from FastVol-client-0 to FastVol-client-1
[2015-12-14 21:46:53.436597] I [dht-rebalance.c:1010:dht_migrate_file]
0-FastVol-dht:
/for_ybest_fsdir/user/ji/ay/up/a19640529/linkwrap/129836/icon_loading_white22c04a.gif:
attempting to move from FastVol-client-0 to FastVol-client-1
<EOF>
=============================================


PuYun
 
From: PuYun
Date: 2015-12-14 21:51
To: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,

Thank you for your reply. I don't know how to send you the huge sized
rebalance log file which is about 2GB.

However, I might have found out the reason why the task failed. My gluster
server has only 2 cpu cores and carries 2 ssd bricks. When the rebalance task
began, top 3  processes are 70%~80%, 30%~40 and 30%~40 cpu usage. Others are
less than 1%. But after a while, 2 CPU cores are used up totally and I even
can't login until the rebalance task failed.

It seems 2 bricks require 4 CPU cores at least. Now I upgrade the virtual server
with 8 CPU cores and start rebalance task again. Everything goes well for now.

I will report again when the current task completed or failed.



PuYun
 
From: Nithya Balachandran
Date: 2015-12-14 18:57
To: PuYun
CC: gluster-users
Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
Hi,
 
Can you send us the rebalance log?
 
Regards,
Nithya
 
----- Original Message -----> From: "PuYun" <cloudor at 126.com>
> To: "gluster-users" <gluster-users at gluster.org>
> Sent: Monday, December 14, 2015 11:33:40 AM
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> 
> Here is the tail of the failed rebalance log, any clue?
> 
> [2015-12-13 21:30:31.527493] I [dht-rebalance.c:2340:gf_defrag_process_dir]
> 0-FastVol-dht: Migration operation on dir
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/5F/1MsH5--BcoGRAJPI took 20.95
secs
> [2015-12-13 21:30:31.528704] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:30:31.543901] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
> /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:31:37.210496] I [MSGID: 109081]
> [dht-common.c:3780:dht_setxattr] 0-FastVol-dht: fixing the layout of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q
> [2015-12-13 21:31:37.722825] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume
0
> (FastVol-client-0): 1032124 chunks
> [2015-12-13 21:31:37.722837] I [MSGID: 109045]
> [dht-selfheal.c:1508:dht_fix_layout_of_directory] 0-FastVol-dht: subvolume
1
> (FastVol-client-1): 1032124 chunks
> [2015-12-13 21:33:03.955539] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-0; inode layout - 0 - 2146817919 - 1; disk layout -
> 2146817920 - 4294967295 - 1
> [2015-12-13 21:33:04.069859] I [MSGID: 109018]
> [dht-common.c:806:dht_revalidate_cbk] 0-FastVol-dht: Mismatching layouts
for
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Ny/7Q, gfid >
f38c4ed2-a26a-4d83-adfd-6b0331831738
> [2015-12-13 21:33:04.118800] I [MSGID: 109064]
> [dht-layout.c:808:dht_layout_dir_mismatch] 0-FastVol-dht: subvol:
> FastVol-client-1; inode layout - 2146817920 - 4294967295 - 1; disk layout -
> 0 - 2146817919 - 1
> [2015-12-13 21:33:19.979507] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of
> /for_ybest_fsdir/user/Weixin.oClDcjhe/Kn/hM/oHcPMp4hKq5Tq2ZQ/flag_finished
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:19.979459] I [MSGID: 109022]
> [dht-rebalance.c:1290:dht_migrate_file] 0-FastVol-dht: completed migration
> of /for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/userPoint
> from subvolume FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.543941] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
>
/for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_origin.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> [2015-12-13 21:33:25.962547] I [dht-rebalance.c:1010:dht_migrate_file]
> 0-FastVol-dht:
>
/for_ybest_fsdir/user/Weixin.oClDcjhe/PU/ps/qUa-n38i8QBgeMdI/portrait_small.jpg:
> attempting to move from FastVol-client-0 to FastVol-client-1
> 
> 
> Cloudor
> 
> 
> 
> From: Sakshi Bansal
> Date: 2015-12-12 13:02
> To: ??
> CC: gluster-users
> Subject: Re: [Gluster-users] How to diagnose volume rebalance failure?
> In the rebalance log file you can check the file/directory for which the
> rebalance has failed. It can mention what was the fop for whihc the failure
> happened.
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151215/1dc12ce3/attachment.html>

Gluster users - Dec 2015 - How to diagnose volume rebalance failure?

[Gluster-users] How to diagnose volume rebalance failure?

[Gluster-users] All issues resolved after disabling RDMA

[Gluster-users] How to diagnose volume rebalance failure?