Hi,
Well our rebalance seems to have failed.  Here is the output:
# gluster vol rebalance tank status
                                    Node Rebalanced-files          size
  scanned      failures       skipped               status  run time in
h:m:s
                               ---------      -----------   -----------
-----------   -----------   -----------         ------------
--------------
                               localhost          1348706        57.8TB
  2234439             9             6               failed      190:24:3
                               serverB                         0
 0Bytes             7             0             0            completed
  63:47:55
volume rebalance: tank: success
# gluster vol status tank
Status of volume: tank
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick serverA:/gluster_bricks/data1       49162     0          Y       20318
Brick serverB:/gluster_bricks/data1       49166     0          Y       3432
Brick serverA:/gluster_bricks/data2       49163     0          Y       20323
Brick serverB:/gluster_bricks/data2       49167     0          Y       3435
Brick serverA:/gluster_bricks/data3       49164     0          Y       4625
Brick serverA:/gluster_bricks/data4       49165     0          Y       4644
Brick serverA:/gluster_bricks/data5       49166     0          Y       5088
Brick serverA:/gluster_bricks/data6       49167     0          Y       5128
Brick serverB:/gluster_bricks/data3       49168     0          Y       22314
Brick serverB:/gluster_bricks/data4       49169     0          Y       22345
Brick serverB:/gluster_bricks/data5       49170     0          Y       22889
Brick serverB:/gluster_bricks/data6       49171     0          Y       22932
Self-heal Daemon on localhost               N/A       N/A        Y
6202
Self-heal Daemon on serverB               N/A       N/A        Y       22981
Task Status of Volume tank
------------------------------------------------------------------------------
Task                 : Rebalance
ID                   : eec64343-8e0d-4523-ad05-5678f9eb9eb2
Status               : failed
# df -hP |grep data
/dev/mapper/gluster_vg-gluster_lv1_data   60T   31T   29T  52%
/gluster_bricks/data1
/dev/mapper/gluster_vg-gluster_lv2_data   60T   31T   29T  51%
/gluster_bricks/data2
/dev/mapper/gluster_vg-gluster_lv3_data   60T   15T   46T  24%
/gluster_bricks/data3
/dev/mapper/gluster_vg-gluster_lv4_data   60T   15T   46T  24%
/gluster_bricks/data4
/dev/mapper/gluster_vg-gluster_lv5_data   60T   15T   45T  25%
/gluster_bricks/data5
/dev/mapper/gluster_vg-gluster_lv6_data   60T   15T   45T  25%
/gluster_bricks/data6
The rebalance log on serverA shows a disconnect from serverB
[2019-09-08 15:41:44.285591] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-tank-client-10: server
<serverB>:49170 has not responded in the last 42 seconds, disconnecting.
[2019-09-08 15:41:44.285739] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-tank-client-10: disconnected from
tank-client-10. Client process will keep trying to connect to glusterd
until brick's port is available
[2019-09-08 15:41:44.286023] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff986e8b132] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff986c5299e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff986c52aae] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7ff986c54220] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2b0)[0x7ff986c54ce0] )))))
0-tank-client-10: forced unwinding frame type(GlusterFS 3.3)
op(FXATTROP(34)) called at 2019-09-08 15:40:44.040333 (xid=0x7f8cfac)
Does this type of failure cause data corruption?  What is the best course
of action at this point?
Thanks,
HB
On Wed, Sep 11, 2019 at 11:58 PM Strahil <hunter86_bg at yahoo.com> wrote:
> Hi Nithya,
>
> Thanks for the detailed explanation.
> It makes sense.
>
> Best Regards,
> Strahil Nikolov
> On Sep 12, 2019 08:18, Nithya Balachandran <nbalacha at redhat.com>
wrote:
>
>
>
> On Wed, 11 Sep 2019 at 09:47, Strahil <hunter86_bg at yahoo.com>
wrote:
>
> Hi Nithya,
>
> I just reminded about your previous  e-mail  which left me with the
> impression that old volumes need that.
> This is the one 1 mean:
>
> >It looks like this is a replicate volume. If >that is the case then
yes,
> you are >running an old version of Gluster for >which this was the
default
>
>
> Hi Strahil,
>
> I'm providing a little more detail here which I hope will explain
things.
> Rebalance was always a volume wide operation - a *rebalance start*
> operation will start rebalance processes on all nodes of the volume.
> However, different processes would behave differently. In earlier releases,
> all nodes would crawl the bricks and update the directory layouts. However,
> only one node in each replica/disperse set would actually migrate files,so
> the rebalance status would only show one node doing any "work"
(scanning,
> rebalancing etc). However, this one node will process all the files in its
> replica sets. Rerunning rebalance on other nodes would make no difference
> as it will always be the same node that ends up migrating files.
> So for instance, for a replicate volume with server1:/brick1,
> server2:/brick2 and server3:/brick3 in that order, only the rebalance
> process on server1 would migrate files. In newer releases, all 3 nodes
> would migrate files.
>
> The rebalance status does not capture the directory operations of fixing
> layouts which is why it looks like the other nodes are not doing anything.
>
> Hope this helps.
>
> Regards,
> Nithya
>
> behaviour.
>
> >
> >
>
> >Regards,
>
> >
>
> >Nithya
>
>
> Best Regards,
> Strahil Nikolov
> On Sep 9, 2019 06:36, Nithya Balachandran <nbalacha at redhat.com>
wrote:
>
>
>
> On Sat, 7 Sep 2019 at 00:03, Strahil Nikolov <hunter86_bg at
yahoo.com>
> wrote:
>
> As it was mentioned, you might have to run rebalance on the other node -
> but it is better to wait this node is over.
>
>
> Hi Strahil,
>
> Rebalance does not need to be run on the other node - the operation is a
> volume wide one . Only a single node per replica set would migrate files in
> the version used in this case .
>
> Regards,
> Nithya
>
> Best Regards,
> Strahil Nikolov
>
> ? ?????, 6 ????????? 2019 ?., 15:29:20 ?. ???????+3, Herb Burnswell <
> herbert.burnswell at gmail.com>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190913/a7a0855f/attachment.html>
On Sat, 14 Sep 2019 at 01:25, Herb Burnswell <herbert.burnswell at gmail.com> wrote:> Hi, > > Well our rebalance seems to have failed. Here is the output: >Hi, Rebalance will abort itself if it cannot reach any of the nodes. Are all the bricks still up and reachable? Regards, Nithya> > # gluster vol rebalance tank status > Node Rebalanced-files size > scanned failures skipped status run time in > h:m:s > --------- ----------- ----------- > ----------- ----------- ----------- ------------ > -------------- > localhost 1348706 57.8TB > 2234439 9 6 failed 190:24:3 > serverB 0 > 0Bytes 7 0 0 completed > 63:47:55 > volume rebalance: tank: success > > # gluster vol status tank > Status of volume: tank > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick serverA:/gluster_bricks/data1 49162 0 Y > 20318 > Brick serverB:/gluster_bricks/data1 49166 0 Y > 3432 > Brick serverA:/gluster_bricks/data2 49163 0 Y > 20323 > Brick serverB:/gluster_bricks/data2 49167 0 Y > 3435 > Brick serverA:/gluster_bricks/data3 49164 0 Y > 4625 > Brick serverA:/gluster_bricks/data4 49165 0 Y > 4644 > Brick serverA:/gluster_bricks/data5 49166 0 Y > 5088 > Brick serverA:/gluster_bricks/data6 49167 0 Y > 5128 > Brick serverB:/gluster_bricks/data3 49168 0 Y > 22314 > Brick serverB:/gluster_bricks/data4 49169 0 Y > 22345 > Brick serverB:/gluster_bricks/data5 49170 0 Y > 22889 > Brick serverB:/gluster_bricks/data6 49171 0 Y > 22932 > Self-heal Daemon on localhost N/A N/A Y > 6202 > Self-heal Daemon on serverB N/A N/A Y > 22981 > > Task Status of Volume tank > > ------------------------------------------------------------------------------ > Task : Rebalance > ID : eec64343-8e0d-4523-ad05-5678f9eb9eb2 > Status : failed > > # df -hP |grep data > /dev/mapper/gluster_vg-gluster_lv1_data 60T 31T 29T 52% > /gluster_bricks/data1 > /dev/mapper/gluster_vg-gluster_lv2_data 60T 31T 29T 51% > /gluster_bricks/data2 > /dev/mapper/gluster_vg-gluster_lv3_data 60T 15T 46T 24% > /gluster_bricks/data3 > /dev/mapper/gluster_vg-gluster_lv4_data 60T 15T 46T 24% > /gluster_bricks/data4 > /dev/mapper/gluster_vg-gluster_lv5_data 60T 15T 45T 25% > /gluster_bricks/data5 > /dev/mapper/gluster_vg-gluster_lv6_data 60T 15T 45T 25% > /gluster_bricks/data6 > > > The rebalance log on serverA shows a disconnect from serverB > > [2019-09-08 15:41:44.285591] C > [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-tank-client-10: server > <serverB>:49170 has not responded in the last 42 seconds, disconnecting. > [2019-09-08 15:41:44.285739] I [MSGID: 114018] > [client.c:2280:client_rpc_notify] 0-tank-client-10: disconnected from > tank-client-10. Client process will keep trying to connect to glusterd > until brick's port is available > [2019-09-08 15:41:44.286023] E [rpc-clnt.c:365:saved_frames_unwind] (--> > /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff986e8b132] (--> > /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff986c5299e] (--> > /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff986c52aae] (--> > /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7ff986c54220] (--> > /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2b0)[0x7ff986c54ce0] ))))) > 0-tank-client-10: forced unwinding frame type(GlusterFS 3.3) > op(FXATTROP(34)) called at 2019-09-08 15:40:44.040333 (xid=0x7f8cfac) > > Does this type of failure cause data corruption? What is the best course > of action at this point? > > Thanks, > > HB > > On Wed, Sep 11, 2019 at 11:58 PM Strahil <hunter86_bg at yahoo.com> wrote: > >> Hi Nithya, >> >> Thanks for the detailed explanation. >> It makes sense. >> >> Best Regards, >> Strahil Nikolov >> On Sep 12, 2019 08:18, Nithya Balachandran <nbalacha at redhat.com> wrote: >> >> >> >> On Wed, 11 Sep 2019 at 09:47, Strahil <hunter86_bg at yahoo.com> wrote: >> >> Hi Nithya, >> >> I just reminded about your previous e-mail which left me with the >> impression that old volumes need that. >> This is the one 1 mean: >> >> >It looks like this is a replicate volume. If >that is the case then yes, >> you are >running an old version of Gluster for >which this was the default >> >> >> Hi Strahil, >> >> I'm providing a little more detail here which I hope will explain things. >> Rebalance was always a volume wide operation - a *rebalance start* >> operation will start rebalance processes on all nodes of the volume. >> However, different processes would behave differently. In earlier releases, >> all nodes would crawl the bricks and update the directory layouts. However, >> only one node in each replica/disperse set would actually migrate files,so >> the rebalance status would only show one node doing any "work" (scanning, >> rebalancing etc). However, this one node will process all the files in its >> replica sets. Rerunning rebalance on other nodes would make no difference >> as it will always be the same node that ends up migrating files. >> So for instance, for a replicate volume with server1:/brick1, >> server2:/brick2 and server3:/brick3 in that order, only the rebalance >> process on server1 would migrate files. In newer releases, all 3 nodes >> would migrate files. >> >> The rebalance status does not capture the directory operations of fixing >> layouts which is why it looks like the other nodes are not doing anything. >> >> Hope this helps. >> >> Regards, >> Nithya >> >> behaviour. >> >> > >> > >> >> >Regards, >> >> > >> >> >Nithya >> >> >> Best Regards, >> Strahil Nikolov >> On Sep 9, 2019 06:36, Nithya Balachandran <nbalacha at redhat.com> wrote: >> >> >> >> On Sat, 7 Sep 2019 at 00:03, Strahil Nikolov <hunter86_bg at yahoo.com> >> wrote: >> >> As it was mentioned, you might have to run rebalance on the other node - >> but it is better to wait this node is over. >> >> >> Hi Strahil, >> >> Rebalance does not need to be run on the other node - the operation is a >> volume wide one . Only a single node per replica set would migrate files in >> the version used in this case . >> >> Regards, >> Nithya >> >> Best Regards, >> Strahil Nikolov >> >> ? ?????, 6 ????????? 2019 ?., 15:29:20 ?. ???????+3, Herb Burnswell < >> herbert.burnswell at gmail.com> >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190918/c46e08e9/attachment.html>