thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!! [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Atin Mukherjee

2016-Apr-14 09:03 UTC

[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!

On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:> 
> 
> On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <amukherj at redhat.com
> <mailto:amukherj at redhat.com>> wrote:
> 
> 
> 
>     On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote:
>     > Hi Team,
>     >
>     > We are using Gluster 3.7.6 and facing one problem in which brick
is not
>     > comming online after restart the board.
>     >
>     > To understand our setup, please look the following steps:
>     > 1. We have two boards A and B on which Gluster volume is running
in
>     > replicated mode having one brick on each board.
>     > 2. Gluster mount point is present on the Board A which is sharable
>     > between number of processes.
>     > 3. Till now our volume is in sync and everthing is working fine.
>     > 4. Now we have test case in which we'll stop the glusterd,
reboot the
>     > Board B and when this board comes up, starts the glusterd again on
it.
>     > 5. We repeated Steps 4 multiple times to check the reliability of
system.
>     > 6. After the Step 4, sometimes system comes in working state (i.e.
in
>     > sync) but sometime we faces that brick of Board B is present in
>     >     ?gluster volume status? command but not be online even waiting
for
>     > more than a minute.
>     As I mentioned in another email thread until and unless the log shows
>     the evidence that there was a reboot nothing can be concluded. The last
>     log what you shared with us few days back didn't give any
indication
>     that brick process wasn't running.
> 
> How can we identify that the brick process is running in brick logs?
> 
>     > 7. When the Step 4 is executing at the same time on Board A some
>     > processes are started accessing the files from the Gluster mount
point.
>     >
>     > As a solution to make this brick online, we found some existing
issues
>     > in gluster mailing list giving suggestion to use ?gluster volume
start
>     > <vol_name> force? to make the brick 'offline' to
'online'.
>     >
>     > If we use ?gluster volume start <vol_name> force? command.
It will kill
>     > the existing volume process and started the new process then what
will
>     > happen if other processes are accessing the same volume at the
time when
>     > volume process is killed by this command internally. Will it
impact any
>     > failure on these processes?
>     This is not true, volume start force will start the brick processes
only
>     if they are not running. Running brick processes will not be
>     interrupted.
> 
> we have tried and check the pid of process before force start and after
> force start.
> the pid has been changed after force start.
> 
> Please find the logs at the time of failure attached once again with
> log-level=debug.
> 
> if you can give me the exact line where you are able to find out that
> the brick process
> is running in brick log file please give me the line number of that file.
Here is the sequence at which glusterd and respective brick process is
restarted.

1. glusterd restart trigger - line number 1014 in glusterd.log file:

[2016-04-03 10:12:29.051735] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterd: Started running /usr/sbin/              glusterd
version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid
--log-level DEBUG)

2. brick start trigger - line number 190 in opt-lvmdir-c2-brick.log

[2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
-S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
--brick-name /opt/lvmdir/c2/brick -l
/var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
*-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
--brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)

3. The following log indicates that brick is up and is now started.
Refer to line 16123 in glusterd.log

[2016-04-03 10:14:25.336855] D [MSGID: 0]
[glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
Connected to 10.32.1.144:/opt/lvmdir/c2/brick

This clearly indicates that the brick is up and running as after that I
do not see any disconnect event been processed by glusterd for the brick
process.

Please note that all the logs referred and pasted are from 002500.

~Atin> 
> 002500 - Board B that brick is offline
> 00300 - Board A logs
> 
>     >
>     > *Question : What could be contributing to brick offline?*
>     >
>     >
>     > --
>     >
>     > Regards
>     > Abhishek Paliwal
>     >
>     >
>     > _______________________________________________
>     > Gluster-devel mailing list
>     > Gluster-devel at gluster.org <mailto:Gluster-devel at
gluster.org>
>     > http://www.gluster.org/mailman/listinfo/gluster-devel
>     >
> 
> 
> 
>

ABHISHEK PALIWAL

2016-Apr-14 10:37 UTC

head link

[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!

On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee <amukherj at redhat.com>
wrote:
>
>
> On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:
> >
> >
> > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <amukherj at
redhat.com
> > <mailto:amukherj at redhat.com>> wrote:
> >
> >
> >
> >     On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote:
> >     > Hi Team,
> >     >
> >     > We are using Gluster 3.7.6 and facing one problem in which
brick
> is not
> >     > comming online after restart the board.
> >     >
> >     > To understand our setup, please look the following steps:
> >     > 1. We have two boards A and B on which Gluster volume is
running in
> >     > replicated mode having one brick on each board.
> >     > 2. Gluster mount point is present on the Board A which is
sharable
> >     > between number of processes.
> >     > 3. Till now our volume is in sync and everthing is working
fine.
> >     > 4. Now we have test case in which we'll stop the
glusterd, reboot
> the
> >     > Board B and when this board comes up, starts the glusterd
again on
> it.
> >     > 5. We repeated Steps 4 multiple times to check the
reliability of
> system.
> >     > 6. After the Step 4, sometimes system comes in working state
(i.e.
> in
> >     > sync) but sometime we faces that brick of Board B is present
in
> >     >     ?gluster volume status? command but not be online even
waiting
> for
> >     > more than a minute.
> >     As I mentioned in another email thread until and unless the log
shows
> >     the evidence that there was a reboot nothing can be concluded. The
> last
> >     log what you shared with us few days back didn't give any
indication
> >     that brick process wasn't running.
> >
> > How can we identify that the brick process is running in brick logs?
> >
> >     > 7. When the Step 4 is executing at the same time on Board A
some
> >     > processes are started accessing the files from the Gluster
mount
> point.
> >     >
> >     > As a solution to make this brick online, we found some
existing
> issues
> >     > in gluster mailing list giving suggestion to use ?gluster
volume
> start
> >     > <vol_name> force? to make the brick 'offline'
to 'online'.
> >     >
> >     > If we use ?gluster volume start <vol_name> force?
command. It will
> kill
> >     > the existing volume process and started the new process then
what
> will
> >     > happen if other processes are accessing the same volume at
the
> time when
> >     > volume process is killed by this command internally. Will it
> impact any
> >     > failure on these processes?
> >     This is not true, volume start force will start the brick
processes
> only
> >     if they are not running. Running brick processes will not be
> >     interrupted.
> >
> > we have tried and check the pid of process before force start and
after
> > force start.
> > the pid has been changed after force start.
> >
> > Please find the logs at the time of failure attached once again with
> > log-level=debug.
> >
> > if you can give me the exact line where you are able to find out that
> > the brick process
> > is running in brick log file please give me the line number of that
file.
>
> Here is the sequence at which glusterd and respective brick process is
> restarted.
>
> 1. glusterd restart trigger - line number 1014 in glusterd.log file:
>
> [2016-04-03 10:12:29.051735] I [MSGID: 100030] [glusterfsd.c:2318:main]
> 0-/usr/sbin/glusterd: Started running /usr/sbin/              glusterd
> version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid
> --log-level DEBUG)
>
> 2. brick start trigger - line number 190 in opt-lvmdir-c2-brick.log
>
> [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
> 0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
> version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
> c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
> --brick-name /opt/lvmdir/c2/brick -l
> /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
> *-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)
>
> 3. The following log indicates that brick is up and is now started.
> Refer to line 16123 in glusterd.log
>
> [2016-04-03 10:14:25.336855] D [MSGID: 0]
> [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
> Connected to 10.32.1.144:/opt/lvmdir/c2/brick
>
> This clearly indicates that the brick is up and running as after that I
> do not see any disconnect event been processed by glusterd for the brick
> process.
>
Thanks for replying descriptively but please also clear some more doubts:

1. At this 10:14:25 moment of time brick is available because we have
removed brick and added it again to make it online:
following are the logs from cmd-history.log file of 000300

[2016-04-03 10:14:21.446570]  : volume status : SUCCESS
[2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica 1
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
[2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS
[2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS
[2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS

and also 10:12:29 was the last reboot time before this failure. So I am
totally agree what you said earlier.

2 .As you said at 10:12:29 glusterd restarted then why we are not getting
'brick start trigger' related logs
 like below between 10:12:29 to 10:14:25 time stamp which is something two
minute of time interval.

[2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
-S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
--brick-name /opt/lvmdir/c2/brick -l
/var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
*-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
--brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)

3. We are continuously checking brick status in the above time duration
using  "gluster volume status" refer the cmd-history.log file from
000300

In glusterd.log file we are also getting below logs

[2016-04-03 10:12:31.771051] D [MSGID: 0]
[glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
Connected to 10.32.1.144:/opt/lvmdir/c2/brick

[2016-04-03 10:12:32.981152] D [MSGID: 0]
[glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
Connected to 10.32.1.144:/opt/lvmdir/c2/brick

two times b/w 10:12:29 and 10:14:25 and as you said these logs  " clearly
indicates that the brick is up and running as after" then why brick is not
online in "gluster volume status" command

[2016-04-03 10:12:33.990487]  : volume status : SUCCESS
[2016-04-03 10:12:34.007469]  : volume status : SUCCESS
[2016-04-03 10:12:35.095918]  : volume status : SUCCESS
[2016-04-03 10:12:35.126369]  : volume status : SUCCESS
[2016-04-03 10:12:36.224018]  : volume status : SUCCESS
[2016-04-03 10:12:36.251032]  : volume status : SUCCESS
[2016-04-03 10:12:37.352377]  : volume status : SUCCESS
[2016-04-03 10:12:37.374028]  : volume status : SUCCESS
[2016-04-03 10:12:38.446148]  : volume status : SUCCESS
[2016-04-03 10:12:38.468860]  : volume status : SUCCESS
[2016-04-03 10:12:39.534017]  : volume status : SUCCESS
[2016-04-03 10:12:39.553711]  : volume status : SUCCESS
[2016-04-03 10:12:40.616610]  : volume status : SUCCESS
[2016-04-03 10:12:40.636354]  : volume status : SUCCESS
......
......
......
[2016-04-03 10:14:21.446570]  : volume status : SUCCESS
[2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica 1
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
[2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS
[2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS
[2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS

In above logs we are continuously checking brick status but when we don't
find brick status 'online' even after ~2 minutes then we removed it and
add
it again to make it online.

[2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica 1
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
[2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS
[2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS
[2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS

that is why in logs we are gettting "brick start trigger logs" at time
stamp 10:14:25

[2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
-S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
--brick-name /opt/lvmdir/c2/brick -l
/var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
*-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
--brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)


Regards,
Abhishek

> Please note that all the logs referred and pasted are from 002500.
>
> ~Atin
> >
> > 002500 - Board B that brick is offline
> > 00300 - Board A logs
> >
> >     >
> >     > *Question : What could be contributing to brick offline?*
> >     >
> >     >
> >     > --
> >     >
> >     > Regards
> >     > Abhishek Paliwal
> >     >
> >     >
> >     > _______________________________________________
> >     > Gluster-devel mailing list
> >     > Gluster-devel at gluster.org <mailto:Gluster-devel at
gluster.org>
> >     > http://www.gluster.org/mailman/listinfo/gluster-devel
> >     >
> >
> >
> >
> >
>


-- 




Regards
Abhishek Paliwal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160414/32081178/attachment.html>

Gluster users - Apr 2016 - [Gluster-devel] Gluster Brick Offline after reboot!!

[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!

[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!