Atin Mukherjee
2016-Apr-14 09:03 UTC
[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!
On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:> > > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <amukherj at redhat.com > <mailto:amukherj at redhat.com>> wrote: > > > > On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote: > > Hi Team, > > > > We are using Gluster 3.7.6 and facing one problem in which brick is not > > comming online after restart the board. > > > > To understand our setup, please look the following steps: > > 1. We have two boards A and B on which Gluster volume is running in > > replicated mode having one brick on each board. > > 2. Gluster mount point is present on the Board A which is sharable > > between number of processes. > > 3. Till now our volume is in sync and everthing is working fine. > > 4. Now we have test case in which we'll stop the glusterd, reboot the > > Board B and when this board comes up, starts the glusterd again on it. > > 5. We repeated Steps 4 multiple times to check the reliability of system. > > 6. After the Step 4, sometimes system comes in working state (i.e. in > > sync) but sometime we faces that brick of Board B is present in > > ?gluster volume status? command but not be online even waiting for > > more than a minute. > As I mentioned in another email thread until and unless the log shows > the evidence that there was a reboot nothing can be concluded. The last > log what you shared with us few days back didn't give any indication > that brick process wasn't running. > > How can we identify that the brick process is running in brick logs? > > > 7. When the Step 4 is executing at the same time on Board A some > > processes are started accessing the files from the Gluster mount point. > > > > As a solution to make this brick online, we found some existing issues > > in gluster mailing list giving suggestion to use ?gluster volume start > > <vol_name> force? to make the brick 'offline' to 'online'. > > > > If we use ?gluster volume start <vol_name> force? command. It will kill > > the existing volume process and started the new process then what will > > happen if other processes are accessing the same volume at the time when > > volume process is killed by this command internally. Will it impact any > > failure on these processes? > This is not true, volume start force will start the brick processes only > if they are not running. Running brick processes will not be > interrupted. > > we have tried and check the pid of process before force start and after > force start. > the pid has been changed after force start. > > Please find the logs at the time of failure attached once again with > log-level=debug. > > if you can give me the exact line where you are able to find out that > the brick process > is running in brick log file please give me the line number of that file.Here is the sequence at which glusterd and respective brick process is restarted. 1. glusterd restart trigger - line number 1014 in glusterd.log file: [2016-04-03 10:12:29.051735] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/ glusterd version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level DEBUG) 2. brick start trigger - line number 190 in opt-lvmdir-c2-brick.log [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/ glusterfsd version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p / system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005. socket --brick-name /opt/lvmdir/c2/brick -l /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option *-posix.glusterd- uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256 --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329) 3. The following log indicates that brick is up and is now started. Refer to line 16123 in glusterd.log [2016-04-03 10:14:25.336855] D [MSGID: 0] [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management: Connected to 10.32.1.144:/opt/lvmdir/c2/brick This clearly indicates that the brick is up and running as after that I do not see any disconnect event been processed by glusterd for the brick process. Please note that all the logs referred and pasted are from 002500. ~Atin> > 002500 - Board B that brick is offline > 00300 - Board A logs > > > > > *Question : What could be contributing to brick offline?* > > > > > > -- > > > > Regards > > Abhishek Paliwal > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > > >
ABHISHEK PALIWAL
2016-Apr-14 10:37 UTC
[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!
On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee <amukherj at redhat.com> wrote:> > > On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote: > > > > > > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <amukherj at redhat.com > > <mailto:amukherj at redhat.com>> wrote: > > > > > > > > On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote: > > > Hi Team, > > > > > > We are using Gluster 3.7.6 and facing one problem in which brick > is not > > > comming online after restart the board. > > > > > > To understand our setup, please look the following steps: > > > 1. We have two boards A and B on which Gluster volume is running in > > > replicated mode having one brick on each board. > > > 2. Gluster mount point is present on the Board A which is sharable > > > between number of processes. > > > 3. Till now our volume is in sync and everthing is working fine. > > > 4. Now we have test case in which we'll stop the glusterd, reboot > the > > > Board B and when this board comes up, starts the glusterd again on > it. > > > 5. We repeated Steps 4 multiple times to check the reliability of > system. > > > 6. After the Step 4, sometimes system comes in working state (i.e. > in > > > sync) but sometime we faces that brick of Board B is present in > > > ?gluster volume status? command but not be online even waiting > for > > > more than a minute. > > As I mentioned in another email thread until and unless the log shows > > the evidence that there was a reboot nothing can be concluded. The > last > > log what you shared with us few days back didn't give any indication > > that brick process wasn't running. > > > > How can we identify that the brick process is running in brick logs? > > > > > 7. When the Step 4 is executing at the same time on Board A some > > > processes are started accessing the files from the Gluster mount > point. > > > > > > As a solution to make this brick online, we found some existing > issues > > > in gluster mailing list giving suggestion to use ?gluster volume > start > > > <vol_name> force? to make the brick 'offline' to 'online'. > > > > > > If we use ?gluster volume start <vol_name> force? command. It will > kill > > > the existing volume process and started the new process then what > will > > > happen if other processes are accessing the same volume at the > time when > > > volume process is killed by this command internally. Will it > impact any > > > failure on these processes? > > This is not true, volume start force will start the brick processes > only > > if they are not running. Running brick processes will not be > > interrupted. > > > > we have tried and check the pid of process before force start and after > > force start. > > the pid has been changed after force start. > > > > Please find the logs at the time of failure attached once again with > > log-level=debug. > > > > if you can give me the exact line where you are able to find out that > > the brick process > > is running in brick log file please give me the line number of that file. > > Here is the sequence at which glusterd and respective brick process is > restarted. > > 1. glusterd restart trigger - line number 1014 in glusterd.log file: > > [2016-04-03 10:12:29.051735] I [MSGID: 100030] [glusterfsd.c:2318:main] > 0-/usr/sbin/glusterd: Started running /usr/sbin/ glusterd > version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid > --log-level DEBUG) > > 2. brick start trigger - line number 190 in opt-lvmdir-c2-brick.log > > [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main] > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/ glusterfsd > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p / > system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005. socket > --brick-name /opt/lvmdir/c2/brick -l > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option > *-posix.glusterd- uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256 > --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329) > > 3. The following log indicates that brick is up and is now started. > Refer to line 16123 in glusterd.log > > [2016-04-03 10:14:25.336855] D [MSGID: 0] > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management: > Connected to 10.32.1.144:/opt/lvmdir/c2/brick > > This clearly indicates that the brick is up and running as after that I > do not see any disconnect event been processed by glusterd for the brick > process. >Thanks for replying descriptively but please also clear some more doubts: 1. At this 10:14:25 moment of time brick is available because we have removed brick and added it again to make it online: following are the logs from cmd-history.log file of 000300 [2016-04-03 10:14:21.446570] : volume status : SUCCESS [2016-04-03 10:14:21.665889] : volume remove-brick c_glusterfs replica 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS [2016-04-03 10:14:21.764270] : peer detach 10.32.1.144 : SUCCESS [2016-04-03 10:14:23.060442] : peer probe 10.32.1.144 : SUCCESS [2016-04-03 10:14:25.649525] : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS and also 10:12:29 was the last reboot time before this failure. So I am totally agree what you said earlier. 2 .As you said at 10:12:29 glusterd restarted then why we are not getting 'brick start trigger' related logs like below between 10:12:29 to 10:14:25 time stamp which is something two minute of time interval. [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/ glusterfsd version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p / system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005. socket --brick-name /opt/lvmdir/c2/brick -l /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option *-posix.glusterd- uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256 --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329) 3. We are continuously checking brick status in the above time duration using "gluster volume status" refer the cmd-history.log file from 000300 In glusterd.log file we are also getting below logs [2016-04-03 10:12:31.771051] D [MSGID: 0] [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management: Connected to 10.32.1.144:/opt/lvmdir/c2/brick [2016-04-03 10:12:32.981152] D [MSGID: 0] [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management: Connected to 10.32.1.144:/opt/lvmdir/c2/brick two times b/w 10:12:29 and 10:14:25 and as you said these logs " clearly indicates that the brick is up and running as after" then why brick is not online in "gluster volume status" command [2016-04-03 10:12:33.990487] : volume status : SUCCESS [2016-04-03 10:12:34.007469] : volume status : SUCCESS [2016-04-03 10:12:35.095918] : volume status : SUCCESS [2016-04-03 10:12:35.126369] : volume status : SUCCESS [2016-04-03 10:12:36.224018] : volume status : SUCCESS [2016-04-03 10:12:36.251032] : volume status : SUCCESS [2016-04-03 10:12:37.352377] : volume status : SUCCESS [2016-04-03 10:12:37.374028] : volume status : SUCCESS [2016-04-03 10:12:38.446148] : volume status : SUCCESS [2016-04-03 10:12:38.468860] : volume status : SUCCESS [2016-04-03 10:12:39.534017] : volume status : SUCCESS [2016-04-03 10:12:39.553711] : volume status : SUCCESS [2016-04-03 10:12:40.616610] : volume status : SUCCESS [2016-04-03 10:12:40.636354] : volume status : SUCCESS ...... ...... ...... [2016-04-03 10:14:21.446570] : volume status : SUCCESS [2016-04-03 10:14:21.665889] : volume remove-brick c_glusterfs replica 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS [2016-04-03 10:14:21.764270] : peer detach 10.32.1.144 : SUCCESS [2016-04-03 10:14:23.060442] : peer probe 10.32.1.144 : SUCCESS [2016-04-03 10:14:25.649525] : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS In above logs we are continuously checking brick status but when we don't find brick status 'online' even after ~2 minutes then we removed it and add it again to make it online. [2016-04-03 10:14:21.665889] : volume remove-brick c_glusterfs replica 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS [2016-04-03 10:14:21.764270] : peer detach 10.32.1.144 : SUCCESS [2016-04-03 10:14:23.060442] : peer probe 10.32.1.144 : SUCCESS [2016-04-03 10:14:25.649525] : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS that is why in logs we are gettting "brick start trigger logs" at time stamp 10:14:25 [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/ glusterfsd version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p / system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005. socket --brick-name /opt/lvmdir/c2/brick -l /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option *-posix.glusterd- uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256 --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329) Regards, Abhishek> Please note that all the logs referred and pasted are from 002500. > > ~Atin > > > > 002500 - Board B that brick is offline > > 00300 - Board A logs > > > > > > > > *Question : What could be contributing to brick offline?* > > > > > > > > > -- > > > > > > Regards > > > Abhishek Paliwal > > > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> > > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > > > > > > > > >-- Regards Abhishek Paliwal -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160414/32081178/attachment.html>