thr3ads.net - Gluster users - [Gluster-users] Failed Volume [May 2017]

If this information is useful, please help other people find it:
Share via:

Jarsulic, Michael [CRI]

2017-May-26 14:59 UTC

[Gluster-users] Failed Volume

Here is some further information on this issue:

The version of gluster we are using is 3.7.6.

Also, the error I found in the cmd history is:
[2017-05-26 04:28:28.332700]  : volume remove-brick hpcscratch
cri16fs001-ib:/data/brick1/scratch commit : FAILED : Commit failed on
cri16fs003-ib. Please check log file for details.

I did not notice this at the time and made an attempt to remove the next brick
to migrate the data off of the system. This left the servers in the following
state.

fs001 - /var/lib/glusterd/vols/hpcscratch/info

type=0
count=3
status=1
sub_count=0
stripe_count=1
replica_count=1
disperse_count=0
redundancy_count=0
version=42
transport-type=0
volume-id=80b8eeed-1e72-45b9-8402-e01ae0130105
?
op-version=30700
client-op-version=3
quota-version=0
parent_volname=N/A
restored_from_snap=00000000-0000-0000-0000-000000000000
snap-max-hard-limit=256
server.event-threads=8
performance.client-io-threads=on
client.event-threads=8
performance.cache-size=32MB
performance.readdir-ahead=on
brick-0=cri16fs001-ib:-data-brick2-scratch
brick-1=cri16fs003-ib:-data-brick5-scratch
brick-2=cri16fs003-ib:-data-brick6-scratch


fs003 - cat /var/lib/glusterd/vols/hpcscratch/info

type=0
count=4
status=1
sub_count=0
stripe_count=1
replica_count=1
disperse_count=0
redundancy_count=0
version=35
transport-type=0
volume-id=80b8eeed-1e72-45b9-8402-e01ae0130105
?
op-version=30700
client-op-version=3
quota-version=0
parent_volname=N/A
restored_from_snap=00000000-0000-0000-0000-000000000000
snap-max-hard-limit=256
performance.readdir-ahead=on
performance.cache-size=32MB
client.event-threads=8
performance.client-io-threads=on
server.event-threads=8
brick-0=cri16fs001-ib:-data-brick1-scratch
brick-1=cri16fs001-ib:-data-brick2-scratch
brick-2=cri16fs003-ib:-data-brick5-scratch
brick-3=cri16fs003-ib:-data-brick6-scratch


fs001 - /var/lib/glusterd/vols/hpcscratch/node_state.info

rebalance_status=5
status=4
rebalance_op=0
rebalance-id=00000000-0000-0000-0000-000000000000
brick1=cri16fs001-ib:/data/brick2/scratch
count=1


fs003 - /var/lib/glusterd/vols/hpcscratch/node_state.info

rebalance_status=1
status=0
rebalance_op=9
rebalance-id=0184577f-eb64-4af9-924d-91ead0605a1e
brick1=cri16fs001-ib:/data/brick1/scratch
count=1


-- 
Mike Jarsulic


On 5/26/17, 8:22 AM, "gluster-users-bounces at gluster.org on behalf of
Jarsulic, Michael [CRI]" <gluster-users-bounces at gluster.org on behalf
of mjarsulic at bsd.uchicago.edu> wrote:

    Recently, I had some problems with the OS hard drives in my glusterd servers
and took one of my systems down for maintenance. The first step was to remove
one of the bricks (brick1) hosted on the server (fs001). The data migrated
successfully and completed last night. After that, I went to commit the changes
and the commit failed. Afterwards, glusterd will not start on one of my servers
(fs003). When I check the glusterd logs on fs003 I get the following errors
whenever glusterd starts:
    
    [2017-05-26 04:37:21.358932] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.6 (args:
/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
    [2017-05-26 04:37:21.382630] I [MSGID: 106478] [glusterd.c:1350:init]
0-management: Maximum allowed open file descriptors set to 65536
    [2017-05-26 04:37:21.382712] I [MSGID: 106479] [glusterd.c:1399:init]
0-management: Using /var/lib/glusterd as working directory
    [2017-05-26 04:37:21.422858] I [MSGID: 106228]
[glusterd.c:433:glusterd_check_gsync_present] 0-glusterd: geo-replication module
not installed in the system [No such file or directory]
    [2017-05-26 04:37:21.450123] I [MSGID: 106513]
[glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved
op-version: 30706
    [2017-05-26 04:37:21.463812] E [MSGID: 101032]
[store.c:434:gf_store_handle_retrieve] 0-: Path corresponding to
/var/lib/glusterd/vols/hpcscratch/bricks/cri16fs001-ib:-data-brick1-scratch. [No
such file or directory]
    [2017-05-26 04:37:21.463866] E [MSGID: 106201]
[glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to
restore volume: hpcscratch
    [2017-05-26 04:37:21.463919] E [MSGID: 101019] [xlator.c:428:xlator_init]
0-management: Initialization of volume 'management' failed, review your
volfile again
    [2017-05-26 04:37:21.463943] E [graph.c:322:glusterfs_graph_init]
0-management: initializing translator failed
    [2017-05-26 04:37:21.463970] E [graph.c:661:glusterfs_graph_activate]
0-graph: init failed
    [2017-05-26 04:37:21.466703] W [glusterfsd.c:1236:cleanup_and_exit]
(-->/usr/sbin/glusterd(glusterfs_volumes_init+0xda) [0x405cba]
-->/usr/sbin/glusterd(glusterfs_process_volfp+0x116) [0x405b96]
-->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: received signum
(0), shutting down
    
    The volume is distribution only. The problem to me looks like it is still
expecting brick1 on fs001 to be available in the volume. Is there any way to
recover from this? Is there any more information that I can provide?
    
    
    --
    Mike Jarsulic
    
    _______________________________________________
    Gluster-users mailing list
    Gluster-users at gluster.org
   
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.gluster.org_mailman_listinfo_gluster-2Dusers&d=CwICAg&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=Ak787_FO1coN0_NpWYelxgxjFkaWMHYbXVCdYf-STow&m=zlkeQUf69-VWf8o96ZWr-vxNatuWZvCgYuHnUVj3u70&s=8YOysLTMfJHXS6dSVgP7X0o0LovgLcIuPjfoSY2Kt2Q&e=

Atin Mukherjee

2017-May-26 15:39 UTC

head link

[Gluster-users] Failed Volume

You'd basically have to copy the content of /var/lib/glusterd from fs001 to
fs003 with out overwritting fs003's onode specific details. Please ensure
you don't touch glusterd.info file and content of /var/lib/glusterd/peers
in fs003 and rest can be copied. Post that I expect glusterd will come up.

On Fri, 26 May 2017 at 20:30, Jarsulic, Michael [CRI] <
mjarsulic at bsd.uchicago.edu> wrote:
> Here is some further information on this issue:
>
> The version of gluster we are using is 3.7.6.
>
> Also, the error I found in the cmd history is:
> [2017-05-26 04:28:28.332700]  : volume remove-brick hpcscratch
> cri16fs001-ib:/data/brick1/scratch commit : FAILED : Commit failed on
> cri16fs003-ib. Please check log file for details.
>
> I did not notice this at the time and made an attempt to remove the next
> brick to migrate the data off of the system. This left the servers in the
> following state.
>
> fs001 - /var/lib/glusterd/vols/hpcscratch/info
>
> type=0
> count=3
> status=1
> sub_count=0
> stripe_count=1
> replica_count=1
> disperse_count=0
> redundancy_count=0
> version=42
> transport-type=0
> volume-id=80b8eeed-1e72-45b9-8402-e01ae0130105
> ?
> op-version=30700
> client-op-version=3
> quota-version=0
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> server.event-threads=8
> performance.client-io-threads=on
> client.event-threads=8
> performance.cache-size=32MB
> performance.readdir-ahead=on
> brick-0=cri16fs001-ib:-data-brick2-scratch
> brick-1=cri16fs003-ib:-data-brick5-scratch
> brick-2=cri16fs003-ib:-data-brick6-scratch
>
>
> fs003 - cat /var/lib/glusterd/vols/hpcscratch/info
>
> type=0
> count=4
> status=1
> sub_count=0
> stripe_count=1
> replica_count=1
> disperse_count=0
> redundancy_count=0
> version=35
> transport-type=0
> volume-id=80b8eeed-1e72-45b9-8402-e01ae0130105
> ?
> op-version=30700
> client-op-version=3
> quota-version=0
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> performance.readdir-ahead=on
> performance.cache-size=32MB
> client.event-threads=8
> performance.client-io-threads=on
> server.event-threads=8
> brick-0=cri16fs001-ib:-data-brick1-scratch
> brick-1=cri16fs001-ib:-data-brick2-scratch
> brick-2=cri16fs003-ib:-data-brick5-scratch
> brick-3=cri16fs003-ib:-data-brick6-scratch
>
>
> fs001 - /var/lib/glusterd/vols/hpcscratch/node_state.info
>
> rebalance_status=5
> status=4
> rebalance_op=0
> rebalance-id=00000000-0000-0000-0000-000000000000
> brick1=cri16fs001-ib:/data/brick2/scratch
> count=1
>
>
> fs003 - /var/lib/glusterd/vols/hpcscratch/node_state.info
>
> rebalance_status=1
> status=0
> rebalance_op=9
> rebalance-id=0184577f-eb64-4af9-924d-91ead0605a1e
> brick1=cri16fs001-ib:/data/brick1/scratch
> count=1
>
>
> --
> Mike Jarsulic
>
>
> On 5/26/17, 8:22 AM, "gluster-users-bounces at gluster.org on behalf
of
> Jarsulic, Michael [CRI]" <gluster-users-bounces at gluster.org on
behalf of
> mjarsulic at bsd.uchicago.edu> wrote:
>
>     Recently, I had some problems with the OS hard drives in my glusterd
> servers and took one of my systems down for maintenance. The first step was
> to remove one of the bricks (brick1) hosted on the server (fs001). The data
> migrated successfully and completed last night. After that, I went to
> commit the changes and the commit failed. Afterwards, glusterd will not
> start on one of my servers (fs003). When I check the glusterd logs on fs003
> I get the following errors whenever glusterd starts:
>
>     [2017-05-26 04:37:21.358932] I [MSGID: 100030]
> [glusterfsd.c:2318:main] 0-/usr/sbin/glusterd: Started running
> /usr/sbin/glusterd version 3.7.6 (args: /usr/sbin/glusterd
> --pid-file=/var/run/glusterd.pid)
>     [2017-05-26 04:37:21.382630] I [MSGID: 106478] [glusterd.c:1350:init]
> 0-management: Maximum allowed open file descriptors set to 65536
>     [2017-05-26 04:37:21.382712] I [MSGID: 106479] [glusterd.c:1399:init]
> 0-management: Using /var/lib/glusterd as working directory
>     [2017-05-26 04:37:21.422858] I [MSGID: 106228]
> [glusterd.c:433:glusterd_check_gsync_present] 0-glusterd: geo-replication
> module not installed in the system [No such file or directory]
>     [2017-05-26 04:37:21.450123] I [MSGID: 106513]
> [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved
> op-version: 30706
>     [2017-05-26 04:37:21.463812] E [MSGID: 101032]
> [store.c:434:gf_store_handle_retrieve] 0-: Path corresponding to
>
/var/lib/glusterd/vols/hpcscratch/bricks/cri16fs001-ib:-data-brick1-scratch.
> [No such file or directory]
>     [2017-05-26 04:37:21.463866] E [MSGID: 106201]
> [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management:
> Unable to restore volume: hpcscratch
>     [2017-05-26 04:37:21.463919] E [MSGID: 101019]
> [xlator.c:428:xlator_init] 0-management: Initialization of volume
> 'management' failed, review your volfile again
>     [2017-05-26 04:37:21.463943] E [graph.c:322:glusterfs_graph_init]
> 0-management: initializing translator failed
>     [2017-05-26 04:37:21.463970] E [graph.c:661:glusterfs_graph_activate]
> 0-graph: init failed
>     [2017-05-26 04:37:21.466703] W [glusterfsd.c:1236:cleanup_and_exit]
> (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xda) [0x405cba]
> -->/usr/sbin/glusterd(glusterfs_process_volfp+0x116) [0x405b96]
> -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: received
> signum (0), shutting down
>
>     The volume is distribution only. The problem to me looks like it is
> still expecting brick1 on fs001 to be available in the volume. Is there any
> way to recover from this? Is there any more information that I can provide?
>
>
>     --
>     Mike Jarsulic
>
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org
>
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.gluster.org_mailman_listinfo_gluster-2Dusers&d=CwICAg&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=Ak787_FO1coN0_NpWYelxgxjFkaWMHYbXVCdYf-STow&m=zlkeQUf69-VWf8o96ZWr-vxNatuWZvCgYuHnUVj3u70&s=8YOysLTMfJHXS6dSVgP7X0o0LovgLcIuPjfoSY2Kt2Q&e>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
-- 
- Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170526/20d83c1b/attachment.html>

Gluster users - May 2017 - Failed Volume

[Gluster-users] Failed Volume

[Gluster-users] Failed Volume