thr3ads.net - Gluster users - [Gluster-users] File Corruption when adding bricks to live replica volumes [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Krutika Dhananjay

2016-Jan-19 12:06 UTC

[Gluster-users] File Corruption when adding bricks to live replica volumes

Hi Lindsay, 

Just to be sure we are not missing any steps here, you did invoke 'gluster
volume heal datastore1 full' after adding the third brick, before the heal
could begin, right?

As far as the reverse heal is concerned, there is one issue with add-brick where
replica count is increased, which is still under review.
Could you instead try the following steps at the time of add-brick and tell me
if it works fine:

1. Run 'gluster volume add-brick datastore1 replica 3
vng.proxmox.softlog:/vmdata/datastore1' as usual.

2. Kill the glusterfsd process corresponding to newly added brick (the brick in
vng in your case). You should be able to get its pid in the output of
'gluster volume status datastore1'.
3. Create a dummy file on the root of the volume from the mount point. This can
be any random name.
4. Delete the dummy file created in step 3. 
5. Bring the killed brick back up. For this, you can run 'gluster volume
start datastore1 force'.
6. Then execute 'gluster volume heal datastore1 full' on the node with
the highest uuid (this we know how to do from the previous thread on the same
topic).

Then monitor heal-info output to track heal progress. 
Let me know if this works. 

-Krutika 

----- Original Message -----
> From: "Lindsay Mathieson" <lindsay.mathieson at gmail.com>
> To: "gluster-users" <Gluster-users at gluster.org>
> Sent: Tuesday, January 19, 2016 4:54:07 PM
> Subject: [Gluster-users] File Corruption when adding bricks to live replica
> volumes
> gluster 3.7.6
> I seem to be able to reliably reproduce this. I have a replica 2 volume
with
> 1 test VM image. While the VM is running with heavy disk read/writes (disk
> benchmark) I add a 3rd brick for replica 3:
> gluster volume add-brick datastore1 replica 3
> vng.proxmox.softlog:/vmdata/datastore1
> I pretty much immediately get this:
> > gluster volume heal datastore1 info
> 
> > Brick vna.proxmox.softlog:/vmdata/datastore1
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing
heal
> 
> > /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal
> 
> > Number of entries: 4
> 
> > Brick vnb.proxmox.softlog:/vmdata/datastore1
> 
> > /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing
heal
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22
> 
> > Number of entries: 4
> 
> > Brick vng.proxmox.softlog:/vmdata/datastore1
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.16
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.28
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.1
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.77
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.9
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.2
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.26
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.15
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.13
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.3
> 
> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.18
> 
> > Number of entries: 13
> 
> The brick on vng is the new empty brick, but it has 13 shards being healed
> back to vna & vnb. That can't be right and if I leave it the VM
becomes
> hopelessly corrupted. Also there are 81 shards in the files, they should
all
> be queued for healing.
> Additionally I get read errors when I run a qemu-img check on the VM image.
> If I remove the vng brick the problems are resolved.
> If I do the same process while the VM is not running - i.e no files are
being
> access, every proceeds as expect. All shard on vn & vnb are healed to
vng,
> --
> Lindsay Mathieson
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160119/bfa22033/attachment.html>

Lindsay Mathieson

2016-Jan-19 12:41 UTC

head link

[Gluster-users] File Corruption when adding bricks to live replica volumes

On 19/01/2016 10:06 PM, Krutika Dhananjay wrote:> Just to be sure we are not missing any steps here, you did invoke 
> 'gluster volume heal datastore1 full' after adding the third brick,
> before the heal could begin, right?
Possibly not. First I immediately ran 'gluster volume heal datastore1 
info' which showed the oddball heal in progress. Then I ran the 'heal 
full' which didn't change anything (on the highest uuid node :))

>
> As far as the reverse heal is concerned, there is one issue with 
> add-brick where replica count is increased, which is still under review.
> Could you instead try the following steps at the time of add-brick and 
> tell me if it works fine:
>
> 1. Run 'gluster volume add-brick datastore1 replica 3 
> vng.proxmox.softlog:/vmdata/datastore1' as usual.
>
> 2. Kill the glusterfsd process corresponding to newly added brick (the 
> brick in vng in your case). You should be able to get its pid in the 
> output of 'gluster volume status datastore1'.
> 3. Create a dummy file on the root of the volume from the mount point. 
> This can be any random name.
> 4. Delete the dummy file created in step 3.
> 5. Bring the killed brick back up. For this, you can run 'gluster 
> volume start datastore1 force'.
> 6. Then execute 'gluster volume heal datastore1 full' on the node
with
> the highest uuid (this we know how to do from the previous thread on 
> the same topic).
>
> Then monitor heal-info output to track heal progress.
> Let me know if this works.

Will do - not right now, have to go to bed :) but will let you know 
tomorrow.

Thanks,

-- 
Lindsay Mathieson

Lindsay Mathieson

2016-Jan-21 00:54 UTC

head link

[Gluster-users] File Corruption when adding bricks to live replica volumes

On 19/01/16 22:06, Krutika Dhananjay wrote:> As far as the reverse heal is concerned, there is one issue with 
> add-brick where replica count is increased, which is still under review.
> Could you instead try the following steps at the time of add-brick and 
> tell me if it works fine:
>
> 1. Run 'gluster volume add-brick datastore1 replica 3 
> vng.proxmox.softlog:/vmdata/datastore1' as usual.
>
> 2. Kill the glusterfsd process corresponding to newly added brick (the 
> brick in vng in your case). You should be able to get its pid in the 
> output of 'gluster volume status datastore1'.
> 3. Create a dummy file on the root of the volume from the mount point. 
> This can be any random name.
> 4. Delete the dummy file created in step 3.
> 5. Bring the killed brick back up. For this, you can run 'gluster 
> volume start datastore1 force'.
> 6. Then execute 'gluster volume heal datastore1 full' on the node
with
> the highest uuid (this we know how to do from the previous thread on 
> the same topic).
>
> Then monitor heal-info output to track heal progress.

I'm afraid it didn't work Krutika, I still got the reverse heal problem.

nb. I am starting from a replica 3 store, removing a brick, cleaning it, 
then re-adding it. Possibly that affects the process?

-- 
Lindsay Mathieson

Gluster users - Jan 2016 - File Corruption when adding bricks to live replica volumes

[Gluster-users] File Corruption when adding bricks to live replica volumes

[Gluster-users] File Corruption when adding bricks to live replica volumes

[Gluster-users] File Corruption when adding bricks to live replica volumes