Krutika Dhananjay
2016-Jan-19 12:06 UTC
[Gluster-users] File Corruption when adding bricks to live replica volumes
Hi Lindsay, Just to be sure we are not missing any steps here, you did invoke 'gluster volume heal datastore1 full' after adding the third brick, before the heal could begin, right? As far as the reverse heal is concerned, there is one issue with add-brick where replica count is increased, which is still under review. Could you instead try the following steps at the time of add-brick and tell me if it works fine: 1. Run 'gluster volume add-brick datastore1 replica 3 vng.proxmox.softlog:/vmdata/datastore1' as usual. 2. Kill the glusterfsd process corresponding to newly added brick (the brick in vng in your case). You should be able to get its pid in the output of 'gluster volume status datastore1'. 3. Create a dummy file on the root of the volume from the mount point. This can be any random name. 4. Delete the dummy file created in step 3. 5. Bring the killed brick back up. For this, you can run 'gluster volume start datastore1 force'. 6. Then execute 'gluster volume heal datastore1 full' on the node with the highest uuid (this we know how to do from the previous thread on the same topic). Then monitor heal-info output to track heal progress. Let me know if this works. -Krutika ----- Original Message -----> From: "Lindsay Mathieson" <lindsay.mathieson at gmail.com> > To: "gluster-users" <Gluster-users at gluster.org> > Sent: Tuesday, January 19, 2016 4:54:07 PM > Subject: [Gluster-users] File Corruption when adding bricks to live replica > volumes> gluster 3.7.6> I seem to be able to reliably reproduce this. I have a replica 2 volume with > 1 test VM image. While the VM is running with heavy disk read/writes (disk > benchmark) I add a 3rd brick for replica 3:> gluster volume add-brick datastore1 replica 3 > vng.proxmox.softlog:/vmdata/datastore1> I pretty much immediately get this:> > gluster volume heal datastore1 info > > > Brick vna.proxmox.softlog:/vmdata/datastore1 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing heal >> > /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal >> > Number of entries: 4 >> > Brick vnb.proxmox.softlog:/vmdata/datastore1 > > > /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal >> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing heal >> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 > > > Number of entries: 4 >> > Brick vng.proxmox.softlog:/vmdata/datastore1 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.16 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.28 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.1 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.77 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.9 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.2 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.26 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.15 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.13 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.3 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.18 > > > Number of entries: 13 >> The brick on vng is the new empty brick, but it has 13 shards being healed > back to vna & vnb. That can't be right and if I leave it the VM becomes > hopelessly corrupted. Also there are 81 shards in the files, they should all > be queued for healing.> Additionally I get read errors when I run a qemu-img check on the VM image. > If I remove the vng brick the problems are resolved.> If I do the same process while the VM is not running - i.e no files are being > access, every proceeds as expect. All shard on vn & vnb are healed to vng,> -- > Lindsay Mathieson> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160119/bfa22033/attachment.html>
Lindsay Mathieson
2016-Jan-19 12:41 UTC
[Gluster-users] File Corruption when adding bricks to live replica volumes
On 19/01/2016 10:06 PM, Krutika Dhananjay wrote:> Just to be sure we are not missing any steps here, you did invoke > 'gluster volume heal datastore1 full' after adding the third brick, > before the heal could begin, right?Possibly not. First I immediately ran 'gluster volume heal datastore1 info' which showed the oddball heal in progress. Then I ran the 'heal full' which didn't change anything (on the highest uuid node :))> > As far as the reverse heal is concerned, there is one issue with > add-brick where replica count is increased, which is still under review. > Could you instead try the following steps at the time of add-brick and > tell me if it works fine: > > 1. Run 'gluster volume add-brick datastore1 replica 3 > vng.proxmox.softlog:/vmdata/datastore1' as usual. > > 2. Kill the glusterfsd process corresponding to newly added brick (the > brick in vng in your case). You should be able to get its pid in the > output of 'gluster volume status datastore1'. > 3. Create a dummy file on the root of the volume from the mount point. > This can be any random name. > 4. Delete the dummy file created in step 3. > 5. Bring the killed brick back up. For this, you can run 'gluster > volume start datastore1 force'. > 6. Then execute 'gluster volume heal datastore1 full' on the node with > the highest uuid (this we know how to do from the previous thread on > the same topic). > > Then monitor heal-info output to track heal progress. > Let me know if this works.Will do - not right now, have to go to bed :) but will let you know tomorrow. Thanks, -- Lindsay Mathieson
Lindsay Mathieson
2016-Jan-21 00:54 UTC
[Gluster-users] File Corruption when adding bricks to live replica volumes
On 19/01/16 22:06, Krutika Dhananjay wrote:> As far as the reverse heal is concerned, there is one issue with > add-brick where replica count is increased, which is still under review. > Could you instead try the following steps at the time of add-brick and > tell me if it works fine: > > 1. Run 'gluster volume add-brick datastore1 replica 3 > vng.proxmox.softlog:/vmdata/datastore1' as usual. > > 2. Kill the glusterfsd process corresponding to newly added brick (the > brick in vng in your case). You should be able to get its pid in the > output of 'gluster volume status datastore1'. > 3. Create a dummy file on the root of the volume from the mount point. > This can be any random name. > 4. Delete the dummy file created in step 3. > 5. Bring the killed brick back up. For this, you can run 'gluster > volume start datastore1 force'. > 6. Then execute 'gluster volume heal datastore1 full' on the node with > the highest uuid (this we know how to do from the previous thread on > the same topic). > > Then monitor heal-info output to track heal progress.I'm afraid it didn't work Krutika, I still got the reverse heal problem. nb. I am starting from a replica 3 store, removing a brick, cleaning it, then re-adding it. Possibly that affects the process? -- Lindsay Mathieson