Lindsay Mathieson
2016-Jan-19 11:24 UTC
[Gluster-users] File Corruption when adding bricks to live replica volumes
gluster 3.7.6 I seem to be able to reliably reproduce this. I have a replica 2 volume with 1 test VM image. While the VM is running with heavy disk read/writes (disk benchmark) I add a 3rd brick for replica 3: gluster volume add-brick datastore1 replica 3 vng.proxmox.softlog:/vmdata/datastore1 I pretty much immediately get this: gluster volume heal datastore1 info Brick vna.proxmox.softlog:/vmdata/datastore1 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing heal /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal Number of entries: 4 Brick vnb.proxmox.softlog:/vmdata/datastore1 /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing heal /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 Number of entries: 4 Brick vng.proxmox.softlog:/vmdata/datastore1 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.16 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.28 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.1 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.77 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.9 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.2 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.26 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.15 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.13 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.3 /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.18 Number of entries: 13 The brick on vng is the new empty brick, but it has 13 shards being healed back to vna & vnb. That can't be right and if I leave it the VM becomes hopelessly corrupted. Also there are 81 shards in the files, they should all be queued for healing. Additionally I get read errors when I run a qemu-img check on the VM image. If I remove the vng brick the problems are resolved. If I do the same process while the VM is not running - i.e no files are being access, every proceeds as expect. All shard on vn & vnb are healed to vng, -- Lindsay Mathieson -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160119/37126601/attachment.html>
Krutika Dhananjay
2016-Jan-19 12:06 UTC
[Gluster-users] File Corruption when adding bricks to live replica volumes
Hi Lindsay, Just to be sure we are not missing any steps here, you did invoke 'gluster volume heal datastore1 full' after adding the third brick, before the heal could begin, right? As far as the reverse heal is concerned, there is one issue with add-brick where replica count is increased, which is still under review. Could you instead try the following steps at the time of add-brick and tell me if it works fine: 1. Run 'gluster volume add-brick datastore1 replica 3 vng.proxmox.softlog:/vmdata/datastore1' as usual. 2. Kill the glusterfsd process corresponding to newly added brick (the brick in vng in your case). You should be able to get its pid in the output of 'gluster volume status datastore1'. 3. Create a dummy file on the root of the volume from the mount point. This can be any random name. 4. Delete the dummy file created in step 3. 5. Bring the killed brick back up. For this, you can run 'gluster volume start datastore1 force'. 6. Then execute 'gluster volume heal datastore1 full' on the node with the highest uuid (this we know how to do from the previous thread on the same topic). Then monitor heal-info output to track heal progress. Let me know if this works. -Krutika ----- Original Message -----> From: "Lindsay Mathieson" <lindsay.mathieson at gmail.com> > To: "gluster-users" <Gluster-users at gluster.org> > Sent: Tuesday, January 19, 2016 4:54:07 PM > Subject: [Gluster-users] File Corruption when adding bricks to live replica > volumes> gluster 3.7.6> I seem to be able to reliably reproduce this. I have a replica 2 volume with > 1 test VM image. While the VM is running with heavy disk read/writes (disk > benchmark) I add a 3rd brick for replica 3:> gluster volume add-brick datastore1 replica 3 > vng.proxmox.softlog:/vmdata/datastore1> I pretty much immediately get this:> > gluster volume heal datastore1 info > > > Brick vna.proxmox.softlog:/vmdata/datastore1 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing heal >> > /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal >> > Number of entries: 4 >> > Brick vnb.proxmox.softlog:/vmdata/datastore1 > > > /images/301/vm-301-disk-1.qcow2 - Possibly undergoing heal >> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.55 - Possibly undergoing heal >> > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.20 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 > > > Number of entries: 4 >> > Brick vng.proxmox.softlog:/vmdata/datastore1 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.16 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.28 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.1 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.22 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.77 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.9 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.2 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.26 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.15 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.13 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.3 > > > /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.18 > > > Number of entries: 13 >> The brick on vng is the new empty brick, but it has 13 shards being healed > back to vna & vnb. That can't be right and if I leave it the VM becomes > hopelessly corrupted. Also there are 81 shards in the files, they should all > be queued for healing.> Additionally I get read errors when I run a qemu-img check on the VM image. > If I remove the vng brick the problems are resolved.> If I do the same process while the VM is not running - i.e no files are being > access, every proceeds as expect. All shard on vn & vnb are healed to vng,> -- > Lindsay Mathieson> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160119/bfa22033/attachment.html>