thr3ads.net - Gluster users - [Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Roger Lehmann

2015-Jun-04 14:08 UTC

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

Hello, I'm having a serious problem with my GlusterFS cluster.
I'm using Proxmox 3.4 for high available VM management which works with 
GlusterFS as storage.
Unfortunately, when I restart every node in the cluster sequentially one 
by one (with online migration of the running HA VM first of course) the 
qemu image of the HA VM gets corrupted and the VM itself has problems 
accessing it.

May 15 10:35:09 blog kernel: [339003.942602] end_request: I/O error, dev vda,
sector 2048
May 15 10:35:09 blog kernel: [339003.942829] Buffer I/O error on device vda1,
logical block 0
May 15 10:35:09 blog kernel: [339003.942929] lost page write due to I/O error on
vda1
May 15 10:35:09 blog kernel: [339003.942952] end_request: I/O error, dev vda,
sector 2072
May 15 10:35:09 blog kernel: [339003.943049] Buffer I/O error on device vda1,
logical block 3
May 15 10:35:09 blog kernel: [339003.943146] lost page write due to I/O error on
vda1
May 15 10:35:09 blog kernel: [339003.943153] end_request: I/O error, dev vda,
sector 4196712
May 15 10:35:09 blog kernel: [339003.943251] Buffer I/O error on device vda1,
logical block 524333
May 15 10:35:09 blog kernel: [339003.943350] lost page write due to I/O error on
vda1
May 15 10:35:09 blog kernel: [339003.943363] end_request: I/O error, dev vda,
sector 4197184


After the image is broken, it's impossible to migrate the VM or start it 
when it's down.

root at pve2 ~ # gluster volume heal pve-vol info
Gathering list of entries to be healed on volume pve-vol has been successful

Brick pve1:/var/lib/glusterd/brick
Number of entries: 1
/images//200/vm-200-disk-1.qcow2

Brick pve2:/var/lib/glusterd/brick
Number of entries: 1
/images/200/vm-200-disk-1.qcow2

Brick pve3:/var/lib/glusterd/brick
Number of entries: 1
/images//200/vm-200-disk-1.qcow2



I couldn't really reproduce this in my test environment with GlusterFS 
3.6.2 but I had other problems while testing (may also be because of a 
virtualized test environment), so I don't want to upgrade to 3.6.2 until 
I definitely know the problems I encountered are fixed in 3.6.2.
Anybody else experienced this problem? I'm not sure if issue 1161885 
(Possible file corruption on dispersed volumes) is the issue I'm 
experiencing. I have a 3 node replicate cluster.
Thanks for your help!

Regards,
Roger Lehmann

Justin Clift

2015-Jun-04 14:33 UTC

head link

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

On 4 Jun 2015, at 15:08, Roger Lehmann <roger.lehmann at marktjagd.de>
wrote:
<snip>> I couldn't really reproduce this in my test environment with GlusterFS
3.6.2 but I had other problems while testing (may also be because of a
virtualized test environment), so I don't want to upgrade to 3.6.2 until I
definitely know the problems I encountered are fixed in 3.6.2.<snip>

Just to point out, version 3.6.3 was released a while ago.  It's
effectively 3.6.2 + bug fixes.  Have you looked at testing that? :)

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

André Bauer

2015-Jun-08 16:22 UTC

head link

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

I saw similar behaviour when file permissions of vm image was set to
root:root instead of hypervisor user.

"chown -R libvirt-qemu:kvm /var/lib/libvirt/images" before starting vm
did the trick for me...


Am 04.06.2015 um 16:08 schrieb Roger Lehmann:> Hello, I'm having a serious problem with my GlusterFS cluster.
> I'm using Proxmox 3.4 for high available VM management which works with
> GlusterFS as storage.
> Unfortunately, when I restart every node in the cluster sequentially one
> by one (with online migration of the running HA VM first of course) the
> qemu image of the HA VM gets corrupted and the VM itself has problems
> accessing it.
> 
> May 15 10:35:09 blog kernel: [339003.942602] end_request: I/O error, dev
> vda, sector 2048
> May 15 10:35:09 blog kernel: [339003.942829] Buffer I/O error on device
> vda1, logical block 0
> May 15 10:35:09 blog kernel: [339003.942929] lost page write due to I/O
> error on vda1
> May 15 10:35:09 blog kernel: [339003.942952] end_request: I/O error, dev
> vda, sector 2072
> May 15 10:35:09 blog kernel: [339003.943049] Buffer I/O error on device
> vda1, logical block 3
> May 15 10:35:09 blog kernel: [339003.943146] lost page write due to I/O
> error on vda1
> May 15 10:35:09 blog kernel: [339003.943153] end_request: I/O error, dev
> vda, sector 4196712
> May 15 10:35:09 blog kernel: [339003.943251] Buffer I/O error on device
> vda1, logical block 524333
> May 15 10:35:09 blog kernel: [339003.943350] lost page write due to I/O
> error on vda1
> May 15 10:35:09 blog kernel: [339003.943363] end_request: I/O error, dev
> vda, sector 4197184
> 
> 
> After the image is broken, it's impossible to migrate the VM or start
it
> when it's down.
> 
> root at pve2 ~ # gluster volume heal pve-vol info
> Gathering list of entries to be healed on volume pve-vol has been
> successful
> 
> Brick pve1:/var/lib/glusterd/brick
> Number of entries: 1
> /images//200/vm-200-disk-1.qcow2
> 
> Brick pve2:/var/lib/glusterd/brick
> Number of entries: 1
> /images/200/vm-200-disk-1.qcow2
> 
> Brick pve3:/var/lib/glusterd/brick
> Number of entries: 1
> /images//200/vm-200-disk-1.qcow2
> 
> 
> 
> I couldn't really reproduce this in my test environment with GlusterFS
> 3.6.2 but I had other problems while testing (may also be because of a
> virtualized test environment), so I don't want to upgrade to 3.6.2
until
> I definitely know the problems I encountered are fixed in 3.6.2.
> Anybody else experienced this problem? I'm not sure if issue 1161885
> (Possible file corruption on dispersed volumes) is the issue I'm
> experiencing. I have a 3 node replicate cluster.
> Thanks for your help!
> 
> Regards,
> Roger Lehmann
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 

-- 
Mit freundlichen Gr??en
Andr? Bauer

MAGIX Software GmbH
Andr? Bauer
Administrator
August-Bebel-Stra?e 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: abauer at magix.net
abauer at magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>


Gesch?ftsf?hrer | Managing Directors: Dr. Arnd Schr?der, Michael Keith
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
----------------------------------------------------------------------

Joe Julian

2015-Jun-08 20:10 UTC

head link

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

"Unfortunately, when I restart every node in the cluster 
sequentially...qemu image of the HA VM gets corrupted..."
Even client nodes?

Make sure that your client can connect to all of the servers.

Make sure, after you restart a server, that the self-heal finishes 
before you restart the next one. What I suspect is happening is that you 
restart server A, writes happen on server B. You restart server B before 
the heal has happened to copy the changes from server A to server B, 
thus causing the client to write changes to server B. When server A 
comes back, both server A and server B think they have changes for the 
other. This is a classic split-brain state.

On 06/04/2015 07:08 AM, Roger Lehmann wrote:> Hello, I'm having a serious problem with my GlusterFS cluster.
> I'm using Proxmox 3.4 for high available VM management which works 
> with GlusterFS as storage.
> Unfortunately, when I restart every node in the cluster sequentially 
> one by one (with online migration of the running HA VM first of 
> course) the qemu image of the HA VM gets corrupted and the VM itself 
> has problems accessing it.
>
> May 15 10:35:09 blog kernel: [339003.942602] end_request: I/O error, 
> dev vda, sector 2048
> May 15 10:35:09 blog kernel: [339003.942829] Buffer I/O error on 
> device vda1, logical block 0
> May 15 10:35:09 blog kernel: [339003.942929] lost page write due to 
> I/O error on vda1
> May 15 10:35:09 blog kernel: [339003.942952] end_request: I/O error, 
> dev vda, sector 2072
> May 15 10:35:09 blog kernel: [339003.943049] Buffer I/O error on 
> device vda1, logical block 3
> May 15 10:35:09 blog kernel: [339003.943146] lost page write due to 
> I/O error on vda1
> May 15 10:35:09 blog kernel: [339003.943153] end_request: I/O error, 
> dev vda, sector 4196712
> May 15 10:35:09 blog kernel: [339003.943251] Buffer I/O error on 
> device vda1, logical block 524333
> May 15 10:35:09 blog kernel: [339003.943350] lost page write due to 
> I/O error on vda1
> May 15 10:35:09 blog kernel: [339003.943363] end_request: I/O error, 
> dev vda, sector 4197184
>
>
> After the image is broken, it's impossible to migrate the VM or start 
> it when it's down.
>
> root at pve2 ~ # gluster volume heal pve-vol info
> Gathering list of entries to be healed on volume pve-vol has been 
> successful
>
> Brick pve1:/var/lib/glusterd/brick
> Number of entries: 1
> /images//200/vm-200-disk-1.qcow2
>
> Brick pve2:/var/lib/glusterd/brick
> Number of entries: 1
> /images/200/vm-200-disk-1.qcow2
>
> Brick pve3:/var/lib/glusterd/brick
> Number of entries: 1
> /images//200/vm-200-disk-1.qcow2
>
>
>
> I couldn't really reproduce this in my test environment with GlusterFS 
> 3.6.2 but I had other problems while testing (may also be because of a 
> virtualized test environment), so I don't want to upgrade to 3.6.2 
> until I definitely know the problems I encountered are fixed in 3.6.2.
> Anybody else experienced this problem? I'm not sure if issue 1161885 
> (Possible file corruption on dispersed volumes) is the issue I'm 
> experiencing. I have a 3 node replicate cluster.
> Thanks for your help!
>
> Regards,
> Roger Lehmann
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

Gluster users - Jun 2015 - GlusterFS 3.6.1 breaks VM images on cluster node restart

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart

[Gluster-users] GlusterFS 3.6.1 breaks VM images on cluster node restart