João Pagaime
2014-Feb-07  12:13 UTC
[Gluster-users] self-heal stops some vms (virtual machines)
hello all I have a replicate volume that holds kvm vms (virtual machines) I had to stop one gluster-server for maintenance . That part of the operation went well: no vms problems after shutdown the problems started after booting the gluster-server. Self-healing started as expected, but some vms locked up with disk problems (time-outs), as self-healing goes by them. Some VMs did survive the self-healing . I suppose the ones with low IO activity or less sensitive to disk problems is there some specific gluster configuration to enable a self-healing ride-through on running-vms? (cluster.data-self-heal-algorithm is already on the diff mode) is there some tweaks recommended to do on vms running on top of gluster? current config: gluster: 3.3.0-1.el6.x86_64 --------------------- volume: # gluster volume info VOL Volume Name: VOL Type: Distributed-Replicate Volume ID: f44182d9-24eb-4953-9cdd-71464f9517e0 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: one-gluster01:/san02-v2 Brick2: one-gluster02:/san02-v2 Brick3: one-gluster01:/san03 Brick4: one-gluster02:/san04 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.disable: on auth.allow:x performance.flush-behind: off cluster.self-heal-window-size: 1 performance.cache-size: 67108864 cluster.data-self-heal-algorithm: diff performance.io-thread-count: 32 cluster.min-free-disk: 250GB thanks, best regards, joao
Fabio Rosati
2014-Feb-27  09:08 UTC
[Gluster-users] self-heal stops some vms (virtual machines)
Hi All, I run in exactly the same problem encountered by Joao. After rebooting one of the GlusterFS nodes, self-heal starts and some VMs can't access their disk images anymore. Logs from one of the VMs after one gluster node has rebooted: Feb 25 23:35:47 fwrt2 kernel: EXT4-fs error (device dm-2): __ext4_get_inode_loc: unable to read inode block - inode=2145, block=417 Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15032608 Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307504 Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307552 Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307568 Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 15307504 Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 12972672 Feb 25 23:35:47 fwrt2 kernel: EXT4-fs error (device dm-1): ext4_find_entry: reading directory #123 offset 0 Feb 25 23:35:47 fwrt2 kernel: Core dump to |/usr/libexec/abrt-hook-ccpp 7 0 2757 0 23 1393367747 e pipe failed Feb 25 23:35:47 fwrt2 kernel: end_request: I/O error, dev vda, sector 9250632 Feb 25 23:35:47 fwrt2 kernel: Read-error on swap-device (253:0:30536) Feb 25 23:35:47 fwrt2 kernel: Read-error on swap-device (253:0:30544) [...] I few hours later the VM seemed to be freezed and I had to kill and restart it, no more problems after reboot. This is the volume layout: Volume Name: gv_pri Type: Distributed-Replicate Volume ID: 3d91b91e-4d72-484f-8655-e5ed8d38bb28 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: nw1glus.gem.local:/glustexp/pri1/brick Brick2: nw2glus.gem.local:/glustexp/pri1/brick Brick3: nw3glus.gem.local:/glustexp/pri2/brick Brick4: nw4glus.gem.local:/glustexp/pri2/brick Options Reconfigured: storage.owner-gid: 107 storage.owner-uid: 107 server.allow-insecure: on network.remote-dio: on performance.write-behind-window-size: 16MB performance.cache-size: 128MB OS: CentOS 6.5 GlusterFS version: 3.4.2 The qemu-kvm VMs access their qcow2 disk images using the native Gluster support (no fuse mount). In the Gluster logs I didn't find anything special logged during self-heal but I can post them if needed. Anyone have an idea of what can cause these problems? Thank you Fabio ----- Messaggio originale ----- Da: "Jo?o Pagaime" <joao.pagaime at gmail.com> A: Gluster-users at gluster.org Inviato: Venerd?, 7 febbraio 2014 13:13:59 Oggetto: [Gluster-users] self-heal stops some vms (virtual machines) hello all I have a replicate volume that holds kvm vms (virtual machines) I had to stop one gluster-server for maintenance . That part of the operation went well: no vms problems after shutdown the problems started after booting the gluster-server. Self-healing started as expected, but some vms locked up with disk problems (time-outs), as self-healing goes by them. Some VMs did survive the self-healing . I suppose the ones with low IO activity or less sensitive to disk problems is there some specific gluster configuration to enable a self-healing ride-through on running-vms? (cluster.data-self-heal-algorithm is already on the diff mode) is there some tweaks recommended to do on vms running on top of gluster? current config: gluster: 3.3.0-1.el6.x86_64 --------------------- volume: # gluster volume info VOL Volume Name: VOL Type: Distributed-Replicate Volume ID: f44182d9-24eb-4953-9cdd-71464f9517e0 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: one-gluster01:/san02-v2 Brick2: one-gluster02:/san02-v2 Brick3: one-gluster01:/san03 Brick4: one-gluster02:/san04 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.disable: on auth.allow:x performance.flush-behind: off cluster.self-heal-window-size: 1 performance.cache-size: 67108864 cluster.data-self-heal-algorithm: diff performance.io-thread-count: 32 cluster.min-free-disk: 250GB thanks, best regards, joao _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users