Niels de Vos
2015-Oct-26 20:56 UTC
[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down
On Thu, Oct 22, 2015 at 08:45:04PM +0200, Andr? Bauer wrote:> Hi, > > i have a 4 node Glusterfs 3.5.6 Cluster. > > My VM images are in an replicated distributed volume which is accessed > from kvm/qemu via libgfapi. > > Mount is against storage.domain.local which has IPs for all 4 Gluster > nodes set in DNS. > > When one of the Gluster nodes goes down (accidently reboot) a lot of the > vms getting read only filesystem. Even when the node comes back up. > > How can i prevent this? > I expect that the vm just uses the replicated file on the other node, > without getting ro fs. > > Any hints?There are at least two timeouts that are involved in this problem: 1. The filesystem in a VM can go read-only when the virtual disk where the filesystem is located does not respond for a while. 2. When a storage server that holds a replica of the virtual disk becomes unreachable, the Gluster client (qemu+libgfapi) waits for max. network.ping-timeout seconds before it resumes I/O. Once a filesystem in a VM goes read-only, you might be able to fsck and re-mount it read-writable again. It is not something a VM will do by itself. The timeouts for (1) are set in sysfs: $ cat /sys/block/sda/device/timeout 30 30 seconds is the default for SD-devices, and for testing you can change it with an echo: # echo 300 > /sys/block/sda/device/timeout This is not a peristent change, you can create a udev-rule to apply this change at bootup. Some of the filesystem offer a mount option that can change the behaviour after a disk error is detected. "man mount" shows the "errors" option for ext*. Changing this to "continue" is not recommended, "abort" or "panic" will be the most safe for your data. The timeout mentioned in (2) is for the Gluster Volume, and checked by the client. When a client does a write to a replicated volume, the write needs to be acknowledged by both/all replicas. The client (libgfapi) delays the reply to the application (qemu) until both/all replies from the replicas has been received. This delay is configured as the volume option network.ping-timeout (42 seconds by default). Now, if the VM returns block errors after 30 seconds, and the client waits up to 42 seconds for recovery, there is an issue... So, your solution could be to increase the timeout for error detection of the disks inside the VMs, and/or decrease the network.ping-timeout. It would be interesting to know if adapting these values prevents the read-only occurrences in your environment. If you do any testing with this, please keep me informed about the results. Niels -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151026/b23a1bb4/attachment.sig>
Roman
2015-Oct-26 23:56 UTC
[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down
Aren't we are talking about this patch? https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD 2015-10-26 22:56 GMT+02:00 Niels de Vos <ndevos at redhat.com>:> On Thu, Oct 22, 2015 at 08:45:04PM +0200, Andr? Bauer wrote: > > Hi, > > > > i have a 4 node Glusterfs 3.5.6 Cluster. > > > > My VM images are in an replicated distributed volume which is accessed > > from kvm/qemu via libgfapi. > > > > Mount is against storage.domain.local which has IPs for all 4 Gluster > > nodes set in DNS. > > > > When one of the Gluster nodes goes down (accidently reboot) a lot of the > > vms getting read only filesystem. Even when the node comes back up. > > > > How can i prevent this? > > I expect that the vm just uses the replicated file on the other node, > > without getting ro fs. > > > > Any hints? > > There are at least two timeouts that are involved in this problem: > > 1. The filesystem in a VM can go read-only when the virtual disk where > the filesystem is located does not respond for a while. > > 2. When a storage server that holds a replica of the virtual disk > becomes unreachable, the Gluster client (qemu+libgfapi) waits for > max. network.ping-timeout seconds before it resumes I/O. > > Once a filesystem in a VM goes read-only, you might be able to fsck and > re-mount it read-writable again. It is not something a VM will do by > itself. > > > The timeouts for (1) are set in sysfs: > > $ cat /sys/block/sda/device/timeout > 30 > > 30 seconds is the default for SD-devices, and for testing you can change > it with an echo: > > # echo 300 > /sys/block/sda/device/timeout > > This is not a peristent change, you can create a udev-rule to apply this > change at bootup. > > Some of the filesystem offer a mount option that can change the > behaviour after a disk error is detected. "man mount" shows the "errors" > option for ext*. Changing this to "continue" is not recommended, "abort" > or "panic" will be the most safe for your data. > > > The timeout mentioned in (2) is for the Gluster Volume, and checked by > the client. When a client does a write to a replicated volume, the write > needs to be acknowledged by both/all replicas. The client (libgfapi) > delays the reply to the application (qemu) until both/all replies from > the replicas has been received. This delay is configured as the volume > option network.ping-timeout (42 seconds by default). > > > Now, if the VM returns block errors after 30 seconds, and the client > waits up to 42 seconds for recovery, there is an issue... So, your > solution could be to increase the timeout for error detection of the > disks inside the VMs, and/or decrease the network.ping-timeout. > > It would be interesting to know if adapting these values prevents the > read-only occurrences in your environment. If you do any testing with > this, please keep me informed about the results. > > Niels > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel >-- Best regards, Roman. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151027/b2431be8/attachment.html>
André Bauer
2015-Oct-27 18:21 UTC
[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi Niels, my network.ping-timeout was already set to 5 seconds. Unfortunately it seems i dont have the timout setting in Ubuntu 14.04 for my vda disk. ls -al /sys/block/vda/device/ gives me only: drwxr-xr-x 4 root root 0 Oct 26 20:21 ./ drwxr-xr-x 5 root root 0 Oct 26 20:21 ../ drwxr-xr-x 3 root root 0 Oct 26 20:21 block/ - -r--r--r-- 1 root root 4096 Oct 27 18:13 device lrwxrwxrwx 1 root root 0 Oct 27 18:13 driver -> ../../../../bus/virtio/drivers/virtio_blk/ - -r--r--r-- 1 root root 4096 Oct 27 18:13 features - -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias drwxr-xr-x 2 root root 0 Oct 27 18:13 power/ - -r--r--r-- 1 root root 4096 Oct 27 18:13 status lrwxrwxrwx 1 root root 0 Oct 26 20:21 subsystem -> ../../../../bus/virtio/ - -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent - -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor Is the qourum setting a problem, if you only have 2 replicas? My volume has this quorum options set: cluster.quorum-type: auto cluster.server-quorum-type: server As i understand the documentation ( https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html ), cluster.server-quorum-ratio is set to "< 50%" by default, which can never happen if you only have 2 replicas and one node goes down, right? Do in need cluster.server-quorum-ratio = 50% in this case? @ Josh Qemu had this in log for the time the vm got read only fs: [2015-10-22 17:44:42.699990] E [socket.c:2244:socket_connect_finish] 0-vmimages-client-2: connection to 192.168.0.43:24007 failed (Connection refused) [2015-10-22 17:45:03.411721] E [client-handshake.c:1760:client_query_portmap_cbk] 0-vmimages-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. netstat looks good. As axpected i got connectiosn to all 4 Glusterfs nodes at the moment. @ Eivind I don't think i had a split brain. Only the vm got read only filesystem not the file on the Glusterfs node. Regards Andr? Am 26.10.2015 um 21:56 schrieb Niels de Vos:> > There are at least two timeouts that are involved in this problem: > > 1. The filesystem in a VM can go read-only when the virtual disk > where the filesystem is located does not respond for a while. > > 2. When a storage server that holds a replica of the virtual disk > becomes unreachable, the Gluster client (qemu+libgfapi) waits for > max. network.ping-timeout seconds before it resumes I/O. > > Once a filesystem in a VM goes read-only, you might be able to fsck > and re-mount it read-writable again. It is not something a VM will > do by itself. > > > The timeouts for (1) are set in sysfs: > > $ cat /sys/block/sda/device/timeout 30 > > 30 seconds is the default for SD-devices, and for testing you can > change it with an echo: > > # echo 300 > /sys/block/sda/device/timeout > > This is not a peristent change, you can create a udev-rule to apply > this change at bootup. > > Some of the filesystem offer a mount option that can change the > behaviour after a disk error is detected. "man mount" shows the > "errors" option for ext*. Changing this to "continue" is not > recommended, "abort" or "panic" will be the most safe for your > data. > > > The timeout mentioned in (2) is for the Gluster Volume, and checked > by the client. When a client does a write to a replicated volume, > the write needs to be acknowledged by both/all replicas. The client > (libgfapi) delays the reply to the application (qemu) until > both/all replies from the replicas has been received. This delay is > configured as the volume option network.ping-timeout (42 seconds by > default). > > > Now, if the VM returns block errors after 30 seconds, and the > client waits up to 42 seconds for recovery, there is an issue... > So, your solution could be to increase the timeout for error > detection of the disks inside the VMs, and/or decrease the > network.ping-timeout. > > It would be interesting to know if adapting these values prevents > the read-only occurrences in your environment. If you do any > testing with this, please keep me informed about the results. > > Niels >- -- Mit freundlichen Gr??en Andr? Bauer MAGIX Software GmbH Andr? Bauer Administrator August-Bebel-Stra?e 48 01219 Dresden GERMANY tel.: 0351 41884875 e-mail: abauer at magix.net abauer at magix.net <mailto:Email> www.magix.com <http://www.magix.com/> Gesch?ftsf?hrer | Managing Directors: Dr. Arnd Schr?der, Klaus Schmidt Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205 Find us on: <http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de> <http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de> - ---------------------------------------------------------------------- The information in this email is intended only for the addressee named above. Access to this email by anyone else is unauthorized. If you are not the intended recipient of this message any disclosure, copying, distribution or any action taken in reliance on it is prohibited and may be unlawful. MAGIX does not warrant that any attachments are free from viruses or other defects and accepts no liability for any losses resulting from infected email transmissions. Please note that any views expressed in this email may be those of the originator and do not necessarily represent the agenda of the company. - ---------------------------------------------------------------------- -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCAAGBQJWL8CvAAoJEES+J36frTguwowH/iJTvA3fuF/VKRl24Re2sOkI 5d3YFH0PXtqBMocSoiQDfKAlFxrLNwRloaKywM97K5odBcoQ8jcI03vIFqArCjdS RmLEFdHv1gPUeQOiZy6zM4b6I0osHoF89POe+UNbcN0uTB014q29B1+JpQtAVi2T rR+g+gc0gYt1PTP/Gxuk4klObXgZGEIbuAGPVZ0IUGH9FAF6buSGYMsi92h8t8qH J3AFH/3abr3aEYpm8KO1qR5ZsC2TfYfMXQyFbRPDLnX0qu8q96RFBa+uNcuvAEwc vkhPDNGDmql7pYCZ9IWpsLCuq/aCECIOLNV4Y/O4KbO2SURNMlVxFRdQcWJowlQ=h/cI -----END PGP SIGNATURE-----