thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Niels de Vos

2015-Oct-26 20:56 UTC

[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down

On Thu, Oct 22, 2015 at 08:45:04PM +0200, Andr? Bauer
wrote:> Hi,
> 
> i have a 4 node Glusterfs 3.5.6 Cluster.
> 
> My VM images are in an replicated distributed volume which is accessed
> from kvm/qemu via libgfapi.
> 
> Mount is against storage.domain.local which has IPs for all 4 Gluster
> nodes set in DNS.
> 
> When one of the Gluster nodes goes down (accidently reboot) a lot of the
> vms getting read only filesystem. Even when the node comes back up.
> 
> How can i prevent this?
> I expect that the vm just uses the replicated file on the other node,
> without getting ro fs.
> 
> Any hints?
There are at least two timeouts that are involved in this problem:

1. The filesystem in a VM can go read-only when the virtual disk where
   the filesystem is located does not respond for a while.

2. When a storage server that holds a replica of the virtual disk
   becomes unreachable, the Gluster client (qemu+libgfapi) waits for
   max. network.ping-timeout seconds before it resumes I/O.

Once a filesystem in a VM goes read-only, you might be able to fsck and
re-mount it read-writable again. It is not something a VM will do by
itself.

The timeouts for (1) are set in sysfs:

    $ cat /sys/block/sda/device/timeout
    30

30 seconds is the default for SD-devices, and for testing you can change
it with an echo:

    # echo 300 > /sys/block/sda/device/timeout

This is not a peristent change, you can create a udev-rule to apply this
change at bootup.

Some of the filesystem offer a mount option that can change the
behaviour after a disk error is detected. "man mount" shows the
"errors"
option for ext*. Changing this to "continue" is not recommended,
"abort"
or "panic" will be the most safe for your data.

The timeout mentioned in (2) is for the Gluster Volume, and checked by
the client. When a client does a write to a replicated volume, the write
needs to be acknowledged by both/all replicas. The client (libgfapi)
delays the reply to the application (qemu) until both/all replies from
the replicas has been received. This delay is configured as the volume
option network.ping-timeout (42 seconds by default).

Now, if the VM returns block errors after 30 seconds, and the client
waits up to 42 seconds for recovery, there is an issue... So, your
solution could be to increase the timeout for error detection of the
disks inside the VMs, and/or decrease the network.ping-timeout.

It would be interesting to know if adapting these values prevents the
read-only occurrences in your environment. If you do any testing with
this, please keep me informed about the results.

Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151026/b23a1bb4/attachment.sig>

Roman

2015-Oct-26 23:56 UTC

head link

[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down

Aren't we are talking about this patch?
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD

2015-10-26 22:56 GMT+02:00 Niels de Vos <ndevos at redhat.com>:
> On Thu, Oct 22, 2015 at 08:45:04PM +0200, Andr? Bauer wrote:
> > Hi,
> >
> > i have a 4 node Glusterfs 3.5.6 Cluster.
> >
> > My VM images are in an replicated distributed volume which is accessed
> > from kvm/qemu via libgfapi.
> >
> > Mount is against storage.domain.local which has IPs for all 4 Gluster
> > nodes set in DNS.
> >
> > When one of the Gluster nodes goes down (accidently reboot) a lot of
the
> > vms getting read only filesystem. Even when the node comes back up.
> >
> > How can i prevent this?
> > I expect that the vm just uses the replicated file on the other node,
> > without getting ro fs.
> >
> > Any hints?
>
> There are at least two timeouts that are involved in this problem:
>
> 1. The filesystem in a VM can go read-only when the virtual disk where
>    the filesystem is located does not respond for a while.
>
> 2. When a storage server that holds a replica of the virtual disk
>    becomes unreachable, the Gluster client (qemu+libgfapi) waits for
>    max. network.ping-timeout seconds before it resumes I/O.
>
> Once a filesystem in a VM goes read-only, you might be able to fsck and
> re-mount it read-writable again. It is not something a VM will do by
> itself.
>
>
> The timeouts for (1) are set in sysfs:
>
>     $ cat /sys/block/sda/device/timeout
>     30
>
> 30 seconds is the default for SD-devices, and for testing you can change
> it with an echo:
>
>     # echo 300 > /sys/block/sda/device/timeout
>
> This is not a peristent change, you can create a udev-rule to apply this
> change at bootup.
>
> Some of the filesystem offer a mount option that can change the
> behaviour after a disk error is detected. "man mount" shows the
"errors"
> option for ext*. Changing this to "continue" is not recommended,
"abort"
> or "panic" will be the most safe for your data.
>
>
> The timeout mentioned in (2) is for the Gluster Volume, and checked by
> the client. When a client does a write to a replicated volume, the write
> needs to be acknowledged by both/all replicas. The client (libgfapi)
> delays the reply to the application (qemu) until both/all replies from
> the replicas has been received. This delay is configured as the volume
> option network.ping-timeout (42 seconds by default).
>
>
> Now, if the VM returns block errors after 30 seconds, and the client
> waits up to 42 seconds for recovery, there is an issue... So, your
> solution could be to increase the timeout for error detection of the
> disks inside the VMs, and/or decrease the network.ping-timeout.
>
> It would be interesting to know if adapting these values prevents the
> read-only occurrences in your environment. If you do any testing with
> this, please keep me informed about the results.
>
> Niels
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>


-- 
Best regards,
Roman.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151027/b2431be8/attachment.html>

André Bauer

2015-Oct-27 18:21 UTC

head link

[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Niels,

my network.ping-timeout was already set to 5 seconds.

Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.

ls -al /sys/block/vda/device/ gives me only:

drwxr-xr-x 4 root root    0 Oct 26 20:21 ./
drwxr-xr-x 5 root root    0 Oct 26 20:21 ../
drwxr-xr-x 3 root root    0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root    0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root    0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root    0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor


Is the qourum setting a problem, if you only have 2 replicas?

My volume has this quorum options set:

cluster.quorum-type: auto
cluster.server-quorum-type: server

As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which
can
never happen if you only have 2 replicas and one node goes down, right?

Do in need cluster.server-quorum-ratio = 50% in this case?



@ Josh

Qemu had this in log for the time the vm got read only fs:

[2015-10-22 17:44:42.699990] E [socket.c:2244:socket_connect_finish]
0-vmimages-client-2: connection to 192.168.0.43:24007 failed
(Connection refused)
[2015-10-22 17:45:03.411721] E
[client-handshake.c:1760:client_query_portmap_cbk]
0-vmimages-client-2: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.

netstat looks good. As axpected i got connectiosn to all 4 Glusterfs
nodes at the moment.



@ Eivind
I don't think i had a split brain.
Only the vm got read only filesystem not the file on the Glusterfs node.



Regards
Andr?

Am 26.10.2015 um 21:56 schrieb Niels de Vos:> 
> There are at least two timeouts that are involved in this problem:
> 
> 1. The filesystem in a VM can go read-only when the virtual disk
> where the filesystem is located does not respond for a while.
> 
> 2. When a storage server that holds a replica of the virtual disk 
> becomes unreachable, the Gluster client (qemu+libgfapi) waits for 
> max. network.ping-timeout seconds before it resumes I/O.
> 
> Once a filesystem in a VM goes read-only, you might be able to fsck
> and re-mount it read-writable again. It is not something a VM will
> do by itself.
> 
> 
> The timeouts for (1) are set in sysfs:
> 
> $ cat /sys/block/sda/device/timeout 30
> 
> 30 seconds is the default for SD-devices, and for testing you can
> change it with an echo:
> 
> # echo 300 > /sys/block/sda/device/timeout
> 
> This is not a peristent change, you can create a udev-rule to apply
> this change at bootup.
> 
> Some of the filesystem offer a mount option that can change the 
> behaviour after a disk error is detected. "man mount" shows the
> "errors" option for ext*. Changing this to "continue"
is not
> recommended, "abort" or "panic" will be the most safe
for your
> data.
> 
> 
> The timeout mentioned in (2) is for the Gluster Volume, and checked
> by the client. When a client does a write to a replicated volume,
> the write needs to be acknowledged by both/all replicas. The client
> (libgfapi) delays the reply to the application (qemu) until
> both/all replies from the replicas has been received. This delay is
> configured as the volume option network.ping-timeout (42 seconds by
> default).
> 
> 
> Now, if the VM returns block errors after 30 seconds, and the
> client waits up to 42 seconds for recovery, there is an issue...
> So, your solution could be to increase the timeout for error
> detection of the disks inside the VMs, and/or decrease the
> network.ping-timeout.
> 
> It would be interesting to know if adapting these values prevents
> the read-only occurrences in your environment. If you do any
> testing with this, please keep me informed about the results.
> 
> Niels
> 

- -- 
Mit freundlichen Gr??en
Andr? Bauer

MAGIX Software GmbH
Andr? Bauer
Administrator
August-Bebel-Stra?e 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: abauer at magix.net
abauer at magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Gesch?ftsf?hrer | Managing Directors: Dr. Arnd Schr?der, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
- ----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
- ----------------------------------------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJWL8CvAAoJEES+J36frTguwowH/iJTvA3fuF/VKRl24Re2sOkI
5d3YFH0PXtqBMocSoiQDfKAlFxrLNwRloaKywM97K5odBcoQ8jcI03vIFqArCjdS
RmLEFdHv1gPUeQOiZy6zM4b6I0osHoF89POe+UNbcN0uTB014q29B1+JpQtAVi2T
rR+g+gc0gYt1PTP/Gxuk4klObXgZGEIbuAGPVZ0IUGH9FAF6buSGYMsi92h8t8qH
J3AFH/3abr3aEYpm8KO1qR5ZsC2TfYfMXQyFbRPDLnX0qu8q96RFBa+uNcuvAEwc
vkhPDNGDmql7pYCZ9IWpsLCuq/aCECIOLNV4Y/O4KbO2SURNMlVxFRdQcWJowlQ=h/cI
-----END PGP SIGNATURE-----

Gluster users - Oct 2015 - [Gluster-devel] VM fs becomes read only when one gluster node goes down

[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down

[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down

[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down