That really isnt an arbiter issue or for that matter a Gluster issue. We have seen that with vanilla NAS servers that had some issue or another. Arbiter simply makes it less likely to be an issue than replica 2 but in turn arbiter is less 'safe' than replica 3. However, in regards to Gluster and RO behaviour The default timeout for most OS versions is 30 seconds and the Gluster timeout is 42, so yes you can trigger an RO event. # cat /sys/block/sda/device/timeout 30 Though it is easy enough to raise as Pavel mentioned # echo 90 > /sys/block/sda/device/timeout As a purely observational note, we have noticed that EXT3/4 filesystems on VMs will go read-only much easier than XFS systems (even with the default timeout and irregardless of storage type). We have always wondered about that, though part of that observation is biased because we tend to use XFS on newer VMs which mean newer, better kernels. Likewise virtio "disks" don't even have a timeout value that I am aware of and I don't recall them being extremely sensitive to disk issues on either Gluster, NFS or DAS. All our newer VMs use virtio instead of sata/ide emulation AND XFS so we rarely see a RO situation and if we do, it was a good thing the VMs did go RO to protect themselves while the storage system freaked out. On 8/23/2017 12:26 PM, lemonnierk at ulrar.net wrote:> Really ? I can't see why. But I've never used arbiter so you probably > know more about this than I do. > > In any case, with replica 3, never had a problem. > > On Wed, Aug 23, 2017 at 09:13:28PM +0200, Pavel Szalbot wrote: >> Hi, I believe it is not that simple. Even replica 2 + arbiter volume >> with default network.ping-timeout will cause the underlying VM to >> remount filesystem as read-only (device error will occur) unless you >> tune mount options in VM's fstab. >> -ps >> >> >> On Wed, Aug 23, 2017 at 6:59 PM, <lemonnierk at ulrar.net> wrote: >>> What he is saying is that, on a two node volume, upgrading a node will >>> cause the volume to go down. That's nothing weird, you really should use >>> 3 nodes. >>> >>> On Wed, Aug 23, 2017 at 06:51:55PM +0200, Gionatan Danti wrote: >>>> Il 23-08-2017 18:14 Pavel Szalbot ha scritto: >>>>> Hi, after many VM crashes during upgrades of Gluster, losing network >>>>> connectivity on one node etc. I would advise running replica 2 with >>>>> arbiter. >>>> Hi Pavel, this is bad news :( >>>> So, in your case at least, Gluster was not stable? Something as simple >>>> as an update would let it crash? >>>> >>>>> I once even managed to break this setup (with arbiter) due to network >>>>> partitioning - one data node never healed and I had to restore from >>>>> backups (it was easier and kind of non-production). Be extremely >>>>> careful and plan for failure. >>>> I would use VM locking via sanlock or virtlock, so a split brain should >>>> not cause simultaneous changes on both replicas. I am more concerned >>>> about volume heal time: what will happen if the standby node >>>> crashes/reboots? Will *all* data be re-synced from the master, or only >>>> changed bit will be re-synced? As stated above, I would like to avoid >>>> using sharding... >>>> >>>> Thanks. >>>> >>>> >>>> -- >>>> Danti Gionatan >>>> Supporto Tecnico >>>> Assyoma S.r.l. - www.assyoma.it >>>> email: g.danti at assyoma.it - info at assyoma.it >>>> GPG public key ID: FF5F32A8 >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170823/31980700/attachment.html>
Hi, On Thu, Aug 24, 2017 at 2:13 AM, WK <wkmail at bneit.com> wrote:> The default timeout for most OS versions is 30 seconds and the Gluster > timeout is 42, so yes you can trigger an RO event.I get read-only mount within approximately 2 seconds after failed IO.> Though it is easy enough to raise as Pavel mentioned > > # echo 90 > /sys/block/sda/device/timeoutAFAIK this is applicable only for directly attached block devices (non-virtualized).> Likewise virtio "disks" don't even have a timeout value that I am aware of > and I don't recall them being extremely sensitive to disk issues on either > Gluster, NFS or DAS.We use only virtio and these problems are persistent - temporarily suspending a node (e.g. HW or Gluster upgrade, reboot) is very scary, because we often end up with read-only filesystems on all VMs. However we use ext4, so I cannot comment on XFS. This discussion will probably end before I migrate VMs from Gluster to local storage on our Openstack nodes, but I might run some tests afterwards and keep you posted. -ps
On 8/23/2017 10:44 PM, Pavel Szalbot wrote:> Hi, > > On Thu, Aug 24, 2017 at 2:13 AM, WK <wkmail at bneit.com> wrote: >> The default timeout for most OS versions is 30 seconds and the Gluster >> timeout is 42, so yes you can trigger an RO event. > I get read-only mount within approximately 2 seconds after failed IO.Hmm, we don't see that, even on busy VMs. We ARE using QCOW2 disk images though. Also, though we no longer use Ovirt, I am still on the list. They are heavy Gluster users and they would be howling if they all had your experience.> >> Though it is easy enough to raise as Pavel mentioned >> >> # echo 90 > /sys/block/sda/device/timeout > AFAIK this is applicable only for directly attached block devices > (non-virtualized).No, if you use SATA/IDE emulation (NOT virtio) it is there WITHIN the VM. We have a lot of legacy VMs from older projects/workloads that have that and we haven't bothered changing them because "they are working fine now" It is NOT there on virtio.>> Likewise virtio "disks" don't even have a timeout value that I am aware of >> and I don't recall them being extremely sensitive to disk issues on either >> Gluster, NFS or DAS. > We use only virtio and these problems are persistent - temporarily > suspending a node (e.g. HW or Gluster upgrade, reboot) is very scary, > because we often end up with read-only filesystems on all VMs. > > However we use ext4, so I cannot comment on XFS.We use the fuse mount, because we are lazy and haven't upgraded to libgfapi.? I hope to start a new cluster with to libfgapi shortly because of the better performance. Also we use a localhost mount for the gluster driveset on each compute node (i.e. so called hyperconverged). So the only 'gluster' only kit is the lightweight arb box. So those VMs in the gluster 'pool' have a local write and then only 1 off-server write (to the other gluster enabled compute host), which means pretty good performance. We use the gluster included 'virt' tuning set of: performance.quick-read=off performance.read-ahead=off performance.io-cache=off performance.stat-prefetch=off performance.low-prio-threads=32 network.remote-dio=enable cluster.eager-lock=enable cluster.quorum-type=auto cluster.server-quorum-type=server cluster.data-self-heal-algorithm=full cluster.locking-scheme=granular cluster.shd-max-threads=8 cluster.shd-wait-qlength=10000 features.shard=on user.cifs=off We do play with shard size and have settled down on 64M, though I've seen recommendations of 128M and 512M for VMs. We didn't really notice much of a difference with any of those as long as they were at least 64M> > This discussion will probably end before I migrate VMs from Gluster to > local storage on our Openstack nodes, but I might run some tests > afterwards and keep you posted.I would be interested in your results. You may also look into Ceph. It is more complicated than Gluster, (well, more complicated than our simple little Gluster arrangement) but the OpenStack people swear by it. It wasn't suited to our needs, but it tested well, when we looked into it last year.