thr3ads.net - Gluster users - [Gluster-users] VM disks corruption on 3.7.11 [May 2016]

If this information is useful, please help other people find it:
Share via:

Kevin Lemonnier

2016-May-25 07:36 UTC

[Gluster-users] VM disks corruption on 3.7.11

Nope, not solved !
Looks like directsync just delays the problem, this morning the VM had
thrown a bunch of I/O errors again. Tried writethrough and it seems to
behave exactly like cache=none, the errors appear in a few minutes.
Trying again with directsync and no errors for now, so it looks like
directsync is better than nothing, but still doesn't solve the problem.

Really can't use this in production, the VM goes read only after a few
days because it saw too much I/O errors. Must be missing something

On Tue, May 24, 2016 at 12:24:44PM +0200, Kevin Lemonnier
wrote:> So the VM were configured with cache set to none, I just tried with
> cache=directsync and it seems to be fixing the issue. Still need to run
> more test, but did a couple already with that option and no I/O errors.
> 
> Never had to do this before, is it known ? Found the clue in some old mail
> from this mailing list, did I miss some doc saying you should be using
> directsync with glusterfs ?
> 
> On Tue, May 24, 2016 at 11:33:28AM +0200, Kevin Lemonnier wrote:
> > Hi,
> > 
> > Some news on this.
> > I actually don't need to trigger a heal to get corruption, so the
problem
> > is not the healing. Live migrating the VM seems to trigger corruption
every
> > time, and even without that just doing a database import, rebooting
then
> > doing another import seems to corrupt as well.
> > 
> > To check I created local storages on each node on the same partition
as the
> > gluster bricks, on XFS, and moved the VM disk on each local storage
and tested
> > the same procedure one by one, no corruption. It seems to happen only
on
> > glusterFS, so I'm not so sure it's hardware anymore : if it
was using local storage
> > would corrupt too, right ?
> > Could I be missing some critical configuration for VM storage on my
gluster volume ?
> > 
> > 
> > On Mon, May 23, 2016 at 01:54:30PM +0200, Kevin Lemonnier wrote:
> > > Hi,
> > > 
> > > I didn't specify it but I use "localhost" to add
the storage in proxmox.
> > > My thinking is that every proxmox node is also a glusterFS node,
so that
> > > should work fine.
> > > 
> > > I don't want to use the "normal" way of setting a
regular address in there
> > > because you can't change it afterwards in proxmox, but could
that be the source of
> > > the problem, maybe during livre migration there is write comming
from
> > > two different servers at the same time ?
> > > 
> > > 
> > > 
> > > On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay
wrote:
> > > >    Hi,
> > > > 
> > > >    I will try to recreate this issue tomorrow on my machines
with the steps
> > > >    that Lindsay provided in this thread. I will let you know
the result soon
> > > >    after that.
> > > > 
> > > >    -Krutika
> > > > 
> > > >    On Wednesday, May 18, 2016, Kevin Lemonnier
<lemonnierk at ulrar.net> wrote:
> > > >    > Hi,
> > > >    >
> > > >    > Some news on this.
> > > >    > Over the week end the RAID Card of the node ipvr2
died, and I thought
> > > >    > that maybe that was the problem all along. The RAID
Card was changed
> > > >    > and yesterday I reinstalled everything.
> > > >    > Same problem just now.
> > > >    >
> > > >    > My test is simple, using the website hosted on the
VMs all the time
> > > >    > I reboot ipvr50, wait for the heal to complete,
migrate all the VMs off
> > > >    > ipvr2 then reboot it, wait for the heal to complete
then migrate all
> > > >    > the VMs off ipvr3 then reboot it.
> > > >    > Everytime the first database VM (which is the only
one really using the
> > > >    disk
> > > >    > durign the heal) starts showing I/O errors on
it's disk.
> > > >    >
> > > >    > Am I really the only one with that problem ?
> > > >    > Maybe one of the drives is dying too, who knows, but
SMART isn't saying
> > > >    anything ..
> > > >    >
> > > >    >
> > > >    > On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin
Lemonnier wrote:
> > > >    >> Hi,
> > > >    >>
> > > >    >> I had a problem some time ago with 3.7.6 and
freezing during heals,
> > > >    >> and multiple persons advised to use 3.7.11
instead. Indeed, with that
> > > >    >> version the freez problem is fixed, it works
like a dream ! You can
> > > >    >> almost not tell that a node is down or healing,
everything keeps
> > > >    working
> > > >    >> except for a little freez when the node just
went down and I assume
> > > >    >> hasn't timed out yet, but that's fine.
> > > >    >>
> > > >    >> Now I have a 3.7.11 volume on 3 nodes for
testing, and the VM are
> > > >    proxmox
> > > >    >> VMs with qCow2 disks stored on the gluster
volume.
> > > >    >> Here is the config :
> > > >    >>
> > > >    >> Volume Name: gluster
> > > >    >> Type: Replicate
> > > >    >> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> > > >    >> Status: Started
> > > >    >> Number of Bricks: 1 x 3 = 3
> > > >    >> Transport-type: tcp
> > > >    >> Bricks:
> > > >    >> Brick1: ipvr2.client:/mnt/storage/gluster
> > > >    >> Brick2: ipvr3.client:/mnt/storage/gluster
> > > >    >> Brick3: ipvr50.client:/mnt/storage/gluster
> > > >    >> Options Reconfigured:
> > > >    >> cluster.quorum-type: auto
> > > >    >> cluster.server-quorum-type: server
> > > >    >> network.remote-dio: enable
> > > >    >> cluster.eager-lock: enable
> > > >    >> performance.quick-read: off
> > > >    >> performance.read-ahead: off
> > > >    >> performance.io-cache: off
> > > >    >> performance.stat-prefetch: off
> > > >    >> features.shard: on
> > > >    >> features.shard-block-size: 64MB
> > > >    >> cluster.data-self-heal-algorithm: full
> > > >    >> performance.readdir-ahead: on
> > > >    >>
> > > >    >>
> > > >    >> As mentioned, I rebooted one of the nodes to
test the freezing issue I
> > > >    had
> > > >    >> on previous versions and appart from the initial
timeout, nothing, the
> > > >    website
> > > >    >> hosted on the VMs keeps working like a charm
even during heal.
> > > >    >> Since it's testing, there isn't any load
on it though, and I just tried
> > > >    to refresh
> > > >    >> the database by importing the production one on
the two MySQL VMs, and
> > > >    both of them
> > > >    >> started doing I/O errors. I tried shutting them
down and powering them
> > > >    on again,
> > > >    >> but same thing, even starting full heals by hand
doesn't solve the
> > > >    problem, the disks are
> > > >    >> corrupted. They still work, but sometimes they
remount their partitions
> > > >    read only ..
> > > >    >>
> > > >    >> I believe there is a few people already using
3.7.11, no one noticed
> > > >    corruption problems ?
> > > >    >> Anyone using Proxmox ? As already mentionned in
multiple other threads
> > > >    on this mailing list
> > > >    >> by other users, I also have pretty much always
shards in heal info, but
> > > >    nothing "stuck" there,
> > > >    >> they always go away in a few seconds getting
replaced by other shards.
> > > >    >>
> > > >    >> Thanks
> > > >    >>
> > > >    >> --
> > > >    >> Kevin Lemonnier
> > > >    >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > >    >
> > > >    >
> > > >    >
> > > >    >> _______________________________________________
> > > >    >> Gluster-users mailing list
> > > >    >> Gluster-users at gluster.org
> > > >    >>
http://www.gluster.org/mailman/listinfo/gluster-users
> > > >    >
> > > >    >
> > > >    > --
> > > >    > Kevin Lemonnier
> > > >    > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > >    >
> > > 
> > > -- 
> > > Kevin Lemonnier
> > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > 
> > 
> > 
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> > 
> > 
> > -- 
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> 
> 
> 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160525/7026b9ef/attachment.sig>

Lindsay Mathieson

2016-May-25 07:46 UTC

head link

[Gluster-users] VM disks corruption on 3.7.11

On 25/05/2016 5:36 PM, Kevin Lemonnier wrote:> Nope, not solved !
> Looks like directsync just delays the problem, this morning the VM had
> thrown a bunch of I/O errors again. Tried writethrough and it seems to
> behave exactly like cache=none, the errors appear in a few minutes.
> Trying again with directsync and no errors for now, so it looks like
> directsync is better than nothing, but still doesn't solve the problem.


Bummer :(


Whats the underlying filesystem under the bricks?

-- 
Lindsay Mathieson

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160525/c94fa102/attachment.html>

Gluster users - May 2016 - VM disks corruption on 3.7.11

[Gluster-users] VM disks corruption on 3.7.11

[Gluster-users] VM disks corruption on 3.7.11