thr3ads.net - Gluster users - [Gluster-users] VM disks corruption on 3.7.11 [May 2016]

If this information is useful, please help other people find it:
Share via:

Kevin Lemonnier

2016-May-23 11:54 UTC

[Gluster-users] VM disks corruption on 3.7.11

Hi,

I didn't specify it but I use "localhost" to add the storage in
proxmox.
My thinking is that every proxmox node is also a glusterFS node, so that
should work fine.

I don't want to use the "normal" way of setting a regular address
in there
because you can't change it afterwards in proxmox, but could that be the
source of
the problem, maybe during livre migration there is write comming from
two different servers at the same time ?



On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay
wrote:>    Hi,
> 
>    I will try to recreate this issue tomorrow on my machines with the steps
>    that Lindsay provided in this thread. I will let you know the result
soon
>    after that.
> 
>    -Krutika
> 
>    On Wednesday, May 18, 2016, Kevin Lemonnier <lemonnierk at
ulrar.net> wrote:
>    > Hi,
>    >
>    > Some news on this.
>    > Over the week end the RAID Card of the node ipvr2 died, and I
thought
>    > that maybe that was the problem all along. The RAID Card was
changed
>    > and yesterday I reinstalled everything.
>    > Same problem just now.
>    >
>    > My test is simple, using the website hosted on the VMs all the time
>    > I reboot ipvr50, wait for the heal to complete, migrate all the VMs
off
>    > ipvr2 then reboot it, wait for the heal to complete then migrate
all
>    > the VMs off ipvr3 then reboot it.
>    > Everytime the first database VM (which is the only one really using
the
>    disk
>    > durign the heal) starts showing I/O errors on it's disk.
>    >
>    > Am I really the only one with that problem ?
>    > Maybe one of the drives is dying too, who knows, but SMART
isn't saying
>    anything ..
>    >
>    >
>    > On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
>    >> Hi,
>    >>
>    >> I had a problem some time ago with 3.7.6 and freezing during
heals,
>    >> and multiple persons advised to use 3.7.11 instead. Indeed,
with that
>    >> version the freez problem is fixed, it works like a dream ! You
can
>    >> almost not tell that a node is down or healing, everything
keeps
>    working
>    >> except for a little freez when the node just went down and I
assume
>    >> hasn't timed out yet, but that's fine.
>    >>
>    >> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM
are
>    proxmox
>    >> VMs with qCow2 disks stored on the gluster volume.
>    >> Here is the config :
>    >>
>    >> Volume Name: gluster
>    >> Type: Replicate
>    >> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
>    >> Status: Started
>    >> Number of Bricks: 1 x 3 = 3
>    >> Transport-type: tcp
>    >> Bricks:
>    >> Brick1: ipvr2.client:/mnt/storage/gluster
>    >> Brick2: ipvr3.client:/mnt/storage/gluster
>    >> Brick3: ipvr50.client:/mnt/storage/gluster
>    >> Options Reconfigured:
>    >> cluster.quorum-type: auto
>    >> cluster.server-quorum-type: server
>    >> network.remote-dio: enable
>    >> cluster.eager-lock: enable
>    >> performance.quick-read: off
>    >> performance.read-ahead: off
>    >> performance.io-cache: off
>    >> performance.stat-prefetch: off
>    >> features.shard: on
>    >> features.shard-block-size: 64MB
>    >> cluster.data-self-heal-algorithm: full
>    >> performance.readdir-ahead: on
>    >>
>    >>
>    >> As mentioned, I rebooted one of the nodes to test the freezing
issue I
>    had
>    >> on previous versions and appart from the initial timeout,
nothing, the
>    website
>    >> hosted on the VMs keeps working like a charm even during heal.
>    >> Since it's testing, there isn't any load on it though,
and I just tried
>    to refresh
>    >> the database by importing the production one on the two MySQL
VMs, and
>    both of them
>    >> started doing I/O errors. I tried shutting them down and
powering them
>    on again,
>    >> but same thing, even starting full heals by hand doesn't
solve the
>    problem, the disks are
>    >> corrupted. They still work, but sometimes they remount their
partitions
>    read only ..
>    >>
>    >> I believe there is a few people already using 3.7.11, no one
noticed
>    corruption problems ?
>    >> Anyone using Proxmox ? As already mentionned in multiple other
threads
>    on this mailing list
>    >> by other users, I also have pretty much always shards in heal
info, but
>    nothing "stuck" there,
>    >> they always go away in a few seconds getting replaced by other
shards.
>    >>
>    >> Thanks
>    >>
>    >> --
>    >> Kevin Lemonnier
>    >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>    >
>    >
>    >
>    >> _______________________________________________
>    >> Gluster-users mailing list
>    >> Gluster-users at gluster.org
>    >> http://www.gluster.org/mailman/listinfo/gluster-users
>    >
>    >
>    > --
>    > Kevin Lemonnier
>    > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>    >
-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160523/fd2a58db/attachment.sig>

Kevin Lemonnier

2016-May-24 09:33 UTC

head link

[Gluster-users] VM disks corruption on 3.7.11

Hi,

Some news on this.
I actually don't need to trigger a heal to get corruption, so the problem
is not the healing. Live migrating the VM seems to trigger corruption every
time, and even without that just doing a database import, rebooting then
doing another import seems to corrupt as well.

To check I created local storages on each node on the same partition as the
gluster bricks, on XFS, and moved the VM disk on each local storage and tested
the same procedure one by one, no corruption. It seems to happen only on
glusterFS, so I'm not so sure it's hardware anymore : if it was using
local storage
would corrupt too, right ?
Could I be missing some critical configuration for VM storage on my gluster
volume ?


On Mon, May 23, 2016 at 01:54:30PM +0200, Kevin Lemonnier
wrote:> Hi,
> 
> I didn't specify it but I use "localhost" to add the storage
in proxmox.
> My thinking is that every proxmox node is also a glusterFS node, so that
> should work fine.
> 
> I don't want to use the "normal" way of setting a regular
address in there
> because you can't change it afterwards in proxmox, but could that be
the source of
> the problem, maybe during livre migration there is write comming from
> two different servers at the same time ?
> 
> 
> 
> On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
> >    Hi,
> > 
> >    I will try to recreate this issue tomorrow on my machines with the
steps
> >    that Lindsay provided in this thread. I will let you know the
result soon
> >    after that.
> > 
> >    -Krutika
> > 
> >    On Wednesday, May 18, 2016, Kevin Lemonnier <lemonnierk at
ulrar.net> wrote:
> >    > Hi,
> >    >
> >    > Some news on this.
> >    > Over the week end the RAID Card of the node ipvr2 died, and I
thought
> >    > that maybe that was the problem all along. The RAID Card was
changed
> >    > and yesterday I reinstalled everything.
> >    > Same problem just now.
> >    >
> >    > My test is simple, using the website hosted on the VMs all the
time
> >    > I reboot ipvr50, wait for the heal to complete, migrate all
the VMs off
> >    > ipvr2 then reboot it, wait for the heal to complete then
migrate all
> >    > the VMs off ipvr3 then reboot it.
> >    > Everytime the first database VM (which is the only one really
using the
> >    disk
> >    > durign the heal) starts showing I/O errors on it's disk.
> >    >
> >    > Am I really the only one with that problem ?
> >    > Maybe one of the drives is dying too, who knows, but SMART
isn't saying
> >    anything ..
> >    >
> >    >
> >    > On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier
wrote:
> >    >> Hi,
> >    >>
> >    >> I had a problem some time ago with 3.7.6 and freezing
during heals,
> >    >> and multiple persons advised to use 3.7.11 instead.
Indeed, with that
> >    >> version the freez problem is fixed, it works like a dream
! You can
> >    >> almost not tell that a node is down or healing, everything
keeps
> >    working
> >    >> except for a little freez when the node just went down and
I assume
> >    >> hasn't timed out yet, but that's fine.
> >    >>
> >    >> Now I have a 3.7.11 volume on 3 nodes for testing, and the
VM are
> >    proxmox
> >    >> VMs with qCow2 disks stored on the gluster volume.
> >    >> Here is the config :
> >    >>
> >    >> Volume Name: gluster
> >    >> Type: Replicate
> >    >> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> >    >> Status: Started
> >    >> Number of Bricks: 1 x 3 = 3
> >    >> Transport-type: tcp
> >    >> Bricks:
> >    >> Brick1: ipvr2.client:/mnt/storage/gluster
> >    >> Brick2: ipvr3.client:/mnt/storage/gluster
> >    >> Brick3: ipvr50.client:/mnt/storage/gluster
> >    >> Options Reconfigured:
> >    >> cluster.quorum-type: auto
> >    >> cluster.server-quorum-type: server
> >    >> network.remote-dio: enable
> >    >> cluster.eager-lock: enable
> >    >> performance.quick-read: off
> >    >> performance.read-ahead: off
> >    >> performance.io-cache: off
> >    >> performance.stat-prefetch: off
> >    >> features.shard: on
> >    >> features.shard-block-size: 64MB
> >    >> cluster.data-self-heal-algorithm: full
> >    >> performance.readdir-ahead: on
> >    >>
> >    >>
> >    >> As mentioned, I rebooted one of the nodes to test the
freezing issue I
> >    had
> >    >> on previous versions and appart from the initial timeout,
nothing, the
> >    website
> >    >> hosted on the VMs keeps working like a charm even during
heal.
> >    >> Since it's testing, there isn't any load on it
though, and I just tried
> >    to refresh
> >    >> the database by importing the production one on the two
MySQL VMs, and
> >    both of them
> >    >> started doing I/O errors. I tried shutting them down and
powering them
> >    on again,
> >    >> but same thing, even starting full heals by hand
doesn't solve the
> >    problem, the disks are
> >    >> corrupted. They still work, but sometimes they remount
their partitions
> >    read only ..
> >    >>
> >    >> I believe there is a few people already using 3.7.11, no
one noticed
> >    corruption problems ?
> >    >> Anyone using Proxmox ? As already mentionned in multiple
other threads
> >    on this mailing list
> >    >> by other users, I also have pretty much always shards in
heal info, but
> >    nothing "stuck" there,
> >    >> they always go away in a few seconds getting replaced by
other shards.
> >    >>
> >    >> Thanks
> >    >>
> >    >> --
> >    >> Kevin Lemonnier
> >    >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >    >
> >    >
> >    >
> >    >> _______________________________________________
> >    >> Gluster-users mailing list
> >    >> Gluster-users at gluster.org
> >    >> http://www.gluster.org/mailman/listinfo/gluster-users
> >    >
> >    >
> >    > --
> >    > Kevin Lemonnier
> >    > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >    >
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160524/dcfc41ff/attachment.sig>

Gluster users - May 2016 - VM disks corruption on 3.7.11

[Gluster-users] VM disks corruption on 3.7.11

[Gluster-users] VM disks corruption on 3.7.11