thr3ads.net - Gluster users - [Gluster-users] Freezing during heal [May 2016]

If this information is useful, please help other people find it:
Share via:

Krutika Dhananjay

2016-Apr-18 14:47 UTC

[Gluster-users] Freezing during heal

On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <lemonnierk at ulrar.net>
wrote:
> I will try migrating to 3.7.10, is it considered stable yet ?
>
Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)

> Should I change the self heal algorithm even if I move to 3.7.10, or is
> that not necessary ?
> Not sure what that change might do.
>
So the other algorithm is 'diff' which computes rolling checksum on
chunks
of the src(es) and sink(s), compares them and heals upon mismatch. This is
known to consume lot of CPU. 'full' algo on the other hand simply copies
the src into sink in chunks. With sharding, it shouldn't be all that bad
copying a 256MB file (in your case) from src to sink. We've used double the
block size and had no issues reported.

So you could change self heal algo to full even in the upgraded cluster.

-Krutika

>
> Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move
> the VMs on it then,
> Thanks a lot for your help,
>
> Regards
>
>
> On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:
> > Hi,
> >
> > Yeah, so the fuse mount log didn't convey much information.
> >
> > So one of the reasons heal may have taken so long (and also consumed
> > resources) is because of a bug in self-heal where it would do heal
from
> > both source bricks in 3-way replication. With such a bug, heal would
take
> > twice the amount of time and consume resources both the times by the
same
> > amount.
> >
> > This issue is fixed at http://review.gluster.org/#/c/14008/ and will
be
> > available in 3.7.12.
> >
> > The other thing you could do is to set
cluster.data-self-heal-algorithm
> to
> > 'full', for better heal performance and more regulated
resource
> consumption
> > by the same.
> >  #gluster volume set <VOL> cluster.data-self-heal-algorithm full
> >
> > As far as sharding is concerned, some critical caching issues were
fixed
> in
> > 3.7.7 and 3.7.8.
> > And my guess is that the vm crash/unbootable state could be because of
> this
> > issue, which exists in 3.7.6.
> >
> > 3.7.10 saw the introduction of throttled client side heals which also
> moves
> > such heals to the background, which is all the more helpful for
> preventing
> > starvation of vms during client heal.
> >
> > Considering these factors, I think it would be better if you upgraded
> your
> > machines to 3.7.10.
> >
> > Do let me know if migrating to 3.7.10 solves your issues.
> >
> > -Krutika
> >
> > On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <lemonnierk at
ulrar.net>
> > wrote:
> >
> > > Yes, but as I was saying I don't believe KVM is using a mount
point, I
> > > think it uses
> > > the API (
> > >
>
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
> > > ).
> > > Might be mistaken ofcourse. Proxmox does have a mountpoint for
> > > conveniance, I'll attach those
> > > logs, hoping they contain the informations you need. They do seem
to
> > > contain a lot of errors
> > > for the 15.
> > > For reference, there was a disconnect of the first brick
(10.10.0.1) in
> > > the morning and then a successfull
> > > heal that caused about 40 minutes downtime of the VMs. Right
after that
> > > heal finished (if my memory is
> > > correct it was about noon or close) the second brick (10.10.0.2)
> rebooted,
> > > and that's the one I disconnected
> > > to prevent the heal from causing another downtime.
> > > I reconnected it one at the end of the afternoon, hoping the heal
> would go
> > > well but everything went down
> > > like in the morning so I disconnected it again, and waited 11pm
> (23:00) to
> > > reconnect it and let it finish.
> > >
> > > Thanks for your help,
> > >
> > >
> > > On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay
wrote:
> > > > Sorry, I was referring to the glusterfs client logs.
> > > >
> > > > Assuming you are using FUSE mount, your log file will be in
> > > > /var/log/glusterfs/<hyphenated-mount-point-path>.log
> > > >
> > > > -Krutika
> > > >
> > > > On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
> lemonnierk at ulrar.net>
> > > > wrote:
> > > >
> > > > > I believe Proxmox is just an interface to KVM that uses
the lib,
> so if
> > > I'm
> > > > > not mistaken there isn't client logs ?
> > > > >
> > > > > It's not the first time I have the issue, it
happens on every heal
> on
> > > the
> > > > > 2 clusters I have.
> > > > >
> > > > > I did let the heal finish that night and the VMs are
working now,
> but
> > > it
> > > > > is pretty scarry for future crashes or brick
replacement.
> > > > > Should I maybe lower the shard size ? Won't solve
the fact that 2
> > > bricks
> > > > > on 3 aren't keeping the filesystem usable but might
make the
> healing
> > > > > quicker right ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay
<
> > > > > kdhananj at redhat.com> a ?crit :
> > > > > >Could you share the client logs and information
about the approx
> > > > > >time/day
> > > > > >when you saw this issue?
> > > > > >
> > > > > >-Krutika
> > > > > >
> > > > > >On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
> > > > > ><lemonnierk at ulrar.net>
> > > > > >wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >>
> > > > > >> We have a small glusterFS 3.7.6 cluster with 3
nodes running
> with
> > > > > >proxmox
> > > > > >> VM's on it. I did set up the different
recommended option like
> the
> > > > > >virt
> > > > > >> group, but
> > > > > >> by hand since it's on debian. The shards
are 256MB, if that
> matters.
> > > > > >>
> > > > > >> This morning the second node crashed, and as
it came back up
> started
> > > > > >a
> > > > > >> heal, but that basically froze all the
VM's running on that
> volume.
> > > > > >Since
> > > > > >> we really really
> > > > > >> can't have 40 minutes down time in the
middle of the day, I just
> > > > > >removed
> > > > > >> the node from the network and that stopped the
heal, allowing
> the
> > > > > >VM's to
> > > > > >> access
> > > > > >> their disks again. The plan was to re-connecte
the node in a
> couple
> > > > > >of
> > > > > >> hours to let it heal at night.
> > > > > >> But a VM crashed now, and it can't boot up
again : seems to
> freez
> > > > > >trying
> > > > > >> to access the disks.
> > > > > >>
> > > > > >> Looking at the heal info for the volume, it
has gone way up
> since
> > > > > >this
> > > > > >> morning, it looks like the VM's aren't
writing to both nodes,
> just
> > > > > >the one
> > > > > >> they are on.
> > > > > >> It seems pretty bad, we have 2 nodes on 3 up,
I would expect the
> > > > > >volume to
> > > > > >> work just fine since it has quorum. What am I
missing ?
> > > > > >>
> > > > > >> It is still too early to start the heal, is
there a way to
> start the
> > > > > >VM
> > > > > >> anyway right now ? I mean, it was running a
moment ago so the
> data
> > > is
> > > > > >> there, it just needs
> > > > > >> to let the VM access it.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Volume Name: vm-storage
> > > > > >> Type: Replicate
> > > > > >> Volume ID:
a5b19324-f032-4136-aaac-5e9a4c88aaef
> > > > > >> Status: Started
> > > > > >> Number of Bricks: 1 x 3 = 3
> > > > > >> Transport-type: tcp
> > > > > >> Bricks:
> > > > > >> Brick1: first_node:/mnt/vg1-storage
> > > > > >> Brick2: second_node:/mnt/vg1-storage
> > > > > >> Brick3: third_node:/mnt/vg1-storage
> > > > > >> Options Reconfigured:
> > > > > >> cluster.quorum-type: auto
> > > > > >> cluster.server-quorum-type: server
> > > > > >> network.remote-dio: enable
> > > > > >> cluster.eager-lock: enable
> > > > > >> performance.readdir-ahead: on
> > > > > >> performance.quick-read: off
> > > > > >> performance.read-ahead: off
> > > > > >> performance.io-cache: off
> > > > > >> performance.stat-prefetch: off
> > > > > >> features.shard: on
> > > > > >> features.shard-block-size: 256MB
> > > > > >> cluster.server-quorum-ratio: 51%
> > > > > >>
> > > > > >>
> > > > > >> Thanks for your help
> > > > > >>
> > > > > >> --
> > > > > >> Kevin Lemonnier
> > > > > >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > > > >>
> > > > > >>
_______________________________________________
> > > > > >> Gluster-users mailing list
> > > > > >> Gluster-users at gluster.org
> > > > > >>
http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > >>
> > > > >
> > > > > --
> > > > > Envoy? de mon appareil Android avec K-9 Mail. Veuillez
excuser ma
> > > bri?vet?.
> > > > > _______________________________________________
> > > > > Gluster-users mailing list
> > > > > Gluster-users at gluster.org
> > > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > > > >
> > >
> > > --
> > > Kevin Lemonnier
> > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > >
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> > >
>
> --
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160418/f6ddf6d7/attachment.html>

Lindsay Mathieson

2016-Apr-18 23:38 UTC

head link

[Gluster-users] Freezing during heal

On 19/04/2016 12:47 AM, Krutika Dhananjay wrote:> Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :) 
Wasn't the 3.7.10 regression only a problem on reboots if you used 
gluster snapshots? (which proxmox doesn't). I'm currently using 3.7.10, 
no issues with restarts so far.
> So one of the reasons heal may have taken so long (and also consumed 
> resources) is because of a bug in self-heal where it would do heal 
> from both source bricks in 3-way replication. With such a bug, heal 
> would take twice the amount of time and consume resources both the 
> times by the same amount.
>
> This issue is fixed at http://review.gluster.org/#/c/14008/ and will 
> be available in 3.7.12.
Rats, I thought it was in 3.7.11 :( but can't it also be worked round by 
disabling "cluster.data-self-heal"? or was that something else.


I'm getting pretty good heal performance with the following settings:

- features.shard-block-size: 64MB
- cluster.self-heal-window-size: 1024





-- 
Lindsay Mathieson

Kevin Lemonnier

2016-Apr-25 12:01 UTC

head link

[Gluster-users] Freezing during heal

Hi,

So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.

Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.

I'm gonna let it finish to copy everything on the new nodes, then I'll
try
to simulate nodes going down to see if my original problem of freezing and
low heal time is solved with this config.
For reference, here is the volume info, if someone sees something I should
change :

Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on


It starts at 2 and jumps to 50 because the first server is doing something else
for now,
and I use 50 to be the temporary third node. If everything goes well, I'll
migrate the production
on the cluster, re-install the first server and do a replace-brick, which I hope
will work just as well
as the add-brick I'm doing now. Last replace-brick also brought everything
down, but I guess that was the
joy of 3.7.6 :).

Thanks !


On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay
wrote:> On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <lemonnierk at
ulrar.net>
> wrote:
> 
> > I will try migrating to 3.7.10, is it considered stable yet ?
> >
> 
> Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
> 
> 
> > Should I change the self heal algorithm even if I move to 3.7.10, or
is
> > that not necessary ?
> > Not sure what that change might do.
> >
> 
> So the other algorithm is 'diff' which computes rolling checksum on
chunks
> of the src(es) and sink(s), compares them and heals upon mismatch. This is
> known to consume lot of CPU. 'full' algo on the other hand simply
copies
> the src into sink in chunks. With sharding, it shouldn't be all that
bad
> copying a 256MB file (in your case) from src to sink. We've used double
the
> block size and had no issues reported.
> 
> So you could change self heal algo to full even in the upgraded cluster.
> 
> -Krutika
> 
> 
> >
> > Anyway, I'll try to create a 3.7.10 cluster in the week end slowly
move
> > the VMs on it then,
> > Thanks a lot for your help,
> >
> > Regards
> >
> >
> > On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:
> > > Hi,
> > >
> > > Yeah, so the fuse mount log didn't convey much information.
> > >
> > > So one of the reasons heal may have taken so long (and also
consumed
> > > resources) is because of a bug in self-heal where it would do
heal from
> > > both source bricks in 3-way replication. With such a bug, heal
would take
> > > twice the amount of time and consume resources both the times by
the same
> > > amount.
> > >
> > > This issue is fixed at http://review.gluster.org/#/c/14008/ and
will be
> > > available in 3.7.12.
> > >
> > > The other thing you could do is to set
cluster.data-self-heal-algorithm
> > to
> > > 'full', for better heal performance and more regulated
resource
> > consumption
> > > by the same.
> > >  #gluster volume set <VOL> cluster.data-self-heal-algorithm
full
> > >
> > > As far as sharding is concerned, some critical caching issues
were fixed
> > in
> > > 3.7.7 and 3.7.8.
> > > And my guess is that the vm crash/unbootable state could be
because of
> > this
> > > issue, which exists in 3.7.6.
> > >
> > > 3.7.10 saw the introduction of throttled client side heals which
also
> > moves
> > > such heals to the background, which is all the more helpful for
> > preventing
> > > starvation of vms during client heal.
> > >
> > > Considering these factors, I think it would be better if you
upgraded
> > your
> > > machines to 3.7.10.
> > >
> > > Do let me know if migrating to 3.7.10 solves your issues.
> > >
> > > -Krutika
> > >
> > > On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <lemonnierk
at ulrar.net>
> > > wrote:
> > >
> > > > Yes, but as I was saying I don't believe KVM is using a
mount point, I
> > > > think it uses
> > > > the API (
> > > >
> >
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
> > > > ).
> > > > Might be mistaken ofcourse. Proxmox does have a mountpoint
for
> > > > conveniance, I'll attach those
> > > > logs, hoping they contain the informations you need. They do
seem to
> > > > contain a lot of errors
> > > > for the 15.
> > > > For reference, there was a disconnect of the first brick
(10.10.0.1) in
> > > > the morning and then a successfull
> > > > heal that caused about 40 minutes downtime of the VMs. Right
after that
> > > > heal finished (if my memory is
> > > > correct it was about noon or close) the second brick
(10.10.0.2)
> > rebooted,
> > > > and that's the one I disconnected
> > > > to prevent the heal from causing another downtime.
> > > > I reconnected it one at the end of the afternoon, hoping the
heal
> > would go
> > > > well but everything went down
> > > > like in the morning so I disconnected it again, and waited
11pm
> > (23:00) to
> > > > reconnect it and let it finish.
> > > >
> > > > Thanks for your help,
> > > >
> > > >
> > > > On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay
wrote:
> > > > > Sorry, I was referring to the glusterfs client logs.
> > > > >
> > > > > Assuming you are using FUSE mount, your log file will
be in
> > > > >
/var/log/glusterfs/<hyphenated-mount-point-path>.log
> > > > >
> > > > > -Krutika
> > > > >
> > > > > On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
> > lemonnierk at ulrar.net>
> > > > > wrote:
> > > > >
> > > > > > I believe Proxmox is just an interface to KVM that
uses the lib,
> > so if
> > > > I'm
> > > > > > not mistaken there isn't client logs ?
> > > > > >
> > > > > > It's not the first time I have the issue, it
happens on every heal
> > on
> > > > the
> > > > > > 2 clusters I have.
> > > > > >
> > > > > > I did let the heal finish that night and the VMs
are working now,
> > but
> > > > it
> > > > > > is pretty scarry for future crashes or brick
replacement.
> > > > > > Should I maybe lower the shard size ? Won't
solve the fact that 2
> > > > bricks
> > > > > > on 3 aren't keeping the filesystem usable but
might make the
> > healing
> > > > > > quicker right ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Le 17 avril 2016 17:56:37 GMT+02:00, Krutika
Dhananjay <
> > > > > > kdhananj at redhat.com> a ?crit :
> > > > > > >Could you share the client logs and
information about the approx
> > > > > > >time/day
> > > > > > >when you saw this issue?
> > > > > > >
> > > > > > >-Krutika
> > > > > > >
> > > > > > >On Sat, Apr 16, 2016 at 12:57 AM, Kevin
Lemonnier
> > > > > > ><lemonnierk at ulrar.net>
> > > > > > >wrote:
> > > > > > >
> > > > > > >> Hi,
> > > > > > >>
> > > > > > >> We have a small glusterFS 3.7.6 cluster
with 3 nodes running
> > with
> > > > > > >proxmox
> > > > > > >> VM's on it. I did set up the
different recommended option like
> > the
> > > > > > >virt
> > > > > > >> group, but
> > > > > > >> by hand since it's on debian. The
shards are 256MB, if that
> > matters.
> > > > > > >>
> > > > > > >> This morning the second node crashed, and
as it came back up
> > started
> > > > > > >a
> > > > > > >> heal, but that basically froze all the
VM's running on that
> > volume.
> > > > > > >Since
> > > > > > >> we really really
> > > > > > >> can't have 40 minutes down time in
the middle of the day, I just
> > > > > > >removed
> > > > > > >> the node from the network and that
stopped the heal, allowing
> > the
> > > > > > >VM's to
> > > > > > >> access
> > > > > > >> their disks again. The plan was to
re-connecte the node in a
> > couple
> > > > > > >of
> > > > > > >> hours to let it heal at night.
> > > > > > >> But a VM crashed now, and it can't
boot up again : seems to
> > freez
> > > > > > >trying
> > > > > > >> to access the disks.
> > > > > > >>
> > > > > > >> Looking at the heal info for the volume,
it has gone way up
> > since
> > > > > > >this
> > > > > > >> morning, it looks like the VM's
aren't writing to both nodes,
> > just
> > > > > > >the one
> > > > > > >> they are on.
> > > > > > >> It seems pretty bad, we have 2 nodes on 3
up, I would expect the
> > > > > > >volume to
> > > > > > >> work just fine since it has quorum. What
am I missing ?
> > > > > > >>
> > > > > > >> It is still too early to start the heal,
is there a way to
> > start the
> > > > > > >VM
> > > > > > >> anyway right now ? I mean, it was running
a moment ago so the
> > data
> > > > is
> > > > > > >> there, it just needs
> > > > > > >> to let the VM access it.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> Volume Name: vm-storage
> > > > > > >> Type: Replicate
> > > > > > >> Volume ID:
a5b19324-f032-4136-aaac-5e9a4c88aaef
> > > > > > >> Status: Started
> > > > > > >> Number of Bricks: 1 x 3 = 3
> > > > > > >> Transport-type: tcp
> > > > > > >> Bricks:
> > > > > > >> Brick1: first_node:/mnt/vg1-storage
> > > > > > >> Brick2: second_node:/mnt/vg1-storage
> > > > > > >> Brick3: third_node:/mnt/vg1-storage
> > > > > > >> Options Reconfigured:
> > > > > > >> cluster.quorum-type: auto
> > > > > > >> cluster.server-quorum-type: server
> > > > > > >> network.remote-dio: enable
> > > > > > >> cluster.eager-lock: enable
> > > > > > >> performance.readdir-ahead: on
> > > > > > >> performance.quick-read: off
> > > > > > >> performance.read-ahead: off
> > > > > > >> performance.io-cache: off
> > > > > > >> performance.stat-prefetch: off
> > > > > > >> features.shard: on
> > > > > > >> features.shard-block-size: 256MB
> > > > > > >> cluster.server-quorum-ratio: 51%
> > > > > > >>
> > > > > > >>
> > > > > > >> Thanks for your help
> > > > > > >>
> > > > > > >> --
> > > > > > >> Kevin Lemonnier
> > > > > > >> PGP Fingerprint : 89A5 2283 04A0 E6E9
0111
> > > > > > >>
> > > > > > >>
_______________________________________________
> > > > > > >> Gluster-users mailing list
> > > > > > >> Gluster-users at gluster.org
> > > > > > >>
http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > > >>
> > > > > >
> > > > > > --
> > > > > > Envoy? de mon appareil Android avec K-9 Mail.
Veuillez excuser ma
> > > > bri?vet?.
> > > > > > _______________________________________________
> > > > > > Gluster-users mailing list
> > > > > > Gluster-users at gluster.org
> > > > > >
http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > >
> > > >
> > > > --
> > > > Kevin Lemonnier
> > > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > >
> > > > _______________________________________________
> > > > Gluster-users mailing list
> > > > Gluster-users at gluster.org
> > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > > >
> >
> > --
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160425/f4c21695/attachment.sig>

Kevin Lemonnier

2016-May-02 09:05 UTC

head link

[Gluster-users] Freezing during heal

Hi,

So after some testing, it is a lot better but I do still have some problems with
3.7.11.
When I reboot a server it seems to have some strange behaviour sometimes, but I
need to test
that better.
Removing a server from the network, waiting for a while then adding it back and
letting it heal
works perfectly, completly invisible for the user and that's perfect !

However when I add a brick, changing the replica count from 2 to 3, it starts a
heal
and some VMs switch to read only. I have to power them off then on again to fix
it,
clearly it's better than with 3.7.6 which froze the VM until the heal was
complete,
but I would still like to understand why some of the VMs are switching to
readonly.
Looks like it happens everytime I add a brick to increase the replica, I would
like
to test adding a whole replica set at once but I just don't have the
hardware for that.

Rebooting a node looks like it's making some VMs go read only too, but I
need to test
that better. For some reason it looks like rebooting a brick or adding a brick
is causing
I/O errors on some VM disks and not others, and I have to power them off and
then on to fix it.
I can't just reboot them, I guess I have to actually re-open the file to
trigger a heal ?

Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it
can be fixed in a minute,
but that's still not great to explain to the clients.

Thanks


On Mon, Apr 25, 2016 at 02:01:09PM +0200, Kevin Lemonnier
wrote:> Hi,
> 
> So I'm trying that now.
> I installed 3.7.11 on two nodes and put a few VMs on it, same config
> as before but with 64MB shards and the heal algo to full. As expected,
> if I poweroff one of the nodes, everything is dead, which is fine.
> 
> Now I'm adding a third node, a big heal was started after the add-brick
> of everything (7000+ shards), and for now everything seems to be working
> fine on the VMs. Last time I tried adding a brick, all those VM died for
> the duration of the heal, so that's already pretty good.
> 
> I'm gonna let it finish to copy everything on the new nodes, then
I'll try
> to simulate nodes going down to see if my original problem of freezing and
> low heal time is solved with this config.
> For reference, here is the volume info, if someone sees something I should
change :
> 
> Volume Name: gluster
> Type: Replicate
> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: ipvr2.client_name:/mnt/storage/gluster
> Brick2: ipvr3.client_name:/mnt/storage/gluster
> Brick3: ipvr50.client_name:/mnt/storage/gluster
> Options Reconfigured:
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> features.shard: on
> features.shard-block-size: 64MB
> cluster.data-self-heal-algorithm: full
> performance.readdir-ahead: on
> 
> 
> It starts at 2 and jumps to 50 because the first server is doing something
else for now,
> and I use 50 to be the temporary third node. If everything goes well,
I'll migrate the production
> on the cluster, re-install the first server and do a replace-brick, which I
hope will work just as well
> as the add-brick I'm doing now. Last replace-brick also brought
everything down, but I guess that was the
> joy of 3.7.6 :).
> 
> Thanks !
> 
> 
> On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay wrote:
> > On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <lemonnierk at
ulrar.net>
> > wrote:
> > 
> > > I will try migrating to 3.7.10, is it considered stable yet ?
> > >
> > 
> > Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
> > 
> > 
> > > Should I change the self heal algorithm even if I move to 3.7.10,
or is
> > > that not necessary ?
> > > Not sure what that change might do.
> > >
> > 
> > So the other algorithm is 'diff' which computes rolling
checksum on chunks
> > of the src(es) and sink(s), compares them and heals upon mismatch.
This is
> > known to consume lot of CPU. 'full' algo on the other hand
simply copies
> > the src into sink in chunks. With sharding, it shouldn't be all
that bad
> > copying a 256MB file (in your case) from src to sink. We've used
double the
> > block size and had no issues reported.
> > 
> > So you could change self heal algo to full even in the upgraded
cluster.
> > 
> > -Krutika
> > 
> > 
> > >
> > > Anyway, I'll try to create a 3.7.10 cluster in the week end
slowly move
> > > the VMs on it then,
> > > Thanks a lot for your help,
> > >
> > > Regards
> > >
> > >
> > > On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay
wrote:
> > > > Hi,
> > > >
> > > > Yeah, so the fuse mount log didn't convey much
information.
> > > >
> > > > So one of the reasons heal may have taken so long (and also
consumed
> > > > resources) is because of a bug in self-heal where it would
do heal from
> > > > both source bricks in 3-way replication. With such a bug,
heal would take
> > > > twice the amount of time and consume resources both the
times by the same
> > > > amount.
> > > >
> > > > This issue is fixed at http://review.gluster.org/#/c/14008/
and will be
> > > > available in 3.7.12.
> > > >
> > > > The other thing you could do is to set
cluster.data-self-heal-algorithm
> > > to
> > > > 'full', for better heal performance and more
regulated resource
> > > consumption
> > > > by the same.
> > > >  #gluster volume set <VOL>
cluster.data-self-heal-algorithm full
> > > >
> > > > As far as sharding is concerned, some critical caching
issues were fixed
> > > in
> > > > 3.7.7 and 3.7.8.
> > > > And my guess is that the vm crash/unbootable state could be
because of
> > > this
> > > > issue, which exists in 3.7.6.
> > > >
> > > > 3.7.10 saw the introduction of throttled client side heals
which also
> > > moves
> > > > such heals to the background, which is all the more helpful
for
> > > preventing
> > > > starvation of vms during client heal.
> > > >
> > > > Considering these factors, I think it would be better if you
upgraded
> > > your
> > > > machines to 3.7.10.
> > > >
> > > > Do let me know if migrating to 3.7.10 solves your issues.
> > > >
> > > > -Krutika
> > > >
> > > > On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier
<lemonnierk at ulrar.net>
> > > > wrote:
> > > >
> > > > > Yes, but as I was saying I don't believe KVM is
using a mount point, I
> > > > > think it uses
> > > > > the API (
> > > > >
> > >
http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
> > > > > ).
> > > > > Might be mistaken ofcourse. Proxmox does have a
mountpoint for
> > > > > conveniance, I'll attach those
> > > > > logs, hoping they contain the informations you need.
They do seem to
> > > > > contain a lot of errors
> > > > > for the 15.
> > > > > For reference, there was a disconnect of the first
brick (10.10.0.1) in
> > > > > the morning and then a successfull
> > > > > heal that caused about 40 minutes downtime of the VMs.
Right after that
> > > > > heal finished (if my memory is
> > > > > correct it was about noon or close) the second brick
(10.10.0.2)
> > > rebooted,
> > > > > and that's the one I disconnected
> > > > > to prevent the heal from causing another downtime.
> > > > > I reconnected it one at the end of the afternoon,
hoping the heal
> > > would go
> > > > > well but everything went down
> > > > > like in the morning so I disconnected it again, and
waited 11pm
> > > (23:00) to
> > > > > reconnect it and let it finish.
> > > > >
> > > > > Thanks for your help,
> > > > >
> > > > >
> > > > > On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika
Dhananjay wrote:
> > > > > > Sorry, I was referring to the glusterfs client
logs.
> > > > > >
> > > > > > Assuming you are using FUSE mount, your log file
will be in
> > > > > >
/var/log/glusterfs/<hyphenated-mount-point-path>.log
> > > > > >
> > > > > > -Krutika
> > > > > >
> > > > > > On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier
<
> > > lemonnierk at ulrar.net>
> > > > > > wrote:
> > > > > >
> > > > > > > I believe Proxmox is just an interface to KVM
that uses the lib,
> > > so if
> > > > > I'm
> > > > > > > not mistaken there isn't client logs ?
> > > > > > >
> > > > > > > It's not the first time I have the issue,
it happens on every heal
> > > on
> > > > > the
> > > > > > > 2 clusters I have.
> > > > > > >
> > > > > > > I did let the heal finish that night and the
VMs are working now,
> > > but
> > > > > it
> > > > > > > is pretty scarry for future crashes or brick
replacement.
> > > > > > > Should I maybe lower the shard size ?
Won't solve the fact that 2
> > > > > bricks
> > > > > > > on 3 aren't keeping the filesystem usable
but might make the
> > > healing
> > > > > > > quicker right ?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Le 17 avril 2016 17:56:37 GMT+02:00, Krutika
Dhananjay <
> > > > > > > kdhananj at redhat.com> a ?crit :
> > > > > > > >Could you share the client logs and
information about the approx
> > > > > > > >time/day
> > > > > > > >when you saw this issue?
> > > > > > > >
> > > > > > > >-Krutika
> > > > > > > >
> > > > > > > >On Sat, Apr 16, 2016 at 12:57 AM, Kevin
Lemonnier
> > > > > > > ><lemonnierk at ulrar.net>
> > > > > > > >wrote:
> > > > > > > >
> > > > > > > >> Hi,
> > > > > > > >>
> > > > > > > >> We have a small glusterFS 3.7.6
cluster with 3 nodes running
> > > with
> > > > > > > >proxmox
> > > > > > > >> VM's on it. I did set up the
different recommended option like
> > > the
> > > > > > > >virt
> > > > > > > >> group, but
> > > > > > > >> by hand since it's on debian.
The shards are 256MB, if that
> > > matters.
> > > > > > > >>
> > > > > > > >> This morning the second node
crashed, and as it came back up
> > > started
> > > > > > > >a
> > > > > > > >> heal, but that basically froze all
the VM's running on that
> > > volume.
> > > > > > > >Since
> > > > > > > >> we really really
> > > > > > > >> can't have 40 minutes down time
in the middle of the day, I just
> > > > > > > >removed
> > > > > > > >> the node from the network and that
stopped the heal, allowing
> > > the
> > > > > > > >VM's to
> > > > > > > >> access
> > > > > > > >> their disks again. The plan was to
re-connecte the node in a
> > > couple
> > > > > > > >of
> > > > > > > >> hours to let it heal at night.
> > > > > > > >> But a VM crashed now, and it
can't boot up again : seems to
> > > freez
> > > > > > > >trying
> > > > > > > >> to access the disks.
> > > > > > > >>
> > > > > > > >> Looking at the heal info for the
volume, it has gone way up
> > > since
> > > > > > > >this
> > > > > > > >> morning, it looks like the VM's
aren't writing to both nodes,
> > > just
> > > > > > > >the one
> > > > > > > >> they are on.
> > > > > > > >> It seems pretty bad, we have 2 nodes
on 3 up, I would expect the
> > > > > > > >volume to
> > > > > > > >> work just fine since it has quorum.
What am I missing ?
> > > > > > > >>
> > > > > > > >> It is still too early to start the
heal, is there a way to
> > > start the
> > > > > > > >VM
> > > > > > > >> anyway right now ? I mean, it was
running a moment ago so the
> > > data
> > > > > is
> > > > > > > >> there, it just needs
> > > > > > > >> to let the VM access it.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Volume Name: vm-storage
> > > > > > > >> Type: Replicate
> > > > > > > >> Volume ID:
a5b19324-f032-4136-aaac-5e9a4c88aaef
> > > > > > > >> Status: Started
> > > > > > > >> Number of Bricks: 1 x 3 = 3
> > > > > > > >> Transport-type: tcp
> > > > > > > >> Bricks:
> > > > > > > >> Brick1: first_node:/mnt/vg1-storage
> > > > > > > >> Brick2: second_node:/mnt/vg1-storage
> > > > > > > >> Brick3: third_node:/mnt/vg1-storage
> > > > > > > >> Options Reconfigured:
> > > > > > > >> cluster.quorum-type: auto
> > > > > > > >> cluster.server-quorum-type: server
> > > > > > > >> network.remote-dio: enable
> > > > > > > >> cluster.eager-lock: enable
> > > > > > > >> performance.readdir-ahead: on
> > > > > > > >> performance.quick-read: off
> > > > > > > >> performance.read-ahead: off
> > > > > > > >> performance.io-cache: off
> > > > > > > >> performance.stat-prefetch: off
> > > > > > > >> features.shard: on
> > > > > > > >> features.shard-block-size: 256MB
> > > > > > > >> cluster.server-quorum-ratio: 51%
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thanks for your help
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Kevin Lemonnier
> > > > > > > >> PGP Fingerprint : 89A5 2283 04A0
E6E9 0111
> > > > > > > >>
> > > > > > > >>
_______________________________________________
> > > > > > > >> Gluster-users mailing list
> > > > > > > >> Gluster-users at gluster.org
> > > > > > > >>
http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > > > >>
> > > > > > >
> > > > > > > --
> > > > > > > Envoy? de mon appareil Android avec K-9 Mail.
Veuillez excuser ma
> > > > > bri?vet?.
> > > > > > >
_______________________________________________
> > > > > > > Gluster-users mailing list
> > > > > > > Gluster-users at gluster.org
> > > > > > >
http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > > >
> > > > >
> > > > > --
> > > > > Kevin Lemonnier
> > > > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > > >
> > > > > _______________________________________________
> > > > > Gluster-users mailing list
> > > > > Gluster-users at gluster.org
> > > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > > > >
> > >
> > > --
> > > Kevin Lemonnier
> > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > >
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> > >
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160502/b5af3d2d/attachment.sig>

Gluster users - May 2016 - Freezing during heal

[Gluster-users] Freezing during heal

[Gluster-users] Freezing during heal

[Gluster-users] Freezing during heal

[Gluster-users] Freezing during heal