Paul Penev
2014-Apr-06 15:52 UTC
[Gluster-users] libgfapi failover problem on replica bricks
Hello, I'm having an issue with rebooting bricks holding images for live KVM machines (using libgfapi). I have a replicated+distributed setup of 4 bricks (2x2). The cluster contains images for a couple of kvm virtual machines. My problem is that when I reboot a brick containing a an image of a VM, the VM will start throwing disk errors and eventually die. The gluster volume is made like this: # gluster vol info pool Volume Name: pool Type: Distributed-Replicate Volume ID: xxxxxxxxxxxxxxxxxxxx Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: srv10g:/data/gluster/brick Brick2: srv11g:/data/gluster/brick Brick3: srv12g:/data/gluster/brick Brick4: srv13g:/data/gluster/brick Options Reconfigured: network.ping-timeout: 10 cluster.server-quorum-type: server diagnostics.client-log-level: WARNING auth.allow: 192.168.0.*,127.* nfs.disable: on The KVM instances run on the same gluster bricks, with disks mounted as : file=gluster://localhost/pool/images/vm-xxx-disk-1.raw,.......,cache=writethrough,aio=native My self-heal backlog is not always 0. It looks like some writes are not going to all bricks at the same time (?). gluster vol heal pool info sometime shows the images needing sync on one brick, the other or both. There are no network problems or errors on the wire. Any ideas what could be causing this ? Thanks.
Fabio Rosati
2014-Apr-09 08:19 UTC
[Gluster-users] libgfapi failover problem on replica bricks
Hi Paul, you're not alone. I get the same issue after rebooting a brick belonging to a 2 x 2 volume and the same is true for Jo?o P. and Nick M. (added in cc). [root at networker ~]# gluster volume info gv_pri Volume Name: gv_pri Type: Distributed-Replicate Volume ID: 3d91b91e-4d72-484f-8655-e5ed8d38bb28 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: nw1glus.gem.local:/glustexp/pri1/brick Brick2: nw2glus.gem.local:/glustexp/pri1/brick Brick3: nw3glus.gem.local:/glustexp/pri2/brick Brick4: nw4glus.gem.local:/glustexp/pri2/brick Options Reconfigured: storage.owner-gid: 107 storage.owner-uid: 107 server.allow-insecure: on network.remote-dio: on performance.write-behind-window-size: 16MB performance.cache-size: 128MB I hope someone will address this problem in the near future since not being able to shutdown a server hosting a brick is a big limitation. It seems someone solved the problem using cgroups: http://www.gluster.org/author/andrew-lau/ Anyway, I think it's not easy to implement because cgroups is already configured and in use for libvirt, if I had a test environment and some spare time I would have tried. Regards, Fabio Rosati ----- Messaggio originale ----- Da: "Paul Penev" <ppquant at gmail.com> A: Gluster-users at gluster.org Inviato: Domenica, 6 aprile 2014 17:52:53 Oggetto: [Gluster-users] libgfapi failover problem on replica bricks Hello, I'm having an issue with rebooting bricks holding images for live KVM machines (using libgfapi). I have a replicated+distributed setup of 4 bricks (2x2). The cluster contains images for a couple of kvm virtual machines. My problem is that when I reboot a brick containing a an image of a VM, the VM will start throwing disk errors and eventually die. The gluster volume is made like this: # gluster vol info pool Volume Name: pool Type: Distributed-Replicate Volume ID: xxxxxxxxxxxxxxxxxxxx Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: srv10g:/data/gluster/brick Brick2: srv11g:/data/gluster/brick Brick3: srv12g:/data/gluster/brick Brick4: srv13g:/data/gluster/brick Options Reconfigured: network.ping-timeout: 10 cluster.server-quorum-type: server diagnostics.client-log-level: WARNING auth.allow: 192.168.0.*,127.* nfs.disable: on The KVM instances run on the same gluster bricks, with disks mounted as : file=gluster://localhost/pool/images/vm-xxx-disk-1.raw,.......,cache=writethrough,aio=native My self-heal backlog is not always 0. It looks like some writes are not going to all bricks at the same time (?). gluster vol heal pool info sometime shows the images needing sync on one brick, the other or both. There are no network problems or errors on the wire. Any ideas what could be causing this ? Thanks. _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Paul Penev
2014-Apr-16 16:20 UTC
[Gluster-users] libgfapi failover problem on replica bricks
>>I can easily reproduce the problem on this cluster. It appears that >>there is a "primary" replica and a "secondary" replica. >> >>If I reboot or kill the glusterfs process there is no problems on the >>running VM. > > Good. That is as expected.Sorry, I was not clear enough. I meant that if I reboot the "secondary" replica, there are no problems.>>If I reboot or "killall -KILL glusterfsd" the primary replica (so I >>don't let it terminate properly), I can block the the VM each time. > > Have you followed my blog advise to prevent the vm from remounting the image filesystem read-only and waited ping-timeout seconds (42 by default)?I have not followed your advice, but there is a difference: I get i/o errors *reading* from the disk. Once the problem kicks, I cannot issue commands (like ls) because they can't be read. There is a problem with that setup: It cannot be implemented on windows machines (which are move vulnerable) and also cannot be implemented on machines which I have no control on (customers).>>If I "reset" the VM it will not find the boot disk. > > Somewhat expected if within the ping-timeout.The issue persists beyond the ping-timeout. The KVM process needs to be reinitialized. I guess libgfapi needs to reconnect from scratch.>>If I power down and power up the VM, then it will boot but will find >>corruption on disk during the boot that requires fixing. > > Expected since the vm doesn't use the image filesystem synchronously. You can change that with mount options at the cost of performance.Ok. I understand this point.> Unless you wait for ping-timeout and then continue writing the replica is actually still in sync. It's only out of sync if you write to one replica but not the other. > > You can shorten the ping timeout. There is a cost to reconnection if you do. Be sure to test a scenario with servers under production loads and see what the performance degradation during a reconnect is. Balance your needs appropriately.Could you please elaborate on the cost of reconnection? I will try to run with a very short ping timeout (2sec) and see if the problem is in the ping-timeout or perhaps not. Paul 2014-04-06 17:52 GMT+02:00 Paul Penev <ppquant at gmail.com>:> Hello, > > I'm having an issue with rebooting bricks holding images for live KVM > machines (using libgfapi). > > I have a replicated+distributed setup of 4 bricks (2x2). The cluster > contains images for a couple of kvm virtual machines. > > My problem is that when I reboot a brick containing a an image of a > VM, the VM will start throwing disk errors and eventually die. > > The gluster volume is made like this: > > # gluster vol info pool > > Volume Name: pool > Type: Distributed-Replicate > Volume ID: xxxxxxxxxxxxxxxxxxxx > Status: Started > Number of Bricks: 2 x 2 = 4 > Transport-type: tcp > Bricks: > Brick1: srv10g:/data/gluster/brick > Brick2: srv11g:/data/gluster/brick > Brick3: srv12g:/data/gluster/brick > Brick4: srv13g:/data/gluster/brick > Options Reconfigured: > network.ping-timeout: 10 > cluster.server-quorum-type: server > diagnostics.client-log-level: WARNING > auth.allow: 192.168.0.*,127.* > nfs.disable: on > > The KVM instances run on the same gluster bricks, with disks mounted > as : file=gluster://localhost/pool/images/vm-xxx-disk-1.raw,.......,cache=writethrough,aio=native > > My self-heal backlog is not always 0. It looks like some writes are > not going to all bricks at the same time (?). > > gluster vol heal pool info > > sometime shows the images needing sync on one brick, the other or both. > > There are no network problems or errors on the wire. > > Any ideas what could be causing this ? > > Thanks.