Tomas Corej
2011-Apr-18  10:47 UTC
[Gluster-users] XEN VPS unresponsive because of selfhealing
Hello, I've been actively watching this project since its early 2.0 releases and think it has made great progress. Personally, the problems it's solving and the way it does it are interesting to me. We are a webhosting company and have used GlusterFS for serving some of the hostings from GlusterFS due to their size. While serving XEN domUs from GlusterFS, yesterday we were trying to upgrade GlusterFS 3.1.2 to the latest version 3.1.4 . Our setup is pretty much the standard distribute-replicate: Volume Name: images Type: Distributed-Replicate Status: Started Number of Bricks: 12 x 2 = 24 Transport-type: tcp Bricks: Brick1: gnode002.local:/data1/images Brick2: gnode004.local:/data1/images Brick3: gnode002.local:/data2/images Brick4: gnode004.local:/data2/images Brick5: gnode002.local:/data3/images Brick6: gnode004.local:/data3/images Brick7: gnode002.local:/data4/images Brick8: gnode004.local:/data4/images Brick9: gnode006.local:/data1/images Brick10: gnode008.local:/data1/images Brick11: gnode006.local:/data2/images Brick12: gnode008.local:/data2/images Brick13: gnode006.local:/data3/images Brick14: gnode008.local:/data3/images Brick15: gnode006.local:/data4/images Brick16: gnode008.local:/data4/images Brick17: gnode010.local:/data1/images Brick18: gnode012.local:/data1/images Brick19: gnode010.local:/data2/images Brick20: gnode012.local:/data2/images Brick21: gnode010.local:/data3/images Brick22: gnode012.local:/data3/images Brick23: gnode010.local:/data4/images Brick24: gnode012.local:/data4/images Options Reconfigured: performance.quick-read: off network.ping-timeout: 30 XEN servers have mounted images through the GlusterFS native client and served using tap:aio driver. We wanted to upgrade gluster on each node, one at a time (but we did only gnode002) . So we did this: root at gnode002.local: /etc/init.d/glusterd stop && killall glusterfsd && /etc/init.d/glusterd start we had to kill processess because glusterd didn't shutdown properly. The problem was, that after execution, self-healing immediately started to check consistency. glusterfsd process could have been down for 5-6 seconds so we expected selfhealing not to initiate, but it did. This would not be a problem on its own, if selfhealing itself wouldn't make our VPS totally unresponsive for 90 minutes until it stopped because gluster has locked (or the access to image was so slow ?) the image. So question is - is there a way to avoid this or minimize these effects? Has anyone had the same experience with selfhealing in GlusterFS+XEN environment? Regards, Tomas Corej S pozdravom -- [ Ohodnotte kvalitu emailu: http://nicereply.com/websupport/Corej/ ] Tom?? ?orej | admin section +421 (0)2 20 60 80 89 +421 (0)2 20 60 80 80 http://WebSupport.sk *** BERTE A VYCHUTNAVAJTE ***
Tomas Corej
2011-Apr-18  12:36 UTC
[Gluster-users] XEN VPS unresponsive because of selfhealing
Hello, I've been actively watching this project since its early 2.0 releases and think it has made great progress. Personally, the problems it's solving and the way it does it are interesting to me. We are a webhosting company and have used GlusterFS for serving some of the hostings from GlusterFS due to their size. While serving XEN domUs from GlusterFS, yesterday we were trying to upgrade GlusterFS 3.1.2 to the latest version 3.1.4 . Our setup is pretty much the standard distribute-replicate: Volume Name: images Type: Distributed-Replicate Status: Started Number of Bricks: 12 x 2 = 24 Transport-type: tcp Bricks: Brick1: gnode002.local:/data1/images Brick2: gnode004.local:/data1/images Brick3: gnode002.local:/data2/images Brick4: gnode004.local:/data2/images Brick5: gnode002.local:/data3/images Brick6: gnode004.local:/data3/images Brick7: gnode002.local:/data4/images Brick8: gnode004.local:/data4/images Brick9: gnode006.local:/data1/images Brick10: gnode008.local:/data1/images Brick11: gnode006.local:/data2/images Brick12: gnode008.local:/data2/images Brick13: gnode006.local:/data3/images Brick14: gnode008.local:/data3/images Brick15: gnode006.local:/data4/images Brick16: gnode008.local:/data4/images Brick17: gnode010.local:/data1/images Brick18: gnode012.local:/data1/images Brick19: gnode010.local:/data2/images Brick20: gnode012.local:/data2/images Brick21: gnode010.local:/data3/images Brick22: gnode012.local:/data3/images Brick23: gnode010.local:/data4/images Brick24: gnode012.local:/data4/images Options Reconfigured: performance.quick-read: off network.ping-timeout: 30 XEN servers have mounted images through the GlusterFS native client and served using tap:aio driver. We wanted to upgrade gluster on each node, one at a time (but we did only gnode002) . So we did this: root at gnode002.local: /etc/init.d/glusterd stop && killall glusterfsd && /etc/init.d/glusterd start we had to kill processess because glusterd didn't shutdown properly. The problem was, that after execution, self-healing immediately started to check consistency. glusterfsd process could have been down for 5-6 seconds so we expected selfhealing not to initiate, but it did. This would not be a problem on its own, if selfhealing itself wouldn't make our VPS totally unresponsive for 90 minutes until it stopped because gluster has locked (or the access to image was so slow ?) the image. So question is - is there a way to avoid this or minimize these effects? Has anyone had the same experience with selfhealing in GlusterFS+XEN environment? Regards, Tomas Corej S pozdravom -- [ Ohodnotte kvalitu emailu: http://nicereply.com/websupport/Corej/ ] Tom?? ?orej | admin section +421 (0)2 20 60 80 89 +421 (0)2 20 60 80 80 http://WebSupport.sk *** BERTE A VYCHUTNAVAJTE ***