thr3ads.net - Gluster users - [Gluster-users] Strange file corruption [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Udo Giacomozzi

2015-Dec-07 11:03 UTC

[Gluster-users] Strange file corruption

Hi all,


yesterday I had a strange situation where Gluster healing corrupted 
*all* my VM images.


In detail:
I had about 15 VMs running (in Proxmox 4.0) totaling about 600 GB of 
qcow2 images. Gluster is used as storage for those images in replicate 3 
setup (ie. 3 physical servers replicating all data).
All VMs were running on machine #1 - the two other machines (#2 and #3) 
were *idle*.
Gluster was fully operating (no healing) when I rebooted machine #2.
For other reasons I had to reboot machines #2 and #3 a few times, but 
since all VMs were running on machine #1 and nothing on the other 
machines was accessing Gluster files, I was confident that this wouldn't 
disturb Gluster.
But anyway this means that I rebootet Gluster nodes during a healing 
process.

After a few minutes, Gluster files began showing corruption - up to the 
point that the qcow2 files became unreadable and all VMs stopped working.

I was forced to restore VM backups (loosing a few hours of data), which 
means that the corrupt files were left as-is by Proxmox and new qcow2 
files were created to be used for the VMs.

This means that Gluster could continue healing it's files all night 
long. Today, many files seem to be intact, but I think they have been 
replaced with older versions (it's a bit difficult to tell exactly)

Please note that node #1 was up at all times. It seems to me that the 
healing process corrupted the files. How can that be?
Anybody has an explanation? Any way to avoid such a situation (except 
checking if Gluster is healing before rebooting)?

Setup details:
- Proxmox 4.0 cluster (not yet in HA mode) = Debian 8 Jessie
- redundant Gbit LAN (bonding)
- Gluster 3.5.2 (most current Proxmox package)
- two volumes, both "replicate" type, 1 x 3 = 3 bricks
- cluster.server-quorum-ratio: 51%

Thanks,
Udo

Lindsay Mathieson

2015-Dec-07 14:01 UTC

head link

[Gluster-users] Strange file corruption

On 7/12/2015 9:03 PM, Udo Giacomozzi wrote:> Setup details:
> - Proxmox 4.0 cluster (not yet in HA mode) = Debian 8 Jessie
> - redundant Gbit LAN (bonding)
> - Gluster 3.5.2 (most current Proxmox package)
> - two volumes, both "replicate" type, 1 x 3 = 3 bricks
> - cluster.server-quorum-ratio: 51%

Probably post the output from "gluster volume info" as well.

Lindsay Mathieson

2015-Dec-08 01:59 UTC

head link

[Gluster-users] Strange file corruption

Hi Udo, thanks for posting your volume info settings. Please note for 
the following, I am not one of the devs, just a user, so unfortunately I 
have no authoritative answers :(

I am running a very similar setup - Proxmox 4.0, three nodes, but using 
ceph for our production storage. Am heavily testing gluster 3.7 on the 
side. We find the performance of ceph slow on these small setups and 
management of it a PITA.

Some more questions

- how are your VM images being accessed by Proxmox? gfapi? (Proxmox 
Gluster storage type) or by using the fuse mount?

- whats your underlying filesystem (ext4, zfs etc)

- Are you using the HA/Watchdog system in Proxmox?

On 07/12/15 21:03, Udo Giacomozzi wrote:> esterday I had a strange situation where Gluster healing corrupted 
> *all* my VM images.
>
>
> In detail:
> I had about 15 VMs running (in Proxmox 4.0) totaling about 600 GB of 
> qcow2 images. Gluster is used as storage for those images in replicate 
> 3 setup (ie. 3 physical servers replicating all data).
> All VMs were running on machine #1 - the two other machines (#2 and 
> #3) were *idle*.
> Gluster was fully operating (no healing) when I rebooted machine #2.
> For other reasons I had to reboot machines #2 and #3 a few times, but 
> since all VMs were running on machine #1 and nothing on the other 
> machines was accessing Gluster files, I was confident that this 
> wouldn't disturb Gluster.
> But anyway this means that I rebootet Gluster nodes during a healing 
> process.
>
> After a few minutes, Gluster files began showing corruption - up to 
> the point that the qcow2 files became unreadable and all VMs stopped 
> working. 

:( sounds painful - my sympathies.

You're running 3.5.2 - thats getting rather old. I use the gluster 
debian repos:

   3.6.7 : 
http://download.gluster.org/pub/gluster/glusterfs/3.6/LATEST/Debian/
   3.7.6 : 
http://download.gluster.org/pub/gluster/glusterfs/LATEST/Debian/jessie/

3.6.x is the latest stable, 3.7 is close to stable(?) 3.7 has some nice 
new features such as sharding, which is very useful for VM hosting - it 
enables much faster heal times.

Regards what happened with your VM's, I'm not sure. Having two servers 
down should have disabled the entire store making it not readable or 
writable. I note that you are missing some settings that need to be set 
for VM stores - there will be corruption problems if you live migrate 
without them.

quick-read=off
read-ahead=off
io-cache=off
stat-prefetch=off
eager-lock=enable
remote-dio=enable
quorum-type=auto
server-quorum-type=server

"stat-prefetch=off" is particularly important.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151208/38411c27/attachment.html>

Lindsay Mathieson

2015-Dec-09 13:39 UTC

head link

[Gluster-users] Strange file corruption

On 7/12/2015 9:03 PM, Udo Giacomozzi wrote:> All VMs were running on machine #1 - the two other machines (#2 and 
> #3) were *idle*.
> Gluster was fully operating (no healing) when I rebooted machine #2.
> For other reasons I had to reboot machines #2 and #3 a few times, but 
> since all VMs were running on machine #1 and nothing on the other 
> machines was accessing Gluster files, I was confident that this 
> wouldn't disturb Gluster.
> But anyway this means that I rebootet Gluster nodes during a healing 
> process.
>
> After a few minutes, Gluster files began showing corruption - up to 
> the point that the qcow2 files became unreadable and all VMs stopped 
> working.
Udo, it occurs to me that if your VM's were running on #2 & #3 and you 
live migrated them to #1 prior to rebooting #2/3, then you would indeed 
rapidly get progressive VM corruption.

However it wouldn't be due to the heal process, but rather the live 
migration with "performance.stat-prefetch" on. This always leads to 
qcow2 files becoming corrupted and unusable.

-- 
Lindsay Mathieson

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151209/bafbd69a/attachment.html>

Gluster users - Dec 2015 - Strange file corruption

[Gluster-users] Strange file corruption

[Gluster-users] Strange file corruption

[Gluster-users] Strange file corruption

[Gluster-users] Strange file corruption