thr3ads.net - Gluster users - [Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster [Oct 2024]

If this information is useful, please help other people find it:
Share via:

Jacobson, Erik

2024-Oct-13 21:59 UTC

[Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster

Hello all! We are experiencing a strange problem with QEMU virtual machines
where the virtual machine image is hosted on a gluster volume. Access via fuse.
(Our GFAPI attempt failed, it doesn?t seem to work properly with current
QEMU/distro/gluster). We have the volume tuned for ?virt?.

So we use qemu-img to create a raw image. You can use sparse or falloc with
equal results. We start a virtual machine (libvirt, qemu-kvm) and libvirt/qemu
points to the fuse mount with the QEMU image file we created.

When we create partitions and filesystems ? like you might do for installing an
operating system ? all is well at first. This includes a root XFS filesystem.

When we try to re-make the XFS filesystem over the old one, it will not mount
and will report XFS corruption.
If you dig into XFS repair, you can find a UUID mismatch between the superblock
and the log. The log always retains the UUID of the original filesystem (the one
we tried to replace). Running xfs_repair doesn?t truly repair, it just reports
more corruption. xfs_db forcing to remake the log doesn?t help.

We can duplicate this with even a QEMU raw image of 50 megabytes. As far as we
can tell, XFS is the only filesystem showing this behavior or at least the only
one reporting a problem.

If we take QEMU out of the picture and create partitions directly on the QEMU
raw image file, then use kpartx to create devices to the partitions, and run a
similar test ? the gluster-hosted image behaves as you would expect and there is
no problem reported by XFS. We can?t duplicate the problem outside of QEMU.

We have observed the issue with Rocky 9.4 and SLES15 SP5 environments (including
the matching QEMU versions). We have not tested more distros yet.

We observed the problem originally with Gluster 9.3. We reproduced it with
Gluster 9.6 and 10.5.

If we switch from QEMU RAW to QCOW2, the problem disappears.

The problem is not reproduced when we take gluster out of the equation (meaning,
pointing QEMU at a local disk image instead of gluster-hosted one ? that works
fine).

The problem can be reproduced this way:

* Assume /adminvm/images on a gluster sharded volume
* rm /adminvm/images/adminvm.img
* qemu-img create -f raw /adminvm/images/adminvm.img 50M

Now start the virtual machine that refers to the above adminvm.img file

* Boot up a rescue environment or a live mode or similar
* sgdisk --zap-all /dev/sda
* sgdisk --set-alignment=4096 --clear /dev/sda
* sgdisk --set-alignment=4096 --new=1:0:0 /dev/sda
* mkfs.xfs -L fs1 /dev/sda1
* mkdir -p /a
* mount /dev/sda1 /a
* umount /a
* # MAKE same FS again:
* mkfs.xfs -f -L fs1 /dev/sda1
* mount /dev/sda1 /a
* This will fail with kernel back traces and corruption reported
* xfs_repair will report the log vs superblock UUID mismatch I mentioned

Here are the volume settings:

# gluster volume info adminvm

Volume Name: adminvm
Type: Replicate
Volume ID: de655913-aad9-4e17-bac4-ff0ad9c28223
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.23.254.181:/data/brick_adminvm_slot2
Brick2: 172.23.254.182:/data/brick_adminvm_slot2
Brick3: 172.23.254.183:/data/brick_adminvm_slot2
Options Reconfigured:
storage.owner-gid: 107
storage.owner-uid: 107
performance.io-thread-count: 32
network.frame-timeout: 10800
cluster.lookup-optimize: off
server.keepalive-count: 5
server.keepalive-interval: 2
server.keepalive-time: 10
server.tcp-user-timeout: 20
network.ping-timeout: 20
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
performance.strict-o-direct: on
network.remote-dio: disable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.granular-entry-heal: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

Any help or ideas would be appreciated. Let us know if we have a setting
incorrect or have made an error.

Thank you all!

Erik
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20241013/3ba216e3/attachment.html>

Andreas Schwibbe

2024-Oct-14 09:33 UTC

head link

[Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster

Hey Erik,

I am running a similar setup with no issues having Ubuntu Host Systems
on HPE DL380 Gen 10.
I however used to run libvirt/qemu via nfs-ganesha on top of gluster
flawlessly.
Recently I upgraded to the native GFAPI implementation, which is poorly
documented with snippets all over the internet.

Although I cannot provide a direct solution for your issue, I am
however suggesting to try either nfs-ganesha as a replacement for fuse
mount or GFAPI. 
Happy to share libvirt/GFAPI config hints to make it happen.

Best
A.

Am Sonntag, dem 13.10.2024 um 21:59 +0000 schrieb Jacobson,
Erik:> Hello all! We are experiencing a strange problem with QEMU virtual
> machines where the virtual machine image is hosted on a gluster
> volume. Access via fuse. (Our GFAPI attempt failed, it doesn?t seem
> to work properly with current QEMU/distro/gluster). We have the
> volume tuned for ?virt?.
> ?
> So we use qemu-img to create a raw image. You can use sparse or
> falloc with equal results. We start a virtual machine (libvirt, qemu-
> kvm) and libvirt/qemu points to the fuse mount with the QEMU image
> file we created.
> ?
> When we create partitions and filesystems ? like you might do for
> installing an operating system ? all is well at first. This includes
> a root XFS filesystem.
> ?
> When we try to re-make the XFS filesystem over the old one, it will
> not mount and will report XFS corruption.
> If you dig into XFS repair, you can find a UUID mismatch between the
> superblock and the log. The log always retains the UUID of the
> original filesystem (the one we tried to replace). Running xfs_repair
> doesn?t truly repair, it just reports more corruption. xfs_db forcing
> to remake the log doesn?t help.
> ?
> We can duplicate this with even a QEMU raw image of 50 megabytes. As
> far as we can tell, XFS is the only filesystem showing this behavior
> or at least the only one reporting a problem.
> ?
> If we take QEMU out of the picture and create partitions directly on
> the QEMU raw image file, then use kpartx to create devices to the
> partitions, and run a similar test ? the gluster-hosted image behaves
> as you would expect and there is no problem reported by XFS. We can?t
> duplicate the problem outside of QEMU.
> ?
> We have observed the issue with Rocky 9.4 and SLES15 SP5 environments
> (including the matching QEMU versions). We have not tested more
> distros yet.
> ?
> We observed the problem originally with Gluster 9.3. We reproduced it
> with Gluster 9.6 and 10.5.
> ?
> If we switch from QEMU RAW to QCOW2, the problem disappears.
> ?
> The problem is not reproduced when we take gluster out of the
> equation (meaning, pointing QEMU at a local disk image instead of
> gluster-hosted one ? that works fine).
> ?
> The problem can be reproduced this way:
> * Assume /adminvm/images on a gluster sharded volume
> * rm /adminvm/images/adminvm.img
> * qemu-img create -f raw /adminvm/images/adminvm.img 50M
> ?
> Now start the virtual machine that refers to the above adminvm.img
> file
> * Boot up a rescue environment or a live mode or similar
> * sgdisk --zap-all /dev/sda
> * sgdisk --set-alignment=4096 --clear /dev/sda
> * sgdisk --set-alignment=4096 --new=1:0:0 /dev/sda
> * mkfs.xfs -L fs1 /dev/sda1
> * mkdir -p /a
> * mount /dev/sda1 /a
> * umount /a
> * # MAKE same FS again:
> * mkfs.xfs -f -L fs1 /dev/sda1
> * mount /dev/sda1 /a
> * This will fail with kernel back traces and corruption reported
> * xfs_repair will report the log vs superblock UUID mismatch I
> mentioned
> ?
> Here are the volume settings:
> ?
> # gluster volume info adminvm
> ?
> Volume Name: adminvm
> Type: Replicate
> Volume ID: de655913-aad9-4e17-bac4-ff0ad9c28223
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.23.254.181:/data/brick_adminvm_slot2
> Brick2: 172.23.254.182:/data/brick_adminvm_slot2
> Brick3: 172.23.254.183:/data/brick_adminvm_slot2
> Options Reconfigured:
> storage.owner-gid: 107
> storage.owner-uid: 107
> performance.io-thread-count: 32
> network.frame-timeout: 10800
> cluster.lookup-optimize: off
> server.keepalive-count: 5
> server.keepalive-interval: 2
> server.keepalive-time: 10
> server.tcp-user-timeout: 20
> network.ping-timeout: 20
> server.event-threads: 4
> client.event-threads: 4
> cluster.choose-local: off
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> performance.strict-o-direct: on
> network.remote-dio: disable
> performance.low-prio-threads: 32
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> cluster.granular-entry-heal: enable
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: on
> ?
> Any help or ideas would be appreciated. Let us know if we have a
> setting incorrect or have made an error.
> ?
> Thank you all!
> ?
> Erik
> ________
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20241014/0c2deaeb/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

Gluster users - Oct 2024 - XFS corruption reported by QEMU virtual machine with image hosted on gluster

[Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster

[Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster

Seemingly Similar Threads