In our Xen cluster, we have: - Many DomU hosts (CentOS 5.2, paravirtualized) mounting a GFS filesystem on a VBD, - A few Dom0 hosts (CentOS 5.2), connected over GigE, - A single SAN providing shared block storage for all of the above. Works great most of the time. The DomU storage is backed by logical volumes on the Dom0''s, all part of a clustered VG on the SAN. Once every few weeks however we experience FS corruption with kernel messages like: May 28 09:30:26 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: fatal: invalid metadata block May 28 09:30:26 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: bh = 27845 (type: exp=4, found=9) May 28 09:30:26 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: function = gfs_get_meta_buffer May 28 09:30:26 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: file /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_xen/src/gfs/dio.c, line = 1225 May 28 09:30:26 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: time = 1243517426 May 28 09:30:27 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: about to withdraw from the cluster May 28 09:30:27 r3core-roll03 kernel: GFS: fsid=r3core-inner:wwwdocs.1: telling LM to withdraw The remedy is to shut down nodes accessing the shared FS, fsck and/or mkfs it, then start up again. What is puzzling is the exact cause of the FS corruption. As we try to narrow it down, I''ve been forced to closely examine the block layers in Xen. While I don''t fully understand (yet) what blkback is doing, I''m nervous the request queueing causes blocks to be flushed to disk asynchronously. That could be very bad for shared filesystems, as I''d expect a file''s metadata blocks need to be written to physical media once a lock is released. So I''m looking at blktap now. Most documentation suggests configuring VBDs with tap:aio:, however my reading of this suggests it can also reorder or defer block writes, which I''m trying to avoid. It looks like tap:sync: is what I really need, though very little documentation is available on that specific driver. Surely somebody must have had this problem before, but a couple days of searchinig and reading have yielded very little. Or am I way off base in understanding the magic that is GFS and how it guarantees filesystem consistency? Help please? -Jeff _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users