thr3ads.net - Gluster users - [Gluster-users] XFS and MD RAID [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Brian Candler

2012-Aug-29 07:48 UTC

[Gluster-users] XFS and MD RAID

Does anyone have any experience running gluster with XFS and MD RAID as the
backend, and/or LSI HBAs, especially bad experience?

In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid
controllers, MD RAID) I can cause XFS corruption just by throwing some
bonnie++ load at the array - locally without gluster.  This happens within
hours.  The same test run over a week doesn't corrupt with ext4.

I've just been bitten by this in production too on a gluster brick I
hadn't
converted to ext4.  I have the details I can post separately if you wish,
but the main symptoms were XFS timeout errors and stack traces in dmesg, and
xfs corruption (requiring a reboot and xfs_repair showing lots of errors,
almost certainly some data loss).

However, this leaves me with some unpalatable conclusions and I'm not sure
where to go from here.

(1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu
kernels.  This seems unlikely given its pedigree and the fact that it is
heavily endorsed by Red Hat for their storage appliance.

(2) Heavy write load in XFS is tickling a bug lower down in the stack
(either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in
ext4 doesn't.  This would have to be a gross error such as blocks queued for
write being thrown away without being sent to the drive.

I guess this is plausible - perhaps the usage pattern of write barriers is
different for example.  However I don't want to point the finger there
without direct evidence either.  There are no block I/O error events logged
in dmesg.

The only way I can think of pinning this down is to find out what's the
smallest MD RAID array I can reproduce the problem with, then try to build a
new system with a different controller card (as MD RAID + JBOD, and/or as a
hardware RAID array)

However while I try to see what I can do for that, I would be grateful for
any other experience people have in this area.

Many thanks,

Brian.

Brian Foster

2012-Aug-29 12:47 UTC

head link

[Gluster-users] XFS and MD RAID

On 08/29/2012 03:48 AM, Brian Candler wrote:> Does anyone have any experience running gluster with XFS and MD RAID as the
> backend, and/or LSI HBAs, especially bad experience?
> 
We have a few servers with 12 drive LSI RAID controllers we use for
gluster (running XFS on RHEL6.2). I don't recall seeing major issues,
but to be fair these particular systems see more hacking/dev/unit test
work than longevity or stress testing. We also are not using MD in any
way (hardware RAID).

I'd be happy to throw a similar workload at one of them if you can
describe your configuration in a bit more detail: specific MD
configuration (RAID type, chunk size, etc.), XFS format options and
mount options, anything else that might be in the I/O stack (LVM?),
specific bonnie++ test you're running (a single instance? or some kind
of looping test?).
> In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid
> controllers, MD RAID) I can cause XFS corruption just by throwing some
> bonnie++ load at the array - locally without gluster.  This happens within
> hours.  The same test run over a week doesn't corrupt with ext4.
> 
> I've just been bitten by this in production too on a gluster brick I
hadn't
> converted to ext4.  I have the details I can post separately if you wish,
> but the main symptoms were XFS timeout errors and stack traces in dmesg,
and
> xfs corruption (requiring a reboot and xfs_repair showing lots of errors,
> almost certainly some data loss).
> 
Could you collect the generic data and post it to linux-xfs? Somebody
might be able to read further into the problem via the stack traces. It
also might be worth testing an upstream kernel on your server, if possible.

Brian
> However, this leaves me with some unpalatable conclusions and I'm not
sure
> where to go from here.
> 
> (1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu
> kernels.  This seems unlikely given its pedigree and the fact that it is
> heavily endorsed by Red Hat for their storage appliance.
> 
> (2) Heavy write load in XFS is tickling a bug lower down in the stack
> (either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in
> ext4 doesn't.  This would have to be a gross error such as blocks
queued for
> write being thrown away without being sent to the drive.
> 
> I guess this is plausible - perhaps the usage pattern of write barriers is
> different for example.  However I don't want to point the finger there
> without direct evidence either.  There are no block I/O error events logged
> in dmesg.
> 
> The only way I can think of pinning this down is to find out what's the
> smallest MD RAID array I can reproduce the problem with, then try to build
a
> new system with a different controller card (as MD RAID + JBOD, and/or as a
> hardware RAID array)
> 
> However while I try to see what I can do for that, I would be grateful for
> any other experience people have in this area.
> 
> Many thanks,
> 
> Brian.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

Joe Landman

2012-Aug-29 13:06 UTC

head link

[Gluster-users] XFS and MD RAID

On 08/29/2012 03:48 AM, Brian Candler wrote:> Does anyone have any experience running gluster with XFS and MD RAID as the
Lots
> backend, and/or LSI HBAs, especially bad experience?
Its pretty solid as long as your hardware/drivers/kernel revs are solid. 
  And this requires updated firmware.  We've found modern LSI HBA and 
RAID gear have had issues with occasional "events" that seem to be
more
firmware bugs or driver bugs than anything else.  The gear is stable for 
very light usage, but when pushed hard (without driver/fw updates), it 
does crash, hard, often with corruption.
>
> In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid
> controllers, MD RAID) I can cause XFS corruption just by throwing some
> bonnie++ load at the array - locally without gluster.  This happens within
> hours.  The same test run over a week doesn't corrupt with ext4.
Which kernel?  I can't say I've ever seen XFS corruption from light use.
  It usually takes some significant failure of some sort to cause this. 
  Iffy driver, bad disk, etc.

The ext4 comparison might not be apt.  Ext4 isn't designed for parallel 
IO workloads, while xfs is.  Chances are you are tickling a 
driver/kernel bug with the higher amount of work being done in xfs 
versus ext4.
>
> I've just been bitten by this in production too on a gluster brick I
hadn't
> converted to ext4.  I have the details I can post separately if you wish,
> but the main symptoms were XFS timeout errors and stack traces in dmesg,
and
> xfs corruption (requiring a reboot and xfs_repair showing lots of errors,
> almost certainly some data loss).
>
> However, this leaves me with some unpalatable conclusions and I'm not
sure
> where to go from here.
>
> (1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu
> kernels.  This seems unlikely given its pedigree and the fact that it is
> heavily endorsed by Red Hat for their storage appliance.
Uh ... no.  Its pretty much the best/only choice for large storage 
systems out there.  Almost 20 years old at this point, making its first 
appearance in Irix in 1995 time frame or so, moving to Linux a few years 
later.  Its many things, but crappy ain't one of them.
>
> (2) Heavy write load in XFS is tickling a bug lower down in the stack
> (either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in
> ext4 doesn't.  This would have to be a gross error such as blocks
queued for
> write being thrown away without being sent to the drive.
xfs is a parallel IO file system, ext4 is not.  There is a very good 
chance you are tickling a bug lower in the stack.  Which LSI HBA or RAID 
are you using?  How have you set this up?  What kernel rev, and whats the

	modinfo mpt2sas
	lspci
	uname -a

output?
>
> I guess this is plausible - perhaps the usage pattern of write barriers is
> different for example.  However I don't want to point the finger there
> without direct evidence either.  There are no block I/O error events logged
> in dmesg.
Its very different.  XFS is pretty good about not corrupting things, the 
file system shuts down if it detects that it is corrupt.  So if the in 
memory image of the current state at moment of sync is not matched by 
whats on the platters/SSD chips, then chances are you have a problem in 
that pathway.
>
> The only way I can think of pinning this down is to find out what's the
> smallest MD RAID array I can reproduce the problem with, then try to build
a
> new system with a different controller card (as MD RAID + JBOD, and/or as a
> hardware RAID array)
This would be a good start.
>
> However while I try to see what I can do for that, I would be grateful for
> any other experience people have in this area.
We've had lots of problems with LSI drivers/FW before rev 11.x.y.z .

FWIW:  We have siCluster storage customers with exactly these types of 
designs with uptimes measurable in hundreds of days, using Gluster atop 
XFS atop MD RAID on our units.   We also have customers who tickle 
obscure and hard to reproduce bugs, causing crashes.  Its not frequent, 
but it does happen.  Not with the file system, but usually with the 
network drivers or overloaded NFS servers.
>
> Many thanks,
>
> Brian.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

s19n

2012-Aug-29 14:34 UTC

head link

[Gluster-users] XFS and MD RAID

* Brian Candler <B.Candler at pobox.com> [2012 08 29,
08:48]:> In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid
> controllers, MD RAID) I can cause XFS corruption just by throwing some
> bonnie++ load at the array - locally without gluster.
Randomly found on Google:

http://www.jive.nl/nexpres/doku.php?id=nexpres:nexpres_wp8#tests_on_xfs_file_system

"It is our opinion that the normalization of XFS behavior on a 24 disks
array is due to some proprietary round-robin algorithm on the raid card
that caused during the tests on a 12 disks array a 'missing disk' signal
that slowed down the pace, even though some downfalls on the 24 disks
array still happen every 18/20 files written. We ought to say that the
downfall pattern is not related to time delays or file sizes, but it is
instead a peculiarity of the XFS file system."

 Now I'd _really_ like to know if you are using a Megaraid or, as you say
at the end, a mpt2sas controller/driver, because I am going to setup a
new gluster volume with them, and considering this issue and the ext4
one I don't really know what to choose...


Regards

Gluster users - Aug 2012 - XFS and MD RAID

[Gluster-users] XFS and MD RAID

[Gluster-users] XFS and MD RAID

[Gluster-users] XFS and MD RAID

[Gluster-users] XFS and MD RAID