Does anyone have any experience running gluster with XFS and MD RAID as the backend, and/or LSI HBAs, especially bad experience? In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid controllers, MD RAID) I can cause XFS corruption just by throwing some bonnie++ load at the array - locally without gluster. This happens within hours. The same test run over a week doesn't corrupt with ext4. I've just been bitten by this in production too on a gluster brick I hadn't converted to ext4. I have the details I can post separately if you wish, but the main symptoms were XFS timeout errors and stack traces in dmesg, and xfs corruption (requiring a reboot and xfs_repair showing lots of errors, almost certainly some data loss). However, this leaves me with some unpalatable conclusions and I'm not sure where to go from here. (1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu kernels. This seems unlikely given its pedigree and the fact that it is heavily endorsed by Red Hat for their storage appliance. (2) Heavy write load in XFS is tickling a bug lower down in the stack (either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in ext4 doesn't. This would have to be a gross error such as blocks queued for write being thrown away without being sent to the drive. I guess this is plausible - perhaps the usage pattern of write barriers is different for example. However I don't want to point the finger there without direct evidence either. There are no block I/O error events logged in dmesg. The only way I can think of pinning this down is to find out what's the smallest MD RAID array I can reproduce the problem with, then try to build a new system with a different controller card (as MD RAID + JBOD, and/or as a hardware RAID array) However while I try to see what I can do for that, I would be grateful for any other experience people have in this area. Many thanks, Brian.
On 08/29/2012 03:48 AM, Brian Candler wrote:> Does anyone have any experience running gluster with XFS and MD RAID as the > backend, and/or LSI HBAs, especially bad experience? >We have a few servers with 12 drive LSI RAID controllers we use for gluster (running XFS on RHEL6.2). I don't recall seeing major issues, but to be fair these particular systems see more hacking/dev/unit test work than longevity or stress testing. We also are not using MD in any way (hardware RAID). I'd be happy to throw a similar workload at one of them if you can describe your configuration in a bit more detail: specific MD configuration (RAID type, chunk size, etc.), XFS format options and mount options, anything else that might be in the I/O stack (LVM?), specific bonnie++ test you're running (a single instance? or some kind of looping test?).> In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid > controllers, MD RAID) I can cause XFS corruption just by throwing some > bonnie++ load at the array - locally without gluster. This happens within > hours. The same test run over a week doesn't corrupt with ext4. > > I've just been bitten by this in production too on a gluster brick I hadn't > converted to ext4. I have the details I can post separately if you wish, > but the main symptoms were XFS timeout errors and stack traces in dmesg, and > xfs corruption (requiring a reboot and xfs_repair showing lots of errors, > almost certainly some data loss). >Could you collect the generic data and post it to linux-xfs? Somebody might be able to read further into the problem via the stack traces. It also might be worth testing an upstream kernel on your server, if possible. Brian> However, this leaves me with some unpalatable conclusions and I'm not sure > where to go from here. > > (1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu > kernels. This seems unlikely given its pedigree and the fact that it is > heavily endorsed by Red Hat for their storage appliance. > > (2) Heavy write load in XFS is tickling a bug lower down in the stack > (either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in > ext4 doesn't. This would have to be a gross error such as blocks queued for > write being thrown away without being sent to the drive. > > I guess this is plausible - perhaps the usage pattern of write barriers is > different for example. However I don't want to point the finger there > without direct evidence either. There are no block I/O error events logged > in dmesg. > > The only way I can think of pinning this down is to find out what's the > smallest MD RAID array I can reproduce the problem with, then try to build a > new system with a different controller card (as MD RAID + JBOD, and/or as a > hardware RAID array) > > However while I try to see what I can do for that, I would be grateful for > any other experience people have in this area. > > Many thanks, > > Brian. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >
On 08/29/2012 03:48 AM, Brian Candler wrote:> Does anyone have any experience running gluster with XFS and MD RAID as theLots> backend, and/or LSI HBAs, especially bad experience?Its pretty solid as long as your hardware/drivers/kernel revs are solid. And this requires updated firmware. We've found modern LSI HBA and RAID gear have had issues with occasional "events" that seem to be more firmware bugs or driver bugs than anything else. The gear is stable for very light usage, but when pushed hard (without driver/fw updates), it does crash, hard, often with corruption.> > In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid > controllers, MD RAID) I can cause XFS corruption just by throwing some > bonnie++ load at the array - locally without gluster. This happens within > hours. The same test run over a week doesn't corrupt with ext4.Which kernel? I can't say I've ever seen XFS corruption from light use. It usually takes some significant failure of some sort to cause this. Iffy driver, bad disk, etc. The ext4 comparison might not be apt. Ext4 isn't designed for parallel IO workloads, while xfs is. Chances are you are tickling a driver/kernel bug with the higher amount of work being done in xfs versus ext4.> > I've just been bitten by this in production too on a gluster brick I hadn't > converted to ext4. I have the details I can post separately if you wish, > but the main symptoms were XFS timeout errors and stack traces in dmesg, and > xfs corruption (requiring a reboot and xfs_repair showing lots of errors, > almost certainly some data loss). > > However, this leaves me with some unpalatable conclusions and I'm not sure > where to go from here. > > (1) XFS is a shonky filesystem, at least in the version supplied in Ubuntu > kernels. This seems unlikely given its pedigree and the fact that it is > heavily endorsed by Red Hat for their storage appliance.Uh ... no. Its pretty much the best/only choice for large storage systems out there. Almost 20 years old at this point, making its first appearance in Irix in 1995 time frame or so, moving to Linux a few years later. Its many things, but crappy ain't one of them.> > (2) Heavy write load in XFS is tickling a bug lower down in the stack > (either MD RAID or LSI mpt2sas driver/firmware), but heavy write load in > ext4 doesn't. This would have to be a gross error such as blocks queued for > write being thrown away without being sent to the drive.xfs is a parallel IO file system, ext4 is not. There is a very good chance you are tickling a bug lower in the stack. Which LSI HBA or RAID are you using? How have you set this up? What kernel rev, and whats the modinfo mpt2sas lspci uname -a output?> > I guess this is plausible - perhaps the usage pattern of write barriers is > different for example. However I don't want to point the finger there > without direct evidence either. There are no block I/O error events logged > in dmesg.Its very different. XFS is pretty good about not corrupting things, the file system shuts down if it detects that it is corrupt. So if the in memory image of the current state at moment of sync is not matched by whats on the platters/SSD chips, then chances are you have a problem in that pathway.> > The only way I can think of pinning this down is to find out what's the > smallest MD RAID array I can reproduce the problem with, then try to build a > new system with a different controller card (as MD RAID + JBOD, and/or as a > hardware RAID array)This would be a good start.> > However while I try to see what I can do for that, I would be grateful for > any other experience people have in this area.We've had lots of problems with LSI drivers/FW before rev 11.x.y.z . FWIW: We have siCluster storage customers with exactly these types of designs with uptimes measurable in hundreds of days, using Gluster atop XFS atop MD RAID on our units. We also have customers who tickle obscure and hard to reproduce bugs, causing crashes. Its not frequent, but it does happen. Not with the file system, but usually with the network drivers or overloaded NFS servers.> > Many thanks, > > Brian. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
* Brian Candler <B.Candler at pobox.com> [2012 08 29, 08:48]:> In a test setup (Ubuntu 12.04, gluster 3.3.0, 24 x SATA HD on LSI Megaraid > controllers, MD RAID) I can cause XFS corruption just by throwing some > bonnie++ load at the array - locally without gluster.Randomly found on Google: http://www.jive.nl/nexpres/doku.php?id=nexpres:nexpres_wp8#tests_on_xfs_file_system "It is our opinion that the normalization of XFS behavior on a 24 disks array is due to some proprietary round-robin algorithm on the raid card that caused during the tests on a 12 disks array a 'missing disk' signal that slowed down the pace, even though some downfalls on the 24 disks array still happen every 18/20 files written. We ought to say that the downfall pattern is not related to time delays or file sizes, but it is instead a peculiarity of the XFS file system." Now I'd _really_ like to know if you are using a Megaraid or, as you say at the end, a mpt2sas controller/driver, because I am going to setup a new gluster volume with them, and considering this issue and the ext4 one I don't really know what to choose... Regards