I'm running 8.0-RELEASE-p2 (amd64) on a larger number of Supermicro
SBI-7425C-T3 blades. Each of the blades has 2 x 500GB disks striped
into a single volume via the on-board ICH9 RAID controller.
However, after running fine for a while (days), the blades crash
eventually with file system problems such as the one below.
Initially I thought that must be a bad disk, but by now 5 different
blades have shown similar problems so I'm suspecting some OS issue.
Has anybody seen something similar before? Could this be an
incompatibility with the RAID controller (I haven't found much
recent on Google but there are a number of older threads indicating
that it might not be well supported. Not sure though whether those
still apply).
Any other thoughts?
Thanks,
Robin
--------- syslog -------------------------------------------------------
Jun 9 10:00:02 <user.crit> blade19 kernel:
ar0s1a[WRITE(offset=704187858944, length=114688)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704188219392, length=131072)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704188891136, length=114688)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704189382656, length=114688)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704189743104, length=131072)]
Jun 9 10:00:02 <user.crit> blade19 kernel: error = 5
--------- system information ------------------------------------------
# uname -a
FreeBSD blade5 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #0: Tue Jan 5 21:11:58 UTC
2010 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
# pciconf -lv | grep SATA
device = '82801IB/IR/IH (ICH9 Family) SATA RAID Controller'
# atacontrol list
ATA channel 2:
Master: ad4 <ST9500325AS/0001SDM1> SATA revision 2.x
Slave: no device present
ATA channel 3:
Master: ad6 <ST9500325AS/0001SDM1> SATA revision 2.x
Slave: no device present
# dmesg | grep ata
atapci0: <Intel ICH9 SATA300 controller> port
0x1c50-0x1c57,0x1c44-0x1c47,0x1c48-0x1c4f,0x1c40-0x1c43,0x18e0-0x18ff mem
0xfcc00000-0xfcc007ff irq 17 at device 31.2 on pci0
atapci0: [ITHREAD]
atapci0: AHCI called from vendor specific driver
atapci0: AHCI v1.20 controller with 6 3Gbps ports, PM supported
ata2: <ATA channel 0> on atapci0
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci0
ata3: [ITHREAD]
ata4: <ATA channel 2> on atapci0
ata4: stopping AHCI engine failed
ata4: [ITHREAD]
ata5: <ATA channel 3> on atapci0
ata5: stopping AHCI engine failed
ata5: [ITHREAD]
ata6: <ATA channel 4> on atapci0
ata6: [ITHREAD]
ata7: <ATA channel 5> on atapci0
ata7: [ITHREAD]
ad4: 476940MB <Seagate ST9500325AS 0001SDM1> at ata2-master SATA300
ad6: 476940MB <Seagate ST9500325AS 0001SDM1> at ata3-master SATA300
ar0: writing of DDF metadata is NOT supported yet
ar0: disk0 READY using ad4 at ata2-master
ar0: disk1 READY using ad6 at ata3-master
--
Robin Sommer * Phone +1 (510) 666-2886 * robin@icir.org
ICSI/LBNL * Fax +1 (510) 666-2956 * www.icir.org
I'm running 8.0-RELEASE-p2 (amd64) on a larger number of Supermicro
SBI-7425C-T3 blades. Each of the blades has 2 x 500GB disks striped
into a single volume via the on-board ICH9 RAID controller.
However, after running fine for a while (days), the blades crash
eventually with file system problems such as the one below.
Initially I thought that must be a bad disk, but by now 5 different
blades have shown similar problems so I'm suspecting some OS issue.
Has anybody seen something similar before? Could this be an
incompatibility with the RAID controller (I haven't found much
recent on Google but there are a number of older threads indicating
that it might not be well supported. Not sure though whether those
still apply).
Any other thoughts?
Thanks,
Robin
--------- syslog -------------------------------------------------------
Jun 9 10:00:02 <user.crit> blade19 kernel:
ar0s1a[WRITE(offset=704187858944, length=114688)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704188219392, length=131072)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704188891136, length=114688)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704189382656, length=114688)]error = 5
Jun 9 10:00:02 <user.crit> blade19 kernel:
g_vfs_done():ar0s1a[WRITE(offset=704189743104, length=131072)]
Jun 9 10:00:02 <user.crit> blade19 kernel: error = 5
--------- system information ------------------------------------------
# uname -a
FreeBSD blade5 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #0: Tue Jan 5 21:11:58 UTC
2010 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
# pciconf -lv | grep SATA
device = '82801IB/IR/IH (ICH9 Family) SATA RAID Controller'
# atacontrol list
ATA channel 2:
Master: ad4 <ST9500325AS/0001SDM1> SATA revision 2.x
Slave: no device present
ATA channel 3:
Master: ad6 <ST9500325AS/0001SDM1> SATA revision 2.x
Slave: no device present
# dmesg | grep ata
atapci0: <Intel ICH9 SATA300 controller> port
0x1c50-0x1c57,0x1c44-0x1c47,0x1c48-0x1c4f,0x1c40-0x1c43,0x18e0-0x18ff mem
0xfcc00000-0xfcc007ff irq 17 at device 31.2 on pci0
atapci0: [ITHREAD]
atapci0: AHCI called from vendor specific driver
atapci0: AHCI v1.20 controller with 6 3Gbps ports, PM supported
ata2: <ATA channel 0> on atapci0
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci0
ata3: [ITHREAD]
ata4: <ATA channel 2> on atapci0
ata4: stopping AHCI engine failed
ata4: [ITHREAD]
ata5: <ATA channel 3> on atapci0
ata5: stopping AHCI engine failed
ata5: [ITHREAD]
ata6: <ATA channel 4> on atapci0
ata6: [ITHREAD]
ata7: <ATA channel 5> on atapci0
ata7: [ITHREAD]
ad4: 476940MB <Seagate ST9500325AS 0001SDM1> at ata2-master SATA300
ad6: 476940MB <Seagate ST9500325AS 0001SDM1> at ata3-master SATA300
ar0: writing of DDF metadata is NOT supported yet
ar0: disk0 READY using ad4 at ata2-master
ar0: disk1 READY using ad6 at ata3-master
--
Robin Sommer * Phone +1 (510) 666-2886 * robin@icir.org
ICSI/LBNL * Fax +1 (510) 666-2956 * www.icir.org
On Thu, Jun 10, 2010 at 09:29:19AM -0700, Robin Sommer wrote:> I'm running 8.0-RELEASE-p2 (amd64) on a larger number of Supermicro > SBI-7425C-T3 blades. Each of the blades has 2 x 500GB disks striped > into a single volume via the on-board ICH9 RAID controller. > > However, after running fine for a while (days), the blades crash > eventually with file system problems such as the one below. > Initially I thought that must be a bad disk, but by now 5 different > blades have shown similar problems so I'm suspecting some OS issue. > > Has anybody seen something similar before? Could this be an > incompatibility with the RAID controller (I haven't found much > recent on Google but there are a number of older threads indicating > that it might not be well supported. Not sure though whether those > still apply). > > Jun 9 10:00:02 <user.crit> blade19 kernel: ar0s1a[WRITE(offset=704187858944, length=114688)]error = 5 > Jun 9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704188219392, length=131072)]error = 5 > Jun 9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704188891136, length=114688)]error = 5 > Jun 9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704189382656, length=114688)]error = 5 > Jun 9 10:00:02 <user.crit> blade19 kernel: g_vfs_done():ar0s1a[WRITE(offset=704189743104, length=131072)] > Jun 9 10:00:02 <user.crit> blade19 kernel: error = 5You're using Intel MatrixRAID. Please stop[1]; you're living dangerously. The messages your kernel is spitting out could indicate a lot of different things. Tracking it down will take time. So let's start wit this: 1) Provide output from "gpart show ar0s1". I'm curious about something (likely a red herring, but I want to see). 2) Install sysutils/smartmontools and run "smartctl -a /dev/adXX" on each of the disks which make up the RAID array. I believe FreeBSD can see the disks associated with the array (meaning you should have a few adXX disks, in addition to an ar0 entry). I can help you decode the output, to see if any of the disks have actual problems that indicate they could be going bad. 3) Remove use of MatrixRAID. Alternatives include ccd, gstripe, gvinum, or ZFS. I would recommend ZFS if you ran RELENG_8 instead of -RELEASE, system was amd64, and has at least 4GB RAM. Remove use of MatrixRAID first, then see if the problem goes away. 4) If the problem still happens after this, there should be developers who can help diagnose the problem. Keeping MatrixRAID out of the picture helps greatly. More details: you might consider these opinions, but they're based on personal experience (I've dealt many a time with MatrixRAID). The problem is not with the ICH9, given that most of our systems are Supermicro (not blades but that doesn't matter) and use ICH9 with AHCI (both with and without ahci.ko). Intel ICHxx and ESBx controllers are heavily tested on FreeBSD, both by users and developers. [1]: http://en.wikipedia.org/wiki/Intel_Matrix_RAID -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |