Folks,
I kicked off a thread just before the holidays regarding some problems
we are having with an Intel SRCU42X RAID controller in a dual
processor production server originally under 5.3-STABLE and now
under 4.10-STABLE. The thread ran out of steam, with no resolution to
the problem, but I'm hoping that with extra information I might get to
the bottom of it.
Basically, after some amount of uptime the kernel will emit a "amr0:
Bad slot x completed" message and pretty soon after this the box goes into
a
partially unresponsive state forcing us to reboot it. So far the only
thing triggering the problem is the nightly jobs, where the amount of
IO is higher than during the day.
Before deployment, we tested the box with 5.3-STABLE and managed to
trigger the problem twice. This forced us to try 4.10-STABLE which
was fine in testing and for a number of weeks after deployment.
However, just before new year we saw our first Bad Slot and crash under
4.10. Since then it has happened 3 more times. We have upgraded the firmware
to
the latest version available from Intel, and if anything this has made
the problem worse.
We're beginning to suspect a dud card but could do with a few "works
fine for us" style posts to build confidence in the support for the
card under FreeBSD. The amr driver doesn't explicitly support the
card, but it's a rebadged MegaRAID 320 as far as we can tell.
Scott Long has posted to say that he is seeing similar problems,
but I'm wondering if it really is a problem with the driver, wouldn't
more of you be having problems?
The machine had 3 disks configured as a single RAID5 array. A fourth
disk is configured as a hot-standby. The card is equipped with 128Mb
of battery-backed cache. Write-back caching is enabled on the card.
Read-ahead caching is enabled in non-adaptive mode.
Is anyone else using a SRCU42X RAID card and seeing similar
problems to ours? What about other cards supported by the amr driver?
We could just change the controller, but the problem we are having is
pretty random and the feedback gap between change and outcome is long.
We'd like to have more information to work with before deciding the
next step.
uname -a
FreeBSD xxxxx 4.10-STABLE FreeBSD 4.10-STABLE #7: Tue Nov 16 12:50:42 GMT 2004
dmesg
Copyright (c) 1992-2004 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 4.10-STABLE #7: Tue Nov 16 12:50:42 GMT 2004
dermot@pooh.traveldev.com:/usr/obj/usr/src/sys/POOH
Timecounter "i8254" frequency 1193182 Hz
CPU: Intel(R) Xeon(TM) CPU 3.20GHz (3189.72-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0xf25 Stepping = 5
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Hyperthreading: 2 logical CPUs
real memory = 4026466304 (3932096K bytes)
Programming 24 pins in IOAPIC #0
IOAPIC #0 intpin 2 -> irq 0
Programming 24 pins in IOAPIC #1
Programming 24 pins in IOAPIC #2
FreeBSD/SMP: Multiprocessor motherboard: 4 CPUs
cpu0 (BSP): apic id: 0, version: 0x00050014, at 0xfee00000
cpu1 (AP): apic id: 1, version: 0x00050014, at 0xfee00000
cpu2 (AP): apic id: 6, version: 0x00050014, at 0xfee00000
cpu3 (AP): apic id: 7, version: 0x00050014, at 0xfee00000
io0 (APIC): apic id: 8, version: 0x00178020, at 0xfec00000
io1 (APIC): apic id: 9, version: 0x00178020, at 0xfec81000
io2 (APIC): apic id: 10, version: 0x00178020, at 0xfec81400
Preloaded elf kernel "kernel" at 0xc03cc000.
Preloaded userconfig_script "/boot/kernel.conf" at 0xc03cc09c.
Warning: Pentium 4 CPU: PSE disabled
Pentium Pro MTRR support enabled
md0: Malloc disk
Using $PIR table, 19 entries at 0xc00f3630
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
IOAPIC #0 intpin 16 -> irq 2
IOAPIC #0 intpin 19 -> irq 16
pci0: <PCI bus> on pcib0
pci0: <unknown card> (vendor=0x8086, dev=0x2541) at 0.1
pcib1: <PCI to PCI bridge (vendor=8086 device=2545)> at device 3.0 on pci0
pci2: <PCI bus> on pcib1
pci2: <unknown card> (vendor=0x8086, dev=0x1461) at 28.0
pcib2: <PCI to PCI bridge (vendor=8086 device=1460)> at device 29.0 on
pci2
IOAPIC #2 intpin 2 -> irq 18
IOAPIC #2 intpin 1 -> irq 19
pci5: <PCI bus> on pcib2
ahd0: <Adaptec AIC7902 Ultra320 SCSI adapter> port
0x4000-0x40ff,0x3800-0x38ff mem 0xfe9e0000-0xfe9e1fff irq 18 at device 7.0 on
pci5
aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs
ahd1: <Adaptec AIC7902 Ultra320 SCSI adapter> port
0x3400-0x34ff,0x3000-0x30ff mem 0xfe9f0000-0xfe9f1fff irq 19 at device 7.1 on
pci5
aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs
pci2: <unknown card> (vendor=0x8086, dev=0x1461) at 30.0
pcib3: <PCI to PCI bridge (vendor=8086 device=1460)> at device 31.0 on
pci2
IOAPIC #1 intpin 6 -> irq 20
IOAPIC #1 intpin 7 -> irq 21
pci3: <PCI bus> on pcib3
em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port
0x2040-0x207f mem 0xfe6c0000-0xfe6dffff irq 20 at device 7.0 on pci
3
em0: Speed:N/A Duplex:N/A
em1: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port
0x2000-0x203f mem 0xfe6e0000-0xfe6fffff irq 21 at device 7.1 on pci
3
em1: Speed:N/A Duplex:N/A
pcib4: <PCI to PCI bridge (vendor=1014 device=01a7)> at device 9.0 on pci3
IOAPIC #1 intpin 3 -> irq 22
pci4: <PCI bus> on pcib4
amr0: <LSILogic MegaRAID> mem 0xfe580000-0xfe5fffff,0xfbef0000-0xfbefffff
irq 22 at device 0.0 on pci4
amr0: <LSILogic Intel(R) RAID Controller SRCU42X> Firmware 413Y, BIOS
H420, 128MB RAM
pci0: <unknown card> (vendor=0x8086, dev=0x2546) at 3.1
uhci0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> port 0x5020-0x503f
irq 2 at device 29.0 on pci0
usb0: <Intel 82801CA/CAM (ICH3) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> port 0x5000-0x501f
irq 16 at device 29.1 on pci0
usb1: <Intel 82801CA/CAM (ICH3) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
pcib5: <Intel 82801BA/BAM (ICH2) Hub to PCI bridge> at device 30.0 on pci0
pci1: <PCI bus> on pcib5
pci1: <ATI Mach64-GR graphics accelerator> at 12.0 irq 17
isab0: <PCI to ISA bridge (vendor=8086 device=2480)> at device 31.0 on
pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH3 ATA100 controller> port
0x3a0-0x3af,0-0x3,0-0x7,0-0x3,0-0x7 irq 0 at device 31.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
pci0: <unknown card> (vendor=0x8086, dev=0x2483) at 31.3 irq 17
orm0: <Option ROMs> at iomem
0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9000-0xc9fff on isa0
pmtimer0 on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model Generic PS/2 mouse, device ID 0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
APIC_IO: Testing 8254 interrupt delivery
APIC_IO: routing 8254 via IOAPIC #0 intpin 2
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
acd0: CDROM <SAMSUNG CD-ROM SN-124> at ata1-master PIO4
Waiting 15 seconds for SCSI devices to settle
amrd0: <LSILogic MegaRAID logical drive> on amr0
amrd0: 140012MB (286744576 sectors) RAID 5 (optimal)
pass0 at amr0 bus 0 target 6 lun 0
pass0: <ESG-SHV SCA HSBP M22 0.06> Fixed Processor SCSI-2 device
Mounting root from ufs:/dev/amrd0s1a
Regards,
Tony.
--
Tony Byrne