Hey everybody!
A while back I noticed strange system crashes on my personal file
server. At first I thought they were caused by bad connections. But
after exchanging all the SATA cables, the crashes didn't go away. Then I
started looking around the web.
As it seems there is a problem with FreeBSD ata(4) and certain Promise
controllers out there - probably as a result of the lack of
documentation from the hardware vendor. There have been threads
discussing the problem. I haven't however seen any where someone with
the actual problem and controller tried out much of anything.
Sofar I have *not* been able to reproduce the errors at will, I can only
say that heavy I/O helps - which however shouldn't come as much of a
surprise. As far as I can tell by now, a crash only occurs when writing
to a drive. I have gone through several gigs of reading (as fast as the
system will read) and there hasn't been a crash yet. Writing files (even
over a slow 100MB network connection) has crashed the system several
times now within the last 48 hours.
I have to slightly explain the word "crash" here: I don't actually
have
to hard reset the system myself. My box just does a reboot by itself. No
filesystem is unmounted cleanly and because the machine isn't really new
and powerful fsck takes pretty long.
As you can imagine, the situation is rather ball busting. This is not so
much a complaint but rather a show of motivation to do something about
it. If one of the developers wants me to try out certain things to find
clues, hit me. I don't mind compiling the kernel serveral times a day -
or even the whole world if that will do any good. Just remember that
this is only a 32 bit AMD, so compiling and results may take some time. :-)
Below is a selection[1] of the error messages from /var/log/messages.
Not that I use geli(8). All of the drives where this happens are
encrypted. I doubt that that has anything to do with my problems though
since I found lots of accounts of this problem on the web without any
mention of geli.
I have also includes the boot-messages[2] from my box. You will notice
that there are two identical controller in the system. Before FreeBSD 8
I used 6 (and skipped 7). Before the installation of 6 I had no problems
with there controllers.
On a side note...
Should this problem prove to be non.resolvable, I would appreciate a
recommendation about alternative controllers that work and where the
vendor isn't quite so reluctant to hand out documentation.
Best regards from Germany,
Chris
[1]
Jun 13 15:31:10 jon kernel: ata8: error issuing SET_MULTI command
Jun 13 15:31:10 jon kernel: ad16: FAILURE - WRITE_DMA timed out LBA=511
Jun 13 15:31:10 jon kernel: GEOM_ELI: Crypto WRITE request failed
(error=5). ad16s1d.eli[WRITE(offset=229376,
length=20480)]
Jun 13 15:31:10 jon kernel:
g_vfs_done():ad16s1d.eli[WRITE(offset=229376, length=20480)]error = 5
Jun 13 15:31:34 jon kernel: ata8: SIGNATURE: ffffffff
Jun 13 15:31:34 jon kernel: ata8: timeout waiting to issue command
Jun 13 15:31:34 jon kernel: ata8: error issuing SETFEATURES SET TRANSFER
MODE command
Jun 13 15:31:34 jon kernel: ata8: timeout waiting to issue command
Jun 13 15:31:34 jon kernel: ata8: error issuing SETFEATURES ENABLE
RCACHE command
Jun 13 15:31:34 jon kernel: ata8: timeout waiting to issue command
Jun 13 15:31:34 jon kernel: ata8: error issuing SETFEATURES ENABLE
WCACHE command
Jun 13 15:31:34 jon kernel: ata8: timeout waiting to issue command
Jun 13 15:31:34 jon kernel: ata8: error issuing SET_MULTI command
Jun 13 15:31:34 jon kernel: ad16: TIMEOUT - READ_DMA48 retrying (0
retries left) LBA=509133183
Jun 13 15:33:17 jon syslogd: kernel boot file is /boot/kernel/kernel
Jun 13 15:33:17 jon kernel: Copyright (c) 1992-2011 The FreeBSD Project.
Jun 13 15:33:17 jon kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988,
1989, 1991, 1992, 1993, 1994
[2]
Jun 13 15:33:17 jon kernel: FreeBSD 8.2-STABLE #1: Sun Jun 12 17:42:17
CEST 2011
Jun 13 15:33:17 jon kernel:
root@jon.rz1.convenimus.net:/usr/obj/usr/src/sys/JON i386
Jun 13 15:33:17 jon kernel: Timecounter "i8254" frequency 1193182 Hz
quality 0
Jun 13 15:33:17 jon kernel: CPU: AMD Athlon(tm) XP 2400+ (1992.03-MHz
686-class CPU)
Jun 13 15:33:17 jon kernel: Origin = "AuthenticAMD" Id = 0x681
Family
= 6 Model = 8 Stepping = 1
Jun 13 15:33:17 jon kernel:
Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,P
AT,PSE36,MMX,FXSR,SSE>
Jun 13 15:33:17 jon kernel: AMD
Features=0xc0400800<SYSCALL,MMX+,3DNow!+,3DNow!>
Jun 13 15:33:17 jon kernel: real memory = 671088640 (640 MB)
Jun 13 15:33:17 jon kernel: avail memory = 647278592 (617 MB)
Jun 13 15:33:17 jon kernel: ACPI Error: A valid RSDP was not found
(20101013/tbxfroot-309)
Jun 13 15:33:17 jon kernel: MPTable: <OEM00000 PROD00000000>
Jun 13 15:33:17 jon kernel: ioapic0: Assuming intbase of 0
Jun 13 15:33:17 jon kernel: ioapic0 <Version 1.1> irqs 0-23 on motherboard
Jun 13 15:33:17 jon kernel: kbd1 at kbdmux0
Jun 13 15:33:17 jon kernel: ACPI Error: A valid RSDP was not found
(20101013/tbxfroot-309)
Jun 13 15:33:17 jon kernel: ACPI: Table initialisation failed: AE_NOT_FOUND
Jun 13 15:33:17 jon kernel: ACPI: Try disabling either ACPI or apic support.
Jun 13 15:33:17 jon kernel: pcib0: <MPTable Host-PCI bridge> pcibus 0 on
motherboard
Jun 13 15:33:17 jon kernel: pci0: <PCI bus> on pcib0
Jun 13 15:33:17 jon kernel: agp0: <VIA 82C8363 (Apollo KT133x/KM133)
host to PCI bridge> on hostb0
Jun 13 15:33:17 jon kernel: agp0: aperture size is 4M
Jun 13 15:33:17 jon kernel: pcib1: <PCI-PCI bridge> at device 1.0 on pci0
Jun 13 15:33:17 jon kernel: pci1: <PCI bus> on pcib1
Jun 13 15:33:17 jon kernel: isab0: <PCI-ISA bridge> at device 7.0 on pci0
Jun 13 15:33:17 jon kernel: isa0: <ISA bus> on isab0
Jun 13 15:33:17 jon kernel: atapci0: <VIA 82C686B UDMA100 controller>
port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xd000-0xd00f at device 7.1 on pci0
Jun 13 15:33:17 jon kernel: atapci0: Correcting VIA config for
southbridge data corruption bug
Jun 13 15:33:17 jon kernel: ata0: <ATA channel 0> on atapci0
Jun 13 15:33:17 jon kernel: ata0: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata1: <ATA channel 1> on atapci0
Jun 13 15:33:17 jon kernel: ata1: [ITHREAD]
Jun 13 15:33:17 jon kernel: atapci1: <Promise PDC40718 SATA300
controller> port 0xdc00-0xdc7f,0xe000-0xe0ff mem
0xd3461000-0xd3461fff,0xd3420000-0xd343ffff irq 11 at device 8.0 on pci0
Jun 13 15:33:17 jon kernel: atapci1: [ITHREAD]
Jun 13 15:33:17 jon kernel: atapci1: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata2: <ATA channel 0> on atapci1
Jun 13 15:33:17 jon kernel: ata2: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata2: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata3: <ATA channel 1> on atapci1
Jun 13 15:33:17 jon kernel: ata3: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata3: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata4: <ATA channel 2> on atapci1
Jun 13 15:33:17 jon kernel: ata4: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata4: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata5: <ATA channel 3> on atapci1
Jun 13 15:33:17 jon kernel: ata5: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata5: [ITHREAD]
Jun 13 15:33:17 jon kernel: fxp0: <Intel 82550 Pro/100 Ethernet> port
0xe400-0xe43f mem 0xd3460000-0xd3460fff,0xd3440000-0xd345ffff irq 10 at
device 9.0 on pci0
Jun 13 15:33:17 jon kernel: miibus0: <MII bus> on fxp0
Jun 13 15:33:17 jon kernel: inphy0: <i82555 10/100 media interface> PHY
1 on miibus0
Jun 13 15:33:17 jon kernel: inphy0: 10baseT, 10baseT-FDX, 100baseTX,
100baseTX-FDX, auto, auto-flow
Jun 13 15:33:17 jon kernel: fxp0: Ethernet address: 00:02:b3:ad:ad:d7
Jun 13 15:33:17 jon kernel: fxp0: [ITHREAD]
Jun 13 15:33:17 jon kernel: atapci2: <Promise PDC40718 SATA300
controller> port 0xe800-0xe87f,0xec00-0xecff mem
0xd3462000-0xd3462fff,0xd3400000-0xd341ffff irq 11 at device 12.0 on pci0
Jun 13 15:33:17 jon kernel: atapci2: [ITHREAD]
Jun 13 15:33:17 jon kernel: atapci2: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata6: <ATA channel 0> on atapci2
Jun 13 15:33:17 jon kernel: ata6: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata6: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata7: <ATA channel 1> on atapci2
Jun 13 15:33:17 jon kernel: ata7: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata7: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata8: <ATA channel 2> on atapci2
Jun 13 15:33:17 jon kernel: ata8: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata8: [ITHREAD]
Jun 13 15:33:17 jon kernel: ata9: <ATA channel 3> on atapci2
Jun 13 15:33:17 jon kernel: ata9: SIGNATURE: 00000101
Jun 13 15:33:17 jon kernel: ata9: [ITHREAD]
Jun 13 15:33:17 jon kernel: vgapci0: <VGA-compatible display> mem
0xd1000000-0xd1003fff,0xd2000000-0xd27fffff at device 13.0 on pci0
Jun 13 15:33:17 jon kernel: cpu0 on motherboard
Jun 13 15:33:17 jon kernel: pmtimer0 on isa0
Jun 13 15:33:17 jon kernel: atrtc0: <AT realtime clock> at port
0x70-0x71 irq 8 pnpid PNP0b00 on isa0
Jun 13 15:33:17 jon kernel: atkbdc0: <Keyboard controller (i8042)> at
port 0x60,0x64 irq 1 pnpid PNP0303 on isa0
Jun 13 15:33:17 jon kernel: atkbd0: <AT Keyboard> irq 1 on atkbdc0
Jun 13 15:33:17 jon kernel: kbd0 at atkbd0
Jun 13 15:33:17 jon kernel: atkbd0: [GIANT-LOCKED]
Jun 13 15:33:17 jon kernel: atkbd0: [ITHREAD]
Jun 13 15:33:17 jon kernel: unknown: <PNP0c01> can't assign resources
(memory)
Jun 13 15:33:17 jon kernel: psm0: <PS/2 Mouse> irq 12 on atkbdc0
Jun 13 15:33:17 jon kernel: psm0: [GIANT-LOCKED]
Jun 13 15:33:17 jon kernel: psm0: [ITHREAD]
Jun 13 15:33:17 jon kernel: psm0: model Generic PS/2 mouse, device ID 0
Jun 13 15:33:17 jon kernel: fdc0: <Enhanced floppy controller> at port
0x3f0-0x3f5,0x3f7 irq 6 drq 2 pnpid PNP0700 on isa0
Jun 13 15:33:17 jon kernel: fdc0: [FILTER]
Jun 13 15:33:17 jon kernel: fd0: <1440-KB 3.5" drive> on fdc0 drive 0
Jun 13 15:33:17 jon kernel: orm0: <ISA Option ROM> at iomem
0xc0000-0xc7fff pnpid ORM0000 on isa0
Jun 13 15:33:17 jon kernel: sc0: <System console> at flags 0x100 on isa0
Jun 13 15:33:17 jon kernel: sc0: VGA <16 virtual consoles, flags=0x300>
Jun 13 15:33:17 jon kernel: vga0: <Generic ISA VGA> at port 0x3c0-0x3df
iomem 0xa0000-0xbffff on isa0
Jun 13 15:33:17 jon kernel: ppc0: parallel port not found.
Jun 13 15:33:17 jon kernel: unknown: <PNP0c01> can't assign resources
(memory)
Jun 13 15:33:17 jon kernel: Timecounter "TSC" frequency 1992030730 Hz
quality 800
Jun 13 15:33:17 jon kernel: Timecounters tick every 1.000 msec
Jun 13 15:33:17 jon kernel: ad0: 14664MB <IBM DJNA-351520 J56OA30K> at
ata0-master UDMA66
Jun 13 15:33:17 jon kernel: ad2: 305245MB <WDC WD3200JB-00KFA0 08.05J08>
at ata1-master UDMA100
Jun 13 15:33:17 jon kernel: ad3: 238475MB <Seagate ST3250823A 3.03> at
ata1-slave UDMA100
Jun 13 15:33:17 jon kernel: ad4: 476940MB <SAMSUNG HD501LJ CR100-11> at
ata2-master UDMA100 SATA 3Gb/s
Jun 13 15:33:17 jon kernel: ad6: 1907729MB <WDC WD20EARS-07MVWB0
51.0AB51> at ata3-master UDMA100 SATA 3Gb/s
Jun 13 15:33:17 jon kernel: ad8: 476940MB <SAMSUNG HD501LJ CR100-11> at
ata4-master UDMA100 SATA 3Gb/s
Jun 13 15:33:17 jon kernel: ad10: 305245MB <WDC WD3200SD-01KNB0
08.05J08> at ata5-master UDMA100 SATA 1.5Gb/s
Jun 13 15:33:17 jon kernel: ad12: 1430799MB <SAMSUNG HD154UI 1AG01118>
at ata6-master UDMA100 SATA 3Gb/s
Jun 13 15:33:17 jon kernel: ad14: 476940MB <Seagate ST3500320AS SD04> at
ata7-master UDMA100 SATA 1.5Gb/s
Jun 13 15:33:17 jon kernel: ad16: 1907729MB <SAMSUNG HD203WI 1AN10002>
at ata8-master UDMA100 SATA 3Gb/s
Jun 13 15:33:17 jon kernel: ad18: 953869MB <SAMSUNG HD103UI 1AA01113> at
ata9-master UDMA100 SATA 3Gb/s
Jun 13 15:33:17 jon kernel: GEOM: ad2s1: geometry does not match label
(16h,63s != 1h,63s).
Jun 13 15:33:17 jon kernel: GEOM: ad3s1: geometry does not match label
(16h,63s != 1h,63s).
Jun 13 15:33:17 jon kernel: Trying to mount root from ufs:/dev/ad0s1a
Jun 13 15:33:17 jon kernel: WARNING: / was not properly dismounted
Jun 13 15:33:17 jon kernel: WARNING: /usr was not properly dismounted
Jun 13 15:33:17 jon kernel: WARNING: /var was not properly dismounted
Jun 13 15:33:17 jon kernel: /var: mount pending error: blocks 8 files 2
Jun 13 15:33:18 jon kernel: fxp0: link state changed to UP
Jun 13 15:33:19 jon kernel: fxp0: link state changed to DOWN
Jun 13 15:33:21 jon kernel: fxp0: link state changed to UP
Jun 13 15:33:22 jon dhclient: New IP Address (fxp0): 192.168.100.9
Jun 13 15:33:22 jon kernel: fxp0: link state changed to DOWN
Jun 13 15:33:22 jon dhclient: New Subnet Mask (fxp0): 255.255.255.0
Jun 13 15:33:22 jon dhclient: New Broadcast Address (fxp0): 192.168.100.255
Jun 13 15:33:22 jon dhclient: New Routers (fxp0): 192.168.100.250
Jun 13 15:33:24 jon kernel: fxp0: link state changed to UP
Jun 13 15:38:07 jon fsck: /dev/ad0s1e: 285849 files, 1607328 used,
2887803 free (117739 frags, 346258 blocks, 2.6% fragmentation)
[after that only messages from fsck]