Karl Denninger
2005-Mar-29 18:14 UTC
DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE
WARNING!
FreeBSD 5.4-PRERELEASE #6: Tue Mar 29 12:44:22 CST 2005
karl@FS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP
Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.4-PRERELEASE #6: Tue Mar 29 12:44:22 CST 2005
karl@FS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP
ACPI APIC Table: <DELL PE400SC>
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2394.01-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0xf29 Stepping = 9
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Hyperthreading: 2 logical CPUs
real memory = 267862016 (255 MB)
avail memory = 252456960 (240 MB)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
ioapic0: Changing APIC ID to 2
ioapic0 <Version 2.0> irqs 0-23 on motherboard
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <DELL PE400SC> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
agp0: <Intel 82875P host to AGP bridge> mem 0xe8000000-0xefffffff at
device 0.0 on pci0
pcib1: <PCI-PCI bridge> at device 1.0 on pci0
pci1: <PCI bus> on pcib1
pci1: <display, VGA> at device 0.0 (no driver attached)
uhci0: <Intel 82801EB (ICH5) USB controller USB-A> port 0xff80-0xff9f irq
16 at device 29.0 on pci0
usb0: <Intel 82801EB (ICH5) USB controller USB-A> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <Intel 82801EB (ICH5) USB controller USB-B> port 0xff60-0xff7f irq
19 at device 29.1 on pci0
usb1: <Intel 82801EB (ICH5) USB controller USB-B> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <Intel 82801EB (ICH5) USB controller USB-C> port 0xff40-0xff5f irq
18 at device 29.2 on pci0
usb2: <Intel 82801EB (ICH5) USB controller USB-C> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
uhci3: <Intel 82801EB (ICH5) USB controller USB-D> port 0xff20-0xff3f irq
16 at device 29.3 on pci0
usb3: <Intel 82801EB (ICH5) USB controller USB-D> on uhci3
usb3: USB revision 1.0
uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub3: 2 ports with 2 removable, self powered
pci0: <serial bus, USB> at device 29.7 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci2: <ACPI PCI bus> on pcib2
atapci0: <SiI 3112 SATA150 controller> port
0xcd70-0xcd7f,0xcd5c-0xcd5f,0xcd68-0xcd6f,0xcd58-0xcd5b,0xcd60-0xcd67 mem
0xfe7dee00-0xfe7defff irq 21 at device 0.0 on pci2
ata2: channel #0 on atapci0
ata3: channel #1 on atapci0
ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0xce00-0xceff mem
0xfe7df000-0xfe7dffff irq 22 at device 1.0 on pci2
aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
rp0: <RocketPort PCI> port 0xcd80-0xcdbf irq 17 at device 2.0 on pci2
RocketPort0 (Version 3.02) 4 ports.
pcib3: <PCI-PCI bridge> at device 3.0 on pci2
pci3: <PCI bus> on pcib3
fxp0: <Intel 82558 Pro/100 Ethernet> port 0xbf80-0xbf9f mem
0xfe400000-0xfe4fffff,0xf8001000-0xf8001fff irq 19 at device 4.0 on pci3
miibus0: <MII bus> on fxp0
inphy0: <i82555 10/100 media interface> on miibus0
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp0: Ethernet address: 00:d0:b7:6f:ce:e8
fxp1: <Intel 82558 Pro/100 Ethernet> port 0xbfe0-0xbfff mem
0xfe500000-0xfe5fffff,0xf8000000-0xf8000fff irq 18 at device 5.0 on pci3
miibus1: <MII bus> on fxp1
inphy1: <i82555 10/100 media interface> on miibus1
inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp1: Ethernet address: 00:d0:b7:6f:ce:e9
em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port
0xcdc0-0xcdff mem 0xfe7e0000-0xfe7fffff irq 18 at device 12.0 on pci2
em0: Ethernet address: 00:0c:f1:c9:df:c5
em0: Speed:N/A Duplex:N/A
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci1: <Intel ICH5 UDMA100 controller> port
0xffa0-0xffaf,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 irq 18 at device 31.1 on pci0
ata0: channel #0 on atapci1
ata1: channel #1 on atapci1
atapci2: <Intel ICH5 SATA150 controller> port
0xfea0-0xfeaf,0xfe30-0xfe33,0xfe20-0xfe27,0xfe10-0xfe13,0xfe00-0xfe07 irq 18 at
device 31.2 on pci0
ata4: channel #0 on atapci2
ata5: channel #1 on atapci2
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
pci0: <multimedia, audio> at device 31.5 (no driver attached)
fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on
acpi0
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on
acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
orm0: <ISA Option ROMs> at iomem
0xd1800-0xd3fff,0xd0000-0xd17ff,0xcf800-0xcffff,0xcb000-0xcf7ff,0xc0000-0xcafff
on isa0
pmtimer0 on isa0
ppc0: parallel port not found.
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
RTC BIOS diagnostic error 18<memory_size,fixed_disk>
Timecounters tick every 10.000 msec
ipfw2 initialized, divert enabled, rule-based forwarding disabled, default to
deny, logging disabled
acd0: CDROM <Lite-On LTN486S 48x Max/YDS6> at ata1-master UDMA33
ad4: 238475MB <HDS722525VLSA80/V36OA63A> [484521/16/63] at ata2-master
SATA150
ad6: 239372MB <Maxtor 6B250S0/BANC1B70> [486344/16/63] at ata3-master
SATA150
em0: Link is up 100 Mbps Full Duplex
ad8: 239372MB <Maxtor 6B250S0/BANC1980> [486344/16/63] at ata4-master
SATA150
ad10: 238475MB <HDS722525VLSA80/V36OA63A> [484521/16/63] at ata5-master
SATA150
Waiting 15 seconds for SCSI devices to settle
GEOM_MIRROR: Device boot created (id=1131801609).
GEOM_MIRROR: Device boot: provider ad8s1 detected.
GEOM_MIRROR: Device boot: provider ad10s1 detected.
GEOM_MIRROR: Device boot: provider ad10s1 activated.
GEOM_MIRROR: Device boot: provider mirror/boot launched.
GEOM_MIRROR: Device boot: rebuilding provider ad8s1.
sa0 at ahc0 bus 0 target 2 lun 0
sa0: <DEC DLT2000 15/30 GB 840B> Removable Sequential Access SCSI-2 device
sa0: 5.000MB/s transfers (5.000MHz, offset 15)
SMP: AP CPU #1 Launched!
Mounting root from ufs:/dev/mirror/boota
WARNING: /dbms was not properly dismounted
WARNING: /disk was not properly dismounted
WARNING: /archive was not properly dismounted
em0: Link is up 100 Mbps Full Duplex
Built this afternoon.
This has a fix in the ATA code, to wit:
mdodd 2005-03-23 04:50:26 UTC
FreeBSD src repository
Modified files: (Branch: RELENG_5)
sys/dev/ata ata-queue.c
Log:
MFC
1.42: When resubmitting a timed out request, reset donecount.
1.41: Reset timeout when we are back from interrupt.
1.40: Correct logical error, result was that retries wasn't always made
but
failure reported instead.
1.39: Do not retry on requests that have lost their device during reinit.
Approved by: re
Revision Changes Path
1.32.2.6 +10 -3 src/sys/dev/ata/ata-queue.c
This change is EXTREMELY DANGEROUS.
If your system takes a RECOVERABLE DMA Write Error (as is happening
frequently on machines with SATA drives on the PCI bus!) you used to get a
drive disconnect from the mirror.
NOW you end up with a RADICALLY unstable machine. Specifically, interrupts
now get seriously screwed up, serial I/O on the machine stops working
immediately, and ultimately the machine crashes or hangs in VERY odd ways.
This change needs to be backed out immediately until it can be determined
why a requeued request destabilizes the system.
I have removed the ad4 and ad6 drives here (which are the ones on the PCI
bus) until this is addressed.
--
--
Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights
Activist
http://www.denninger.net My home on the net - links to everything I do!
http://scubaforum.org Your UNCENSORED place to talk about DIVING!
http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME!
http://genesis3.blogspot.com Musings Of A Sentient Mind
Matthew N. Dodd
2005-Mar-29 20:40 UTC
DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE
On Tue, 29 Mar 2005, Karl Denninger wrote:> 1.42: When resubmitting a timed out request, reset donecount. > 1.41: Reset timeout when we are back from interrupt. > 1.40: Correct logical error, result was that retries wasn't always made but > failure reported instead. > 1.39: Do not retry on requests that have lost their device during reinit. > > This change is EXTREMELY DANGEROUS. > > This change needs to be backed out immediately until it can be determined > why a requeued request destabilizes the system.The changes in question are very small. Could you attempt to isolate which one is the cause? Thanks. -- 10 40 80 C0 00 FF FF FF FF C0 00 00 00 00 10 AA AA 03 00 00 00 08 00
Karl Denninger
2005-Mar-31 09:46 UTC
DANGER WILL ROBINSON! SERIOUS problem with current5.4-PRERELEASE - UPDATE (real this time)
On Thu, Mar 31, 2005 at 12:19:13PM -0500, Matthew N. Dodd wrote:> On Thu, 31 Mar 2005, Karl Denninger wrote: > > What do you expect the patch to do, given that removing the delta > > appears to fix the instability problem? > > I expect the patch to properly stop the callout.Ok, so the implication is that the callouts are/were being triggered and dumped on the floor without it? Would you prefer me to validate that path before or after I attempt to validate that removing the first delta fixes the crashes on my production machine? (Reason for the question is that the latter will require most of the rest of the day and evening, while the former can likely be done by 3-4 pm today) -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind
Matthew N. Dodd
2005-Mar-31 11:20 UTC
DANGER WILL ROBINSON! SERIOUS problem withcurrent5.4-PRERELEASE - UPDATE (real this time)
On Thu, 31 Mar 2005, Karl Denninger wrote:> Would you prefer me to validate that path before or after I attempt to > validate that removing the first delta fixes the crashes on my > production machine? (Reason for the question is that the latter will > require most of the rest of the day and evening, while the former can > likely be done by 3-4 pm today)Your previous email suggests that the callout stuff is to blame. Testing the fix seems like a good plan unless you have any doubts about your earlier results. -- 10 40 80 C0 00 FF FF FF FF C0 00 00 00 00 10 AA AA 03 00 00 00 08 00
Karl Denninger
2005-Mar-31 11:57 UTC
DANGER WILL ROBINSON! SERIOUS problem withcurrent5.4-PRERELEASE - UPDATE (real this time)
On Thu, Mar 31, 2005 at 02:16:11PM -0500, Matthew N. Dodd wrote:> On Thu, 31 Mar 2005, Karl Denninger wrote: > > Would you prefer me to validate that path before or after I attempt to > > validate that removing the first delta fixes the crashes on my > > production machine? (Reason for the question is that the latter will > > require most of the rest of the day and evening, while the former can > > likely be done by 3-4 pm today) > > Your previous email suggests that the callout stuff is to blame. Testing > the fix seems like a good plan unless you have any doubts about your > earlier results.I cannot provide solid validation until I test it on the production machine here. The sandbox survived overnight with dozens of write errors, but whether the production system will is not yet known. Will test on the production system and advise. -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind