On 18 Feb, Matthias Andree wrote:> Hi,
>
> I have seen a SCSI timeout that preceded massive file system corruption
> some days ago. As this happened with the drive's write cache turned off
> and with ffs softupdates, I think it's worth reporting.
>
> Hardware: Micropolis 4345WS (UWSCSI disk drive, 7200/min, device da0)
> attached to an Adaptec 2940UW Pro (ahc0, aic7880 chip); on the same bus
> there's a Yamaha CRW4416S CD writer (current firmware) and a Plextor
> PX-20TS CD-ROM (firmware 1.00).
>
> This machine had a SCSI timeout problem on Friday Feb 6th and went down
> hard, suffering massive file system corruption on /var. At that time,
> the machine was running portupgrade -a. /var is using softupdates and
> uses default mount options. As said before, the drive's FWC enable was
> set to 0 in both the current and saved editions of mode page 8, and I
> wonder how such massive corruption can happen. I was under the
> impression that softupdates prevented any on-disk corruptions that
> require user intervention at fsck time. Given that the write cache was
> off, I am wondering if there are any ffs+softupdates or tagged command
> queueing bugs left (that might reorder writes - ordered tag forgotten or
> something).
The UNKNOWN FILE TYPE complains are a pretty good clue that a block
containing inodes got overwritten by garbage. I've seen this sort of
thing happen if power to a drive fails. It could also be caused by a
driver or firmware bug that causes data to get written to the wrong
place, or a cabling or termination problem that causes the drive to see
the wrong command.
> Here's the log, first the SCSI crash, then, on Feb 7, then excerpts
from
> the reboot including fsck. / and /usr also suffered minor file system
> inconsistencies that "fsck -p" was able to fix automatically.
>
> Feb 6 20:10:20 libertas /kernel: swap_pager: indefinite wait buffer:
device: #da/0x20001, blkno: 4296, size: 24576
> Feb 6 20:10:41 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x0 - timed out
> Feb 6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
> Feb 6 20:13:12 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out
> Feb 6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message
buffer
> Feb 6 20:13:21 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x21 - timed out
> Feb 6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout,
status = 34b
> Feb 6 20:13:30 libertas /kernel: ahc0: Issued Channel A Bus Reset. 12 SCBs
aborted
> Feb 6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer:
device: #da/0x20001, blkno: 4296, size: 24576
> Feb 6 20:13:30 libertas /kernel: swap_pager: indefinite wait buffer:
device: #da/0x20001, blkno: 2056, size: 8192
> Feb 6 20:13:30 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x20 - timed out
> Feb 6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
> Feb 6 20:13:37 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb 6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message
buffer
> Feb 6 20:13:44 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb 6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout,
status = 34b
> Feb 6 20:13:51 libertas /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs
aborted
> Feb 6 20:13:51 libertas /kernel: (da0:ahc0:0:0:0): SCB 0x22 - timed out
> Feb 6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): Other SCB Timeout
> Feb 6 20:13:58 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb 6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): BDR message in message
buffer
> Feb 6 20:14:05 libertas /kernel: (da0:ahc0:0:0:0): SCB 0xa - timed out
> Feb 6 20:14:14 libertas /kernel: (da0:ahc0:0:0:0): no longer in timeout,
status = 34b
> Feb 6 20:14:14 libertas /kernel: ahc0: Issued Channel A Bus Reset. 4 SCBs
aborted
> Feb 6 20:15:01 libertas /usr/sbin/cron[58193]: (root) CMD
(/usr/libexec/atrun)
> Feb 6 20:20:00 libertas /usr/sbin/cron[64038]: (root) CMD
(/usr/libexec/atrun)
>
> Feb 7 12:37:03 libertas /kernel: ahc0: <Adaptec 2940 Pro Ultra SCSI
adapter> port 0xe800-0xe8ff mem 0xe3020000-0xe3020fff irq 15 at devi
> ce 8.0 on pci0
> Feb 7 12:37:03 libertas /kernel: aic7880: Ultra Wide Channel A, SCSI Id=7,
16/253 SCBs
>
> Feb 7 12:37:04 libertas /kernel: da0 at ahc0 bus 0 target 0 lun 0
> Feb 7 12:37:04 libertas /kernel: da0: <MICROP 4345WS x43h> Fixed
Direct Access SCSI-2 device
> Feb 7 12:37:04 libertas /kernel: da0: 40.000MB/s transfers (20.000MHz,
offset 8, 16bit), Tagged Queueing Enabled
> Feb 7 12:37:04 libertas /kernel: da0: 4340MB (8890029 512 byte sectors:
255H 63S/T 553C)
>
> Feb 7 12:37:04 libertas /kernel: /dev/da0s1e:
> Feb 7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: /dev/da0s1e: UNEXPECTED SOFT UPDATE
INCONSISTENCY; RUN fsck MANUALLY.
> Feb 7 12:37:04 libertas /kernel: THE FOLLOWING FILE SYSTEM HAD AN
UNEXPECTED INCONSISTENCY:
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: /dev/da0s1e (/var)
> Feb 7 12:37:04 libertas /kernel: Automatic file system check failed . . .
help!
> Feb 7 12:37:04 libertas /kernel: Enter full pathname of shell or RETURN
for /bin/sh:
> Feb 7 12:37:04 libertas /kernel: #
>
> Entered shell and typed "fsck":
>
> Feb 7 12:37:04 libertas /kernel: ** /dev/da0s1e
> Feb 7 12:37:04 libertas /kernel: ** Last Mounted on /var
> Feb 7 12:37:04 libertas /kernel: ** Phase 1 - Check Blocks and Sizes
>
> Feb 7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=256
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: CLEAR? [yn]
> Feb 7 12:37:04 libertas /kernel:
>
> Last 6 lines repeat with "I=" figures monotonically increasing,
finally:
>
> Feb 7 12:37:04 libertas /kernel: UNKNOWN FILE TYPE I=264
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: UNEXPECTED SOFT UPDATE INCONSISTENCY
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: CLEAR? [yn]
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel:
> Feb 7 12:37:04 libertas /kernel: UNREF FILE
>
> The log ends here, but isn't complete, I have agreed to reconnect
dozens
> of files to lost+found after fsck asked, many but not all of them from
> /var/db/pkg.
>