Matthias Andree
2005-Jan-19 05:49 UTC
4.11-RC3: SCSI+UFS+softupdates corruption (write cache DISABLED!)
Hi, I had a FreeBSD 4.11-RC3 machine reboot without advance notice, the last logging the network syslogd captured was attempted aic0 (Adaptec 2940 UW Pro) recovery. Syslog excerpt as captured by the remote machine, with date and "hostname /kernel:" and card state dumps removed (can be provided if necessary). I wonder if the SCSI error recovery attempts caused the reboot, I have no hints either way, but this machine is otherwise stable. 13:28:35 ahc0: Recovery Initiated 13:28:53 (da0:ahc0:0:0:0): SCB 0x16 - timed out 13:28:53 sg[0] - Addr 0x6da3800 : Length 2048 13:28:53 (da0:ahc0:0:0:0): Other SCB Timeout 13:28:53 ahc0: Timedout SCBs already complete. Interrupts may not be functioning. 13:28:53 ahc0: Recovery Initiated 13:29:02 (da0:ahc0:0:0:0): SCB 0x1b - timed out 13:29:04 (da0:ahc0:0:0:0): BDR message in message buffer 13:29:04 ahc0: Timedout SCBs already complete. Interrupts may not be functioning. 13:29:04 ahc0: Recovery Initiated 13:29:16 Kernel Free SCB list: 9 4 15 20 13:29:17 sg[7] - Addr 0x3bea000 : Length 4096 13:29:18 ahc0: Issued Channel A Bus Reset. 25 SCBs aborted As the machine rebooted up, it remained in single user due to a softupdates inconsistency fsck reported: | # fsck -p /usr | /dev/da0s1g: DIRECTORY CORRUPTED I=175105 OWNER=root MODE=40755 | /dev/da0s1g: SIZE=512 MTIME=Jan 18 15:14 2005 | /dev/da0s1g: DIR=? | | /dev/da0s1g: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY. I have not yet run fsck for interactive repair, because I want to know what is going on here and allow debugging this. At the time of the crash, these tasks were running: 1. amanda was running a dump(8) 2. I was installing manpages from /usr/src/share/man/man4 3. a cvsup for the ports tree was running (this is likely related to the problem) | # fsdb -r /dev/da0s1g | fsdb (inum: 2)> inode 175105 | current inode: directory | I=175105 MODE=40755 SIZE=512 | MTIME=Jan 18 15:14:48 2005 [0 nsec] | CTIME=Jan 18 15:14:48 2005 [0 nsec] | ATIME=Jun 19 03:05:43 2003 [0 nsec] | OWNER=root GRP=wheel LINKCNT=2 FLAGS=0 BLKCNT=4 GEN=4e5151f9 | fsdb (inum: 175105)> cd .. | component `..': fsdb: name `..' not found in current inode directory I checked with camcontrol, the write cache is off (see below), but the queue algorithm modifier is on and cannot be switched off. Digging through the old structures, with find, reveals: | 175101 4 drwxr-xr-x 3 root wheel 512 Sep 1 2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd | 175102 4 drwxr-xr-x 2 root wheel 512 Sep 1 2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd/auto | 175103 4 drwxr-xr-x 5 root wheel 512 Aug 23 2002 /usr/sup | 175104 4 drwxr-xr-x 2 root wheel 512 Jan 19 13:29 /usr/sup/src-all> 175105 4 drwxr-xr-x 2 root wheel 512 Jan 18 15:14 /usr/sup/ports-all| 175106 4 drwxr-xr-x 2 root wheel 512 Jan 18 15:14 /usr/sup/doc-all | 175107 4 drwxr-xr-x 22 root wheel 1024 Sep 28 19:47 /usr/doc | 175108 4 drwxr-xr-x 6 root wheel 512 Dec 19 13:26 /usr/doc/de_DE.ISO8859-1 | 175109 4 drwxr-xr-x 5 root wheel 512 Dec 27 2003 /usr/doc/de_DE.ISO8859-1/books And, as expected: | # ls -la /usr/sup/ports-all/ | # Why can, under such circumstances, a softupdates filesystem become corrupt so that fsck -p cannot fix it, and it loses has directories without . and ..? kernel/softupdates bug? How can this directory become empty? locate has this information recorded: /usr/sup/ports-all /usr/sup/ports-all/#cvs.cvsup-2279.0 /usr/sup/ports-all/checkouts.cvs:. so apparently, three (checkouts.cvs:., . and ..) or four files (perhaps the # file) have disappeared. I'm not sure if fsck will revive them, I want to avoid destroying data useful for debugging. Is the Queue Algorithm Modifier a problem? (see below) I cannot set this to 0 on this drive, "camcontrol: error sending mode select command" with -P0 and -P3. (Micropolis 4345WS) How do I go about providing the file system metadata so someone can take a look at it? The file system is 3.5 G in size, so anything that goes beyond meta data is not feasible. Providing SSH access to the failed machine may work though if I'm sent your OpenSSH v2-format key. # camcontrol inquiry da0 pass0: <MICROP 4345WS x43h> Fixed Direct Access SCSI-2 device pass0: Serial Number 77HT45XXXX pass0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled # camcontrol modepage da0 -m8 IC: 0 ABPF: 0 CAP: 0 DISC: 0 SIZE: 0 WCE: 0 MF: 0 RCD: 0 ... # camcontrol modepage da0 -m10 RLEC: 0 Queue Algorithm Modifier: 1 QErr: 0 DQue: 0 ... -- Matthias Andree
Matthias Andree
2005-Jan-19 07:08 UTC
4.11-RC3: SCSI+UFS+softupdates corruption (write cache DISABLED!)
Matthias Andree <matthias.andree@gmx.de> writes:> so apparently, three (checkouts.cvs:., . and ..) or four files (perhaps > the # file) have disappeared. I'm not sure if fsck will revive them, I > want to avoid destroying data useful for debugging.OK, I dd'd the whole partition to an SLR tape and ran fsck for interactive repairs. | ** /dev/da0s1g | ** Last Mounted on /usr | ** Phase 1 - Check Blocks and Sizes | ** Phase 2 - Check Pathnames | DIRECTORY CORRUPTED I=175105 OWNER=root MODE=40755 | SIZE=512 MTIME=Jan 18 15:14 2005 | DIR=? | | UNEXPECTED SOFT UPDATE INCONSISTENCY | | SALVAGE? [yn] y | | MISSING '.' I=175105 OWNER=root MODE=40755 | SIZE=512 MTIME=Jan 18 15:14 2005 | DIR=? | | UNEXPECTED SOFT UPDATE INCONSISTENCY | | FIX? [yn] y | | MISSING '..' I=175105 OWNER=root MODE=40755 | SIZE=512 MTIME=Jan 18 15:14 2005 | DIR=/sup/ports-all | | UNEXPECTED SOFT UPDATE INCONSISTENCY | | FIX? [yn] y | | ** Phase 3 - Check Connectivity | ** Phase 4 - Check Reference Counts | UNREF FILE I=176801 OWNER=root MODE=100644 | SIZE=14098161 MTIME=Jan 18 15:14 2005 | RECONNECT? [yn] y | | NO lost+found DIRECTORY | CREATE? [yn] y | | UNREF FILE I=179558 OWNER=root MODE=100644 | SIZE=8327913 MTIME=Mar 20 03:11 2004 | RECONNECT? [yn] y | | ** Phase 5 - Check Cyl groups | FREE BLK COUNT(S) WRONG IN SUPERBLK | SALVAGE? [yn] y | | SUMMARY INFORMATION BAD | SALVAGE? [yn] y | | BLK(S) MISSING IN BIT MAPS | SALVAGE? [yn] y | | 243085 files, 1465923 used, 274252 free (102444 frags, 21476 blocks, 5.9% fragmentation) | | ***** FILE SYSTEM MARKED CLEAN ***** | | ***** FILE SYSTEM WAS MODIFIED ***** Turns out the missing two files ended up in lost+found. Is this a failure mode that is allowed to happen for softupdates? -- Matthias Andree