Alan Romeril
2006-Feb-07  20:30 UTC
[zfs-discuss] ZFS panic after falling over a motherboard bug
Hi All
      I have an ASUS A8N Deluxe motherboard with 4GB RAM and 8 SATA drives
connected and after messing around with replacing fans decided to swap the cpu
with my desktop.  The processor that I took out was an AMD64 3500+ and the one
that went in was an AMD64 4200+, this is a newer stepping CPU which enabled the
H/W DRAM Over 4G Remapping setting in the BIOS.  Now the disks probably went
back in the wrong order, but the raidz zpool imported correctly :-
bash-3.00# zpool status -v
  pool: raidpool
 state: ONLINE
 scrub: none requested
config:
        NAME        STATE     READ WRITE CKSUM
        raidpool    ONLINE       0     0     0
          raidz     ONLINE       0     0     0
            c5d0s0  ONLINE       0     0     0
            c4d0s0  ONLINE       0     0     0
            c3d0s0  ONLINE       0     0     0
            c2d0s0  ONLINE       0     0     0
            c6d0s0  ONLINE       0     0     0
            c7d0s0  ONLINE       0     0     0
            c7d1s0  ONLINE       0     0     0
            c6d1s0  ONLINE       0     0     0
bash-3.00# df -k /raidpool
Filesystem            kbytes    used   avail capacity  Mounted on
raidpool             2313814016 40152697 2273660169     2%    /raidpool
*BUT* the machine started panicing fairly soon after.  This was in the ufs
module, and the root filesystem ended up totally corrupt after an fsck and GRUB
wouldn''t boot.  Booting from CD, I found that files in /tmp would not
cksum correctly in fact the cksums changed every time cksum was run on an
unchanging file.
Long and the short of it, disabling the H/W DRAM Over 4G Remapping setting
allowed me to reinstall the box and everything behaves as it should.
Except the zpool....
Again, the filesystem mounts fine, and I can read from it without a problem. 
But writing to it causes a panic every time :-
bash-3.00# mdb -k unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp ufs ip sctp
usba fctl s1394 nca lofs zfs random nfs fcip cpc ptm sppp
]> ::status
debugging crash dump vmcore.0 (64-bit) from sol
operating system: 5.11 b31p (i86pc)
panic message:
assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file:
../../commo
n/fs/zfs/dnode_sync.c, line: 372
dump content: kernel pages only> *panic_thread::findstack -v
stack pointer for thread fffffe8000872c80: fffffe8000872660
  fffffe8000872750 panic+0x9e()
  fffffe80008727e0 assfail3+0xbe(fffffffff0560500, 95a, fffffffff05604f8, 0,
  fffffffff05604d0, 174)
  fffffe8000872840 dnode_sync_free+0x2f6(ffffffff8aa15a08, fffffe81cb26c180)
  fffffe80008728f0 dnode_sync+0x49e(ffffffff8aa15a08, 0, ffffffff8aa09880,
  fffffe81cb26c180)
  fffffe8000872960 dmu_objset_sync_dnodes+0xb0(ffffffff82af4100,
  ffffffff82af4260, fffffe81cb26c180)
  fffffe80008729b0 dmu_objset_sync+0xf6(ffffffff82af4100, fffffe81cb26c180)
  fffffe80008729e0 dsl_dataset_sync+0x59(ffffffff8abd3800, fffffe81cb26c180)
  fffffe8000872b50 dsl_pool_sync+0xa3(ffffffff89695c00, 4534)
  fffffe8000872bd0 spa_sync+0x100(ffffffff87a88500, 4534)
  fffffe8000872c60 txg_sync_thread+0x230(ffffffff89695c00)
  fffffe8000872c70 thread_start+8()
> ::spa -ve
ADDR                 STATE NAME
ffffffff87a88500    ACTIVE raidpool
    ADDR             STATE     AUX          DESCRIPTION
    ffffffff82dfc6c0 HEALTHY   -            root
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS               0            0            0            0            0
        BYTES             0            0            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff82dfcc40 HEALTHY   -              raidz
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x27         0x4a            0            0            0
        BYTES       0x64200      0xdd000            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff82dfd1c0 HEALTHY   -                /dev/dsk/c5d0s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x13         0x50            0            0            0
        BYTES      0x15c400      0x93600            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff82e07040 HEALTHY   -                /dev/dsk/c4d0s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x17         0x50            0            0            0
        BYTES      0x18c000      0x91e00            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff87a81780 HEALTHY   -                /dev/dsk/c3d0s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS             0xf         0x50            0            0            0
        BYTES      0x12c000      0x94200            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff82e00700 HEALTHY   -                /dev/dsk/c2d0s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x13         0x4e            0            0            0
        BYTES      0x16c000      0x93200            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff87a81200 HEALTHY   -                /dev/dsk/c6d0s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x10         0x58            0            0            0
        BYTES      0x13c000      0x95800            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff87a80c80 HEALTHY   -                /dev/dsk/c7d0s0
                       READ        WRITE         FREE        CLAIM        IOCTL 
OPS            0x14         0x58            0            0            0
        BYTES      0x16ca00      0x93400            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff87a80700 HEALTHY   -                /dev/dsk/c7d1s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x13         0x58            0            0            0
        BYTES      0x15ca00      0x93c00            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
    ffffffff87a80180 HEALTHY   -                /dev/dsk/c6d1s0
                       READ        WRITE         FREE        CLAIM        IOCTL
        OPS            0x14         0x4e            0            0            0
        BYTES      0x16ca00      0x90800            0            0            0
        EREAD             0
        EWRITE            0
        ECKSUM            0
>
bash-3.00# zdb -bbc raidpool
Traversing all blocks to verify checksums and verify nothing leaked ...
leaked space: vdev 0, offset 0x3409aecc00, size 19456
leaked space: vdev 0, offset 0x3409ae6c00, size 19456
leaked space: vdev 0, offset 0x3409b43800, size 38912
leaked space: vdev 0, offset 0x3409b15000, size 155648
leaked space: vdev 0, offset 0x340ecab800, size 77824
leaked space: vdev 0, offset 0x340ed0ec00, size 77824
leaked space: vdev 0, offset 0x340ed30c00, size 214016
leaked space: vdev 0, offset 0x340ecd9c00, size 194560
leaked space: vdev 0, offset 0x3409b4f800, size 97280
leaked space: vdev 0, offset 0x700ed12800, size 155648
leaked space: vdev 0, offset 0x700ed54400, size 116736
leaked space: vdev 0, offset 0x700ed3d800, size 58368
block traversal size 41116431360 != alloc 41117657088 (leaked 1225728)
        bp count:          277589
        bp logical:    36057796096       avg: 129896
        bp physical:   35801011200       avg: 128971    compression:   1.01
        bp allocated:  41116431360       avg: 148119    compression:   0.88
        SPA allocated: 41117657088      used:  1.72%
Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     9  96.0K   96.0K    115K   12.8K    1.00     0.00  deferred free
     1    512     512      1K      1K    1.00     0.00  object directory
     3  1.50K   1.50K   3.00K      1K    1.00     0.00  object array
     1    16K     16K   19.0K   19.0K    1.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     1    16K     16K   19.0K   19.0K    1.00     0.00  bplist
     -      -       -       -       -       -        -  bplist header
     -      -       -       -       -       -        -  SPA space map header
   188   776K    776K    968K   5.15K    1.00     0.00  SPA space map
     -      -       -       -       -       -        -  ZIL intent log
    39   624K    624K    741K   19.0K    1.00     0.00  DMU dnode
     2     2K      2K      4K      2K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
     2     1K      1K      2K      1K    1.00     0.00  DSL directory child map
     1    512     512      1K      1K    1.00     0.00  DSL dataset snap map
     2     1K      1K      2K      1K    1.00     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS ACL
  271K  33.6G   33.3G   38.3G    145K    1.01    99.98  ZFS plain file
    73  3.74M   3.74M   4.32M   60.6K    1.00     0.01  ZFS directory
     1    512     512      1K      1K    1.00     0.00  ZFS master node
     1    512     512      1K      1K    1.00     0.00  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
  271K  33.6G   33.3G   38.3G    145K    1.01   100.00  Total
Looks an interesting problem ;)
Anyone any thoughts?
Thanks,
Alan
This message posted from opensolaris.org
Eric Schrock
2006-Feb-07  20:54 UTC
[zfs-discuss] ZFS panic after falling over a motherboard bug
On Tue, Feb 07, 2006 at 12:30:36PM -0800, Alan Romeril wrote:> > ::status > debugging crash dump vmcore.0 (64-bit) from sol > operating system: 5.11 b31p (i86pc) > panic message: > assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file: ../../commo > n/fs/zfs/dnode_sync.c, line: 372 > dump content: kernel pages onlyThis looks like: 6357736 Panic dn->dn_phys->dn_secphys == 0 (0x60 == 0x0), file: ../../common/fs/zfs/dnode_sync.c, line: 372 Which is another symptom of: 6357699 Panic assertion failed: db->db_blkptr != 0 However, this was supposed to have been fixed in build 31. Is that what you''re running? What is "b31p" ? Since the line number is still 372, I''m guesing that you''re not really running build 31, since this assert moved down to line 374 in that build. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Alan Romeril
2006-Feb-07  22:16 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Ah yes heh this is built from opensolaris-src-20060102.tar which is b30 I think. The only addition I made to this build was in usr/src/uts/i86pc/io/pci/pci_boot.c line 375 changing :- if (subcl != PCI_MASS_OTHER && subcl != PCI_MASS_SATA) to if (subcl != PCI_MASS_OTHER && subcl != PCI_MASS_SATA && subcl != PCI_MASS_RAID) to get my 3114 SATA RAID controller to work in JBOD mode, should have named it b30p :) Okay sorry for the false alarm, I''ll wait for the buildable b33 sources and try again. Thanks Eric :) Alan This message posted from opensolaris.org
Alan Romeril
2006-Feb-07  23:39 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
I''ve just bfu''ed with the 20060201 archives and dropped in a normal Sil 3114 pci card so I don''t have to patch anything to have the 8 disks connected. I get the same panic string as before! bash-3.00# mdb -k vmcore.0 unix.0 mdb: vmcore.0 is not an ELF file bash-3.00# mdb -k unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs uppc pcplusmp ufs ip sctp usba s1394 nca lofs zfs random nfs sppp ptm ]> ::statusdebugging crash dump vmcore.0 (64-bit) from sol operating system: 5.11 opensol-20060201 (i86pc) panic message: assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file: ../../commo n/fs/zfs/dnode_sync.c, line: 374 dump content: kernel pages only Some of the disks have renumbered AVAILABLE DISK SELECTIONS: 0. c0d1 <DEFAULT cyl 48449 alt 2 hd 128 sec 63> /pci at 0,0/pci-ide at 6/ide at 0/cmdk at 1,0 1. c2d0 <ST330083- 3NF10JK-0001-279.45GB> /pci at 0,0/pci-ide at 7/ide at 0/cmdk at 0,0 2. c3d0 <ST330083- 3NF0ZLA-0001-279.45GB> /pci at 0,0/pci-ide at 7/ide at 1/cmdk at 0,0 3. c4d0 <ST330083- 3NF1379-0001-279.45GB> /pci at 0,0/pci-ide at 8/ide at 0/cmdk at 0,0 4. c5d0 <ST330083- 3NF0QZS-0001-279.45GB> /pci at 0,0/pci-ide at 8/ide at 1/cmdk at 0,0 5. c8d0 <ST330083- 3NF13FA-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 0/cmdk at 0,0 6. c8d1 <ST330083- 3NF12V5-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 0/cmdk at 1,0 7. c9d0 <ST330083- 3NF13EX-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 1/cmdk at 0,0 8. c9d1 <ST330083- 3NF12CF-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 1/cmdk at 1,0 But the pool did import:- bash-3.00# zpool status pool: raidpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raidpool ONLINE 0 0 0 raidz ONLINE 0 0 0 c5d0s0 ONLINE 0 0 0 c4d0s0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 c2d0s0 ONLINE 0 0 0 c8d0s0 ONLINE 0 0 0 c9d0s0 ONLINE 0 0 0 c9d1s0 ONLINE 0 0 0 c8d1s0 ONLINE 0 0 0 Maybe a different issue? Cheers, Alan This message posted from opensolaris.org
Per Öberg
2006-Feb-08  08:17 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Could this be related to my problems described in thread? http://www.opensolaris.org/jive/thread.jspa?threadID=5446&tstart=0 /Per (I''ll wait until SXCR 33 to try and install again..) This message posted from opensolaris.org
Alan Romeril
2006-Feb-08  21:02 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Hi Per,
     I don''t think so, I''ve reverted to the 1013 version of
the bios which I know is stable on this hardware and still get the panic. 
However I have narrowed it down, there seems to be one file that causes the
panic, which was the one I was writing to when the box first bailed.
I had run a mkfile 1g test in the root of the pool when it paniced, and if I try
to delete this file I get the panic exactly as before.
panic message:
assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file:
../../commo
n/fs/zfs/dnode_sync.c, line: 374
But the rest of the pool looks fine, I''ve ftp''ed a few isos to
the pool and they cksum correctly with no stability issues fom the pool there.
rm''ing the file test panics the machine every time.
Cheers,
Alan
This message posted from opensolaris.org
Alan Romeril
2006-Feb-08  22:55 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Tried a zpool scrub and one cksum problem was found.
But still panics on the rm....
Might have to copy files off and rebuild this pool at this rate...
Alan
bash-3.00# zpool status
  pool: raidpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool online'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Wed Feb  8 21:32:53 2006
config:
        NAME        STATE     READ WRITE CKSUM
        raidpool    ONLINE       0     0     0
          raidz     ONLINE       0     0     0
            c5d0s0  ONLINE       0     0     0
            c4d0s0  ONLINE       0     0     0
            c3d0s0  ONLINE       0     0     0
            c2d0s0  ONLINE       0     0     0
            c8d0s0  ONLINE       0     0     0
            c9d0s0  ONLINE       0     0     0
            c9d1s0  ONLINE       0     0     0
            c8d1s0  ONLINE       0     0     1  18.50 repaired
This message posted from opensolaris.org
Matthew A. Ahrens
2006-Feb-09  00:04 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
> I''ve just bfu''ed with the 20060201 archives and > dropped in a normal Sil 3114 pci card so I don''t have > to patch anything to have the 8 disks connected. I > get the same panic string as before!Unfortunately, this is expected. The bug you encountered messes up the on-disk accounting, so you will need to destroy this pool to eliminate the problem. --matt This message posted from opensolaris.org
Per Öberg
2006-Feb-09  09:04 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Hi Alan, Thanks for your reply, I''ll revert to BIOS 1012.006 (I can''t find your version - 1013 - on the asus homepage..) and give it a shot when I can get hold of SXCR 33 or later - which seems to cure mentioned disk related bugs? I''ve haven''t got time for Opensolaris at the moment. /Per This message posted from opensolaris.org