Alan Romeril
2006-Feb-07 20:30 UTC
[zfs-discuss] ZFS panic after falling over a motherboard bug
Hi All I have an ASUS A8N Deluxe motherboard with 4GB RAM and 8 SATA drives connected and after messing around with replacing fans decided to swap the cpu with my desktop. The processor that I took out was an AMD64 3500+ and the one that went in was an AMD64 4200+, this is a newer stepping CPU which enabled the H/W DRAM Over 4G Remapping setting in the BIOS. Now the disks probably went back in the wrong order, but the raidz zpool imported correctly :- bash-3.00# zpool status -v pool: raidpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raidpool ONLINE 0 0 0 raidz ONLINE 0 0 0 c5d0s0 ONLINE 0 0 0 c4d0s0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 c2d0s0 ONLINE 0 0 0 c6d0s0 ONLINE 0 0 0 c7d0s0 ONLINE 0 0 0 c7d1s0 ONLINE 0 0 0 c6d1s0 ONLINE 0 0 0 bash-3.00# df -k /raidpool Filesystem kbytes used avail capacity Mounted on raidpool 2313814016 40152697 2273660169 2% /raidpool *BUT* the machine started panicing fairly soon after. This was in the ufs module, and the root filesystem ended up totally corrupt after an fsck and GRUB wouldn''t boot. Booting from CD, I found that files in /tmp would not cksum correctly in fact the cksums changed every time cksum was run on an unchanging file. Long and the short of it, disabling the H/W DRAM Over 4G Remapping setting allowed me to reinstall the box and everything behaves as it should. Except the zpool.... Again, the filesystem mounts fine, and I can read from it without a problem. But writing to it causes a panic every time :- bash-3.00# mdb -k unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp ufs ip sctp usba fctl s1394 nca lofs zfs random nfs fcip cpc ptm sppp ]> ::statusdebugging crash dump vmcore.0 (64-bit) from sol operating system: 5.11 b31p (i86pc) panic message: assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file: ../../commo n/fs/zfs/dnode_sync.c, line: 372 dump content: kernel pages only> *panic_thread::findstack -vstack pointer for thread fffffe8000872c80: fffffe8000872660 fffffe8000872750 panic+0x9e() fffffe80008727e0 assfail3+0xbe(fffffffff0560500, 95a, fffffffff05604f8, 0, fffffffff05604d0, 174) fffffe8000872840 dnode_sync_free+0x2f6(ffffffff8aa15a08, fffffe81cb26c180) fffffe80008728f0 dnode_sync+0x49e(ffffffff8aa15a08, 0, ffffffff8aa09880, fffffe81cb26c180) fffffe8000872960 dmu_objset_sync_dnodes+0xb0(ffffffff82af4100, ffffffff82af4260, fffffe81cb26c180) fffffe80008729b0 dmu_objset_sync+0xf6(ffffffff82af4100, fffffe81cb26c180) fffffe80008729e0 dsl_dataset_sync+0x59(ffffffff8abd3800, fffffe81cb26c180) fffffe8000872b50 dsl_pool_sync+0xa3(ffffffff89695c00, 4534) fffffe8000872bd0 spa_sync+0x100(ffffffff87a88500, 4534) fffffe8000872c60 txg_sync_thread+0x230(ffffffff89695c00) fffffe8000872c70 thread_start+8()> ::spa -veADDR STATE NAME ffffffff87a88500 ACTIVE raidpool ADDR STATE AUX DESCRIPTION ffffffff82dfc6c0 HEALTHY - root READ WRITE FREE CLAIM IOCTL OPS 0 0 0 0 0 BYTES 0 0 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff82dfcc40 HEALTHY - raidz READ WRITE FREE CLAIM IOCTL OPS 0x27 0x4a 0 0 0 BYTES 0x64200 0xdd000 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff82dfd1c0 HEALTHY - /dev/dsk/c5d0s0 READ WRITE FREE CLAIM IOCTL OPS 0x13 0x50 0 0 0 BYTES 0x15c400 0x93600 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff82e07040 HEALTHY - /dev/dsk/c4d0s0 READ WRITE FREE CLAIM IOCTL OPS 0x17 0x50 0 0 0 BYTES 0x18c000 0x91e00 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff87a81780 HEALTHY - /dev/dsk/c3d0s0 READ WRITE FREE CLAIM IOCTL OPS 0xf 0x50 0 0 0 BYTES 0x12c000 0x94200 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff82e00700 HEALTHY - /dev/dsk/c2d0s0 READ WRITE FREE CLAIM IOCTL OPS 0x13 0x4e 0 0 0 BYTES 0x16c000 0x93200 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff87a81200 HEALTHY - /dev/dsk/c6d0s0 READ WRITE FREE CLAIM IOCTL OPS 0x10 0x58 0 0 0 BYTES 0x13c000 0x95800 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff87a80c80 HEALTHY - /dev/dsk/c7d0s0 READ WRITE FREE CLAIM IOCTL OPS 0x14 0x58 0 0 0 BYTES 0x16ca00 0x93400 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff87a80700 HEALTHY - /dev/dsk/c7d1s0 READ WRITE FREE CLAIM IOCTL OPS 0x13 0x58 0 0 0 BYTES 0x15ca00 0x93c00 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0 ffffffff87a80180 HEALTHY - /dev/dsk/c6d1s0 READ WRITE FREE CLAIM IOCTL OPS 0x14 0x4e 0 0 0 BYTES 0x16ca00 0x90800 0 0 0 EREAD 0 EWRITE 0 ECKSUM 0>bash-3.00# zdb -bbc raidpool Traversing all blocks to verify checksums and verify nothing leaked ... leaked space: vdev 0, offset 0x3409aecc00, size 19456 leaked space: vdev 0, offset 0x3409ae6c00, size 19456 leaked space: vdev 0, offset 0x3409b43800, size 38912 leaked space: vdev 0, offset 0x3409b15000, size 155648 leaked space: vdev 0, offset 0x340ecab800, size 77824 leaked space: vdev 0, offset 0x340ed0ec00, size 77824 leaked space: vdev 0, offset 0x340ed30c00, size 214016 leaked space: vdev 0, offset 0x340ecd9c00, size 194560 leaked space: vdev 0, offset 0x3409b4f800, size 97280 leaked space: vdev 0, offset 0x700ed12800, size 155648 leaked space: vdev 0, offset 0x700ed54400, size 116736 leaked space: vdev 0, offset 0x700ed3d800, size 58368 block traversal size 41116431360 != alloc 41117657088 (leaked 1225728) bp count: 277589 bp logical: 36057796096 avg: 129896 bp physical: 35801011200 avg: 128971 compression: 1.01 bp allocated: 41116431360 avg: 148119 compression: 0.88 SPA allocated: 41117657088 used: 1.72% Blocks LSIZE PSIZE ASIZE avg comp %Total Type 9 96.0K 96.0K 115K 12.8K 1.00 0.00 deferred free 1 512 512 1K 1K 1.00 0.00 object directory 3 1.50K 1.50K 3.00K 1K 1.00 0.00 object array 1 16K 16K 19.0K 19.0K 1.00 0.00 packed nvlist - - - - - - - packed nvlist size 1 16K 16K 19.0K 19.0K 1.00 0.00 bplist - - - - - - - bplist header - - - - - - - SPA space map header 188 776K 776K 968K 5.15K 1.00 0.00 SPA space map - - - - - - - ZIL intent log 39 624K 624K 741K 19.0K 1.00 0.00 DMU dnode 2 2K 2K 4K 2K 1.00 0.00 DMU objset - - - - - - - DSL directory 2 1K 1K 2K 1K 1.00 0.00 DSL directory child map 1 512 512 1K 1K 1.00 0.00 DSL dataset snap map 2 1K 1K 2K 1K 1.00 0.00 DSL props - - - - - - - DSL dataset - - - - - - - ZFS znode - - - - - - - ZFS ACL 271K 33.6G 33.3G 38.3G 145K 1.01 99.98 ZFS plain file 73 3.74M 3.74M 4.32M 60.6K 1.00 0.01 ZFS directory 1 512 512 1K 1K 1.00 0.00 ZFS master node 1 512 512 1K 1K 1.00 0.00 ZFS delete queue - - - - - - - zvol object - - - - - - - zvol prop - - - - - - - other uint8[] - - - - - - - other uint64[] - - - - - - - other ZAP 271K 33.6G 33.3G 38.3G 145K 1.01 100.00 Total Looks an interesting problem ;) Anyone any thoughts? Thanks, Alan This message posted from opensolaris.org
Eric Schrock
2006-Feb-07 20:54 UTC
[zfs-discuss] ZFS panic after falling over a motherboard bug
On Tue, Feb 07, 2006 at 12:30:36PM -0800, Alan Romeril wrote:> > ::status > debugging crash dump vmcore.0 (64-bit) from sol > operating system: 5.11 b31p (i86pc) > panic message: > assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file: ../../commo > n/fs/zfs/dnode_sync.c, line: 372 > dump content: kernel pages onlyThis looks like: 6357736 Panic dn->dn_phys->dn_secphys == 0 (0x60 == 0x0), file: ../../common/fs/zfs/dnode_sync.c, line: 372 Which is another symptom of: 6357699 Panic assertion failed: db->db_blkptr != 0 However, this was supposed to have been fixed in build 31. Is that what you''re running? What is "b31p" ? Since the line number is still 372, I''m guesing that you''re not really running build 31, since this assert moved down to line 374 in that build. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Alan Romeril
2006-Feb-07 22:16 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Ah yes heh this is built from opensolaris-src-20060102.tar which is b30 I think. The only addition I made to this build was in usr/src/uts/i86pc/io/pci/pci_boot.c line 375 changing :- if (subcl != PCI_MASS_OTHER && subcl != PCI_MASS_SATA) to if (subcl != PCI_MASS_OTHER && subcl != PCI_MASS_SATA && subcl != PCI_MASS_RAID) to get my 3114 SATA RAID controller to work in JBOD mode, should have named it b30p :) Okay sorry for the false alarm, I''ll wait for the buildable b33 sources and try again. Thanks Eric :) Alan This message posted from opensolaris.org
Alan Romeril
2006-Feb-07 23:39 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
I''ve just bfu''ed with the 20060201 archives and dropped in a normal Sil 3114 pci card so I don''t have to patch anything to have the 8 disks connected. I get the same panic string as before! bash-3.00# mdb -k vmcore.0 unix.0 mdb: vmcore.0 is not an ELF file bash-3.00# mdb -k unix.0 vmcore.0 Loading modules: [ unix krtld genunix specfs uppc pcplusmp ufs ip sctp usba s1394 nca lofs zfs random nfs sppp ptm ]> ::statusdebugging crash dump vmcore.0 (64-bit) from sol operating system: 5.11 opensol-20060201 (i86pc) panic message: assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file: ../../commo n/fs/zfs/dnode_sync.c, line: 374 dump content: kernel pages only Some of the disks have renumbered AVAILABLE DISK SELECTIONS: 0. c0d1 <DEFAULT cyl 48449 alt 2 hd 128 sec 63> /pci at 0,0/pci-ide at 6/ide at 0/cmdk at 1,0 1. c2d0 <ST330083- 3NF10JK-0001-279.45GB> /pci at 0,0/pci-ide at 7/ide at 0/cmdk at 0,0 2. c3d0 <ST330083- 3NF0ZLA-0001-279.45GB> /pci at 0,0/pci-ide at 7/ide at 1/cmdk at 0,0 3. c4d0 <ST330083- 3NF1379-0001-279.45GB> /pci at 0,0/pci-ide at 8/ide at 0/cmdk at 0,0 4. c5d0 <ST330083- 3NF0QZS-0001-279.45GB> /pci at 0,0/pci-ide at 8/ide at 1/cmdk at 0,0 5. c8d0 <ST330083- 3NF13FA-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 0/cmdk at 0,0 6. c8d1 <ST330083- 3NF12V5-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 0/cmdk at 1,0 7. c9d0 <ST330083- 3NF13EX-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 1/cmdk at 0,0 8. c9d1 <ST330083- 3NF12CF-0001-279.45GB> /pci at 0,0/pci10de,5c at 9/pci-ide at 6/ide at 1/cmdk at 1,0 But the pool did import:- bash-3.00# zpool status pool: raidpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raidpool ONLINE 0 0 0 raidz ONLINE 0 0 0 c5d0s0 ONLINE 0 0 0 c4d0s0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 c2d0s0 ONLINE 0 0 0 c8d0s0 ONLINE 0 0 0 c9d0s0 ONLINE 0 0 0 c9d1s0 ONLINE 0 0 0 c8d1s0 ONLINE 0 0 0 Maybe a different issue? Cheers, Alan This message posted from opensolaris.org
Per Öberg
2006-Feb-08 08:17 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Could this be related to my problems described in thread? http://www.opensolaris.org/jive/thread.jspa?threadID=5446&tstart=0 /Per (I''ll wait until SXCR 33 to try and install again..) This message posted from opensolaris.org
Alan Romeril
2006-Feb-08 21:02 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Hi Per, I don''t think so, I''ve reverted to the 1013 version of the bios which I know is stable on this hardware and still get the panic. However I have narrowed it down, there seems to be one file that causes the panic, which was the one I was writing to when the box first bailed. I had run a mkfile 1g test in the root of the pool when it paniced, and if I try to delete this file I get the panic exactly as before. panic message: assertion failed: dn->dn_phys->dn_secphys == 0 (0x95a == 0x0), file: ../../commo n/fs/zfs/dnode_sync.c, line: 374 But the rest of the pool looks fine, I''ve ftp''ed a few isos to the pool and they cksum correctly with no stability issues fom the pool there. rm''ing the file test panics the machine every time. Cheers, Alan This message posted from opensolaris.org
Alan Romeril
2006-Feb-08 22:55 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Tried a zpool scrub and one cksum problem was found. But still panics on the rm.... Might have to copy files off and rebuild this pool at this rate... Alan bash-3.00# zpool status pool: raidpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool online'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed with 0 errors on Wed Feb 8 21:32:53 2006 config: NAME STATE READ WRITE CKSUM raidpool ONLINE 0 0 0 raidz ONLINE 0 0 0 c5d0s0 ONLINE 0 0 0 c4d0s0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 c2d0s0 ONLINE 0 0 0 c8d0s0 ONLINE 0 0 0 c9d0s0 ONLINE 0 0 0 c9d1s0 ONLINE 0 0 0 c8d1s0 ONLINE 0 0 1 18.50 repaired This message posted from opensolaris.org
Matthew A. Ahrens
2006-Feb-09 00:04 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
> I''ve just bfu''ed with the 20060201 archives and > dropped in a normal Sil 3114 pci card so I don''t have > to patch anything to have the 8 disks connected. I > get the same panic string as before!Unfortunately, this is expected. The bug you encountered messes up the on-disk accounting, so you will need to destroy this pool to eliminate the problem. --matt This message posted from opensolaris.org
Per Öberg
2006-Feb-09 09:04 UTC
[zfs-discuss] Re: ZFS panic after falling over a motherboard bug
Hi Alan, Thanks for your reply, I''ll revert to BIOS 1012.006 (I can''t find your version - 1013 - on the asus homepage..) and give it a shot when I can get hold of SXCR 33 or later - which seems to cure mentioned disk related bugs? I''ve haven''t got time for Opensolaris at the moment. /Per This message posted from opensolaris.org