Hey all, I''m not sure if this is a ZFS bug or a hardware issue I''m having - any pointers would be great! Following contents include: - high-level info about my system - my first thought to debugging this - stack trace - format output - zpool status output - dmesg output High-Level Info About My System --------------------------------------------- - fresh install of b78 - first time trying to do anything IO-intensive with ZFS - command was `cp -r /cdrom /tank/sxce_b78_disk1` - but this also fails `cp -r /usr /tank/usr` - system has 24 sata/sas drive bays, but only 12 of them all populated - system has three AOC-SAT2-MV8 cards plugged into 6 mini-sas backplanes - card1 ("c3") - bp1 (c3t0d0, c3t1d0) - bp2 (c3t4d0, c3t5d0) - card2 ("c4") - bp1 (c4t0d0, c4t1d0) - bp2 (c4t4d0, c4t5d0) - card3 ("c5") - bp1 (c5t0d0, c5t1d0) - bp2 (c5t4d0, c5t5d0) - system has one Barcelona Opteron (step BA) - the one with the potential look-aside cache bug... - though its not clear this is related... My First Thought To Debugging This ------------------------------------------------ After crashing my system several times (using `cp -r /usr /tank/usr`) and comparing the outputs, I noticed that it stack trace always points to "device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1", which corresponds to all the drives connected to aoc-sat-mv8 card #3 (i.e. "c5") But looking at the `format` output, this device path only differs from the other devices in that there is a ",1" trailing the "/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" part. Further, again looking at the `format` output, "c3" devices have "4/disk", "c4" devices have "6/disk", and "c5" devices also have "6/disk". The only other thing I can add to this is that if I boot a Xen kernel, which I was *not* using for all these tests, I the following IRQ errors are reported: SunOS Release 5.11 Version snv_78 64-bit Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: san NOTICE: IRQ17 is shared Reading ZFS config: done Mounting ZFS filesystems: (1/1) NOTICE: IRQ20 is shared NOTICE: IRQ21 is shared NOTICE: IRQ22 is shared Any ideas? Stack Trace (note: I''ve done this a few times and its always "/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1") --------------- ATA UDMA data parity error SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x478f39ab.0x160dc688 (0x5344bb5958) PLATFORM: i86pc, CSN: -, HOSTNAME: san SOURCE: SunOS, REV: 5.11 snv_78 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry IMPACT: The system will sync files, save a crash dump if needed, and reboot REC-ACTION: Save the error summary below in case telemetry cannot be saved panic[cpu3]/thread=ffffff000f7c2c80: pcie_pci-0: PCI(-X) Express Fatal Error ffffff000f7c2bc0 pcie_pci:pepb_err_msi_intr+d2 () ffffff000f7c2c20 unix:av_dispatch_autovect+78 () ffffff000f7c2c60 unix:dispatch_hardint+2f () ffffff000fd09fd0 unix:switch_sp_and_call+13 () ffffff000fd0a020 unix:do_interrupt+a0 () ffffff000fd0a030 unix:cmnint+ba () ffffff000fd0a130 genunix:avl_first+1e () ffffff000fd0a1f0 zfs:metaslab_group_alloc+d1 () ffffff000fd0a2c0 zfs:metaslab_alloc_dva+1b7 () ffffff000fd0a360 zfs:metaslab_alloc+82 () ffffff000fd0a3b0 zfs:zio_dva_allocate+8a () ffffff000fd0a3d0 zfs:zio_next_stage+b3 () ffffff000fd0a400 zfs:zio_checksum_generate+6e () ffffff000fd0a420 zfs:zio_next_stage+b3 () ffffff000fd0a490 zfs:zio_write_compress+239 () ffffff000fd0a4b0 zfs:zio_next_stage+b3 () ffffff000fd0a500 zfs:zio_wait_for_children+5d () ffffff000fd0a520 zfs:zio_wait_children_ready+20 () ffffff000fd0a540 zfs:zio_next_stage_async+bb () ffffff000fd0a560 zfs:zio_nowait+11 () ffffff000fd0a870 zfs:dbuf_sync_leaf+1ac () ffffff000fd0a8b0 zfs:dbuf_sync_list+51 () ffffff000fd0a900 zfs:dbuf_sync_indirect+cd () ffffff000fd0a940 zfs:dbuf_sync_list+5e () ffffff000fd0a9b0 zfs:dnode_sync+23b () ffffff000fd0a9f0 zfs:dmu_objset_sync_dnodes+55 () ffffff000fd0aa70 zfs:dmu_objset_sync+13d () ffffff000fd0aac0 zfs:dsl_dataset_sync+5d () ffffff000fd0ab30 zfs:dsl_pool_sync+b5 () ffffff000fd0abd0 zfs:spa_sync+208 () ffffff000fd0ac60 zfs:txg_sync_thread+19a () ffffff000fd0ac70 unix:thread_start+8 () syncing file systems... 1 1 done ereport.io.pciex.rc.fe-msg ena=5344b8176c00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c source-id=201 source-valid=1 ereport.io.pciex.rc.mue-msg ena=5344b8176c00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c ereport.io.pci.sec-rserr ena=5344b8176c00c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pci.sec-ma ena=5344b8176c00c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pciex.bdg.sec-perr ena=5344b8176c00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] sue-status=1800 source-id=201 source-valid=1 ereport.io.pciex.bdg.sec-serr ena=5344b8176c00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] sue-status=1800 ereport.io.pci.sec-rserr ena=5344b8176c00c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] pci-sec-status=6420 pci-bdg-ctrl=7 dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel NOTICE: /pci at 0,0/pci15d9,1611 at 5: port 0: device reset 100% done: 114942 pages dumped, compression ratio 4.24, dump succeeded rebooting... Format output ------------------ # format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c2t0d0 <DEFAULT cyl 6527 alt 2 hd 255 sec 63> /pci at 0,0/pci15d9,1611 at 5/disk at 0,0 1. c2t1d0 <DEFAULT cyl 6527 alt 2 hd 255 sec 63> /pci at 0,0/pci15d9,1611 at 5/disk at 1,0 2. c3t0d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 0,0 3. c3t1d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 1,0 4. c3t4d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 4,0 5. c3t5d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 5,0 6. c4t0d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 0,0 7. c4t1d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 1,0 8. c4t4d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 4,0 9. c4t5d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 10. c5t0d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 0,0 11. c5t1d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 1,0 12. c5t4d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 4,0 13. c5t5d0 <ATA-ST31000340NS-SN03-931.51GB> /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 5,0 Zpool Status Output ------------------------- # zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 errors: No known data errors Dmesg Output (sorry, it looks like the beginning of the dmesg output is missing) ------------------- # dmesg Thu Jan 17 06:42:37 EST 2008 Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd8 at marvell88sx0: target 4 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd8 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 4,0 Jan 17 06:41:19 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 4,0 (sd8) online Jan 17 06:41:19 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4 : Jan 17 06:41:19 san sata: [ID 761595 kern.info] SATA disk device at port 5 Jan 17 06:41:19 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:19 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:19 san sata: [ID 163988 kern.info] serial number 5QJ0337Z Jan 17 06:41:19 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:19 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:19 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd10 at marvell88sx0: target 5 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd10 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 5,0 Jan 17 06:41:19 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 5,0 (sd10) online Jan 17 06:41:19 san pcplusmp: [ID 444295 kern.info] pcplusmp: pci11ab,6081.9 (marvell88sx) instance #1 vector 0x1b ioapic 0xff intin 0xff is bound to cpu 1 Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 0 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 1 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 2 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 3 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 4 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 5 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 6 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx1: device on port 7 reset: initialization Jan 17 06:41:19 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6 : Jan 17 06:41:19 san sata: [ID 761595 kern.info] SATA disk device at port 0 Jan 17 06:41:19 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:19 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:19 san sata: [ID 163988 kern.info] serial number 9QJ00TE2 Jan 17 06:41:19 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:19 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:19 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd12 at marvell88sx1: target 0 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd12 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 0,0 Jan 17 06:41:19 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 0,0 (sd12) online Jan 17 06:41:19 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6 : Jan 17 06:41:19 san sata: [ID 761595 kern.info] SATA disk device at port 1 Jan 17 06:41:19 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:19 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:19 san sata: [ID 163988 kern.info] serial number 5QJ02T3F Jan 17 06:41:19 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:19 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:19 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd13 at marvell88sx1: target 1 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd13 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 1,0 Jan 17 06:41:19 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 1,0 (sd13) online Jan 17 06:41:19 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6 : Jan 17 06:41:19 san sata: [ID 761595 kern.info] SATA disk device at port 4 Jan 17 06:41:19 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:19 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:19 san sata: [ID 163988 kern.info] serial number 9QJ00PWQ Jan 17 06:41:19 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:19 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:19 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd14 at marvell88sx1: target 4 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd14 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 4,0 Jan 17 06:41:19 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 4,0 (sd14) online Jan 17 06:41:19 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6 : Jan 17 06:41:19 san sata: [ID 761595 kern.info] SATA disk device at port 5 Jan 17 06:41:19 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:19 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:19 san sata: [ID 163988 kern.info] serial number 9QJ00QN7 Jan 17 06:41:19 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:19 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:19 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd15 at marvell88sx1: target 5 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd15 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 Jan 17 06:41:19 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 (sd15) online Jan 17 06:41:19 san pcplusmp: [ID 444295 kern.info] pcplusmp: pci11ab,6081.9 (marvell88sx) instance #2 vector 0x1c ioapic 0xff intin 0xff is bound to cpu 1 Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 0 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 1 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 2 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 3 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 5 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 6 reset: initialization Jan 17 06:41:19 san marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 7 reset: initialization Jan 17 06:41:19 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6 : Jan 17 06:41:19 san sata: [ID 761595 kern.info] SATA disk device at port 0 Jan 17 06:41:19 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:19 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:19 san sata: [ID 163988 kern.info] serial number 9QJ00WQM Jan 17 06:41:19 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:19 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:19 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:19 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:19 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:19 san scsi: [ID 193665 kern.info] sd4 at marvell88sx2: target 0 lun 0 Jan 17 06:41:19 san genunix: [ID 936769 kern.info] sd4 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 0,0 Jan 17 06:41:20 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 0,0 (sd4) online Jan 17 06:41:20 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6 : Jan 17 06:41:20 san sata: [ID 761595 kern.info] SATA disk device at port 1 Jan 17 06:41:20 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:20 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:20 san sata: [ID 163988 kern.info] serial number 9QJ00AWL Jan 17 06:41:20 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:20 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:20 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:20 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:20 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:20 san scsi: [ID 193665 kern.info] sd7 at marvell88sx2: target 1 lun 0 Jan 17 06:41:20 san genunix: [ID 936769 kern.info] sd7 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 1,0 Jan 17 06:41:20 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 1,0 (sd7) online Jan 17 06:41:20 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6 : Jan 17 06:41:20 san sata: [ID 761595 kern.info] SATA disk device at port 4 Jan 17 06:41:20 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:20 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:20 san sata: [ID 163988 kern.info] serial number 9QJ00WBG Jan 17 06:41:20 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:20 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:20 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:20 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:20 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:20 san scsi: [ID 193665 kern.info] sd9 at marvell88sx2: target 4 lun 0 Jan 17 06:41:20 san genunix: [ID 936769 kern.info] sd9 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 4,0 Jan 17 06:41:20 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 4,0 (sd9) online Jan 17 06:41:20 san sata: [ID 663010 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6 : Jan 17 06:41:20 san sata: [ID 761595 kern.info] SATA disk device at port 5 Jan 17 06:41:20 san sata: [ID 846691 kern.info] model ST31000340NS Jan 17 06:41:20 san sata: [ID 693010 kern.info] firmware SN03 Jan 17 06:41:20 san sata: [ID 163988 kern.info] serial number 9QJ026NE Jan 17 06:41:20 san sata: [ID 594940 kern.info] supported features: Jan 17 06:41:20 san sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Jan 17 06:41:20 san sata: [ID 514995 kern.info] SATA Gen1 signaling speed (1.5Gbps) Jan 17 06:41:20 san sata: [ID 349649 kern.info] Supported queue depth 32 Jan 17 06:41:20 san sata: [ID 349649 kern.info] capacity = 1953525168 sectors Jan 17 06:41:20 san scsi: [ID 193665 kern.info] sd11 at marvell88sx2: target 5 lun 0 Jan 17 06:41:20 san genunix: [ID 936769 kern.info] sd11 is /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 5,0 Jan 17 06:41:20 san genunix: [ID 408114 kern.info] /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 5,0 (sd11) online Jan 17 06:41:22 san pseudo: [ID 129642 kern.info] pseudo-device: pm0 Jan 17 06:41:22 san genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at 0 Jan 17 06:41:22 san genunix: [ID 314293 kern.info] device pciclass,030000 at 5(display#0) keeps up device sd at 0,0(disk#0), but the latter is not power managed Jan 17 06:41:22 san pseudo: [ID 129642 kern.info] pseudo-device: power0 Jan 17 06:41:22 san genunix: [ID 936769 kern.info] power0 is /pseudo/power at 0 Jan 17 06:41:22 san pseudo: [ID 129642 kern.info] pseudo-device: srn0 Jan 17 06:41:22 san genunix: [ID 936769 kern.info] srn0 is /pseudo/srn at 0 Jan 17 06:41:22 san /usr/lib/power/powerd: [ID 387247 daemon.error] Able to open /dev/srn Jan 17 06:41:27 san pseudo: [ID 129642 kern.info] pseudo-device: dtrace0 Jan 17 06:41:27 san genunix: [ID 936769 kern.info] dtrace0 is /pseudo/dtrace at 0 Jan 17 06:41:29 san pcplusmp: [ID 803547 kern.info] pcplusmp: fdc (fdc) instance 0 vector 0x6 ioapic 0x4 intin 0x6 is bound to cpu 2 Jan 17 06:41:29 san isa: [ID 202937 kern.info] ISA-device: fdc0 Jan 17 06:41:29 san fdc: [ID 114370 kern.info] fd0 at fdc0 Jan 17 06:41:29 san genunix: [ID 936769 kern.info] fd0 is /isa/fdc at 1,3f0/fd at 0,0 Jan 17 06:41:30 san pcplusmp: [ID 803547 kern.info] pcplusmp: ide (ata) instance 0 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 3 Jan 17 06:41:30 san genunix: [ID 640982 kern.info] ATAPI device at targ 0, lun 0 lastlun 0xfeff Jan 17 06:41:30 san genunix: [ID 846691 kern.info] model DW-224E-V Jan 17 06:41:30 san genunix: [ID 479077 kern.info] ATA/ATAPI-7 supported, majver 0xfe minver 0x0 Jan 17 06:41:30 san npe: [ID 236367 kern.info] PCI Express-device: ide at 0, ata0 Jan 17 06:41:30 san genunix: [ID 936769 kern.info] ata0 is /pci at 0,0/pci-ide at 4/ide at 0 Jan 17 06:41:30 san genunix: [ID 773945 kern.info] UltraDMA mode 2 selected Jan 17 06:41:30 san last message repeated 2 times Jan 17 06:41:30 san scsi: [ID 193665 kern.info] sd3 at ata0: target 0 lun 0 Jan 17 06:41:30 san genunix: [ID 936769 kern.info] sd3 is /pci at 0,0/pci-ide at 4/ide at 0/sd at 0,0 Jan 17 06:41:30 san genunix: [ID 314293 kern.info] device pciclass,030000 at 5(display#0) keeps up device sd at 0,0(sd#3), but the latter is not power managed Jan 17 06:41:31 san savecore: [ID 570001 auth.error] reboot after panic: pcie_pci-0: PCI(-X) Express Fatal Error Jan 17 06:41:31 san savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/san/*.7 Jan 17 06:41:31 san pcplusmp: [ID 444295 kern.info] pcplusmp: asy (asy) instance #1 vector 0x3 ioapic 0x4 intin 0x3 is bound to cpu 3 Jan 17 06:41:31 san isa: [ID 202937 kern.info] ISA-device: asy1 Jan 17 06:41:31 san genunix: [ID 936769 kern.info] asy1 is /isa/asy at 1,2f8 Jan 17 06:41:33 san sendmail[885]: [ID 702911 mail.crit] My unqualified host name (san) unknown; sleeping for retry Jan 17 06:41:33 san sendmail[884]: [ID 702911 mail.crit] My unqualified host name (san) unknown; sleeping for retry Jan 17 06:41:46 san rootnex: [ID 349649 kern.info] xsvc0 at root: space 0 offset 0 Jan 17 06:41:46 san genunix: [ID 936769 kern.info] xsvc0 is /xsvc at 0,0 Jan 17 06:42:14 san pseudo: [ID 129642 kern.info] pseudo-device: devinfo0 Jan 17 06:42:14 san genunix: [ID 936769 kern.info] devinfo0 is /pseudo/devinfo at 0 Jan 17 06:42:15 san pseudo: [ID 129642 kern.info] pseudo-device: bmc0 Jan 17 06:42:15 san genunix: [ID 936769 kern.info] bmc0 is /pseudo/bmc at 0 Jan 17 06:42:15 san fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUNOS-8000-1L, TYPE: Defect, VER: 1, SEVERITY: Minor Jan 17 06:42:15 san EVENT-TIME: Thu Jan 17 06:42:15 EST 2008 Jan 17 06:42:15 san PLATFORM: H8DM8-2, CSN: 1234567890, HOSTNAME: san Jan 17 06:42:15 san SOURCE: eft, REV: 1.16 Jan 17 06:42:15 san EVENT-ID: f74efc28-56c6-ed54-b894-f8ad62c31ba3 Jan 17 06:42:15 san DESC: The EFT Diagnosis Engine encountered telemetry for which it is unable to produce a diagnosis. Refer to http://sun.com/msg/SUNOS-8000-1L for more information. Jan 17 06:42:15 san AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun. Jan 17 06:42:15 san IMPACT: Automated diagnosis and response for these events will not occur. Jan 17 06:42:15 san REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed properly. Contact Sun for support. Jan 17 06:42:15 san fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUNOS-8000-1L, TYPE: Defect, VER: 1, SEVERITY: Minor Jan 17 06:42:15 san EVENT-TIME: Thu Jan 17 06:42:15 EST 2008 Jan 17 06:42:15 san PLATFORM: H8DM8-2, CSN: 1234567890, HOSTNAME: san Jan 17 06:42:15 san SOURCE: eft, REV: 1.16 Jan 17 06:42:15 san EVENT-ID: e6a90741-e722-e248-a250-b2ee3c00c284 Jan 17 06:42:15 san DESC: The EFT Diagnosis Engine encountered telemetry for which it is unable to produce a diagnosis. Refer to http://sun.com/msg/SUNOS-8000-1L for more information. Jan 17 06:42:15 san AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun. Jan 17 06:42:15 san IMPACT: Automated diagnosis and response for these events will not occur. Jan 17 06:42:15 san REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed properly. Contact Sun for support. Jan 17 06:42:17 san pseudo: [ID 129642 kern.info] pseudo-device: pool0 Jan 17 06:42:17 san genunix: [ID 936769 kern.info] pool0 is /pseudo/pool at 0 Jan 17 06:42:34 san sendmail[885]: [ID 702911 mail.alert] unable to qualify my own domain name (san) -- using short name Jan 17 06:42:34 san sendmail[884]: [ID 702911 mail.alert] unable to qualify my own domain name (san) -- using short name # Thanks! Kent
On a lark, I decided to create a new pool not including any devices connected to card #3 (i.e. "c5") It crashes again, but this time with a slightly different dump (see below) - actually, there are two dumps below, the first is using the xVM kernel and the second is not Any ideas? Kent [NOTE: this one using xVM kernel - see below for dump without xVM kernel] # zpool destroy tank # zpool status no pools available # zpool create tank raidz2 c3t0d0 c3t4d0 c4t0d0 c4t4d0 raidz2 c3t1d0 c3t5d0 c4t1d0 c4t5d0 # zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 errors: No known data errors # ls /tank # cp -r /usr /tank/usr Jan 17 08:48:53 san sata: NOTICE: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6: Jan 17 08:48:53 san port 5: device reset Jan 17 08:48:53 san sata: NOTICE: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6: Jan 17 08:48:53 san port 5: link lost Jan 17 08:48:53 san sata: NOTICE: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6: Jan 17 08:48:53 san port 5: link established Jan 17 08:48:55 san marvell88sx: WARNING: marvell88sx1: port 4: DMA completed after timed out Jan 17 08:48:55 san last message repeated 14 times Jan 17 08:48:55 san sata: NOTICE: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6: Jan 17 08:48:55 san port 4: device reset Jan 17 08:48:55 san sata: NOTICE: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6: Jan 17 08:48:55 san port 4: link lost Jan 17 08:48:55 san sata: NOTICE: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6: Jan 17 08:48:55 san port 4: link established Jan 17 08:48:55 san scsi: WARNING: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 (sd15): Jan 17 08:48:55 san Error for Command: write Error Level: Retryable Jan 17 08:48:55 san scsi: Requested Block: 11893 Error Block: 11893 Jan 17 08:48:55 san scsi: Vendor: ATA Serial Number: Jan 17 08:48:55 san scsi: Sense Key: No_Additional_Sense Jan 17 08:48:55 san scsi: ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Jan 17 08:48:55 san scsi: WARNING: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 (sd15): Jan 17 08:48:55 san Error for Command: write Error Level: Retryable Jan 17 08:48:55 san scsi: Requested Block: 11983 Error Block: 11983 Jan 17 08:48:55 san scsi: Vendor: ATA Serial Number: Jan 17 08:48:55 san scsi: Sense Key: No_Additional_Sense Jan 17 08:48:55 san scsi: ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Jan 17 08:48:55 san scsi: WARNING: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 (sd15): Jan 17 08:48:55 san Error for Command: write Error Level: Retryable Jan 17 08:48:55 san scsi: Requested Block: 12988 Error Block: 12988 Jan 17 08:48:55 san scsi: Vendor: ATA Serial Number: Jan 17 08:48:55 san scsi: Sense Key: No_Additional_Sense Jan 17 08:48:55 san scsi: ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Jan 17 08:48:55 san scsi: WARNING: /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 (sd15): Jan 17 08:48:55 WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: ATA UDMA data parity error WARNING: marvell88sx1: error on port 4: EDMA self disabled panic[cpu2]/thread=ffffff000f180c80: BAD TRAP: type=e (#pf Page fault) rp=ffffff000f180ab0 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference sched: #pf Page fault Bad kernel fault at addr=0x0 pid=0, pc=0x0, sp=0xffffff000f180ba8, eflags=0x10246 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 620<xmme,fxsr,pae> cr2: 0 rdi: ffffff02d0040380 rsi: 0 rdx: ffffff000f180c80 rcx: 2 r8: 0 r9: 0 rax: 0 rbx: 1 rbp: ffffff000f180be0 r10: 429c7a8d230 r11: fffffffffb81ec40 r12: ffffff02cd25aae8 r13: ffffff02d0040380 r14: ffffff02cd25aa80 r15: ffffff02cbab3c80 fsb: 0 gsb: ffffff02c6bb9b00 ds: 4b es: 4b fs: 0 gs: 1c3 trp: e err: 10 rip: 0 cs: e030 rfl: 10246 rsp: ffffff000f180ba8 ss: e02b ffffff000f180990 unix:die+c8 () ffffff000f180aa0 unix:trap+13b3 () ffffff000f180ab0 unix:cmntrap+12f () ffffff000f180be0 0 () ffffff000f180c30 unix:av_dispatch_softvect+5f () ffffff000f180c60 unix:dispatch_softint+38 () ffffff000f13c9a0 unix:switch_sp_and_call+13 () ffffff000f13c9e0 unix:dosoftint+59 () ffffff000f13ca30 unix:do_interrupt+f9 () ffffff000f13cae0 unix:xen_callback_handler+370 () ffffff000f13caf0 unix:xen_callback+cd () ffffff000f13cbf0 unix:HYPERVISOR_sched_op+29 () ffffff000f13cc00 unix:HYPERVISOR_block+11 () ffffff000f13cc10 unix:mach_cpu_idle+12 () ffffff000f13cc40 unix:cpu_idle+cc () ffffff000f13cc60 unix:idle+10e () ffffff000f13cc70 unix:thread_start+8 () syncing file systems... 1 1 done dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel NOTICE: /pci at 0,0/pci15d9,1611 at 5: port 0: device reset 100% done: 143600 pages dumped, compression ratio 3.32, dump succeeded rebooting... [NOTE: this one using the standard kernel - not the xVM kernel] # cp -r /usr /tank/usr WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x478f5e05.0x288802d5 (0x4504536150) PLATFORM: i86pc, CSN: -, HOSTNAME: san SOURCE: SunOS, REV: 5.11 snv_78 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry IMPACT: The system will sync files, save a crash dump if needed, and reboot REC-ACTION: Save the error summary below in case telemetry cannot be saved panic[cpu3]/thread=ffffff000f81ac80: pcie_pci-0: PCI(-X) Express Fatal Error ffffff000f81abc0 pcie_pci:pepb_err_msi_intr+d2 () ffffff000f81ac20 unix:av_dispatch_autovect+78 () ffffff000f81ac60 unix:dispatch_hardint+2f () ffffff000f7e4ac0 unix:switch_sp_and_call+13 () ffffff000f7e4b10 unix:do_interrupt+a0 () ffffff000f7e4b20 unix:cmnint+ba () ffffff000f7e4c10 unix:mach_cpu_idle+b () ffffff000f7e4c40 unix:cpu_idle+c8 () ffffff000f7e4c60 unix:idle+10e () ffffff000f7e4c70 unix:thread_start+8 () syncing file systems... done ereport.io.pciex.rc.fe-msg ena=450450261f00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c source-id=200 source-valid=1 ereport.io.pciex.rc.mue-msg ena=450450261f00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c ereport.io.pci.sec-rserr ena=450450261f00c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pci.sec-ma ena=450450261f00c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pciex.bdg.sec-perr ena=450450261f00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] sue-status=1800 source-id=200 source-valid=1 ereport.io.pciex.bdg.sec-serr ena=450450261f00c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] sue-status=1800 ereport.io.pci.sec-rserr ena=450450261f00c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] pci-sec-status=6420 pci-bdg-ctrl=7 dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel NOTICE: /pci at 0,0/pci15d9,1611 at 5: port 0: device reset 100% done: 152687 pages dumped, compression ratio 5.33, dump succeeded rebooting...
Below I create zpools isolating one card at a time - when just card#1 - it works - when just card #2 - it fails - when just card #3 - it works And then again using the two cards that seem to work: - when cards #1 and #3 - it fails So, at first I thought I narrowed it down to a card, but my last test shows that it still fails when the zpool uses two cards that succeed individually... The only thing I can think to point out here is that those two cards on on different buses - one connected to a NECuPD720400 and the other connected to a AIC-7902, which itself is then connected to the NECuPD720400 Any ideas? Thanks, Kent OK, doing it again using just card #1 (i.e. "c3") works! # zpool destroy tank # zpool create tank raidz2 c3t0d0 c3t4d0 c3t1d0 c3t5d0 # cp -r /usr /tank/usr cp: cycle detected: /usr/ccs/lib/link_audit/32 cp: cannot access /usr/lib/amd64/libdbus-1.so.2 Doing it again using just card #2 (i.e. "c4") still fails: # zpool destroy tank # zpool create tank raidz2 c4t0d0 c4t4d0 c4t1d0 c4t5d0 # cp -r /usr /tank/usr cp: cycle detected: /usr/ccs/lib/link_audit/32 cp: cannot access /usr/lib/amd64/libdbus-1.so.2 WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error WARNING: marvell88sx1: error on port 1: ATA UDMA data parity error SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x478f6148.0x376ebd4b (0xbf8f86652d) PLATFORM: i86pc, CSN: -, HOSTNAME: san SOURCE: SunOS, REV: 5.11 snv_78 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry IMPACT: The system will sync files, save a crash dump if needed, and reboot REC-ACTION: Save the error summary below in case telemetry cannot be saved panic[cpu3]/thread=ffffff000f7bcc80: pcie_pci-0: PCI(-X) Express Fatal Error ffffff000f7bcbc0 pcie_pci:pepb_err_msi_intr+d2 () ffffff000f7bcc20 unix:av_dispatch_autovect+78 () ffffff000f7bcc60 unix:dispatch_hardint+2f () ffffff000f786ac0 unix:switch_sp_and_call+13 () ffffff000f786b10 unix:do_interrupt+a0 () ffffff000f786b20 unix:cmnint+ba () ffffff000f786c10 unix:mach_cpu_idle+b () ffffff000f786c40 unix:cpu_idle+c8 () ffffff000f786c60 unix:idle+10e () ffffff000f786c70 unix:thread_start+8 () syncing file systems... done ereport.io.pciex.rc.fe-msg ena=bf8f828ea700c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c source-id=200 source-valid=1 ereport.io.pciex.rc.mue-msg ena=bf8f828ea700c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c ereport.io.pci.sec-rserr ena=bf8f828ea700c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pci.sec-ma ena=bf8f828ea700c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pciex.bdg.sec-perr ena=bf8f828ea700c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] sue-status=1800 source-id=200 source-valid=1 ereport.io.pciex.bdg.sec-serr ena=bf8f828ea700c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] sue-status=1800 ereport.io.pci.sec-rserr ena=bf8f828ea700c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] pci-sec-status=6420 pci-bdg-ctrl=7 dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel NOTICE: /pci at 0,0/pci15d9,1611 at 5: port 0: device reset 100% done: And doing it again using just card #3 (i.e. "c5") works! # zpool destroy tank cannot open ''tank'': no such pool (interesting) # zpool create tank raidz2 c5t0d0 c5t4d0 c5t1d0 c5t5d0 # cp -r /usr /tank/usr And doing it again using cards #1 and #3 (i.e. "c3" and "c5") fails! # zpool destroy tank # zpool create tank raidz2 c3t0d0 c3t4d0 c3t1d0 c3t5d0 raidz2 c5t0d0 c5t4d0 c5t1d0 c5t5d0 # cp -r /usr /tank/usr cp: cycle detected: /usr/ccs/lib/link_audit/32 cp: cannot access /usr/lib/amd64/libdbus-1.so.2 WARNING: marvell88sx2: error on port 4: ATA UDMA data parity error WARNING: marvell88sx2: error on port 4: ATA UDMA data parity error WARNING: marvell88sx2: error on port 4: ATA UDMA data parity error WARNING: marvell88sx2: error on port 4: ATA UDMA data parity error WARNING: marvell88sx2: error on port 4: ATA UDMA data parity error WARNING: marvell88sx2: error on port 4: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x478f6307.0x20c8668b (0x643e118fd4) PLATFORM: i86pc, CSN: -, HOSTNAME: san SOURCE: SunOS, REV: 5.11 snv_78 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry IMPACT: The system will sync files, save a crash dump if needed, and reboot REC-ACTION: Save the error summary below in case telemetry cannot be saved panic[cpu3]/thread=ffffff000f7c2c80: pcie_pci-0: PCI(-X) Express Fatal Error ffffff000f7c2bc0 pcie_pci:pepb_err_msi_intr+d2 () ffffff000f7c2c20 unix:av_dispatch_autovect+78 () ffffff000f7c2c60 unix:dispatch_hardint+2f () ffffff000f78cac0 unix:switch_sp_and_call+13 () ffffff000f78cb10 unix:do_interrupt+a0 () ffffff000f78cb20 unix:cmnint+ba () ffffff000f78cc10 unix:mach_cpu_idle+b () ffffff000f78cc40 unix:cpu_idle+c8 () ffffff000f78cc60 unix:idle+10e () ffffff000f78cc70 unix:thread_start+8 () syncing file systems... done ereport.io.pciex.rc.fe-msg ena=643e0d446400c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c source-id=201 source-valid=1 ereport.io.pciex.rc.mue-msg ena=643e0d446400c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c ereport.io.pci.sec-rserr ena=643e0d446400c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pci.sec-ma ena=643e0d446400c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 pci-bdg-ctrl=3 ereport.io.pciex.bdg.sec-perr ena=643e0d446400c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] sue-status=1800 source-id=201 source-valid=1 ereport.io.pciex.bdg.sec-serr ena=643e0d446400c01 detector=[ version=0 scheme "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] sue-status=1800 ereport.io.pci.sec-rserr ena=643e0d446400c01 detector=[ version=0 scheme="dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] pci-sec-status=6420 pci-bdg-ctrl=7 dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel NOTICE: /pci at 0,0/pci15d9,1611 at 5: port 0: device reset 100% done: 178114 pages dumped, compression ratio 2.44, dump succeeded rebooting... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080117/ea82570d/attachment.html>
Looks like flaky or broken hardware to me. It could be a power supply issue, those tend to rear their ugly head when workloads get heavy and they are usually the easiest to replace. -- richard Kent Watsen wrote:> > > Below I create zpools isolating one card at a time > - when just card#1 - it works > - when just card #2 - it fails > - when just card #3 - it works > > And then again using the two cards that seem to work: > - when cards #1 and #3 - it fails > > So, at first I thought I narrowed it down to a card, but my last test > shows that it still fails when the zpool uses two cards that succeed > individually... > > The only thing I can think to point out here is that those two cards > on on different buses - one connected to a NECuPD720400 and the other > connected to a AIC-7902, which itself is then connected to the > NECuPD720400 > > Any ideas? > > Thanks, > Kent > > > > > > OK, doing it again using just card #1 (i.e. "c3") works! > > # zpool destroy tank > # zpool create tank raidz2 c3t0d0 c3t4d0 c3t1d0 c3t5d0 > # cp -r /usr /tank/usr > cp: cycle detected: /usr/ccs/lib/link_audit/32 > cp: cannot access /usr/lib/amd64/libdbus-1.so.2 > > > Doing it again using just card #2 (i.e. "c4") still fails: > > # zpool destroy tank > # zpool create tank raidz2 c4t0d0 c4t4d0 c4t1d0 c4t5d0 > # cp -r /usr /tank/usr > cp: cycle detected: /usr/ccs/lib/link_audit/32 > cp: cannot access /usr/lib/amd64/libdbus-1.so.2 > WARNING: marvell88sx1: error on port 1: > ATA UDMA data parity error > WARNING: marvell88sx1: error on port 1: > ATA UDMA data parity error > WARNING: marvell88sx1: error on port 1: > ATA UDMA data parity error > WARNING: marvell88sx1: error on port 1: > ATA UDMA data parity error > WARNING: marvell88sx1: error on port 1: > ATA UDMA data parity error > WARNING: marvell88sx1: error on port 1: > ATA UDMA data parity error > > SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major > EVENT-TIME: 0x478f6148.0x376ebd4b (0xbf8f86652d) > PLATFORM: i86pc, CSN: -, HOSTNAME: san > SOURCE: SunOS, REV: 5.11 snv_78 > DESC: Errors have been detected that require a reboot to ensure system > integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more > information. > AUTO-RESPONSE: Solaris will attempt to save and diagnose the error > telemetry > IMPACT: The system will sync files, save a crash dump if needed, > and reboot > REC-ACTION: Save the error summary below in case telemetry cannot > be saved > > > panic[cpu3]/thread=ffffff000f7bcc80: pcie_pci-0: PCI(-X) Express > Fatal Error > > ffffff000f7bcbc0 pcie_pci:pepb_err_msi_intr+d2 () > ffffff000f7bcc20 unix:av_dispatch_autovect+78 () > ffffff000f7bcc60 unix:dispatch_hardint+2f () > ffffff000f786ac0 unix:switch_sp_and_call+13 () > ffffff000f786b10 unix:do_interrupt+a0 () > ffffff000f786b20 unix:cmnint+ba () > ffffff000f786c10 unix:mach_cpu_idle+b () > ffffff000f786c40 unix:cpu_idle+c8 () > ffffff000f786c60 unix:idle+10e () > ffffff000f786c70 unix:thread_start+8 () > > syncing file systems... done > ereport.io.pciex.rc.fe-msg ena=bf8f828ea700c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c > source-id=200 > source-valid=1 > > ereport.io.pciex.rc.mue-msg ena=bf8f828ea700c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c > > ereport.io.pci.sec-rserr ena=bf8f828ea700c01 detector=[ version=0 > scheme="dev" > device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 > pci-bdg-ctrl=3 > > ereport.io.pci.sec-ma ena=bf8f828ea700c01 detector=[ version=0 > scheme="dev" > device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 > pci-bdg-ctrl=3 > > ereport.io.pciex.bdg.sec-perr ena=bf8f828ea700c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] > sue-status=1800 > source-id=200 source-valid=1 > > ereport.io.pciex.bdg.sec-serr ena=bf8f828ea700c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] > sue-status=1800 > > ereport.io.pci.sec-rserr ena=bf8f828ea700c01 detector=[ version=0 > scheme="dev" > device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0" ] > pci-sec-status=6420 > pci-bdg-ctrl=7 > > dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel > NOTICE: /pci at 0,0/pci15d9,1611 at 5: > port 0: device reset > > 100% done: > > > And doing it again using just card #3 (i.e. "c5") works! > > # zpool destroy tank > cannot open ''tank'': no such pool > (interesting) > # zpool create tank raidz2 c5t0d0 c5t4d0 c5t1d0 c5t5d0 > # cp -r /usr /tank/usr > > > > > And doing it again using cards #1 and #3 (i.e. "c3" and "c5") fails! > > # zpool destroy tank > # zpool create tank raidz2 c3t0d0 c3t4d0 c3t1d0 c3t5d0 raidz2 > c5t0d0 c5t4d0 c5t1d0 c5t5d0 > # cp -r /usr /tank/usr > cp: cycle detected: /usr/ccs/lib/link_audit/32 > cp: cannot access /usr/lib/amd64/libdbus-1.so.2 > WARNING: marvell88sx2: error on port 4: > ATA UDMA data parity error > WARNING: marvell88sx2: error on port 4: > ATA UDMA data parity error > WARNING: marvell88sx2: error on port 4: > ATA UDMA data parity error > WARNING: marvell88sx2: error on port 4: > ATA UDMA data parity error > WARNING: marvell88sx2: error on port 4: > ATA UDMA data parity error > WARNING: marvell88sx2: error on port 4: > > SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major > EVENT-TIME: 0x478f6307.0x20c8668b (0x643e118fd4) > PLATFORM: i86pc, CSN: -, HOSTNAME: san > SOURCE: SunOS, REV: 5.11 snv_78 > DESC: Errors have been detected that require a reboot to ensure system > integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more > information. > AUTO-RESPONSE: Solaris will attempt to save and diagnose the error > telemetry > IMPACT: The system will sync files, save a crash dump if needed, > and reboot > REC-ACTION: Save the error summary below in case telemetry cannot > be saved > > > panic[cpu3]/thread=ffffff000f7c2c80: pcie_pci-0: PCI(-X) Express > Fatal Error > > ffffff000f7c2bc0 pcie_pci:pepb_err_msi_intr+d2 () > ffffff000f7c2c20 unix:av_dispatch_autovect+78 () > ffffff000f7c2c60 unix:dispatch_hardint+2f () > ffffff000f78cac0 unix:switch_sp_and_call+13 () > ffffff000f78cb10 unix:do_interrupt+a0 () > ffffff000f78cb20 unix:cmnint+ba () > ffffff000f78cc10 unix:mach_cpu_idle+b () > ffffff000f78cc40 unix:cpu_idle+c8 () > ffffff000f78cc60 unix:idle+10e () > ffffff000f78cc70 unix:thread_start+8 () > > syncing file systems... done > ereport.io.pciex.rc.fe-msg ena=643e0d446400c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c > source-id=201 > source-valid=1 > > ereport.io.pciex.rc.mue-msg ena=643e0d446400c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a" ] rc-status=800007c > > ereport.io.pci.sec-rserr ena=643e0d446400c01 detector=[ version=0 > scheme="dev" > device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 > pci-bdg-ctrl=3 > > ereport.io.pci.sec-ma ena=643e0d446400c01 detector=[ version=0 > scheme="dev" > device-path="/pci at 0,0/pci10de,376 at a" ] pci-sec-status=6000 > pci-bdg-ctrl=3 > > ereport.io.pciex.bdg.sec-perr ena=643e0d446400c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] > sue-status=1800 > source-id=201 source-valid=1 > > ereport.io.pciex.bdg.sec-serr ena=643e0d446400c01 detector=[ > version=0 scheme> "dev" device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] > sue-status=1800 > > ereport.io.pci.sec-rserr ena=643e0d446400c01 detector=[ version=0 > scheme="dev" > device-path="/pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1" ] > pci-sec-status=6420 > pci-bdg-ctrl=7 > > dumping to /dev/dsk/c2t0d0s1, offset 215547904, content: kernel > NOTICE: /pci at 0,0/pci15d9,1611 at 5: > port 0: device reset > > 100% done: 178114 pages dumped, compression ratio 2.44, dump succeeded > rebooting... > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Thu, 17 Jan 2008, Richard Elling wrote:> Looks like flaky or broken hardware to me. It could be a > power supply issue, those tend to rear their ugly head when > workloads get heavy and they are usually the easiest to > replace.+1 PSU or memory (run memtestx86)> -- richard > > Kent Watsen wrote: >> >> >> Below I create zpools isolating one card at a time >> - when just card#1 - it works >> - when just card #2 - it fails >> - when just card #3 - it works >> >> And then again using the two cards that seem to work: >> - when cards #1 and #3 - it fails.... snip ..... Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Thanks Richard and Al, I''ll refrain from express how disturbing this is, as I''m trying to help the Internet be kid-safe ;) As for the PSU, I''d be very surprised there if that were it as it is a 3+1 redundant PSU that came with this system, built by a reputable integrator. Also, the PSU is plugged into a high-end APC UPS, sucking just 25% of its capacity. And the UPS has a dedicated 240V 30A circuit. As for the memory, it might be - even though the same integrator installed the SIMMs and did a 24-hour burn-in test, you never know. So I''m running memtest86 now, which is 12% passed so far... I''m going to try another hardware test, which is to switch around the backplanes my cards are plugging into. If the same backplanes are failing, then I know all my AOC-SAT2-MV8 cards are OK. Likewise, if the same backplane doesn''t fail, then I know all my backplanes are OK. Either way, I''ll eliminate one potential hardware issue. But I still think that it might be software related. My first post was trying to point out some anomalies in how the devices are being named - see the highlighted parts below? - doesn''t that look strange? - why would Solaris use different naming convention for some disks? /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@4/disk@0,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@4/disk@1,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@4/disk@4,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@4/disk@5,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@6/disk@0,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@6/disk@1,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@6/disk@4,0 /pci@0,0/pci10de,376@a/pci1033,125@0/pci11ab,11ab@6/disk@5,0 /pci@0,0/pci10de,376@a/pci1033,125@0,1/pci11ab,11ab@6/disk@0,0 /pci@0,0/pci10de,376@a/pci1033,125@0,1/pci11ab,11ab@6/disk@1,0 /pci@0,0/pci10de,376@a/pci1033,125@0,1/pci11ab,11ab@6/disk@4,0 /pci@0,0/pci10de,376@a/pci1033,125@0,1/pci11ab,11ab@6/disk@5,0 PS: in case you can''t see it, look at the last four disks and notice how the contain a spurious ",1" and also have the same "@6" as the middle four disks Any ideas, suggestions, condolences? Thanks, Kent _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Kent Watsen wrote:> > Thanks Richard and Al, > > I''ll refrain from express how disturbing this is, as I''m trying to > help the Internet be kid-safe ;) > > As for the PSU, I''d be very surprised there if that were it as it is a > 3+1 redundant PSU that came with this system, built by a reputable > integrator. Also, the PSU is plugged into a high-end APC UPS, sucking > just 25% of its capacity. And the UPS has a dedicated 240V 30A circuit. > > As for the memory, it might be - even though the same integrator > installed the SIMMs and did a 24-hour burn-in test, you never know. > So I''m running memtest86 now, which is 12% passed so far... > > I''m going to try another hardware test, which is to switch around the > backplanes my cards are plugging into. If the same backplanes are > failing, then I know all my AOC-SAT2-MV8 cards are OK. Likewise, if > the same backplane doesn''t fail, then I know all my backplanes are > OK. Either way, I''ll eliminate one potential hardware issue. >You could also try the SunVTS system tests. They are what we use in the factory to prove systems work before being shipped to customers. Located in /usr/sunvts where READMEs, man pages, and binaries live. There are a zillion options, but I highly recommend the readonly disk tests for your case.> But I still think that it might be software related. My first post > was trying to point out some anomalies in how the devices are being > named - see the highlighted parts below? - doesn''t that look strange? > - why would Solaris use different naming convention for some disks? > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 0,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 1,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 4,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 4/disk at 5,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 0,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 1,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 4,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0/pci11ab,11ab at 6/disk at 5,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 0,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 1,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 4,0 > /pci at 0,0/pci10de,376 at a/pci1033,125 at 0,1/pci11ab,11ab at 6/disk at 5,0 > > > PS: in case you can''t see it, look at the last four disks and > notice how the contain a spurious ",1" and also have the same > "@6" as the middle four disks > >looks reasonable to me. These are just PCI device identifiers, certainly nothing to be worried about. -- richard
Definitely a hardware problem (possibly compounded by a bug). Some key phrases and routines: ATA UDMA data parity error This one actually looks like a misnomer. At least, I''d normally expect "data parity error" not to crash the system! (It should result in a retry or EIO.) PCI(-X) Express Fatal Error This one''s more of an issue -- it indicates that the PCI Express bus had an error. pcie_pci:pepb_err_msi_intr This indicates an error on the PCI bus which has been reflected through to the PCI Express bus. There should be more detail, but it''s hard to figure it out from what''s below. (The report is showing multiple errors, including both parity errors & system errors, which seems unlikely unless there''s a hardware design flaw or a software bug.) Others have suggested the power supply or memory, but in my experience these types of errors are more often due to a faulty system backplane or card (and occasionally a bad bridge chip). This message posted from opensolaris.org
Thanks for the note Anton. I let memtest86 run overnight and it found no issues. I''ve also now moved the cards around and have confirmed that slot #3 on the mobo is bad (all my aoc-sat2-mv8 cards, cables, and backplanes are OK). However, I think its more than just slot #3 that has a fault because when I have all three cards plugged into mobo slots other than #3, they all work fine individually, but when I run the exact same per-card tests in parallel, the system crashes. I''m now going to have the system integrator that built my system send me a new mobo (ugh!) Thanks again, Kent Anton B. Rang wrote:> Definitely a hardware problem (possibly compounded by a bug). Some key phrases and routines: > > ATA UDMA data parity error > > This one actually looks like a misnomer. At least, I''d normally expect "data parity error" not to crash the system! (It should result in a retry or EIO.) > > PCI(-X) Express Fatal Error > > This one''s more of an issue -- it indicates that the PCI Express bus had an error. > > pcie_pci:pepb_err_msi_intr > > This indicates an error on the PCI bus which has been reflected through to the PCI Express bus. There should be more detail, but it''s hard to figure it out from what''s below. (The report is showing multiple errors, including both parity errors & system errors, which seems unlikely unless there''s a hardware design flaw or a software bug.) > > Others have suggested the power supply or memory, but in my experience these types of errors are more often due to a faulty system backplane or card (and occasionally a bad bridge chip). > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >
For the archive, I swapped the mobo and all is good now... (I copied 100GB into the pool without a crash) One problem I had was that Solaris would hang whenever booting - even when all the aoc-sat2-mv8 cards were pulled out. Turns out that switching the BIOS field "USB 2.0 Controller Mode" from "HiSpeed" to "FullSpeed" makes the difference - any ideas why? Thanks, Kent