Hello, I am not sure if I am posting in the correct forum, but it seems somewhat zfs related, so I thought I''d share it. While the machine was idle, I started a scrub. Around the time the scrubbing was supposed to be finished, the machine panicked. This might be related to the ''metadata corruption'' that happened earlier to me. Here is the log, any ideas? Oct 24 20:13:51 FServe unix: [ID 836849 kern.notice] Oct 24 20:13:51 FServe ^Mpanic[cpu0]/thread=fffffe8000311c80: Oct 24 20:13:51 FServe genunix: [ID 683410 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=fffffe80003119c0 addr=fffffe00e24c6218 Oct 24 20:13:51 FServe unix: [ID 100000 kern.notice] Oct 24 20:13:51 FServe unix: [ID 839527 kern.notice] sched: Oct 24 20:13:51 FServe unix: [ID 753105 kern.notice] #pf Page fault Oct 24 20:13:51 FServe unix: [ID 532287 kern.notice] Bad kernel fault at addr=0xfffffe00e24c6218 Oct 24 20:13:51 FServe unix: [ID 243837 kern.notice] pid=0, pc=0xfffffffffb92c360, sp=0xfffffe8000311ab0, eflags=0x10282 Oct 24 20:13:51 FServe unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> Oct 24 20:13:51 FServe unix: [ID 354241 kern.notice] cr2: fffffe00e24c6218 cr3: a22b000 cr8: c Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] rdi: ffffffff84233e88 rsi: fffffe00e24c6208 rdx: 3fffff8038931883 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] rcx: 0 r8: 1 r9: ffffffff Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] rax: 2 rbx: fffffe80eb90f7c0 rbp: fffffe8000311ab0 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] r10: ffffffffa5de7488 r11: 1 r12: ffffffff84233e88 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] r13: 20000 r14: fffffe80eb90f7c0 r15: ffffffff84233dd8 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] fsb: ffffffff80000000 gsb: fffffffffbc24060 ds: 43 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] es: 43 fs: 0 gs: 1c3 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] trp: e err: 0 rip: fffffffffb92c360 Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] cs: 28 rfl: 10282 rsp: fffffe8000311ab0 Oct 24 20:13:51 FServe unix: [ID 266532 kern.notice] ss: 30 Oct 24 20:13:51 FServe unix: [ID 100000 kern.notice] Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe80003118d0 unix:real_mode_end+58d1 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe80003119b0 unix:trap+d77 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe80003119c0 unix:_cmntrap+13f () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311ab0 genunix:avl_insert+60 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311ae0 genunix:avl_add+33 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311b60 zfs:vdev_queue_io_to_issue+1ec () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311ba0 zfs:zfsctl_ops_root+33c6e7a1 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311bc0 zfs:vdev_disk_io_done+11 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311bd0 zfs:vdev_io_done+12 () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311be0 zfs:zio_vdev_io_done+1b () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311c60 genunix:taskq_thread+bc () Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311c70 unix:thread_start+8 () Oct 24 20:13:51 FServe unix: [ID 100000 kern.notice] Oct 24 20:13:51 FServe genunix: [ID 672855 kern.notice] syncing file systems... Oct 24 20:13:51 FServe genunix: [ID 904073 kern.notice] done Oct 24 20:13:52 FServe genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c0t3d0s1, offset 860356608, content: kernel Oct 24 20:13:52 FServe marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 3: Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] device disconnected Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] device connected Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] SError interrupt Oct 24 20:13:52 FServe marvell88sx: [ID 131198 kern.info] SErrors: Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] Recovered communication error Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] PHY ready change Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] Disparity error Oct 24 20:13:57 FServe genunix: [ID 409368 kern.notice] ^M100% done: 150751 pages dumped, compression ratio 4.23, Oct 24 20:13:57 FServe genunix: [ID 851671 kern.notice] dump succeeded Thanks, Siegfried This message posted from opensolaris.org
On 10/25/06, Siegfried Nikolaivich <exitware at gmail.com> wrote: ...> While the machine was idle, I started a scrub. Around the time the scrubbing was supposed to be finished, the machine panicked. > This might be related to the ''metadata corruption'' that happened earlier to me. Here is the log, any ideas?...> Oct 24 20:13:52 FServe marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 3: > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] device disconnected > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] device connected > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] SError interrupt > Oct 24 20:13:52 FServe marvell88sx: [ID 131198 kern.info] SErrors: > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] Recovered communication error > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] PHY ready change > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error > Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info] Disparity errorHi Siegfried, this error from the marvell88sx driver is of concern, The 10b8b decode and disparity error messages make me think that you have a bad piece of hardware. I hope it''s not your controller but I can''t tell without more data. You should have a look at the iostat -En output for the device on marvell88sx instance #0, attached as port 3. If there are any error counts above 0 then - after checking /var/adm/messages for medium errors - you should probably replace the disk. However, don''t discount the possibly that the controller and or the cable is at fault. cheers, James -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
On 24-Oct-06, at 9:11 PM, James McPherson wrote:> this error from the marvell88sx driver is of concern, The 10b8b decode > and disparity error messages make me think that you have a bad piece > of hardware. I hope it''s not your controller but I can''t tell > without more > data. You should have a look at the iostat -En output for the device > on marvell88sx instance #0, attached as port 3. If there are any error > counts above 0 then - after checking /var/adm/messages for medium > errors - you should probably replace the disk. >I have just tried to do a ''zpool scrub'' and I got the same result - a panic right when the scrub finishes (no errors found during / after panic). So I guess this problem is reproducible (and might not be an intermittent hardware malfunction). It is funny I get the marvell88sx driver error for port 3 as that is the Solaris UFS drive, whereas the rest of the ports are setup for ZFS. Since the scrub seems to be causing the panic, I don''t see why an error on the root drive would be the root cause. Note that this error comes in the log after it is trying to make a dump of the panic: "genunix: [ID 111219 kern.notice] dumping to /dev/ dsk/c0t3d0s1, offset 860356608, content: kernel" By the way, this is what iostat -En shows for port 3: c0t3d0 Soft Errors: 24 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3320620AS Revision: C Serial No: Size: 320.07GB <320072932864 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 24 Predictive Failure Analysis: 0 And this is shown on the rest of the ports: c0t?d0 Soft Errors: 6 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3320620AS Revision: C Serial No: Size: 320.07GB <320072932864 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 6 Predictive Failure Analysis: 0 Thanks, Siegfried
On 24-Oct-06, at 9:47 PM, James McPherson wrote:> On 10/25/06, Siegfried Nikolaivich <exitware at gmail.com> wrote: >> And this is shown on the rest of the ports: >> c0t?d0 Soft Errors: 6 Hard Errors: 0 Transport Errors: 0 >> Vendor: ATA Product: ST3320620AS Revision: C Serial No: >> Size: 320.07GB <320072932864 bytes> >> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 >> Illegal Request: 6 Predictive Failure Analysis: 0 > > Hmm. All your disks attached to the same controller and showing > entries in the Illegal Request field ..... what''s the common component > between them - the cable?I guess the common component between them is the power supply. Each drive has its own SATA cable connected directly to the controller.> Could you look through your msgbuf and/or /var/adm/messages and > find the full text of when these Illegal Request errors were > logged. That > will give an idea of where to look next.That is the part I can''t figure out. Nowhere does it say "Illegal Request" except when I run iostat -nE. I found out that the "Illegal Request" count can be incremented on the ZFS drives by starting a scrub. For example: # iostat -nE ... c0t2d0 Soft Errors: 8 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3320620AS Revision: C Serial No: Size: 320.07GB <320072932864 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 8 Predictive Failure Analysis: 0 c0t3d0 Soft Errors: 24 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3320620AS Revision: C Serial No: Size: 320.07GB <320072932864 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 24 Predictive Failure Analysis: 0 ... # zpool scrub tank # iostat -nE ... c0t2d0 Soft Errors: 9 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3320620AS Revision: C Serial No: Size: 320.07GB <320072932864 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 9 Predictive Failure Analysis: 0 c0t3d0 Soft Errors: 24 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: ST3320620AS Revision: C Serial No: Size: 320.07GB <320072932864 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 24 Predictive Failure Analysis: 0 ... # zpool scrub -s tank (no panic at this point) Happens every time. Thanks, Siegfried
On 24-Oct-06, at 9:47 PM, James McPherson wrote:> Could you look through your msgbuf and/or /var/adm/messages and > find the full text of when these Illegal Request errors were > logged. That > will give an idea of where to look next.Ok it doesn''t look like it''s the controller, I ran some tests and it functions just as well as it used to. I have no idea why it keeps panicking during the scrub... doesn''t seem hardware related. Cheers, Siegfried