thr3ads.net - zfs discuss - [zfs-discuss] Panic while scrubbing [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Siegfried Nikolaivich

2006-Oct-25 03:58 UTC

[zfs-discuss] Panic while scrubbing

Hello,

I am not sure if I am posting in the correct forum, but it seems somewhat zfs
related, so I thought I''d share it.

While the machine was idle, I started a scrub.  Around the time the scrubbing
was supposed to be finished, the machine panicked.

This might be related to the ''metadata corruption'' that
happened earlier to me.  Here is the log, any ideas?


Oct 24 20:13:51 FServe unix: [ID 836849 kern.notice] 
Oct 24 20:13:51 FServe ^Mpanic[cpu0]/thread=fffffe8000311c80: 
Oct 24 20:13:51 FServe genunix: [ID 683410 kern.notice] BAD TRAP: type=e (#pf
Page fault) rp=fffffe80003119c0 addr=fffffe00e24c6218
Oct 24 20:13:51 FServe unix: [ID 100000 kern.notice] 
Oct 24 20:13:51 FServe unix: [ID 839527 kern.notice] sched: 
Oct 24 20:13:51 FServe unix: [ID 753105 kern.notice] #pf Page fault
Oct 24 20:13:51 FServe unix: [ID 532287 kern.notice] Bad kernel fault at
addr=0xfffffe00e24c6218
Oct 24 20:13:51 FServe unix: [ID 243837 kern.notice] pid=0,
pc=0xfffffffffb92c360, sp=0xfffffe8000311ab0, eflags=0x10282
Oct 24 20:13:51 FServe unix: [ID 211416 kern.notice] cr0:
8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
Oct 24 20:13:51 FServe unix: [ID 354241 kern.notice] cr2: fffffe00e24c6218 cr3:
a22b000 cr8: c
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    rdi: ffffffff84233e88
rsi: fffffe00e24c6208 rdx: 3fffff8038931883
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    rcx:                0 
r8:                1  r9:         ffffffff
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    rax:                2
rbx: fffffe80eb90f7c0 rbp: fffffe8000311ab0
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    r10: ffffffffa5de7488
r11:                1 r12: ffffffff84233e88
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    r13:            20000
r14: fffffe80eb90f7c0 r15: ffffffff84233dd8
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    fsb: ffffffff80000000
gsb: fffffffffbc24060  ds:               43
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]     es:               43 
fs:                0  gs:              1c3
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]    trp:                e
err:                0 rip: fffffffffb92c360
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]     cs:               28
rfl:            10282 rsp: fffffe8000311ab0
Oct 24 20:13:51 FServe unix: [ID 266532 kern.notice]     ss:               30
Oct 24 20:13:51 FServe unix: [ID 100000 kern.notice] 
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe80003118d0
unix:real_mode_end+58d1 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe80003119b0
unix:trap+d77 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe80003119c0
unix:_cmntrap+13f ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311ab0
genunix:avl_insert+60 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311ae0
genunix:avl_add+33 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311b60
zfs:vdev_queue_io_to_issue+1ec ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311ba0
zfs:zfsctl_ops_root+33c6e7a1 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311bc0
zfs:vdev_disk_io_done+11 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311bd0
zfs:vdev_io_done+12 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311be0
zfs:zio_vdev_io_done+1b ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311c60
genunix:taskq_thread+bc ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fffffe8000311c70
unix:thread_start+8 ()
Oct 24 20:13:51 FServe unix: [ID 100000 kern.notice] 
Oct 24 20:13:51 FServe genunix: [ID 672855 kern.notice] syncing file systems...
Oct 24 20:13:51 FServe genunix: [ID 904073 kern.notice]  done
Oct 24 20:13:52 FServe genunix: [ID 111219 kern.notice] dumping to
/dev/dsk/c0t3d0s1, offset 860356608, content: kernel
Oct 24 20:13:52 FServe marvell88sx: [ID 812950 kern.warning] WARNING:
marvell88sx0: error on port 3:
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]       device
disconnected
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]       device connected
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]       SError interrupt
Oct 24 20:13:52 FServe marvell88sx: [ID 131198 kern.info]       SErrors:
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]              
Recovered communication error
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]               PHY
ready change
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]               10-bit
to 8-bit decode error
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]              
Disparity error
Oct 24 20:13:57 FServe genunix: [ID 409368 kern.notice] ^M100% done: 150751
pages dumped, compression ratio 4.23,
Oct 24 20:13:57 FServe genunix: [ID 851671 kern.notice] dump succeeded


Thanks,
Siegfried
 
 
This message posted from opensolaris.org

James McPherson

2006-Oct-25 04:11 UTC

head link

[zfs-discuss] Panic while scrubbing

On 10/25/06, Siegfried Nikolaivich <exitware at gmail.com> wrote:
...> While the machine was idle, I started a scrub.  Around the time the
scrubbing was supposed to be finished, the machine panicked.
> This might be related to the ''metadata corruption'' that
happened earlier to me.  Here is the log, any ideas?
...> Oct 24 20:13:52 FServe marvell88sx: [ID 812950 kern.warning] WARNING:
marvell88sx0: error on port 3:
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]       device
disconnected
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]       device
connected
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]       SError
interrupt
> Oct 24 20:13:52 FServe marvell88sx: [ID 131198 kern.info]       SErrors:
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]              
Recovered communication error
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]               PHY
ready change
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]              
10-bit to 8-bit decode error
> Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]              
Disparity error

Hi Siegfried,
this error from the marvell88sx driver is of concern, The 10b8b decode
and disparity error messages make me think that you have a bad piece
of hardware. I hope it''s not your controller but I can''t tell
without more
data. You should have a look at the iostat -En output for the device
on marvell88sx instance #0, attached as port 3. If there are any error
counts above 0 then - after checking /var/adm/messages for medium
errors - you should probably replace the disk.

However, don''t discount the possibly that the controller and or the
cable is at fault.

cheers,
James
--
Solaris kernel software engineer, system admin and troubleshooter
              http://www.jmcp.homeunix.com/blog
Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson

Siegfried Nikolaivich

2006-Oct-25 04:41 UTC

head link

[zfs-discuss] Panic while scrubbing

On 24-Oct-06, at 9:11 PM, James McPherson wrote:
> this error from the marvell88sx driver is of concern, The 10b8b decode
> and disparity error messages make me think that you have a bad piece
> of hardware. I hope it''s not your controller but I can''t
tell
> without more
> data. You should have a look at the iostat -En output for the device
> on marvell88sx instance #0, attached as port 3. If there are any error
> counts above 0 then - after checking /var/adm/messages for medium
> errors - you should probably replace the disk.
>
I have just tried to do a ''zpool scrub'' and I got the same
result - a
panic right when the scrub finishes (no errors found during / after  
panic).  So I guess this problem is reproducible (and might not be an  
intermittent hardware malfunction).

It is funny I get the marvell88sx driver error for port 3 as that is  
the Solaris UFS drive, whereas the rest of the ports are setup for  
ZFS.  Since the scrub seems to be causing the panic, I don''t see why  
an error on the root drive would be the root cause.

Note that this error comes in the log after it is trying to make a  
dump of the panic: "genunix: [ID 111219 kern.notice] dumping to /dev/ 
dsk/c0t3d0s1, offset 860356608, content: kernel"


By the way, this is what iostat -En shows for port 3:
c0t3d0           Soft Errors: 24 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
Size: 320.07GB <320072932864 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 24 Predictive Failure Analysis: 0


And this is shown on the rest of the ports:
c0t?d0           Soft Errors: 6 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
Size: 320.07GB <320072932864 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 6 Predictive Failure Analysis: 0


Thanks,
Siegfried

Siegfried Nikolaivich

2006-Oct-25 05:09 UTC

head link

[zfs-discuss] Panic while scrubbing

On 24-Oct-06, at 9:47 PM, James McPherson wrote:
> On 10/25/06, Siegfried Nikolaivich <exitware at gmail.com> wrote:
>> And this is shown on the rest of the ports:
>> c0t?d0           Soft Errors: 6 Hard Errors: 0 Transport Errors: 0
>> Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
>> Size: 320.07GB <320072932864 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
>> Illegal Request: 6 Predictive Failure Analysis: 0
>
> Hmm. All your disks attached to the same controller and showing
> entries in the Illegal Request field ..... what''s the common
component
> between them - the cable?
I guess the common component between them is the power supply.  Each  
drive has its own SATA cable connected directly to the controller.
> Could you look through your msgbuf and/or /var/adm/messages and
> find the full text of when these Illegal Request errors were  
> logged. That
> will give an idea of where to look next.
That is the part I can''t figure out.  Nowhere does it say "Illegal
Request" except when I run iostat -nE.

I found out that the "Illegal Request" count can be incremented on  
the ZFS drives by starting a scrub.

For example:
# iostat -nE
...
c0t2d0           Soft Errors: 8 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
Size: 320.07GB <320072932864 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 8 Predictive Failure Analysis: 0
c0t3d0           Soft Errors: 24 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
Size: 320.07GB <320072932864 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 24 Predictive Failure Analysis: 0
...

# zpool scrub tank

# iostat -nE
...
c0t2d0           Soft Errors: 9 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
Size: 320.07GB <320072932864 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 9 Predictive Failure Analysis: 0
c0t3d0           Soft Errors: 24 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: ST3320620AS      Revision: C    Serial No:
Size: 320.07GB <320072932864 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 24 Predictive Failure Analysis: 0
...

# zpool scrub -s tank
(no panic at this point)

Happens every time.



Thanks,
Siegfried

Siegfried Nikolaivich

2006-Oct-26 02:46 UTC

head link

[zfs-discuss] Panic while scrubbing

On 24-Oct-06, at 9:47 PM, James McPherson wrote:
> Could you look through your msgbuf and/or /var/adm/messages and
> find the full text of when these Illegal Request errors were  
> logged. That
> will give an idea of where to look next.
Ok it doesn''t look like it''s the controller, I ran some tests
and it
functions just as well as it used to.

I have no idea why it keeps panicking during the scrub... doesn''t  
seem hardware related.


Cheers,
Siegfried

Maybe Matching Threads

Search for more possibly parallel threads

zfs discuss - Oct 2006 - Panic while scrubbing

[zfs-discuss] Panic while scrubbing

[zfs-discuss] Panic while scrubbing

[zfs-discuss] Panic while scrubbing

[zfs-discuss] Panic while scrubbing

[zfs-discuss] Panic while scrubbing

Maybe Matching Threads