thr3ads.net - freebsd stable - kern/130330: [mpt] [panic] Panic and reboot machine MPT ... [May 2009]

If this information is useful, please help other people find it:
Share via:

Riccardo Torrini

2009-May-07 16:02 UTC

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

I just submitted a follow-up to PR kern/130330 with the same
info.  Maybe I found the committed lines doing the crash.

Please see PR for more detailed info (and cc: this thread to me).

I restricted the time window of the problem doing (a lot of)
build&install world from 2008.07 up to now (read last week).

With 2008.07.28.17.00.00 (7.0-STABLE) works fine but
with 2008.07.28.18.00.00 start crashing removing the
the second disk of a mirror (when the mirror is ok)
or adding the second disk of a degraded ones.

Also note that the same crash happens with all 7.1
stable or release and even all 7.2-PRE I tested.

(wrapping long lines)
# cd /home/ncvs/src/sys/
# grep -R "date.*2008\.07\.28\.17" ./ | grep -v /Attic

./dev/wi/if_wi.c,v:
        date    2008.07.28.17.00.37;    author imp;     state Exp;
./dev/wi/if_wivar.h,v:
        date    2008.07.28.17.00.37;    author imp;     state Exp;
./dev/mpt/mpt_raid.c,v:
        date    2008.07.28.17.10.09;    author jhb;     state Exp;
./dev/mpt/mpt_raid.c,v:
        date    2008.07.28.17.05.09;    author jhb;     state Exp;
./kern/sched_4bsd.c,v:
        date    2008.07.28.17.25.24;    author jhb;     state Exp;
./modules/et/Makefile,v:
        date    2008.07.28.17.56.37;    author antoine; state Exp;

In that time window there are only 4 file changed in
src/sys/dev, and I bet to mpt_raid.c  :-)

This is the commit log extracted from cvsweb
-----8<-----
Revision 1.15.2.1:
Mon Jul 28 17:05:09 2008 UTC (9 months, 1 week ago) by jhb
Branches: RELENG_7
CVS tags: RELENG_7_1_BP
Branch point for: RELENG_7_1
Diff to: previous 1.15: preferred, colored
Changes since revision 1.15: +4 -4 lines

SVN rev 180920 on 2008-07-28 17:05:09Z by jhb

MFC: Allocate a single CCB at the start of the main loop of the RAID
monitoring kthread of the mpt(4) driver.
-----8<-----

Here are the diff:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/mpt/mpt_raid.c.diff?r1=1.15;r2=1.15.2.1


What can I do now?


-- 
Riccardo.
Network Manager @ ESAOTE S.p.A.

John Baldwin

2009-May-11 16:50 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

On Thursday 07 May 2009 11:50:12 am Riccardo Torrini
wrote:> I just submitted a follow-up to PR kern/130330 with the same
> info.  Maybe I found the committed lines doing the crash.
> 
> Please see PR for more detailed info (and cc: this thread to me).
> 
> I restricted the time window of the problem doing (a lot of)
> build&install world from 2008.07 up to now (read last week).
> 
> With 2008.07.28.17.00.00 (7.0-STABLE) works fine but
> with 2008.07.28.18.00.00 start crashing removing the
> the second disk of a mirror (when the mirror is ok)
> or adding the second disk of a degraded ones.
> 
> Also note that the same crash happens with all 7.1
> stable or release and even all 7.2-PRE I tested.
> 
> (wrapping long lines)
> # cd /home/ncvs/src/sys/
> # grep -R "date.*2008\.07\.28\.17" ./ | grep -v /Attic
> 
> ./dev/wi/if_wi.c,v:
>         date    2008.07.28.17.00.37;    author imp;     state Exp;
> ./dev/wi/if_wivar.h,v:
>         date    2008.07.28.17.00.37;    author imp;     state Exp;
> ./dev/mpt/mpt_raid.c,v:
>         date    2008.07.28.17.10.09;    author jhb;     state Exp;
> ./dev/mpt/mpt_raid.c,v:
>         date    2008.07.28.17.05.09;    author jhb;     state Exp;
> ./kern/sched_4bsd.c,v:
>         date    2008.07.28.17.25.24;    author jhb;     state Exp;
> ./modules/et/Makefile,v:
>         date    2008.07.28.17.56.37;    author antoine; state Exp;
> 
> In that time window there are only 4 file changed in
> src/sys/dev, and I bet to mpt_raid.c  :-)
> 
> This is the commit log extracted from cvsweb
> -----8<-----
> Revision 1.15.2.1:
> Mon Jul 28 17:05:09 2008 UTC (9 months, 1 week ago) by jhb
> Branches: RELENG_7
> CVS tags: RELENG_7_1_BP
> Branch point for: RELENG_7_1
> Diff to: previous 1.15: preferred, colored
> Changes since revision 1.15: +4 -4 lines
> 
> SVN rev 180920 on 2008-07-28 17:05:09Z by jhb
> 
> MFC: Allocate a single CCB at the start of the main loop of the RAID
> monitoring kthread of the mpt(4) driver.
> -----8<-----
> 
> Here are the diff:
> 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/mpt/mpt_raid.c.diff?r1=1.15;r2=1.15.2.1> 
> 
> What can I do now?
Can you get more details on the crash, perhaps a crash dump?

-- 
John Baldwin

Riccardo Torrini

2009-May-11 16:55 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

On Mon, May 11, 2009 at 09:53:21AM -0400, John Baldwin wrote:
>> What can I do now?
> Can you get more details on the crash, perhaps a crash dump?
All what you want, but you need to drive me, I was unable
to setup serial/debug console so I must wrote down by hand
(followed handbook, tryed all speed/duplex pairs, still
having "graphics" strange chars, maybe the cable or setup).

Using a kernel with all know (to me) debug knobs enabled  :-)


-- 
Riccardo.

John Baldwin

2009-May-11 18:25 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

On Monday 11 May 2009 12:55:22 pm Riccardo Torrini
wrote:> On Mon, May 11, 2009 at 09:53:21AM -0400, John Baldwin wrote:
> 
> >> What can I do now?
> 
> > Can you get more details on the crash, perhaps a crash dump?
> 
> All what you want, but you need to drive me, I was unable
> to setup serial/debug console so I must wrote down by hand
> (followed handbook, tryed all speed/duplex pairs, still
> having "graphics" strange chars, maybe the cable or setup).
> 
> Using a kernel with all know (to me) debug knobs enabled  :-)
Do you have kernel crashdumps enabled and a swap partition?  If so, do you 
happen to have any files in /var/crash?

-- 
John Baldwin

John Baldwin

2009-May-20 14:21 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

On Tuesday 12 May 2009 12:10:25 pm Riccardo Torrini
wrote:> On Tue, May 12, 2009 at 11:44:20AM -0400, John Baldwin wrote:
> 
> > If you can get a stack trace, that would be most helpful.
> > My guess is that the recovery thread is holding the mpt lock
> > and calling some CAM routine which attempts to relock it via
> > cam_periph_lock().  A stack trace would be most telling in
> > that case.
> 
> Rebooted, inserted 2nd disk (copied by hand, sorry for delay)
Try this.  It reverts the single-CCB part of the previous commit while keeping 
the other fixes.  I missed that the CCB might still be in flight when we 
schedule another rescan.

Index: mpt_raid.c
==================================================================--- mpt_raid.c
(revision 192376)
+++ mpt_raid.c	(working copy)
@@ -658,19 +658,19 @@
 static void
 mpt_cam_rescan_callback(struct cam_periph *periph, union ccb *ccb)
 {
+
 	xpt_free_path(ccb->ccb_h.path);
+	xpt_free_ccb(ccb);
 }
 
 static void
 mpt_raid_thread(void *arg)
 {
 	struct mpt_softc *mpt;
-	union ccb *ccb;
 	int firstrun;
 
 	mpt = (struct mpt_softc *)arg;
 	firstrun = 1;
-	ccb = xpt_alloc_ccb();
 	MPT_LOCK(mpt);
 	while (mpt->shutdwn_raid == 0) {
 
@@ -698,15 +698,21 @@
 		}
 
 		if (mpt->raid_rescan != 0) {
+			union ccb *ccb;
 			struct cam_path *path;
 			int error;
 
 			mpt->raid_rescan = 0;
+			MPT_UNLOCK(mpt);
 
+			ccb = xpt_alloc_ccb();
+
+			MPT_LOCK(mpt);
 			error = xpt_create_path(&path, xpt_periph,
 			    cam_sim_path(mpt->phydisk_sim),
 			    CAM_TARGET_WILDCARD, CAM_LUN_WILDCARD);
 			if (error != CAM_REQ_CMP) {
+				xpt_free_ccb(ccb);
 				mpt_prt(mpt, "Unable to rescan RAID Bus!\n");
 			} else {
 				xpt_setup_ccb(&ccb->ccb_h, path, 5);
@@ -719,7 +725,6 @@
 			}
 		}
 	}
-	xpt_free_ccb(ccb);
 	mpt->raid_thread = NULL;
 	wakeup(&mpt->raid_thread);
 	MPT_UNLOCK(mpt);

-- 
John Baldwin

Riccardo Torrini

2009-May-21 07:29 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote:
> Try this.  It reverts the single-CCB part of the previous
> commit while keeping the other fixes.  I missed that the
> CCB might still be in flight when we schedule another rescan.
Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it
differ only for line position but adiacent lines are the same).
Also redone a diff -u4 to verify, recompiled, installed, and...

YOO-HOO.  Now it rebuild _without_ crashing.

May 20 17:39:08 horse kernel: \
	mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
	mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
	mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
	mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining

Let me test against a 7.2-STABLE (and even to some -CURRENT)...

[some times ahead]

Bad news: I removed the second disk during rebuilding and it
still crash.  I take a screen shapshot with camera because of
too many messages for write down by hand  :)

Image, src tarball and info here (about 2.2MB):
ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/


-- 
Riccardo.

Attilio Rao

2009-May-21 10:18 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

2009/5/21 Riccardo Torrini
<riccardo.torrini@esaote.com>:> On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote:
>
>> Try this.  It reverts the single-CCB part of the previous
>> commit while keeping the other fixes.  I missed that the
>> CCB might still be in flight when we schedule another rescan.
>
> Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it
> differ only for line position but adiacent lines are the same).
> Also redone a diff -u4 to verify, recompiled, installed, and...
>
> YOO-HOO.  Now it rebuild _without_ crashing.
>
> May 20 17:39:08 horse kernel: \
>        mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
>        mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
>        mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
>        mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining
>
> Let me test against a 7.2-STABLE (and even to some -CURRENT)...
>
> [some times ahead]
>
> Bad news: I removed the second disk during rebuilding and it
> still crash.  I take a screen shapshot with camera because of
> too many messages for write down by hand  :)
>
> Image, src tarball and info here (about 2.2MB):
> ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/
Please try the patch here:
http://www.freebsd.org/~attilio/notify.diff

I think it is perfectly fine this approach because the devctl_notify()
also will "silently" fail if no memory is available.
Note that this is a CAM "bug" more that the driver arises.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

Riccardo Torrini

2009-May-21 16:55 UTC

head link

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

On Thu, May 21, 2009 at 11:47:54AM +0200, Attilio Rao wrote:
> Please try the patch here:
> http://www.freebsd.org/~attilio/notify.diff
As promised I checked againts 7.2-STABLE of today (cvsup ended
at 15:17 CEST, GTM+2, Italy time with DST) and ... it works !
(added and removed a disk 4 times, even during a sync-in-progress)

# uname -v
FreeBSD 7.2-STABLE #3: Thu May 21 18:26:04 CEST 2009 ...


-----[ 1st remove ]-----

mpt0: External Bus Reset Detected
(mpt0:vol0:1): Physical Disk Status Changed
(mpt0:vol0:0): Volume Status Changed
(mpt0:vol0:1): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled )
(mpt0:vol0:1): No longer configured

-----[ 1st add ]-----
mpt0: External Bus Reset Detected
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Domain Validation Required
mpt0:vol0(mpt0:0:0): Volume Status Changed
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
mpt0:vol0(mpt0:0:0): 71087625 of 71087625 blocks remaining
(mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:1:0)
(mpt0:vol0:1): Online
(mpt0:vol0:1): Status ( Out-Of-Sync )
(mpt0:vol0:1): SMART Data Received
(mpt0:vol0:1): ASC 0x5d, ASCQ 0x0)
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
mpt0:vol0(mpt0:0:0): 71076421 of 71087625 blocks remaining
mpt0:vol0(mpt0:0:0): Volume Status Changed


-----[ 2nd remove ]-----
mpt0: External Bus Reset Detected
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled )
(mpt0:vol0:1): Physical Disk Status Changed
(mpt0:vol0:1): Physical Disk Status Changed
(mpt0:vol0:1): No longer configured
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed

-----[ 2nd add ]-----
mpt0: External Bus Reset Detected
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Physical Disk Status Changed
mpt0:vol0(mpt0:0:0): Domain Validation Required
mpt0:vol0(mpt0:0:0): Volume Status Changed
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
mpt0:vol0(mpt0:0:0): 71087625 of 71087625 blocks remaining
(mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:1:0)
(mpt0:vol0:1): Online
(mpt0:vol0:1): Status ( Out-Of-Sync )
(mpt0:vol0:1): SMART Data Received
(mpt0:vol0:1): ASC 0x5d, ASCQ 0x0)
mpt0:vol0(mpt0:0:0): RAID-1 - Degraded
mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing )
mpt0:vol0(mpt0:0:0): Low Priority Re-Sync
mpt0:vol0(mpt0:0:0): 70896522 of 71087625 blocks remaining


Thanks again.


-- 
Riccardo.

freebsd stable - May 2009 - kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...

kern/130330: [mpt] [panic] Panic and reboot machine MPT ...