Riccardo Torrini
2009-May-07  16:02 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
I just submitted a follow-up to PR kern/130330 with the same
info.  Maybe I found the committed lines doing the crash.
Please see PR for more detailed info (and cc: this thread to me).
I restricted the time window of the problem doing (a lot of)
build&install world from 2008.07 up to now (read last week).
With 2008.07.28.17.00.00 (7.0-STABLE) works fine but
with 2008.07.28.18.00.00 start crashing removing the
the second disk of a mirror (when the mirror is ok)
or adding the second disk of a degraded ones.
Also note that the same crash happens with all 7.1
stable or release and even all 7.2-PRE I tested.
(wrapping long lines)
# cd /home/ncvs/src/sys/
# grep -R "date.*2008\.07\.28\.17" ./ | grep -v /Attic
./dev/wi/if_wi.c,v:
        date    2008.07.28.17.00.37;    author imp;     state Exp;
./dev/wi/if_wivar.h,v:
        date    2008.07.28.17.00.37;    author imp;     state Exp;
./dev/mpt/mpt_raid.c,v:
        date    2008.07.28.17.10.09;    author jhb;     state Exp;
./dev/mpt/mpt_raid.c,v:
        date    2008.07.28.17.05.09;    author jhb;     state Exp;
./kern/sched_4bsd.c,v:
        date    2008.07.28.17.25.24;    author jhb;     state Exp;
./modules/et/Makefile,v:
        date    2008.07.28.17.56.37;    author antoine; state Exp;
In that time window there are only 4 file changed in
src/sys/dev, and I bet to mpt_raid.c  :-)
This is the commit log extracted from cvsweb
-----8<-----
Revision 1.15.2.1:
Mon Jul 28 17:05:09 2008 UTC (9 months, 1 week ago) by jhb
Branches: RELENG_7
CVS tags: RELENG_7_1_BP
Branch point for: RELENG_7_1
Diff to: previous 1.15: preferred, colored
Changes since revision 1.15: +4 -4 lines
SVN rev 180920 on 2008-07-28 17:05:09Z by jhb
MFC: Allocate a single CCB at the start of the main loop of the RAID
monitoring kthread of the mpt(4) driver.
-----8<-----
Here are the diff:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/mpt/mpt_raid.c.diff?r1=1.15;r2=1.15.2.1
What can I do now?
-- 
Riccardo.
Network Manager @ ESAOTE S.p.A.
John Baldwin
2009-May-11  16:50 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Thursday 07 May 2009 11:50:12 am Riccardo Torrini wrote:> I just submitted a follow-up to PR kern/130330 with the same > info. Maybe I found the committed lines doing the crash. > > Please see PR for more detailed info (and cc: this thread to me). > > I restricted the time window of the problem doing (a lot of) > build&install world from 2008.07 up to now (read last week). > > With 2008.07.28.17.00.00 (7.0-STABLE) works fine but > with 2008.07.28.18.00.00 start crashing removing the > the second disk of a mirror (when the mirror is ok) > or adding the second disk of a degraded ones. > > Also note that the same crash happens with all 7.1 > stable or release and even all 7.2-PRE I tested. > > (wrapping long lines) > # cd /home/ncvs/src/sys/ > # grep -R "date.*2008\.07\.28\.17" ./ | grep -v /Attic > > ./dev/wi/if_wi.c,v: > date 2008.07.28.17.00.37; author imp; state Exp; > ./dev/wi/if_wivar.h,v: > date 2008.07.28.17.00.37; author imp; state Exp; > ./dev/mpt/mpt_raid.c,v: > date 2008.07.28.17.10.09; author jhb; state Exp; > ./dev/mpt/mpt_raid.c,v: > date 2008.07.28.17.05.09; author jhb; state Exp; > ./kern/sched_4bsd.c,v: > date 2008.07.28.17.25.24; author jhb; state Exp; > ./modules/et/Makefile,v: > date 2008.07.28.17.56.37; author antoine; state Exp; > > In that time window there are only 4 file changed in > src/sys/dev, and I bet to mpt_raid.c :-) > > This is the commit log extracted from cvsweb > -----8<----- > Revision 1.15.2.1: > Mon Jul 28 17:05:09 2008 UTC (9 months, 1 week ago) by jhb > Branches: RELENG_7 > CVS tags: RELENG_7_1_BP > Branch point for: RELENG_7_1 > Diff to: previous 1.15: preferred, colored > Changes since revision 1.15: +4 -4 lines > > SVN rev 180920 on 2008-07-28 17:05:09Z by jhb > > MFC: Allocate a single CCB at the start of the main loop of the RAID > monitoring kthread of the mpt(4) driver. > -----8<----- > > Here are the diff: >http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/mpt/mpt_raid.c.diff?r1=1.15;r2=1.15.2.1> > > What can I do now?Can you get more details on the crash, perhaps a crash dump? -- John Baldwin
Riccardo Torrini
2009-May-11  16:55 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Mon, May 11, 2009 at 09:53:21AM -0400, John Baldwin wrote:>> What can I do now?> Can you get more details on the crash, perhaps a crash dump?All what you want, but you need to drive me, I was unable to setup serial/debug console so I must wrote down by hand (followed handbook, tryed all speed/duplex pairs, still having "graphics" strange chars, maybe the cable or setup). Using a kernel with all know (to me) debug knobs enabled :-) -- Riccardo.
John Baldwin
2009-May-11  18:25 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Monday 11 May 2009 12:55:22 pm Riccardo Torrini wrote:> On Mon, May 11, 2009 at 09:53:21AM -0400, John Baldwin wrote: > > >> What can I do now? > > > Can you get more details on the crash, perhaps a crash dump? > > All what you want, but you need to drive me, I was unable > to setup serial/debug console so I must wrote down by hand > (followed handbook, tryed all speed/duplex pairs, still > having "graphics" strange chars, maybe the cable or setup). > > Using a kernel with all know (to me) debug knobs enabled :-)Do you have kernel crashdumps enabled and a swap partition? If so, do you happen to have any files in /var/crash? -- John Baldwin
John Baldwin
2009-May-20  14:21 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Tuesday 12 May 2009 12:10:25 pm Riccardo Torrini wrote:> On Tue, May 12, 2009 at 11:44:20AM -0400, John Baldwin wrote: > > > If you can get a stack trace, that would be most helpful. > > My guess is that the recovery thread is holding the mpt lock > > and calling some CAM routine which attempts to relock it via > > cam_periph_lock(). A stack trace would be most telling in > > that case. > > Rebooted, inserted 2nd disk (copied by hand, sorry for delay)Try this. It reverts the single-CCB part of the previous commit while keeping the other fixes. I missed that the CCB might still be in flight when we schedule another rescan. Index: mpt_raid.c ==================================================================--- mpt_raid.c (revision 192376) +++ mpt_raid.c (working copy) @@ -658,19 +658,19 @@ static void mpt_cam_rescan_callback(struct cam_periph *periph, union ccb *ccb) { + xpt_free_path(ccb->ccb_h.path); + xpt_free_ccb(ccb); } static void mpt_raid_thread(void *arg) { struct mpt_softc *mpt; - union ccb *ccb; int firstrun; mpt = (struct mpt_softc *)arg; firstrun = 1; - ccb = xpt_alloc_ccb(); MPT_LOCK(mpt); while (mpt->shutdwn_raid == 0) { @@ -698,15 +698,21 @@ } if (mpt->raid_rescan != 0) { + union ccb *ccb; struct cam_path *path; int error; mpt->raid_rescan = 0; + MPT_UNLOCK(mpt); + ccb = xpt_alloc_ccb(); + + MPT_LOCK(mpt); error = xpt_create_path(&path, xpt_periph, cam_sim_path(mpt->phydisk_sim), CAM_TARGET_WILDCARD, CAM_LUN_WILDCARD); if (error != CAM_REQ_CMP) { + xpt_free_ccb(ccb); mpt_prt(mpt, "Unable to rescan RAID Bus!\n"); } else { xpt_setup_ccb(&ccb->ccb_h, path, 5); @@ -719,7 +725,6 @@ } } } - xpt_free_ccb(ccb); mpt->raid_thread = NULL; wakeup(&mpt->raid_thread); MPT_UNLOCK(mpt); -- John Baldwin
Riccardo Torrini
2009-May-21  07:29 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote:> Try this. It reverts the single-CCB part of the previous > commit while keeping the other fixes. I missed that the > CCB might still be in flight when we schedule another rescan.Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it differ only for line position but adiacent lines are the same). Also redone a diff -u4 to verify, recompiled, installed, and... YOO-HOO. Now it rebuild _without_ crashing. May 20 17:39:08 horse kernel: \ mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining Let me test against a 7.2-STABLE (and even to some -CURRENT)... [some times ahead] Bad news: I removed the second disk during rebuilding and it still crash. I take a screen shapshot with camera because of too many messages for write down by hand :) Image, src tarball and info here (about 2.2MB): ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/ -- Riccardo.
Attilio Rao
2009-May-21  10:18 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
2009/5/21 Riccardo Torrini <riccardo.torrini@esaote.com>:> On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote: > >> Try this. It reverts the single-CCB part of the previous >> commit while keeping the other fixes. I missed that the >> CCB might still be in flight when we schedule another rescan. > > Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it > differ only for line position but adiacent lines are the same). > Also redone a diff -u4 to verify, recompiled, installed, and... > > YOO-HOO. Now it rebuild _without_ crashing. > > May 20 17:39:08 horse kernel: \ > mpt0:vol0(mpt0:0:0): RAID-1 - Degraded > mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) > mpt0:vol0(mpt0:0:0): Low Priority Re-Sync > mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining > > Let me test against a 7.2-STABLE (and even to some -CURRENT)... > > [some times ahead] > > Bad news: I removed the second disk during rebuilding and it > still crash. I take a screen shapshot with camera because of > too many messages for write down by hand :) > > Image, src tarball and info here (about 2.2MB): > ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/Please try the patch here: http://www.freebsd.org/~attilio/notify.diff I think it is perfectly fine this approach because the devctl_notify() also will "silently" fail if no memory is available. Note that this is a CAM "bug" more that the driver arises. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
Riccardo Torrini
2009-May-21  16:55 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Thu, May 21, 2009 at 11:47:54AM +0200, Attilio Rao wrote:> Please try the patch here: > http://www.freebsd.org/~attilio/notify.diffAs promised I checked againts 7.2-STABLE of today (cvsup ended at 15:17 CEST, GTM+2, Italy time with DST) and ... it works ! (added and removed a disk 4 times, even during a sync-in-progress) # uname -v FreeBSD 7.2-STABLE #3: Thu May 21 18:26:04 CEST 2009 ... -----[ 1st remove ]----- mpt0: External Bus Reset Detected (mpt0:vol0:1): Physical Disk Status Changed (mpt0:vol0:0): Volume Status Changed (mpt0:vol0:1): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled ) (mpt0:vol0:1): No longer configured -----[ 1st add ]----- mpt0: External Bus Reset Detected mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Domain Validation Required mpt0:vol0(mpt0:0:0): Volume Status Changed mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 71087625 of 71087625 blocks remaining (mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:1:0) (mpt0:vol0:1): Online (mpt0:vol0:1): Status ( Out-Of-Sync ) (mpt0:vol0:1): SMART Data Received (mpt0:vol0:1): ASC 0x5d, ASCQ 0x0) mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 71076421 of 71087625 blocks remaining mpt0:vol0(mpt0:0:0): Volume Status Changed -----[ 2nd remove ]----- mpt0: External Bus Reset Detected mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled ) (mpt0:vol0:1): Physical Disk Status Changed (mpt0:vol0:1): Physical Disk Status Changed (mpt0:vol0:1): No longer configured mpt0:vol0(mpt0:0:0): Physical Disk Status Changed -----[ 2nd add ]----- mpt0: External Bus Reset Detected mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Domain Validation Required mpt0:vol0(mpt0:0:0): Volume Status Changed mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 71087625 of 71087625 blocks remaining (mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:1:0) (mpt0:vol0:1): Online (mpt0:vol0:1): Status ( Out-Of-Sync ) (mpt0:vol0:1): SMART Data Received (mpt0:vol0:1): ASC 0x5d, ASCQ 0x0) mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 70896522 of 71087625 blocks remaining Thanks again. -- Riccardo.