Riccardo Torrini
2009-May-07 16:02 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
I just submitted a follow-up to PR kern/130330 with the same info. Maybe I found the committed lines doing the crash. Please see PR for more detailed info (and cc: this thread to me). I restricted the time window of the problem doing (a lot of) build&install world from 2008.07 up to now (read last week). With 2008.07.28.17.00.00 (7.0-STABLE) works fine but with 2008.07.28.18.00.00 start crashing removing the the second disk of a mirror (when the mirror is ok) or adding the second disk of a degraded ones. Also note that the same crash happens with all 7.1 stable or release and even all 7.2-PRE I tested. (wrapping long lines) # cd /home/ncvs/src/sys/ # grep -R "date.*2008\.07\.28\.17" ./ | grep -v /Attic ./dev/wi/if_wi.c,v: date 2008.07.28.17.00.37; author imp; state Exp; ./dev/wi/if_wivar.h,v: date 2008.07.28.17.00.37; author imp; state Exp; ./dev/mpt/mpt_raid.c,v: date 2008.07.28.17.10.09; author jhb; state Exp; ./dev/mpt/mpt_raid.c,v: date 2008.07.28.17.05.09; author jhb; state Exp; ./kern/sched_4bsd.c,v: date 2008.07.28.17.25.24; author jhb; state Exp; ./modules/et/Makefile,v: date 2008.07.28.17.56.37; author antoine; state Exp; In that time window there are only 4 file changed in src/sys/dev, and I bet to mpt_raid.c :-) This is the commit log extracted from cvsweb -----8<----- Revision 1.15.2.1: Mon Jul 28 17:05:09 2008 UTC (9 months, 1 week ago) by jhb Branches: RELENG_7 CVS tags: RELENG_7_1_BP Branch point for: RELENG_7_1 Diff to: previous 1.15: preferred, colored Changes since revision 1.15: +4 -4 lines SVN rev 180920 on 2008-07-28 17:05:09Z by jhb MFC: Allocate a single CCB at the start of the main loop of the RAID monitoring kthread of the mpt(4) driver. -----8<----- Here are the diff: http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/mpt/mpt_raid.c.diff?r1=1.15;r2=1.15.2.1 What can I do now? -- Riccardo. Network Manager @ ESAOTE S.p.A.
John Baldwin
2009-May-11 16:50 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Thursday 07 May 2009 11:50:12 am Riccardo Torrini wrote:> I just submitted a follow-up to PR kern/130330 with the same > info. Maybe I found the committed lines doing the crash. > > Please see PR for more detailed info (and cc: this thread to me). > > I restricted the time window of the problem doing (a lot of) > build&install world from 2008.07 up to now (read last week). > > With 2008.07.28.17.00.00 (7.0-STABLE) works fine but > with 2008.07.28.18.00.00 start crashing removing the > the second disk of a mirror (when the mirror is ok) > or adding the second disk of a degraded ones. > > Also note that the same crash happens with all 7.1 > stable or release and even all 7.2-PRE I tested. > > (wrapping long lines) > # cd /home/ncvs/src/sys/ > # grep -R "date.*2008\.07\.28\.17" ./ | grep -v /Attic > > ./dev/wi/if_wi.c,v: > date 2008.07.28.17.00.37; author imp; state Exp; > ./dev/wi/if_wivar.h,v: > date 2008.07.28.17.00.37; author imp; state Exp; > ./dev/mpt/mpt_raid.c,v: > date 2008.07.28.17.10.09; author jhb; state Exp; > ./dev/mpt/mpt_raid.c,v: > date 2008.07.28.17.05.09; author jhb; state Exp; > ./kern/sched_4bsd.c,v: > date 2008.07.28.17.25.24; author jhb; state Exp; > ./modules/et/Makefile,v: > date 2008.07.28.17.56.37; author antoine; state Exp; > > In that time window there are only 4 file changed in > src/sys/dev, and I bet to mpt_raid.c :-) > > This is the commit log extracted from cvsweb > -----8<----- > Revision 1.15.2.1: > Mon Jul 28 17:05:09 2008 UTC (9 months, 1 week ago) by jhb > Branches: RELENG_7 > CVS tags: RELENG_7_1_BP > Branch point for: RELENG_7_1 > Diff to: previous 1.15: preferred, colored > Changes since revision 1.15: +4 -4 lines > > SVN rev 180920 on 2008-07-28 17:05:09Z by jhb > > MFC: Allocate a single CCB at the start of the main loop of the RAID > monitoring kthread of the mpt(4) driver. > -----8<----- > > Here are the diff: >http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/mpt/mpt_raid.c.diff?r1=1.15;r2=1.15.2.1> > > What can I do now?Can you get more details on the crash, perhaps a crash dump? -- John Baldwin
Riccardo Torrini
2009-May-11 16:55 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Mon, May 11, 2009 at 09:53:21AM -0400, John Baldwin wrote:>> What can I do now?> Can you get more details on the crash, perhaps a crash dump?All what you want, but you need to drive me, I was unable to setup serial/debug console so I must wrote down by hand (followed handbook, tryed all speed/duplex pairs, still having "graphics" strange chars, maybe the cable or setup). Using a kernel with all know (to me) debug knobs enabled :-) -- Riccardo.
John Baldwin
2009-May-11 18:25 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Monday 11 May 2009 12:55:22 pm Riccardo Torrini wrote:> On Mon, May 11, 2009 at 09:53:21AM -0400, John Baldwin wrote: > > >> What can I do now? > > > Can you get more details on the crash, perhaps a crash dump? > > All what you want, but you need to drive me, I was unable > to setup serial/debug console so I must wrote down by hand > (followed handbook, tryed all speed/duplex pairs, still > having "graphics" strange chars, maybe the cable or setup). > > Using a kernel with all know (to me) debug knobs enabled :-)Do you have kernel crashdumps enabled and a swap partition? If so, do you happen to have any files in /var/crash? -- John Baldwin
John Baldwin
2009-May-20 14:21 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Tuesday 12 May 2009 12:10:25 pm Riccardo Torrini wrote:> On Tue, May 12, 2009 at 11:44:20AM -0400, John Baldwin wrote: > > > If you can get a stack trace, that would be most helpful. > > My guess is that the recovery thread is holding the mpt lock > > and calling some CAM routine which attempts to relock it via > > cam_periph_lock(). A stack trace would be most telling in > > that case. > > Rebooted, inserted 2nd disk (copied by hand, sorry for delay)Try this. It reverts the single-CCB part of the previous commit while keeping the other fixes. I missed that the CCB might still be in flight when we schedule another rescan. Index: mpt_raid.c ==================================================================--- mpt_raid.c (revision 192376) +++ mpt_raid.c (working copy) @@ -658,19 +658,19 @@ static void mpt_cam_rescan_callback(struct cam_periph *periph, union ccb *ccb) { + xpt_free_path(ccb->ccb_h.path); + xpt_free_ccb(ccb); } static void mpt_raid_thread(void *arg) { struct mpt_softc *mpt; - union ccb *ccb; int firstrun; mpt = (struct mpt_softc *)arg; firstrun = 1; - ccb = xpt_alloc_ccb(); MPT_LOCK(mpt); while (mpt->shutdwn_raid == 0) { @@ -698,15 +698,21 @@ } if (mpt->raid_rescan != 0) { + union ccb *ccb; struct cam_path *path; int error; mpt->raid_rescan = 0; + MPT_UNLOCK(mpt); + ccb = xpt_alloc_ccb(); + + MPT_LOCK(mpt); error = xpt_create_path(&path, xpt_periph, cam_sim_path(mpt->phydisk_sim), CAM_TARGET_WILDCARD, CAM_LUN_WILDCARD); if (error != CAM_REQ_CMP) { + xpt_free_ccb(ccb); mpt_prt(mpt, "Unable to rescan RAID Bus!\n"); } else { xpt_setup_ccb(&ccb->ccb_h, path, 5); @@ -719,7 +725,6 @@ } } } - xpt_free_ccb(ccb); mpt->raid_thread = NULL; wakeup(&mpt->raid_thread); MPT_UNLOCK(mpt); -- John Baldwin
Riccardo Torrini
2009-May-21 07:29 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote:> Try this. It reverts the single-CCB part of the previous > commit while keeping the other fixes. I missed that the > CCB might still be in flight when we schedule another rescan.Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it differ only for line position but adiacent lines are the same). Also redone a diff -u4 to verify, recompiled, installed, and... YOO-HOO. Now it rebuild _without_ crashing. May 20 17:39:08 horse kernel: \ mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining Let me test against a 7.2-STABLE (and even to some -CURRENT)... [some times ahead] Bad news: I removed the second disk during rebuilding and it still crash. I take a screen shapshot with camera because of too many messages for write down by hand :) Image, src tarball and info here (about 2.2MB): ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/ -- Riccardo.
Attilio Rao
2009-May-21 10:18 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
2009/5/21 Riccardo Torrini <riccardo.torrini@esaote.com>:> On Wed, May 20, 2009 at 10:21:23AM -0400, John Baldwin wrote: > >> Try this. It reverts the single-CCB part of the previous >> commit while keeping the other fixes. I missed that the >> CCB might still be in flight when we schedule another rescan. > > Applied to mpt_raid.c,v 1.15.2.1 2008/07/28 17:05:09 jhb (it > differ only for line position but adiacent lines are the same). > Also redone a diff -u4 to verify, recompiled, installed, and... > > YOO-HOO. Now it rebuild _without_ crashing. > > May 20 17:39:08 horse kernel: \ > mpt0:vol0(mpt0:0:0): RAID-1 - Degraded > mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) > mpt0:vol0(mpt0:0:0): Low Priority Re-Sync > mpt0:vol0(mpt0:0:0): 64461754 of 71087625 blocks remaining > > Let me test against a 7.2-STABLE (and even to some -CURRENT)... > > [some times ahead] > > Bad news: I removed the second disk during rebuilding and it > still crash. I take a screen shapshot with camera because of > too many messages for write down by hand :) > > Image, src tarball and info here (about 2.2MB): > ftp://ftp.torrini.org/pub/FreeBSD/mpt_crash_on_rebuild/Please try the patch here: http://www.freebsd.org/~attilio/notify.diff I think it is perfectly fine this approach because the devctl_notify() also will "silently" fail if no memory is available. Note that this is a CAM "bug" more that the driver arises. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
Riccardo Torrini
2009-May-21 16:55 UTC
kern/130330: [mpt] [panic] Panic and reboot machine MPT ...
On Thu, May 21, 2009 at 11:47:54AM +0200, Attilio Rao wrote:> Please try the patch here: > http://www.freebsd.org/~attilio/notify.diffAs promised I checked againts 7.2-STABLE of today (cvsup ended at 15:17 CEST, GTM+2, Italy time with DST) and ... it works ! (added and removed a disk 4 times, even during a sync-in-progress) # uname -v FreeBSD 7.2-STABLE #3: Thu May 21 18:26:04 CEST 2009 ... -----[ 1st remove ]----- mpt0: External Bus Reset Detected (mpt0:vol0:1): Physical Disk Status Changed (mpt0:vol0:0): Volume Status Changed (mpt0:vol0:1): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled ) (mpt0:vol0:1): No longer configured -----[ 1st add ]----- mpt0: External Bus Reset Detected mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Domain Validation Required mpt0:vol0(mpt0:0:0): Volume Status Changed mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 71087625 of 71087625 blocks remaining (mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:1:0) (mpt0:vol0:1): Online (mpt0:vol0:1): Status ( Out-Of-Sync ) (mpt0:vol0:1): SMART Data Received (mpt0:vol0:1): ASC 0x5d, ASCQ 0x0) mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 71076421 of 71087625 blocks remaining mpt0:vol0(mpt0:0:0): Volume Status Changed -----[ 2nd remove ]----- mpt0: External Bus Reset Detected mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled ) (mpt0:vol0:1): Physical Disk Status Changed (mpt0:vol0:1): Physical Disk Status Changed (mpt0:vol0:1): No longer configured mpt0:vol0(mpt0:0:0): Physical Disk Status Changed -----[ 2nd add ]----- mpt0: External Bus Reset Detected mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Physical Disk Status Changed mpt0:vol0(mpt0:0:0): Domain Validation Required mpt0:vol0(mpt0:0:0): Volume Status Changed mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 71087625 of 71087625 blocks remaining (mpt0:vol0:1): Physical (mpt0:0:1:0), Pass-thru (mpt0:1:1:0) (mpt0:vol0:1): Online (mpt0:vol0:1): Status ( Out-Of-Sync ) (mpt0:vol0:1): SMART Data Received (mpt0:vol0:1): ASC 0x5d, ASCQ 0x0) mpt0:vol0(mpt0:0:0): RAID-1 - Degraded mpt0:vol0(mpt0:0:0): Status ( Enabled Re-Syncing ) mpt0:vol0(mpt0:0:0): Low Priority Re-Sync mpt0:vol0(mpt0:0:0): 70896522 of 71087625 blocks remaining Thanks again. -- Riccardo.