Craig Leres
2011-Sep-22 03:26 UTC
Panic during kernel booting on HP Proliant DL180G6 and latest STABLE
I have a lot of supermicro motherboards and the newest ones have igb chipsets; they've been quite a headache with respect to FreeBSD 8. I'm running 8.2-RELEASE but have upgraded parts of my kernel to 8-RELENG (as of a few months ago). Some of them work ok while others panic on bootup. Upgrading to newer versions of the intel igb code fixes some but breaks others. It's been frustrating. While working on this today, I saw two different kernel panics: Could not setup receive structures m_getzone: m_getjcl: invalid cluster type I tried John Baldwin's patch but got the "invalid cluster type" panic so I backed it out. Later I figured out that either turning off hw.igb.enable_msix (loader.conf) or raising kern.ipc.nmbclusters to 131072 (sysctl.conf) and setting hw.igb.num_queues to 4 (loader.conf) would avoid the "receive structures" panic but either way I was seeing the "invalid cluster type" panic. Looking m_getjcl(), I suspected the passed size to be 0; some debugging confirmed this. Looks like a race here where a receive interrupt comes in before adapter->rx_mbuf_sz has been initialized. Attached is the hack I added to avoid the panic when booting. The idea is to pretend m_getjcl() failed to allocate a cluster rather than to go down in flames. Craig -------------- next part -------------- Index: if_igb.c ==================================================================--- if_igb.c (revision 31) +++ if_igb.c (working copy) @@ -3695,6 +3695,11 @@ htole64(hseg[0].ds_addr); no_split: if (rxbuf->m_pack == NULL) { + if (adapter->rx_mbuf_sz == 0) { + printf("igb_refresh_mbufs: " + "avoid m_getjcl() panic\n"); + goto update; + } mp = m_getjcl(M_DONTWAIT, MT_DATA, M_PKTHDR, adapter->rx_mbuf_sz); if (mp == NULL) @@ -3912,6 +3917,12 @@ skip_head: /* Now the payload cluster */ + if (adapter->rx_mbuf_sz == 0) { + printf("igb_setup_receive_ring: " + "avoid m_getjcl() panic\n"); + error = ENOBUFS; + goto fail; + } rxbuf->m_pack = m_getjcl(M_DONTWAIT, MT_DATA, M_PKTHDR, adapter->rx_mbuf_sz); if (rxbuf->m_pack == NULL) {
Jeremy Chadwick
2011-Sep-22 04:19 UTC
Panic during kernel booting on HP Proliant DL180G6 and latest STABLE
On Wed, Sep 21, 2011 at 08:26:46PM -0700, Craig Leres wrote:> I have a lot of supermicro motherboards and the newest ones have igb > chipsets; they've been quite a headache with respect to FreeBSD 8. I'm > running 8.2-RELEASE but have upgraded parts of my kernel to 8-RELENG (as > of a few months ago). Some of them work ok while others panic on bootup. > Upgrading to newer versions of the intel igb code fixes some but breaks > others. It's been frustrating. > > While working on this today, I saw two different kernel panics: > > Could not setup receive structures > m_getzone: m_getjcl: invalid cluster type > > I tried John Baldwin's patch but got the "invalid cluster type" panic so > I backed it out. > > Later I figured out that either turning off hw.igb.enable_msix > (loader.conf) or raising kern.ipc.nmbclusters to 131072 (sysctl.conf) > and setting hw.igb.num_queues to 4 (loader.conf) would avoid the > "receive structures" panic but either way I was seeing the "invalid > cluster type" panic. > > Looking m_getjcl(), I suspected the passed size to be 0; some debugging > confirmed this. Looks like a race here where a receive interrupt comes > in before adapter->rx_mbuf_sz has been initialized. > > Attached is the hack I added to avoid the panic when booting. The idea > is to pretend m_getjcl() failed to allocate a cluster rather than to go > down in flames. > > Craig> Index: if_igb.c > ==================================================================> --- if_igb.c (revision 31) > +++ if_igb.c (working copy) > @@ -3695,6 +3695,11 @@ > htole64(hseg[0].ds_addr); > no_split: > if (rxbuf->m_pack == NULL) { > + if (adapter->rx_mbuf_sz == 0) { > + printf("igb_refresh_mbufs: " > + "avoid m_getjcl() panic\n"); > + goto update; > + } > mp = m_getjcl(M_DONTWAIT, MT_DATA, > M_PKTHDR, adapter->rx_mbuf_sz); > if (mp == NULL) > @@ -3912,6 +3917,12 @@ > > skip_head: > /* Now the payload cluster */ > + if (adapter->rx_mbuf_sz == 0) { > + printf("igb_setup_receive_ring: " > + "avoid m_getjcl() panic\n"); > + error = ENOBUFS; > + goto fail; > + } > rxbuf->m_pack = m_getjcl(M_DONTWAIT, MT_DATA, > M_PKTHDR, adapter->rx_mbuf_sz); > if (rxbuf->m_pack == NULL) {The fact you have this happening on multiple systems is uncomfortable. It makes me uncomfortable because we use Supermicro hardware exclusively. Your Email contains no reference ID or in-reply-to headers so it appears as a new thread. As such I'll point readers to the thread which spans over months: http://lists.freebsd.org/pipermail/freebsd-stable/2011-May/062596.html http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062949.html http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063867.html Also re-CC'ing Jack Vogel. Chris, I am under the impression that to get proper visibility and attention in this matter, you're probably going to need to set up serial console (both BIOS-level and bootloader-level) for remote debugging capability. Jack, John, or someone familiar with kernel debugging is probably going to need to get access to a machine which is experiencing this problem so they can figure out what's going on. The tricky part here is that you're going to need to have a custom kernel built that includes numerous debugging options. PXE booting is probably the easiest method. Remember you don't need filesystems on the system, just a kernel that boots/loads and will drop to ddb> when the panic happens. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
David G Lawrence
2011-Sep-22 10:12 UTC
Panic during kernel booting on HP Proliant DL180G6 and latest STABLE
> I have a lot of supermicro motherboards and the newest ones have igb > chipsets; they've been quite a headache with respect to FreeBSD 8. I'm > running 8.2-RELEASE but have upgraded parts of my kernel to 8-RELENG (as > of a few months ago). Some of them work ok while others panic on bootup. > Upgrading to newer versions of the intel igb code fixes some but breaks > others. It's been frustrating. > > While working on this today, I saw two different kernel panics: > > Could not setup receive structures > m_getzone: m_getjcl: invalid cluster typeI fixed this awhile back in my local sources. A 12 core Supermicro MB system I'm building here was hitting the bug 100% of the time during startup. Patch attached. -DG Dr. David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 Pave the road of life with opportunities. Index: if_igb.c ==================================================================RCS file: /home/ncvs/src/sys/dev/e1000/if_igb.c,v retrieving revision 1.21.2.20 diff -c -r1.21.2.20 if_igb.c *** if_igb.c 29 Jun 2011 16:16:59 -0000 1.21.2.20 --- if_igb.c 22 Sep 2011 10:04:31 -0000 *************** *** 1278,1286 **** /* Don't lose promiscuous settings */ igb_set_promisc(adapter); - ifp->if_drv_flags |= IFF_DRV_RUNNING; - ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; - callout_reset(&adapter->timer, hz, igb_local_timer, adapter); e1000_clear_hw_cntrs_base_generic(&adapter->hw); --- 1278,1283 ---- *************** *** 1308,1313 **** --- 1305,1313 ---- /* Don't reset the phy next time init gets called */ adapter->hw.phy.reset_disable = TRUE; + + ifp->if_drv_flags |= IFF_DRV_RUNNING; + ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; } static void *************** *** 1490,1501 **** E1000_WRITE_REG(&adapter->hw, E1000_EIMC, que->eims); ++que->irqs; IGB_TX_LOCK(txr); more_tx = igb_txeof(txr); IGB_TX_UNLOCK(txr); - more_rx = igb_rxeof(que, adapter->rx_process_limit, NULL); - if (igb_enable_aim == FALSE) goto no_calc; /* --- 1490,1505 ---- E1000_WRITE_REG(&adapter->hw, E1000_EIMC, que->eims); ++que->irqs; + if (!(adapter->ifp->if_drv_flags & IFF_DRV_RUNNING)) { + return; + } + + more_rx = igb_rxeof(que, adapter->rx_process_limit, NULL); + IGB_TX_LOCK(txr); more_tx = igb_txeof(txr); IGB_TX_UNLOCK(txr); if (igb_enable_aim == FALSE) goto no_calc; /*