Royce Williams
2009-Nov-12 20:01 UTC
82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)
We have servers with dual 82573 NICs that work well during low-throughput activity, but during high-volume activity, they pause shortly after transfers start and do not recover. Other sessions to the system are not affected. These systems are being repurposed, jumping from 6.3 to 7.2. The same system and its kin do not exhibit the symptom under 6.3-RELEASE-p13. The symptoms appear under freebsd-updated 7.2-RELEASE GENERIC kernel with no tuning. Previously, we've been using DCGDIS.EXE (from Jack Vogel) for this symptom. The first system to be repurposed accepts DCGDIS with 'Updated' and subsequent 'update not needed', with no relief. Notably, there are no watchdog timeout errors - unlike our various Supermicro models still running FreeBSD 6.x. All of our other 7.x Supermicro flavors had already received the flash update and haven't show the symptom. Details follow. Kernel: rand# uname -a FreeBSD rand.acsalaska.net 7.2-RELEASE-p4 FreeBSD 7.2-RELEASE-p4 #0: Fri Oct 2 12:21:39 UTC 2009 root@i386-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC i386 sysctls: rand# sysctl dev.em dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 6.9.6 dev.em.0.%driver: em dev.em.0.%location: slot=0 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000 dev.em.0.%parent: pci13 dev.em.0.debug: -1 dev.em.0.stats: -1 dev.em.0.rx_int_delay: 0 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_abs_int_delay: 66 dev.em.0.rx_processing_limit: 100 dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 6.9.6 dev.em.1.%driver: em dev.em.1.%location: slot=0 function=0 dev.em.1.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000 dev.em.1.%parent: pci14 dev.em.1.debug: -1 dev.em.1.stats: -1 dev.em.1.rx_int_delay: 0 dev.em.1.tx_int_delay: 66 dev.em.1.rx_abs_int_delay: 66 dev.em.1.tx_abs_int_delay: 66 dev.em.1.rx_processing_limit: 100 kenv: rand# kenv | grep smbios | egrep -v 'socket|serial|uuid|tag|0123456789' smbios.bios.reldate="03/05/2008" smbios.bios.vendor="Phoenix Technologies LTD" smbios.bios.version="6.00" smbios.chassis.maker="Supermicro" smbios.planar.maker="Supermicro" smbios.planar.product="PDSMi " smbios.planar.version="PCB Version" smbios.system.maker="Supermicro" smbios.system.product="PDSMi" The system is not yet production, so I can invasively abuse it if needed. The other systems are in production under 6.3-RELEASE-p13 and can also be inspected. Any pointers appreciated. Royce
Jeremy Chadwick
2009-Nov-12 20:47 UTC
82573 xfers pause, no watchdog timeouts, DCGDIS ineffective (7.2-R)
On Thu, Nov 12, 2009 at 10:36:16AM -0900, Royce Williams wrote:> We have servers with dual 82573 NICs that work well during low-throughput activity, but during high-volume activity, they pause shortly after transfers start and do not recover. Other sessions to the system are not affected.Please define "low-throughput" and "high-volume" if you could; it might help folks determine where the threshold is for problems.> These systems are being repurposed, jumping from 6.3 to 7.2. The same system and its kin do not exhibit the symptom under 6.3-RELEASE-p13. The symptoms appear under freebsd-updated 7.2-RELEASE GENERIC kernel with no tuning. > > Previously, we've been using DCGDIS.EXE (from Jack Vogel) for this symptom. The first system to be repurposed accepts DCGDIS with 'Updated' and subsequent 'update not needed', with no relief. > > Notably, there are no watchdog timeout errors - unlike our various Supermicro models still running FreeBSD 6.x. All of our other 7.x Supermicro flavors had already received the flash update and haven't show the symptom. > > Details follow. > > Kernel: > > rand# uname -a > FreeBSD rand.acsalaska.net 7.2-RELEASE-p4 FreeBSD 7.2-RELEASE-p4 #0: Fri Oct 2 12:21:39 UTC 2009 root@i386-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC i386 > > sysctls: > > rand# sysctl dev.em > dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 6.9.6 > dev.em.0.%driver: em > dev.em.0.%location: slot=0 function=0 > dev.em.0.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000 > dev.em.0.%parent: pci13 > dev.em.0.debug: -1 > dev.em.0.stats: -1 > dev.em.0.rx_int_delay: 0 > dev.em.0.tx_int_delay: 66 > dev.em.0.rx_abs_int_delay: 66 > dev.em.0.tx_abs_int_delay: 66 > dev.em.0.rx_processing_limit: 100 > dev.em.1.%desc: Intel(R) PRO/1000 Network Connection 6.9.6 > dev.em.1.%driver: em > dev.em.1.%location: slot=0 function=0 > dev.em.1.%pnpinfo: vendor=0x8086 device=0x108c subvendor=0x15d9 subdevice=0x108c class=0x020000 > dev.em.1.%parent: pci14 > dev.em.1.debug: -1 > dev.em.1.stats: -1 > dev.em.1.rx_int_delay: 0 > dev.em.1.tx_int_delay: 66 > dev.em.1.rx_abs_int_delay: 66 > dev.em.1.tx_abs_int_delay: 66 > dev.em.1.rx_processing_limit: 100 > > kenv: > > rand# kenv | grep smbios | egrep -v 'socket|serial|uuid|tag|0123456789' > smbios.bios.reldate="03/05/2008" > smbios.bios.vendor="Phoenix Technologies LTD" > smbios.bios.version="6.00" > smbios.chassis.maker="Supermicro" > smbios.planar.maker="Supermicro" > smbios.planar.product="PDSMi " > smbios.planar.version="PCB Version" > smbios.system.maker="Supermicro" > smbios.system.product="PDSMi" > > > The system is not yet production, so I can invasively abuse it if needed. The other systems are in production under 6.3-RELEASE-p13 and can also be inspected. > > Any pointers appreciated. > > RoyceFor what it's worth as a comparison base: We use the following Supermicro SuperServers, and can confirm that no such issues occur for us using RELENG_6 nor RELENG_7 on the following hardware: Supermicro SuperServer 5015B-MTB - amd64 - Intel 82573V + Intel 82573L Supermicro SuperServer 5015M-T+B - amd64 - Intel 82573V + Intel 82573L Supermicro SuperServer 5015M-T+B - amd64 - Intel 82573V + Intel 82573L Supermicro SuperServer 5015M-T+B - i386 - Intel 82573V + Intel 82573L Supermicro SuperServer 5015M-T+B - i386 - Intel 82573V + Intel 82573L The 5015B-MTB system presently runs RELENG_8 -- no issues there either. Relevant server configuration and network setup details: - All machines use pf(4). - All emX devices are configured for autoneg. - All emX devices use RXCSUM, TXCSUM, and TSO4. - We do not use polling. - All machines use both NICs simultaneously at all times. - All machines connected to an HP ProCurve 2626 switch (100mbit, full-duplex ports, all autoneg). - We do not use Jumbo frames. - No add-in cards (PCI, PCI-X, nor PCIe) are used in the systems. - All of the systems had DCGDIS.EXE run on them; no EEPROM settings were changed, indicating the from-the-Intel-factory MANC register in question was set properly. Relevant throughput details per box: - em0 pushes ~600-1000kbit/sec at all times. - em1 pushes ~100-200kbit/sec at all times. - During nightly maintenance (backups), em1 pushes ~2-3mbit/sec for a variable amount of time. - For a full level 0 backup (which I've done numerous times), em1 pushes 60-70mbit/sec without issues. I've compared your sysctl dev.em output to that of our 5015M-T+B systems (which use the PDSMi+, not the PDSMi, but whatever), and ours is 100% identical. All of our 5015M-T+B systems are using BIOS 1.3, and the 5015B-MTB system is using BIOS 1.30. If you'd like, I can provide the exact BIOS settings we use on the machines in question; they do deviate from the factory defaults a slight bit, but none of the adjustments are "tweaks" for performance or otherwise (just disabling things which we don't use, etc.). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |