Dear folks,
We are installing a large diskless cluster using CentOS 5.1. The
hardware is pretty new - Supermicro X7DWT boards with Harpertown CPUs.
Unfortunately we have some PXE-related problems described by the
following scenario:
1) Set up DHCP, TFTP and NFS on a server, prepare PXE kernel and initrd
- fine.
2) Start up the node using PXE for the first time - fine.
3) Reboot the node - PXE boot fails for all next attempts. We see that a
server gets DHCP requests and answers them, but a node doesn't response
with DHCP ack. The typical DHCP log is:
Jan 5 09:14:34 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan 5 09:14:34 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to
00:30:48:7e:24:a6 via eth1
Jan 5 09:14:36 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan 5 09:14:36 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to
00:30:48:7e:24:a6 via eth1
Jan 5 09:14:40 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan 5 09:14:40 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to
00:30:48:7e:24:a6 via eth1
Jan 5 09:14:48 shoffner dhcpd: DHCPDISCOVER from 00:30:48:7e:24:a6 via eth1
Jan 5 09:14:48 shoffner dhcpd: DHCPOFFER on 10.1.5.2 to
00:30:48:7e:24:a6 via eth1
4) Anything like DHCP server restart, node reset, node power on/off
doesn't help
5) The only thing that will enable system to boot again over PXE is to
perform "bmc reset cold" command on a node using ipmitool - yes, we
have
IPMI card sharing the same Ethernet interface. After that we can boot
CentOS again.
6) When Linux is loaded, if we reboot a node using "bmc power cycle"
instead of reboot or shutdown, a node will boot for the next time
without problems
7) There are no problems with a second GbE interface (without IPMI)
8) So our guess is that Linux on a reboot leaves Ethernet device in some
state that cause brain damage for IPMI+PXE combination. We tried to play
with some e1000 driver options, we are also tried latest Intel driver -
nothing helps.
Do you have any idea what goes wrong? Any help will be much appreciated.
Below there is a system summary:
[root at node-05-03 ~]# uname -a
Linux node-05-03 2.6.18-53.1.4.el5 #1 SMP Fri Nov 30 00:45:55 EST 2007
x86_64 x86_64 x86_64 GNU/Linux
[root at node-05-03 ~]# lspci
00:00.0 Host bridge: Intel Corporation Memory Controller Hub (rev 20)
00:01.0 PCI bridge: Intel Corporation PCI Express Port 1 (rev 20)
00:05.0 PCI bridge: Intel Corporation PCI Express Port 5 (rev 20)
00:07.0 PCI bridge: Intel Corporation PCI Express Port 7 (rev 20)
00:0f.0 System peripheral: Intel Corporation DMA/DCA Engine (rev 20)
00:10.0 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.1 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.2 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.3 Host bridge: Intel Corporation FSB Registers (rev 20)
00:10.4 Host bridge: Intel Corporation FSB Registers (rev 20)
00:11.0 Host bridge: Intel Corporation Unknown device 4031 (rev 20)
00:15.0 Host bridge: Intel Corporation FBD Registers (rev 20)
00:15.1 Host bridge: Intel Corporation FBD Registers (rev 20)
00:16.0 Host bridge: Intel Corporation FBD Registers (rev 20)
00:16.1 Host bridge: Intel Corporation FBD Registers (rev 20)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset
UHCI USB Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset
UHCI USB Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset
UHCI USB Controller #3 (rev 09)
00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset
EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC
Interface Controller (rev 09)
00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
Controller (rev 09)
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
Controller (rev 09)
01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
Upstream Port (rev 01)
02:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to
PCI-X Bridge (rev 01)
03:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
Downstream Port E1 (rev 01)
03:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
Downstream Port E3 (rev 01)
05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)
05:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)
08:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
Thanks in advance,
Andrey