I've searched all the archives and can't find anything relating to my problem. I have a 128 node cluster. Al NICS are Nvidia. I have them on a network with other machines. I do PXE installs of Fedora on them. These 128 nodes have a problem with the install. The initial DHCP requests get answered, the machine accepts and then the TFTP sequence takes place. The second DHCP request is sent and the server sees it and answers. The offer gets to the machine ( I've seen it with a tap on the wire ). The machine doesn't respond to the offer. This happens about 99.9% or more of the time on all machines. If I try and set the address by statically I get the same behavior. It appears the NIC is up but I can't ping it but I occasionally see packets coming from it. I have another network that is very similar. If I move the nodes to that network the install works properly Package/DHCP servers are set up identically on both networks. Each network has similar devices but not identical. The main router on the non-working network is an Extreme Aspen. The main router on the working network is a Cisco 6800. One problem I have is that the 128 nodes are all headless so I can't see any logging information on the serial port during the install. Any suggestions are appreciated. Thanks -- Tony Heaton HPC-5 (505) 667-9015 theaton at lanl.gov 'But when a long Train of Abuses and Usurpations, pursuing invariably the same Object, evinces a Design to reduce them under absolute Despotism, it is their Right, it is their Duty, to throw off such Government, and to provide new Guards for their Future Security.' - United States Declaration of Independence
Tony Heaton wrote:> I've searched all the archives and can't find anything relating to my > problem. I have a 128 node cluster. Al NICS are Nvidia. I have them > on a network with other machines. I do PXE installs of Fedora on them. > These 128 nodes have a problem with the install. The initial DHCP > requests get answered, the machine accepts and then the TFTP sequence > takes place. The second DHCP request is sent and the server sees it and > answers. The offer gets to the machine ( I've seen it with a tap on the > wire ). The machine doesn't respond to the offer. This happens about > 99.9% or more of the time on all machines. If I try and set the address > by statically I get the same behavior. It appears the NIC is up but I > can't ping it but I occasionally see packets coming from it. I have > another network that is very similar. If I move the nodes to that > network the install works properly Package/DHCP servers are set up > identically on both networks. Each network has similar devices but not > identical. The main router on the non-working network is an Extreme > Aspen. The main router on the working network is a Cisco 6800. One > problem I have is that the 128 nodes are all headless so I can't see any > logging information on the serial port during the install. Any > suggestions are appreciated.There was a bug in the Nvidia PXE stack at some point that would make the NIC unavailable not just to the Linux forcedeth driver, but also to the Windows driver. There seems to have been an update to the driver to reset the appropriate state, because I haven't been able to reproduce it recently, but depending on your kernel version this might be the problem you're seeing. -hpa