Joe Mroczek
2006-Jan-29 03:24 UTC
[syslinux] PXELinux sporadic hangs searching for config file
Very sporadicially (1 out of every 40-5000 boots), our blade system will hang indefinately while PXE Linux is attempting to locate it's configuration file. This is causing our automated testing to hang and generate failures. I have put more details below. My concerns are threefold: 1) System hangs 2) PXE Linux bootstrap never seems to retry the transaction 3) PXE Linux bootstrap never seems to reboot the system Versions of PXE Linux Used: 3.11, 2.04 PXE Agent Intel Boot Agent for e1000 NICs 1.2.14, 1.2.16(latest) NICs 2x e1000 (82544ei integrated into board) CPUs 2x LV Xeon @ 1.6Ghz or 2.0Ghz Chipset LV E7501 Log from PXE Server: Jan 24 16:34:34 ssh-pad dhcpd: DHCPREQUEST for 192.168.77.254 (192.168.77.77) from 00:0e:0c:52:d0:8b via eth1 Jan 24 16:34:34 ssh-pad dhcpd: DHCPACK on 192.168.77.254 to 00:0e:0c:52:d0:8b via eth1 Jan 25 00:34:34 ssh-pad in.tftpd[7324]: RRQ from 192.168.77.254 filename pxelinux.0 Jan 25 00:34:34 ssh-pad in.tftpd[7324]: tftp: client does not accept options Jan 25 00:34:34 ssh-pad in.tftpd[7325]: RRQ from 192.168.77.254 filename pxelinux.0 Jan 25 00:34:35 ssh-pad in.tftpd[7326]: RRQ from 192.168.77.254 filename pxelinux.cfg/01-00-0e-0c-52-d0-8b Jan 25 00:34:35 ssh-pad in.tftpd[7326]: sending NAK (1, File not found) to 192.168.77.254 Jan 25 00:34:35 ssh-pad in.tftpd[7327]: RRQ from 192.168.77.254 filename pxelinux.cfg/C0A84DFE Jan 25 00:34:35 ssh-pad in.tftpd[7327]: sending NAK (1, File not found) to 192.168.77.254 Jan 25 00:34:35 ssh-pad in.tftpd[7328]: RRQ from 192.168.77.254 filename pxelinux.cfg/C0A84DF Jan 25 00:34:35 ssh-pad in.tftpd[7328]: sending NAK (1, File not found) to 192.168.77.254 Jan 25 00:34:35 ssh-pad in.tftpd[7329]: RRQ from 192.168.77.254 filename pxelinux.cfg/C0A84D Jan 25 00:34:35 ssh-pad in.tftpd[7329]: sending NAK (1, File not found) to 192.168.77.254 Jan 25 00:34:35 ssh-pad in.tftpd[7330]: RRQ from 192.168.77.254 filename pxelinux.cfg/C0A84 Jan 25 00:34:35 ssh-pad in.tftpd[7330]: sending NAK (1, File not found) to 192.168.77.254 Jan 25 00:34:35 ssh-pad in.tftpd[7331]: RRQ from 192.168.77.254 filename pxelinux.cfg/C0A8 Jan 25 00:34:35 ssh-pad in.tftpd[7331]: sending NAK (1, File not found) to 192.168.77.254 Jan 25 00:34:35 ssh-pad in.tftpd[7332]: RRQ from 192.168.77.254 filename pxelinux.cfg/C0A Jan 25 00:34:35 ssh-pad in.tftpd[7332]: sending NAK (1, File not found) to 192.168.77.254 Ethereal shows the last NAK going across the wire. We has Intel come in and look for issues within their boot agent and were able to trace the NAK all the way up to pxelinux. Further they could not find any sign of corruption of the PXE structure and the RX and TX routines in boot agent were still running. For some reason pxelinux was still calling the RX routine as if it was still expecting something. Any clues on how to either debug the root issue, or at least get PXELinux to retry the failed transaction? Regards, Joe M. Can you provide any details on how
H. Peter Anvin
2006-Jan-29 19:17 UTC
[syslinux] PXELinux sporadic hangs searching for config file
Joe Mroczek wrote:> > Ethereal shows the last NAK going across the wire. We has Intel come in and > look for issues within their boot agent and were able to trace the NAK all > the way up to pxelinux. Further they could not find any sign of corruption > of the PXE structure and the RX and TX routines in boot agent were still > running. For some reason pxelinux was still calling the RX routine as if it > was still expecting something. > > Any clues on how to either debug the root issue, or at least get PXELinux to > retry the failed transaction? >You've ended up at the command prompt, and pxelinux calls the receive routine so that the PXE stack can respond to ARP. The real question is why your server returns file not found. However, it has been discussed that the default syslinux behaviour (command prompt) may not be appropriate for pxelinux. -hpa
Joe Mroczek
2006-Feb-06 19:45 UTC
[syslinux] PXELinux sporadic hangs searching for config file
On 1/29/06, H. Peter Anvin <hpa at zytor.com> wrote:> Joe Mroczek wrote: > > I should have included what we get on our serial console, which at > > least appears to match up to a VGA console when it is attached. This > > obviously does not match up with the TFTP server log from earlier, > > however it should still illustrate what we are seeing: > > > > - pxelinux is 2.04 2003-04-16 H.Peter Anvin > > - UNDI data segment at: 00089930 > > UNDI data segment size : B440 > > UNDI code segment at : 00084d70 > > UNDI Code segment size: 2e00 > > - pxe entry point found(we hope) at 94d7:0106 > > Trying to load: pxelinux.cfg/80000afe > > Trying to load: pxelinux.cfg/80000af > > Trying to load: pxelinux.cfg/80000a > > Trying to load: pxelinux.cfg/80000 > > Trying to load: pxelinux.cfg/8000 > > > > There is nothing more on either console after this. We have let the > > system sit for over 2 days with no further reporting to the console. > > When checking the TFTP log we see the NAK for the last file printed to > > the screen, a sniffer attached to a snoop port on an intermediary > > swtich shows the NAK and it appears well formatted. > > > > OK, you have me flabbergasted. No idea what the issue is.This goes back to my orignal questions: 1) If NAK is not received for a file, will PXELinux rerequest the file? 2) If NAK is received but corrupted, will PXELinux rerequest the file? 3) Is there a timer in PXELinux that should reboot the machine if boot fails? If so, is it expected to fire in the above listed circumstances?> -hpa > > P.S. Please don't take things off-list.Sorry, gmail seems to lack a reply to all button. ~joe
H. Peter Anvin
2006-Feb-06 23:08 UTC
[syslinux] PXELinux sporadic hangs searching for config file
Joe Mroczek wrote:>> >>OK, you have me flabbergasted. No idea what the issue is. > > > This goes back to my orignal questions: > > 1) If NAK is not received for a file, will PXELinux rerequest the file? > > 2) If NAK is received but corrupted, will PXELinux rerequest the file? > > 3) Is there a timer in PXELinux that should reboot the machine if boot > fails? If so, is it expected to fire in the above listed > circumstances? >Yes, yes, and yes. -hpa
Joe Mroczek
2006-Feb-07 00:53 UTC
[syslinux] PXELinux sporadic hangs searching for config file
On 2/6/06, H. Peter Anvin <hpa at zytor.com> wrote:> Joe Mroczek wrote: > >> > >>OK, you have me flabbergasted. No idea what the issue is. > > > > > > This goes back to my orignal questions: > > > > 1) If NAK is not received for a file, will PXELinux rerequest the file? > > > > 2) If NAK is received but corrupted, will PXELinux rerequest the file? > > > > 3) Is there a timer in PXELinux that should reboot the machine if boot > > fails? If so, is it expected to fire in the above listed > > circumstances? > > > > Yes, yes, and yes. >Thank you. I am off to get RedHat to examine PXELinux to understand why any of the three normal outcomes are not comming to pass in this scenario. Regards, Joe M.
Joe Mroczek
2006-Mar-13 22:35 UTC
[syslinux] PXELinux sporadic hangs searching for config file
On 1/29/06, H. Peter Anvin <hpa at zytor.com> wrote:> Joe Mroczek wrote: > > I should have included what we get on our serial console, which at > > least appears to match up to a VGA console when it is attached. This > > obviously does not match up with the TFTP server log from earlier, > > however it should still illustrate what we are seeing: > > > > - pxelinux is 2.04 2003-04-16 H.Peter Anvin > > - UNDI data segment at: 00089930 > > UNDI data segment size : B440 > > UNDI code segment at : 00084d70 > > UNDI Code segment size: 2e00 > > - pxe entry point found(we hope) at 94d7:0106 > > Trying to load: pxelinux.cfg/80000afe > > Trying to load: pxelinux.cfg/80000af > > Trying to load: pxelinux.cfg/80000a > > Trying to load: pxelinux.cfg/80000 > > Trying to load: pxelinux.cfg/8000 > > > > There is nothing more on either console after this. We have let the > > system sit for over 2 days with no further reporting to the console. > > When checking the TFTP log we see the NAK for the last file printed to > > the screen, a sniffer attached to a snoop port on an intermediary > > swtich shows the NAK and it appears well formatted. > > > > OK, you have me flabbergasted. No idea what the issue is. > > -hpaFYI... Root cause has been found. UDPRead function in PXE was not exiting. This was caused by the serial console handler improperly disabling an interrupt the PXE agent was using for a timer function. The net result was that PXELinux hung waiting for a response that would never come.> P.S. Please don't take things off-list.Sorry, hit the wrong button in Gmail. ~joe
H. Peter Anvin
2006-Mar-14 01:58 UTC
[syslinux] PXELinux sporadic hangs searching for config file
Joe Mroczek wrote:> > FYI... Root cause has been found. UDPRead function in PXE was not > exiting. This was caused by the serial console handler improperly > disabling an interrupt the PXE agent was using for a timer function. > The net result was that PXELinux hung waiting for a response that > would never come. >I take it this is a serial console handler built into the BIOS, as opposed to the serial console support in pxelinux (since that never gets run due to no config file?) -hpa
Joe Mroczek
2006-Mar-16 01:05 UTC
[syslinux] PXELinux sporadic hangs searching for config file
On 3/13/06, H. Peter Anvin <hpa at zytor.com> wrote:> Joe Mroczek wrote: > > > > FYI... Root cause has been found. UDPRead function in PXE was not > > exiting. This was caused by the serial console handler improperly > > disabling an interrupt the PXE agent was using for a timer function. > > The net result was that PXELinux hung waiting for a response that > > would never come. > > > > I take it this is a serial console handler built into the BIOS, as > opposed to the serial console support in pxelinux (since that never gets > run due to no config file?) > > -hpaThat is correct. ~joe