Oscar Roozen
2015-Jun-17 15:24 UTC
[syslinux] EFI & PXE-booting: very slow TFTP performance on a VMWare test setup
Dear people on the Syslinux Mailinglist, Are there any known problems with the performance of TFTP in (U)EFI environments in general or maybe just on VMWare? Right now I'm running tests on two virtual environments using either VMWare Workstation 11.1.0 on a Fedora 21 system or VMWare Player 7.1.0 on an Ubuntu 14.10 system. Both virtual systems support PXE booting in UEFI mode. Both systems behave exactly the same. The test-setup is as follows: Master: 2 CPU, 2GB RAM, Network Adapter "Host-only" Centos 7.1 * syslinux-6.03 * isc-dhcpd-V3.0.5-RedHat * atftp-0.7.1 Slaves: 2 CPU, 2GB RAM, Network Adapter "Host-only" Configured to boot from the network only The Master is configured to provide a pxelinux.0 and supporting stuff through tftp. The client boots, gets a DHCP answer, loads pxelinux.0, presents a menu through menu.c32, defaults to loading the kernel and an initrd and starts booting from there. No problems here in BIOS mode. The whole process takes about less than 8 seconds, measured from the moment I hit enter in the menu. One alternate choice in our menu is loading a rescue-environment that has a much bigger initrd (87M instead of 13M) which loads in 12 seconds. Now I'm testing the same setup in UEFI mode. The client boots, gets an answer from the DHCPd, loads efi64/syslinux.efi and the supporting stuff from the efi64-subtree, presents a menu, starts loading the kernel and initrd and continues booting from there. Sounds good? Yes, but instead of 8 seconds, this all takes at least a few minutes! And quite often the system just hangs before even presenting the menu or presents an empty menu, indicating it failed to load one of the .c32 files or menu definitions. No errors are shown. Sometimes VMWare pop-ups appear indicating the machine crashed on invalid code or could not load an OS. If the boot process finishes successfully, the kernel and initrd go on into our custom installer and after the slave is up and running, its network performance (actually, all performance) is equal to the slave that was booted in BIOS mode. It turns out to be impossible to load the rescue environment though. Loading the initrd of 87M takes longer than 15 minutes. After 15 minutes the VMWare workstation (or is it syslinux?) decides to reboot. By the way, This is no arp issue or something. I did check to see if both slaves have different MAC addresses and I think I covered most other obvious causes. The only obvious thing I have not yet tried is to run the setup on real machines or another Virtual Environment. (Actually, I tried VirtualBox first but could not get it to PXE boot in UEFI mode) So, the simple question is: Is anyone else experiencing slow performance while network-booting in UEFI mode? Is it caused by the UEFI API implementation of VMWare, is there an issue in gnu-efi or is it even the way syslinux interacts with the EFI-layer that causes this? This is the question I've been asking myself for the past few days. I've been debugging the syslinux code by inspection and adding debug statements everywere. I also removed "git submodule update --init" from efi/build-gnu-efi.sh so I could rebuild syslinux locally using: # make spotless # make bios # make efi64 What I've found so far: /* TCPdump/Wireshark analysis - problem 1 */ Running tcpdump from the host on /dev/vmnet1 I found the following. The file "syslinux.efi" loads from a client that does not understand the additional options. This is probably the native UEFI firmware in action. Also, this step never fails. The next file will always at least start to load, even when the slave decides to crash before (or after?) the next file finnishes downloading. Even ldlinux can fail to load. On bigger files, there is an increasing chance that core_udp_recv in efi/udp.c hits the 15ms (or is it 15 undefined jiffies?) timeout and returns -1 to its caller in core/fs/pxe/tfpt.c:tftp_get_packet(). This function then goes to ack_again, causing atfptd to resend the packet that just timed out. Then on the next call the original(!) packet is received anyway and handled according to plan. But the next packet to be received will be the same packet again, causing tftp.c to hit this code: if (serial != last_pkt) { /* * Wrong packet, ACK the packet and try again. * This is presumably because the ACK got lost, * so the server just resent the previous packet. */ < #if 0 < printf("Wrong packet, wanted %04x, got %04x\n", \ < htons(last_pkt), htons(*(uint16_t *)(data+2))); ---> #if 1 > printf("Wrong packet, wanted %04x, got %04x\n", \ > last_pkt, serial);#endif goto ack_again; } So, what happens now is that the next packet will also be sent twice.>From now on, every packet will be sent twice! Until another one hits thetimeout. From then on each packet will be sent three times. Unfortunately, this is not *the* problem. It accounts for some delay but this timeout only happens a few times. Typically 1 to 3 times during a download. It should still load the kernel and initrd in less than 24s, not in over 5 minutes. So, there's something else going on, but what? I tried to figure out why a UDP packet on a virtual network with unlimited bandwidth and zero packet-loss would hit a 15ms timeout. The comment says it waits for 15ms, but it does this by counting volatile uint32_t __jiffies, which gets incremented from efi/main.c by an event initiated from the UEFI API. According to the comment near the #define DEFAULT_TIMER_TICK_DURATION, the jiffies should increment every 50 msec. So a jiffie is not 1 msec but 50 times longer. If this comment is correct, the actual timeout isn't 15ms but 750ms. Or is it 750us because a few zero's are missing? I don't have the UEFI spec so I can't verify. What I did next was just increase the timeout in efi/udp.c:core_udp_recv to 60 jiffies. This made the double-ACK problem disappear completely.>From now on, every packet is received within the timeout. All calls tocore_udp_recv return 0. But still, the booting process sometimes crashes and loading the kernel and initrd takes more than a few minutes. I have yet to successfully boot the rescue environment through TFTP, because it still takes more than 15 minutes to load. This leads me to the conclusion that this double-ACK problem is not the cause of the bad performance, so I had to investigate further. /* TCPdump/Wireshark analysis - problem 2 */ After configuring wireshark to display the difference in time between each DATA and ACK packet, I found another weirdness. Obviously, there is a delay between sending an ACK after receiving a DATA packet on the client side and between receiving an ACK and sending the next DATA packet. To illustrate: Client: * receives a DATA pkt (delay1) * sends an ACK pkt (wait...) Server: * receives an ACK pkt (delay2) * sends the next DATA pkt (wait...) Delay1 and delay2 are caused by the time it takes to evaluate the packet and to compute what packet to send next. No fancy stuff, just some easy calculations and memory transactions. So they both should be very short and stay constant. But they don't. The server does just fine. It typically reacts within 0.000100 seconds. The client also starts out fine, reacting within about 0.000250 seconds, but this delay slowly increases. Look at the beginning of the download of vmlinuz: No. Time Source Destination Protocol Length Info 15002 0.274814 10.4.0.1 10.4.255.254 TFTP 105 Read Request, File: efi64/images/default-image/initrd, Transfer type: octet, tsize\000=0\000, blksize\000=1408\000 15003 0.001082 10.4.255.254 10.4.0.1 TFTP 72 Option Acknowledgement, tsize\000=13442875\000, blksize\000=1408\000 15004 0.003735 10.4.0.1 10.4.255.254 TFTP 60 Acknowledgement, Block: 0 15005 0.000081 10.4.255.254 10.4.0.1 TFTP 1454 Data Packet, Block: 1 15006 0.000236 10.4.0.1 10.4.255.254 TFTP 60 Acknowledgement, Block: 1 15007 0.000076 10.4.255.254 10.4.0.1 TFTP 1454 Data Packet, Block: 2 15008 0.000220 10.4.0.1 10.4.255.254 TFTP 60 Acknowledgement, Block: 2 15009 0.000051 10.4.255.254 10.4.0.1 TFTP 1454 Data Packet, Block: 3 15010 0.000244 10.4.0.1 10.4.255.254 TFTP 60 Acknowledgement, Block: 3 15011 0.000038 10.4.255.254 10.4.0.1 TFTP 1454 Data Packet, Block: 4 The time displayed is the time since the previous displayed packet. To after receving packet 15005, the client needs 0.000236 seconds to send the ACK for block 1 in packet 15006. The server then takes 0.000076 seconds to send block 2. And so on... Now fast forward a few thousand packets and observe the delays: 51117 0.000115 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9120 51118 0.023451 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9120 51119 0.000098 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9121 51120 0.017573 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9120 51121 0.000088 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9121 51122 0.011682 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9121 51123 0.000089 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9122 51124 0.017590 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9121 51125 0.000088 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9122 51126 0.017732 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9122 51127 0.000114 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9123 51128 0.011833 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9122 51129 0.000097 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9123 51130 0.017628 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9123 51131 0.000088 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9124 51132 0.017785 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9123 51133 0.000090 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9124 51134 0.017949 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9124 51135 0.000090 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9125 51136 0.030592 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9124 51137 0.000097 10.141.255.254 10.141.0.1 TFTP 1454 Data Packet, Block: 9125 51138 0.037680 10.141.0.1 10.141.255.254 TFTP 60 Acknowledgement, Block: 9125 See how each Acknowledgement now takes 100 times as long to send? We're not in 0.000x territory anymore, we're down to a handful of packets per second. On the next transfer, the delays are back to normal again, just to start increasing and increasing, again slowing down to a crawl. I've been ploughing through the code for quite some time now, but I cannot explain why this happens. The main obstacle for me is that I barely understand how this glue-code to the UEFI API actually works and that I don't have a copy of the spec. So, is there anybody on this list willing to give it a shot? -- Oscar Roozen Linux Developer Bright Computing BV
Gene Cumm
2015-Jun-17 17:02 UTC
[syslinux] EFI & PXE-booting: very slow TFTP performance on a VMWare test setup
On Jun 17, 2015 11:27 AM, "Oscar Roozen via Syslinux" <syslinux at zytor.com> wrote:> > Dear people on the Syslinux Mailinglist, > > Are there any known problems with the performance of TFTP in (U)EFI > environments in general or maybe just on VMWare?I don't believe so. I use Workstation 10.0.2 on Ubuntu 12.04 for the moment.> Right now I'm running tests on two virtual environments using either > VMWare Workstation 11.1.0 on a Fedora 21 system or VMWare Player 7.1.0 > on an Ubuntu 14.10 system. Both virtual systems support PXE booting in > UEFI mode. Both systems behave exactly the same. > > The test-setup is as follows: > > Master: > 2 CPU, 2GB RAM, > Network Adapter "Host-only" > Centos 7.1 > * syslinux-6.03 > * isc-dhcpd-V3.0.5-RedHat > * atftp-0.7.1 > > Slaves: > 2 CPU, 2GB RAM, > Network Adapter "Host-only" > Configured to boot from the network onlyI hope this is a hex-core host with 6+GiB of RAM. Have you considered single vCPU client VMs? I presume you disabled VMware's isc-dhcp on vmnet1?> The Master is configured to provide a pxelinux.0 and supporting stuff > through tftp. The client boots, gets a DHCP answer, loads pxelinux.0, > presents a menu through menu.c32, defaults to loading the kernel and an > initrd and starts booting from there. No problems here in BIOS mode. The > whole process takes about less than 8 seconds, measured from the moment > I hit enter in the menu. > > One alternate choice in our menu is loading a rescue-environment that > has a much bigger initrd (87M instead of 13M) which loads in 12 seconds.Nice.> Now I'm testing the same setup in UEFI mode. The client boots, gets an > answer from the DHCPd, loads efi64/syslinux.efi and the supporting stuff > from the efi64-subtree, presents a menu, starts loading the kernel and > initrd and continues booting from there. Sounds good? Yes, but instead > of 8 seconds, this all takes at least a few minutes! And quite often the > system just hangs before even presenting the menu or presents an empty > menu, indicating it failed to load one of the .c32 files or menu > definitions. No errors are shown. Sometimes VMWare pop-ups appear > indicating the machine crashed on invalid code or could not load an OS.Good idea to serve from a subtree.> If the boot process finishes successfully, the kernel and initrd go on > into our custom installer and after the slave is up and running, its > network performance (actually, all performance) is equal to the slave > that was booted in BIOS mode. > > It turns out to be impossible to load the rescue environment though. > Loading the initrd of 87M takes longer than 15 minutes. After 15 minutes > the VMWare workstation (or is it syslinux?) decides to reboot.The VM's vmware.log might help. Is it exactly 15 minutes from hitting enter on your selection? 15 minutes of loading the initrd?> By the way, This is no arp issue or something. I did check to see if > both slaves have different MAC addresses and I think I covered most > other obvious causes. The only obvious thing I have not yet tried is to > run the setup on real machines or another Virtual Environment. > > (Actually, I tried VirtualBox first but could not get it to PXE boot in > UEFI mode) > > So, the simple question is: Is anyone else experiencing slow performance > while network-booting in UEFI mode? Is it caused by the UEFI API > implementation of VMWare, is there an issue in gnu-efi or is it even the > way syslinux interacts with the EFI-layer that causes this?Not sure but worth investigation.> This is the question I've been asking myself for the past few days. > > I've been debugging the syslinux code by inspection and adding debug > statements everywere. I also removed "git submodule update --init" > from efi/build-gnu-efi.sh so I could rebuild syslinux locally using: > > # make spotless > # make bios > # make efi64'make bios' should be unnecessary.> What I've found so far: > > /* TCPdump/Wireshark analysis - problem 1 */ > > Running tcpdump from the host on /dev/vmnet1 I found the following. > > The file "syslinux.efi" loads from a client that does not understand the > additional options. This is probably the native UEFI firmware in action. > Also, this step never fails. The next file will always at least start to > load, even when the slave decides to crash before (or after?) the next > file finnishes downloading. Even ldlinux can fail to load. > > On bigger files, there is an increasing chance that core_udp_recv in > efi/udp.c hits the 15ms (or is it 15 undefined jiffies?) timeout and > returns -1 to its caller in core/fs/pxe/tfpt.c:tftp_get_packet(). > > This function then goes to ack_again, causing atfptd to resend the packet > that just timed out. Then on the next call the original(!) packet is > received anyway and handled according to plan. > > But the next packet to be received will be the same packet again, > causing tftp.c to hit this code: > > if (serial != last_pkt) { > /* > * Wrong packet, ACK the packet and try again. > * This is presumably because the ACK got lost, > * so the server just resent the previous packet. > */ > < #if 0 > < printf("Wrong packet, wanted %04x, got %04x\n", \ > < htons(last_pkt), htons(*(uint16_t *)(data+2))); > --- > > #if 1 > > printf("Wrong packet, wanted %04x, got %04x\n", \ > > last_pkt, serial); > #endif > goto ack_again; > } > > So, what happens now is that the next packet will also be sent twice. > From now on, every packet will be sent twice! Until another one hits the > timeout. From then on each packet will be sent three times.That's not good.> Unfortunately, this is not *the* problem. It accounts for some delay but > this timeout only happens a few times. Typically 1 to 3 times during a > download. It should still load the kernel and initrd in less than 24s, > not in over 5 minutes. So, there's something else going on, but what? > > I tried to figure out why a UDP packet on a virtual network with > unlimited bandwidth and zero packet-loss would hit a 15ms timeout. TheYou'd be surprised how much CPU and RAM speed influence this. Handling a sustained 2Gbps for example doesn't do well on most servers.> comment says it waits for 15ms, but it does this by counting volatile > uint32_t __jiffies, which gets incremented from efi/main.c by an event > initiated from the UEFI API. According to the comment near the #define > DEFAULT_TIMER_TICK_DURATION, the jiffies should increment every 50 msec. > So a jiffie is not 1 msec but 50 times longer. If this comment is > correct, the actual timeout isn't 15ms but 750ms. Or is it 750us because > a few zero's are missing? I don't have the UEFI spec so I can't verify. > > What I did next was just increase the timeout in efi/udp.c:core_udp_recv > to 60 jiffies. This made the double-ACK problem disappear completely. > From now on, every packet is received within the timeout. All calls to > core_udp_recv return 0. But still, the booting process sometimes crashes > and loading the kernel and initrd takes more than a few minutes. I have > yet to successfully boot the rescue environment through TFTP, because it > still takes more than 15 minutes to load. > > This leads me to the conclusion that this double-ACK problem is not the > cause of the bad performance, so I had to investigate further. > > > /* TCPdump/Wireshark analysis - problem 2 */ > > After configuring wireshark to display the difference in time between > each DATA and ACK packet, I found another weirdness. Obviously, there is > a delay between sending an ACK after receiving a DATA packet on the > client side and between receiving an ACK and sending the next DATA > packet. To illustrate: > > Client: > * receives a DATA pkt > (delay1) > * sends an ACK pkt > (wait...) > > Server: > * receives an ACK pkt > (delay2) > * sends the next DATA pkt > (wait...) > > Delay1 and delay2 are caused by the time it takes to evaluate the packet > and to compute what packet to send next. No fancy stuff, just some easy > calculations and memory transactions. So they both should be very short > and stay constant. But they don't. > > The server does just fine. It typically reacts within 0.000100 seconds. > The client also starts out fine, reacting within about 0.000250 seconds, > but this delay slowly increases.This makes me wonder the source and there's at least 3 factors: VMware UEFI on multiple vCPU, memory or other leak, a strange bug in Syslinux.> Look at the beginning of the download of vmlinuz: > > No. Time Source Destination ProtocolLength Info> 15002 0.274814 10.4.0.1 10.4.255.254 TFTP > 105 Read Request, File: efi64/images/default-image/initrd, Transfer > type: octet, tsize\000=0\000, blksize\000=1408\000 > 15003 0.001082 10.4.255.254 10.4.0.1 TFTP > 72 Option Acknowledgement, tsize\000=13442875\000, > blksize\000=1408\000 > 15004 0.003735 10.4.0.1 10.4.255.254 TFTP > 60 Acknowledgement, Block: 0 > 15005 0.000081 10.4.255.254 10.4.0.1 TFTP > 1454 Data Packet, Block: 1 > 15006 0.000236 10.4.0.1 10.4.255.254 TFTP > 60 Acknowledgement, Block: 1 > 15007 0.000076 10.4.255.254 10.4.0.1 TFTP > 1454 Data Packet, Block: 2 > 15008 0.000220 10.4.0.1 10.4.255.254 TFTP > 60 Acknowledgement, Block: 2 > 15009 0.000051 10.4.255.254 10.4.0.1 TFTP > 1454 Data Packet, Block: 3 > 15010 0.000244 10.4.0.1 10.4.255.254 TFTP > 60 Acknowledgement, Block: 3 > 15011 0.000038 10.4.255.254 10.4.0.1 TFTP > 1454 Data Packet, Block: 4 > > The time displayed is the time since the previous displayed packet. To > after receving packet 15005, the client needs 0.000236 seconds to send > the ACK for block 1 in packet 15006. The server then takes 0.000076 > seconds to send block 2. And so on... > > Now fast forward a few thousand packets and observe the delays: > > 51117 0.000115 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9120 > 51118 0.023451 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9120 > 51119 0.000098 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9121 > 51120 0.017573 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9120 > 51121 0.000088 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9121 > 51122 0.011682 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9121 > 51123 0.000089 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9122 > 51124 0.017590 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9121 > 51125 0.000088 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9122 > 51126 0.017732 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9122 > 51127 0.000114 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9123 > 51128 0.011833 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9122 > 51129 0.000097 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9123 > 51130 0.017628 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9123 > 51131 0.000088 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9124 > 51132 0.017785 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9123 > 51133 0.000090 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9124 > 51134 0.017949 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9124 > 51135 0.000090 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9125 > 51136 0.030592 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9124 > 51137 0.000097 10.141.255.254 10.141.0.1 TFTP > 1454 Data Packet, Block: 9125 > 51138 0.037680 10.141.0.1 10.141.255.254 TFTP > 60 Acknowledgement, Block: 9125 > > See how each Acknowledgement now takes 100 times as long to send? We're > not in 0.000x territory anymore, we're down to a handful of packets per > second. > > On the next transfer, the delays are back to normal again, just to start > increasing and increasing, again slowing down to a crawl.A very important observation!> I've been ploughing through the code for quite some time now, but I > cannot explain why this happens. The main obstacle for me is that I > barely understand how this glue-code to the UEFI API actually works and > that I don't have a copy of the spec.I recall finding the UEFI 2.4 spec really for free.> So, is there anybody on this list willing to give it a shot?Certainly worth some comparisons. By the way, what version/commit? --Gene
Didier Spaier
2015-Jun-17 18:12 UTC
[syslinux] EFI & PXE-booting: very slow TFTP performance on a VMWare test setup
>> I've been ploughing through the code for quite some time now, but I >> cannot explain why this happens. The main obstacle for me is that I >> barely understand how this glue-code to the UEFI API actually works and >> that I don't have a copy of the spec. > > I recall finding the UEFI 2.4 spec really for free.Yes, here: http://www.uefi.org/specifications Best regards, Didier
Oscar Roozen
2015-Jun-17 18:20 UTC
[syslinux] EFI & PXE-booting: very slow TFTP performance on a VMWare test setup
>> Are there any known problems with the performance of TFTP in (U)EFI >> environments in general or maybe just on VMWare? > I don't believe so. I use Workstation 10.0.2 on Ubuntu 12.04 for the > moment.<stupid question> and have EFI enabled? </stupid>> I hope this is a hex-core host with 6+GiB of RAM.Fedora runs on an AMD A8-6600K at 3Ghz and has 8G of RAM. Ubuntu runs on an Intel i7-2600K at 3.4Ghz with 16G RAM. I think they perform at least "good enough" to quikly boot and install in BIOS mode. The trouble only starts when I toggle the EFI switch.> Have you considered single vCPU client VMs?Yes. That's what I started with. But then I saw this piece of code in efi/udp.c and thought it would be wise to enable an extra CPU to receive the UDP packet. A single CPU has a 100% load here: start = jiffies(); while (cb_status == -1) { /* 15ms receive timeout... */ if (jiffies() - start >= 30) { if (jiffies() - start >= 60) printf("Failed to cancel UDP\n"); uefi_call_wrapper(udp->Cancel, 2, udp, &token); printf("core_udp_recv: timed out\n"); } uefi_call_wrapper(udp->Poll, 1, udp); } Just to rule out some stupid missed interrupt or so, I gave it an extra CPU.> I presume you disabled VMware's isc-dhcp on vmnet1?Yes. I should have mentioned that. Found out the hard way.. . ;-)> Good idea to serve from a subtree.Thanks. We need to support both modes, so we now have /tftpboot/bios and ../efi64.>> # make spotless >> # make bios >> # make efi64 > > 'make bios' should be unnecessary.I have to test both modes, but while debugging I mostly did just "make efi64 && install-stuff-to-/tftpboot/efi64".>> The server does just fine. It typically reacts within 0.000100 seconds. >> The client also starts out fine, reacting within about 0.000250 seconds, >> but this delay slowly increases. > > This makes me wonder the source and there's at least 3 factors: VMware UEFI > on multiple vCPU, memory or other leak, a strange bug in Syslinux.This also happens on a single vCPU. I did check efi/udp.c over and over again (especially the event handling) but I don't see a place that could cause this behaviour. One thing I did just now is a bit - ehm - hackish. I symlinked /tftpboot/efi64/syslinux.efi to /dev/zero and tried booting from that. I wanted to see if the "built-in" tftp-client of VMWare's UEFI firmware also showed this behaviour, but it didn't. Well, it got a bit slower, but not as much as the client from syslinux.> I recall finding the UEFI 2.4 spec really for free.I recall running into some NDA that had to be manually approved before I could download anything. I also remember preferring to go on hacking in more debug statements to find out where the delay comes from. Unfortunately I still have no clue and no specs as a result. Even with the specs, my guess is it will take some time understanding them. I'm just becoming a bit more comfortable reading syslinux code.> By the way, what version/commit?A fresh syslinux-6.03.tar.gz from kernel.org. I also tried 81ad566f155fac31089fde69c87059b217e7e9b6 with the same results. -- Oscar Roozen Linux Developer Bright Computing BV
Possibly Parallel Threads
- EFI & PXE-booting: very slow TFTP performance on a VMWare test setup
- pxelinux tries to load ldlinux.c32 from DHCP server, instead of next-server
- pxelinux tries to load ldlinux.c32 from DHCP server, instead of next-server
- pxelinux tries to load ldlinux.c32 from DHCP server, instead of next-server
- PXE boot hangs while transferring vmlinuz or initrd.img