thr3ads.net - Syslinux - [syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM [Jun 2011]

If this information is useful, please help other people find it:
Share via:

Schlomo Schapiro

2011-Jun-29 18:57 UTC

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

Hi,

first of all I would like to voice my deep gratitude to all syslinux developers
for this really important software. I am using it in all my automation projects
and could not manage without.

Unfortunately now I stumbled upon a problem where I am out of my wits and need
some help.

The core problem is that HTTP transfers by gpxelinux.0 are very slow. Sadly this
problem seems to be somehow related to our VMware ESX environment and I am not
able to pin the problem down.

Please bear with me while I try to give you a picture of what we are doing.

We have a VMware ESX VM that serves as boot and installation server, hosting
DHCP, DNS, TFTP and HTTP services for 2 networks (called "d" and
"a"). The "d"
network consist mostly of desktop computers (mostly Dell T5500 workstations)
while the "a" network consists exclusively of VMware ESX 4.1 VMs.

The boot configuration is shared between the networks and looks like this:

DHCP filename: gpxelinux.0 (from syslinux 4.04)
pxelinux prefix: http://server.domain/boot

http://server.domain/boot/pxelinux.cfg/default loads a vesamenu.c32 based menu
structure that allows various installs of Ubuntu, RHEL and CentOS, all
accessible exclusively via HTTP (e.g. kernel
http://server.domain/centos/5/x86_64/.../vmlinuz)

Booting a desktop system on the "d" network goes really fast, and
loading the
35MB initrd of the RHEL6.1 installer takes 1-2 seconds.

Booting a VM on the "a" network is much slower, loading the same 35MB
initrd
from the same URL takes >20 seconds. Also, wireshark on the boot server shows
lots of TCP retransmissions and duplicate ACK packets. Also, about 10-20% of all
boot attempts on the "a" network fail by either getting stuck in
vesamenu.c32 or
by reporting an error, aborting the boot and rebooting after the pxelinux reboot
timeout.

Some other things we also noticed:
* TFTP on the "a" network is much faster than HTTP, but still a small
fraction
of boot attempts fail.
* some TFTP requests seem to come twice, e.g. I see two log entries for
gpxelinux.0, but only one for vesamenu.c32.
* we where not able to find any difference in the network configuration.
* The VMware VM network card type seems to have no effect on the HTTP transfer
times, at least e1000 and vmxnet3 behave similar.
* using gpxe and vesamenu.c32 directly fails with vesamenu.c32 >3.85 (or so)
and
simply reboots the system.

The questions we have are the following:

* Is there any known issue with gpxelinux.0 and HTTP transfers on VMware ESX
(4.1)?

* How could we debug gpxelinux.0 and HTTP transfers, specifically find out why
there are so many TCP retransmits?

* Is it possible to tune the HTTP and TFTP protocol engines, e.g. timeouts,
retries etc.?

* Can you give us any advice on how to troubleshoot such a problem?

We'll be happy to try out anything that could help, this boot issue is
basically
the last problem in setting up a fairly large self-service virtualization
environment for our developers. The reason we need to use HTTP booting is that
we provide the pxelinux configuration dynamically and use this as a quality gate
that new VMs have to pass before they are allowed to boot. This is the core of
our self-service virtualization which is published as "Lab Manager
Light" on
http://blog.schlomo.schapiro.org/2011/05/lab-manager-light-self-service.html.

Kind Regards,
Schlomo Schapiro

H. Peter Anvin

2011-Jun-29 20:05 UTC

head link

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

On 06/29/2011 11:57 AM, Schlomo Schapiro wrote:> 
> The core problem is that HTTP transfers by gpxelinux.0 are very slow. Sadly
this
> problem seems to be somehow related to our VMware ESX environment and I am
not
> able to pin the problem down.
> 
The requirement for gpxelinux.0 to support HTTP transfers is going to be
dropped in Syslinux 4.10, which is now on the release track.  Could you
test out pxelinux.0 (*not* gpxelinux.0) from Syslinux 4.10-pre15 and see
if you have any problems?

Other than that, it would be good to get a package trace
(tcpdump/wireshark).

	-hpa

Gene Cumm

2011-Jun-29 21:57 UTC

head link

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

On Wed, Jun 29, 2011 at 14:57, Schlomo Schapiro
<syslinux at schlomo.schapiro.org> wrote:> Hi,
>
> first of all I would like to voice my deep gratitude to all syslinux
developers
> for this really important software. I am using it in all my automation
projects
> and could not manage without.
>
> Unfortunately now I stumbled upon a problem where I am out of my wits and
need
> some help.
>
> The core problem is that HTTP transfers by gpxelinux.0 are very slow. Sadly
this
> problem seems to be somehow related to our VMware ESX environment and I am
not
> able to pin the problem down.
>
> Please bear with me while I try to give you a picture of what we are doing.
>
> We have a VMware ESX VM that serves as boot and installation server,
hosting
> DHCP, DNS, TFTP and HTTP services for 2 networks (called "d" and
"a"). The "d"
> network consist mostly of desktop computers (mostly Dell T5500
workstations)
> while the "a" network consists exclusively of VMware ESX 4.1 VMs.
Which network is the boot/install server attached to, "d",
"a" or
another?  If "a", do you see an issue when the booting VM is on the
same host as the boot/install server VM?
> The boot configuration is shared between the networks and looks like this:
>
> DHCP filename: gpxelinux.0 (from syslinux 4.04)
> pxelinux prefix: http://server.domain/boot
>
> http://server.domain/boot/pxelinux.cfg/default loads a vesamenu.c32 based
menu
> structure that allows various installs of Ubuntu, RHEL and CentOS, all
> accessible exclusively via HTTP (e.g. kernel
> http://server.domain/centos/5/x86_64/.../vmlinuz)
>
> Booting a desktop system on the "d" network goes really fast, and
loading the
> 35MB initrd of the RHEL6.1 installer takes 1-2 seconds.
>
> Booting a VM on the "a" network is much slower, loading the same
35MB initrd
> from the same URL takes >20 seconds. Also, wireshark on the boot server
shows
> lots of TCP retransmissions and duplicate ACK packets. Also, about 10-20%
of all
> boot attempts on the "a" network fail by either getting stuck in
vesamenu.c32 or
> by reporting an error, aborting the boot and rebooting after the pxelinux
reboot
> timeout.
>
> Some other things we also noticed:
> * TFTP on the "a" network is much faster than HTTP, but still a
small fraction
> of boot attempts fail.
> * some TFTP requests seem to come twice, e.g. I see two log entries for
> gpxelinux.0, but only one for vesamenu.c32.
> * we where not able to find any difference in the network configuration.
> * The VMware VM network card type seems to have no effect on the HTTP
transfer
> times, at least e1000 and vmxnet3 behave similar.
You should be able to also use an older one that may still have issues
with PXELINUX 4.10-pre15.  Configure a VM as something like "Other 2.6
Linux" and it should act like a PCnet32 for PXE.
> * using gpxe and vesamenu.c32 directly fails with vesamenu.c32 >3.85 (or
so) and
> simply reboots the system.
Using vesamenu.c32 from Syslinux 4.xx will do this; the issue is that
gPXE doesn't recognize the fact that it's a COM32R rather than a 3.xx
COM32.
> The questions we have are the following:
>
> * Is there any known issue with gpxelinux.0 and HTTP transfers on VMware
ESX (4.1)?
I'm pretty sure I've tried this personally and not had issues.  I know
with those two NICs, I have not had any issues.
> * How could we debug gpxelinux.0 and HTTP transfers, specifically find out
why
> there are so many TCP retransmits?
>
> * Is it possible to tune the HTTP and TFTP protocol engines, e.g. timeouts,
> retries etc.?
>
> * Can you give us any advice on how to troubleshoot such a problem?
I've already got a similar system.  I'll try some tests myself.
Non-bug guesses include a vSwitch or real switch with issues or
bandwidth throttling active, a bad NIC or switch port and an
overloaded host/network.
> We'll be happy to try out anything that could help, this boot issue is
basically
> the last problem in setting up a fairly large self-service virtualization
> environment for our developers. The reason we need to use HTTP booting is
that
> we provide the pxelinux configuration dynamically and use this as a quality
gate
> that new VMs have to pass before they are allowed to boot. This is the core
of
> our self-service virtualization which is published as "Lab Manager
Light" on
>
http://blog.schlomo.schapiro.org/2011/05/lab-manager-light-self-service.html.
Sounds like an interesting project.

-- 
-Gene

syslinux at schlomo.schapiro.org

2011-Jun-30 14:53 UTC

head link

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

Hi,

Am 29.06.2011 22:05, schrieb H. Peter Anvin:> On 06/29/2011 11:57 AM, Schlomo Schapiro wrote:
>>
>> The core problem is that HTTP transfers by gpxelinux.0 are very slow.
Sadly this
>> problem seems to be somehow related to our VMware ESX environment and I
am not
>> able to pin the problem down.
>>
> 
> The requirement for gpxelinux.0 to support HTTP transfers is going to be
> dropped in Syslinux 4.10, which is now on the release track.  Could you
> test out pxelinux.0 (*not* gpxelinux.0) from Syslinux 4.10-pre15 and see
> if you have any problems?
> 
> Other than that, it would be good to get a package trace
> (tcpdump/wireshark).
I tried your suggestion today (and I think it is a very good thing to
have native HTTP support in pxelinux).

Unfortunately it does not work. I put together some debug infos and
traces at http://files.schapiro.org/schlomo/syslinux/index.html

Things to notice:
* syslinux 4.10 did apparently not try to load vesamenu.c32 but
immediately tried vesamenu.c32.0 and other variations. The access log
shows that very nice. First you find there a successful boot with a user
agent of gPXE from gpelinux.0 4.04. After that the failed boot requests
from a user agent of Syslinux/4.10

* there are much less network errors in the pcap trace with pxelinux
4.10 compared to gpxelinux 4.04, but still some.

* I tried various VM settings with 32 and 64 bit and various NIC types,
seems to have no effect

* gpxelinux.0 4.04 HTTP performance differs very much between HW (fast)
and VM (slow). Could it be that the code is taxing the VM in a way that
makes the virtualization become very slow compared to hardware?

Kind Regards,
Schlomo

Shantanu Gadgil

2011-Jul-01 05:44 UTC

head link

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

Hi,
Please see my comments inline ... I too have faced similar problems ..

----- Original Message ----

Message: 1
Date: Wed, 29 Jun 2011 20:57:22 +0200
From: Schlomo Schapiro <syslinux at schlomo.schapiro.org>
To: syslinux at zytor.com
Subject: [syslinux] gpxelinux.0 and slow HTTP performance on VMware
    ESX VM
Message-ID: <4E0B7592.5060404 at schlomo.schapiro.org>
Content-Type: text/plain; charset=UTF-8

Hi,

Unfortunately now I stumbled upon a problem where I am out of my wits and need
some help.

[Shantanu]: Yes ... exactly, many times I too face this! :) ;P

The core problem is that HTTP transfers by gpxelinux.0 are very slow. Sadly this
problem seems to be somehow related to our VMware ESX environment and I am not
able to pin the problem down.

[Shantanu]: The symptom on my end was ...

syslinux-4.10-pre14: The scan from the "mac_address_with_dashes" down
to default
would be very slow ...
HTTP transfers were sometimes slow, sometimes not.

This was seen when the TFTP server was inside a VM (ESXi/VirtualBox) but not 
when the TFTP server was a real machine.

So ... I end up going back to 4.05-preX

My current solution is to *just* do the vmlinuz/initrd loading via TFTP (it just
works :)) for all the OSes and have the OS installation
configured to install via HTTP/NFS. This has been working quite smoothly so far.

The 4.10-preXX hit some issue or the other and I keep coming back to the 4.04 
(now testing 4.05-pre3) !!! :( :(

Cheers and Regards,
Shantanu

Reasonably Related Threads

Search for more maybe matching threads

Syslinux - Jun 2011 - gpxelinux.0 and slow HTTP performance on VMware ESX VM

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

[syslinux] gpxelinux.0 and slow HTTP performance on VMware ESX VM

Reasonably Related Threads