thr3ads.net - klibc - [klibc] Bug#511959: klibc-utils: ipconfig times out when several machines boot at the very same time [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Cyril Brulebois

2009-Jan-15 22:19 UTC

[klibc] Bug#511959: klibc-utils: ipconfig times out when several machines boot at the very same time

Package: klibc-utils
Version: 1.5.12-2
Severity: important
Tags: patch

(I'm using X-Debbugs-Cc so that the klibc list receives a copy directly.
 I'd be glad to be kept in Cc since I don't follow that list, thanks
 already.)

Hi maks, hpa, Louis, and others,

I've been experiencing for a while timeouts at DHCP-time in ipconfig
when starting up several machines (say: 2 out of 4 don't boot) at the
very same time (FWIW, using the 'reboot' mechanism of our SSI system,
which means the reboots happen in sync).

Finally, I found some time to look into it, and discovered that the
automata doesn't take some cases into account. First, I've rebuilt
ipconfig with DEBUG/IPC_DEBUG, and discovered that the machines that
don't boot receive the messages broadcasted by the other dhcp clients
(be it DISCOVER or REQUEST) and from that point on, receive nothing
else.

Now, looking at the code (under usr/kinit/ipconfig for those following
at home):

packet.c: packet_recv()
| ?
|         if (udp->source != htons(cfg_remote_port) ||
|             udp->dest != htons(cfg_local_port))
|                 goto free_pkt;
| ?
| free_pkt:
|         free(ip);
|         return 0;
| ?

Which means in case of source/dest mismatch (which is the case when a
message from another client is received), 0 is returned.

Now, looking at the callers:
bootp_proto.c: bootp_recv_reply() & dhcp_proto.c: dhcp_recv()
| ?
|         ret = packet_recv(iov, 3);
|         if (ret <= 0)
|                 return ret;
| ?

Again, 0 is returned.

dhcp_recv() is wrapped into dhcp_recv_offer() & dhcp_recv_ack().

Finally, all those are used in switch() statements in main.c, where -1
and strictly positive values are checked for, but not 0. Hence the
attached patch: 0001-trivial.patch.

I guess one could consider it a special case that might deserve a
DEVST_SOFTERROR state, which could have a shorter retry delay than
DEVST_ERROR. Especially true for some setups with a hundred machines or
more, it'd be quite a PITA to wait 10 seconds for a retry where only a
couple of machines will complete the DHCP handshake. Not to mention the
default timeout that'll bite. That's why I'm proposing a second
patch:
0002-introduce-softerror.patch; and since it's probably overkill to
introduce that additional state, I think the functionally equivalent
0003-cleaner.patch will be better if you want to implement my suggestion.

Patches against master branch, tested on Debian's sid version (1.5.12).

Errm, now that I'm rebooting on a loopy fashion, it looks like those
patches don't cure the problem totally, so I guess I'm back to
debugging. Hopefully, upstream will figure this out better than I do.

Cheers,
-- 
Cyril Brulebois
-------------- next part --------------
An embedded message was scrubbed...
From: Cyril Brulebois <cyril.brulebois at kerlabs.com>
Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages
	restart the handshake.
Date: Thu, 15 Jan 2009 21:31:38 +0100
Size: 1667
Url:
http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment.mht
-------------- next part --------------
An embedded message was scrubbed...
From: Cyril Brulebois <cyril.brulebois at kerlabs.com>
Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages
	restart the handshake.
Date: Thu, 15 Jan 2009 21:31:38 +0100
Size: 3091
Url:
http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment-0001.mht
-------------- next part --------------
An embedded message was scrubbed...
From: Cyril Brulebois <cyril.brulebois at kerlabs.com>
Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages
	restart the handshake.
Date: Thu, 15 Jan 2009 21:31:38 +0100
Size: 2533
Url:
http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment-0002.mht

Cyril Brulebois

2009-Jan-16 10:39 UTC

head link

[klibc] klibc-utils: ipconfig times out when several machines boot at the very same time

Cyril Brulebois <cyril.brulebois at kerlabs.com>
(15/01/2009):> Errm, now that I'm rebooting on a loopy fashion, it looks like those
> patches don't cure the problem totally, so I guess I'm back to
> debugging.
OK, my patches aren't actually fixing the issue, trash them. :)

I've found [1] by ????? ????????????, which definitely describes and
fixes my problem. With an automated reboot every 90 seconds, I haven't
been able to reach the timeout.

 1. http://www.zytor.com/pipermail/klibc/2008-June/002319.html

It'd by very nice to see this included.


Also, as noted in [2], I'm reaching the retry code on every boot (it's
easier for me than for ????? to reproduce, I believe due to the sync'd
boots), which means that e.g. when 3 boxes out of 4 are stuck, one is
completing the DHCP handshake upon each retry: Meaning that box1 boots
up, box2 to box4 are waiting. After 10 seconds, box3 completes the
handshake. After 10 other seconds, box4 completes. And finally after 10
other seconds, box2 completes. For our cluster use, we'll probably lower
the 10 seconds delay to a single or two seconds, but it'd be nice to see
this other problem fixed too. I'll try and get back to you with full
traces.

The relevant excerpt from [2] describing the problem:
| Output of ipconfig-1.5.10-patched receiving an ARP packet,
| considering it an error and delaying for 10 secs. It didn't drop any
| packets before the error (as the other versions did), the error
| happend before the offer (rare - took me many minutes to reproduce).
| So this ARP error is on all versions. The delay depends on the errors
| received, I've seen all versions needing from 1 sec to some minutes.
| http://users.sch.gr/alkisg/temp/output3.txt

 2. http://www.zytor.com/pipermail/klibc/2008-June/002322.html

Cheers,
-- 
Cyril Brulebois
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://www.zytor.com/pipermail/klibc/attachments/20090116/e8fb4861/attachment.bin

Reasonably Related Threads

Search for more reasonably related threads

klibc - Jan 2009 - Bug#511959: klibc-utils: ipconfig times out when several machines boot at the very same time

[klibc] Bug#511959: klibc-utils: ipconfig times out when several machines boot at the very same time

[klibc] klibc-utils: ipconfig times out when several machines boot at the very same time

Reasonably Related Threads