Cyril Brulebois
2009-Jan-15 22:19 UTC
[klibc] Bug#511959: klibc-utils: ipconfig times out when several machines boot at the very same time
Package: klibc-utils Version: 1.5.12-2 Severity: important Tags: patch (I'm using X-Debbugs-Cc so that the klibc list receives a copy directly. I'd be glad to be kept in Cc since I don't follow that list, thanks already.) Hi maks, hpa, Louis, and others, I've been experiencing for a while timeouts at DHCP-time in ipconfig when starting up several machines (say: 2 out of 4 don't boot) at the very same time (FWIW, using the 'reboot' mechanism of our SSI system, which means the reboots happen in sync). Finally, I found some time to look into it, and discovered that the automata doesn't take some cases into account. First, I've rebuilt ipconfig with DEBUG/IPC_DEBUG, and discovered that the machines that don't boot receive the messages broadcasted by the other dhcp clients (be it DISCOVER or REQUEST) and from that point on, receive nothing else. Now, looking at the code (under usr/kinit/ipconfig for those following at home): packet.c: packet_recv() | ? | if (udp->source != htons(cfg_remote_port) || | udp->dest != htons(cfg_local_port)) | goto free_pkt; | ? | free_pkt: | free(ip); | return 0; | ? Which means in case of source/dest mismatch (which is the case when a message from another client is received), 0 is returned. Now, looking at the callers: bootp_proto.c: bootp_recv_reply() & dhcp_proto.c: dhcp_recv() | ? | ret = packet_recv(iov, 3); | if (ret <= 0) | return ret; | ? Again, 0 is returned. dhcp_recv() is wrapped into dhcp_recv_offer() & dhcp_recv_ack(). Finally, all those are used in switch() statements in main.c, where -1 and strictly positive values are checked for, but not 0. Hence the attached patch: 0001-trivial.patch. I guess one could consider it a special case that might deserve a DEVST_SOFTERROR state, which could have a shorter retry delay than DEVST_ERROR. Especially true for some setups with a hundred machines or more, it'd be quite a PITA to wait 10 seconds for a retry where only a couple of machines will complete the DHCP handshake. Not to mention the default timeout that'll bite. That's why I'm proposing a second patch: 0002-introduce-softerror.patch; and since it's probably overkill to introduce that additional state, I think the functionally equivalent 0003-cleaner.patch will be better if you want to implement my suggestion. Patches against master branch, tested on Debian's sid version (1.5.12). Errm, now that I'm rebooting on a loopy fashion, it looks like those patches don't cure the problem totally, so I guess I'm back to debugging. Hopefully, upstream will figure this out better than I do. Cheers, -- Cyril Brulebois -------------- next part -------------- An embedded message was scrubbed... From: Cyril Brulebois <cyril.brulebois at kerlabs.com> Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages restart the handshake. Date: Thu, 15 Jan 2009 21:31:38 +0100 Size: 1667 Url: http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment.mht -------------- next part -------------- An embedded message was scrubbed... From: Cyril Brulebois <cyril.brulebois at kerlabs.com> Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages restart the handshake. Date: Thu, 15 Jan 2009 21:31:38 +0100 Size: 3091 Url: http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment-0001.mht -------------- next part -------------- An embedded message was scrubbed... From: Cyril Brulebois <cyril.brulebois at kerlabs.com> Subject: [PATCH] klibc: ipconfig: Make sure unexpected received messages restart the handshake. Date: Thu, 15 Jan 2009 21:31:38 +0100 Size: 2533 Url: http://www.zytor.com/pipermail/klibc/attachments/20090115/01e426e2/attachment-0002.mht
Cyril Brulebois
2009-Jan-16 10:39 UTC
[klibc] klibc-utils: ipconfig times out when several machines boot at the very same time
Cyril Brulebois <cyril.brulebois at kerlabs.com> (15/01/2009):> Errm, now that I'm rebooting on a loopy fashion, it looks like those > patches don't cure the problem totally, so I guess I'm back to > debugging.OK, my patches aren't actually fixing the issue, trash them. :) I've found [1] by ????? ????????????, which definitely describes and fixes my problem. With an automated reboot every 90 seconds, I haven't been able to reach the timeout. 1. http://www.zytor.com/pipermail/klibc/2008-June/002319.html It'd by very nice to see this included. Also, as noted in [2], I'm reaching the retry code on every boot (it's easier for me than for ????? to reproduce, I believe due to the sync'd boots), which means that e.g. when 3 boxes out of 4 are stuck, one is completing the DHCP handshake upon each retry: Meaning that box1 boots up, box2 to box4 are waiting. After 10 seconds, box3 completes the handshake. After 10 other seconds, box4 completes. And finally after 10 other seconds, box2 completes. For our cluster use, we'll probably lower the 10 seconds delay to a single or two seconds, but it'd be nice to see this other problem fixed too. I'll try and get back to you with full traces. The relevant excerpt from [2] describing the problem: | Output of ipconfig-1.5.10-patched receiving an ARP packet, | considering it an error and delaying for 10 secs. It didn't drop any | packets before the error (as the other versions did), the error | happend before the offer (rare - took me many minutes to reproduce). | So this ARP error is on all versions. The delay depends on the errors | received, I've seen all versions needing from 1 sec to some minutes. | http://users.sch.gr/alkisg/temp/output3.txt 2. http://www.zytor.com/pipermail/klibc/2008-June/002322.html Cheers, -- Cyril Brulebois -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://www.zytor.com/pipermail/klibc/attachments/20090116/e8fb4861/attachment.bin