I've been seeing a problem lately that I thought I'd run by this list to see if it's familiar to anyone. We've been running tftpd from tftp-hpa (at first version 0.29, but recently version 0.34) on DigitalUNIX 4.0F, serving files off of an NFS filesystem (from a very busy server). The daemon is started as: /noc/bin/tftpd -v -p -l -m /noc/etc/tftp.filename-translations \ -u noc-tftp -s /noc/tftp Occasionally, when one of our switches uploads its config with TFTP, we end up with a 0-byte file. In syslog, we see something like this: | May 1 16:23:09 [server] tftpd[14250]: WRQ from [client] filename switch.config remapped to /switch.config | May 1 16:23:10 [server] tftpd[14255]: WRQ from [client] filename switch.config remapped to /switch.config | May 1 16:23:10 [server] tftpd[14255]: tftpd: read: Connection refused If I watch the file while this is happening, I can see it grow from 0 bytes to some size, then get truncated back to 0. I don't know too much about TFTP, but my theory as to what is happening here is this: something (probably NFS) is making the tftp daemon slow to respond to the first WRQ, so the switch sends another one. In the meantime, the tftp daemon responds to the first WRQ, opens the file, sends an ACK, and the switch sends its data. Then, the daemon responds to the second WRQ, opens the file, sends an ACK, but gets an error (because the switch has already finished the transfer) and closes the file. Unfortunately, the opens are done with O_TRUNC, so the second open truncates the file. A tcpdump seems to support this idea: | 17:23:07.090411 [client].1036 > [server].69: 29 WRQ "widener-le-sw.config" | 17:23:08.532809 [client].1036 > [server].69: 29 WRQ "widener-le-sw.config" | 17:23:09.451769 [server].1334 > [client].1036: udp 4 | 17:23:09.458605 [client].1036 > [server].1334: udp 516 | [...] | 17:23:09.634386 [client].1036 > [server].1334: udp 516 | 17:23:09.634386 [server].1334 > [client].1036: udp 4 | 17:23:09.643175 [client].1036 > [server].1334: udp 174 | 17:23:09.661730 [server].1334 > [client].1036: udp 4 | 17:23:10.651003 [server].1335 > [client].1036: udp 4 | 17:23:10.656862 [client] > [server]: icmp: [client] udp port 1036 unreachable Does that sound reasonable, or does anyone have some other explanation? Has anyone seen this behavior before? It seems to me like the problems I'm seeing could be avoided on the tftpd side in one (or both) of two ways: 1) not truncating the file until the first data packet is received 2) not responding to a second WRQ with the same host+port within some period (since the TIDs should be chosen such that they are unlikely to repeat, if I read the RFCs correctly). But, like I said, I'm no TFTP or tftp-hpa expert, which I why I'm emailing the SYSLINUX list. --Alan Sundell
Alan Sundell wrote:> > Does that sound reasonable, or does anyone have some other explanation? > Has anyone seen this behavior before? > > It seems to me like the problems I'm seeing could be avoided on the > tftpd side in one (or both) of two ways: > 1) not truncating the file until the first data packet is received > 2) not responding to a second WRQ with the same host+port within some > period (since the TIDs should be chosen such that they are unlikely to > repeat, if I read the RFCs correctly). > > But, like I said, I'm no TFTP or tftp-hpa expert, which I why I'm > emailing the SYSLINUX list. >It's entirely reasonable. Since two sessions have been opened, you have effectively a two-writer problem. The "connection refused" is probably the best bet for solving this, since the natural mutex is the kernel lock on the port number. This would mean we shouldn't open the file until we have connect()ed, which is probably correct anyway; if nothing else we still need to send back the error code, so we need to connect() no matter what. -hpa
On Thu, May 01, 2003 at 06:52:46PM -0400, Alan Sundell wrote: [...]> Occasionally, when one of our switches uploads its config with TFTP, we > end up with a 0-byte file.[...]> It seems to me like the problems I'm seeing could be avoided on the > tftpd side in one (or both) of two ways: > 1) not truncating the file until the first data packet is received > 2) not responding to a second WRQ with the same host+port within some > period (since the TIDs should be chosen such that they are unlikely to > repeat, if I read the RFCs correctly).Hey, remember this issue? I've done some things to mitigate the NIS delays on the server side, but AFAIK, the possibly of data loss from a truncated file due to network or other delays still exists. Back then, I wrote a couple of quick and dirty patches against tftp-hpa-0.3.4 to test out numbers 1) and 2) above. (Haven't had time to think about this issue much since then, hence the long delay.) Either one seemed to fix the problem, and they both seemed to confirm my theory of what was going on here (that the re-transmitted request caused the truncation of the file on open). I've attached the two patches to this message. They are, as I said, quick and dirty, but they should give you an idea of what I'm talking about. The late-truncate patch should be harmless, and I would like to see something like it get applied, since it will solve our data loss problem. The request-tracking patch is harmless for clients that choose a different TID for every request, but will of course cause problems if clients re-use TIDs within a short period of time. Since it may cause problems with some clients, is a bit pedantic, and requires ugly additions to the code, it's probably a less-advisable way to go. What do you think? Interestingly, as a footnote, "some clients" above includes the client in tftp-hpa. To give some context, the RFC says: | The TID's chosen for a connection should be randomly chosen, so that | the probability that the same number is chosen twice in immediate | succession is very low. However, the client in tftp-hpa opens a socket once in main() and re-uses it for all subsequent requests. Therefore the probability that a single invocation of the client will chose the same "TID" (port) twice in immediate succession is 1. Whether this causes any practical problems with any TFTP servers out there, I don't know. It doesn't cause me any trouble, like the truncation does -- I just noticed it in passing and thought I would point it out in case it causes anyone else any grief. Thanks again for your help with this, --Alan