Anthony Sadler
2006-May-03 02:04 UTC
Rsync error on client end: unexpected tag 3 [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(843) [sender]
Hi guys: First post for me on this list. THE PROBLEM ==========I have several linux servers (ranging from Redhat 7.3 to FC3) that use Rsync (version 2.6.8 and 2.6.7, read on for reasons) to backup their data. A typical rsync backup script looks like this: ------------------------------------------------------------------------------------------ #!/bin/sh RSYNC_PASSWORD=<PASSWORD> rsync -a --delete --password-file=<PASSWORD FILE> --ignore-errors --exclude '/dev/*' --exclude '/sys/*' --exclude '/proc/*' --exclude '/var/run/*' --exclude '/tmp/*' --exclude '/home/downloads/*' / <USERNAME>@<ADDRESS>::<MODULE> >> /var/log/rsync.log rsync -a --progress --delete --ignore-errors --password-file=<PASSWORD FILE> /home/downloads/ <USERNAME>@<ADDRESS>::<MODULE> >> /var/log/rsync.log ------------------------------------------------------------------------------------------ (Some things have been replaced to prevent people doing mean things to our servers :P) For a long time rsync has worked, occasionly barfing due to lack of harddrive space or for other external reasons. When we upgraded to version 2.6.8 however, things started to die. Here is a typical error we get: ------------------------------------------------------------------------------------------ unexpected tag 3 [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(828) [sender=2.6.8] ------------------------------------------------------------------------------------------ SERVERS ======For the purpose of this email, I will talk about two servers in particular. One is an apache server, which I will refer to as APACHE, the other our backup server, which we will call RSERVER. APACHE is a Fedora Core release 3 (Heidelberg) box, running Linux 2.6.12-1.1381_FC3 with Rsync 2.6.8, installed from an RPM. All software is up to date to my knowledge. RSERVER is also a Fedora Core release 3 (Heidelberg) box, also running Linux 2.6.12-1.1381_FC3 with Rsync 2.6.8, again installed from RPM. Again, all software is up to date. SETUP ====The script you see above is the script used to copy the data between APACHE and RSERVER. It runs once a night, starting at 21:00 and goes for some hours. The amount of data it has to think about amounts to is some 100 gig. What it has to sync over the course of an evening can be siginificantly less (I see two reports that say 46 meg and 32 meg). The two machines talk to each other over a 100 megabit switched LAN. APACHE is a server that has port forwards to it and has no software firewall running (well Iptables almost always runs, but it has no rules and all chains are set to ACCEPT). The same with RSERVER. The script is kicked off by a cronjob running as root. DETAILS ======Around about 10 minutes after the script runs, it will fail with the above error message. With small amounts of data (eg 10 meg, 20 meg, no known boundary) the script will work. In fact, I see that some files do work with the cron script. However when I start to sync more data (1.6 gig, 1.7 gig etc), it fails part of the way through with the above. Interestingly, CWRsync seems to work (ie Rsync running under Cygwin on Windows). However this is not exclusive. Some do, some do not. I cannot work out why (it may be because they are under the size threshold?) TROUBLESHOOTING ==============- Small Rsyncs will work. No known boundary between small and large though (ie I cant say "More than this amount of data it will fail") - Tried running Rsync in Xinetd. - Tried running Rsync as a daemon, as root - Using blocking-io - Using no-blocking-io - Pings and SSH sessions to watch for network problems. - Packet dumps (very limited, I have trouble with TCPDump... give me ethereal any day :P) - Straces - Replacing 2.6.8 with 2.6.7. 2.6.7 used to work. (that's why I have said above two seperate version): We get a marginally different error message: ------------------------------------------------------------------------------------------ unexpected tag 3 [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(843) [sender] ------------------------------------------------------------------------------------------ The difference is that the sender on the 2nd line does not have (2.6.8). That's the only difference I can find. STRACE #1 ========I can provide whole straces if people request, but here are some interesting things I have found. This is on server RSERVER, the rsync server side, running 2.6.8 as xinetd: ------------------------------------------------------------------------------------------ 21219 open("/etc/localtime", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 143) = 143 21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "s\0\0\n", 4) = 4 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "recv 51.32.168.192.in-addr.arpa "..., 115) = 115 21219 time(NULL) = 1146546636 21219 open("/etc/localtime", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 143) = 143 21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "j\0\0\n", 4) = 4 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "recv 51.32.168.192.in-addr.arpa "..., 106) = 106 21219 time(NULL) = 1146546636 21219 open("/etc/localtime", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 134) = 134 21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "f\0\0\n", 4) = 4 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "recv 51.32.168.192.in-addr.arpa "..., 102) = 102 21219 time(NULL) = 1146546636 21219 open("/etc/localtime", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 130) = 130 21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "2\0\0\r", 4) = 4 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "rsync: read error: Connection re"..., 50) = 50 21219 time(NULL) = 1146546636 21219 open("/etc/localtime", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 write(3, "2006/05/02 05:10:36 [21219] rsyn"..., 78) = 78 21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "Y\0\0\10", 4) = 4 21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0}) 21219 read(4, "rsync error: error in rsync prot"..., 89) = 89 21219 time(NULL) = 1146546636 21219 open("/etc/localtime", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 write(3, "2006/05/02 05:10:36 [21219] rsyn"..., 117) = 117 ------------------------------------------------------------------------------------------ Notice /etc/localtime not being able to open? ------------------------------------------------------------------------------------------ [root@RSERVER tmp]# ls -lah /etc/localtime lrwxrwxrwx 1 root root 36 Mar 27 13:50 /etc/localtime -> /usr/share/zoneinfo/Australia/Sydney [root@RSERVER tmp]# ls -lah /usr/share/zoneinfo/Australia/Sydney -rw-r--r-- 1 root root 785 Mar 27 11:05 /usr/share/zoneinfo/Australia/Sydney [root@RSERVER tmp]# ------------------------------------------------------------------------------------------ Another weird thing is that it does the same for /etc/fstab! (The below is significantly further back in the STRACE than the above) ------------------------------------------------------------------------------------------ 21219 open("/etc/mtab", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 open("/etc/fstab", O_RDONLY) = -1 ENOENT (No such file or directory) 21219 open("/proc/meminfo", O_RDONLY) = -1 ENOENT (No such file or directory) ------------------------------------------------------------------------------------------ Yet at other places in the same STRACE, it does open the file: ------------------------------------------------------------------------------------------ 21219 open("/etc/localtime", O_RDONLY) = 3 21219 fstat64(3, {st_mode=S_IFREG|0644, st_size=785, ...}) = 0 21219 fstat64(3, {st_mode=S_IFREG|0644, st_size=785, ...}) = 0 ------------------------------------------------------------------------------------------ I assume that is exit code 0, which means it works. Either way it doesn't throw an error I can see. STRACE #2 ========This is an strace run on APACHE (the rsync client). I apologise, I cannot say whether it was 2.6.8 or 2.6.7. I'm pretty sure it's 2.6.8. Here is the interesting section: ------------------------------------------------------------------------------------------ 5503 gettimeofday({1146614119, 369460}, NULL) = 0 5503 ioctl(1, TIOCGPGRP, [48]) = -1 ENOTTY (Inappropriate ioctl for device) 5503 write(1, " 583 100% 0.94kB/s "..., 41) = 41 5503 select(4, NULL, [3], NULL, {60, 0}) = 1 (out [3], left {60, 0}) 5503 write(3, "\333\323\361\276[\177\376\33\3\34\302\4\235.\237\301", 16) = 16 5503 gettimeofday({1146614119, 369703}, NULL) = 0 5503 write(1, " 583 100% 0.94kB/s "..., 75) = 75 5503 close(4) = 0 5503 select(4, [3], [], NULL, {60, 0}) = 1 (in [3], left {60, 0}) 5503 read(3, "Y\0\0\n", 4) = 4 5503 write(2, "unexpected tag 3 [sender]", 25) = 25 5503 write(2, "\n", 1) = 1 5503 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 5503 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 5503 write(2, "rsync error: error in rsync prot"..., 80) = 80 5503 write(2, "\n", 1) = 1 5503 munmap(0xb7d67000, 4096) = 0 5503 exit_group(12) = ? ------------------------------------------------------------------------------------------ Specifically the 2nd line: 5503 ioctl(1, TIOCGPGRP, [48]) = -1 ENOTTY (Inappropriate ioctl for device) TCPDUMPS =======I played with this for all of about 3 minutes. I didn't get far. RSYNC LOG FILE =============2006/05/03 00:28:31 [22725] rsync: writefd_unbuffered failed to write 131 bytes [generator]: Connection reset by peer (104) 2006/05/03 00:28:31 [22725] rsync error: error in rsync protocol data stream (code 12) at io.c(1119) [generator=2.6.8] OTHER INFO =========------------------------------------------------------------------------------------------ [root@celery tmp]# df -h Filesystem Size Used Avail Use% Mounted on /dev/hda1 9.7G 3.6G 5.6G 39% / none 379M 0 379M 0% /dev/shm /dev/hda4 17G 675M 15G 5% /home /dev/hda2 9.7G 360M 8.8G 4% /var /dev/hdb1 276G 176G 87G 68% /disk2 /dev/hdc1 276G 175G 87G 67% /disk3 /dev/sda1 276G 225G 37G 87% /media/usbdisk ------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------ /etc/rsyncd.conf: #global paramaters log file = /var/log/rsyncd.log read only = no uid = root [APACHE] comment = Backup of APACHE path = /disk3/backup/0/APACHE use chroot = yes auth users = <USERNAME> secrets file = /etc/rsyncd.secrets hosts allow = <SUBNET> 127.0.0.1 transfer logging = yes #log file = /var/log/rsync.APACHE.log ------------------------------------------------------------------------------------------ MY THOUGHTS ==========It seems that rsync is reaching some random file and then going, nope, I've done enough work. Go away. I've tried looking for bad files, but I have found nothing. FROM HERE ========I'm just about out of ideas what to do... Returning to the old version doesn't seem to fix it. I'm more than happy to provide more information if requested. I've put all that I can think into the above email. Let me know if you need anything else and I'll provide it. Many thanks, Anthony Sadler Far Edge Technology w: (02) 8425 1400
Wayne Davison
2006-May-03 05:31 UTC
Rsync error on client end: unexpected tag 3 [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(843) [sender]
On Wed, May 03, 2006 at 11:58:07AM +1000, Anthony Sadler wrote:> unexpected tag 3 [sender]My analysis shows that some deferred messages (which are those that arrive in the generator when the generator is already trying to flush a partial message down the socket) are not being handled by rwrite() like they should be. Furthermore, the MSG_LOG message doesn't actually need to be deferred, since they cannot result in any data being written to the socket. This bug can only affect an rsync daemon that is receiving files, as all other scenarios behave correctly with the existing code. The attached patch should fix the problem. I've also checked the fix into CVS (and it will appear in the next nightly tar file that gets generated). Thanks for the detailed report, it was very helpful. ..wayne.. -------------- next part -------------- --- io.c 25 Apr 2006 23:51:15 -0000 1.193 +++ io.c 3 May 2006 05:13:31 -0000 @@ -294,6 +294,7 @@ static void read_msg_fd(void) exit_cleanup(RERR_STREAMIO); } close_multiplexing_out(); + defer_forwarding_messages = 0; /* FALL THROUGH */ case MSG_INFO: case MSG_ERROR: @@ -303,7 +304,8 @@ static void read_msg_fd(void) if (n >= sizeof buf) n = sizeof buf - 1; read_loop(fd, buf, n); - if (am_generator && am_server && defer_forwarding_messages) + if (am_generator && am_server + && defer_forwarding_messages && tag != MSG_LOG) msg_list_add(&msg2sndr, tag, buf, n); else rwrite((enum logcode)tag, buf, n); @@ -1134,11 +1136,20 @@ static void msg2sndr_flush(void) while (msg2sndr.head && io_multiplexing_out) { struct msg_list_item *m = msg2sndr.head; + int tag = (IVAL(m->buf, 0) >> 24) - MPLEX_BASE; if (!(msg2sndr.head = m->next)) msg2sndr.tail = NULL; - stats.total_written += m->len; defer_forwarding_messages = 1; - writefd_unbuffered(sock_f_out, m->buf, m->len); + switch (tag) { + case MSG_INFO: + case MSG_ERROR: + rwrite((enum logcode)tag, m->buf + 4, m->len - 4); + break; + default: + stats.total_written += m->len; + writefd_unbuffered(sock_f_out, m->buf, m->len); + break; + } defer_forwarding_messages = 0; free(m); }
Wayne Davison
2006-May-03 06:15 UTC
Rsync error on client end: unexpected tag 3 [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(843) [sender]
On Wed, May 03, 2006 at 11:58:07AM +1000, Anthony Sadler wrote:> Notice /etc/localtime not being able to open?That happens because you're using "use chroot = yes". When rsync is running inside a chroot jail, it cannot access files that exist outside the rsync module's path. Rsync attempts to handle the timezone situation by using localtime() prior to the chroot call, but on some systems this is not enough to keep the library code really happy. I've checked in a change that makes rsync also call strftime() prior to the chroot (for those systems that are configured to use strftime()). ..wayne.. -------------- next part -------------- --- log.c 25 Apr 2006 23:51:15 -0000 1.146 +++ log.c 3 May 2006 06:06:19 -0000 @@ -154,17 +154,14 @@ static void logfile_open(void) void log_init(void) { - time_t t; - if (log_initialised) return; log_initialised = 1; /* this looks pointless, but it is needed in order for the * C library on some systems to fetch the timezone info * before the chroot */ - t = time(NULL); - localtime(&t); + timestring(time(NULL)); /* optionally use a log file instead of syslog */ logfname = lp_log_file();