I see very odd results from rsync 2.4.7pre1, the latest cvs version (sept 12, i think was the last modified file). We have a number of network-attached storage devices. 10/100 ethernet, nfs2 mounted (under nfs3, they buffer deletes, and recursive deletions fail). Usually, these are kept syncronized across a wan by a nightly cronjob, We have a few we keep in reserve, which we syncronize locally. They're all the same, aside from whatever is left different by rsync failures. they are syncronized, one at a time, during the week. They fail on different files. as you see, bildmax4 failed on only one file. bildmax2 failed on several screensful, but i trimmed it down. i don't see any duplication between systems. At the bottom, i show a ls of one of the files that failed on bildmax2, both on the master (/big1/....) and on the bildmax itself (/bildmax/bildmax2/...). What value is too big? Oh, the cmdline: /cadappl/encap/bin/rsync -Wa --delete --force --bwlimit=524288 source destination It's about 128Gb of data in about 2.8M files. Any idea what this randomness is? might the "Value too large for defined data type" be thrown if the system runs out of memory? These jobs get up to over a half a gig memory used. It was compiled (and is running) on a 64-bit machine. My life would be greatly simplified if i could run these syncs as single chunks, as there are things being removed, as well as added, so automatically breaking up the runs may leave things out. Anybody got any ideas? I think i've heard of others running much larger distributions. Incidentally, from this test, it's plain that the old false timeout problem, where one process waiting for another to finish its work would timeout and stop the run, even though the other threads were still working, at the 60-second select_timeout value that is set when no io_timeout is set, is fixed in this version. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ starting /bildmax/bildmax2 at Wed Oct 31 02:42:46 PST 2001 readlink big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/.22-4.htm.IFaG5g: Value too large for defined data type readlink big1/cadappl1/hpux/ictools/arm_ads/1.1/common/include/.sstream.MLaG5g: Value too large for defined data type rsync error: partial transfer (code 23) at main.c(537) finished /bildmax/bildmax2 at Wed Oct 31 12:46:00 PST 2001 starting /bildmax/bildmax3 at Wed Oct 31 12:46:00 PST 2001 readlink big1/cadappl1/hpux/ictools/tmpTempo: No such file or directory symlink big/tools/DI/factory_integrator2.2/data -> /cadappl/ictools/factory_integrator/2.2/data : File exists symlink big/tools/synopsys/synopsys1999.05/doc -> /cadappl/ictools/synopsys/1999.05/doc : File exists symlink big1/tools1/DI/dis2.2.2/DI/documentation -> /cadappl/ictools/Design_Integrator/2.2.2/documentation : File exists symlink big1/tools1/DI/dis2.2.2/DI/system -> /cadappl/ictools/Design_Integrator/2.2.2/system : File exists rsync error: partial transfer (code 23) at main.c(537) finished /bildmax/bildmax3 at Wed Oct 31 18:07:04 PST 2001 starting /bildmax/bildmax4 at Wed Oct 31 18:07:04 PST 2001 readlink big/tools/DI/dis2.2.1/DI/system/product/vsc983/spice_models/.vsc9a_wire.inc.enc.1.1.0.fobG1u: Value too large for defined data type rsync error: partial transfer (code 23) at main.c(537) finished /bildmax/bildmax4 at Thu Nov 1 05:09:48 PST 2001 starting /bildmax/bildmax5 at Thu Nov 1 05:09:48 PST 2001 readlink big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_danger_p/2.0/lib/corelib_danger_p/dly6x3pd/auLvs/.master.tag.dNsOZO: Value too large for defined data type readlink big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_p/2.0.1/lib/corelib_p/ors2pd/datasheet/.master.tag.U8zOZO: Value too large for defined data type rsync error: partial transfer (code 23) at main.c(537) finished /bildmax/bildmax5 at Fri Nov 2 00:41:47 PST 2001 starting /bildmax/bildmax6 at Fri Nov 2 00:41:47 PST 2001 readlink big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_p/2.0.1/lib/corelib_p/ao6anx4pd/abstract/.layout.cdb.hSuGZO: Value too large for defined data type Tools@willy /site/local/share/ToolSync/localrep>ls -l /big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm -rw-r--r-- 1 Tools Tools 12025 Apr 27 1999 /big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm Tools@willy /site/local/share/ToolSync/localrep>ls -l /bildmax/bildmax2/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm -rw-r--r-- 1 Tools Tools 12025 Apr 27 1999 /bildmax/bildmax2/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm Tools@willy /site/local/share/ToolSync/localrep> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ main.c line 537 is just exit_cleanup(status) +++++++++++++++++++++++++++++++++++++++Memory usage+++++++++++++++++++++++++++++++++++++++++++++++++++++ load averages: 0.25, 0.33, 0.35 08:46:15 102 processes: 100 sleeping, 1 zombie, 1 on cpu CPU states: 94.1% idle, 0.5% user, 5.3% kernel, 0.0% iowait, 0.0% swap Memory: 3072M real, 1709M free, 644M swap in use, 5501M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 10838 Tools 1 33 0 535M 176M sleep 55:19 0.00% rsync 19393 Tools 1 33 0 535M 1728K sleep 5:04 0.59% rsync 10837 Tools 1 33 0 285M 78M sleep 26:46 0.28% rsync ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Tim Conway tim.conway@philips.com 303.682.4917 Philips Semiconductor - Longmont TC 1880 Industrial Circle, Suite D Longmont, CO 80501 Available via SameTime Connect within Philips, n9hmg on AIM perl -e 'print pack(nnnnnnnnnnnn, 19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), ".\n" ' "There are some who call me.... Tim?"
On Fri, Nov 02, 2001 at 08:55:14AM -0800, tim.conway@philips.com wrote:> I see very odd results from rsync 2.4.7pre1, the latest cvs version (sept > 12, i think was the last modified file)....> It's about 128Gb of data in about 2.8M files. > Any idea what this randomness is? might the "Value too large for defined > data type" be thrown if the system runs out of memory? These jobs get up > to over a half a gig memory used. > It was compiled (and is running) on a 64-bit machine.I don't think it would get that message from running out of memory, although a process size of >512MB of memory is awfully big. The message "Value too large for defined data type" is what is printed for an EOVERFLOW message, at least on Solaris 7. What operating system are you using? It looks like all your messages all say "readlink" which is printed in the function make_file() in flist.c after a failed call to readlink_stat(). readlink_stat() calls do_lstat() in syscall.c, which calls lstat64() if HAVE_OFF64_T is defined, otherwise it calls lstat(). Check your config.h to see if HAVE_OFF64_T is defined. With that much data I assume you've got large filesystems and you would need the 64 bit interface. rsync 2.4.7pre1 uses a relatively new autoconf rule for support of 64 bit systems. You didn't happen to regenerate the configure script with autoconf, did you? If you do, it has to be version 2.52 or later. - Dave Dykstra
I'd thought of the 32v64 issue. Here's a snatch of a trace (truss: i AM running solaris 7, as you mentioned). ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 5089: read(0, " 0 , 1 2 0 )\0 _ c e l l".., 32768) = 32768 5085: read(8, "\0\0\0\0", 4) = 4 5085: poll(0xEFFF6B98, 1, 60000) = 1 5085: read(8, "BC02\0\0", 4) = 4 5085: poll(0xEFFF6B98, 1, 60000) = 1 5085: read(8, "\0\0\0\0", 4) = 4 5085: open64("/sql/rsync/test/tools/DI/dis2.2.1/DI/tools/VLSIMemoryIntegrator/solaris_bin/vlsi_PhantomGen", O_RDONLY) = 6 5085: fstat64(6, 0xEFFFF808) = 0 5085: poll(0xEFFF7098, 1, 60000) = 1 5089: write(1, " 0 , 1 2 0 )\0 _ c e l l".., 32768) = 32768 5089: poll(0xEFFFE000, 1, 60000) = 1 5089: read(0, "\0\0\0\0", 4) = 4 5089: poll(0xEFFFE0E8, 1, 60000) = 1 5085: write(4, "BB1F\0\0", 4) = 4 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ It's not file size anyway... the example I gave showed that multiple duplicate runs failed on different files, and that one randomly chosen file that had failed was very small. (<1M). That was what I meant by unpredictable. I was hoping to find a certain file content, or exact file size, or something, but it seems to be the product of randomness, rather than of any particular file. No file fails in any two runs. That was why i was wondering about total memory issues... maybe something is getting close to using it all. There's 3G, and plenty of swap, but if there's a case in which memory is pinned, that might make a difference. I've not heard of pinning memory in any context except for AIX, so that may be irrelevant. It means to make the allocation unpageable - never leaves physical memory. I don't think that's even available in most unices, but just in case, i though i'd bring it up. Tim Conway tim.conway@philips.com 303.682.4917 Philips Semiconductor - Longmont TC 1880 Industrial Circle, Suite D Longmont, CO 80501 Available via SameTime Connect within Philips, n9hmg on AIM perl -e 'print pack(nnnnnnnnnnnn, 19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), ".\n" ' "There are some who call me.... Tim?" Dave Dykstra <dwd@bell-labs.com> 11/05/2001 01:58 PM To: Tim Conway/LMT/SC/PHILIPS@AMEC cc: rsync@lists.samba.org Subject: Re: unpredictable behaviour Classification: On Fri, Nov 02, 2001 at 08:55:14AM -0800, tim.conway@philips.com wrote:> I see very odd results from rsync 2.4.7pre1, the latest cvs version(sept> 12, i think was the last modified file)....> It's about 128Gb of data in about 2.8M files. > Any idea what this randomness is? might the "Value too large fordefined> data type" be thrown if the system runs out of memory? These jobs getup> to over a half a gig memory used. > It was compiled (and is running) on a 64-bit machine.I don't think it would get that message from running out of memory, although a process size of >512MB of memory is awfully big. The message "Value too large for defined data type" is what is printed for an EOVERFLOW message, at least on Solaris 7. What operating system are you using? It looks like all your messages all say "readlink" which is printed in the function make_file() in flist.c after a failed call to readlink_stat(). readlink_stat() calls do_lstat() in syscall.c, which calls lstat64() if HAVE_OFF64_T is defined, otherwise it calls lstat(). Check your config.h to see if HAVE_OFF64_T is defined. With that much data I assume you've got large filesystems and you would need the 64 bit interface. rsync 2.4.7pre1 uses a relatively new autoconf rule for support of 64 bit systems. You didn't happen to regenerate the configure script with autoconf, did you? If you do, it has to be version 2.52 or later. - Dave Dykstra
I'm not familiar with that issue, at least as a known issue, yet. I'll look into it. it sounds possible... the filesystems are nfs, freebsd network attached storage. Tim Conway tim.conway@philips.com 303.682.4917 Philips Semiconductor - Longmont TC 1880 Industrial Circle, Suite D Longmont, CO 80501 Available via SameTime Connect within Philips, n9hmg on AIM perl -e 'print pack(nnnnnnnnnnnn, 19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), ".\n" ' "There are some who call me.... Tim?" Jos Backus <josb@cncdsl.com> 11/05/2001 04:43 PM Please respond to Jos Backus To: Tim Conway/LMT/SC/PHILIPS@AMEC cc: Subject: Re: unpredictable behaviour Classification: Hi Tim, Could this be the NFS timestamp/EOVERFLOW problem? See e.g. http://www.google.com/search?q=cache:_GGdsnFTb8g:lists.sourceforge.net/pipermail/nfs/2000q2/001299.html+EOVERFLOW+NFS+Solaris&hl=en -- Jos Backus _/ _/_/_/ Santa Clara, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/ _/ josb@cncdsl.com _/_/ _/_/_/ use Std::Disclaimer;