I see very odd results from rsync 2.4.7pre1, the latest cvs version (sept
12, i think was the last modified file).
We have a number of network-attached storage devices. 10/100 ethernet,
nfs2 mounted (under nfs3, they buffer deletes, and recursive deletions
fail). Usually, these are kept syncronized across
a wan by a nightly cronjob,
We have a few we keep in reserve, which we syncronize locally. They're
all the same, aside from whatever is left different by rsync failures.
they are syncronized, one at a time, during the week.
They fail on different files. as you see, bildmax4 failed on only one
file. bildmax2 failed on several screensful, but i trimmed it down. i
don't see any duplication between systems.
At the bottom, i show a ls of one of the files that failed on bildmax2,
both on the master (/big1/....) and on the bildmax itself
(/bildmax/bildmax2/...). What value is too big?
Oh, the cmdline:
/cadappl/encap/bin/rsync -Wa --delete --force --bwlimit=524288 source
destination
It's about 128Gb of data in about 2.8M files.
Any idea what this randomness is? might the "Value too large for defined
data type" be thrown if the system runs out of memory? These jobs get up
to over a half a gig memory used.
It was compiled (and is running) on a 64-bit machine.
My life would be greatly simplified if i could run these syncs as single
chunks, as there are things being removed, as well as added, so
automatically breaking up the runs may leave things out.
Anybody got any ideas? I think i've heard of others running much larger
distributions.
Incidentally, from this test, it's plain that the old false timeout
problem, where one process waiting for another to finish its work would
timeout and stop the run, even though the other threads were still
working, at the 60-second select_timeout value that is set when no
io_timeout is set, is fixed in this version.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
starting /bildmax/bildmax2 at Wed Oct 31 02:42:46 PST 2001
readlink
big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/.22-4.htm.IFaG5g:
Value too large for defined data type
readlink
big1/cadappl1/hpux/ictools/arm_ads/1.1/common/include/.sstream.MLaG5g:
Value too large for defined data type
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax2 at Wed Oct 31 12:46:00 PST 2001
starting /bildmax/bildmax3 at Wed Oct 31 12:46:00 PST 2001
readlink big1/cadappl1/hpux/ictools/tmpTempo: No such file or directory
symlink big/tools/DI/factory_integrator2.2/data ->
/cadappl/ictools/factory_integrator/2.2/data : File exists
symlink big/tools/synopsys/synopsys1999.05/doc ->
/cadappl/ictools/synopsys/1999.05/doc : File exists
symlink big1/tools1/DI/dis2.2.2/DI/documentation ->
/cadappl/ictools/Design_Integrator/2.2.2/documentation : File exists
symlink big1/tools1/DI/dis2.2.2/DI/system ->
/cadappl/ictools/Design_Integrator/2.2.2/system : File exists
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax3 at Wed Oct 31 18:07:04 PST 2001
starting /bildmax/bildmax4 at Wed Oct 31 18:07:04 PST 2001
readlink
big/tools/DI/dis2.2.1/DI/system/product/vsc983/spice_models/.vsc9a_wire.inc.enc.1.1.0.fobG1u:
Value too large for defined data type
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax4 at Thu Nov 1 05:09:48 PST 2001
starting /bildmax/bildmax5 at Thu Nov 1 05:09:48 PST 2001
readlink
big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_danger_p/2.0/lib/corelib_danger_p/dly6x3pd/auLvs/.master.tag.dNsOZO:
Value too large for defined data type
readlink
big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_p/2.0.1/lib/corelib_p/ors2pd/datasheet/.master.tag.U8zOZO:
Value too large for defined data type
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax5 at Fri Nov 2 00:41:47 PST 2001
starting /bildmax/bildmax6 at Fri Nov 2 00:41:47 PST 2001
readlink
big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_p/2.0.1/lib/corelib_p/ao6anx4pd/abstract/.layout.cdb.hSuGZO:
Value too large for defined data type
Tools@willy
/site/local/share/ToolSync/localrep>ls -l
/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
-rw-r--r-- 1 Tools Tools 12025 Apr 27 1999
/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
Tools@willy
/site/local/share/ToolSync/localrep>ls -l
/bildmax/bildmax2/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
-rw-r--r-- 1 Tools Tools 12025 Apr 27 1999
/bildmax/bildmax2/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
Tools@willy
/site/local/share/ToolSync/localrep>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
main.c line 537 is just exit_cleanup(status)
+++++++++++++++++++++++++++++++++++++++Memory
usage+++++++++++++++++++++++++++++++++++++++++++++++++++++
load averages: 0.25, 0.33, 0.35 08:46:15
102 processes: 100 sleeping, 1 zombie, 1 on cpu
CPU states: 94.1% idle, 0.5% user, 5.3% kernel, 0.0% iowait, 0.0% swap
Memory: 3072M real, 1709M free, 644M swap in use, 5501M swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
10838 Tools 1 33 0 535M 176M sleep 55:19 0.00% rsync
19393 Tools 1 33 0 535M 1728K sleep 5:04 0.59% rsync
10837 Tools 1 33 0 285M 78M sleep 26:46 0.28% rsync
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn,
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970),
".\n" '
"There are some who call me.... Tim?"
On Fri, Nov 02, 2001 at 08:55:14AM -0800, tim.conway@philips.com wrote:> I see very odd results from rsync 2.4.7pre1, the latest cvs version (sept > 12, i think was the last modified file)....> It's about 128Gb of data in about 2.8M files. > Any idea what this randomness is? might the "Value too large for defined > data type" be thrown if the system runs out of memory? These jobs get up > to over a half a gig memory used. > It was compiled (and is running) on a 64-bit machine.I don't think it would get that message from running out of memory, although a process size of >512MB of memory is awfully big. The message "Value too large for defined data type" is what is printed for an EOVERFLOW message, at least on Solaris 7. What operating system are you using? It looks like all your messages all say "readlink" which is printed in the function make_file() in flist.c after a failed call to readlink_stat(). readlink_stat() calls do_lstat() in syscall.c, which calls lstat64() if HAVE_OFF64_T is defined, otherwise it calls lstat(). Check your config.h to see if HAVE_OFF64_T is defined. With that much data I assume you've got large filesystems and you would need the 64 bit interface. rsync 2.4.7pre1 uses a relatively new autoconf rule for support of 64 bit systems. You didn't happen to regenerate the configure script with autoconf, did you? If you do, it has to be version 2.52 or later. - Dave Dykstra
I'd thought of the 32v64 issue. Here's a snatch of a trace (truss: i
AM
running solaris 7, as you mentioned).
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5089: read(0, " 0 , 1 2 0 )\0 _ c e l l".., 32768) = 32768
5085: read(8, "\0\0\0\0", 4) = 4
5085: poll(0xEFFF6B98, 1, 60000) = 1
5085: read(8, "BC02\0\0", 4) = 4
5085: poll(0xEFFF6B98, 1, 60000) = 1
5085: read(8, "\0\0\0\0", 4) = 4
5085:
open64("/sql/rsync/test/tools/DI/dis2.2.1/DI/tools/VLSIMemoryIntegrator/solaris_bin/vlsi_PhantomGen",
O_RDONLY) = 6
5085: fstat64(6, 0xEFFFF808) = 0
5085: poll(0xEFFF7098, 1, 60000) = 1
5089: write(1, " 0 , 1 2 0 )\0 _ c e l l".., 32768) = 32768
5089: poll(0xEFFFE000, 1, 60000) = 1
5089: read(0, "\0\0\0\0", 4) = 4
5089: poll(0xEFFFE0E8, 1, 60000) = 1
5085: write(4, "BB1F\0\0", 4) = 4
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
It's not file size anyway... the example I gave showed that multiple
duplicate runs failed on different files, and that one randomly chosen
file that had failed was very small. (<1M). That was what I meant by
unpredictable. I was hoping to find a certain file content, or exact file
size, or something, but it seems to be the product of randomness, rather
than of any particular file. No file fails in any two runs. That was why
i was wondering about total memory issues... maybe something is getting
close to using it all. There's 3G, and plenty of swap, but if there's a
case in which memory is pinned, that might make a difference. I've not
heard of pinning memory in any context except for AIX, so that may be
irrelevant. It means to make the allocation unpageable - never leaves
physical memory. I don't think that's even available in most unices,
but
just in case, i though i'd bring it up.
Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn,
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970),
".\n" '
"There are some who call me.... Tim?"
Dave Dykstra <dwd@bell-labs.com>
11/05/2001 01:58 PM
To: Tim Conway/LMT/SC/PHILIPS@AMEC
cc: rsync@lists.samba.org
Subject: Re: unpredictable behaviour
Classification:
On Fri, Nov 02, 2001 at 08:55:14AM -0800, tim.conway@philips.com
wrote:> I see very odd results from rsync 2.4.7pre1, the latest cvs version
(sept > 12, i think was the last modified file).
...> It's about 128Gb of data in about 2.8M files.
> Any idea what this randomness is? might the "Value too large for
defined > data type" be thrown if the system runs out of memory? These jobs get
up > to over a half a gig memory used.
> It was compiled (and is running) on a 64-bit machine.
I don't think it would get that message from running out of memory,
although
a process size of >512MB of memory is awfully big.
The message "Value too large for defined data type" is what is printed
for
an EOVERFLOW message, at least on Solaris 7. What operating system are
you using? It looks like all your messages all say "readlink" which
is
printed in the function make_file() in flist.c after a failed call to
readlink_stat(). readlink_stat() calls do_lstat() in syscall.c, which
calls lstat64() if HAVE_OFF64_T is defined, otherwise it calls lstat().
Check your config.h to see if HAVE_OFF64_T is defined. With that much
data
I assume you've got large filesystems and you would need the 64 bit
interface.
rsync 2.4.7pre1 uses a relatively new autoconf rule for support of 64 bit
systems. You didn't happen to regenerate the configure script with
autoconf, did you? If you do, it has to be version 2.52 or later.
- Dave Dykstra
I'm not familiar with that issue, at least as a known issue, yet. I'll
look into it. it sounds possible... the filesystems are nfs, freebsd
network attached storage.
Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn,
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970),
".\n" '
"There are some who call me.... Tim?"
Jos Backus <josb@cncdsl.com>
11/05/2001 04:43 PM
Please respond to Jos Backus
To: Tim Conway/LMT/SC/PHILIPS@AMEC
cc:
Subject: Re: unpredictable behaviour
Classification:
Hi Tim,
Could this be the NFS timestamp/EOVERFLOW problem? See e.g.
http://www.google.com/search?q=cache:_GGdsnFTb8g:lists.sourceforge.net/pipermail/nfs/2000q2/001299.html+EOVERFLOW+NFS+Solaris&hl=en
--
Jos Backus _/ _/_/_/ Santa Clara, CA
_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
josb@cncdsl.com _/_/ _/_/_/ use Std::Disclaimer;