thr3ads.net - rsync - unpredictable behaviour [Nov 2001]

If this information is useful, please help other people find it:
Share via:

tim.conway@philips.com

2001-Nov-03 03:55 UTC

unpredictable behaviour

I see very odd results from rsync 2.4.7pre1, the latest cvs version (sept 
12, i think was the last modified file).
We have a number of network-attached storage devices. 10/100 ethernet, 
nfs2 mounted (under nfs3, they buffer deletes, and recursive deletions 
fail).  Usually, these are kept syncronized across 
a wan by a nightly cronjob,
We have a few we keep in reserve, which we syncronize locally.  They're 
all the same, aside from whatever is left different by rsync failures. 
they are syncronized, one at a time, during the week.
They fail on different files.  as you see, bildmax4 failed on only one 
file.  bildmax2 failed on several screensful, but i trimmed it down.  i 
don't see any duplication between systems. 
At the bottom, i show a ls of one of the files that failed on bildmax2, 
both on the master (/big1/....) and on the bildmax itself 
(/bildmax/bildmax2/...).  What value is too big?
Oh, the cmdline:
/cadappl/encap/bin/rsync -Wa --delete --force --bwlimit=524288 source 
destination
It's about 128Gb of data in about 2.8M files.
Any idea what this randomness is?  might the "Value too large for defined 
data type" be thrown if the system runs out of memory?  These jobs get up 
to over a half a gig memory used.
It was compiled (and is running) on a 64-bit machine.
My life would be greatly simplified if i could run these syncs as single 
chunks, as there are things being removed, as well as added, so 
automatically breaking up the runs may leave things out.
Anybody got any ideas?  I think i've heard of others running much larger 
distributions.
Incidentally, from this test, it's plain that the old false timeout 
problem, where one process waiting for another to finish its work would 
timeout and stop the run, even though the other threads were still
 working, at the 60-second select_timeout value that is set when no 
io_timeout is set, is fixed in this version.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
starting /bildmax/bildmax2 at Wed Oct 31 02:42:46 PST 2001
readlink 
big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/.22-4.htm.IFaG5g:
Value too large for defined data type
readlink 
big1/cadappl1/hpux/ictools/arm_ads/1.1/common/include/.sstream.MLaG5g: 
Value too large for defined data type
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax2 at Wed Oct 31 12:46:00 PST 2001
starting /bildmax/bildmax3 at Wed Oct 31 12:46:00 PST 2001
readlink big1/cadappl1/hpux/ictools/tmpTempo: No such file or directory
symlink big/tools/DI/factory_integrator2.2/data -> 
/cadappl/ictools/factory_integrator/2.2/data : File exists
symlink big/tools/synopsys/synopsys1999.05/doc -> 
/cadappl/ictools/synopsys/1999.05/doc : File exists
symlink big1/tools1/DI/dis2.2.2/DI/documentation -> 
/cadappl/ictools/Design_Integrator/2.2.2/documentation : File exists
symlink big1/tools1/DI/dis2.2.2/DI/system -> 
/cadappl/ictools/Design_Integrator/2.2.2/system : File exists
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax3 at Wed Oct 31 18:07:04 PST 2001
starting /bildmax/bildmax4 at Wed Oct 31 18:07:04 PST 2001
readlink 
big/tools/DI/dis2.2.1/DI/system/product/vsc983/spice_models/.vsc9a_wire.inc.enc.1.1.0.fobG1u:
Value too large for defined data type
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax4 at Thu Nov  1 05:09:48 PST 2001
starting /bildmax/bildmax5 at Thu Nov  1 05:09:48 PST 2001
readlink 
big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_danger_p/2.0/lib/corelib_danger_p/dly6x3pd/auLvs/.master.tag.dNsOZO:
Value too large for defined data type
readlink 
big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_p/2.0.1/lib/corelib_p/ors2pd/datasheet/.master.tag.U8zOZO:
Value too large for defined data type
rsync error: partial transfer (code 23) at main.c(537)
finished /bildmax/bildmax5 at Fri Nov  2 00:41:47 PST 2001
starting /bildmax/bildmax6 at Fri Nov  2 00:41:47 PST 2001
readlink 
big1/cadappl1/hpux/iclibs/CMOS18/PcCMOS18corelib_p/2.0.1/lib/corelib_p/ao6anx4pd/abstract/.layout.cdb.hSuGZO:
Value too large for defined data type
Tools@willy
/site/local/share/ToolSync/localrep>ls -l 
/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
-rw-r--r--   1 Tools    Tools      12025 Apr 27  1999 
/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
Tools@willy
/site/local/share/ToolSync/localrep>ls -l 
/bildmax/bildmax2/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
-rw-r--r--   1 Tools    Tools      12025 Apr 27  1999 
/bildmax/bildmax2/big1/cadappl1/hpux/ictools/arm_ads/1.1/common/html/stdug/general/22-4.htm
Tools@willy
/site/local/share/ToolSync/localrep>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
main.c line 537 is just exit_cleanup(status)                               
                 
+++++++++++++++++++++++++++++++++++++++Memory 
usage+++++++++++++++++++++++++++++++++++++++++++++++++++++
load averages:  0.25,  0.33,  0.35                      08:46:15
102 processes: 100 sleeping, 1 zombie, 1 on cpu
CPU states: 94.1% idle,  0.5% user,  5.3% kernel,  0.0% iowait,  0.0% swap
Memory: 3072M real, 1709M free, 644M swap in use, 5501M swap free

  PID USERNAME THR PRI NICE  SIZE   RES STATE   TIME    CPU COMMAND
10838 Tools      1  33    0  535M  176M sleep  55:19  0.00% rsync
19393 Tools      1  33    0  535M 1728K sleep   5:04  0.59% rsync
10837 Tools      1  33    0  285M   78M sleep  26:46  0.28% rsync
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn, 
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), 
".\n" '
"There are some who call me.... Tim?"

Dave Dykstra

2001-Nov-06 07:58 UTC

head link

unpredictable behaviour

On Fri, Nov 02, 2001 at 08:55:14AM -0800, tim.conway@philips.com
wrote:> I see very odd results from rsync 2.4.7pre1, the latest cvs version (sept 
> 12, i think was the last modified file).
...> It's about 128Gb of data in about 2.8M files.
> Any idea what this randomness is?  might the "Value too large for
defined
> data type" be thrown if the system runs out of memory?  These jobs get
up
> to over a half a gig memory used.
> It was compiled (and is running) on a 64-bit machine.
I don't think it would get that message from running out of memory, although
a process size of >512MB of memory is awfully big.

The message "Value too large for defined data type" is what is printed
for
an EOVERFLOW message, at least on Solaris 7.  What operating system are
you using?  It looks like all your messages all say "readlink" which
is
printed in the function make_file() in flist.c after a failed call to
readlink_stat().  readlink_stat() calls do_lstat() in syscall.c, which
calls lstat64() if HAVE_OFF64_T is defined, otherwise it calls lstat().
Check your config.h to see if HAVE_OFF64_T is defined.  With that much data
I assume you've got large filesystems and you would need the 64 bit
interface.

rsync 2.4.7pre1 uses a relatively new autoconf rule for support of 64 bit
systems.  You didn't happen to regenerate the configure script with
autoconf, did you?  If you do, it has to be version 2.52 or later.

- Dave Dykstra

tim.conway@philips.com

2001-Nov-06 10:04 UTC

head link

unpredictable behaviour

I'd thought of the 32v64 issue.   Here's a snatch of a trace (truss:  i
AM
running solaris 7, as you mentioned).
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5089:   read(0, " 0 , 1 2 0 )\0 _ c e l l".., 32768)    = 32768
5085:   read(8, "\0\0\0\0", 4)                          = 4
5085:   poll(0xEFFF6B98, 1, 60000)                      = 1
5085:   read(8, "BC02\0\0", 4)                          = 4
5085:   poll(0xEFFF6B98, 1, 60000)                      = 1
5085:   read(8, "\0\0\0\0", 4)                          = 4
5085: 
open64("/sql/rsync/test/tools/DI/dis2.2.1/DI/tools/VLSIMemoryIntegrator/solaris_bin/vlsi_PhantomGen",
O_RDONLY) = 6
5085:   fstat64(6, 0xEFFFF808)                          = 0
5085:   poll(0xEFFF7098, 1, 60000)                      = 1
5089:   write(1, " 0 , 1 2 0 )\0 _ c e l l".., 32768)   = 32768
5089:   poll(0xEFFFE000, 1, 60000)                      = 1
5089:   read(0, "\0\0\0\0", 4)                          = 4
5089:   poll(0xEFFFE0E8, 1, 60000)                      = 1
5085:   write(4, "BB1F\0\0", 4)                         = 4
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
It's not file size anyway... the example I gave showed that multiple 
duplicate runs failed on different files, and that one  randomly chosen 
file that had failed was very small.  (<1M).  That was what I meant by 
unpredictable.  I was hoping to find a certain file content, or exact file 
size, or something, but it seems to be the product of randomness, rather 
than of any particular file.  No file fails in any two runs.  That was why 
i was wondering about total memory issues... maybe something is getting 
close to using it all.  There's 3G, and plenty of swap, but if there's a
case in which memory is pinned, that might make a difference.  I've not 
heard of pinning memory in any context except for AIX, so that may be 
irrelevant.  It means to make the allocation unpageable - never leaves 
physical memory.  I don't think that's even available in most unices,
but
just in case, i though i'd bring it up.

Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn, 
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), 
".\n" '
"There are some who call me.... Tim?"




Dave Dykstra <dwd@bell-labs.com>
11/05/2001 01:58 PM

 
        To:     Tim Conway/LMT/SC/PHILIPS@AMEC
        cc:     rsync@lists.samba.org
        Subject:        Re: unpredictable behaviour
        Classification: 



On Fri, Nov 02, 2001 at 08:55:14AM -0800, tim.conway@philips.com
wrote:> I see very odd results from rsync 2.4.7pre1, the latest cvs version 
(sept > 12, i think was the last modified file).
...> It's about 128Gb of data in about 2.8M files.
> Any idea what this randomness is?  might the "Value too large for 
defined > data type" be thrown if the system runs out of memory?  These jobs get
up > to over a half a gig memory used.
> It was compiled (and is running) on a 64-bit machine.
I don't think it would get that message from running out of memory, 
although
a process size of >512MB of memory is awfully big.

The message "Value too large for defined data type" is what is printed
for
an EOVERFLOW message, at least on Solaris 7.  What operating system are
you using?  It looks like all your messages all say "readlink" which
is
printed in the function make_file() in flist.c after a failed call to
readlink_stat().  readlink_stat() calls do_lstat() in syscall.c, which
calls lstat64() if HAVE_OFF64_T is defined, otherwise it calls lstat().
Check your config.h to see if HAVE_OFF64_T is defined.  With that much 
data
I assume you've got large filesystems and you would need the 64 bit
interface.

rsync 2.4.7pre1 uses a relatively new autoconf rule for support of 64 bit
systems.  You didn't happen to regenerate the configure script with
autoconf, did you?  If you do, it has to be version 2.52 or later.

- Dave Dykstra

tim.conway@philips.com

2001-Nov-06 11:17 UTC

head link

unpredictable behaviour

I'm not familiar with that issue, at least as a known issue, yet.  I'll 
look into it.  it sounds possible... the filesystems are nfs, freebsd 
network attached storage.

Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn, 
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), 
".\n" '
"There are some who call me.... Tim?"




Jos Backus <josb@cncdsl.com>
11/05/2001 04:43 PM
Please respond to Jos Backus

 
        To:     Tim Conway/LMT/SC/PHILIPS@AMEC
        cc: 
        Subject:        Re: unpredictable behaviour
        Classification: 



                 Hi Tim,

Could this be the NFS timestamp/EOVERFLOW problem? See e.g.

http://www.google.com/search?q=cache:_GGdsnFTb8g:lists.sourceforge.net/pipermail/nfs/2000q2/001299.html+EOVERFLOW+NFS+Solaris&hl=en

-- 
Jos Backus                 _/  _/_/_/        Santa Clara, CA
                          _/  _/   _/
                         _/  _/_/_/ 
                    _/  _/  _/    _/
josb@cncdsl.com     _/_/   _/_/_/            use Std::Disclaimer;

Seemingly Similar Threads

Search for more seemingly similar threads

rsync - Nov 2001 - unpredictable behaviour

unpredictable behaviour

unpredictable behaviour

unpredictable behaviour

unpredictable behaviour

Seemingly Similar Threads