thr3ads.net - rsync - filelist caching optimization proposal [May 2005]

If this information is useful, please help other people find it:
Share via:

Edwin Eefting

2005-May-23 13:28 UTC

filelist caching optimization proposal

Hi,

As a gentoo-user i frequently run the emerge sync command, which in turn does 
a rsync with the mainserver. The 'problem' is that the portage directory
tree
contains about 19.000 directories and 96.000 files. So building the filelist 
takes a pretty long time, because of the many disk accesses that are 
neccesary. On the server side the disk-io problem is probably less worse 
since after the first time the whole tree is cached in the OS disk cache. 
(but still a lot of cpu resources in all the syscalls i think)

My idea is to create a patch for something like a --cache option that will use 
a cached version of the filelist: This way instead of creating the filelist 
every time (100.000's of system calls, diskaccesses), we can now load the 
filelist in one instance. This is even more usefull for rsync-servers, that 
are usually read-only. (like the gentoo mirrors or kernel.org which always 
has a +100 load it seems ;)

I see the following problem with this:
The cache will become 'out of sync' if something manually changes the
local
files.  So using the cache option wouldn't be recommended for users that 
don't know whats going on. However it can be enabled manually under the
right
cicumstances. Maybe it's even possible to do some extra checks on directory 
ctimes in the maindir or some other checks.

-What are the opinions of other people on this list? 
-Would it be easy to implement, or would it give too much trouble? 
-What are the most likely problems i would run into when i try to implement 
this?
-Any ideas on WHERE to store such a cache? (a magic hidden file in the 
directory that is being builded perhaps?)

Thanks,
Edwin


-- 
? //||\\ ?Edwin Eefting
?|| || || DatuX, Linux solutions and innovations
? \\||// ?http://www.datux.nl ?

? ? ? ? Nieuw Amsterdamsestraat 40
? ? ? ? 7814 VA Emmen
? ? ? ? Tel. 0591-857037
? ? ? ? Fax. 0591-633001
????????

Christoph Biedl

2005-May-23 14:09 UTC

head link

filelist caching optimization proposal

Edwin Eefting wrote...
> -What are the opinions of other people on this list? 
Sounds like a great idea for me but I'm just an rsync user.
> -Would it be easy to implement, or would it give too much trouble? 
Without looking into the sources I think it should not be that difficult
to dump the list of found items into a file and resume later from that
point.

[ re-ordering ]> -Any ideas on WHERE to store such a cache? (a magic hidden file in the 
> directory that is being builded perhaps?)
I'd leave that to the user/admin. The according option describes a file
where to store the cache information.
> -What are the most likely problems i would run into when i try to implement
> this?
You can expect a feature request that allows to manipulate certain parts
of the cache only (re-scan or delete a subtree). This would turn the
cache into a database but will surely introduce a lot of trouble to
implement.

Other problems like the forseeable synchronisation problems and races in
the cache generation should be caught by locking and/or "It's a
feature".

    Christoph

Carson Gaspar

2005-May-23 14:30 UTC

head link

filelist caching optimization proposal

--On Monday, May 23, 2005 03:24:07 PM +0200 Edwin Eefting <edwin@datux.nl>
wrote:
> My idea is to create a patch for something like a --cache option that
> will use  a cached version of the filelist: This way instead of creating
> the filelist  every time (100.000's of system calls, diskaccesses), we
> can now load the  filelist in one instance. This is even more usefull for
> rsync-servers, that  are usually read-only. (like the gentoo mirrors or
> kernel.org which always  has a +100 load it seems ;)
You should take a look at how sup/cvsup behave before diving in, as they do 
something similar.

-- 
Carson

Wayne Davison

2005-May-23 16:22 UTC

head link

filelist caching optimization proposal

On Mon, May 23, 2005 at 03:24:07PM +0200, Edwin Eefting
wrote:> My idea is to create a patch for something like a --cache option that
> will use a cached version of the filelist:
Something like that would be fairly easy to write, but only if there are
no conflicts between the cache and the live disk.  One would simply need
an on-disk representation for the file-list's in-memory data structure,
and a way to save/restore it.  If you limited the code to a single
source hierarchy, it might even be possible to use the current send &
receive code for the file-list (with just a little touch-up of the
dir.root value that is sender-side only, and thus not properly set by
the receive code).  Every time the server updates, it would want to use
an atomic-update algorithm, like the one implemented in the atomic-rsync
perl script in the "support" dir (which uses a parallel hierarchy and
the --link-dest option to update all the files at the same time).

An alternative to this --cache idea is to use the existing batch-file
mechanism to provide a daily (or twice daily, etc.) update method for
users.  It would work like this:

- A master cache server would maintain its files using a batch-writing
  rsync transfer updating it atomically (as mentioned above) so that (1)
  the batch-creating process can be restarted from scratch if the rsync
  run doesn't finish successfully, and so that (2) users have a source
  hierarchy that exactly matches the last batch-file's end-state.

- The resulting batch file would be put into a place where it could be
  downloaded via some file-transfer protocol, such as on a webserver.

- As long as the user didn't modify the portage hierarchy between
  batched runs, it would be possible to just apply each batched
  transfer, one after the other, to update to the latest hierarchy.  If
  something goes wrong with the receive, it is safe to just run the
  batch-reading command again (since rsync skips the updates that were
  already applied; N.B. --partial must NOT be enabled.)  A a fall-back,
  a normal rsync command to fetch files from the server would update any
  defects and get you back in sync with the batched updates.
  
- I'd imagine using something like an HTTP-capable perl script to grab
  the data and output it on stdout -- this would let the batch be
  processed as it arrived instead of being written out to disk first.

Such an update mechanism would work quite well for a consistent N
batched updates per day (where N is not overly large).  A set of source
servers could even use this method to mirror the N-update hierarchy
throughout the day.  As long as the batch files are named uniquely, the
end-user doesn't need to run the command on a regular schedule:  the
script could be smart enough to notice when the local portage hierarchy
was last updated and choose to either perform one or more batch-reading
runs, or to fall-back to doing a normal rsync update.

..wayne..

Reasonably Related Threads

Search for more possibly parallel threads

rsync - May 2005 - filelist caching optimization proposal

filelist caching optimization proposal

filelist caching optimization proposal

filelist caching optimization proposal

filelist caching optimization proposal

Reasonably Related Threads