Hi, As a gentoo-user i frequently run the emerge sync command, which in turn does a rsync with the mainserver. The 'problem' is that the portage directory tree contains about 19.000 directories and 96.000 files. So building the filelist takes a pretty long time, because of the many disk accesses that are neccesary. On the server side the disk-io problem is probably less worse since after the first time the whole tree is cached in the OS disk cache. (but still a lot of cpu resources in all the syscalls i think) My idea is to create a patch for something like a --cache option that will use a cached version of the filelist: This way instead of creating the filelist every time (100.000's of system calls, diskaccesses), we can now load the filelist in one instance. This is even more usefull for rsync-servers, that are usually read-only. (like the gentoo mirrors or kernel.org which always has a +100 load it seems ;) I see the following problem with this: The cache will become 'out of sync' if something manually changes the local files. So using the cache option wouldn't be recommended for users that don't know whats going on. However it can be enabled manually under the right cicumstances. Maybe it's even possible to do some extra checks on directory ctimes in the maindir or some other checks. -What are the opinions of other people on this list? -Would it be easy to implement, or would it give too much trouble? -What are the most likely problems i would run into when i try to implement this? -Any ideas on WHERE to store such a cache? (a magic hidden file in the directory that is being builded perhaps?) Thanks, Edwin -- ? //||\\ ?Edwin Eefting ?|| || || DatuX, Linux solutions and innovations ? \\||// ?http://www.datux.nl ? ? ? ? ? Nieuw Amsterdamsestraat 40 ? ? ? ? 7814 VA Emmen ? ? ? ? Tel. 0591-857037 ? ? ? ? Fax. 0591-633001 ????????
Edwin Eefting wrote...> -What are the opinions of other people on this list?Sounds like a great idea for me but I'm just an rsync user.> -Would it be easy to implement, or would it give too much trouble?Without looking into the sources I think it should not be that difficult to dump the list of found items into a file and resume later from that point. [ re-ordering ]> -Any ideas on WHERE to store such a cache? (a magic hidden file in the > directory that is being builded perhaps?)I'd leave that to the user/admin. The according option describes a file where to store the cache information.> -What are the most likely problems i would run into when i try to implement > this?You can expect a feature request that allows to manipulate certain parts of the cache only (re-scan or delete a subtree). This would turn the cache into a database but will surely introduce a lot of trouble to implement. Other problems like the forseeable synchronisation problems and races in the cache generation should be caught by locking and/or "It's a feature". Christoph
--On Monday, May 23, 2005 03:24:07 PM +0200 Edwin Eefting <edwin@datux.nl> wrote:> My idea is to create a patch for something like a --cache option that > will use a cached version of the filelist: This way instead of creating > the filelist every time (100.000's of system calls, diskaccesses), we > can now load the filelist in one instance. This is even more usefull for > rsync-servers, that are usually read-only. (like the gentoo mirrors or > kernel.org which always has a +100 load it seems ;)You should take a look at how sup/cvsup behave before diving in, as they do something similar. -- Carson
On Mon, May 23, 2005 at 03:24:07PM +0200, Edwin Eefting wrote:> My idea is to create a patch for something like a --cache option that > will use a cached version of the filelist:Something like that would be fairly easy to write, but only if there are no conflicts between the cache and the live disk. One would simply need an on-disk representation for the file-list's in-memory data structure, and a way to save/restore it. If you limited the code to a single source hierarchy, it might even be possible to use the current send & receive code for the file-list (with just a little touch-up of the dir.root value that is sender-side only, and thus not properly set by the receive code). Every time the server updates, it would want to use an atomic-update algorithm, like the one implemented in the atomic-rsync perl script in the "support" dir (which uses a parallel hierarchy and the --link-dest option to update all the files at the same time). An alternative to this --cache idea is to use the existing batch-file mechanism to provide a daily (or twice daily, etc.) update method for users. It would work like this: - A master cache server would maintain its files using a batch-writing rsync transfer updating it atomically (as mentioned above) so that (1) the batch-creating process can be restarted from scratch if the rsync run doesn't finish successfully, and so that (2) users have a source hierarchy that exactly matches the last batch-file's end-state. - The resulting batch file would be put into a place where it could be downloaded via some file-transfer protocol, such as on a webserver. - As long as the user didn't modify the portage hierarchy between batched runs, it would be possible to just apply each batched transfer, one after the other, to update to the latest hierarchy. If something goes wrong with the receive, it is safe to just run the batch-reading command again (since rsync skips the updates that were already applied; N.B. --partial must NOT be enabled.) A a fall-back, a normal rsync command to fetch files from the server would update any defects and get you back in sync with the batched updates. - I'd imagine using something like an HTTP-capable perl script to grab the data and output it on stdout -- this would let the batch be processed as it arrived instead of being written out to disk first. Such an update mechanism would work quite well for a consistent N batched updates per day (where N is not overly large). A set of source servers could even use this method to mirror the N-update hierarchy throughout the day. As long as the batch files are named uniquely, the end-user doesn't need to run the command on a regular schedule: the script could be smart enough to notice when the local portage hierarchy was last updated and choose to either perform one or more batch-reading runs, or to fall-back to doing a normal rsync update. ..wayne..