Attached is a patch that implements compressing output files as they're written to disk, uzing zlib. Thus far I've only used it with synchronizing directories on a single machine. What seems to work / what's done: - Synchronizing directories with all files in the target directory gzip'd. Files seem to contain the correct data. Use the option "--gzip-dest". - Only transferring files whose checksums are different. Destination files are gunzip'd before their checksums are calculated. - Added an option "--ignore-sizes", since there is no easy way for the receiver to know the uncompressed size of the files it already has. For now you have to use --checksum to be sure... - Added gzio.c from the latest zlib distribution so we can call gzwrite() etc. What remains to be done / problems: - Needs more testing, especially with remote clients / servers. - Batch files are not compressed. - Reading compressed files should be implemented in a more generic fashion, perhaps in map_file() and its cousins. I started working on this but saw that changing map_file() et al. could have far reaching consequences, so I took the easy way out: I just changed the one routine I cared about for now. - Add documentation of new options to manpages etc. - Find a way to write down the uncompressed file sizes on the receiving side. Perhaps the least-bad way to do this would be append some rsync-specific data, including uncompressed size, to the end of the gzip'd files. The receiver could read this in on future runs when it needed to. Gunzip'ing the file from the command-line would work but would give a "ignoring trailing garbage" kind of error. What I've done so far isn't pretty, but I thought I'd send it in in case someone else finds it useful. -Joel -------------- next part -------------- A non-text attachment was scrubbed... Name: rsync-2.5.5-jav-20020702.diff.gz Type: application/octet-stream Size: 10770 bytes Desc: Url : http://lists.samba.org/archive/rsync/attachments/20020702/6d65e755/rsync-2.5.5-jav-20020702.diff.obj
On Wednesday 03 July 2002 01:51, Joel Votaw wrote:> Attached is a patch that implements compressing output files as they're > written to disk, uzing zlib. Thus far I've only used it with > synchronizing directories on a single machine....> - Added an option "--ignore-sizes", since there is no easy way for > the receiver to know the uncompressed size of the files it > already has. For now you have to use --checksum to be sure... > > - Find a way to write down the uncompressed file sizes on the > receiving side. Perhaps the least-bad way to do this would be > append some rsync-specific data, including uncompressed size, to > the end of the gzip'd files. The receiver could read this in on > future runs when it needed to. Gunzip'ing the file from the > command-line would work but would give a "ignoring trailing > garbage" kind of error.The GZIP Standard http://www.faqs.org/rfcs/rfc1952.html defines the field ISIZE: This contains the size of the original (uncompressed) input data modulo 2^32. I'd expect that zlib sets that data and has a way to read this?> What I've done so far isn't pretty, but I thought I'd send it in in case > someone else finds it useful.It's amazing. I'll have a use for that if the problem with the sizes is solved - completly unzipping and checksumming the file doesn't make sense for local file systems. BTW: we might even save the MD4 checksum of the original file in a gzip field. See the RFC: XLEN (eXtra LENgth) If FLG.FEXTRA is set, this gives the length of the optional extra field. See below for details. Maybe even the checksums of the individual blocks - but that's depending on the blocksize (which can vary with every invocation), and needs some space. If there would be a way to decompress only from the middle of the file it could make sense, as we wouldn't have to unzip the complete file just to send the checksums over the wire ... Maybe we should save the blocksize which was used on generating and the resulting block checksums. So the only time we would unzip is to actually send portions of the file. At least the MD4 checksum would be ok, I guess. Regards, Phil
On Tue, Jul 02, 2002 at 05:51:58PM -0600, Joel Votaw wrote:> > Attached is a patch that implements compressing output files as they're > written to disk, uzing zlib. Thus far I've only used it with > synchronizing directories on a single machine.This certainly would be useful once reliable. Be handy for dirvish and other backup tools.> > What seems to work / what's done: > > - Synchronizing directories with all files in the target > directory gzip'd. Files seem to contain the correct data. Use > the option "--gzip-dest".Should also have a --gzip-src option to allow reciprocal transfers. Comments in the patch mention this i notice.> - Only transferring files whose checksums are different. > Destination files are gunzip'd before their checksums are > calculated. > > - Added an option "--ignore-sizes", since there is no easy way for > the receiver to know the uncompressed size of the files it > already has. For now you have to use --checksum to be sure...This option shouldn't be necessary once you extract the size from the internal gzip file structure.> > - Added gzio.c from the latest zlib distribution so we can call > gzwrite() etc. > > What remains to be done / problems: > > - Needs more testing, especially with remote clients / servers. > > - Batch files are not compressed.Huh? Please explain what is a "batch" file and why it doesn't get compressed.> > - Reading compressed files should be implemented in a more generic > fashion, perhaps in map_file() and its cousins. I started > working on this but saw that changing map_file() et al. could > have far reaching consequences, so I took the easy way out: I > just changed the one routine I cared about for now. > > - Add documentation of new options to manpages etc. > > - Find a way to write down the uncompressed file sizes on the > receiving side. Perhaps the least-bad way to do this would be > append some rsync-specific data, including uncompressed size, to > the end of the gzip'd files. The receiver could read this in on > future runs when it needed to. Gunzip'ing the file from the > command-line would work but would give a "ignoring trailing > garbage" kind of error. > > What I've done so far isn't pretty, but I thought I'd send it in in case > someone else finds it useful. > > -JoelIt seems (to me) a reasonable start so far. The comments show some foresight re bidirectional plans and support for other compression libs and levels. I don't know if i'd support multiple compression libs but if you do might i suggest calling the option --zip-dest and have it take an argument to specify the compression library? ie --zip-dest (bzip2|gzip)[=(1-9)] In any case i would make --gzip-dest take an optional argument for specifying the compression level right away. Also downgrade the default level to 6 as the speed penalty for level 9 is seldom worth the marginal compression increase. One extra issue to consider is that accidentally leaving off the --gzip-* option would really mess things up (imagine restoring /usr). Long term a sanity check might be in order with a way to override. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt