thr3ads.net - rsync - [PATCH] Compressed output files [Jul 2002]

If this information is useful, please help other people find it:
Share via:

Joel Votaw

2002-Jul-02 16:53 UTC

[PATCH] Compressed output files

Attached is a patch that implements compressing output files as they're
written to disk, uzing zlib. Thus far I've only used it with
synchronizing directories on a single machine.

What seems to work / what's done:

- Synchronizing directories with all files in the target
directory gzip'd. Files seem to contain the correct data. Use
the option "--gzip-dest".

- Only transferring files whose checksums are different.
Destination files are gunzip'd before their checksums are
calculated.

- Added an option "--ignore-sizes", since there is no easy way for
the receiver to know the uncompressed size of the files it
already has. For now you have to use --checksum to be sure...

- Added gzio.c from the latest zlib distribution so we can call
gzwrite() etc.

What remains to be done / problems:

- Needs more testing, especially with remote clients / servers.

- Batch files are not compressed.

- Reading compressed files should be implemented in a more generic
fashion, perhaps in map_file() and its cousins. I started
working on this but saw that changing map_file() et al. could
have far reaching consequences, so I took the easy way out: I
just changed the one routine I cared about for now.

- Add documentation of new options to manpages etc.

- Find a way to write down the uncompressed file sizes on the
receiving side. Perhaps the least-bad way to do this would be
append some rsync-specific data, including uncompressed size, to
the end of the gzip'd files. The receiver could read this in on
future runs when it needed to. Gunzip'ing the file from the
command-line would work but would give a "ignoring trailing
garbage" kind of error.

What I've done so far isn't pretty, but I thought I'd send it in in
case
someone else finds it useful.

-Joel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rsync-2.5.5-jav-20020702.diff.gz
Type: application/octet-stream
Size: 10770 bytes
Desc:
Url :
http://lists.samba.org/archive/rsync/attachments/20020702/6d65e755/rsync-2.5.5-jav-20020702.diff.obj

Ph. Marek

2002-Jul-02 23:53 UTC

head link

[PATCH] Compressed output files

On Wednesday 03 July 2002 01:51, Joel Votaw wrote:> Attached is a patch that implements compressing output files as they're
> written to disk, uzing zlib.  Thus far I've only used it with
> synchronizing directories on a single machine.
...> 	- Added an option "--ignore-sizes", since there is no easy way
for
> 	  the receiver to know the uncompressed size of the files it
> 	  already has.  For now you have to use --checksum to be sure...
>
> 	- Find a way to write down the uncompressed file sizes on the
> 	  receiving side.  Perhaps the least-bad way to do this would be
> 	  append some rsync-specific data, including uncompressed size, to
> 	  the end of the gzip'd files.  The receiver could read this in on
> 	  future runs when it needed to.  Gunzip'ing the file from the
> 	  command-line would work but would give a "ignoring trailing
> 	  garbage"  kind of error.The GZIP Standard http://www.faqs.org/rfcs/rfc1952.html defines 
the field ISIZE:
	This contains the size of the original (uncompressed) input
	data modulo 2^32.
I'd expect that zlib sets that data and has a way to read this?
> What I've done so far isn't pretty, but I thought I'd send it
in in case
> someone else finds it useful.It's amazing. I'll have a use for that if the problem with the sizes is
solved
- completly unzipping and checksumming the file doesn't make sense for local
file systems.

BTW: we might even save the MD4 checksum of the original file
in a gzip field. See the RFC:
	XLEN (eXtra LENgth)
		If FLG.FEXTRA is set, this gives the length of the optional
		extra field.  See below for details.
Maybe even the checksums of the individual blocks - but that's depending on 
the blocksize (which can vary with every invocation), and needs some space.
If there would be a way to decompress only from the middle of the file it 
could make sense, as we wouldn't have to unzip the complete file just to
send
the checksums over the wire ...
Maybe we should save the blocksize which was used on generating and the 
resulting block checksums. So the only time we would unzip is to actually 
send portions of the file. 

At least the MD4 checksum would be ok, I guess.


Regards,

Phil

jw schultz

2002-Jul-03 07:23 UTC

head link

[PATCH] Compressed output files

On Tue, Jul 02, 2002 at 05:51:58PM -0600, Joel Votaw
wrote:> 
> Attached is a patch that implements compressing output files as they're
> written to disk, uzing zlib.  Thus far I've only used it with
> synchronizing directories on a single machine.
This certainly would be useful once reliable.  Be handy for
dirvish and other backup tools.
> 
> What seems to work / what's done:
> 
> 	- Synchronizing directories with all files in the target
> 	  directory gzip'd.  Files seem to contain the correct data.  Use
> 	  the option "--gzip-dest".
Should also have a --gzip-src option to allow reciprocal
transfers.  Comments in the patch mention this i notice.
> 	- Only transferring files whose checksums are different.
> 	  Destination files are gunzip'd before their checksums are
> 	  calculated.
> 
> 	- Added an option "--ignore-sizes", since there is no easy way
for
> 	  the receiver to know the uncompressed size of the files it
> 	  already has.  For now you have to use --checksum to be sure...
This option shouldn't be necessary once you extract the size
from the internal gzip file structure.
> 
> 	- Added gzio.c from the latest zlib distribution so we can call
> 	  gzwrite() etc.
> 
> What remains to be done / problems:
> 
> 	- Needs more testing, especially with remote clients / servers.
> 
> 	- Batch files are not compressed.
Huh?  Please explain what is a "batch" file and why it doesn't
get compressed.
> 
> 	- Reading compressed files should be implemented in a more generic
> 	  fashion, perhaps in map_file() and its cousins.  I started
> 	  working on this but saw that changing map_file() et al. could
> 	  have far reaching consequences, so I took the easy way out: I
> 	  just changed the one routine I cared about for now.
> 
> 	- Add documentation of new options to manpages etc.
> 
> 	- Find a way to write down the uncompressed file sizes on the
> 	  receiving side.  Perhaps the least-bad way to do this would be
> 	  append some rsync-specific data, including uncompressed size, to
> 	  the end of the gzip'd files.  The receiver could read this in on
> 	  future runs when it needed to.  Gunzip'ing the file from the
> 	  command-line would work but would give a "ignoring trailing
> 	  garbage"  kind of error.
> 
> What I've done so far isn't pretty, but I thought I'd send it
in in case
> someone else finds it useful.
> 
> 	-Joel
It seems (to me) a reasonable start so far.
The comments show some foresight re bidirectional plans and
support for other compression libs and levels.

I don't know if i'd support multiple compression libs but if
you do might i suggest calling the option --zip-dest
and have it take an argument to specify the compression
library? ie --zip-dest (bzip2|gzip)[=(1-9)]

In any case i would make --gzip-dest take an optional
argument for specifying the compression level right away.
Also downgrade the default level to 6 as the speed penalty
for level 9 is seldom worth the marginal compression increase.

One extra issue to consider is that accidentally leaving off
the --gzip-* option would really mess things up (imagine
restoring /usr).  Long term a sanity check might be in order
with a way to override. 


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

Apparently Analagous Threads

Search for more maybe matching threads

rsync - Jul 2002 - [PATCH] Compressed output files

[PATCH] Compressed output files

[PATCH] Compressed output files

[PATCH] Compressed output files

Apparently Analagous Threads