thr3ads.net - rsync - performance suggestion: sparse files [Aug 2003]

If this information is useful, please help other people find it:
Share via:

Jon Howell

2003-Aug-27 04:28 UTC

performance suggestion: sparse files

So I was transferring a 2GB virtual machine disk image image over a slow
wireless link. Of course I used --sparse, to keep the image small on the
destination end as well as on the source end.

Much to my surprise, I noticed that the transfer took a long time even
when it got past the first 0.5GB of actually-populated file. A little
sleuthing with strace revealed that the source rsync was dutifully reading
block after block of zeros, sending them to ssh, who compressed them and
send them across the wire(less), where another rsync got the zero blocks,
realized that they were sparse, and just bode its time until it could do
one big seek to the next non-sparse block. ("bode its time"? Who
writes
like that?)

Of course, it never survived to see that moment; a cruel SIGINT arrived
and dispatched both rsyncs.

It seems like the right thing would be for the local end to skim past the
zero blocks and send some metainformation, to avoid encrypting and
transferring many GB of zeros.

I worked around the problem by adding -z to compress the stream first
(blocks of zeros compress remarkably well), and that made the virtual disk
image transfer go much faster. Of course, all of the .tgzs and .tbzs in
the same transfer got slower waiting on the source CPU to compress the
incompressible.

The obvious solution is to <music type=organ register=bass>change the
protocol</music>, but that seems like a scary thing to do for a
performance tweak. What about an option for
"really-crappy-compression"?
Something really cheezy (RLE) that can decide in a hurry whether to
compress away a string of zeros, and if not, just send them raw. That way,
performance on compressed files stays I/O bound even on systems with pokey
CPUs, but sparse files are disk-bound on the source system (as they should
be). (And, of course, --sparse would automatically promote the compression
level to "really-crappy" if it was at "none" before.)

Well, okay, they shouldn't even be disk bound; the source system should be
able to discover the sparsity of the file without making 1.5GB-worth of
read calls. Does POSIX (or do specific OSes) offer a call that provides a
map of allocated regions in the file?

Source rsync: 2.5.6
Destination rsync: 2.5.5
Diligence: I searched for 'sparse' in the faqomatic, the bug database,
the
current issues page, the TODO document, and the mailing list archive, and
didn't find anything relevant; please don't flame if I missed an
existing
comment.

Thanks!

    --Jon

jw schultz

2003-Aug-27 06:45 UTC

head link

performance suggestion: sparse files

On Tue, Aug 26, 2003 at 11:28:12AM -0700, Jon Howell
wrote:> So I was transferring a 2GB virtual machine disk image image over a slow
> wireless link. Of course I used --sparse, to keep the image small on the
> destination end as well as on the source end.
> 
> Much to my surprise, I noticed that the transfer took a long time even
> when it got past the first 0.5GB of actually-populated file. A little
> sleuthing with strace revealed that the source rsync was dutifully reading
> block after block of zeros, sending them to ssh, who compressed them and
> send them across the wire(less), where another rsync got the zero blocks,
> realized that they were sparse, and just bode its time until it could do
> one big seek to the next non-sparse block. ("bode its time"? Who
writes
> like that?)
Had you been updating an existing image file it would have
the blocks of zeroes would have had matches and not been
sent.  A workaround if you do this again in future would be
to create an original file full of zeros.  dd if=/dev/zero
of=$dest bs=1024 count=$block_size
> 
> Of course, it never survived to see that moment; a cruel SIGINT arrived
> and dispatched both rsyncs.
> 
> It seems like the right thing would be for the local end to skim past the
> zero blocks and send some metainformation, to avoid encrypting and
> transferring many GB of zeros.
> 
> I worked around the problem by adding -z to compress the stream first
> (blocks of zeros compress remarkably well), and that made the virtual disk
> image transfer go much faster. Of course, all of the .tgzs and .tbzs in
> the same transfer got slower waiting on the source CPU to compress the
> incompressible.
That is what i would have recommended.
> The obvious solution is to <music type=organ register=bass>change the
> protocol</music>, but that seems like a scary thing to do for a
> performance tweak. What about an option for
"really-crappy-compression"?
> Something really cheezy (RLE) that can decide in a hurry whether to
> compress away a string of zeros, and if not, just send them raw. That way,
> performance on compressed files stays I/O bound even on systems with pokey
> CPUs, but sparse files are disk-bound on the source system (as they should
> be). (And, of course, --sparse would automatically promote the compression
> level to "really-crappy" if it was at "none" before.)
This is really only an issue when rsync hits a new file.  I
agree an RLE of the stream _sounds_ lika a good idea.  But
even better might be an extra phantom block that represents
all zeros.  That too would require a protocol bump.
> Well, okay, they shouldn't even be disk bound; the source system should
be
> able to discover the sparsity of the file without making 1.5GB-worth of
> read calls. Does POSIX (or do specific OSes) offer a call that provides a
> map of allocated regions in the file?
There is no way in user-mode to distinguish between a sparse file and a
file full of zeroed blocks.
> Source rsync: 2.5.6
> Destination rsync: 2.5.5
> Diligence: I searched for 'sparse' in the faqomatic, the bug
database, the
> current issues page, the TODO document, and the mailing list archive, and
> didn't find anything relevant; please don't flame if I missed an
existing
> comment.
> 
> Thanks!
> 
>     --Jon
> 
> 
> -- 
> To unsubscribe or change options:
http://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
> 
-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

Possibly Parallel Threads

Search for more reasonably related threads

rsync - Aug 2003 - performance suggestion: sparse files

performance suggestion: sparse files

performance suggestion: sparse files

Possibly Parallel Threads