thr3ads.net - rsync - TODO hardlink performance optimizations [Dec 2003]

If this information is useful, please help other people find it:
Share via:

John Van Essen

2003-Dec-17 08:18 UTC

TODO hardlink performance optimizations

On Mon, 15 Dec 2003, jw schultz <jw@pegasys.ws> wrote:
> OK, first pass on TODO complete.
....> PERFORMANCE ----------------------------------------------------------
....> Traverse just one directory at a time
> 
>   Traverse just one directory at a time.  Tridge says it's possible.
> 
>   At the moment rsync reads the whole file list into memory at the
>   start, which makes us use a lot of memory and also not pipeline
>   network access as much as we could.
An additional comment should be added observing that this will affect
hardlink processing, since it relies on the entire flist array being
present in order to match dev and inode numbers.  But perhaps the
required hlist array could be saved and built on the fly as items
with node counts > 1 are encountered.

....> Hard-link handling
> 
>   At the moment hardlink handling is very expensive, so it's off by
>   default.  It does not need to be so.
> 
>   Since most of the solutions are rather intertwined with the file
>   list it is probably better to fix that first, although fixing
>   hardlinks is possibly simpler.
> 
>   We can rule out hardlinked directories since they will probably
>   screw us up in all kinds of ways.  They simply should not be used.
> 
>   At the moment rsync only cares about hardlinks to regular files.  I
>   guess you could also use them for sockets, devices and other beasts,
>   but I have not seen them.
> 
>   When trying to reproduce hard links, we only need to worry about
>   files that have more than one name (nlinks>1 && !S_ISDIR).
It would be very helpful if file_struct.flags could have a bit set to
indicate that the node count was greater than 1.  This info could be
used later to optimize the hardlink search by only considering those
flist entries with this flag bit set.

It'd be nice to implement this bit setting in this protocol number so
it can be widely distributed before 2.6.1 is released which could have
the code to actually make use of it.  I'd be interested in doing the
later changes, but if Martin or jw could at least get the bit set...
It doesn't even have to be --hwlink option dependent.  Just examine
the node count and set the bit.
>   The basic point of this is to discover alternate names that refer to
>   the same file.  All operations, including creating the file and
>   writing modifications to it need only to be done for the first name.
>   For all later names, we just create the link and then leave it
>   alone.
An earlier thread started 11/25/2003 points out that in certain cases,
a hardlinked file is unnecessarily transferred in full.  This is due
to the algortihm described above.  If the first file in the sorted list
is missing, but a later one exists, then that file should be used as
the master.  I've been thinking of solutions to this as well.  But
not until after 2.6.0 is released.
>   If hard links are to be preserved:
> 
>     Before the generator/receiver fork, the list of files is received
>     from the sender (recv_file_list), and a table for detecting hard
>     links is built.
> 
>     The generator looks for hard links within the file list and does
>     not send checksums for them, though it does send other metadata.
> 
>     The sender sends the device number and inode with file entries, so
>     that files are uniquely identified.
> 
>     The receiver goes through and creates hard links (do_hard_links)
>     after all data has been written, but before directory permissions
>     are set.
> 
>   At the moment device and inum are sent as 4-byte integers, which
>   will probably cause problems on large filesystems.  On Linux the
>   kernel uses 64-bit ino_t's internally, and people will soon have
>   filesystems big enough to use them.  We ought to follow NFS4 in
>   using 64-bit device and inode identification, perhaps with a
>   protocol version bump.
> 
>   Once we've seen all the names for a particular file, we no longer
>   need to think about it and we can deallocate the memory.
> 
>   We can also have the case where there are links to a file that are
>   not in the tree being transferred.  There's nothing we can do about
>   that.  Because we rename the destination into place after writing,
>   any hardlinks to the old file are always going to be orphaned.  In
>   fact that is almost necessary because otherwise we'd get really
>   confused if we were generating checksums for one name of a file and
>   modifying another.
> 
>   At the moment the code seems to make a whole second copy of the file
>   list, which seems unnecessary.
Indeed!  It does!  Very wasteful.  It should only need a list of pointers
to the flist entries and sort that list.  Furthermore, with the addition
of the new multiple nodes flag bit requested above, the list of pointers
would only contain pointers to flist entries with that bit set, resulting
in a much smaller list.
-- 
        John Van Essen  Univ of MN Alumnus  <vanes002@umn.edu>

jw schultz

2003-Dec-17 08:49 UTC

head link

TODO hardlink performance optimizations

On Tue, Dec 16, 2003 at 03:18:15PM -0600, John Van Essen
wrote:> On Mon, 15 Dec 2003, jw schultz <jw@pegasys.ws> wrote:
> 
> > OK, first pass on TODO complete.
> ....
> > PERFORMANCE ----------------------------------------------------------
> ....
> > Traverse just one directory at a time
> > 
> >   Traverse just one directory at a time.  Tridge says it's
possible.
> > 
> >   At the moment rsync reads the whole file list into memory at the
> >   start, which makes us use a lot of memory and also not pipeline
> >   network access as much as we could.
> 
> An additional comment should be added observing that this will affect
> hardlink processing, since it relies on the entire flist array being
> present in order to match dev and inode numbers.  But perhaps the
> required hlist array could be saved and built on the fly as items
> with node counts > 1 are encountered.
Dynamic creation of the hlist set (hash perhaps) would deal with that, yes.
> 
> ....
> > Hard-link handling
> > 
> >   At the moment hardlink handling is very expensive, so it's off
by
> >   default.  It does not need to be so.
> > 
> >   Since most of the solutions are rather intertwined with the file
> >   list it is probably better to fix that first, although fixing
> >   hardlinks is possibly simpler.
> > 
> >   We can rule out hardlinked directories since they will probably
> >   screw us up in all kinds of ways.  They simply should not be used.
> > 
> >   At the moment rsync only cares about hardlinks to regular files.  I
> >   guess you could also use them for sockets, devices and other beasts,
> >   but I have not seen them.
> > 
> >   When trying to reproduce hard links, we only need to worry about
> >   files that have more than one name (nlinks>1 &&
!S_ISDIR).
> 
> It would be very helpful if file_struct.flags could have a bit set to
> indicate that the node count was greater than 1.  This info could be
> used later to optimize the hardlink search by only considering those
> flist entries with this flag bit set.
> 
> It'd be nice to implement this bit setting in this protocol number so
> it can be widely distributed before 2.6.1 is released which could have
> the code to actually make use of it.  I'd be interested in doing the
> later changes, but if Martin or jw could at least get the bit set...
> It doesn't even have to be --hwlink option dependent.  Just examine
> the node count and set the bit.
I'm not keen on squeezing that in at this time.  Lets get it
out the door, hardlink performance improvements can be made
in a minor release.  I'm also a bit more inclined to pass
nlinks (IFF non-zero and ~IS_DIR).
> 
> >   The basic point of this is to discover alternate names that refer to
> >   the same file.  All operations, including creating the file and
> >   writing modifications to it need only to be done for the first name.
> >   For all later names, we just create the link and then leave it
> >   alone.
> 
> An earlier thread started 11/25/2003 points out that in certain cases,
> a hardlinked file is unnecessarily transferred in full.  This is due
> to the algortihm described above.  If the first file in the sorted list
> is missing, but a later one exists, then that file should be used as
> the master.  I've been thinking of solutions to this as well.  But
> not until after 2.6.0 is released.
> 
> >   If hard links are to be preserved:
> > 
> >     Before the generator/receiver fork, the list of files is received
> >     from the sender (recv_file_list), and a table for detecting hard
> >     links is built.
> > 
> >     The generator looks for hard links within the file list and does
> >     not send checksums for them, though it does send other metadata.
> > 
> >     The sender sends the device number and inode with file entries, so
> >     that files are uniquely identified.
> > 
> >     The receiver goes through and creates hard links (do_hard_links)
> >     after all data has been written, but before directory permissions
> >     are set.
> > 
> >   At the moment device and inum are sent as 4-byte integers, which
> >   will probably cause problems on large filesystems.  On Linux the
> >   kernel uses 64-bit ino_t's internally, and people will soon have
> >   filesystems big enough to use them.  We ought to follow NFS4 in
> >   using 64-bit device and inode identification, perhaps with a
> >   protocol version bump.
> > 
> >   Once we've seen all the names for a particular file, we no
longer
> >   need to think about it and we can deallocate the memory.
> > 
> >   We can also have the case where there are links to a file that are
> >   not in the tree being transferred.  There's nothing we can do
about
> >   that.  Because we rename the destination into place after writing,
> >   any hardlinks to the old file are always going to be orphaned.  In
> >   fact that is almost necessary because otherwise we'd get really
> >   confused if we were generating checksums for one name of a file and
> >   modifying another.
> > 
> >   At the moment the code seems to make a whole second copy of the file
> >   list, which seems unnecessary.
> 
> Indeed!  It does!  Very wasteful.  It should only need a list of pointers
> to the flist entries and sort that list.  Furthermore, with the addition
> of the new multiple nodes flag bit requested above, the list of pointers
> would only contain pointers to flist entries with that bit set, resulting
> in a much smaller list.
You do mean multiple paths, same node? :)

Lets take this up after the release, shall we?


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

Lester Hightower

2004-Jan-04 13:24 UTC

head link

TODO hardlink performance optimizations

Hello,

I read with interest the mailing list thread found here:

	http://marc.10east.com/?t=107160967400007&r=1&w=2

We have a "situation" with rsync and --hard-links that was the reason
for
my search in MARC's rsync list archive that turned up the thread shown
above.  After reading through that thread, and other information on this
topic, I believe that sharing our situation with you will in itself prove
to be a good contribution to rsync (which is an excellent tool, BTW).

So, here goes:

We have a process on a backup server (I called it "s" below), that
each
night rsyncs a full copy of /, /var, and /usr from a great number of
systems.  As a rule we put /, /var, and /usr on separate partitions, but
that detail is not important.  What is important is to understand exactly
how we do these nightly, full system backups.  First, let me start by
showing you what a small set of the system_backups hierarchy looks like:

root@s:/vol/6/system_backups# find . -type d -maxdepth 1
.
./client1
./docs1.colo1
./docs2.colo1
./ipfw-internal.colo1
./ipfw1
./ipfw2
./docsdev1

root@s:/vol/6/system_backups# find . -type d -maxdepth 2|head -25|egrep -v
'^\./[^/]+$'|sort
.
./client1/20031223
./client1/20031224
./client1/20031225
./client1/20031226
./client1/20031227
./client1/20031229
./client1/20040102
./client1/current
./docs1.colo1/20031219
./docs1.colo1/20031223
./docs1.colo1/20031224
./docs1.colo1/20031225
./docs1.colo1/20031226
./docs1.colo1/20031227
./docs1.colo1/20031229
./docs1.colo1/20040102
./docs1.colo1/current
./docs1.colo1/image-20031218
./docs2.colo1/20031218
./docs2.colo1/20031219
./docs2.colo1/current

OK, that gives you an idea of how the hierarchy looks.  Here is the critical
part, though.  The logic that creates these each night looks like this:

TODAY=<YYYYMMDD for today>
for HOST in (<hosts>); do
  cp -al $HOST/current $HOST/$TODAY
  ...now rsync remote $HOST into my local $HOST/current...
done

For those not familiar with the -l option to cp:

root@s:/vol/6/system_backups# man cp|grep -B1 -A1 'hard links instead'
       -l, --link
              Make hard links instead of copies  of  non-directo-
              ries.

What we end up with is a tree that is _very_ fast to rsync each night,
with revision history going back indefinitely, at the disk usage cost of
only files that change (rare) and the directories (about 8MB per machine).
Note, however, that the _vast_ majority of file entries on these file
systems (system_backups) are hard links.  Many inodes will have 20, 30, or
more filename entries pointing at them (depending strictly on how much
history we choose to keep).

Keeping all that in mind, now understand that server "s" has
/vol/(0..14)
installed in its disk subsystem, and (the important part) each of those
volumes has a slow mirror -- one rsync per day.  We do not keep those
mirrors mounted, but you could think of /vol/0 having a /vol/0_mirror
partner that is rsynced once every twenty-four hours.

All of this works absolutely perfectly, with one exception, the daily
rsync of /vol/N to /vol/N_mirror for volumes that hold system_backups, and
the reason appears to be the --hard-links flag.  Rsync, which is running
completely locally for /vol/N to /vol/N_mirror work, exhausts all of the
RAM and swap allocated to it in this machine (3GB), sends the machine into
a maddening swap spiral, etc.  The issue only exists for /vol/N vols where
we have "system_backups" stored.

I wanted to share this circumstance with you because my reading of the
discussion on this topic, though encouraging, left me with the impression
that some might not be thinking about situations like this one, where it
is perfectly normal and desired to have many hard links to one inode, and
hundreds of thousands of hard links in one file system.

To give you an idea of the type of information one can glean from such a
backup process, here are a couple of examples.  Keep in mind that files
with link-count of 1 changed on the date indicated by the directory:

root@s:/vol/6/system_backups/client1# find 20040102 -links 1 -type f|head -2
20040102/root/.bash_history
20040102/tmp/.803.e4a1

root@s:/vol/6/system_backups/client1# diff 20040102/root/.bash_history
current/root/.bash_history
1d0
< lynx http://localhost:1081 --source | grep Rebuilding | head -1 | cut 10-
500a500> ssh ljacobs@supermag
root@s:/vol/6/system_backups/client1# find 20040102 -links 1 -type f|cut -d/
-f1,2,3,4|sort |uniq -c
      1 20040102/SYMLINKS
      1 20040102/root/.bash_history
      1 20040102/tmp/.803.e4a1
      1 20040102/usr/local/BMS
     54 20040102/usr/local/WWW
     17 20040102/usr/local/etc
      1 20040102/usr/sbin/symlinks
     42 20040102/vol/1/bmshome
      1 20040102/vol/2/webalizer_working
     12 20040102/vol/3/home

You'll notice that the hard link counts in this file system are not very
high yet (only 8), yet it is _very_ intensive to have rsync try to sync
/vol/6system_backups/client1 to /vol/6_mirror/system_backups/client1 with
the --hard-links flag set:

root@s:/vol/6/system_backups/client1# find 20040102 ! -links 1 -type f -printf
'%n\t%i\t%s\t%d\t%h/%f\n'|head -50|tail -5
8       11323   10108   2       20040102/bin/mknod
8       11324   25108   2       20040102/bin/more
8       11325   60912   2       20040102/bin/mount
8       11326   10556   2       20040102/bin/mt-GNU
8       11327   33848   2       20040102/bin/mv



If there is anything that I did not articulate clearly, if you have any
followup questions, if you would like us to test some code for you guys,
or if there is anything else that you feel that I can do to help, please
do not hesitate to ask.

Sincerely,

--
Lester Hightower
10East Corp.


p.s.  10East created and now supports the MARC system (marc.10east.com) in
various ways, including hosting it, though it is primarily administered by
Mr. Hank Leininger, a good friend and former employee.  I didn't see any
mention of MARC in the rsync web-site.  Please feel free to use it.

Reasonably Related Threads

Search for more apparently analagous threads

rsync - Dec 2003 - TODO hardlink performance optimizations

TODO hardlink performance optimizations

TODO hardlink performance optimizations

TODO hardlink performance optimizations

Reasonably Related Threads