thr3ads.net - Lustre devel - [Lustre-devel] [Bug 10929] scalable libext2fs based backup [Jun 2007]

If this information is useful, please help other people find it:
Share via:

braam@clusterfs.com

2007-Jun-03 22:20 UTC

[Lustre-devel] [Bug 10929] scalable libext2fs based backup

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10929



Huangwei:

The parent directory of a file that was removed has an mtime newer than the RF,
all new files are in the list.

The real problem that we need to solve are the backup operations related to b)
and to a variant of a), as follows.  From e2scan we get a list of inodes (and
pathnames) for these modified inodes.

Let''s assume that on a backup file system we have the directory tree at
time T,
then at time T+1 we make a snapshot of the new directory tree and run e2scan on
that snapshot.  (If we leave the snapshot out, things will be more confusing,
but we will probably have to look at that.)  So we have two name spaces (the
tree with attributes, ignoring content), N and N+1.

By running rsync from N+1->N, rsync somehow manages to change N into N+1. 
Our
goal is to dramatically reduce the work of rsync, using the list of changed
inodes.

I do not know exactly what algorithm rsync uses to synchronize the namespaces
and it would be good if you summarized that for me.  Could you also summarize
how Unison does this?

Let me give a few examples of how we can help rsync (we may have to make small
changes to rsync):

1 - Rsync does not have to check which regular files have changed, we know that
in our list.
2 - We only need to synchronize directories (and their subtrees) that were
modified, we don''t need to scan the whole file system.  Note that this:
 2a - deals with case b), if a file was removed (even recursively) rsync can
descend down the modified directory, and find whatever else was removed.  But,
first we would check if there is in fact a file missing in N+1 that was present
in N, if not then syncing the directory is simpler, because things were added to
it or perhaps renamed inside it.  But we need to be careful to distinguish a
rename from an add / remove.
 2b - adding files is detected in this way, but we need to prevent rsync from
working recursively when it sees a new file, we know the new files.  I think
rsync cannot work recursively.
 2c - simple rename operations are easy to distinguish (at least often) from
removing and adding files: if you see a directory that is modified, search for
other directories modified with the same timestamp.  By comparing N and N+1 you
will find missing inodes and you can then search if the same inode was added to
another directory with the same modification time.
 2d - if this doesn''t work out then you have to rsync the entire
subtree,
probably, so that could lead to some bad cases.

Remember, first send me the algorithm that rsync uses, because there is no
obvious correct solution at all, and we need to understand what shortcuts it is
taking to make the right choices.  But here is a sketch of what might work,
(don''t take it too seriously, but it may be good):

A. sort the list of modified paths in a top-down manner
B. feed the list to rsync
C. modify rsync to have a "no recurse" option, when checking one
directory
D. when a modified directory is encountered by rsync (using our list), use the
database and N to check for modifications on directories that are rename
operations, move the files back if you find one.
E. when a modified directory is encountered check for removals (rsync can do
this nicely) that are not caused by renames, if you find them execute them
recursively, maybe using rm without using rsync.  But remember, some things may
have been renamed away, if you remove them from N you will have to recreate them
with a full rsync from N+1.
F. when a modified directory is encountered, check for additions and create the
files and directories underneath using the change list, but remember that some
directories may have been renamed into the new tree (not created in that new
subtree).
G. change rsync to only synchronize directories (or write a new tool to do this,
without network interfaces, ie. the new tool only syncs file trees in two
directory trees on one system.
H. When the namespaces are the same, then sync the modified files using a list
(now the pathnames should be the same in the two directory trees).

So the fixup tasks is:

 - determine rsync modifications, or another directory synchronization
operation, and how to use rsync to sychronize N+1 to N.  We want this to be much
more efficient than a normal rsync run, in most cases.
  - when the namespace is fixed up, use the file size information from the DB to
divide the change database into multiple smaller lists so that multiple backup
nodes can move the files in parallel.

We have very amibitous plans with these kind of tools:

 - we want e2scan to find inodes for different rules, for example, find files
that were not accessed for a while, or files modified between two time
intervals.

 - we will build a changelog to make the synchronization of N and N+1 much
easier in the future.

I look forward to working with you on the design.

braam@clusterfs.com

2007-Jun-03 22:35 UTC

head link

[Lustre-devel] [Bug 10929] scalable libext2fs based backup

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10929



Huangwei:

http://www.cis.upenn.edu/~bcpierce/papers/unisonspec.pdf

this is probably the right material to learn how this should be designed - lucky
you, you are a mathematician!

Lustre devel - Jun 2007 - [Bug 10929] scalable libext2fs based backup

[Lustre-devel] [Bug 10929] scalable libext2fs based backup

[Lustre-devel] [Bug 10929] scalable libext2fs based backup