thr3ads.net - Btrfs devel - [RFC] btrfs send and receive [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Jan Schmidt

2011-Aug-01 12:22 UTC

[RFC] btrfs send and receive

I''d like btrfs to support full featured send and receive in the future.
If nobody is currently working on it, I''ll grab the send/receive lock.
Now that I own the lock, I''m opening several discussions on this topic.
If you are in a hurry, it would be great if you could at least read and
comment on the KEY PROPERTIES section.

In short, the purpose of this mail is to
- acquire the send/receive lock
- find a name for a new feature
- define key properties and achieve consensus about them
- find a suitable streaming format

0) REMARK

First discussion point is not for discussion. Proof reading the email
showed that I''m using the term "file system" for
implementations such as
ext3 and btrfs as well as a file system image. It should be clear from
the context everywhere.

I furthermore realized that the term "subvolume" is omitted in favor
of
the term "snapshot". This is because I tend to think of snapshots
being
read-only (though I very much appreciate they are not). Just replace the
term wherever you feel appropriate.

1) NAMING

Personally, I like "send" and "receive" as they convey the
purpose and
do not leave much room to swap their meaning unintentionally.

I''ll call the file system you use "send" on the source file
system, and
(drum roll) the file system you use "receive" on the destination file
system.

2) USE CASES

I see two related use cases:

- backup of a file system
- migration of a file system to another disk / machine / ...

3) KEY PROPERTIES

I wrote down key features that are must haves for me, please add to the
list if you have anything on top:

- "send" must generate a stream that can either be
"receive"d
immediately or stored in a file for asynchronous "receive"
- streams must obviously be byte order safe
- a stream must contain a complete fs (full stream) or an incremental
update to a file system
- a stream must not be restricted in size
- an incremental stream must contain the information which version it
is based on
- "receive" of an incremental stream must check whether the base is
the current state of the file system
- YES => "receive"
- NO, but is previous version
=> abort; should offer --force for rollback and "receive"
- NO, does not match any previous version => abort
- a stream must be taken from a consistent state of the file system
- the source file system must remain read-writable during a "send"
- the destination file system must at least remain readable during a
"receive"
- btrfs as a destination file system should reflect all features of
the source file system
- other destination file systems must be supported (although some
features will not map to all file systems)

4) EXISTING SOLUTIONS

Currently, some people use rsync for the aforementioned tasks. It solves
some of the key properties quite well, others not. Depending on how you
use rsync, you might not sync snapshots very well. You might have
problems with reflinks or sparse files. And rsync knows nothing about
when your latest sync was.

Some problems can be solved with the utility function "btrfs
find-new",
but it does not provide any kind of consistency and has several other
drawbacks.

5) STREAMING FORMAT

An ideal streaming format can contain a complete file system or
incremental updates to a file system. It must transport meta information
(such as snapshots, reflinks, base of the file system, etc.) and file
information (such as holes, extended attributes, atime, ctime, mtime,
user, group, hardlinks, softlinks, device nodes, etc.). It should have a
feature to (optionally) store only parts of a modified file.

It would help if we could use tools already widely available to
encapsulate our backup streams. Imagine an existing streaming format
that is flexible enough to encode all the information needed for our key
properties. I like to put my backups in a different file system (like,
ext3 or zfs) on another machine, hence I''d love to do so without the
need of having btrfs or btrfs tools for this machine.

Currently, what I have in mind is a solution where "send --compatible"
produces a stream that can easily be unpacked by an unmodified version
of a standard tool (e.g. tar). This would likely include each file
completely that was modified since the reference point - it would never
contain a file partially. In contrast, "send --minimal" produces a
stream that might need a patched tool to be received and which contains
parts of files to save space. Meta information should be included in
both streams.

I haven''t decided yet whether I''d like compression to be an
integral
part of the stream. I currently tend to dislike that, but to be honest,
I have no good reason to do so. For now, I did some quick research and
looked at cpio, tar, ustar, pax and dar:

* cpio and tar have several drawbacks, I''ll just mention that they
can''t go over 8GB in file size, making them unusable here.

* The successor of traditional tar, uniform standard tar (ustar) has
only 255 characters (at max) per file in the archive and is not
extendable.

* pax (portable archive exchange, do not confuse it with PaX) looks a
lot better from a features perspective [1], and so does ...

* dar (disk archiver) [2].

5.A) Why it won''t be dar

dar comes as a GPL program and a library (libdar), where the interesting
bits are encapsulated in the library. No formal or informal
specification of the file format exists, the library is the interface.
This can speed up implementation considerably, but it sucks in flexibility.

dar has a lot of useful features, one of which is built-in support for
creation of incremental archives, and even decremental archives [6]. It
has no built-in support for reflinks, though. Second no go for libdar:
it does all the work required to detect files that changed between two
backup runs, which is great for some file systems. However, we want to
make use of the fact that btrfs knows exactly what changed.

5.B) Confusing pax

I found a utility named pax at openbsd [3], which does not implement the
pax format. It has support for several formats, but the newest of them
is ustar. This implementation is at least used by Debian (and
derivatives), Gentoo, RedHat and MacOS.

OpenIndiana has a pax utility that implements the pax format, for which
I could not find source code. I found a Makefile which refers to pax as
"$(CLOSED)/cmd/pax" [4] which makes me think it''s not open
source.

I was already about to drop pax from my consideration completely, when I
accidentally realized that GNU tar has a --format=pax option (and has it
since 2004) [5]. Users of Solaris, OpenIndiana or similar will have to
use their pax utility, though, because their tar does not support pax
format. Kind of confusing...

5.C) The good in pax

The good thing about the pax format is that it is extendable at will.
You can use custom header records with key=value pairs of any length.
There are predefined keys and application specific ones can be added.

pax can be generated compatible with ustar, which means such an archive
could be unpacked almost everywhere. The general concept of pax is to
use the pax-specific headers in a way, that they will be ignored by a
tar utility that does understand ustar but not pax.

5.D) How pax could be used

(Knowledge of the format required for this paragraph, see [1]) This is
more like brain storming than something figured out carefully: btrfs
"send" could generate a stream beginning with a global pax header
(typeflag=g) for the name of the current snapshot. Then all the files
from this snapshot with custom pax headers (typeflag=x) as needed, to
encode reflinks, for example.

After the next global pax header we''re in the next snapshot. This can
either contain any file that has changes completely (--compatible) or
the diffs for the file along with a custom header telling where the
diffs go (--minimal). The --compatible version could be extracted by any
tar from the shelf (provided file name length and such fit).

The result would be one file containing multiple snapshots for your file
system. Extraction of a single file would be possible, though listing
the files in the archive requires reading the whole file (with a lot of
large seeks over the data portions) as there is no central directory. As
an alternative, we could also start a new file for every snapshot we''re
about to "send".

We can use more of the custom headers to encode reflinks in a way that
they will either be hard- or softlinks when extracted with a standard
tar. We can add inode numbers for each entry if we feel those should be
replicated to a destination btrfs and much more.

5.E) So it will be pax - will it?

To me it looks like pax is the most suitable, flexible and available
format to use. Unless somebody has serious objections or thoughts for a
better choice.

6) FINAL REMARK

I hope this longish introduction creates a lively discussion about the
advertised features - or at least silent acknowledgement and endorsement.

-Jan

[1] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html
[2] http://dar.linux.free.fr/doc/man/dar.html
[3] http://www.openbsd.org/cgi-bin/cvsweb/src/bin/pax/
[4]
http://hg.openindiana.org/illumos-gate/raw-file/d3807abc6720/usr/src/cmd/Makefile
[5]
http://git.savannah.gnu.org/cgit/tar.git/commit/?id=ba08e339a6e05e2a0d1432efdadd67ff2c63f834
[6] http://dar.linux.free.fr/doc/usage_notes.html#Decremental_Backup
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2011-Aug-01 18:51 UTC

head link

Re: [RFC] btrfs send and receive

Hi Jan

On 08/01/2011 02:22 PM, Jan Schmidt wrote:
> I furthermore realized that the term "subvolume" is omitted in
favor >of the term "snapshot". This is because I tend to think of
snapshots> being read-only (though I very much appreciate they are not). Just
> replace the term wherever you feel appropriate.
I think that we have to cope to both the terms. A snapshot is a
subvolume with an ancestor. This is important if we want to be able to
"transfer" between two filesystem a snapshot as subvolume + delta.


> 3) KEY PROPERTIES
> 
> I wrote down key features that are must haves for me, please add to the
> list if you have anything on top:
> 
> - "send" must generate a stream that can either be
"receive"d
>   immediately or stored in a file for asynchronous "receive"
> - streams must obviously be byte order safe
> - a stream must contain a complete fs (full stream) or an incremental
>   update to a file system
> - a stream must not be restricted in size
> - an incremental stream must contain the information which version it
>   is based on
> - "receive" of an incremental stream must check whether the base
is
>   the current state of the file system
>     - YES => "receive"
>     - NO, but is previous version
>       => abort; should offer --force for rollback and
"receive"
>     - NO, does not match any previous version => abort
> - a stream must be taken from a consistent state of the file system
> - the source file system must remain read-writable during a
"send"
> - the destination file system must at least remain readable during a
>   "receive"
> - btrfs as a destination file system should reflect all features of
>   the source file system
I think that we should define what means "all features".

1) If we are interested to transport only the file
type/contents/timestamps/acls/owners/permissions, that could be obtained
with a combination of "find-new" (with some extensions [1]) and an
user-space tool. No extension to BTRFS are needed.

1.1) as above plus preserve the inode number.

2) If we want to have also to conserver the COW relation (between
snapshots/subvolumes and  files), I think that we need some help from
the kernel side to be able to injecting these information in the
destination btrfs filesystem.
Moreover we need to cope all the possible errors due to the fact that
the snapshot/subvolume are out-sync between the source fs and the
destination fs: what about if we want to transport a snapshot to an
another filesystem where the snapshotted subvolume (previously
successful transported) was removed or changed ? How we can check if a
snapshot/subvolume was changed ?

> - other destination file systems must be supported (although some
>   features will not map to all file systems)

BR
G.Baroncelli


[1] http://comments.gmane.org/gmane.comp.file-systems.btrfs/8201

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Schmidt

2011-Aug-02 09:43 UTC

head link

Re: [RFC] btrfs send and receive

On 01.08.2011 20:51, Goffredo Baroncelli wrote:> On 08/01/2011 02:22 PM, Jan Schmidt wrote:
> 
>> I furthermore realized that the term "subvolume" is omitted
in favor >
> of the term "snapshot". This is because I tend to think of
snapshots
>> being read-only (though I very much appreciate they are not). Just
>> replace the term wherever you feel appropriate.
> 
> I think that we have to cope to both the terms. A snapshot is a
> subvolume with an ancestor. This is important if we want to be able to
> "transfer" between two filesystem a snapshot as subvolume +
delta.
To be precise, each snapshot is again a subvolume. On the other hand, we
can call every subvolume but the root subvolume a (writable) snapshot.
I''d like to continue discussing the real points now :-)
>> 3) KEY PROPERTIES
>>
>> I wrote down key features that are must haves for me, please add to the
>> list if you have anything on top:
>>
>> - "send" must generate a stream that can either be
"receive"d
>>   immediately or stored in a file for asynchronous "receive"
>> - streams must obviously be byte order safe
>> - a stream must contain a complete fs (full stream) or an incremental
>>   update to a file system
>> - a stream must not be restricted in size
>> - an incremental stream must contain the information which version it
>>   is based on
>> - "receive" of an incremental stream must check whether the
base is
>>   the current state of the file system
>>     - YES => "receive"
>>     - NO, but is previous version
>>       => abort; should offer --force for rollback and
"receive"
>>     - NO, does not match any previous version => abort
>> - a stream must be taken from a consistent state of the file system
>> - the source file system must remain read-writable during a
"send"
>> - the destination file system must at least remain readable during a
>>   "receive"
>> - btrfs as a destination file system should reflect all features of
>>   the source file system
> 
> I think that we should define what means "all features".
> 
> 1) If we are interested to transport only the file
> type/contents/timestamps/acls/owners/permissions, that could be obtained
> with a combination of "find-new" (with some extensions [1]) and
an
> user-space tool. No extension to BTRFS are needed.
Right. This is not what I''m after.
> 1.1) as above plus preserve the inode number.
> 
> 2) If we want to have also to conserver the COW relation (between
> snapshots/subvolumes and  files), I think that we need some help from
> the kernel side to be able to injecting these information in the
> destination btrfs filesystem.
I''d rather gather those information (possibly with help from the
kernel)
when generating the stream. I realized that I should have used some
examples in my original mail. What I have in mind (as briefly described
in my section 5.D) is the following:

To not make it overly complex, let''s assume snapshots are read only for
a moment.

We have a subvolume /home with one snapshot /home/snap1. When we want to
send the whole subvolume we could do the following (if you read section
5 assume --minimal was the default):

btrfs subvol snapshot /home send-test
btrfs send /home send-test > /tmp/stream

Algorithm: First pick all files from the oldest snapshot (snap1) and put
them into the stream. Then, there is a block of meta information saying
"snap1 complete". The next in the stream are the diffs between snap1
and
send-test (reflecting the current state of /home), again with an end-marker.

Let''s assume we have a freshly created empty /backup subvolume. To
receive our changes, we''d call the following:

btrfs receive /backup < /tmp/stream

Algorithm: Use data from the stream to create files to /backup/home.
Once we reach the meta information mentioned above, we create a snapshot
of /backup/home in /backup/home/snap1. Then we go on receiving the diffs
in the stream to /backup/home and create a snapshot send-test.
> Moreover we need to cope all the possible errors due to the fact that
> the snapshot/subvolume are out-sync between the source fs and the
> destination fs: what about if we want to transport a snapshot to an
> another filesystem where the snapshotted subvolume (previously
> successful transported) was removed or changed ? How we can check if a
> snapshot/subvolume was changed ?
Continuing the above example, lets assume we have a /backup subvolume
that did receive a /backup/home earlier. Receiving a full stream (as
generated above) would fail, then. You could remove the subvolume and
receive the new full stream, if you like.

Now let''s assume we have /home with snap1, send-test, snap2 and snap3.
On the backup side, we have /backup/home with snap1 and send-test.

We make another snapshot and generate an incremental stream on the sender:

btrfs subvol snapshot /home incr-test
btrfs send /home send-test incr-test

The stream contains the information that it''s based on the snapshot
send-test (maybe by uuid rather than name). When we receive the stream,
we first check if the destination file system /backup/home has that
send-test snapshot.

a) If it does and there are no diffs between /backup/home and
   /backup/home/send-test, noone modified the destination => receive

b) If it does and there are diffs between the two, receive should fail
   unless --force is specified, which would eliminate all the changes
   made in /backup/home so that it''s rolled back to send-test.

c) If if does not, receive just fails.

Although the current algorithm breaks when we release the read-only
precondition, I''m certain we can turn these fuzzy ideas into an
actually
working solution

Thanks,
-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Aug-02 15:21 UTC

head link

Re: [RFC] btrfs send and receive

Excerpts from Jan Schmidt''s message of 2011-08-02 05:43:39
-0400:> 
> On 01.08.2011 20:51, Goffredo Baroncelli wrote:
> > On 08/01/2011 02:22 PM, Jan Schmidt wrote:
> > 
> >> I furthermore realized that the term "subvolume" is
omitted in favor >
> > of the term "snapshot". This is because I tend to think of
snapshots
> >> being read-only (though I very much appreciate they are not). Just
> >> replace the term wherever you feel appropriate.
> > 
> > I think that we have to cope to both the terms. A snapshot is a
> > subvolume with an ancestor. This is important if we want to be able to
> > "transfer" between two filesystem a snapshot as subvolume +
delta.
> 
> To be precise, each snapshot is again a subvolume. On the other hand, we
> can call every subvolume but the root subvolume a (writable) snapshot.
> I''d like to continue discussing the real points now :-)
First: awesome!  I can''t wait to have this feature.

I think you have a very sound list of requirements here, but I''ll add
one more.  If there are metadata corruptions on the sender, we must not
transmit them over to the receiver.  In order for the send/receive
command to be a backup, it needs to have a first-do-no-harm rule to the
receiving end.

In terms of formats, I came to similar conclusions a while ago about
cpio, tar and dar.  I haven''t looked in detail at pax but
don''t have any
strong feelings against it.

But, I''ll toss in an alternative.  Adapt the git pack files a little
and
use them as the format.  There are a few reasons for this:

Git has a very strong developer community and is already being
hammered into use as a backup application.  You''ll find a lot of
interested people to help out.

Git separates the contents from the metadata (names).  This makes it
naturally suited to describing snapshots and other features.  The big
exception is in large file handling, but you could extend the format to
describe filename,offset,len->sha instead of just filename->sha.

This doesn''t mean I''ll reject a pax setup, it''s just
an alternative to
think about.  We should have the actual data transmission format pretty
well abstracted away so we can experiment with alternatives.

In terms of transmitting snapshot details, I always assumed we would
need a snapshot tool that added extra metadata about parent
relationships on the snapshots.  I didn''t want to enforce this in the
metadata on disk, but I have no problems with saying the send/receive
tool requires extra metadata to tell us about parents.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Schmidt

2011-Aug-02 16:01 UTC

head link

Re: [RFC] btrfs send and receive

On 02.08.2011 17:21, Chris Mason wrote:> First: awesome!  I can''t wait to have this feature.
> 
> I think you have a very sound list of requirements here, but I''ll
add
> one more.  If there are metadata corruptions on the sender, we must not
> transmit them over to the receiver.  In order for the send/receive
> command to be a backup, it needs to have a first-do-no-harm rule to the
> receiving end.
Full ack. I''m planning to fetch as much information from user space as
possible. Anything that needs kernel support will have consistency
checks added. No guessing will be made. If it doesn''t look safe and
sound, that file system will not be sendable.

Furthermore, receiving should not need kernel support at all (except for
an optional interface to create a file with a certain inode, we''ll
see).
Thus, replicating metadata corruptions should be very unlikely.

One more thing to add: We have to make sure our stream doesn''t get
corrupted. So if the file format we''re choosing does not include it, we
should keep in mind to add something ourselves.
> In terms of formats, I came to similar conclusions a while ago about
> cpio, tar and dar.  I haven''t looked in detail at pax but
don''t have any
> strong feelings against it.
> 
> But, I''ll toss in an alternative.  Adapt the git pack files a
little and
> use them as the format.  There are a few reasons for this:
> 
> Git has a very strong developer community and is already being
> hammered into use as a backup application.  You''ll find a lot of
> interested people to help out.
> 
> Git separates the contents from the metadata (names).  This makes it
> naturally suited to describing snapshots and other features.  The big
> exception is in large file handling, but you could extend the format to
> describe filename,offset,len->sha instead of just filename->sha.
That sounds interesting. I haven''t thought of git until now. It will
lack the appealing feature to unpack without any special tools or a
modified git client, I think. But I believe there are things that would
get easier compared to pax.

I''ll try to make a plan how it could be implemented with git, so that
we
have something we can compare.
> This doesn''t mean I''ll reject a pax setup, it''s
just an alternative to
> think about.  We should have the actual data transmission format pretty
> well abstracted away so we can experiment with alternatives.
Yes, that would be nice. I''ll keep that in mind. If both have their
advantages, we might end up having one format in the first
implementation and another one added later once the rest is working.
> In terms of transmitting snapshot details, I always assumed we would
> need a snapshot tool that added extra metadata about parent
> relationships on the snapshots.  I didn''t want to enforce this in
the
> metadata on disk, but I have no problems with saying the send/receive
> tool requires extra metadata to tell us about parents.
Oh, right. That''s something that might not only need kernel support for
"send" to determine a parent, but also a new key representing a
snapshot''s parent relationship information.

I''ll think that over, currently I tend to adding these relationship
keys
around btrfs_ioctl_snap_create soon, so we have at least some file
systems in the wild that are ready for send and receive once it''s done.

Thanks,
-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2011-Aug-02 17:42 UTC

head link

Re: [RFC] btrfs send and receive

Hi all,

[...]> 
> Furthermore, receiving should not need kernel support at all (except for
> an optional interface to create a file with a certain inode, we''ll
see).
> Thus, replicating metadata corruptions should be very unlikely.
I think that for receiving we can have three level, which may represent three 
level in the develop:

1) we store the information as a pax|tar|git|... file format. Then is the user 
that can expand this file when needed. I think that in case of backup this is 
more useful than having a full filesystem. No help from kernel required.

2) we expand the stream in files; so the final results would be a filesystem.
2.1) as above but preserving the inode number (small help from kernel 
required, may be file-system independent also)
2.2) as above but preserving the COW properties: if we update an already 
snapshotted file, btrfs store the original one and the modified data. The same 
would be in the destination filesystem: if exists the previous file snapshot, 
in the filesystem is COW-ed the file updating only the "new data".
(help from
kernel side. I don''t know if it is possible to adapt this strategy for
other
filesystem than BTRFS)

3) extracting from the source filesystem the btree structure, and injecting in 
the btrfs filesystem this structure. I think that this has the best 
performance, both in terms of CPU-power and in bandwidth. Full kernel support 
required.

> One more thing to add: We have to make sure our stream doesn''t get
> corrupted. So if the file format we''re choosing does not include
it, we
> should keep in mind to add something ourselves.
The best would be using the BTRFS checksum.
> 
> > In terms of formats, I came to similar conclusions a while ago about
> > cpio, tar and dar.  I haven''t looked in detail at pax but
don''t have any
> > strong feelings against it.
> > 
> > But, I''ll toss in an alternative.  Adapt the git pack files a
little and
> > use them as the format.  There are a few reasons for this:
> > 
> > Git has a very strong developer community and is already being
> > hammered into use as a backup application.  You''ll find a lot
of
> > interested people to help out.
> > 
> > Git separates the contents from the metadata (names).  This makes it
> > naturally suited to describing snapshots and other features.  The big
> > exception is in large file handling, but you could extend the format
to
> > describe filename,offset,len->sha instead of just filename->sha.
> 
> That sounds interesting. I haven''t thought of git until now. It
will
> lack the appealing feature to unpack without any special tools or a
> modified git client, I think. But I believe there are things that would
> get easier compared to pax.
> 
> I''ll try to make a plan how it could be implemented with git, so
that we
> have something we can compare.
I suggest to give a look to the fast-import/export format, which is "de
facto"
standard about sharing information between the new CVS system.
> 
> > This doesn''t mean I''ll reject a pax setup,
it''s just an alternative to
> > think about.  We should have the actual data transmission format
pretty
> > well abstracted away so we can experiment with alternatives.
> 
> Yes, that would be nice. I''ll keep that in mind. If both have
their
> advantages, we might end up having one format in the first
> implementation and another one added later once the rest is working.
> 
> > In terms of transmitting snapshot details, I always assumed we would
> > need a snapshot tool that added extra metadata about parent
> > relationships on the snapshots.  I didn''t want to enforce
this in the
> > metadata on disk, but I have no problems with saying the send/receive
> > tool requires extra metadata to tell us about parents.
> 
> Oh, right. That''s something that might not only need kernel
support for
> "send" to determine a parent, but also a new key representing a
> snapshot''s parent relationship information.
I think that this information already exists. In fact every snapshot has a 
reference to the original data, on the basis of which it is possible to obtain 
the snapshot''s parent relationship information.

However we need to be sure that when we send the "delta" between two
snapshot
to the receiver side, the receiver side:
1) has a copy of the previous snapshot
2) this copy is in sync to the original one

I think (please Chris confirm that) that we can check this with the subvolume 
id and the generation-no of every snapshot, which should be unique.
> 
> I''ll think that over, currently I tend to adding these
relationship keys
> around btrfs_ioctl_snap_create soon, so we have at least some file
> systems in the wild that are ready for send and receive once it''s
done.
> 
> Thanks,
> -Jan
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijack@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Schmidt

2011-Aug-03 15:04 UTC

head link

Re: [RFC] btrfs send and receive

On 02.08.2011 19:42, Goffredo Baroncelli wrote:>> Furthermore, receiving should not need kernel support at all (except
for
>> an optional interface to create a file with a certain inode,
we''ll see).
>> Thus, replicating metadata corruptions should be very unlikely.
> 
> I think that for receiving we can have three level, which may represent
three
> level in the develop:
> 
> 1) we store the information as a pax|tar|git|... file format. Then is the
user
> that can expand this file when needed. I think that in case of backup this
is
> more useful than having a full filesystem. No help from kernel required.
> 
> 2) we expand the stream in files; so the final results would be a
filesystem.
How would you test your stream from 1) if you can''t unpack it?
> 2.1) as above but preserving the inode number (small help from kernel 
> required, may be file-system independent also)
I would skip that and add it as an extention, later.
> 2.2) as above but preserving the COW properties: if we update an already 
> snapshotted file, btrfs store the original one and the modified data. The
same
> would be in the destination filesystem: if exists the previous file
snapshot,
> in the filesystem is COW-ed the file updating only the "new
data". (help from
> kernel side. I don''t know if it is possible to adapt this strategy
for other
> filesystem than BTRFS)
Again, I''d rather gather those information (possibly with help from the
kernel) when generating the stream. This is what I answered and tried to
explain by example in my mail yesterday. Please tell me which part was
unclear and I''ll try to explain better.

With the algorithm outlined yesterday, you don''t need any kernel
support
when receiving, so it should be adaptable by any filesystem that
supports snapshots.
> 3) extracting from the source filesystem the btree structure, and injecting
in
> the btrfs filesystem this structure. I think that this has the best 
> performance, both in terms of CPU-power and in bandwidth. Full kernel
support
> required.
This is like a diff-aware dd, or did I get you wrong? If it is: do you
really think we need it? What for?
>> One more thing to add: We have to make sure our stream doesn''t
get
>> corrupted. So if the file format we''re choosing does not
include it, we
>> should keep in mind to add something ourselves.
> 
> The best would be using the BTRFS checksum.
Sounds interesting. How would you add a btrfs checksum to a stream file
(no matter what format we''ll use)? And how would you verify it?
>> I''ll try to make a plan how it could be implemented with git,
so that we
>> have something we can compare.
> 
> I suggest to give a look to the fast-import/export format, which is
"de facto"
> standard about sharing information between the new CVS system.
Thanks for the hint, I will include that in my considerations.
>>> In terms of transmitting snapshot details, I always assumed we
would
>>> need a snapshot tool that added extra metadata about parent
>>> relationships on the snapshots.  I didn''t want to enforce
this in the
>>> metadata on disk, but I have no problems with saying the
send/receive
>>> tool requires extra metadata to tell us about parents.
>>
>> Oh, right. That''s something that might not only need kernel
support for
>> "send" to determine a parent, but also a new key representing
a
>> snapshot''s parent relationship information.
> 
> I think that this information already exists. In fact every snapshot has a 
> reference to the original data, on the basis of which it is possible to
obtain
> the snapshot''s parent relationship information.
How can that be done? I don''t see such a link.
> However we need to be sure that when we send the "delta" between
two snapshot
> to the receiver side, the receiver side:
> 1) has a copy of the previous snapshot
> 2) this copy is in sync to the original one
> 
> I think (please Chris confirm that) that we can check this with the
subvolume
> id and the generation-no of every snapshot, which should be unique.
uuid + generation was my suggestion as well, should be unique, yes.

-Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2011-Aug-03 18:52 UTC

head link

Re: [RFC] btrfs send and receive

On Tuesday, 02 August, 2011 11:43:39 you wrote:> On 01.08.2011 20:51, Goffredo Baroncelli wrote:
[...]> > 1) If we are interested to transport only the file
> > type/contents/timestamps/acls/owners/permissions, that could be
obtained
> > with a combination of "find-new" (with some extensions [1])
and an
> > user-space tool. No extension to BTRFS are needed.
> 
> Right. This is not what I''m after.
> 
> > 1.1) as above plus preserve the inode number.
> > 
> > 2) If we want to have also to conserver the COW relation (between
> > snapshots/subvolumes and  files), I think that we need some help from
> > the kernel side to be able to injecting these information in the
> > destination btrfs filesystem.
> 
> I''d rather gather those information (possibly with help from the
kernel)
> when generating the stream. I realized that I should have used some
> examples in my original mail. What I have in mind (as briefly described
> in my section 5.D) is the following:
> 
> To not make it overly complex, let''s assume snapshots are read
only for
> a moment.
> 
> We have a subvolume /home with one snapshot /home/snap1. When we want to
> send the whole subvolume we could do the following (if you read section
> 5 assume --minimal was the default):
> 
> btrfs subvol snapshot /home send-test
> btrfs send /home send-test > /tmp/stream
> 
> Algorithm: First pick all files from the oldest snapshot (snap1) and put
> them into the stream. Then, there is a block of meta information saying
> "snap1 complete". The next in the stream are the diffs between
snap1 and
> send-test (reflecting the current state of /home), again with an
end-marker.
> 
> Let''s assume we have a freshly created empty /backup subvolume. To
> receive our changes, we''d call the following:
> 
> btrfs receive /backup < /tmp/stream
> 
> Algorithm: Use data from the stream to create files to /backup/home.
> Once we reach the meta information mentioned above, we create a snapshot
> of /backup/home in /backup/home/snap1. Then we go on receiving the diffs
> in the stream to /backup/home and create a snapshot send-test.
Basically we need a tool to evaluate the difference (in terms of metadata and 
file content) between two snapshot which have an ancestor in common. This is a 
typical problem of the VCS software. 

On the basis of that we can generate a stream, which may be a diff between two 
snapshot or a full subvolume (no diff).

On the receiver side in case of diff there is the necessity to check if both 
the side have the "old" snapshot, and if these snapshots are aligned.
I think
that tracking the subvolume-id and the generation-no we can check this easily.
> 
> > Moreover we need to cope all the possible errors due to the fact that
> > the snapshot/subvolume are out-sync between the source fs and the
> > destination fs: what about if we want to transport a snapshot to an
> > another filesystem where the snapshotted subvolume (previously
> > successful transported) was removed or changed ? How we can check if a
> > snapshot/subvolume was changed ?
> 
> Continuing the above example, lets assume we have a /backup subvolume
> that did receive a /backup/home earlier. Receiving a full stream (as
> generated above) would fail, then. You could remove the subvolume and
> receive the new full stream, if you like.
> 
> Now let''s assume we have /home with snap1, send-test, snap2 and
snap3.
> On the backup side, we have /backup/home with snap1 and send-test.
> 
> We make another snapshot and generate an incremental stream on the sender:
> 
> btrfs subvol snapshot /home incr-test
> btrfs send /home send-test incr-test
> 
> The stream contains the information that it''s based on the
snapshot
> send-test (maybe by uuid rather than name). When we receive the stream,
> we first check if the destination file system /backup/home has that
> send-test snapshot.
> 
> a) If it does and there are no diffs between /backup/home and
>    /backup/home/send-test, noone modified the destination => receive
> 
> b) If it does and there are diffs between the two, receive should fail
>    unless --force is specified, which would eliminate all the changes
>    made in /backup/home so that it''s rolled back to send-test.
> 
> c) If if does not, receive just fails.
> 
> Although the current algorithm breaks when we release the read-only
> precondition, I''m certain we can turn these fuzzy ideas into an
actually
> working solution
> 
> Thanks,
> -Jan-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijack@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2011-Aug-03 20:43 UTC

head link

Re: [RFC] btrfs send and receive

On Wednesday, 03 August, 2011 17:04:40 Jan Schmidt
wrote:> On 02.08.2011 19:42, Goffredo Baroncelli wrote:
> >> Furthermore, receiving should not need kernel support at all
(except for
> >> an optional interface to create a file with a certain inode,
we''ll see).
> >> Thus, replicating metadata corruptions should be very unlikely.
> > 
> > I think that for receiving we can have three level, which may
represent
> > three level in the develop:
> > 
> > 1) we store the information as a pax|tar|git|... file format. Then is
the
> > user that can expand this file when needed. I think that in case of
> > backup this is more useful than having a full filesystem. No help from
> > kernel required.
> > 
> > 2) we expand the stream in files; so the final results would be a
> > filesystem.
> How would you test your stream from 1) if you can''t unpack it?
If we are able to store the information in a standard format (like tar), we 
are able to unpack when we need.

The difference between the point 1 and 2 is  that for the point #1 is not 
required to develop the "extraction side". This doesn''t means
that we *must
not* develop, this means that we *may* delay the develop of the extraction 
side, and having something really useful. 

The point 2) requires to develop an extraction tool (the "btrfs
receive"
command),  which would be able to handle further metadata like the "parent 
relationship" which you refer below.

I think that the extraction would be like:

sender> hello "receiver", which snapshot do you have ?
receiver> hello "sender", I have snapshot A, B ,D
sender> ok, I have the snapshot B and C, so I will send you the delta from 
               snapshot B, which is the latest in common.
sender> send data .....

This is far away from a simple tar (or pax or git...) file format.
> 
> > 2.1) as above but preserving the inode number (small help from kernel
> > required, may be file-system independent also)
> 
> I would skip that and add it as an extention, later.
> 
> > 2.2) as above but preserving the COW properties: if we update an
already
> > snapshotted file, btrfs store the original one and the modified data.
The
> > same would be in the destination filesystem: if exists the previous
file
> > snapshot, in the filesystem is COW-ed the file updating only the
"new
> > data". (help from kernel side. I don''t know if it is
possible to adapt
> > this strategy for other filesystem than BTRFS)
> 
> Again, I''d rather gather those information (possibly with help
from the
> kernel) when generating the stream. This is what I answered and tried to
> explain by example in my mail yesterday. Please tell me which part was
> unclear and I''ll try to explain better.
I am talking *only* about the receiving. How we gather these information is 
not (for the moment) in discussion
> 
> With the algorithm outlined yesterday, you don''t need any kernel
support
> when receiving, so it should be adaptable by any filesystem that
> supports snapshots.
Right
> 
> > 3) extracting from the source filesystem the btree structure, and
> > injecting in the btrfs filesystem this structure. I think that this
has
> > the best performance, both in terms of CPU-power and in bandwidth.
Full
> > kernel support required.
> 
> This is like a diff-aware dd, or did I get you wrong? If it is: do you
> really think we need it? What for?
I cited it only as "brainstorm" approach. The only gain is its space
efficency.
> 
> >> One more thing to add: We have to make sure our stream
doesn''t get
> >> corrupted. So if the file format we''re choosing does not
include it, we
> >> should keep in mind to add something ourselves.
> > 
> > The best would be using the BTRFS checksum.
> 
> Sounds interesting. How would you add a btrfs checksum to a stream file
> (no matter what format we''ll use)? And how would you verify it?
I think that btrfs already store a checksum per block basis. When we send the 
stream we could get this information from btrfs and send together. This only 
to avoid to recalculate a checksum. Pay attention that I think that btrfs 
stores the checksum only for the data, and not for the full files. What I means 
is that if a file is cow-ed, btrfs store the original data and only the data 
updated, then store the checksum for the original file and the checksum for the 
data updated. It don''t store the checksum for the full file updated.
This means
that if we try to rebuild the file applying a delta we don''t have a
checksum of
the full file to compare.
> 
> >> I''ll try to make a plan how it could be implemented with
git, so that we
> >> have something we can compare.
> > 
> > I suggest to give a look to the fast-import/export format, which is
"de
> > facto" standard about sharing information between the new CVS
system.
> 
> Thanks for the hint, I will include that in my considerations.
> 
> >>> In terms of transmitting snapshot details, I always assumed we
would
> >>> need a snapshot tool that added extra metadata about parent
> >>> relationships on the snapshots.  I didn''t want to
enforce this in the
> >>> metadata on disk, but I have no problems with saying the
send/receive
> >>> tool requires extra metadata to tell us about parents.
> >> 
> >> Oh, right. That''s something that might not only need
kernel support for
> >> "send" to determine a parent, but also a new key
representing a
> >> snapshot''s parent relationship information.
> > 
> > I think that this information already exists. In fact every snapshot
has a
> > reference to the original data, on the basis of which it is possible
to
> > obtain the snapshot''s parent relationship information.
> 
> How can that be done? I don''t see such a link.
Give a look at

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Backref_walking_utilities

but I have to admit that the real state is different from what I (wrlongly) 
understood of the btrfs internal.

> 
> > However we need to be sure that when we send the "delta"
between two
> > snapshot to the receiver side, the receiver side:
> > 1) has a copy of the previous snapshot
> > 2) this copy is in sync to the original one
> > 
> > I think (please Chris confirm that) that we can check this with the
> > subvolume id and the generation-no of every snapshot, which should be
> > unique.
> uuid + generation was my suggestion as well, should be unique, yes.
> 
> -Jan
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijack@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Schmidt

2011-Aug-05 11:24 UTC

head link

Re: [RFC] btrfs send and receive

On 02.08.2011 18:01, Jan Schmidt wrote:> On 02.08.2011 17:21, Chris Mason wrote:
>> But, I''ll toss in an alternative.  Adapt the git pack files a
little and
>> use them as the format.  There are a few reasons for this:
>>
>> Git has a very strong developer community and is already being
>> hammered into use as a backup application.  You''ll find a lot
of
>> interested people to help out.
>>
>> Git separates the contents from the metadata (names).  This makes it
>> naturally suited to describing snapshots and other features.  The big
>> exception is in large file handling, but you could extend the format to
>> describe filename,offset,len->sha instead of just filename->sha.
> 
> That sounds interesting. I haven''t thought of git until now. It
will
> lack the appealing feature to unpack without any special tools or a
> modified git client, I think. But I believe there are things that would
> get easier compared to pax.
There are easier questions to google. You''ll find a lot of backup
applications having a git repository for maintaining their source code.
You''ll find a lot of "linuxquestions.org * of the year" hits
- because
in the news the versioning system of the year (git, of course) comes
right before the backup application of the year. And you''ll also find
this thread in the top 10 or top 20 hits, depending on your search.

Using git as a backend for backups has been discussed earlier on the git
mailing list [1], though this rfc got no comments at all and development
apparently stopped after the initial post. This one [2] got a lot more
discussion, but keeps focused on text file (/etc dir). It may have made
the base for etc-keeper [3], aiming at the same target, but I did not
check that.

lwn.net discusses bup [4], which is mentioned several times on the git
mailing list, too. It''s an actively developed backup tool writing its
own git files, including files''s meta data. It is a collection of
python
scripts calling git helper functions (namely git config, init, cat-file,
verify-pack, show-ref, rev-list and update-ref). I did not look deeper
as I''m for a C-only solution.

There is coldstorage [5] that has been stuck in a seemingly early phase
for more than a year.

Goffredo suggested looking at fast-import/export format [6], which I
did. It is a text based protocol, used to transport commits and
associated meta information from one VCS to another (possibly of a
different kind). My conclusion is that it''s not suitable for solving
the
problems being discussed here.
> I''ll try to make a plan how it could be implemented with git, so
that we
> have something we can compare.
Finally, we''ll have to create a solution on our own. We could borrow
some ideas from bup if we decided to do it. We''d need a concept to
store
more (arbitrary) meta data in the index, which would not be too hard to
add. And the content-addressed concept of git certainly has charme.

Although this inherent deduplication comes for free, we cannot save any
work on stream creation: As a bit of meta information, we will still
need to tell plain copies from reflinks, which could be stored in the
index. However, once we''ve figured out that something is referencing
the
same data, we can use it to not store data multiple times in pax format,
too.
>> This doesn''t mean I''ll reject a pax setup,
it''s just an alternative to
>> think about.
After having done so, I''d like to say it''s good that you
don''t reject
pax :-) It is definitely possible to use git''s object store methods for
our stream format, but for me, pax still wins. Step by step:

On the plus side of git, I currently only have deduplication in our
stream format - for files that share content blocks (in the size of
blocks we would store). This can make the stream a little smaller,
however, as the content blocks get smaller in size (making dedup more
likely), meta data overhead increases.

On the plus side of pax, there is the possibility to create streams in
compatibility mode, making it possible to unpack them with any
(sufficiently recent) tar program. This advantage is such a big one, I
would put a good amount of extra work into it - which is not even necessary.

So, I''ll not hard wire the stream output format and make it easily
replaceable. If no more facts come up here, I''ll make my proof of
concept implementation with pax as stream format.

Thanks!
-Jan

[1] http://kerneltrap.org/mailarchive/git/2006/2/21/201380/thread#mid-201380
[2] http://thread.gmane.org/gmane.comp.version-control.git/33887
[3] http://kitenet.net/~joey/code/etckeeper/
[4] http://lwn.net/Articles/380983/
[5]
http://amarok.kde.org/blog/archives/1151-ColdStorage-A-Backup-Tool-Using-Git-At-Its-Core.html
[6] http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Aug 2011 - [RFC] btrfs send and receive

[RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive

Re: [RFC] btrfs send and receive