I''d like btrfs to support full featured send and receive in the future. If nobody is currently working on it, I''ll grab the send/receive lock. Now that I own the lock, I''m opening several discussions on this topic. If you are in a hurry, it would be great if you could at least read and comment on the KEY PROPERTIES section. In short, the purpose of this mail is to - acquire the send/receive lock - find a name for a new feature - define key properties and achieve consensus about them - find a suitable streaming format 0) REMARK First discussion point is not for discussion. Proof reading the email showed that I''m using the term "file system" for implementations such as ext3 and btrfs as well as a file system image. It should be clear from the context everywhere. I furthermore realized that the term "subvolume" is omitted in favor of the term "snapshot". This is because I tend to think of snapshots being read-only (though I very much appreciate they are not). Just replace the term wherever you feel appropriate. 1) NAMING Personally, I like "send" and "receive" as they convey the purpose and do not leave much room to swap their meaning unintentionally. I''ll call the file system you use "send" on the source file system, and (drum roll) the file system you use "receive" on the destination file system. 2) USE CASES I see two related use cases: - backup of a file system - migration of a file system to another disk / machine / ... 3) KEY PROPERTIES I wrote down key features that are must haves for me, please add to the list if you have anything on top: - "send" must generate a stream that can either be "receive"d immediately or stored in a file for asynchronous "receive" - streams must obviously be byte order safe - a stream must contain a complete fs (full stream) or an incremental update to a file system - a stream must not be restricted in size - an incremental stream must contain the information which version it is based on - "receive" of an incremental stream must check whether the base is the current state of the file system - YES => "receive" - NO, but is previous version => abort; should offer --force for rollback and "receive" - NO, does not match any previous version => abort - a stream must be taken from a consistent state of the file system - the source file system must remain read-writable during a "send" - the destination file system must at least remain readable during a "receive" - btrfs as a destination file system should reflect all features of the source file system - other destination file systems must be supported (although some features will not map to all file systems) 4) EXISTING SOLUTIONS Currently, some people use rsync for the aforementioned tasks. It solves some of the key properties quite well, others not. Depending on how you use rsync, you might not sync snapshots very well. You might have problems with reflinks or sparse files. And rsync knows nothing about when your latest sync was. Some problems can be solved with the utility function "btrfs find-new", but it does not provide any kind of consistency and has several other drawbacks. 5) STREAMING FORMAT An ideal streaming format can contain a complete file system or incremental updates to a file system. It must transport meta information (such as snapshots, reflinks, base of the file system, etc.) and file information (such as holes, extended attributes, atime, ctime, mtime, user, group, hardlinks, softlinks, device nodes, etc.). It should have a feature to (optionally) store only parts of a modified file. It would help if we could use tools already widely available to encapsulate our backup streams. Imagine an existing streaming format that is flexible enough to encode all the information needed for our key properties. I like to put my backups in a different file system (like, ext3 or zfs) on another machine, hence I''d love to do so without the need of having btrfs or btrfs tools for this machine. Currently, what I have in mind is a solution where "send --compatible" produces a stream that can easily be unpacked by an unmodified version of a standard tool (e.g. tar). This would likely include each file completely that was modified since the reference point - it would never contain a file partially. In contrast, "send --minimal" produces a stream that might need a patched tool to be received and which contains parts of files to save space. Meta information should be included in both streams. I haven''t decided yet whether I''d like compression to be an integral part of the stream. I currently tend to dislike that, but to be honest, I have no good reason to do so. For now, I did some quick research and looked at cpio, tar, ustar, pax and dar: * cpio and tar have several drawbacks, I''ll just mention that they can''t go over 8GB in file size, making them unusable here. * The successor of traditional tar, uniform standard tar (ustar) has only 255 characters (at max) per file in the archive and is not extendable. * pax (portable archive exchange, do not confuse it with PaX) looks a lot better from a features perspective [1], and so does ... * dar (disk archiver) [2]. 5.A) Why it won''t be dar dar comes as a GPL program and a library (libdar), where the interesting bits are encapsulated in the library. No formal or informal specification of the file format exists, the library is the interface. This can speed up implementation considerably, but it sucks in flexibility. dar has a lot of useful features, one of which is built-in support for creation of incremental archives, and even decremental archives [6]. It has no built-in support for reflinks, though. Second no go for libdar: it does all the work required to detect files that changed between two backup runs, which is great for some file systems. However, we want to make use of the fact that btrfs knows exactly what changed. 5.B) Confusing pax I found a utility named pax at openbsd [3], which does not implement the pax format. It has support for several formats, but the newest of them is ustar. This implementation is at least used by Debian (and derivatives), Gentoo, RedHat and MacOS. OpenIndiana has a pax utility that implements the pax format, for which I could not find source code. I found a Makefile which refers to pax as "$(CLOSED)/cmd/pax" [4] which makes me think it''s not open source. I was already about to drop pax from my consideration completely, when I accidentally realized that GNU tar has a --format=pax option (and has it since 2004) [5]. Users of Solaris, OpenIndiana or similar will have to use their pax utility, though, because their tar does not support pax format. Kind of confusing... 5.C) The good in pax The good thing about the pax format is that it is extendable at will. You can use custom header records with key=value pairs of any length. There are predefined keys and application specific ones can be added. pax can be generated compatible with ustar, which means such an archive could be unpacked almost everywhere. The general concept of pax is to use the pax-specific headers in a way, that they will be ignored by a tar utility that does understand ustar but not pax. 5.D) How pax could be used (Knowledge of the format required for this paragraph, see [1]) This is more like brain storming than something figured out carefully: btrfs "send" could generate a stream beginning with a global pax header (typeflag=g) for the name of the current snapshot. Then all the files from this snapshot with custom pax headers (typeflag=x) as needed, to encode reflinks, for example. After the next global pax header we''re in the next snapshot. This can either contain any file that has changes completely (--compatible) or the diffs for the file along with a custom header telling where the diffs go (--minimal). The --compatible version could be extracted by any tar from the shelf (provided file name length and such fit). The result would be one file containing multiple snapshots for your file system. Extraction of a single file would be possible, though listing the files in the archive requires reading the whole file (with a lot of large seeks over the data portions) as there is no central directory. As an alternative, we could also start a new file for every snapshot we''re about to "send". We can use more of the custom headers to encode reflinks in a way that they will either be hard- or softlinks when extracted with a standard tar. We can add inode numbers for each entry if we feel those should be replicated to a destination btrfs and much more. 5.E) So it will be pax - will it? To me it looks like pax is the most suitable, flexible and available format to use. Unless somebody has serious objections or thoughts for a better choice. 6) FINAL REMARK I hope this longish introduction creates a lively discussion about the advertised features - or at least silent acknowledgement and endorsement. -Jan [1] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html [2] http://dar.linux.free.fr/doc/man/dar.html [3] http://www.openbsd.org/cgi-bin/cvsweb/src/bin/pax/ [4] http://hg.openindiana.org/illumos-gate/raw-file/d3807abc6720/usr/src/cmd/Makefile [5] http://git.savannah.gnu.org/cgit/tar.git/commit/?id=ba08e339a6e05e2a0d1432efdadd67ff2c63f834 [6] http://dar.linux.free.fr/doc/usage_notes.html#Decremental_Backup -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan On 08/01/2011 02:22 PM, Jan Schmidt wrote:> I furthermore realized that the term "subvolume" is omitted in favor >of the term "snapshot". This is because I tend to think of snapshots> being read-only (though I very much appreciate they are not). Just > replace the term wherever you feel appropriate.I think that we have to cope to both the terms. A snapshot is a subvolume with an ancestor. This is important if we want to be able to "transfer" between two filesystem a snapshot as subvolume + delta.> 3) KEY PROPERTIES > > I wrote down key features that are must haves for me, please add to the > list if you have anything on top: > > - "send" must generate a stream that can either be "receive"d > immediately or stored in a file for asynchronous "receive" > - streams must obviously be byte order safe > - a stream must contain a complete fs (full stream) or an incremental > update to a file system > - a stream must not be restricted in size > - an incremental stream must contain the information which version it > is based on > - "receive" of an incremental stream must check whether the base is > the current state of the file system > - YES => "receive" > - NO, but is previous version > => abort; should offer --force for rollback and "receive" > - NO, does not match any previous version => abort > - a stream must be taken from a consistent state of the file system > - the source file system must remain read-writable during a "send" > - the destination file system must at least remain readable during a > "receive" > - btrfs as a destination file system should reflect all features of > the source file systemI think that we should define what means "all features". 1) If we are interested to transport only the file type/contents/timestamps/acls/owners/permissions, that could be obtained with a combination of "find-new" (with some extensions [1]) and an user-space tool. No extension to BTRFS are needed. 1.1) as above plus preserve the inode number. 2) If we want to have also to conserver the COW relation (between snapshots/subvolumes and files), I think that we need some help from the kernel side to be able to injecting these information in the destination btrfs filesystem. Moreover we need to cope all the possible errors due to the fact that the snapshot/subvolume are out-sync between the source fs and the destination fs: what about if we want to transport a snapshot to an another filesystem where the snapshotted subvolume (previously successful transported) was removed or changed ? How we can check if a snapshot/subvolume was changed ?> - other destination file systems must be supported (although some > features will not map to all file systems)BR G.Baroncelli [1] http://comments.gmane.org/gmane.comp.file-systems.btrfs/8201 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01.08.2011 20:51, Goffredo Baroncelli wrote:> On 08/01/2011 02:22 PM, Jan Schmidt wrote: > >> I furthermore realized that the term "subvolume" is omitted in favor > > of the term "snapshot". This is because I tend to think of snapshots >> being read-only (though I very much appreciate they are not). Just >> replace the term wherever you feel appropriate. > > I think that we have to cope to both the terms. A snapshot is a > subvolume with an ancestor. This is important if we want to be able to > "transfer" between two filesystem a snapshot as subvolume + delta.To be precise, each snapshot is again a subvolume. On the other hand, we can call every subvolume but the root subvolume a (writable) snapshot. I''d like to continue discussing the real points now :-)>> 3) KEY PROPERTIES >> >> I wrote down key features that are must haves for me, please add to the >> list if you have anything on top: >> >> - "send" must generate a stream that can either be "receive"d >> immediately or stored in a file for asynchronous "receive" >> - streams must obviously be byte order safe >> - a stream must contain a complete fs (full stream) or an incremental >> update to a file system >> - a stream must not be restricted in size >> - an incremental stream must contain the information which version it >> is based on >> - "receive" of an incremental stream must check whether the base is >> the current state of the file system >> - YES => "receive" >> - NO, but is previous version >> => abort; should offer --force for rollback and "receive" >> - NO, does not match any previous version => abort >> - a stream must be taken from a consistent state of the file system >> - the source file system must remain read-writable during a "send" >> - the destination file system must at least remain readable during a >> "receive" >> - btrfs as a destination file system should reflect all features of >> the source file system > > I think that we should define what means "all features". > > 1) If we are interested to transport only the file > type/contents/timestamps/acls/owners/permissions, that could be obtained > with a combination of "find-new" (with some extensions [1]) and an > user-space tool. No extension to BTRFS are needed.Right. This is not what I''m after.> 1.1) as above plus preserve the inode number. > > 2) If we want to have also to conserver the COW relation (between > snapshots/subvolumes and files), I think that we need some help from > the kernel side to be able to injecting these information in the > destination btrfs filesystem.I''d rather gather those information (possibly with help from the kernel) when generating the stream. I realized that I should have used some examples in my original mail. What I have in mind (as briefly described in my section 5.D) is the following: To not make it overly complex, let''s assume snapshots are read only for a moment. We have a subvolume /home with one snapshot /home/snap1. When we want to send the whole subvolume we could do the following (if you read section 5 assume --minimal was the default): btrfs subvol snapshot /home send-test btrfs send /home send-test > /tmp/stream Algorithm: First pick all files from the oldest snapshot (snap1) and put them into the stream. Then, there is a block of meta information saying "snap1 complete". The next in the stream are the diffs between snap1 and send-test (reflecting the current state of /home), again with an end-marker. Let''s assume we have a freshly created empty /backup subvolume. To receive our changes, we''d call the following: btrfs receive /backup < /tmp/stream Algorithm: Use data from the stream to create files to /backup/home. Once we reach the meta information mentioned above, we create a snapshot of /backup/home in /backup/home/snap1. Then we go on receiving the diffs in the stream to /backup/home and create a snapshot send-test.> Moreover we need to cope all the possible errors due to the fact that > the snapshot/subvolume are out-sync between the source fs and the > destination fs: what about if we want to transport a snapshot to an > another filesystem where the snapshotted subvolume (previously > successful transported) was removed or changed ? How we can check if a > snapshot/subvolume was changed ?Continuing the above example, lets assume we have a /backup subvolume that did receive a /backup/home earlier. Receiving a full stream (as generated above) would fail, then. You could remove the subvolume and receive the new full stream, if you like. Now let''s assume we have /home with snap1, send-test, snap2 and snap3. On the backup side, we have /backup/home with snap1 and send-test. We make another snapshot and generate an incremental stream on the sender: btrfs subvol snapshot /home incr-test btrfs send /home send-test incr-test The stream contains the information that it''s based on the snapshot send-test (maybe by uuid rather than name). When we receive the stream, we first check if the destination file system /backup/home has that send-test snapshot. a) If it does and there are no diffs between /backup/home and /backup/home/send-test, noone modified the destination => receive b) If it does and there are diffs between the two, receive should fail unless --force is specified, which would eliminate all the changes made in /backup/home so that it''s rolled back to send-test. c) If if does not, receive just fails. Although the current algorithm breaks when we release the read-only precondition, I''m certain we can turn these fuzzy ideas into an actually working solution Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Jan Schmidt''s message of 2011-08-02 05:43:39 -0400:> > On 01.08.2011 20:51, Goffredo Baroncelli wrote: > > On 08/01/2011 02:22 PM, Jan Schmidt wrote: > > > >> I furthermore realized that the term "subvolume" is omitted in favor > > > of the term "snapshot". This is because I tend to think of snapshots > >> being read-only (though I very much appreciate they are not). Just > >> replace the term wherever you feel appropriate. > > > > I think that we have to cope to both the terms. A snapshot is a > > subvolume with an ancestor. This is important if we want to be able to > > "transfer" between two filesystem a snapshot as subvolume + delta. > > To be precise, each snapshot is again a subvolume. On the other hand, we > can call every subvolume but the root subvolume a (writable) snapshot. > I''d like to continue discussing the real points now :-)First: awesome! I can''t wait to have this feature. I think you have a very sound list of requirements here, but I''ll add one more. If there are metadata corruptions on the sender, we must not transmit them over to the receiver. In order for the send/receive command to be a backup, it needs to have a first-do-no-harm rule to the receiving end. In terms of formats, I came to similar conclusions a while ago about cpio, tar and dar. I haven''t looked in detail at pax but don''t have any strong feelings against it. But, I''ll toss in an alternative. Adapt the git pack files a little and use them as the format. There are a few reasons for this: Git has a very strong developer community and is already being hammered into use as a backup application. You''ll find a lot of interested people to help out. Git separates the contents from the metadata (names). This makes it naturally suited to describing snapshots and other features. The big exception is in large file handling, but you could extend the format to describe filename,offset,len->sha instead of just filename->sha. This doesn''t mean I''ll reject a pax setup, it''s just an alternative to think about. We should have the actual data transmission format pretty well abstracted away so we can experiment with alternatives. In terms of transmitting snapshot details, I always assumed we would need a snapshot tool that added extra metadata about parent relationships on the snapshots. I didn''t want to enforce this in the metadata on disk, but I have no problems with saying the send/receive tool requires extra metadata to tell us about parents. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02.08.2011 17:21, Chris Mason wrote:> First: awesome! I can''t wait to have this feature. > > I think you have a very sound list of requirements here, but I''ll add > one more. If there are metadata corruptions on the sender, we must not > transmit them over to the receiver. In order for the send/receive > command to be a backup, it needs to have a first-do-no-harm rule to the > receiving end.Full ack. I''m planning to fetch as much information from user space as possible. Anything that needs kernel support will have consistency checks added. No guessing will be made. If it doesn''t look safe and sound, that file system will not be sendable. Furthermore, receiving should not need kernel support at all (except for an optional interface to create a file with a certain inode, we''ll see). Thus, replicating metadata corruptions should be very unlikely. One more thing to add: We have to make sure our stream doesn''t get corrupted. So if the file format we''re choosing does not include it, we should keep in mind to add something ourselves.> In terms of formats, I came to similar conclusions a while ago about > cpio, tar and dar. I haven''t looked in detail at pax but don''t have any > strong feelings against it. > > But, I''ll toss in an alternative. Adapt the git pack files a little and > use them as the format. There are a few reasons for this: > > Git has a very strong developer community and is already being > hammered into use as a backup application. You''ll find a lot of > interested people to help out. > > Git separates the contents from the metadata (names). This makes it > naturally suited to describing snapshots and other features. The big > exception is in large file handling, but you could extend the format to > describe filename,offset,len->sha instead of just filename->sha.That sounds interesting. I haven''t thought of git until now. It will lack the appealing feature to unpack without any special tools or a modified git client, I think. But I believe there are things that would get easier compared to pax. I''ll try to make a plan how it could be implemented with git, so that we have something we can compare.> This doesn''t mean I''ll reject a pax setup, it''s just an alternative to > think about. We should have the actual data transmission format pretty > well abstracted away so we can experiment with alternatives.Yes, that would be nice. I''ll keep that in mind. If both have their advantages, we might end up having one format in the first implementation and another one added later once the rest is working.> In terms of transmitting snapshot details, I always assumed we would > need a snapshot tool that added extra metadata about parent > relationships on the snapshots. I didn''t want to enforce this in the > metadata on disk, but I have no problems with saying the send/receive > tool requires extra metadata to tell us about parents.Oh, right. That''s something that might not only need kernel support for "send" to determine a parent, but also a new key representing a snapshot''s parent relationship information. I''ll think that over, currently I tend to adding these relationship keys around btrfs_ioctl_snap_create soon, so we have at least some file systems in the wild that are ready for send and receive once it''s done. Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi all, [...]> > Furthermore, receiving should not need kernel support at all (except for > an optional interface to create a file with a certain inode, we''ll see). > Thus, replicating metadata corruptions should be very unlikely.I think that for receiving we can have three level, which may represent three level in the develop: 1) we store the information as a pax|tar|git|... file format. Then is the user that can expand this file when needed. I think that in case of backup this is more useful than having a full filesystem. No help from kernel required. 2) we expand the stream in files; so the final results would be a filesystem. 2.1) as above but preserving the inode number (small help from kernel required, may be file-system independent also) 2.2) as above but preserving the COW properties: if we update an already snapshotted file, btrfs store the original one and the modified data. The same would be in the destination filesystem: if exists the previous file snapshot, in the filesystem is COW-ed the file updating only the "new data". (help from kernel side. I don''t know if it is possible to adapt this strategy for other filesystem than BTRFS) 3) extracting from the source filesystem the btree structure, and injecting in the btrfs filesystem this structure. I think that this has the best performance, both in terms of CPU-power and in bandwidth. Full kernel support required.> One more thing to add: We have to make sure our stream doesn''t get > corrupted. So if the file format we''re choosing does not include it, we > should keep in mind to add something ourselves.The best would be using the BTRFS checksum.> > > In terms of formats, I came to similar conclusions a while ago about > > cpio, tar and dar. I haven''t looked in detail at pax but don''t have any > > strong feelings against it. > > > > But, I''ll toss in an alternative. Adapt the git pack files a little and > > use them as the format. There are a few reasons for this: > > > > Git has a very strong developer community and is already being > > hammered into use as a backup application. You''ll find a lot of > > interested people to help out. > > > > Git separates the contents from the metadata (names). This makes it > > naturally suited to describing snapshots and other features. The big > > exception is in large file handling, but you could extend the format to > > describe filename,offset,len->sha instead of just filename->sha. > > That sounds interesting. I haven''t thought of git until now. It will > lack the appealing feature to unpack without any special tools or a > modified git client, I think. But I believe there are things that would > get easier compared to pax. > > I''ll try to make a plan how it could be implemented with git, so that we > have something we can compare.I suggest to give a look to the fast-import/export format, which is "de facto" standard about sharing information between the new CVS system.> > > This doesn''t mean I''ll reject a pax setup, it''s just an alternative to > > think about. We should have the actual data transmission format pretty > > well abstracted away so we can experiment with alternatives. > > Yes, that would be nice. I''ll keep that in mind. If both have their > advantages, we might end up having one format in the first > implementation and another one added later once the rest is working. > > > In terms of transmitting snapshot details, I always assumed we would > > need a snapshot tool that added extra metadata about parent > > relationships on the snapshots. I didn''t want to enforce this in the > > metadata on disk, but I have no problems with saying the send/receive > > tool requires extra metadata to tell us about parents. > > Oh, right. That''s something that might not only need kernel support for > "send" to determine a parent, but also a new key representing a > snapshot''s parent relationship information.I think that this information already exists. In fact every snapshot has a reference to the original data, on the basis of which it is possible to obtain the snapshot''s parent relationship information. However we need to be sure that when we send the "delta" between two snapshot to the receiver side, the receiver side: 1) has a copy of the previous snapshot 2) this copy is in sync to the original one I think (please Chris confirm that) that we can check this with the subvolume id and the generation-no of every snapshot, which should be unique.> > I''ll think that over, currently I tend to adding these relationship keys > around btrfs_ioctl_snap_create soon, so we have at least some file > systems in the wild that are ready for send and receive once it''s done. > > Thanks, > -Jan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack@inwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02.08.2011 19:42, Goffredo Baroncelli wrote:>> Furthermore, receiving should not need kernel support at all (except for >> an optional interface to create a file with a certain inode, we''ll see). >> Thus, replicating metadata corruptions should be very unlikely. > > I think that for receiving we can have three level, which may represent three > level in the develop: > > 1) we store the information as a pax|tar|git|... file format. Then is the user > that can expand this file when needed. I think that in case of backup this is > more useful than having a full filesystem. No help from kernel required. > > 2) we expand the stream in files; so the final results would be a filesystem.How would you test your stream from 1) if you can''t unpack it?> 2.1) as above but preserving the inode number (small help from kernel > required, may be file-system independent also)I would skip that and add it as an extention, later.> 2.2) as above but preserving the COW properties: if we update an already > snapshotted file, btrfs store the original one and the modified data. The same > would be in the destination filesystem: if exists the previous file snapshot, > in the filesystem is COW-ed the file updating only the "new data". (help from > kernel side. I don''t know if it is possible to adapt this strategy for other > filesystem than BTRFS)Again, I''d rather gather those information (possibly with help from the kernel) when generating the stream. This is what I answered and tried to explain by example in my mail yesterday. Please tell me which part was unclear and I''ll try to explain better. With the algorithm outlined yesterday, you don''t need any kernel support when receiving, so it should be adaptable by any filesystem that supports snapshots.> 3) extracting from the source filesystem the btree structure, and injecting in > the btrfs filesystem this structure. I think that this has the best > performance, both in terms of CPU-power and in bandwidth. Full kernel support > required.This is like a diff-aware dd, or did I get you wrong? If it is: do you really think we need it? What for?>> One more thing to add: We have to make sure our stream doesn''t get >> corrupted. So if the file format we''re choosing does not include it, we >> should keep in mind to add something ourselves. > > The best would be using the BTRFS checksum.Sounds interesting. How would you add a btrfs checksum to a stream file (no matter what format we''ll use)? And how would you verify it?>> I''ll try to make a plan how it could be implemented with git, so that we >> have something we can compare. > > I suggest to give a look to the fast-import/export format, which is "de facto" > standard about sharing information between the new CVS system.Thanks for the hint, I will include that in my considerations.>>> In terms of transmitting snapshot details, I always assumed we would >>> need a snapshot tool that added extra metadata about parent >>> relationships on the snapshots. I didn''t want to enforce this in the >>> metadata on disk, but I have no problems with saying the send/receive >>> tool requires extra metadata to tell us about parents. >> >> Oh, right. That''s something that might not only need kernel support for >> "send" to determine a parent, but also a new key representing a >> snapshot''s parent relationship information. > > I think that this information already exists. In fact every snapshot has a > reference to the original data, on the basis of which it is possible to obtain > the snapshot''s parent relationship information.How can that be done? I don''t see such a link.> However we need to be sure that when we send the "delta" between two snapshot > to the receiver side, the receiver side: > 1) has a copy of the previous snapshot > 2) this copy is in sync to the original one > > I think (please Chris confirm that) that we can check this with the subvolume > id and the generation-no of every snapshot, which should be unique.uuid + generation was my suggestion as well, should be unique, yes. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, 02 August, 2011 11:43:39 you wrote:> On 01.08.2011 20:51, Goffredo Baroncelli wrote:[...]> > 1) If we are interested to transport only the file > > type/contents/timestamps/acls/owners/permissions, that could be obtained > > with a combination of "find-new" (with some extensions [1]) and an > > user-space tool. No extension to BTRFS are needed. > > Right. This is not what I''m after. > > > 1.1) as above plus preserve the inode number. > > > > 2) If we want to have also to conserver the COW relation (between > > snapshots/subvolumes and files), I think that we need some help from > > the kernel side to be able to injecting these information in the > > destination btrfs filesystem. > > I''d rather gather those information (possibly with help from the kernel) > when generating the stream. I realized that I should have used some > examples in my original mail. What I have in mind (as briefly described > in my section 5.D) is the following: > > To not make it overly complex, let''s assume snapshots are read only for > a moment. > > We have a subvolume /home with one snapshot /home/snap1. When we want to > send the whole subvolume we could do the following (if you read section > 5 assume --minimal was the default): > > btrfs subvol snapshot /home send-test > btrfs send /home send-test > /tmp/stream > > Algorithm: First pick all files from the oldest snapshot (snap1) and put > them into the stream. Then, there is a block of meta information saying > "snap1 complete". The next in the stream are the diffs between snap1 and > send-test (reflecting the current state of /home), again with an end-marker. > > Let''s assume we have a freshly created empty /backup subvolume. To > receive our changes, we''d call the following: > > btrfs receive /backup < /tmp/stream > > Algorithm: Use data from the stream to create files to /backup/home. > Once we reach the meta information mentioned above, we create a snapshot > of /backup/home in /backup/home/snap1. Then we go on receiving the diffs > in the stream to /backup/home and create a snapshot send-test.Basically we need a tool to evaluate the difference (in terms of metadata and file content) between two snapshot which have an ancestor in common. This is a typical problem of the VCS software. On the basis of that we can generate a stream, which may be a diff between two snapshot or a full subvolume (no diff). On the receiver side in case of diff there is the necessity to check if both the side have the "old" snapshot, and if these snapshots are aligned. I think that tracking the subvolume-id and the generation-no we can check this easily.> > > Moreover we need to cope all the possible errors due to the fact that > > the snapshot/subvolume are out-sync between the source fs and the > > destination fs: what about if we want to transport a snapshot to an > > another filesystem where the snapshotted subvolume (previously > > successful transported) was removed or changed ? How we can check if a > > snapshot/subvolume was changed ? > > Continuing the above example, lets assume we have a /backup subvolume > that did receive a /backup/home earlier. Receiving a full stream (as > generated above) would fail, then. You could remove the subvolume and > receive the new full stream, if you like. > > Now let''s assume we have /home with snap1, send-test, snap2 and snap3. > On the backup side, we have /backup/home with snap1 and send-test. > > We make another snapshot and generate an incremental stream on the sender: > > btrfs subvol snapshot /home incr-test > btrfs send /home send-test incr-test > > The stream contains the information that it''s based on the snapshot > send-test (maybe by uuid rather than name). When we receive the stream, > we first check if the destination file system /backup/home has that > send-test snapshot. > > a) If it does and there are no diffs between /backup/home and > /backup/home/send-test, noone modified the destination => receive > > b) If it does and there are diffs between the two, receive should fail > unless --force is specified, which would eliminate all the changes > made in /backup/home so that it''s rolled back to send-test. > > c) If if does not, receive just fails. > > Although the current algorithm breaks when we release the read-only > precondition, I''m certain we can turn these fuzzy ideas into an actually > working solution > > Thanks, > -Jan-- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack@inwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday, 03 August, 2011 17:04:40 Jan Schmidt wrote:> On 02.08.2011 19:42, Goffredo Baroncelli wrote: > >> Furthermore, receiving should not need kernel support at all (except for > >> an optional interface to create a file with a certain inode, we''ll see). > >> Thus, replicating metadata corruptions should be very unlikely. > > > > I think that for receiving we can have three level, which may represent > > three level in the develop: > > > > 1) we store the information as a pax|tar|git|... file format. Then is the > > user that can expand this file when needed. I think that in case of > > backup this is more useful than having a full filesystem. No help from > > kernel required. > > > > 2) we expand the stream in files; so the final results would be a > > filesystem. > How would you test your stream from 1) if you can''t unpack it?If we are able to store the information in a standard format (like tar), we are able to unpack when we need. The difference between the point 1 and 2 is that for the point #1 is not required to develop the "extraction side". This doesn''t means that we *must not* develop, this means that we *may* delay the develop of the extraction side, and having something really useful. The point 2) requires to develop an extraction tool (the "btrfs receive" command), which would be able to handle further metadata like the "parent relationship" which you refer below. I think that the extraction would be like: sender> hello "receiver", which snapshot do you have ? receiver> hello "sender", I have snapshot A, B ,D sender> ok, I have the snapshot B and C, so I will send you the delta from snapshot B, which is the latest in common. sender> send data ..... This is far away from a simple tar (or pax or git...) file format.> > > 2.1) as above but preserving the inode number (small help from kernel > > required, may be file-system independent also) > > I would skip that and add it as an extention, later. > > > 2.2) as above but preserving the COW properties: if we update an already > > snapshotted file, btrfs store the original one and the modified data. The > > same would be in the destination filesystem: if exists the previous file > > snapshot, in the filesystem is COW-ed the file updating only the "new > > data". (help from kernel side. I don''t know if it is possible to adapt > > this strategy for other filesystem than BTRFS) > > Again, I''d rather gather those information (possibly with help from the > kernel) when generating the stream. This is what I answered and tried to > explain by example in my mail yesterday. Please tell me which part was > unclear and I''ll try to explain better.I am talking *only* about the receiving. How we gather these information is not (for the moment) in discussion> > With the algorithm outlined yesterday, you don''t need any kernel support > when receiving, so it should be adaptable by any filesystem that > supports snapshots.Right> > > 3) extracting from the source filesystem the btree structure, and > > injecting in the btrfs filesystem this structure. I think that this has > > the best performance, both in terms of CPU-power and in bandwidth. Full > > kernel support required. > > This is like a diff-aware dd, or did I get you wrong? If it is: do you > really think we need it? What for?I cited it only as "brainstorm" approach. The only gain is its space efficency.> > >> One more thing to add: We have to make sure our stream doesn''t get > >> corrupted. So if the file format we''re choosing does not include it, we > >> should keep in mind to add something ourselves. > > > > The best would be using the BTRFS checksum. > > Sounds interesting. How would you add a btrfs checksum to a stream file > (no matter what format we''ll use)? And how would you verify it?I think that btrfs already store a checksum per block basis. When we send the stream we could get this information from btrfs and send together. This only to avoid to recalculate a checksum. Pay attention that I think that btrfs stores the checksum only for the data, and not for the full files. What I means is that if a file is cow-ed, btrfs store the original data and only the data updated, then store the checksum for the original file and the checksum for the data updated. It don''t store the checksum for the full file updated. This means that if we try to rebuild the file applying a delta we don''t have a checksum of the full file to compare.> > >> I''ll try to make a plan how it could be implemented with git, so that we > >> have something we can compare. > > > > I suggest to give a look to the fast-import/export format, which is "de > > facto" standard about sharing information between the new CVS system. > > Thanks for the hint, I will include that in my considerations. > > >>> In terms of transmitting snapshot details, I always assumed we would > >>> need a snapshot tool that added extra metadata about parent > >>> relationships on the snapshots. I didn''t want to enforce this in the > >>> metadata on disk, but I have no problems with saying the send/receive > >>> tool requires extra metadata to tell us about parents. > >> > >> Oh, right. That''s something that might not only need kernel support for > >> "send" to determine a parent, but also a new key representing a > >> snapshot''s parent relationship information. > > > > I think that this information already exists. In fact every snapshot has a > > reference to the original data, on the basis of which it is possible to > > obtain the snapshot''s parent relationship information. > > How can that be done? I don''t see such a link.Give a look at https://btrfs.wiki.kernel.org/index.php/Project_ideas#Backref_walking_utilities but I have to admit that the real state is different from what I (wrlongly) understood of the btrfs internal.> > > However we need to be sure that when we send the "delta" between two > > snapshot to the receiver side, the receiver side: > > 1) has a copy of the previous snapshot > > 2) this copy is in sync to the original one > > > > I think (please Chris confirm that) that we can check this with the > > subvolume id and the generation-no of every snapshot, which should be > > unique. > uuid + generation was my suggestion as well, should be unique, yes. > > -Jan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack@inwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02.08.2011 18:01, Jan Schmidt wrote:> On 02.08.2011 17:21, Chris Mason wrote: >> But, I''ll toss in an alternative. Adapt the git pack files a little and >> use them as the format. There are a few reasons for this: >> >> Git has a very strong developer community and is already being >> hammered into use as a backup application. You''ll find a lot of >> interested people to help out. >> >> Git separates the contents from the metadata (names). This makes it >> naturally suited to describing snapshots and other features. The big >> exception is in large file handling, but you could extend the format to >> describe filename,offset,len->sha instead of just filename->sha. > > That sounds interesting. I haven''t thought of git until now. It will > lack the appealing feature to unpack without any special tools or a > modified git client, I think. But I believe there are things that would > get easier compared to pax.There are easier questions to google. You''ll find a lot of backup applications having a git repository for maintaining their source code. You''ll find a lot of "linuxquestions.org * of the year" hits - because in the news the versioning system of the year (git, of course) comes right before the backup application of the year. And you''ll also find this thread in the top 10 or top 20 hits, depending on your search. Using git as a backend for backups has been discussed earlier on the git mailing list [1], though this rfc got no comments at all and development apparently stopped after the initial post. This one [2] got a lot more discussion, but keeps focused on text file (/etc dir). It may have made the base for etc-keeper [3], aiming at the same target, but I did not check that. lwn.net discusses bup [4], which is mentioned several times on the git mailing list, too. It''s an actively developed backup tool writing its own git files, including files''s meta data. It is a collection of python scripts calling git helper functions (namely git config, init, cat-file, verify-pack, show-ref, rev-list and update-ref). I did not look deeper as I''m for a C-only solution. There is coldstorage [5] that has been stuck in a seemingly early phase for more than a year. Goffredo suggested looking at fast-import/export format [6], which I did. It is a text based protocol, used to transport commits and associated meta information from one VCS to another (possibly of a different kind). My conclusion is that it''s not suitable for solving the problems being discussed here.> I''ll try to make a plan how it could be implemented with git, so that we > have something we can compare.Finally, we''ll have to create a solution on our own. We could borrow some ideas from bup if we decided to do it. We''d need a concept to store more (arbitrary) meta data in the index, which would not be too hard to add. And the content-addressed concept of git certainly has charme. Although this inherent deduplication comes for free, we cannot save any work on stream creation: As a bit of meta information, we will still need to tell plain copies from reflinks, which could be stored in the index. However, once we''ve figured out that something is referencing the same data, we can use it to not store data multiple times in pax format, too.>> This doesn''t mean I''ll reject a pax setup, it''s just an alternative to >> think about.After having done so, I''d like to say it''s good that you don''t reject pax :-) It is definitely possible to use git''s object store methods for our stream format, but for me, pax still wins. Step by step: On the plus side of git, I currently only have deduplication in our stream format - for files that share content blocks (in the size of blocks we would store). This can make the stream a little smaller, however, as the content blocks get smaller in size (making dedup more likely), meta data overhead increases. On the plus side of pax, there is the possibility to create streams in compatibility mode, making it possible to unpack them with any (sufficiently recent) tar program. This advantage is such a big one, I would put a good amount of extra work into it - which is not even necessary. So, I''ll not hard wire the stream output format and make it easily replaceable. If no more facts come up here, I''ll make my proof of concept implementation with pax as stream format. Thanks! -Jan [1] http://kerneltrap.org/mailarchive/git/2006/2/21/201380/thread#mid-201380 [2] http://thread.gmane.org/gmane.comp.version-control.git/33887 [3] http://kitenet.net/~joey/code/etckeeper/ [4] http://lwn.net/Articles/380983/ [5] http://amarok.kde.org/blog/archives/1151-ColdStorage-A-Backup-Tool-Using-Git-At-Its-Core.html [6] http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html