----- Forwarded Message ----- From: "Stefan Hajnoczi" <stefanha@redhat.com> To: "Eric Sandeen" <sandeen@redhat.com> Cc: virt-devel@redhat.com, "Kevin Wolf" <kwolf@redhat.com> Sent: Friday, November 22, 2013 9:20:51 AM Subject: [virt-devel] btrfs NOCOW for VM disk images Hi, In upstream QEMU we''re discussing patches that set the NOCOW flag on disk image files. We''re told that this increases btrfs performance greatly since the file system will modify data in-place like ext4/xfs. During testing I found that the NOCOW flag prevents file cloning from working. cp --reflink fails with EINVAL when the source file has the NOCOW flag set. It is not possible to toggle NOCOW back and forth later on since it can only be set when no data has been allocated for the file yet. This leaves us with the choice between performance (NOCOW) and snapshots (default). Both are important for VM disk images! Questions: * Would it be possible to extend btrfs so that cp --reflink works on NOCOW files? (Clueless idea: quiesce I/O to the NOCOW file and clone it, then resume I/O and COW only writes to shared blocks.) * Does NOCOW prevent any other functionality besides file-level cloning? * Does NOCOW increase risk of data loss/corruption? (I guess yes since overwriting in place puts data at risk of power failure or drive failure.) Thanks, Stefan -- John Dulaney, RHCE IRC: handsome_pirate jdulaney.wordpress.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
John Dulaney posted on Fri, 22 Nov 2013 11:17:34 -0500 as excerpted:> In upstream QEMU we''re discussing patches that set the NOCOW flag on > disk image files. We''re told that this increases btrfs performance > greatly since the file system will modify data in-place like ext4/xfs.Indeed. For VM images and similar large "internally modified" files, NOCOW is definitely recommended, since otherwise they can very rapidly become extremely heavily fragmented. This is a use-case that COW-based filesystems simply don''t deal well with, so turning off the COW is definitely recommended.> During testing I found that the NOCOW flag prevents file cloning from > working. cp --reflink fails with EINVAL when the source file has the > NOCOW flag set.That would be expected, since disabling COW means the file will be updated in-place, and if reflink-copying was allowed, changing the one view in-place would by definition change the other view of the same file, since it /is/ the same file data. If you want both views of the file to change together, why not use a normal hardlink? If you don''t want them to change together, then you can''t set NOCOW and reflink-copy, since by definition NOCOW makes changes in-place, and if reflinks were allowed, that''d change both views. Quoting the cp manpage --reflink discussion:>>>>When --reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if --reflink=auto is specified, fall back to a standard copy. <<<< Since you disabled COW, the data blocks cannot be copied when modified, so the copy fails (or with auto falls back to a normal copy). Defined, documented and expected behavior.> It is not possible to toggle NOCOW back and forth later on since it can > only be set when no data has been allocated for the file yet. > > This leaves us with the choice between performance (NOCOW) and snapshots > (default). Both are important for VM disk images! > > Questions: > > * Would it be possible to extend btrfs so that cp --reflink works on > NOCOW files? (Clueless idea: quiesce I/O to the NOCOW file and clone > it, then resume I/O and COW only writes to shared blocks.)Of course it''s /possible/, but doing so would pervert the definition of NOCOW or of reflink or both. Either reflinks would effectively become hardlinks and writing to one view of the NOCOW data would change them all, or it would no longer be NOCOW. Since hardlinks already exist as a solution and COW is the default...> * Does NOCOW prevent any other functionality besides file-level > cloning?Being a simple btrfs user/sysadmin, I''m not sure about the file-level option, but certainly when given as a mount option (nodatacow)[1], both data checksumming and file compression are turned off as well. Given the technical requirements, I''d assume the same applies to NOCOW file attributes as well. It''s worth noting that there have been several bugs related to this as well, where btrfs was doing the wrong thing with "internally changed" files in one case or another. One now fixed bug was triggered most often with systemd''s journal, where systemd was doing direct-IO and btrfs wasn''t properly handling checksums. (Someone else reported a file- preallocating bittorrent client triggering that same bug, so it wasn''t /just/ systemd triggering it, but systemd was the most widely deployed and thus most common trigger.) Turning off checksums for this sort of "internally changed" image file thus becomes the easiest way to avoid such issues and NOCOW is the way this type of file usage pattern is conveyed to the filesystem. Mixing compression and internal-writes is another problematic situation, so turning that off for NOCOW files also makes sense.> * Does NOCOW increase risk of data loss/corruption? (I guess yes since > overwriting in place puts data at risk of power failure or drive > failure.)Absolutely, for that file at least. The loss of data checksumming means loss of that normally important data integrity check as well, tho at the same time it''s actually safer in some ways since you don''t have the filesystem checksums fighting and racing with the internal file updates, the source of the now fixed systemd journal triggered bug mentioned above. However, NOCOW on a large and very frequently internally changed file arguably makes other data/metadata on the filesystem safer, since the very frequent changes are now contained and isolated to their own unchanging location on the filesystem, no constantly changing partially shared extent tracking metadata and data checksum records to be keeping updated at the same time, thus no possibility of endangering the other files sharing the same metadata records. [1] Btrfs mount options. They aren''t yet documented in the mount manpage, and mount doesn''t ship with btrfs-progs so there''s no manpage documentation for mount options there, so the kernel''s btrfs.txt file and the wiki are the only good places to look up btrfs-specific mount-options: https://btrfs.wiki.kernel.org/index.php/Mount_options -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 22 Nov 2013 21:26:16 +0000 (UTC) Duncan <1i5t5.duncan@cox.net> wrote:> > During testing I found that the NOCOW flag prevents file cloning from > > working. cp --reflink fails with EINVAL when the source file has the > > NOCOW flag set. > > That would be expected, since disabling COW means the file will be > updated in-place, and if reflink-copying was allowed, changing the one > view in-place would by definition change the other view of the same file, > since it /is/ the same file data.However snapshotting a subvolume which has NOCOW files *is* allowed. I''m told data is then COW''ed only once, and only the areas that are changed after the snapshot has been made (or something along those lines). So since snapshotting+NOCOW can be combined and everything works automagically as expected, maybe reflink could be made to work as well? -- With respect, Roman
On Nov 22, 2013, at 9:17 AM, John Dulaney <jdulaney@redhat.com> wrote:> > In upstream QEMU we''re discussing patches that set the NOCOW flag on > disk image files. We''re told that this increases btrfs performance > greatly since the file system will modify data in-place like ext4/xfs.The best performing qemu/kvm results I have, using installing Fedora 20 as the benchmark method and anaconda''s time stamping of the start and completion of the installation, is Btrfs on the host with preallocated Raw file with xattr +C, and Btrfs used in the guest. The test matrix is 3x3: ext4, XFS, Btrfs. So each fs was used on the host, and in the guest. By "best performing" we''re talking about maybe 20-30 seconds better over a 7-8 minute install time on spinning rust. So with respect to installing an OS (the live image uses rsync), it seems Btrfs on Btrfs is at least no worse off than other file systems. A 20GB preallocated Raw on Btrfs with +C set has 33 extents, which doesn''t ever change. When I do this with a qcow2 file with preallocated metadata, it starts out with only 5 extents upon creation, but with each successive installation using the same qcow2 file, also with +C xattr, the extent count grows quite a bit. Although it''s very unclear from the testing if this negatively impacts performance, or if the extent increase eventually flattens out. after installation1> fedoratest.img: 1255 extents found after installation2> fedoratest.img: 1773 extents found after installation3> fedoratest.img: 2148 extents found after installation4> fedoratest.img: 2245 extents found This is a whole lot less, however, than non-preallocated Raw, without +C xattr where it rapidly ends up with tens of thousands of extents with no end in sight.> This leaves us with the choice between performance (NOCOW) and snapshots > (default). Both are important for VM disk images!Some testing needs to be done with qcow2 on Btrfs with +C long term to see if there''s a meaningful performance hit as the qcow2 ages. It may also be possible to defragment the qcow2 file once extent allocation tapers off. And another possibility would be for qcow2 to support full preallocation, so that its initial extent count is no worse than Raw. If you don''t need host based snapshotting, another possibility is using Btrfs in the guest, and snapshotting within the guest. It depends on the use case if this is preferred or not, but I think there could be some advantages to snapshotting within the guest. In this case, using Btrfs in the guest regardless of the backing method used, gives the guest the ability to at least flag for fs/data corruption, if not repair it (if a raid 1+ data profile is employed). Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Nov 23, 2013 at 04:00:28AM +0600, Roman Mamedov wrote:> On Fri, 22 Nov 2013 21:26:16 +0000 (UTC) > Duncan <1i5t5.duncan@cox.net> wrote: > > > > During testing I found that the NOCOW flag prevents file cloning from > > > working. cp --reflink fails with EINVAL when the source file has the > > > NOCOW flag set. > > > > That would be expected, since disabling COW means the file will be > > updated in-place, and if reflink-copying was allowed, changing the one > > view in-place would by definition change the other view of the same file, > > since it /is/ the same file data. > > However snapshotting a subvolume which has NOCOW files *is* allowed. > I''m told data is then COW''ed only once, and only the areas that are changed > after the snapshot has been made (or something along those lines).This is correct.> So since snapshotting+NOCOW can be combined and everything works > automagically as expected, maybe reflink could be made to work as > well?This works (to my own surprise). The clone ioctl checks if the files have the same status regarding checksums, so reflink from nocow -> nocow should work. What does not work if one does $ cp --reflink=always nocow-file somefile cp: failed to clone ‘soefile’ from ‘nocow’: Invalid argument because cp creates somefile without nocow status. But precreating somefile with chattr +C and the calling the command above, cp does not complain. Rewriting one file does not modify the other though output of filefrag after the modification does not seem to reflect that the files do not in fact share the same blocks: File size of somefile is 2097152 (512 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 511: 54865266.. 54865777: 512: eof somefile: 1 extent found File size of nocow is 2097152 (512 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 511: 54865266.. 54865777: 512: eof nocow: 1 extent found files were dd''ed with zeros and differ in the first 4k: dd if=/dev/urandom of=somefile bs=4k count=1 conv=notrunc So there''s a bug somewhere, probably in reporting extents through fiemap of the modified nocow file. This hides the actual position of the new block and fragmentation. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html