Richard W.M. Jones
2018-Jan-20 16:58 UTC
[Libguestfs] [PATCH nbdkit] filters: Add copy-on-write filter.
Eric, you'll probably find the design "interesting" ... It does work, for me at least. Rich.
Richard W.M. Jones
2018-Jan-20 16:58 UTC
[Libguestfs] [PATCH nbdkit] filters: Add copy-on-write filter.
--- configure.ac | 5 + filters/Makefile.am | 4 + filters/cow/Makefile.am | 65 +++++++ filters/cow/cow.c | 392 ++++++++++++++++++++++++++++++++++++++ filters/cow/nbdkit-cow-filter.pod | 162 ++++++++++++++++ tests/Makefile.am | 6 + tests/test-cow.sh | 98 ++++++++++ 7 files changed, 732 insertions(+) diff --git a/configure.ac b/configure.ac index 367b2ba..aa7f406 100644 --- a/configure.ac +++ b/configure.ac @@ -483,6 +483,10 @@ AC_SUBST([VDDK_LIBS]) AC_DEFINE_UNQUOTED([VDDK_LIBDIR],["$VDDK_LIBDIR"],[VDDK 'libDir'.]) AM_CONDITIONAL([HAVE_VDDK],[test "x$VDDK_LIBS" != "x"]) +dnl Check for <linux/fs.h>, optional but needed for COW filter. +AC_CHECK_HEADER([linux/fs.h], [have_linux_fs_h=yes]) +AM_CONDITIONAL([HAVE_COW_FILTER], [test "x$have_linux_fs_h" = "xyes"]) + dnl Produce output files. AC_CONFIG_HEADERS([config.h]) AC_CONFIG_FILES([nbdkit], @@ -513,6 +517,7 @@ AC_CONFIG_FILES([Makefile plugins/vddk/Makefile plugins/xz/Makefile filters/Makefile + filters/cow/Makefile filters/delay/Makefile filters/offset/Makefile filters/partition/Makefile diff --git a/filters/Makefile.am b/filters/Makefile.am index d918b81..15a1995 100644 --- a/filters/Makefile.am +++ b/filters/Makefile.am @@ -34,3 +34,7 @@ SUBDIRS = \ delay \ offset \ partition + +if HAVE_COW_FILTER +SUBDIRS += cow +endif diff --git a/filters/cow/Makefile.am b/filters/cow/Makefile.am new file mode 100644 index 0000000..5b8ae5a --- /dev/null +++ b/filters/cow/Makefile.am @@ -0,0 +1,65 @@ +# nbdkit +# Copyright (C) 2018 Red Hat Inc. +# All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +EXTRA_DIST = nbdkit-cow-filter.pod + +if HAVE_COW_FILTER + +CLEANFILES = *~ + +filterdir = $(libdir)/nbdkit/filters + +filter_LTLIBRARIES = nbdkit-cow-filter.la + +nbdkit_cow_filter_la_SOURCES = \ + cow.c \ + $(top_srcdir)/include/nbdkit-filter.h + +nbdkit_cow_filter_la_CPPFLAGS = \ + -I$(top_srcdir)/include +nbdkit_cow_filter_la_CFLAGS = \ + $(WARNINGS_CFLAGS) +nbdkit_cow_filter_la_LDFLAGS = \ + -module -avoid-version -shared + +if HAVE_POD2MAN + +man_MANS = nbdkit-cow-filter.1 +CLEANFILES += $(man_MANS) + +nbdkit-cow-filter.1: nbdkit-cow-filter.pod + $(POD2MAN) $(POD2MAN_ARGS) --section=1 --name=`basename $@ .1` $< $@.t && \ + if grep 'POD ERROR' $@.t; then rm $@.t; exit 1; fi && \ + mv $@.t $@ + +endif +endif diff --git a/filters/cow/cow.c b/filters/cow/cow.c new file mode 100644 index 0000000..6744772 --- /dev/null +++ b/filters/cow/cow.c @@ -0,0 +1,392 @@ +/* nbdkit + * Copyright (C) 2018 Red Hat Inc. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * * Neither the name of Red Hat nor the names of its contributors may be + * used to endorse or promote products derived from this software without + * specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A + * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* Notes on the design and implementation of this filter: + * + * The filter works by creating a large, sparse temporary file, the + * same size as the underlying device. Being sparse, initially this + * takes up no space. + * + * We confine all pread/pwrite operations to the filesystem block + * size. The blk_read and blk_write functions below always happen on + * whole filesystem block boundaries. A smaller-than-block-size + * pwrite will turn into a read-modify-write of a whole block. We + * also assume that the plugin returns the same immutable data for + * each pread call we make, and optimize on this basis. + * + * When reading a block we first check the temporary file to see if + * that file block is allocated or a hole. If allocated, we return it + * from the temporary file. If a hole, we issue a pread to the + * underlying plugin. + * + * When writing a block we unconditionally write the data to the + * temporary file (allocating a block in that file if it wasn't + * before). + * + * No locking is needed for blk_* calls, but there is a potential + * problem of multiple pwrite calls are doing a read-modify-write + * cycle because the last write would win, erasing earlier writes. To + * avoid this we limit the thread model to SERIALIZE_ALL_REQUESTS so + * that there cannot be concurrent pwrite requests. We could relax + * this restriction with a bit of work. + */ + +#include <config.h> + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <stdbool.h> +#include <string.h> +#include <inttypes.h> +#include <alloca.h> +#include <unistd.h> +#include <fcntl.h> +#include <errno.h> +#include <sys/types.h> +#include <sys/ioctl.h> +#include <linux/fs.h> + +#include <nbdkit-filter.h> + +/* XXX See design comment above. */ +#define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS + +/* The temporary overlay. */ +static int fd = -1; + +/* The filesystem block size. */ +static int blksize; + +/* Because all requests are serialized, we can use a globally + * allocated block as a temporary place to store blocks when + * reading and writing, instead of stack allocation or awkward + * temporary mallocs. + */ +static uint8_t *block; + +static void +cow_load (void) +{ + const char *tmpdir; + size_t len; + char *template; + + tmpdir = getenv ("TMPDIR"); + if (!tmpdir) + tmpdir = "/var/tmp"; + + nbdkit_debug ("cow: temporary directory for overlay: %s", tmpdir); + + len = strlen (tmpdir) + 8; + template = alloca (len); + snprintf (template, len, "%s/XXXXXX", tmpdir); + + fd = mkostemp (template, O_CLOEXEC); + if (fd == -1) { + nbdkit_error ("mkostemp: %s: %m", tmpdir); + exit (EXIT_FAILURE); + } + + unlink (template); + + if (ioctl (fd, FIGETBSZ, &blksize) == -1) { + nbdkit_error ("ioctl: FIGETBSZ: %m"); + exit (EXIT_FAILURE); + } + if (blksize <= 0) { + nbdkit_error ("filesystem block size is < 0 or cannot be read"); + exit (EXIT_FAILURE); + } + + nbdkit_debug ("cow: filesystem block size: %d", blksize); + + block = malloc (blksize); + if (block == NULL) { + nbdkit_error ("malloc: %m"); + exit (EXIT_FAILURE); + } +} + +static void +cow_unload (void) +{ + if (fd >= 0) + close (fd); +} + +static void * +cow_open (nbdkit_next_open *next, void *nxdata, int readonly) +{ + /* We don't use the handle, so this just provides a non-NULL + * pointer that we can return. + */ + static int handle; + + /* Always pass readonly=1 to the underlying plugin. */ + if (next (nxdata, 1) == -1) + return NULL; + + return &handle; +} + +/* Get the file size and ensure the overlay is the correct size. */ +static int64_t +cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + int64_t size; + + size = next_ops->get_size (nxdata); + if (size == -1) + return -1; + + if (ftruncate (fd, size) == -1) + return -1; + + nbdkit_debug ("cow: underlying file size: %" PRIi64, size); + + return size; +} + +/* Force an early call to cow_get_size, consequently truncating the + * overlay to the correct size. + */ +static int +cow_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + int64_t r; + + r = cow_get_size (next_ops, nxdata, handle); + return r >= 0 ? 0 : -1; +} + +/* Whatever the underlying plugin can or can't do, we can write, we + * cannot trim, and we can flush. + */ +static int +cow_can_write (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) +{ + return 1; +} + +static int +cow_can_trim (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) +{ + return 0; +} + +static int +cow_can_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) +{ + return 1; +} + +/* These are the block operations. Note they implicitly read or write + * into the global ‘block’ array. + */ +static int +blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t blknum) +{ + off_t offset = blknum * blksize, roffset; + bool hole; + + nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ")", + blknum, (uint64_t) offset); + + /* Find out if the current block contains data or is a hole. */ + roffset = lseek (fd, offset, SEEK_DATA); + if (roffset == -1) { + /* Undocumented? Anyway if SEEK_DATA returns ENXIO it means + * "there are no more data regions past the supplied offset", ie. + * we're in a hole. + */ + if (errno == ENXIO) + hole = true; + else { + nbdkit_error ("lseek: SEEK_DATA: %m"); + return -1; + } + } + else + hole = offset != roffset; + + nbdkit_debug ("cow: block %" PRIu64 " is %s", + blknum, hole ? "a hole" : "allocated"); + + if (hole) /* Read underlying plugin. */ + return next_ops->pread (nxdata, block, blksize, offset); + else { /* Read overlay. */ + if (pread (fd, block, blksize, offset) == -1) { + nbdkit_error ("pread: %m"); + return -1; + } + return 0; + } +} + +static int +blk_write (uint64_t blknum) +{ + off_t offset = blknum * blksize; + + nbdkit_debug ("cow: blk_write block %" PRIu64 " (offset %" PRIu64 ")", + blknum, (uint64_t) offset); + + if (pwrite (fd, block, blksize, offset) == -1) { + nbdkit_error ("pwrite: %m"); + return -1; + } + return 0; +} + +/* Read data. */ +static int +cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle, void *buf, uint32_t count, uint64_t offset) +{ + while (count > 0) { + uint64_t blknum, blkoffs, n; + + blknum = offset / blksize; /* block number */ + blkoffs = offset % blksize; /* offset within the block */ + n = blksize - blkoffs; /* max bytes we can read from this block */ + if (n > count) + n = count; + + if (blk_read (next_ops, nxdata, blknum) == -1) + return -1; + + memcpy (buf, &block[blkoffs], n); + + buf += n; + count -= n; + offset += n; + } + + return 0; +} + +/* Write data. */ +static int +cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle, const void *buf, uint32_t count, uint64_t offset) +{ + while (count > 0) { + uint64_t blknum, blkoffs, n; + + blknum = offset / blksize; /* block number */ + blkoffs = offset % blksize; /* offset within the block */ + n = blksize - blkoffs; /* max bytes we can read from this block */ + if (n > count) + n = count; + + /* Do a read-modify-write operation on the current block. */ + if (blk_read (next_ops, nxdata, blknum) == -1) + return -1; + memcpy (&block[blkoffs], buf, n); + if (blk_write (blknum) == -1) + return -1; + + buf += n; + count -= n; + offset += n; + } + + return 0; +} + +/* Zero data. */ +static int +cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle, uint32_t count, uint64_t offset, int may_trim) +{ + while (count > 0) { + uint64_t blknum, blkoffs, n; + + blknum = offset / blksize; /* block number */ + blkoffs = offset % blksize; /* offset within the block */ + n = blksize - blkoffs; /* max bytes we can read from this block */ + if (n > count) + n = count; + + /* XXX There is the possibility of optimizing this: ONLY if we are + * writing a whole, aligned block, then use FALLOC_FL_ZERO_RANGE. + */ + if (blk_read (next_ops, nxdata, blknum) == -1) + return -1; + memset (&block[blkoffs], 0, n); + if (blk_write (blknum) == -1) + return -1; + + count -= n; + offset += n; + } + + return 0; +} + +static int +cow_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) +{ + /* I think we don't care about file metadata for this temporary + * file, so only flush the data. + */ + if (fdatasync (fd) == -1) { + nbdkit_error ("fdatasync: %m"); + return -1; + } + + return 0; +} + +static struct nbdkit_filter filter = { + .name = "cow", + .longname = "nbdkit copy-on-write (COW) filter", + .version = PACKAGE_VERSION, + .load = cow_load, + .unload = cow_unload, + .open = cow_open, + .prepare = cow_prepare, + .get_size = cow_get_size, + .can_write = cow_can_write, + .can_flush = cow_can_flush, + .can_trim = cow_can_trim, + .pread = cow_pread, + .pwrite = cow_pwrite, + .zero = cow_zero, + .flush = cow_flush, +}; + +NBDKIT_REGISTER_FILTER(filter) diff --git a/filters/cow/nbdkit-cow-filter.pod b/filters/cow/nbdkit-cow-filter.pod new file mode 100644 index 0000000..accf81c --- /dev/null +++ b/filters/cow/nbdkit-cow-filter.pod @@ -0,0 +1,162 @@ +=encoding utf8 + +=head1 NAME + +nbdkit-cow-filter - nbdkit copy-on-write (COW) filter + +=head1 SYNOPSIS + + nbdkit --filter=cow plugin [plugin-args...] + +=head1 DESCRIPTION + +C<nbdkit-cow-filter> is a filter that makes a temporary writable copy +on top of a read-only plugin. It can be used to enable writes for +plugins which only implement read-only access. Note that: + +=over 4 + +=item * + +B<Anything written is thrown away as soon as nbdkit exits.> + +=item * + +All connections to the nbdkit instance see the same view of the disk. + +This is different from L<nbd-server(1)> where each connection sees its +own copy-on-write overlay and simply disconnecting the client throws +that away. It also allows us to create diffs, see below. + +=item * + +The plugin is opened read-only (as if the I<-r> flag was passed), but +you should B<not> pass the I<-r> flag to nbdkit. + +=back + +Limitations of the filter include: + +=over 4 + +=item * + +The underlying file/device must not be resized. + +=item * + +The underlying plugin must behave “normally”, meaning that it must +serve the same data to each client. + +=back + +=head1 PARAMETERS + +There are no parameters specific to nbdkit-cow-filter. Any parameters +are passed through to and processed by the underlying plugin in the +normal way. + +=head1 EXAMPLES + +Serve the file F<disk.img>, allowing writes, but do not save any +changes into the file: + + nbdkit --filter=cow file file=disk.img + +L<nbdkit-xz-plugin(1)> only supports read access, but you can provide +temporary write access by doing (although this does B<not> save +changes to the file): + + nbdkit --filter=cow xz file=disk.xz + +=head1 CREATING A DIFF WITH QEMU-IMG + +Although nbdkit-cow-filter itself cannot save the differences, it is +possible to do this using an obscure feature of L<qemu-img(1)>. +B<nbdkit must remain continuously running during the whole operation, +otherwise all changes will be lost>. + +Run nbdkit: + + nbdkit --filter=cow file file=disk.img + +and then connect with a client and make whatever changes you need. +At the end, disconnect the client. + +Run these C<qemu-img> commands to construct a qcow2 file containing +the differences: + + qemu-img create -f qcow2 -b nbd:localhost diff.qcow2 + qemu-img rebase -b disk.img diff.qcow2 + +F<diff.qcow2> now contains the differences between the base +(F<disk.img>) and the changes stored in nbdkit-cow-filter. C<nbdkit> +can now be killed. + +=head1 ENVIRONMENT VARIABLES + +=over 4 + +=item C<TMPDIR> + +The copy-on-write changes are stored in a temporary file located in +C</var/tmp> by default. You can override this location by setting the +C<TMPDIR> environment variable before starting nbdkit. + +=back + +=head1 SEE ALSO + +L<nbdkit(1)>, +L<nbdkit-file-plugin(1)>, +L<nbdkit-xz-plugin(1)>, +L<nbdkit-filter(3)>, +L<qemu-img(1)>. + +=head1 AUTHORS + +Richard W.M. Jones + +=head1 COPYRIGHT + +Copyright (C) 2018 Red Hat Inc. + +=head1 LICENSE + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + +=over 4 + +=item * + +Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. + +=item * + +Redistributions in binary form must reproduce the above copyright +notice, this list of conditions and the following disclaimer in the +documentation and/or other materials provided with the distribution. + +=item * + +Neither the name of Red Hat nor the names of its contributors may be +used to endorse or promote products derived from this software without +specific prior written permission. + +=back + +THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +SUCH DAMAGE. diff --git a/tests/Makefile.am b/tests/Makefile.am index a3029a7..ae22801 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -41,6 +41,7 @@ EXTRA_DIST = \ shebang.py \ shebang.rb \ test-captive.sh \ + test-cow.sh \ test-cxx.sh \ test-dump-config.sh \ test-dump-plugin.sh \ @@ -411,6 +412,11 @@ endif HAVE_RUBY #---------------------------------------------------------------------- # Tests of filters. +# cow filter test. +if HAVE_COW_FILTER +TESTS += test-cow.sh +endif HAVE_COW_FILTER + # delay filter test. check_PROGRAMS += test-delay TESTS += test-delay diff --git a/tests/test-cow.sh b/tests/test-cow.sh new file mode 100755 index 0000000..ea4e5c0 --- /dev/null +++ b/tests/test-cow.sh @@ -0,0 +1,98 @@ +#!/bin/bash - +# nbdkit +# Copyright (C) 2018 Red Hat Inc. +# All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +set -e + +files="cow-base.img cow-diff.qcow2 cow.sock cow.pid" +rm -f $files + +# Create a base image which is partitioned with an empty filesystem. +guestfish -N cow-base.img=fs exit +lastmod="$(stat -c "%y" cow-base.img)" + +# Run nbdkit with a COW overlay. +nbdkit -P cow.pid -U cow.sock --filter cow file file=cow-base.img + +# We may have to wait a short time for the pid file to appear. +for i in `seq 1 10`; do + if test -f cow.pid; then + break + fi + sleep 1 +done +if ! test -f cow.pid; then + echo "$0: PID file was not created" + exit 1 +fi + +pid="$(cat cow.pid)" + +# Kill the nbdkit process on exit. +cleanup () +{ + status=$? + + kill $pid + rm -f $files + + exit $status +} +trap cleanup INT QUIT TERM EXIT ERR + +# Write some data into the overlay. +guestfish --format=raw -a 'nbd://?socket=cow.sock' -m /dev/sda1 <<EOF + fill-dir / 10000 + fill-pattern "abcde" 5M /large + write /hello "hello, world" +EOF + +# The original file must not be modified. +currmod="$(stat -c "%y" cow-base.img)" + +if [ "$lastmod" != "$currmod" ]; then + echo "$0: FAILED last modified time of base file changed" + exit 1 +fi + +# If we have qemu-img, try the hairy rebase operation documented +# in the nbdkit-cow-filter manual. +if qemu-img --version >/dev/null 2>&1; then + qemu-img create -f qcow2 -b nbd:unix:cow.sock cow-diff.qcow2 + time qemu-img rebase -b cow-base.img cow-diff.qcow2 + qemu-img info cow-diff.qcow2 + + # This checks the file we created exists. + guestfish --ro -a cow-diff.qcow2 -m /dev/sda1 cat /hello +fi + +# The cleanup() function is called implicitly on exit. -- 2.15.1
Richard W.M. Jones
2018-Jan-20 18:24 UTC
Re: [Libguestfs] [PATCH nbdkit] filters: Add copy-on-write filter.
On Sat, Jan 20, 2018 at 04:58:43PM +0000, Richard W.M. Jones wrote:> Eric, you'll probably find the design "interesting" ... > It does work, for me at least.I should add that if the use of SEEK_DATA etc proves too adventurous, it wouldn't be much trouble to modify this to use a simple block bitmap. With a 4K block size the bitmap is 1/32768th of the size of the disk, so even a 1 TB disk only needs a 32 MB bitmap which can comfortably be kept in memory. But it seemed a shame to do that when the operating system is essentially maintaining equivalent state. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Richard W.M. Jones
2018-Jan-21 22:08 UTC
Re: [Libguestfs] [PATCH nbdkit] filters: Add copy-on-write filter.
Here's the patch (on top of the preceeding one) which uses a bitmap instead of SEEK_DATA. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org --4VrXvz3cwkc87Wze Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-filters-cow-Modify-cow-filter-to-use-a-bitmap.patch" Content-Transfer-Encoding: 8bit>From e9f7ff5ea68f2a0391a3319cef1bf9e3f5581942 Mon Sep 17 00:00:00 2001From: "Richard W.M. Jones" <rjones@redhat.com> Date: Sun, 21 Jan 2018 21:52:26 +0000 Subject: [PATCH] filters: cow: Modify cow filter to use a bitmap. --- configure.ac | 4 -- filters/Makefile.am | 5 +- filters/cow/Makefile.am | 3 - filters/cow/cow.c | 179 +++++++++++++++++++++++++++++------------------- tests/Makefile.am | 2 - 5 files changed, 111 insertions(+), 82 deletions(-) diff --git a/configure.ac b/configure.ac index aa7f406..1091d27 100644 --- a/configure.ac +++ b/configure.ac @@ -483,10 +483,6 @@ AC_SUBST([VDDK_LIBS]) AC_DEFINE_UNQUOTED([VDDK_LIBDIR],["$VDDK_LIBDIR"],[VDDK 'libDir'.]) AM_CONDITIONAL([HAVE_VDDK],[test "x$VDDK_LIBS" != "x"]) -dnl Check for <linux/fs.h>, optional but needed for COW filter. -AC_CHECK_HEADER([linux/fs.h], [have_linux_fs_h=yes]) -AM_CONDITIONAL([HAVE_COW_FILTER], [test "x$have_linux_fs_h" = "xyes"]) - dnl Produce output files. AC_CONFIG_HEADERS([config.h]) AC_CONFIG_FILES([nbdkit], diff --git a/filters/Makefile.am b/filters/Makefile.am index 15a1995..7e6fe5a 100644 --- a/filters/Makefile.am +++ b/filters/Makefile.am @@ -31,10 +31,7 @@ # SUCH DAMAGE. SUBDIRS = \ + cow \ delay \ offset \ partition - -if HAVE_COW_FILTER -SUBDIRS += cow -endif diff --git a/filters/cow/Makefile.am b/filters/cow/Makefile.am index 5b8ae5a..31526d8 100644 --- a/filters/cow/Makefile.am +++ b/filters/cow/Makefile.am @@ -32,8 +32,6 @@ EXTRA_DIST = nbdkit-cow-filter.pod -if HAVE_COW_FILTER - CLEANFILES = *~ filterdir = $(libdir)/nbdkit/filters @@ -62,4 +60,3 @@ nbdkit-cow-filter.1: nbdkit-cow-filter.pod mv $@.t $@ endif -endif diff --git a/filters/cow/cow.c b/filters/cow/cow.c index 287c94e..2b023af 100644 --- a/filters/cow/cow.c +++ b/filters/cow/cow.c @@ -38,20 +38,22 @@ * takes up no space. * * We confine all pread/pwrite operations to the filesystem block - * size. The blk_read and blk_write functions below always happen on - * whole filesystem block boundaries. A smaller-than-block-size - * pwrite will turn into a read-modify-write of a whole block. We - * also assume that the plugin returns the same immutable data for - * each pread call we make, and optimize on this basis. + * size. The blk_* functions below only work on whole filesystem + * block boundaries. A smaller-than-block-size pwrite will turn into + * a read-modify-write of a whole block. We also assume that the + * plugin returns the same immutable data for each pread call we make, + * and optimize on this basis. * - * When reading a block we first check the temporary file to see if - * that file block is allocated or a hole. If allocated, we return it - * from the temporary file. If a hole, we issue a pread to the - * underlying plugin. + * A block bitmap is maintained in memory recording if each block in + * the temporary file is "allocated" (1) or "hole" (0). + * + * When reading a block we first check the bitmap to see if that file + * block is allocated or a hole. If allocated, we return it from the + * temporary file. If a hole, we issue a pread to the underlying + * plugin. * * When writing a block we unconditionally write the data to the - * temporary file (allocating a block in that file if it wasn't - * before). + * temporary file, setting the bit in the bitmap. * * No locking is needed for blk_* calls, but there is a potential * problem of multiple pwrite calls doing a read-modify-write cycle @@ -75,18 +77,27 @@ #include <errno.h> #include <sys/types.h> #include <sys/ioctl.h> -#include <linux/fs.h> #include <nbdkit-filter.h> +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) + /* XXX See design comment above. */ #define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS +/* Size of a block in the overlay. A 4K block size means that we need + * 32 MB of memory to store the bitmap for a 1 TB underlying image. + */ +#define BLKSIZE 4096 + /* The temporary overlay. */ static int fd = -1; -/* The filesystem block size. */ -static int blksize; +/* Bitmap. Bit 1 = allocated, 0 = hole. */ +static uint8_t *bitmap; + +/* Size of the bitmap in bytes. */ +static uint64_t bm_size; static void cow_load (void) @@ -112,17 +123,6 @@ cow_load (void) } unlink (template); - - if (ioctl (fd, FIGETBSZ, &blksize) == -1) { - nbdkit_error ("ioctl: FIGETBSZ: %m"); - exit (EXIT_FAILURE); - } - if (blksize <= 0) { - nbdkit_error ("filesystem block size is < 0 or cannot be read"); - exit (EXIT_FAILURE); - } - - nbdkit_debug ("cow: filesystem block size: %d", blksize); } static void @@ -147,6 +147,34 @@ cow_open (nbdkit_next_open *next, void *nxdata, int readonly) return &handle; } +/* Allocate or resize the overlay file and bitmap. */ +static int +blk_set_size (uint64_t new_size) +{ + uint8_t *new_bm; + const size_t old_bm_size = bm_size; + size_t new_bm_size = DIV_ROUND_UP (new_size, BLKSIZE*8); + + new_bm = realloc (bitmap, new_bm_size); + if (new_bm == NULL) { + nbdkit_error ("realloc: %m"); + return -1; + } + bitmap = new_bm; + bm_size = new_bm_size; + if (old_bm_size < new_bm_size) + memset (&bitmap[old_bm_size], 0, new_bm_size-old_bm_size); + + nbdkit_debug ("cow: bitmap resized to %" PRIu64 " bytes", new_bm_size); + + if (ftruncate (fd, new_size) == -1) { + nbdkit_error ("ftruncate: %m"); + return -1; + } + + return 0; +} + /* Get the file size and ensure the overlay is the correct size. */ static int64_t cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -158,11 +186,11 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, if (size == -1) return -1; - if (ftruncate (fd, size) == -1) - return -1; - nbdkit_debug ("cow: underlying file size: %" PRIi64, size); + if (blk_set_size (size)) + return -1; + return size; } @@ -200,6 +228,36 @@ cow_can_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) return 1; } +/* Return true if the block is allocated. Consults the bitmap. */ +static bool +blk_is_allocated (uint64_t blknum) +{ + uint64_t bm_offset = blknum / 8; + uint64_t bm_bit = blknum % 8; + + if (bm_offset >= bm_size) { + nbdkit_debug ("blk_is_allocated: block number is out of range"); + return false; + } + + return bitmap[bm_offset] & (1 << bm_bit); +} + +/* Mark a block as allocated. */ +static void +blk_set_allocated (uint64_t blknum) +{ + uint64_t bm_offset = blknum / 8; + uint64_t bm_bit = blknum % 8; + + if (bm_offset >= bm_size) { + nbdkit_debug ("blk_set_allocated: block number is out of range"); + return; + } + + bitmap[bm_offset] |= 1 << bm_bit; +} + /* These are the block operations. They always read or write a single * whole block of size blksize. */ @@ -207,36 +265,17 @@ static int blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t blknum, uint8_t *block) { - off_t offset = blknum * blksize, roffset; - bool hole; - - nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ")", - blknum, (uint64_t) offset); + off_t offset = blknum * BLKSIZE; + bool allocated = blk_is_allocated (blknum); - /* Find out if the current block contains data or is a hole. */ - roffset = lseek (fd, offset, SEEK_DATA); - if (roffset == -1) { - /* Undocumented? Anyway if SEEK_DATA returns ENXIO it means - * "there are no more data regions past the supplied offset", ie. - * we're in a hole. - */ - if (errno == ENXIO) - hole = true; - else { - nbdkit_error ("lseek: SEEK_DATA: %m"); - return -1; - } - } - else - hole = offset != roffset; - - nbdkit_debug ("cow: block %" PRIu64 " is %s", - blknum, hole ? "a hole" : "allocated"); + nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ") is %s", + blknum, (uint64_t) offset, + !allocated ? "a hole" : "allocated"); - if (hole) /* Read underlying plugin. */ - return next_ops->pread (nxdata, block, blksize, offset); + if (!allocated) /* Read underlying plugin. */ + return next_ops->pread (nxdata, block, BLKSIZE, offset); else { /* Read overlay. */ - if (pread (fd, block, blksize, offset) == -1) { + if (pread (fd, block, BLKSIZE, offset) == -1) { nbdkit_error ("pread: %m"); return -1; } @@ -247,15 +286,17 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, static int blk_write (uint64_t blknum, const uint8_t *block) { - off_t offset = blknum * blksize; + off_t offset = blknum * BLKSIZE; nbdkit_debug ("cow: blk_write block %" PRIu64 " (offset %" PRIu64 ")", blknum, (uint64_t) offset); - if (pwrite (fd, block, blksize, offset) == -1) { + if (pwrite (fd, block, BLKSIZE, offset) == -1) { nbdkit_error ("pwrite: %m"); return -1; } + blk_set_allocated (blknum); + return 0; } @@ -266,7 +307,7 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata, { uint8_t *block; - block = malloc (blksize); + block = malloc (BLKSIZE); if (block == NULL) { nbdkit_error ("malloc: %m"); return -1; @@ -275,9 +316,9 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata, while (count > 0) { uint64_t blknum, blkoffs, n; - blknum = offset / blksize; /* block number */ - blkoffs = offset % blksize; /* offset within the block */ - n = blksize - blkoffs; /* max bytes we can read from this block */ + blknum = offset / BLKSIZE; /* block number */ + blkoffs = offset % BLKSIZE; /* offset within the block */ + n = BLKSIZE - blkoffs; /* max bytes we can read from this block */ if (n > count) n = count; @@ -304,7 +345,7 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, { uint8_t *block; - block = malloc (blksize); + block = malloc (BLKSIZE); if (block == NULL) { nbdkit_error ("malloc: %m"); return -1; @@ -313,9 +354,9 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, while (count > 0) { uint64_t blknum, blkoffs, n; - blknum = offset / blksize; /* block number */ - blkoffs = offset % blksize; /* offset within the block */ - n = blksize - blkoffs; /* max bytes we can read from this block */ + blknum = offset / BLKSIZE; /* block number */ + blkoffs = offset % BLKSIZE; /* offset within the block */ + n = BLKSIZE - blkoffs; /* max bytes we can read from this block */ if (n > count) n = count; @@ -346,7 +387,7 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, { uint8_t *block; - block = malloc (blksize); + block = malloc (BLKSIZE); if (block == NULL) { nbdkit_error ("malloc: %m"); return -1; @@ -355,9 +396,9 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, while (count > 0) { uint64_t blknum, blkoffs, n; - blknum = offset / blksize; /* block number */ - blkoffs = offset % blksize; /* offset within the block */ - n = blksize - blkoffs; /* max bytes we can read from this block */ + blknum = offset / BLKSIZE; /* block number */ + blkoffs = offset % BLKSIZE; /* offset within the block */ + n = BLKSIZE - blkoffs; /* max bytes we can read from this block */ if (n > count) n = count; diff --git a/tests/Makefile.am b/tests/Makefile.am index ae22801..b073f22 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -413,9 +413,7 @@ endif HAVE_RUBY # Tests of filters. # cow filter test. -if HAVE_COW_FILTER TESTS += test-cow.sh -endif HAVE_COW_FILTER # delay filter test. check_PROGRAMS += test-delay -- 2.15.1 --4VrXvz3cwkc87Wze--
Eric Blake
2018-Jan-22 15:50 UTC
Re: [Libguestfs] [PATCH nbdkit] filters: Add copy-on-write filter.
On 01/20/2018 10:58 AM, Richard W.M. Jones wrote:> --- > configure.ac | 5 + > filters/Makefile.am | 4 + > filters/cow/Makefile.am | 65 +++++++ > filters/cow/cow.c | 392 ++++++++++++++++++++++++++++++++++++++ > filters/cow/nbdkit-cow-filter.pod | 162 ++++++++++++++++ > tests/Makefile.am | 6 + > tests/test-cow.sh | 98 ++++++++++ > 7 files changed, 732 insertions(+) >> + > +/* Notes on the design and implementation of this filter: > + * > + * The filter works by creating a large, sparse temporary file, the > + * same size as the underlying device. Being sparse, initially this > + * takes up no space. > + * > + * We confine all pread/pwrite operations to the filesystem block > + * size. The blk_read and blk_write functions below always happen on > + * whole filesystem block boundaries. A smaller-than-block-size > + * pwrite will turn into a read-modify-write of a whole block. We > + * also assume that the plugin returns the same immutable data for > + * each pread call we make, and optimize on this basis. > + * > + * When reading a block we first check the temporary file to see if > + * that file block is allocated or a hole. If allocated, we return it > + * from the temporary file. If a hole, we issue a pread to the > + * underlying plugin.This is great on supported file systems. However, you are defaulting to /tmp, and even recent Linux has had tmpfs that did not support SEEK_HOLE (using the default of treating the entire file as data, even when it is sparse, which means your reads fail to pick up the underlying plugin) or which supported it only with something like O(n^2) instead of O(1) performance according to the offset of the next hole (which makes performance painful the larger the file is). You may want to list caveats that this filter requires decent filesystem support for SEEK_HOLE before this makes sense; or, I see your followup patch that uses a bitmap to avoid lseek issues and thus the problems on straggler filesystems.> + * > + * When writing a block we unconditionally write the data to the > + * temporary file (allocating a block in that file if it wasn't > + * before). > + * > + * No locking is needed for blk_* calls, but there is a potential > + * problem of multiple pwrite calls are doing a read-modify-write > + * cycle because the last write would win, erasing earlier writes. To > + * avoid this we limit the thread model to SERIALIZE_ALL_REQUESTS so > + * that there cannot be concurrent pwrite requests. We could relax > + * this restriction with a bit of work.Yep, this is a pretty cool idea, when it works!> +static void > +cow_load (void) > +{ > + const char *tmpdir; > + size_t len; > + char *template; > + > + tmpdir = getenv ("TMPDIR"); > + if (!tmpdir) > + tmpdir = "/var/tmp"; > + > + nbdkit_debug ("cow: temporary directory for overlay: %s", tmpdir); > + > + len = strlen (tmpdir) + 8; > + template = alloca (len); > + snprintf (template, len, "%s/XXXXXX", tmpdir); > + > + fd = mkostemp (template, O_CLOEXEC); > + if (fd == -1) { > + nbdkit_error ("mkostemp: %s: %m", tmpdir); > + exit (EXIT_FAILURE); > + } > + > + unlink (template); > + > + if (ioctl (fd, FIGETBSZ, &blksize) == -1) { > + nbdkit_error ("ioctl: FIGETBSZ: %m"); > + exit (EXIT_FAILURE); > + } > + if (blksize <= 0) { > + nbdkit_error ("filesystem block size is < 0 or cannot be read"); > + exit (EXIT_FAILURE); > + }The file is still size zero here...> +/* Force an early call to cow_get_size, consequently truncating the > + * overlay to the correct size. > + */ > +static int > +cow_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, > + void *handle) > +{ > + int64_t r; > + > + r = cow_get_size (next_ops, nxdata, handle); > + return r >= 0 ? 0 : -1;...so here, now that you've resized it, it may be worth an lseek(SEEK_HOLE) (must be 0) and SEEK_DATA (must fail with ENXIO because there is no data yet) to make sure the tmp file has appropriate seek support, so you can at least kill the nbdkit process up front rather than suffer from a file system that reports the entire sparse file as DATA.> +=head1 CREATING A DIFF WITH QEMU-IMG > + > +Although nbdkit-cow-filter itself cannot save the differences, it is > +possible to do this using an obscure feature of L<qemu-img(1)>. > +B<nbdkit must remain continuously running during the whole operation, > +otherwise all changes will be lost>. > + > +Run nbdkit: > + > + nbdkit --filter=cow file file=disk.img > + > +and then connect with a client and make whatever changes you need. > +At the end, disconnect the client. > + > +Run these C<qemu-img> commands to construct a qcow2 file containing > +the differences: > + > + qemu-img create -f qcow2 -b nbd:localhost diff.qcow2 > + qemu-img rebase -b disk.img diff.qcow2 > + > +F<diff.qcow2> now contains the differences between the base > +(F<disk.img>) and the changes stored in nbdkit-cow-filter. C<nbdkit> > +can now be killed.Cute! -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Possibly Parallel Threads
- [nbdkit PATCH v2 0/2] Bounce buffer cleanups
- [PATCH nbdkit] filters: Add caching filter.
- [nbdkit PATCH] cache: Reduce use of bounce-buffer
- [nbdkit PATCH 0/2] RFC: tweak error handling, add log filter
- [PATCH nbdkit 0/9] cache: Implement cache-max-size and method of reclaiming space from the cache.