Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 0/9] cache: Implement cache-max-size and method of reclaiming space from the cache.
This patch series enhances the cache filter in a few ways, primarily adding a "cache-on-read" feature (similar to qemu's copyonread); and adding the ability to limit the cache size and the antecedent of that which is having a method to reclaim cache blocks. As the cache is stored as a sparse temporary file, reclaiming cache blocks simply means punching holes in the temporary file. The tricky bit is working out where to punch the holes to avoid reclaiming recently/frequently used blocks. I believe the status of the patch series is: [1/9] cache: Rename blk_writeback function. [2/9] cache: Add cache-on-read mode. [3/9] common/bitmap: Include <nbdkit-plugin.h>. [4/9] cache: Split out the block/bitmap code into - All of the above patches are pretty uncontroversial and ready for review and upstreaming. [5/9] cache: Allow this filter to serve requests in - Probably needs more testing, but should be OK too. [6/9] common/bitmap: Add bitmap_next function and tests. [7/9] common/bitmap: Add bitmap_clear function. [8/9] cache: Implement LRU structure. [9/9] cache: Implement cache-max-size and method of - Needs much better testing, and maybe even review of the design. Rich.
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 1/9] cache: Rename blk_writeback function.
The blk_writeback function was misnamed, since depending on the cache
mode and flags it could either do a writethrough or a writeback [write
to the cache only].
Since it's really the general "write a block" function, rename it
simply as blk_write.  Note there exists a blk_read function already.
There are several related changes in this commit, but it's entirely
refactoring and makes no functional difference.
---
 filters/cache/cache.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index e9a5c38..1dc17fd 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -262,8 +262,7 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void
*nxdata,
 {
   off_t offset = blknum * BLKSIZE;
 
-  nbdkit_debug ("cache: blk_writethrough block %" PRIu64
-                " (offset %" PRIu64 ")",
+  nbdkit_debug ("cache: writethrough block %" PRIu64 " (offset
%" PRIu64 ")",
                 blknum, (uint64_t) offset);
 
   if (pwrite (fd, block, BLKSIZE, offset) == -1) {
@@ -280,11 +279,18 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void
*nxdata,
   return 0;
 }
 
-/* Write to the cache only. */
+/* Write a whole block.
+ *
+ * If the cache is in writethrough mode, or the FUA flag is set, then
+ * this calls blk_writethrough above which will write both to the
+ * cache and through to the underlying device.
+ *
+ * Otherwise it will only write to the cache.
+ */
 static int
-blk_writeback (struct nbdkit_next_ops *next_ops, void *nxdata,
-               uint64_t blknum, const uint8_t *block, uint32_t flags,
-               int *err)
+blk_write (struct nbdkit_next_ops *next_ops, void *nxdata,
+           uint64_t blknum, const uint8_t *block, uint32_t flags,
+           int *err)
 {
   off_t offset;
 
@@ -294,8 +300,7 @@ blk_writeback (struct nbdkit_next_ops *next_ops, void
*nxdata,
 
   offset = blknum * BLKSIZE;
 
-  nbdkit_debug ("cache: blk_writeback block %" PRIu64
-                " (offset %" PRIu64 ")",
+  nbdkit_debug ("cache: writeback block %" PRIu64 " (offset
%" PRIu64 ")",
                 blknum, (uint64_t) offset);
 
   if (pwrite (fd, block, BLKSIZE, offset) == -1) {
@@ -385,7 +390,7 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
       return -1;
     }
     memcpy (&block[blkoffs], buf, n);
-    if (blk_writeback (next_ops, nxdata, blknum, block, flags, err) == -1) {
+    if (blk_write (next_ops, nxdata, blknum, block, flags, err) == -1) {
       free (block);
       return -1;
     }
@@ -437,7 +442,7 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
       return -1;
     }
     memset (&block[blkoffs], 0, n);
-    if (blk_writeback (next_ops, nxdata, blknum, block, flags, err) == -1) {
+    if (blk_write (next_ops, nxdata, blknum, block, flags, err) == -1) {
       free (block);
       return -1;
     }
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 2/9] cache: Add cache-on-read mode.
The same as qemu's copyonread flag, this caches read requests.
---
 filters/cache/nbdkit-cache-filter.pod | 11 +++++
 filters/cache/cache.c                 | 37 +++++++++++++--
 tests/Makefile.am                     |  4 +-
 tests/test-cache-on-read.sh           | 66 +++++++++++++++++++++++++++
 4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/filters/cache/nbdkit-cache-filter.pod
b/filters/cache/nbdkit-cache-filter.pod
index 31e533d..bfbada0 100644
--- a/filters/cache/nbdkit-cache-filter.pod
+++ b/filters/cache/nbdkit-cache-filter.pod
@@ -5,6 +5,7 @@ nbdkit-cache-filter - nbdkit caching filter
 =head1 SYNOPSIS
 
  nbdkit --filter=cache plugin [cache=writeback|writethrough|unsafe]
+                              [cache-on-read=true|false]
                               [plugin-args...]
 
 =head1 DESCRIPTION
@@ -50,6 +51,16 @@ This is dangerous and can cause data loss, but this may be
acceptable
 if you only use it for testing or with data that you don't care about
 or can cheaply reconstruct.
 
+=item B<cache-on-read=true>
+
+Cache read requests as well as write requests.  Any time a block is
+read from the plugin, it is saved in the cache (if there is sufficient
+space) so the same data can be served more quickly later.
+
+=item B<cache-on-read=false>
+
+Do not cache read requests (this is the default).
+
 =back
 
 =head1 ENVIRONMENT VARIABLES
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index 1dc17fd..b1c3d53 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -90,6 +90,9 @@ static enum cache_mode {
   CACHE_MODE_UNSAFE,
 } cache_mode = CACHE_MODE_WRITEBACK;
 
+/* Cache read requests. */
+static bool cache_on_read = false;
+
 static int cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void
*handle, uint32_t flags, int *err);
 
 static void
@@ -156,6 +159,15 @@ cache_config (nbdkit_next_config *next, void *nxdata,
       return -1;
     }
   }
+  else if (strcmp (key, "cache-on-read") == 0) {
+    int r;
+
+    r = nbdkit_parse_bool (value);
+    if (r == -1)
+      return -1;
+    cache_on_read = r;
+    return 0;
+  }
   else {
     return next (nxdata, key, value);
   }
@@ -242,9 +254,28 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
                 state == BLOCK_DIRTY ? "dirty" :
                 "unknown");
 
-  if (state == BLOCK_NOT_CACHED) /* Read underlying plugin. */
-    return next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err);
-  else {                         /* Read cache. */
+  if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin. */
+    if (next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err) == -1)
+      return -1;
+
+    /* If cache-on-read, copy the block to the cache. */
+    if (cache_on_read) {
+      off_t offset = blknum * BLKSIZE;
+
+      nbdkit_debug ("cache: cache-on-read block %" PRIu64
+                    " (offset %" PRIu64 ")",
+                    blknum, (uint64_t) offset);
+
+      if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+        *err = errno;
+        nbdkit_error ("pwrite: %m");
+        return -1;
+      }
+      bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
+    }
+    return 0;
+  }
+  else {                        /* Read cache. */
     if (pread (fd, block, BLKSIZE, offset) == -1) {
       *err = errno;
       nbdkit_error ("pread: %m");
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 79b723a..ccd5ff5 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -702,7 +702,9 @@ TESTS += test-blocksize.sh
 
 # cache filter test.
 if HAVE_GUESTFISH
-TESTS += test-cache.sh
+TESTS += \
+	test-cache.sh \
+	test-cache-on-read.sh
 endif HAVE_GUESTFISH
 
 # cow filter test.
diff --git a/tests/test-cache-on-read.sh b/tests/test-cache-on-read.sh
new file mode 100755
index 0000000..51764a2
--- /dev/null
+++ b/tests/test-cache-on-read.sh
@@ -0,0 +1,66 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+source ./functions.sh
+set -e
+set -x
+
+files="cache-on-read.img cache-on-read.sock cache-on-read.pid"
+rm -f $files
+cleanup_fn rm -f $files
+
+# Create an empty base image.
+truncate -s 1G cache-on-read.img
+
+# Run nbdkit with the caching filter and cache-on-read set.
+start_nbdkit -P cache-on-read.pid -U cache-on-read.sock \
+             --filter=cache \
+             file cache-on-read.img \
+             cache-on-read=true
+
+# Open the overlay and perform some operations.
+guestfish --format=raw -a "nbd://?socket=$PWD/cache-on-read.sock"
<<'EOF'
+  run
+  part-disk /dev/sda gpt
+  mkfs ext4 /dev/sda1
+  mount /dev/sda1 /
+  fill-dir / 10000
+  fill-pattern "abcde" 5M /large
+  write /hello "hello, world"
+EOF
+
+# Check the last files we created exist.
+guestfish --ro -a cache-on-read.img -m /dev/sda1 <<'EOF'
+  cat /hello
+  cat /large | cat >/dev/null
+EOF
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 3/9] common/bitmap: Include <nbdkit-plugin.h>.
Inline functions in <bitmap.h> call nbdkit_debug so the include is needed. --- common/bitmap/bitmap.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/common/bitmap/bitmap.h b/common/bitmap/bitmap.h index ffed92e..f943933 100644 --- a/common/bitmap/bitmap.h +++ b/common/bitmap/bitmap.h @@ -44,6 +44,8 @@ #include <stdint.h> #include <assert.h> +#include <nbdkit-plugin.h> + #include "ispowerof2.h" /* This is the bitmap structure. */ -- 2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 4/9] cache: Split out the block/bitmap code into separate files.
This change is almost completely code motion.
The only part which is not simple movement of code is that I changed
‘cache_flush’ so that it does not lazily allocate the bounce buffer.
This change considerably simplifies the code and the performance
impact should be minimal.
---
 filters/cache/blk.h       |  66 +++++++++
 filters/cache/cache.h     |  54 ++++++++
 filters/cache/blk.c       | 244 ++++++++++++++++++++++++++++++++
 filters/cache/cache.c     | 285 +++++++++-----------------------------
 filters/cache/Makefile.am |   3 +
 5 files changed, 429 insertions(+), 223 deletions(-)
diff --git a/filters/cache/blk.h b/filters/cache/blk.h
new file mode 100644
index 0000000..24bf6a1
--- /dev/null
+++ b/filters/cache/blk.h
@@ -0,0 +1,66 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef NBDKIT_BLK_H
+#define NBDKIT_BLK_H
+
+/* Initialize the cache and bitmap. */
+extern int blk_init (void);
+
+/* Close the cache, free the bitmap. */
+extern void blk_free (void);
+
+/* Allocate or resize the cache file and bitmap. */
+extern int blk_set_size (uint64_t new_size);
+
+/* Read a single block from the cache or plugin. */
+extern int blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t
blknum, uint8_t *block, int *err);
+
+/* Write to the cache and the plugin. */
+extern int blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t blknum, const uint8_t *block, uint32_t flags, int *err);
+
+/* Write a whole block.
+ *
+ * If the cache is in writethrough mode, or the FUA flag is set, then
+ * this calls blk_writethrough above which will write both to the
+ * cache and through to the underlying device.
+ *
+ * Otherwise it will only write to the cache.
+ */
+extern int blk_write (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t
blknum, const uint8_t *block, uint32_t flags, int *err);
+
+/* Iterates over each dirty block in the cache. */
+typedef int (*block_callback) (uint64_t blknum, void *vp);
+extern int for_each_dirty_block (block_callback f, void *vp);
+
+#endif /* NBDKIT_BLK_H */
diff --git a/filters/cache/cache.h b/filters/cache/cache.h
new file mode 100644
index 0000000..50416b1
--- /dev/null
+++ b/filters/cache/cache.h
@@ -0,0 +1,54 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef NBDKIT_CACHE_H
+#define NBDKIT_CACHE_H
+
+/* Size of a block in the cache.  A 4K block size means that we need
+ * 64 MB of memory to store the bitmaps for a 1 TB underlying image.
+ * It is also smaller than the usual hole size for sparse files, which
+ * means we have no reason to call next_ops->zero.
+ */
+#define BLKSIZE 4096
+
+/* Caching mode. */
+extern enum cache_mode {
+  CACHE_MODE_WRITEBACK,
+  CACHE_MODE_WRITETHROUGH,
+  CACHE_MODE_UNSAFE,
+} cache_mode;
+
+/* Cache read requests. */
+extern bool cache_on_read;
+
+#endif /* NBDKIT_CACHE_H */
diff --git a/filters/cache/blk.c b/filters/cache/blk.c
new file mode 100644
index 0000000..4000276
--- /dev/null
+++ b/filters/cache/blk.c
@@ -0,0 +1,244 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/* These are the block operations.  They always read or write a single
+ * whole block of size ‘blksize’.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <inttypes.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+
+#include <nbdkit-filter.h>
+
+#include "bitmap.h"
+
+#include "cache.h"
+#include "blk.h"
+
+/* The cache. */
+static int fd = -1;
+
+/* Bitmap.  There are two bits per block which are updated as we read,
+ * write back or write through blocks.
+ *
+ * 00 = not in cache
+ * 01 = block cached and clean
+ * 10 = <unused>
+ * 11 = block cached and dirty
+ */
+static struct bitmap bm;
+
+enum bm_entry {
+  BLOCK_NOT_CACHED = 0,
+  BLOCK_CLEAN = 1,
+  BLOCK_DIRTY = 3,
+};
+
+int
+blk_init (void)
+{
+  const char *tmpdir;
+  size_t len;
+  char *template;
+
+  bitmap_init (&bm, BLKSIZE, 2 /* bits per block */);
+
+  tmpdir = getenv ("TMPDIR");
+  if (!tmpdir)
+    tmpdir = LARGE_TMPDIR;
+
+  nbdkit_debug ("cache: temporary directory for cache: %s", tmpdir);
+
+  len = strlen (tmpdir) + 8;
+  template = alloca (len);
+  snprintf (template, len, "%s/XXXXXX", tmpdir);
+
+#ifdef HAVE_MKOSTEMP
+  fd = mkostemp (template, O_CLOEXEC);
+#else
+  fd = mkstemp (template);
+  fcntl (fd, F_SETFD, FD_CLOEXEC);
+#endif
+  if (fd == -1) {
+    nbdkit_error ("mkostemp: %s: %m", tmpdir);
+    return -1;
+  }
+
+  unlink (template);
+
+  return 0;
+}
+
+void
+blk_free (void)
+{
+  if (fd >= 0)
+    close (fd);
+
+  bitmap_free (&bm);
+}
+
+int
+blk_set_size (uint64_t new_size)
+{
+  if (bitmap_resize (&bm, new_size) == -1)
+    return -1;
+
+  if (ftruncate (fd, new_size) == -1) {
+    nbdkit_error ("ftruncate: %m");
+    return -1;
+  }
+
+  return 0;
+}
+
+int
+blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
+          uint64_t blknum, uint8_t *block, int *err)
+{
+  off_t offset = blknum * BLKSIZE;
+  enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_CACHED);
+
+  nbdkit_debug ("cache: blk_read block %" PRIu64 " (offset
%" PRIu64 ") is %s",
+                blknum, (uint64_t) offset,
+                state == BLOCK_NOT_CACHED ? "not cached" :
+                state == BLOCK_CLEAN ? "clean" :
+                state == BLOCK_DIRTY ? "dirty" :
+                "unknown");
+
+  if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin. */
+    if (next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err) == -1)
+      return -1;
+
+    /* If cache-on-read, copy the block to the cache. */
+    if (cache_on_read) {
+      off_t offset = blknum * BLKSIZE;
+
+      nbdkit_debug ("cache: cache-on-read block %" PRIu64
+                    " (offset %" PRIu64 ")",
+                    blknum, (uint64_t) offset);
+
+      if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+        *err = errno;
+        nbdkit_error ("pwrite: %m");
+        return -1;
+      }
+      bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
+    }
+    return 0;
+  }
+  else {                        /* Read cache. */
+    if (pread (fd, block, BLKSIZE, offset) == -1) {
+      *err = errno;
+      nbdkit_error ("pread: %m");
+      return -1;
+    }
+    return 0;
+  }
+}
+
+int
+blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
+                  uint64_t blknum, const uint8_t *block, uint32_t flags,
+                  int *err)
+{
+  off_t offset = blknum * BLKSIZE;
+
+  nbdkit_debug ("cache: writethrough block %" PRIu64 " (offset
%" PRIu64 ")",
+                blknum, (uint64_t) offset);
+
+  if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+    *err = errno;
+    nbdkit_error ("pwrite: %m");
+    return -1;
+  }
+
+  if (next_ops->pwrite (nxdata, block, BLKSIZE, offset, flags, err) == -1)
+    return -1;
+
+  bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
+
+  return 0;
+}
+
+int
+blk_write (struct nbdkit_next_ops *next_ops, void *nxdata,
+           uint64_t blknum, const uint8_t *block, uint32_t flags,
+           int *err)
+{
+  off_t offset;
+
+  if (cache_mode == CACHE_MODE_WRITETHROUGH ||
+      (cache_mode == CACHE_MODE_WRITEBACK && (flags &
NBDKIT_FLAG_FUA)))
+    return blk_writethrough (next_ops, nxdata, blknum, block, flags, err);
+
+  offset = blknum * BLKSIZE;
+
+  nbdkit_debug ("cache: writeback block %" PRIu64 " (offset
%" PRIu64 ")",
+                blknum, (uint64_t) offset);
+
+  if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+    *err = errno;
+    nbdkit_error ("pwrite: %m");
+    return -1;
+  }
+  bitmap_set_blk (&bm, blknum, BLOCK_DIRTY);
+
+  return 0;
+}
+
+int
+for_each_dirty_block (block_callback f, void *vp)
+{
+  uint64_t blknum;
+  enum bm_entry state;
+
+  bitmap_for (&bm, blknum) {
+    state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_CACHED);
+    if (state == BLOCK_DIRTY) {
+      if (f (blknum, vp) == -1)
+        return -1;
+    }
+  }
+
+  return 0;
+}
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index b1c3d53..0d006bc 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -52,89 +52,28 @@
 
 #include <nbdkit-filter.h>
 
-#include "bitmap.h"
+#include "cache.h"
+#include "blk.h"
 
 /* XXX See design comment in filters/cow/cow.c. */
 #define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS
 
-/* Size of a block in the cache.  A 4K block size means that we need
- * 64 MB of memory to store the bitmaps for a 1 TB underlying image.
- * It is also smaller than the usual hole size for sparse files, which
- * means we have no reason to call next_ops->zero.
- */
-#define BLKSIZE 4096
-
-/* The cache. */
-static int fd = -1;
-
-/* Bitmap.  There are two bits per block which are updated as we read,
- * write back or write through blocks.
- *
- * 00 = not in cache
- * 01 = block cached and clean
- * 10 = <unused>
- * 11 = block cached and dirty
- */
-static struct bitmap bm;
-
-enum bm_entry {
-  BLOCK_NOT_CACHED = 0,
-  BLOCK_CLEAN = 1,
-  BLOCK_DIRTY = 3,
-};
-
-/* Caching mode. */
-static enum cache_mode {
-  CACHE_MODE_WRITEBACK,
-  CACHE_MODE_WRITETHROUGH,
-  CACHE_MODE_UNSAFE,
-} cache_mode = CACHE_MODE_WRITEBACK;
-
-/* Cache read requests. */
-static bool cache_on_read = false;
+enum cache_mode cache_mode = CACHE_MODE_WRITEBACK;
+bool cache_on_read = false;
 
 static int cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void
*handle, uint32_t flags, int *err);
 
 static void
 cache_load (void)
 {
-  const char *tmpdir;
-  size_t len;
-  char *template;
-
-  bitmap_init (&bm, BLKSIZE, 2 /* bits per block */);
-
-  tmpdir = getenv ("TMPDIR");
-  if (!tmpdir)
-    tmpdir = LARGE_TMPDIR;
-
-  nbdkit_debug ("cache: temporary directory for cache: %s", tmpdir);
-
-  len = strlen (tmpdir) + 8;
-  template = alloca (len);
-  snprintf (template, len, "%s/XXXXXX", tmpdir);
-
-#ifdef HAVE_MKOSTEMP
-  fd = mkostemp (template, O_CLOEXEC);
-#else
-  fd = mkstemp (template);
-  fcntl (fd, F_SETFD, FD_CLOEXEC);
-#endif
-  if (fd == -1) {
-    nbdkit_error ("mkostemp: %s: %m", tmpdir);
+  if (blk_init () == -1)
     exit (EXIT_FAILURE);
-  }
-
-  unlink (template);
 }
 
 static void
 cache_unload (void)
 {
-  if (fd >= 0)
-    close (fd);
-
-  bitmap_free (&bm);
+  blk_free ();
 }
 
 static int
@@ -187,21 +126,6 @@ cache_open (nbdkit_next_open *next, void *nxdata, int
readonly)
   return &handle;
 }
 
-/* Allocate or resize the cache file and bitmap. */
-static int
-blk_set_size (uint64_t new_size)
-{
-  if (bitmap_resize (&bm, new_size) == -1)
-    return -1;
-
-  if (ftruncate (fd, new_size) == -1) {
-    nbdkit_error ("ftruncate: %m");
-    return -1;
-  }
-
-  return 0;
-}
-
 /* Get the file size and ensure the cache is the correct size. */
 static int64_t
 cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -237,113 +161,6 @@ cache_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata,
   return 0;
 }
 
-/* These are the block operations.  They always read or write a single
- * whole block of size ‘blksize’.
- */
-static int
-blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
-          uint64_t blknum, uint8_t *block, int *err)
-{
-  off_t offset = blknum * BLKSIZE;
-  enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_CACHED);
-
-  nbdkit_debug ("cache: blk_read block %" PRIu64 " (offset
%" PRIu64 ") is %s",
-                blknum, (uint64_t) offset,
-                state == BLOCK_NOT_CACHED ? "not cached" :
-                state == BLOCK_CLEAN ? "clean" :
-                state == BLOCK_DIRTY ? "dirty" :
-                "unknown");
-
-  if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin. */
-    if (next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err) == -1)
-      return -1;
-
-    /* If cache-on-read, copy the block to the cache. */
-    if (cache_on_read) {
-      off_t offset = blknum * BLKSIZE;
-
-      nbdkit_debug ("cache: cache-on-read block %" PRIu64
-                    " (offset %" PRIu64 ")",
-                    blknum, (uint64_t) offset);
-
-      if (pwrite (fd, block, BLKSIZE, offset) == -1) {
-        *err = errno;
-        nbdkit_error ("pwrite: %m");
-        return -1;
-      }
-      bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
-    }
-    return 0;
-  }
-  else {                        /* Read cache. */
-    if (pread (fd, block, BLKSIZE, offset) == -1) {
-      *err = errno;
-      nbdkit_error ("pread: %m");
-      return -1;
-    }
-    return 0;
-  }
-}
-
-/* Write to the cache and the plugin. */
-static int
-blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
-                  uint64_t blknum, const uint8_t *block, uint32_t flags,
-                  int *err)
-{
-  off_t offset = blknum * BLKSIZE;
-
-  nbdkit_debug ("cache: writethrough block %" PRIu64 " (offset
%" PRIu64 ")",
-                blknum, (uint64_t) offset);
-
-  if (pwrite (fd, block, BLKSIZE, offset) == -1) {
-    *err = errno;
-    nbdkit_error ("pwrite: %m");
-    return -1;
-  }
-
-  if (next_ops->pwrite (nxdata, block, BLKSIZE, offset, flags, err) == -1)
-    return -1;
-
-  bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
-
-  return 0;
-}
-
-/* Write a whole block.
- *
- * If the cache is in writethrough mode, or the FUA flag is set, then
- * this calls blk_writethrough above which will write both to the
- * cache and through to the underlying device.
- *
- * Otherwise it will only write to the cache.
- */
-static int
-blk_write (struct nbdkit_next_ops *next_ops, void *nxdata,
-           uint64_t blknum, const uint8_t *block, uint32_t flags,
-           int *err)
-{
-  off_t offset;
-
-  if (cache_mode == CACHE_MODE_WRITETHROUGH ||
-      (cache_mode == CACHE_MODE_WRITEBACK && (flags &
NBDKIT_FLAG_FUA)))
-    return blk_writethrough (next_ops, nxdata, blknum, block, flags, err);
-
-  offset = blknum * BLKSIZE;
-
-  nbdkit_debug ("cache: writeback block %" PRIu64 " (offset
%" PRIu64 ")",
-                blknum, (uint64_t) offset);
-
-  if (pwrite (fd, block, BLKSIZE, offset) == -1) {
-    *err = errno;
-    nbdkit_error ("pwrite: %m");
-    return -1;
-  }
-  bitmap_set_blk (&bm, blknum, BLOCK_DIRTY);
-
-  return 0;
-}
-
 /* Read data. */
 static int
 cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -489,57 +306,79 @@ cache_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
 }
 
 /* Flush: Go through all the dirty blocks, flushing them to disk. */
+struct flush_data {
+  uint8_t *block;               /* bounce buffer */
+  unsigned errors;              /* count of errors seen */
+  int first_errno;              /* first errno seen */
+  struct nbdkit_next_ops *next_ops;
+  void *nxdata;
+};
+
+static int flush_dirty_block (uint64_t blknum, void *);
+
 static int
 cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle,
              uint32_t flags, int *err)
 {
-  uint8_t *block = NULL;
-  uint64_t blknum;
-  enum bm_entry state;
-  unsigned errors = 0;
+  struct flush_data data +    { .errors = 0, .first_errno = 0, .next_ops =
next_ops, .nxdata = nxdata };
   int tmp;
 
   if (cache_mode == CACHE_MODE_UNSAFE)
     return 0;
 
+  assert (!flags);
+
+  /* Allocate the bounce buffer. */
+  data.block = malloc (BLKSIZE);
+  if (data.block == NULL) {
+    *err = errno;
+    nbdkit_error ("malloc: %m");
+    return -1;
+  }
+
   /* In theory if cache_mode == CACHE_MODE_WRITETHROUGH then there
    * should be no dirty blocks.  However we go through the cache here
    * to be sure.  Also we still need to issue the flush to the
    * underlying storage.
    */
-  assert (!flags);
-  bitmap_for (&bm, blknum) {
-    state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_CACHED);
-    if (state == BLOCK_DIRTY) {
-      /* Lazily allocate the bounce buffer. */
-      if (!block) {
-        block = malloc (BLKSIZE);
-        if (block == NULL) {
-          *err = errno;
-          nbdkit_error ("malloc: %m");
-          return -1;
-        }
-      }
-      /* Perform a read + writethrough which will read from the
-       * cache and write it through to the underlying storage.
-       */
-      if (blk_read (next_ops, nxdata, blknum, block,
-                    errors ? &tmp : err) == -1 ||
-          blk_writethrough (next_ops, nxdata, blknum, block, 0,
-                            errors ? &tmp : err) == -1) {
-        nbdkit_error ("cache: flush of block %" PRIu64 "
failed", blknum);
-        errors++;
-      }
-    }
-  }
-
-  free (block);
+  for_each_dirty_block (flush_dirty_block, &data);
+  free (data.block);
 
   /* Now issue a flush request to the underlying storage. */
-  if (next_ops->flush (nxdata, 0, errors ? &tmp : err) == -1)
-    errors++;
+  if (next_ops->flush (nxdata, 0,
+                       data.errors ? &tmp : &data.first_errno) == -1)
+    data.errors++;
 
-  return errors == 0 ? 0 : -1;
+  if (data.errors > 0) {
+    *err = data.first_errno;
+    return -1;
+  }
+  return 0;
+}
+
+static int
+flush_dirty_block (uint64_t blknum, void *datav)
+{
+  struct flush_data *data = datav;
+  int tmp;
+
+  /* Perform a read + writethrough which will read from the
+   * cache and write it through to the underlying storage.
+   */
+  if (blk_read (data->next_ops, data->nxdata, blknum, data->block,
+                data->errors ? &tmp : &data->first_errno) == -1)
+    goto err;
+  if (blk_writethrough (data->next_ops, data->nxdata, blknum,
data->block, 0,
+                        data->errors ? &tmp : &data->first_errno)
== -1)
+    goto err;
+
+  return 0;
+
+ err:
+  nbdkit_error ("cache: flush of block %" PRIu64 " failed",
blknum);
+  data->errors++;
+  return 0; /* continue scanning and flushing. */
 }
 
 static struct nbdkit_filter filter = {
diff --git a/filters/cache/Makefile.am b/filters/cache/Makefile.am
index 89f4396..0b79be5 100644
--- a/filters/cache/Makefile.am
+++ b/filters/cache/Makefile.am
@@ -37,7 +37,10 @@ EXTRA_DIST = nbdkit-cache-filter.pod
 filter_LTLIBRARIES = nbdkit-cache-filter.la
 
 nbdkit_cache_filter_la_SOURCES = \
+	blk.c \
+	blk.h \
 	cache.c \
+	cache.h \
 	$(top_srcdir)/include/nbdkit-filter.h
 
 nbdkit_cache_filter_la_CPPFLAGS = \
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 5/9] cache: Allow this filter to serve requests in parallel.
Make the implicit lock explicit, and hold it around blk_* operations.
This allows us to relax the thread model for the filter to
NBDKIT_THREAD_MODEL_PARALLEL.
---
 filters/cache/blk.h   |  7 ++++++
 filters/cache/cache.c | 57 +++++++++++++++++++++++++++++++------------
 2 files changed, 49 insertions(+), 15 deletions(-)
diff --git a/filters/cache/blk.h b/filters/cache/blk.h
index 24bf6a1..ab9134e 100644
--- a/filters/cache/blk.h
+++ b/filters/cache/blk.h
@@ -40,6 +40,13 @@ extern int blk_init (void);
 /* Close the cache, free the bitmap. */
 extern void blk_free (void);
 
+/*----------------------------------------------------------------------
+ * ** NOTE **
+ *
+ * An exclusive lock must be held when you call any function below
+ * this line.
+ */
+
 /* Allocate or resize the cache file and bitmap. */
 extern int blk_set_size (uint64_t new_size);
 
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index 0d006bc..580e4f5 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -46,6 +46,8 @@
 #include <sys/ioctl.h>
 #include <assert.h>
 
+#include <pthread.h>
+
 #ifdef HAVE_ALLOCA_H
 #include <alloca.h>
 #endif
@@ -55,8 +57,12 @@
 #include "cache.h"
 #include "blk.h"
 
-/* XXX See design comment in filters/cow/cow.c. */
-#define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS
+#define THREAD_MODEL NBDKIT_THREAD_MODEL_PARALLEL
+
+/* In order to handle parallel requests safely, this lock must be held
+ * when calling any blk_* functions.
+ */
+static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 
 enum cache_mode cache_mode = CACHE_MODE_WRITEBACK;
 bool cache_on_read = false;
@@ -132,6 +138,7 @@ cache_get_size (struct nbdkit_next_ops *next_ops, void
*nxdata,
               void *handle)
 {
   int64_t size;
+  int r;
 
   size = next_ops->get_size (nxdata);
   if (size == -1)
@@ -139,7 +146,10 @@ cache_get_size (struct nbdkit_next_ops *next_ops, void
*nxdata,
 
   nbdkit_debug ("cache: underlying file size: %" PRIi64, size);
 
-  if (blk_set_size (size))
+  pthread_mutex_lock (&lock);
+  r = blk_set_size (size);
+  pthread_mutex_unlock (&lock);
+  if (r == -1)
     return -1;
 
   return size;
@@ -179,6 +189,7 @@ cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
 
   while (count > 0) {
     uint64_t blknum, blkoffs, n;
+    int r;
 
     blknum = offset / BLKSIZE;  /* block number */
     blkoffs = offset % BLKSIZE; /* offset within the block */
@@ -186,7 +197,10 @@ cache_pread (struct nbdkit_next_ops *next_ops, void
*nxdata,
     if (n > count)
       n = count;
 
-    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
+    pthread_mutex_lock (&lock);
+    r = blk_read (next_ops, nxdata, blknum, block, err);
+    pthread_mutex_unlock (&lock);
+    if (r == -1) {
       free (block);
       return -1;
     }
@@ -225,6 +239,7 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
   }
   while (count > 0) {
     uint64_t blknum, blkoffs, n;
+    int r;
 
     blknum = offset / BLKSIZE;  /* block number */
     blkoffs = offset % BLKSIZE; /* offset within the block */
@@ -232,13 +247,17 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
     if (n > count)
       n = count;
 
-    /* Do a read-modify-write operation on the current block. */
-    if (blk_read (next_ops, nxdata, blknum, block, err) == -1){
-      free (block);
-      return -1;
+    /* Do a read-modify-write operation on the current block.
+     * Hold the lock over the whole operation.
+     */
+    pthread_mutex_lock (&lock);
+    r = blk_read (next_ops, nxdata, blknum, block, err);
+    if (r != -1) {
+      memcpy (&block[blkoffs], buf, n);
+      r = blk_write (next_ops, nxdata, blknum, block, flags, err);
     }
-    memcpy (&block[blkoffs], buf, n);
-    if (blk_write (next_ops, nxdata, blknum, block, flags, err) == -1) {
+    pthread_mutex_unlock (&lock);
+    if (r == -1) {
       free (block);
       return -1;
     }
@@ -278,6 +297,7 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   }
   while (count > 0) {
     uint64_t blknum, blkoffs, n;
+    int r;
 
     blknum = offset / BLKSIZE;  /* block number */
     blkoffs = offset % BLKSIZE; /* offset within the block */
@@ -285,12 +305,17 @@ cache_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
     if (n > count)
       n = count;
 
-    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
-      free (block);
-      return -1;
+    /* Do a read-modify-write operation on the current block.
+     * Hold the lock over the whole operation.
+     */
+    pthread_mutex_lock (&lock);
+    r = blk_read (next_ops, nxdata, blknum, block, err);
+    if (r != -1) {
+      memset (&block[blkoffs], 0, n);
+      r = blk_write (next_ops, nxdata, blknum, block, flags, err);
     }
-    memset (&block[blkoffs], 0, n);
-    if (blk_write (next_ops, nxdata, blknum, block, flags, err) == -1) {
+    pthread_mutex_unlock (&lock);
+    if (r == -1) {
       free (block);
       return -1;
     }
@@ -342,7 +367,9 @@ cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle,
    * to be sure.  Also we still need to issue the flush to the
    * underlying storage.
    */
+  pthread_mutex_lock (&lock);
   for_each_dirty_block (flush_dirty_block, &data);
+  pthread_mutex_unlock (&lock);
   free (data.block);
 
   /* Now issue a flush request to the underlying storage. */
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 6/9] common/bitmap: Add bitmap_next function and tests.
It's useful to be able to search for the next non-zero entry in a
bitmap.  This commit adds a ‘bitmap_next’ function to do that.
Because the bitmap is just a uint8_t buffer, using fast string
functions we should be able to do this quickly even if the bitmap is
sparse.  (However the actual implementation is not optimized since
that is quite complicated - see to-do comments in
common/include/nextnonzero.h).
I wasn't confident about the correctness of the code and so this
commit also adds some unit tests covering all of the bitmap code.
---
 common/bitmap/bitmap.h       |   6 ++
 common/include/iszero.h      |   3 +
 common/include/nextnonzero.h |  59 +++++++++++++
 common/bitmap/bitmap.c       |  35 ++++++++
 common/bitmap/test-bitmap.c  | 162 +++++++++++++++++++++++++++++++++++
 .gitignore                   |   1 +
 common/bitmap/Makefile.am    |  12 +++
 common/include/Makefile.am   |   1 +
 8 files changed, 279 insertions(+)
diff --git a/common/bitmap/bitmap.h b/common/bitmap/bitmap.h
index f943933..80fd5e4 100644
--- a/common/bitmap/bitmap.h
+++ b/common/bitmap/bitmap.h
@@ -165,4 +165,10 @@ bitmap_set (const struct bitmap *bm, uint64_t offset,
unsigned v)
 #define bitmap_for(bm, /* uint64_t */ blknum)                           \
   for ((blknum) = 0; (blknum) < (bm)->size * (bm)->ibpb; ++(blknum))
 
+/* Find the next non-zero block in the bitmap, starting at ‘blk’.
+ * Returns -1 if the bitmap is all zeroes from blk to the end of the
+ * bitmap.
+ */
+extern int64_t bitmap_next (const struct bitmap *bm, uint64_t blk);
+
 #endif /* NBDKIT_BITMAP_H */
diff --git a/common/include/iszero.h b/common/include/iszero.h
index 1079f3e..7a54f0a 100644
--- a/common/include/iszero.h
+++ b/common/include/iszero.h
@@ -42,6 +42,9 @@
  * The clever approach here was suggested by Eric Blake.  See:
  * https://www.redhat.com/archives/libguestfs/2017-April/msg00171.html
  * https://rusty.ozlabs.org/?p=560
+ *
+ * See also:
+ * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908
  */
 static inline bool
 is_zero (const char *buffer, size_t size)
diff --git a/common/include/nextnonzero.h b/common/include/nextnonzero.h
new file mode 100644
index 0000000..3f96e85
--- /dev/null
+++ b/common/include/nextnonzero.h
@@ -0,0 +1,59 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef NBDKIT_NEXTNONZERO_H
+#define NBDKIT_NEXTNONZERO_H
+
+/* Given a byte buffer, return a pointer to the first non-zero byte,
+ * or return NULL if we reach the end of the buffer.
+ *
+ * XXX: Even with -O3 -mavx2, gcc 8.2.1 makes a terrible job with this
+ * loop, compiling it completely naively.  QEMU has an AVX2
+ * implementation of a similar loop.
+ *
+ * See also:
+ * https://sourceware.org/bugzilla/show_bug.cgi?id=19920
+ * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908
+ */
+static inline const char *
+next_non_zero (const char *buffer, size_t size)
+{
+  size_t i;
+
+  for (i = 0; i < size; ++i)
+    if (buffer[i] != 0)
+      return &buffer[i];
+  return NULL;
+}
+
+#endif /* NBDKIT_NEXTNONZERO_H */
diff --git a/common/bitmap/bitmap.c b/common/bitmap/bitmap.c
index fb5dbe7..9690a2e 100644
--- a/common/bitmap/bitmap.c
+++ b/common/bitmap/bitmap.c
@@ -42,6 +42,7 @@
 
 #include "bitmap.h"
 #include "rounding.h"
+#include "nextnonzero.h"
 
 int
 bitmap_resize (struct bitmap *bm, uint64_t new_size)
@@ -73,3 +74,37 @@ bitmap_resize (struct bitmap *bm, uint64_t new_size)
 
   return 0;
 }
+
+int64_t
+bitmap_next (const struct bitmap *bm, uint64_t blk)
+{
+  uint64_t limit = bm->size * bm->ibpb;
+  const uint8_t *p;
+
+  /* Align to the next byte boundary. */
+  for (; blk < limit && (blk & (bm->ibpb-1)) != 0; ++blk) {
+    if (bitmap_get_blk (bm, blk, 0) != 0)
+      return blk;
+  }
+  if (blk == limit)
+    return -1;
+
+  /* Now we're at a byte boundary we can use a fast string function to
+   * find the next non-zero byte.
+   */
+  p = &bm->bitmap[blk >> (3 - bm->bitshift)];
+  p = (const uint8_t *) next_non_zero ((const char *) p,
+                                       &bm->bitmap[bm->size] - p);
+  if (p == NULL)
+    return -1;
+
+  /* Now check the non-zero byte to find out which bit (ie. block) is set. */
+  blk = (p - bm->bitmap) << (3 - bm->bitshift);
+  for (; blk < limit; ++blk) {
+    if (bitmap_get_blk (bm, blk, 0) != 0)
+      return blk;
+  }
+
+  /* Should never be reached. */
+  abort ();
+}
diff --git a/common/bitmap/test-bitmap.c b/common/bitmap/test-bitmap.c
new file mode 100644
index 0000000..6bf4574
--- /dev/null
+++ b/common/bitmap/test-bitmap.c
@@ -0,0 +1,162 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/* Unit tests of the bitmap code. */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdarg.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+
+#include <nbdkit-plugin.h>
+
+#include "bitmap.h"
+
+static void
+test (int bpb, int blksize)
+{
+  struct bitmap bm;
+  const int nr_blocks = 1000;
+  int blks[] +    {
+     1, 2, 3, 10, 12,
+     90, 91, 92, 93, 94, 99,
+     800, 801, 802, 803,
+     902, 903, 905, 907, 911, 913, 917, 919, 923, 929,
+     999
+    };
+  unsigned v, vexp;
+  size_t i, j;
+
+  printf ("bpb = %d, blksize = %d\n", bpb, blksize);
+  fflush (stdout);
+
+  bitmap_init (&bm, blksize, bpb);
+  if (bitmap_resize (&bm, nr_blocks * blksize) == -1)
+    exit (EXIT_FAILURE);
+
+  /* Set some bits at known block numbers. */
+  for (j = 0; j < sizeof blks / sizeof blks[0]; ++j) {
+    v = (j & 1) == 0 ? 1 : (1<<bpb) - 1;
+    bitmap_set_blk (&bm, blks[j], v);
+  }
+
+  /* Check the values of all bits. */
+  for (i = j = 0; i < nr_blocks; ++i) {
+    if (i == blks[j]) { /* previously set bit */
+      vexp = (j & 1) == 0 ? 1 : (1<<bpb) - 1;
+      v = bitmap_get_blk (&bm, blks[j], 0);
+      assert (v == vexp);
+      ++j;
+    }
+    else { /* unset bit, except it to be zero */
+      v = bitmap_get_blk (&bm, i, 0);
+      assert (v == 0);
+    }
+  }
+
+  /* Same as above, but using bitmap_for macro. */
+  j = 0;
+  bitmap_for (&bm, i) {
+    if (i == blks[j]) { /* previously set bit */
+      vexp = (j & 1) == 0 ? 1 : (1<<bpb) - 1;
+      v = bitmap_get_blk (&bm, blks[j], 0);
+      assert (v == vexp);
+      ++j;
+    }
+    else { /* unset bit, except it to be zero */
+      v = bitmap_get_blk (&bm, i, 0);
+      assert (v == 0);
+    }
+  }
+
+  /* Use bitmap_next to iterate over the non-zero entries in the bitmap. */
+  i = bitmap_next (&bm, 0);
+  j = 0;
+  while (i != -1) {
+    assert (i == blks[j]);
+    vexp = (j & 1) == 0 ? 1 : (1<<bpb) - 1;
+    v = bitmap_get_blk (&bm, i, 0);
+    assert (v == vexp);
+    i = bitmap_next (&bm, i+1);
+    ++j;
+  }
+
+  bitmap_free (&bm);
+}
+
+int
+main (void)
+{
+  int bpb;
+  size_t i;
+  int blksizes[] = { 1, 2, 4, 1024, 2048, 4096, 16384 };
+
+  /* Try the tests at each bpb setting and at a range of block sizes. */
+  for (bpb = 1; bpb <= 8; bpb <<= 1)
+    for (i = 0; i < sizeof blksizes / sizeof blksizes[0]; ++i)
+      test (bpb, blksizes[i]);
+
+  exit (EXIT_SUCCESS);
+}
+
+/* The bitmap code uses nbdkit_debug, normally provided by the main
+ * server program.  So we have to provide it here.
+ */
+void
+nbdkit_debug (const char *fs, ...)
+{
+  /* do nothing */
+}
+
+/* Same for nbdkit_error. */
+void
+nbdkit_error (const char *fs, ...)
+{
+  int err = errno;
+  va_list args;
+
+  va_start (args, fs);
+  fprintf (stderr, "error: ");
+  errno = err; /* Must restore in case fs contains %m */
+  vfprintf (stderr, fs, args);
+  fprintf (stderr, "\n");
+  va_end (args);
+
+  errno = err;
+}
diff --git a/.gitignore b/.gitignore
index c34dcc5..c3a4e67 100644
--- a/.gitignore
+++ b/.gitignore
@@ -24,6 +24,7 @@ Makefile.in
 
 /aclocal.m4
 /autom4te.cache
+/common/bitmap/test-bitmap
 /compile
 /config.guess
 /config.h
diff --git a/common/bitmap/Makefile.am b/common/bitmap/Makefile.am
index cbd82bd..dade5ec 100644
--- a/common/bitmap/Makefile.am
+++ b/common/bitmap/Makefile.am
@@ -42,3 +42,15 @@ libbitmap_la_CPPFLAGS = \
 	-I$(top_srcdir)/common/include
 libbitmap_la_CFLAGS = \
         $(WARNINGS_CFLAGS)
+
+# Unit tests.
+
+TESTS = test-bitmap
+check_PROGRAMS = test-bitmap
+
+test_bitmap_SOURCES = test-bitmap.c bitmap.c bitmap.h
+test_bitmap_CPPFLAGS = \
+	-I$(top_srcdir)/include \
+	-I$(top_srcdir)/common/include
+test_bitmap_CFLAGS = \
+        $(WARNINGS_CFLAGS)
diff --git a/common/include/Makefile.am b/common/include/Makefile.am
index 81f4804..c7416fb 100644
--- a/common/include/Makefile.am
+++ b/common/include/Makefile.am
@@ -41,4 +41,5 @@ EXTRA_DIST = \
 	isaligned.h \
 	ispowerof2.h \
 	iszero.h \
+	nextnonzero.h \
 	rounding.h
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 7/9] common/bitmap: Add bitmap_clear function.
Clears the bitmap.
---
 common/bitmap/bitmap.h | 8 ++++++++
 1 file changed, 8 insertions(+)
diff --git a/common/bitmap/bitmap.h b/common/bitmap/bitmap.h
index 80fd5e4..0da650a 100644
--- a/common/bitmap/bitmap.h
+++ b/common/bitmap/bitmap.h
@@ -42,6 +42,7 @@
 #define NBDKIT_BITMAP_H
 
 #include <stdint.h>
+#include <string.h>
 #include <assert.h>
 
 #include <nbdkit-plugin.h>
@@ -100,6 +101,13 @@ bitmap_free (struct bitmap *bm)
  */
 extern int bitmap_resize (struct bitmap *bm, uint64_t new_size);
 
+/* Clear the bitmap (set everything to zero). */
+static inline void
+bitmap_clear (struct bitmap *bm)
+{
+  memset (bm->bitmap, 0, bm->size);
+}
+
 /* This macro calculates the byte offset in the bitmap and which
  * bit/mask we are addressing within that byte.
  *
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:45 UTC
[Libguestfs] [PATCH nbdkit 8/9] cache: Implement LRU structure.
---
 filters/cache/lru.h       |  54 ++++++++++++++
 filters/cache/blk.c       |  12 +++
 filters/cache/lru.c       | 150 ++++++++++++++++++++++++++++++++++++++
 filters/cache/Makefile.am |   2 +
 4 files changed, 218 insertions(+)
diff --git a/filters/cache/lru.h b/filters/cache/lru.h
new file mode 100644
index 0000000..4faefcd
--- /dev/null
+++ b/filters/cache/lru.h
@@ -0,0 +1,54 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef NBDKIT_LRU_H
+#define NBDKIT_LRU_H
+
+#include <stdbool.h>
+
+/* Initialize LRU. */
+extern void lru_init (void);
+
+/* Free the LRU. */
+extern void lru_free (void);
+
+/* Notify LRU that the virtual size has changed. */
+extern int lru_set_size (uint64_t new_size);
+
+/* Mark a block as recently accessed in the LRU structure. */
+extern void lru_set_recently_accessed (uint64_t blknum);
+
+/* Check if a block has been recently accessed. */
+extern bool lru_has_been_recently_accessed (uint64_t blknum);
+
+#endif /* NBDKIT_LRU_H */
diff --git a/filters/cache/blk.c b/filters/cache/blk.c
index 4000276..b256446 100644
--- a/filters/cache/blk.c
+++ b/filters/cache/blk.c
@@ -53,6 +53,7 @@
 
 #include "cache.h"
 #include "blk.h"
+#include "lru.h"
 
 /* The cache. */
 static int fd = -1;
@@ -80,6 +81,8 @@ blk_init (void)
   size_t len;
   char *template;
 
+  lru_init ();
+
   bitmap_init (&bm, BLKSIZE, 2 /* bits per block */);
 
   tmpdir = getenv ("TMPDIR");
@@ -115,6 +118,8 @@ blk_free (void)
     close (fd);
 
   bitmap_free (&bm);
+
+  lru_free ();
 }
 
 int
@@ -128,6 +133,9 @@ blk_set_size (uint64_t new_size)
     return -1;
   }
 
+  if (lru_set_size (new_size) == -1)
+    return -1;
+
   return 0;
 }
 
@@ -163,6 +171,7 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
         return -1;
       }
       bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
+      lru_set_recently_accessed (blknum);
     }
     return 0;
   }
@@ -172,6 +181,7 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
       nbdkit_error ("pread: %m");
       return -1;
     }
+    lru_set_recently_accessed (blknum);
     return 0;
   }
 }
@@ -196,6 +206,7 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void
*nxdata,
     return -1;
 
   bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
+  lru_set_recently_accessed (blknum);
 
   return 0;
 }
@@ -222,6 +233,7 @@ blk_write (struct nbdkit_next_ops *next_ops, void *nxdata,
     return -1;
   }
   bitmap_set_blk (&bm, blknum, BLOCK_DIRTY);
+  lru_set_recently_accessed (blknum);
 
   return 0;
 }
diff --git a/filters/cache/lru.c b/filters/cache/lru.c
new file mode 100644
index 0000000..60cf379
--- /dev/null
+++ b/filters/cache/lru.c
@@ -0,0 +1,150 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/* These are the block operations.  They always read or write a single
+ * whole block of size ‘blksize’.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <inttypes.h>
+
+#include <nbdkit-filter.h>
+
+#include "bitmap.h"
+
+#include "cache.h"
+#include "blk.h"
+#include "lru.h"
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+/* LRU bitmaps.  These bitmaps implement a simple, fast LRU structure.
+ *
+ *    bm[0]         bm[1]         blocks not in either bitmap
+ * ┌─────────┬──────────────────┬─────────────────────────────┐
+ * │         │                  │                             │
+ * └─────────┴──────────────────┴─────────────────────────────┘
+ *    ↑          c1 bits set
+ *  c0 bits set
+ *
+ * The LRU structure keeps track of the [approx] last N distinct
+ * blocks which have been most recently accessed.  It can answer in
+ * O(1) time the question: “Is a particular block in or not in the N
+ * distinct blocks most recently accessed?”
+ *
+ * To do this we keep two bitmaps.
+ *
+ * When a block is accessed, we set the corresponding bit in bm[0] and
+ * increment c0 (c0 counts the number of bits set in bm[0]).  If c0 =+ * N/2
then we swap the two bitmaps, clear bm[0], and reset c0 to 0.
+ *
+ * To check if a block has been accessed within the previous N
+ * distinct accesses, we simply have to check both bitmaps.  If it is
+ * not in either bitmap, then it's old and a candidate to be
+ * reclaimed.
+ *
+ * You'll note that in fact we only keep track of between N/2 and N
+ * recently accessed blocks.  We could make the estimate more accurate
+ * by having more bitmaps, but as this is only a heuristic we choose
+ * to keep the implementation simple and memory usage low instead.
+ */
+static struct bitmap bm[2];
+static unsigned c0 = 0, c1 = 0;
+static unsigned N = 100;
+
+void
+lru_init (void)
+{
+  bitmap_init (&bm[0], BLKSIZE, 1 /* bits per block */);
+  bitmap_init (&bm[1], BLKSIZE, 1 /* bits per block */);
+}
+
+void
+lru_free (void)
+{
+  bitmap_free (&bm[0]);
+  bitmap_free (&bm[1]);
+}
+
+int
+lru_set_size (uint64_t new_size)
+{
+  if (bitmap_resize (&bm[0], new_size) == -1)
+    return -1;
+  if (bitmap_resize (&bm[1], new_size) == -1)
+    return -1;
+
+  /* XXX Choose this better. */
+  N = MAX (new_size / BLKSIZE / 4, 100);
+
+  return 0;
+}
+
+void
+lru_set_recently_accessed (uint64_t blknum)
+{
+  /* If the block is already set in the first bitmap, don't need to do
+   * anything.
+   */
+  if (bitmap_get_blk (&bm[0], blknum, false))
+    return;
+
+  bitmap_set_blk (&bm[0], blknum, true);
+  c0++;
+
+  /* If we've reached N/2 then we need to swap over the bitmaps. */
+  if (c0 >= N/2) {
+    struct bitmap tmp;
+
+    tmp = bm[0];
+    bm[0] = bm[1];
+    bm[1] = tmp;
+    c1 = c0;
+
+    bitmap_clear (&bm[0]);
+    c0 = 0;
+  }
+}
+
+bool
+lru_has_been_recently_accessed (uint64_t blknum)
+{
+  return
+    bitmap_get_blk (&bm[0], blknum, false) ||
+    bitmap_get_blk (&bm[1], blknum, false);
+}
diff --git a/filters/cache/Makefile.am b/filters/cache/Makefile.am
index 0b79be5..fcd927a 100644
--- a/filters/cache/Makefile.am
+++ b/filters/cache/Makefile.am
@@ -41,6 +41,8 @@ nbdkit_cache_filter_la_SOURCES = \
 	blk.h \
 	cache.c \
 	cache.h \
+	lru.c \
+	lru.h \
 	$(top_srcdir)/include/nbdkit-filter.h
 
 nbdkit_cache_filter_la_CPPFLAGS = \
-- 
2.19.2
Richard W.M. Jones
2018-Dec-28  18:46 UTC
[Libguestfs] [PATCH nbdkit 9/9] cache: Implement cache-max-size and method of reclaiming space from the cache.
The original plan was to have a background thread doing the reclaim.
However that cannot work given the design of filters, because a
background thread cannot access the next_ops struct which is only
available during requests.
Therefore we spread the work over the request threads.  Each blk_*
function checks whether there is work to do, and if there is will
reclaim up to two blocks from the cache (thus ensuring that we always
make forward progress, since each blk_* function can only add at most
one block to the cache).
---
 filters/cache/nbdkit-cache-filter.pod |  70 +++++++++++
 filters/cache/cache.h                 |  11 ++
 filters/cache/blk.c                   | 163 ++++++++++++++++++++++++++
 filters/cache/cache.c                 |  56 +++++++++
 filters/cache/lru.c                   |   7 +-
 TODO                                  |  11 +-
 tests/Makefile.am                     |   3 +-
 tests/test-cache-max-size.sh          |  90 ++++++++++++++
 8 files changed, 399 insertions(+), 12 deletions(-)
diff --git a/filters/cache/nbdkit-cache-filter.pod
b/filters/cache/nbdkit-cache-filter.pod
index bfbada0..0f44179 100644
--- a/filters/cache/nbdkit-cache-filter.pod
+++ b/filters/cache/nbdkit-cache-filter.pod
@@ -5,6 +5,8 @@ nbdkit-cache-filter - nbdkit caching filter
 =head1 SYNOPSIS
 
  nbdkit --filter=cache plugin [cache=writeback|writethrough|unsafe]
+                              [cache-max-size=SIZE]
+                              [cache-high-threshold=N] [cache-low-threshold=N]
                               [cache-on-read=true|false]
                               [plugin-args...]
 
@@ -51,6 +53,14 @@ This is dangerous and can cause data loss, but this may be
acceptable
 if you only use it for testing or with data that you don't care about
 or can cheaply reconstruct.
 
+=item B<cache-max-size=SIZE>
+
+=item B<cache-high-threshold=N>
+
+=item B<cache-low-threshold=N>
+
+Limit the size of the cache to C<SIZE>.  See L</CACHE MAXIMUM SIZE>
below.
+
 =item B<cache-on-read=true>
 
 Cache read requests as well as write requests.  Any time a block is
@@ -63,6 +73,66 @@ Do not cache read requests (this is the default).
 
 =back
 
+=head1 CACHE MAXIMUM SIZE
+
+By default the cache can grow to any size (although not larger than
+the virtual size of the underlying plugin) and you have to ensure
+there is sufficient space in C<$TMPDIR> for it.
+
+Using the parameters C<cache-max-size>, C<cache-high-threshold> and
+C<cache-low-threshold> you can limit the maximum size of the cache.
+
+Some examples:
+
+=over 4
+
+=item C<cache-max-size=1G>
+
+The cache is limited to around 1 gigabyte.
+
+=item C<cache-max-size=1G cache-high-threshold=95 cache-low-threshold=80>
+
+Once the cache size reaches 972M (95% of 1G), blocks are reclaimed
+from the cache until the size is reduced to 819M (80% of 1G).
+
+=back
+
+The way this works is once the size of the cache exceeds
+S<C<SIZE> ✕ the high threshold>, the filter works to reduce the
size
+of the cache until it is less than S<C<SIZE> ✕ the low threshold>.
+Once the size is below the low threshold, no more reclaim work is done
+until the size exceeds the high threshold again.
+
+The default thresholds are high 95% and low 80%.  The thresholds are
+expressed as percentages of C<cache-max-size>, and may be greater than
+100.
+
+The reclaiming work discards blocks from the cache in the following
+order of priority:
+
+=over 4
+
+=item *
+
+(Highest priority / first to be reclaimed.)
+Least recently used clean blocks are discarded from the cache.
+
+=item *
+
+Least recently used dirty blocks are flushed to the plugin and
+discarded.
+
+=item *
+
+Recently used clean blocks are discarded.
+
+=item *
+
+(Lowest priority / last to be reclaimed.)
+Recently used dirty blocks are flushed to the plugin and discarded.
+
+=back
+
 =head1 ENVIRONMENT VARIABLES
 
 =over 4
diff --git a/filters/cache/cache.h b/filters/cache/cache.h
index 50416b1..889414b 100644
--- a/filters/cache/cache.h
+++ b/filters/cache/cache.h
@@ -48,7 +48,18 @@ extern enum cache_mode {
   CACHE_MODE_UNSAFE,
 } cache_mode;
 
+/* Maximum size of the cache and high/low thresholds. */
+extern int64_t max_size;
+extern int hi_thresh, lo_thresh;
+
 /* Cache read requests. */
 extern bool cache_on_read;
 
+/* Do we support reclaiming cache blocks? */
+#ifdef FALLOC_FL_PUNCH_HOLE
+#define HAVE_CACHE_RECLAIM 1
+#else
+#undef HAVE_CACHE_RECLAIM
+#endif
+
 #endif /* NBDKIT_CACHE_H */
diff --git a/filters/cache/blk.c b/filters/cache/blk.c
index b256446..4a7f65f 100644
--- a/filters/cache/blk.c
+++ b/filters/cache/blk.c
@@ -46,6 +46,8 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <errno.h>
+#include <sys/types.h>
+#include <sys/stat.h>
 
 #include <nbdkit-filter.h>
 
@@ -74,6 +76,31 @@ enum bm_entry {
   BLOCK_DIRTY = 3,
 };
 
+#ifdef HAVE_CACHE_RECLAIM
+/* If we are currently reclaiming blocks from the cache.
+ *
+ * The state machine starts in the NOT_RECLAIMING state.  When the
+ * size of the cache exceeds the high threshold, we move to
+ * RECLAIMING_LRU.  Once we have exhausted all LRU blocks, we move to
+ * RECLAIMING_ANY (reclaiming any blocks).
+ *
+ * If at any time the size of the cache goes below the low threshold
+ * we move back to the NOT_RECLAIMING state.
+ *
+ * reclaim_blk is the last block that we will looked at.
+ */
+enum reclaim_state {
+  NOT_RECLAIMING = 0,
+  RECLAIMING_LRU = 1,
+  RECLAIMING_ANY = 2,
+};
+
+static enum reclaim_state reclaiming = NOT_RECLAIMING;
+static uint64_t reclaim_blk;
+#endif
+
+static void reclaim (void);
+
 int
 blk_init (void)
 {
@@ -146,6 +173,8 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
   off_t offset = blknum * BLKSIZE;
   enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_CACHED);
 
+  reclaim ();
+
   nbdkit_debug ("cache: blk_read block %" PRIu64 " (offset
%" PRIu64 ") is %s",
                 blknum, (uint64_t) offset,
                 state == BLOCK_NOT_CACHED ? "not cached" :
@@ -193,6 +222,8 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void
*nxdata,
 {
   off_t offset = blknum * BLKSIZE;
 
+  reclaim ();
+
   nbdkit_debug ("cache: writethrough block %" PRIu64 " (offset
%" PRIu64 ")",
                 blknum, (uint64_t) offset);
 
@@ -224,6 +255,8 @@ blk_write (struct nbdkit_next_ops *next_ops, void *nxdata,
 
   offset = blknum * BLKSIZE;
 
+  reclaim ();
+
   nbdkit_debug ("cache: writeback block %" PRIu64 " (offset
%" PRIu64 ")",
                 blknum, (uint64_t) offset);
 
@@ -254,3 +287,133 @@ for_each_dirty_block (block_callback f, void *vp)
 
   return 0;
 }
+
+#ifdef HAVE_CACHE_RECLAIM
+
+static void reclaim_one (void);
+static void reclaim_lru (void);
+static void reclaim_any (void);
+static void reclaim_block (void);
+
+/* See if we need to reclaim blocks. */
+static void
+reclaim (void)
+{
+  struct stat statbuf;
+  uint64_t cache_allocated;
+
+  /* If the user didn't set cache-max-size, do nothing. */
+  if (max_size == -1) return;
+
+  /* Check the allocated size of the cache. */
+  if (fstat (fd, &statbuf) == -1) {
+    nbdkit_debug ("cache: fstat: %m");
+    return;
+  }
+  cache_allocated = statbuf.st_blocks * UINT64_C(512);
+
+  if (reclaiming) {
+    /* Keep reclaiming until the cache size drops below the low threshold. */
+    if (cache_allocated < max_size * lo_thresh / 100) {
+      nbdkit_debug ("cache: stop reclaiming");
+      reclaiming = NOT_RECLAIMING;
+      return;
+    }
+  }
+  else {
+    if (cache_allocated < max_size * hi_thresh / 100)
+      return;
+
+    /* Start reclaiming if the cache size goes over the high threshold. */
+    nbdkit_debug ("cache: start reclaiming");
+    reclaiming = RECLAIMING_LRU;
+  }
+
+  /* Reclaim up to 2 cache blocks. */
+  reclaim_one ();
+  reclaim_one ();
+}
+
+/* Reclaim a single cache block. */
+static void
+reclaim_one (void)
+{
+  assert (reclaiming);
+
+  if (reclaiming == RECLAIMING_LRU)
+    reclaim_lru ();
+  else
+    reclaim_any ();
+}
+
+static void
+reclaim_lru (void)
+{
+  uint64_t old_reclaim_blk;
+
+  /* Find the next block in the cache. */
+  reclaim_blk = bitmap_next (&bm, reclaim_blk+1);
+  old_reclaim_blk = reclaim_blk;
+
+  /* Search for an LRU block after this one. */
+  do {
+    if (! lru_has_been_recently_accessed (reclaim_blk)) {
+      reclaim_block ();
+      return;
+    }
+
+    reclaim_blk = bitmap_next (&bm, reclaim_blk+1);
+    if (reclaim_blk == -1)    /* wrap around */
+      reclaim_blk = bitmap_next (&bm, 0);
+  } while (reclaim_blk >= 0 && old_reclaim_blk != reclaim_blk);
+
+  if (old_reclaim_blk == reclaim_blk) {
+    /* Run out of LRU blocks, so start reclaiming any block in the cache. */
+    nbdkit_debug ("cache: reclaiming any blocks");
+    reclaiming = RECLAIMING_ANY;
+    reclaim_one ();
+    return;
+  }
+}
+
+static void
+reclaim_any (void)
+{
+  /* Find the next block in the cache. */
+  reclaim_blk = bitmap_next (&bm, reclaim_blk+1);
+  if (reclaim_blk == -1)        /* wrap around */
+    reclaim_blk = bitmap_next (&bm, 0);
+
+  reclaim_block ();
+}
+
+static void
+reclaim_block (void)
+{
+  if (reclaim_blk == -1) {
+    nbdkit_debug ("cache: run out of blocks to reclaim!");
+    return;
+  }
+
+  nbdkit_debug ("cache: reclaiming block %" PRIu64, reclaim_blk);
+#ifdef FALLOC_FL_PUNCH_HOLE
+  if (fallocate (fd, FALLOC_FL_PUNCH_HOLE,
+                 reclaim_blk * BLKSIZE, BLKSIZE) == -1) {
+    nbdkit_error ("cache: reclaiming cache blocks: "
+                  "fallocate: FALLOC_FL_PUNCH_HOLE: %m");
+    return;
+  }
+#else
+#error "no implementation for punching holes"
+#endif
+
+  bitmap_set_blk (&bm, reclaim_blk, BLOCK_NOT_CACHED);
+}
+
+#else /* !HAVE_CACHE_RECLAIM */
+static void
+reclaim (void)
+{
+  /* nothing */
+}
+#endif /* !HAVE_CACHE_RECLAIM */
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index 580e4f5..5ea7b6a 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -65,6 +65,8 @@
 static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 
 enum cache_mode cache_mode = CACHE_MODE_WRITEBACK;
+int64_t max_size = -1;
+int hi_thresh = 95, lo_thresh = 80;
 bool cache_on_read = false;
 
 static int cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void
*handle, uint32_t flags, int *err);
@@ -104,6 +106,44 @@ cache_config (nbdkit_next_config *next, void *nxdata,
       return -1;
     }
   }
+#ifdef HAVE_CACHE_RECLAIM
+  else if (strcmp (key, "cache-max-size") == 0) {
+    int64_t r;
+
+    r = nbdkit_parse_size (value);
+    if (r == -1)
+      return -1;
+    /* We set a lower limit for the cache size just to keep out of
+     * trouble.
+     */
+    if (r < 1024*1024) {
+      nbdkit_error ("cache-max-size is too small");
+      return -1;
+    }
+    max_size = r;
+    return 0;
+  }
+  else if (strcmp (key, "cache-high-threshold") == 0) {
+    if (sscanf (value, "%d", &hi_thresh) != 1) {
+      nbdkit_error ("invalid cache-high-threshold parameter: %s",
value);
+      return -1;
+    }
+    if (hi_thresh <= 0) {
+      nbdkit_error ("cache-high-threshold must be greater than
zero");
+    }
+    return 0;
+  }
+  else if (strcmp (key, "cache-low-threshold") == 0) {
+    if (sscanf (value, "%d", &lo_thresh) != 1) {
+      nbdkit_error ("invalid cache-low-threshold parameter: %s",
value);
+      return -1;
+    }
+    if (lo_thresh <= 0) {
+      nbdkit_error ("cache-low-threshold must be greater than zero");
+    }
+    return 0;
+  }
+#endif
   else if (strcmp (key, "cache-on-read") == 0) {
     int r;
 
@@ -118,6 +158,21 @@ cache_config (nbdkit_next_config *next, void *nxdata,
   }
 }
 
+static int
+cache_config_complete (nbdkit_next_config_complete *next, void *nxdata)
+{
+  /* If cache-max-size was set then check the thresholds. */
+  if (max_size != -1) {
+    if (lo_thresh >= hi_thresh) {
+      nbdkit_error ("cache-low-threshold must be "
+                    "less than cache-high-threshold");
+      return -1;
+    }
+  }
+
+  return next (nxdata);
+}
+
 static void *
 cache_open (nbdkit_next_open *next, void *nxdata, int readonly)
 {
@@ -415,6 +470,7 @@ static struct nbdkit_filter filter = {
   .load              = cache_load,
   .unload            = cache_unload,
   .config            = cache_config,
+  .config_complete   = cache_config_complete,
   .open              = cache_open,
   .prepare           = cache_prepare,
   .get_size          = cache_get_size,
diff --git a/filters/cache/lru.c b/filters/cache/lru.c
index 60cf379..8159040 100644
--- a/filters/cache/lru.c
+++ b/filters/cache/lru.c
@@ -109,8 +109,11 @@ lru_set_size (uint64_t new_size)
   if (bitmap_resize (&bm[1], new_size) == -1)
     return -1;
 
-  /* XXX Choose this better. */
-  N = MAX (new_size / BLKSIZE / 4, 100);
+  if (max_size != -1)
+    /* Make the threshold about 1/4 the maximum size of the cache. */
+    N = MAX (max_size / BLKSIZE / 4, 100);
+  else
+    N = MAX (new_size / BLKSIZE / 4, 100);
 
   return 0;
 }
diff --git a/TODO b/TODO
index 9d6e727..7107ea9 100644
--- a/TODO
+++ b/TODO
@@ -100,15 +100,8 @@ Suggestions for filters
 * masking plugin features for testing clients (see 'nozero' and
'fua'
   filters for examples)
 
-nbdkit-cache-filter needs considerable work:
-
-* allow the user to limit the total size of the cache
-
-* handle ENOSPC errors
-
-* implement some cache replacement policies
-
-* some sort of background task or thread to write out dirty blocks
+* nbdkit-cache-filter should handle ENOSPC errors automatically by
+  reclaiming blocks from the cache
 
 Composing nbdkit
 ----------------
diff --git a/tests/Makefile.am b/tests/Makefile.am
index ccd5ff5..3c9eb4d 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -704,7 +704,8 @@ TESTS += test-blocksize.sh
 if HAVE_GUESTFISH
 TESTS += \
 	test-cache.sh \
-	test-cache-on-read.sh
+	test-cache-on-read.sh \
+	test-cache-max-size.sh
 endif HAVE_GUESTFISH
 
 # cow filter test.
diff --git a/tests/test-cache-max-size.sh b/tests/test-cache-max-size.sh
new file mode 100755
index 0000000..5e228da
--- /dev/null
+++ b/tests/test-cache-max-size.sh
@@ -0,0 +1,90 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+source ./functions.sh
+set -e
+set -x
+
+files="cache-max-size.img cache-max-size.sock cache-max-size.pid"
+rm -f $files
+cleanup_fn rm -f $files
+
+# Create an empty base image.
+truncate -s 1G cache-max-size.img
+
+# Run nbdkit with the caching filter and a low size limit to ensure
+# that the reclaim code is exercised.
+start_nbdkit -P cache-max-size.pid -U cache-max-size.sock \
+             --filter=cache \
+             file cache-max-size.img \
+             cache-max-size=10M
+
+# Open the overlay and perform some operations.
+guestfish --format=raw -a "nbd://?socket=$PWD/cache-max-size.sock"
<<'EOF'
+  run
+  part-disk /dev/sda gpt
+  mkfs ext4 /dev/sda1
+  mount /dev/sda1 /
+  fill-dir / 10000
+  fill-pattern "abcde" 5M /large1
+  sync
+  fill-pattern "abcde" 5M /large2
+  sync
+  fill-pattern "abcde" 5M /large3
+  sync
+  fill-pattern "abcde" 5M /large4
+  sync
+  fill-pattern "abcde" 5M /large5
+  sync
+  fill-pattern "abcde" 5M /large6
+  sync
+  fill-pattern "abcde" 5M /large7
+  sync
+  fill-pattern "abcde" 5M /large8
+  sync
+  write /hello "hello, world"
+  sync
+EOF
+
+# Check the last files we created exist.
+guestfish --ro -a cache-max-size.img -m /dev/sda1 <<'EOF'
+  cat /hello
+  cat /large1 | cat >/dev/null
+  cat /large2 | cat >/dev/null
+  cat /large3 | cat >/dev/null
+  cat /large4 | cat >/dev/null
+  cat /large5 | cat >/dev/null
+  cat /large6 | cat >/dev/null
+  cat /large7 | cat >/dev/null
+  cat /large8 | cat >/dev/null
+EOF
-- 
2.19.2
Richard W.M. Jones
2018-Dec-29  07:25 UTC
[Libguestfs] [PATCH v2 nbdkit 9/9] cache: Implement cache-max-size and method of reclaiming space from the cache.
This is version 2 of patch 9/9. Patches 1 through 8 didn't change so I didn't repost them. v2: - Implement a proper test. This turned out to be quite a lot more complicated than anticipated. - Fix the call to fallocate. It turns out that fallocate hole punching silently does nothing unless you use ‘FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE’. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top --tNQTSEo8WG/FKZ8E Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-cache-Implement-cache-max-size-and-method-of-reclaim.patch" Content-Transfer-Encoding: 8bit>From 0be67694b75c802254d79f66c31bc31989ea2635 Mon Sep 17 00:00:00 2001From: "Richard W.M. Jones" <rjones@redhat.com> Date: Sat, 22 Dec 2018 15:46:14 +0000 Subject: [PATCH] cache: Implement cache-max-size and method of reclaiming space from the cache. The original plan was to have a background thread doing the reclaim. However that cannot work given the design of filters, because a background thread cannot access the next_ops struct which is only available during requests. Therefore we spread the work over the request threads. Each blk_* function checks whether there is work to do, and if there is will reclaim up to two blocks from the cache (thus ensuring that we always make forward progress, since each blk_* function can only add at most one block to the cache). --- filters/cache/nbdkit-cache-filter.pod | 71 +++++++++++ filters/cache/cache.h | 13 ++ filters/cache/blk.c | 163 ++++++++++++++++++++++++++ filters/cache/cache.c | 56 +++++++++ filters/cache/lru.c | 7 +- TODO | 11 +- tests/Makefile.am | 3 +- tests/test-cache-max-size.sh | 96 +++++++++++++++ 8 files changed, 408 insertions(+), 12 deletions(-) create mode 100755 tests/test-cache-max-size.sh diff --git a/filters/cache/nbdkit-cache-filter.pod b/filters/cache/nbdkit-cache-filter.pod index bfbada0..c2b8e5c 100644 --- a/filters/cache/nbdkit-cache-filter.pod +++ b/filters/cache/nbdkit-cache-filter.pod @@ -5,6 +5,9 @@ nbdkit-cache-filter - nbdkit caching filter =head1 SYNOPSIS nbdkit --filter=cache plugin [cache=writeback|writethrough|unsafe] + [cache-max-size=SIZE] + [cache-high-threshold=N] + [cache-low-threshold=N] [cache-on-read=true|false] [plugin-args...] @@ -51,6 +54,14 @@ This is dangerous and can cause data loss, but this may be acceptable if you only use it for testing or with data that you don't care about or can cheaply reconstruct. +=item B<cache-max-size=SIZE> + +=item B<cache-high-threshold=N> + +=item B<cache-low-threshold=N> + +Limit the size of the cache to C<SIZE>. See L</CACHE MAXIMUM SIZE> below. + =item B<cache-on-read=true> Cache read requests as well as write requests. Any time a block is @@ -63,6 +74,66 @@ Do not cache read requests (this is the default). =back +=head1 CACHE MAXIMUM SIZE + +By default the cache can grow to any size (although not larger than +the virtual size of the underlying plugin) and you have to ensure +there is sufficient space in C<$TMPDIR> for it. + +Using the parameters C<cache-max-size>, C<cache-high-threshold> and +C<cache-low-threshold> you can limit the maximum size of the cache. + +Some examples: + +=over 4 + +=item C<cache-max-size=1G> + +The cache is limited to around 1 gigabyte. + +=item C<cache-max-size=1G cache-high-threshold=95 cache-low-threshold=80> + +Once the cache size reaches 972M (95% of 1G), blocks are reclaimed +from the cache until the size is reduced to 819M (80% of 1G). + +=back + +The way this works is once the size of the cache exceeds +S<C<SIZE> ✕ the high threshold>, the filter works to reduce the size +of the cache until it is less than S<C<SIZE> ✕ the low threshold>. +Once the size is below the low threshold, no more reclaim work is done +until the size exceeds the high threshold again. + +The default thresholds are high 95% and low 80%. The thresholds are +expressed as percentages of C<cache-max-size>, and may be greater than +100. + +The reclaiming work discards blocks from the cache in the following +order of priority: + +=over 4 + +=item * + +(Highest priority / first to be reclaimed.) +Least recently used clean blocks are discarded from the cache. + +=item * + +Least recently used dirty blocks are flushed to the plugin and +discarded. + +=item * + +Recently used clean blocks are discarded. + +=item * + +(Lowest priority / last to be reclaimed.) +Recently used dirty blocks are flushed to the plugin and discarded. + +=back + =head1 ENVIRONMENT VARIABLES =over 4 diff --git a/filters/cache/cache.h b/filters/cache/cache.h index 50416b1..233a4fa 100644 --- a/filters/cache/cache.h +++ b/filters/cache/cache.h @@ -34,6 +34,8 @@ #ifndef NBDKIT_CACHE_H #define NBDKIT_CACHE_H +#include <fcntl.h> + /* Size of a block in the cache. A 4K block size means that we need * 64 MB of memory to store the bitmaps for a 1 TB underlying image. * It is also smaller than the usual hole size for sparse files, which @@ -48,7 +50,18 @@ extern enum cache_mode { CACHE_MODE_UNSAFE, } cache_mode; +/* Maximum size of the cache and high/low thresholds. */ +extern int64_t max_size; +extern int hi_thresh, lo_thresh; + /* Cache read requests. */ extern bool cache_on_read; +/* Do we support reclaiming cache blocks? */ +#ifdef FALLOC_FL_PUNCH_HOLE +#define HAVE_CACHE_RECLAIM 1 +#else +#undef HAVE_CACHE_RECLAIM +#endif + #endif /* NBDKIT_CACHE_H */ diff --git a/filters/cache/blk.c b/filters/cache/blk.c index b256446..6f67a7f 100644 --- a/filters/cache/blk.c +++ b/filters/cache/blk.c @@ -46,6 +46,8 @@ #include <unistd.h> #include <fcntl.h> #include <errno.h> +#include <sys/types.h> +#include <sys/stat.h> #include <nbdkit-filter.h> @@ -74,6 +76,31 @@ enum bm_entry { BLOCK_DIRTY = 3, }; +#ifdef HAVE_CACHE_RECLAIM +/* If we are currently reclaiming blocks from the cache. + * + * The state machine starts in the NOT_RECLAIMING state. When the + * size of the cache exceeds the high threshold, we move to + * RECLAIMING_LRU. Once we have exhausted all LRU blocks, we move to + * RECLAIMING_ANY (reclaiming any blocks). + * + * If at any time the size of the cache goes below the low threshold + * we move back to the NOT_RECLAIMING state. + * + * reclaim_blk is the last block that we will looked at. + */ +enum reclaim_state { + NOT_RECLAIMING = 0, + RECLAIMING_LRU = 1, + RECLAIMING_ANY = 2, +}; + +static enum reclaim_state reclaiming = NOT_RECLAIMING; +static uint64_t reclaim_blk; +#endif + +static void reclaim (void); + int blk_init (void) { @@ -146,6 +173,8 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, off_t offset = blknum * BLKSIZE; enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_CACHED); + reclaim (); + nbdkit_debug ("cache: blk_read block %" PRIu64 " (offset %" PRIu64 ") is %s", blknum, (uint64_t) offset, state == BLOCK_NOT_CACHED ? "not cached" : @@ -193,6 +222,8 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata, { off_t offset = blknum * BLKSIZE; + reclaim (); + nbdkit_debug ("cache: writethrough block %" PRIu64 " (offset %" PRIu64 ")", blknum, (uint64_t) offset); @@ -224,6 +255,8 @@ blk_write (struct nbdkit_next_ops *next_ops, void *nxdata, offset = blknum * BLKSIZE; + reclaim (); + nbdkit_debug ("cache: writeback block %" PRIu64 " (offset %" PRIu64 ")", blknum, (uint64_t) offset); @@ -254,3 +287,133 @@ for_each_dirty_block (block_callback f, void *vp) return 0; } + +#ifdef HAVE_CACHE_RECLAIM + +static void reclaim_one (void); +static void reclaim_lru (void); +static void reclaim_any (void); +static void reclaim_block (void); + +/* See if we need to reclaim blocks. */ +static void +reclaim (void) +{ + struct stat statbuf; + uint64_t cache_allocated; + + /* If the user didn't set cache-max-size, do nothing. */ + if (max_size == -1) return; + + /* Check the allocated size of the cache. */ + if (fstat (fd, &statbuf) == -1) { + nbdkit_debug ("cache: fstat: %m"); + return; + } + cache_allocated = statbuf.st_blocks * UINT64_C(512); + + if (reclaiming) { + /* Keep reclaiming until the cache size drops below the low threshold. */ + if (cache_allocated < max_size * lo_thresh / 100) { + nbdkit_debug ("cache: stop reclaiming"); + reclaiming = NOT_RECLAIMING; + return; + } + } + else { + if (cache_allocated < max_size * hi_thresh / 100) + return; + + /* Start reclaiming if the cache size goes over the high threshold. */ + nbdkit_debug ("cache: start reclaiming"); + reclaiming = RECLAIMING_LRU; + } + + /* Reclaim up to 2 cache blocks. */ + reclaim_one (); + reclaim_one (); +} + +/* Reclaim a single cache block. */ +static void +reclaim_one (void) +{ + assert (reclaiming); + + if (reclaiming == RECLAIMING_LRU) + reclaim_lru (); + else + reclaim_any (); +} + +static void +reclaim_lru (void) +{ + uint64_t old_reclaim_blk; + + /* Find the next block in the cache. */ + reclaim_blk = bitmap_next (&bm, reclaim_blk+1); + old_reclaim_blk = reclaim_blk; + + /* Search for an LRU block after this one. */ + do { + if (! lru_has_been_recently_accessed (reclaim_blk)) { + reclaim_block (); + return; + } + + reclaim_blk = bitmap_next (&bm, reclaim_blk+1); + if (reclaim_blk == -1) /* wrap around */ + reclaim_blk = bitmap_next (&bm, 0); + } while (reclaim_blk >= 0 && old_reclaim_blk != reclaim_blk); + + if (old_reclaim_blk == reclaim_blk) { + /* Run out of LRU blocks, so start reclaiming any block in the cache. */ + nbdkit_debug ("cache: reclaiming any blocks"); + reclaiming = RECLAIMING_ANY; + reclaim_one (); + return; + } +} + +static void +reclaim_any (void) +{ + /* Find the next block in the cache. */ + reclaim_blk = bitmap_next (&bm, reclaim_blk+1); + if (reclaim_blk == -1) /* wrap around */ + reclaim_blk = bitmap_next (&bm, 0); + + reclaim_block (); +} + +static void +reclaim_block (void) +{ + if (reclaim_blk == -1) { + nbdkit_debug ("cache: run out of blocks to reclaim!"); + return; + } + + nbdkit_debug ("cache: reclaiming block %" PRIu64, reclaim_blk); +#ifdef FALLOC_FL_PUNCH_HOLE + if (fallocate (fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, + reclaim_blk * BLKSIZE, BLKSIZE) == -1) { + nbdkit_error ("cache: reclaiming cache blocks: " + "fallocate: FALLOC_FL_PUNCH_HOLE: %m"); + return; + } +#else +#error "no implementation for punching holes" +#endif + + bitmap_set_blk (&bm, reclaim_blk, BLOCK_NOT_CACHED); +} + +#else /* !HAVE_CACHE_RECLAIM */ +static void +reclaim (void) +{ + /* nothing */ +} +#endif /* !HAVE_CACHE_RECLAIM */ diff --git a/filters/cache/cache.c b/filters/cache/cache.c index 580e4f5..5ea7b6a 100644 --- a/filters/cache/cache.c +++ b/filters/cache/cache.c @@ -65,6 +65,8 @@ static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; enum cache_mode cache_mode = CACHE_MODE_WRITEBACK; +int64_t max_size = -1; +int hi_thresh = 95, lo_thresh = 80; bool cache_on_read = false; static int cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle, uint32_t flags, int *err); @@ -104,6 +106,44 @@ cache_config (nbdkit_next_config *next, void *nxdata, return -1; } } +#ifdef HAVE_CACHE_RECLAIM + else if (strcmp (key, "cache-max-size") == 0) { + int64_t r; + + r = nbdkit_parse_size (value); + if (r == -1) + return -1; + /* We set a lower limit for the cache size just to keep out of + * trouble. + */ + if (r < 1024*1024) { + nbdkit_error ("cache-max-size is too small"); + return -1; + } + max_size = r; + return 0; + } + else if (strcmp (key, "cache-high-threshold") == 0) { + if (sscanf (value, "%d", &hi_thresh) != 1) { + nbdkit_error ("invalid cache-high-threshold parameter: %s", value); + return -1; + } + if (hi_thresh <= 0) { + nbdkit_error ("cache-high-threshold must be greater than zero"); + } + return 0; + } + else if (strcmp (key, "cache-low-threshold") == 0) { + if (sscanf (value, "%d", &lo_thresh) != 1) { + nbdkit_error ("invalid cache-low-threshold parameter: %s", value); + return -1; + } + if (lo_thresh <= 0) { + nbdkit_error ("cache-low-threshold must be greater than zero"); + } + return 0; + } +#endif else if (strcmp (key, "cache-on-read") == 0) { int r; @@ -118,6 +158,21 @@ cache_config (nbdkit_next_config *next, void *nxdata, } } +static int +cache_config_complete (nbdkit_next_config_complete *next, void *nxdata) +{ + /* If cache-max-size was set then check the thresholds. */ + if (max_size != -1) { + if (lo_thresh >= hi_thresh) { + nbdkit_error ("cache-low-threshold must be " + "less than cache-high-threshold"); + return -1; + } + } + + return next (nxdata); +} + static void * cache_open (nbdkit_next_open *next, void *nxdata, int readonly) { @@ -415,6 +470,7 @@ static struct nbdkit_filter filter = { .load = cache_load, .unload = cache_unload, .config = cache_config, + .config_complete = cache_config_complete, .open = cache_open, .prepare = cache_prepare, .get_size = cache_get_size, diff --git a/filters/cache/lru.c b/filters/cache/lru.c index 60cf379..8159040 100644 --- a/filters/cache/lru.c +++ b/filters/cache/lru.c @@ -109,8 +109,11 @@ lru_set_size (uint64_t new_size) if (bitmap_resize (&bm[1], new_size) == -1) return -1; - /* XXX Choose this better. */ - N = MAX (new_size / BLKSIZE / 4, 100); + if (max_size != -1) + /* Make the threshold about 1/4 the maximum size of the cache. */ + N = MAX (max_size / BLKSIZE / 4, 100); + else + N = MAX (new_size / BLKSIZE / 4, 100); return 0; } diff --git a/TODO b/TODO index 9d6e727..7107ea9 100644 --- a/TODO +++ b/TODO @@ -100,15 +100,8 @@ Suggestions for filters * masking plugin features for testing clients (see 'nozero' and 'fua' filters for examples) -nbdkit-cache-filter needs considerable work: - -* allow the user to limit the total size of the cache - -* handle ENOSPC errors - -* implement some cache replacement policies - -* some sort of background task or thread to write out dirty blocks +* nbdkit-cache-filter should handle ENOSPC errors automatically by + reclaiming blocks from the cache Composing nbdkit ---------------- diff --git a/tests/Makefile.am b/tests/Makefile.am index ccd5ff5..3c9eb4d 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -704,7 +704,8 @@ TESTS += test-blocksize.sh if HAVE_GUESTFISH TESTS += \ test-cache.sh \ - test-cache-on-read.sh + test-cache-on-read.sh \ + test-cache-max-size.sh endif HAVE_GUESTFISH # cow filter test. diff --git a/tests/test-cache-max-size.sh b/tests/test-cache-max-size.sh new file mode 100755 index 0000000..4d34dbc --- /dev/null +++ b/tests/test-cache-max-size.sh @@ -0,0 +1,96 @@ +#!/usr/bin/env bash +# nbdkit +# Copyright (C) 2018 Red Hat Inc. +# All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +source ./functions.sh +set -e +set -x + +# Check that this is a Linux-like system supporting /proc/pid/fd. +if ! test -d /proc/self/fd; then + echo "$0: not a Linux-like system supporting /proc/pid/fd" + exit 77 +fi + +# Test that qemu-io works +if ! qemu-io --help >/dev/null; then + echo "$0: missing or broken qemu-io" + exit 77 +fi + +# Need the stat command from coreutils. +if ! stat --version >/dev/null; then + echo "$0: missing or broken stat command" + exit 77 +fi + +d=cache-max-size.d +rm -rf $d +mkdir -p $d +cleanup_fn rm -rf $d + +# Create a cache directory. +mkdir $d/cache +TMPDIR=$d/cache +export TMPDIR + +# Create an empty base image. +truncate -s 1G $d/cache-max-size.img + +# Run nbdkit with the caching filter and a low size limit to ensure +# that the reclaim code is exercised. +start_nbdkit -P $d/cache-max-size.pid -U $d/cache-max-size.sock \ + --filter=cache \ + file $d/cache-max-size.img \ + cache-max-size=10M cache-on-read=true + +# Write > 10M to the plugin. +qemu-io -f raw "nbd+unix://?socket=$d/cache-max-size.sock" \ + -c "w -P 0 0 10M" \ + -c "w -P 0 20M 10M" \ + -c "r 20M 10M" \ + -c "r 30M 10M" \ + -c "w -P 0 40M 10M" + +# We can't use ‘du’ on the cache directory because the cache file is +# deleted by the filter, and so is only accessible via /proc/PID/fd. +# Get the /proc link to the cache file, and the size of it in bytes. +fddir="/proc/$( cat $d/cache-max-size.pid )/fd" +ls -l $fddir +fd="$fddir/$( ls -l $fddir | grep $TMPDIR | head -1 | awk '{print $9}' )" +stat -L $fd +size=$(( $(stat -L -c '%b * %B' $fd) )) + +if [ "$size" -gt $(( 11 * 1024 * 1024 )) ]; then + echo "$0: cache size is larger than 10M (actual size: $size bytes)" + exit 1 +fi -- 2.19.2 --tNQTSEo8WG/FKZ8E--
Eric Blake
2018-Dec-31  22:59 UTC
Re: [Libguestfs] [PATCH nbdkit 2/9] cache: Add cache-on-read mode.
On 12/28/18 12:45 PM, Richard W.M. Jones wrote:> The same as qemu's copyonread flag, this caches read requests. > --- > filters/cache/nbdkit-cache-filter.pod | 11 +++++ > filters/cache/cache.c | 37 +++++++++++++-- > tests/Makefile.am | 4 +- > tests/test-cache-on-read.sh | 66 +++++++++++++++++++++++++++ > 4 files changed, 114 insertions(+), 4 deletions(-)Makes sense.> +=item B<cache-on-read=true> > + > +Cache read requests as well as write requests. Any time a block is > +read from the plugin, it is saved in the cache (if there is sufficient > +space) so the same data can be served more quickly later.Is it worth mentioning that this is fine for a client that is expected to be the only writing client of a given server; but that it can result in stale data if the server can modify the data via other means? (In particular, since we don't implement NBD_FLAG_CAN_MULTI_CONN, we already admit that two clients trying to write in parallel are not guaranteed to see consistent read results after a flush - while caching read data only makes that more apparent)> +++ b/tests/test-cache-on-read.shWhat you have here looks okay, but I have a possible idea for an additional test: use the delay filter to prove that repeated reads of a region of the disk only suffer a one-time read penalty, rather than a penalty per read. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2018-Dec-31  23:18 UTC
Re: [Libguestfs] [PATCH nbdkit 5/9] cache: Allow this filter to serve requests in parallel.
On 12/28/18 12:45 PM, Richard W.M. Jones wrote:> Make the implicit lock explicit, and hold it around blk_* operations. > This allows us to relax the thread model for the filter to > NBDKIT_THREAD_MODEL_PARALLEL. > --- > filters/cache/blk.h | 7 ++++++ > filters/cache/cache.c | 57 +++++++++++++++++++++++++++++++------------ > 2 files changed, 49 insertions(+), 15 deletions(-)Nice. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Reasonably Related Threads
- [PATCH nbdkit v5 3/3] cache: Implement cache-max-size and cache space reclaim.
- [PATCH nbdkit v3 0/2] cache: Implement cache-max-size and method of reclaiming space from the cache.
- [PATCH nbdkit v4 0/2] cache: Implement cache-max-size and method of
- [PATCH nbdkit v2 0/4] cache: Implement cache-max-size etc.
- [PATCH nbdkit] common: Move shared bitmap code to a common library.