Eric Blake
2022-May-26 21:00 UTC
[Libguestfs] [nbdkit PATCH v2] blocksize: Avoid losing aligned writes to RMW race
[We are still investigating if a CVE needs to be assigned.] We have a bug where the blocksize filter can lose the effects of an aligned write that happens in parallel with an overlapping unaligned write. In general, a client should be extremely cautious about overlapping writes; it is not NBD's fault which of the two writes lands on the disk (or even if both writes land partially, at block-level granularities on which write landed last). However, a client should still be able to assume that even when two writes partially overlap, the final state of the disk of the non-overlapping portion will be that of the one write explicitly touching that portion, rather than stale data from before either write touched that block. Fix the bug by switching the blocksize filter locking from a mutex to a rwlock. We still need mutual exclusion during all unaligned operations (whether reading or writing to the plugin), because we share the same bounce buffer among all such code, which we get via the mutual exclusion of wrlock. But we now also add in shared exclusion for aligned write-like operations (pwrite, trim, zero); these can operate in parallel with one another, but must not be allowed to fall in the window between when an overlapping unaligned operation has read from the plugin but not yet written, so that we do not end up re-writing stale contents that undo the portion of the aligned write that was outside the unaligned region [it's ironic that we only need this protection on write I/O, but get it by using the rdlock feature of the rwlock]. Read-like operations (pread, extents, cache) don't need to be protected from RMW operations, because it is already the user's problem if they execute parallel overlapping reads while modifying the same portion of the image; and they will still end up seeing either the state before or after the unaligned write, indistinguishable from if we had added more locking. Whether this lock favors rdlock (aka aligned pwrite) or wrlock (aka unaligned access) is implementation-defined by the pthread implementation; if we need to be more precise, we'll have to introduce our own rwlock implementation on top of lower-level primitives (https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock mentions the use of two mutex to favor readers, or a condvar/mutex to favor writers). This is also not very fine-grained; it may turn out that we could get more performance if instead of locking the entire operation (across all clients, and regardless of whether the offsets overlap), we instead just lock metadata and then track whether a new operation overlaps with an existing unaligned operation, or even add in support for parallel unaligned operations by having more than one bounce buffer. But if performance truly matters, you're better off fixing the client to do aligned access in the first place, rather than needing the blocksize filter. Add a test that fails without the change to blocksize.c. With some carefully timed setups (using the delay filter to stall writes reaching the plugin, and the plugin itself to stall reads coming back from the plugin, so that we are reasonably sure of the overall timeline), we can demonstrate the bug present in the unpatched code, where an aligned write is lost when it lands in the wrong part of an unaligned RMW cycle; the fixed code instead shows that the overlapping operations were serialized. We may need to further tweak the test to be more reliable even under heavy load, but hopefully 2 seconds per stall (where a successful test requires 18 seconds) is sufficient for now. Reported-by: Nikolaus Rath <Nikolaus at rath.org> Fixes: 1aadbf9971 ("blocksize: Allow parallel requests", v1.25.3) --- In v2: - Implement the blocksize fix. - Drop RFC; I think this is ready, other than determining if we want to tag it with a CVE number. - Improve the test: print out more details before assertions, to aid in debugging if it ever dies under CI resources tests/Makefile.am | 2 + filters/blocksize/blocksize.c | 26 +++-- tests/test-blocksize-sharding.sh | 168 +++++++++++++++++++++++++++++++ 3 files changed, 187 insertions(+), 9 deletions(-) create mode 100755 tests/test-blocksize-sharding.sh diff --git a/tests/Makefile.am b/tests/Makefile.am index b6d30c0a..67e7618b 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -1376,11 +1376,13 @@ TESTS += \ test-blocksize.sh \ test-blocksize-extents.sh \ test-blocksize-default.sh \ + test-blocksize-sharding.sh \ $(NULL) EXTRA_DIST += \ test-blocksize.sh \ test-blocksize-extents.sh \ test-blocksize-default.sh \ + test-blocksize-sharding.sh \ $(NULL) # blocksize-policy filter test. diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c index 03da4971..1c81d5e3 100644 --- a/filters/blocksize/blocksize.c +++ b/filters/blocksize/blocksize.c @@ -51,10 +51,15 @@ #define BLOCKSIZE_MIN_LIMIT (64U * 1024) -/* In order to handle parallel requests safely, this lock must be held - * when using the bounce buffer. +/* Lock in order to handle overlapping requests safely. + * + * Grabbed for exclusive access (wrlock) when using the bounce buffer. + * + * Grabbed for shared access (rdlock) when doing aligned writes. + * These can happen in parallel with one another, but must not land in + * between the read and write of an unaligned RMW operation. */ -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; +static pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER; /* A single bounce buffer for alignment purposes, guarded by the lock. * Size it to the maximum we allow for minblock. @@ -255,7 +260,7 @@ blocksize_pread (nbdkit_next *next, /* Unaligned head */ if (offs & (h->minblock - 1)) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); drop = offs & (h->minblock - 1); keep = MIN (h->minblock - drop, count); if (next->pread (next, bounce, h->minblock, offs - drop, flags, err) == -1) @@ -278,7 +283,7 @@ blocksize_pread (nbdkit_next *next, /* Unaligned tail */ if (count) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); if (next->pread (next, bounce, h->minblock, offs, flags, err) == -1) return -1; memcpy (buf, bounce, count); @@ -306,7 +311,7 @@ blocksize_pwrite (nbdkit_next *next, /* Unaligned head */ if (offs & (h->minblock - 1)) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); drop = offs & (h->minblock - 1); keep = MIN (h->minblock - drop, count); if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) @@ -321,6 +326,7 @@ blocksize_pwrite (nbdkit_next *next, /* Aligned body */ while (count >= h->minblock) { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); keep = MIN (h->maxdata, ROUND_DOWN (count, h->minblock)); if (next->pwrite (next, buf, keep, offs, flags, err) == -1) return -1; @@ -331,7 +337,7 @@ blocksize_pwrite (nbdkit_next *next, /* Unaligned tail */ if (count) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) return -1; memcpy (bounce, buf, count); @@ -371,6 +377,7 @@ blocksize_trim (nbdkit_next *next, /* Aligned body */ while (count) { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); keep = MIN (h->maxlen, count); if (next->trim (next, keep, offs, flags, err) == -1) return -1; @@ -413,7 +420,7 @@ blocksize_zero (nbdkit_next *next, /* Unaligned head */ if (offs & (h->minblock - 1)) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); drop = offs & (h->minblock - 1); keep = MIN (h->minblock - drop, count); if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) @@ -428,6 +435,7 @@ blocksize_zero (nbdkit_next *next, /* Aligned body */ while (count >= h->minblock) { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); keep = MIN (h->maxlen, ROUND_DOWN (count, h->minblock)); if (next->zero (next, keep, offs, flags, err) == -1) return -1; @@ -437,7 +445,7 @@ blocksize_zero (nbdkit_next *next, /* Unaligned tail */ if (count) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) return -1; memset (bounce, 0, count); diff --git a/tests/test-blocksize-sharding.sh b/tests/test-blocksize-sharding.sh new file mode 100755 index 00000000..177668ed --- /dev/null +++ b/tests/test-blocksize-sharding.sh @@ -0,0 +1,168 @@ +#!/usr/bin/env bash +# nbdkit +# Copyright (C) 2018-2022 Red Hat Inc. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +# Demonstrate a fix for a bug where blocksize could lose aligned writes +# run in parallel with unaligned writes + +source ./functions.sh +set -e +set -x + +requires_plugin eval +requires_nbdsh_uri + +files='blocksize-sharding.img blocksize-sharding.tmp' +rm -f $files +cleanup_fn rm -f $files + +# Script a server that requires 16-byte aligned requests, and which delays +# 4s after reads if a witness file exists. Couple it with the delay filter +# that delays 2s before writes. If an unaligned and aligned write overlap, +# and can execute in parallel, we would have this timeline: +# +# T1 aligned write 1's to 0/16 T2 unaligned write 2's to 4/12 +#t=0 blocksize: next->pwrite(0, 16) blocksize: next->pread(0, 16) +# delay: wait 2s delay: next->pread(0, 16) +# ... eval: read 0's, wait 4s +#t=2 delay: next->pwrite(0, 16) ... +# eval: write 1's ... +# return ... +#t=4 return 0's (now stale) +# blocksize: next->pwrite(0, 16) +# delay: wait 2s +#t=6 delay: next->pwrite(0, 16) +# eval: write stale RMW buffer +# +# leaving us with a sharded 0000222222222222 (T1's write is lost). +# But as long as the blocksize filter detects the overlap, we should end +# up with either 1111222222222222 (aligned write completed first), or with +# 1111111111111111 (unaligned write completed first), either taking 8s, +# but with no sharding. +# +# We also need an nbdsh script that kicks off parallel writes. +export script=' +import os +import time + +witness = os.getenv("witness") + +def touch(path): + open(path, "a").close() + +# First pass: check that two aligned operations work in parallel +# Total time should be closer to 2 seconds, rather than 4 if serialized +print("sanity check") +ba1 = bytearray(b"1"*16) +ba2 = bytearray(b"2"*16) +buf1 = nbd.Buffer.from_bytearray(ba1) +buf2 = nbd.Buffer.from_bytearray(ba2) +touch(witness) +start_t = time.time() +h.aio_pwrite(buf1, 0) +h.aio_pwrite(buf2, 0) + +while h.aio_in_flight() > 0: + h.poll(-1) +end_t = time.time() +os.unlink(witness) + +out = h.pread(16,0) +print(out) +t = end_t - start_t +print(t) +assert out in [b"1"*16, b"2"*16] +assert t >= 2 and t <= 3 + +# Next pass: try to kick off unaligned first +print("unaligned first") +h.zero(16, 0) +ba3 = bytearray(b"3"*12) +ba4 = bytearray(b"4"*16) +buf3 = nbd.Buffer.from_bytearray(ba3) +buf4 = nbd.Buffer.from_bytearray(ba4) +touch(witness) +start_t = time.time() +h.aio_pwrite(buf3, 4) +h.aio_pwrite(buf4, 0) + +while h.aio_in_flight() > 0: + h.poll(-1) +end_t = time.time() +os.unlink(witness) + +out = h.pread(16,0) +print(out) +t = end_t - start_t +print(t) +assert out in [b"4"*4 + b"3"*12, b"4"*16] +assert t >= 8 + +# Next pass: try to kick off aligned first +print("aligned first") +ba5 = bytearray(b"5"*16) +ba6 = bytearray(b"6"*12) +buf5 = nbd.Buffer.from_bytearray(ba5) +buf6 = nbd.Buffer.from_bytearray(ba6) +h.zero(16, 0) +touch(witness) +start_t = time.time() +h.aio_pwrite(buf5, 0) +h.aio_pwrite(buf6, 4) + +while h.aio_in_flight() > 0: + h.poll(-1) +end_t = time.time() +os.unlink(witness) + +out = h.pread(16,0) +print(out) +t = end_t - start_t +print(t) +assert out in [b"5"*4 + b"6"*12, b"5"*16] +assert t >= 8 +' + +# Now run everything +truncate --size=16 blocksize-sharding.img +export witness="$PWD/blocksize-sharding.tmp" +nbdkit -U - --filter=blocksize --filter=delay eval delay-write=2 \ + config='ln -sf "$(realpath "$3")" $tmpdir/$2' \ + img="$PWD/blocksize-sharding.img" tmp="$PWD/blocksize-sharding.tmp" \ + get_size='echo 16' block_size='echo 16 64K 1M' \ + thread_model='echo parallel' \ + zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \ + iflag=count_bytes,skip_bytes' \ + pread=' + dd if=$tmpdir/img skip=$4 count=$3 iflag=count_bytes,skip_bytes + if [ -f $tmpdir/tmp ]; then sleep 4; fi ' \ + pwrite='dd of=$tmpdir/img seek=$4 conv=notrunc oflag=seek_bytes' \ + --run 'nbdsh -u "$uri" -c "$script"' -- 2.36.1
Laszlo Ersek
2022-May-27 06:59 UTC
[Libguestfs] [nbdkit PATCH v2] blocksize: Avoid losing aligned writes to RMW race
On 05/26/22 23:00, Eric Blake wrote:> [We are still investigating if a CVE needs to be assigned.] > > We have a bug where the blocksize filter can lose the effects of an > aligned write that happens in parallel with an overlapping unaligned > write. In general, a client should be extremely cautious about > overlapping writes; it is not NBD's fault which of the two writes > lands on the disk (or even if both writes land partially, at > block-level granularities on which write landed last). However, a > client should still be able to assume that even when two writes > partially overlap, the final state of the disk of the non-overlapping > portion will be that of the one write explicitly touching that > portion, rather than stale data from before either write touched that > block. > > Fix the bug by switching the blocksize filter locking from a mutex to > a rwlock. We still need mutual exclusion during all unaligned > operations (whether reading or writing to the plugin), because we > share the same bounce buffer among all such code, which we get via the > mutual exclusion of wrlock. But we now also add in shared exclusion > for aligned write-like operations (pwrite, trim, zero); these can > operate in parallel with one another, but must not be allowed to fall > in the window between when an overlapping unaligned operation has read > from the plugin but not yet written, so that we do not end up > re-writing stale contents that undo the portion of the aligned write > that was outside the unaligned region [it's ironic that we only need > this protection on write I/O, but get it by using the rdlock feature > of the rwlock]. Read-like operations (pread, extents, cache) don't > need to be protected from RMW operations, because it is already the > user's problem if they execute parallel overlapping reads while > modifying the same portion of the image; and they will still end up > seeing either the state before or after the unaligned write, > indistinguishable from if we had added more locking. > > Whether this lock favors rdlock (aka aligned pwrite) or wrlock (aka > unaligned access) is implementation-defined by the pthread > implementation; if we need to be more precise, we'll have to introduce > our own rwlock implementation on top of lower-level primitives > (https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock mentions > the use of two mutex to favor readers, or a condvar/mutex to favor > writers). This is also not very fine-grained; it may turn out that we > could get more performance if instead of locking the entire operation > (across all clients, and regardless of whether the offsets overlap), > we instead just lock metadata and then track whether a new operation > overlaps with an existing unaligned operation, or even add in support > for parallel unaligned operations by having more than one bounce > buffer. But if performance truly matters, you're better off fixing > the client to do aligned access in the first place, rather than > needing the blocksize filter. > > Add a test that fails without the change to blocksize.c. With some > carefully timed setups (using the delay filter to stall writes > reaching the plugin, and the plugin itself to stall reads coming back > from the plugin, so that we are reasonably sure of the overall > timeline), we can demonstrate the bug present in the unpatched code, > where an aligned write is lost when it lands in the wrong part of an > unaligned RMW cycle; the fixed code instead shows that the overlapping > operations were serialized. We may need to further tweak the test to > be more reliable even under heavy load, but hopefully 2 seconds per > stall (where a successful test requires 18 seconds) is sufficient for > now. > > Reported-by: Nikolaus Rath <Nikolaus at rath.org> > Fixes: 1aadbf9971 ("blocksize: Allow parallel requests", v1.25.3) > --- > > In v2: > - Implement the blocksize fix. > - Drop RFC; I think this is ready, other than determining if we want > to tag it with a CVE number. > - Improve the test: print out more details before assertions, to aid > in debugging if it ever dies under CI resourcesApparently there was a bug in the zero callback too (which I had missed): -+ zero='dd if=/dev/zero skip=$4 count=$3 iflag=count_bytes,skip_bytes' \ ++ zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \ ++ iflag=count_bytes,skip_bytes' \ After comparing RFC against v2: if you can split the test case to a separate patch (= the second patch in a two-part series), then I'd be happy for you to carry my R-b forward. I don't feel confident enough to review the change to the blocksize filter itself. (My confidence is lacking in my knowledge of nbdkit, not in your patch.) Thanks Laszlo> > tests/Makefile.am | 2 + > filters/blocksize/blocksize.c | 26 +++-- > tests/test-blocksize-sharding.sh | 168 +++++++++++++++++++++++++++++++ > 3 files changed, 187 insertions(+), 9 deletions(-) > create mode 100755 tests/test-blocksize-sharding.sh > > diff --git a/tests/Makefile.am b/tests/Makefile.am > index b6d30c0a..67e7618b 100644 > --- a/tests/Makefile.am > +++ b/tests/Makefile.am > @@ -1376,11 +1376,13 @@ TESTS += \ > test-blocksize.sh \ > test-blocksize-extents.sh \ > test-blocksize-default.sh \ > + test-blocksize-sharding.sh \ > $(NULL) > EXTRA_DIST += \ > test-blocksize.sh \ > test-blocksize-extents.sh \ > test-blocksize-default.sh \ > + test-blocksize-sharding.sh \ > $(NULL) > > # blocksize-policy filter test. > diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c > index 03da4971..1c81d5e3 100644 > --- a/filters/blocksize/blocksize.c > +++ b/filters/blocksize/blocksize.c > @@ -51,10 +51,15 @@ > > #define BLOCKSIZE_MIN_LIMIT (64U * 1024) > > -/* In order to handle parallel requests safely, this lock must be held > - * when using the bounce buffer. > +/* Lock in order to handle overlapping requests safely. > + * > + * Grabbed for exclusive access (wrlock) when using the bounce buffer. > + * > + * Grabbed for shared access (rdlock) when doing aligned writes. > + * These can happen in parallel with one another, but must not land in > + * between the read and write of an unaligned RMW operation. > */ > -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; > +static pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER; > > /* A single bounce buffer for alignment purposes, guarded by the lock. > * Size it to the maximum we allow for minblock. > @@ -255,7 +260,7 @@ blocksize_pread (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, flags, err) == -1) > @@ -278,7 +283,7 @@ blocksize_pread (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, flags, err) == -1) > return -1; > memcpy (buf, bounce, count); > @@ -306,7 +311,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) > @@ -321,6 +326,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Aligned body */ > while (count >= h->minblock) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxdata, ROUND_DOWN (count, h->minblock)); > if (next->pwrite (next, buf, keep, offs, flags, err) == -1) > return -1; > @@ -331,7 +337,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) > return -1; > memcpy (bounce, buf, count); > @@ -371,6 +377,7 @@ blocksize_trim (nbdkit_next *next, > > /* Aligned body */ > while (count) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxlen, count); > if (next->trim (next, keep, offs, flags, err) == -1) > return -1; > @@ -413,7 +420,7 @@ blocksize_zero (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) > @@ -428,6 +435,7 @@ blocksize_zero (nbdkit_next *next, > > /* Aligned body */ > while (count >= h->minblock) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxlen, ROUND_DOWN (count, h->minblock)); > if (next->zero (next, keep, offs, flags, err) == -1) > return -1; > @@ -437,7 +445,7 @@ blocksize_zero (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) > return -1; > memset (bounce, 0, count); > diff --git a/tests/test-blocksize-sharding.sh b/tests/test-blocksize-sharding.sh > new file mode 100755 > index 00000000..177668ed > --- /dev/null > +++ b/tests/test-blocksize-sharding.sh > @@ -0,0 +1,168 @@ > +#!/usr/bin/env bash > +# nbdkit > +# Copyright (C) 2018-2022 Red Hat Inc. > +# > +# Redistribution and use in source and binary forms, with or without > +# modification, are permitted provided that the following conditions are > +# met: > +# > +# * Redistributions of source code must retain the above copyright > +# notice, this list of conditions and the following disclaimer. > +# > +# * Redistributions in binary form must reproduce the above copyright > +# notice, this list of conditions and the following disclaimer in the > +# documentation and/or other materials provided with the distribution. > +# > +# * Neither the name of Red Hat nor the names of its contributors may be > +# used to endorse or promote products derived from this software without > +# specific prior written permission. > +# > +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND > +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, > +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A > +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR > +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF > +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND > +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, > +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT > +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > +# SUCH DAMAGE. > + > +# Demonstrate a fix for a bug where blocksize could lose aligned writes > +# run in parallel with unaligned writes > + > +source ./functions.sh > +set -e > +set -x > + > +requires_plugin eval > +requires_nbdsh_uri > + > +files='blocksize-sharding.img blocksize-sharding.tmp' > +rm -f $files > +cleanup_fn rm -f $files > + > +# Script a server that requires 16-byte aligned requests, and which delays > +# 4s after reads if a witness file exists. Couple it with the delay filter > +# that delays 2s before writes. If an unaligned and aligned write overlap, > +# and can execute in parallel, we would have this timeline: > +# > +# T1 aligned write 1's to 0/16 T2 unaligned write 2's to 4/12 > +#t=0 blocksize: next->pwrite(0, 16) blocksize: next->pread(0, 16) > +# delay: wait 2s delay: next->pread(0, 16) > +# ... eval: read 0's, wait 4s > +#t=2 delay: next->pwrite(0, 16) ... > +# eval: write 1's ... > +# return ... > +#t=4 return 0's (now stale) > +# blocksize: next->pwrite(0, 16) > +# delay: wait 2s > +#t=6 delay: next->pwrite(0, 16) > +# eval: write stale RMW buffer > +# > +# leaving us with a sharded 0000222222222222 (T1's write is lost). > +# But as long as the blocksize filter detects the overlap, we should end > +# up with either 1111222222222222 (aligned write completed first), or with > +# 1111111111111111 (unaligned write completed first), either taking 8s, > +# but with no sharding. > +# > +# We also need an nbdsh script that kicks off parallel writes. > +export script=' > +import os > +import time > + > +witness = os.getenv("witness") > + > +def touch(path): > + open(path, "a").close() > + > +# First pass: check that two aligned operations work in parallel > +# Total time should be closer to 2 seconds, rather than 4 if serialized > +print("sanity check") > +ba1 = bytearray(b"1"*16) > +ba2 = bytearray(b"2"*16) > +buf1 = nbd.Buffer.from_bytearray(ba1) > +buf2 = nbd.Buffer.from_bytearray(ba2) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf1, 0) > +h.aio_pwrite(buf2, 0) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"1"*16, b"2"*16] > +assert t >= 2 and t <= 3 > + > +# Next pass: try to kick off unaligned first > +print("unaligned first") > +h.zero(16, 0) > +ba3 = bytearray(b"3"*12) > +ba4 = bytearray(b"4"*16) > +buf3 = nbd.Buffer.from_bytearray(ba3) > +buf4 = nbd.Buffer.from_bytearray(ba4) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf3, 4) > +h.aio_pwrite(buf4, 0) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"4"*4 + b"3"*12, b"4"*16] > +assert t >= 8 > + > +# Next pass: try to kick off aligned first > +print("aligned first") > +ba5 = bytearray(b"5"*16) > +ba6 = bytearray(b"6"*12) > +buf5 = nbd.Buffer.from_bytearray(ba5) > +buf6 = nbd.Buffer.from_bytearray(ba6) > +h.zero(16, 0) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf5, 0) > +h.aio_pwrite(buf6, 4) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"5"*4 + b"6"*12, b"5"*16] > +assert t >= 8 > +' > + > +# Now run everything > +truncate --size=16 blocksize-sharding.img > +export witness="$PWD/blocksize-sharding.tmp" > +nbdkit -U - --filter=blocksize --filter=delay eval delay-write=2 \ > + config='ln -sf "$(realpath "$3")" $tmpdir/$2' \ > + img="$PWD/blocksize-sharding.img" tmp="$PWD/blocksize-sharding.tmp" \ > + get_size='echo 16' block_size='echo 16 64K 1M' \ > + thread_model='echo parallel' \ > + zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \ > + iflag=count_bytes,skip_bytes' \ > + pread=' > + dd if=$tmpdir/img skip=$4 count=$3 iflag=count_bytes,skip_bytes > + if [ -f $tmpdir/tmp ]; then sleep 4; fi ' \ > + pwrite='dd of=$tmpdir/img seek=$4 conv=notrunc oflag=seek_bytes' \ > + --run 'nbdsh -u "$uri" -c "$script"' >
Richard W.M. Jones
2022-May-27 10:19 UTC
[Libguestfs] [nbdkit PATCH v2] blocksize: Avoid losing aligned writes to RMW race
On Thu, May 26, 2022 at 04:00:08PM -0500, Eric Blake wrote:> [We are still investigating if a CVE needs to be assigned.] > > We have a bug where the blocksize filter can lose the effects of an > aligned write that happens in parallel with an overlapping unaligned > write. In general, a client should be extremely cautious about > overlapping writes; it is not NBD's fault which of the two writes > lands on the disk (or even if both writes land partially, at > block-level granularities on which write landed last). However, a > client should still be able to assume that even when two writes > partially overlap, the final state of the disk of the non-overlapping > portion will be that of the one write explicitly touching that > portion, rather than stale data from before either write touched that > block. > > Fix the bug by switching the blocksize filter locking from a mutex to > a rwlock. We still need mutual exclusion during all unaligned > operations (whether reading or writing to the plugin), because we > share the same bounce buffer among all such code, which we get via the > mutual exclusion of wrlock. But we now also add in shared exclusion > for aligned write-like operations (pwrite, trim, zero); these can > operate in parallel with one another, but must not be allowed to fall > in the window between when an overlapping unaligned operation has read > from the plugin but not yet written, so that we do not end up > re-writing stale contents that undo the portion of the aligned write > that was outside the unaligned region [it's ironic that we only need > this protection on write I/O, but get it by using the rdlock feature > of the rwlock]. Read-like operations (pread, extents, cache) don't > need to be protected from RMW operations, because it is already the > user's problem if they execute parallel overlapping reads while > modifying the same portion of the image; and they will still end up > seeing either the state before or after the unaligned write, > indistinguishable from if we had added more locking. > > Whether this lock favors rdlock (aka aligned pwrite) or wrlock (aka > unaligned access) is implementation-defined by the pthread > implementation; if we need to be more precise, we'll have to introduce > our own rwlock implementation on top of lower-level primitives > (https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock mentions > the use of two mutex to favor readers, or a condvar/mutex to favor > writers). This is also not very fine-grained; it may turn out that we > could get more performance if instead of locking the entire operation > (across all clients, and regardless of whether the offsets overlap), > we instead just lock metadata and then track whether a new operation > overlaps with an existing unaligned operation, or even add in support > for parallel unaligned operations by having more than one bounce > buffer. But if performance truly matters, you're better off fixing > the client to do aligned access in the first place, rather than > needing the blocksize filter. > > Add a test that fails without the change to blocksize.c. With some > carefully timed setups (using the delay filter to stall writes > reaching the plugin, and the plugin itself to stall reads coming back > from the plugin, so that we are reasonably sure of the overall > timeline), we can demonstrate the bug present in the unpatched code, > where an aligned write is lost when it lands in the wrong part of an > unaligned RMW cycle; the fixed code instead shows that the overlapping > operations were serialized. We may need to further tweak the test to > be more reliable even under heavy load, but hopefully 2 seconds per > stall (where a successful test requires 18 seconds) is sufficient for > now. > > Reported-by: Nikolaus Rath <Nikolaus at rath.org> > Fixes: 1aadbf9971 ("blocksize: Allow parallel requests", v1.25.3) > --- > > In v2: > - Implement the blocksize fix. > - Drop RFC; I think this is ready, other than determining if we want > to tag it with a CVE number. > - Improve the test: print out more details before assertions, to aid > in debugging if it ever dies under CI resources > > tests/Makefile.am | 2 + > filters/blocksize/blocksize.c | 26 +++-- > tests/test-blocksize-sharding.sh | 168 +++++++++++++++++++++++++++++++ > 3 files changed, 187 insertions(+), 9 deletions(-) > create mode 100755 tests/test-blocksize-sharding.sh > > diff --git a/tests/Makefile.am b/tests/Makefile.am > index b6d30c0a..67e7618b 100644 > --- a/tests/Makefile.am > +++ b/tests/Makefile.am > @@ -1376,11 +1376,13 @@ TESTS += \ > test-blocksize.sh \ > test-blocksize-extents.sh \ > test-blocksize-default.sh \ > + test-blocksize-sharding.sh \ > $(NULL) > EXTRA_DIST += \ > test-blocksize.sh \ > test-blocksize-extents.sh \ > test-blocksize-default.sh \ > + test-blocksize-sharding.sh \ > $(NULL) > > # blocksize-policy filter test. > diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c > index 03da4971..1c81d5e3 100644 > --- a/filters/blocksize/blocksize.c > +++ b/filters/blocksize/blocksize.c > @@ -51,10 +51,15 @@ > > #define BLOCKSIZE_MIN_LIMIT (64U * 1024) > > -/* In order to handle parallel requests safely, this lock must be held > - * when using the bounce buffer. > +/* Lock in order to handle overlapping requests safely. > + * > + * Grabbed for exclusive access (wrlock) when using the bounce buffer. > + * > + * Grabbed for shared access (rdlock) when doing aligned writes. > + * These can happen in parallel with one another, but must not land in > + * between the read and write of an unaligned RMW operation. > */ > -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; > +static pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER; > > /* A single bounce buffer for alignment purposes, guarded by the lock. > * Size it to the maximum we allow for minblock. > @@ -255,7 +260,7 @@ blocksize_pread (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, flags, err) == -1) > @@ -278,7 +283,7 @@ blocksize_pread (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, flags, err) == -1) > return -1; > memcpy (buf, bounce, count); > @@ -306,7 +311,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) > @@ -321,6 +326,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Aligned body */ > while (count >= h->minblock) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxdata, ROUND_DOWN (count, h->minblock)); > if (next->pwrite (next, buf, keep, offs, flags, err) == -1) > return -1; > @@ -331,7 +337,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) > return -1; > memcpy (bounce, buf, count); > @@ -371,6 +377,7 @@ blocksize_trim (nbdkit_next *next, > > /* Aligned body */ > while (count) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxlen, count); > if (next->trim (next, keep, offs, flags, err) == -1) > return -1; > @@ -413,7 +420,7 @@ blocksize_zero (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) > @@ -428,6 +435,7 @@ blocksize_zero (nbdkit_next *next, > > /* Aligned body */ > while (count >= h->minblock) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxlen, ROUND_DOWN (count, h->minblock)); > if (next->zero (next, keep, offs, flags, err) == -1) > return -1; > @@ -437,7 +445,7 @@ blocksize_zero (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) > return -1; > memset (bounce, 0, count); > > diff --git a/tests/test-blocksize-sharding.sh b/tests/test-blocksize-sharding.sh > new file mode 100755 > index 00000000..177668ed > --- /dev/null > +++ b/tests/test-blocksize-sharding.sh > @@ -0,0 +1,168 @@ > +#!/usr/bin/env bash > +# nbdkit > +# Copyright (C) 2018-2022 Red Hat Inc. > +# > +# Redistribution and use in source and binary forms, with or without > +# modification, are permitted provided that the following conditions are > +# met: > +# > +# * Redistributions of source code must retain the above copyright > +# notice, this list of conditions and the following disclaimer. > +# > +# * Redistributions in binary form must reproduce the above copyright > +# notice, this list of conditions and the following disclaimer in the > +# documentation and/or other materials provided with the distribution. > +# > +# * Neither the name of Red Hat nor the names of its contributors may be > +# used to endorse or promote products derived from this software without > +# specific prior written permission. > +# > +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND > +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, > +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A > +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR > +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF > +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND > +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, > +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT > +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > +# SUCH DAMAGE. > + > +# Demonstrate a fix for a bug where blocksize could lose aligned writes > +# run in parallel with unaligned writes > + > +source ./functions.sh > +set -e > +set -x > + > +requires_plugin eval > +requires_nbdsh_uri > + > +files='blocksize-sharding.img blocksize-sharding.tmp' > +rm -f $files > +cleanup_fn rm -f $files > + > +# Script a server that requires 16-byte aligned requests, and which delays > +# 4s after reads if a witness file exists. Couple it with the delay filter > +# that delays 2s before writes. If an unaligned and aligned write overlap, > +# and can execute in parallel, we would have this timeline: > +# > +# T1 aligned write 1's to 0/16 T2 unaligned write 2's to 4/12 > +#t=0 blocksize: next->pwrite(0, 16) blocksize: next->pread(0, 16) > +# delay: wait 2s delay: next->pread(0, 16) > +# ... eval: read 0's, wait 4s > +#t=2 delay: next->pwrite(0, 16) ... > +# eval: write 1's ... > +# return ... > +#t=4 return 0's (now stale) > +# blocksize: next->pwrite(0, 16) > +# delay: wait 2s > +#t=6 delay: next->pwrite(0, 16) > +# eval: write stale RMW buffer > +# > +# leaving us with a sharded 0000222222222222 (T1's write is lost). > +# But as long as the blocksize filter detects the overlap, we should end > +# up with either 1111222222222222 (aligned write completed first), or with > +# 1111111111111111 (unaligned write completed first), either taking 8s, > +# but with no sharding. > +# > +# We also need an nbdsh script that kicks off parallel writes. > +export script=' > +import os > +import time > + > +witness = os.getenv("witness") > + > +def touch(path): > + open(path, "a").close() > + > +# First pass: check that two aligned operations work in parallel > +# Total time should be closer to 2 seconds, rather than 4 if serialized > +print("sanity check") > +ba1 = bytearray(b"1"*16) > +ba2 = bytearray(b"2"*16) > +buf1 = nbd.Buffer.from_bytearray(ba1) > +buf2 = nbd.Buffer.from_bytearray(ba2) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf1, 0) > +h.aio_pwrite(buf2, 0) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"1"*16, b"2"*16] > +assert t >= 2 and t <= 3 > + > +# Next pass: try to kick off unaligned first > +print("unaligned first") > +h.zero(16, 0) > +ba3 = bytearray(b"3"*12) > +ba4 = bytearray(b"4"*16) > +buf3 = nbd.Buffer.from_bytearray(ba3) > +buf4 = nbd.Buffer.from_bytearray(ba4) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf3, 4) > +h.aio_pwrite(buf4, 0) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"4"*4 + b"3"*12, b"4"*16] > +assert t >= 8 > + > +# Next pass: try to kick off aligned first > +print("aligned first") > +ba5 = bytearray(b"5"*16) > +ba6 = bytearray(b"6"*12) > +buf5 = nbd.Buffer.from_bytearray(ba5) > +buf6 = nbd.Buffer.from_bytearray(ba6) > +h.zero(16, 0) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf5, 0) > +h.aio_pwrite(buf6, 4) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"5"*4 + b"6"*12, b"5"*16] > +assert t >= 8 > +' > + > +# Now run everything > +truncate --size=16 blocksize-sharding.img > +export witness="$PWD/blocksize-sharding.tmp" > +nbdkit -U - --filter=blocksize --filter=delay eval delay-write=2 \ > + config='ln -sf "$(realpath "$3")" $tmpdir/$2' \ > + img="$PWD/blocksize-sharding.img" tmp="$PWD/blocksize-sharding.tmp" \ > + get_size='echo 16' block_size='echo 16 64K 1M' \ > + thread_model='echo parallel' \ > + zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \ > + iflag=count_bytes,skip_bytes' \ > + pread=' > + dd if=$tmpdir/img skip=$4 count=$3 iflag=count_bytes,skip_bytes > + if [ -f $tmpdir/tmp ]; then sleep 4; fi ' \ > + pwrite='dd of=$tmpdir/img seek=$4 conv=notrunc oflag=seek_bytes' \ > + --run 'nbdsh -u "$uri" -c "$script"'Reviewed-by: Richard W.M. Jones <rjones at redhat.com> I have no preference on whether or not to put the test into a separate commit. I'm not sure I understand where the potential CVE is. In what scenario could the old filter be exploited? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top