Richard W.M. Jones
2020-Aug-07  13:16 UTC
Re: [Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
On Fri, Aug 07, 2020 at 07:53:13AM -0500, Eric Blake wrote:> >$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random > > Hmm - the -W actually says that qemu-img is performing semi-random > access (there is no guarantee that the 16 coroutines are serviced in > linear order of the file), even though we really are making only one > pass through the file in bulk. I don't know if fadvise=normal would > be any better; dropping -W but keeping -m 16 might also be an > interesting number to check (where qemu-img tries harder to do > in-order access, but still take advantage of parallel threads). > > > total used free shared buff/cache available > >Mem: 32083 1188 27928 1 2966 30440 > >Swap: 16135 16 16119 > > (100.00/100%) > > > >real 0m13.107s > >user 0m2.051s > >sys 0m37.556s > > total used free shared buff/cache available > >Mem: 32083 1196 27861 1 3024 30429 > >Swap: 16135 16 16119 > >pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]Without -W it's very similar: $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random total used free shared buff/cache available Mem: 32083 1184 26113 1 4785 30444 Swap: 16135 16 16119 (100.00/100%) real 0m13.308s user 0m1.961s sys 0m40.455s total used free shared buff/cache available Mem: 32083 1188 26049 1 4845 30438 Swap: 16135 16 16119 pages in cache: 14808/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K] With -W and using fadvise=random is also about the same: $ free -m; time ./nbdkit file /var/tmp/random fadvise=random cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random total used free shared buff/cache available Mem: 32083 1187 26109 1 4785 30440 Swap: 16135 16 16119 (100.00/100%) real 0m13.030s user 0m1.986s sys 0m37.498s total used free shared buff/cache available Mem: 32083 1187 26053 1 4842 30440 Swap: 16135 16 16119 pages in cache: 14336/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K] I'm going to guess that for this case readahead doesn't have much time to get ahead of qemu.> >+=item B<fadvise=normal> > >+ > >+=item B<fadvise=random> > >+ > >+=item B<fadvise=sequential> > >+ > >+This optional flag hints to the kernel that you will access the file > >+normally, or in a random order, or sequentially. The exact behaviour > >+depends on your operating system, but for Linux using C<normal> causes > >+the kernel to read-ahead, C<sequential> causes the kernel to > >+read-ahead twice as much as C<normal>, and C<random> turns off > >+read-ahead. > > Is it worth a mention of L<posix_fadvise(3)> here, to let the user > get some idea of what their operating system supports?Yes I had this at one point but I seem to have dropped it. Will add it back, thanks.> >+=head2 Reducing evictions from the page cache > >+ > >+If the file is very large and you known the client will only > >+read/write the file sequentially one time (eg for making a single copy > >+or backup) then this will stop other processes from being evicted from > >+the page cache: > >+ > >+ nbdkit file disk.img fadvise=sequential cache=none > > It's also possible to avoid polluting the page cache by using > O_DIRECT, but that comes with harder guarantees (aligned access > through aligned buffers), so we may add it as another mode later on. > But in the meantime, cache=none is fairly nice while still avoiding > O_DIRECT.I'm not sure if or even how we could ever do a robust O_DIRECT implementation, but my idea was that it might be an alternate implementation of cache=none. But if we thought we might use O_DIRECT as a separate mode, then maybe we should rename cache=none. cache=advise? cache=dontneed? I can't think of a good name!> >@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset, > > { > > struct handle *h = handle; > >+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED) > >+ uint32_t orig_count = count; > >+ uint64_t orig_offset = offset; > >+ > >+ /* If cache=none we want to force pages we have just written to the > >+ * file to be flushed to disk so we can immediately evict them from > >+ * the page cache. > >+ */ > >+ if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA; > >+#endif > >+ > > while (count > 0) { > > ssize_t r = pwrite (h->fd, buf, count, offset); > > if (r == -1) { > >@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset, > > if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1) > > return -1; > >+#ifdef HAVE_POSIX_FADVISE > >+ /* On Linux this will evict the pages we just wrote from the page cache. */ > >+ if (cache_mode == cache_none) > >+ posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED); > >+#endif > > So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed > doesn't help? You did point out that the use of FUA for flushing > slows things down, but that's a fair price to pay to keep the cache > clean.On Linux POSIX_FADV_DONTNEED won't flush dirty buffers. I expect (but didn't actually measure) that just after a medium sized write the buffers would all be dirty so the posix_fadvise(DONTNEED) call would do nothing at all. The advice online does seem to be that you must flush before calling this. (Linus advocates a complex double-buffering solution so that you can be reading into one buffer while flushing the other, so you don't have the overhead of waiting for the flush). I'm going to do a bit of benchmarking of the write side now. Thanks, Rich.> Patch looks good to me. > > -- > Eric Blake, Principal Software Engineer > Red Hat, Inc. +1-919-301-3226 > Virtualization: qemu.org | libvirt.org-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Nir Soffer
2020-Aug-07  13:43 UTC
Re: [Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
On Fri, Aug 7, 2020, 16:16 Richard W.M. Jones <rjones@redhat.com> wrote:> On Fri, Aug 07, 2020 at 07:53:13AM -0500, Eric Blake wrote: > > >$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential > cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd > "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; > cachestats /var/tmp/random > > > > Hmm - the -W actually says that qemu-img is performing semi-random > > access (there is no guarantee that the 16 coroutines are serviced in > > linear order of the file), even though we really are making only one > > pass through the file in bulk. I don't know if fadvise=normal would > > be any better; dropping -W but keeping -m 16 might also be an > > interesting number to check (where qemu-img tries harder to do > > in-order access, but still take advantage of parallel threads). > > > > > total used free shared buff/cache > available > > >Mem: 32083 1188 27928 1 2966 > 30440 > > >Swap: 16135 16 16119 > > > (100.00/100%) > > > > > >real 0m13.107s > > >user 0m2.051s > > >sys 0m37.556s > > > total used free shared buff/cache > available > > >Mem: 32083 1196 27861 1 3024 > 30429 > > >Swap: 16135 16 16119 > > >pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, > pagesize=4K] > > Without -W it's very similar: > > $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential > cache=none --run 'qemu-img convert -n -p -m 16 $nbd > "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; > cachestats /var/tmp/random > total used free shared buff/cache > available > Mem: 32083 1184 26113 1 4785 > 30444 > Swap: 16135 16 16119 > (100.00/100%) > > real 0m13.308s > user 0m1.961s > sys 0m40.455s > total used free shared buff/cache > available > Mem: 32083 1188 26049 1 4845 > 30438 > Swap: 16135 16 16119 > pages in cache: 14808/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K] > > With -W and using fadvise=random is also about the same: > > $ free -m; time ./nbdkit file /var/tmp/random fadvise=random cache=none > --run 'qemu-img convert -n -p -m 16 -W $nbd > "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; > cachestats /var/tmp/random > total used free shared buff/cache > available > Mem: 32083 1187 26109 1 4785 > 30440 > Swap: 16135 16 16119 > (100.00/100%) > > real 0m13.030s > user 0m1.986s > sys 0m37.498s > total used free shared buff/cache > available > Mem: 32083 1187 26053 1 4842 > 30440 > Swap: 16135 16 16119 > pages in cache: 14336/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K] > > I'm going to guess that for this case readahead doesn't have much time > to get ahead of qemu. > > > >+=item B<fadvise=normal> > > >+ > > >+=item B<fadvise=random> > > >+ > > >+=item B<fadvise=sequential> > > >+ > > >+This optional flag hints to the kernel that you will access the file > > >+normally, or in a random order, or sequentially. The exact behaviour > > >+depends on your operating system, but for Linux using C<normal> causes > > >+the kernel to read-ahead, C<sequential> causes the kernel to > > >+read-ahead twice as much as C<normal>, and C<random> turns off > > >+read-ahead. > > > > Is it worth a mention of L<posix_fadvise(3)> here, to let the user > > get some idea of what their operating system supports? > > Yes I had this at one point but I seem to have dropped it. Will > add it back, thanks. > > > >+=head2 Reducing evictions from the page cache > > >+ > > >+If the file is very large and you known the client will only > > >+read/write the file sequentially one time (eg for making a single copy > > >+or backup) then this will stop other processes from being evicted from > > >+the page cache: > > >+ > > >+ nbdkit file disk.img fadvise=sequential cache=none > > > > It's also possible to avoid polluting the page cache by using > > O_DIRECT, but that comes with harder guarantees (aligned access > > through aligned buffers), so we may add it as another mode later on. > > But in the meantime, cache=none is fairly nice while still avoiding > > O_DIRECT. > > I'm not sure if or even how we could ever do a robust O_DIRECT >We can let the plugin an filter deal with that. The simplest solution is to drop it on the user and require aligned requests. Maybe a filter can handle alignment? implementation, but my idea was that it might be an alternate> implementation of cache=none. But if we thought we might use O_DIRECT > as a separate mode, then maybe we should rename cache=none. > cache=advise? cache=dontneed? I can't think of a good name! >Yes, don't call it none if you use the cache. How about advise=? I would keep cache semantics similar to qemu.> > >@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf, > uint32_t count, uint64_t offset, > > > { > > > struct handle *h = handle; > > >+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED) > > >+ uint32_t orig_count = count; > > >+ uint64_t orig_offset = offset; > > >+ > > >+ /* If cache=none we want to force pages we have just written to the > > >+ * file to be flushed to disk so we can immediately evict them from > > >+ * the page cache. > > >+ */ > > >+ if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA; > > >+#endif > > >+ > > > while (count > 0) { > > > ssize_t r = pwrite (h->fd, buf, count, offset); > > > if (r == -1) { > > >@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, > uint32_t count, uint64_t offset, > > > if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1) > > > return -1; > > >+#ifdef HAVE_POSIX_FADVISE > > >+ /* On Linux this will evict the pages we just wrote from the page > cache. */ > > >+ if (cache_mode == cache_none) > > >+ posix_fadvise (h->fd, orig_offset, orig_count, > POSIX_FADV_DONTNEED); > > >+#endif > > > > So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed > > doesn't help? You did point out that the use of FUA for flushing > > slows things down, but that's a fair price to pay to keep the cache > > clean. > > On Linux POSIX_FADV_DONTNEED won't flush dirty buffers. I expect (but > didn't actually measure) that just after a medium sized write the > buffers would all be dirty so the posix_fadvise(DONTNEED) call would > do nothing at all. The advice online does seem to be that you must > flush before calling this. (Linus advocates a complex > double-buffering solution so that you can be reading into one buffer > while flushing the other, so you don't have the overhead of waiting > for the flush). > > I'm going to do a bit of benchmarking of the write side now. >We already tried this with dd and the results were not good. Nir> Thanks, > > Rich. > > > Patch looks good to me. > > > > -- > > Eric Blake, Principal Software Engineer > > Red Hat, Inc. +1-919-301-3226 > > Virtualization: qemu.org | libvirt.org > > -- > Richard Jones, Virtualization Group, Red Hat > http://people.redhat.com/~rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > libguestfs lets you edit virtual machines. Supports shell scripting, > bindings from many languages. http://libguestfs.org > >
Eric Blake
2020-Aug-07  13:59 UTC
Re: [Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
On 8/7/20 8:43 AM, Nir Soffer wrote:>>> >>> It's also possible to avoid polluting the page cache by using >>> O_DIRECT, but that comes with harder guarantees (aligned access >>> through aligned buffers), so we may add it as another mode later on. >>> But in the meantime, cache=none is fairly nice while still avoiding >>> O_DIRECT. >> >> I'm not sure if or even how we could ever do a robust O_DIRECT >> > > We can let the plugin an filter deal with that. The simplest solution is to > drop it on the user and require aligned requests. > > Maybe a filter can handle alignment?The blocksize filter seems like a good fit. Also, I'm trying to add support for NBD_INFO_BLOCK_SIZE (qemu already supports it), so that we can advertise minimum granularity across to the client to avoid the need for read-modify-write in the blocksize filter.> > implementation, but my idea was that it might be an alternate >> implementation of cache=none. But if we thought we might use O_DIRECT >> as a separate mode, then maybe we should rename cache=none. >> cache=advise? cache=dontneed? I can't think of a good name! >> > > Yes, don't call it none if you use the cache. > > How about advise=? > > I would keep cache semantics similar to qemu.qemu-img has cache=, but it affects O_DIRECT rather than posix_fadvise. Yeah, naming is hard, and I don't know if I have better ideas. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2020-Aug-07  14:07 UTC
Re: [Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
On Fri, Aug 07, 2020 at 04:43:12PM +0300, Nir Soffer wrote:> On Fri, Aug 7, 2020, 16:16 Richard W.M. Jones <rjones@redhat.com> wrote: > > I'm not sure if or even how we could ever do a robust O_DIRECT > > > > We can let the plugin an filter deal with that. The simplest solution is to > drop it on the user and require aligned requests.I mean this is very error prone. It requires the end user to know about the basically unknowable restrictions of O_DIRECT and isn't even possible in one common case - if the size of the file isn't an exact multiple of the filesystem block size.> Maybe a filter can handle alignment? > > > implementation, but my idea was that it might be an alternate > > implementation of cache=none. But if we thought we might use O_DIRECT > > as a separate mode, then maybe we should rename cache=none. > > cache=advise? cache=dontneed? I can't think of a good name! > > > > Yes, don't call it none if you use the cache. > > How about advise=? > > I would keep cache semantics similar to qemu.qemu uses cache=none as a synonym for O_DIRECT, but AFAIK it has nothing that tries to use posix_fadvise(DONTNEED) with or without Linus's double buffering technique. qemu does use posix_fadvise(DONTNEED) in one place but AFAICT it is only used for live migration. ...> We already tried this with dd and the results were not good.These ones? https://www.redhat.com/archives/libguestfs/2020-August/msg00078.html Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Reasonably Related Threads
- Re: [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
- Re: [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
- Re: [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
- Re: [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.
- Re: [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.