Instead of malloc'ing a C buffer, nbd_pread()ing into it, then copying it into an immutable Python bytes object, we can instead pre-create a correctly-sized Python bytearray object, then nbd_pread() directly into that object's underlying bytes. While the data copying might not be the bottleneck compared to the networking costs, it does have noticeable results; on my machine: $ export script=' m=1024*1024 size=h.get_size() for i in range(size // m): buf = h.pread(m, m*i) ' $ time ./run nbdkit -U - pattern 10G --run 'nbdsh -u $uri -c "$script"' sped up from about 7.8s pre-patch to about 7.1s post-patch, approximately a 10% speedup. Note that this slightly changes the python API: h.pread[_structured] now returns a mutable 'bytearray' object, rather than an immutable 'bytes' object. This makes it possible to modify the just-read string in place, rather than having to create yet another memory buffer for any modifications, which offers even more speedups when writing a read-modify-write paradigm in python. But the change is backwards-compatible - python already states that a read-write buffer may be returned instead of readonly for any client that already operated only on a buffer in a readonly manner. --- Note that h.pread is more like Python read() semantics in creating a buffer, while h.aio_pread is more like Python readinto() semantics in modifying a passed-in buffer. But now that both code paths have a python object prior to calling into the C API, my next task is to improve the h.*pread_structured callback function to pass its buffer as a slice of the Python input buffer, rather than doing yet another round of pointless memcpy from C into python objects. generator/Python.ml | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/generator/Python.ml b/generator/Python.ml index 4ab18f6..1c4446e 100644 --- a/generator/Python.ml +++ b/generator/Python.ml @@ -1,6 +1,6 @@ (* hey emacs, this is OCaml code: -*- tuareg -*- *) (* nbd client library in userspace: Python bindings - * Copyright (C) 2013-2021 Red Hat Inc. + * Copyright (C) 2013-2022 Red Hat Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -293,7 +293,7 @@ let | BytesIn (n, _) -> pr " Py_buffer %s = { .obj = NULL };\n" n | BytesOut (n, count) -> - pr " char *%s = NULL;\n" n; + pr " PyObject *%s = NULL;\n" n; pr " Py_ssize_t %s;\n" count | BytesPersistIn (n, _) | BytesPersistOut (n, _) -> @@ -432,8 +432,8 @@ let | Bool _ -> () | BytesIn _ -> () | BytesOut (n, count) -> - pr " %s = malloc (%s);\n" n count; - pr " if (%s == NULL) { PyErr_NoMemory (); goto out; }\n" n + pr " %s = PyByteArray_FromStringAndSize (NULL, %s);\n" n count; + pr " if (%s == NULL) goto out;\n" n | BytesPersistIn (n, _) | BytesPersistOut (n, _) -> pr " %s_buf = nbd_internal_py_get_aio_buffer (%s);\n" n n; pr " if (!%s_buf) goto out;\n" n; @@ -479,7 +479,7 @@ let function | Bool n -> pr ", %s" n | BytesIn (n, _) -> pr ", %s.buf, %s.len" n n - | BytesOut (n, count) -> pr ", %s, %s" n count + | BytesOut (n, count) -> pr ", PyByteArray_AS_STRING (%s), %s" n count | BytesPersistIn (n, _) | BytesPersistOut (n, _) -> pr ", %s_buf->data, %s_buf->len" n n | Closure { cbname } -> pr ", %s" cbname @@ -524,8 +524,9 @@ let let use_ret = ref true in List.iter ( function - | BytesOut (n, count) -> - pr " py_ret = PyBytes_FromStringAndSize (%s, %s);\n" n count; + | BytesOut (n, _) -> + pr " py_ret = %s;\n" n; + pr " %s = NULL;\n" n; use_ret := false | Bool _ | BytesIn _ @@ -572,7 +573,7 @@ let | BytesIn (n, _) -> pr " if (%s.obj)\n" n; pr " PyBuffer_Release (&%s);\n" n - | BytesOut (n, _) -> pr " free (%s);\n" n + | BytesOut (n, _) -> pr " Py_XDECREF (%s);\n" n | BytesPersistIn _ | BytesPersistOut _ -> () | Closure { cbname } -> pr " free_user_data (%s_user_data);\n" cbname -- 2.36.1
On Fri, May 27, 2022 at 04:44:16PM -0500, Eric Blake wrote:> Instead of malloc'ing a C buffer, nbd_pread()ing into it, then copying > it into an immutable Python bytes object, we can instead pre-create a > correctly-sized Python bytearray object, then nbd_pread() directly > into that object's underlying bytes. > > While the data copying might not be the bottleneck compared to the > networking costs, it does have noticeable results; on my machine: > > $ export script=' > m=1024*1024 > size=h.get_size() > for i in range(size // m): > buf = h.pread(m, m*i) > ' > $ time ./run nbdkit -U - pattern 10G --run 'nbdsh -u $uri -c "$script"' > > sped up from about 7.8s pre-patch to about 7.1s post-patch, > approximately a 10% speedup. > > Note that this slightly changes the python API: h.pread[_structured] > now returns a mutable 'bytearray' object, rather than an immutable > 'bytes' object. This makes it possible to modify the just-read string > in place, rather than having to create yet another memory buffer for > any modifications, which offers even more speedups when writing a > read-modify-write paradigm in python. But the change is > backwards-compatible - python already states that a read-write buffer > may be returned instead of readonly for any client that already > operated only on a buffer in a readonly manner. > --- > > Note that h.pread is more like Python read() semantics in creating a > buffer, while h.aio_pread is more like Python readinto() semantics in > modifying a passed-in buffer. But now that both code paths have a > python object prior to calling into the C API, my next task is to > improve the h.*pread_structured callback function to pass its buffer > as a slice of the Python input buffer, rather than doing yet another > round of pointless memcpy from C into python objects. > > generator/Python.ml | 17 +++++++++-------- > 1 file changed, 9 insertions(+), 8 deletions(-)I've been previously reminded (commit e0953c for example) that showing the diff of the generated output is also useful: --- python/methods.c.bak 2022-05-27 17:08:43.035658709 -0500 +++ python/methods.c 2022-05-27 17:08:50.678669864 -0500 @@ -2295,7 +2295,7 @@ struct nbd_handle *h; int ret; PyObject *py_ret = NULL; - char *buf = NULL; + PyObject *buf = NULL; Py_ssize_t count; uint64_t offset_u64; unsigned long long offset; /* really uint64_t */ @@ -2309,19 +2309,20 @@ h = get_handle (py_h); if (!h) goto out; flags_u32 = flags; - buf = malloc (count); - if (buf == NULL) { PyErr_NoMemory (); goto out; } + buf = PyByteArray_FromStringAndSize (NULL, count); + if (buf == NULL) goto out; offset_u64 = offset; - ret = nbd_pread (h, buf, count, offset_u64, flags_u32); + ret = nbd_pread (h, PyByteArray_AS_STRING (buf), count, offset_u64, flags_u32); if (ret == -1) { raise_exception (); goto out; } - py_ret = PyBytes_FromStringAndSize (buf, count); + py_ret = buf; + buf = NULL; out: - free (buf); + Py_XDECREF (buf); return py_ret; } @@ -2332,7 +2333,7 @@ struct nbd_handle *h; int ret; PyObject *py_ret = NULL; - char *buf = NULL; + PyObject *buf = NULL; Py_ssize_t count; uint64_t offset_u64; unsigned long long offset; /* really uint64_t */ @@ -2350,8 +2351,8 @@ h = get_handle (py_h); if (!h) goto out; flags_u32 = flags; - buf = malloc (count); - if (buf == NULL) { PyErr_NoMemory (); goto out; } + buf = PyByteArray_FromStringAndSize (NULL, count); + if (buf == NULL) goto out; offset_u64 = offset; chunk.user_data = chunk_user_data = alloc_user_data (); if (chunk_user_data == NULL) goto out; @@ -2364,16 +2365,17 @@ Py_INCREF (py_chunk_fn); chunk_user_data->fn = py_chunk_fn; - ret = nbd_pread_structured (h, buf, count, offset_u64, chunk, flags_u32); + ret = nbd_pread_structured (h, PyByteArray_AS_STRING (buf), count, offset_u64, chunk, flags_u32); chunk_user_data = NULL; if (ret == -1) { raise_exception (); goto out; } - py_ret = PyBytes_FromStringAndSize (buf, count); + py_ret = buf; + buf = NULL; out: - free (buf); + Py_XDECREF (buf); free_user_data (chunk_user_data); return py_ret; } -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2022-May-28 17:16 UTC
[Libguestfs] [libnbd PATCH] python: Speed up pread
On Fri, May 27, 2022 at 04:44:16PM -0500, Eric Blake wrote:> Instead of malloc'ing a C buffer, nbd_pread()ing into it, then copying > it into an immutable Python bytes object, we can instead pre-create a > correctly-sized Python bytearray object, then nbd_pread() directly > into that object's underlying bytes. > > While the data copying might not be the bottleneck compared to the > networking costs, it does have noticeable results; on my machine: > > $ export script=' > m=1024*1024 > size=h.get_size() > for i in range(size // m): > buf = h.pread(m, m*i) > ' > $ time ./run nbdkit -U - pattern 10G --run 'nbdsh -u $uri -c "$script"' > > sped up from about 7.8s pre-patch to about 7.1s post-patch, > approximately a 10% speedup. > > Note that this slightly changes the python API: h.pread[_structured] > now returns a mutable 'bytearray' object, rather than an immutable > 'bytes' object. This makes it possible to modify the just-read string > in place, rather than having to create yet another memory buffer for > any modifications, which offers even more speedups when writing a > read-modify-write paradigm in python. But the change is > backwards-compatible - python already states that a read-write buffer > may be returned instead of readonly for any client that already > operated only on a buffer in a readonly manner.Although this is not an API change, in general we have no obligation to maintain backwards compat for the Python API. (The C API is quite a different matter of course.) Nevertheless, it's nice not to break things.> Note that h.pread is more like Python read() semantics in creating a > buffer, while h.aio_pread is more like Python readinto() semantics in > modifying a passed-in buffer. But now that both code paths have a > python object prior to calling into the C API, my next task is to > improve the h.*pread_structured callback function to pass its buffer > as a slice of the Python input buffer, rather than doing yet another > round of pointless memcpy from C into python objects.You might want to have a look at the nbdkit Python bindings for inspiration because Nir made those either zero- or low-copy (I forget exactly which).> generator/Python.ml | 17 +++++++++-------- > 1 file changed, 9 insertions(+), 8 deletions(-) > > diff --git a/generator/Python.ml b/generator/Python.ml > index 4ab18f6..1c4446e 100644 > --- a/generator/Python.ml > +++ b/generator/Python.ml > @@ -1,6 +1,6 @@ > (* hey emacs, this is OCaml code: -*- tuareg -*- *) > (* nbd client library in userspace: Python bindings > - * Copyright (C) 2013-2021 Red Hat Inc. > + * Copyright (C) 2013-2022 Red Hat Inc. > * > * This library is free software; you can redistribute it and/or > * modify it under the terms of the GNU Lesser General Public > @@ -293,7 +293,7 @@ let > | BytesIn (n, _) -> > pr " Py_buffer %s = { .obj = NULL };\n" n > | BytesOut (n, count) -> > - pr " char *%s = NULL;\n" n; > + pr " PyObject *%s = NULL;\n" n; > pr " Py_ssize_t %s;\n" count > | BytesPersistIn (n, _) > | BytesPersistOut (n, _) -> > @@ -432,8 +432,8 @@ let > | Bool _ -> () > | BytesIn _ -> () > | BytesOut (n, count) -> > - pr " %s = malloc (%s);\n" n count; > - pr " if (%s == NULL) { PyErr_NoMemory (); goto out; }\n" n > + pr " %s = PyByteArray_FromStringAndSize (NULL, %s);\n" n count; > + pr " if (%s == NULL) goto out;\n" n > | BytesPersistIn (n, _) | BytesPersistOut (n, _) -> > pr " %s_buf = nbd_internal_py_get_aio_buffer (%s);\n" n n; > pr " if (!%s_buf) goto out;\n" n; > @@ -479,7 +479,7 @@ let > function > | Bool n -> pr ", %s" n > | BytesIn (n, _) -> pr ", %s.buf, %s.len" n n > - | BytesOut (n, count) -> pr ", %s, %s" n count > + | BytesOut (n, count) -> pr ", PyByteArray_AS_STRING (%s), %s" n count > | BytesPersistIn (n, _) > | BytesPersistOut (n, _) -> pr ", %s_buf->data, %s_buf->len" n n > | Closure { cbname } -> pr ", %s" cbname > @@ -524,8 +524,9 @@ let > let use_ret = ref true in > List.iter ( > function > - | BytesOut (n, count) -> > - pr " py_ret = PyBytes_FromStringAndSize (%s, %s);\n" n count; > + | BytesOut (n, _) -> > + pr " py_ret = %s;\n" n; > + pr " %s = NULL;\n" n; > use_ret := false > | Bool _ > | BytesIn _ > @@ -572,7 +573,7 @@ let > | BytesIn (n, _) -> > pr " if (%s.obj)\n" n; > pr " PyBuffer_Release (&%s);\n" n > - | BytesOut (n, _) -> pr " free (%s);\n" n > + | BytesOut (n, _) -> pr " Py_XDECREF (%s);\n" n > | BytesPersistIn _ | BytesPersistOut _ -> () > | Closure { cbname } -> > pr " free_user_data (%s_user_data);\n" cbnameLooks good, Reviewed-by: Richard W.M. Jones <rjones at redhat.com> Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW