Richard W.M. Jones
2023-Feb-13 18:07 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote:> Hey, > When downloading a file whose path contains multi-byte utf-8, libguestfs > sometimes crashes. > This reproduces when using python, and not when using guestfish. > > Code to reproduce: > for i in range(2000): > ? ? g.download ('/xxx?', '/tmp/1')'i' is not used inside the loop? Or is this error intermittent?> #0 ?raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50 > #1 ?0x00007ffff7fac140 in <signal handler called> () at /lib/x86_64-linux-gnu/ > libpthread.so.0 > #2 ?0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at /usr/include/ > python3.9/object.h:408 > #3 ?guestfs_int_py_event_callback_wrapper > ? ? (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, buf_len> 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > #4 ?guestfs_int_py_event_callback_wrapper > ? ? (g=<optimized out>, callback=0x7ffff2516600, event=16, event_handle=0, > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > #5 ?0x00007ffff6e0076a in guestfs_int_call_callbacks_message (g=0xf31290, event > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > buf_len=47) > ? ? at events.c:117 > #6 ?0x00007ffff6e1702e in guestfs_int_log_message_callback > ? ? (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > settle -E \303by", len=len at entry=47) at proto.c:145 > #7 ?0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, conn> conn at entry=0x110e280) at conn-socket.c:395 > #8 ?0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, connv> <optimized out>, g=<optimized out>) at conn-socket.c:179 > #9 ?read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, len=4) at > conn-socket.c:142 > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, size_rtn> 0x7fffffffd854, g=0xf31290) at proto.c:545 > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, size_rtn=size_rtn at entry> 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > #12 0x00007ffff6e17a5a in guestfs_int_recv > ? ? (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", hdr=hdr at entry> 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, ret> ret at entry=0x0) > ? ? at proto.c:668 > > I debugged this issue and noticed that the appliance logs from?commandrvf?are > truncated, leading to parse failure (missing utf-8 additional bytes): > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c#L134 > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: invalid > start byteSo I thought we'd fixed this in: https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f This is specifically a Python API problem or would it affect the C API too? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Richard W.M. Jones
2023-Feb-13 18:09 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
On Mon, Feb 13, 2023 at 06:07:58PM +0000, Richard W.M. Jones wrote:> On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote: > > Hey, > > When downloading a file whose path contains multi-byte utf-8, libguestfs > > sometimes crashes. > > This reproduces when using python, and not when using guestfish. > > > > Code to reproduce: > > for i in range(2000): > > ? ? g.download ('/xxx?', '/tmp/1') > > 'i' is not used inside the loop? Or is this error intermittent? > > > #0 ?raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50 > > #1 ?0x00007ffff7fac140 in <signal handler called> () at /lib/x86_64-linux-gnu/ > > libpthread.so.0 > > #2 ?0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at /usr/include/ > > python3.9/object.h:408 > > #3 ?guestfs_int_py_event_callback_wrapper > > ? ? (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, buf_len> > 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > > #4 ?guestfs_int_py_event_callback_wrapper > > ? ? (g=<optimized out>, callback=0x7ffff2516600, event=16, event_handle=0, > > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > > #5 ?0x00007ffff6e0076a in guestfs_int_call_callbacks_message (g=0xf31290, event > > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > > buf_len=47) > > ? ? at events.c:117 > > #6 ?0x00007ffff6e1702e in guestfs_int_log_message_callback > > ? ? (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > > settle -E \303by", len=len at entry=47) at proto.c:145 > > #7 ?0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, conn> > conn at entry=0x110e280) at conn-socket.c:395 > > #8 ?0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, connv> > <optimized out>, g=<optimized out>) at conn-socket.c:179 > > #9 ?read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, len=4) at > > conn-socket.c:142 > > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, size_rtn> > 0x7fffffffd854, g=0xf31290) at proto.c:545 > > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, size_rtn=size_rtn at entry> > 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > > #12 0x00007ffff6e17a5a in guestfs_int_recv > > ? ? (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", hdr=hdr at entry> > 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, ret> > ret at entry=0x0) > > ? ? at proto.c:668 > > > > I debugged this issue and noticed that the appliance logs from?commandrvf?are > > truncated, leading to parse failure (missing utf-8 additional bytes): > > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c#L134 > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: invalid > > start byte > > So I thought we'd fixed this in: > > https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f > > This is specifically a Python API problem or would it affect > the C API too?Here's a similar bug ... https://bugzilla.redhat.com/show_bug.cgi?id=1778962 I would likely need a complete self-contained reproducer to really go anywhere with this. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Richard W.M. Jones
2023-Feb-13 18:09 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
And this one too: https://bugzilla.redhat.com/show_bug.cgi?id=1684004 -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Daniel P. Berrangé
2023-Feb-13 18:13 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
On Mon, Feb 13, 2023 at 06:07:58PM +0000, Richard W.M. Jones wrote:> On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote: > > Hey, > > When downloading a file whose path contains multi-byte utf-8, libguestfs > > sometimes crashes. > > This reproduces when using python, and not when using guestfish. > > > > Code to reproduce: > > for i in range(2000): > > ? ? g.download ('/xxx?', '/tmp/1') > > 'i' is not used inside the loop? Or is this error intermittent? > > > #0 ?raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50 > > #1 ?0x00007ffff7fac140 in <signal handler called> () at /lib/x86_64-linux-gnu/ > > libpthread.so.0 > > #2 ?0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at /usr/include/ > > python3.9/object.h:408 > > #3 ?guestfs_int_py_event_callback_wrapper > > ? ? (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, buf_len> > 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > > #4 ?guestfs_int_py_event_callback_wrapper > > ? ? (g=<optimized out>, callback=0x7ffff2516600, event=16, event_handle=0, > > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > > #5 ?0x00007ffff6e0076a in guestfs_int_call_callbacks_message (g=0xf31290, event > > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > > buf_len=47) > > ? ? at events.c:117 > > #6 ?0x00007ffff6e1702e in guestfs_int_log_message_callback > > ? ? (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > > settle -E \303by", len=len at entry=47) at proto.c:145 > > #7 ?0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, conn> > conn at entry=0x110e280) at conn-socket.c:395 > > #8 ?0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, connv> > <optimized out>, g=<optimized out>) at conn-socket.c:179 > > #9 ?read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, len=4) at > > conn-socket.c:142 > > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, size_rtn> > 0x7fffffffd854, g=0xf31290) at proto.c:545 > > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, size_rtn=size_rtn at entry> > 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > > #12 0x00007ffff6e17a5a in guestfs_int_recv > > ? ? (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", hdr=hdr at entry> > 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, ret> > ret at entry=0x0) > > ? ? at proto.c:668 > > > > I debugged this issue and noticed that the appliance logs from?commandrvf?are > > truncated, leading to parse failure (missing utf-8 additional bytes): > > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c#L134 > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: invalid > > start byte > > So I thought we'd fixed this in: > > https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f > > This is specifically a Python API problem or would it affect > the C API too?The difference with any C API is that almost nothing at the C level will be validating that the bytes are actually valid utf-8 sequences. So the truncated data is unlikely to result in a fatal error. Python is aggressively validating all bytes, and so you get a hard error from the truncated UTF-8. Other languages may vary, but I've not seen anything that makes validation errors a failure in quite such an aggressive way as python. The problems with decode exceptions have hit soooo many apps using python over the past few years. Even worse if running in a C locale as python will reject anything with 8th bit set as being outside 7-bit asciii, instead of being 8-bit clean in its stream handling. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|