Daniel P. Berrangé
2023-Feb-13 18:13 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
On Mon, Feb 13, 2023 at 06:07:58PM +0000, Richard W.M. Jones wrote:> On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote: > > Hey, > > When downloading a file whose path contains multi-byte utf-8, libguestfs > > sometimes crashes. > > This reproduces when using python, and not when using guestfish. > > > > Code to reproduce: > > for i in range(2000): > > ? ? g.download ('/xxx?', '/tmp/1') > > 'i' is not used inside the loop? Or is this error intermittent? > > > #0 ?raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50 > > #1 ?0x00007ffff7fac140 in <signal handler called> () at /lib/x86_64-linux-gnu/ > > libpthread.so.0 > > #2 ?0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at /usr/include/ > > python3.9/object.h:408 > > #3 ?guestfs_int_py_event_callback_wrapper > > ? ? (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, buf_len> > 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > > #4 ?guestfs_int_py_event_callback_wrapper > > ? ? (g=<optimized out>, callback=0x7ffff2516600, event=16, event_handle=0, > > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > > #5 ?0x00007ffff6e0076a in guestfs_int_call_callbacks_message (g=0xf31290, event > > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \303by", > > buf_len=47) > > ? ? at events.c:117 > > #6 ?0x00007ffff6e1702e in guestfs_int_log_message_callback > > ? ? (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug > > settle -E \303by", len=len at entry=47) at proto.c:145 > > #7 ?0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, conn> > conn at entry=0x110e280) at conn-socket.c:395 > > #8 ?0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, connv> > <optimized out>, g=<optimized out>) at conn-socket.c:179 > > #9 ?read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, len=4) at > > conn-socket.c:142 > > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, size_rtn> > 0x7fffffffd854, g=0xf31290) at proto.c:545 > > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, size_rtn=size_rtn at entry> > 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > > #12 0x00007ffff6e17a5a in guestfs_int_recv > > ? ? (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", hdr=hdr at entry> > 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, ret> > ret at entry=0x0) > > ? ? at proto.c:668 > > > > I debugged this issue and noticed that the appliance logs from?commandrvf?are > > truncated, leading to parse failure (missing utf-8 additional bytes): > > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c#L134 > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: invalid > > start byte > > So I thought we'd fixed this in: > > https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f > > This is specifically a Python API problem or would it affect > the C API too?The difference with any C API is that almost nothing at the C level will be validating that the bytes are actually valid utf-8 sequences. So the truncated data is unlikely to result in a fatal error. Python is aggressively validating all bytes, and so you get a hard error from the truncated UTF-8. Other languages may vary, but I've not seen anything that makes validation errors a failure in quite such an aggressive way as python. The problems with decode exceptions have hit soooo many apps using python over the past few years. Even worse if running in a C locale as python will reject anything with 8th bit set as being outside 7-bit asciii, instead of being 8-bit clean in its stream handling. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Yonatan Shtarkman
2023-Feb-13 19:38 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
Thanks for looking into this Rich and Daniel, and thanks for the profound explanation. See my answers below. `'i' is not used inside the loop? Or is this error intermittent?` - 'i' can be removed, the error is intermittent. This reproduces on libguestfs 1.46.2 and 1.48.6. `This is specifically a Python API problem or would it affect the C API too?` - I assume it's not in the C API, since this issue doesn't reproduce when using guestfish. I would likely need a complete self-contained reproducer to really go anywhere with this.` - I can supply a container image with libguestfs 1.46.2 running the for loop. Let me know if that helps. Also, as a workaround, I avoided calling the event callback if null is returned by Py_BuildValue (still print the error and release the thread). This seems to work for our usage (we only use the event callbacks for logging), any potential issues I'm missing? On Mon, Feb 13, 2023 at 8:14 PM Daniel P. Berrang? <berrange at redhat.com> wrote:> On Mon, Feb 13, 2023 at 06:07:58PM +0000, Richard W.M. Jones wrote: > > On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote: > > > Hey, > > > When downloading a file whose path contains multi-byte utf-8, > libguestfs > > > sometimes crashes. > > > This reproduces when using python, and not when using guestfish. > > > > > > Code to reproduce: > > > for i in range(2000): > > > g.download ('/xxx?', '/tmp/1') > > > > 'i' is not used inside the loop? Or is this error intermittent? > > > > > #0 raise (sig=<optimized out>) at > ../sysdeps/unix/sysv/linux/raise.c:50 > > > #1 0x00007ffff7fac140 in <signal handler called> () at > /lib/x86_64-linux-gnu/ > > > libpthread.so.0 > > > #2 0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at > /usr/include/ > > > python3.9/object.h:408 > > > #3 guestfs_int_py_event_callback_wrapper > > > (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, > buf_len> > > 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E > \303by", > > > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > > > #4 guestfs_int_py_event_callback_wrapper > > > (g=<optimized out>, callback=0x7ffff2516600, event=16, > event_handle=0, > > > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm > --debug > > > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > > > #5 0x00007ffff6e0076a in guestfs_int_call_callbacks_message > (g=0xf31290, event > > > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E > \303by", > > > buf_len=47) > > > at events.c:117 > > > #6 0x00007ffff6e1702e in guestfs_int_log_message_callback > > > (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm > --debug > > > settle -E \303by", len=len at entry=47) at proto.c:145 > > > #7 0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, > conn> > > conn at entry=0x110e280) at conn-socket.c:395 > > > #8 0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, > connv> > > <optimized out>, g=<optimized out>) at conn-socket.c:179 > > > #9 read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, > len=4) at > > > conn-socket.c:142 > > > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, > size_rtn> > > 0x7fffffffd854, g=0xf31290) at proto.c:545 > > > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, > size_rtn=size_rtn at entry> > > 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > > > #12 0x00007ffff6e17a5a in guestfs_int_recv > > > (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", > hdr=hdr at entry> > > 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, > ret> > > ret at entry=0x0) > > > at proto.c:668 > > > > > > I debugged this issue and noticed that the appliance logs > from commandrvf are > > > truncated, leading to parse failure (missing utf-8 additional bytes): > > > > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c#L134 > > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position > 0: invalid > > > start byte > > > > So I thought we'd fixed this in: > > > > > https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f > > > > This is specifically a Python API problem or would it affect > > the C API too? > > The difference with any C API is that almost nothing at the C level will > be validating that the bytes are actually valid utf-8 sequences. > > So the truncated data is unlikely to result in a fatal error. Python is > aggressively validating all bytes, and so you get a hard error from the > truncated UTF-8. Other languages may vary, but I've not seen anything > that makes validation errors a failure in quite such an aggressive way > as python. The problems with decode exceptions have hit soooo many apps > using python over the past few years. Even worse if running in a C > locale as python will reject anything with 8th bit set as being outside > 7-bit asciii, instead of being 8-bit clean in its stream handling. > > > With regards, > Daniel > -- > |: https://berrange.com -o- > https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- > https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- > https://www.instagram.com/dberrange :| > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libguestfs/attachments/20230213/8cdf0b2a/attachment.htm>