Yonatan Shtarkman
2023-Feb-13 19:38 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
Thanks for looking into this Rich and Daniel, and thanks for the profound explanation. See my answers below. `'i' is not used inside the loop? Or is this error intermittent?` - 'i' can be removed, the error is intermittent. This reproduces on libguestfs 1.46.2 and 1.48.6. `This is specifically a Python API problem or would it affect the C API too?` - I assume it's not in the C API, since this issue doesn't reproduce when using guestfish. I would likely need a complete self-contained reproducer to really go anywhere with this.` - I can supply a container image with libguestfs 1.46.2 running the for loop. Let me know if that helps. Also, as a workaround, I avoided calling the event callback if null is returned by Py_BuildValue (still print the error and release the thread). This seems to work for our usage (we only use the event callbacks for logging), any potential issues I'm missing? On Mon, Feb 13, 2023 at 8:14 PM Daniel P. Berrang? <berrange at redhat.com> wrote:> On Mon, Feb 13, 2023 at 06:07:58PM +0000, Richard W.M. Jones wrote: > > On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote: > > > Hey, > > > When downloading a file whose path contains multi-byte utf-8, > libguestfs > > > sometimes crashes. > > > This reproduces when using python, and not when using guestfish. > > > > > > Code to reproduce: > > > for i in range(2000): > > > g.download ('/xxx?', '/tmp/1') > > > > 'i' is not used inside the loop? Or is this error intermittent? > > > > > #0 raise (sig=<optimized out>) at > ../sysdeps/unix/sysv/linux/raise.c:50 > > > #1 0x00007ffff7fac140 in <signal handler called> () at > /lib/x86_64-linux-gnu/ > > > libpthread.so.0 > > > #2 0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at > /usr/include/ > > > python3.9/object.h:408 > > > #3 guestfs_int_py_event_callback_wrapper > > > (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, > buf_len> > > 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E > \303by", > > > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > > > #4 guestfs_int_py_event_callback_wrapper > > > (g=<optimized out>, callback=0x7ffff2516600, event=16, > event_handle=0, > > > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm > --debug > > > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > > > #5 0x00007ffff6e0076a in guestfs_int_call_callbacks_message > (g=0xf31290, event > > > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E > \303by", > > > buf_len=47) > > > at events.c:117 > > > #6 0x00007ffff6e1702e in guestfs_int_log_message_callback > > > (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm > --debug > > > settle -E \303by", len=len at entry=47) at proto.c:145 > > > #7 0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, > conn> > > conn at entry=0x110e280) at conn-socket.c:395 > > > #8 0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, > connv> > > <optimized out>, g=<optimized out>) at conn-socket.c:179 > > > #9 read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, > len=4) at > > > conn-socket.c:142 > > > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, > size_rtn> > > 0x7fffffffd854, g=0xf31290) at proto.c:545 > > > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, > size_rtn=size_rtn at entry> > > 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > > > #12 0x00007ffff6e17a5a in guestfs_int_recv > > > (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", > hdr=hdr at entry> > > 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, > ret> > > ret at entry=0x0) > > > at proto.c:668 > > > > > > I debugged this issue and noticed that the appliance logs > from commandrvf are > > > truncated, leading to parse failure (missing utf-8 additional bytes): > > > > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c#L134 > > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position > 0: invalid > > > start byte > > > > So I thought we'd fixed this in: > > > > > https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f > > > > This is specifically a Python API problem or would it affect > > the C API too? > > The difference with any C API is that almost nothing at the C level will > be validating that the bytes are actually valid utf-8 sequences. > > So the truncated data is unlikely to result in a fatal error. Python is > aggressively validating all bytes, and so you get a hard error from the > truncated UTF-8. Other languages may vary, but I've not seen anything > that makes validation errors a failure in quite such an aggressive way > as python. The problems with decode exceptions have hit soooo many apps > using python over the past few years. Even worse if running in a C > locale as python will reject anything with 8th bit set as being outside > 7-bit asciii, instead of being 8-bit clean in its stream handling. > > > With regards, > Daniel > -- > |: https://berrange.com -o- > https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- > https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- > https://www.instagram.com/dberrange :| > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libguestfs/attachments/20230213/8cdf0b2a/attachment.htm>
Richard W.M. Jones
2023-Feb-13 20:13 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
On Mon, Feb 13, 2023 at 09:38:50PM +0200, Yonatan Shtarkman wrote:> Thanks for looking into this Rich and Daniel, and thanks for the profound > explanation. > See my answers below. > > `'i' is not used inside the loop?? Or is this error intermittent?` - 'i' can?be > removed, the error is intermittent. > This reproduces on libguestfs 1.46.2 and 1.48.6. > > `This is specifically a Python API problem or would it affect > the C API too?` - I assume it's not in the C API, since this issue doesn't > reproduce when using guestfish. > > I would likely need a complete self-contained reproducer to really go > anywhere with this.` - I can supply a container image with libguestfs 1.46.2 > running the for loop. Let me know if that helps.Could you come up with a Python program, or a Python + guestfish program. Something like ... $ guestfish -N fs -m /dev/sda1 touch '/xxx?' + something in Python that uses the image (test1.img) to reproduce it? Rich.> Also, as a workaround, I avoided calling the event callback if null is returned > by?Py_BuildValue (still print the error and release the thread). > This seems to work for our usage (we only use the event callbacks for logging), > any potential?issues?I'm missing? > > On Mon, Feb 13, 2023 at 8:14 PM Daniel P. Berrang? <berrange at redhat.com> wrote: > > On Mon, Feb 13, 2023 at 06:07:58PM +0000, Richard W.M. Jones wrote: > > On Sun, Feb 12, 2023 at 03:31:08PM +0200, Yonatan Shtarkman wrote: > > > Hey, > > > When downloading a file whose path contains multi-byte utf-8, > libguestfs > > > sometimes crashes. > > > This reproduces when using python, and not when using guestfish. > > > > > > Code to reproduce: > > > for i in range(2000): > > > ? ? g.download ('/xxx?', '/tmp/1') > > > > 'i' is not used inside the loop?? Or is this error intermittent? > > > > > #0 ?raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/ > raise.c:50 > > > #1 ?0x00007ffff7fac140 in <signal handler called> () at /lib/ > x86_64-linux-gnu/ > > > libpthread.so.0 > > > #2 ?0x00007ffff6f77701 in _Py_INCREF (op=<optimized out>) at /usr/ > include/ > > > python3.9/object.h:408 > > > #3 ?guestfs_int_py_event_callback_wrapper > > > ? ? (g=<optimized out>, flags=<optimized out>, array_len=0, array=0x0, > buf_len> > > 47, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \ > 303by", > > > event_handle=0, event=16, callback=0x7ffff2516600) at handle.c:137 > > > #4 ?guestfs_int_py_event_callback_wrapper > > > ? ? (g=<optimized out>, callback=0x7ffff2516600, event=16, event_handle > =0, > > > flags=<optimized out>, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm > --debug > > > settle -E \303by", buf_len=47, array=0x0, array_len=0) at handle.c:104 > > > #5 ?0x00007ffff6e0076a in guestfs_int_call_callbacks_message (g> 0xf31290, event > > > =16, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm --debug settle -E \ > 303by", > > > buf_len=47) > > > ? ? at events.c:117 > > > #6 ?0x00007ffff6e1702e in guestfs_int_log_message_callback > > > ? ? (g=g at entry=0xf31290, buf=0x113b8a0 "gs=0x0\r\ncommandrvf: udevadm > --debug > > > settle -E \303by", len=len at entry=47) at proto.c:145 > > > #7 ?0x00007ffff6dfb759 in handle_log_message (g=g at entry=0xf31290, conn> > > conn at entry=0x110e280) at conn-socket.c:395 > > > #8 ?0x00007ffff6dfbd63 in read_data (len=4, bufv=<optimized out>, connv > > > > <optimized out>, g=<optimized out>) at conn-socket.c:179 > > > #9 ?read_data (g=0xf31290, connv=0x110e280, bufv=<optimized out>, len> 4) at > > > conn-socket.c:142 > > > #10 0x00007ffff6e1764a in recv_from_daemon (buf_rtn=0x7fffffffd858, > size_rtn> > > 0x7fffffffd854, g=0xf31290) at proto.c:545 > > > #11 guestfs_int_recv_from_daemon (g=g at entry=0xf31290, size_rtn> size_rtn at entry> > > 0x7fffffffd854, buf_rtn=buf_rtn at entry=0x7fffffffd858) at proto.c:623 > > > #12 0x00007ffff6e17a5a in guestfs_int_recv > > > ? ? (g=g at entry=0xf31290, fn=fn at entry=0x7ffff6e3b3e8 "download", hdr> hdr at entry> > > 0x7fffffffd920, err=err at entry=0x7fffffffd8f0, xdrp=xdrp at entry=0x0, ret> > > ret at entry=0x0) > > > ? ? at proto.c:668 > > > > > > I debugged this issue and noticed that the appliance logs > from?commandrvf?are > > > truncated, leading to parse failure (missing utf-8 additional bytes): > > > https://github.com/libguestfs/libguestfs/blob/master/python/handle.c# > L134 > > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 0: > invalid > > > start byte > > > > So I thought we'd fixed this in: > > > > https://github.com/libguestfs/libguestfs/commit/ > 0ee02e0117527b86a31b2a88a14994ce7f15571f > > > > This is specifically a Python API problem or would it affect > > the C API too? > > The difference with any C API is that almost nothing at the C level will > be validating that the bytes are actually valid utf-8 sequences. > > So the truncated data is unlikely to result in a fatal error. Python is > aggressively validating all bytes, and so you get a hard error from the > truncated UTF-8.? Other languages may vary, but I've not seen anything > that makes validation errors a failure in quite such an aggressive way > as python. The problems with decode exceptions have hit soooo many apps > using python over the past few years. Even worse if running in a C > locale as python will reject anything with 8th bit set as being outside > 7-bit asciii, instead of being 8-bit clean in its stream handling. > > > With regards, > Daniel > -- > |: https://berrange.com? ? ? -o-? ? https://www.flickr.com/photos/dberrange > :| > |: https://libvirt.org? ? ? ? ?-o-? ? ? ? ? ? https://fstop138.berrange.com > :| > |: https://entangle-photo.org? ? -o-? ? https://www.instagram.com/dberrange > :| > >-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Richard W.M. Jones
2023-Feb-13 20:16 UTC
[Libguestfs] Issue with downloading files whose path contains multi-byte utf-8 characters
On Mon, Feb 13, 2023 at 09:38:50PM +0200, Yonatan Shtarkman wrote:> Also, as a workaround, I avoided calling the event callback if null is returned > by?Py_BuildValue (still print the error and release the thread). > This seems to work for our usage (we only use the event callbacks for logging), > any potential?issues?I'm missing?Oh I see so the issue is actually when logging and not in the download call? That's strange that it should be happening intermittently. I'd expect a UTF-8 encoding error would happen every time not sometimes. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top