Sam Eiderman
2020-Jun-03 11:52 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Wed, May 13, 2020 at 10:06 PM Richard W.M. Jones <rjones@redhat.com> wrote:> > On Sun, Apr 26, 2020 at 09:14:03PM +0300, Sam Eiderman wrote: > > The python3 bindings create PyUnicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded - however in some cases they are not a valid unicode > > string, on SLES11 SP4 the encoding of the description of the following > > packages is latin1 and they fail to be converted to unicode using > > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): > > > > PackageKit > > aaa_base > > coreutils > > dejavu > > desktop-data-SLED > > gnome-utils > > hunspell > > hunspell-32bit > > hunspell-tools > > libblocxx6 > > libexif > > libgphoto2 > > libgtksourceview-2_0-0 > > libmpfr1 > > libopensc2 > > libopensc2-32bit > > liborc-0_4-0 > > libpackagekit-glib10 > > libpixman-1-0 > > libpixman-1-0-32bit > > libpoppler-glib4 > > libpoppler5 > > libsensors3 > > libtelepathy-glib0 > > m4 > > opensc > > opensc-32bit > > permissions > > pinentry > > poppler-tools > > python-gtksourceview > > splashy > > syslog-ng > > tar > > tightvnc > > xorg-x11 > > xorg-x11-xauth > > yast2-mouse > > > > Fix this by globally changing guestfs_int_py_fromstring() > > and guestfs_int_py_fromstringsize() to fallback to latin1 decoding if > > utf-8 decoding fails. > > > > Using the "strict" error handler doesn't matter in the case of latin1 > > and has the same effect of "replace": > > > > https://docs.python.org/3/library/codecs.html#error-handlers > > > > Signed-off-by: Sam Eiderman <sameid@google.com> > > --- > > python/handle.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/python/handle.c b/python/handle.c > > index 2fb8c18f0..fe89dc58a 100644 > > --- a/python/handle.c > > +++ b/python/handle.c > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromString (str); > > #else > > - return PyUnicode_FromString (str); > > + return guestfs_int_py_fromstringsize (str, strlen (str)); > > #endif > > } > > > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromStringAndSize (str, size); > > #else > > - return PyUnicode_FromStringAndSize (str, size); > > + PyObject *s = PyUnicode_FromString (str); > > + if (s == NULL) { > > + PyErr_Clear (); > > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict"); > > + } > > + return s; > > #endif > > } > > Looks OK to me. Pino - any objections to merging this? > > Rich. > > -- > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > virt-df lists disk usage of guests without needing to install any > software inside the virtual machine. Supports Linux and Windows. > http://people.redhat.com/~rjones/virt-df/ >
Sam Eiderman
2020-Jun-30 08:35 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
gentle ping On Wed, Jun 3, 2020 at 2:52 PM Sam Eiderman <sameid@google.com> wrote:> On Wed, May 13, 2020 at 10:06 PM Richard W.M. Jones <rjones@redhat.com> > wrote: > > > > On Sun, Apr 26, 2020 at 09:14:03PM +0300, Sam Eiderman wrote: > > > The python3 bindings create PyUnicode objects from application strings > > > on the guest (i.e. installed rpm, deb packages). > > > It is documented that rpm package fields such as description should be > > > utf8 encoded - however in some cases they are not a valid unicode > > > string, on SLES11 SP4 the encoding of the description of the following > > > packages is latin1 and they fail to be converted to unicode using > > > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): > > > > > > PackageKit > > > aaa_base > > > coreutils > > > dejavu > > > desktop-data-SLED > > > gnome-utils > > > hunspell > > > hunspell-32bit > > > hunspell-tools > > > libblocxx6 > > > libexif > > > libgphoto2 > > > libgtksourceview-2_0-0 > > > libmpfr1 > > > libopensc2 > > > libopensc2-32bit > > > liborc-0_4-0 > > > libpackagekit-glib10 > > > libpixman-1-0 > > > libpixman-1-0-32bit > > > libpoppler-glib4 > > > libpoppler5 > > > libsensors3 > > > libtelepathy-glib0 > > > m4 > > > opensc > > > opensc-32bit > > > permissions > > > pinentry > > > poppler-tools > > > python-gtksourceview > > > splashy > > > syslog-ng > > > tar > > > tightvnc > > > xorg-x11 > > > xorg-x11-xauth > > > yast2-mouse > > > > > > Fix this by globally changing guestfs_int_py_fromstring() > > > and guestfs_int_py_fromstringsize() to fallback to latin1 decoding if > > > utf-8 decoding fails. > > > > > > Using the "strict" error handler doesn't matter in the case of latin1 > > > and has the same effect of "replace": > > > > > > https://docs.python.org/3/library/codecs.html#error-handlers > > > > > > Signed-off-by: Sam Eiderman <sameid@google.com> > > > --- > > > python/handle.c | 9 +++++++-- > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > diff --git a/python/handle.c b/python/handle.c > > > index 2fb8c18f0..fe89dc58a 100644 > > > --- a/python/handle.c > > > +++ b/python/handle.c > > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > > #if PY_MAJOR_VERSION < 3 > > > return PyString_FromString (str); > > > #else > > > - return PyUnicode_FromString (str); > > > + return guestfs_int_py_fromstringsize (str, strlen (str)); > > > #endif > > > } > > > > > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, > size_t size) > > > #if PY_MAJOR_VERSION < 3 > > > return PyString_FromStringAndSize (str, size); > > > #else > > > - return PyUnicode_FromStringAndSize (str, size); > > > + PyObject *s = PyUnicode_FromString (str); > > > + if (s == NULL) { > > > + PyErr_Clear (); > > > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict"); > > > + } > > > + return s; > > > #endif > > > } > > > > Looks OK to me. Pino - any objections to merging this? > > > > Rich. > > > > -- > > Richard Jones, Virtualization Group, Red Hat > http://people.redhat.com/~rjones > > Read my programming and virtualization blog: http://rwmj.wordpress.com > > virt-df lists disk usage of guests without needing to install any > > software inside the virtual machine. Supports Linux and Windows. > > http://people.redhat.com/~rjones/virt-df/ > > >
Richard W.M. Jones
2020-Jul-06 11:39 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
Hi Sam, I was doing some work on the Python bindings, starting with removing support for Python 2 since it's EOL. I thought I would have a look at this patch. So firstly I think the last version posted is: https://www.redhat.com/archives/libguestfs/2020-April/msg00190.html My impression of this is that we shouldn't just hack the Python bindings to make this apparently work. But I wanted to ask you a few questions about this: - Does the SUSE RPM output contain a mix of encodings? Or is it all latin-1 or utf-8? - Is there any indication of the correct encoding from RPM? - Can we not instead escape the bad sequences using whatever is the C-level equivalent of str.encode(..., 'backslashreplace')? Or I guess better, escape them as Unicode compatibility characters https://en.wikipedia.org/wiki/Unicode_compatibility_characters Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Possibly Parallel Threads
- Re: [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)