Sam Eiderman
2020-Apr-26  18:14 UTC
[Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
The python3 bindings create PyUnicode objects from application strings
on the guest (i.e. installed rpm, deb packages).
It is documented that rpm package fields such as description should be
utf8 encoded - however in some cases they are not a valid unicode
string, on SLES11 SP4 the encoding of the description of the following
packages is latin1 and they fail to be converted to unicode using
guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()):
 PackageKit
 aaa_base
 coreutils
 dejavu
 desktop-data-SLED
 gnome-utils
 hunspell
 hunspell-32bit
 hunspell-tools
 libblocxx6
 libexif
 libgphoto2
 libgtksourceview-2_0-0
 libmpfr1
 libopensc2
 libopensc2-32bit
 liborc-0_4-0
 libpackagekit-glib10
 libpixman-1-0
 libpixman-1-0-32bit
 libpoppler-glib4
 libpoppler5
 libsensors3
 libtelepathy-glib0
 m4
 opensc
 opensc-32bit
 permissions
 pinentry
 poppler-tools
 python-gtksourceview
 splashy
 syslog-ng
 tar
 tightvnc
 xorg-x11
 xorg-x11-xauth
 yast2-mouse
Fix this by globally changing guestfs_int_py_fromstring()
and guestfs_int_py_fromstringsize() to fallback to latin1 decoding if
utf-8 decoding fails.
Using the "strict" error handler doesn't matter in the case of
latin1
and has the same effect of "replace":
 https://docs.python.org/3/library/codecs.html#error-handlers
Signed-off-by: Sam Eiderman <sameid@google.com>
---
 python/handle.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/python/handle.c b/python/handle.c
index 2fb8c18f0..fe89dc58a 100644
--- a/python/handle.c
+++ b/python/handle.c
@@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str)
 #if PY_MAJOR_VERSION < 3
   return PyString_FromString (str);
 #else
-  return PyUnicode_FromString (str);
+  return guestfs_int_py_fromstringsize (str, strlen (str));
 #endif
 }
 
@@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, size_t
size)
 #if PY_MAJOR_VERSION < 3
   return PyString_FromStringAndSize (str, size);
 #else
-  return PyUnicode_FromStringAndSize (str, size);
+  PyObject *s = PyUnicode_FromString (str);
+  if (s == NULL) {
+    PyErr_Clear ();
+    s = PyUnicode_Decode (str, strlen(str), "latin1",
"strict");
+  }
+  return s;
 #endif
 }
 
-- 
2.26.2.303.gf8c07b1a785-goog
Sam Eiderman
2020-May-13  15:44 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
bump On Sun, Apr 26, 2020 at 9:14 PM Sam Eiderman <sameid@google.com> wrote:> > The python3 bindings create PyUnicode objects from application strings > on the guest (i.e. installed rpm, deb packages). > It is documented that rpm package fields such as description should be > utf8 encoded - however in some cases they are not a valid unicode > string, on SLES11 SP4 the encoding of the description of the following > packages is latin1 and they fail to be converted to unicode using > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): > > PackageKit > aaa_base > coreutils > dejavu > desktop-data-SLED > gnome-utils > hunspell > hunspell-32bit > hunspell-tools > libblocxx6 > libexif > libgphoto2 > libgtksourceview-2_0-0 > libmpfr1 > libopensc2 > libopensc2-32bit > liborc-0_4-0 > libpackagekit-glib10 > libpixman-1-0 > libpixman-1-0-32bit > libpoppler-glib4 > libpoppler5 > libsensors3 > libtelepathy-glib0 > m4 > opensc > opensc-32bit > permissions > pinentry > poppler-tools > python-gtksourceview > splashy > syslog-ng > tar > tightvnc > xorg-x11 > xorg-x11-xauth > yast2-mouse > > Fix this by globally changing guestfs_int_py_fromstring() > and guestfs_int_py_fromstringsize() to fallback to latin1 decoding if > utf-8 decoding fails. > > Using the "strict" error handler doesn't matter in the case of latin1 > and has the same effect of "replace": > > https://docs.python.org/3/library/codecs.html#error-handlers > > Signed-off-by: Sam Eiderman <sameid@google.com> > --- > python/handle.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/python/handle.c b/python/handle.c > index 2fb8c18f0..fe89dc58a 100644 > --- a/python/handle.c > +++ b/python/handle.c > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > #if PY_MAJOR_VERSION < 3 > return PyString_FromString (str); > #else > - return PyUnicode_FromString (str); > + return guestfs_int_py_fromstringsize (str, strlen (str)); > #endif > } > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > #if PY_MAJOR_VERSION < 3 > return PyString_FromStringAndSize (str, size); > #else > - return PyUnicode_FromStringAndSize (str, size); > + PyObject *s = PyUnicode_FromString (str); > + if (s == NULL) { > + PyErr_Clear (); > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict"); > + } > + return s; > #endif > } > > -- > 2.26.2.303.gf8c07b1a785-goog >
Richard W.M. Jones
2020-May-13  19:06 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Sun, Apr 26, 2020 at 09:14:03PM +0300, Sam Eiderman wrote:> The python3 bindings create PyUnicode objects from application strings > on the guest (i.e. installed rpm, deb packages). > It is documented that rpm package fields such as description should be > utf8 encoded - however in some cases they are not a valid unicode > string, on SLES11 SP4 the encoding of the description of the following > packages is latin1 and they fail to be converted to unicode using > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): > > PackageKit > aaa_base > coreutils > dejavu > desktop-data-SLED > gnome-utils > hunspell > hunspell-32bit > hunspell-tools > libblocxx6 > libexif > libgphoto2 > libgtksourceview-2_0-0 > libmpfr1 > libopensc2 > libopensc2-32bit > liborc-0_4-0 > libpackagekit-glib10 > libpixman-1-0 > libpixman-1-0-32bit > libpoppler-glib4 > libpoppler5 > libsensors3 > libtelepathy-glib0 > m4 > opensc > opensc-32bit > permissions > pinentry > poppler-tools > python-gtksourceview > splashy > syslog-ng > tar > tightvnc > xorg-x11 > xorg-x11-xauth > yast2-mouse > > Fix this by globally changing guestfs_int_py_fromstring() > and guestfs_int_py_fromstringsize() to fallback to latin1 decoding if > utf-8 decoding fails. > > Using the "strict" error handler doesn't matter in the case of latin1 > and has the same effect of "replace": > > https://docs.python.org/3/library/codecs.html#error-handlers > > Signed-off-by: Sam Eiderman <sameid@google.com> > --- > python/handle.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/python/handle.c b/python/handle.c > index 2fb8c18f0..fe89dc58a 100644 > --- a/python/handle.c > +++ b/python/handle.c > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > #if PY_MAJOR_VERSION < 3 > return PyString_FromString (str); > #else > - return PyUnicode_FromString (str); > + return guestfs_int_py_fromstringsize (str, strlen (str)); > #endif > } > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > #if PY_MAJOR_VERSION < 3 > return PyString_FromStringAndSize (str, size); > #else > - return PyUnicode_FromStringAndSize (str, size); > + PyObject *s = PyUnicode_FromString (str); > + if (s == NULL) { > + PyErr_Clear (); > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict"); > + } > + return s; > #endif > }Looks OK to me. Pino - any objections to merging this? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
Sam Eiderman
2020-Jun-03  11:52 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Wed, May 13, 2020 at 10:06 PM Richard W.M. Jones <rjones@redhat.com> wrote:> > On Sun, Apr 26, 2020 at 09:14:03PM +0300, Sam Eiderman wrote: > > The python3 bindings create PyUnicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded - however in some cases they are not a valid unicode > > string, on SLES11 SP4 the encoding of the description of the following > > packages is latin1 and they fail to be converted to unicode using > > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): > > > > PackageKit > > aaa_base > > coreutils > > dejavu > > desktop-data-SLED > > gnome-utils > > hunspell > > hunspell-32bit > > hunspell-tools > > libblocxx6 > > libexif > > libgphoto2 > > libgtksourceview-2_0-0 > > libmpfr1 > > libopensc2 > > libopensc2-32bit > > liborc-0_4-0 > > libpackagekit-glib10 > > libpixman-1-0 > > libpixman-1-0-32bit > > libpoppler-glib4 > > libpoppler5 > > libsensors3 > > libtelepathy-glib0 > > m4 > > opensc > > opensc-32bit > > permissions > > pinentry > > poppler-tools > > python-gtksourceview > > splashy > > syslog-ng > > tar > > tightvnc > > xorg-x11 > > xorg-x11-xauth > > yast2-mouse > > > > Fix this by globally changing guestfs_int_py_fromstring() > > and guestfs_int_py_fromstringsize() to fallback to latin1 decoding if > > utf-8 decoding fails. > > > > Using the "strict" error handler doesn't matter in the case of latin1 > > and has the same effect of "replace": > > > > https://docs.python.org/3/library/codecs.html#error-handlers > > > > Signed-off-by: Sam Eiderman <sameid@google.com> > > --- > > python/handle.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/python/handle.c b/python/handle.c > > index 2fb8c18f0..fe89dc58a 100644 > > --- a/python/handle.c > > +++ b/python/handle.c > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromString (str); > > #else > > - return PyUnicode_FromString (str); > > + return guestfs_int_py_fromstringsize (str, strlen (str)); > > #endif > > } > > > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromStringAndSize (str, size); > > #else > > - return PyUnicode_FromStringAndSize (str, size); > > + PyObject *s = PyUnicode_FromString (str); > > + if (s == NULL) { > > + PyErr_Clear (); > > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict"); > > + } > > + return s; > > #endif > > } > > Looks OK to me. Pino - any objections to merging this? > > Rich. > > -- > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > virt-df lists disk usage of guests without needing to install any > software inside the virtual machine. Supports Linux and Windows. > http://people.redhat.com/~rjones/virt-df/ >
Pino Toscano
2020-Jun-30  08:42 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Sunday, 26 April 2020 20:14:03 CEST Sam Eiderman wrote:> The python3 bindings create PyUnicode objects from application strings > on the guest (i.e. installed rpm, deb packages). > It is documented that rpm package fields such as description should be > utf8 encoded - however in some cases they are not a valid unicode > string, on SLES11 SP4 the encoding of the description of the following > packages is latin1 and they fail to be converted to unicode using > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()):Sorry, I wanted to reach our resident Python maintainers to get their feedback, and so far had no time for it. Will do it shortly. BTW do you have a reproducer I can actually try freely?> diff --git a/python/handle.c b/python/handle.c > index 2fb8c18f0..fe89dc58a 100644 > --- a/python/handle.c > +++ b/python/handle.c > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > #if PY_MAJOR_VERSION < 3 > return PyString_FromString (str); > #else > - return PyUnicode_FromString (str); > + return guestfs_int_py_fromstringsize (str, strlen (str)); > #endif > } > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > #if PY_MAJOR_VERSION < 3 > return PyString_FromStringAndSize (str, size); > #else > - return PyUnicode_FromStringAndSize (str, size); > + PyObject *s = PyUnicode_FromString (str); > + if (s == NULL) { > + PyErr_Clear (); > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict");Minor nit: space between "strlen" and the opening bracket. Also, isn't there any error we can check as a way to detect this situation, rather than always attempting to decode it as latin1? Thanks, -- Pino Toscano
Sam Eiderman
2020-Jun-30  08:53 UTC
Re: [Libguestfs] [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
Hey Pino, Can you search for the previous patches I submitted? I had some discussions regarding this with Daniel and Nir. Thanks! On Tue, Jun 30, 2020 at 11:43 AM Pino Toscano <ptoscano@redhat.com> wrote:> On Sunday, 26 April 2020 20:14:03 CEST Sam Eiderman wrote: > > The python3 bindings create PyUnicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded - however in some cases they are not a valid unicode > > string, on SLES11 SP4 the encoding of the description of the following > > packages is latin1 and they fail to be converted to unicode using > > guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): > > Sorry, I wanted to reach our resident Python maintainers to get their > feedback, and so far had no time for it. Will do it shortly. > > BTW do you have a reproducer I can actually try freely? > > > diff --git a/python/handle.c b/python/handle.c > > index 2fb8c18f0..fe89dc58a 100644 > > --- a/python/handle.c > > +++ b/python/handle.c > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromString (str); > > #else > > - return PyUnicode_FromString (str); > > + return guestfs_int_py_fromstringsize (str, strlen (str)); > > #endif > > } > > > > @@ -397,7 +397,12 @@ guestfs_int_py_fromstringsize (const char *str, > size_t size) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromStringAndSize (str, size); > > #else > > - return PyUnicode_FromStringAndSize (str, size); > > + PyObject *s = PyUnicode_FromString (str); > > + if (s == NULL) { > > + PyErr_Clear (); > > + s = PyUnicode_Decode (str, strlen(str), "latin1", "strict"); > > Minor nit: space between "strlen" and the opening bracket. > > Also, isn't there any error we can check as a way to detect this > situation, rather than always attempting to decode it as latin1? > > Thanks, > -- > Pino Toscano
Reasonably Related Threads
- Re: [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)