Sam Eiderman
2020-Apr-20 12:37 UTC
[Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
The python3 bindings create unicode objects from application strings on the guest (i.e. installed rpm, deb packages). It is documented that rpm package fields such as description should be utf8 encoded - however in some cases they are not a valid unicode string, on SLES11 SP4 the following packages fail to be converted to unicode using guestfs_int_py_fromstring() (which invokes PyUnicode_FromString()): PackageKit aaa_base coreutils dejavu desktop-data-SLED gnome-utils hunspell hunspell-32bit hunspell-tools libblocxx6 libexif libgphoto2 libgtksourceview-2_0-0 libmpfr1 libopensc2 libopensc2-32bit liborc-0_4-0 libpackagekit-glib10 libpixman-1-0 libpixman-1-0-32bit libpoppler-glib4 libpoppler5 libsensors3 libtelepathy-glib0 m4 opensc opensc-32bit permissions pinentry poppler-tools python-gtksourceview splashy syslog-ng tar tightvnc xorg-x11 xorg-x11-xauth yast2-mouse Fix this by globally changing guestfs_int_py_fromstring() and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace" error handler: https://docs.python.org/3/library/codecs.html#error-handlers For example, this will decode PackageKit's description on SLES4 the following way: Backend: pisi S.�ağlar Onur <caglar@pardus.org.tr> Signed-off-by: Sam Eiderman <sameid@google.com> --- python/handle.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/handle.c b/python/handle.c index 2fb8c18f0..427424707 100644 --- a/python/handle.c +++ b/python/handle.c @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) #if PY_MAJOR_VERSION < 3 return PyString_FromString (str); #else - return PyUnicode_FromString (str); + return PyUnicode_Decode(str, strlen(str), "utf-8", "replace"); #endif } @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) #if PY_MAJOR_VERSION < 3 return PyString_FromStringAndSize (str, size); #else - return PyUnicode_FromStringAndSize (str, size); + return PyUnicode_Decode(str, size, "utf-8", "replace"); #endif } -- 2.26.1.301.g55bc3eb7cb9-goog
Daniel P. Berrangé
2020-Apr-20 12:59 UTC
Re: [Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Mon, Apr 20, 2020 at 03:37:16PM +0300, Sam Eiderman wrote:> The python3 bindings create unicode objects from application strings > on the guest (i.e. installed rpm, deb packages). > It is documented that rpm package fields such as description should be > utf8 encoded - however in some cases they are not a valid unicode > string, on SLES11 SP4 the following packages fail to be converted to > unicode using guestfs_int_py_fromstring() (which invokes > PyUnicode_FromString()): > > PackageKit > aaa_base > coreutils > dejavu > desktop-data-SLED > gnome-utils > hunspell > hunspell-32bit > hunspell-tools > libblocxx6 > libexif > libgphoto2 > libgtksourceview-2_0-0 > libmpfr1 > libopensc2 > libopensc2-32bit > liborc-0_4-0 > libpackagekit-glib10 > libpixman-1-0 > libpixman-1-0-32bit > libpoppler-glib4 > libpoppler5 > libsensors3 > libtelepathy-glib0 > m4 > opensc > opensc-32bit > permissions > pinentry > poppler-tools > python-gtksourceview > splashy > syslog-ng > tar > tightvnc > xorg-x11 > xorg-x11-xauth > yast2-mouse > > Fix this by globally changing guestfs_int_py_fromstring() > and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace" > error handler: > > https://docs.python.org/3/library/codecs.html#error-handlers > > For example, this will decode PackageKit's description on SLES4 the > following way: > > Backend: pisi > S.�ağlar Onur <caglar@pardus.org.tr> > > Signed-off-by: Sam Eiderman <sameid@google.com> > --- > python/handle.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/python/handle.c b/python/handle.c > index 2fb8c18f0..427424707 100644 > --- a/python/handle.c > +++ b/python/handle.c > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > #if PY_MAJOR_VERSION < 3 > return PyString_FromString (str); > #else > - return PyUnicode_FromString (str); > + return PyUnicode_Decode(str, strlen(str), "utf-8", "replace"); > #endif > } > > @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > #if PY_MAJOR_VERSION < 3 > return PyString_FromStringAndSize (str, size); > #else > - return PyUnicode_FromStringAndSize (str, size); > + return PyUnicode_Decode(str, size, "utf-8", "replace"); > #endif > }Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Sam Eiderman
2020-Apr-21 09:29 UTC
Re: [Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
If it possible to fix a typo in the commit message while submitting it will be great: SLES4 -> SLES11 SP4 On Mon, Apr 20, 2020 at 3:59 PM Daniel P. Berrangé <berrange@redhat.com> wrote:> On Mon, Apr 20, 2020 at 03:37:16PM +0300, Sam Eiderman wrote: > > The python3 bindings create unicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded - however in some cases they are not a valid unicode > > string, on SLES11 SP4 the following packages fail to be converted to > > unicode using guestfs_int_py_fromstring() (which invokes > > PyUnicode_FromString()): > > > > PackageKit > > aaa_base > > coreutils > > dejavu > > desktop-data-SLED > > gnome-utils > > hunspell > > hunspell-32bit > > hunspell-tools > > libblocxx6 > > libexif > > libgphoto2 > > libgtksourceview-2_0-0 > > libmpfr1 > > libopensc2 > > libopensc2-32bit > > liborc-0_4-0 > > libpackagekit-glib10 > > libpixman-1-0 > > libpixman-1-0-32bit > > libpoppler-glib4 > > libpoppler5 > > libsensors3 > > libtelepathy-glib0 > > m4 > > opensc > > opensc-32bit > > permissions > > pinentry > > poppler-tools > > python-gtksourceview > > splashy > > syslog-ng > > tar > > tightvnc > > xorg-x11 > > xorg-x11-xauth > > yast2-mouse > > > > Fix this by globally changing guestfs_int_py_fromstring() > > and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace" > > error handler: > > > > https://docs.python.org/3/library/codecs.html#error-handlers > > > > For example, this will decode PackageKit's description on SLES4 the > > following way: > > > > Backend: pisi > > S.�ağlar Onur <caglar@pardus.org.tr> > > > > Signed-off-by: Sam Eiderman <sameid@google.com> > > --- > > python/handle.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/python/handle.c b/python/handle.c > > index 2fb8c18f0..427424707 100644 > > --- a/python/handle.c > > +++ b/python/handle.c > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromString (str); > > #else > > - return PyUnicode_FromString (str); > > + return PyUnicode_Decode(str, strlen(str), "utf-8", "replace"); > > #endif > > } > > > > @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, > size_t size) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromStringAndSize (str, size); > > #else > > - return PyUnicode_FromStringAndSize (str, size); > > + return PyUnicode_Decode(str, size, "utf-8", "replace"); > > #endif > > } > > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> > > Regards, > Daniel > -- > |: https://berrange.com -o- > https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- > https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- > https://www.instagram.com/dberrange :| > >
Nir Soffer
2020-Apr-23 18:33 UTC
Re: [Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
On Mon, Apr 20, 2020 at 3:38 PM Sam Eiderman <sameid@google.com> wrote:> > The python3 bindings create unicode objects from application strings > on the guest (i.e. installed rpm, deb packages). > It is documented that rpm package fields such as description should be > utf8 encoded - however in some cases they are not a valid unicode > string,So what are they? latin1 maybe? Maybe use: try: value.decode("utf-8") except UnicodeDecodeError: value.decode("latin1") This will always succeed, producing possibly garbage output but so is errors='replace'.> on SLES11 SP4 the following packages fail to be converted to > unicode using guestfs_int_py_fromstring() (which invokes > PyUnicode_FromString()): > > PackageKit > aaa_base > coreutils > dejavu > desktop-data-SLED > gnome-utils > hunspell > hunspell-32bit > hunspell-tools > libblocxx6 > libexif > libgphoto2 > libgtksourceview-2_0-0 > libmpfr1 > libopensc2 > libopensc2-32bit > liborc-0_4-0 > libpackagekit-glib10 > libpixman-1-0 > libpixman-1-0-32bit > libpoppler-glib4 > libpoppler5 > libsensors3 > libtelepathy-glib0 > m4 > opensc > opensc-32bit > permissions > pinentry > poppler-tools > python-gtksourceview > splashy > syslog-ng > tar > tightvnc > xorg-x11 > xorg-x11-xauth > yast2-mouse > > Fix this by globally changing guestfs_int_py_fromstring() > and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace" > error handler: > > https://docs.python.org/3/library/codecs.html#error-handlers > > For example, this will decode PackageKit's description on SLES4 the > following way: > > Backend: pisi > S.�ağlar Onur <caglar@pardus.org.tr>What is the original text? Nir> Signed-off-by: Sam Eiderman <sameid@google.com> > --- > python/handle.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/python/handle.c b/python/handle.c > index 2fb8c18f0..427424707 100644 > --- a/python/handle.c > +++ b/python/handle.c > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > #if PY_MAJOR_VERSION < 3 > return PyString_FromString (str); > #else > - return PyUnicode_FromString (str); > + return PyUnicode_Decode(str, strlen(str), "utf-8", "replace"); > #endif > } > > @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, size_t size) > #if PY_MAJOR_VERSION < 3 > return PyString_FromStringAndSize (str, size); > #else > - return PyUnicode_FromStringAndSize (str, size); > + return PyUnicode_Decode(str, size, "utf-8", "replace"); > #endif > } > > -- > 2.26.1.301.g55bc3eb7cb9-goog > > > _______________________________________________ > Libguestfs mailing list > Libguestfs@redhat.com > https://www.redhat.com/mailman/listinfo/libguestfs
Sam Eiderman
2020-Apr-25 17:32 UTC
Re: [Libguestfs] [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
Hi Nir, I think latin1, How do you think we should handle latin1 errors then? Replace on latin1 or replace on utf-8? for codec in ["utf8", "latin1"]: try: return decode(b, codec) except: pass return decode(b, "utf8", errors="replace") (Pseudocode, will be implemented in c) On Thu, Apr 23, 2020, 21:34 Nir Soffer <nsoffer@redhat.com> wrote:> On Mon, Apr 20, 2020 at 3:38 PM Sam Eiderman <sameid@google.com> wrote: > > > > The python3 bindings create unicode objects from application strings > > on the guest (i.e. installed rpm, deb packages). > > It is documented that rpm package fields such as description should be > > utf8 encoded - however in some cases they are not a valid unicode > > string, > > So what are they? latin1 maybe? > > Maybe use: > > try: > value.decode("utf-8") > except UnicodeDecodeError: > value.decode("latin1") > > This will always succeed, producing possibly garbage output but so is > errors='replace'. > > > on SLES11 SP4 the following packages fail to be converted to > > unicode using guestfs_int_py_fromstring() (which invokes > > PyUnicode_FromString()): > > > > PackageKit > > aaa_base > > coreutils > > dejavu > > desktop-data-SLED > > gnome-utils > > hunspell > > hunspell-32bit > > hunspell-tools > > libblocxx6 > > libexif > > libgphoto2 > > libgtksourceview-2_0-0 > > libmpfr1 > > libopensc2 > > libopensc2-32bit > > liborc-0_4-0 > > libpackagekit-glib10 > > libpixman-1-0 > > libpixman-1-0-32bit > > libpoppler-glib4 > > libpoppler5 > > libsensors3 > > libtelepathy-glib0 > > m4 > > opensc > > opensc-32bit > > permissions > > pinentry > > poppler-tools > > python-gtksourceview > > splashy > > syslog-ng > > tar > > tightvnc > > xorg-x11 > > xorg-x11-xauth > > yast2-mouse > > > > Fix this by globally changing guestfs_int_py_fromstring() > > and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace" > > error handler: > > > > https://docs.python.org/3/library/codecs.html#error-handlers > > > > For example, this will decode PackageKit's description on SLES4 the > > following way: > > > > Backend: pisi > > S.�ağlar Onur <caglar@pardus.org.tr> > > What is the original text? > > Nir > > > Signed-off-by: Sam Eiderman <sameid@google.com> > > --- > > python/handle.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/python/handle.c b/python/handle.c > > index 2fb8c18f0..427424707 100644 > > --- a/python/handle.c > > +++ b/python/handle.c > > @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromString (str); > > #else > > - return PyUnicode_FromString (str); > > + return PyUnicode_Decode(str, strlen(str), "utf-8", "replace"); > > #endif > > } > > > > @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, > size_t size) > > #if PY_MAJOR_VERSION < 3 > > return PyString_FromStringAndSize (str, size); > > #else > > - return PyUnicode_FromStringAndSize (str, size); > > + return PyUnicode_Decode(str, size, "utf-8", "replace"); > > #endif > > } > > > > -- > > 2.26.1.301.g55bc3eb7cb9-goog > > > > > > _______________________________________________ > > Libguestfs mailing list > > Libguestfs@redhat.com > > https://www.redhat.com/mailman/listinfo/libguestfs > >
Apparently Analagous Threads
- Re: [PATCH v2] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH v3] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- [PATCH] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- [PATCH] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)
- Re: [PATCH] python: Fix UnicodeError in inspect_list_applications2() (RHBZ#1684004)