thr3ads.net - Xapian discuss - [Xapian-discuss] Python bindings and unicode strings [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Deron Meranda

2007-Aug-30 20:02 UTC

[Xapian-discuss] Python bindings and unicode strings

I understand that the Xapian core uses UTF-8, but is there a way to
get the Python bindings to always work with Python's native unicode
string type so that the underlying UTF-8 is not exposed?  It appears
that I can store unicode strings, like;
>>>  document.set_term( u'panach\u00e9' )
but then when I get them back out they're plain byte sequences (UTF-8
encoded) rather than nice unicode strings,
>>>  [t.term for t in document.allterms()]['panach\xc3\xa9']

I would have expected to get [u'panach\u00e9'] out instead.

Deron Meranda

James Aylett

2007-Sep-02 13:38 UTC

head link

[Xapian-discuss] Python bindings and unicode strings

On Thu, Aug 30, 2007 at 03:02:22PM -0400, Deron Meranda wrote:
> I understand that the Xapian core uses UTF-8, but is there a way to
> get the Python bindings to always work with Python's native unicode
> string type so that the underlying UTF-8 is not exposed?
This isn't true, and therein lies the problem. Xapian core treats
everything as blobs of bytes; in many cases the sensible choice for
applications is to put UTF-8 in there.
> It appears that I can store unicode strings, like;
> 
> >>>  document.set_term( u'panach\u00e9' )
> 
> but then when I get them back out they're plain byte sequences (UTF-8
> encoded) rather than nice unicode strings,
> 
> >>>  [t.term for t in document.allterms()]
> ['panach\xc3\xa9']
> 
> I would have expected to get [u'panach\u00e9'] out instead.
I'm not sure what the right way of solving this is. Ideally we want a
way of saying what encoding is being used, and have Python do the
right thing. It would probably always come out as a Unicode string,
but the deserialisation would depend on the encoding used.

We might be okay having one encoding for everything, rather than
separate for terms and doc data... and values. Hmm. And I guess we
could stuff this into database metadata, which would make it
automatic. Some more thought may be required here first, though.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Xapian discuss - Aug 2007 - Python bindings and unicode strings

[Xapian-discuss] Python bindings and unicode strings

[Xapian-discuss] Python bindings and unicode strings