thr3ads.net - Xapian discuss - [Xapian-discuss] indexing and queryparsing: UTF-8 and PHP [Feb 2006]

If this information is useful, please help other people find it:
Share via:

tata 668

2006-Feb-25 16:55 UTC

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Hi,

It's my first message in this mailing list, I hope I'm sending it to the
correct address. I'm also new to Xapian and my english is not perfect.

I test Xapian from PHP 4.4.1, using the bindings, and it works pretty well. 
Thanks to everyone involved in this project!

My questions:

1) Am I correct when I say that Xapian doesn't provide an indexer function? 
I mean, from what I understand, the only way to index a text in Xapian is to 
split it, word by word, *by ourself*, and then to insert, one by one, those 
words in Xapian using Document::add_term(). There are no Xapian function 
that would take a whole text, splits the words by itself and indexes them, 
right? I have to write my own indexer, my own string splitting function. Is 
that correct? (And I don't think I want to look at Omega because I do not 
indexe webpages, I'm using Xapian to indexe some custom text inside my 
application, to provide a fast plain-text search functionality.)

2) My second question is related to the queryparser. I've heard that UTF-8 
support is not yet available in release versions. I'm not a C or C++ 
programmer so I'd prefere not to mess with patches ( 
http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But anyway, 
I don't need full support for my queries so I wrote my own, UTF-8 aware, 
queryparser even if it's not perfect (see question #3).

Here's my question: I don't understand how you can use your own parsing 
method for indexing (see question #1) AND use the provided Xapian 
queryparser (even if it would support UTF-8)! Am I missing something or both 
sides (the indexing and the queryparsing) have to use the same splitting 
algorithm if you want the results to be correct. If my indexing algorithm 
splits "aaa?bbb" into one word only ("aaa?bbb") but the
Xapian queryparser
doesn't considere "?" as an alphanumeric character and therefore
splits the
string into two words ("aaa" and "bbb"), my search results
won't be correct,
right? So I don't see how it is possible to rely on a provided queryparser 
if there is no indexing function also provided that would use the exact same 
splitting algorithm.

3) If someone has experience with splitting UTF-8 strings into words using 
PHP 4, I would be really happy. I though  mb_split("\W", $text) ;
would do
the job but it seems that it considers some characters as alphanumeric (ie: 
"?") where, I think, it shouldn't. Any help?


Thanks,

Jules Landry

Jim Lynch

2006-Feb-25 18:31 UTC

head link

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

I'm using a combination of scriptindex and omega to index german 
language texts and the words do not split on accented characters.  E. g. 
h?chstpers?nlichen remains h?chstpers?nlichen and a search for it finds 
it fine.  What does happen is that xapian does transliterate the 
accented characters into diagraphs but since these are unique it does't 
make any difference unless you want to use the term list that is 
returned for something. 

Olly posted a patch recently to eliminate that behavior.  While omega is 
a cgi program it does not mean you cannot use it to search a database 
and return results to a program.  In fact, that's the way I'm using it 
myself.  I use html2text to produce plain text and read the text in and 
format it in a way that scriptindex likes it.  I then have my search 
program call omega to return a xml file to me with the results.  I am 
using it in cgi mode, just 'cause that is convinient but I could have 
called it via a exec call just as easily. 

Hope that helps.

Jim.
tata 668 wrote:
> Hi,
>
> It's my first message in this mailing list, I hope I'm sending it
to
> the correct address. I'm also new to Xapian and my english is not 
> perfect.
>
> I test Xapian from PHP 4.4.1, using the bindings, and it works pretty 
> well. Thanks to everyone involved in this project!
>
> My questions:
>
> 1) Am I correct when I say that Xapian doesn't provide an indexer 
> function? I mean, from what I understand, the only way to index a text 
> in Xapian is to split it, word by word, *by ourself*, and then to 
> insert, one by one, those words in Xapian using Document::add_term(). 
> There are no Xapian function that would take a whole text, splits the 
> words by itself and indexes them, right? I have to write my own 
> indexer, my own string splitting function. Is that correct? (And I 
> don't think I want to look at Omega because I do not indexe webpages, 
> I'm using Xapian to indexe some custom text inside my application, to 
> provide a fast plain-text search functionality.)
>
> 2) My second question is related to the queryparser. I've heard that 
> UTF-8 support is not yet available in release versions. I'm not a C or 
> C++ programmer so I'd prefere not to mess with patches ( 
> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But 
> anyway, I don't need full support for my queries so I wrote my own, 
> UTF-8 aware, queryparser even if it's not perfect (see question #3).
>
> Here's my question: I don't understand how you can use your own 
> parsing method for indexing (see question #1) AND use the provided 
> Xapian queryparser (even if it would support UTF-8)! Am I missing 
> something or both sides (the indexing and the queryparsing) have to 
> use the same splitting algorithm if you want the results to be 
> correct. If my indexing algorithm splits "aaa?bbb" into one word
only
> ("aaa?bbb") but the Xapian queryparser doesn't considere
"?" as an
> alphanumeric character and therefore splits the string into two words 
> ("aaa" and "bbb"), my search results won't be
correct, right? So I
> don't see how it is possible to rely on a provided queryparser if 
> there is no indexing function also provided that would use the exact 
> same splitting algorithm.
>
> 3) If someone has experience with splitting UTF-8 strings into words 
> using PHP 4, I would be really happy. I though  mb_split("\W",
$text)
> ; would do the job but it seems that it considers some characters as 
> alphanumeric (ie: "?") where, I think, it shouldn't. Any
help?
>
>
> Thanks,
>
> Jules Landry
>
>
>
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>
>

Peter Karman

2006-Feb-25 19:19 UTC

head link

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

These are good questions.

tata 668 scribbled on 2/25/06 10:54 AM:
> 1) Am I correct when I say that Xapian doesn't provide an indexer 
> function?
The Omega project provides a couple different indexers. That's a separate 
project from the Xapian library, but they're available together, as are the 
bindings for using other languages (like PHP).

Your questions about how "words" are defined is one reason I prefer
Swish-e
(http://swish-e.org) for smaller projects. Swish-e lets you define which 
characters constitute a "word" and the indexer splits text strings
accordingly.
Also, the indexer is "smart" about word context in HTML and XML and
lets you
bias some words more than others (like titles or headings, for example).

Since this is the Xapian list and not the Swish-e list, I will say that Xapian 
offers some key features Swish-e does not, which is why I am on this list. :) I 
am currently working on the next version of Swish-e, which will offer the Xapian
library as a backend, thus combining the best of both worlds: the ease and power
of Swish-e's indexer with the scalability and ranking features of Xapian.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Olly Betts

2006-Feb-26 00:58 UTC

head link

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

On Sat, Feb 25, 2006 at 11:54:51AM -0500, tata 668
wrote:> 1) Am I correct when I say that Xapian doesn't provide an indexer
function?
> I mean, from what I understand, the only way to index a text in Xapian is 
> to split it, word by word, *by ourself*, and then to insert, one by one, 
> those words in Xapian using Document::add_term(). There are no Xapian 
> function that would take a whole text, splits the words by itself and 
> indexes them, right?
Not currently, but it's on my list.  As you suggest, it's a bit odd that
there's a "parser for queries" component, but no matching
"parser for
indexing document text" component.
> (And I don't think I want to look at 
> Omega because I do not indexe webpages, I'm using Xapian to indexe some
> custom text inside my application, to provide a fast plain-text search 
> functionality.)
Omega's omindex indexer assumes you're indexing webpages (or documents
in a web server tree).  However Omega's scriptindex indexer is a good
fit for what you want to do - it takes a "dump file" (or several)
which
is essentially groups of NAME=VALUE pairs, and another file which
describes what to do for each NAME.  One possible action is to split
VALUE into terms.  Currently iso-8859-1 input is assumed by the word
splitting though.
> 2) My second question is related to the queryparser. I've heard that
UTF-8
> support is not yet available in release versions. I'm not a C or C++ 
> programmer so I'd prefere not to mess with patches ( 
> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ).
Patching is really easy:

patch -p0 < PATCHFILE

And then configure and build as usual.
> Here's my question: I don't understand how you can use your own
parsing
> method for indexing (see question #1) AND use the provided Xapian 
> queryparser (even if it would support UTF-8)! Am I missing something or 
> both sides (the indexing and the queryparsing) have to use the same 
> splitting algorithm if you want the results to be correct.
Both indexing and query parsing do indeed need to have compatible
strategies for identifying terms.  

And currently to use the utf-8 QueryParser you have to implement a
compatible tokeniser for indexing.

Perhaps I should explain that there's an interrelated collection of
things I'm planning for what will probably be numbered 1.0.  I've
mentioned that I'm going to work on most of these before, but not
actually put all the pieces together like this before.

Most of these will result in databases built by pre-1.0 not being
reliably searchable by post-1.0, and vice versa, which is why I want to
do them all together at a major version change.  You've touched on a
number of them:

* update to the latest snowball stemmers (which support utf-8).

* clean up and apply the utf-8 patch for QueryParser.

* allowing more control over what QueryParser treats as a word character
  (and tweak the defaults to avoid generating phrase searches in cases
  where we don't need to - for example: 2.4.1 is currently a 3 term
  phrase query, and a slow case).

* remove the "accent normalisation" code - for any languages where it
  is desirable, it would be better done by incorporating it into the
  stemming algorithm if it isn't done there already.  The reason why
  it's done separately is historical (Xapian's proprietary precursor
  expected accents to be represented in its own special way, and it was
  easier to normalise them than to translate them!)

* fix the routines from indextext.cc used by omindex/scriptindex to
  handle utf-8 text (hmm, I've already done this for the gmane indexer!)

* add word character configurability to the indextext.cc routines to
  match that of the QueryParser and make available in the core library.

* fix the $highlight command in Omega to handle utf-8 and the
  configurable definitions of what a word is.

* fix omindex to use utf-8 (and convert input from documents in other
  character sets).

Before you ask, I don't have a date for 1.0 yet.  I suspect we'll want
at least one more 0.9.X first, to collect up any bug fixes, especially
since upgrading to 1.0 will be a bigger deal than usual, because it will
require a reindex for many users.

Cheers,
    Olly

Xapian discuss - Feb 2006 - indexing and queryparsing: UTF-8 and PHP

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP