Hi, It's my first message in this mailing list, I hope I'm sending it to the correct address. I'm also new to Xapian and my english is not perfect. I test Xapian from PHP 4.4.1, using the bindings, and it works pretty well. Thanks to everyone involved in this project! My questions: 1) Am I correct when I say that Xapian doesn't provide an indexer function? I mean, from what I understand, the only way to index a text in Xapian is to split it, word by word, *by ourself*, and then to insert, one by one, those words in Xapian using Document::add_term(). There are no Xapian function that would take a whole text, splits the words by itself and indexes them, right? I have to write my own indexer, my own string splitting function. Is that correct? (And I don't think I want to look at Omega because I do not indexe webpages, I'm using Xapian to indexe some custom text inside my application, to provide a fast plain-text search functionality.) 2) My second question is related to the queryparser. I've heard that UTF-8 support is not yet available in release versions. I'm not a C or C++ programmer so I'd prefere not to mess with patches ( http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But anyway, I don't need full support for my queries so I wrote my own, UTF-8 aware, queryparser even if it's not perfect (see question #3). Here's my question: I don't understand how you can use your own parsing method for indexing (see question #1) AND use the provided Xapian queryparser (even if it would support UTF-8)! Am I missing something or both sides (the indexing and the queryparsing) have to use the same splitting algorithm if you want the results to be correct. If my indexing algorithm splits "aaa?bbb" into one word only ("aaa?bbb") but the Xapian queryparser doesn't considere "?" as an alphanumeric character and therefore splits the string into two words ("aaa" and "bbb"), my search results won't be correct, right? So I don't see how it is possible to rely on a provided queryparser if there is no indexing function also provided that would use the exact same splitting algorithm. 3) If someone has experience with splitting UTF-8 strings into words using PHP 4, I would be really happy. I though mb_split("\W", $text) ; would do the job but it seems that it considers some characters as alphanumeric (ie: "?") where, I think, it shouldn't. Any help? Thanks, Jules Landry
I'm using a combination of scriptindex and omega to index german language texts and the words do not split on accented characters. E. g. h?chstpers?nlichen remains h?chstpers?nlichen and a search for it finds it fine. What does happen is that xapian does transliterate the accented characters into diagraphs but since these are unique it does't make any difference unless you want to use the term list that is returned for something. Olly posted a patch recently to eliminate that behavior. While omega is a cgi program it does not mean you cannot use it to search a database and return results to a program. In fact, that's the way I'm using it myself. I use html2text to produce plain text and read the text in and format it in a way that scriptindex likes it. I then have my search program call omega to return a xml file to me with the results. I am using it in cgi mode, just 'cause that is convinient but I could have called it via a exec call just as easily. Hope that helps. Jim. tata 668 wrote:> Hi, > > It's my first message in this mailing list, I hope I'm sending it to > the correct address. I'm also new to Xapian and my english is not > perfect. > > I test Xapian from PHP 4.4.1, using the bindings, and it works pretty > well. Thanks to everyone involved in this project! > > My questions: > > 1) Am I correct when I say that Xapian doesn't provide an indexer > function? I mean, from what I understand, the only way to index a text > in Xapian is to split it, word by word, *by ourself*, and then to > insert, one by one, those words in Xapian using Document::add_term(). > There are no Xapian function that would take a whole text, splits the > words by itself and indexes them, right? I have to write my own > indexer, my own string splitting function. Is that correct? (And I > don't think I want to look at Omega because I do not indexe webpages, > I'm using Xapian to indexe some custom text inside my application, to > provide a fast plain-text search functionality.) > > 2) My second question is related to the queryparser. I've heard that > UTF-8 support is not yet available in release versions. I'm not a C or > C++ programmer so I'd prefere not to mess with patches ( > http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But > anyway, I don't need full support for my queries so I wrote my own, > UTF-8 aware, queryparser even if it's not perfect (see question #3). > > Here's my question: I don't understand how you can use your own > parsing method for indexing (see question #1) AND use the provided > Xapian queryparser (even if it would support UTF-8)! Am I missing > something or both sides (the indexing and the queryparsing) have to > use the same splitting algorithm if you want the results to be > correct. If my indexing algorithm splits "aaa?bbb" into one word only > ("aaa?bbb") but the Xapian queryparser doesn't considere "?" as an > alphanumeric character and therefore splits the string into two words > ("aaa" and "bbb"), my search results won't be correct, right? So I > don't see how it is possible to rely on a provided queryparser if > there is no indexing function also provided that would use the exact > same splitting algorithm. > > 3) If someone has experience with splitting UTF-8 strings into words > using PHP 4, I would be really happy. I though mb_split("\W", $text) > ; would do the job but it seems that it considers some characters as > alphanumeric (ie: "?") where, I think, it shouldn't. Any help? > > > Thanks, > > Jules Landry > > > > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > >
Peter Karman
2006-Feb-25 19:19 UTC
[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP
These are good questions. tata 668 scribbled on 2/25/06 10:54 AM:> 1) Am I correct when I say that Xapian doesn't provide an indexer > function?The Omega project provides a couple different indexers. That's a separate project from the Xapian library, but they're available together, as are the bindings for using other languages (like PHP). Your questions about how "words" are defined is one reason I prefer Swish-e (http://swish-e.org) for smaller projects. Swish-e lets you define which characters constitute a "word" and the indexer splits text strings accordingly. Also, the indexer is "smart" about word context in HTML and XML and lets you bias some words more than others (like titles or headings, for example). Since this is the Xapian list and not the Swish-e list, I will say that Xapian offers some key features Swish-e does not, which is why I am on this list. :) I am currently working on the next version of Swish-e, which will offer the Xapian library as a backend, thus combining the best of both worlds: the ease and power of Swish-e's indexer with the scalability and ranking features of Xapian. -- Peter Karman . http://peknet.com/ . peter@peknet.com
Olly Betts
2006-Feb-26 00:58 UTC
[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP
On Sat, Feb 25, 2006 at 11:54:51AM -0500, tata 668 wrote:> 1) Am I correct when I say that Xapian doesn't provide an indexer function? > I mean, from what I understand, the only way to index a text in Xapian is > to split it, word by word, *by ourself*, and then to insert, one by one, > those words in Xapian using Document::add_term(). There are no Xapian > function that would take a whole text, splits the words by itself and > indexes them, right?Not currently, but it's on my list. As you suggest, it's a bit odd that there's a "parser for queries" component, but no matching "parser for indexing document text" component.> (And I don't think I want to look at > Omega because I do not indexe webpages, I'm using Xapian to indexe some > custom text inside my application, to provide a fast plain-text search > functionality.)Omega's omindex indexer assumes you're indexing webpages (or documents in a web server tree). However Omega's scriptindex indexer is a good fit for what you want to do - it takes a "dump file" (or several) which is essentially groups of NAME=VALUE pairs, and another file which describes what to do for each NAME. One possible action is to split VALUE into terms. Currently iso-8859-1 input is assumed by the word splitting though.> 2) My second question is related to the queryparser. I've heard that UTF-8 > support is not yet available in release versions. I'm not a C or C++ > programmer so I'd prefere not to mess with patches ( > http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ).Patching is really easy: patch -p0 < PATCHFILE And then configure and build as usual.> Here's my question: I don't understand how you can use your own parsing > method for indexing (see question #1) AND use the provided Xapian > queryparser (even if it would support UTF-8)! Am I missing something or > both sides (the indexing and the queryparsing) have to use the same > splitting algorithm if you want the results to be correct.Both indexing and query parsing do indeed need to have compatible strategies for identifying terms. And currently to use the utf-8 QueryParser you have to implement a compatible tokeniser for indexing. Perhaps I should explain that there's an interrelated collection of things I'm planning for what will probably be numbered 1.0. I've mentioned that I'm going to work on most of these before, but not actually put all the pieces together like this before. Most of these will result in databases built by pre-1.0 not being reliably searchable by post-1.0, and vice versa, which is why I want to do them all together at a major version change. You've touched on a number of them: * update to the latest snowball stemmers (which support utf-8). * clean up and apply the utf-8 patch for QueryParser. * allowing more control over what QueryParser treats as a word character (and tweak the defaults to avoid generating phrase searches in cases where we don't need to - for example: 2.4.1 is currently a 3 term phrase query, and a slow case). * remove the "accent normalisation" code - for any languages where it is desirable, it would be better done by incorporating it into the stemming algorithm if it isn't done there already. The reason why it's done separately is historical (Xapian's proprietary precursor expected accents to be represented in its own special way, and it was easier to normalise them than to translate them!) * fix the routines from indextext.cc used by omindex/scriptindex to handle utf-8 text (hmm, I've already done this for the gmane indexer!) * add word character configurability to the indextext.cc routines to match that of the QueryParser and make available in the core library. * fix the $highlight command in Omega to handle utf-8 and the configurable definitions of what a word is. * fix omindex to use utf-8 (and convert input from documents in other character sets). Before you ask, I don't have a date for 1.0 yet. I suspect we'll want at least one more 0.9.X first, to collect up any bug fixes, especially since upgrading to 1.0 will be a bigger deal than usual, because it will require a reindex for many users. Cheers, Olly