Floris Bos
2005-Sep-05 18:35 UTC
[Xapian-discuss] Input files and special chars and spaces
Hi, Thanks to the people on this mailing list I'm now succesfully using Xapian/Omega but I ran in to a problem today. I'm using scriptindex input files to put data in the the Xapian db. This works great as long as I don't use any special characters. As soon as I try to add a document that includes a text with special characters like for example: ' / or \ only the part before the special char is added to the db. I read somewhere (can't recall where) that Omega replaces these chars with spaces when using regular indexing. Do I also need to do this replacing when using input files? I know that this doesn't influence searching but I'd like to have the possibility to use at leat the ' char for in the sample field because this is a char that often occurs in dutch language. Is it possible to create a sample text in the Xapian db that includes special chars?
Olly Betts
2005-Sep-07 00:01 UTC
[Xapian-discuss] Input files and special chars and spaces
On Mon, Sep 05, 2005 at 07:35:13PM +0200, Floris Bos wrote:> I'm using scriptindex input > files to put data in the the Xapian db. This works great as long as I don't > use any special characters. As soon as I try to add a document that > includes a text with special characters like for example: ' / or \ > only the part before the special char is added to the db.A word such as "doesn't" is currently indexed by scriptindex as "doesn" and "t". Assuming you index with positional information, when searching, "doesn't" is treated as a phrase search so will match as you want. This approach also allows "Olly's" to be matched by a search for "olly". Similarly, "/etc/passwd" is indexed as "etc" and "passwd", and searching for it generates a phrase search. But you can also search for "etc" and "passwd" separately and the document will match. The downside of this is that some of the phrase searches we generate in this way can be rather slow with a big database, so this is an area which is likely to be revisited: http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=22 Also the latest snowball stemmers can make user of apostrophes (at least in English).> I read somewhere (can't recall where) that Omega replaces these chars > with spaces when using regular indexing.I'm not sure "replacing with spaces" is the best way to think about it. Such characters are simply treated as word breaks, like spaces are.> Do I also need to do this replacing when using input files?No.> I know that this doesn't influence searching but I'd like to have > the possibility to use at leat the ' char for in the sample field because > this is a char that often occurs in dutch language.But you can search for text containing "'" (unless you're not storing positional information).> Is it possible to create a sample text in the Xapian db that includes > special chars?Are you asking about the sample text stored in the document data and used in the search results? That can contain any character (even zero bytes). Cheers, Olly
Floris Bos
2005-Sep-09 18:55 UTC
[Xapian-discuss] Input files and special chars and spaces
Thank you for the explanation on the handling of special chars. As for the sample text: I was blaming Omega for the problems I had with my XML parser. However one problem keeps occuring: When I enter a text that contains < or > the text is chopped right before the first occurence of one of these chars. I figured out that it's $htmlstrip that does this, am I right here? It would not be a very big problem if this is the case but I want to be sure that this problem doesn't has another cause. So is it standard that $htmlstrip chops of the text before < or >? Is there a way to work round this and still remove the < and > chars?