Hi all, I have recently indexed a collection of xml documents totalling roughly 500mb, however when I attempt to search through the database using the simplesearch program (in java) it always seems to return 0 matches, even when I attempt words that I know to be in the documents. I have tried the simpleindex and simplesearch example and that worked fine. So looking through the documentation I found the delve feature and attempted a poke inside the database and these were the results delve -v -r 1 /home/kwok/Desktop/latimes Term List for record #1: Rla072290 1 217 la072290 1 217>From changing the term positions to #100, 200, 500, 1000, 2000, 5000, theonly difference I got was a change in the laxxxxxx, where laxxxxxx are the filenames of small individual compressed files. I am actually unsure what kind of results I should be getting from using delve, but I was expecting all the different words/terms to be indexed rather than the filenames. Then when I use the random search terms of "la072290", "la052089", I got 217 and 156 results respectively. I have also attempted a search program created by my supervisor, but the end result is also that Xapian is not retrieving the documents. So I guess my question is, does it look as though I am not getting relevant search results because my indexer has not been indexing correctly (from the delve results) or is Xapian just not retrieving the documents for any reasons? Many Thanks Kwok!
On Tue, Sep 04, 2007 at 10:24:50AM +0100, Kwok-yau Kwong wrote:> I am actually unsure what kind of results I should be getting from using > delve, but I was expecting all the different words/terms to be indexed > rather than the filenames.Yes, you should get the terms from the document. I've no idea what your indexer looks like, but at a guess, are you doing this? termgenerator.index_text(filename); You need to pass a string containing the text to index, not the filename of a file containing the text! If the input is XML, you'll want to parse it first, as otherwise all the tags will be indexed as terms. Also, you often don't want to index the contents of all tags. Cheers, Olly
On Tue, Sep 04, 2007 at 10:24:50AM +0100, Kwok-yau Kwong wrote:> So I guess my question is, does it look as though I am not getting relevant > search results because my indexer has not been indexing correctly (from the > delve results) or is Xapian just not retrieving the documents for any > reasons?Looking at what you saw from delve(1), I'd say that your indexer isn't doing what you think it's doing. Without knowing what your indexer is, or how it works, I can't say much more, but from what you said it sounds like it's indexing the names of the files rather than the contents of them, which is a little odd. I assume you're pulling the text nodes of the XML documents out in your indexer somehow? J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org