Hello all, I am using Xapian to index two XML files. In each file, there are about 6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>. The way I build the index is: 1) read the XML file line by line, get one piece of news's head, date, and contents which are separated by tags 2) remove numbers, change to lower case, remove stop words , and the information is saved in $buf 3) new a Xapian::Document $doc, and use the TermGenerator to set_document($doc) and index_text($buf). 4) add the $doc to the database $db For the next piece of news, repeat the above 1 to 3 steps. The average length of each news is about 200 terms. The index is very fast, about one to two minutes. My question is about the searching speed. I need to find the bigrams of indexed documents, i.e., find any two term's common postinglist and their positionlist in the same document. I found the speed is kind of low, about 1562 bigrams/hour. My question is, is it an efficient way to build the index? If I do the above step 1 and 2, and save the results into one separate file, can I speed up the searching speed? Can I index a file directly instead of TermGenerator? In a previous post, http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, it mentioned tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up the searching speed? Thank you, -Ying
Hello again, I am working on a pretty fast computer, Dell Optiplex 960. The memory is: total used free shared buffers cached Mem: 3094868 2943068 151800 0 329468 1590012 -/+ buffers/cache: 1023588 2071280 Swap: 9060620 76792 8983828 The cpu is: 00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller (rev 03) The two files which contain more than 12000+ pieces of news are totally about 17MB. My college is doing the same test by Lemur and her searching speed for bigrams is about 10 times than Xapian, and our machine is the same. (the speed to build the index is both very fast. ) I think there must be some thing I can improve with the way I build the index. Usually, how do you build the index? what's the more efficient way? Thank you, Ying Ying Liu wrote:> Hello all, > > I am using Xapian to index two XML files. In each file, there are > about 6000+ pieces of news. Each piece of news is separated by <DOC> > </DOC>. The way I build the index is: > > 1) read the XML file line by line, get one piece of news's head, date, > and contents which are separated by tags > 2) remove numbers, change to lower case, remove stop words , and the > information is saved in $buf > 3) new a Xapian::Document $doc, and use the TermGenerator to > set_document($doc) and index_text($buf). > 4) add the $doc to the database $db > > For the next piece of news, repeat the above 1 to 3 steps. The average > length of each news is about 200 terms. The index is very fast, about > one to two minutes. My question is about the searching speed. I need > to find the bigrams of indexed documents, i.e., find any two term's > common postinglist and their positionlist in the same document. I > found the speed is kind of low, about 1562 bigrams/hour. > > My question is, is it an efficient way to build the index? If I do the > above step 1 and 2, and save the results into one separate file, can I > speed up the searching speed? Can I index a file directly instead of > TermGenerator? In a previous post, > http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, > it mentioned tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed > up the searching speed? > > Thank you, > -Ying > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss
Olly Betts
2009-Nov-04 22:35 UTC
[Xapian-discuss] bigrams search speed and index documents
On Tue, Nov 03, 2009 at 07:38:08PM -0600, Ying Liu wrote:> I am using Xapian to index two XML files. In each file, there are about > 6000+ pieces of news. Each piece of news is separated by <DOC> </DOC>. > The way I build the index is: > > 1) read the XML file line by line, get one piece of news's head, date, > and contents which are separated by tags > 2) remove numbers, change to lower case, remove stop words , and the > information is saved in $buf > 3) new a Xapian::Document $doc, and use the TermGenerator to > set_document($doc) and index_text($buf). > 4) add the $doc to the database $dbPlease post actual code rather than trying to describe it in English.> For the next piece of news, repeat the above 1 to 3 steps.So you only actually add the first document to the database? If you'd posted the actual code you were using, I wouldn't have to guess...> The average > length of each news is about 200 terms. The index is very fast, about > one to two minutes. My question is about the searching speed. I need to > find the bigrams of indexed documents, i.e., find any two term's common > postinglist and their positionlist in the same document. I found the > speed is kind of low, about 1562 bigrams/hour.I don't know how you're doing this without seeing the code.> My question is, is it an efficient way to build the index? If I do the > above step 1 and 2, and save the results into one separate file, can I > speed up the searching speed?I don't see how that would make any difference to search speed - the database will contain the same terms.> Can I index a file directly instead of TermGenerator?You can just call Document::add_term() and/or Document::add_posting() directly instead of generating a string to feed to TermGenerator. That would be an easier and more efficient approach I think.> In a previous post, > http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, > it mentioned tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up > the searching speed?XAPIAN_FLUSH_THRESHOLD only affects indexing. It can slightly change where posting lists chunk boundaries are, and the internal layout of blocks in the Btree, which may indirectly affect search speed, but there's no direct effect on searching. Cheers, Olly