thr3ads.net - Xapian discuss - [Xapian-discuss] bigrams search speed and index documents [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Ying Liu

2009-Nov-04 01:38 UTC

[Xapian-discuss] bigrams search speed and index documents

Hello all,

I am using Xapian to index two XML files. In each file, there are about 
6000+ pieces of news. Each piece of news is separated by <DOC>
</DOC>.
The way I build the index is:

1) read the XML file line by line, get one piece of news's head, date, 
and contents which are separated by tags
2) remove  numbers, change to lower case,  remove stop words , and the 
information is saved in $buf
3) new a Xapian::Document $doc, and use the TermGenerator to 
set_document($doc) and index_text($buf).
4) add the $doc to the database $db

For the next piece of news, repeat the above 1 to 3 steps. The average 
length of each news is about 200 terms. The index is very fast, about 
one to two minutes. My question is about the searching speed. I need to 
find the bigrams of indexed documents, i.e., find any two term's common 
postinglist and their positionlist in the same document. I found the 
speed is kind of low, about 1562 bigrams/hour.

My question is, is it an efficient way to build the index? If I do the 
above step 1 and 2, and save the results into one separate file, can I 
speed up the searching speed? Can I index a file directly instead of 
TermGenerator? In a previous post, 
http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, 
it mentioned  tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up 
the searching speed?

Thank you,
-Ying

Ying Liu

2009-Nov-04 17:03 UTC

head link

[Xapian-discuss] bigrams search speed and index documents

Hello again,

I am working on a pretty fast computer, Dell Optiplex 960. The memory is:
                                total        used             free     
shared    buffers     cached
Mem:               3094868    2943068       151800          0       
329468    1590012
-/+ buffers/cache:              1023588     2071280
Swap:                 9060620     76792     8983828

The cpu is:
00:00.0 Host bridge: Intel Corporation 4 Series Chipset DRAM Controller 
(rev 03)

The two files which contain more than 12000+ pieces of news are totally 
about 17MB.

My college is doing the same test by Lemur and her searching speed for 
bigrams is about 10 times than Xapian, and our machine is the same. (the 
speed to build the index is both very fast. ) I think there must be some 
thing I can improve with the way I build the index. Usually, how do you 
build the index? what's the more efficient way?

Thank you,
Ying


Ying Liu wrote:> Hello all,
>
> I am using Xapian to index two XML files. In each file, there are 
> about 6000+ pieces of news. Each piece of news is separated by <DOC> 
> </DOC>. The way I build the index is:
>
> 1) read the XML file line by line, get one piece of news's head, date, 
> and contents which are separated by tags
> 2) remove  numbers, change to lower case,  remove stop words , and the 
> information is saved in $buf
> 3) new a Xapian::Document $doc, and use the TermGenerator to 
> set_document($doc) and index_text($buf).
> 4) add the $doc to the database $db
>
> For the next piece of news, repeat the above 1 to 3 steps. The average 
> length of each news is about 200 terms. The index is very fast, about 
> one to two minutes. My question is about the searching speed. I need 
> to find the bigrams of indexed documents, i.e., find any two term's 
> common postinglist and their positionlist in the same document. I 
> found the speed is kind of low, about 1562 bigrams/hour.
>
> My question is, is it an efficient way to build the index? If I do the 
> above step 1 and 2, and save the results into one separate file, can I 
> speed up the searching speed? Can I index a file directly instead of 
> TermGenerator? In a previous post, 
> http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html, 
> it mentioned  tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed 
> up the searching speed?
>
> Thank you,
> -Ying
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss

Olly Betts

2009-Nov-04 22:35 UTC

head link

[Xapian-discuss] bigrams search speed and index documents

On Tue, Nov 03, 2009 at 07:38:08PM -0600, Ying Liu
wrote:> I am using Xapian to index two XML files. In each file, there are about  
> 6000+ pieces of news. Each piece of news is separated by <DOC>
</DOC>.
> The way I build the index is:
>
> 1) read the XML file line by line, get one piece of news's head, date,
> and contents which are separated by tags
> 2) remove  numbers, change to lower case,  remove stop words , and the  
> information is saved in $buf
> 3) new a Xapian::Document $doc, and use the TermGenerator to  
> set_document($doc) and index_text($buf).
> 4) add the $doc to the database $db
Please post actual code rather than trying to describe it in English.
> For the next piece of news, repeat the above 1 to 3 steps.
So you only actually add the first document to the database?

If you'd posted the actual code you were using, I wouldn't have to
guess...
> The average  
> length of each news is about 200 terms. The index is very fast, about  
> one to two minutes. My question is about the searching speed. I need to  
> find the bigrams of indexed documents, i.e., find any two term's common
> postinglist and their positionlist in the same document. I found the  
> speed is kind of low, about 1562 bigrams/hour.
I don't know how you're doing this without seeing the code.
> My question is, is it an efficient way to build the index? If I do the  
> above step 1 and 2, and save the results into one separate file, can I  
> speed up the searching speed?
I don't see how that would make any difference to search speed - the
database
will contain the same terms.
> Can I index a file directly instead of  TermGenerator?
You can just call Document::add_term() and/or Document::add_posting() directly
instead of generating a string to feed to TermGenerator.  That would be an
easier and more efficient approach I think.
> In a previous post,  
> http://lists.xapian.org/pipermail/xapian-discuss/2009-April/006626.html,  
> it mentioned  tuning XAPIAN_FLUSH_THRESHOLD. How to do this to speed up  
> the searching speed?
XAPIAN_FLUSH_THRESHOLD only affects indexing.  It can slightly change where
posting lists chunk boundaries are, and the internal layout of blocks in the
Btree, which may indirectly affect search speed, but there's no direct
effect
on searching.

Cheers,
    Olly

Xapian discuss - Nov 2009 - bigrams search speed and index documents

[Xapian-discuss] bigrams search speed and index documents

[Xapian-discuss] bigrams search speed and index documents

[Xapian-discuss] bigrams search speed and index documents