thr3ads.net - Xapian discuss - [Xapian-discuss] custom [Dec 2004]

If this information is useful, please help other people find it:
Share via:

gervin23

2004-Dec-21 20:20 UTC

[Xapian-discuss] custom

hello,

i was playing with various ideas around hit-highlighting and have a 
version via python that parses the documents (html in this case) and 
performs regx to surround the highlighted terms with a <span> tag. it 
works for the moment and is quite speedy but i can't imagine the server 
could handle it nicely under load. so my next idea isn't possible with 
xapian at the moment (i don't think?) but was wondering if there's a 
workaround, patch, or better idea. basically, what i'd like is to add a 
4th parameter to add_posting(term, pos, wdf) where i could store 
something like the byte offset of the term (as opposed to the position 
list value).

for example, html like the following would take into account where the 
term(s) physically exist in the document while taking into account tags, 
stopwords ('by' in this case), etc:
<html><title>Search by Document</title></html>

where my indexing calls would look something like:
add_posting('search',1,1,13) <- position=1,byteoffset=13
add_posting('document',2,1,23) <- position=2,byteoffset=23

note, it might also serve better to use octal or hex for this purpose. 
any ideas greatly appreciated.

one other question i have has to do with 2 term phrase searches. i find 
these particular searches magnitudes longer than 3+ term phrase searches 
(sometimes 80 seconds). it seems the more terms i add, the faster the 
results. for these tests, i performed all new searches each time (trying 
to workaround the cache system) and found the results pretty consistent. 
any ideas as to why this might be happening? also, i dug a little into 
how the storage system was behaving while doing these searches and found 
xapian using less than 1% of available throughput (cpu pretty much 
idle). now, this is on a regular desktop system so if that's the case, 
how would a RAID'd system help? i'm most likely missing something here 
so a little more insight would help tremendously.

thanks much,
andrew

Olly Betts

2004-Dec-21 22:31 UTC

head link

[Xapian-discuss] custom

On Tue, Dec 21, 2004 at 01:11:47PM -0700, gervin23
wrote:> i was playing with various ideas around hit-highlighting and have a 
> version via python that parses the documents (html in this case) and 
> performs regx to surround the highlighted terms with a <span> tag. it
> works for the moment and is quite speedy but i can't imagine the server
> could handle it nicely under load. so my next idea isn't possible with 
> xapian at the moment (i don't think?) but was wondering if there's
a
> workaround, patch, or better idea. basically, what i'd like is to add a
> 4th parameter to add_posting(term, pos, wdf) where i could store 
> something like the byte offset of the term (as opposed to the position 
> list value).
It's not possible at present, and is a reasonably involved change, as
you'd need to invent somewhere to store this extra information -
probably a parallel structure to the posting list.

But for this to be useful, you also need to store the exact version of
the document which this information was extracted from.  Otherwise if
somebody updates the document, most of the byte offsets will probably
be wrong and using them blindly will give very odd results.

So I've always ended up concluding that you want to do the highlighting
on the fly.  Since large search systems end up I/O bound, you generally
have spare CPU cycles, and don't want to be fetching extra information
off disk anyway.

What would be really neat is a client side solution - something like
a javascript based highlighter, but I don't know if javascript has
enough power to allow this.
> one other question i have has to do with 2 term phrase searches. i find 
> these particular searches magnitudes longer than 3+ term phrase searches 
> (sometimes 80 seconds). it seems the more terms i add, the faster the 
> results. for these tests, i performed all new searches each time (trying 
> to workaround the cache system) and found the results pretty consistent. 
> any ideas as to why this might be happening?
The phrase search is a AND search plus additional filtering - I suspect
that the more terms, the fewer need to be filtered.  Also, the filtering
keys off the least common term in the phrase, which is likely to work
better with more terms (one is more likely to be infrequent).

Profiling with nothing cached isn't terribly representative in most
cases.  You'd expect much of the "skeleton" of all the Btrees to
be
cached, plus the most commonly used terms.
> also, i dug a little into 
> how the storage system was behaving while doing these searches and found 
> xapian using less than 1% of available throughput (cpu pretty much 
> idle). now, this is on a regular desktop system so if that's the case, 
> how would a RAID'd system help?
Sounds like you're I/O bound, so a RAID configuration which improves
the speed you can seek to and read a group of blocks should help.

There's also potential for changing Xapian's code to improve the speed
of phrase searches (more in some cases than others).  It's something I'm
going to be working on in the next few months.

Do you have some example queries which are particularly slow?

Cheers,
    Olly

Xapian discuss - Dec 2004 - custom

[Xapian-discuss] custom

[Xapian-discuss] custom