thr3ads.net - Xapian discuss - [Xapian-discuss] Get term from document by position [Jul 2015]

If this information is useful, please help other people find it:
Share via:

john.alveris at Safe-mail.net

2015-Jul-26 14:36 UTC

[Xapian-discuss] Get term from document by position

> Snippet highlighting is something that was worked on for a GSoC project a
> few years ago, and is mentioned in our FAQ:
<http://trac.xapian.org/wiki/FAQ/Snippets>.
> It?s not available in the 1.2 series, but as I understand it should work
out of the
> box in 1.3.3.
I tried it, this approach returns snippet that have nothing to do with the
search string. Moreover, it takes too long to generate a snippet.

 > Note that your suggested approach of going from terms to snippet doesn?t
work in the general
> case, because of issues like stemming. 
Actually, it works just fine. I am using the following indexing scheme: 
First, i index unstemmed text.
Next,  i add a term with a unique prefix to the database. This term is used as a
delimiter between stemmed and unstemmed
terms.
Finally, i index stemmed text.


When generating snippet (if stemmer is being used) i get positions of the
stemmed terms (that the snipped should consist of) and the  position of the
delimiter. Next,  i make an appropriate shift and get positions of the
corresponding unstemmed terms.

This approach works fine, except for the fact that i have to cycle to get terms
by position (this operation is time-consuming).

Let me not that Recoll ( http://www.lesbonscomptes.com/recoll/ ) uses the
similar approach to generate snippet (actually, i am using their method with
some modifications). To get a term by position they cycle through all of the
terms too.
While it works, it takes 1-2 seconds to generate snippets (about 10 snippets). I
think that if one had a way to get a term by position fast, than the snippet
generation would be much more faster.



> 
> > Hello. Is there any FAST way to get a term from  the xapian document
by it's position, something like
> > std::string term = Xapian::Document::GetTermByPosition(int position) ?
> 
> Not that I?m aware of. Snippet highlighting is something that was worked on
for a GSoC project a few years ago, and is mentioned in our FAQ:
<http://trac.xapian.org/wiki/FAQ/Snippets>. It?s not available in the 1.2
series, but as I understand it should work out of the box in 1.3.3.
> 
> Note that your suggested approach of going from terms to snippet doesn?t
work in the general case, because of issues like stemming. Instead, Mihai?s
approach was to use the matcher information to generate a snippet from the
original, unstemmed and untermed, text.

James Aylett

2015-Jul-26 14:41 UTC

head link

[Xapian-discuss] Get term from document by position

On 26 Jul 2015, at 15:36, john.alveris at Safe-mail.net wrote:
>> Snippet highlighting is something that was worked on for a GSoC project
a
>> few years ago, and is mentioned in our FAQ:
<http://trac.xapian.org/wiki/FAQ/Snippets>.
>> It?s not available in the 1.2 series, but as I understand it should
work out of the
>> box in 1.3.3.
> 
> I tried it, this approach returns snippet that have nothing to do with the
search string. Moreover, it takes too long to generate a snippet.
Can you file a bug with some example outputs that are unrelated to the search
string?
>> Note that your suggested approach of going from terms to snippet
doesn?t work in the general
>> case, because of issues like stemming. 
> 
> Actually, it works just fine. I am using the following indexing scheme: 
> First, i index unstemmed text.
> Next,  i add a term with a unique prefix to the database. This term is used
as a delimiter between stemmed and unstemmed
> terms.
> Finally, i index stemmed text.
Right, but that?s not the general case. It?s absolutely possible to do things in
other ways, of course. (In this case I assume you?re indexing completely
untransformed text, just word splitting; you aren?t normalising case for the
?raw? terms, for instance. What do you do about punctuation, out of interest?)

J

-- 
 James Aylett, occasional trouble-maker
 xapian.org

Reasonably Related Threads

Search for more seemingly similar threads

Xapian discuss - Jul 2015 - Get term from document by position

[Xapian-discuss] Get term from document by position

[Xapian-discuss] Get term from document by position

Reasonably Related Threads