john.alveris at Safe-mail.net
2015-Jul-26 14:36 UTC
[Xapian-discuss] Get term from document by position
> Snippet highlighting is something that was worked on for a GSoC project a > few years ago, and is mentioned in our FAQ: <http://trac.xapian.org/wiki/FAQ/Snippets>. > It?s not available in the 1.2 series, but as I understand it should work out of the > box in 1.3.3.I tried it, this approach returns snippet that have nothing to do with the search string. Moreover, it takes too long to generate a snippet.> Note that your suggested approach of going from terms to snippet doesn?t work in the general > case, because of issues like stemming.Actually, it works just fine. I am using the following indexing scheme: First, i index unstemmed text. Next, i add a term with a unique prefix to the database. This term is used as a delimiter between stemmed and unstemmed terms. Finally, i index stemmed text. When generating snippet (if stemmer is being used) i get positions of the stemmed terms (that the snipped should consist of) and the position of the delimiter. Next, i make an appropriate shift and get positions of the corresponding unstemmed terms. This approach works fine, except for the fact that i have to cycle to get terms by position (this operation is time-consuming). Let me not that Recoll ( http://www.lesbonscomptes.com/recoll/ ) uses the similar approach to generate snippet (actually, i am using their method with some modifications). To get a term by position they cycle through all of the terms too. While it works, it takes 1-2 seconds to generate snippets (about 10 snippets). I think that if one had a way to get a term by position fast, than the snippet generation would be much more faster.> > > Hello. Is there any FAST way to get a term from the xapian document by it's position, something like > > std::string term = Xapian::Document::GetTermByPosition(int position) ? > > Not that I?m aware of. Snippet highlighting is something that was worked on for a GSoC project a few years ago, and is mentioned in our FAQ: <http://trac.xapian.org/wiki/FAQ/Snippets>. It?s not available in the 1.2 series, but as I understand it should work out of the box in 1.3.3. > > Note that your suggested approach of going from terms to snippet doesn?t work in the general case, because of issues like stemming. Instead, Mihai?s approach was to use the matcher information to generate a snippet from the original, unstemmed and untermed, text.
On 26 Jul 2015, at 15:36, john.alveris at Safe-mail.net wrote:>> Snippet highlighting is something that was worked on for a GSoC project a >> few years ago, and is mentioned in our FAQ: <http://trac.xapian.org/wiki/FAQ/Snippets>. >> It?s not available in the 1.2 series, but as I understand it should work out of the >> box in 1.3.3. > > I tried it, this approach returns snippet that have nothing to do with the search string. Moreover, it takes too long to generate a snippet.Can you file a bug with some example outputs that are unrelated to the search string?>> Note that your suggested approach of going from terms to snippet doesn?t work in the general >> case, because of issues like stemming. > > Actually, it works just fine. I am using the following indexing scheme: > First, i index unstemmed text. > Next, i add a term with a unique prefix to the database. This term is used as a delimiter between stemmed and unstemmed > terms. > Finally, i index stemmed text.Right, but that?s not the general case. It?s absolutely possible to do things in other ways, of course. (In this case I assume you?re indexing completely untransformed text, just word splitting; you aren?t normalising case for the ?raw? terms, for instance. What do you do about punctuation, out of interest?) J -- James Aylett, occasional trouble-maker xapian.org