Michael Decerbo
2020-Sep-20 02:56 UTC
help improving relevance of snippets displayed by Omega
Olly, Thanks again very much for helping me improve my understanding of Xapian and Omega. Thanks especially for pointing out that my idea of trying to generate a snippet from stemmed text lacking capitalization and punctuation would probably not produce a user-friendly result. But I'm still doubtful that expanding the sample size could be the right way to obtain excerpts from the document that are relevant to the query. Suppose that the sample size were even as big as 10% of the average document size, queries contained only a single term, and a typical query term appeared on average only once per document. In that case, it seems to me that nine out of ten samples would not contain the single query term, so that nine times out of ten the snippet generated from the sample would not contain the query term. Is my thinking accurate about this, or am I again missing something? In general, I'm wondering how best to use Xapian so that, at query time, my application can display an excerpt that is relevant to the query, not a sample chosen at indexing time without regard to the query that may or may not contain the query term(s). For example, TheyWorkForYou.com is listed on xapian.org as a site using Xapian, and when I enter a single-term query on that site the document excerpts provided as part of the search results invariably include highlighted words, possibly stemmed, responsive to the query. That's the effect I would like to achieve. If you can think of any sample code that I should refer to, or even if you could just suggest the broad outlines of a solution, I would be very grateful. Thanks again! Michael> > >
Matthew Somerville
2020-Sep-21 08:28 UTC
help improving relevance of snippets displayed by Omega
Hi, Ha, I was reading this thread thinking TheyWorkForYou (which I help maintain) does highlight terms wherever they are, and then you mentioned it :) The code is open source; it is quite old and probably hair-raising, but as you say it does basically do what you want. Our Xapian database stores terms/boolean terms/values for the text, and for the document itself stores only an identifier. It works by doing the Xapian search, then fetching the resultant IDs from the database, then it boils down to calling prepare_search_result_for_display on each result: https://github.com/mysociety/theyworkforyou/blob/master/www/includes/easyparliament/hansardlist.php#L1279-L1306 Which uses two functions to then work out the extract, position_of_first_word: https://github.com/mysociety/theyworkforyou/blob/master/www/includes/easyparliament/searchengine.php#L562 and highlight: https://github.com/mysociety/theyworkforyou/blob/master/www/includes/easyparliament/searchengine.php#L475 They stem the entered words and loop through the speeches to find the match/thing to highlight. ATB, Matthew On Sun, 20 Sep 2020 at 03:57, Michael Decerbo <michaeldecerbo at gmail.com> wrote:> In general, I'm wondering how best to use Xapian so that, at query time, my > application can display an excerpt that is relevant to the query, not a > sample chosen at indexing time without regard to the query that may or may > not contain the query term(s). For example, TheyWorkForYou.com is listed on > xapian.org as a site using Xapian, and when I enter a single-term query on > that site the document excerpts provided as part of the search results > invariably include highlighted words, possibly stemmed, responsive to the > query. That's the effect I would like to achieve. > > If you can think of any sample code that I should refer to, or even if you > could just suggest the broad outlines of a solution, I would be very > grateful. >
Matthew Somerville
2020-Sep-21 08:30 UTC
help improving relevance of snippets displayed by Omega
Whoops, forgot to say, as Olly said, much of that could probably now be simplified with Xapian's snippet() function, which I can only assume did not exist back when all this was written! :) ATB, Matthew On Mon, 21 Sep 2020 at 09:28, Matthew Somerville <matthew at mysociety.org> wrote:> Hi, > > Ha, I was reading this thread thinking TheyWorkForYou (which I help > maintain) does highlight terms wherever they are, and then you mentioned it > :) > The code is open source; it is quite old and probably hair-raising, but as > you say it does basically do what you want. > Our Xapian database stores terms/boolean terms/values for the text, and > for the document itself stores only an identifier. > It works by doing the Xapian search, then fetching the resultant IDs from > the database, then it boils down to calling > prepare_search_result_for_display on each result: > > https://github.com/mysociety/theyworkforyou/blob/master/www/includes/easyparliament/hansardlist.php#L1279-L1306 > Which uses two functions to then work out the extract, > position_of_first_word: > > https://github.com/mysociety/theyworkforyou/blob/master/www/includes/easyparliament/searchengine.php#L562 > and highlight: > > https://github.com/mysociety/theyworkforyou/blob/master/www/includes/easyparliament/searchengine.php#L475 > They stem the entered words and loop through the speeches to find the > match/thing to highlight. > > ATB, > Matthew > > On Sun, 20 Sep 2020 at 03:57, Michael Decerbo <michaeldecerbo at gmail.com> > wrote: > >> In general, I'm wondering how best to use Xapian so that, at query time, >> my >> application can display an excerpt that is relevant to the query, not a >> sample chosen at indexing time without regard to the query that may or may >> not contain the query term(s). For example, TheyWorkForYou.com is listed >> on >> xapian.org as a site using Xapian, and when I enter a single-term query >> on >> that site the document excerpts provided as part of the search results >> invariably include highlighted words, possibly stemmed, responsive to the >> query. That's the effect I would like to achieve. >> >> If you can think of any sample code that I should refer to, or even if you >> could just suggest the broad outlines of a solution, I would be very >> grateful. >> >
On Sat, Sep 19, 2020 at 10:56:30PM -0400, Michael Decerbo wrote:> But I'm still doubtful that expanding the sample size could be the right > way to obtain excerpts from the document that are relevant to the query. > Suppose that the sample size were even as big as 10% of the average > document size, queries contained only a single term, and a typical query > term appeared on average only once per document.FWIW, that's too low - for highlighting purposes only documents the term appears in are interesting, so the relevant average to consider here is only over documents with 1 or more occurrences. You'll never find the term in a document it doesn't occur in, no matter how much of it you store.> In that case, it seems to > me that nine out of ten samples would not contain the single query term, so > that nine times out of ten the snippet generated from the sample would not > contain the query term. Is my thinking accurate about this, or am I again > missing something?I'm suggesting you set the sample size so that ALL of the text is stored for MOST documents (there are usually outliers, so having a limit is a good idea so that if someone adds a terabyte file of random ASCII to your system that doesn't result in pointless bloat).> In general, I'm wondering how best to use Xapian so that, at query time, my > application can display an excerpt that is relevant to the query, not a > sample chosen at indexing time without regard to the query that may or may > not contain the query term(s). For example, TheyWorkForYou.com is listed on > xapian.org as a site using Xapian, and when I enter a single-term query on > that site the document excerpts provided as part of the search results > invariably include highlighted words, possibly stemmed, responsive to the > query. That's the effect I would like to achieve.You need the document text in order to select a dynamic sample, so either you need to store that text in Xapian, or obtain it at search time from some external source (which needs to be reasonably efficient as you need to do this for each document in a page of results - if it takes 0.1 seconds per document to get the text, you're adding a second to the time to render a page of 10 results). With Omega, storing in Xapian is well supported (via setting a large sample size) so that's what I'm suggesting for the situation you described. If you have the text easily accessible somewhere else you can make use of that, but as I already said you'll need to write some code - either your own front end (which is what TheyWorkForYou has) or modifying Omega. Cheers, Olly
Possibly Parallel Threads
- help improving relevance of snippets displayed by Omega
- help improving relevance of snippets displayed by Omega
- help improving relevance of snippets displayed by Omega
- Very far out and static get_matches_estimated
- help improving relevance of snippets displayed by Omega