Michael Decerbo
2020-Sep-19 00:33 UTC
help improving relevance of snippets displayed by Omega
Thanks Olly! But expanding the sample seems like the wrong solution. Is there a way to instead pass a hit or hits from the document to snippet generation? Michael
On Fri, Sep 18, 2020 at 08:33:49PM -0400, Michael Decerbo wrote:> But expanding the sample seems like the wrong solution. Is there a way to > instead pass a hit or hits from the document to snippet generation?I'm not sure what you have in mind, but the only way I can see that working is if it read all the positional data for all the terms in the document, and then sorted it to essentially reconstruct the document text. However (a) that gives you the text without capitalisation and without punctuation which doesn't look very good and (b) it tends to be rather slow because the positional data is primarily ordered by document for efficient searching, so there's poor locality of reference for this use (and large documents would make that worse). The "xapian-pos" debug tool effectively does this text reconstruction to help visualise the positional data, so you can see what the reconstructed text would look like using that - e.g.: Gap of 1 unused positions 1 Sbath 2 Ssomerset 3 bath 4 somerset 5 coordinates 6 51 7 23 8 n 9 2 10 22 11 w 12 51.38 13 n 14 2.36 15 w 16 51.38 17 2.36 18 bath 19 ?b??? 20 or 21 ?b?? 22 latin 23 aquae 24 sulis 25 welsh 26 caerfaddon 27 is 28 a 29 city ... I've tried this approach on a project, but it didn't work out. Storing a larger sample is definitely what I'd recommend (or if you have the text stored in another system, you could pass that to the MSet::snippet() method, but there isn't a way to do that with omega unless you modify the code). Cheers, Olly
Michael Decerbo
2020-Sep-20 02:56 UTC
help improving relevance of snippets displayed by Omega
Olly, Thanks again very much for helping me improve my understanding of Xapian and Omega. Thanks especially for pointing out that my idea of trying to generate a snippet from stemmed text lacking capitalization and punctuation would probably not produce a user-friendly result. But I'm still doubtful that expanding the sample size could be the right way to obtain excerpts from the document that are relevant to the query. Suppose that the sample size were even as big as 10% of the average document size, queries contained only a single term, and a typical query term appeared on average only once per document. In that case, it seems to me that nine out of ten samples would not contain the single query term, so that nine times out of ten the snippet generated from the sample would not contain the query term. Is my thinking accurate about this, or am I again missing something? In general, I'm wondering how best to use Xapian so that, at query time, my application can display an excerpt that is relevant to the query, not a sample chosen at indexing time without regard to the query that may or may not contain the query term(s). For example, TheyWorkForYou.com is listed on xapian.org as a site using Xapian, and when I enter a single-term query on that site the document excerpts provided as part of the search results invariably include highlighted words, possibly stemmed, responsive to the query. That's the effect I would like to achieve. If you can think of any sample code that I should refer to, or even if you could just suggest the broad outlines of a solution, I would be very grateful. Thanks again! Michael> > >
Apparently Analagous Threads
- help improving relevance of snippets displayed by Omega
- help improving relevance of snippets displayed by Omega
- help improving relevance of snippets displayed by Omega
- help improving relevance of snippets displayed by Omega
- help improving relevance of snippets displayed by Omega