Hello all, Is there a way to access an indexed document's contents sequentally starting from a given position in the document? I've been banging my thick head on xapian's documentation and wading my feet in the depths of the Internet for a day or so and not getting anywhere. I'm using Python bindings for Xapian. Indexing and searching work fine, but I cannot figure out how to show a bit of textual context around terms found in a certain document (just like Google does). When indexing, I am including posting information. When searching, I am able to get the position information for a term using database.positionlist(). But how to get the text in the positions around the term? Matti -- Matti Heinonen | email: matti.heinonen@uta.fi Atk-erikoistutkija | tel: +358 3 215 8523 Yhteiskuntatieteellinen tietoarkisto FSD | fax: +358 3 215 8519 FIN-33014 TAMPEREEN YLIOPISTO | WWW: http://www.fsd.uta.fi/
Felix Antonius Wilhelm Ostmann
2007-Mar-05 11:38 UTC
[Xapian-discuss] Getting document's context
i want to use the same thing, but in think me must get the hole data and do it without help from xapian. right? Matti Heinonen schrieb:> Hello all, > > Is there a way to access an indexed document's contents sequentally > starting from a given position in the document? I've been banging my > thick head on xapian's documentation and wading my feet in the depths > of the Internet for a day or so and not getting anywhere. > > I'm using Python bindings for Xapian. Indexing and searching work > fine, but I cannot figure out how to show a bit of textual context > around terms found in a certain document (just like Google does). > > When indexing, I am including posting information. When searching, I > am able to get the position information for a term using > database.positionlist(). But how to get the text in the positions > around the term? > > > > Matti-- Mit freundlichen Gr??en Felix Antonius Wilhelm Ostmann -------------------------------------------------- Websuche Search Technology GmbH & Co. KG Martinistra?e 3 - D-49080 Osnabr?ck - Germany Tel.: +49 541 40666-0 - Fax: +49 541 40666-22 Email: info@websuche.de - Website: www.websuche.de -------------------------------------------------- AG Osnabr?ck - HRA 200252 - Ust-Ident: DE814737310 Komplement?rin: Websuche Search Technology Verwaltungs GmbH - AG Osnabr?ck - HRB 200359 Gesch?ftsf?hrer: Diplom Kaufmann Martin Steinkamp --------------------------------------------------
On Mon, Mar 05, 2007 at 01:29:22PM +0200, Matti Heinonen wrote:> When indexing, I am including posting information. When searching, I am > able to get the position information for a term using > database.positionlist(). But how to get the text in the positions around > the term?We store positional information per term+document so it isn't possible to answer the question "which terms occur between positions N1 and N2 in document D" without opening the position lists for every term in document D and doing a "skip_to" on each. I'd generally suggest storing a cleaned up copy of the document text in the document data and generating dynamic samples from that. Xapian doesn't currently have a mechanism to do that though (it's something I'd like to add). Alternatively, Jean-Francois Dockes posted some C++ code to recreate the whole document by looking at position list data - it would be easy to adapt that to only look at a restricted range of document positions: http://article.gmane.org/gmane.comp.search.xapian.general/2187 Cheers, Olly
On 3/6/07, Olly Betts <olly@survex.com> wrote:> On Mon, Mar 05, 2007 at 01:29:22PM +0200, Matti Heinonen wrote: > > When indexing, I am including posting information. When searching, I am > > able to get the position information for a term using > > database.positionlist(). But how to get the text in the positions around > > the term? > > We store positional information per term+document so it isn't possible > to answer the question "which terms occur between positions N1 and N2 in > document D" without opening the position lists for every term in > document D and doing a "skip_to" on each. > > I'd generally suggest storing a cleaned up copy of the document text in > the document data and generating dynamic samples from that. Xapian > doesn't currently have a mechanism to do that though (it's something > I'd like to add). > > Alternatively, Jean-Francois Dockes posted some C++ code to recreate the > whole document by looking at position list data - it would be easy to > adapt that to only look at a restricted range of document positions: > > http://article.gmane.org/gmane.comp.search.xapian.general/2187 >I wrote something similar for Pinot that tries to find the best "window", i.e. where in the document the largest number of query terms is to be found. I can't claim it's optimal but it works well enough for me. You can find the code at http://svn.berlios.de/wsvn/pinot/tags/version_0_7_0/Search/AbstractGenerator.cpp?op=file&rev=0&sc=0 Fabrice