thr3ads.net - Xapian discuss - [Xapian-discuss] Getting document's context [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Matti Heinonen

2007-Mar-05 11:29 UTC

[Xapian-discuss] Getting document's context

Hello all,

Is there a way to access an indexed document's contents sequentally 
starting from a given position in the document? I've been banging my 
thick head on xapian's documentation and wading my feet in the depths of 
the Internet for a day or so and not getting anywhere.

I'm using Python bindings for Xapian. Indexing and searching work fine, 
but I cannot figure out how to show a bit of textual context around 
terms found in a certain document (just like Google does).

When indexing, I am including posting information. When searching, I am 
able to get the position information for a term using 
database.positionlist(). But how to get the text in the positions around 
  the term?



Matti
-- 
Matti Heinonen                           | email: matti.heinonen@uta.fi
Atk-erikoistutkija                       | tel: +358 3 215 8523
Yhteiskuntatieteellinen tietoarkisto FSD | fax: +358 3 215 8519
FIN-33014 TAMPEREEN YLIOPISTO            | WWW: http://www.fsd.uta.fi/

Felix Antonius Wilhelm Ostmann

2007-Mar-05 11:38 UTC

head link

[Xapian-discuss] Getting document's context

i want to use the same thing, but in think me must get the hole data and
do it without help from xapian. right?

Matti Heinonen schrieb:> Hello all,
>
> Is there a way to access an indexed document's contents sequentally 
> starting from a given position in the document? I've been banging my 
> thick head on xapian's documentation and wading my feet in the depths 
> of the Internet for a day or so and not getting anywhere.
>
> I'm using Python bindings for Xapian. Indexing and searching work 
> fine, but I cannot figure out how to show a bit of textual context 
> around terms found in a certain document (just like Google does).
>
> When indexing, I am including posting information. When searching, I 
> am able to get the position information for a term using 
> database.positionlist(). But how to get the text in the positions 
> around  the term?
>
>
>
> Matti

-- 
Mit freundlichen Gr??en

Felix Antonius Wilhelm Ostmann
--------------------------------------------------
Websuche   Search   Technology   GmbH   &   Co. KG
Martinistra?e 3  -  D-49080  Osnabr?ck  -  Germany
Tel.:   +49 541 40666-0 - Fax:    +49 541 40666-22
Email: info@websuche.de - Website: www.websuche.de
--------------------------------------------------
AG Osnabr?ck - HRA 200252 - Ust-Ident: DE814737310
Komplement?rin:     Websuche   Search   Technology
Verwaltungs GmbH   -  AG Osnabr?ck  -   HRB 200359
Gesch?ftsf?hrer:  Diplom Kaufmann Martin Steinkamp
--------------------------------------------------

Olly Betts

2007-Mar-06 01:15 UTC

head link

[Xapian-discuss] Getting document's context

On Mon, Mar 05, 2007 at 01:29:22PM +0200, Matti Heinonen
wrote:> When indexing, I am including posting information. When searching, I am 
> able to get the position information for a term using 
> database.positionlist(). But how to get the text in the positions around 
> the term?
We store positional information per term+document so it isn't possible
to answer the question "which terms occur between positions N1 and N2 in
document D" without opening the position lists for every term in
document D and doing a "skip_to" on each.

I'd generally suggest storing a cleaned up copy of the document text in
the document data and generating dynamic samples from that.  Xapian
doesn't currently have a mechanism to do that though (it's something
I'd like to add).

Alternatively, Jean-Francois Dockes posted some C++ code to recreate the
whole document by looking at position list data - it would be easy to
adapt that to only look at a restricted range of document positions:

http://article.gmane.org/gmane.comp.search.xapian.general/2187

Cheers,
    Olly

Fabrice Colin

2007-Mar-06 12:10 UTC

head link

[Xapian-discuss] Re: Getting document's context

On 3/6/07, Olly Betts <olly@survex.com> wrote:> On Mon, Mar 05, 2007 at 01:29:22PM +0200, Matti Heinonen wrote:
> > When indexing, I am including posting information. When searching, I
am
> > able to get the position information for a term using
> > database.positionlist(). But how to get the text in the positions
around
> > the term?
>
> We store positional information per term+document so it isn't possible
> to answer the question "which terms occur between positions N1 and N2
in
> document D" without opening the position lists for every term in
> document D and doing a "skip_to" on each.
>
> I'd generally suggest storing a cleaned up copy of the document text in
> the document data and generating dynamic samples from that.  Xapian
> doesn't currently have a mechanism to do that though (it's
something
> I'd like to add).
>
> Alternatively, Jean-Francois Dockes posted some C++ code to recreate the
> whole document by looking at position list data - it would be easy to
> adapt that to only look at a restricted range of document positions:
>
> http://article.gmane.org/gmane.comp.search.xapian.general/2187
>I wrote something similar for Pinot that tries to find the best
"window", i.e.
where in the document the largest number of query terms is to be found.

I can't claim it's optimal but it works well enough for me. You can
find the code at
http://svn.berlios.de/wsvn/pinot/tags/version_0_7_0/Search/AbstractGenerator.cpp?op=file&rev=0&sc=0

Fabrice

Xapian discuss - Mar 2007 - Getting document's context

[Xapian-discuss] Getting document's context

[Xapian-discuss] Getting document's context

[Xapian-discuss] Getting document's context

[Xapian-discuss] Re: Getting document's context