Arne Georg Gleditsch
2005-Jan-19 21:30 UTC
[Xapian-discuss] Search::Xapian and term positions
Hi,
I'm fooling around with the Xapian engine (via the Perl modules). I'm
wondering if I can get Xapian to tell me where in a document my
queries match -- so, as a first approach I'm trying to look at the
positionlist. It's not working for me. The following snippet:
my $db = Search::Xapian::Database->new("test");
my $enq = $db->enquire("xapian");
my @matches = $enq->matches(0, 100);
foreach my $match (@matches) {
my $terms = $enq->get_matching_terms_begin($match);
my $pos = $terms->positionlist_begin();
}
bombs out. Firstly, get_matching_terms_begin1 and
get_matching_terms_begin2 seem to switched around in Enquire.pm, but
even if I rectify that things crash and burn. Under gdb:
Program received signal SIGABRT, Aborted.
[Switching to Thread -1209842944 (LWP 12984)]
0xb7e8aed9 in raise () from /lib/tls/libc.so.6
(gdb) bt
#0 0xb7e8aed9 in raise () from /lib/tls/libc.so.6
#1 0xb7f98fcc in ?? () from /lib/tls/libc.so.6
#2 0xbffff7e0 in ?? ()
#3 0xb7e8c771 in abort () from /lib/tls/libc.so.6
[..]
#43 0x080c30c6 in Perl_pp_entersub ()
#44 0xb7c2cf84 in std::terminate () from /usr/lib/libstdc++.so.5
#45 0xb7c2d0f6 in __cxa_throw () from /usr/lib/libstdc++.so.5
#46 0xb7c93015 in Xapian::TermIterator::Internal::positionlist_begin ()
from /usr/lib/libxapian.so.5
#47 0xb7d2a19d in Xapian::TermIterator::positionlist_begin ()
from /usr/lib/libxapian.so.5
#48 0xb7dc8029 in XS_Search__Xapian__TermIterator_positionlist_begin ()
from /usr/local/lib/perl/5.8.4/auto/Search/Xapian/Xapian.so
#49 0x080c30c6 in Perl_pp_entersub ()
#50 0x080bbbb9 in Perl_runops_standard ()
#51 0x080635e8 in perl_run ()
#52 0x080633f5 in perl_run ()
#53 0x0805fb9f in main ()
Has anyone seen this before? This is Xapian 0.8.5 and Search::Xapian
0.8.4.
Perhaps, before I walk this line further: is the positionlist going to
be useful for me? Not having gotten far enough to see what it looks
like, I get the impression that it is an index into the sequence of
tokens that a file is parsed to, is that correct? Can this number be
manipulated when a file is indexed, and what would be the consequence
of doing so? (I.e. letting it be <line number>*100 + <token position
in current line> or something?)
Thanks,
Arne.
On Wed, Jan 19, 2005 at 10:26:00PM +0100, Arne Georg Gleditsch wrote:> Has anyone seen this before? This is Xapian 0.8.5 and Search::Xapian > 0.8.4.I've not seen it. Ideally all the methods should have feature tests, but I've not managed to add these at quite the same rate I've been adding wrappers.> Perhaps, before I walk this line further: is the positionlist going to > be useful for me? Not having gotten far enough to see what it looks > like, I get the impression that it is an index into the sequence of > tokens that a file is parsed to, is that correct?It returns the position values you passed to add_posting() for the specified term and document, in ascending order.> Can this number be manipulated when a file is indexed, and what would > be the consequence of doing so? (I.e. letting it be <line number>*100 > + <token position in current line> or something?)Sure - you can pass whatever values you like. Phrase searching relies on them being adjacent for terms which should be treatable as a phrase, but the technique you suggest would work fine. In fact, omindex and scriptindex do something similar to prevent phrases overlapping between different fields. E.g. a title of "Hello" and first paragraph starting "World" shouldn't match a phrase search for "hello world". It's better to make the gap size modest to help the compression (100 is reasonable). Cheers, Olly
On Wed, Jan 19, 2005 at 10:26:00PM +0100, Arne Georg Gleditsch wrote:> Firstly, get_matching_terms_begin1 and > get_matching_terms_begin2 seem to switched around in Enquire.pmOops, so they are! I've had a quick look, and I think the problem is that this isn't implemented in C++ yet. When I say "this" I mean creating a PositionIterator from the TermIterator which you get from Enquire::get_matching_terms_begin(). That should be fixed, but it'll take me a while to get round to it. However, you can just create one from the database by passing the docid and termname explicitly: my $db = Search::Xapian::Database->new("test"); my $enq = $db->enquire("xapian"); my @matches = $enq->matches(0, 100); foreach my $match (@matches) { my $terms = $enq->get_matching_terms_begin($match); my $pos = $db->positionlist_begin($match->get_docid(), $terms->get_termname()); } That's probably all that would happen internally anyway, so this shouldn't be any less efficient. Cheers, Olly