Arne Georg Gleditsch
2005-Jan-19 21:30 UTC
[Xapian-discuss] Search::Xapian and term positions
Hi, I'm fooling around with the Xapian engine (via the Perl modules). I'm wondering if I can get Xapian to tell me where in a document my queries match -- so, as a first approach I'm trying to look at the positionlist. It's not working for me. The following snippet: my $db = Search::Xapian::Database->new("test"); my $enq = $db->enquire("xapian"); my @matches = $enq->matches(0, 100); foreach my $match (@matches) { my $terms = $enq->get_matching_terms_begin($match); my $pos = $terms->positionlist_begin(); } bombs out. Firstly, get_matching_terms_begin1 and get_matching_terms_begin2 seem to switched around in Enquire.pm, but even if I rectify that things crash and burn. Under gdb: Program received signal SIGABRT, Aborted. [Switching to Thread -1209842944 (LWP 12984)] 0xb7e8aed9 in raise () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7e8aed9 in raise () from /lib/tls/libc.so.6 #1 0xb7f98fcc in ?? () from /lib/tls/libc.so.6 #2 0xbffff7e0 in ?? () #3 0xb7e8c771 in abort () from /lib/tls/libc.so.6 [..] #43 0x080c30c6 in Perl_pp_entersub () #44 0xb7c2cf84 in std::terminate () from /usr/lib/libstdc++.so.5 #45 0xb7c2d0f6 in __cxa_throw () from /usr/lib/libstdc++.so.5 #46 0xb7c93015 in Xapian::TermIterator::Internal::positionlist_begin () from /usr/lib/libxapian.so.5 #47 0xb7d2a19d in Xapian::TermIterator::positionlist_begin () from /usr/lib/libxapian.so.5 #48 0xb7dc8029 in XS_Search__Xapian__TermIterator_positionlist_begin () from /usr/local/lib/perl/5.8.4/auto/Search/Xapian/Xapian.so #49 0x080c30c6 in Perl_pp_entersub () #50 0x080bbbb9 in Perl_runops_standard () #51 0x080635e8 in perl_run () #52 0x080633f5 in perl_run () #53 0x0805fb9f in main () Has anyone seen this before? This is Xapian 0.8.5 and Search::Xapian 0.8.4. Perhaps, before I walk this line further: is the positionlist going to be useful for me? Not having gotten far enough to see what it looks like, I get the impression that it is an index into the sequence of tokens that a file is parsed to, is that correct? Can this number be manipulated when a file is indexed, and what would be the consequence of doing so? (I.e. letting it be <line number>*100 + <token position in current line> or something?) Thanks, Arne.
On Wed, Jan 19, 2005 at 10:26:00PM +0100, Arne Georg Gleditsch wrote:> Has anyone seen this before? This is Xapian 0.8.5 and Search::Xapian > 0.8.4.I've not seen it. Ideally all the methods should have feature tests, but I've not managed to add these at quite the same rate I've been adding wrappers.> Perhaps, before I walk this line further: is the positionlist going to > be useful for me? Not having gotten far enough to see what it looks > like, I get the impression that it is an index into the sequence of > tokens that a file is parsed to, is that correct?It returns the position values you passed to add_posting() for the specified term and document, in ascending order.> Can this number be manipulated when a file is indexed, and what would > be the consequence of doing so? (I.e. letting it be <line number>*100 > + <token position in current line> or something?)Sure - you can pass whatever values you like. Phrase searching relies on them being adjacent for terms which should be treatable as a phrase, but the technique you suggest would work fine. In fact, omindex and scriptindex do something similar to prevent phrases overlapping between different fields. E.g. a title of "Hello" and first paragraph starting "World" shouldn't match a phrase search for "hello world". It's better to make the gap size modest to help the compression (100 is reasonable). Cheers, Olly
On Wed, Jan 19, 2005 at 10:26:00PM +0100, Arne Georg Gleditsch wrote:> Firstly, get_matching_terms_begin1 and > get_matching_terms_begin2 seem to switched around in Enquire.pmOops, so they are! I've had a quick look, and I think the problem is that this isn't implemented in C++ yet. When I say "this" I mean creating a PositionIterator from the TermIterator which you get from Enquire::get_matching_terms_begin(). That should be fixed, but it'll take me a while to get round to it. However, you can just create one from the database by passing the docid and termname explicitly: my $db = Search::Xapian::Database->new("test"); my $enq = $db->enquire("xapian"); my @matches = $enq->matches(0, 100); foreach my $match (@matches) { my $terms = $enq->get_matching_terms_begin($match); my $pos = $db->positionlist_begin($match->get_docid(), $terms->get_termname()); } That's probably all that would happen internally anyway, so this shouldn't be any less efficient. Cheers, Olly