Kevin SoftDev
2006-Mar-08 17:44 UTC
[Xapian-discuss] Perl example: parse terms, search , get total, get result, parse result
Hi,
Thank you all for emailing some answers to my question. I put together
simple Perl script so we do not keep asking the same thing over. As you can
see I had to parse the document data and try to find where is title, body
and url. If someone knows something that was not yet documented and
retrieves the specific document attribute (title,body,url) let me know.
#--------------------------------------------------------- begin of the
script -----------------------------------------------#
my $db = Search::Xapian::Database->new( '/europa' );
my $qp = Search::Xapian::QueryParser->new();
my $enq = $db->enquire($qp->parse_query($terms));
my $total = $db->get_termfreq($terms);
printf "Searching for: '%s' ", $terms;
print "Total matches found" . $total;
#--- display only range of documents for pagination ----#
my @matches = $enq->matches($start, $end);
my($doc,$html,$body,$title,$url);
foreach my $match ( @matches )
{
$doc = $match->get_document();
$html = $doc->get_data();
$html =~ m/body=(.*)/; $body = $1;
$html =~ m/title=(.*)/; $title = $1;
$html =~ m/url=(.*)/; $url = $1;
printf "<table border=0 width=95%><tr><td><font
size=2
face=Verdana>Relevance: %s% ",
$match->get_percent();
print "<a href=\"$url\"
target=_blank><b>$title</b><BR><i>$url</i></a><BR>$body";
print
"</font></td></tr></table><P>";
}
#--------------------------------------------------------- end of the script
-----------------------------------------------#
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20060308/072db734/attachment.htm
Olly Betts
2006-Mar-09 00:38 UTC
[Xapian-discuss] Perl example: parse terms, search , get total, get result, parse result
On Wed, Mar 08, 2006 at 09:43:53AM -0800, Kevin SoftDev wrote:> my $total = $db->get_termfreq($terms);This looks up the frequency of a single term, so it'll be fine for a one term query, but will return zero for anything more complicated (unless you happen to have terms with spaces, etc in). As I explained just now, you want MSet::get_matches_estimated().> $html = $doc->get_data(); > > $html =~ m/body=(.*)/; $body = $1;That's kind of risky - you only want to match body at the start of a line, but this doesn't specify that, so it'll match wrongly if there's an earlier line containing "body=" anywhere in it. I suggest: my ($body) = $html =~ m/^body=(.*)/m;> print "<a href=\"$url\" > target=_blank><b>$title</b><BR><i>$url</i></a><BR>$body";You really want to be escaping values put into HTML output, unless you've carefully sanitised them at indexing time. Otherwise you're opening yourself to cross-site scripting type exploits. Cheers, Olly