Kevin SoftDev
2006-Mar-08 17:44 UTC
[Xapian-discuss] Perl example: parse terms, search , get total, get result, parse result
Hi, Thank you all for emailing some answers to my question. I put together simple Perl script so we do not keep asking the same thing over. As you can see I had to parse the document data and try to find where is title, body and url. If someone knows something that was not yet documented and retrieves the specific document attribute (title,body,url) let me know. #--------------------------------------------------------- begin of the script -----------------------------------------------# my $db = Search::Xapian::Database->new( '/europa' ); my $qp = Search::Xapian::QueryParser->new(); my $enq = $db->enquire($qp->parse_query($terms)); my $total = $db->get_termfreq($terms); printf "Searching for: '%s' ", $terms; print "Total matches found" . $total; #--- display only range of documents for pagination ----# my @matches = $enq->matches($start, $end); my($doc,$html,$body,$title,$url); foreach my $match ( @matches ) { $doc = $match->get_document(); $html = $doc->get_data(); $html =~ m/body=(.*)/; $body = $1; $html =~ m/title=(.*)/; $title = $1; $html =~ m/url=(.*)/; $url = $1; printf "<table border=0 width=95%><tr><td><font size=2 face=Verdana>Relevance: %s% ", $match->get_percent(); print "<a href=\"$url\" target=_blank><b>$title</b><BR><i>$url</i></a><BR>$body"; print "</font></td></tr></table><P>"; } #--------------------------------------------------------- end of the script -----------------------------------------------# -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20060308/072db734/attachment.htm
Olly Betts
2006-Mar-09 00:38 UTC
[Xapian-discuss] Perl example: parse terms, search , get total, get result, parse result
On Wed, Mar 08, 2006 at 09:43:53AM -0800, Kevin SoftDev wrote:> my $total = $db->get_termfreq($terms);This looks up the frequency of a single term, so it'll be fine for a one term query, but will return zero for anything more complicated (unless you happen to have terms with spaces, etc in). As I explained just now, you want MSet::get_matches_estimated().> $html = $doc->get_data(); > > $html =~ m/body=(.*)/; $body = $1;That's kind of risky - you only want to match body at the start of a line, but this doesn't specify that, so it'll match wrongly if there's an earlier line containing "body=" anywhere in it. I suggest: my ($body) = $html =~ m/^body=(.*)/m;> print "<a href=\"$url\" > target=_blank><b>$title</b><BR><i>$url</i></a><BR>$body";You really want to be escaping values put into HTML output, unless you've carefully sanitised them at indexing time. Otherwise you're opening yourself to cross-site scripting type exploits. Cheers, Olly