On Fri, Jul 02, 2004 at 01:45:29AM -0700, Lee Johnson
wrote:> Hi,
> i have read the future of xapian thread today. One
> item is specifically is very interesting for me in
> that thread is adding a web spider. We all know that
> Xapian is not designed exclusively for that purpose
> but a web spider can increase greatly the usage of
> Xapian. I'm not a programmer but writing a web spider
> is rather simple wrt writing xapian itself.
I happen to call myself a programmer and have written some
crawlers myself. No, writing a _good_ crawler is _far_ from
simple (you need a rather error-tolerant HTML/XHTML parser,
a good http lib, smart tracking of Etag headers and content
hash sums, and more and more a rather capable ECMA-script
interpreter (for those stupid javascript links ....).
I agree, it's pretty trivial to hack a _bad_ crawler in
one of the P languages but there are allready quite a few
out there in the wild you can catch and abuse :-)
> In turn,
> xapian can earn lots of users and those ones become
> familiar with xapian and so they use in other areas,
> they tell others about xapian and so on.
??? Is this really how it works? People use Xapian because
the need powerfull IR technology. Word of mouth is a bad
advisor in such areas. I doubt that mifluz is used more
often because of its use in htdig -- actually, i'm still
not shure anyone is using it outside htdig.
> I'm saying this because i also need a crawler for
> xapian. I have hand-picked rather big list of URLs
> (just URLs not the contents) and need a crawler to
> crawl all pages beneath the URLs and put the those
> content into a db. so i can use xapian to index and
> search that db. I'm very open to suggestions.
Use Perl with the LWP lib to fetch the documents,
parse them with the Perl libxml2 parser (that has a pretty
good html mode), use libxml2's Reader API to fetch all
URLs nd push them onto a stack of jobs. Use Xapian's
Perl bindings to do the actual indexing. Nothing to
hard. But: if the resources you grab aren't on your servers
you might want to honor robot.txt and add delays to the
job queue, check for dynamic content etc.
> I looked
> at nutch, heritrix and larbin (this one probably just
> fetches the URLs not the contents i asked this to the
> developer but no answer yet) but with those i cannot
> use xapian (if i use one of them then probably i will
> use mnogosearch). another thing with nutch and
> heritrix is that they are written in java, imho, is
> not a good idea.
If i t does the job ...
> Also for those interested a good read may be
>
http://acmqueue.com/modules.php?name=Content&pa=list_pages_issues&issue_id=12
> which devoted that month's issue to search topic.
>
> Regards
>
Just my 0.02$
Ralf Mattes>
>
>
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss