thr3ads.net - Xapian discuss - [Xapian-discuss] Adding a Web Spider [Jul 2004]

If this information is useful, please help other people find it:
Share via:

Lee Johnson

2004-Jul-02 09:45 UTC

[Xapian-discuss] Adding a Web Spider

Hi,
i have read the future of xapian thread today. One
item is specifically is very interesting for me in
that thread is adding a web spider. We all know that
Xapian is not designed exclusively for that purpose
but a web spider can increase greatly the usage of
Xapian. I'm not a programmer but writing a web spider
is rather simple wrt writing xapian itself. In turn,
xapian can earn lots of users and those ones become
familiar with xapian and so they use in other areas,
they tell others about xapian and so on.

I'm saying this because i also need a crawler for
xapian. I have hand-picked rather big list of URLs
(just URLs not the contents) and need a crawler to
crawl all pages beneath the URLs and put the those
content into a db. so i can use xapian to index and
search that db. I'm very open to suggestions. I looked
at nutch, heritrix and larbin (this one probably just
fetches the URLs not the contents i asked this to the
developer but no answer yet) but with those i cannot
use xapian (if i use one of them then probably i will
use mnogosearch). another thing with nutch and
heritrix is that they are written in java, imho, is
not a good idea.

Also for those interested a good read may be
http://acmqueue.com/modules.php?name=Content&pa=list_pages_issues&issue_id=12
which devoted that month's issue to search topic.

Regards


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail

rm@fabula.de

2004-Jul-02 11:38 UTC

head link

[Xapian-discuss] Adding a Web Spider

On Fri, Jul 02, 2004 at 01:45:29AM -0700, Lee Johnson
wrote:> Hi,
> i have read the future of xapian thread today. One
> item is specifically is very interesting for me in
> that thread is adding a web spider. We all know that
> Xapian is not designed exclusively for that purpose
> but a web spider can increase greatly the usage of
> Xapian. I'm not a programmer but writing a web spider
> is rather simple wrt writing xapian itself. 
I happen to call myself a programmer and have written some
crawlers myself. No, writing a _good_ crawler is _far_ from
simple (you need a rather error-tolerant HTML/XHTML parser,
a good http lib, smart tracking of Etag headers and content
hash sums, and more and more a rather capable ECMA-script
interpreter (for those stupid javascript links ....).

I agree, it's pretty trivial to hack a _bad_ crawler in
one of the P languages but there are allready quite a few
out there in the wild you can catch and abuse :-)
> In turn,
> xapian can earn lots of users and those ones become
> familiar with xapian and so they use in other areas,
> they tell others about xapian and so on.
??? Is this really how it works? People use Xapian because
the need powerfull IR technology. Word of mouth is a bad
advisor in such areas. I doubt that mifluz is used more
often because of its use in htdig -- actually, i'm still
not shure anyone is using it outside htdig.
> I'm saying this because i also need a crawler for
> xapian. I have hand-picked rather big list of URLs
> (just URLs not the contents) and need a crawler to
> crawl all pages beneath the URLs and put the those
> content into a db. so i can use xapian to index and
> search that db. I'm very open to suggestions. 
Use Perl with the LWP lib  to fetch the documents,
parse them with the Perl libxml2 parser (that has a pretty
good html mode), use libxml2's Reader API to fetch all
URLs nd push them onto a stack of jobs. Use Xapian's 
Perl bindings to do the actual indexing. Nothing to
hard. But: if the resources you grab aren't on your servers
you might want to honor robot.txt and add delays to the
job queue, check for dynamic content etc.
> I looked
> at nutch, heritrix and larbin (this one probably just
> fetches the URLs not the contents i asked this to the
> developer but no answer yet) but with those i cannot
> use xapian (if i use one of them then probably i will
> use mnogosearch). another thing with nutch and
> heritrix is that they are written in java, imho, is
> not a good idea.
If i t does the job ...
> Also for those interested a good read may be
>
http://acmqueue.com/modules.php?name=Content&pa=list_pages_issues&issue_id=12
> which devoted that month's issue to search topic.
> 
> Regards
> 
Just my 0.02$ 

  Ralf Mattes> 
> 	
> 		
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail 
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss

Xapian discuss - Jul 2004 - Adding a Web Spider

[Xapian-discuss] Adding a Web Spider

[Xapian-discuss] Adding a Web Spider