Haniff Din
2005-Feb-03 01:33 UTC
[Xapian-discuss] newbie help, group site crawl, index and search
Hello xapian-discuss, Forgive the newbie question. I want to crawl a subset of business sites on a regular basis, in a specific niche market (about 100-200 sites). In effect for parent charity organisation website so you can search all the group independent sites, from the parent, as if was one web-site. Of course this means, I always crawl and index the same group of sites, which is expanding slowly in effect. So the crawler is always given the same set of URLs to crawl in effect. What is the best approach to crawl and build an index for searching? Are crawl and/or index tools part of this tool-set? Are they available? Appreciate any help, as this is a non profit, charity, it will really help people. -- Best regards, Haniff mailto:mail@haniff.net www.haniff.co.uk
Samuel Liddicott
2005-Feb-03 10:15 UTC
[Xapian-discuss] newbie help, group site crawl, index and search
Haniff Din wrote:>Hello xapian-discuss, > > Forgive the newbie question. > > I want to crawl a subset of business sites on a regular basis, in a specific > niche market (about 100-200 sites). > In effect for parent charity organisation website so you can search all the group > independent sites, from the parent, as if was one web-site. > Of course this means, I always crawl and index the same group > of sites, which is expanding slowly in effect. So the crawler > is always given the same set of URLs to crawl in effect. > > What is the best approach to crawl and build an index > for searching? Are crawl and/or index tools part > of this tool-set? Are they available? > >Something like this, perhaps (shell script stuff): spider() { dir="$1" shift wget --progress=bar "--directory-prefix=$dir" --html-extension \ --cookies=on --save-cookies "$dir/cookies" --load-cookies "$dir/cookies" \ --exclude-directories=logout,login,search \ --recursive --convert-links --mirror "$@" } then call: spider site/dump-directory http://site/ then once the files are on disk you can index them using whatever indexing tools can index flat files. Sam
Jim Lynch
2005-Feb-03 13:13 UTC
[Xapian-discuss] newbie help, group site crawl, index and search
From the xapian web page, " The indexer supplied can index HTML, PHP, PDF, PostScript, and plain text. Adding support for indexing other formats is easy where conversion filters are available (e.g. Microsoft Word). This indexer works using the filing system, but we also provide a script to allow the htdig web crawler to be hooked in, allowing remote sites to be searched using Omega." Jim. Samuel Liddicott wrote:> Haniff Din wrote: > >> Hello xapian-discuss, >> >> Forgive the newbie question. >> >> I want to crawl a subset of business sites on a regular basis, in a >> specific >> niche market (about 100-200 sites). >> In effect for parent charity organisation website so you can search >> all the group >> independent sites, from the parent, as if was one web-site. >> Of course this means, I always crawl and index the same group >> of sites, which is expanding slowly in effect. So the crawler >> is always given the same set of URLs to crawl in effect. >> >> What is the best approach to crawl and build an index >> for searching? Are crawl and/or index tools part >> of this tool-set? Are they available? >> >> > > Something like this, perhaps (shell script stuff): > spider() { > dir="$1" > shift > > wget --progress=bar "--directory-prefix=$dir" --html-extension \ > --cookies=on --save-cookies "$dir/cookies" --load-cookies > "$dir/cookies" \ > --exclude-directories=logout,login,search \ > --recursive --convert-links --mirror "$@" > } > > then call: > spider site/dump-directory http://site/ > > then once the files are on disk you can index them using whatever > indexing tools can index flat files. > > Sam > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss