thr3ads.net - Xapian discuss - [Xapian-discuss] newbie help, group site crawl, index and search [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Haniff Din

2005-Feb-03 01:33 UTC

[Xapian-discuss] newbie help, group site crawl, index and search

Hello xapian-discuss,

  Forgive the newbie question.
  
  I want to crawl a subset of business sites on a regular basis, in a specific
  niche market (about 100-200 sites).
  In effect for  parent charity organisation website so you can search all the
group
  independent sites, from the parent, as if was one web-site.
  Of course this means, I always crawl and index the same group
  of sites, which is expanding slowly in effect. So the crawler
  is always given the same set of URLs to crawl in effect.
  
  What is the best approach to crawl and build an index
  for searching?  Are crawl and/or index tools part
  of this tool-set? Are they available?

  Appreciate any help, as this is a non profit, charity, it
  will really help people.

-- 
Best regards,
 Haniff 

 mailto:mail@haniff.net

 www.haniff.co.uk

Samuel Liddicott

2005-Feb-03 10:15 UTC

head link

[Xapian-discuss] newbie help, group site crawl, index and search

Haniff Din wrote:
>Hello xapian-discuss,
>
>  Forgive the newbie question.
>  
>  I want to crawl a subset of business sites on a regular basis, in a
specific
>  niche market (about 100-200 sites).
>  In effect for  parent charity organisation website so you can search all
the group
>  independent sites, from the parent, as if was one web-site.
>  Of course this means, I always crawl and index the same group
>  of sites, which is expanding slowly in effect. So the crawler
>  is always given the same set of URLs to crawl in effect.
>  
>  What is the best approach to crawl and build an index
>  for searching?  Are crawl and/or index tools part
>  of this tool-set? Are they available?
>  
>
Something like this, perhaps (shell script stuff):
spider() {
  dir="$1"
  shift

     wget --progress=bar "--directory-prefix=$dir" --html-extension \
     --cookies=on --save-cookies "$dir/cookies"  --load-cookies 
"$dir/cookies" \
     --exclude-directories=logout,login,search \
     --recursive --convert-links --mirror "$@"
}

then call:
spider site/dump-directory http://site/

then once the files are on disk you can index them using whatever 
indexing tools can index flat files.

Sam

Jim Lynch

2005-Feb-03 13:13 UTC

head link

[Xapian-discuss] newbie help, group site crawl, index and search

From the xapian web page,

" The indexer supplied can index HTML, PHP, PDF, PostScript, and plain 
text. Adding support for indexing other formats is easy where conversion 
filters are available (e.g. Microsoft Word). This indexer works using 
the filing system, but we also provide a script to allow the htdig web 
crawler to be hooked in, allowing remote sites to be searched using Omega."

Jim.

Samuel Liddicott wrote:
> Haniff Din wrote:
>
>> Hello xapian-discuss,
>>
>>  Forgive the newbie question.
>>  
>>  I want to crawl a subset of business sites on a regular basis, in a 
>> specific
>>  niche market (about 100-200 sites).
>>  In effect for  parent charity organisation website so you can search 
>> all the group
>>  independent sites, from the parent, as if was one web-site.
>>  Of course this means, I always crawl and index the same group
>>  of sites, which is expanding slowly in effect. So the crawler
>>  is always given the same set of URLs to crawl in effect.
>>  
>>  What is the best approach to crawl and build an index
>>  for searching?  Are crawl and/or index tools part
>>  of this tool-set? Are they available?
>>  
>>
>
> Something like this, perhaps (shell script stuff):
> spider() {
>  dir="$1"
>  shift
>
>     wget --progress=bar "--directory-prefix=$dir"
--html-extension \
>     --cookies=on --save-cookies "$dir/cookies"  --load-cookies 
> "$dir/cookies" \
>     --exclude-directories=logout,login,search \
>     --recursive --convert-links --mirror "$@"
> }
>
> then call:
> spider site/dump-directory http://site/
>
> then once the files are on disk you can index them using whatever 
> indexing tools can index flat files.
>
> Sam
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss

Xapian discuss - Feb 2005 - newbie help, group site crawl, index and search

[Xapian-discuss] newbie help, group site crawl, index and search

[Xapian-discuss] newbie help, group site crawl, index and search

[Xapian-discuss] newbie help, group site crawl, index and search