Peter Masiar
2006-Mar-29 18:43 UTC
[Xapian-discuss] htdig with omega for multiple URLs (websites)
Olly, many thanks for suggesting htdig, you saved me a lot of time. Htdig looks better than my original idea - wget, you were right. Using htdig, I can crawl and search single website - but I need to integrate search of pages spread over 100+ sites. Learning, learning.... Htdig uses separate document database for every website (one database per URL to initiate crawling). Htdig also can merge result databases to allow search of integrated results. If you still have around the script you said you wrote to use htdig as crawler front-end for omega, I would be really interested to see it. My htdig crawls single site. I need to learn how to crawl multiple sites and merge results. Do you recall your htdig2omega script handling this merging? Or you searched one htdig-crawled database? Or can I merge using htdig and then search using omega? Thanks for any insight which way to start looking. Also if anyone on list has experience with using htdig to crawl multiple websites, I would really appreciate insight or sample scripts. My current approach would be 1) generate 100+ config files (one per URL), creating 100+ databases 2) generate script to merge results. Is there a better way? -- Peter Masiar, Yale center for medical Informatics A: Because it messes up the flow of reading. Q: Why is top-posting often frowned upon?
Olly Betts
2006-Mar-30 14:28 UTC
[Xapian-discuss] htdig with omega for multiple URLs (websites)
On Wed, Mar 29, 2006 at 12:41:31PM -0500, Peter Masiar wrote:> If you still have around the script you said you wrote to use htdig as > crawler front-end for omega, I would be really interested to see it.The script is htdig2omega, which is distributed with omega.> My htdig crawls single site. I need to learn how to crawl multiple sites > and merge results. Do you recall your htdig2omega script handling this > merging? Or you searched one htdig-crawled database? Or can I merge > using htdig and then search using omega?I don't know about merging with htdig. You could convert each htdig database to a single Xapian database, and then search them together. Or you could combine them at index time. Assuming your htdig indexes are /var/htdig/index.N and you want the xapian index to be /var/xapian/default, then this bourne shell snippet should do the job: for htdigdir in /var/htdig/index.* ; do htdig2omega "$htdigdir"|scriptindex /var/xapian/default htdig2omega.script done Cheers, Olly