thr3ads.net - similar to: "omega crawler: ht://dig or wget?"

Displaying 20 results from an estimated 2000 matches similar to: "omega crawler: ht://dig or wget?"

htdig with omega for multiple URLs (websites)

2006 Mar 29

htdig with omega for multiple URLs (websites)

Olly, many thanks for suggesting htdig, you saved me a lot of time. Htdig looks better than my original idea - wget, you were right. Using htdig, I can crawl and search single website - but I need to integrate search of pages spread over 100+ sites. Learning, learning.... Htdig uses separate document database for every website (one database per URL to initiate crawling). Htdig also can merge

Unicode troubles

2006 May 26

Unicode troubles

Hi, I've tried to follow all helpful tips I've found in the mailing-list and I've applied these two utf-8 patches; http://article.gmane.org/gmane.comp.search.xapian.general/2324 http://article.gmane.org/gmane.comp.search.xapian.general/1927 Now the QueryParser works as I wants it to do, and creates the terms correctly. But sadly I can't find any documents. If I do this; $ quest

Developing a web crawler

2011 Mar 03

Developing a web crawler

Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets. So how should i go about analyzing data that is not

[LLVMdev] Tutorial: Building a stack crawler in LLVM

2010 Oct 03

[LLVMdev] Tutorial: Building a stack crawler in LLVM

As promised, here is a document describing how to build a stack crawler using the garbage collection features of LLVM. https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ I'm interested in any feedback, particularly on: - Explanations that aren't clear. - Spelling errors. - Technical errors. - Suggestions for ways in which things could be

RDig - ferret-based website crawler/indexer

2006 Mar 25

RDig - ferret-based website crawler/indexer

Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge

Is there a ftp crawler in ruby on rails?

2012 Jun 01

Is there a ftp crawler in ruby on rails?

Hi, I''m a newbie to ROR. I wanted to write some code which can help me to list and then index all the paths on a remote server. Is there a ftp server crawler in ruby? Thanks, Narayana -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to

Test builds for CYGWIN and IRIX?

2004 Nov 22

Test builds for CYGWIN and IRIX?

I'm starting to prepare the next release. Since 0.8.3 I've made a number of changes to get working builds working on HPUX and OSF, and made some of the Windows specific bits more robust. I'd like to check that these haven't broken CYGWIN or IRIX builds, but I don't have access to these platforms. If you are able to test, it'd be most appreciated if you could. Download a

Getting custom field data from the page through crawling

2007 Feb 08

Getting custom field data from the page through crawling

Now on to my next question.. I've got the search and indexing working well for now.. My next quest is to implement a system of creating custom fields in the index. Our site is fully dynamic. That is, every page is generated in PHP and there are enough different kinds of pages that I wouldn't want to get into the business of indexing the DB directly, so I think that using htdig to crawl

Sieve filters on folders, different from INBOX

2013 Feb 26

Sieve filters on folders, different from INBOX

Hi all Is it possible to configure Dovecot's sieve plugin to act on message arrival to folders, other than INBOX? I wish to move messages fetched by pop3 fetcher to special folder, or sort outgoing mail to folders, specific to their recipients. Thanhs in advance, WBR, valery

regrex_crawler -- a crawler which uses regular expression to catch data from website

2009 Sep 13

regrex_crawler -- a crawler which uses regular expression to catch data from website

RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree I give an example: a script to synchronize your github projects except fork projects, , please check example/github_projects.rb require ''rubygems''

RE: [Shorewall-devel] SFTP

2004 Nov 30

RE: [Shorewall-devel] SFTP

On Tue, 2004-11-30 at 12:17 +0700, Matthew Hodgett wrote: > > As for the 169.254 issue I tried to search the archives but got nothing. > I then tried to search on generic words, nothing. I then tried some > really common words like ''help'', ''initiated'', ''masq'' - nothing. I think > the index might be corrupt because I get no

Preventing crawlers on link_to''s

2006 Apr 16

Preventing crawlers on link_to''s

My understanding was that using the :post=>true on a link_to() was supposed to prevent search engine crawlers from triggering the link. However, this does not seem to be working for me. Is there something else that I should be/can be doing to accomplish this? Thanks. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL:

Web crawler - spider and Amazon Web Servces (AWS)

2008 Jul 18

Web crawler - spider and Amazon Web Servces (AWS)

I need to create a web crawler and the closest thing to a tutorial I''ve found so far is this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182 wich I think I found via RubyInside or some blog. It uses some of Amazone Web Services, mainly SQS, but this would be my first time outsourcing a process to a third party and I would like to know if someone in the

Opensource Websearch Engine Project

2010 Oct 26

Opensource Websearch Engine Project

Hi, I'm Pierre-Louis Dehapiot from Paris, France. I am studying computing programming at the ECE (a french school) and this year, the topic of my project is "google and indexing". To summarize, it deals with creating my own google in only one year :p ! I saw that you made yourself an opensource websearch engine written in C (Xapian). I already made the php/CSS interface for my own

Opensource Websearch Engine Project

2010 Oct 26

Opensource Websearch Engine Project

Dovecot -> Gmail (via POP Mail Fetcher)

2009 Apr 20

Dovecot -> Gmail (via POP Mail Fetcher)

I'm trying to move my entire email store from my Dovecot installation (which I normally access via IMAP) into Gmail using Gmail's Mail Fetcher (which functions over POP); and I'm running into two problems: 1. Gmail only imported 78 out of 1000+ mails in my inbox, which I'm taking to mean that Dovecot is reporting only those 78 as new. How can I get Dovecot to send all mail over as

RDig document processing error

2006 Jul 25

RDig document processing error

Hi all, Am having problems using RDig: With this rdig config... cfg.crawler.start_urls = [''http://www.defensetech.org''] cfg.crawler.include_hosts = [''www.defensetech.org''] cfg.index.path = ''/my/path/to/index'' cfg.verbose = true ...I get this output: $ rdig -c config/rdig_config.rb /usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45:

Filmon HDI Player

2009 Nov 14

Filmon HDI Player

Hello, wonder if anyone can help here - perhaps a winetrick (tried DIVX - doesn't help, and slight problems with Ubuntu 9.10, like some other Linux programs have). I have tried to install a MS Windows program to watch TV, films etc. over the Internet from filmon.com. The program gives a "network problem" message after trying to authenticate the log-in name and password. There is no

keep rsync from removing unfinished source files?

2008 Sep 07

keep rsync from removing unfinished source files?

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't

Mailman - searchable archive

2010 Sep 28

Mailman - searchable archive

Mailman works well for our mailing lists, but the archive is unacceptable - the worst thing is lack of search function. I got one tip for this: 1) emails converted to html format with mhonarc 2) search can be done with htdig Opinions? Maybe there are better software solutions for this - I hope. - Jussi -- Jussi Hirvi * Green Spot Topeliuksenkatu 15 C * 00250 Helsinki * Finland Tel. +358 9

similar to: omega crawler: ht://dig or wget?