similar to: omega crawler: ht://dig or wget?

Displaying 20 results from an estimated 2000 matches similar to: "omega crawler: ht://dig or wget?"

2006 Mar 29
1
htdig with omega for multiple URLs (websites)
Olly, many thanks for suggesting htdig, you saved me a lot of time. Htdig looks better than my original idea - wget, you were right. Using htdig, I can crawl and search single website - but I need to integrate search of pages spread over 100+ sites. Learning, learning.... Htdig uses separate document database for every website (one database per URL to initiate crawling). Htdig also can merge
2006 May 26
1
Unicode troubles
Hi, I've tried to follow all helpful tips I've found in the mailing-list and I've applied these two utf-8 patches; http://article.gmane.org/gmane.comp.search.xapian.general/2324 http://article.gmane.org/gmane.comp.search.xapian.general/1927 Now the QueryParser works as I wants it to do, and creates the terms correctly. But sadly I can't find any documents. If I do this; $ quest
2011 Mar 03
6
Developing a web crawler
Hi, I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package. I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets. So how should i go about analyzing data that is not
2010 Oct 03
1
[LLVMdev] Tutorial: Building a stack crawler in LLVM
As promised, here is a document describing how to build a stack crawler using the garbage collection features of LLVM. https://docs.google.com/document/pub?id=1-ws0KYo47S0CgqpwkjfWDBJ8wFhW_0UYKxPIJ0TyKrQ I'm interested in any feedback, particularly on: - Explanations that aren't clear. - Spelling errors. - Technical errors. - Suggestions for ways in which things could be
2006 Mar 25
1
RDig - ferret-based website crawler/indexer
Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge
2012 Jun 01
4
Is there a ftp crawler in ruby on rails?
Hi, I''m a newbie to ROR. I wanted to write some code which can help me to list and then index all the paths on a remote server. Is there a ftp server crawler in ruby? Thanks, Narayana -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to
2004 Nov 22
3
Test builds for CYGWIN and IRIX?
I'm starting to prepare the next release. Since 0.8.3 I've made a number of changes to get working builds working on HPUX and OSF, and made some of the Windows specific bits more robust. I'd like to check that these haven't broken CYGWIN or IRIX builds, but I don't have access to these platforms. If you are able to test, it'd be most appreciated if you could. Download a
2007 Feb 08
1
Getting custom field data from the page through crawling
Now on to my next question.. I've got the search and indexing working well for now.. My next quest is to implement a system of creating custom fields in the index. Our site is fully dynamic. That is, every page is generated in PHP and there are enough different kinds of pages that I wouldn't want to get into the business of indexing the DB directly, so I think that using htdig to crawl
2013 Feb 26
4
Sieve filters on folders, different from INBOX
Hi all Is it possible to configure Dovecot's sieve plugin to act on message arrival to folders, other than INBOX? I wish to move messages fetched by pop3 fetcher to special folder, or sort outgoing mail to folders, specific to their recipients. Thanhs in advance, WBR, valery
2009 Sep 13
0
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression. The project site is: http://github.com/flyerhzm/regexp_crawler/tree I give an example: a script to synchronize your github projects except fork projects, , please check example/github_projects.rb require ''rubygems''
2004 Nov 30
5
RE: [Shorewall-devel] SFTP
On Tue, 2004-11-30 at 12:17 +0700, Matthew Hodgett wrote: > > As for the 169.254 issue I tried to search the archives but got nothing. > I then tried to search on generic words, nothing. I then tried some > really common words like ''help'', ''initiated'', ''masq'' - nothing. I think > the index might be corrupt because I get no
2006 Apr 16
4
Preventing crawlers on link_to''s
My understanding was that using the :post=>true on a link_to() was supposed to prevent search engine crawlers from triggering the link. However, this does not seem to be working for me. Is there something else that I should be/can be doing to accomplish this? Thanks. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL:
2008 Jul 18
0
Web crawler - spider and Amazon Web Servces (AWS)
I need to create a web crawler and the closest thing to a tutorial I''ve found so far is this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1182 wich I think I found via RubyInside or some blog. It uses some of Amazone Web Services, mainly SQS, but this would be my first time outsourcing a process to a third party and I would like to know if someone in the
2010 Oct 26
2
Opensource Websearch Engine Project
Hi, I'm Pierre-Louis Dehapiot from Paris, France. I am studying computing programming at the ECE (a french school) and this year, the topic of my project is "google and indexing". To summarize, it deals with creating my own google in only one year :p ! I saw that you made yourself an opensource websearch engine written in C (Xapian). I already made the php/CSS interface for my own
2010 Oct 26
2
Opensource Websearch Engine Project
Hi, I'm Pierre-Louis Dehapiot from Paris, France. I am studying computing programming at the ECE (a french school) and this year, the topic of my project is "google and indexing". To summarize, it deals with creating my own google in only one year :p ! I saw that you made yourself an opensource websearch engine written in C (Xapian). I already made the php/CSS interface for my own
2009 Apr 20
2
Dovecot -> Gmail (via POP Mail Fetcher)
I'm trying to move my entire email store from my Dovecot installation (which I normally access via IMAP) into Gmail using Gmail's Mail Fetcher (which functions over POP); and I'm running into two problems: 1. Gmail only imported 78 out of 1000+ mails in my inbox, which I'm taking to mean that Dovecot is reporting only those 78 as new. How can I get Dovecot to send all mail over as
2006 Jul 25
1
RDig document processing error
Hi all, Am having problems using RDig: With this rdig config... cfg.crawler.start_urls = [''http://www.defensetech.org''] cfg.crawler.include_hosts = [''www.defensetech.org''] cfg.index.path = ''/my/path/to/index'' cfg.verbose = true ...I get this output: $ rdig -c config/rdig_config.rb /usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45:
2009 Nov 14
1
Filmon HDI Player
Hello, wonder if anyone can help here - perhaps a winetrick (tried DIVX - doesn't help, and slight problems with Ubuntu 9.10, like some other Linux programs have). I have tried to install a MS Windows program to watch TV, films etc. over the Internet from filmon.com. The program gives a "network problem" message after trying to authenticate the log-in name and password. There is no
2008 Sep 07
2
keep rsync from removing unfinished source files?
I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't
2010 Sep 28
4
Mailman - searchable archive
Mailman works well for our mailing lists, but the archive is unacceptable - the worst thing is lack of search function. I got one tip for this: 1) emails converted to html format with mhonarc 2) search can be done with htdig Opinions? Maybe there are better software solutions for this - I hope. - Jussi -- Jussi Hirvi * Green Spot Topeliuksenkatu 15 C * 00250 Helsinki * Finland Tel. +358 9