flyerhzm
2009-Sep-13 12:57 UTC
regrex_crawler -- a crawler which uses regular expression to catch data from website
RegexpCrawler is a crawler which uses regular expression to catch data
from website. It is easy to use and less code if you are familiar with
regular expression.
The project site is: http://github.com/flyerhzm/regexp_crawler/tree
I give an example: a script to synchronize your github projects except
fork projects, , please check example/github_projects.rb
require ''rubygems''
require ''regexp_crawler''
crawler = RegexpCrawler::Crawler.new(
:start_page => "http://github.com/flyerhzm",
:continue_regexp => %r{<div class="title"><b><a
href="(/
flyerhzm/.*?)">}m,
:capture_regexp => %r{<a
href="http://github.com/flyerhzm/[^/"]*?(?:/
tree)?">(.*?)</a>.*<span
id="repository_description".*?>(.*?)</span>.*
(<div
class="(?:wikistyle|plain)">.*?</div>)</div>}m,
:named_captures => [''title'',
''description'', ''body''],
:save_method => Proc.new do |result, page|
puts ''=============================''
puts page·
puts result[:title]
puts result[:description]
puts result[:body][0..100] + "..."
end,·
:need_parse => Proc.new do |page, response_body|
page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index
(/Fork of.*?<a href=".*?">/)
end)·
crawler.start
The results are as follows:
============================http://github.com/flyerhzm/bullet/tree/master
bullet
A rails plugin/gem to kill N+1 queries and unused eager loading
<div class="wikistyle"><h1>Bullet</h1>
<p>The Bullet plugin/gem is designed to help you increase your...
============================http://github.com/flyerhzm/regexp_crawler/tree/master
regexp_crawler
A crawler which use regular expression to catch data.
<div class="wikistyle"><h1>RegexpCrawler</h1>
<p>RegexpCrawler is a crawler which use regex expressi...
============================http://github.com/flyerhzm/sitemap/tree/master
sitemap
This plugin will generate a sitemap.xml from sitemap.rb whose format
is very similar to routes.rb
<div class="wikistyle"><h1>Sitemap</h1>
<p>This plugin will generate a sitemap.xml or sitemap.xml.gz ...
============================http://github.com/flyerhzm/visual_partial/tree/master
visual_partial
This plugin provides a way that you can see all the partial pages
rendered. So it can prevent you from using partial page too much,
which hurts the performance.
<div class="wikistyle"><h1>VisualPartial</h1>
<p>This plugin provides a way that you can see all the ...
============================http://github.com/flyerhzm/chinese_regions/tree/master
chinese_regions
provides all chinese regions, cities and districts
<div class="wikistyle"><h1>ChineseRegions</h1>
<p>Provides all chinese regions, cities and districts<...
============================http://github.com/flyerhzm/chinese_permalink/tree/master
chinese_permalink
This plugin adds a capability for ar model to create a seo permalink
with your chinese text. It will translate your chinese text to english
url based on google translate.
<div class="wikistyle"><h1>ChinesePermalink</h1>
<p>This plugin adds a capability for ar model to cre...
============================http://github.com/flyerhzm/codelinestatistics/tree/master
codelinestatistics
The code line statistics takes files and directories from GUI, counts
the total files, total sizes of files, total lines, lines of codes,
lines of comments and lines of blanks in the files, displays the
results and can also export results to html file.
<div class="plain"><pre>codelinestatistics README file:
Wha…
