thr3ads.net - Rails - How to scrape a page without knowing its html structure [Dec 2009]

If this information is useful, please help other people find it:
Share via:

kalyan

2009-Dec-12 10:56 UTC

How to scrape a page without knowing its html structure

Hi,

I''m doing one module in my site, there I need to import user blog into
my site. I can use RSS feeds to read the blog information but using
RSS feeds I''m not getting entire information. So, I need to scrape the
user blog page. How to scrape a pages without knowing its html
structure of a page? Please anyone can help me for this issue. Thanks
in advance.

--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Hassan Schroeder

2009-Dec-12 17:20 UTC

head link

Re: How to scrape a page without knowing its html structure

On Sat, Dec 12, 2009 at 2:56 AM, kalyan
<kalyan.allampalli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I''m doing one module in my site, there I need to import user blog
into
> my site. I can use RSS feeds to read the blog information but using
> RSS feeds I''m not getting entire information. So, I need to scrape
the
> user blog page. How to scrape a pages without knowing its html
> structure of a page?
Unless you want the entire page, you need to know something about
the page structure.

Well. If the page is even reasonably marked up (DIVs/Ps-wise) and
you create an array of block elements, you *might* get away with the
assumption that the ones with significant amounts of text (for some
value of "significant") are the actual blog post.

Might. I''d imagine a lot more going into that heuristic, since
you''re
looking for an AI solution  :-)

Good luck,
-- 
Hassan Schroeder ------------------------
hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
twitter: @hassan

--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Joe McGlynn

2009-Dec-12 17:34 UTC

head link

RE: How to scrape a page without knowing its html structure

I think you''ll find you need to know _something_ about the page layout.
If
there are a finite number of places you need to scrape from you could do
this pretty simply.

Assume you had a css selector to find the desired content in each URL of
interest, and it was stored in an active record (ish) model.

# ...
# lookup the selector
@selector = Selector.find_by_url @the_url_to_scrape

doc = Nokogiri::HTML(open(@the_url_to_scrape))

# Search for nodes by css
doc.css(@selector).each do |link|
  puts link.content 
end
#...


I did a write up on simple scraping with nokogiri and selectorgadget here:
http://joemcglynn.wordpress.com/2009/12/10/five-minute-introduction-to-nokog
iri/


--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Kalyan

2009-Dec-16 06:12 UTC

head link

How to scrape a page without knowing its html structure

I''m doing one module in my site, there I need to import user blog into
my site. I can use RSS feeds to read the blog information but using
RSS feeds I''m not getting entire information. So, I need to scrape the
user blog page. How to scrape a pages without knowing its html
structure of a page? Please anyone can help me for this issue. Thanks
in advance.

-- 
Thanks & regards
Kalyan

--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

kallu

2009-Dec-16 06:14 UTC

head link

How to scrape a page without knowing its html structure

Hi,

I''m doing one module in my site, there I need to import user blog into
my site. I can use RSS feeds to read the blog information but using
RSS feeds I''m not getting entire information. So, I need to scrape the
user blog page. How to scrape a pages without knowing its html
structure of a page? Please anyone can help me for this issue. Thanks
in advance.

--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Hassan Schroeder

2009-Dec-16 06:18 UTC

head link

Re: How to scrape a page without knowing its html structure

On Tue, Dec 15, 2009 at 10:12 PM, Kalyan
<kalyan.allampalli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> How to scrape a pages without knowing its html structure of a page?
You asked this exact question 4 days ago and got 2 answers, that
basically you can''t -- you have to know *something* about way the
pages are marked up.

It''s still true. :-)

-- 
Hassan Schroeder ------------------------
hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
twitter: @hassan

--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

jman

2009-Dec-16 22:40 UTC

head link

Re: How to scrape a page without knowing its html structure

It seems that looking at the structure would be the easiest way, but
if you wanted something more complex...your scraping program could
infer the layout structure and separate this from the content.  Your
program would need to be fed multiple pages and would assume the
layout to be the portion that stays mostly the same from page to
page.  That''s an oversimplification, but that''s the general
idea.

Good luck.

--

You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Apparently Analagous Threads

Search for more seemingly similar threads

Rails - Dec 2009 - How to scrape a page without knowing its html structure

How to scrape a page without knowing its html structure

Re: How to scrape a page without knowing its html structure

RE: How to scrape a page without knowing its html structure

How to scrape a page without knowing its html structure

How to scrape a page without knowing its html structure

Re: How to scrape a page without knowing its html structure

Re: How to scrape a page without knowing its html structure

Apparently Analagous Threads