Hi, I''m doing one module in my site, there I need to import user blog into my site. I can use RSS feeds to read the blog information but using RSS feeds I''m not getting entire information. So, I need to scrape the user blog page. How to scrape a pages without knowing its html structure of a page? Please anyone can help me for this issue. Thanks in advance. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Hassan Schroeder
2009-Dec-12 17:20 UTC
Re: How to scrape a page without knowing its html structure
On Sat, Dec 12, 2009 at 2:56 AM, kalyan <kalyan.allampalli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> I''m doing one module in my site, there I need to import user blog into > my site. I can use RSS feeds to read the blog information but using > RSS feeds I''m not getting entire information. So, I need to scrape the > user blog page. How to scrape a pages without knowing its html > structure of a page?Unless you want the entire page, you need to know something about the page structure. Well. If the page is even reasonably marked up (DIVs/Ps-wise) and you create an array of block elements, you *might* get away with the assumption that the ones with significant amounts of text (for some value of "significant") are the actual blog post. Might. I''d imagine a lot more going into that heuristic, since you''re looking for an AI solution :-) Good luck, -- Hassan Schroeder ------------------------ hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org twitter: @hassan -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Joe McGlynn
2009-Dec-12 17:34 UTC
RE: How to scrape a page without knowing its html structure
I think you''ll find you need to know _something_ about the page layout. If there are a finite number of places you need to scrape from you could do this pretty simply. Assume you had a css selector to find the desired content in each URL of interest, and it was stored in an active record (ish) model. # ... # lookup the selector @selector = Selector.find_by_url @the_url_to_scrape doc = Nokogiri::HTML(open(@the_url_to_scrape)) # Search for nodes by css doc.css(@selector).each do |link| puts link.content end #... I did a write up on simple scraping with nokogiri and selectorgadget here: http://joemcglynn.wordpress.com/2009/12/10/five-minute-introduction-to-nokog iri/ -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
I''m doing one module in my site, there I need to import user blog into my site. I can use RSS feeds to read the blog information but using RSS feeds I''m not getting entire information. So, I need to scrape the user blog page. How to scrape a pages without knowing its html structure of a page? Please anyone can help me for this issue. Thanks in advance. -- Thanks & regards Kalyan -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Hi, I''m doing one module in my site, there I need to import user blog into my site. I can use RSS feeds to read the blog information but using RSS feeds I''m not getting entire information. So, I need to scrape the user blog page. How to scrape a pages without knowing its html structure of a page? Please anyone can help me for this issue. Thanks in advance. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Hassan Schroeder
2009-Dec-16 06:18 UTC
Re: How to scrape a page without knowing its html structure
On Tue, Dec 15, 2009 at 10:12 PM, Kalyan <kalyan.allampalli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> How to scrape a pages without knowing its html structure of a page?You asked this exact question 4 days ago and got 2 answers, that basically you can''t -- you have to know *something* about way the pages are marked up. It''s still true. :-) -- Hassan Schroeder ------------------------ hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org twitter: @hassan -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
It seems that looking at the structure would be the easiest way, but if you wanted something more complex...your scraping program could infer the layout structure and separate this from the content. Your program would need to be fed multiple pages and would assume the layout to be the portion that stays mostly the same from page to page. That''s an oversimplification, but that''s the general idea. Good luck. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.