I''m trying to scrape images from a page. I''m using Hpricot to scrape the actual image URLs into an array but I''ve encountered a problem regarding resolving the full image paths. Example: The src of the images can be like any of the following: http://external.site.com/images/image.jpg (Full URL) /images/image.jpg (Absolute Path) ../images/image.jpg (Relative Path) images/image.jpg (Relative Path) Is there a way to resolve these paths to the proper URLs? So I can copy the images to my server or whatever else I need to do with them? Hope that makes sense. Cheers, Jim -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
On Dec 3, 2007 10:45 AM, Jim Neath <rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org> wrote:> > I''m trying to scrape images from a page. I''m using Hpricot to scrape the > actual image URLs into an array but I''ve encountered a problem regarding > resolving the full image paths. > > Example: > > The src of the images can be like any of the following: > > http://external.site.com/images/image.jpg (Full URL) > /images/image.jpg (Absolute Path) > ../images/image.jpg (Relative Path) > images/image.jpg (Relative Path) > > Is there a way to resolve these paths to the proper URLs? So I can copy > the images to my server or whatever else I need to do with them?You might try making a local mirror of the site using `wget -m -np http://external.site.com`. That will resolve all the urls for you and download the images. -- Greg Donald http://destiney.com/ --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
I would do something similiar but the problem with that is that the script is going to be working on lots of different URLs. It''s for a social bookmarking site that I''m currently working on. The user bookmarks a page, a script scrapes all the images form the page and resizes them, then a user can choose which thumbnail they want to use for their bookmark. Using a wget on every site probably isn''t the best plan for so many sites. -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
On Dec 3, 2007 2:21 PM, Philip Hallstrom <rails-SUcgGwS4C16SUMMaM/qcSw@public.gmane.org> wrote:> Parse the url into pieces... extract the domain name and the "directory" > part of the path. > > Then just match them up. If your image starts with http just use that. > If it starts with a slash then prepend the domain name. Otherwise domain > + directory_path + image./me watches while wget get reinvented. -- Greg Donald http://destiney.com/ --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
On Dec 3, 2007, at 11:45 AM, Jim Neath wrote:> I''m trying to scrape images from a page. I''m using Hpricot to scrape > the > actual image URLs into an array but I''ve encountered a problem > regarding > resolving the full image paths. > > Example: > > The src of the images can be like any of the following: > > http://external.site.com/images/image.jpg (Full URL) > /images/image.jpg (Absolute Path) > ../images/image.jpg (Relative Path) > images/image.jpg (Relative Path) > > Is there a way to resolve these paths to the proper URLs? So I can > copy > the images to my server or whatever else I need to do with them? > > Hope that makes sense. > > Cheers, > > JimYou use URI.join irb> require ''uri'' => true irb> page_and_images = { ?> ''http://external.site.com/somedir/somepage.html'' => [''http://external.site.com/images/image.jpg'' , ?> ''/ images/image.jpg'', ?> ''../ images/image.jpg'' ], ?> ''http://external.site.com/sometoppage.html'' => [''http://external.site.com/images/image.jpg'' , ?> ''images/ image.jpg'' ], ?> } irb> page_and_images.each do |page,images| ?> page_url = URI.parse(page) irb> puts "Starting from: #{page}" irb> images.each do |image| ?> image_url = URI.join(page, image) irb> puts " #{image} becomes #{image_url}" irb> end irb> end; nil Starting from: http://external.site.com/sometoppage.html http://external.site.com/images/image.jpg becomes http://external.site.com/images/image.jpg images/image.jpg becomes http://external.site.com/images/image.jpg Starting from: http://external.site.com/somedir/somepage.html http://external.site.com/images/image.jpg becomes http://external.site.com/images/image.jpg /images/image.jpg becomes http://external.site.com/images/image.jpg ../images/image.jpg becomes http://external.site.com/images/ image.jpg => nil -Rob Rob Biedenharn http://agileconsultingllc.com Rob-xa9cJyRlE0mWcWVYNo9pwxS2lgjeYSpx@public.gmane.org --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
> I''m trying to scrape images from a page. I''m using Hpricot to scrape the > actual image URLs into an array but I''ve encountered a problem regarding > resolving the full image paths. > > Example: > > The src of the images can be like any of the following: > > http://external.site.com/images/image.jpg (Full URL) > /images/image.jpg (Absolute Path) > ../images/image.jpg (Relative Path) > images/image.jpg (Relative Path) > > Is there a way to resolve these paths to the proper URLs? So I can copy > the images to my server or whatever else I need to do with them?Parse the url into pieces... extract the domain name and the "directory" part of the path. Then just match them up. If your image starts with http just use that. If it starts with a slash then prepend the domain name. Otherwise domain + directory_path + image. -philip --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---