thr3ads.net - Rails - Resolving image URLs [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Jim Neath

2007-Dec-03 16:45 UTC

Resolving image URLs

I''m trying to scrape images from a page. I''m using Hpricot to
scrape the
actual image URLs into an array but I''ve encountered a problem
regarding
resolving the full image paths.

Example:

The src of the images can be like any of the following:

http://external.site.com/images/image.jpg (Full URL)
/images/image.jpg (Absolute Path)
../images/image.jpg (Relative Path)
images/image.jpg (Relative Path)

Is there a way to resolve these paths to the proper URLs? So I can copy
the images to my server or whatever else I need to do with them?

Hope that makes sense.

Cheers,

Jim
-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Greg Donald

2007-Dec-03 16:48 UTC

head link

Re: Resolving image URLs

On Dec 3, 2007 10:45 AM, Jim Neath
<rails-mailing-list-ARtvInVfO7ksV2N9l4h3zg@public.gmane.org>
wrote:>
> I''m trying to scrape images from a page. I''m using
Hpricot to scrape the
> actual image URLs into an array but I''ve encountered a problem
regarding
> resolving the full image paths.
>
> Example:
>
> The src of the images can be like any of the following:
>
> http://external.site.com/images/image.jpg (Full URL)
> /images/image.jpg (Absolute Path)
> ../images/image.jpg (Relative Path)
> images/image.jpg (Relative Path)
>
> Is there a way to resolve these paths to the proper URLs? So I can copy
> the images to my server or whatever else I need to do with them?
You might try making a local mirror of the site using `wget -m -np
http://external.site.com`.  That will resolve all the urls for you and
download the images.


-- 
Greg Donald
http://destiney.com/

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Jim Neath

2007-Dec-03 17:00 UTC

head link

Re: Resolving image URLs

I would do something similiar but the problem with that is that the 
script is going to be working on lots of different URLs.

It''s for a social bookmarking site that I''m currently working
on. The
user bookmarks a page, a script scrapes all the images form the page and 
resizes them, then a user can choose which thumbnail they want to use 
for their bookmark.

Using a wget on every site probably isn''t the best plan for so many 
sites.
-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Greg Donald

2007-Dec-03 18:31 UTC

head link

Re: Resolving image URLs

On Dec 3, 2007 2:21 PM, Philip Hallstrom
<rails-SUcgGwS4C16SUMMaM/qcSw@public.gmane.org>
wrote:> Parse the url into pieces... extract the domain name and the
"directory"
> part of the path.
>
> Then just match them up.  If your image starts with http just use that.
> If it starts with a slash then prepend the domain name.  Otherwise domain
> + directory_path + image.

/me watches while wget get reinvented.


-- 
Greg Donald
http://destiney.com/

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Rob Biedenharn

2007-Dec-03 18:37 UTC

head link

Re: Resolving image URLs

On Dec 3, 2007, at 11:45 AM, Jim Neath wrote:> I''m trying to scrape images from a page. I''m using
Hpricot to scrape
> the
> actual image URLs into an array but I''ve encountered a problem  
> regarding
> resolving the full image paths.
>
> Example:
>
> The src of the images can be like any of the following:
>
> http://external.site.com/images/image.jpg (Full URL)
> /images/image.jpg (Absolute Path)
> ../images/image.jpg (Relative Path)
> images/image.jpg (Relative Path)
>
> Is there a way to resolve these paths to the proper URLs? So I can  
> copy
> the images to my server or whatever else I need to do with them?
>
> Hope that makes sense.
>
> Cheers,
>
> Jim
You use URI.join

irb> require ''uri''
=> true
irb> page_and_images = {
?>         ''http://external.site.com/somedir/somepage.html''
=> [''http://external.site.com/images/image.jpg''
,
?>                                                                ''/
images/image.jpg'',
?>                                                               
''../
images/image.jpg'' ],
?>         ''http://external.site.com/sometoppage.html''
=> [''http://external.site.com/images/image.jpg''
,
?>                                                          
''images/
image.jpg'' ],
?>       }

irb> page_and_images.each do |page,images|
?>         page_url = URI.parse(page)
irb>     puts "Starting from:  #{page}"
irb>     images.each do |image|
?>             image_url = URI.join(page, image)
irb>         puts "   #{image} becomes #{image_url}"
irb>       end
irb>   end; nil
Starting from:  http://external.site.com/sometoppage.html
    http://external.site.com/images/image.jpg becomes
http://external.site.com/images/image.jpg
    images/image.jpg becomes http://external.site.com/images/image.jpg
Starting from:  http://external.site.com/somedir/somepage.html
    http://external.site.com/images/image.jpg becomes
http://external.site.com/images/image.jpg
    /images/image.jpg becomes http://external.site.com/images/image.jpg
    ../images/image.jpg becomes http://external.site.com/images/ 
image.jpg
=> nil

-Rob

Rob Biedenharn		http://agileconsultingllc.com
Rob-xa9cJyRlE0mWcWVYNo9pwxS2lgjeYSpx@public.gmane.org


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Philip Hallstrom

2007-Dec-03 20:21 UTC

head link

Re: Resolving image URLs

> I''m trying to scrape images from a page. I''m using
Hpricot to scrape the
> actual image URLs into an array but I''ve encountered a problem
regarding
> resolving the full image paths.
>
> Example:
>
> The src of the images can be like any of the following:
>
> http://external.site.com/images/image.jpg (Full URL)
> /images/image.jpg (Absolute Path)
> ../images/image.jpg (Relative Path)
> images/image.jpg (Relative Path)
>
> Is there a way to resolve these paths to the proper URLs? So I can copy
> the images to my server or whatever else I need to do with them?
Parse the url into pieces... extract the domain name and the
"directory"
part of the path.

Then just match them up.  If your image starts with http just use that. 
If it starts with a slash then prepend the domain name.  Otherwise domain 
+ directory_path + image.

-philip

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Rails - Dec 2007 - Resolving image URLs

Resolving image URLs

Re: Resolving image URLs

Re: Resolving image URLs

Re: Resolving image URLs

Re: Resolving image URLs

Re: Resolving image URLs