thr3ads.net - Mechanize users - [Mechanize-users] determining whether a link might be a file? [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Soyoung Shin

2010-Jun-25 22:53 UTC

[Mechanize-users] determining whether a link might be a file?

Hi again. I figured out how to use ruby and solve that last problem :)

On another note, I''m trying to build a crawler that will generally
avoid hitting (but maybe still get the url for) non-html (downloadable) files
like csv, xml, exe, etc. It''s simple enough to avoid links that end in
.csv or .xml, but when there are intermediate redirects, it can be difficult.
For example, this was linked from cnet

http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext&ltype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826

which redirects to

http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg

which downloads a dmg for virtual dj. has anyone got a solution to this?

Thanks
Soyoung

Matthias -apoc- Hecker

2010-Jun-26 00:51 UTC

head link

[Mechanize-users] determining whether a link might be a file?

Hello list, Soyoung Shin,

Well you could deactivate automated redirection and fetch the location
header:

agent.redirect_ok = false
page = agent.get ''http://example.com/''
puts page.header[''location'']

But I guess thats not really more practical, you cannot really use the
url to determine the file content-type (There can be mod_rewrite rules
etc). Another possible solution would be to sent a HEAD request first to
determine the content-type/content-disposition response header:

page = agent.head ''http://example.com/''
puts page.header[''content-type'']

That would work.
I think in theory it should be possible to sent a normal GET request,
then retrieve the response header and *then* decide to proceed or stop
(if it isn''t text/html for instance). However I don''t think
that this is
possible with Mechanize, the post_connect_hook is triggered after the
file is downloaded so that ain''t an option.

I hope this helps a little.

Matthias

Soyoung Shin wrote:> Hi again. I figured out how to use ruby and solve that last problem :)
> 
> On another note, I''m trying to build a crawler that will generally
avoid hitting (but maybe still get the url for) non-html (downloadable) files
like csv, xml, exe, etc. It''s simple enough to avoid links that end in
.csv or .xml, but when there are intermediate redirects, it can be difficult.
For example, this was linked from cnet
> 
>
http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext&ltype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826
> 
> which redirects to
> 
>
http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg
> 
> which downloads a dmg for virtual dj. has anyone got a solution to this?
> 
> Thanks
> Soyoung
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users

-- 
(a) (p)roof (o)f (c)oncept ..
  http://apoc.sixserv.org/

Mihael

2010-Jun-26 05:44 UTC

head link

[Mechanize-users] determining whether a link might be a file?

Hey, maybe u could use something like this:

          head = a.head(img_url)
          content_type = head.response["content-type"]
          if head.kind_of?(WWW::Mechanize::File) && (content_type =~
/image/)
            image = a.get(img_url)
            filename = img_url.split(''/'').last
            path = @temp_path/filename
            image.save_as(path)
            asset = Asset.new(:original_url => img_url, :mayo_id =>
@source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path,
content_type))
            asset.save
            File.delete(path) #cleanup
          else
            warn "skipped image url: #{img_url} !!! not an image url"
            return nil #this is not an image url
          end

On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote:
> Hi again. I figured out how to use ruby and solve that last problem :)
> 
> On another note, I''m trying to build a crawler that will generally
avoid hitting (but maybe still get the url for) non-html (downloadable) files
like csv, xml, exe, etc. It''s simple enough to avoid links that end in
.csv or .xml, but when there are intermediate redirects, it can be difficult.
For example, this was linked from cnet
> 
>
http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext&ltype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826
> 
> which redirects to
> 
>
http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg
> 
> which downloads a dmg for virtual dj. has anyone got a solution to this?
> 
> Thanks
> Soyoung
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users

Soyoung Shin

2010-Jun-28 16:54 UTC

head link

[Mechanize-users] determining whether a link might be a file?

That works, but unfortunately it still downloads the entire file before
inspecting the headers. I think at this point, it seems like a better option
will be to use a mixture of curl/wget + mechanize.

headers = `curl --head url`
if headers.include? "301 something something"
	# inspect the redirect url for suffixes like .jpeg
end
# continue as normal

~Soyoung

On Jun 25, 2010, at 10:44 PM, Mihael wrote:
:> Hey, maybe u could use something like this:
> 
>          head = a.head(img_url)
>          content_type = head.response["content-type"]
>          if head.kind_of?(WWW::Mechanize::File) && (content_type =~
/image/)
>            image = a.get(img_url)
>            filename = img_url.split(''/'').last
>            path = @temp_path/filename
>            image.save_as(path)
>            asset = Asset.new(:original_url => img_url, :mayo_id =>
@source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path,
content_type))
>            asset.save
>            File.delete(path) #cleanup
>          else
>            warn "skipped image url: #{img_url} !!! not an image
url"
>            return nil #this is not an image url
>          end
> 
> On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote:
> 
>> Hi again. I figured out how to use ruby and solve that last problem :)
>> 
>> On another note, I''m trying to build a crawler that will
generally avoid hitting (but maybe still get the url for) non-html
(downloadable) files like csv, xml, exe, etc. It''s simple enough to
avoid links that end in .csv or .xml, but when there are intermediate redirects,
it can be difficult. For example, this was linked from cnet
>> 
>>
http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext&ltype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826
>> 
>> which redirects to
>> 
>>
http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg
>> 
>> which downloads a dmg for virtual dj. has anyone got a solution to
this?
>> 
>> Thanks
>> Soyoung
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
> 
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users

Aaron Starr

2010-Jun-28 17:07 UTC

head link

[Mechanize-users] determining whether a link might be a file?

Seems like, in that case, you should just take Matthias'' suggestion:

agent.redirect_ok = false
page = agent.get url
# inspect page.header[''location''] for suffixes like .jpeg


On Mon, Jun 28, 2010 at 9:54 AM, Soyoung Shin <soyoung.shin at
socrata.com>wrote:
> That works, but unfortunately it still downloads the entire file before
> inspecting the headers. I think at this point, it seems like a better
option
> will be to use a mixture of curl/wget + mechanize.
>
> headers = `curl --head url`
> if headers.include? "301 something something"
>        # inspect the redirect url for suffixes like .jpeg
> end
> # continue as normal
>
> ~Soyoung
>
> On Jun 25, 2010, at 10:44 PM, Mihael wrote:
> :
> > Hey, maybe u could use something like this:
> >
> >          head = a.head(img_url)
> >          content_type = head.response["content-type"]
> >          if head.kind_of?(WWW::Mechanize::File) &&
(content_type =~
> /image/)
> >            image = a.get(img_url)
> >            filename = img_url.split(''/'').last
> >            path = @temp_path/filename
> >            image.save_as(path)
> >            asset = Asset.new(:original_url => img_url, :mayo_id
=> @
> source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path,
> content_type))
> >            asset.save
> >            File.delete(path) #cleanup
> >          else
> >            warn "skipped image url: #{img_url} !!! not an image
url"
> >            return nil #this is not an image url
> >          end
> >
> > On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote:
> >
> >> Hi again. I figured out how to use ruby and solve that last
problem :)
> >>
> >> On another note, I''m trying to build a crawler that will
generally avoid
> hitting (but maybe still get the url for) non-html (downloadable) files
like
> csv, xml, exe, etc. It''s simple enough to avoid links that end in
.csv or
> .xml, but when there are intermediate redirects, it can be difficult. For
> example, this was linked from cnet
> >>
> >>
>
http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext&ltype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826
> >>
> >> which redirects to
> >>
> >>
>
http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg
> >>
> >> which downloads a dmg for virtual dj. has anyone got a solution to
this?
> >>
> >> Thanks
> >> Soyoung
> >> _______________________________________________
> >> Mechanize-users mailing list
> >> Mechanize-users at rubyforge.org
> >> http://rubyforge.org/mailman/listinfo/mechanize-users
> >
> > _______________________________________________
> > Mechanize-users mailing list
> > Mechanize-users at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/mechanize-users
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20100628/0ade5257/attachment.html>

Soyoung Shin

2010-Jun-28 17:17 UTC

head link

[Mechanize-users] determining whether a link might be a file?

ah, duh. thanks! :3

On Jun 28, 2010, at 10:07 AM, Aaron Starr wrote:
> 
> Seems like, in that case, you should just take Matthias''
suggestion:
> 
> agent.redirect_ok = false
> page = agent.get url
> # inspect page.header[''location''] for suffixes like .jpeg
> 
> 
> On Mon, Jun 28, 2010 at 9:54 AM, Soyoung Shin <soyoung.shin at
socrata.com> wrote:
> That works, but unfortunately it still downloads the entire file before
inspecting the headers. I think at this point, it seems like a better option
will be to use a mixture of curl/wget + mechanize.
> 
> headers = `curl --head url`
> if headers.include? "301 something something"
>        # inspect the redirect url for suffixes like .jpeg
> end
> # continue as normal
> 
> ~Soyoung
> 
> On Jun 25, 2010, at 10:44 PM, Mihael wrote:
> :
> > Hey, maybe u could use something like this:
> >
> >          head = a.head(img_url)
> >          content_type = head.response["content-type"]
> >          if head.kind_of?(WWW::Mechanize::File) &&
(content_type =~ /image/)
> >            image = a.get(img_url)
> >            filename = img_url.split(''/'').last
> >            path = @temp_path/filename
> >            image.save_as(path)
> >            asset = Asset.new(:original_url => img_url, :mayo_id
=> @source.id, :uploaded_data=>
ActionController::TestUploadedFile.new(path, content_type))
> >            asset.save
> >            File.delete(path) #cleanup
> >          else
> >            warn "skipped image url: #{img_url} !!! not an image
url"
> >            return nil #this is not an image url
> >          end
> >
> > On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote:
> >
> >> Hi again. I figured out how to use ruby and solve that last
problem :)
> >>
> >> On another note, I''m trying to build a crawler that will
generally avoid hitting (but maybe still get the url for) non-html
(downloadable) files like csv, xml, exe, etc. It''s simple enough to
avoid links that end in .csv or .xml, but when there are intermediate redirects,
it can be difficult. For example, this was linked from cnet
> >>
> >>
http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext&ltype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826
> >>
> >> which redirects to
> >>
> >>
http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg
> >>
> >> which downloads a dmg for virtual dj. has anyone got a solution to
this?
> >>
> >> Thanks
> >> Soyoung
> >> _______________________________________________
> >> Mechanize-users mailing list
> >> Mechanize-users at rubyforge.org
> >> http://rubyforge.org/mailman/listinfo/mechanize-users
> >
> > _______________________________________________
> > Mechanize-users mailing list
> > Mechanize-users at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/mechanize-users
> 
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
> 
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20100628/e43ee2c5/attachment-0001.html>

Mechanize users - Jun 2010 - determining whether a link might be a file?

[Mechanize-users] determining whether a link might be a file?

[Mechanize-users] determining whether a link might be a file?

[Mechanize-users] determining whether a link might be a file?

[Mechanize-users] determining whether a link might be a file?

[Mechanize-users] determining whether a link might be a file?

[Mechanize-users] determining whether a link might be a file?