Soyoung Shin
2010-Jun-25 22:53 UTC
[Mechanize-users] determining whether a link might be a file?
Hi again. I figured out how to use ruby and solve that last problem :) On another note, I''m trying to build a crawler that will generally avoid hitting (but maybe still get the url for) non-html (downloadable) files like csv, xml, exe, etc. It''s simple enough to avoid links that end in .csv or .xml, but when there are intermediate redirects, it can be difficult. For example, this was linked from cnet http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext<ype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826 which redirects to http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg which downloads a dmg for virtual dj. has anyone got a solution to this? Thanks Soyoung
Matthias -apoc- Hecker
2010-Jun-26 00:51 UTC
[Mechanize-users] determining whether a link might be a file?
Hello list, Soyoung Shin, Well you could deactivate automated redirection and fetch the location header: agent.redirect_ok = false page = agent.get ''http://example.com/'' puts page.header[''location''] But I guess thats not really more practical, you cannot really use the url to determine the file content-type (There can be mod_rewrite rules etc). Another possible solution would be to sent a HEAD request first to determine the content-type/content-disposition response header: page = agent.head ''http://example.com/'' puts page.header[''content-type''] That would work. I think in theory it should be possible to sent a normal GET request, then retrieve the response header and *then* decide to proceed or stop (if it isn''t text/html for instance). However I don''t think that this is possible with Mechanize, the post_connect_hook is triggered after the file is downloaded so that ain''t an option. I hope this helps a little. Matthias Soyoung Shin wrote:> Hi again. I figured out how to use ruby and solve that last problem :) > > On another note, I''m trying to build a crawler that will generally avoid hitting (but maybe still get the url for) non-html (downloadable) files like csv, xml, exe, etc. It''s simple enough to avoid links that end in .csv or .xml, but when there are intermediate redirects, it can be difficult. For example, this was linked from cnet > > http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext<ype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826 > > which redirects to > > http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg > > which downloads a dmg for virtual dj. has anyone got a solution to this? > > Thanks > Soyoung > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users-- (a) (p)roof (o)f (c)oncept .. http://apoc.sixserv.org/
Mihael
2010-Jun-26 05:44 UTC
[Mechanize-users] determining whether a link might be a file?
Hey, maybe u could use something like this: head = a.head(img_url) content_type = head.response["content-type"] if head.kind_of?(WWW::Mechanize::File) && (content_type =~ /image/) image = a.get(img_url) filename = img_url.split(''/'').last path = @temp_path/filename image.save_as(path) asset = Asset.new(:original_url => img_url, :mayo_id => @source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path, content_type)) asset.save File.delete(path) #cleanup else warn "skipped image url: #{img_url} !!! not an image url" return nil #this is not an image url end On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote:> Hi again. I figured out how to use ruby and solve that last problem :) > > On another note, I''m trying to build a crawler that will generally avoid hitting (but maybe still get the url for) non-html (downloadable) files like csv, xml, exe, etc. It''s simple enough to avoid links that end in .csv or .xml, but when there are intermediate redirects, it can be difficult. For example, this was linked from cnet > > http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext<ype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826 > > which redirects to > > http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg > > which downloads a dmg for virtual dj. has anyone got a solution to this? > > Thanks > Soyoung > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Soyoung Shin
2010-Jun-28 16:54 UTC
[Mechanize-users] determining whether a link might be a file?
That works, but unfortunately it still downloads the entire file before inspecting the headers. I think at this point, it seems like a better option will be to use a mixture of curl/wget + mechanize. headers = `curl --head url` if headers.include? "301 something something" # inspect the redirect url for suffixes like .jpeg end # continue as normal ~Soyoung On Jun 25, 2010, at 10:44 PM, Mihael wrote: :> Hey, maybe u could use something like this: > > head = a.head(img_url) > content_type = head.response["content-type"] > if head.kind_of?(WWW::Mechanize::File) && (content_type =~ /image/) > image = a.get(img_url) > filename = img_url.split(''/'').last > path = @temp_path/filename > image.save_as(path) > asset = Asset.new(:original_url => img_url, :mayo_id => @source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path, content_type)) > asset.save > File.delete(path) #cleanup > else > warn "skipped image url: #{img_url} !!! not an image url" > return nil #this is not an image url > end > > On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote: > >> Hi again. I figured out how to use ruby and solve that last problem :) >> >> On another note, I''m trying to build a crawler that will generally avoid hitting (but maybe still get the url for) non-html (downloadable) files like csv, xml, exe, etc. It''s simple enough to avoid links that end in .csv or .xml, but when there are intermediate redirects, it can be difficult. For example, this was linked from cnet >> >> http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext<ype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826 >> >> which redirects to >> >> http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg >> >> which downloads a dmg for virtual dj. has anyone got a solution to this? >> >> Thanks >> Soyoung >> _______________________________________________ >> Mechanize-users mailing list >> Mechanize-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mechanize-users > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users
Aaron Starr
2010-Jun-28 17:07 UTC
[Mechanize-users] determining whether a link might be a file?
Seems like, in that case, you should just take Matthias'' suggestion: agent.redirect_ok = false page = agent.get url # inspect page.header[''location''] for suffixes like .jpeg On Mon, Jun 28, 2010 at 9:54 AM, Soyoung Shin <soyoung.shin at socrata.com>wrote:> That works, but unfortunately it still downloads the entire file before > inspecting the headers. I think at this point, it seems like a better option > will be to use a mixture of curl/wget + mechanize. > > headers = `curl --head url` > if headers.include? "301 something something" > # inspect the redirect url for suffixes like .jpeg > end > # continue as normal > > ~Soyoung > > On Jun 25, 2010, at 10:44 PM, Mihael wrote: > : > > Hey, maybe u could use something like this: > > > > head = a.head(img_url) > > content_type = head.response["content-type"] > > if head.kind_of?(WWW::Mechanize::File) && (content_type =~ > /image/) > > image = a.get(img_url) > > filename = img_url.split(''/'').last > > path = @temp_path/filename > > image.save_as(path) > > asset = Asset.new(:original_url => img_url, :mayo_id => @ > source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path, > content_type)) > > asset.save > > File.delete(path) #cleanup > > else > > warn "skipped image url: #{img_url} !!! not an image url" > > return nil #this is not an image url > > end > > > > On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote: > > > >> Hi again. I figured out how to use ruby and solve that last problem :) > >> > >> On another note, I''m trying to build a crawler that will generally avoid > hitting (but maybe still get the url for) non-html (downloadable) files like > csv, xml, exe, etc. It''s simple enough to avoid links that end in .csv or > .xml, but when there are intermediate redirects, it can be difficult. For > example, this was linked from cnet > >> > >> > http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext<ype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826 > >> > >> which redirects to > >> > >> > http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg > >> > >> which downloads a dmg for virtual dj. has anyone got a solution to this? > >> > >> Thanks > >> Soyoung > >> _______________________________________________ > >> Mechanize-users mailing list > >> Mechanize-users at rubyforge.org > >> http://rubyforge.org/mailman/listinfo/mechanize-users > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100628/0ade5257/attachment.html>
Soyoung Shin
2010-Jun-28 17:17 UTC
[Mechanize-users] determining whether a link might be a file?
ah, duh. thanks! :3 On Jun 28, 2010, at 10:07 AM, Aaron Starr wrote:> > Seems like, in that case, you should just take Matthias'' suggestion: > > agent.redirect_ok = false > page = agent.get url > # inspect page.header[''location''] for suffixes like .jpeg > > > On Mon, Jun 28, 2010 at 9:54 AM, Soyoung Shin <soyoung.shin at socrata.com> wrote: > That works, but unfortunately it still downloads the entire file before inspecting the headers. I think at this point, it seems like a better option will be to use a mixture of curl/wget + mechanize. > > headers = `curl --head url` > if headers.include? "301 something something" > # inspect the redirect url for suffixes like .jpeg > end > # continue as normal > > ~Soyoung > > On Jun 25, 2010, at 10:44 PM, Mihael wrote: > : > > Hey, maybe u could use something like this: > > > > head = a.head(img_url) > > content_type = head.response["content-type"] > > if head.kind_of?(WWW::Mechanize::File) && (content_type =~ /image/) > > image = a.get(img_url) > > filename = img_url.split(''/'').last > > path = @temp_path/filename > > image.save_as(path) > > asset = Asset.new(:original_url => img_url, :mayo_id => @source.id, :uploaded_data=> ActionController::TestUploadedFile.new(path, content_type)) > > asset.save > > File.delete(path) #cleanup > > else > > warn "skipped image url: #{img_url} !!! not an image url" > > return nil #this is not an image url > > end > > > > On Jun 26, 2010, at 12:53 AM, Soyoung Shin wrote: > > > >> Hi again. I figured out how to use ruby and solve that last problem :) > >> > >> On another note, I''m trying to build a crawler that will generally avoid hitting (but maybe still get the url for) non-html (downloadable) files like csv, xml, exe, etc. It''s simple enough to avoid links that end in .csv or .xml, but when there are intermediate redirects, it can be difficult. For example, this was linked from cnet > >> > >> http://dw.com.com/redir?edId=3&siteId=4&oId=3000-18502_4-10976868&ontId=18502_4&spi=a10bbf0aa9a8a3315fe085cb27966826&lop=link&tag=tdw_dltext<ype=dl_dlnow&pid=11422203&mfgId=74349&merId=74349&pguid=kDCrwgoPjAYAAFAMm8gAAADn&destUrl=http%3A%2F%2Fdownload.cnet.com%2F3001-18502_4-10976868.html%3Fspi%3Da10bbf0aa9a8a3315fe085cb27966826 > >> > >> which redirects to > >> > >> http://software-files-l.cnet.com/s/software/11/42/22/03/install_virtualdj_trial_v6.1.dmg?e=1277527830&h=1aca74c88927f3f981bfb5d756764454&lop=link&ptype=1901&ontid=18502&siteId=4&edId=3&spi=a10bbf0aa9a8a3315fe085cb27966826&pid=11422203&psid=10976868&fileName=install_virtualdj_trial_v6.1.dmg > >> > >> which downloads a dmg for virtual dj. has anyone got a solution to this? > >> > >> Thanks > >> Soyoung > >> _______________________________________________ > >> Mechanize-users mailing list > >> Mechanize-users at rubyforge.org > >> http://rubyforge.org/mailman/listinfo/mechanize-users > > > > _______________________________________________ > > Mechanize-users mailing list > > Mechanize-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mechanize-users > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20100628/e43ee2c5/attachment-0001.html>