Hi all, I''ve been using mechanize for a while and it rocks. Docs are pretty clear and so far I''ve been able to do it on my own. However, I''m stuck in a weird situation in a script to download my contact list from hotmail. I''ve used Firebug to check all urls, and tested it by hand while logged in via browser. Even in the script everything works well until the last ''agent.get_file'', which gets stuck with a weird error: ------ snip ------ $ ruby msn-scrap.rb #<URI::HTTP:0xfdbc850b8 URL: http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true>"http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" Err: unexpected end of file Trace: /usr/lib/ruby/1.8/mechanize.rb:372:in `read'' /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page'' /usr/lib/ruby/1.8/net/http.rb:1050:in `request'' /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body'' /usr/lib/ruby/1.8/net/http.rb:1049:in `request'' /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page'' /usr/lib/ruby/1.8/net/http.rb:543:in `start'' /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page'' /usr/lib/ruby/1.8/mechanize.rb:139:in `get'' /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file'' msn-scrap.rb:32 ----- snip ------ mech.log important part: D, [2007-11-12T12:22:35.925521 #24540] DEBUG -- : request-header: referer => http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true D, [2007-11-12T12:22:36.589708 #24540] DEBUG -- : response-header: cache-control => private,max-age=86400 D, [2007-11-12T12:22:36.589853 #24540] DEBUG -- : response-header: vary => Accept-Encoding D, [2007-11-12T12:22:36.589934 #24540] DEBUG -- : response-header: connection => keep-alive D, [2007-11-12T12:22:36.590012 #24540] DEBUG -- : response-header: expires => Wed, 01 Jan 1997 12:00:00 GMT, Wed, 01 Jan 1997 12:00:00 GMT D, [2007-11-12T12:22:36.590089 #24540] DEBUG -- : response-header: p3p => CP="BUS CUR CONo FIN IVDo ONL OUR PHY SAMo TELo" D, [2007-11-12T12:22:36.590166 #24540] DEBUG -- : response-header: date => Mon, 12 Nov 2007 14:28:34 GMT D, [2007-11-12T12:22:36.590241 #24540] DEBUG -- : response-header: xxn => W4 D, [2007-11-12T12:22:36.590344 #24540] DEBUG -- : response-header: content-type => text/csv D, [2007-11-12T12:22:36.590430 #24540] DEBUG -- : response-header: msnserver => H: BAY124-W4 V: 12.0.1190.927 D: 2007-09-27T23:27:08 D, [2007-11-12T12:22:36.590509 #24540] DEBUG -- : response-header: content-encoding => gzip D, [2007-11-12T12:22:36.590586 #24540] DEBUG -- : response-header: content-disposition => attachment; filename="WLMContacts.csv" D, [2007-11-12T12:22:36.590663 #24540] DEBUG -- : response-header: server => Microsoft-IIS/6.0 D, [2007-11-12T12:22:36.590738 #24540] DEBUG -- : response-header: content-length => 4285 D, [2007-11-12T12:22:36.591732 #24540] DEBUG -- : gunzip body I''ve tried some ugly hacks, as altering headers and so on (BTW, how do I change request-headers w/o inheriting from www::mechanize ?), but no result. Am I doing something wrong ? Seems to me that the server encodes the file (Firebug shows it too), but mechanize receives a weird error while trying to fetch it. Any ideas ? I did another contact scrap for gmail and it worked wonders. There is a post of mine at http://zenmachine.wordpress.com where I show how to use firebug and mechanize to find the right URLs. Best regards, and keep up the excellent work. ---- msn-scrap.rb---- #!/usr/bin/env ruby # download msn contacts require ''rubygems'' require ''mechanize'' require ''logger'' begin agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } agent.user_agent_alias = "Windows IE 6" page = agent.get("https://login.live.com/login.srf") form = page.forms.name("f1").first form.login = ''user'' form.passwd = ''pass'' page = agent.submit(form) pageContact = agent.get(''http://g.live.com/1MBAMen-us/sc_mail'') p pageContact.uri baseURL=pageContact.uri.host contactURL=''http://''+baseURL+''/mail/GetContacts.aspx'' p contactURL page = agent.get_file(contactURL) p page if (page.code == ''200'') puts "saving contacts.csv" page.save_as(''contacts_msn.csv'') else puts "error downloading contacts" end rescue puts "Err: "+$! puts "Trace:" $@.each {|tl| puts tl } end -- More cowbell, please ! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20071112/68742371/attachment-0001.html
Mike Mondragon
2007-Nov-14 08:18 UTC
[Mechanize-users] Weird error downloading a gzip''ed file
On Nov 12, 2007 6:33 AM, gmoraes <gsmoraes2 at gmail.com> wrote:> Hi all, > > I''ve been using mechanize for a while and it rocks. Docs are pretty clear > and so far I''ve been able to do it on my own. > However, I''m stuck in a weird situation in a script to download my contact > list from hotmail. > I''ve used Firebug to check all urls, and tested it by hand while logged in > via browser. > Even in the script everything works well until the last ''agent.get_file'', > which gets stuck with a weird error: > > ------ snip ------ > $ ruby msn-scrap.rb > #<URI::HTTP:0xfdbc850b8 > URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > > > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" > Err: unexpected end of file > Trace: > /usr/lib/ruby/1.8/mechanize.rb:372:in `read'' > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page'' > /usr/lib/ruby/1.8/net/http.rb:1050:in `request'' > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body'' > /usr/lib/ruby/1.8/net/http.rb:1049:in `request'' > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page'' > /usr/lib/ruby/1.8/net/http.rb:543:in `start'' > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page'' > /usr/lib/ruby/1.8/mechanize.rb:139:in `get'' > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file'' > msn-scrap.rb:32 > ----- snip ------ > > mech.log important part: > > D, [2007-11-12T12:22:35.925521 #24540] DEBUG -- : request-header: referer => > http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > D, [2007-11-12T12:22:36.589708 #24540] DEBUG -- : response-header: > cache-control => private,max-age=86400 > D, [2007-11-12T12:22: 36.589853 #24540] DEBUG -- : response-header: vary => > Accept-Encoding > D, [2007-11-12T12:22:36.589934 #24540] DEBUG -- : response-header: > connection => keep-alive > D, [2007-11-12T12:22:36.590012 #24540] DEBUG -- : response-header: expires > => Wed, 01 Jan 1997 12:00:00 GMT, Wed, 01 Jan 1997 12:00:00 GMT > D, [2007-11-12T12:22:36.590089 #24540] DEBUG -- : response-header: p3p => > CP="BUS CUR CONo FIN IVDo ONL OUR PHY SAMo TELo" > D, [2007-11-12T12:22:36.590166 #24540] DEBUG -- : response-header: date => > Mon, 12 Nov 2007 14:28:34 GMT > D, [2007-11-12T12:22:36.590241 #24540] DEBUG -- : response-header: xxn => W4 > D, [2007-11-12T12:22:36.590344 #24540] DEBUG -- : response-header: > content-type => text/csv > D, [2007-11-12T12:22:36.590430 #24540] DEBUG -- : response-header: msnserver > => H: BAY124-W4 V: 12.0.1190.927 D: 2007-09-27T23:27:08 > D, [2007-11-12T12:22:36.590509 #24540] DEBUG -- : response-header: > content-encoding => gzip > D, [2007-11-12T12:22:36.590586 #24540] DEBUG -- : response-header: > content-disposition => attachment; filename=" WLMContacts.csv" > D, [2007-11-12T12:22:36.590663 #24540] DEBUG -- : response-header: server => > Microsoft-IIS/6.0 > D, [2007-11-12T12:22:36.590738 #24540] DEBUG -- : response-header: > content-length => 4285 > D, [2007-11-12T12:22:36.591732 #24540] DEBUG -- : gunzip body > > I''ve tried some ugly hacks, as altering headers and so on (BTW, how do I > change request-headers w/o inheriting from www::mechanize ?), but no result. > > Am I doing something wrong ? Seems to me that the server encodes the file > (Firebug shows it too), but mechanize receives a weird error while trying to > fetch it. Any ideas ? > > I did another contact scrap for gmail and it worked wonders. There is a post > of mine at http://zenmachine.wordpress.com where I show how to use firebug > and mechanize to find the right URLs. > > > > Best regards, and keep up the excellent work. > > > ---- msn-scrap.rb---- > > #!/usr/bin/env ruby > > # download msn contacts > > require ''rubygems'' > require ''mechanize'' > require ''logger'' > > begin > agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } > agent.user_agent_alias = "Windows IE 6" > > > page = agent.get("https://login.live.com/login.srf ") > > > > form = page.forms.name("f1").first > form.login = ''user'' > form.passwd = ''pass'' > > page = agent.submit(form) > > pageContact = agent.get (''http://g.live.com/1MBAMen-us/sc_mail'') > p pageContact.uri > > baseURL=pageContact.uri.host > > > contactURL=''http://''+baseURL+''/mail/GetContacts.aspx'' > p contactURL > > page = agent.get_file(contactURL) > > p page > > if (page.code == ''200'') > puts "saving contacts.csv" > page.save_as(''contacts_msn.csv'') > else > puts "error downloading contacts" > end > > > > rescue > puts "Err: "+$! > puts "Trace:" > $@.each {|tl| > puts tl > } > end > > > -- > More cowbell, please ! > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users >gmoraes Even though the "Scraping AOL Webmail to login and fetch contacts?" thread is about scraping contacts from AOL it might prove helpful to your problem with Hotmail. However, we do have Hotmail solved. The Blackbook Gem will be released shortly and it scrapes contacts from GMail, Hotmail, AOL, Yahoo! and returns them in a convenient interface that your application can utilize. I''ll post a note to the Mechanize list when Blackbook is released. Thanks Mike -- Mike Mondragon Work> http://sas.quat.ch/ Blog> http://blog.mondragon.cc/ Small URLs> http://hurl.it/
Mike Mondragon
2008-Feb-04 20:06 UTC
[Mechanize-users] Weird error downloading a gzip''ed file
On 11/12/07, gmoraes <gsmoraes2 at gmail.com> wrote:> Hi all, > > I''ve been using mechanize for a while and it rocks. Docs are pretty clear > and so far I''ve been able to do it on my own. > However, I''m stuck in a weird situation in a script to download my contact > list from hotmail. > I''ve used Firebug to check all urls, and tested it by hand while logged in > via browser. > Even in the script everything works well until the last ''agent.get_file'', > which gets stuck with a weird error: > > ------ snip ------ > $ ruby msn-scrap.rb > #<URI::HTTP:0xfdbc850b8 > URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > > > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" > Err: unexpected end of file > Trace: > /usr/lib/ruby/1.8/mechanize.rb:372:in `read'' > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page'' > /usr/lib/ruby/1.8/net/http.rb:1050:in `request'' > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body'' > /usr/lib/ruby/1.8/net/http.rb:1049:in `request'' > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page'' > /usr/lib/ruby/1.8/net/http.rb:543:in `start'' > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page'' > /usr/lib/ruby/1.8/mechanize.rb:139:in `get'' > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file'' > msn-scrap.rb:32I just wanted to follow up that I experienced this same issue when scraping Hotmail. There is a form on /mail/options.aspx?subsection=26&n=XXXXX that when posted will return a CSV file of your contacts, the response header has attachment with a content type of text/csv. But when you mimic the interaction with Mechanize the underlying Net::HTTP will read a number of bytes then unexpectedly raise an eof exception. Anyway, Hotmail seems to pretty up their own CSV as HTML on this page: /mail/PrintShell.aspx?type=contact and Mechanize can fetch that without any problems and then you can use Hpricot to get at contact attributes. That is how the Blackbook Gem is handling Hotmail. Blackbook Gem: http://rubyforge.org/frs/?group_id=4311 -- Mike Mondragon Work> http://sas.quat.ch/ Blog> http://blog.mondragon.cc/
Aaron Patterson
2008-Feb-04 20:28 UTC
[Mechanize-users] Weird error downloading a gzip''ed file
On Mon, Feb 04, 2008 at 12:06:50PM -0800, Mike Mondragon wrote:> On 11/12/07, gmoraes <gsmoraes2 at gmail.com> wrote: > > Hi all, > > > > I''ve been using mechanize for a while and it rocks. Docs are pretty clear > > and so far I''ve been able to do it on my own. > > However, I''m stuck in a weird situation in a script to download my contact > > list from hotmail. > > I''ve used Firebug to check all urls, and tested it by hand while logged in > > via browser. > > Even in the script everything works well until the last ''agent.get_file'', > > which gets stuck with a weird error: > > > > ------ snip ------ > > $ ruby msn-scrap.rb > > #<URI::HTTP:0xfdbc850b8 > > URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > > > > > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" > > Err: unexpected end of file > > Trace: > > /usr/lib/ruby/1.8/mechanize.rb:372:in `read'' > > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page'' > > /usr/lib/ruby/1.8/net/http.rb:1050:in `request'' > > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body'' > > /usr/lib/ruby/1.8/net/http.rb:1049:in `request'' > > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page'' > > /usr/lib/ruby/1.8/net/http.rb:543:in `start'' > > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page'' > > /usr/lib/ruby/1.8/mechanize.rb:139:in `get'' > > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file'' > > msn-scrap.rb:32 > > I just wanted to follow up that I experienced this same issue when > scraping Hotmail. There is a form on > /mail/options.aspx?subsection=26&n=XXXXX that when posted will return > a CSV file of your contacts, the response header has attachment with a > content type of text/csv. But when you mimic the interaction with > Mechanize the underlying Net::HTTP will read a number of bytes then > unexpectedly raise an eof exception. > > Anyway, Hotmail seems to pretty up their own CSV as HTML on this page: > /mail/PrintShell.aspx?type=contact > and Mechanize can fetch that without any problems and then you can use > Hpricot to get at contact attributes. That is how the Blackbook Gem > is handling Hotmail. > > Blackbook Gem: http://rubyforge.org/frs/?group_id=4311I think I''ve finally tracked down this error (thanks to postmodern). Its a bug in net/http. I''ve submitted a patch for ruby here: http://rubyforge.org/tracker/index.php?func=detail&aid=17778&group_id=426&atid=1700 And I''ll add a monkey patch to mechanize to fix this in 0.7.1. -- Aaron Patterson http://tenderlovemaking.com/