thr3ads.net - Mechanize users - [Mechanize-users] Weird error downloading a gzip''ed file [Nov 2007]

If this information is useful, please help other people find it:
Share via:

gmoraes

2007-Nov-12 14:33 UTC

[Mechanize-users] Weird error downloading a gzip''ed file

Hi all,

I''ve been using mechanize for a while and it rocks. Docs are pretty
clear
and so far I''ve been able to do it on my own.
However, I''m stuck in a weird situation in a script to download my
contact
list from hotmail.
I''ve used Firebug to check all urls, and tested it by hand while logged
in
via browser.
Even in the script everything works well until the last
''agent.get_file'',
which gets stuck with a weird error:

------ snip ------
$ ruby msn-scrap.rb
#<URI::HTTP:0xfdbc850b8 URL:
http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true>"http://by124w.bay124.mail.live.com/mail/GetContacts.aspx"
Err: unexpected end of file
Trace:
/usr/lib/ruby/1.8/mechanize.rb:372:in `read''
/usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page''
/usr/lib/ruby/1.8/net/http.rb:1050:in `request''
/usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body''
/usr/lib/ruby/1.8/net/http.rb:1049:in `request''
/usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page''
/usr/lib/ruby/1.8/net/http.rb:543:in `start''
/usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page''
/usr/lib/ruby/1.8/mechanize.rb:139:in `get''
/usr/lib/ruby/1.8/mechanize.rb:146:in `get_file''
msn-scrap.rb:32
----- snip ------

mech.log important part:

D, [2007-11-12T12:22:35.925521 #24540] DEBUG -- : request-header: referer =>
http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
D, [2007-11-12T12:22:36.589708 #24540] DEBUG -- : response-header:
cache-control => private,max-age=86400
D, [2007-11-12T12:22:36.589853 #24540] DEBUG -- : response-header: vary =>
Accept-Encoding
D, [2007-11-12T12:22:36.589934 #24540] DEBUG -- : response-header:
connection => keep-alive
D, [2007-11-12T12:22:36.590012 #24540] DEBUG -- : response-header: expires
=> Wed, 01 Jan 1997 12:00:00 GMT, Wed, 01 Jan 1997 12:00:00 GMT
D, [2007-11-12T12:22:36.590089 #24540] DEBUG -- : response-header: p3p =>
CP="BUS CUR CONo FIN IVDo ONL OUR PHY SAMo TELo"
D, [2007-11-12T12:22:36.590166 #24540] DEBUG -- : response-header: date =>
Mon, 12 Nov 2007 14:28:34 GMT
D, [2007-11-12T12:22:36.590241 #24540] DEBUG -- : response-header: xxn => W4
D, [2007-11-12T12:22:36.590344 #24540] DEBUG -- : response-header:
content-type => text/csv
D, [2007-11-12T12:22:36.590430 #24540] DEBUG -- : response-header: msnserver
=> H: BAY124-W4 V: 12.0.1190.927 D: 2007-09-27T23:27:08
D, [2007-11-12T12:22:36.590509 #24540] DEBUG -- : response-header:
content-encoding => gzip
D, [2007-11-12T12:22:36.590586 #24540] DEBUG -- : response-header:
content-disposition => attachment; filename="WLMContacts.csv"
D, [2007-11-12T12:22:36.590663 #24540] DEBUG -- : response-header: server =>
Microsoft-IIS/6.0
D, [2007-11-12T12:22:36.590738 #24540] DEBUG -- : response-header:
content-length => 4285
D, [2007-11-12T12:22:36.591732 #24540] DEBUG -- : gunzip body

I''ve tried some ugly hacks, as altering headers and so on (BTW, how do
I
change request-headers w/o inheriting from www::mechanize ?), but no result.

Am I doing something wrong ? Seems to me that the server encodes the file
(Firebug shows it too), but mechanize receives a weird error while trying to
fetch it. Any ideas ?

I did another contact scrap for gmail and it worked wonders. There is a post
of mine at http://zenmachine.wordpress.com where I show how to use firebug
and mechanize to find the right URLs.



Best regards, and keep up the excellent work.


---- msn-scrap.rb----

#!/usr/bin/env ruby

# download msn contacts

require ''rubygems''
require ''mechanize''
require ''logger''

begin
    agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") }
    agent.user_agent_alias  = "Windows IE 6"


    page = agent.get("https://login.live.com/login.srf")



    form = page.forms.name("f1").first
    form.login = ''user''
    form.passwd = ''pass''

    page = agent.submit(form)

    pageContact =
agent.get(''http://g.live.com/1MBAMen-us/sc_mail'')
    p pageContact.uri

    baseURL=pageContact.uri.host


   
contactURL=''http://''+baseURL+''/mail/GetContacts.aspx''
    p contactURL

    page = agent.get_file(contactURL)

    p page

    if (page.code == ''200'')
        puts "saving contacts.csv"
        page.save_as(''contacts_msn.csv'')
    else
        puts "error downloading contacts"
    end



rescue
    puts "Err: "+$!
    puts "Trace:"
    $@.each {|tl|
        puts tl
    }
end


-- 
More cowbell, please !
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/mechanize-users/attachments/20071112/68742371/attachment-0001.html

Mike Mondragon

2007-Nov-14 08:18 UTC

head link

[Mechanize-users] Weird error downloading a gzip''ed file

On Nov 12, 2007 6:33 AM, gmoraes <gsmoraes2 at gmail.com>
wrote:> Hi all,
>
> I''ve been using mechanize for a while and it rocks. Docs are
pretty clear
> and so far I''ve been able to do it on my own.
> However, I''m stuck in a weird situation in a script to download my
contact
> list from hotmail.
> I''ve used Firebug to check all urls, and tested it by hand while
logged in
> via browser.
> Even in the script everything works well until the last
''agent.get_file'',
> which gets stuck with a weird error:
>
> ------ snip ------
> $ ruby msn-scrap.rb
> #<URI::HTTP:0xfdbc850b8
>
URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
> >
> "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx"
> Err: unexpected end of file
> Trace:
> /usr/lib/ruby/1.8/mechanize.rb:372:in `read''
> /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page''
> /usr/lib/ruby/1.8/net/http.rb:1050:in `request''
> /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body''
> /usr/lib/ruby/1.8/net/http.rb:1049:in `request''
> /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page''
> /usr/lib/ruby/1.8/net/http.rb:543:in `start''
> /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page''
> /usr/lib/ruby/1.8/mechanize.rb:139:in `get''
>  /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file''
> msn-scrap.rb:32
> ----- snip ------
>
> mech.log important part:
>
> D, [2007-11-12T12:22:35.925521 #24540] DEBUG -- : request-header: referer
=>
>
http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
> D, [2007-11-12T12:22:36.589708 #24540] DEBUG -- : response-header:
> cache-control => private,max-age=86400
> D, [2007-11-12T12:22: 36.589853 #24540] DEBUG -- : response-header: vary
=>
> Accept-Encoding
> D, [2007-11-12T12:22:36.589934 #24540] DEBUG -- : response-header:
> connection => keep-alive
> D, [2007-11-12T12:22:36.590012 #24540] DEBUG -- : response-header: expires
> => Wed, 01 Jan 1997 12:00:00 GMT, Wed, 01 Jan 1997 12:00:00 GMT
> D, [2007-11-12T12:22:36.590089 #24540] DEBUG -- : response-header: p3p
=>
> CP="BUS CUR CONo FIN IVDo ONL OUR PHY SAMo TELo"
> D, [2007-11-12T12:22:36.590166 #24540] DEBUG -- : response-header: date
=>
> Mon, 12 Nov 2007 14:28:34 GMT
> D, [2007-11-12T12:22:36.590241 #24540] DEBUG -- : response-header: xxn
=> W4
> D, [2007-11-12T12:22:36.590344 #24540] DEBUG -- : response-header:
> content-type => text/csv
> D, [2007-11-12T12:22:36.590430 #24540] DEBUG -- : response-header:
msnserver
> => H: BAY124-W4 V: 12.0.1190.927 D: 2007-09-27T23:27:08
> D, [2007-11-12T12:22:36.590509 #24540] DEBUG -- : response-header:
> content-encoding => gzip
> D, [2007-11-12T12:22:36.590586 #24540] DEBUG -- : response-header:
> content-disposition => attachment; filename=" WLMContacts.csv"
> D, [2007-11-12T12:22:36.590663 #24540] DEBUG -- : response-header: server
=>
> Microsoft-IIS/6.0
> D, [2007-11-12T12:22:36.590738 #24540] DEBUG -- : response-header:
> content-length => 4285
>  D, [2007-11-12T12:22:36.591732 #24540] DEBUG -- : gunzip body
>
> I''ve tried some ugly hacks, as altering headers and so on (BTW,
how do I
> change request-headers w/o inheriting from www::mechanize ?), but no
result.
>
> Am I doing something wrong ? Seems to me that the server encodes the file
> (Firebug shows it too), but mechanize receives a weird error while trying
to
> fetch it. Any ideas ?
>
> I did another contact scrap for gmail and it worked wonders. There is a
post
> of mine at http://zenmachine.wordpress.com where I show how to use firebug
> and mechanize to find the right URLs.
>
>
>
> Best regards, and keep up the excellent work.
>
>
> ---- msn-scrap.rb----
>
> #!/usr/bin/env ruby
>
> # download msn contacts
>
> require ''rubygems''
> require ''mechanize''
> require ''logger''
>
> begin
>     agent = WWW::Mechanize.new { |a| a.log =
Logger.new("mech.log") }
>     agent.user_agent_alias  = "Windows IE 6"
>
>
>     page = agent.get("https://login.live.com/login.srf ")
>
>
>
>     form = page.forms.name("f1").first
>     form.login = ''user''
>     form.passwd = ''pass''
>
>     page = agent.submit(form)
>
>     pageContact = agent.get
(''http://g.live.com/1MBAMen-us/sc_mail'')
>     p pageContact.uri
>
>     baseURL=pageContact.uri.host
>
>
>    
contactURL=''http://''+baseURL+''/mail/GetContacts.aspx''
>     p contactURL
>
>     page = agent.get_file(contactURL)
>
>     p page
>
>     if (page.code == ''200'')
>         puts "saving contacts.csv"
>         page.save_as(''contacts_msn.csv'')
>     else
>         puts "error downloading contacts"
>     end
>
>
>
> rescue
>     puts "Err: "+$!
>     puts "Trace:"
>     $@.each {|tl|
>         puts tl
>     }
>  end
>
>
> --
> More cowbell, please !
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
gmoraes

Even though the "Scraping AOL Webmail to login and fetch contacts?"
thread is about scraping contacts from AOL it might prove helpful to
your problem with Hotmail.

However, we do have Hotmail solved.  The Blackbook Gem will be
released shortly and it scrapes contacts from GMail, Hotmail, AOL,
Yahoo! and returns them in a convenient interface that your
application can utilize.

I''ll post a note to the Mechanize list when Blackbook is released.

Thanks
Mike

-- 
Mike Mondragon
Work> http://sas.quat.ch/
Blog> http://blog.mondragon.cc/
Small URLs> http://hurl.it/

Mike Mondragon

2008-Feb-04 20:06 UTC

head link

[Mechanize-users] Weird error downloading a gzip''ed file

On 11/12/07, gmoraes <gsmoraes2 at gmail.com>
wrote:> Hi all,
>
> I''ve been using mechanize for a while and it rocks. Docs are
pretty clear
> and so far I''ve been able to do it on my own.
> However, I''m stuck in a weird situation in a script to download my
contact
> list from hotmail.
> I''ve used Firebug to check all urls, and tested it by hand while
logged in
> via browser.
> Even in the script everything works well until the last
''agent.get_file'',
> which gets stuck with a weird error:
>
> ------ snip ------
> $ ruby msn-scrap.rb
> #<URI::HTTP:0xfdbc850b8
>
URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
> >
> "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx"
> Err: unexpected end of file
> Trace:
> /usr/lib/ruby/1.8/mechanize.rb:372:in `read''
> /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page''
> /usr/lib/ruby/1.8/net/http.rb:1050:in `request''
> /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body''
> /usr/lib/ruby/1.8/net/http.rb:1049:in `request''
> /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page''
> /usr/lib/ruby/1.8/net/http.rb:543:in `start''
> /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page''
> /usr/lib/ruby/1.8/mechanize.rb:139:in `get''
>  /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file''
> msn-scrap.rb:32
I just wanted to follow up that I experienced this same issue when
scraping Hotmail.  There is a form on
/mail/options.aspx?subsection=26&n=XXXXX that when posted will return
a CSV file of your contacts, the response header has attachment with a
content type of text/csv.  But when you mimic the interaction with
Mechanize the underlying Net::HTTP will read a number of bytes then
unexpectedly raise an eof exception.

Anyway, Hotmail seems to pretty up their own CSV as HTML on this page:
/mail/PrintShell.aspx?type=contact
and Mechanize can fetch that without any problems and then you can use
Hpricot to get at contact attributes.  That is how the Blackbook Gem
is handling Hotmail.

Blackbook Gem: http://rubyforge.org/frs/?group_id=4311


-- 
Mike Mondragon
Work> http://sas.quat.ch/
Blog> http://blog.mondragon.cc/

Aaron Patterson

2008-Feb-04 20:28 UTC

head link

[Mechanize-users] Weird error downloading a gzip''ed file

On Mon, Feb 04, 2008 at 12:06:50PM -0800, Mike Mondragon
wrote:> On 11/12/07, gmoraes <gsmoraes2 at gmail.com> wrote:
> > Hi all,
> >
> > I''ve been using mechanize for a while and it rocks. Docs are
pretty clear
> > and so far I''ve been able to do it on my own.
> > However, I''m stuck in a weird situation in a script to
download my contact
> > list from hotmail.
> > I''ve used Firebug to check all urls, and tested it by hand
while logged in
> > via browser.
> > Even in the script everything works well until the last
''agent.get_file'',
> > which gets stuck with a weird error:
> >
> > ------ snip ------
> > $ ruby msn-scrap.rb
> > #<URI::HTTP:0xfdbc850b8
> >
URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true
> > >
> > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx"
> > Err: unexpected end of file
> > Trace:
> > /usr/lib/ruby/1.8/mechanize.rb:372:in `read''
> > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page''
> > /usr/lib/ruby/1.8/net/http.rb:1050:in `request''
> > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body''
> > /usr/lib/ruby/1.8/net/http.rb:1049:in `request''
> > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page''
> > /usr/lib/ruby/1.8/net/http.rb:543:in `start''
> > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page''
> > /usr/lib/ruby/1.8/mechanize.rb:139:in `get''
> >  /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file''
> > msn-scrap.rb:32
> 
> I just wanted to follow up that I experienced this same issue when
> scraping Hotmail.  There is a form on
> /mail/options.aspx?subsection=26&n=XXXXX that when posted will return
> a CSV file of your contacts, the response header has attachment with a
> content type of text/csv.  But when you mimic the interaction with
> Mechanize the underlying Net::HTTP will read a number of bytes then
> unexpectedly raise an eof exception.
> 
> Anyway, Hotmail seems to pretty up their own CSV as HTML on this page:
> /mail/PrintShell.aspx?type=contact
> and Mechanize can fetch that without any problems and then you can use
> Hpricot to get at contact attributes.  That is how the Blackbook Gem
> is handling Hotmail.
> 
> Blackbook Gem: http://rubyforge.org/frs/?group_id=4311
I think I''ve finally tracked down this error (thanks to postmodern).
Its a bug in net/http.  I''ve submitted a patch for ruby here:

http://rubyforge.org/tracker/index.php?func=detail&aid=17778&group_id=426&atid=1700

And I''ll add a monkey patch to mechanize to fix this in 0.7.1.

-- 
Aaron Patterson
http://tenderlovemaking.com/

Seemingly Similar Threads

Search for more possibly parallel threads

Mechanize users - Nov 2007 - Weird error downloading a gzip''ed file

[Mechanize-users] Weird error downloading a gzip''ed file

[Mechanize-users] Weird error downloading a gzip''ed file

[Mechanize-users] Weird error downloading a gzip''ed file

[Mechanize-users] Weird error downloading a gzip''ed file

Seemingly Similar Threads