Robert Poor
2011-Dec-20 18:27 UTC
[Mechanize-users] Mechanize GETting twice without redirect?
[Cross posted to Ruby on Rails Forum and Mechanize mailing list.] I''m using Mechanize for page scraping (Ruby 1.9.2 / Rails 3.0.5 / Mechanize 2.0.1). I''m seeing a case where a single agent.get(url) generates two HTTP GETs. Why is this happening? The response to the first GET is a 200 (no redirect) and doesn''t have any meta-refresh. I don''t see why Mechanize is issuing the second GET (which happens to be failing with an EOFError with Content-Length / body length mismatch). Details: I''m using the nifty Charles web proxy debugger to monitor browser / server interactions. ====In the original browser + server exchange, I see: Req: POST /login/Login HTTP/1.1 Rsp: sets two cookies + HTTP/1.1 302 Moved Temporarily => https://online.nationalgridus.com/eservice_enu/ Req: GET /eservice_enu/ HTTP/1.1 Rsp: set a cookie + HTTP/1.1 200 OK The body contains onLoad Javascript to set this.location ''start.swe?SWECmd=Start'' Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1 Rsp: sets four cookies + HTTP/1.1 200 OK ====In the mechanize = server exchange: My code: page2 = agent.submit(login_form) Req: POST /login/Login HTTP/1.1 Rsp: set two cookies + HTTP/1.1 302 Moved Temporarily => https://online.nationalgridus.com/eservice_enu/ Req: GET /eservice_enu/ HTTP/1.1 Rsp: set a cookie + HTTP/1.1 200 OK The body contains onLoad Javascript to set this.location ''start.swe?SWECmd=Start'', but Mechanize can''t follow that automatically. So I do an agent.get() to emulate it: My code: page3 agent.get("https://online.nationalgridus.com/eservice_enu/start.swe?SWECmd=Start") Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1 Rsp: sets four cookies + HTTP/1.1 200 OK Note that at this point both the user driven and mechanize driven interactions appear to be identical. But Mechanize appears to generate another GET all by itself: Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1 Rsp: sets four cookies + HTTP/1.1 200 OK ... and this response throws an EOFError: Content-Length (536) does not match response body length (524) - EOFError ====So: Why did Mechanize generate that last GET without me asking it to? Was the EOFError actually in the first GET and it''s doing a retry? If so, how do I work around the length mismatch?