Andrew Stewart
2006-Dec-07 14:21 UTC
[Mechanize-users] Response To Form Submission Hanging
Hello, I am using Mechanize to post a form to a website. When I do this by hand in my browser the response takes about 35s to come back (it''s a long page full of tables and graphics). When I do this with Mechanize, the server starts to respond and then appears to hang. The obvious conclusion is that my code is wrong but I am reasonably sure that I haven''t altered it since it was working earlier in the week (famous last words!). And I don''t know how to observe exactly what my browser sends to the server to compare with my Mechanize log. What''s the best way to debug the problem? In fact, more importantly, what''s the problem? :) Here''s some of the evidence I have gathered: The log of the response: D, [2006-12-07T14:04:29.213035 #374] DEBUG -- : response-header: content-type => text/html; charset=utf-8 D, [2006-12-07T14:04:29.213337 #374] DEBUG -- : response-header: date => Thu, 07 Dec 2006 14:04:27 GMT D, [2006-12-07T14:04:29.213465 #374] DEBUG -- : response-header: server => Apache D, [2006-12-07T14:04:29.213561 #374] DEBUG -- : response-header: set- cookie => backend=web-2-06; path=/; expires=Fri, 08-Dec-2006 14:04:28 GMT D, [2006-12-07T14:04:29.213657 #374] DEBUG -- : response-header: transfer-encoding => chunked D, [2006-12-07T14:04:29.214291 #374] DEBUG -- : saved cookie: backend=web-2-06 (Let me know if it would help to see the log of the request.) The stack trace produced when I interrupt my code after several minutes: ^C/usr/local/lib/ruby/1.8/net/protocol.rb:133:in `sysread'': Interrupt from /usr/local/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill'' from /usr/local/lib/ruby/1.8/timeout.rb:56:in `timeout'' from /usr/local/lib/ruby/1.8/timeout.rb:76:in `timeout'' from /usr/local/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill'' from /usr/local/lib/ruby/1.8/net/protocol.rb:86:in `read'' from /usr/local/lib/ruby/1.8/net/http.rb:2200:in `read_chunked'' from /usr/local/lib/ruby/1.8/net/http.rb:2175:in `read_body_0'' from /usr/local/lib/ruby/1.8/net/http.rb:2141:in `read_body'' ... 7 levels... from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ mechanize.rb:265:in `post_form'' from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ mechanize.rb:201:in `submit'' from lib/grabber.rb:49 from lib/grabber.rb:28 And some of my code: ... 21 agent.read_timeout = 120 # clutching at straws ... 45 form = page.forms[1] 46 form.checkboxes.each { |c| c.checked = true } 47 options.merge(''stage'' => ''refine'').each { |k,v| form.add_field! (k, v) } 48 puts "about to submit at #{Time.now}" 49 page = agent.submit(form) 50 puts "submitted at #{Time.now}" ... The ''about to submit...'' statement is written out to the console. The ''submitted at...'' line isn''t. I''ve left it running for up to 5 minutes. Any help would be much appreciated. Thanks and regards, Andy Stewart
On Dec 7, 2006, at 9:21 AM, Andrew Stewart wrote:> I am using Mechanize to post a form to a website. When I do this by > hand in my browser the response takes about 35s to come back (it''s a > long page full of tables and graphics). When I do this with > Mechanize, the server starts to respond and then appears to hang.I don''t have any specific ideas based off of your logs. But this problem reminded me that I''ve seen Net::HTTP (specifically mongrel and webrick, but they both use it) do some non-standard things and trip up browsers. Maybe you''re seeing something similar? Personally I would use Charles to debug it to see if there are any marked differences between what the browser is doing and what mechanize is doing. It''s shareware, so you can only run it for like 30 minutes at a time. But I find that''s long enough to debug most issues. http://www.xk72.com/charles/ Good luck! Maybe Aaron will have some more ideas for you. -Mat
Andrew Stewart
2006-Dec-08 09:46 UTC
[Mechanize-users] Response To Form Submission Hanging
Mat,> I don''t have any specific ideas based off of your logs. But this > problem reminded me that I''ve seen Net::HTTP (specifically mongrel > and webrick, but they both use it) do some non-standard things and > trip up browsers. Maybe you''re seeing something similar?Intriguing....> Personally I would use Charles to debug it to see if there are any > marked differences between what the browser is doing and what > mechanize is doing. It''s shareware, so you can only run it for like > 30 minutes at a time. But I find that''s long enough to debug most > issues.Thanks for the pointer to Charles -- I didn''t know of it and it looks useful.> Good luck! Maybe Aaron will have some more ideas for you.Let''s hope! Thanks and regards, Andy Stewart
Aaron Patterson
2006-Dec-08 23:20 UTC
[Mechanize-users] Response To Form Submission Hanging
Hi Andrew, On Thu, Dec 07, 2006 at 02:21:04PM +0000, Andrew Stewart wrote:> Hello, > > I am using Mechanize to post a form to a website. When I do this by > hand in my browser the response takes about 35s to come back (it''s a > long page full of tables and graphics). When I do this with > Mechanize, the server starts to respond and then appears to hang.How long exactly is the page? In the mega-bytes?> > The obvious conclusion is that my code is wrong but I am reasonably > sure that I haven''t altered it since it was working earlier in the > week (famous last words!). And I don''t know how to observe exactly > what my browser sends to the server to compare with my Mechanize log. > > What''s the best way to debug the problem? In fact, more importantly, > what''s the problem? :) > > Here''s some of the evidence I have gathered: > > The log of the response: > > D, [2006-12-07T14:04:29.213035 #374] DEBUG -- : response-header: > content-type => text/html; charset=utf-8 > D, [2006-12-07T14:04:29.213337 #374] DEBUG -- : response-header: date > => Thu, 07 Dec 2006 14:04:27 GMT > D, [2006-12-07T14:04:29.213465 #374] DEBUG -- : response-header: > server => Apache > D, [2006-12-07T14:04:29.213561 #374] DEBUG -- : response-header: set- > cookie => backend=web-2-06; path=/; expires=Fri, 08-Dec-2006 14:04:28 > GMT > D, [2006-12-07T14:04:29.213657 #374] DEBUG -- : response-header: > transfer-encoding => chunked > D, [2006-12-07T14:04:29.214291 #374] DEBUG -- : saved cookie: > backend=web-2-06 > > (Let me know if it would help to see the log of the request.) > > The stack trace produced when I interrupt my code after several minutes: > > ^C/usr/local/lib/ruby/1.8/net/protocol.rb:133:in `sysread'': Interrupt > from /usr/local/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill'' > from /usr/local/lib/ruby/1.8/timeout.rb:56:in `timeout'' > from /usr/local/lib/ruby/1.8/timeout.rb:76:in `timeout'' > from /usr/local/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill'' > from /usr/local/lib/ruby/1.8/net/protocol.rb:86:in `read'' > from /usr/local/lib/ruby/1.8/net/http.rb:2200:in `read_chunked'' > from /usr/local/lib/ruby/1.8/net/http.rb:2175:in `read_body_0'' > from /usr/local/lib/ruby/1.8/net/http.rb:2141:in `read_body'' > ... 7 levels... > from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ > mechanize.rb:265:in `post_form'' > from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ > mechanize.rb:201:in `submit'' > from lib/grabber.rb:49 > from lib/grabber.rb:28I''ve seen stack traces similar to this when you get read timeouts. If the server is slow, that may be your problem..... Also, there is sort of a bug in ruby net/http where large file will take a long time to download and max out your CPU. Try adding the following snippet and see if that helps. Note that it will only work for ruby versions > 1.8.2. class Net::InternetMessageIO #:nodoc: alias :old_rbuf_fill :rbuf_fill def rbuf_fill begin @rbuf << @io.read_nonblock(65536) rescue Errno::EWOULDBLOCK if IO.select([@io], nil, nil, @read_timeout) @rbuf << @io.read_nonblock(65536) else raise Timeout::TimeoutError end end end end If you can send a short sample that will reproduce the problem, I can help more. -- Aaron Patterson http://tenderlovemaking.com/
Andrew Stewart
2006-Dec-11 10:15 UTC
[Mechanize-users] Response To Form Submission Hanging
Hi Aaron, On 8 Dec 2006, at 23:20, Aaron Patterson wrote:> How long exactly is the page? In the mega-bytes?Just under 3.2 MB.> I''ve seen stack traces similar to this when you get read timeouts. If > the server is slow, that may be your problem.....I wouldn''t call the server fast but it seems good enough. When I do what the script does myself in the browser, the page comes back in under a minute.> Also, there is sort of a bug in ruby net/http where large file will > take a long time to download and max out your CPU. Try adding the > following snippet and see if that helps. Note that it will only work > for ruby versions > 1.8.2.Thanks for the snippet. I''m on Ruby 1.8.4 and unfortunately the snippet didn''t appear to make any difference.> If you can send a short sample that will reproduce the problem, I can > help more.I''ve pasted my code below. The part which hangs is the final line using Mechanize: "page = agent.submit(form)". ---- require ''rubygems'' require ''mechanize'' require ''logger'' class Net::InternetMessageIO #:nodoc: alias :old_rbuf_fill :rbuf_fill def rbuf_fill begin @rbuf << @io.read_nonblock(65536) rescue Errno::EWOULDBLOCK if IO.select([@io], nil, nil, @read_timeout) @rbuf << @io.read_nonblock(65536) else raise Timeout::TimeoutError end end end end class SangerParser < WWW::Mechanize::Page def initialize(uri = nil, response = nil, body = nil, code = nil) # Ditch any runs of table data which contain only and at least 3 "<td>" or "</td>"s body.gsub!(/(<\/?td>){3,}/, '''') # Ditch any title attributes which equal nothing at all body.gsub!(/ title=>/i, ''>'') super(uri, response, body, code) end end logger = Logger.new(''mech.log'') logger.level = Logger::DEBUG agent = WWW::Mechanize.new { |a| a.log = logger } agent.pluggable_parser.html = SangerParser agent.user_agent_alias = ''Mac Safari'' agent.read_timeout = 120 # Does this make any difference? # Tissue selection screen url = ''http://www.sanger.ac.uk/cgi-bin/genetics/CGP/genotyping/lohmap'' page = agent.get(url) # Just choose one tissue for now #tissues = page.forms[1].fields.name(''site_1'').options[1..-1].map { | o| o.value } tissues = %w( Colorectal ) tissues.each { |tissue| # chrom: chromosome to display # site_1: tissue to display # freqval: number of consecutive homozygous markers to colour (1-6) # failmark: extend homozygous markup (yes = 1, no = 0) # sort: sort the cell lines (by loss [high to low] = 1, by cell line [A-Z] = 0) options = { ''site_1'' => tissue, ''chrom'' => ''1'', ''freqval'' => ''1'', ''failmark'' => ''0'', ''sort'' => ''1'' } # Post directly without going via the tissue selection form page = agent.post(url, options.merge(''stage'' => ''display'')) # LOH data selection screen form = page.forms[1] form.checkboxes.each { |c| c.checked = true } options.merge(''stage'' => ''refine'').each { |k, v| form.add_field! (k, v) } puts "about to submit at #{Time.now}" page = agent.submit(form) puts "submitted at #{Time.now}" # Write out results page to disk open("data/#{tissue.downcase}.html", ''w'') { |f| f << page.body } } ---- Thanks for your help, Andy Stewart
Aaron Patterson
2006-Dec-13 03:43 UTC
[Mechanize-users] Response To Form Submission Hanging
On Mon, Dec 11, 2006 at 10:15:40AM +0000, Andrew Stewart wrote:> Hi Aaron, > > On 8 Dec 2006, at 23:20, Aaron Patterson wrote: > > > How long exactly is the page? In the mega-bytes? > > Just under 3.2 MB. > > > I''ve seen stack traces similar to this when you get read timeouts. If > > the server is slow, that may be your problem..... > > I wouldn''t call the server fast but it seems good enough. When I do > what the script does myself in the browser, the page comes back in > under a minute. > > > Also, there is sort of a bug in ruby net/http where large file will > > take a long time to download and max out your CPU. Try adding the > > following snippet and see if that helps. Note that it will only work > > for ruby versions > 1.8.2. > > Thanks for the snippet. I''m on Ruby 1.8.4 and unfortunately the > snippet didn''t appear to make any difference. > > > If you can send a short sample that will reproduce the problem, I can > > help more.[snip] I think I''ve solved the problem. I added some extra debugging to mechanize and found that the server was returning too much data! I think it was caught in some sort of infinite loop. Mechanize downloaded over 100 megs of data before the socket read timed out. Basically the problem was sending the "sort" argument to the server. When I removed that, things seemed to work OK. I found using LiveHTTPHeaders for Firefox that the form you submit isn''t supposed to send a "sort" argument. I noticed that the html returned is not clean, so I modified your pluggable parser to use Tidy. This let Hpricot parse everything properly and the form object seemed to be populated correctly. Anyway, here is my version of your script (you''ll need to verify it is returning the correct data from the form): require ''rubygems'' require ''mechanize'' require ''logger'' require ''tidy'' # Change this depending on your system Tidy.path = ''/usr/lib/libtidy.dylib'' class SangerParser < WWW::Mechanize::Page def initialize(uri = nil, response = nil, body = nil, code = nil) xml = Tidy.open() do |tidy| tidy.options.output_xml = true xml = tidy.clean(body) xml end super(uri, response, xml, code) end end logger = Logger.new(''mech.log'') logger.level = Logger::DEBUG agent = WWW::Mechanize.new { |a| a.log = logger } agent.pluggable_parser.html = SangerParser agent.user_agent_alias = ''Mac Safari'' agent.read_timeout = 120 # Does this make any difference? # Tissue selection screen url = ''http://www.sanger.ac.uk/cgi-bin/genetics/CGP/genotyping/lohmap'' page = agent.get(url) # Just choose one tissue for now tissues = page.forms[1].fields.name(''site_1'').options[1..-1].map { |o| o.value } tissues = %w( Colorectal ) tissues.each { |tissue| # chrom: chromosome to display # site_1: tissue to display # freqval: number of consecutive homozygous markers to colour (1-6) # failmark: extend homozygous markup (yes = 1, no = 0) # sort: sort the cell lines (by loss [high to low] = 1, by cell line [A-Z] = 0) options = { ''site_1'' => tissue, ''chrom'' => ''1'', ''freqval'' => ''1'', ''failmark'' => ''0'', ''sort'' => ''1'' } # Post directly without going via the tissue selection form page = agent.post(url, options.merge(''stage'' => ''display'')) # LOH data selection screen form = page.forms[1] form.checkboxes.each { |c| c.checked = true } puts "about to submit at #{Time.now}" page = agent.submit(form, form.buttons.first) puts "submitted at #{Time.now}" # Write out results page to disk open("data/#{tissue.downcase}.html", ''w'') { |f| f << page.body } } -- Aaron Patterson http://tenderlovemaking.com/
Andrew Stewart
2006-Dec-13 17:12 UTC
[Mechanize-users] Response To Form Submission Hanging
Hi Aaron, On 13 Dec 2006, at 03:43, Aaron Patterson wrote:> I think I''ve solved the problem. I added some extra debugging to > mechanize and found that the server was returning too much data! I > think it was caught in some sort of infinite loop. Mechanize > downloaded > over 100 megs of data before the socket read timed out. > > Basically the problem was sending the "sort" argument to the server. > When I removed that, things seemed to work OK. I found using > LiveHTTPHeaders for Firefox that the form you submit isn''t supposed to > send a "sort" argument.It works perfectly now. Thank you! I never thought that the "sort" argument was causing the problem. Well spotted.> I noticed that the html returned is not clean, so I modified your > pluggable > parser to use Tidy. This let Hpricot parse everything properly and > the > form object seemed to be populated correctly.Using Tidy sounds much better than laboriously and erroneously doing my own tidying. And as you say, it allows the form object to be populated correctly. Super!> Anyway, here is my version of your script (you''ll need to verify it is > returning the correct data from the form):I''ve checked the data and it''s 100% accurate. Thanks again for your help -- I am really grateful. And I''ve learned a few things along the way too :) Regards, Andy Stewart