Jack Royal-Gordon
2013-Mar-21 05:13 UTC
[Mechanize-users] Problems parsing page encoded in Shift-JIS
I''m posting this question to both mailing lists as I''m not sure whether it''s a Mechanize problem or a Nokogiri problem. Using Nokogiri and Mechanize to load and parse a web page encoded with Shift-JIS. I have an HTML construct like: <head> <meta http-equiv="content-type" content="text/html; charset=Shift_JIS"> </head> <body> ... <ul id="test"> <li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li> <li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li> <li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li> </ul> ... </body> In case it''s relevant, the response header has "Content-Type = text/html; charset=Shift_JIS". I''m trying to parse this with the following Ruby code: page = mechanize_agent.get(url) list = page.search("ul#test li").each {|item| item.search("a").each {|a| a.content}.join(" > ") } } I''d expect this to return ["abc > def > ghi", "abc > def > ghi", "abc > def > ghi"] but it returns ["abc > def > ghi123 > abc > def > ghi123 > abc > def > ghi123"]. However, if I save the page.body and then do page = Nokogiri::parse(saved_body) and repeat the code, then it behaves as expected. This is a simplified example. The actual HTML is (you can get this at "http://www.amazon.co.jp/dp/B006QP63LI"): <ul class="zg_hrsr"> <li class="zg_hrsr_item"> <span class="zg_hrsr_rank">487?</span> <span class="zg_hrsr_ladder">? <a href="http://www.amazon.co.jp/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kinc_1_1">Kindle???</a> > <a href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2275256051/ref=pd_zg_hrsr_kinc_1_2">Kindle?</a> > <a href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2275259051/ref=pd_zg_hrsr_kinc_1_3">Kindle??</a> > <a href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2312932051/ref=pd_zg_hrsr_kinc_1_4">Romance</a> > <b><a href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2312943051/ref=pd_zg_hrsr_kinc_1_5_last">Historical</a></b></span> </li> <li class="zg_hrsr_item"> <span class="zg_hrsr_rank">1532?</span> <span class="zg_hrsr_ladder">? <a href="http://www.amazon.co.jp/gp/bestsellers/english-books/ref=pd_zg_hrsr_fb_2_1">??</a> > <a href="http://www.amazon.co.jp/gp/bestsellers/english-books/100925011/ref=pd_zg_hrsr_fb_2_2">Romance</a> > <b><a href="http://www.amazon.co.jp/gp/bestsellers/english-books/101338011/ref=pd_zg_hrsr_fb_2_3_last">Historical</a></b></span> </li> <li class="zg_hrsr_item"> <span class="zg_hrsr_rank">2613?</span> <span class="zg_hrsr_ladder">? <a href="http://www.amazon.co.jp/gp/bestsellers/english-books/ref=pd_zg_hrsr_fb_3_1">??</a> > <a href="http://www.amazon.co.jp/gp/bestsellers/english-books/93834011/ref=pd_zg_hrsr_fb_3_2">Literature & Fiction</a> > <a href="http://www.amazon.co.jp/gp/bestsellers/english-books/95083011/ref=pd_zg_hrsr_fb_3_3">Genre Fiction</a> > <b><a href="http://www.amazon.co.jp/gp/bestsellers/english-books/95796011/ref=pd_zg_hrsr_fb_3_4_last">Historical</a></b></span> </li> </ul> and the result was: Kindle? > Kindle?? > Romance > Historical 1542? ? ?? > Romance > Historical 2627? ? ?? > Literature & Fiction > Genre Fiction > Historical but when the page was reloaded, I got the expected result: ["Kindle?X?g?A > Kindle?{ > Kindle?m?? > Romance > Historical", "?m?? > Romance > Historical", "?m?? > Literature & Fiction > Genre Fiction > Historical"] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20130320/ebd4e0e0/attachment.html>