Jack Royal-Gordon
2013-Mar-21 05:13 UTC
[Mechanize-users] Problems parsing page encoded in Shift-JIS
I''m posting this question to both mailing lists as I''m not
sure whether it''s a Mechanize problem or a Nokogiri problem.
Using Nokogiri and Mechanize to load and parse a web page encoded with
Shift-JIS. I have an HTML construct like:
<head>
<meta http-equiv="content-type" content="text/html;
charset=Shift_JIS">
</head>
<body>
...
<ul id="test">
<li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li>
<li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li>
<li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li>
</ul>
...
</body>
In case it''s relevant, the response header has "Content-Type =
text/html; charset=Shift_JIS".
I''m trying to parse this with the following Ruby code:
page = mechanize_agent.get(url)
list = page.search("ul#test li").each {|item|
item.search("a").each {|a| a.content}.join(" > ") } }
I''d expect this to return ["abc > def > ghi", "abc
> def > ghi", "abc > def > ghi"]
but it returns ["abc > def > ghi123 > abc > def > ghi123
> abc > def > ghi123"].
However, if I save the page.body and then do page = Nokogiri::parse(saved_body)
and repeat the code, then it behaves as expected. This is a simplified example.
The actual HTML is (you can get this at
"http://www.amazon.co.jp/dp/B006QP63LI"):
<ul class="zg_hrsr">
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">487?</span>
<span class="zg_hrsr_ladder">? <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kinc_1_1">Kindle???</a>
> <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2275256051/ref=pd_zg_hrsr_kinc_1_2">Kindle?</a>
> <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2275259051/ref=pd_zg_hrsr_kinc_1_3">Kindle??</a>
> <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2312932051/ref=pd_zg_hrsr_kinc_1_4">Romance</a>
> <b><a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2312943051/ref=pd_zg_hrsr_kinc_1_5_last">Historical</a></b></span>
</li>
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">1532?</span>
<span class="zg_hrsr_ladder">? <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/ref=pd_zg_hrsr_fb_2_1">??</a>
> <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/100925011/ref=pd_zg_hrsr_fb_2_2">Romance</a>
> <b><a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/101338011/ref=pd_zg_hrsr_fb_2_3_last">Historical</a></b></span>
</li>
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">2613?</span>
<span class="zg_hrsr_ladder">? <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/ref=pd_zg_hrsr_fb_3_1">??</a>
> <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/93834011/ref=pd_zg_hrsr_fb_3_2">Literature
& Fiction</a> > <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/95083011/ref=pd_zg_hrsr_fb_3_3">Genre
Fiction</a> > <b><a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/95796011/ref=pd_zg_hrsr_fb_3_4_last">Historical</a></b></span>
</li>
</ul>
and the result was:
Kindle? > Kindle?? > Romance > Historical 1542? ? ?? > Romance >
Historical 2627? ? ?? > Literature & Fiction > Genre Fiction >
Historical
but when the page was reloaded, I got the expected result:
["Kindle?X?g?A > Kindle?{ > Kindle?m?? > Romance >
Historical", "?m?? > Romance > Historical", "?m?? >
Literature & Fiction > Genre Fiction > Historical"]
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20130320/ebd4e0e0/attachment.html>
