thr3ads.net - Mechanize users - [Mechanize-users] Problems parsing page encoded in Shift-JIS [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Jack Royal-Gordon

2013-Mar-21 05:13 UTC

[Mechanize-users] Problems parsing page encoded in Shift-JIS

I''m posting this question to both mailing lists as I''m not
sure whether it''s a Mechanize problem or a Nokogiri problem.

Using Nokogiri and Mechanize to load and parse a web page encoded with
Shift-JIS. I have an HTML construct like:

<head>
	<meta http-equiv="content-type" content="text/html;
charset=Shift_JIS">
</head>
<body>
...
	<ul id="test">
	
<li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li>
	
<li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li>
	
<li><a>abc</a><a>def</a><a>ghi</a><span>123</span></li>
	</ul>
...
</body>

In case it''s relevant, the response header has "Content-Type =
text/html; charset=Shift_JIS".

I''m trying to parse this with the following Ruby code:

	page = mechanize_agent.get(url)
	list = page.search("ul#test li").each {|item|
item.search("a").each {|a| a.content}.join(" > ") } }

I''d expect this to return ["abc > def > ghi", "abc
> def > ghi", "abc > def > ghi"]

but it returns ["abc > def > ghi123 > abc > def > ghi123
> abc > def > ghi123"].

However, if I save the page.body and then do page = Nokogiri::parse(saved_body)
and repeat the code, then it behaves as expected.  This is a simplified example.
The actual HTML is (you can get this at
"http://www.amazon.co.jp/dp/B006QP63LI"):

<ul class="zg_hrsr">
    <li class="zg_hrsr_item">
    <span class="zg_hrsr_rank">487?</span> 
    <span class="zg_hrsr_ladder">?&nbsp;<a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kinc_1_1">Kindle???</a>
&gt; <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2275256051/ref=pd_zg_hrsr_kinc_1_2">Kindle?</a>
&gt; <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2275259051/ref=pd_zg_hrsr_kinc_1_3">Kindle??</a>
&gt; <a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2312932051/ref=pd_zg_hrsr_kinc_1_4">Romance</a>
&gt; <b><a
href="http://www.amazon.co.jp/gp/bestsellers/digital-text/2312943051/ref=pd_zg_hrsr_kinc_1_5_last">Historical</a></b></span>
    </li>
    <li class="zg_hrsr_item">
    <span class="zg_hrsr_rank">1532?</span> 
    <span class="zg_hrsr_ladder">?&nbsp;<a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/ref=pd_zg_hrsr_fb_2_1">??</a>
&gt; <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/100925011/ref=pd_zg_hrsr_fb_2_2">Romance</a>
&gt; <b><a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/101338011/ref=pd_zg_hrsr_fb_2_3_last">Historical</a></b></span>
    </li>
    <li class="zg_hrsr_item">
    <span class="zg_hrsr_rank">2613?</span> 
    <span class="zg_hrsr_ladder">?&nbsp;<a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/ref=pd_zg_hrsr_fb_3_1">??</a>
&gt; <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/93834011/ref=pd_zg_hrsr_fb_3_2">Literature
& Fiction</a> &gt; <a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/95083011/ref=pd_zg_hrsr_fb_3_3">Genre
Fiction</a> &gt; <b><a
href="http://www.amazon.co.jp/gp/bestsellers/english-books/95796011/ref=pd_zg_hrsr_fb_3_4_last">Historical</a></b></span>
    </li>
</ul>

and the result was:

Kindle? > Kindle?? > Romance > Historical 1542? ? ?? > Romance >
Historical 2627? ? ?? > Literature & Fiction > Genre Fiction >
Historical
 
but when the page was reloaded, I got the expected result:

["Kindle?X?g?A > Kindle?{ > Kindle?m?? > Romance >
Historical", "?m?? > Romance > Historical", "?m?? >
Literature & Fiction > Genre Fiction > Historical"]

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://rubyforge.org/pipermail/mechanize-users/attachments/20130320/ebd4e0e0/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

Mechanize users - Mar 2013 - Problems parsing page encoded in Shift-JIS

[Mechanize-users] Problems parsing page encoded in Shift-JIS

Apparently Analagous Threads

Wisdom of the Ancients