Hello, I''m trying to find a solution to convert everything returned by
mechanize to utf-8, no matter if the original page is utf-8 or iso and I
really don''t know where to start from...
agent = WWW::Mechanize.new { |a| a.log =
Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) }
one_page = agent.get("www.google.fr")
My first problem is that one_page encoding should be utf-8 (as stated by
firefox page''s properties), instead one_page.content_type is
"text/html;
charset=ISO-8859-1" and displaying text content gives wrong accent
conversion.
Second problem, when scraping datas from a REAL ISO-8859-1 website, how
should I do to convert them to utf-8 ?
Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console
Thanks
Christophe, If you''re doing this within Rails (which it appears you are), just use string.toutf8. This method is part of the Kconv module which it appears Rails includes by default. Output from my script/console:>> Kconv.toutf8 "string"=> "string">> "string".toutf8=> "string">> toutf8HTH, Matt White ----- Original Message ---- From: Christophe <anaema_ml at yahoo.fr> To: mechanize-users at rubyforge.org Sent: Thursday, July 17, 2008 3:42:23 AM Subject: [Mechanize-users] Convert data to utf-8 Hello, I''m trying to find a solution to convert everything returned by mechanize to utf-8, no matter if the original page is utf-8 or iso and I really don''t know where to start from... agent = WWW::Mechanize.new { |a| a.log = Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) } one_page = agent.get("www.google.fr") My first problem is that one_page encoding should be utf-8 (as stated by firefox page''s properties), instead one_page.content_type is "text/html; charset=ISO-8859-1" and displaying text content gives wrong accent conversion. Second problem, when scraping datas from a REAL ISO-8859-1 website, how should I do to convert them to utf-8 ? Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console Thanks _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20080717/67ad269e/attachment.html>
Thanks Matt, but it does not work...
I''ve investigated a again and :
agent = WWW::Mechanize.new
page = agent.get("http://www.google.fr")
page.content_type gives me "text/html; charset=ISO-8859-1" which is
WRONG and should be UTF-8 - I would appreciate if somebody else could
do the same test
I followed your advice and
page.body.toutf8 have the same effect as
Iconv.conv(''ISO-8859-1//IGNORE'', ''UTF-8'',
page.body)
and removes all accentuated characters from body
I really don''t understand
le 17/07/2008 16:57, Matt White nous a dit:
_______________________________________________
Mechanize-users mailing list
Mechanize-users@rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
I cheated the whole system, and just monkey patched mechanize to
return everything scraped to UTF-8, but that might be frowned upon:
require ''iconv''
module UTF8Mechanize
@@converter = Iconv.new("UTF-8", "ISO-8859-1")
def utf8_value
@@converter.iconv(iso88591_value)
end
end
class WWW::Mechanize::File
include UTF8Mechanize
alias_method :iso88591_value, :body
alias_method :body, :utf8_value
end
/Johan
On Thu, Jul 17 2008 at 5:42 AM, Christophe <anaema_ml at yahoo.fr>
wrote:> Hello, I''m trying to find a solution to convert everything
returned by
> mechanize to utf-8, no matter if the original page is utf-8 or iso and I
> really don''t know where to start from...
>
> agent = WWW::Mechanize.new { |a| a.log >
Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) }
> one_page = agent.get("www.google.fr")
>
> My first problem is that one_page encoding should be utf-8 (as stated by
> firefox page''s properties), instead one_page.content_type is
"text/html;
> charset=ISO-8859-1" and displaying text content gives wrong accent
> conversion.
> Second problem, when scraping datas from a REAL ISO-8859-1 website, how
> should I do to convert them to utf-8 ?
>
> Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console
>
> Thanks