Hello, I''m trying to find a solution to convert everything returned by mechanize to utf-8, no matter if the original page is utf-8 or iso and I really don''t know where to start from... agent = WWW::Mechanize.new { |a| a.log = Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) } one_page = agent.get("www.google.fr") My first problem is that one_page encoding should be utf-8 (as stated by firefox page''s properties), instead one_page.content_type is "text/html; charset=ISO-8859-1" and displaying text content gives wrong accent conversion. Second problem, when scraping datas from a REAL ISO-8859-1 website, how should I do to convert them to utf-8 ? Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console Thanks
Christophe, If you''re doing this within Rails (which it appears you are), just use string.toutf8. This method is part of the Kconv module which it appears Rails includes by default. Output from my script/console:>> Kconv.toutf8 "string"=> "string">> "string".toutf8=> "string">> toutf8HTH, Matt White ----- Original Message ---- From: Christophe <anaema_ml at yahoo.fr> To: mechanize-users at rubyforge.org Sent: Thursday, July 17, 2008 3:42:23 AM Subject: [Mechanize-users] Convert data to utf-8 Hello, I''m trying to find a solution to convert everything returned by mechanize to utf-8, no matter if the original page is utf-8 or iso and I really don''t know where to start from... agent = WWW::Mechanize.new { |a| a.log = Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) } one_page = agent.get("www.google.fr") My first problem is that one_page encoding should be utf-8 (as stated by firefox page''s properties), instead one_page.content_type is "text/html; charset=ISO-8859-1" and displaying text content gives wrong accent conversion. Second problem, when scraping datas from a REAL ISO-8859-1 website, how should I do to convert them to utf-8 ? Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console Thanks _______________________________________________ Mechanize-users mailing list Mechanize-users at rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mechanize-users/attachments/20080717/67ad269e/attachment.html>
Thanks Matt, but it does not work... I''ve investigated a again and : agent = WWW::Mechanize.new page = agent.get("http://www.google.fr") page.content_type gives me "text/html; charset=ISO-8859-1" which is WRONG and should be UTF-8 - I would appreciate if somebody else could do the same test I followed your advice and page.body.toutf8 have the same effect as Iconv.conv(''ISO-8859-1//IGNORE'', ''UTF-8'', page.body) and removes all accentuated characters from body I really don''t understand le 17/07/2008 16:57, Matt White nous a dit: _______________________________________________ Mechanize-users mailing list Mechanize-users@rubyforge.org http://rubyforge.org/mailman/listinfo/mechanize-users
I cheated the whole system, and just monkey patched mechanize to return everything scraped to UTF-8, but that might be frowned upon: require ''iconv'' module UTF8Mechanize @@converter = Iconv.new("UTF-8", "ISO-8859-1") def utf8_value @@converter.iconv(iso88591_value) end end class WWW::Mechanize::File include UTF8Mechanize alias_method :iso88591_value, :body alias_method :body, :utf8_value end /Johan On Thu, Jul 17 2008 at 5:42 AM, Christophe <anaema_ml at yahoo.fr> wrote:> Hello, I''m trying to find a solution to convert everything returned by > mechanize to utf-8, no matter if the original page is utf-8 or iso and I > really don''t know where to start from... > > agent = WWW::Mechanize.new { |a| a.log > Logger.new(File::join(RAILS_ROOT, "log/mechanize.log")) } > one_page = agent.get("www.google.fr") > > My first problem is that one_page encoding should be utf-8 (as stated by > firefox page''s properties), instead one_page.content_type is "text/html; > charset=ISO-8859-1" and displaying text content gives wrong accent > conversion. > Second problem, when scraping datas from a REAL ISO-8859-1 website, how > should I do to convert them to utf-8 ? > > Mechanize 0.7.6, ruby 1.8.5, CentOS with utf-8 console > > Thanks