lukens
2007-Aug-07 15:38 UTC
Hpricot problems when scraping sites with accented characters
Hi, I''m using Hpricot to try and grab reviews from various sites, some of these reviews are in French or German, etc, and so contain accented characters. However, these are coming out the other end as a load of question marks. I''m suspecting this is some kind of encoding issue, and the best of my googling has revealed that Ruby and Rails kinda suck at character encodings. I''ve tried blindly adding the following to my environment.rb: $KCODE = ''u'' require ''jcode'' but it doesn''t seem to have helped at all. Any idea what I can try next? am I likely to be able to get this working? Thanks, Luke. ----- --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Lionel Bouton
2007-Aug-07 15:42 UTC
Re: Hpricot problems when scraping sites with accented characters
lukens wrote:> Hi, > > I''m using Hpricot to try and grab reviews from various sites, some of > these reviews are in French or German, etc, and so contain accented > characters. > > However, these are coming out the other end as a load of question > marks. > > I''m suspecting this is some kind of encoding issue, and the best of my > googling has revealed that Ruby and Rails kinda suck at character > encodings. > > I''ve tried blindly adding the following to my environment.rb: > > $KCODE = ''u'' > require ''jcode'' > > but it doesn''t seem to have helped at all. > > > Any idea what I can try next? am I likely to be able to get this > working? >- detect the encoding with chardet, - use Iconv to convert the original content to utf-8, - only then use Hpricot to parse it. Lionel --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Phlip
2007-Aug-07 17:31 UTC
Re: Hpricot problems when scraping sites with accented characters
Lionel Bouton wrote:> - detect the encoding with chardet, > - use Iconv to convert the original content to utf-8, > - only then use Hpricot to parse it.Would Tidy -utf do the first two steps automatically? -- Phlip http://www.oreilly.com/catalog/9780596510657/ ^ assert_xpath http://tinyurl.com/yrc77g <-- assert_latest Model --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Lionel Bouton
2007-Aug-07 17:44 UTC
Re: Hpricot problems when scraping sites with accented characters
Phlip wrote:> Lionel Bouton wrote: > > >> - detect the encoding with chardet, >> - use Iconv to convert the original content to utf-8, >> - only then use Hpricot to parse it. >> > > Would Tidy -utf do the first two steps automatically? > >Never tested (and already coded a working chardet + Iconv implementation). From tidy''s documentation, it seems it would request an encoding, but not force it, so buggy servers will still crash your code. In fact I had to use a begin Iconv.iconv(''utf-8'', ''utf-8'') rescue .... end to make absolutely sure results are really utf-8 (you don''t want bad encoding trying to enter a database set to use UTF-8...) Lionel --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
lukens
2007-Aug-07 18:20 UTC
Re: Hpricot problems when scraping sites with accented characters
Thanks for the responses, but could you elaborate a little please? at the moment I have: Hpricot(open(uri)) (with a "require ''open-uri''" at the top) What do I need to do, and where? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Lionel Bouton
2007-Aug-07 20:11 UTC
Re: Hpricot problems when scraping sites with accented characters
lukens wrote the following on 07.08.2007 20:20 :> Thanks for the responses, but could you elaborate a little please? > > at the moment I have: > > Hpricot(open(uri)) > > (with a "require ''open-uri''" at the top) > > What do I need to do, and where? >open(uri) gives you a String in an unknown encoding. Hpricot expects UTF-8, so you must make sure that the String you get is converted to UTF-8, to do so you must use the Iconv library but it expects you to know which encoding the source is in. The chardet library will be able to guess the original encoding. For the details, look up the documentation of chardet and Iconv. Iconv is in the standard library, chardet is a separate download. Lionel --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
lukens
2007-Aug-08 13:06 UTC
Re: Hpricot problems when scraping sites with accented characters
thanks for the help. open(uri) returns a File, rather than a String, and after playing with various options for detecting the encoding, I found that the file object has a charset method, which returns the encoding (I think this is only on a file returned by open-uri). This was handy as chardet seemed pretty crap at detecting the encoding correctly, it was slightly better when I tried doing it a line at a time, but for the whole file, it just sucked. I still have a fallback to chardet if the file object doesn''t respond to ''charset''. I should note that I was using rchardet as I couldn''t get the chardet gem to play ball at all. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---