Hi all, I am scraping a table off of another site and inserting it onto my site. you can see an example on the initial page at: http://mthosts.heroku.com. I''m referring to the green box with the snowbird weather and snowfall information. this box has been scraped off of the snowbird site at: http://www.snowbird.com/ski_board/snowreport.php The problem is that on the snowbird site it has degree symbols (°) but on my page it shows up as: (�) I think it has something to do with the encoding but i''m pretty new to html etc. and am not sure what i can do to fix this. I''ve tried substituting the characters and some other things but haven''t had any success yet. any ideas? thanks, max -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Everaldo Gomes
2011-Nov-27 11:54 UTC
Re: problem scraping using nokogiri - getting wrong characters
Hi! I opened the html source from the snowreport.php site and I noted that the strange symbols that you mentioned are htmlencoded characters. The symbol is ° I had a similar problem on last Monday, but I couldn''t complete solve it. Try the lib: http://htmlentities.rubyforge.org/ or use a regular expression (sub, gsub) to substitute ° for the degrees symbol. Regards, Everaldo On Sun, Nov 27, 2011 at 1:15 AM, Max <aamax-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:> Hi all, > > I am scraping a table off of another site and inserting it onto my > site. you can see an example on the initial page at: > http://mthosts.heroku.com. > I''m referring to the green box with the snowbird weather and snowfall > information. > > this box has been scraped off of the snowbird site at: > http://www.snowbird.com/ski_board/snowreport.php > > The problem is that on the snowbird site it has degree symbols (°) but > on my page it shows up as: (�) > > I think it has something to do with the encoding but i''m pretty new to > html etc. and am not sure what i can do to fix this. I''ve tried > substituting the characters and some other things but haven''t had any > success yet. > > any ideas? > > thanks, > > max > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
i tried that but it didn''t work for me. what did was to explicitly set the encoding property in nokogiri url = ''http://www.snowbird.com/ski_board/snowreport.php'' page = Nokogiri::HTML(open(url)) page.encoding = ''utf-8'' worked great after that! thx, Max On Nov 27, 4:54 am, Everaldo Gomes <everaldo.go...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi! > > I opened the html source from the snowreport.php site and I noted that the > strange symbols that you mentioned are htmlencoded > characters. The symbol is ° > > I had a similar problem on last Monday, but I couldn''t complete solve it. > > Try the lib:http://htmlentities.rubyforge.org/ > > or use a regular expression (sub, gsub) to substitute ° for the degrees > symbol. > > Regards, > > Everaldo > > > > > > > > On Sun, Nov 27, 2011 at 1:15 AM, Max <aa...-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > Hi all, > > > I am scraping a table off of another site and inserting it onto my > > site. you can see an example on the initial page at: > >http://mthosts.heroku.com. > > I''m referring to the green box with the snowbird weather and snowfall > > information. > > > this box has been scraped off of the snowbird site at: > >http://www.snowbird.com/ski_board/snowreport.php > > > The problem is that on the snowbird site it has degree symbols (°) but > > on my page it shows up as: ( ) > > > I think it has something to do with the encoding but i''m pretty new to > > html etc. and am not sure what i can do to fix this. I''ve tried > > substituting the characters and some other things but haven''t had any > > success yet. > > > any ideas? > > > thanks, > > > max > > > -- > > You received this message because you are subscribed to the Google Groups > > "Ruby on Rails: Talk" group. > > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To unsubscribe from this group, send email to > > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > For more options, visit this group at > >http://groups.google.com/group/rubyonrails-talk?hl=en.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.