R.T.A.J.Leenders
2011-May-05 09:33 UTC
[R] issue with "strange" characters (readHTMLTable)
Thank you. The line of code you give certainly resolves several of the issues. I didn't realize that font support is such a tough matter to realize. Let me express my gratitude to those who provide this for us in R. On 04-05-11, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote: Oh, please! This is about the contributed package XML, not R and not Windows. Some of us have worked very hard to provide reasonable font support in R, including on Windows. We are given exceedingly little credit, just the brickbats for things for which we are not responsible. (We even work hard to port XML to Windows for you, again with almost zero credit.) That URL is a page in UTF-8, as its header says. We have provided many ways to work with UTF-8 on Windows, but it seems readHTMLTable() is not making use of them. You need to run iconv() on the strings in your object (which as it has factors, are the levels). When you do so, you will discover that page contains characters not in your native charset (I presume, not having your locale). What you can do, in Rgui only, is for (n in names(Islands)) Encoding(levels(Islands[[n]])) <-"UTF-8" but likely there are still characters it will not know how to display. On Wed, 4 May 2011, R.T.A.J.Leenders wrote: > > WinXP-x32, R-21.13.0 > Dear list, > I have a problem that (I think) relates to the interaction between Windows > and R. > I am trying to scrape a table with data on the Hawai'ian Islands, This is my > code: > library(XML) > u <- "[1]http://en.wikipedia.org/wiki/Hawaii" > tables <- readHTMLTable(u) > Islands <- tables[[5]] > The output is (first set of columns): > Island Nickname > > Islands > Island Nickname > Location >1 Hawai???????i[7] The Big Island 19???????34????????????N 155???????30????????????W???????????? / ????????????19.567 >???????N 155.5???????W???????????? / 19.567; -155.5 >2 Maui[8] The Valley Isle 20???????48????????????N 156???????20????????????W???????????? / ????????????20.8???????N >156.333???????W???????????? / 20.8; -156.333 >3 Kaho???????olawe[9] The Target Isle 20???????33????????????N 156???????36????????????W???????????? / ????????????20.55 >???????N 156.6???????W???????????? / 20.55; -156.6 >4 L???na???????i[10] The Pineapple Isle 20???????50????????????N 156???????56????????????W???????????? / ????????????20.833???????N 15 >6.933???????W???????????? / 20.833; -156.933 >5 Moloka???????i[11] The Friendly Isle 21???????08????????????N 157???????02????????????W???????????? / ????????????21.133???????N 1 >57.033???????W???????????? / 21.133; -157.033 >6 O???????ahu[12] The Gathering Place 21???????28????????????N 157???????59????????????W???????????? / ????????????21.467???????N 1 >57.983???????W???????????? / 21.467; -157.983 >7 Kaua???????i[13] The Garden Isle 22???????05????????????N 159???????30????????????W???????????? / ????????????22.083 >???????N 159.5???????W???????????? / 22.083; -159.5 >8 Ni???????ihau[14] The Forbidden Isle 21???????54????????????N 160???????10????????????W???????????? / ????????????21.9???????N >160.167???????W???????????? / 21.9; -160.167 > > As you can see, there are "weird" characters in there. I have also tried > readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding > "UTF-8") > but that didn't help. > It seems to me that there may be an issue with the interaction of the > Windows settings of the character set. > sessionInfo() gives > > sessionInfo() > R version 2.13.0 (2011-04-13) > Platform: i386-pc-mingw32/i386 (32-bit) > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 > LC_MONETARY=Dutch_Netherlands.1252 > [4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] XML_3.2-0.2 > > > I have also attempted to let R use another setting by entering: > Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response: > > Sys.setlocale("LC_ALL", "en_US.UTF-8") > [1] "" > Warning message: > In Sys.setlocale("LC_ALL", "en_US.UTF-8") : > OS reports request to set locale to "en_US.UTF-8" cannot be honored > > > In addition, I have attempted to make the change directly from the windows > command prompt, using: "chcp 65001" and variations of that, but that didn't > change anything. > I have searched the list and the web and have found others bringing forth a > similar issues, but have not been able to find a solution. I looks like this > is an issue of how Windows and R interact. Unfortunately, all three > computers at my disposal have this problem. It occurs both under WinXP-x32 > and under Win7-x86. > Is there a way to make R override the windows settings or can the issue be > solved otherwise? > I have also tried other websites, and the issue occurs every time when there > is an ????, ????, ????, ????, et cetera in the text-to-be-scraped. > Thank you, > Roger >______________________________________________ >R-help at r-project.org mailing list >[2]https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide [3]http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, [4]http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 References 1. http://en.wikipedia.org/wiki/Hawaii 2. https://stat.ethz.ch/mailman/listinfo/r-help 3. http://www.R-project.org/posting-guide.html 4. http://www.stats.ox.ac.uk/%7Eripley/
Janko Thyson
2011-May-05 14:45 UTC
[R] Tone in mailing lists (was " issue with "strange" characters (readHTMLTable)")
I did read How To Ask Questions The Smart Way <http://www.catb.org/%7Eesr/faqs/smart-questions.html> and I don't have a problem with calling me stupid and/or a winer for posting this... What's the purpose of an R Mailing list (especially r-help), after all? IMHO it should be about R users being able to ask questions and getting help from those who would like to provide some. Unless a post has been written in an clearly offensive tone and/or shows a major amount of ignorance on the part of the person asking, why would you want to not just provide an answer to the question posed but deliver on top either sarcasm, personal criticism, sharp remarks or the like? I also believe that it's in the nature of things that R users will most likely rather post about problems/bugs they encountered than to simply drop a line of acknowledgment for the people who continuously develop R and its packages. It's nice if they do explicitly express their gratitude in their posts (and this *does* happen a lot also), but it is not *demanded* and if they don't, it doesn't automatically mean that they are not grateful... Please understand me right: all the people working on pushing R forward do fantastic work and I greatly appreciate the inputs from the entire community, I sometimes just wonder about the tone in the mailing lists. But that's probably just me , maybe it's just the way the cookie crumbles in technical mailing lists ... Regards, Janko On 05.05.2011 11:33, R.T.A.J.Leenders wrote:> Thank you. The line of code you give certainly resolves several of the > issues. > I didn't realize that font support is such a tough matter to realize. Let me > express my gratitude to those who provide this for us in R. > On 04-05-11, Prof Brian Ripley<ripley@stats.ox.ac.uk> wrote: > > Oh, please! > This is about the contributed package XML, not R and not Windows. > Some of us have worked very hard to provide reasonable font support in R, > including on Windows. We are given exceedingly little credit, just > the brickbats for things for which we are not responsible. (We even work > hard to port XML to Windows for you, again with almost zero credit.) > That URL is a page in UTF-8, as its header says. We have provided many ways > to work with UTF-8 on Windows, but it seems readHTMLTable() is not making > use of them. > You need to run iconv() on the strings in your object (which as it has > factors, are the levels). When you do so, you will discover that page > contains characters not in your native charset (I presume, not having your > locale). > What you can do, in Rgui only, is > for (n in names(Islands)) Encoding(levels(Islands[[n]]))<-"UTF-8" > but likely there are still characters it will not know how to display. > On Wed, 4 May 2011, R.T.A.J.Leenders wrote: > > > > WinXP-x32, R-21.13.0 > > Dear list, > > I have a problem that (I think) relates to the interaction between > Windows > > and R. > > I am trying to scrape a table with data on the Hawai'ian Islands, This is > my > > code: > > library(XML) > > u<- "[1]http://en.wikipedia.org/wiki/Hawaii" > > tables<- readHTMLTable(u) > > Islands<- tables[[5]] > > The output is (first set of columns): > > Island Nickname > > > Islands > > Island Nickname > > Location > >1 HawaiÃ??Ã?»i[7] The Big Island 19Ã??Ã?°34Ã?¢Ã?¤Ã?²N > 155Ã??Ã?°30Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿19.567 > >Ã??Ã?°N 155.5Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 19.567; -155.5 > >2 Maui[8] The Valley Isle 20Ã??Ã?°48Ã?¢Ã?¤Ã?²N > 156Ã??Ã?°20Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿20.8Ã??Ã?°N > >156.333Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 20.8; -156.333 > >3 KahoÃ??Ã?»olawe[9] The Target Isle 20Ã??Ã?°33Ã?¢Ã?¤Ã?²N > 156Ã??Ã?°36Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿20.55 > >Ã??Ã?°N 156.6Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 20.55; -156.6 > >4 LÃ??naÃ??Ã?»i[10] The Pineapple Isle 20Ã??Ã?°50Ã?¢Ã?¤Ã?²N > 156Ã??Ã?°56Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿20.833Ã??Ã?°N 15 > >6.933Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 20.833; -156.933 > >5 MolokaÃ??Ã?»i[11] The Friendly Isle 21Ã??Ã?°08Ã?¢Ã?¤Ã?²N > 157Ã??Ã?°02Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿21.133Ã??Ã?°N 1 > >57.033Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 21.133; -157.033 > >6 OÃ??Ã?»ahu[12] The Gathering Place 21Ã??Ã?°28Ã?¢Ã?¤Ã?²N > 157Ã??Ã?°59Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿21.467Ã??Ã?°N 1 > >57.983Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 21.467; -157.983 > >7 KauaÃ??Ã?»i[13] The Garden Isle 22Ã??Ã?°05Ã?¢Ã?¤Ã?²N > 159Ã??Ã?°30Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿22.083 > >Ã??Ã?°N 159.5Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 22.083; -159.5 > >8 NiÃ??Ã?»ihau[14] The Forbidden Isle 21Ã??Ã?°54Ã?¢Ã?¤Ã?²N > 160Ã??Ã?°10Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿21.9Ã??Ã?°N > >160.167Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 21.9; -160.167 > > > > As you can see, there are "weird" characters in there. I have also tried > > readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding > > "UTF-8") > > but that didn't help. > > It seems to me that there may be an issue with the interaction of the > > Windows settings of the character set. > > sessionInfo() gives > > > sessionInfo() > > R version 2.13.0 (2011-04-13) > > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > > [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 > > LC_MONETARY=Dutch_Netherlands.1252 > > [4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > > [1] XML_3.2-0.2 > > > > > I have also attempted to let R use another setting by entering: > > Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response: > > > Sys.setlocale("LC_ALL", "en_US.UTF-8") > > [1] "" > > Warning message: > > In Sys.setlocale("LC_ALL", "en_US.UTF-8") : > > OS reports request to set locale to "en_US.UTF-8" cannot be honored > > > > > In addition, I have attempted to make the change directly from the > windows > > command prompt, using: "chcp 65001" and variations of that, but that > didn't > > change anything. > > I have searched the list and the web and have found others bringing forth > a > > similar issues, but have not been able to find a solution. I looks like > this > > is an issue of how Windows and R interact. Unfortunately, all three > > computers at my disposal have this problem. It occurs both under > WinXP-x32 > > and under Win7-x86. > > Is there a way to make R override the windows settings or can the issue > be > > solved otherwise? > > I have also tried other websites, and the issue occurs every time when > there > > is an Ã?©, Ã?¼, Ã?¤, Ã?®, et cetera in the text-to-be-scraped. > > Thank you, > > Roger > >______________________________________________ > >R-help@r-project.org mailing list > >[2]https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > [3]http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > > > -- > Brian D. Ripley,ripley@stats.ox.ac.uk > Professor of Applied Statistics, [4]http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > References> 1.http://en.wikipedia.org/wiki/Hawaii > 2.https://stat.ethz.ch/mailman/listinfo/r-help > 3.http://www.R-project.org/posting-guide.html > 4.http://www.stats.ox.ac.uk/%7Eripley/ > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- ------------------------------------------------------------------------ *Janko Thyson* janko.thyson@googlemail.com <mailto:janko.thyson@googlemail.com> Jesuitenstraße 3 D-85049 Ingolstadt Mobile: +49 (0)176 83294257 This e-mail and any attachment is for authorized use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. On 05.05.2011 11:33, R.T.A.J.Leenders wrote:> Thank you. The line of code you give certainly resolves several of the > issues. > I didn't realize that font support is such a tough matter to realize. Let me > express my gratitude to those who provide this for us in R. > On 04-05-11, Prof Brian Ripley<ripley@stats.ox.ac.uk> wrote: > > Oh, please! > This is about the contributed package XML, not R and not Windows. > Some of us have worked very hard to provide reasonable font support in R, > including on Windows. We are given exceedingly little credit, just > the brickbats for things for which we are not responsible. (We even work > hard to port XML to Windows for you, again with almost zero credit.) > That URL is a page in UTF-8, as its header says. We have provided many ways > to work with UTF-8 on Windows, but it seems readHTMLTable() is not making > use of them. > You need to run iconv() on the strings in your object (which as it has > factors, are the levels). When you do so, you will discover that page > contains characters not in your native charset (I presume, not having your > locale). > What you can do, in Rgui only, is > for (n in names(Islands)) Encoding(levels(Islands[[n]]))<-"UTF-8" > but likely there are still characters it will not know how to display. > On Wed, 4 May 2011, R.T.A.J.Leenders wrote: > > > > WinXP-x32, R-21.13.0 > > Dear list, > > I have a problem that (I think) relates to the interaction between > Windows > > and R. > > I am trying to scrape a table with data on the Hawai'ian Islands, This is > my > > code: > > library(XML) > > u<- "[1]http://en.wikipedia.org/wiki/Hawaii" > > tables<- readHTMLTable(u) > > Islands<- tables[[5]] > > The output is (first set of columns): > > Island Nickname > > > Islands > > Island Nickname > > Location > >1 HawaiÃ??Ã?»i[7] The Big Island 19Ã??Ã?°34Ã?¢Ã?¤Ã?²N > 155Ã??Ã?°30Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿19.567 > >Ã??Ã?°N 155.5Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 19.567; -155.5 > >2 Maui[8] The Valley Isle 20Ã??Ã?°48Ã?¢Ã?¤Ã?²N > 156Ã??Ã?°20Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿20.8Ã??Ã?°N > >156.333Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 20.8; -156.333 > >3 KahoÃ??Ã?»olawe[9] The Target Isle 20Ã??Ã?°33Ã?¢Ã?¤Ã?²N > 156Ã??Ã?°36Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿20.55 > >Ã??Ã?°N 156.6Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 20.55; -156.6 > >4 LÃ??naÃ??Ã?»i[10] The Pineapple Isle 20Ã??Ã?°50Ã?¢Ã?¤Ã?²N > 156Ã??Ã?°56Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿20.833Ã??Ã?°N 15 > >6.933Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 20.833; -156.933 > >5 MolokaÃ??Ã?»i[11] The Friendly Isle 21Ã??Ã?°08Ã?¢Ã?¤Ã?²N > 157Ã??Ã?°02Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿21.133Ã??Ã?°N 1 > >57.033Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 21.133; -157.033 > >6 OÃ??Ã?»ahu[12] The Gathering Place 21Ã??Ã?°28Ã?¢Ã?¤Ã?²N > 157Ã??Ã?°59Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿21.467Ã??Ã?°N 1 > >57.983Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 21.467; -157.983 > >7 KauaÃ??Ã?»i[13] The Garden Isle 22Ã??Ã?°05Ã?¢Ã?¤Ã?²N > 159Ã??Ã?°30Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿22.083 > >Ã??Ã?°N 159.5Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 22.083; -159.5 > >8 NiÃ??Ã?»ihau[14] The Forbidden Isle 21Ã??Ã?°54Ã?¢Ã?¤Ã?²N > 160Ã??Ã?°10Ã?¢Ã?¤Ã?²WÃ?¯Ã?»Ã?¿ / Ã?¯Ã?»Ã?¿21.9Ã??Ã?°N > >160.167Ã??Ã?°WÃ?¯Ã?»Ã?¿ / 21.9; -160.167 > > > > As you can see, there are "weird" characters in there. I have also tried > > readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding > > "UTF-8") > > but that didn't help. > > It seems to me that there may be an issue with the interaction of the > > Windows settings of the character set. > > sessionInfo() gives > > > sessionInfo() > > R version 2.13.0 (2011-04-13) > > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > > [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 > > LC_MONETARY=Dutch_Netherlands.1252 > > [4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > > [1] XML_3.2-0.2 > > > > > I have also attempted to let R use another setting by entering: > > Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response: > > > Sys.setlocale("LC_ALL", "en_US.UTF-8") > > [1] "" > > Warning message: > > In Sys.setlocale("LC_ALL", "en_US.UTF-8") : > > OS reports request to set locale to "en_US.UTF-8" cannot be honored > > > > > In addition, I have attempted to make the change directly from the > windows > > command prompt, using: "chcp 65001" and variations of that, but that > didn't > > change anything. > > I have searched the list and the web and have found others bringing forth > a > > similar issues, but have not been able to find a solution. I looks like > this > > is an issue of how Windows and R interact. Unfortunately, all three > > computers at my disposal have this problem. It occurs both under > WinXP-x32 > > and under Win7-x86. > > Is there a way to make R override the windows settings or can the issue > be > > solved otherwise? > > I have also tried other websites, and the issue occurs every time when > there > > is an Ã?©, Ã?¼, Ã?¤, Ã?®, et cetera in the text-to-be-scraped. > > Thank you, > > Roger > >______________________________________________ > >R-help@r-project.org mailing list > >[2]https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > [3]http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > > > -- > Brian D. Ripley,ripley@stats.ox.ac.uk > Professor of Applied Statistics, [4]http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > References > > 1.http://en.wikipedia.org/wiki/Hawaii > 2.https://stat.ethz.ch/mailman/listinfo/r-help > 3.http://www.R-project.org/posting-guide.html > 4.http://www.stats.ox.ac.uk/%7Eripley/ > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- ------------------------------------------------------------------------ *Janko Thyson* janko.thyson@googlemail.com <mailto:janko.thyson@googlemail.com> Jesuitenstraße 3 D-85049 Ingolstadt Mobile: +49 (0)176 83294257 This e-mail and any attachment is for authorized use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. [[alternative HTML version deleted]]