Lauri Nikkinen
2009-Dec-31 13:09 UTC
[R] XML and RCurl: problem with encoding (htmlTreeParse)
Hi, I'm trying to get data from web page and modify it in R. I have a problem with encoding. I'm not able to get encoding right in htmlTreeParse command. See below> library(RCurl) > library(XML) > > site <- getURL("http://www.aarresaari.net/jobboard/jobs.html") > txt <- readLines(tc <- textConnection(site)); close(tc) > txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) > > g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) > head(grep(" ", g, value=T))[1] "????PART-TIME EXPORT SALES ASSOCIATES (ALSO SUMMER WORK) ? Valuatum Oy ??Helsinki ??Ilmoitus lis??tty: 31.12.2009. Viimeinen hakup??iv??: 28.02.2010" [2] "????MSN EDITOR / ONLINE PRODUCER ??Manpower Oy ??Espoo ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 15.1.2010" [3] "????MYYNTINEUVOTTELIJA ??Rand Customer Contact Oy ??Helsinki ? Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 30.1.2010" [4] "????HALUATKO IT-ARKKITEHDIKSI SHANGHAIHIN? ??HALUATKO IT-ARKKITEHDIKSI SHANGHAIHIN? ??Shanghai, China ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" [5] "????HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ? HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ??Shanghai, China ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" [6] "????Korkeakouluharjoittelija/ ty??el??m??valmennettava ??Suomen suurl??hetyst?? Pristina, Kosovo ??Pristina, Kosovo ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 20.1.2010">This won't help:> txt <- readLines(tc <- textConnection(site)); close(tc) > txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE, encoding="latin1") > g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) > head(grep(" ", g, value=T))[1] "????PART-TIME EXPORT SALES ASSOCIATES (ALSO SUMMER WORK) ? Valuatum Oy ??Helsinki ??Ilmoitus lis??tty: 31.12.2009. Viimeinen hakup??iv??: 28.02.2010" [2] "????MSN EDITOR / ONLINE PRODUCER ??Manpower Oy ??Espoo ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 15.1.2010" [3] "????MYYNTINEUVOTTELIJA ??Rand Customer Contact Oy ??Helsinki ? Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 30.1.2010" [4] "????HALUATKO IT-ARKKITEHDIKSI SHANGHAIHIN? ??HALUATKO IT-ARKKITEHDIKSI SHANGHAIHIN? ??Shanghai, China ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" [5] "????HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ? HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ??Shanghai, China ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" [6] "????Korkeakouluharjoittelija/ ty??el??m??valmennettava ??Suomen suurl??hetyst?? Pristina, Kosovo ??Pristina, Kosovo ??Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 20.1.2010">Any ideas? Thanks, Lauri> sessionInfo()R version 2.10.0 (2009-10-26) i386-pc-mingw32 locale: [1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252 LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C [5] LC_TIME=Finnish_Finland.1252 attached base packages: [1] grDevices datasets splines graphics utils grid stats methods base other attached packages: [1] RDCOMClient_0.92-0 XML_2.6-0 RCurl_1.3-1 Hmisc_3.7-0 survival_2.35-8 ggplot2_0.8.5 digest_0.4.2 reshape_0.8.3 [9] plyr_0.1.9 proto_0.3-8 gplots_2.7.4 caTools_1.10 bitops_1.0-4.1 gtools_2.6.1 gmodels_2.15.0 gdata_2.6.1 [17] lattice_0.17-26 loaded via a namespace (and not attached): [1] cluster_1.12.1 MASS_7.3-4 tools_2.10.0>
Duncan Temple Lang
2009-Dec-31 15:32 UTC
[R] XML and RCurl: problem with encoding (htmlTreeParse)
Hi Lauri. I am in the process of making some changes to the encoding in the XML package. I'll take a look over the next few days. (Not certain precisely when.) D. Lauri Nikkinen wrote:> Hi, > > I'm trying to get data from web page and modify it in R. I have a > problem with encoding. I'm not able to get > encoding right in htmlTreeParse command. See below > >> library(RCurl) >> library(XML) >> >> site <- getURL("http://www.aarresaari.net/jobboard/jobs.html") >> txt <- readLines(tc <- textConnection(site)); close(tc) >> txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) >> >> g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) >> head(grep(" ", g, value=T)) > > [1] "? ? PART-TIME EXPORT SALES ASSOCIATES (ALSO SUMMER WORK) ? > Valuatum Oy ? Helsinki ? Ilmoitus lis??tty: 31.12.2009. Viimeinen > hakup??iv??: 28.02.2010" > [2] "? ? MSN EDITOR / ONLINE PRODUCER ? Manpower Oy ? Espoo ? Ilmoitus > lis??tty: 30.12.2009. Viimeinen hakup??iv??: 15.1.2010" > [3] "? ? MYYNTINEUVOTTELIJA ? Rand Customer Contact Oy ? Helsinki ? > Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 30.1.2010" > [4] "? ? HALUATKO IT-ARKKITEHDIKSI SHANGHAIHIN? ? HALUATKO > IT-ARKKITEHDIKSI SHANGHAIHIN? ? Shanghai, China ? Ilmoitus lis??tty: > 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" > [5] "? ? HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ? > HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ? Shanghai, China > ? Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" > [6] "? ? Korkeakouluharjoittelija/ ty??el??m??valmennettava ? Suomen > suurl??hetyst?? Pristina, Kosovo ? Pristina, Kosovo ? Ilmoitus > lis??tty: 30.12.2009. Viimeinen hakup??iv??: 20.1.2010" > > This won't help: > >> txt <- readLines(tc <- textConnection(site)); close(tc) >> txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE, encoding="latin1") >> g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) >> head(grep(" ", g, value=T)) > > [1] "? ? PART-TIME EXPORT SALES ASSOCIATES (ALSO SUMMER WORK) ? > Valuatum Oy ? Helsinki ? Ilmoitus lis??tty: 31.12.2009. Viimeinen > hakup??iv??: 28.02.2010" > [2] "? ? MSN EDITOR / ONLINE PRODUCER ? Manpower Oy ? Espoo ? Ilmoitus > lis??tty: 30.12.2009. Viimeinen hakup??iv??: 15.1.2010" > [3] "? ? MYYNTINEUVOTTELIJA ? Rand Customer Contact Oy ? Helsinki ? > Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 30.1.2010" > [4] "? ? HALUATKO IT-ARKKITEHDIKSI SHANGHAIHIN? ? HALUATKO > IT-ARKKITEHDIKSI SHANGHAIHIN? ? Shanghai, China ? Ilmoitus lis??tty: > 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" > [5] "? ? HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ? > HALUATKO J2EE-OHJELMISTOKEHITT??J??KSI SHANGHAIHIN? ? Shanghai, China > ? Ilmoitus lis??tty: 30.12.2009. Viimeinen hakup??iv??: 28.2.2010" > [6] "? ? Korkeakouluharjoittelija/ ty??el??m??valmennettava ? Suomen > suurl??hetyst?? Pristina, Kosovo ? Pristina, Kosovo ? Ilmoitus > lis??tty: 30.12.2009. Viimeinen hakup??iv??: 20.1.2010" > > Any ideas? > > Thanks, > Lauri > >> sessionInfo() > R version 2.10.0 (2009-10-26) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252 > LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C > [5] LC_TIME=Finnish_Finland.1252 > > attached base packages: > [1] grDevices datasets splines graphics utils grid stats > methods base > > other attached packages: > [1] RDCOMClient_0.92-0 XML_2.6-0 RCurl_1.3-1 > Hmisc_3.7-0 survival_2.35-8 ggplot2_0.8.5 digest_0.4.2 > reshape_0.8.3 > [9] plyr_0.1.9 proto_0.3-8 gplots_2.7.4 > caTools_1.10 bitops_1.0-4.1 gtools_2.6.1 > gmodels_2.15.0 gdata_2.6.1 > [17] lattice_0.17-26 > > loaded via a namespace (and not attached): > [1] cluster_1.12.1 MASS_7.3-4 tools_2.10.0 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Eduardo Leoni
2009-Dec-31 18:34 UTC
[R] XML and RCurl: problem with encoding (htmlTreeParse)
In the meantime, try this. library(XML) theurl <- "http://www.aarresaari.net/jobboard/jobs.html" download.file(theurl, "tmp.html") txt <- readLines("tmp.html") txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) head(grep(" ", g, value=T)) It works for me: [[alternative HTML version deleted]]
Lauri Nikkinen
2010-Jan-01 14:14 UTC
[R] XML and RCurl: problem with encoding (htmlTreeParse)
Thanks. Interestingly, your code works on my Mac 10.6.1 but not on my Win XP. See sessionInfo from below. Mac R:> sessionInfo()R version 2.9.2 (2009-08-24) i386-apple-darwin8.11.1 locale: fi_FI.UTF-8/fi_FI.UTF-8/C/C/fi_FI.UTF-8/fi_FI.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_2.6-0>WinXP:> sessionInfo()R version 2.9.2 (2009-08-24) i386-pc-mingw32 locale: LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_2.6-0 RCurl_1.2-1 bitops_1.0-4.1 loaded via a namespace (and not attached): [1] tools_2.9.2>-L 2009/12/31 Eduardo Leoni <leoniedu at msu.edu>:> In the meantime, try this. > library(XML) > theurl <- "http://www.aarresaari.net/jobboard/jobs.html" > download.file(theurl, "tmp.html") > txt <- readLines("tmp.html") > txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) > g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) > head(grep(" ", g, value=T)) > It works for me: > >