Hello dear R-help mailing list. I wish to be able to have htmlParse work well with Hebrew, but it keeps to scramble the Hebrew text in pages I feed into it. For example: # why can't I parse the Hebrew correctly? library(RCurl) library(XML) u = "http://humus101.com/?p=2737" a = getURL(u) a # Here - the hebrew is fine. a2 <- htmlParse(a) a2 # Here it is a mess... None of these seem to fix it: htmlParse(a, encoding = "utf-8") htmlParse(a, encoding = "iso8859-8") This is my locale:> Sys.getlocale()[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255">Any suggestions? Thanks up front, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- [[alternative HTML version deleted]]
Duncan Temple Lang
2012-Jan-31 01:33 UTC
[R] Getting htmlParse to work with Hebrew? (on windows)
With some off-line interaction and testing by Tal, the latest version of the XML package (3.9-4) should resolve these issues. So the encoding from the document is used in more cases as the default. It is often important to specify the encoding for HTML files in the call to htmlParse() and use "UTF-8" rather than the lower case. I'll add code to make this simpler when I get a chance. Thanks Tal D. On 1/30/12 5:35 AM, Tal Galili wrote:> Hello dear R-help mailing list. > > > > I wish to be able to have htmlParse work well with Hebrew, but it keeps to > scramble the Hebrew text in pages I feed into it. > > For example: > > # why can't I parse the Hebrew correctly? > > library(RCurl) > library(XML) > u = "http://humus101.com/?p=2737" > a = getURL(u) > a # Here - the hebrew is fine. > a2 <- htmlParse(a) > a2 # Here it is a mess... > > None of these seem to fix it: > > htmlParse(a, encoding = "utf-8") > > htmlParse(a, encoding = "iso8859-8") > > This is my locale: > >> Sys.getlocale() > > [1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255" >> > > Any suggestions? > > > Thanks up front, > Tal > > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili at gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > ---------------------------------------------------------------------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.