Hello dear R-help mailing list. Looks like the same issue in Russian: library(RCurl) library(XML) u = " http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" a = getURL(u) a # Here - the Russian is fine. a2 <- htmlParse(a) a2 # Here it is a mess... None of these seem to fix it: htmlParse(a, encoding = "windows-1251") htmlParse(a, encoding = "CP1251") htmlParse(a, encoding = "cp1251") htmlParse(a, encoding = "iso8859-5") This is my locale: Sys.getlocale() "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" Any suggestions? Thanks you very much in advance, Lavrentiy Eskin <http://www.eng.nvg.ru> [[alternative HTML version deleted]]
Milan Bouchet-Valat
2013-Feb-21 10:08 UTC
[R] Getting htmlParse to work with Hebrew? (on windows)
Le jeudi 21 f?vrier 2013 ? 13:16 +0400, Lawr Eskin a ?crit :> Hello dear R-help mailing list. > > > Looks like the same issue in Russian: > > > > library(RCurl) > > library(XML) > > u = " http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > a = getURL(u) > > a # Here - the Russian is fine. > > a2 <- htmlParse(a) > > a2 # Here it is a mess... > > > > None of these seem to fix it: > > > > htmlParse(a, encoding = "windows-1251") > > htmlParse(a, encoding = "CP1251") > > htmlParse(a, encoding = "cp1251") > > htmlParse(a, encoding = "iso8859-5") > > > > This is my locale: > > > > Sys.getlocale() > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > Any suggestions?What does Encoding(a) say? (FWIW, here on Linux even a is not in the correct encoding : <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head> <title>???????????? ????????????????? ???????? ????? ?????????? ?? ????? ??????? ?? 11430 ???????????????????? ?? ????????? ???? ???????????????? ? ???????? ????? ????????</title> [...]) Regards> Thanks you very much in advance, > > Lavrentiy Eskin > <http://www.eng.nvg.ru> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Milan!> Encoding(a)[1] "unknown"2013/2/21 Milan Bouchet-Valat <nalimilan@club.fr>> >> Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a écrit : >> > Hello dear R-help mailing list. >> > >> > >> > Looks like the same issue in Russian: >> > >> > >> > >> > library(RCurl) >> > >> > library(XML) >> > >> > u = " http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" >> > >> > a = getURL(u) >> > >> > a # Here - the Russian is fine. >> > >> > a2 <- htmlParse(a) >> > >> > a2 # Here it is a mess... >> > >> > >> > >> > None of these seem to fix it: >> > >> > >> > >> > htmlParse(a, encoding = "windows-1251") >> > >> > htmlParse(a, encoding = "CP1251") >> > >> > htmlParse(a, encoding = "cp1251") >> > >> > htmlParse(a, encoding = "iso8859-5") >> > >> > >> > >> > This is my locale: >> > >> > >> > >> > Sys.getlocale() >> > >> > >> "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" >> > >> > >> > >> > Any suggestions? >> What does Encoding(a) say? >> >> >> (FWIW, here on Linux even a is not in the correct encoding : >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >> "http://www.w3.org/TR/REC-html40/loose.dtd"> >> <html><head> >> <title>ГЉГіГЇГЁГІГј îäГîêîìГГ ГІГГіГѕ êâà ðòèðó Гў Ìîà >> ±ГЄГўГҐ В— 11430 îáúÿâëåГГЁГ© Г® ïðîäà æå îäГîêîìà >> Г ГІГûõ êâà ðòèð</title> >> [...]) >> >> >> Regards >> >> >> > Thanks you very much in advance, >> > >> > Lavrentiy Eskin >> > <http://www.eng.nvg.ru> >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> >[[alternative HTML version deleted]]
Hi Milan, a <- getURL(con, .encoding = "UTF-8") Encoding(a)> [1] "UTF-8"a # Here - the UTF-8 codes looks like fine. htmlParse(a, encoding = "UTF-8") ###again same encoding issue>>why didn't getURL() detect and set a's encoding correctly?I think there are page issue because another sites works fine 2013/2/21 Milan Bouchet-Valat <nalimilan@club.fr>> Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a écrit : > > Hi Milan! > > > > > > > Encoding(a) > > [1] "unknown" > Hm, here I get "UTF-8", which is my locale encoding. > > I've tried a little more, and I discovered that using > a <- getURL(u, .encoding="UTF-8") > ensures that a is in the correct encoding here. I know this is not your > problem, but it might help: check whether Encoding(a) is set to "UTF-8" > or not in that case, and whether this fixes things. > > I'm not sure how htmlParse() detects the encoding when you pass it a > character vector, but it probably uses Encoding(a), since that's the > only reliable information; if it is missing, maybe it falls back to what > the contents of the file say (maybe even before what the "encoding" > argument says), which is windows-1251, and may not be the encoding in > which getURL() saved the character vector. The question would then be: > why didn't getURL() detect and set a's encoding correctly? > > > My two cents > > > > 2013/2/21 Milan Bouchet-Valat <nalimilan@club.fr> > > Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a écrit : > > > Hello dear R-help mailing list. > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > library(RCurl) > > > > > > library(XML) > > > > > > u = " > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > a = getURL(u) > > > > > > a # Here - the Russian is fine. > > > > > > a2 <- htmlParse(a) > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > Any suggestions? > > > > What does Encoding(a) say? > > > > > > (FWIW, here on Linux even a is not in the correct encoding : > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > <html><head> > > <title>ГЉГіГЇГЁГІГј îäГîêîìГГ ГІГГіГѕ êâà ðòèð > > Гі Гў Ìîà > > ±ГЄГўГҐ В— 11430 îáúÿâëåГГЁГ© Г® ïðîäà æå îäà > > îêîìà > > Г ГІГûõ êâà ðòèð</title> > > [...]) > > > > > > Regards > > > > > > > Thanks you very much in advance, > > > > > > Lavrentiy Eskin > > > > > <http://www.eng.nvg.ru> > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible > > code. > > > > > >[[alternative HTML version deleted]]
Milan Bouchet-Valat
2013-Feb-21 14:43 UTC
[R] Getting htmlParse to work with Hebrew? (on windows)
Le jeudi 21 f?vrier 2013 ? 18:31 +0400, Lawr Eskin a ?crit :> Hi Milan, > > a <- getURL(con, .encoding = "UTF-8") > Encoding(a) > > [1] "UTF-8" > a # Here - the UTF-8 codes looks like fine. > htmlParse(a, encoding = "UTF-8") ###again same encoding issueAnd what if you try this: a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) or this: a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8")) Cheers> >>why didn't getURL() detect and set a's encoding correctly? > I think there are page issue because another sites works fine > > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr> > Le jeudi 21 f?vrier 2013 ? 16:04 +0400, Lawr Eskin a ?crit : > > Hi Milan! > > > > > > > Encoding(a) > > [1] "unknown" > > Hm, here I get "UTF-8", which is my locale encoding. > > I've tried a little more, and I discovered that using > a <- getURL(u, .encoding="UTF-8") > ensures that a is in the correct encoding here. I know this is > not your > problem, but it might help: check whether Encoding(a) is set > to "UTF-8" > or not in that case, and whether this fixes things. > > I'm not sure how htmlParse() detects the encoding when you > pass it a > character vector, but it probably uses Encoding(a), since > that's the > only reliable information; if it is missing, maybe it falls > back to what > the contents of the file say (maybe even before what the > "encoding" > argument says), which is windows-1251, and may not be the > encoding in > which getURL() saved the character vector. The question would > then be: > why didn't getURL() detect and set a's encoding correctly? > > > My two cents > > > > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr> > > Le jeudi 21 f?vrier 2013 ? 13:16 +0400, Lawr Eskin a > ?crit : > > > Hello dear R-help mailing list. > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > library(RCurl) > > > > > > library(XML) > > > > > > u = " > > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > a = getURL(u) > > > > > > a # Here - the Russian is fine. > > > > > > a2 <- htmlParse(a) > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > Any suggestions? > > > > What does Encoding(a) say? > > > > > > (FWIW, here on Linux even a is not in the correct > encoding : > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 > Transitional//EN" > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > <html><head> > > <title>???????????? ????????????????? ???????? ????? > ???????? > > ?? ?? ????? > > ??????? ?? 11430 ???????????????????? ?? ????????? > ???? ?????? > > ?????????? > > ? ???????? ????? ????????</title> > > [...]) > > > > > > Regards > > > > > > > Thanks you very much in advance, > > > > > > Lavrentiy Eskin > > > > > <http://www.eng.nvg.ru> > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, > reproducible > > code. > > > > > > >
Maybe Matching Threads
- [bug report] Cyrillic letter "я" interrupts script execution via R source function
- is it necessary to always register C routines with R_registerRoutines?
- Problem with accessing internal variable in package.
- About "=" in command line in windows.
- mboost_1.1-3 blackboost_fit (PR#13972)