(I am reposting this question after a few months without a solution...)
Hi all,
I am trying to read a .txt file, with Hebrew column names, but without
success.
I uploaded an example file to: http://www.talgalili.com/files/aa.txt
And tried the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep =
"\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
Trying to use something like:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding
="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding
= "iso8859-8") :
invalid input found on input connection
'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding
= "iso8859-8") :
incomplete final line found by readTableHeader on
'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL",
"en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
[[alternative HTML version deleted]]
William Dunlap
2010-Mar-18 22:42 UTC
[R] How to read.table with “Hebrew” column names (in R)?
I tried this on R 2.11.0 unstable (2010-03-07 r51225) using
encoding="UTF-8" and check.names=FALSE in read.table().
It seemed to basically work, except that the data.frame/matrix printing
routine wants to print the Unicode codes for the characters
in the names:
> data1 <-
read.table("http://www.talgalili.com/files/aa.txt",
header = TRUE, sep = "\t", encoding="UTF-8",
check.names=FALSE)
> data1 # I see Unicode codes, presumably the correct ones
<U+05D0><U+05D7><U+05EA>
<U+05E9><U+05EA><U+05D9><U+05D9><U+05DD>
1 12 97
2 123 354
3 6 1
<U+05E9><U+05DC><U+05D5><U+05E9>
1 6
2 44
3 3
> colnames(data1) # I see Hebrew strings (in R the first starts with
aleph)
[1] "???" "?????" "????"
> colnames(data)[1]
[1] "???"
> strsplit(colnames(data)[1], "")[[1]][1]
[1] "?"
> data1[,"?????"]
[1] 97 354 1
I'm writing this in Outlook in the English (American) locale
and the copy-n-paste from the R gui window to the Outlook window
of the Hebrew letters reversed the whole line of them (reversing
the characters in each name and the names in the line), which I
why I showed a subset of the names and a substring of the first name.
However, when I try to use lm() with this data.frame then I run into
trouble, which is probably the same problem as I see in the
data.frame printing:
> lm(`?????` ~ `????`)
Error: \uxxxx sequences not supported inside backticks (line 1)
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Tal Galili
> Sent: Thursday, March 18, 2010 2:41 PM
> To: r-help at r-project.org
> Subject: [R] How to read.table with ?Hebrew? column names (in R)?
>
> (I am reposting this question after a few months without a
> solution...)
>
>
> Hi all,
>
> I am trying to read a .txt file, with Hebrew column names, but without
> success.
>
> I uploaded an example file to: http://www.talgalili.com/files/aa.txt
>
> And tried the command:
>
> read.table("http://www.talgalili.com/files/aa.txt", header =
> T, sep = "\t")
>
> This returns me with:
>
> X.....?? X...??...... X...??....
> 1 12 97 6
> 2 123 354 44
> 3 6 1 3
>
> Instead of:
>
> ?????? ?????????? ????????
> 12 97 6
> 123 354 44
> 6 1 3
>
>
> Trying to use something like:
>
> read.table("http://www.talgalili.com/files/aa.txt",fileEncodin
> g ="iso8859-8")
>
> Has resulted in:
>
> V1
> 1 ?
> Warning messages:
> 1: In read.table("http://www.talgalili.com/files/aa.txt",
fileEncoding
> = "iso8859-8") :
>
> invalid input found on input connection
> 'http://www.talgalili.com/files/aa.txt'
> 2: In read.table("http://www.talgalili.com/files/aa.txt",
fileEncoding
> = "iso8859-8") :
>
> incomplete final line found by readTableHeader on
> 'http://www.talgalili.com/files/aa.txt'
>
> While also trying this:
>
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
>
> Or this:
>
> Sys.setlocale("LC_ALL",
> "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
>
> Get's me this:
>
> [1] ""
> Warning message:
> In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
>
> OS reports request to set locale to "en_US.UTF-8" cannot be
honored
>
>
>
> My output for:
>
> l10n_info()
>
> Is:
>
> $MBCS
> [1] FALSE
>
> $`UTF-8`
> [1] FALSE
>
> $`Latin-1`
> [1] TRUE
>
> $codepage
> [1] 1252
>
> And for:
>
> Sys.getlocale()
>
> Is:
>
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> Finally, here is the > sessionInfo()
>
> R version 2.10.1 (2009-12-14)
>
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.1
>
>
> Any suggestion or clarification will be appreciated.
>
>
>
> Best,
>
> Tal
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
> (Hebrew) |
> www.r-statistics.com (English)
> --------------------------------------------------------------
> --------------------------------
>
> [[alternative HTML version deleted]]
>
>