Stefan Th. Gries
2008-May-30 15:14 UTC
[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)
Hi all Four questions regarding Unicode. Three Windows questions. I am using - a PC with Windows XP (Build 20600.xpsp080413-2111 (Service Pack 3); - the following R version:> R.versionplatform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 7.0 year 2008 month 04 day 22 svn rev 45424 language R version.string R version 2.7.0 (2008-04-22) - the following locale:> Sys.getlocale(category = "LC_ALL")[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" # I loaded the file # <http://www.linguistics.ucsb.edu/faculty/stgries/teaching/russ_corp.txt> # into R, and this works fine. x<-scan(choose.files(), what="char", sep="\n", quote="", comment.char="", encoding="UTF-8") # My problems are the following: # 1 strsplit # This does not work: words.1<-unlist(strsplit(corpus.file, "[-!;:\'\"\\?\\. ]+", perl=T)) # - words.1[173] should be "?????", as in corpus.file[6] # but it is "??????????" # - words.1[208] should be "????????", as in corpus.file[13] # but it is "????????????????" # - words.1[214] should be "????????", as in corpus.file[14] # but it is "????????????????" # 2 entering Unicode characters into R: I want to search for, # say, "???". So I try to define it as follows, # but this doesn't work: (x123<-"\u0434\u043b\u044F") # I can define each individual character (x1<-"\u0434"); (x2<-"\u043b"); (x3<-"\u044F") # and each pair of character (x12<-"\u0434\u043b") (x13<-"\u0434\u044F") (x23<-"\u043b\u044F") # but not all three ... the last one gets skipped. # why's that and how do I do it? # 3 defining Unicode character ranges: in each of the following, # the last bracket does not get included (even if it gets defined # as a Unicode character, too): russ.char.yes<-"[\u0401\u0410-\u044F\u0451]" # all Russian Cyrillics russ.char.no<-"[^\u0401\u0410-\u044F\u0451]" # other characters russ.char.capit<-"[\u0410-\u042F\u0451]" # capital Russian Cyrillics russ.char.small<-"[\u0430-\u044F\u0401]" # small Russian Cyrillics # I can do that all on Linux, but this arises in a context where # many other character processing issues are explained for Mac, # Linux, *and* Windows, and I'd hate to have to say "this one # thing, you can't do on Windows" One Linux question. I am using Ubuntu Hardy Heron:> sessionInfo()R version 2.7.0 (2008-04-22) i486-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base # strange(?) behavior of word boundary characters: # I understand why these work ... grep("\\b?????", "? ?????????", perl=F, value=T) # OK # [1] "? ?????????" gsub("\\b?????", ">XX<", "? ?????????", perl=F) # OK # [1] "? >XX<????" # but why does "\\b" not work with perl=T? grep("\\b?????", "? ?????????", perl=T, value=T) # FAIL # character(0) gsub("\\b?????", ">XX<", "? ?????????", perl=T) # FAIL # [1] "? ?????????" Any pointers would be much appreciated and acknowledged ... STG
Hans-Jörg Bibiko
2008-May-30 16:58 UTC
[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)
Hi, to put it simply. Windows cannot handle utf-8 data. There is no utf-8 locale available. If your corpus only contains Russian data, maybe English glosses etc. you can try to set lang of Rgui.exe to Russian. Then at least you can use grep, strsplit because they are depending on the locales chosen. On 30.05.2008, at 17:14, Stefan Th. Gries wrote:> # I can do that all on Linux, but this arises in a context where > # many other character processing issues are explained for Mac, > # Linux, *and* Windows, and I'd hate to have to say "this one > # thing, you can't do on Windows" >Unfortunately I have to say this quite often :) Cheers, --Hans
Seemingly Similar Threads
- Question about graphical UI running R version 2.7.0 (2008-04-22) on Ubuntu Hardy Heron...
- Question about graphical UI running R version 2.7.0 (2008-04-22) on Ubuntu Hardy Heron...
- odd behavior of names
- odd behavior of names
- Cannot open file in NTFS filesystem on Ubuntu Hardy Heron