Hello, I am using R, 2.15.2, on a 64-bit Linux box. I run R through Emacs' ESS. R runs in a French, Canadian-French, locale and lately I got surprising results from functions making factor variables from character variables. Many of the variables in input data.frames are character variables and contain latin accents, for exemple the "?" in "Montr?al". I waisted several days playing with coding systems and trying to understand why some code when run one command at a time from the command line gives the expected result while when cut and pasted in a function it doesn't??? For example the following code: =============================================================================ttt.rmr <- sima.31122012$rmrnom ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston", "Charlottetown", "Calgary", "Winnipeg", "Victoria", "Vancouver", "Toronto", "St. John's", "Saskatoon", "Regina", "Qu?bec", "Ottawa - Gatineau (Ontario", "Ottawa - Gatineau (partie", "Montr?al", "Halifax", "Fredericton"), "Grandes villes", ifelse(ttt.rmr == "", "Manquant", "Autres")) unique(ttt.rmr.2) ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres", "Manquant"), labels = c("Grandes villes", "Autres", "Manquant")) ============================================================================= will have "Montr?al" and "Qu?bec" in the "Grandes villes" level of the factor variable, while running the same code in a function will have them in "Autres". The variable "rmr.Merged" in the data.frame "test2.sima.31122012.DataPrep" is the output of the function, which, of course, does a lot of other stuff. =============================================================================ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged) frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w]) Frequency Percent Cum.Freq Cum.Percent Montr?al 1301254 79.57173 1301254 79.57173 Qu?bec 334068 20.42827 1635322 100.00000 ============================================================================= All other city names, no accents, were correctly classified but "Montr?al" and "Qu?bec", together they represent over 1.5M records, not negligeable!!! Following is my ".Renviron" file where I set up environment variables for R. R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R" # export R_PROFILE_USER R_HISTFILE="/home/jeg002/MyRwork/.Rhistory" ## Default editor EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}} ## Default pager PAGER=${PAGER-'/usr/local/bin/emacsclient'} ## Setting locale, hoping it will be OK "all" the time!!! LANG=fr_CA LANGUAGE=fr_CA LC_ADDRESS=fr_CA LC_COLLATE=fr_CA LC_TYPE=fr_CA LC_IDENTIFICATION=fr_CA LC_MEASUREMENT=fr_CA LC_MESSAGES=fr_CA LC_NAME=fr_CA LC_PAPER=en_US LC_NUMERIC=en_US LC_TELEPHONE=fr_CA LC_MONETARY=fr_CA LC_TIME=fr_CA R_PAPERSIZE='letter' ============================================================================= and:> Sys.getlocale()[1] "LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"> Sys.getenv(c("LANGUAGE", "LANG"))LANGUAGE LANG "fr_CA" "fr_CA" I must be missing something!!! Maybe someone can make sense of this!!! Thanks for your support, G?rald Jean (Embedded image moved to file: pic06023.gif) Gerald Jean, M. Sc. en statistiques Conseiller senior en statistiques L?vis (si?ge social) Actuariat corporatif, 418 835-4900, poste Mod?lisation et Recherche 7639 Assurance de dommages 1 877 835-4900, poste Mouvement Desjardins 7639 T?l?copieur : 418 835-6657 Faites bonne impression et imprimez seulement au besoin! Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et est adress? exclusivement au destinataire. Il est strictement interdit ? toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. Merci.
Could it be that your r-script is saved in a different encoding than the one used by R (which will probably be UTF8 since you're working on linux)? -- Jan gerald.jean at dgag.ca schreef:> Hello, > > I am using R, 2.15.2, on a 64-bit Linux box. I run R through Emacs' ESS. > > R runs in a French, Canadian-French, locale and lately I got surprising > results > from functions making factor variables from character variables. Many of > the > variables in input data.frames are character variables and contain latin > accents, for exemple the "?" in "Montr?al". I waisted several days playing > with coding systems and trying to understand why some code when run one > command at > a time from the command line gives the expected result while when cut and > pasted in a function it doesn't??? > > For example the following code: > > =============================================================================> ttt.rmr <- sima.31122012$rmrnom > ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston", > "Charlottetown", "Calgary", "Winnipeg", > "Victoria", "Vancouver", "Toronto", > "St. John's", "Saskatoon", "Regina", > "Qu?bec", "Ottawa - Gatineau (Ontario", > "Ottawa - Gatineau (partie", > "Montr?al", > "Halifax", "Fredericton"), > "Grandes villes", ifelse(ttt.rmr == "", "Manquant", > "Autres")) > unique(ttt.rmr.2) > ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres", > "Manquant"), > labels = c("Grandes villes", "Autres", "Manquant")) > > =============================================================================> > will have "Montr?al" and "Qu?bec" in the "Grandes villes" level of the > factor > variable, while running the same code in a function will have them in > "Autres". > The variable "rmr.Merged" in the data.frame "test2.sima.31122012.DataPrep" > is > the output of the function, which, of course, does a lot of other stuff. > > =============================================================================> ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged) > frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w]) > Frequency Percent Cum.Freq Cum.Percent > Montr?al 1301254 79.57173 1301254 79.57173 > Qu?bec 334068 20.42827 1635322 100.00000 > =============================================================================> > All other city names, no accents, were correctly classified but "Montr?al" > and > "Qu?bec", together they represent over 1.5M records, not negligeable!!! > > Following is my ".Renviron" file where I set up environment variables for > R. > > R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R" > # export R_PROFILE_USER > R_HISTFILE="/home/jeg002/MyRwork/.Rhistory" > ## Default editor > EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}} > ## Default pager > PAGER=${PAGER-'/usr/local/bin/emacsclient'} > > ## Setting locale, hoping it will be OK "all" the time!!! > LANG=fr_CA > LANGUAGE=fr_CA > LC_ADDRESS=fr_CA > LC_COLLATE=fr_CA > LC_TYPE=fr_CA > LC_IDENTIFICATION=fr_CA > LC_MEASUREMENT=fr_CA > LC_MESSAGES=fr_CA > LC_NAME=fr_CA > LC_PAPER=en_US > LC_NUMERIC=en_US > LC_TELEPHONE=fr_CA > LC_MONETARY=fr_CA > LC_TIME=fr_CA > R_PAPERSIZE='letter' > =============================================================================> > and: > >> Sys.getlocale() > [1] > "LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C" > >> Sys.getenv(c("LANGUAGE", "LANG")) > LANGUAGE LANG > "fr_CA" "fr_CA" > > I must be missing something!!! Maybe someone can make sense of this!!! > Thanks > for your support, > > G?rald Jean > > (Embedded image moved to file: > pic06023.gif) > > Gerald Jean, M. Sc. en statistiques > Conseiller senior en statistiques L?vis (si?ge social) > > Actuariat corporatif, 418 835-4900, poste > Mod?lisation et Recherche 7639 > Assurance de dommages 1 877 835-4900, poste > Mouvement Desjardins 7639 > T?l?copieur : 418 > 835-6657 > > > > > Faites bonne impression et imprimez seulement au besoin! > > Ce courriel est confidentiel, peut ?tre prot?g? par le secret > professionnel et > est adress? exclusivement au destinataire. Il est strictement > interdit ? toute > autre personne de diffuser, distribuer ou reproduire ce message. Si > vous l'avez > re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur. > Merci.