Anthony Damico
2015-Mar-14 11:07 UTC
[R] iconv() replaces invalid characters with " " instead of " " (two spaces instead of one) on unix?
hello, i am trying to replace non-ASCII characters in a character string with a single space. the iconv() function works as i expect it to on windows, but on unix, non-ASCII characters are getting replaced with two spaces instead of one. i suppose i could write a workaround for my code, but i'm wondering if i'm making some other mistake? in the output below, this is the result i'm getting: [1] "cancelaci n" and this is the result i want: [1] "cancelaci n" thanks!! ================> getOption( "encoding" )[1] "windows-1252"> a <- "cancelaci?n" > iconv(a,"","ASCII")[1] NA> iconv(a,"","ASCII",sub=" ")[1] "cancelaci n" ================> sessionInfo()R version 3.1.2 (2014-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] R.utils_1.34.0 R.oo_1.18.0 R.methodsS3_1.6.1 descr_1.0.4 [5] SAScii_1.0 downloader_0.3 foreign_0.8-61 MonetDB.R_0.9.5 [9] digest_0.6.6 DBI_0.3.1 loaded via a namespace (and not attached): [1] xtable_1.7-4 [[alternative HTML version deleted]]
Prof Brian Ripley
2015-Mar-14 15:06 UTC
[R] iconv() replaces invalid characters with " " instead of " " (two spaces instead of one) on unix?
On 14/03/2015 11:07, Anthony Damico wrote:> hello, i am trying to replace non-ASCII characters in a character string > with a single space. the iconv() function works as i expect it to on > windows, but on unix, non-ASCII characters are getting replaced with two > spaces instead of one. i suppose i could write a workaround for my code, > but i'm wondering if i'm making some other mistake?You are (not reading the help, not writing legible English) ... sub: character string. If not ?NA? it is used to replace any non-convertible bytes in the input. Note *bytes* not characters. In UTF-8 '?' is two bytes, other non-ASCII characters can be 2, 3, 4 (in the current Unicode standard, originally in principle up to 6). We do not know what locale you used on Windows, but in non-CJK locales characters == bytes. I guess chartr() will do what you want using a character range.> > in the output below, this is the result i'm getting: > [1] "cancelaci n" > > and this is the result i want: > [1] "cancelaci n" > > thanks!! > > ================> >> getOption( "encoding" ) > [1] "windows-1252"What is the relevance of that?> >> a <- "cancelaci?n" >> iconv(a,"","ASCII") > [1] NA >> iconv(a,"","ASCII",sub=" ") > [1] "cancelaci n" > > ================> >> sessionInfo() > R version 3.1.2 (2014-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] R.utils_1.34.0 R.oo_1.18.0 R.methodsS3_1.6.1 descr_1.0.4 > [5] SAScii_1.0 downloader_0.3 foreign_0.8-61 MonetDB.R_0.9.5 > [9] digest_0.6.6 DBI_0.3.1 > > loaded via a namespace (and not attached): > [1] xtable_1.7-4 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK