Christophe Dutang
2016-Apr-04 19:19 UTC
[R] Find the dataset(s) that contain(s) non-ASCII characters
Dear list, I?m maintainsing a package containing only datasets (152): http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html <http://dutangc.free.fr/pub/RRepos/web/CASdatasets-index.html> When R CMD checking the package, I get the following NOTE * checking data for non-ASCII characters ... NOTE Note: found 4 marked UTF-8 strings I wonder how to find which dataset(s) (all recorded as rda files) contain(s) non-ASCII characters. Using the iconv function let us to find or replace non-ASCII characters iconv(x, "UTF-8", "ASCII", sub="I_WAS_NOT_ASCII") I use the following function to detect non-ASCII characters. testASCII <- function(idata) { col <- (1:NCOL(idata))[sapply(idata, is.factor)] col <- c(col, (1:NCOL(idata))[sapply(idata, is.character)]) for(i in col) { x <- idata[, i] cat(colnames(idata)[i], "\n") res <- grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")) res <- c(res, grep("I_WAS_NOT_ASCII", iconv(x, "UTF-8", "ASCII", sub="I_WAS_NOT_ASCII"))) if(any(length(res) > 0)) cat(res, "\n") } } Unfortunately, I did not find yet which rda file contains non-ASCII characters among 56 most recent datasets. Is there a faster way to detect non-ASCII characters than to manually load and testASCII()? for example directly on rda files? Any comment is welcome. Regards, Christophe> sessionInfo()R version 3.2.4 (2016-03-10) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.5 (Yosemite) locale: [1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base --------------------------------------- Christophe Dutang LMM, UdM, Le Mans, France web: http://dutangc.free.fr <http://dutangc.free.fr/> [[alternative HTML version deleted]]