Dear helpers, I'm trying to replace a character with a unicode code inside a data frame using gsub(), but unsuccessfully.> data.frame(animals=c("dog","wolf","cat"))->my.data > gsub("o","\u0254",my.data$animals)->my.data$animals > my.data$animals[1] "d??g" "w??lf" "cat" It's not that a data frame cannot have unicode codes, cf. e.g.> data.frame(animals=c("d\u0254g","w\u0254lf","cat"))->my.data.2 > my.data.2$animals[1] d?g w?lf cat Levels: cat d<U+0254>g w<U+0254>lf I've done the best I can based on what ?gsub and ?enc2utf8 tell me, but I haven't found a solution. Unrelated to that problem, but related to gsub() is that I can't find a way for gsub() to interpret the backslash as a character. In regular expression, \\ should represent "the character \", but gsub() doesn't:> data.frame(animals=c("dog","wolf","cat"))->my.data > gsub("d","\\",my.data$animals)[1] "og" "wolf" "cat" Thank you Sverre
To put a backslash in the replacement expression of sub or gsub (when fixed=FALSE) use 4 backslashes. The rationale is that the replacement expression backslash-digit means to use the digit'th parenthesized subpattern as the replacement and backslash-backslash means to put in a literal backslash. However, R parser also uses backslashes to signify things like unicode characters (that backslash is not in the string stored by R, but is just a signal to the parser) and it requires a doubled backslash to enter a backslash. 2*2 is 4 backslashes. E.g., > gsub("([[:digit:]]+)([[:alpha:]]+)", "alpha=<<\\2>>\\\\numeric=<<\\1>>", c("12P", "34Cat")) [1] "alpha=<<P>>\\numeric=<<12>>" "alpha=<<Cat>>\\numeric=<<34>>" > cat(.Last.value, sep="\n") # see what is really in the strings alpha=<<P>>\numeric=<<12>> alpha=<<Cat>>\numeric=<<34>> I don't know about your unicode/encoding problem. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sverre Stausland > Sent: Saturday, July 16, 2011 7:20 PM > To: r-help at r-project.org > Subject: [R] gsub() with unicode and escape character > > Dear helpers, > > I'm trying to replace a character with a unicode code inside a data > frame using gsub(), but unsuccessfully. > > > data.frame(animals=c("dog","wolf","cat"))->my.data > > gsub("o","\u0254",my.data$animals)->my.data$animals > > my.data$animals > [1] "d??g" "w??lf" "cat" > > It's not that a data frame cannot have unicode codes, cf. e.g. > > > data.frame(animals=c("d\u0254g","w\u0254lf","cat"))->my.data.2 > > my.data.2$animals > [1] d?g w?lf cat > Levels: cat d<U+0254>g w<U+0254>lf > > I've done the best I can based on what ?gsub and ?enc2utf8 tell me, > but I haven't found a solution. > > Unrelated to that problem, but related to gsub() is that I can't find > a way for gsub() to interpret the backslash as a character. In regular > expression, \\ should represent "the character \", but gsub() doesn't: > > > data.frame(animals=c("dog","wolf","cat"))->my.data > > gsub("d","\\",my.data$animals) > [1] "og" "wolf" "cat" > > Thank you > Sverre > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Don't know the answer to you first question, but for the \\ see below. On Sat, Jul 16, 2011 at 7:19 PM, Sverre Stausland <johnsen at fas.harvard.edu> wrote:> Unrelated to that problem, but related to gsub() is that I can't find > a way for gsub() to interpret the backslash as a character. In regular > expression, \\ should represent "the character \", but gsub() doesn't: > >> data.frame(animals=c("dog","wolf","cat"))->my.data >> gsub("d","\\",my.data$animals) > [1] "og" ? "wolf" "cat"Use \\\\ (yes, that's 4 backslashes).> gsub("d","\\\\",my.data$animals)[1] "\\og" "wolf" "cat"> cat(paste(gsub("d","\\\\",my.data$animals)))\og wolf cat> The reason is that the backslashes get interpreted twice, once when the command line parses the string, second time when the gsub processes the pattern. HTH Peter
You forgot the 'at a minimum' information required by the posting guide. Most likely this is a limitation of the locale you used (and failed to tell us about) on the OS you used (...). On Sat, 16 Jul 2011, Sverre Stausland wrote:> Dear helpers, > > I'm trying to replace a character with a unicode code inside a data > frame using gsub(), but unsuccessfully. > >> data.frame(animals=c("dog","wolf","cat"))->my.data >> gsub("o","\u0254",my.data$animals)->my.data$animals >> my.data$animals > [1] "d??g" "w??lf" "cat" > > It's not that a data frame cannot have unicode codes, cf. e.g. > >> data.frame(animals=c("d\u0254g","w\u0254lf","cat"))->my.data.2 >> my.data.2$animals > [1] d?g w?lf cat > Levels: cat d<U+0254>g w<U+0254>lf > > I've done the best I can based on what ?gsub and ?enc2utf8 tell me, > but I haven't found a solution. > > Unrelated to that problem, but related to gsub() is that I can't find > a way for gsub() to interpret the backslash as a character. In regular > expression, \\ should represent "the character \", but gsub() doesn't: > >> data.frame(animals=c("dog","wolf","cat"))->my.data >> gsub("d","\\",my.data$animals) > [1] "og" "wolf" "cat" > > Thank you > Sverre > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595