Hello, I have a simple question but I don't know which method is best to use for my problem. I have the following strings: str1 <- "My_name_is_peter" str2 <- "what_is_your_surname_peter" I would like to apply predefined abbreviations for peter=p and name=n to both strings so that the new strings look like the followings: str1: "My_n_is_p" str2: "what_is_your_surn_p" Which method is the best to use for that particular problem? syrvn -- View this message in context: http://r.789695.n4.nabble.com/Which-function-to-use-grep-replace-substr-etc-tp3909871p3909871.html Sent from the R help mailing list archive at Nabble.com.
David Winsemius
2011-Oct-16 16:42 UTC
[R] Which function to use: grep, replace, substr etc.?
On Oct 16, 2011, at 12:35 PM, syrvn wrote:> Hello, > > I have a simple question but I don't know which method is best to > use for my > problem. > > I have the following strings: > > str1 <- "My_name_is_peter" > str2 <- "what_is_your_surname_peter" > > I would like to apply predefined abbreviations for peter=p and > name=n to > both strings > so that the new strings look like the followings: > > str1: "My_n_is_p" > str2: "what_is_your_surn_p" > > Which method is the best to use for that particular problem??sub # on same page as grep > sub("(p)eter", "\\1", vec) [1] "My_name_is_p" "what_is_your_surname_p" -- David Winsemius, MD West Hartford, CT
Hi, thanks for the tip! I do it as follows now but I still have a problem I do not understand: abbrvs <- data.frame(c("peter", "name", "male", "female"), c("P", "N", "m", "f")) colnames(abbrvs) <- c("pattern", "replacement") str <- "My name is peter and I am male" for(m in 1:nrow(abbrvs)) { str <- sub(abbrvs$pattern[m], abbrvs$replacement[m], str, fixed=TRUE) print(str) } This works perfectly fine as I get: "My N is P and I am m" However, when I replace male by female then I get the following: "My N is P and I am fem" but I want to have "My N is P and I am f". Even with the parameter fixed=true I get the same result. Why is that? -- View this message in context: http://r.789695.n4.nabble.com/Which-function-to-use-grep-replace-substr-etc-tp3909871p3909922.html Sent from the R help mailing list archive at Nabble.com.
William Dunlap
2011-Oct-17 01:25 UTC
[R] Which function to use: grep, replace, substr etc.?
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius > Sent: Sunday, October 16, 2011 1:59 PM > To: Jeff Newmiller > Cc: r-help at r-project.org; syrvn > Subject: Re: [R] Which function to use: grep, replace, substr etc.? > > > On Oct 16, 2011, at 1:32 PM, Jeff Newmiller wrote: > > > Note that "male" comes before "female" in your data frame. > > --------------------------------------------------------------------------- > > Jeff Newmiller The ..... ..... Go Live... > > > > > syrvn <mentor_ at gmx.net> wrote: > > > > Hi, > > > > thanks for the tip! I do it as follows now but I still have a > > problem I do > > not understand: > > > > > > abbrvs <- data.frame(c("peter", "name", "male", "female"), > > c("P", "N", "m", "f")) > > > > colnames(abbrvs) <- c("pattern", "replacement") > > > > str <- "My name is peter and I am male" > > > > for(m in 1:nrow(abbrvs)) { > > str <- sub(abbrvs$pattern[m], abbrvs$replacement[m], str, > > fixed=TRUE) > > print(str) > > } > > > > > > This works perfectly fine as I get: "My N is P and I am m" > > > > However, when I replace male by female then I get the following: "My > > N is P > > and I am fem" > > > > but I want to have "My N is P and I am f". > > > > Even with the parameter fixed=true I get the same result. Why is that? > > Because "male" is in "female? This reminds me of a comment on a > posting I made this morning on SO. > http://stackoverflow.com/questions/7782113/counting-keyword-occurrences-in-r > > The problem was slightly different, but the greppish principle was > that in order to match only complete words, you need to specific "^", > "$" or " " at each end of the word: > > dataset <- c("corn", "cornmeal", "corn on the cob", "meal") > grep("^corn$|^corn | corn$", dataset) > [1] 1 3You can use the 2 character sequences "\\<" and "\\>" to match the beginning and end of a "word" (where the match takes up zero characters): > dataset <- c("corn", "cornmeal", "corn on the cob", "popcorn", "this corn is sweet") > grep("^corn$|^corn | corn$", dataset) [1] 1 3 > grep("\\<corn\\>", dataset) [1] 1 3 5 > gsub("\\<corn\\>", "CORN", dataset) [1] "CORN" [2] "cornmeal" [3] "CORN on the cob" [4] "popcorn" [5] "this CORN is sweet" If your definition of a "word" is more expansive it gets complicated. E.g., if words might include letters, numbers, and periods but not underscores or anything else, you could use: > gsub("(^|[^.[:alpha:][:digit:]])?corn($|[^.[:alpha:][:digit:]])?", "\\1CORN.BY.ITSELF\\2", c("corn.1", "corn_2", " corn", "4corn", "1.corn")) [1] "corn.1" [2] "CORN.BY.ITSELF_2" [3] " CORN.BY.ITSELF" [4] "4corn" [5] "1.corn" Moving to perl regular expressions would probably make this simpler. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> > In such cases you may want to look at the gsubfn package. It offers > higher level matching functions and I think strapply might be more > efficient and expressive here. I can imagine construction in a loop > such as yours, but you would probably want to build a pattern outside > the sub() call. > > After struggling to fix your loop (and your data.frame which > definitely should not be using factor variables), I am even more > convinced you should be learning "gubfn" facilities. (Tate out the > debugging print statements.) > > > abbrvs <- data.frame(c("peter", "name", "male", "female"), > + c(" P ", " N ", " m ", " f "), stringsAsFactors=FALSE) > > > > colnames(abbrvs) <- c("pattern", "replacement") > > > > for(m in 1:nrow(abbrvs)) { patt <- paste("^",abbrvs$pattern[m], "$| > ", > + abbrvs$pattern[m], " | ", > + abbrvs$pattern[m], "$", sep="") > + print(c( patt, abbrvs$replacement[m])) > + str <- sub(patt, abbrvs$replacement[m], str) > + print(str) > + } > [1] "^peter$| peter | peter$" " P " > [1] "My name is P and I am female" > [1] "^name$| name | name$" " N " > [1] "My N is P and I am female" > [1] "^male$| male | male$" " m " > [1] "My N is P and I am female" > [1] "^female$| female | female$" " f " > [1] "My N is P and I am f " > > -- > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.