Stefan Th. Gries
2006-Jul-23 01:48 UTC
[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences
Dear all I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems. platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 3.1 year 2006 month 06 day 01 svn rev 38247 language R version.string Version 2.3.1 (2006-06-01) I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example: INPUT:This is my dog. DESIRED OUTPUT: This<r> is<r> my dog. I found a solution for cases where the potentially rhyming words are adjacent: text<-"This is my dog." gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input. (i) While I know what to do for non-adjacent words in general gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-) this runs into problems with overlapping matches: text<-"And this is the second sentence" gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) [1] "And<r> this is the second<r> sentence" It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches? (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string text<-"this is an example sentence." gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,} text<-"this is an example sentence." gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) because, as I understand it, this requires the 2+ cases of \\w to be identical characters: text<-"doo yoo see mee?" gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) Again, any ideas? I'd really appreciate any snippets of codes, pointers, etc. Thanks so much, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries
Gabor Grothendieck
2006-Jul-23 04:05 UTC
[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences
The following requires more than just a single gsub but it does solve the problem. Modify to suit. The first gsub places <...> around the first occurrence of any duplicated suffixes. We use the (?=...) zero width regexp to circumvent the nesting problem. Then we use strapply from the gsubfn package to extract the suffixes so marked and paste them together to pass to a second gsub which locates them in the original string appending an <r> to each. Uncomment the commented pat if you only want to match 2+ character suffixes. library(gsubfn) # places <...> around first occurrences of repeated suffixes text <- "And this is the second sentence" pat <- "(\\w+)(?=\\b.+\\1\\b)" # pat <- "(\\w\\w+)(?=\\b.+\\1\\b)" out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE) suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]] gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text) On 7/22/06, Stefan Th. Gries <stgries_lists at arcor.de> wrote:> Dear all > > I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems. > > platform i386-pc-mingw32 > arch i386 > os mingw32 > system i386, mingw32 > status > major 2 > minor 3.1 > year 2006 > month 06 > day 01 > svn rev 38247 > language R > version.string Version 2.3.1 (2006-06-01) > > > I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example: > > INPUT:This is my dog. > DESIRED OUTPUT: This<r> is<r> my dog. > > I found a solution for cases where the potentially rhyming words are adjacent: > > text<-"This is my dog." > gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input. > > (i) While I know what to do for non-adjacent words in general > > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-) > > this runs into problems with overlapping matches: > > text<-"And this is the second sentence" > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > [1] "And<r> this is the second<r> sentence" > > It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches? > > (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string > > text<-"this is an example sentence." > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,} > > text<-"this is an example sentence." > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > because, as I understand it, this requires the 2+ cases of \\w to be identical characters: > > text<-"doo yoo see mee?" > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > Again, any ideas? > > I'd really appreciate any snippets of codes, pointers, etc. > Thanks so much, > STG > -- > Stefan Th. Gries > ----------------------------------------------- > University of California, Santa Barbara > http://www.linguistics.ucsb.edu/faculty/stgries > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Greg Snow
2006-Jul-25 16:56 UTC
[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences
Using regular expression matching for this case may be overkill (the RE engine will be doing a lot of backtracking looking at a lot of non-matches). Here is an alternative that splits the text into a vector of words, extracts the last 2 letters of each word (remember if the last 3 letters match, then the last 2 have to match, so we only need to consider the last 2), then looks at all pairwise comparisons for matches, then pastes everything back together with the marked matches: text<-"And this is a second rand sentence" tmp1 <- strsplit(text, ' ')[[1]] tmp2 <- nchar(tmp1) tmp3 <- substr(tmp1,tmp2-1,tmp2) tmp4 <- which(lower.tri(diag(length(tmp3))), arr.ind=TRUE) tmp5 <- tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ] tmp6 <- rep('', length(tmp1)) count <- 1 for( i in which(tmp5) ){ tmp6[ tmp4[i,1] ] <- paste(tmp6[ tmp4[i,1] ], '<r',count,'>',sep='') tmp6[ tmp4[i,2] ] <- paste(tmp6[ tmp4[i,2] ], '<r',count,'>',sep='') count <- count + 1 } out.text <- paste( tmp1,tmp6, sep='',collapse=' ') If you are doing a lot of text processing like this, I would suggest doing it in Perl rather than R. S Poetry by Dr. Burns has a function to take a vector of character strings in R and run a Perl script on it and return the results. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Stefan Th. Gries Sent: Saturday, July 22, 2006 7:49 PM To: r-help at stat.math.ethz.ch Subject: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences Dear all I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems. platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 3.1 year 2006 month 06 day 01 svn rev 38247 language R version.string Version 2.3.1 (2006-06-01) I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example: INPUT:This is my dog. DESIRED OUTPUT: This<r> is<r> my dog. I found a solution for cases where the potentially rhyming words are adjacent: text<-"This is my dog." gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input. (i) While I know what to do for non-adjacent words in general gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-) this runs into problems with overlapping matches: text<-"And this is the second sentence" gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) [1] "And<r> this is the second<r> sentence" It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches? (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string text<-"this is an example sentence." gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,} text<-"this is an example sentence." gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) because, as I understand it, this requires the 2+ cases of \\w to be identical characters: text<-"doo yoo see mee?" gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) Again, any ideas? I'd really appreciate any snippets of codes, pointers, etc. Thanks so much, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Greg Snow
2006-Jul-25 19:38 UTC
[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences
Before comparing times we should make sure that they functions return the same thing. My original function (f1 below) labels the potential rymes with match numbers as well as finding possible rymes, if you just want the <r> flag then the for loop can be eliminated giving f4 as follows: f4 <- function(text) { tmp1 <- strsplit(text, ' ')[[1]] tmp2 <- nchar(tmp1) tmp3 <- substr(tmp1,tmp2-1,tmp2) tmp4 <- which(lower.tri(diag(length(tmp3))), arr.ind=TRUE) tmp5 <- tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ] tmp6 <- rep('', length(tmp1)) tmp6[ unique(c(tmp4[tmp5,])) ] <- '<r>' paste( tmp1,tmp6, sep='',collapse=' ') } The speed of f4 is similar to the speed of f3 (even after correcting f3, the original one just returns the original text string). But that is on the sample string, what if a longer string is used (more potential for backtracking). Try the string generated by: set.seed(1) text <- paste( sample(c(letters,' ',' ',' '), 1000, replace=T), collapse='') text <- gsub(" {2,}"," ",text) Now f4 is much faster than f3. However f3 can be optimized by replacing \\w+ in pat by \\w{2} and that makes it faster than f4 again It would probably be even faster to use gregexpr to just find the matching endings then create the new regexp based on those endings and do one substitute rather than using multiple gsubs. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] Sent: Tuesday, July 25, 2006 11:41 AM To: Greg Snow Cc: Stefan Th. Gries; r-help at stat.math.ethz.ch Subject: Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences Regarding having to do a lot of backtracking one can just look at the relative comparison of speeds and we see that they are comparable in speed. In fact the bottleneck is not the backtacking but strapply. I had coded the regexp version for compactness of code but if we replace the strapply with custom gsub/strapply code for speed, the new rexexp version is twice as fast as the for loop version. Below f1 is the for loop version, f2 is the original regexp version with strapply and f3 is the revised version using gsub/strsplit instead. f1 <- function() { tmp1 <- strsplit(text, ' ')[[1]] tmp2 <- nchar(tmp1) tmp3 <- substr(tmp1,tmp2-1,tmp2) tmp4 <- which(lower.tri(diag(length(tmp3))), arr.ind=TRUE) tmp5 <- tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ] tmp6 <- rep('', length(tmp1)) count <- 1 for( i in which(tmp5) ){ tmp6[ tmp4[i,1] ] <- paste(tmp6[ tmp4[i,1] ], '<r',count,'>',sep='') tmp6[ tmp4[i,2] ] <- paste(tmp6[ tmp4[i,2] ], '<r',count,'>',sep='') count <- count + 1 } out.text <- paste( tmp1,tmp6, sep='',collapse=' ') } # places <...> around first occurrences of repeated suffixes library(gsubfn) f2 <- function() { text <- "And this is the second sentence" pat <- "(\\w+)(?=\\b.+\\1\\b)" # pat <- "(\\w\\w+)(?=\\b.+\\1\\b)" out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE) suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]] gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text) } f3 <- function() { text <- "And this is the second sentence" pat <- "(\\w+)(?=\\b.+\\1\\b)" # pat <- "(\\w\\w+)(?=\\b.+\\1\\b)" out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE) # redo this strapply by hand for speed purposes # suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]] suff <- gsub("[^<>]*<|>[^<>]*<|>[^<>]*$", "<", out) suff <- gsub("^<|<$", "", suff) suff <- strsplit(suff, "<")[[1]] gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text) } # for loop version system.time(for (i in 1:100) f1()) # 0.32 0.00 0.36 NA NA # original regexp version with strapply system.time(for (i in 1:100) f2()) # 0.36 0.00 0.38 NA NA # regexp version with strapply replaced with gsub/strsplit system.time(for (i in 1:100) f3()) # 0.15 0.00 0.16 NA NA On 7/25/06, Greg Snow <Greg.Snow at intermountainmail.org> wrote:> Using regular expression matching for this case may be overkill (the > RE engine will be doing a lot of backtracking looking at a lot of > non-matches). Here is an alternative that splits the text into a > vector of words, extracts the last 2 letters of each word (remember if> the last > 3 letters match, then the last 2 have to match, so we only need to > consider the last 2), then looks at all pairwise comparisons for > matches, then pastes everything back together with the marked matches: > > text<-"And this is a second rand sentence" > > tmp1 <- strsplit(text, ' ')[[1]] > tmp2 <- nchar(tmp1) > tmp3 <- substr(tmp1,tmp2-1,tmp2) > > tmp4 <- which(lower.tri(diag(length(tmp3))), arr.ind=TRUE) > tmp5 <- tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ] > > tmp6 <- rep('', length(tmp1)) > count <- 1 > for( i in which(tmp5) ){ > tmp6[ tmp4[i,1] ] <- paste(tmp6[ tmp4[i,1] ], > '<r',count,'>',sep='') > tmp6[ tmp4[i,2] ] <- paste(tmp6[ tmp4[i,2] ], > '<r',count,'>',sep='') > count <- count + 1 > } > > out.text <- paste( tmp1,tmp6, sep='',collapse=' ') > > > If you are doing a lot of text processing like this, I would suggest > doing it in Perl rather than R. S Poetry by Dr. Burns has a function > to take a vector of character strings in R and run a Perl script on it> and return the results. > > Hope this helps, > > > > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow at intermountainmail.org > (801) 408-8111 > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Stefan Th. > Gries > Sent: Saturday, July 22, 2006 7:49 PM > To: r-help at stat.math.ethz.ch > Subject: [R] RfW 2.3.1: regular expressions to detect pairs of > identical word-final character sequences > > Dear all > > I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 > machine and I have two related regular expression problems. > > platform i386-pc-mingw32 > arch i386 > os mingw32 > system i386, mingw32 > status > major 2 > minor 3.1 > year 2006 > month 06 > day 01 > svn rev 38247 > language R > version.string Version 2.3.1 (2006-06-01) > > > I would like to find cases of words in elements of character vectors > that end in the same character sequences; if I find such cases, I want> to add <r> to both potentially rhyming sequences. An example: > > INPUT:This is my dog. > DESIRED OUTPUT: This<r> is<r> my dog. > > I found a solution for cases where the potentially rhyming words are > adjacent: > > text<-"This is my dog." > gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, > perl=TRUE) > > However, with another text vector, I came across two problems I cannot> seem to solve and for which I would love to get some input. > > (i) While I know what to do for non-adjacent words in general > > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my > dog", perl=TRUE) # I know this is not proper English ;-) > > this runs into problems with overlapping matches: > > text<-"And this is the second sentence" > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)> [1] "And<r> this is the second<r> sentence" > > It finds the "nd" match, but since the "is" match is within the two > "nd"'s, it doesn't get it. Any ideas on how to get all pairwisematches?> > (ii) How would one tell R to match only when there are 2+ characters > matching? If the above expression is applied to another character > string > > text<-"this is an example sentence." > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > it also matches the "e"'s at the end of example and sentence. It's not> possible to get rid of that by specifying a range such as {2,} > > text<-"this is an example sentence." > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, > perl=TRUE) > > because, as I understand it, this requires the 2+ cases of \\w to be > identical characters: > > text<-"doo yoo see mee?" > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, > perl=TRUE) > > Again, any ideas? > > I'd really appreciate any snippets of codes, pointers, etc. > Thanks so much, > STG > -- > Stefan Th. Gries > ----------------------------------------------- > University of California, Santa Barbara > http://www.linguistics.ucsb.edu/faculty/stgries > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >