Mark Heckmann
2009-Jun-30 15:30 UTC
[R] Using regular expressions to detect clusters of consonants in a string
Hi,
I want to parse a string extracting the number of occurrences where two
consonants clump together. Consider for example the word "hallo". Here
I
want the algorithm to return 1. For "chess" if want it to return 2.
For the
word "screw" the result should be negative as it is a clump of three
consonants not two. Also for word "abstraction" I do not want the
algorithm
to detect two times a two consonant cluster. In this case the result should
be negative as well as it is four consonants in a row.
str <- "hallo"
gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case
=TRUE,
extended = TRUE)[[1]]
[1] 3
attr(,"match.length")
[1] 3
The result is correct. Now I change the word to "hall"
str <- "hall"
gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case
=TRUE,
extended = TRUE)[[1]]
[1] -1
attr(,"match.length")
[1] -1
Here my expression fails. How can I write a correct regex to do this? I
always encounter problems at the beginning or end of a string.
Also:
str <- "abstraction"
gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case
=TRUE,
extended = TRUE)[[1]]
[1] 4 7
attr(,"match.length")
[1] 3 3
This also fails.
Thanks in advance,
Mark
-------------------------------
Mark Heckmann
www.markheckmann.de
R-Blog: http://ryouready.wordpress.com
Greg Hirson
2009-Jun-30 16:29 UTC
[R] Using regular expressions to detect clusters of consonants in a string
Mark,
"Abstraction" also has a valid two consonant cluster ("ct").
Some logic
could be added to reject words that have valid twos if they also have
longer strings of consonants.
This may work as a starting off point, using strsplit:
twocons = function(word){
chars = strsplit(word, "[aeiou]")
conlengths = lapply(chars, nchar)
numtwos = sum(conlengths[[1]] == 2)
return(numtwos)
}
words = c("test", "hello", "fail",
"pass", "assess")
lapply(words, twocons)
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 1
[[5]]
[1] 2
I hope this is helpful,
Greg
Mark Heckmann wrote:> Hi,
>
> I want to parse a string extracting the number of occurrences where two
> consonants clump together. Consider for example the word "hallo".
Here I
> want the algorithm to return 1. For "chess" if want it to return
2. For the
> word "screw" the result should be negative as it is a clump of
three
> consonants not two. Also for word "abstraction" I do not want the
algorithm
> to detect two times a two consonant cluster. In this case the result should
> be negative as well as it is four consonants in a row.
>
> str <- "hallo"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 3
> attr(,"match.length")
> [1] 3
>
> The result is correct. Now I change the word to "hall"
>
> str <- "hall"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] -1
> attr(,"match.length")
> [1] -1
>
> Here my expression fails. How can I write a correct regex to do this? I
> always encounter problems at the beginning or end of a string.
>
> Also:
>
> str <- "abstraction"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 4 7
> attr(,"match.length")
> [1] 3 3
>
> This also fails.
>
> Thanks in advance,
> Mark
>
> -------------------------------
> Mark Heckmann
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Greg Hirson
ghirson at ucdavis.edu
Graduate Student
Agricultural and Environmental Chemistry
1106 Robert Mondavi Institute North
One Shields Avenue
Davis, CA 95616
Gabor Grothendieck
2009-Jun-30 16:30 UTC
[R] Using regular expressions to detect clusters of consonants in a string
Try this: library(gsubfn) s <- "mystring" strapply(s, "[bcdfghjklmnpqrstvwxyz]+", nchar)[[1]] which returns a vector of consonant string lengths. Now apply your algorithm to that. See http://gsubfn.googlecode.com for more. On Tue, Jun 30, 2009 at 11:30 AM, Mark Heckmann<mark.heckmann at gmx.de> wrote:> Hi, > > I want to parse a string extracting the number of occurrences where two > consonants clump together. Consider for example the word "hallo". Here I > want the algorithm to return 1. For "chess" if want it to return 2. For the > word "screw" the result should be negative as it is a clump of three > consonants not two. Also for word "abstraction" I do not want the algorithm > to detect two times a two consonant cluster. In this case the result should > be negative as well as it is four consonants in a row. > > str <- "hallo" > gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, > extended = TRUE)[[1]] > > [1] 3 > attr(,"match.length") > [1] 3 > > The result is correct. Now I change the word to "hall" > > str <- "hall" > gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, > extended = TRUE)[[1]] > > [1] -1 > attr(,"match.length") > [1] -1 > > Here my expression fails. How can I write a correct regex to do this? I > always encounter problems at the beginning or end of a string. > > Also: > > str <- "abstraction" > gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, > extended = TRUE)[[1]] > > [1] 4 7 > attr(,"match.length") > [1] 3 3 > > This also fails. > > Thanks in advance, > Mark > > ------------------------------- > Mark Heckmann > www.markheckmann.de > R-Blog: http://ryouready.wordpress.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Mark Heckmann
2009-Jul-01 09:07 UTC
[R] Using regular expressions to detect clusters of consonants in a string
Hi Gabor, thanks fort his great advice. Just one more question: I cannot find how to switch off case sensitivity for the regex in the documentation for gsubfn or strapply, like e.g. in gregexpr the ignore.case =TRUE command. Is there a way? TIA, Mark ------------------------------- Mark Heckmann + 49 (0) 421 - 1614618 www.markheckmann.de R-Blog: http://ryouready.wordpress.com -----Urspr?ngliche Nachricht----- Von: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] Gesendet: Dienstag, 30. Juni 2009 18:31 An: Mark Heckmann Cc: r-help at r-project.org Betreff: Re: [R] Using regular expressions to detect clusters of consonants in a string Try this: library(gsubfn) s <- "mystring" strapply(s, "[bcdfghjklmnpqrstvwxyz]+", nchar)[[1]] which returns a vector of consonant string lengths. Now apply your algorithm to that. See http://gsubfn.googlecode.com for more. On Tue, Jun 30, 2009 at 11:30 AM, Mark Heckmann<mark.heckmann at gmx.de> wrote:> Hi, > > I want to parse a string extracting the number of occurrences where two > consonants clump together. Consider for example the word "hallo". Here I > want the algorithm to return 1. For "chess" if want it to return 2. Forthe> word "screw" the result should be negative as it is a clump of three > consonants not two. Also for word "abstraction" I do not want thealgorithm> to detect two times a two consonant cluster. In this case the resultshould> be negative as well as it is four consonants in a row. > > str <- "hallo" > gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, > extended = TRUE)[[1]] > > [1] 3 > attr(,"match.length") > [1] 3 > > The result is correct. Now I change the word to "hall" > > str <- "hall" > gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, > extended = TRUE)[[1]] > > [1] -1 > attr(,"match.length") > [1] -1 > > Here my expression fails. How can I write a correct regex to do this? I > always encounter problems at the beginning or end of a string. > > Also: > > str <- "abstraction" > gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, > extended = TRUE)[[1]] > > [1] 4 7 > attr(,"match.length") > [1] 3 3 > > This also fails. > > Thanks in advance, > Mark > > ------------------------------- > Mark Heckmann > www.markheckmann.de > R-Blog: http://ryouready.wordpress.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2009-Jul-01 11:57 UTC
[R] Using regular expressions to detect clusters of consonants in a string
strapply and gsubfn pass the ... argument to gsub so it accepts all the same arguments. See ?strappy and ?gsubfn. e.g.> strapply("MyString", "[bcdfghjklmnpqrstvwxyz]+", nchar, ignore.case = TRUE)[[1]] [1] 5 2> gsubfn("[bcdfghjklmnpqrstvwxyz]+", "X", "MyString", ignore.case = TRUE)[1] "XiX" On Wed, Jul 1, 2009 at 5:07 AM, Mark Heckmann<mark.heckmann at gmx.de> wrote:> > Hi Gabor, > > thanks fort his great advice. Just one more question: > I cannot find how to switch off case sensitivity for the regex in the > documentation for gsubfn or strapply, like e.g. in gregexpr the ignore.case > =TRUE command. ?Is there a way? > > TIA, > Mark > > ------------------------------- > > Mark Heckmann > + 49 (0) 421 - 1614618 > www.markheckmann.de > R-Blog: http://ryouready.wordpress.com > > > > > -----Urspr?ngliche Nachricht----- > Von: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] > Gesendet: Dienstag, 30. Juni 2009 18:31 > An: Mark Heckmann > Cc: r-help at r-project.org > Betreff: Re: [R] Using regular expressions to detect clusters of consonants > in a string > > Try this: > > library(gsubfn) > s <- "mystring" > strapply(s, "[bcdfghjklmnpqrstvwxyz]+", nchar)[[1]] > > which returns a vector of consonant string lengths. > Now apply your algorithm to that. > See http://gsubfn.googlecode.com for more. > > On Tue, Jun 30, 2009 at 11:30 AM, Mark Heckmann<mark.heckmann at gmx.de> wrote: >> Hi, >> >> I want to parse a string extracting the number of occurrences where two >> consonants clump together. Consider for example the word "hallo". Here I >> want the algorithm to return 1. For "chess" if want it to return 2. For > the >> word "screw" the result should be negative as it is a clump of three >> consonants not two. Also for word "abstraction" I do not want the > algorithm >> to detect two times a two consonant cluster. In this case the result > should >> be negative as well as it is four consonants in a row. >> >> str <- "hallo" >> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, >> extended = TRUE)[[1]] >> >> [1] 3 >> attr(,"match.length") >> [1] 3 >> >> The result is correct. Now I change the word to "hall" >> >> str <- "hall" >> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, >> extended = TRUE)[[1]] >> >> [1] -1 >> attr(,"match.length") >> [1] -1 >> >> Here my expression fails. How can I write a correct regex to do this? I >> always encounter problems at the beginning or end of a string. >> >> Also: >> >> str <- "abstraction" >> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case =TRUE, >> extended = TRUE)[[1]] >> >> [1] 4 7 >> attr(,"match.length") >> [1] 3 3 >> >> This also fails. >> >> Thanks in advance, >> Mark >> >> ------------------------------- >> Mark Heckmann >> www.markheckmann.de >> R-Blog: http://ryouready.wordpress.com >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >