thr3ads.net - R help - [R] Using regular expressions to detect clusters of consonants in a string [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Mark Heckmann

2009-Jun-30 15:30 UTC

[R] Using regular expressions to detect clusters of consonants in a string

Hi,

I want to parse a string extracting the number of occurrences where two
consonants clump together. Consider for example the word "hallo". Here
I
want the algorithm to return 1. For "chess" if want it to return 2.
For the
word "screw" the result should be negative as it is a clump of three
consonants not two. Also for word "abstraction" I do not want the
algorithm
to detect two times a two consonant cluster. In this case the result should
be negative as well as it is four consonants in a row.

str <- "hallo"
gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case
=TRUE,
extended = TRUE)[[1]]

[1] 3
attr(,"match.length")
[1] 3

The result is correct. Now I change the word to "hall"

str <- "hall"
gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case
=TRUE,
extended = TRUE)[[1]]

[1] -1
attr(,"match.length")
[1] -1

Here my expression fails. How can I write a correct regex to do this? I
always encounter problems at the beginning or end of a string.

Also:

str <- "abstraction"
gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str, ignore.case
=TRUE,
extended = TRUE)[[1]]

[1] 4 7
attr(,"match.length")
[1] 3 3

This also fails.

Thanks in advance,
Mark

-------------------------------
Mark Heckmann
www.markheckmann.de
R-Blog: http://ryouready.wordpress.com

Greg Hirson

2009-Jun-30 16:29 UTC

head link

[R] Using regular expressions to detect clusters of consonants in a string

Mark,

"Abstraction" also has a valid two consonant cluster ("ct").
Some logic
could be added to reject words that have valid twos if they also have 
longer strings of consonants.

This may work as a starting off point, using strsplit:

twocons = function(word){
    chars = strsplit(word, "[aeiou]")
    conlengths = lapply(chars, nchar)
    numtwos = sum(conlengths[[1]] == 2)
    return(numtwos)
    }

words = c("test", "hello", "fail",
"pass", "assess")

lapply(words, twocons)
[[1]]
[1] 1

[[2]]
[1] 1

[[3]]
[1] 0

[[4]]
[1] 1

[[5]]
[1] 2



I hope this is helpful,

Greg

Mark Heckmann wrote:> Hi,
>
> I want to parse a string extracting the number of occurrences where two
> consonants clump together. Consider for example the word "hallo".
Here I
> want the algorithm to return 1. For "chess" if want it to return
2. For the
> word "screw" the result should be negative as it is a clump of
three
> consonants not two. Also for word "abstraction" I do not want the
algorithm
> to detect two times a two consonant cluster. In this case the result should
> be negative as well as it is four consonants in a row.
>
> str <- "hallo"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 3
> attr(,"match.length")
> [1] 3
>
> The result is correct. Now I change the word to "hall"
>
> str <- "hall"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] -1
> attr(,"match.length")
> [1] -1
>
> Here my expression fails. How can I write a correct regex to do this? I
> always encounter problems at the beginning or end of a string.
>
> Also:
>
> str <- "abstraction"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 4 7
> attr(,"match.length")
> [1] 3 3
>
> This also fails.
>
> Thanks in advance,
> Mark
>
> -------------------------------
> Mark Heckmann
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   
-- 
Greg Hirson
ghirson at ucdavis.edu

Graduate Student
Agricultural and Environmental Chemistry

1106 Robert Mondavi Institute North
One Shields Avenue
Davis, CA 95616

Gabor Grothendieck

2009-Jun-30 16:30 UTC

head link

[R] Using regular expressions to detect clusters of consonants in a string

Try this:

library(gsubfn)
s <- "mystring"
strapply(s, "[bcdfghjklmnpqrstvwxyz]+", nchar)[[1]]

which returns a vector of consonant string lengths.
Now apply your algorithm to that.
See http://gsubfn.googlecode.com for more.

On Tue, Jun 30, 2009 at 11:30 AM, Mark Heckmann<mark.heckmann at gmx.de>
wrote:> Hi,
>
> I want to parse a string extracting the number of occurrences where two
> consonants clump together. Consider for example the word "hallo".
Here I
> want the algorithm to return 1. For "chess" if want it to return
2. For the
> word "screw" the result should be negative as it is a clump of
three
> consonants not two. Also for word "abstraction" I do not want the
algorithm
> to detect two times a two consonant cluster. In this case the result should
> be negative as well as it is four consonants in a row.
>
> str <- "hallo"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 3
> attr(,"match.length")
> [1] 3
>
> The result is correct. Now I change the word to "hall"
>
> str <- "hall"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] -1
> attr(,"match.length")
> [1] -1
>
> Here my expression fails. How can I write a correct regex to do this? I
> always encounter problems at the beginning or end of a string.
>
> Also:
>
> str <- "abstraction"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 4 7
> attr(,"match.length")
> [1] 3 3
>
> This also fails.
>
> Thanks in advance,
> Mark
>
> -------------------------------
> Mark Heckmann
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Mark Heckmann

2009-Jul-01 09:07 UTC

head link

[R] Using regular expressions to detect clusters of consonants in a string

Hi Gabor,

thanks fort his great advice. Just one more question:
I cannot find how to switch off case sensitivity for the regex in the
documentation for gsubfn or strapply, like e.g. in gregexpr the ignore.case
=TRUE command.  Is there a way?

TIA,
Mark 

-------------------------------

Mark Heckmann
+ 49 (0) 421 - 1614618
www.markheckmann.de
R-Blog: http://ryouready.wordpress.com




-----Urspr?ngliche Nachricht-----
Von: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Gesendet: Dienstag, 30. Juni 2009 18:31
An: Mark Heckmann
Cc: r-help at r-project.org
Betreff: Re: [R] Using regular expressions to detect clusters of consonants
in a string

Try this:

library(gsubfn)
s <- "mystring"
strapply(s, "[bcdfghjklmnpqrstvwxyz]+", nchar)[[1]]

which returns a vector of consonant string lengths.
Now apply your algorithm to that.
See http://gsubfn.googlecode.com for more.

On Tue, Jun 30, 2009 at 11:30 AM, Mark Heckmann<mark.heckmann at gmx.de>
wrote:> Hi,
>
> I want to parse a string extracting the number of occurrences where two
> consonants clump together. Consider for example the word "hallo".
Here I
> want the algorithm to return 1. For "chess" if want it to return
2. For
the> word "screw" the result should be negative as it is a clump of
three
> consonants not two. Also for word "abstraction" I do not want the
algorithm> to detect two times a two consonant cluster. In this case the result
should> be negative as well as it is four consonants in a row.
>
> str <- "hallo"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 3
> attr(,"match.length")
> [1] 3
>
> The result is correct. Now I change the word to "hall"
>
> str <- "hall"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] -1
> attr(,"match.length")
> [1] -1
>
> Here my expression fails. How can I write a correct regex to do this? I
> always encounter problems at the beginning or end of a string.
>
> Also:
>
> str <- "abstraction"
> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
> extended = TRUE)[[1]]
>
> [1] 4 7
> attr(,"match.length")
> [1] 3 3
>
> This also fails.
>
> Thanks in advance,
> Mark
>
> -------------------------------
> Mark Heckmann
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
>

Gabor Grothendieck

2009-Jul-01 11:57 UTC

head link

[R] Using regular expressions to detect clusters of consonants in a string

strapply and gsubfn pass the ... argument to gsub so it accepts
all the same arguments.  See ?strappy and ?gsubfn.  e.g.
> strapply("MyString", "[bcdfghjklmnpqrstvwxyz]+", nchar,
ignore.case = TRUE)[[1]]
[1] 5 2
> gsubfn("[bcdfghjklmnpqrstvwxyz]+", "X",
"MyString", ignore.case = TRUE)[1] "XiX"


On Wed, Jul 1, 2009 at 5:07 AM, Mark Heckmann<mark.heckmann at gmx.de>
wrote:>
> Hi Gabor,
>
> thanks fort his great advice. Just one more question:
> I cannot find how to switch off case sensitivity for the regex in the
> documentation for gsubfn or strapply, like e.g. in gregexpr the ignore.case
> =TRUE command. ?Is there a way?
>
> TIA,
> Mark
>
> -------------------------------
>
> Mark Heckmann
> + 49 (0) 421 - 1614618
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
>
>
>
> -----Urspr?ngliche Nachricht-----
> Von: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> Gesendet: Dienstag, 30. Juni 2009 18:31
> An: Mark Heckmann
> Cc: r-help at r-project.org
> Betreff: Re: [R] Using regular expressions to detect clusters of consonants
> in a string
>
> Try this:
>
> library(gsubfn)
> s <- "mystring"
> strapply(s, "[bcdfghjklmnpqrstvwxyz]+", nchar)[[1]]
>
> which returns a vector of consonant string lengths.
> Now apply your algorithm to that.
> See http://gsubfn.googlecode.com for more.
>
> On Tue, Jun 30, 2009 at 11:30 AM, Mark Heckmann<mark.heckmann at
gmx.de> wrote:
>> Hi,
>>
>> I want to parse a string extracting the number of occurrences where two
>> consonants clump together. Consider for example the word
"hallo". Here I
>> want the algorithm to return 1. For "chess" if want it to
return 2. For
> the
>> word "screw" the result should be negative as it is a clump
of three
>> consonants not two. Also for word "abstraction" I do not want
the
> algorithm
>> to detect two times a two consonant cluster. In this case the result
> should
>> be negative as well as it is four consonants in a row.
>>
>> str <- "hallo"
>> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
>> extended = TRUE)[[1]]
>>
>> [1] 3
>> attr(,"match.length")
>> [1] 3
>>
>> The result is correct. Now I change the word to "hall"
>>
>> str <- "hall"
>> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
>> extended = TRUE)[[1]]
>>
>> [1] -1
>> attr(,"match.length")
>> [1] -1
>>
>> Here my expression fails. How can I write a correct regex to do this? I
>> always encounter problems at the beginning or end of a string.
>>
>> Also:
>>
>> str <- "abstraction"
>> gregexpr("[bcdfghjklmnpqrstvwxyz]{2}[aeiou]{1}" , str,
ignore.case =TRUE,
>> extended = TRUE)[[1]]
>>
>> [1] 4 7
>> attr(,"match.length")
>> [1] 3 3
>>
>> This also fails.
>>
>> Thanks in advance,
>> Mark
>>
>> -------------------------------
>> Mark Heckmann
>> www.markheckmann.de
>> R-Blog: http://ryouready.wordpress.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

R help - Jun 2009 - Using regular expressions to detect clusters of consonants in a string

[R] Using regular expressions to detect clusters of consonants in a string

[R] Using regular expressions to detect clusters of consonants in a string

[R] Using regular expressions to detect clusters of consonants in a string

[R] Using regular expressions to detect clusters of consonants in a string

[R] Using regular expressions to detect clusters of consonants in a string