thr3ads.net - R help - [R] Frequency of a character in a string [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Charles C. Berry

2016-Nov-14 17:26 UTC

[R] Frequency of a character in a string

On Mon, 14 Nov 2016, Bert Gunter wrote:
> Yes, but it need some help, since nchar gives the length of the
> *entire* string; e.g.
>
> ## to count "a" 's  :
>
>> x <-(c("abbababba","bbabbabbaaaba"))
>> nchar(gsub("[^a]","",x))
> [1] 4 6
>
> This is one of about 8 zillion ways to do this in base R if you don't
> want to use a specialized package.
>
> Just for curiosity: Can anyone comment on what is the most efficient
> way to do this using base R pattern matching?
>
Most efficient? There probably is no uniformly most efficient way to do 
this as the timing will depend on the distribution of "a" in the atoms
of
any vector as well as the length of the vector.

But here is one way to avoid the regular expression matching:

lengths(strsplit(paste0("X", x,
"X"),"a",fixed=TRUE)) - 1


Chuck

Marc Schwartz

2016-Nov-14 17:48 UTC

head link

[R] Frequency of a character in a string

> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu>
wrote:
> 
> On Mon, 14 Nov 2016, Bert Gunter wrote:
> 
>> Yes, but it need some help, since nchar gives the length of the
>> *entire* string; e.g.
>> 
>> ## to count "a" 's  :
>> 
>>> x <-(c("abbababba","bbabbabbaaaba"))
>>> nchar(gsub("[^a]","",x))
>> [1] 4 6
>> 
>> This is one of about 8 zillion ways to do this in base R if you
don't
>> want to use a specialized package.
>> 
>> Just for curiosity: Can anyone comment on what is the most efficient
>> way to do this using base R pattern matching?
>> 
> 
> Most efficient? There probably is no uniformly most efficient way to do
this as the timing will depend on the distribution of "a" in the atoms
of any vector as well as the length of the vector.
> 
> But here is one way to avoid the regular expression matching:
> 
> lengths(strsplit(paste0("X", x,
"X"),"a",fixed=TRUE)) - 1
> 
> 
> Chuck
> 

Hi,

Both gsub() and strsplit() are using regex based pattern matching internally.
That being said, they are ultimately calling .Internal code, so both are pretty
fast.

For comparison:

## Create a 1,000,000 character vector
set.seed(1)
Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse =
"")
> nchar(Vec)[1] 1000000

## Split the vector into single characters and tabulate > table(strsplit(Vec, split = "")[[1]])
    a     b     c     d     e     f     g     h     i     j     k     l 
38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 
    m     n     o     p     q     r     s     t     u     v     w     x 
38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 
    y     z 
38265 38299 


## Get just the count of "a"> table(strsplit(Vec, split = "")[[1]])["a"]    a 
38664 
> nchar(gsub("[^a]", "", Vec))[1] 38664


## Check performance> system.time(table(strsplit(Vec, split = "")[[1]])["a"])   user  system elapsed 
  0.100   0.007   0.107 
> system.time(nchar(gsub("[^a]", "", Vec)))   user  system elapsed 
  0.270   0.001   0.272 


So, the above would suggest that using strsplit() is somewhat faster than using
gsub(). However, as Chuck notes, in the absence of more exhaustive benchmarking,
the difference may or may not be more generalizable.

Regards,

Marc Schwartz

Charles C. Berry

2016-Nov-14 19:55 UTC

head link

[R] Frequency of a character in a string

On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at
ucsd.edu> wrote:
>>
>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>[stuff deleted]
> Hi,
>
> Both gsub() and strsplit() are using regex based pattern matching 
> internally. That being said, they are ultimately calling .Internal code, 
> so both are pretty fast.
>
> For comparison:
>
> ## Create a 1,000,000 character vector
> set.seed(1)
> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse =
"")
>
>> nchar(Vec)
> [1] 1000000
>
> ## Split the vector into single characters and tabulate
>> table(strsplit(Vec, split = "")[[1]])
>
>    a     b     c     d     e     f     g     h     i     j     k     l
> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>    m     n     o     p     q     r     s     t     u     v     w     x
> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>    y     z
> 38265 38299
>
>
> ## Get just the count of "a"
>> table(strsplit(Vec, split = "")[[1]])["a"]
>    a
> 38664
>
>> nchar(gsub("[^a]", "", Vec))
> [1] 38664
>
>
> ## Check performance
>> system.time(table(strsplit(Vec, split =
"")[[1]])["a"])
>   user  system elapsed
>  0.100   0.007   0.107
>
>> system.time(nchar(gsub("[^a]", "", Vec)))
>   user  system elapsed
>  0.270   0.001   0.272
>
>
> So, the above would suggest that using strsplit() is somewhat faster 
> than using gsub(). However, as Chuck notes, in the absence of more 
> exhaustive benchmarking, the difference may or may not be more 
> generalizable.

Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:

First repeating what Marc did...
> system.time(table(strsplit(Vec, split =
"",fixed=TRUE)[[1]])["a"])    user  system elapsed
   0.132   0.010   0.139 > system.time(table(strsplit(Vec, split =
"",fixed=FALSE)[[1]])["a"])    user  system elapsed
   0.130   0.010   0.138

... fixed=TRUE hardly matters. But the idiom I proposed...
> system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=TRUE)) - 1))    user  system elapsed
   0.017   0.000   0.018 > system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=FALSE)) - 1))    user  system elapsed
   0.104   0.000   0.104>
... is 5 times faster with fixed=TRUE for this case.

This result matchea Marc's count:
> sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=FALSE)) - 1)
[1] 38664>
Chuck

R help - Nov 2016 - Frequency of a character in a string

[R] Frequency of a character in a string

[R] Frequency of a character in a string

[R] Frequency of a character in a string