thr3ads.net - R help - [R] interval between specific characters in a string... [Dec 2022]

If this information is useful, please help other people find it:
Share via:

Hervé Pagès

2022-Dec-03 23:49 UTC

[R] interval between specific characters in a string...

On 03/12/2022 07:21, Bert Gunter wrote:> Perhaps it is worth pointing out that looping constructs like lapply() can
> be avoided and the procedure vectorized by mimicking Martin Morgan's
> solution:
>
> ## s is the string to be searched.
> diff(c(0,grep('b',strsplit(s,'')[[1]])))
>
> However, Martin's solution is simpler and likely even faster as the
regex
> engine is unneeded:
>
> diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ##
completely vectorized
>
> This seems much preferable to me.
Of all the proposed solutions, Andrew Hart's solution seems the most 
efficient:

 ? big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab",
500000)

 ? system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]])
+ 1)
 ? #? ? user? system elapsed
 ? # ? 0.736?? 0.028?? 0.764

 ? system.time(diff(c(0, which(strsplit(big_string, "",
fixed=TRUE)[[1]]
== "b"))))
 ? #? ? user? system elapsed
 ? #? 2.100?? 0.356?? 2.455

The bigger the string, the bigger the gap in performance.

Also, the bigger the average gap between 2 successive b's, the bigger 
the gap in performance.

Finally: always use fixed=TRUE in strsplit() if you don't need to use 
the regex engine.

Cheers,

H.

> -- Bert
>
>
>
>
>
> On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas at
sapo.pt> wrote:
>
>> ?s 17:18 de 02/12/2022, Evan Cooch escreveu:
>>> Was wondering if there is an 'efficient/elegant' way to do
the following
>>> (without tidyverse). Take a string
>>>
>>> abaaabbaaaaabaaab
>>>
>>> Its easy enough to count the number of times the character
'b' shows up
>>> in the string, but...what I'm looking for is outputing the
'intervals'
>>> between occurrences of 'b' (starting the counter at the
beginning of the
>>> string). So, for the preceding example, 'b' shows up in
positions
>>>
>>> 2, 6, 7, 13, 17
>>>
>>> So, the interval data would be: 2, 4, 1, 6, 4
>>>
>>> My main approach has been to simply output positions (say,
something
>>> like unlist(gregexpr('b', target_string))), and 'do the
math' between
>>> successive positions. Can anyone suggest a more elegant approach?
>>>
>>> Thanks in advance...
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> I don't find your solution inelegant, it's even easy to write
it as a
>> one-line function.
>>
>>
>> char_interval <- function(x, s) {
>>     lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y)))
>> }
>>
>> target_string <-"abaaabbaaaaabaaab"
>> char_interval('b', target_string)
>> #> [[1]]
>> #> [1] 2 4 1 6 4
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Herv? Pag?s

Bioconductor Core Team
hpages.on.github at gmail.com

@vi@e@gross m@iii@g oii gm@ii@com

2022-Dec-04 00:21 UTC

head link

[R] interval between specific characters in a string...

This may be a fairly dumb and often asked question about some functions like
strsplit()  that return a list of things, often a list of ONE thing that be
another list or a vector and needs to be made into something simpler..

The examples shown below have used various methods to convert the result to a
vector but why is this not a built-in option for such a function to simplify the
result either when possible or always?

Sure you can subset it with " [[1]]" or use unlist() or as.vector() to
coerce it back to a vector. But when you have a very common idiom and a fact
that many people waste lots of time figuring out they had a LIST containing a
single vector and debug, maybe it would have made sense to have either a sister
function like strsplit_v() that returns what is actually wanted or allow
strsplit(whatever, output="vector") or something giving the same
result.

Yes, I understand that when there is a workaround, it just complicates the base,
but there could be a package that consistently does things like this to make the
use of such functions easier.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Herv? Pag?s
Sent: Saturday, December 3, 2022 6:50 PM
To: Bert Gunter <bgunter.4567 at gmail.com>; Rui Barradas <ruipbarradas
at sapo.pt>
Cc: r-help at r-project.org; Evan Cooch <evan.cooch at gmail.com>
Subject: Re: [R] interval between specific characters in a string...

On 03/12/2022 07:21, Bert Gunter wrote:> Perhaps it is worth pointing out that looping constructs like lapply() 
> can be avoided and the procedure vectorized by mimicking Martin 
> Morgan's
> solution:
>
> ## s is the string to be searched.
> diff(c(0,grep('b',strsplit(s,'')[[1]])))
>
> However, Martin's solution is simpler and likely even faster as the 
> regex engine is unneeded:
>
> diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ##
completely
> vectorized
>
> This seems much preferable to me.
Of all the proposed solutions, Andrew Hart's solution seems the most
efficient:

   big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab",
500000)

   system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]])
+ 1)
   #    user  system elapsed
   #   0.736   0.028   0.764

   system.time(diff(c(0, which(strsplit(big_string, "",
fixed=TRUE)[[1]] == "b"))))
   #    user  system elapsed
   #  2.100   0.356   2.455

The bigger the string, the bigger the gap in performance.

Also, the bigger the average gap between 2 successive b's, the bigger the
gap in performance.

Finally: always use fixed=TRUE in strsplit() if you don't need to use the
regex engine.

Cheers,

H.

> -- Bert
>
>
>
>
>
> On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas at
sapo.pt> wrote:
>
>> ?s 17:18 de 02/12/2022, Evan Cooch escreveu:
>>> Was wondering if there is an 'efficient/elegant' way to do
the
>>> following (without tidyverse). Take a string
>>>
>>> abaaabbaaaaabaaab
>>>
>>> Its easy enough to count the number of times the character
'b' shows
>>> up in the string, but...what I'm looking for is outputing the
'intervals'
>>> between occurrences of 'b' (starting the counter at the
beginning of
>>> the string). So, for the preceding example, 'b' shows up in
>>> positions
>>>
>>> 2, 6, 7, 13, 17
>>>
>>> So, the interval data would be: 2, 4, 1, 6, 4
>>>
>>> My main approach has been to simply output positions (say,
something
>>> like unlist(gregexpr('b', target_string))), and 'do the
math'
>>> between successive positions. Can anyone suggest a more elegant
approach?
>>>
>>> Thanks in advance...
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> I don't find your solution inelegant, it's even easy to write
it as a
>> one-line function.
>>
>>
>> char_interval <- function(x, s) {
>>     lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y))) }
>>
>> target_string <-"abaaabbaaaaabaaab"
>> char_interval('b', target_string)
>> #> [[1]]
>> #> [1] 2 4 1 6 4
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Herv? Pag?s

Bioconductor Core Team
hpages.on.github at gmail.com

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2022-Dec-04 02:33 UTC

head link

[R] interval between specific characters in a string...

Thanks. Very informative.
I certainly missed this.

-- Bert

On Sat, Dec 3, 2022 at 3:49 PM Herv? Pag?s <hpages.on.github at gmail.com>
wrote:
> On 03/12/2022 07:21, Bert Gunter wrote:
> > Perhaps it is worth pointing out that looping constructs like lapply()
> can
> > be avoided and the procedure vectorized by mimicking Martin
Morgan's
> > solution:
> >
> > ## s is the string to be searched.
> > diff(c(0,grep('b',strsplit(s,'')[[1]])))
> >
> > However, Martin's solution is simpler and likely even faster as
the regex
> > engine is unneeded:
> >
> > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ##
completely vectorized
> >
> > This seems much preferable to me.
>
> Of all the proposed solutions, Andrew Hart's solution seems the most
> efficient:
>
>    big_string <-
strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
>
>    system.time(nchar(strsplit(big_string, split="b",
fixed=TRUE)[[1]]) + 1)
>    #    user  system elapsed
>    #   0.736   0.028   0.764
>
>    system.time(diff(c(0, which(strsplit(big_string, "",
fixed=TRUE)[[1]]
> == "b"))))
>    #    user  system elapsed
>    #  2.100   0.356   2.455
>
> The bigger the string, the bigger the gap in performance.
>
> Also, the bigger the average gap between 2 successive b's, the bigger
> the gap in performance.
>
> Finally: always use fixed=TRUE in strsplit() if you don't need to use
> the regex engine.
>
> Cheers,
>
> H.
>
>
> > -- Bert
> >
> >
> >
> >
> >
> > On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas at
sapo.pt>
> wrote:
> >
> >> ?s 17:18 de 02/12/2022, Evan Cooch escreveu:
> >>> Was wondering if there is an 'efficient/elegant' way
to do the
> following
> >>> (without tidyverse). Take a string
> >>>
> >>> abaaabbaaaaabaaab
> >>>
> >>> Its easy enough to count the number of times the character
'b' shows up
> >>> in the string, but...what I'm looking for is outputing the
'intervals'
> >>> between occurrences of 'b' (starting the counter at
the beginning of
> the
> >>> string). So, for the preceding example, 'b' shows up
in positions
> >>>
> >>> 2, 6, 7, 13, 17
> >>>
> >>> So, the interval data would be: 2, 4, 1, 6, 4
> >>>
> >>> My main approach has been to simply output positions (say,
something
> >>> like unlist(gregexpr('b', target_string))), and
'do the math' between
> >>> successive positions. Can anyone suggest a more elegant
approach?
> >>>
> >>> Thanks in advance...
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >> Hello,
> >>
> >> I don't find your solution inelegant, it's even easy to
write it as a
> >> one-line function.
> >>
> >>
> >> char_interval <- function(x, s) {
> >>     lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y)))
> >> }
> >>
> >> target_string <-"abaaabbaaaaabaaab"
> >> char_interval('b', target_string)
> >> #> [[1]]
> >> #> [1] 2 4 1 6 4
> >>
> >>
> >> Hope this helps,
> >>
> >> Rui Barradas
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Herv? Pag?s
>
> Bioconductor Core Team
> hpages.on.github at gmail.com
>
>
	[[alternative HTML version deleted]]

Hadley Wickham

2022-Dec-04 08:25 UTC

head link

[R] interval between specific characters in a string...

On Sun, Dec 4, 2022 at 12:50 PM Herv? Pag?s <hpages.on.github at
gmail.com> wrote:>
> On 03/12/2022 07:21, Bert Gunter wrote:
> > Perhaps it is worth pointing out that looping constructs like lapply()
can
> > be avoided and the procedure vectorized by mimicking Martin
Morgan's
> > solution:
> >
> > ## s is the string to be searched.
> > diff(c(0,grep('b',strsplit(s,'')[[1]])))
> >
> > However, Martin's solution is simpler and likely even faster as
the regex
> > engine is unneeded:
> >
> > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ##
completely vectorized
> >
> > This seems much preferable to me.
>
> Of all the proposed solutions, Andrew Hart's solution seems the most
> efficient:
>
>    big_string <-
strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
>
>    system.time(nchar(strsplit(big_string, split="b",
fixed=TRUE)[[1]]) + 1)
>    #    user  system elapsed
>    #   0.736   0.028   0.764
>
>    system.time(diff(c(0, which(strsplit(big_string, "",
fixed=TRUE)[[1]]
> == "b"))))
>    #    user  system elapsed
>    #  2.100   0.356   2.455
>
> The bigger the string, the bigger the gap in performance.
>
> Also, the bigger the average gap between 2 successive b's, the bigger
> the gap in performance.
>
> Finally: always use fixed=TRUE in strsplit() if you don't need to use
> the regex engine.
You can do a bit better if you are willing to use stringr:

library(stringr)
big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab",
500000)

system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) +
1)
#>    user  system elapsed
#>   0.126   0.002   0.128

system.time(str_length(str_split(big_string, fixed("b"))[[1]]))
#>    user  system elapsed
#>   0.103   0.004   0.107

(And my timings also suggest that it's time for Herv? to get a new computer
:P)

It feels like an approach that uses locations should be faster since
you wouldn't have to construct all the intermediate strings.

system.time(pos <- str_locate_all(big_string, fixed("b"))[[1]][,1])
#>    user  system elapsed
#>   0.075   0.004   0.080
# I suspect this could be optimised with a little thought making this approach
# faster overall
system.time(c(0, diff(pos))
#>    user  system elapsed
#>   0.022   0.006   0.027

Hadley

-- 
http://hadley.nz

R help - Dec 2022 - interval between specific characters in a string...

[R] interval between specific characters in a string...

[R] interval between specific characters in a string...

[R] interval between specific characters in a string...

[R] interval between specific characters in a string...