On 03/12/2022 07:21, Bert Gunter wrote:> Perhaps it is worth pointing out that looping constructs like lapply() can > be avoided and the procedure vectorized by mimicking Martin Morgan's > solution: > > ## s is the string to be searched. > diff(c(0,grep('b',strsplit(s,'')[[1]]))) > > However, Martin's solution is simpler and likely even faster as the regex > engine is unneeded: > > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized > > This seems much preferable to me.Of all the proposed solutions, Andrew Hart's solution seems the most efficient: ? big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) ? system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) ? #? ? user? system elapsed ? # ? 0.736?? 0.028?? 0.764 ? system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] == "b")))) ? #? ? user? system elapsed ? #? 2.100?? 0.356?? 2.455 The bigger the string, the bigger the gap in performance. Also, the bigger the average gap between 2 successive b's, the bigger the gap in performance. Finally: always use fixed=TRUE in strsplit() if you don't need to use the regex engine. Cheers, H.> -- Bert > > > > > > On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas at sapo.pt> wrote: > >> ?s 17:18 de 02/12/2022, Evan Cooch escreveu: >>> Was wondering if there is an 'efficient/elegant' way to do the following >>> (without tidyverse). Take a string >>> >>> abaaabbaaaaabaaab >>> >>> Its easy enough to count the number of times the character 'b' shows up >>> in the string, but...what I'm looking for is outputing the 'intervals' >>> between occurrences of 'b' (starting the counter at the beginning of the >>> string). So, for the preceding example, 'b' shows up in positions >>> >>> 2, 6, 7, 13, 17 >>> >>> So, the interval data would be: 2, 4, 1, 6, 4 >>> >>> My main approach has been to simply output positions (say, something >>> like unlist(gregexpr('b', target_string))), and 'do the math' between >>> successive positions. Can anyone suggest a more elegant approach? >>> >>> Thanks in advance... >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> Hello, >> >> I don't find your solution inelegant, it's even easy to write it as a >> one-line function. >> >> >> char_interval <- function(x, s) { >> lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y))) >> } >> >> target_string <-"abaaabbaaaaabaaab" >> char_interval('b', target_string) >> #> [[1]] >> #> [1] 2 4 1 6 4 >> >> >> Hope this helps, >> >> Rui Barradas >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com
@vi@e@gross m@iii@g oii gm@ii@com
2022-Dec-04 00:21 UTC
[R] interval between specific characters in a string...
This may be a fairly dumb and often asked question about some functions like strsplit() that return a list of things, often a list of ONE thing that be another list or a vector and needs to be made into something simpler.. The examples shown below have used various methods to convert the result to a vector but why is this not a built-in option for such a function to simplify the result either when possible or always? Sure you can subset it with " [[1]]" or use unlist() or as.vector() to coerce it back to a vector. But when you have a very common idiom and a fact that many people waste lots of time figuring out they had a LIST containing a single vector and debug, maybe it would have made sense to have either a sister function like strsplit_v() that returns what is actually wanted or allow strsplit(whatever, output="vector") or something giving the same result. Yes, I understand that when there is a workaround, it just complicates the base, but there could be a package that consistently does things like this to make the use of such functions easier. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Herv? Pag?s Sent: Saturday, December 3, 2022 6:50 PM To: Bert Gunter <bgunter.4567 at gmail.com>; Rui Barradas <ruipbarradas at sapo.pt> Cc: r-help at r-project.org; Evan Cooch <evan.cooch at gmail.com> Subject: Re: [R] interval between specific characters in a string... On 03/12/2022 07:21, Bert Gunter wrote:> Perhaps it is worth pointing out that looping constructs like lapply() > can be avoided and the procedure vectorized by mimicking Martin > Morgan's > solution: > > ## s is the string to be searched. > diff(c(0,grep('b',strsplit(s,'')[[1]]))) > > However, Martin's solution is simpler and likely even faster as the > regex engine is unneeded: > > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely > vectorized > > This seems much preferable to me.Of all the proposed solutions, Andrew Hart's solution seems the most efficient: big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) # user system elapsed # 0.736 0.028 0.764 system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] == "b")))) # user system elapsed # 2.100 0.356 2.455 The bigger the string, the bigger the gap in performance. Also, the bigger the average gap between 2 successive b's, the bigger the gap in performance. Finally: always use fixed=TRUE in strsplit() if you don't need to use the regex engine. Cheers, H.> -- Bert > > > > > > On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas at sapo.pt> wrote: > >> ?s 17:18 de 02/12/2022, Evan Cooch escreveu: >>> Was wondering if there is an 'efficient/elegant' way to do the >>> following (without tidyverse). Take a string >>> >>> abaaabbaaaaabaaab >>> >>> Its easy enough to count the number of times the character 'b' shows >>> up in the string, but...what I'm looking for is outputing the 'intervals' >>> between occurrences of 'b' (starting the counter at the beginning of >>> the string). So, for the preceding example, 'b' shows up in >>> positions >>> >>> 2, 6, 7, 13, 17 >>> >>> So, the interval data would be: 2, 4, 1, 6, 4 >>> >>> My main approach has been to simply output positions (say, something >>> like unlist(gregexpr('b', target_string))), and 'do the math' >>> between successive positions. Can anyone suggest a more elegant approach? >>> >>> Thanks in advance... >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> Hello, >> >> I don't find your solution inelegant, it's even easy to write it as a >> one-line function. >> >> >> char_interval <- function(x, s) { >> lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y))) } >> >> target_string <-"abaaabbaaaaabaaab" >> char_interval('b', target_string) >> #> [[1]] >> #> [1] 2 4 1 6 4 >> >> >> Hope this helps, >> >> Rui Barradas >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Herv? Pag?s Bioconductor Core Team hpages.on.github at gmail.com ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks. Very informative. I certainly missed this. -- Bert On Sat, Dec 3, 2022 at 3:49 PM Herv? Pag?s <hpages.on.github at gmail.com> wrote:> On 03/12/2022 07:21, Bert Gunter wrote: > > Perhaps it is worth pointing out that looping constructs like lapply() > can > > be avoided and the procedure vectorized by mimicking Martin Morgan's > > solution: > > > > ## s is the string to be searched. > > diff(c(0,grep('b',strsplit(s,'')[[1]]))) > > > > However, Martin's solution is simpler and likely even faster as the regex > > engine is unneeded: > > > > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized > > > > This seems much preferable to me. > > Of all the proposed solutions, Andrew Hart's solution seems the most > efficient: > > big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) > > system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) > # user system elapsed > # 0.736 0.028 0.764 > > system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] > == "b")))) > # user system elapsed > # 2.100 0.356 2.455 > > The bigger the string, the bigger the gap in performance. > > Also, the bigger the average gap between 2 successive b's, the bigger > the gap in performance. > > Finally: always use fixed=TRUE in strsplit() if you don't need to use > the regex engine. > > Cheers, > > H. > > > > -- Bert > > > > > > > > > > > > On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas at sapo.pt> > wrote: > > > >> ?s 17:18 de 02/12/2022, Evan Cooch escreveu: > >>> Was wondering if there is an 'efficient/elegant' way to do the > following > >>> (without tidyverse). Take a string > >>> > >>> abaaabbaaaaabaaab > >>> > >>> Its easy enough to count the number of times the character 'b' shows up > >>> in the string, but...what I'm looking for is outputing the 'intervals' > >>> between occurrences of 'b' (starting the counter at the beginning of > the > >>> string). So, for the preceding example, 'b' shows up in positions > >>> > >>> 2, 6, 7, 13, 17 > >>> > >>> So, the interval data would be: 2, 4, 1, 6, 4 > >>> > >>> My main approach has been to simply output positions (say, something > >>> like unlist(gregexpr('b', target_string))), and 'do the math' between > >>> successive positions. Can anyone suggest a more elegant approach? > >>> > >>> Thanks in advance... > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> Hello, > >> > >> I don't find your solution inelegant, it's even easy to write it as a > >> one-line function. > >> > >> > >> char_interval <- function(x, s) { > >> lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y))) > >> } > >> > >> target_string <-"abaaabbaaaaabaaab" > >> char_interval('b', target_string) > >> #> [[1]] > >> #> [1] 2 4 1 6 4 > >> > >> > >> Hope this helps, > >> > >> Rui Barradas > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > Herv? Pag?s > > Bioconductor Core Team > hpages.on.github at gmail.com > >[[alternative HTML version deleted]]
Hadley Wickham
2022-Dec-04 08:25 UTC
[R] interval between specific characters in a string...
On Sun, Dec 4, 2022 at 12:50 PM Herv? Pag?s <hpages.on.github at gmail.com> wrote:> > On 03/12/2022 07:21, Bert Gunter wrote: > > Perhaps it is worth pointing out that looping constructs like lapply() can > > be avoided and the procedure vectorized by mimicking Martin Morgan's > > solution: > > > > ## s is the string to be searched. > > diff(c(0,grep('b',strsplit(s,'')[[1]]))) > > > > However, Martin's solution is simpler and likely even faster as the regex > > engine is unneeded: > > > > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized > > > > This seems much preferable to me. > > Of all the proposed solutions, Andrew Hart's solution seems the most > efficient: > > big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) > > system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) > # user system elapsed > # 0.736 0.028 0.764 > > system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] > == "b")))) > # user system elapsed > # 2.100 0.356 2.455 > > The bigger the string, the bigger the gap in performance. > > Also, the bigger the average gap between 2 successive b's, the bigger > the gap in performance. > > Finally: always use fixed=TRUE in strsplit() if you don't need to use > the regex engine.You can do a bit better if you are willing to use stringr: library(stringr) big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000) system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) #> user system elapsed #> 0.126 0.002 0.128 system.time(str_length(str_split(big_string, fixed("b"))[[1]])) #> user system elapsed #> 0.103 0.004 0.107 (And my timings also suggest that it's time for Herv? to get a new computer :P) It feels like an approach that uses locations should be faster since you wouldn't have to construct all the intermediate strings. system.time(pos <- str_locate_all(big_string, fixed("b"))[[1]][,1]) #> user system elapsed #> 0.075 0.004 0.080 # I suspect this could be optimised with a little thought making this approach # faster overall system.time(c(0, diff(pos)) #> user system elapsed #> 0.022 0.006 0.027 Hadley -- http://hadley.nz