On Mon, 14 Nov 2016, Marc Schwartz wrote:> >> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote: >> >> On Mon, 14 Nov 2016, Bert Gunter wrote: >>[stuff deleted]> Hi, > > Both gsub() and strsplit() are using regex based pattern matching > internally. That being said, they are ultimately calling .Internal code, > so both are pretty fast. > > For comparison: > > ## Create a 1,000,000 character vector > set.seed(1) > Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") > >> nchar(Vec) > [1] 1000000 > > ## Split the vector into single characters and tabulate >> table(strsplit(Vec, split = "")[[1]]) > > a b c d e f g h i j k l > 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 > m n o p q r s t u v w x > 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 > y z > 38265 38299 > > > ## Get just the count of "a" >> table(strsplit(Vec, split = "")[[1]])["a"] > a > 38664 > >> nchar(gsub("[^a]", "", Vec)) > [1] 38664 > > > ## Check performance >> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) > user system elapsed > 0.100 0.007 0.107 > >> system.time(nchar(gsub("[^a]", "", Vec))) > user system elapsed > 0.270 0.001 0.272 > > > So, the above would suggest that using strsplit() is somewhat faster > than using gsub(). However, as Chuck notes, in the absence of more > exhaustive benchmarking, the difference may or may not be more > generalizable.Whether splitting on fixed strings rather than treating them as regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on what you split: First repeating what Marc did...> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])user system elapsed 0.132 0.010 0.139> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])user system elapsed 0.130 0.010 0.138 ... fixed=TRUE hardly matters. But the idiom I proposed...> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - 1))user system elapsed 0.017 0.000 0.018> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1))user system elapsed 0.104 0.000 0.104>... is 5 times faster with fixed=TRUE for this case. This result matchea Marc's count:> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)[1] 38664>Chuck
Chuck, Marc, and anyone else who still has interest in this odd little discussion ... Yes, and with fixed = TRUE my approach took 1/3 as much time as Chuck's with a 10 element vector each element of which is a character string of length 1e5:> set.seed(1001) > x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse = ""))> system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))user system elapsed 0.012 0.000 0.012> system.time(nchar(gsub("[^a]", "", x,fixed = TRUE)))user system elapsed 0.004 0.000 0.004 Best, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:> On Mon, 14 Nov 2016, Marc Schwartz wrote: > >> >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote: >>> >>> On Mon, 14 Nov 2016, Bert Gunter wrote: >>> > [stuff deleted] > > >> Hi, >> >> Both gsub() and strsplit() are using regex based pattern matching >> internally. That being said, they are ultimately calling .Internal code, so >> both are pretty fast. >> >> For comparison: >> >> ## Create a 1,000,000 character vector >> set.seed(1) >> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") >> >>> nchar(Vec) >> >> [1] 1000000 >> >> ## Split the vector into single characters and tabulate >>> >>> table(strsplit(Vec, split = "")[[1]]) >> >> >> a b c d e f g h i j k l >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 >> m n o p q r s t u v w x >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 >> y z >> 38265 38299 >> >> >> ## Get just the count of "a" >>> >>> table(strsplit(Vec, split = "")[[1]])["a"] >> >> a >> 38664 >> >>> nchar(gsub("[^a]", "", Vec)) >> >> [1] 38664 >> >> >> ## Check performance >>> >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) >> >> user system elapsed >> 0.100 0.007 0.107 >> >>> system.time(nchar(gsub("[^a]", "", Vec))) >> >> user system elapsed >> 0.270 0.001 0.272 >> >> >> So, the above would suggest that using strsplit() is somewhat faster than >> using gsub(). However, as Chuck notes, in the absence of more exhaustive >> benchmarking, the difference may or may not be more generalizable. > > > > Whether splitting on fixed strings rather than treating them as > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on > what you split: > > First repeating what Marc did... > >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) > > user system elapsed > 0.132 0.010 0.139 >> >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) > > user system elapsed > 0.130 0.010 0.138 > > ... fixed=TRUE hardly matters. But the idiom I proposed... > >> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - >> 1)) > > user system elapsed > 0.017 0.000 0.018 >> >> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - >> 1)) > > user system elapsed > 0.104 0.000 0.104 >> >> > > ... is 5 times faster with fixed=TRUE for this case. > > This result matchea Marc's count: > >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) > > [1] 38664 >> >> > > Chuck
Hi, FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE) or strsplit( , fixed=TRUE): set.seed(1) Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "") system.time(res1 <- nchar(gsub("[^a]", "", Vec))) # user system elapsed # 0.585 0.000 0.586 system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L) # user system elapsed # 0.061 0.000 0.061 system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE))) # user system elapsed # 0.039 0.000 0.039 identical(res1, res2) # [1] TRUE identical(res1, res3) # [1] TRUE The gsub( , fixed=TRUE) solution also uses slightly less memory than the strsplit( , fixed=TRUE) solution. Cheers, H. On 11/14/2016 11:55 AM, Charles C. Berry wrote:> On Mon, 14 Nov 2016, Marc Schwartz wrote: > >> >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote: >>> >>> On Mon, 14 Nov 2016, Bert Gunter wrote: >>> > [stuff deleted] > >> Hi, >> >> Both gsub() and strsplit() are using regex based pattern matching >> internally. That being said, they are ultimately calling .Internal >> code, so both are pretty fast. >> >> For comparison: >> >> ## Create a 1,000,000 character vector >> set.seed(1) >> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") >> >>> nchar(Vec) >> [1] 1000000 >> >> ## Split the vector into single characters and tabulate >>> table(strsplit(Vec, split = "")[[1]]) >> >> a b c d e f g h i j k l >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 >> m n o p q r s t u v w x >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 >> y z >> 38265 38299 >> >> >> ## Get just the count of "a" >>> table(strsplit(Vec, split = "")[[1]])["a"] >> a >> 38664 >> >>> nchar(gsub("[^a]", "", Vec)) >> [1] 38664 >> >> >> ## Check performance >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) >> user system elapsed >> 0.100 0.007 0.107 >> >>> system.time(nchar(gsub("[^a]", "", Vec))) >> user system elapsed >> 0.270 0.001 0.272 >> >> >> So, the above would suggest that using strsplit() is somewhat faster >> than using gsub(). However, as Chuck notes, in the absence of more >> exhaustive benchmarking, the difference may or may not be more >> generalizable. > > > Whether splitting on fixed strings rather than treating them as > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on > what you split: > > First repeating what Marc did... > >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) > user system elapsed > 0.132 0.010 0.139 >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) > user system elapsed > 0.130 0.010 0.138 > > ... fixed=TRUE hardly matters. But the idiom I proposed... > >> system.time(sum(lengths(strsplit(paste0("X", Vec, >> "X"),"a",fixed=TRUE)) - 1)) > user system elapsed > 0.017 0.000 0.018 >> system.time(sum(lengths(strsplit(paste0("X", Vec, >> "X"),"a",fixed=FALSE)) - 1)) > user system elapsed > 0.104 0.000 0.104 >> > > ... is 5 times faster with fixed=TRUE for this case. > > This result matchea Marc's count: > >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) > [1] 38664 >> > > Chuck > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
(Sheepishly)... Yes, thank you Herv?. It would have been nice if I had given correct soutions. Fixed = TRUE could not have of course worked with ["a"] character class! Here's what I found with a 10 element vector each member of which is a 1e5 length string:> system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))user system elapsed 0.013 0.000 0.013> system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))user system elapsed 0.251 0.000 0.252 ## WAYYYY slower> system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))user system elapsed 0.007 0.000 0.007 ## twice as fast Clearly and unsurprisingly, the message is to avoid fixed = FALSE; after that, it seems mostly to be: who cares?! Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Nov 14, 2016 at 12:26 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:> Hi, > > FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE) > or strsplit( , fixed=TRUE): > > set.seed(1) > Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "") > > system.time(res1 <- nchar(gsub("[^a]", "", Vec))) > # user system elapsed > # 0.585 0.000 0.586 > > system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L) > # user system elapsed > # 0.061 0.000 0.061 > > system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE))) > # user system elapsed > # 0.039 0.000 0.039 > > identical(res1, res2) > # [1] TRUE > identical(res1, res3) > # [1] TRUE > > The gsub( , fixed=TRUE) solution also uses slightly less memory than the > strsplit( , fixed=TRUE) solution. > > Cheers, > H. > > > On 11/14/2016 11:55 AM, Charles C. Berry wrote: >> >> On Mon, 14 Nov 2016, Marc Schwartz wrote: >> >>> >>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote: >>>> >>>> On Mon, 14 Nov 2016, Bert Gunter wrote: >>>> >> [stuff deleted] >> >>> Hi, >>> >>> Both gsub() and strsplit() are using regex based pattern matching >>> internally. That being said, they are ultimately calling .Internal >>> code, so both are pretty fast. >>> >>> For comparison: >>> >>> ## Create a 1,000,000 character vector >>> set.seed(1) >>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") >>> >>>> nchar(Vec) >>> >>> [1] 1000000 >>> >>> ## Split the vector into single characters and tabulate >>>> >>>> table(strsplit(Vec, split = "")[[1]]) >>> >>> >>> a b c d e f g h i j k l >>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 >>> m n o p q r s t u v w x >>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 >>> y z >>> 38265 38299 >>> >>> >>> ## Get just the count of "a" >>>> >>>> table(strsplit(Vec, split = "")[[1]])["a"] >>> >>> a >>> 38664 >>> >>>> nchar(gsub("[^a]", "", Vec)) >>> >>> [1] 38664 >>> >>> >>> ## Check performance >>>> >>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) >>> >>> user system elapsed >>> 0.100 0.007 0.107 >>> >>>> system.time(nchar(gsub("[^a]", "", Vec))) >>> >>> user system elapsed >>> 0.270 0.001 0.272 >>> >>> >>> So, the above would suggest that using strsplit() is somewhat faster >>> than using gsub(). However, as Chuck notes, in the absence of more >>> exhaustive benchmarking, the difference may or may not be more >>> generalizable. >> >> >> >> Whether splitting on fixed strings rather than treating them as >> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on >> what you split: >> >> First repeating what Marc did... >> >>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) >> >> user system elapsed >> 0.132 0.010 0.139 >>> >>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) >> >> user system elapsed >> 0.130 0.010 0.138 >> >> ... fixed=TRUE hardly matters. But the idiom I proposed... >> >>> system.time(sum(lengths(strsplit(paste0("X", Vec, >>> "X"),"a",fixed=TRUE)) - 1)) >> >> user system elapsed >> 0.017 0.000 0.018 >>> >>> system.time(sum(lengths(strsplit(paste0("X", Vec, >>> "X"),"a",fixed=FALSE)) - 1)) >> >> user system elapsed >> 0.104 0.000 0.104 >>> >>> >> >> ... is 5 times faster with fixed=TRUE for this case. >> >> This result matchea Marc's count: >> >>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) >> >> [1] 38664 >>> >>> >> >> Chuck >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Here is another variant, v3, and a change to your first example so it returns the same value as your second example.> set.seed(1001) > x <- sapply(1:100,function(x)paste0(sample(letters,rpois(1,1e5),rep=TRUE),collapse = ""))> system.time(v1 <- lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) -1) user system elapsed 0.47 0.00 0.49> system.time(v2 <- nchar(gsub("[^a]", "", x)))user system elapsed 2.53 0.00 2.53> system.time(v3 <- nchar(x) - nchar(gsub("a", "", x, fixed=TRUE)))user system elapsed 0.08 0.00 0.08> > all.equal(v1,v2)[1] TRUE> all.equal(v1,v3)[1] TRUE Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Nov 14, 2016 at 12:23 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> Chuck, Marc, and anyone else who still has interest in this odd little > discussion ... > > Yes, and with fixed = TRUE my approach took 1/3 as much time as > Chuck's with a 10 element vector each element of which is a character > string of length 1e5: > > > set.seed(1001) > > x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse > = "")) > > > system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - > 1)) > user system elapsed > 0.012 0.000 0.012 > > system.time(nchar(gsub("[^a]", "", x,fixed = TRUE))) > user system elapsed > 0.004 0.000 0.004 > > Best, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccberry at ucsd.edu> > wrote: > > On Mon, 14 Nov 2016, Marc Schwartz wrote: > > > >> > >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> > wrote: > >>> > >>> On Mon, 14 Nov 2016, Bert Gunter wrote: > >>> > > [stuff deleted] > > > > > >> Hi, > >> > >> Both gsub() and strsplit() are using regex based pattern matching > >> internally. That being said, they are ultimately calling .Internal > code, so > >> both are pretty fast. > >> > >> For comparison: > >> > >> ## Create a 1,000,000 character vector > >> set.seed(1) > >> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") > >> > >>> nchar(Vec) > >> > >> [1] 1000000 > >> > >> ## Split the vector into single characters and tabulate > >>> > >>> table(strsplit(Vec, split = "")[[1]]) > >> > >> > >> a b c d e f g h i j k l > >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 > >> m n o p q r s t u v w x > >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 > >> y z > >> 38265 38299 > >> > >> > >> ## Get just the count of "a" > >>> > >>> table(strsplit(Vec, split = "")[[1]])["a"] > >> > >> a > >> 38664 > >> > >>> nchar(gsub("[^a]", "", Vec)) > >> > >> [1] 38664 > >> > >> > >> ## Check performance > >>> > >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) > >> > >> user system elapsed > >> 0.100 0.007 0.107 > >> > >>> system.time(nchar(gsub("[^a]", "", Vec))) > >> > >> user system elapsed > >> 0.270 0.001 0.272 > >> > >> > >> So, the above would suggest that using strsplit() is somewhat faster > than > >> using gsub(). However, as Chuck notes, in the absence of more exhaustive > >> benchmarking, the difference may or may not be more generalizable. > > > > > > > > Whether splitting on fixed strings rather than treating them as > > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on > > what you split: > > > > First repeating what Marc did... > > > >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) > > > > user system elapsed > > 0.132 0.010 0.139 > >> > >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) > > > > user system elapsed > > 0.130 0.010 0.138 > > > > ... fixed=TRUE hardly matters. But the idiom I proposed... > > > >> system.time(sum(lengths(strsplit(paste0("X", Vec, > "X"),"a",fixed=TRUE)) - > >> 1)) > > > > user system elapsed > > 0.017 0.000 0.018 > >> > >> system.time(sum(lengths(strsplit(paste0("X", Vec, > "X"),"a",fixed=FALSE)) - > >> 1)) > > > > user system elapsed > > 0.104 0.000 0.104 > >> > >> > > > > ... is 5 times faster with fixed=TRUE for this case. > > > > This result matchea Marc's count: > > > >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) > > > > [1] 38664 > >> > >> > > > > Chuck > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]