On 11/29/2017 04:15 PM, T?th D?nes wrote:> Hi, > > A benchmarking study with an additional (data.table-based) solution.I don't think speed is the right benchmark (I do agree that correctness is!). For the R-help list, maybe something about least specialized R knowledge required would be appropriate? I'd say there were some 'hard' solutions -- Michael (deep understanding of Bioconductor and IRanges), Toth (deep understanding of data.table), Jim (at least for me moderate understanding of dplyr,especially the .$ notation; a simpler dplyr answer might have moved this response out of the 'difficult' category, especially given the familiarity of the OP with dplyr). I'd vote for Bill's as requiring the least specialized knowledge of R (though the +/- 1 indexing is an easy thing to get wrong). A different criteria might be reuse across analysis scenarios. Bill seems to win here again, since the principles are very general and at least moderately efficient (both Bert and Martin's solutions are essentially R-level iterations and have poor scalability, as demonstrated in the microbenchmarks; Bill's is mostly vectorized). Certainly data.table, dplyr, and IRanges are extremely useful within the confines of the problem domains they address. Martin> Enjoy! ;) > > Cheers, > Denes > > > -------------------------- > > > ## packages ########################## > > library(dplyr) > library(data.table) > library(IRanges) > library(microbenchmark) > > ## prepare example dataset ########### > > ## use Bert's example, with 2000 stations instead of 2 > d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), > ??????????????????? from = as.integer(c(60,61,71,72,76,60,65,82,83)), > ??????????????????? to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), > ??????????????????? record = c("A","B","C","B","D","B","B","D","E"), > ??????????????????? stringsAsFactors = FALSE) > stations <- rle(d_df$station) > stations$value <- gsub( > ? " ", "0", > ? paste0("station", format(1:length(stations$value), width = 6))) > d_df$station <- rep(stations$value, stations$lengths) > > ## prepare tibble and data.table versions > d_tbl <- as_tibble(d_df) > d_dt <- as.data.table(d_df) > > ## solutions ########################## > > ## Bert - by > fun_bert <- function(d) { > ? out <- by( > ??? d, d$station, function(x) with(x, { > ????? i <- to - from +1 > ????? data.frame(record =rep(record,i), > ???????????????? year =sequence(i) -1 + rep(from,i), > ???????????????? stringsAsFactors = FALSE) > ??? })) > ? data.frame(station = rep(names(out), sapply(out,nrow)), > ???????????? do.call(rbind,out), > ???????????? row.names = NULL, > ???????????? stringsAsFactors = FALSE) > } > > ## Bill - transform > fun_bill <- function(d) { > ? i <- rep(seq_len(nrow(d)), d$to-d$from+1) > ? j <- sequence(d$to-d$from+1) > ? transform(d[i,], year=from+j-1, from=NULL, to=NULL) > } > > ## Michael - IRanges > fun_michael <- function(d) { > ? df <- with(d, DataFrame(station, record, year=IRanges(from, to))) > ? expand(df, "year") > } > > ## Jim - dplyr > fun_jim <- function(d) { > ? d %>% > ??? rowwise() %>% > ??? do(tibble(station = .$station, > ????????????? record = .$record, > ????????????? year = seq(.$from, .$to)) > ??? ) > } > > ## Martin - Map > fun_martin <- function(d) { > ? d$year <- with(d, Map(seq, from, to)) > ? res0 <- with(d, Map(data.frame, > ????????????????????? station=station, > ????????????????????? record=record, > ????????????????????? year=year, > ????????????????????? MoreArgs = list(stringsAsFactors = FALSE))) > ? do.call(rbind, unname(res0)) > } > > ## Denes - simple data.table > fun_denes <- function(d) { > ? out <- d[, .(year = from:to), by = .(station, from, record)] > ? out[, from := NULL] > } > > ## Check equality ################################ > all.equal(fun_bill(d_df), fun_bert(d_df), > ????????? check.attributes = FALSE) > all.equal(fun_bill(d_df), fun_martin(d_df), > ????????? check.attributes = FALSE) > all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)), > ????????? check.attributes = FALSE) > all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)), > ????????? check.attributes = FALSE) > # Be prepared: this solution is super slow > all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)), > ????????? check.attributes = FALSE) > > ## Benchmark ##################################### > > ## Martin > print(system.time(fun_martin(d_df))) > > ## Bert > print(system.time(fun_bert(d_df))) > > ## Top 3 > print( > ? microbenchmark( > ??? fun_bill(d_df), > ??? fun_michael(d_df), > ??? fun_denes(d_dt), > ??? times = 100L > ? ) > ) > > > ------------------------- > > On 11/28/2017 06:49 PM, Michael Lawrence wrote: >> Or with the Bioconductor IRanges package: >> >> df <- with(input, DataFrame(station, year=IRanges(from, to), record)) >> expand(df, "year") >> >> DataFrame with 24 rows and 3 columns >> ???????? station???? year????? record >> ???? <character> <integer> <character> >> 1?????? 07EA001????? 1960???????? QMS >> 2?????? 07EA001????? 1961???????? QMC >> 3?????? 07EA001????? 1962???????? QMC >> 4?????? 07EA001????? 1963???????? QMC >> 5?????? 07EA001????? 1964???????? QMC >> ...???????? ...?????? ...???????? ... >> 20????? 07EA001????? 1979???????? QRC >> 21????? 07EA001????? 1980???????? QRC >> 22????? 07EA001????? 1981???????? QRC >> 23????? 07EA001????? 1982???????? QRC >> 24????? 07EA001????? 1983???????? QRC >> >> If you tell the computer more about your data, it can do more things for >> you. >> >> Michael >> >> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan < >> martin.morgan at roswellpark.org> wrote: >> >>> On 11/26/2017 08:42 PM, jim holtman wrote: >>> >>>> try this: >>>> >>>> ########################################## >>>> >>>> library(dplyr) >>>> >>>> input <- tribble( >>>> ??? ~station, ~from, ~to, ~record, >>>> ?? "07EA001" ,??? 1960? ,? 1960? , "QMS", >>>> ?? "07EA001"? ,?? 1961 ,?? 1970? , "QMC", >>>> ?? "07EA001" ,??? 1971? ,? 1971? , "QMM", >>>> ?? "07EA001" ,??? 1972? ,? 1976? , "QMC", >>>> ?? "07EA001" ,??? 1977? ,? 1983? , "QRC" >>>> ) >>>> >>>> result <- input %>% >>>> ??? rowwise() %>% >>>> ??? do(tibble(station = .$station, >>>> ????????????? year = seq(.$from, .$to), >>>> ????????????? record = .$record) >>>> ??? ) >>>> >>>> ########################### >>>> >>> >>> In a bit more 'base R' mode I did >>> >>> ?? input$year <- with(input, Map(seq, from, to)) >>> ?? res0 <- with(input, Map(data.frame, station=station, year=year, >>> ?????? record=record)) >>> ??? as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 >>> >>> resulting in >>> >>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 >>> ??? station? year record >>> ???? <fctr> <int> <fctr> >>> ? 1 07EA001? 1960??? QMS >>> ? 2 07EA001? 1961??? QMC >>> ? 3 07EA001? 1962??? QMC >>> ? 4 07EA001? 1963??? QMC >>> ? 5 07EA001? 1964??? QMC >>> ? 6 07EA001? 1965??? QMC >>> ? 7 07EA001? 1966??? QMC >>> ? 8 07EA001? 1967??? QMC >>> ? 9 07EA001? 1968??? QMC >>> 10 07EA001? 1969??? QMC >>> # ... with 14 more rows >>> >>> I though I should have been able to use `tibble` in the second step, but >>> that leads to a (cryptic) error >>> >>>> res0 <- with(input, Map(tibble, station=station, year=year, >>> record=record))Error in captureDots(strict = `__quosured`) : >>> ?? the argument has already been evaluated >>> >>> The 'station' and 'record' columns are factors, so different from the >>> original input, but this seems the appropriate data type for theses >>> columns. >>> >>> It's interesting to compare the 'specialized' knowledge needed for each >>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), maybe >>> rbind() and Map() for base R. >>> >>> Martin >>> >>> >>> >>>> >>>> >>>> Jim Holtman >>>> Data Munger Guru >>>> >>>> What is the problem that you are trying to solve? >>>> Tell me what you want to do, not how you want to do it. >>>> >>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at gmail.com> >>>> wrote: >>>> >>>> To David W.'s point about lack of a suitable reprex ("reproducible >>>>> example"), Bill's solution seems to be for only one station. >>>>> >>>>> Here is a reprex and modification that I think does what was requested >>>>> for >>>>> multiple stations, again using base R and data frames, not dplyr and >>>>> tibbles. >>>>> >>>>> First the reprex with **two** stations: >>>>> >>>>> d <- data.frame( station = rep(c("one","two"),c(5,4)), >>>>>> >>>>> ???????????????? from = c(60,61,71,72,76,60,65,82,83), >>>>> ????????????????? to = c(60,70,71,76,83,64, 81, 82,83), >>>>> ????????????????? record = c("A","B","C","B","D","B","B","D","E")) >>>>> >>>>> d >>>>>> >>>>> ??? station from to record >>>>> 1???? one?? 60 60????? A >>>>> 2???? one?? 61 70????? B >>>>> 3???? one?? 71 71????? C >>>>> 4???? one?? 72 76????? B >>>>> 5???? one?? 76 83????? D >>>>> 6???? two?? 60 64????? B >>>>> 7???? two?? 65 81????? B >>>>> 8???? two?? 82 82????? D >>>>> 9???? two?? 83 83????? E >>>>> >>>>> ## Now the conversion code using base R, especially by(): >>>>> >>>>> out <- by(d, d$station, function(x) with(x, { >>>>>> >>>>> +??? i <- to - from +1 >>>>> +??? data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD >>>>> =rep(record,i)) >>>>> + })) >>>>> >>>>> >>>>> out <- data.frame(station >>>>>> >>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names = NULL) >>>>> >>>>> >>>>> out >>>>>> >>>>> ???? station YEAR RECORD >>>>> 1????? one?? 60????? A >>>>> 2????? one?? 61????? B >>>>> 3????? one?? 62????? B >>>>> 4????? one?? 63????? B >>>>> 5????? one?? 64????? B >>>>> 6????? one?? 65????? B >>>>> 7????? one?? 66????? B >>>>> 8????? one?? 67????? B >>>>> 9????? one?? 68????? B >>>>> 10???? one?? 69????? B >>>>> 11???? one?? 70????? B >>>>> 12???? one?? 71????? C >>>>> 13???? one?? 72????? B >>>>> 14???? one?? 73????? B >>>>> 15???? one?? 74????? B >>>>> 16???? one?? 75????? B >>>>> 17???? one?? 76????? B >>>>> 18???? one?? 76????? D >>>>> 19???? one?? 77????? D >>>>> 20???? one?? 78????? D >>>>> 21???? one?? 79????? D >>>>> 22???? one?? 80????? D >>>>> 23???? one?? 81????? D >>>>> 24???? one?? 82????? D >>>>> 25???? one?? 83????? D >>>>> 26???? two?? 60????? B >>>>> 27???? two?? 61????? B >>>>> 28???? two?? 62????? B >>>>> 29???? two?? 63????? B >>>>> 30???? two?? 64????? B >>>>> 31???? two?? 65????? B >>>>> 32???? two?? 66????? B >>>>> 33???? two?? 67????? B >>>>> 34???? two?? 68????? B >>>>> 35???? two?? 69????? B >>>>> 36???? two?? 70????? B >>>>> 37???? two?? 71????? B >>>>> 38???? two?? 72????? B >>>>> 39???? two?? 73????? B >>>>> 40???? two?? 74????? B >>>>> 41???? two?? 75????? B >>>>> 42???? two?? 76????? B >>>>> 43???? two?? 77????? B >>>>> 44???? two?? 78????? B >>>>> 45???? two?? 79????? B >>>>> 46???? two?? 80????? B >>>>> 47???? two?? 81????? B >>>>> 48???? two?? 82????? D >>>>> 49???? two?? 83????? E >>>>> >>>>> Cheers, >>>>> Bert >>>>> >>>>> >>>>> >>>>> >>>>> Bert Gunter >>>>> >>>>> "The trouble with having an open mind is that people keep coming along >>>>> and >>>>> sticking things into it." >>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >>>>> >>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help < >>>>> r-help at r-project.org> wrote: >>>>> >>>>> dplyr may have something for this, but in base R I think the following >>>>>> >>>>> does >>>>> >>>>>> what you want.? I've shortened the name of your data set to 'd'. >>>>>> >>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1) >>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1) >>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL, YEAR_TO=NULL) >>>>>> >>>>>> >>>>>> Bill Dunlap >>>>>> TIBCO Software >>>>>> wdunlap tibco.com >>>>>> >>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) < >>>>>> david.hutchinson at canada.ca> wrote: >>>>>> >>>>>> I have a returned tibble of station operational record similar to the >>>>>>> following: >>>>>>> >>>>>>> data.collection >>>>>>>> >>>>>>> # A tibble: 5 x 4 >>>>>>> ??? STATION_NUMBER YEAR_FROM YEAR_TO RECORD >>>>>>> ???????????? <chr>???? <int>?? <int>? <chr> >>>>>>> 1??????? 07EA001????? 1960??? 1960??? QMS >>>>>>> 2??????? 07EA001????? 1961??? 1970??? QMC >>>>>>> 3??????? 07EA001????? 1971??? 1971??? QMM >>>>>>> 4??????? 07EA001????? 1972??? 1976??? QMC >>>>>>> 5??????? 07EA001????? 1977??? 1983??? QRC >>>>>>> >>>>>>> I would like to reshape this to one operational record (row) per >>>>>>> year >>>>>>> >>>>>> per >>>>> >>>>>> station. Something like: >>>>>>> >>>>>>> 07EA001????????????? 1960????? QMS >>>>>>> 07EA001????????????? 1961????? QMC >>>>>>> 07EA001????????????? 1962????? QMC >>>>>>> 07EA001????????????? 1963????? QMC >>>>>>> ... >>>>>>> 07EA001????????????? 1971????? QMM >>>>>>> >>>>>>> Can this be done in dplyr easily? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> >>>>>>> David >>>>>>> >>>>>>> ????????? [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>>> posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>>> >>>>>> ????????? [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>> posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>>> >>>>> ????????? [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>> posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>> ???????? [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> >>> This email message may contain legally privileged and/or...{{dropped:2}} >>> >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ????[[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >This email message may contain legally privileged and/or...{{dropped:2}}
Hi Martin, On 11/29/2017 10:46 PM, Martin Morgan wrote:> On 11/29/2017 04:15 PM, T?th D?nes wrote: >> Hi, >> >> A benchmarking study with an additional (data.table-based) solution. > > I don't think speed is the right benchmark (I do agree that correctness > is!).Well, agree, and sorry for the wording. It was really just an exercise and not a full evaluation of the approaches. When I read the avalanche of solutions neither of which mentioning data.table (my first choice for data.frame-manipulations), I became curious how a one-liner data.table code performs against the other solutions in terms of speed and readability. Second, I quite often have the feeling that dplyr is extremely overused among novice (and sometimes even experienced) R users nowadays. This is unfortunate, as the present example also illustrates. Regards, Denes> > For the R-help list, maybe something about least specialized R knowledge > required would be appropriate? I'd say there were some 'hard' solutions > -- Michael (deep understanding of Bioconductor and IRanges), Toth (deep > understanding of data.table), Jim (at least for me moderate > understanding of dplyr,especially the .$ notation; a simpler dplyr > answer might have moved this response out of the 'difficult' category, > especially given the familiarity of the OP with dplyr). I'd vote for > Bill's as requiring the least specialized knowledge of R (though the +/- > 1 indexing is an easy thing to get wrong). > > A different criteria might be reuse across analysis scenarios. Bill > seems to win here again, since the principles are very general and at > least moderately efficient (both Bert and Martin's solutions are > essentially R-level iterations and have poor scalability, as > demonstrated in the microbenchmarks; Bill's is mostly vectorized). > Certainly data.table, dplyr, and IRanges are extremely useful within the > confines of the problem domains they address. > > Martin > >> Enjoy! ;) >> >> Cheers, >> Denes >> >> >> -------------------------- >> >> >> ## packages ########################## >> >> library(dplyr) >> library(data.table) >> library(IRanges) >> library(microbenchmark) >> >> ## prepare example dataset ########### >> >> ## use Bert's example, with 2000 stations instead of 2 >> d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), >> from = as.integer(c(60,61,71,72,76,60,65,82,83)), >> to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), >> record = c("A","B","C","B","D","B","B","D","E"), >> stringsAsFactors = FALSE) >> stations <- rle(d_df$station) >> stations$value <- gsub( >> " ", "0", >> paste0("station", format(1:length(stations$value), width = 6))) >> d_df$station <- rep(stations$value, stations$lengths) >> >> ## prepare tibble and data.table versions >> d_tbl <- as_tibble(d_df) >> d_dt <- as.data.table(d_df) >> >> ## solutions ########################## >> >> ## Bert - by >> fun_bert <- function(d) { >> out <- by( >> d, d$station, function(x) with(x, { >> i <- to - from +1 >> data.frame(record =rep(record,i), >> year =sequence(i) -1 + rep(from,i), >> stringsAsFactors = FALSE) >> })) >> data.frame(station = rep(names(out), sapply(out,nrow)), >> do.call(rbind,out), >> row.names = NULL, >> stringsAsFactors = FALSE) >> } >> >> ## Bill - transform >> fun_bill <- function(d) { >> i <- rep(seq_len(nrow(d)), d$to-d$from+1) >> j <- sequence(d$to-d$from+1) >> transform(d[i,], year=from+j-1, from=NULL, to=NULL) >> } >> >> ## Michael - IRanges >> fun_michael <- function(d) { >> df <- with(d, DataFrame(station, record, year=IRanges(from, to))) >> expand(df, "year") >> } >> >> ## Jim - dplyr >> fun_jim <- function(d) { >> d %>% >> rowwise() %>% >> do(tibble(station = .$station, >> record = .$record, >> year = seq(.$from, .$to)) >> ) >> } >> >> ## Martin - Map >> fun_martin <- function(d) { >> d$year <- with(d, Map(seq, from, to)) >> res0 <- with(d, Map(data.frame, >> station=station, >> record=record, >> year=year, >> MoreArgs = list(stringsAsFactors = FALSE))) >> do.call(rbind, unname(res0)) >> } >> >> ## Denes - simple data.table >> fun_denes <- function(d) { >> out <- d[, .(year = from:to), by = .(station, from, record)] >> out[, from := NULL] >> } >> >> ## Check equality ################################ >> all.equal(fun_bill(d_df), fun_bert(d_df), >> check.attributes = FALSE) >> all.equal(fun_bill(d_df), fun_martin(d_df), >> check.attributes = FALSE) >> all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)), >> check.attributes = FALSE) >> all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)), >> check.attributes = FALSE) >> # Be prepared: this solution is super slow >> all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)), >> check.attributes = FALSE) >> >> ## Benchmark ##################################### >> >> ## Martin >> print(system.time(fun_martin(d_df))) >> >> ## Bert >> print(system.time(fun_bert(d_df))) >> >> ## Top 3 >> print( >> microbenchmark( >> fun_bill(d_df), >> fun_michael(d_df), >> fun_denes(d_dt), >> times = 100L >> ) >> ) >> >> >> ------------------------- >> >> On 11/28/2017 06:49 PM, Michael Lawrence wrote: >>> Or with the Bioconductor IRanges package: >>> >>> df <- with(input, DataFrame(station, year=IRanges(from, to), record)) >>> expand(df, "year") >>> >>> DataFrame with 24 rows and 3 columns >>> station year record >>> <character> <integer> <character> >>> 1 07EA001 1960 QMS >>> 2 07EA001 1961 QMC >>> 3 07EA001 1962 QMC >>> 4 07EA001 1963 QMC >>> 5 07EA001 1964 QMC >>> ... ... ... ... >>> 20 07EA001 1979 QRC >>> 21 07EA001 1980 QRC >>> 22 07EA001 1981 QRC >>> 23 07EA001 1982 QRC >>> 24 07EA001 1983 QRC >>> >>> If you tell the computer more about your data, it can do more things for >>> you. >>> >>> Michael >>> >>> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan < >>> martin.morgan at roswellpark.org> wrote: >>> >>>> On 11/26/2017 08:42 PM, jim holtman wrote: >>>> >>>>> try this: >>>>> >>>>> ########################################## >>>>> >>>>> library(dplyr) >>>>> >>>>> input <- tribble( >>>>> ~station, ~from, ~to, ~record, >>>>> "07EA001" , 1960 , 1960 , "QMS", >>>>> "07EA001" , 1961 , 1970 , "QMC", >>>>> "07EA001" , 1971 , 1971 , "QMM", >>>>> "07EA001" , 1972 , 1976 , "QMC", >>>>> "07EA001" , 1977 , 1983 , "QRC" >>>>> ) >>>>> >>>>> result <- input %>% >>>>> rowwise() %>% >>>>> do(tibble(station = .$station, >>>>> year = seq(.$from, .$to), >>>>> record = .$record) >>>>> ) >>>>> >>>>> ########################### >>>>> >>>> >>>> In a bit more 'base R' mode I did >>>> >>>> input$year <- with(input, Map(seq, from, to)) >>>> res0 <- with(input, Map(data.frame, station=station, year=year, >>>> record=record)) >>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 >>>> >>>> resulting in >>>> >>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 >>>> station year record >>>> <fctr> <int> <fctr> >>>> 1 07EA001 1960 QMS >>>> 2 07EA001 1961 QMC >>>> 3 07EA001 1962 QMC >>>> 4 07EA001 1963 QMC >>>> 5 07EA001 1964 QMC >>>> 6 07EA001 1965 QMC >>>> 7 07EA001 1966 QMC >>>> 8 07EA001 1967 QMC >>>> 9 07EA001 1968 QMC >>>> 10 07EA001 1969 QMC >>>> # ... with 14 more rows >>>> >>>> I though I should have been able to use `tibble` in the second step, >>>> but >>>> that leads to a (cryptic) error >>>> >>>>> res0 <- with(input, Map(tibble, station=station, year=year, >>>> record=record))Error in captureDots(strict = `__quosured`) : >>>> the argument has already been evaluated >>>> >>>> The 'station' and 'record' columns are factors, so different from the >>>> original input, but this seems the appropriate data type for theses >>>> columns. >>>> >>>> It's interesting to compare the 'specialized' knowledge needed for each >>>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), maybe >>>> rbind() and Map() for base R. >>>> >>>> Martin >>>> >>>> >>>> >>>>> >>>>> >>>>> Jim Holtman >>>>> Data Munger Guru >>>>> >>>>> What is the problem that you are trying to solve? >>>>> Tell me what you want to do, not how you want to do it. >>>>> >>>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at gmail.com> >>>>> wrote: >>>>> >>>>> To David W.'s point about lack of a suitable reprex ("reproducible >>>>>> example"), Bill's solution seems to be for only one station. >>>>>> >>>>>> Here is a reprex and modification that I think does what was >>>>>> requested >>>>>> for >>>>>> multiple stations, again using base R and data frames, not dplyr and >>>>>> tibbles. >>>>>> >>>>>> First the reprex with **two** stations: >>>>>> >>>>>> d <- data.frame( station = rep(c("one","two"),c(5,4)), >>>>>>> >>>>>> from = c(60,61,71,72,76,60,65,82,83), >>>>>> to = c(60,70,71,76,83,64, 81, 82,83), >>>>>> record = c("A","B","C","B","D","B","B","D","E")) >>>>>> >>>>>> d >>>>>>> >>>>>> station from to record >>>>>> 1 one 60 60 A >>>>>> 2 one 61 70 B >>>>>> 3 one 71 71 C >>>>>> 4 one 72 76 B >>>>>> 5 one 76 83 D >>>>>> 6 two 60 64 B >>>>>> 7 two 65 81 B >>>>>> 8 two 82 82 D >>>>>> 9 two 83 83 E >>>>>> >>>>>> ## Now the conversion code using base R, especially by(): >>>>>> >>>>>> out <- by(d, d$station, function(x) with(x, { >>>>>>> >>>>>> + i <- to - from +1 >>>>>> + data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD >>>>>> =rep(record,i)) >>>>>> + })) >>>>>> >>>>>> >>>>>> out <- data.frame(station >>>>>>> >>>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names = >>>>>> NULL) >>>>>> >>>>>> >>>>>> out >>>>>>> >>>>>> station YEAR RECORD >>>>>> 1 one 60 A >>>>>> 2 one 61 B >>>>>> 3 one 62 B >>>>>> 4 one 63 B >>>>>> 5 one 64 B >>>>>> 6 one 65 B >>>>>> 7 one 66 B >>>>>> 8 one 67 B >>>>>> 9 one 68 B >>>>>> 10 one 69 B >>>>>> 11 one 70 B >>>>>> 12 one 71 C >>>>>> 13 one 72 B >>>>>> 14 one 73 B >>>>>> 15 one 74 B >>>>>> 16 one 75 B >>>>>> 17 one 76 B >>>>>> 18 one 76 D >>>>>> 19 one 77 D >>>>>> 20 one 78 D >>>>>> 21 one 79 D >>>>>> 22 one 80 D >>>>>> 23 one 81 D >>>>>> 24 one 82 D >>>>>> 25 one 83 D >>>>>> 26 two 60 B >>>>>> 27 two 61 B >>>>>> 28 two 62 B >>>>>> 29 two 63 B >>>>>> 30 two 64 B >>>>>> 31 two 65 B >>>>>> 32 two 66 B >>>>>> 33 two 67 B >>>>>> 34 two 68 B >>>>>> 35 two 69 B >>>>>> 36 two 70 B >>>>>> 37 two 71 B >>>>>> 38 two 72 B >>>>>> 39 two 73 B >>>>>> 40 two 74 B >>>>>> 41 two 75 B >>>>>> 42 two 76 B >>>>>> 43 two 77 B >>>>>> 44 two 78 B >>>>>> 45 two 79 B >>>>>> 46 two 80 B >>>>>> 47 two 81 B >>>>>> 48 two 82 D >>>>>> 49 two 83 E >>>>>> >>>>>> Cheers, >>>>>> Bert >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Bert Gunter >>>>>> >>>>>> "The trouble with having an open mind is that people keep coming >>>>>> along >>>>>> and >>>>>> sticking things into it." >>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >>>>>> >>>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help < >>>>>> r-help at r-project.org> wrote: >>>>>> >>>>>> dplyr may have something for this, but in base R I think the >>>>>> following >>>>>>> >>>>>> does >>>>>> >>>>>>> what you want. I've shortened the name of your data set to 'd'. >>>>>>> >>>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1) >>>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1) >>>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL, YEAR_TO=NULL) >>>>>>> >>>>>>> >>>>>>> Bill Dunlap >>>>>>> TIBCO Software >>>>>>> wdunlap tibco.com >>>>>>> >>>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) < >>>>>>> david.hutchinson at canada.ca> wrote: >>>>>>> >>>>>>> I have a returned tibble of station operational record similar to >>>>>>> the >>>>>>>> following: >>>>>>>> >>>>>>>> data.collection >>>>>>>>> >>>>>>>> # A tibble: 5 x 4 >>>>>>>> STATION_NUMBER YEAR_FROM YEAR_TO RECORD >>>>>>>> <chr> <int> <int> <chr> >>>>>>>> 1 07EA001 1960 1960 QMS >>>>>>>> 2 07EA001 1961 1970 QMC >>>>>>>> 3 07EA001 1971 1971 QMM >>>>>>>> 4 07EA001 1972 1976 QMC >>>>>>>> 5 07EA001 1977 1983 QRC >>>>>>>> >>>>>>>> I would like to reshape this to one operational record (row) per >>>>>>>> year >>>>>>>> >>>>>>> per >>>>>> >>>>>>> station. Something like: >>>>>>>> >>>>>>>> 07EA001 1960 QMS >>>>>>>> 07EA001 1961 QMC >>>>>>>> 07EA001 1962 QMC >>>>>>>> 07EA001 1963 QMC >>>>>>>> ... >>>>>>>> 07EA001 1971 QMM >>>>>>>> >>>>>>>> Can this be done in dplyr easily? >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> >>>>>>>> David >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> ______________________________________________ >>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>>>> posting-guide.html >>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>> >>>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>>> posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>> posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>> ng-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>> >>>> This email message may contain legally privileged >>>> and/or...{{dropped:2}} >>>> >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> > > > This email message may contain legally privileged and/or confidential > information. If you are not the intended recipient(s), or the employee > or agent responsible for the delivery of this message to the intended > recipient(s), you are hereby notified that any disclosure, copying, > distribution, or use of this email message is prohibited. If you have > received this message in error, please notify the sender immediately by > e-mail and delete this email message from your computer. Thank you. >-- Dr. T?th D?nes ?gyvezet? Kogentum Kft. Tel.: 06-30-2583723 Web: www.kogentum.hu
On 11/29/2017 05:47 PM, T?th D?nes wrote:> Hi Martin, > > On 11/29/2017 10:46 PM, Martin Morgan wrote: >> On 11/29/2017 04:15 PM, T?th D?nes wrote: >>> Hi, >>> >>> A benchmarking study with an additional (data.table-based) solution. >> >> I don't think speed is the right benchmark (I do agree that >> correctness is!). > > Well, agree, and sorry for the wording. It was really just an exercise > and not a full evaluation of the approaches. When I read the avalanche > of solutions neither of which mentioning data.table (my first choice for > data.frame-manipulations), I became curious how a one-liner data.table > code performs against the other solutions in terms of speed and > readability. > Second, I quite often have the feeling that dplyr is extremely overused > among novice (and sometimes even experienced) R users nowadays. This is > unfortunate, as the present example also illustrates.Another solution is Bill's approach and dplyr's implementation (adding the 1L to keep integers integers!) fun_bill1 <- function(d) { i <- rep(seq_len(nrow(d)), d$to - d$from + 1L) j <- sequence(d$to - d$from + 1L) ## d[i,] %>% mutate(year = from + j - 1L, from = NULL, to = NULL) mutate(d[i,], year = from + j - 1L, from = NULL, to = NULL) } which is competitive with IRanges and data.table (the more dplyr-ish? solution d[i, ] %>% mutate(year = from + j - 1L) %>% select(station, record, year)) has intermediate performance) and might appeal to those introduced to R through dplyr but wanting more base R knowledge, and vice versa. I think if dplyr introduces new users to R, or exposes R users to new approaches for working with data, that's great! Martin> > Regards, > Denes > >> >> For the R-help list, maybe something about least specialized R >> knowledge required would be appropriate? I'd say there were some >> 'hard' solutions -- Michael (deep understanding of Bioconductor and >> IRanges), Toth (deep understanding of data.table), Jim (at least for >> me moderate understanding of dplyr,especially the .$ notation; a >> simpler dplyr answer might have moved this response out of the >> 'difficult' category, especially given the familiarity of the OP with >> dplyr). I'd vote for Bill's as requiring the least specialized >> knowledge of R (though the +/- 1 indexing is an easy thing to get wrong). >> >> A different criteria might be reuse across analysis scenarios. Bill >> seems to win here again, since the principles are very general and at >> least moderately efficient (both Bert and Martin's solutions are >> essentially R-level iterations and have poor scalability, as >> demonstrated in the microbenchmarks; Bill's is mostly vectorized). >> Certainly data.table, dplyr, and IRanges are extremely useful within >> the confines of the problem domains they address. >> >> Martin >> >>> Enjoy! ;) >>> >>> Cheers, >>> Denes >>> >>> >>> -------------------------- >>> >>> >>> ## packages ########################## >>> >>> library(dplyr) >>> library(data.table) >>> library(IRanges) >>> library(microbenchmark) >>> >>> ## prepare example dataset ########### >>> >>> ## use Bert's example, with 2000 stations instead of 2 >>> d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L), >>> ???????????????????? from = as.integer(c(60,61,71,72,76,60,65,82,83)), >>> ???????????????????? to = as.integer(c(60,70,71,76,83,64, 81, 82,83)), >>> ???????????????????? record = c("A","B","C","B","D","B","B","D","E"), >>> ???????????????????? stringsAsFactors = FALSE) >>> stations <- rle(d_df$station) >>> stations$value <- gsub( >>> ?? " ", "0", >>> ?? paste0("station", format(1:length(stations$value), width = 6))) >>> d_df$station <- rep(stations$value, stations$lengths) >>> >>> ## prepare tibble and data.table versions >>> d_tbl <- as_tibble(d_df) >>> d_dt <- as.data.table(d_df) >>> >>> ## solutions ########################## >>> >>> ## Bert - by >>> fun_bert <- function(d) { >>> ?? out <- by( >>> ???? d, d$station, function(x) with(x, { >>> ?????? i <- to - from +1 >>> ?????? data.frame(record =rep(record,i), >>> ????????????????? year =sequence(i) -1 + rep(from,i), >>> ????????????????? stringsAsFactors = FALSE) >>> ???? })) >>> ?? data.frame(station = rep(names(out), sapply(out,nrow)), >>> ????????????? do.call(rbind,out), >>> ????????????? row.names = NULL, >>> ????????????? stringsAsFactors = FALSE) >>> } >>> >>> ## Bill - transform >>> fun_bill <- function(d) { >>> ?? i <- rep(seq_len(nrow(d)), d$to-d$from+1) >>> ?? j <- sequence(d$to-d$from+1) >>> ?? transform(d[i,], year=from+j-1, from=NULL, to=NULL) >>> } >>> >>> ## Michael - IRanges >>> fun_michael <- function(d) { >>> ?? df <- with(d, DataFrame(station, record, year=IRanges(from, to))) >>> ?? expand(df, "year") >>> } >>> >>> ## Jim - dplyr >>> fun_jim <- function(d) { >>> ?? d %>% >>> ???? rowwise() %>% >>> ???? do(tibble(station = .$station, >>> ?????????????? record = .$record, >>> ?????????????? year = seq(.$from, .$to)) >>> ???? ) >>> } >>> >>> ## Martin - Map >>> fun_martin <- function(d) { >>> ?? d$year <- with(d, Map(seq, from, to)) >>> ?? res0 <- with(d, Map(data.frame, >>> ?????????????????????? station=station, >>> ?????????????????????? record=record, >>> ?????????????????????? year=year, >>> ?????????????????????? MoreArgs = list(stringsAsFactors = FALSE))) >>> ?? do.call(rbind, unname(res0)) >>> } >>> >>> ## Denes - simple data.table >>> fun_denes <- function(d) { >>> ?? out <- d[, .(year = from:to), by = .(station, from, record)] >>> ?? out[, from := NULL] >>> } >>> >>> ## Check equality ################################ >>> all.equal(fun_bill(d_df), fun_bert(d_df), >>> ?????????? check.attributes = FALSE) >>> all.equal(fun_bill(d_df), fun_martin(d_df), >>> ?????????? check.attributes = FALSE) >>> all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)), >>> ?????????? check.attributes = FALSE) >>> all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)), >>> ?????????? check.attributes = FALSE) >>> # Be prepared: this solution is super slow >>> all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)), >>> ?????????? check.attributes = FALSE) >>> >>> ## Benchmark ##################################### >>> >>> ## Martin >>> print(system.time(fun_martin(d_df))) >>> >>> ## Bert >>> print(system.time(fun_bert(d_df))) >>> >>> ## Top 3 >>> print( >>> ?? microbenchmark( >>> ???? fun_bill(d_df), >>> ???? fun_michael(d_df), >>> ???? fun_denes(d_dt), >>> ???? times = 100L >>> ?? ) >>> ) >>> >>> >>> ------------------------- >>> >>> On 11/28/2017 06:49 PM, Michael Lawrence wrote: >>>> Or with the Bioconductor IRanges package: >>>> >>>> df <- with(input, DataFrame(station, year=IRanges(from, to), record)) >>>> expand(df, "year") >>>> >>>> DataFrame with 24 rows and 3 columns >>>> ???????? station???? year????? record >>>> ???? <character> <integer> <character> >>>> 1?????? 07EA001????? 1960???????? QMS >>>> 2?????? 07EA001????? 1961???????? QMC >>>> 3?????? 07EA001????? 1962???????? QMC >>>> 4?????? 07EA001????? 1963???????? QMC >>>> 5?????? 07EA001????? 1964???????? QMC >>>> ...???????? ...?????? ...???????? ... >>>> 20????? 07EA001????? 1979???????? QRC >>>> 21????? 07EA001????? 1980???????? QRC >>>> 22????? 07EA001????? 1981???????? QRC >>>> 23????? 07EA001????? 1982???????? QRC >>>> 24????? 07EA001????? 1983???????? QRC >>>> >>>> If you tell the computer more about your data, it can do more things >>>> for >>>> you. >>>> >>>> Michael >>>> >>>> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan < >>>> martin.morgan at roswellpark.org> wrote: >>>> >>>>> On 11/26/2017 08:42 PM, jim holtman wrote: >>>>> >>>>>> try this: >>>>>> >>>>>> ########################################## >>>>>> >>>>>> library(dplyr) >>>>>> >>>>>> input <- tribble( >>>>>> ??? ~station, ~from, ~to, ~record, >>>>>> ?? "07EA001" ,??? 1960? ,? 1960? , "QMS", >>>>>> ?? "07EA001"? ,?? 1961 ,?? 1970? , "QMC", >>>>>> ?? "07EA001" ,??? 1971? ,? 1971? , "QMM", >>>>>> ?? "07EA001" ,??? 1972? ,? 1976? , "QMC", >>>>>> ?? "07EA001" ,??? 1977? ,? 1983? , "QRC" >>>>>> ) >>>>>> >>>>>> result <- input %>% >>>>>> ??? rowwise() %>% >>>>>> ??? do(tibble(station = .$station, >>>>>> ????????????? year = seq(.$from, .$to), >>>>>> ????????????? record = .$record) >>>>>> ??? ) >>>>>> >>>>>> ########################### >>>>>> >>>>> >>>>> In a bit more 'base R' mode I did >>>>> >>>>> ?? input$year <- with(input, Map(seq, from, to)) >>>>> ?? res0 <- with(input, Map(data.frame, station=station, year=year, >>>>> ?????? record=record)) >>>>> ??? as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 >>>>> >>>>> resulting in >>>>> >>>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3 >>>>> ??? station? year record >>>>> ???? <fctr> <int> <fctr> >>>>> ? 1 07EA001? 1960??? QMS >>>>> ? 2 07EA001? 1961??? QMC >>>>> ? 3 07EA001? 1962??? QMC >>>>> ? 4 07EA001? 1963??? QMC >>>>> ? 5 07EA001? 1964??? QMC >>>>> ? 6 07EA001? 1965??? QMC >>>>> ? 7 07EA001? 1966??? QMC >>>>> ? 8 07EA001? 1967??? QMC >>>>> ? 9 07EA001? 1968??? QMC >>>>> 10 07EA001? 1969??? QMC >>>>> # ... with 14 more rows >>>>> >>>>> I though I should have been able to use `tibble` in the second >>>>> step, but >>>>> that leads to a (cryptic) error >>>>> >>>>>> res0 <- with(input, Map(tibble, station=station, year=year, >>>>> record=record))Error in captureDots(strict = `__quosured`) : >>>>> ?? the argument has already been evaluated >>>>> >>>>> The 'station' and 'record' columns are factors, so different from the >>>>> original input, but this seems the appropriate data type for theses >>>>> columns. >>>>> >>>>> It's interesting to compare the 'specialized' knowledge needed for >>>>> each >>>>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), >>>>> maybe >>>>> rbind() and Map() for base R. >>>>> >>>>> Martin >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> Jim Holtman >>>>>> Data Munger Guru >>>>>> >>>>>> What is the problem that you are trying to solve? >>>>>> Tell me what you want to do, not how you want to do it. >>>>>> >>>>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at gmail.com> >>>>>> wrote: >>>>>> >>>>>> To David W.'s point about lack of a suitable reprex ("reproducible >>>>>>> example"), Bill's solution seems to be for only one station. >>>>>>> >>>>>>> Here is a reprex and modification that I think does what was >>>>>>> requested >>>>>>> for >>>>>>> multiple stations, again using base R and data frames, not dplyr and >>>>>>> tibbles. >>>>>>> >>>>>>> First the reprex with **two** stations: >>>>>>> >>>>>>> d <- data.frame( station = rep(c("one","two"),c(5,4)), >>>>>>>> >>>>>>> ???????????????? from = c(60,61,71,72,76,60,65,82,83), >>>>>>> ????????????????? to = c(60,70,71,76,83,64, 81, 82,83), >>>>>>> ????????????????? record = c("A","B","C","B","D","B","B","D","E")) >>>>>>> >>>>>>> d >>>>>>>> >>>>>>> ??? station from to record >>>>>>> 1???? one?? 60 60????? A >>>>>>> 2???? one?? 61 70????? B >>>>>>> 3???? one?? 71 71????? C >>>>>>> 4???? one?? 72 76????? B >>>>>>> 5???? one?? 76 83????? D >>>>>>> 6???? two?? 60 64????? B >>>>>>> 7???? two?? 65 81????? B >>>>>>> 8???? two?? 82 82????? D >>>>>>> 9???? two?? 83 83????? E >>>>>>> >>>>>>> ## Now the conversion code using base R, especially by(): >>>>>>> >>>>>>> out <- by(d, d$station, function(x) with(x, { >>>>>>>> >>>>>>> +??? i <- to - from +1 >>>>>>> +??? data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD >>>>>>> =rep(record,i)) >>>>>>> + })) >>>>>>> >>>>>>> >>>>>>> out <- data.frame(station >>>>>>>> >>>>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names = >>>>>>> NULL) >>>>>>> >>>>>>> >>>>>>> out >>>>>>>> >>>>>>> ???? station YEAR RECORD >>>>>>> 1????? one?? 60????? A >>>>>>> 2????? one?? 61????? B >>>>>>> 3????? one?? 62????? B >>>>>>> 4????? one?? 63????? B >>>>>>> 5????? one?? 64????? B >>>>>>> 6????? one?? 65????? B >>>>>>> 7????? one?? 66????? B >>>>>>> 8????? one?? 67????? B >>>>>>> 9????? one?? 68????? B >>>>>>> 10???? one?? 69????? B >>>>>>> 11???? one?? 70????? B >>>>>>> 12???? one?? 71????? C >>>>>>> 13???? one?? 72????? B >>>>>>> 14???? one?? 73????? B >>>>>>> 15???? one?? 74????? B >>>>>>> 16???? one?? 75????? B >>>>>>> 17???? one?? 76????? B >>>>>>> 18???? one?? 76????? D >>>>>>> 19???? one?? 77????? D >>>>>>> 20???? one?? 78????? D >>>>>>> 21???? one?? 79????? D >>>>>>> 22???? one?? 80????? D >>>>>>> 23???? one?? 81????? D >>>>>>> 24???? one?? 82????? D >>>>>>> 25???? one?? 83????? D >>>>>>> 26???? two?? 60????? B >>>>>>> 27???? two?? 61????? B >>>>>>> 28???? two?? 62????? B >>>>>>> 29???? two?? 63????? B >>>>>>> 30???? two?? 64????? B >>>>>>> 31???? two?? 65????? B >>>>>>> 32???? two?? 66????? B >>>>>>> 33???? two?? 67????? B >>>>>>> 34???? two?? 68????? B >>>>>>> 35???? two?? 69????? B >>>>>>> 36???? two?? 70????? B >>>>>>> 37???? two?? 71????? B >>>>>>> 38???? two?? 72????? B >>>>>>> 39???? two?? 73????? B >>>>>>> 40???? two?? 74????? B >>>>>>> 41???? two?? 75????? B >>>>>>> 42???? two?? 76????? B >>>>>>> 43???? two?? 77????? B >>>>>>> 44???? two?? 78????? B >>>>>>> 45???? two?? 79????? B >>>>>>> 46???? two?? 80????? B >>>>>>> 47???? two?? 81????? B >>>>>>> 48???? two?? 82????? D >>>>>>> 49???? two?? 83????? E >>>>>>> >>>>>>> Cheers, >>>>>>> Bert >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Bert Gunter >>>>>>> >>>>>>> "The trouble with having an open mind is that people keep coming >>>>>>> along >>>>>>> and >>>>>>> sticking things into it." >>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >>>>>>> >>>>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help < >>>>>>> r-help at r-project.org> wrote: >>>>>>> >>>>>>> dplyr may have something for this, but in base R I think the >>>>>>> following >>>>>>>> >>>>>>> does >>>>>>> >>>>>>>> what you want.? I've shortened the name of your data set to 'd'. >>>>>>>> >>>>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1) >>>>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1) >>>>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL, YEAR_TO=NULL) >>>>>>>> >>>>>>>> >>>>>>>> Bill Dunlap >>>>>>>> TIBCO Software >>>>>>>> wdunlap tibco.com >>>>>>>> >>>>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) < >>>>>>>> david.hutchinson at canada.ca> wrote: >>>>>>>> >>>>>>>> I have a returned tibble of station operational record similar >>>>>>>> to the >>>>>>>>> following: >>>>>>>>> >>>>>>>>> data.collection >>>>>>>>>> >>>>>>>>> # A tibble: 5 x 4 >>>>>>>>> ??? STATION_NUMBER YEAR_FROM YEAR_TO RECORD >>>>>>>>> ???????????? <chr>???? <int>?? <int>? <chr> >>>>>>>>> 1??????? 07EA001????? 1960??? 1960??? QMS >>>>>>>>> 2??????? 07EA001????? 1961??? 1970??? QMC >>>>>>>>> 3??????? 07EA001????? 1971??? 1971??? QMM >>>>>>>>> 4??????? 07EA001????? 1972??? 1976??? QMC >>>>>>>>> 5??????? 07EA001????? 1977??? 1983??? QRC >>>>>>>>> >>>>>>>>> I would like to reshape this to one operational record (row) >>>>>>>>> per year >>>>>>>>> >>>>>>>> per >>>>>>> >>>>>>>> station. Something like: >>>>>>>>> >>>>>>>>> 07EA001????????????? 1960????? QMS >>>>>>>>> 07EA001????????????? 1961????? QMC >>>>>>>>> 07EA001????????????? 1962????? QMC >>>>>>>>> 07EA001????????????? 1963????? QMC >>>>>>>>> ... >>>>>>>>> 07EA001????????????? 1971????? QMM >>>>>>>>> >>>>>>>>> Can this be done in dplyr easily? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>>> ????????? [[alternative HTML version deleted]] >>>>>>>>> >>>>>>>>> ______________________________________________ >>>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>>>>> posting-guide.html >>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>>> >>>>>>>>> >>>>>>>> ????????? [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> ______________________________________________ >>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>>>> posting-guide.html >>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>> >>>>>>>> >>>>>>> ????????? [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide http://www.R-project.org/ >>>>>>> posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>>> >>>>>> ???????? [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>>> ng-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>>> >>>>> >>>>> This email message may contain legally privileged >>>>> and/or...{{dropped:2}} >>>>> >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>> ng-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> ??? [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >> >> >> This email message may contain legally privileged and/or confidential >> information.? If you are not the intended recipient(s), or the >> employee or agent responsible for the delivery of this message to the >> intended recipient(s), you are hereby notified that any disclosure, >> copying, distribution, or use of this email message is prohibited.? If >> you have received this message in error, please notify the sender >> immediately by e-mail and delete this email message from your >> computer. Thank you. >> >This email message may contain legally privileged and/or...{{dropped:2}}