thr3ads.net - R help - [R] dplyr - add/expand rows [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Michael Lawrence

2017-Nov-28 17:49 UTC

[R] dplyr - add/expand rows

Or with the Bioconductor IRanges package:

df <- with(input, DataFrame(station, year=IRanges(from, to), record))
expand(df, "year")

DataFrame with 24 rows and 3 columns
        station     year      record
    <character> <integer> <character>
1       07EA001      1960         QMS
2       07EA001      1961         QMC
3       07EA001      1962         QMC
4       07EA001      1963         QMC
5       07EA001      1964         QMC
...         ...       ...         ...
20      07EA001      1979         QRC
21      07EA001      1980         QRC
22      07EA001      1981         QRC
23      07EA001      1982         QRC
24      07EA001      1983         QRC

If you tell the computer more about your data, it can do more things for
you.

Michael

On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
martin.morgan at roswellpark.org> wrote:
> On 11/26/2017 08:42 PM, jim holtman wrote:
>
>> try this:
>>
>> ##########################################
>>
>> library(dplyr)
>>
>> input <- tribble(
>>    ~station, ~from, ~to, ~record,
>>   "07EA001" ,    1960  ,  1960  , "QMS",
>>   "07EA001"  ,   1961 ,   1970  , "QMC",
>>   "07EA001" ,    1971  ,  1971  , "QMM",
>>   "07EA001" ,    1972  ,  1976  , "QMC",
>>   "07EA001" ,    1977  ,  1983  , "QRC"
>> )
>>
>> result <- input %>%
>>    rowwise() %>%
>>    do(tibble(station = .$station,
>>              year = seq(.$from, .$to),
>>              record = .$record)
>>    )
>>
>> ###########################
>>
>
> In a bit more 'base R' mode I did
>
>   input$year <- with(input, Map(seq, from, to))
>   res0 <- with(input, Map(data.frame, station=station, year=year,
>       record=record))
>    as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>
> resulting in
>
> > as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>    station  year record
>     <fctr> <int> <fctr>
>  1 07EA001  1960    QMS
>  2 07EA001  1961    QMC
>  3 07EA001  1962    QMC
>  4 07EA001  1963    QMC
>  5 07EA001  1964    QMC
>  6 07EA001  1965    QMC
>  7 07EA001  1966    QMC
>  8 07EA001  1967    QMC
>  9 07EA001  1968    QMC
> 10 07EA001  1969    QMC
> # ... with 14 more rows
>
> I though I should have been able to use `tibble` in the second step, but
> that leads to a (cryptic) error
>
> > res0 <- with(input, Map(tibble, station=station, year=year,
> record=record))Error in captureDots(strict = `__quosured`) :
>   the argument has already been evaluated
>
> The 'station' and 'record' columns are factors, so
different from the
> original input, but this seems the appropriate data type for theses
columns.
>
> It's interesting to compare the 'specialized' knowledge needed
for each
> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), maybe
> rbind() and Map() for base R.
>
> Martin
>
>
>
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at
gmail.com>
>> wrote:
>>
>> To David W.'s point about lack of a suitable reprex
("reproducible
>>> example"), Bill's solution seems to be for only one
station.
>>>
>>> Here is a reprex and modification that I think does what was
requested
>>> for
>>> multiple stations, again using base R and data frames, not dplyr
and
>>> tibbles.
>>>
>>> First the reprex with **two** stations:
>>>
>>> d <- data.frame( station =
rep(c("one","two"),c(5,4)),
>>>>
>>>                 from = c(60,61,71,72,76,60,65,82,83),
>>>                  to = c(60,70,71,76,83,64, 81, 82,83),
>>>                  record =
c("A","B","C","B","D","B","B","D","E"))
>>>
>>> d
>>>>
>>>    station from to record
>>> 1     one   60 60      A
>>> 2     one   61 70      B
>>> 3     one   71 71      C
>>> 4     one   72 76      B
>>> 5     one   76 83      D
>>> 6     two   60 64      B
>>> 7     two   65 81      B
>>> 8     two   82 82      D
>>> 9     two   83 83      E
>>>
>>> ## Now the conversion code using base R, especially by():
>>>
>>> out <- by(d, d$station, function(x) with(x, {
>>>>
>>> +    i <- to - from +1
>>> +    data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD
=rep(record,i))
>>> + }))
>>>
>>>
>>> out <- data.frame(station >>>>
>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names =
NULL)
>>>
>>>
>>> out
>>>>
>>>     station YEAR RECORD
>>> 1      one   60      A
>>> 2      one   61      B
>>> 3      one   62      B
>>> 4      one   63      B
>>> 5      one   64      B
>>> 6      one   65      B
>>> 7      one   66      B
>>> 8      one   67      B
>>> 9      one   68      B
>>> 10     one   69      B
>>> 11     one   70      B
>>> 12     one   71      C
>>> 13     one   72      B
>>> 14     one   73      B
>>> 15     one   74      B
>>> 16     one   75      B
>>> 17     one   76      B
>>> 18     one   76      D
>>> 19     one   77      D
>>> 20     one   78      D
>>> 21     one   79      D
>>> 22     one   80      D
>>> 23     one   81      D
>>> 24     one   82      D
>>> 25     one   83      D
>>> 26     two   60      B
>>> 27     two   61      B
>>> 28     two   62      B
>>> 29     two   63      B
>>> 30     two   64      B
>>> 31     two   65      B
>>> 32     two   66      B
>>> 33     two   67      B
>>> 34     two   68      B
>>> 35     two   69      B
>>> 36     two   70      B
>>> 37     two   71      B
>>> 38     two   72      B
>>> 39     two   73      B
>>> 40     two   74      B
>>> 41     two   75      B
>>> 42     two   76      B
>>> 43     two   77      B
>>> 44     two   78      B
>>> 45     two   79      B
>>> 46     two   80      B
>>> 47     two   81      B
>>> 48     two   82      D
>>> 49     two   83      E
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>>
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep
coming along
>>> and
>>> sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County"
comic strip )
>>>
>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help <
>>> r-help at r-project.org> wrote:
>>>
>>> dplyr may have something for this, but in base R I think the
following
>>>>
>>> does
>>>
>>>> what you want.  I've shortened the name of your data set to
'd'.
>>>>
>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1)
>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1)
>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL,
YEAR_TO=NULL)
>>>>
>>>>
>>>> Bill Dunlap
>>>> TIBCO Software
>>>> wdunlap tibco.com
>>>>
>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) <
>>>> david.hutchinson at canada.ca> wrote:
>>>>
>>>> I have a returned tibble of station operational record similar
to the
>>>>> following:
>>>>>
>>>>> data.collection
>>>>>>
>>>>> # A tibble: 5 x 4
>>>>>    STATION_NUMBER YEAR_FROM YEAR_TO RECORD
>>>>>             <chr>     <int>   <int> 
<chr>
>>>>> 1        07EA001      1960    1960    QMS
>>>>> 2        07EA001      1961    1970    QMC
>>>>> 3        07EA001      1971    1971    QMM
>>>>> 4        07EA001      1972    1976    QMC
>>>>> 5        07EA001      1977    1983    QRC
>>>>>
>>>>> I would like to reshape this to one operational record
(row) per year
>>>>>
>>>> per
>>>
>>>> station. Something like:
>>>>>
>>>>> 07EA001              1960      QMS
>>>>> 07EA001              1961      QMC
>>>>> 07EA001              1962      QMC
>>>>> 07EA001              1963      QMC
>>>>> ...
>>>>> 07EA001              1971      QMM
>>>>>
>>>>> Can this be done in dplyr easily?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> David
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>> posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>> posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/
>>> posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> This email message may contain legally privileged and/or...{{dropped:2}}
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Tóth Dénes

2017-Nov-29 21:15 UTC

head link

[R] dplyr - add/expand rows

Hi,

A benchmarking study with an additional (data.table-based) solution. 
Enjoy! ;)

Cheers,
Denes


--------------------------


## packages ##########################

library(dplyr)
library(data.table)
library(IRanges)
library(microbenchmark)

## prepare example dataset ###########

## use Bert's example, with 2000 stations instead of 2
d_df <- data.frame( station =
rep(rep(c("one","two"),c(5,4)), 1000L),
                     from = as.integer(c(60,61,71,72,76,60,65,82,83)),
                     to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
                     record =
c("A","B","C","B","D","B","B","D","E"),
                     stringsAsFactors = FALSE)
stations <- rle(d_df$station)
stations$value <- gsub(
   " ", "0",
   paste0("station", format(1:length(stations$value), width = 6)))
d_df$station <- rep(stations$value, stations$lengths)

## prepare tibble and data.table versions
d_tbl <- as_tibble(d_df)
d_dt <- as.data.table(d_df)

## solutions ##########################

## Bert - by
fun_bert <- function(d) {
   out <- by(
     d, d$station, function(x) with(x, {
       i <- to - from +1
       data.frame(record =rep(record,i),
                  year =sequence(i) -1 + rep(from,i),
                  stringsAsFactors = FALSE)
     }))
   data.frame(station = rep(names(out), sapply(out,nrow)),
              do.call(rbind,out),
              row.names = NULL,
              stringsAsFactors = FALSE)
}

## Bill - transform
fun_bill <- function(d) {
   i <- rep(seq_len(nrow(d)), d$to-d$from+1)
   j <- sequence(d$to-d$from+1)
   transform(d[i,], year=from+j-1, from=NULL, to=NULL)
}

## Michael - IRanges
fun_michael <- function(d) {
   df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
   expand(df, "year")
}

## Jim - dplyr
fun_jim <- function(d) {
   d %>%
     rowwise() %>%
     do(tibble(station = .$station,
               record = .$record,
               year = seq(.$from, .$to))
     )
}

## Martin - Map
fun_martin <- function(d) {
   d$year <- with(d, Map(seq, from, to))
   res0 <- with(d, Map(data.frame,
                       station=station,
                       record=record,
                       year=year,
                       MoreArgs = list(stringsAsFactors = FALSE)))
   do.call(rbind, unname(res0))
}

## Denes - simple data.table
fun_denes <- function(d) {
   out <- d[, .(year = from:to), by = .(station, from, record)]
   out[, from := NULL]
}

## Check equality ################################
all.equal(fun_bill(d_df), fun_bert(d_df),
           check.attributes = FALSE)
all.equal(fun_bill(d_df), fun_martin(d_df),
           check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
           check.attributes = FALSE)
all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
           check.attributes = FALSE)
# Be prepared: this solution is super slow
all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
           check.attributes = FALSE)

## Benchmark #####################################

## Martin
print(system.time(fun_martin(d_df)))

## Bert
print(system.time(fun_bert(d_df)))

## Top 3
print(
   microbenchmark(
     fun_bill(d_df),
     fun_michael(d_df),
     fun_denes(d_dt),
     times = 100L
   )
)


-------------------------

On 11/28/2017 06:49 PM, Michael Lawrence wrote:> Or with the Bioconductor IRanges package:
> 
> df <- with(input, DataFrame(station, year=IRanges(from, to), record))
> expand(df, "year")
> 
> DataFrame with 24 rows and 3 columns
>          station     year      record
>      <character> <integer> <character>
> 1       07EA001      1960         QMS
> 2       07EA001      1961         QMC
> 3       07EA001      1962         QMC
> 4       07EA001      1963         QMC
> 5       07EA001      1964         QMC
> ...         ...       ...         ...
> 20      07EA001      1979         QRC
> 21      07EA001      1980         QRC
> 22      07EA001      1981         QRC
> 23      07EA001      1982         QRC
> 24      07EA001      1983         QRC
> 
> If you tell the computer more about your data, it can do more things for
> you.
> 
> Michael
> 
> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
> martin.morgan at roswellpark.org> wrote:
> 
>> On 11/26/2017 08:42 PM, jim holtman wrote:
>>
>>> try this:
>>>
>>> ##########################################
>>>
>>> library(dplyr)
>>>
>>> input <- tribble(
>>>     ~station, ~from, ~to, ~record,
>>>    "07EA001" ,    1960  ,  1960  , "QMS",
>>>    "07EA001"  ,   1961 ,   1970  , "QMC",
>>>    "07EA001" ,    1971  ,  1971  , "QMM",
>>>    "07EA001" ,    1972  ,  1976  , "QMC",
>>>    "07EA001" ,    1977  ,  1983  , "QRC"
>>> )
>>>
>>> result <- input %>%
>>>     rowwise() %>%
>>>     do(tibble(station = .$station,
>>>               year = seq(.$from, .$to),
>>>               record = .$record)
>>>     )
>>>
>>> ###########################
>>>
>>
>> In a bit more 'base R' mode I did
>>
>>    input$year <- with(input, Map(seq, from, to))
>>    res0 <- with(input, Map(data.frame, station=station, year=year,
>>        record=record))
>>     as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>
>> resulting in
>>
>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>     station  year record
>>      <fctr> <int> <fctr>
>>   1 07EA001  1960    QMS
>>   2 07EA001  1961    QMC
>>   3 07EA001  1962    QMC
>>   4 07EA001  1963    QMC
>>   5 07EA001  1964    QMC
>>   6 07EA001  1965    QMC
>>   7 07EA001  1966    QMC
>>   8 07EA001  1967    QMC
>>   9 07EA001  1968    QMC
>> 10 07EA001  1969    QMC
>> # ... with 14 more rows
>>
>> I though I should have been able to use `tibble` in the second step,
but
>> that leads to a (cryptic) error
>>
>>> res0 <- with(input, Map(tibble, station=station, year=year,
>> record=record))Error in captureDots(strict = `__quosured`) :
>>    the argument has already been evaluated
>>
>> The 'station' and 'record' columns are factors, so
different from the
>> original input, but this seems the appropriate data type for theses
columns.
>>
>> It's interesting to compare the 'specialized' knowledge
needed for each
>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), maybe
>> rbind() and Map() for base R.
>>
>> Martin
>>
>>
>>
>>>
>>>
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at
gmail.com>
>>> wrote:
>>>
>>> To David W.'s point about lack of a suitable reprex
("reproducible
>>>> example"), Bill's solution seems to be for only one
station.
>>>>
>>>> Here is a reprex and modification that I think does what was
requested
>>>> for
>>>> multiple stations, again using base R and data frames, not
dplyr and
>>>> tibbles.
>>>>
>>>> First the reprex with **two** stations:
>>>>
>>>> d <- data.frame( station =
rep(c("one","two"),c(5,4)),
>>>>>
>>>>                  from = c(60,61,71,72,76,60,65,82,83),
>>>>                   to = c(60,70,71,76,83,64, 81, 82,83),
>>>>                   record =
c("A","B","C","B","D","B","B","D","E"))
>>>>
>>>> d
>>>>>
>>>>     station from to record
>>>> 1     one   60 60      A
>>>> 2     one   61 70      B
>>>> 3     one   71 71      C
>>>> 4     one   72 76      B
>>>> 5     one   76 83      D
>>>> 6     two   60 64      B
>>>> 7     two   65 81      B
>>>> 8     two   82 82      D
>>>> 9     two   83 83      E
>>>>
>>>> ## Now the conversion code using base R, especially by():
>>>>
>>>> out <- by(d, d$station, function(x) with(x, {
>>>>>
>>>> +    i <- to - from +1
>>>> +    data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD
=rep(record,i))
>>>> + }))
>>>>
>>>>
>>>> out <- data.frame(station >>>>>
>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names
= NULL)
>>>>
>>>>
>>>> out
>>>>>
>>>>      station YEAR RECORD
>>>> 1      one   60      A
>>>> 2      one   61      B
>>>> 3      one   62      B
>>>> 4      one   63      B
>>>> 5      one   64      B
>>>> 6      one   65      B
>>>> 7      one   66      B
>>>> 8      one   67      B
>>>> 9      one   68      B
>>>> 10     one   69      B
>>>> 11     one   70      B
>>>> 12     one   71      C
>>>> 13     one   72      B
>>>> 14     one   73      B
>>>> 15     one   74      B
>>>> 16     one   75      B
>>>> 17     one   76      B
>>>> 18     one   76      D
>>>> 19     one   77      D
>>>> 20     one   78      D
>>>> 21     one   79      D
>>>> 22     one   80      D
>>>> 23     one   81      D
>>>> 24     one   82      D
>>>> 25     one   83      D
>>>> 26     two   60      B
>>>> 27     two   61      B
>>>> 28     two   62      B
>>>> 29     two   63      B
>>>> 30     two   64      B
>>>> 31     two   65      B
>>>> 32     two   66      B
>>>> 33     two   67      B
>>>> 34     two   68      B
>>>> 35     two   69      B
>>>> 36     two   70      B
>>>> 37     two   71      B
>>>> 38     two   72      B
>>>> 39     two   73      B
>>>> 40     two   74      B
>>>> 41     two   75      B
>>>> 42     two   76      B
>>>> 43     two   77      B
>>>> 44     two   78      B
>>>> 45     two   79      B
>>>> 46     two   80      B
>>>> 47     two   81      B
>>>> 48     two   82      D
>>>> 49     two   83      E
>>>>
>>>> Cheers,
>>>> Bert
>>>>
>>>>
>>>>
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep
coming along
>>>> and
>>>> sticking things into it."
>>>> -- Opus (aka Berkeley Breathed in his "Bloom County"
comic strip )
>>>>
>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help <
>>>> r-help at r-project.org> wrote:
>>>>
>>>> dplyr may have something for this, but in base R I think the
following
>>>>>
>>>> does
>>>>
>>>>> what you want.  I've shortened the name of your data
set to 'd'.
>>>>>
>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1)
>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1)
>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL,
YEAR_TO=NULL)
>>>>>
>>>>>
>>>>> Bill Dunlap
>>>>> TIBCO Software
>>>>> wdunlap tibco.com
>>>>>
>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC)
<
>>>>> david.hutchinson at canada.ca> wrote:
>>>>>
>>>>> I have a returned tibble of station operational record
similar to the
>>>>>> following:
>>>>>>
>>>>>> data.collection
>>>>>>>
>>>>>> # A tibble: 5 x 4
>>>>>>     STATION_NUMBER YEAR_FROM YEAR_TO RECORD
>>>>>>              <chr>     <int>   <int> 
<chr>
>>>>>> 1        07EA001      1960    1960    QMS
>>>>>> 2        07EA001      1961    1970    QMC
>>>>>> 3        07EA001      1971    1971    QMM
>>>>>> 4        07EA001      1972    1976    QMC
>>>>>> 5        07EA001      1977    1983    QRC
>>>>>>
>>>>>> I would like to reshape this to one operational record
(row) per year
>>>>>>
>>>>> per
>>>>
>>>>> station. Something like:
>>>>>>
>>>>>> 07EA001              1960      QMS
>>>>>> 07EA001              1961      QMC
>>>>>> 07EA001              1962      QMC
>>>>>> 07EA001              1963      QMC
>>>>>> ...
>>>>>> 07EA001              1971      QMM
>>>>>>
>>>>>> Can this be done in dplyr easily?
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> David
>>>>>>
>>>>>>           [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
http://www.R-project.org/
>>>>>> posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>>
>>>>>>
>>>>>           [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>> posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>>
>>>>           [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>> posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>> This email message may contain legally privileged
and/or...{{dropped:2}}
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
-- 
Dr. T?th D?nes ?gyvezet?
Kogentum Kft.
Tel.: 06-30-2583723
Web: www.kogentum.hu

Martin Morgan

2017-Nov-29 21:46 UTC

head link

[R] dplyr - add/expand rows

On 11/29/2017 04:15 PM, T?th D?nes wrote:> Hi,
> 
> A benchmarking study with an additional (data.table-based) solution. 
I don't think speed is the right benchmark (I do agree that correctness 
is!).

For the R-help list, maybe something about least specialized R knowledge 
required would be appropriate? I'd say there were some 'hard'
solutions
-- Michael (deep understanding of Bioconductor and IRanges), Toth (deep 
understanding of data.table), Jim (at least for me moderate 
understanding of dplyr,especially the .$ notation; a simpler dplyr 
answer might have moved this response out of the 'difficult' category, 
especially given the familiarity of the OP with dplyr). I'd vote for 
Bill's as requiring the least specialized knowledge of R (though the +/- 
1 indexing is an easy thing to get wrong).

A different criteria might be reuse across analysis scenarios. Bill 
seems to win here again, since the principles are very general and at 
least moderately efficient (both Bert and Martin's solutions are 
essentially R-level iterations and have poor scalability, as 
demonstrated in the microbenchmarks; Bill's is mostly vectorized). 
Certainly data.table, dplyr, and IRanges are extremely useful within the 
confines of the problem domains they address.

Martin
> Enjoy! ;)
> 
> Cheers,
> Denes
> 
> 
> --------------------------
> 
> 
> ## packages ##########################
> 
> library(dplyr)
> library(data.table)
> library(IRanges)
> library(microbenchmark)
> 
> ## prepare example dataset ###########
> 
> ## use Bert's example, with 2000 stations instead of 2
> d_df <- data.frame( station =
rep(rep(c("one","two"),c(5,4)), 1000L),
>  ??????????????????? from = as.integer(c(60,61,71,72,76,60,65,82,83)),
>  ??????????????????? to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
>  ??????????????????? record =
c("A","B","C","B","D","B","B","D","E"),
>  ??????????????????? stringsAsFactors = FALSE)
> stations <- rle(d_df$station)
> stations$value <- gsub(
>  ? " ", "0",
>  ? paste0("station", format(1:length(stations$value), width =
6)))
> d_df$station <- rep(stations$value, stations$lengths)
> 
> ## prepare tibble and data.table versions
> d_tbl <- as_tibble(d_df)
> d_dt <- as.data.table(d_df)
> 
> ## solutions ##########################
> 
> ## Bert - by
> fun_bert <- function(d) {
>  ? out <- by(
>  ??? d, d$station, function(x) with(x, {
>  ????? i <- to - from +1
>  ????? data.frame(record =rep(record,i),
>  ???????????????? year =sequence(i) -1 + rep(from,i),
>  ???????????????? stringsAsFactors = FALSE)
>  ??? }))
>  ? data.frame(station = rep(names(out), sapply(out,nrow)),
>  ???????????? do.call(rbind,out),
>  ???????????? row.names = NULL,
>  ???????????? stringsAsFactors = FALSE)
> }
> 
> ## Bill - transform
> fun_bill <- function(d) {
>  ? i <- rep(seq_len(nrow(d)), d$to-d$from+1)
>  ? j <- sequence(d$to-d$from+1)
>  ? transform(d[i,], year=from+j-1, from=NULL, to=NULL)
> }
> 
> ## Michael - IRanges
> fun_michael <- function(d) {
>  ? df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
>  ? expand(df, "year")
> }
> 
> ## Jim - dplyr
> fun_jim <- function(d) {
>  ? d %>%
>  ??? rowwise() %>%
>  ??? do(tibble(station = .$station,
>  ????????????? record = .$record,
>  ????????????? year = seq(.$from, .$to))
>  ??? )
> }
> 
> ## Martin - Map
> fun_martin <- function(d) {
>  ? d$year <- with(d, Map(seq, from, to))
>  ? res0 <- with(d, Map(data.frame,
>  ????????????????????? station=station,
>  ????????????????????? record=record,
>  ????????????????????? year=year,
>  ????????????????????? MoreArgs = list(stringsAsFactors = FALSE)))
>  ? do.call(rbind, unname(res0))
> }
> 
> ## Denes - simple data.table
> fun_denes <- function(d) {
>  ? out <- d[, .(year = from:to), by = .(station, from, record)]
>  ? out[, from := NULL]
> }
> 
> ## Check equality ################################
> all.equal(fun_bill(d_df), fun_bert(d_df),
>  ????????? check.attributes = FALSE)
> all.equal(fun_bill(d_df), fun_martin(d_df),
>  ????????? check.attributes = FALSE)
> all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
>  ????????? check.attributes = FALSE)
> all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
>  ????????? check.attributes = FALSE)
> # Be prepared: this solution is super slow
> all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
>  ????????? check.attributes = FALSE)
> 
> ## Benchmark #####################################
> 
> ## Martin
> print(system.time(fun_martin(d_df)))
> 
> ## Bert
> print(system.time(fun_bert(d_df)))
> 
> ## Top 3
> print(
>  ? microbenchmark(
>  ??? fun_bill(d_df),
>  ??? fun_michael(d_df),
>  ??? fun_denes(d_dt),
>  ??? times = 100L
>  ? )
> )
> 
> 
> -------------------------
> 
> On 11/28/2017 06:49 PM, Michael Lawrence wrote:
>> Or with the Bioconductor IRanges package:
>>
>> df <- with(input, DataFrame(station, year=IRanges(from, to),
record))
>> expand(df, "year")
>>
>> DataFrame with 24 rows and 3 columns
>> ???????? station???? year????? record
>> ???? <character> <integer> <character>
>> 1?????? 07EA001????? 1960???????? QMS
>> 2?????? 07EA001????? 1961???????? QMC
>> 3?????? 07EA001????? 1962???????? QMC
>> 4?????? 07EA001????? 1963???????? QMC
>> 5?????? 07EA001????? 1964???????? QMC
>> ...???????? ...?????? ...???????? ...
>> 20????? 07EA001????? 1979???????? QRC
>> 21????? 07EA001????? 1980???????? QRC
>> 22????? 07EA001????? 1981???????? QRC
>> 23????? 07EA001????? 1982???????? QRC
>> 24????? 07EA001????? 1983???????? QRC
>>
>> If you tell the computer more about your data, it can do more things
for
>> you.
>>
>> Michael
>>
>> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
>> martin.morgan at roswellpark.org> wrote:
>>
>>> On 11/26/2017 08:42 PM, jim holtman wrote:
>>>
>>>> try this:
>>>>
>>>> ##########################################
>>>>
>>>> library(dplyr)
>>>>
>>>> input <- tribble(
>>>> ??? ~station, ~from, ~to, ~record,
>>>> ?? "07EA001" ,??? 1960? ,? 1960? , "QMS",
>>>> ?? "07EA001"? ,?? 1961 ,?? 1970? , "QMC",
>>>> ?? "07EA001" ,??? 1971? ,? 1971? , "QMM",
>>>> ?? "07EA001" ,??? 1972? ,? 1976? , "QMC",
>>>> ?? "07EA001" ,??? 1977? ,? 1983? , "QRC"
>>>> )
>>>>
>>>> result <- input %>%
>>>> ??? rowwise() %>%
>>>> ??? do(tibble(station = .$station,
>>>> ????????????? year = seq(.$from, .$to),
>>>> ????????????? record = .$record)
>>>> ??? )
>>>>
>>>> ###########################
>>>>
>>>
>>> In a bit more 'base R' mode I did
>>>
>>> ?? input$year <- with(input, Map(seq, from, to))
>>> ?? res0 <- with(input, Map(data.frame, station=station,
year=year,
>>> ?????? record=record))
>>> ??? as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>
>>> resulting in
>>>
>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>> ??? station? year record
>>> ???? <fctr> <int> <fctr>
>>> ? 1 07EA001? 1960??? QMS
>>> ? 2 07EA001? 1961??? QMC
>>> ? 3 07EA001? 1962??? QMC
>>> ? 4 07EA001? 1963??? QMC
>>> ? 5 07EA001? 1964??? QMC
>>> ? 6 07EA001? 1965??? QMC
>>> ? 7 07EA001? 1966??? QMC
>>> ? 8 07EA001? 1967??? QMC
>>> ? 9 07EA001? 1968??? QMC
>>> 10 07EA001? 1969??? QMC
>>> # ... with 14 more rows
>>>
>>> I though I should have been able to use `tibble` in the second
step, but
>>> that leads to a (cryptic) error
>>>
>>>> res0 <- with(input, Map(tibble, station=station, year=year,
>>> record=record))Error in captureDots(strict = `__quosured`) :
>>> ?? the argument has already been evaluated
>>>
>>> The 'station' and 'record' columns are factors, so
different from the
>>> original input, but this seems the appropriate data type for theses
>>> columns.
>>>
>>> It's interesting to compare the 'specialized' knowledge
needed for each
>>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(),
maybe
>>> rbind() and Map() for base R.
>>>
>>> Martin
>>>
>>>
>>>
>>>>
>>>>
>>>> Jim Holtman
>>>> Data Munger Guru
>>>>
>>>> What is the problem that you are trying to solve?
>>>> Tell me what you want to do, not how you want to do it.
>>>>
>>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567
at gmail.com>
>>>> wrote:
>>>>
>>>> To David W.'s point about lack of a suitable reprex
("reproducible
>>>>> example"), Bill's solution seems to be for only
one station.
>>>>>
>>>>> Here is a reprex and modification that I think does what
was requested
>>>>> for
>>>>> multiple stations, again using base R and data frames, not
dplyr and
>>>>> tibbles.
>>>>>
>>>>> First the reprex with **two** stations:
>>>>>
>>>>> d <- data.frame( station =
rep(c("one","two"),c(5,4)),
>>>>>>
>>>>> ???????????????? from = c(60,61,71,72,76,60,65,82,83),
>>>>> ????????????????? to = c(60,70,71,76,83,64, 81, 82,83),
>>>>> ????????????????? record =
c("A","B","C","B","D","B","B","D","E"))
>>>>>
>>>>> d
>>>>>>
>>>>> ??? station from to record
>>>>> 1???? one?? 60 60????? A
>>>>> 2???? one?? 61 70????? B
>>>>> 3???? one?? 71 71????? C
>>>>> 4???? one?? 72 76????? B
>>>>> 5???? one?? 76 83????? D
>>>>> 6???? two?? 60 64????? B
>>>>> 7???? two?? 65 81????? B
>>>>> 8???? two?? 82 82????? D
>>>>> 9???? two?? 83 83????? E
>>>>>
>>>>> ## Now the conversion code using base R, especially by():
>>>>>
>>>>> out <- by(d, d$station, function(x) with(x, {
>>>>>>
>>>>> +??? i <- to - from +1
>>>>> +??? data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD 
>>>>> =rep(record,i))
>>>>> + }))
>>>>>
>>>>>
>>>>> out <- data.frame(station >>>>>>
>>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out),
row.names = NULL)
>>>>>
>>>>>
>>>>> out
>>>>>>
>>>>> ???? station YEAR RECORD
>>>>> 1????? one?? 60????? A
>>>>> 2????? one?? 61????? B
>>>>> 3????? one?? 62????? B
>>>>> 4????? one?? 63????? B
>>>>> 5????? one?? 64????? B
>>>>> 6????? one?? 65????? B
>>>>> 7????? one?? 66????? B
>>>>> 8????? one?? 67????? B
>>>>> 9????? one?? 68????? B
>>>>> 10???? one?? 69????? B
>>>>> 11???? one?? 70????? B
>>>>> 12???? one?? 71????? C
>>>>> 13???? one?? 72????? B
>>>>> 14???? one?? 73????? B
>>>>> 15???? one?? 74????? B
>>>>> 16???? one?? 75????? B
>>>>> 17???? one?? 76????? B
>>>>> 18???? one?? 76????? D
>>>>> 19???? one?? 77????? D
>>>>> 20???? one?? 78????? D
>>>>> 21???? one?? 79????? D
>>>>> 22???? one?? 80????? D
>>>>> 23???? one?? 81????? D
>>>>> 24???? one?? 82????? D
>>>>> 25???? one?? 83????? D
>>>>> 26???? two?? 60????? B
>>>>> 27???? two?? 61????? B
>>>>> 28???? two?? 62????? B
>>>>> 29???? two?? 63????? B
>>>>> 30???? two?? 64????? B
>>>>> 31???? two?? 65????? B
>>>>> 32???? two?? 66????? B
>>>>> 33???? two?? 67????? B
>>>>> 34???? two?? 68????? B
>>>>> 35???? two?? 69????? B
>>>>> 36???? two?? 70????? B
>>>>> 37???? two?? 71????? B
>>>>> 38???? two?? 72????? B
>>>>> 39???? two?? 73????? B
>>>>> 40???? two?? 74????? B
>>>>> 41???? two?? 75????? B
>>>>> 42???? two?? 76????? B
>>>>> 43???? two?? 77????? B
>>>>> 44???? two?? 78????? B
>>>>> 45???? two?? 79????? B
>>>>> 46???? two?? 80????? B
>>>>> 47???? two?? 81????? B
>>>>> 48???? two?? 82????? D
>>>>> 49???? two?? 83????? E
>>>>>
>>>>> Cheers,
>>>>> Bert
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Bert Gunter
>>>>>
>>>>> "The trouble with having an open mind is that people
keep coming along
>>>>> and
>>>>> sticking things into it."
>>>>> -- Opus (aka Berkeley Breathed in his "Bloom
County" comic strip )
>>>>>
>>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help
<
>>>>> r-help at r-project.org> wrote:
>>>>>
>>>>> dplyr may have something for this, but in base R I think
the following
>>>>>>
>>>>> does
>>>>>
>>>>>> what you want.? I've shortened the name of your
data set to 'd'.
>>>>>>
>>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1)
>>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1)
>>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL,
YEAR_TO=NULL)
>>>>>>
>>>>>>
>>>>>> Bill Dunlap
>>>>>> TIBCO Software
>>>>>> wdunlap tibco.com
>>>>>>
>>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David
(EC) <
>>>>>> david.hutchinson at canada.ca> wrote:
>>>>>>
>>>>>> I have a returned tibble of station operational record
similar to the
>>>>>>> following:
>>>>>>>
>>>>>>> data.collection
>>>>>>>>
>>>>>>> # A tibble: 5 x 4
>>>>>>> ??? STATION_NUMBER YEAR_FROM YEAR_TO RECORD
>>>>>>> ???????????? <chr>???? <int>??
<int>? <chr>
>>>>>>> 1??????? 07EA001????? 1960??? 1960??? QMS
>>>>>>> 2??????? 07EA001????? 1961??? 1970??? QMC
>>>>>>> 3??????? 07EA001????? 1971??? 1971??? QMM
>>>>>>> 4??????? 07EA001????? 1972??? 1976??? QMC
>>>>>>> 5??????? 07EA001????? 1977??? 1983??? QRC
>>>>>>>
>>>>>>> I would like to reshape this to one operational
record (row) per
>>>>>>> year
>>>>>>>
>>>>>> per
>>>>>
>>>>>> station. Something like:
>>>>>>>
>>>>>>> 07EA001????????????? 1960????? QMS
>>>>>>> 07EA001????????????? 1961????? QMC
>>>>>>> 07EA001????????????? 1962????? QMC
>>>>>>> 07EA001????????????? 1963????? QMC
>>>>>>> ...
>>>>>>> 07EA001????????????? 1971????? QMM
>>>>>>>
>>>>>>> Can this be done in dplyr easily?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> ????????? [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
http://www.R-project.org/
>>>>>>> posting-guide.html
>>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>>>
>>>>>>>
>>>>>> ????????? [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
http://www.R-project.org/
>>>>>> posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>>
>>>>>>
>>>>> ????????? [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>> posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>>
>>>> ???????? [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>>
>>>
>>> This email message may contain legally privileged
and/or...{{dropped:2}}
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ????[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 

This email message may contain legally privileged and/or...{{dropped:2}}

Apparently Analagous Threads

Search for more reasonably related threads

R help - Nov 2017 - dplyr - add/expand rows

[R] dplyr - add/expand rows

[R] dplyr - add/expand rows

[R] dplyr - add/expand rows

Apparently Analagous Threads