thr3ads.net - R help - [R] Processing key_column, begin_date, end

If this information is useful, please help other people find it:
Share via:

Matt Gross

2015-Feb-25 21:18 UTC

[R] Processing key_column, begin_date, end_date in R

Hi,

I am trying to process a large dataset in R.  The dataset contains the
following three columns:

key_column - a unique key identifier
begin_date - the start date of the active period
end_date - the end date of the active period


Example data is here:

key_column,begin_date,end_date
123456,2013-01-01,2014-01-01
123456,2013-07-01,2014-07-01
789102,2012-03-01,2014-03-01
789102,2015-02-01,2016-02-01
789102,2015-02-06,2016-02-06

I want to build a condensed table of key_column and begin_date's and
end_date's.  As you can see in the example data above, some begin and end
date periods overlap with begin_date and end_date pairs for the same
key_column.  In situations where overlap exists I want to have one record
for the key_column with the min(begin_date) and the max(end_date).

Can anyone help me build the commands to process this data in R?

Thanks,
Matt

-- 
Matt Gross
grossm at gmail.com
503.329.4545

	[[alternative HTML version deleted]]

JS Huang

2015-Feb-26 04:18 UTC

head link

[R] Processing key_column, begin_date, end_date in R

Hi,
  
  Here is an implemenation:
> date  key_column begin_date    end_date
1     123456 2013-01-01 2014-01-01 
2     123456 2013-07-01 2014-07-01 
3     789102 2012-03-01 2014-03-01 
4     789102 2015-02-01 2016-02-01 
5     789102 2015-02-06  2016-02-06> y <- t(sapply(unique(date$key_column),function(x)
>
c(x,min(as.character(date[date$key_column==x,"begin_date"])),max(as.character(date[date$key_column==x,"end_date"])))))
> y     [,1]     [,2]         [,3]         
[1,] "123456" "2013-01-01" "2014-07-01 "
[2,] "789102" "2012-03-01" "2016-02-06"
> colnames(y)
NULL> colnames(y) <-
c("key_column","begin_date","end_date")
> y     key_column begin_date   end_date     
[1,] "123456"   "2013-01-01" "2014-07-01 "
[2,] "789102"   "2012-03-01" "2016-02-06" 



--
View this message in context:
http://r.789695.n4.nabble.com/Processing-key-column-begin-date-end-date-in-R-tp4703835p4703850.html
Sent from the R help mailing list archive at Nabble.com.

JS Huang

2015-Feb-26 05:15 UTC

head link

[R] Processing key_column, begin_date, end_date in R

Hi,

  It's not as easy as I originally thought.  Here is a revision with the
function beginEnd to get it done.
> date  key_column begin_date    end_date
1     123456 2013-01-01 2014-01-01 
2     123456 2013-07-01 2014-07-01 
3     789102 2012-03-01 2014-03-01 
4     789102 2015-02-01 2016-02-01 
5     789102 2015-02-06  2016-02-06> beginEndfunction()
{
  date[order(date$key_column,date$begin_date),]
  key <- numeric(0)
  begin <- character(0)
  end <- character(0)
  currentKey <- as.numeric(date$key_column[1])
  key <- c(key, currentKey)
  currentBegin <- as.character(date$begin_date[1])
  begin <- c(begin, currentBegin)
  currentEnd <- as.character(date$end_date[1])
  for (i in 2:length(date$key_column))
  {
    if (currentKey == as.numeric(date$key_column[i]))
    {
      if (currentEnd >= as.character(date$begin_date[i]))
      {
        currentEnd <- max(currentEnd, as.character(date$end_date[i]))
      }
      else
      {
        end <- c(end, currentEnd)
        currentKey <- as.numeric(date$key_column[i])
        key <- c(key, currentKey)
        currentBegin <- as.character(date$begin_date[i])
        begin <- c(begin, currentBegin)
        currentEnd <- as.character(date$end_date[i])        
      }
      if (i == length(date$key_column))
      {
        end <- c(end, currentEnd)
      }
    }
    else
    {
      end <- c(end, currentEnd)
      currentKey <- as.numeric(date$key_column[i])
      key <- c(key, currentKey)
      currentBegin <- as.character(date$begin_date[i])
      begin <- c(begin, currentBegin)
      currentEnd <- as.character(date$end_date[i])
      if (i == length(date$key_column))
      {
        end <- list(end, currentEnd)
      }
    }
  }
  result <- cbind(key, begin, end)
  colnames(result) <-
c("key.column","begin.date","end.date")
  return(result)
}> beginEnd()     key.column begin.date   end.date     
[1,] "123456"   "2013-01-01" "2014-07-01 "
[2,] "789102"   "2012-03-01" "2014-03-01 "
[3,] "789102"   "2015-02-01" "2016-02-06" 




--
View this message in context:
http://r.789695.n4.nabble.com/Processing-key-column-begin-date-end-date-in-R-tp4703835p4703852.html
Sent from the R help mailing list archive at Nabble.com.

Jeff Newmiller

2015-Feb-26 14:30 UTC

head link

[R] Processing key_column, begin_date, end_date in R

Here is another way. Have not tested for large scale efficiency, but if you
convert dta to a data.table that might improve things.

library(dplyr)
dta <- read.csv( text"key_column,begin_date,end_date
123456,2013-01-01,2014-01-01
123456,2013-07-01,2014-07-01
789102,2012-03-01,2014-03-01
789102,2015-02-01,2016-02-01
789102,2015-02-06,2016-02-06
789102,2015-02-28,2015-03-31
789102,2015-04-30,2015-05-31
", as.is=TRUE)
( dta
%>% mutate( begin_date = as.Date( begin_date ),
end_date = as.Date( end_date ) )
%>% arrange( key_column, begin_date )
) -> dta

mkgp <- function( begin_date, cend ) {
  ix <- c( TRUE, cend[ -length( begin_date ) ] < begin_date[ -1 ] )
  cumsum( ix )
}

result <- ( dta
          %>% group_by( key_column )
          %>% mutate( cend = as.Date( cummax( as.numeric( end_date ) )
                                    , origin="1970-01-01" )
                      , gp = mkgp( begin_date, cend )
                      )
          %>% ungroup
          %>% group_by( key_column, gp )
          %>% summarise(  begin_date = begin_date[ 1 ]
                        , end_date = cend[ length( cend ) ]
                        )
          %>% ungroup
          %>% select( -gp )
          %>% as.data.frame
          )
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On February 25, 2015 1:18:58 PM PST, Matt Gross <grossm at gmail.com>
wrote:>Hi,
>
>I am trying to process a large dataset in R.  The dataset contains the
>following three columns:
>
>key_column - a unique key identifier
>begin_date - the start date of the active period
>end_date - the end date of the active period
>
>
>Example data is here:
>
>key_column,begin_date,end_date
>123456,2013-01-01,2014-01-01
>123456,2013-07-01,2014-07-01
>789102,2012-03-01,2014-03-01
>789102,2015-02-01,2016-02-01
>789102,2015-02-06,2016-02-06
>
>I want to build a condensed table of key_column and begin_date's and
>end_date's.  As you can see in the example data above, some begin and
>end
>date periods overlap with begin_date and end_date pairs for the same
>key_column.  In situations where overlap exists I want to have one
>record
>for the key_column with the min(begin_date) and the max(end_date).
>
>Can anyone help me build the commands to process this data in R?
>
>Thanks,
>Matt

jim holtman

2015-Feb-26 15:23 UTC

head link

[R] Processing key_column, begin_date, end_date in R

here is yet another way:
> dta <- read.csv( text+ "key_column,begin_date,end_date+  123456,2013-01-01,2014-01-01
+  123456,2013-07-01,2014-07-01
+  789102,2012-03-01,2014-03-01
+  789102,2015-02-01,2016-02-01
+  789102,2015-02-06,2016-02-06
+ 789102,2015-02-28,2015-03-31
+  789102,2015-04-30,2015-05-31"
+  , as.is=TRUE)>
> # check for overlap by sorting into time order and the adding 1 for
> # begin and -1 for end and create cumsum
> # select only resulting entries with begin @ 1 and end @ 0
> dta <- dta %>%+     mutate(begin_date = as.Date(begin_date)  # convert the times
+         , end_date = as.Date(end_date)
+         ) %>%
+     gather(time, value, -key_column) %>%  # create 'long' data
+     mutate(oper = ifelse(grepl('^b', time), 1, -1)) %>%  # value
for begin/end
+     arrange(value) %>%  # sort by time
+     group_by(key_column) %>%  # separate into groups
+     mutate(depth = cumsum(oper)) %>%
+     filter((grepl("^b", time) & depth == 1) |  # filter on begin
at 1 and end at 0
+           (grepl("^e", time) & depth == 0)
+           ) %>%
+     ungroup() %>%
+     arrange(key_column, value)> # now have pairs of lines for the times
> indx <- seq(1, nrow(dta), 2)
> result <- data.frame(key_column = dta$key_column[indx]+                 , begin_time = dta$value[indx]
+                 , end_time = dta$value[indx + 1L]
+                 , stringsAsFactors = FALSE
+                 )> result  key_column begin_time   end_time
1     123456 2013-01-01 2014-07-01
2     789102 2012-03-01 2014-03-01
3     789102 2015-02-01 2016-02-06>
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Feb 25, 2015 at 4:18 PM, Matt Gross <grossm at gmail.com>
wrote:> Hi,
>
> I am trying to process a large dataset in R.  The dataset contains the
> following three columns:
>
> key_column - a unique key identifier
> begin_date - the start date of the active period
> end_date - the end date of the active period
>
>
> Example data is here:
>
> key_column,begin_date,end_date
> 123456,2013-01-01,2014-01-01
> 123456,2013-07-01,2014-07-01
> 789102,2012-03-01,2014-03-01
> 789102,2015-02-01,2016-02-01
> 789102,2015-02-06,2016-02-06
>
> I want to build a condensed table of key_column and begin_date's and
> end_date's.  As you can see in the example data above, some begin and
end
> date periods overlap with begin_date and end_date pairs for the same
> key_column.  In situations where overlap exists I want to have one record
> for the key_column with the min(begin_date) and the max(end_date).
>
> Can anyone help me build the commands to process this data in R?
>
> Thanks,
> Matt
>
> --
> Matt Gross
> grossm at gmail.com
> 503.329.4545
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Feb 2015 - Processing key_column, begin_date, end_date in R

[R] Processing key_column, begin_date, end_date in R

[R] Processing key_column, begin_date, end_date in R

[R] Processing key_column, begin_date, end_date in R

[R] Processing key_column, begin_date, end_date in R

[R] Processing key_column, begin_date, end_date in R