Hi, I am trying to process a large dataset in R. The dataset contains the following three columns: key_column - a unique key identifier begin_date - the start date of the active period end_date - the end date of the active period Example data is here: key_column,begin_date,end_date 123456,2013-01-01,2014-01-01 123456,2013-07-01,2014-07-01 789102,2012-03-01,2014-03-01 789102,2015-02-01,2016-02-01 789102,2015-02-06,2016-02-06 I want to build a condensed table of key_column and begin_date's and end_date's. As you can see in the example data above, some begin and end date periods overlap with begin_date and end_date pairs for the same key_column. In situations where overlap exists I want to have one record for the key_column with the min(begin_date) and the max(end_date). Can anyone help me build the commands to process this data in R? Thanks, Matt -- Matt Gross grossm at gmail.com 503.329.4545 [[alternative HTML version deleted]]
Hi, Here is an implemenation:> datekey_column begin_date end_date 1 123456 2013-01-01 2014-01-01 2 123456 2013-07-01 2014-07-01 3 789102 2012-03-01 2014-03-01 4 789102 2015-02-01 2016-02-01 5 789102 2015-02-06 2016-02-06> y <- t(sapply(unique(date$key_column),function(x) > c(x,min(as.character(date[date$key_column==x,"begin_date"])),max(as.character(date[date$key_column==x,"end_date"]))))) > y[,1] [,2] [,3] [1,] "123456" "2013-01-01" "2014-07-01 " [2,] "789102" "2012-03-01" "2016-02-06"> colnames(y)NULL> colnames(y) <- c("key_column","begin_date","end_date") > ykey_column begin_date end_date [1,] "123456" "2013-01-01" "2014-07-01 " [2,] "789102" "2012-03-01" "2016-02-06" -- View this message in context: http://r.789695.n4.nabble.com/Processing-key-column-begin-date-end-date-in-R-tp4703835p4703850.html Sent from the R help mailing list archive at Nabble.com.
Hi, It's not as easy as I originally thought. Here is a revision with the function beginEnd to get it done.> datekey_column begin_date end_date 1 123456 2013-01-01 2014-01-01 2 123456 2013-07-01 2014-07-01 3 789102 2012-03-01 2014-03-01 4 789102 2015-02-01 2016-02-01 5 789102 2015-02-06 2016-02-06> beginEndfunction() { date[order(date$key_column,date$begin_date),] key <- numeric(0) begin <- character(0) end <- character(0) currentKey <- as.numeric(date$key_column[1]) key <- c(key, currentKey) currentBegin <- as.character(date$begin_date[1]) begin <- c(begin, currentBegin) currentEnd <- as.character(date$end_date[1]) for (i in 2:length(date$key_column)) { if (currentKey == as.numeric(date$key_column[i])) { if (currentEnd >= as.character(date$begin_date[i])) { currentEnd <- max(currentEnd, as.character(date$end_date[i])) } else { end <- c(end, currentEnd) currentKey <- as.numeric(date$key_column[i]) key <- c(key, currentKey) currentBegin <- as.character(date$begin_date[i]) begin <- c(begin, currentBegin) currentEnd <- as.character(date$end_date[i]) } if (i == length(date$key_column)) { end <- c(end, currentEnd) } } else { end <- c(end, currentEnd) currentKey <- as.numeric(date$key_column[i]) key <- c(key, currentKey) currentBegin <- as.character(date$begin_date[i]) begin <- c(begin, currentBegin) currentEnd <- as.character(date$end_date[i]) if (i == length(date$key_column)) { end <- list(end, currentEnd) } } } result <- cbind(key, begin, end) colnames(result) <- c("key.column","begin.date","end.date") return(result) }> beginEnd()key.column begin.date end.date [1,] "123456" "2013-01-01" "2014-07-01 " [2,] "789102" "2012-03-01" "2014-03-01 " [3,] "789102" "2015-02-01" "2016-02-06" -- View this message in context: http://r.789695.n4.nabble.com/Processing-key-column-begin-date-end-date-in-R-tp4703835p4703852.html Sent from the R help mailing list archive at Nabble.com.
Here is another way. Have not tested for large scale efficiency, but if you convert dta to a data.table that might improve things. library(dplyr) dta <- read.csv( text"key_column,begin_date,end_date 123456,2013-01-01,2014-01-01 123456,2013-07-01,2014-07-01 789102,2012-03-01,2014-03-01 789102,2015-02-01,2016-02-01 789102,2015-02-06,2016-02-06 789102,2015-02-28,2015-03-31 789102,2015-04-30,2015-05-31 ", as.is=TRUE) ( dta %>% mutate( begin_date = as.Date( begin_date ), end_date = as.Date( end_date ) ) %>% arrange( key_column, begin_date ) ) -> dta mkgp <- function( begin_date, cend ) { ix <- c( TRUE, cend[ -length( begin_date ) ] < begin_date[ -1 ] ) cumsum( ix ) } result <- ( dta %>% group_by( key_column ) %>% mutate( cend = as.Date( cummax( as.numeric( end_date ) ) , origin="1970-01-01" ) , gp = mkgp( begin_date, cend ) ) %>% ungroup %>% group_by( key_column, gp ) %>% summarise( begin_date = begin_date[ 1 ] , end_date = cend[ length( cend ) ] ) %>% ungroup %>% select( -gp ) %>% as.data.frame ) --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On February 25, 2015 1:18:58 PM PST, Matt Gross <grossm at gmail.com> wrote:>Hi, > >I am trying to process a large dataset in R. The dataset contains the >following three columns: > >key_column - a unique key identifier >begin_date - the start date of the active period >end_date - the end date of the active period > > >Example data is here: > >key_column,begin_date,end_date >123456,2013-01-01,2014-01-01 >123456,2013-07-01,2014-07-01 >789102,2012-03-01,2014-03-01 >789102,2015-02-01,2016-02-01 >789102,2015-02-06,2016-02-06 > >I want to build a condensed table of key_column and begin_date's and >end_date's. As you can see in the example data above, some begin and >end >date periods overlap with begin_date and end_date pairs for the same >key_column. In situations where overlap exists I want to have one >record >for the key_column with the min(begin_date) and the max(end_date). > >Can anyone help me build the commands to process this data in R? > >Thanks, >Matt
here is yet another way:> dta <- read.csv( text+ "key_column,begin_date,end_date+ 123456,2013-01-01,2014-01-01 + 123456,2013-07-01,2014-07-01 + 789102,2012-03-01,2014-03-01 + 789102,2015-02-01,2016-02-01 + 789102,2015-02-06,2016-02-06 + 789102,2015-02-28,2015-03-31 + 789102,2015-04-30,2015-05-31" + , as.is=TRUE)> > # check for overlap by sorting into time order and the adding 1 for > # begin and -1 for end and create cumsum > # select only resulting entries with begin @ 1 and end @ 0 > dta <- dta %>%+ mutate(begin_date = as.Date(begin_date) # convert the times + , end_date = as.Date(end_date) + ) %>% + gather(time, value, -key_column) %>% # create 'long' data + mutate(oper = ifelse(grepl('^b', time), 1, -1)) %>% # value for begin/end + arrange(value) %>% # sort by time + group_by(key_column) %>% # separate into groups + mutate(depth = cumsum(oper)) %>% + filter((grepl("^b", time) & depth == 1) | # filter on begin at 1 and end at 0 + (grepl("^e", time) & depth == 0) + ) %>% + ungroup() %>% + arrange(key_column, value)> # now have pairs of lines for the times > indx <- seq(1, nrow(dta), 2) > result <- data.frame(key_column = dta$key_column[indx]+ , begin_time = dta$value[indx] + , end_time = dta$value[indx + 1L] + , stringsAsFactors = FALSE + )> resultkey_column begin_time end_time 1 123456 2013-01-01 2014-07-01 2 789102 2012-03-01 2014-03-01 3 789102 2015-02-01 2016-02-06>Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Wed, Feb 25, 2015 at 4:18 PM, Matt Gross <grossm at gmail.com> wrote:> Hi, > > I am trying to process a large dataset in R. The dataset contains the > following three columns: > > key_column - a unique key identifier > begin_date - the start date of the active period > end_date - the end date of the active period > > > Example data is here: > > key_column,begin_date,end_date > 123456,2013-01-01,2014-01-01 > 123456,2013-07-01,2014-07-01 > 789102,2012-03-01,2014-03-01 > 789102,2015-02-01,2016-02-01 > 789102,2015-02-06,2016-02-06 > > I want to build a condensed table of key_column and begin_date's and > end_date's. As you can see in the example data above, some begin and end > date periods overlap with begin_date and end_date pairs for the same > key_column. In situations where overlap exists I want to have one record > for the key_column with the min(begin_date) and the max(end_date). > > Can anyone help me build the commands to process this data in R? > > Thanks, > Matt > > -- > Matt Gross > grossm at gmail.com > 503.329.4545 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.