Hi All, I've come up with a solution for this problem that relies on a for loop, and I was wondering if anybody had any insight into a more elegant method: I have two data frames, each has a column for categorical data and a column for date. What I'd like to do, ideally, is calculate the number of days between all pairs of dates in data frame 1 and data frame 2 (*but only for members of the same category*). The number of members of each category varies between the two data frames. For example:> d <- seq(as.Date("2000-02-12"), as.Date("2009-08-18"), by="weeks")> df1 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE))> df2 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE))> df1A date category 1 93 2004-02-28 3 2 105 2001-03-17 3 3 189 2009-07-04 2 4 130 2003-07-05 2 5 160 2005-09-24 2 6 32 2004-11-06 2 7 117 2007-03-17 1 8 161 2003-07-19 4 9 153 2001-09-15 3 10 173 2005-08-27 1> df2A date category 1 102 2006-08-19 3 2 68 2004-11-27 2 3 137 2003-01-11 1 4 39 2002-12-28 2 5 127 2004-03-06 4 6 125 2002-02-23 2 7 150 2002-05-18 4 8 19 2003-02-22 1 9 80 2000-08-05 1 10 94 2003-12-27 1 Within a loop, I'd do the following (i is my counter; for the example, I set it to 1):> i<-1# Create the data frames:> yeari_1 <- df1[which(df1['category']==i),]; yeari_2 <- df2[which(df2['category']==i),]# Select only the data from category i> yeari_1A date category 7 117 2007-03-17 1 10 173 2005-08-27 1> yeari_2A date category 3 137 2003-01-11 1 8 19 2003-02-22 1 9 80 2000-08-05 1 10 94 2003-12-27 1 # Convert dates to integers year1_i[[2]] <- as.integer(as.Date(yeari_1[[2]])); yeari_2[[2]] <- as.integer(as.Date(yeari_2[[2]]));> yeari_1A date category 7 117 13589 1 10 173 13022 1> yeari_2A date category 3 137 12063 1 8 19 12105 1 9 80 11174 1 10 94 12413 1 # Get differences of all pairs:> result <- outer(yeari_1[[2]],yeari_2[[2]],'-') > result[,1] [,2] [,3] [,4] [1,] 1526 1484 2415 1176 [2,] 959 917 1848 609 # Now, merge the results with the results from all the earlier iterations for previous values of i, increment i to the next value, and repeat. ---- Ideally, I could accomplish this in some sort of vectorized manner, although the Force is not yet strong with me. Any ideas would be appreciated! Regards, Jonathan [[alternative HTML version deleted]]
Jonathan- When you provide an example that uses sample() or random number generators, it's a good idea to call set.seed() before generating the data, so that others can reproduce it. But to answer your question: mapply(function(d1,d2)outer(d1$date,d2$date,'-'), split(df1,df1$category), split(df2,df2$category)) will return a list with the information you want. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Fri, 14 May 2010, Jonathan wrote:> Hi All, > I've come up with a solution for this problem that relies on a for loop, > and I was wondering if anybody had any insight into a more elegant method: > > I have two data frames, each has a column for categorical data and a column > for date. What I'd like to do, ideally, is calculate the number of days > between all pairs of dates in data frame 1 and data frame 2 (*but only for > members of the same category*). The number of members of each category > varies between the two data frames. > > For example: > > >> d <- seq(as.Date("2000-02-12"), as.Date("2009-08-18"), by="weeks") > >> df1 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE)) > >> df2 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE)) > > >> df1 > A date category > 1 93 2004-02-28 3 > 2 105 2001-03-17 3 > 3 189 2009-07-04 2 > 4 130 2003-07-05 2 > 5 160 2005-09-24 2 > 6 32 2004-11-06 2 > 7 117 2007-03-17 1 > 8 161 2003-07-19 4 > 9 153 2001-09-15 3 > 10 173 2005-08-27 1 > > >> df2 > A date category > 1 102 2006-08-19 3 > 2 68 2004-11-27 2 > 3 137 2003-01-11 1 > 4 39 2002-12-28 2 > 5 127 2004-03-06 4 > 6 125 2002-02-23 2 > 7 150 2002-05-18 4 > 8 19 2003-02-22 1 > 9 80 2000-08-05 1 > 10 94 2003-12-27 1 > > > Within a loop, I'd do the following (i is my counter; for the example, > I set it to 1): > > >> i<-1 > > # Create the data frames: > >> yeari_1 <- df1[which(df1['category']==i),]; yeari_2 <- df2[which(df2['category']==i),] > > # Select only the data from category i > >> yeari_1 > A date category > 7 117 2007-03-17 1 > 10 173 2005-08-27 1 > >> yeari_2 > A date category > 3 137 2003-01-11 1 > 8 19 2003-02-22 1 > 9 80 2000-08-05 1 > 10 94 2003-12-27 1 > > # Convert dates to integers > > year1_i[[2]] <- as.integer(as.Date(yeari_1[[2]])); yeari_2[[2]] <- > as.integer(as.Date(yeari_2[[2]])); > >> yeari_1 > A date category > 7 117 13589 1 > 10 173 13022 1 >> yeari_2 > A date category > 3 137 12063 1 > 8 19 12105 1 > 9 80 11174 1 > 10 94 12413 1 > > # Get differences of all pairs: > >> result <- outer(yeari_1[[2]],yeari_2[[2]],'-') >> result > [,1] [,2] [,3] [,4] > [1,] 1526 1484 2415 1176 > [2,] 959 917 1848 609 > > # Now, merge the results with the results from all the earlier > iterations for previous values of i, increment i to the next value, > and repeat. > > > ---- > > Ideally, I could accomplish this in some sort of vectorized manner, > although the Force is not yet strong with me. Any ideas would be > appreciated! > > > Regards, > > Jonathan > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2010-May-14 22:38 UTC
[R] operations between two aggregated data frames?
Generating df1 and df2 as in your post try this (and see http://sqldf.googlecode.com for more info):> library(sqldf) > out <- sqldf("select category,+ df1.date date1, + df2.date date2, + df1.date - df2.date datediff + from df1 join df2 using(category) + order by category, date1, date2")> > out[[2]] <- as.Date(out[[2]], origin = "1970-01-01") > out[[3]] <- as.Date(out[[3]], origin = "1970-01-01") > outcategory date1 date2 datediff 1 1 2003-02-08 2003-05-17 -98 2 1 2003-02-08 2003-11-29 -294 3 1 2008-08-16 2003-05-17 1918 4 1 2008-08-16 2003-11-29 1722 5 1 2008-11-29 2003-05-17 2023 6 1 2008-11-29 2003-11-29 1827 7 2 2001-12-01 2005-02-26 -1183 8 2 2001-12-01 2008-10-25 -2520 9 2 2002-01-12 2005-02-26 -1141 10 2 2002-01-12 2008-10-25 -2478 11 3 2003-09-13 2005-10-01 -749 12 3 2003-09-13 2007-08-18 -1435 13 3 2009-08-01 2005-10-01 1400 14 3 2009-08-01 2007-08-18 714 15 4 2000-11-04 2000-11-18 -14 16 4 2000-11-04 2003-05-10 -917 17 4 2000-11-04 2003-07-26 -994 18 4 2000-11-04 2008-11-22 -2940 19 4 2004-03-13 2000-11-18 1211 20 4 2004-03-13 2003-05-10 308 21 4 2004-03-13 2003-07-26 231 22 4 2004-03-13 2008-11-22 -1715 23 4 2007-06-02 2000-11-18 2387 24 4 2007-06-02 2003-05-10 1484 25 4 2007-06-02 2003-07-26 1407 26 4 2007-06-02 2008-11-22 -539 On Fri, May 14, 2010 at 6:13 PM, Jonathan <jonsleepy at gmail.com> wrote:> Hi All, > ? I've come up with a solution for this problem that relies on a for loop, > and I was wondering if anybody had any insight into a more elegant method: > > I have two data frames, each has a column for categorical data and a column > for date. ?What I'd like to do, ideally, is calculate the number of days > between all pairs of dates in data frame 1 and data frame 2 (*but only for > members of the same category*). ?The number of members of each category > varies between the two data frames. > > For example: > > >> d <- seq(as.Date("2000-02-12"), as.Date("2009-08-18"), by="weeks") > >> df1 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE)) > >> df2 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE)) > > >> df1 > ? ? A ? ? ? date category > 1 ? 93 2004-02-28 ? ? ? ?3 > 2 ?105 2001-03-17 ? ? ? ?3 > 3 ?189 2009-07-04 ? ? ? ?2 > 4 ?130 2003-07-05 ? ? ? ?2 > 5 ?160 2005-09-24 ? ? ? ?2 > 6 ? 32 2004-11-06 ? ? ? ?2 > 7 ?117 2007-03-17 ? ? ? ?1 > 8 ?161 2003-07-19 ? ? ? ?4 > 9 ?153 2001-09-15 ? ? ? ?3 > 10 173 2005-08-27 ? ? ? ?1 > > >> df2 > ? ? A ? ? ? date category > 1 ?102 2006-08-19 ? ? ? ?3 > 2 ? 68 2004-11-27 ? ? ? ?2 > 3 ?137 2003-01-11 ? ? ? ?1 > 4 ? 39 2002-12-28 ? ? ? ?2 > 5 ?127 2004-03-06 ? ? ? ?4 > 6 ?125 2002-02-23 ? ? ? ?2 > 7 ?150 2002-05-18 ? ? ? ?4 > 8 ? 19 2003-02-22 ? ? ? ?1 > 9 ? 80 2000-08-05 ? ? ? ?1 > 10 ?94 2003-12-27 ? ? ? ?1 > > > Within a loop, I'd do the following (i is my counter; for the example, > I set it to 1): > > >> i<-1 > > # Create the data frames: > >> yeari_1 <- df1[which(df1['category']==i),]; yeari_2 <- df2[which(df2['category']==i),] > > # Select only the data from category i > >> yeari_1 > ? ? A ? ? ? date category > 7 ?117 2007-03-17 ? ? ? ?1 > 10 173 2005-08-27 ? ? ? ?1 > >> yeari_2 > ? ? A ? ? ? date category > 3 ?137 2003-01-11 ? ? ? ?1 > 8 ? 19 2003-02-22 ? ? ? ?1 > 9 ? 80 2000-08-05 ? ? ? ?1 > 10 ?94 2003-12-27 ? ? ? ?1 > > # Convert dates to integers > > year1_i[[2]] <- as.integer(as.Date(yeari_1[[2]])); yeari_2[[2]] <- > as.integer(as.Date(yeari_2[[2]])); > >> yeari_1 > ? ? A ?date category > 7 ?117 13589 ? ? ? ?1 > 10 173 13022 ? ? ? ?1 >> yeari_2 > ? ? A ?date category > 3 ?137 12063 ? ? ? ?1 > 8 ? 19 12105 ? ? ? ?1 > 9 ? 80 11174 ? ? ? ?1 > 10 ?94 12413 ? ? ? ?1 > > # Get differences of all pairs: > >> result <- outer(yeari_1[[2]],yeari_2[[2]],'-') >> result > ? ? [,1] [,2] [,3] [,4] > [1,] 1526 1484 2415 1176 > [2,] ?959 ?917 1848 ?609 > > # Now, merge the results with the results from all the earlier > iterations for previous values of i, increment i to the next value, > and repeat. > > > ---- > > Ideally, I could accomplish this in some sort of vectorized manner, > although the Force is not yet strong with me. ?Any ideas would be > appreciated! > > > Regards, > > Jonathan > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Possibly Parallel Threads
- difference of two data frames
- merging or joining 2 dataframes: merge, rbind.fill, etc.?
- Best way/practice to create a new data frame from two given ones with last column computed from the two data frames?
- Help with loop
- rbind: inconsistent behaviour with empty data frames?