Wow.. thanks for the deluge of responses! Aggregate seems like the way to go here. But, suppose that instead of integers in column V2, I actually have dates (and instead of keeping the minimum integer, I want to keep the earliest date):> df = data.frame(V1=c(1,1,2,3,1,2),V2=c('2002-03-13','1989-03-10','1988-01-20','1997-05-15','1996-11-18','2000-01-12')); > aggregate(df$V2,df['V1'],min) #thanks, PhilError in Summary.factor(c(6L, 2L, 3L), na.rm = FALSE) : min not meaningful for factors So I checked:> min(df$V2)Error in Summary.factor(c(6L, 2L, 1L, 4L, 3L, 5L), na.rm = FALSE) : min not meaningful for factors This can be fixed with:> min(as.Date(df$V2))[1] "1988-01-20" However, I get a strange response when I try:> aggregate(as.Date(df$V2),df['V1'],min)V1 x 1 1 7008 2 2 6593 3 3 9996 What I'm looking for, of course, is:> aggregate(as.Date(df$V2),df['V1'],min)V1 V2 1 1 1996-11-18 2 2 1988-01-20 3 3 1997-05-15 Any idea how to produce the desired output? Thanks, Jonathan On Mon, Feb 8, 2010 at 12:11 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Here are 3 solutions assuming DF contains the data frame: > >> # 1. aggregate >> aggregate(DF[2], DF[1], min) > ?V1 V2 > 1 ?a ?2 > 2 ?b ?9 > 3 ?c ?4 > >> # 2. aggregate.formula - requires R 2.11.x >> aggregate(V2 ~ V1, DF, min) > ?V1 V2 > 1 ?a ?2 > 2 ?b ?9 > 3 ?c ?4 > >> # 3. SQL using sqldf >> library(sqldf) >> sqldf("select V1, min(V2) V2 from DF group by V1") > ?V1 V2 > 1 ?a ?2 > 2 ?b ?9 > 3 ?c ?4 > >> # 4. summaryBy in the doBy package >> library(doBy) >> summaryBy(V2 ~., DF, FUN = min, keep.names = TRUE) > ?V1 V2 > 1 ?a ?2 > 2 ?b ?9 > 3 ?c ?4 > > On Mon, Feb 8, 2010 at 11:39 AM, Jonathan <jonsleepy at gmail.com> wrote: >> Hi all, >> ? ?I'm feeling a little guilty to ask this question, since I've >> written a solution using a rather clunky for loop that gets the job >> done. ?But I'm convinced there must be a faster (and probably more >> elegant) way to accomplish what I'm looking to do (perhaps using the >> "merge" function?). ?I figured somebody out there might've already >> figured this out: >> >> I have a dataframe with two columns (let's call them V1 and V2). ?All >> rows are unique, although column V1 has several redundant entries. >> >> Ex: >> >> ? ? V1 ? ? V2 >> 1 ? ?a ? ? ? ?3 >> 2 ? ?a ? ? ? ?2 >> 3 ? ?b ? ? ? ?9 >> 4 ? ?c ? ? ? ?4 >> 5 ? ?a ? ? ? ?7 >> 6 ? ?b ? ? ? ?11 >> >> >> What I'd like is to return a dataframe cut down to have only unique >> entires in V1. ?V2 should contain a vector, for each V1, that is the >> minimum of all the possible choices from the set of redundant V1's. >> >> Example output: >> >> ? ? ?V1 ? ? V2 >> 1 ? ? a ? ? ? ?2 >> 2 ? ? b ? ? ? ?9 >> 3 ? ? c ? ? ? ?4 >> >> >> If somebody could (relatively easily) figure out how to get closer to >> a solution, I'd appreciate hearing how. ?Also, I'd be interested to >> hear how you came upon the answer (so I can get better at searching >> the R resources myself). >> >> Regards, >> Jonathan >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >
Gabor Grothendieck
2010-Feb-08 18:56 UTC
[R] Follow-up Question: data frames; matching/merging
It works if you use character class rather than factors for your date strings. Note last arg to data.frame:> DF <- data.frame(V1=c(1,1,2,3,1,2),+ V2=c('2002-03-13','1989-03-10','1988-01-20','1997-05-15','1996-11-18','2000-01-12'), + stringsAsFactors = FALSE)> aggregate(DF[2], DF[1], min)V1 V2 1 1 1989-03-10 2 2 1988-01-20 3 3 1997-05-15 On Mon, Feb 8, 2010 at 1:05 PM, Jonathan <jonsleepy at gmail.com> wrote:> Wow.. thanks for the deluge of responses! > > Aggregate seems like the way to go here. > > But, suppose that instead of integers in column V2, I actually have > dates (and instead of keeping the minimum integer, I want to keep the > earliest date): > >> df = data.frame(V1=c(1,1,2,3,1,2),V2=c('2002-03-13','1989-03-10','1988-01-20','1997-05-15','1996-11-18','2000-01-12')); >> aggregate(df$V2,df['V1'],min) #thanks, Phil > Error in Summary.factor(c(6L, 2L, 3L), na.rm = FALSE) : > ?min not meaningful for factors > > So I checked: >> min(df$V2) > Error in Summary.factor(c(6L, 2L, 1L, 4L, 3L, 5L), na.rm = FALSE) : > ?min not meaningful for factors > > This can be fixed with: >> min(as.Date(df$V2)) > [1] "1988-01-20" > > > However, I get a strange response when I try: >> aggregate(as.Date(df$V2),df['V1'],min) > ?V1 ? ?x > 1 ?1 7008 > 2 ?2 6593 > 3 ?3 9996 > > What I'm looking for, of course, is: >> aggregate(as.Date(df$V2),df['V1'],min) > ?V1 ? ?V2 > 1 ?1 ? 1996-11-18 > 2 ?2 ? 1988-01-20 > 3 ?3 ? 1997-05-15 > > > Any idea how to produce the desired output? > > Thanks, > Jonathan > > > > On Mon, Feb 8, 2010 at 12:11 PM, Gabor Grothendieck > <ggrothendieck at gmail.com> wrote: >> Here are 3 solutions assuming DF contains the data frame: >> >>> # 1. aggregate >>> aggregate(DF[2], DF[1], min) >> ?V1 V2 >> 1 ?a ?2 >> 2 ?b ?9 >> 3 ?c ?4 >> >>> # 2. aggregate.formula - requires R 2.11.x >>> aggregate(V2 ~ V1, DF, min) >> ?V1 V2 >> 1 ?a ?2 >> 2 ?b ?9 >> 3 ?c ?4 >> >>> # 3. SQL using sqldf >>> library(sqldf) >>> sqldf("select V1, min(V2) V2 from DF group by V1") >> ?V1 V2 >> 1 ?a ?2 >> 2 ?b ?9 >> 3 ?c ?4 >> >>> # 4. summaryBy in the doBy package >>> library(doBy) >>> summaryBy(V2 ~., DF, FUN = min, keep.names = TRUE) >> ?V1 V2 >> 1 ?a ?2 >> 2 ?b ?9 >> 3 ?c ?4 >> >> On Mon, Feb 8, 2010 at 11:39 AM, Jonathan <jonsleepy at gmail.com> wrote: >>> Hi all, >>> ? ?I'm feeling a little guilty to ask this question, since I've >>> written a solution using a rather clunky for loop that gets the job >>> done. ?But I'm convinced there must be a faster (and probably more >>> elegant) way to accomplish what I'm looking to do (perhaps using the >>> "merge" function?). ?I figured somebody out there might've already >>> figured this out: >>> >>> I have a dataframe with two columns (let's call them V1 and V2). ?All >>> rows are unique, although column V1 has several redundant entries. >>> >>> Ex: >>> >>> ? ? V1 ? ? V2 >>> 1 ? ?a ? ? ? ?3 >>> 2 ? ?a ? ? ? ?2 >>> 3 ? ?b ? ? ? ?9 >>> 4 ? ?c ? ? ? ?4 >>> 5 ? ?a ? ? ? ?7 >>> 6 ? ?b ? ? ? ?11 >>> >>> >>> What I'd like is to return a dataframe cut down to have only unique >>> entires in V1. ?V2 should contain a vector, for each V1, that is the >>> minimum of all the possible choices from the set of redundant V1's. >>> >>> Example output: >>> >>> ? ? ?V1 ? ? V2 >>> 1 ? ? a ? ? ? ?2 >>> 2 ? ? b ? ? ? ?9 >>> 3 ? ? c ? ? ? ?4 >>> >>> >>> If somebody could (relatively easily) figure out how to get closer to >>> a solution, I'd appreciate hearing how. ?Also, I'd be interested to >>> hear how you came upon the answer (so I can get better at searching >>> the R resources myself). >>> >>> Regards, >>> Jonathan >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >