Adeel Amin
2013-May-02 06:28 UTC
[R] R issue with unequal large data frames with multiple columns
I'm a bit of an amateur R programmer. I can do simple R scenarios but my handle on complex grammatical issues isn't steady. I have 12 CSV files that I've read into dataframes. Each has 8 columns and over 2000000 rows. Each dataframe has data associated by time component and a date component in the format of: X.DATE and then X.TIME X.DATE is in the format of MMDDYYYY and X.TIME is format HHMM. The issue is that even though each dataframe begins and ends with the same X.DATE and X.TIME values, each data frame has different number of rows. One may have as many 100000 rows more than the other. I want to do two things: 1) I want to extract a certain portion of data depending on date and time (easy) 2) In lock step with number 2 I want to eliminate values from the data frame that are a) redundant or b) do not appear in the other data sets. When step 2 is done, all the time/date data within all 12 dataframes will be the same. Suggestions? Thanks R Community -- [[alternative HTML version deleted]]
Jim Holtman
2013-May-02 09:43 UTC
[R] R issue with unequal large data frames with multiple columns
?duplicated ?intersect Sent from my iPad On May 2, 2013, at 2:28, Adeel Amin <adeel.amin at gmail.com> wrote:> I'm a bit of an amateur R programmer. I can do simple R scenarios but my > handle on complex grammatical issues isn't steady. > > I have 12 CSV files that I've read into dataframes. Each has 8 columns and > over 2000000 rows. Each dataframe has data associated by time component > and a date component in the format of: > > X.DATE and then X.TIME > > X.DATE is in the format of MMDDYYYY and X.TIME is format HHMM. The issue > is that even though each dataframe begins and ends with the same X.DATE and > X.TIME values, each data frame has different number of rows. One may have > as many 100000 rows more than the other. > > I want to do two things: > > 1) I want to extract a certain portion of data depending on date and time > (easy) > > 2) In lock step with number 2 I want to eliminate values from the data > frame that are a) redundant or b) do not appear in the other data sets. > > When step 2 is done, all the time/date data within all 12 dataframes will > be the same. > > Suggestions? Thanks R Community -- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
PIKAL Petr
2013-May-02 09:47 UTC
[R] R issue with unequal large data frames with multiple columns
Hi without real data I can suggest you to look to ?merge. Or maybe ?aggregate. Regards Petr> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Adeel Amin > Sent: Thursday, May 02, 2013 8:28 AM > To: r-help at r-project.org > Subject: [R] R issue with unequal large data frames with multiple > columns > > I'm a bit of an amateur R programmer. I can do simple R scenarios but > my handle on complex grammatical issues isn't steady. > > I have 12 CSV files that I've read into dataframes. Each has 8 columns > and over 2000000 rows. Each dataframe has data associated by time > component and a date component in the format of: > > X.DATE and then X.TIME > > X.DATE is in the format of MMDDYYYY and X.TIME is format HHMM. The > issue is that even though each dataframe begins and ends with the same > X.DATE and X.TIME values, each data frame has different number of rows. > One may have as many 100000 rows more than the other. > > I want to do two things: > > 1) I want to extract a certain portion of data depending on date and > time > (easy) > > 2) In lock step with number 2 I want to eliminate values from the data > frame that are a) redundant or b) do not appear in the other data sets. > > When step 2 is done, all the time/date data within all 12 dataframes > will be the same. > > Suggestions? Thanks R Community -- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
arun
2013-May-02 14:08 UTC
[R] R issue with unequal large data frames with multiple columns
Hi,May be this helps: dat1<-structure(list(X.DATE = c("01052007", "01072007", "01072007", "02182007", "02182007", "02242007", "03252007"), X.TIME = c("0230", "0330", "0440", "0440", "0440", "0330", "0230"), VALUE = c(37, 42, 45, 45, 45, 42, 45), VALUE2 = c(29, 24, 28, 27, 35, 32, 32 )), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame", row.names = c(NA, -7L)) dat2<- structure(list(X.DATE = c("01052007", "01182007", "01242007", "02142007", "02182007", "03242007", "03252007"), X.TIME = c("0230", "0330", "0430", "0330", "0440", "0230", "0230"), VALUE = c(34, 41, 42, 44, 45, 21, 42), VALUE2 = c(28, 25, 26, 28, 32, 35, 36 )), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame", row.names = c(NA, -7L)) dat3<- structure(list(X.DATE = c("01052007", "01182007", "01252007", "02142007", "02182007", "03222007", "03252007"), X.TIME = c("0230", "0330", "0430", "0330", "0440", "0230", "0230"), VALUE = c(32, 42, 44, 44, 47, 42, 46), VALUE2 = c(24, 29, 32, 34, 38, 39, 42 )), .Names = c("X.DATE", "X.TIME", "VALUE", "VALUE2"), class = "data.frame", row.names = c(NA, -7L)) library(xts) lst1<-lapply(list(dat1,dat2,dat3),function(x){ xts(x[,-c(1,2)], order.by=as.POSIXct(paste0(x[,1],x[,2]),format="%m%d%Y%H%M"))}) #subset by date and time ?lapply(lst1,function(x) x['2007-01-05 02:30:00/2007-01-25 04:30:00']) #[[1]] #??????????????????? VALUE VALUE2 #2007-01-05 02:30:00??? 37???? 29 #2007-01-07 03:30:00??? 42???? 24 #2007-01-07 04:40:00??? 45???? 28 # #[[2]] #??????????????????? VALUE VALUE2 #2007-01-05 02:30:00??? 34???? 28 #2007-01-18 03:30:00??? 41???? 25 #2007-01-24 04:30:00??? 42???? 26 # #[[3]] #??????????????????? VALUE VALUE2 #2007-01-05 02:30:00??? 32???? 24 #2007-01-18 03:30:00??? 42???? 29 #2007-01-25 04:30:00??? 44???? 32 #subset by time lapply(lst1,function(x) x['T02:30/T03:30']) res<-na.omit(Reduce(function(...) merge(...),lst1)) res #??????????????????? VALUE VALUE2 VALUE.1 VALUE2.1 VALUE.2 VALUE2.2 #2007-01-05 02:30:00??? 37???? 29????? 34?????? 28????? 32?????? 24 #2007-02-18 04:40:00??? 45???? 27????? 45?????? 32????? 47?????? 38 #2007-03-25 02:30:00??? 45???? 32????? 42?????? 36????? 46?????? 42 lst2<-as.list(res) lst3<- lapply(list(c("VALUE","VALUE2"),c("VALUE.1","VALUE2.1"),c("VALUE.2","VALUE2.2")),function(x) do.call(cbind,lst2[x])) #or lst3<- lapply(split(names(lst2),((seq_along(names(lst2))-1)%/%2)+1),function(x) do.call(cbind,lst2[x])) #change according to the number of columns lst3 #$`1` #??????????????????? VALUE VALUE2 #2007-01-05 02:30:00??? 37???? 29 #2007-02-18 04:40:00??? 45???? 27 #2007-03-25 02:30:00??? 45???? 32 # #$`2` #??????????????????? VALUE.1 VALUE2.1 #2007-01-05 02:30:00????? 34?????? 28 #2007-02-18 04:40:00????? 45?????? 32 #2007-03-25 02:30:00????? 42?????? 36 # #$`3` #??????????????????? VALUE.2 VALUE2.2 #2007-01-05 02:30:00????? 32?????? 24 #2007-02-18 04:40:00????? 47?????? 38 #2007-03-25 02:30:00????? 46?????? 42 A.K. ----- Original Message ----- From: Adeel Amin <adeel.amin at gmail.com> To: r-help at r-project.org Cc: Sent: Thursday, May 2, 2013 2:28 AM Subject: [R] R issue with unequal large data frames with multiple columns I'm a bit of an amateur R programmer.? I can do simple R scenarios but my handle on complex grammatical issues isn't steady. I have 12 CSV files that I've read into dataframes.? Each has 8 columns and over 2000000 rows.? Each dataframe has data associated by time component and a date component in the format of: X.DATE and then X.TIME X.DATE is in the format of MMDDYYYY and X.TIME is format HHMM.? The issue is that even though each dataframe begins and ends with the same X.DATE and X.TIME values, each data frame has different number of rows.? One may have as many 100000 rows more than the other. I want to do two things: 1) I want to extract a certain portion of data depending on date and time (easy) 2) In lock step with number 2 I want to eliminate values from the data frame that are a) redundant or b) do not appear in the other data sets. When step 2 is done, all the time/date data within all 12 dataframes will be the same. Suggestions?? Thanks R Community -- ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.