http://www.nabble.com/file/p18018170/subdata.csv subdata.csv I've attached 100 rows of a data frame I am working with. I have one factor, id, with 27 levels. There are two columns of reference data, x and y (UTM coordinates), one column "date" in POSIXct format, and one column "diff" in times format (chron package). What I am trying to do is as follows: For each day of the year (date, irrespective of time), select that row for each id which contains the smallest "diff" value, resulting in an output containing in general one value per id per day. "aggregate" has been suggested but it only produces the columns considered in the function and I need all columns intact. My data frame contains almost 70,000 entries so manual sorting is not an option. I know R is robust but my programming skills are elementary. The only way I know to approach it is to first separate every id, then filter, then recombine somehow. Is there not a more efficient way for this relatively straight-forward filtering exercise? Tyler -- View this message in context: http://www.nabble.com/Advanced-Filtering-problem-tp18018170p18018170.html Sent from the R help mailing list archive at Nabble.com.
Hi Tyler, I donĀ“t know if I understood well. Try this. Case not work I try again and again :-) df<-read.csv("http://www.nabble.com/file/p18018170/subdata.csv") df.min.diff<-aggregate(df["diff"], df[c("day")], min) df.subset<-subset(df, paste(df$day, df$diff) %in% paste(df.min.diff$day, df.min.diff$diff)) On 6/19/08, T.D.Rudolph <prairie.picker@gmail.com> wrote:> > > http://www.nabble.com/file/p18018170/subdata.csv subdata.csv > > I've attached 100 rows of a data frame I am working with. > I have one factor, id, with 27 levels. There are two columns of reference > data, x and y (UTM coordinates), one column "date" in POSIXct format, and > one column "diff" in times format (chron package). > > What I am trying to do is as follows: > For each day of the year (date, irrespective of time), select that row for > each id which contains the smallest "diff" value, resulting in an output > containing in general one value per id per day. > > "aggregate" has been suggested but it only produces the columns considered > in the function and I need all columns intact. My data frame contains > almost 70,000 entries so manual sorting is not an option. I know R is > robust but my programming skills are elementary. The only way I know to > approach it is to first separate every id, then filter, then recombine > somehow. Is there not a more efficient way for this relatively > straight-forward filtering exercise? > > Tyler > -- > View this message in context: > http://www.nabble.com/Advanced-Filtering-problem-tp18018170p18018170.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi Tyler,> I've attached 100 rows of a data frame I am working with. > I have one factor, id, with 27 levels. There are two columns of reference > data, x and y (UTM coordinates), one column "date" in POSIXct format, and > one column "diff" in times format (chron package). > > What I am trying to do is as follows: > For each day of the year (date, irrespective of time), select that row for > each id which contains the smallest "diff" value, resulting in an output > containing in general one value per id per day.There's a basic strategy that makes solving this type of problem much easier. I call it split-apply-combine. The basic idea is that if you had a single day, the problem would be pretty easy: df <- read.csv("http://www.nabble.com/file/p18018170/subdata.csv") oneday <- subset(df, day == "01-01-05") oneday[which.min(oneday$diff), ] # Let's make that into a function to make it easier to apply to all days mindiff <- function(df) df[which.min(df$diff), ] # Now we split up the data frame so that we have a data frame for # each day pieces <- split(df, df$day) # And use lapply to apply that function to each piece: results <- lapply(pieces, mindiff) # Then finally join all the pieces back together df_done <- do.call("rbind", results) So we split the data frame into individual days, picked the correct row for each day, and then joined all the pieces back together. This isn't the most efficient solution, but I think it's easy to see how each part works, and how you can apply it to new situations. If you aren't familiar with lapply or do.call, it's worth having a look at their examples to get a feel for how they work (although for this case you can of course just copy and paste them without caring how they work) Hadley -- http://had.co.nz/
Reasonably Related Threads
- Extracting values from Surv function in survival package
- Extracting values from Surv function in survival package
- Problem to remove loops in a routine
- Eliminate cases in a subset of a dataframe
- How to apply a function to subsets of a data frame *and* obtain a data frame again?