Hi R Users, Some individuals recorded multiple within a time period. But, I want to select the row of last site within each time period for each individual. I spent a substantial time, but no luck in selecting the rows. Would you give me a hint for this one? I have a very large data set, but this is just an example. Thanks for your help. I want to get dat2 from dat1. dat1<-structure(list(sn = 1:16, Species = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L), .Label = c("Sp1", "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 3L, 1L, 2L, 1L, 2L, 4L, 3L, 1L, 2L, 2L, 3L, 5L, 4L, 1L, 3L), .Label = c("SiteA", "SiteB", "SiteC", "SiteD", "SIteD"), class = "factor"), Time = structure(c(1L, 5L, 6L, 6L, 6L, 2L, 3L, 11L, 8L, 10L, 1L, 1L, 4L, 4L, 7L, 9L), .Label = c("1/15/15", "1/17/15", "1/29/15", "2/17/15", "2/25/15", "2/27/15", "2/28/15", "3/27/15", "3/5/15", "4/19/15", "7/3/15"), class = "factor"), Month = structure(c(3L, 2L, 2L, 2L, 2L, 3L, 2L, 4L, 4L, 1L, 3L, 3L, 2L, 2L, 2L, 4L), .Label = c("April", "Feb", "Jan", "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -16L)) dat1 #--- dat2<-structure(list(sn = c(1L, 5L, 6L, 9L, 10L, 11L, 12L, 15L, 16L), Species = structure(c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 5L), .Label = c("Sp1", "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L), .Label = c("SiteA", "SiteB", "SiteC"), class = "factor"), Time = structure(c(1L, 3L, 2L, 5L, 7L, 1L, 1L, 4L, 6L), .Label = c("1/15/15", "1/17/15", "2/27/15", "2/28/15", "3/27/15", "3/5/15", "4/19/15"), class = "factor"), Month = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 3L, 2L, 4L), .Label = c("April", "Feb", "Jan", "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -9L)) dat2 [[alternative HTML version deleted]]
Dear Kristi You do not say what you have tried but I would have thought the key to a solution was to split your problem up by site and write a function to select the last observation within each site. I am not quite clear whether this is the last in time or the last occurring row so I just give a hint below as to functions ?aggregate ?by ?strplit ?lapply which may all be relevant here. On 20/03/2016 14:15, Kristi Glover wrote:> Hi R Users, > Some individuals recorded multiple within a time period. But, I want to select the row of last site within each time period for each individual. I spent a substantial time, but no luck in selecting the rows. Would you give me a hint for this one? I have a very large data set, but this is just an example. > Thanks for your help. > > I want to get dat2 from dat1. > > dat1<-structure(list(sn = 1:16, Species = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L), .Label = c("Sp1", > "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 3L, 1L, 2L, 1L, 2L, 4L, 3L, 1L, 2L, 2L, 3L, 5L, 4L, 1L, 3L), .Label = c("SiteA", > "SiteB", "SiteC", "SiteD", "SIteD"), class = "factor"), Time = structure(c(1L, 5L, 6L, 6L, 6L, 2L, 3L, 11L, 8L, 10L, 1L, 1L, 4L, 4L, 7L, 9L), .Label = c("1/15/15", > "1/17/15", "1/29/15", "2/17/15", "2/25/15", "2/27/15", "2/28/15", "3/27/15", "3/5/15", "4/19/15", "7/3/15"), class = "factor"), > Month = structure(c(3L, 2L, 2L, 2L, 2L, 3L, 2L, 4L, 4L, 1L, 3L, 3L, 2L, 2L, 2L, 4L), .Label = c("April", "Feb", "Jan", > "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -16L)) > dat1 > #--- > dat2<-structure(list(sn = c(1L, 5L, 6L, 9L, 10L, 11L, 12L, 15L, 16L), Species = structure(c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 5L), .Label = c("Sp1", > "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L), .Label = c("SiteA", "SiteB", > "SiteC"), class = "factor"), Time = structure(c(1L, 3L, 2L, 5L, 7L, 1L, 1L, 4L, 6L), .Label = c("1/15/15", "1/17/15", "2/27/15", > "2/28/15", "3/27/15", "3/5/15", "4/19/15"), class = "factor"), Month = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 3L, 2L, 4L), .Label = c("April", > "Feb", "Jan", "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -9L)) > dat2 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Michael http://www.dewey.myzen.co.uk/home.html
Don't post in HTML, but thanks for providing dput() outout. This code tries to follow your specifications, assuming by "time period" you mean days. The result doesn't look like your desired example though, because that is not compatible with your specs and I can't guess what you want to do differently. E.g your species 2 is observed on three different days: 6 6 Sp2 SiteB 2015-01-17 Jan 7 7 Sp2 SiteD 2015-01-29 Feb 8 8 Sp2 SiteC 2015-07-03 March ... yet your dat2 contains only the first observation in Jan. Also note that there are conflicts like this: 3 3 Sp1 SiteA 2015-02-27 4 4 Sp1 SiteB 2015-02-27 ... that you have not specified how to resolve. Is there an ordering implied such that SiteB is the "last site"? Finally, your data is messy: youhave Sp2 and SP3, SiteC and SIteD - if you don't pay better attention that's going to bite you one day. Here's working code, you should be able to solve it entirely based on this: # Step 1: most of your columns are factors. This doesn't matter much # except for Time. For time this is the Wrong Way, since time # is ordered. We replace ... datNew <- dat1 datNew$Time <- strptime(as.character(dat1$Time), format="%m/%d/%y") # Step 2: Your problem is solved with a combination of order() and duplicated(). # In order to make the ordering of species/site unique we generate # an auxiliary column. datNew$key <- paste(datNew$Species, datNew$Time, sep=".") # Step 3: Now we order the rows by $key and $Observed.site, in decreasing order datNew <- datNew[ order(datNew$key, datNew$Observed.site, decreasing=TRUE), ] # Step 4: Then we drop all but the first row from each species/site # block. The first row is the last date (decreasing sort, remember!) datNew <- datNew[!duplicated(datNew$key), ] # Step 5: We can drop the keys and order ascending # to make it look more like your example. Now we have # one site on each day for each individual. datNew <- datNew[order(datNew$key), colnames(datNew) != "key"] datNew B. On Mar 20, 2016, at 11:14 AM, Michael Dewey <lists at dewey.myzen.co.uk> wrote:> Dear Kristi > > You do not say what you have tried but I would have thought the key to a solution was to split your problem up by site and write a function to select the last observation within each site. I am not quite clear whether this is the last in time or the last occurring row so I just give a hint below as to functions > > ?aggregate > ?by > ?strplit > ?lapply > > which may all be relevant here. > > On 20/03/2016 14:15, Kristi Glover wrote: >> Hi R Users, >> Some individuals recorded multiple within a time period. But, I want to select the row of last site within each time period for each individual. I spent a substantial time, but no luck in selecting the rows. Would you give me a hint for this one? I have a very large data set, but this is just an example. >> Thanks for your help. >> >> I want to get dat2 from dat1. >> >> dat1<-structure(list(sn = 1:16, Species = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L), .Label = c("Sp1", >> "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 3L, 1L, 2L, 1L, 2L, 4L, 3L, 1L, 2L, 2L, 3L, 5L, 4L, 1L, 3L), .Label = c("SiteA", >> "SiteB", "SiteC", "SiteD", "SIteD"), class = "factor"), Time = structure(c(1L, 5L, 6L, 6L, 6L, 2L, 3L, 11L, 8L, 10L, 1L, 1L, 4L, 4L, 7L, 9L), .Label = c("1/15/15", >> "1/17/15", "1/29/15", "2/17/15", "2/25/15", "2/27/15", "2/28/15", "3/27/15", "3/5/15", "4/19/15", "7/3/15"), class = "factor"), >> Month = structure(c(3L, 2L, 2L, 2L, 2L, 3L, 2L, 4L, 4L, 1L, 3L, 3L, 2L, 2L, 2L, 4L), .Label = c("April", "Feb", "Jan", >> "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -16L)) >> dat1 >> #--- >> dat2<-structure(list(sn = c(1L, 5L, 6L, 9L, 10L, 11L, 12L, 15L, 16L), Species = structure(c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 5L), .Label = c("Sp1", >> "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L), .Label = c("SiteA", "SiteB", >> "SiteC"), class = "factor"), Time = structure(c(1L, 3L, 2L, 5L, 7L, 1L, 1L, 4L, 6L), .Label = c("1/15/15", "1/17/15", "2/27/15", >> "2/28/15", "3/27/15", "3/5/15", "4/19/15"), class = "factor"), Month = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 3L, 2L, 4L), .Label = c("April", >> "Feb", "Jan", "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -9L)) >> dat2 >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > -- > Michael > http://www.dewey.myzen.co.uk/home.html > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Your example is wrong: What happened to Sp2 on 1/29? You have also apparently mixed up lower and upper case: "Sp1", "SP3" . This will likely cause you great grief, so try to avoid or fix this in your work. Anyway, there are tons of ways to do this. dplyR is particularly good at this sort of thing, I believe, so you might want to learn it. However, I tend to use just base R, in which it is also pretty straightforward. 1) I assume "time period" = Month. Please be clear in what you mean in future. To get the final result neatly arranged, you could use date functions, after first converting your Time column to POSIX. You could then use the month() function to get the month, properly orderd. However, as this is a bit complicated, I'll just use brute force to order the Month factor manually: dat1$Month <- with(dat1,ordered(as.character(Month),lev=c("Jan","Feb","March","April"))) 2) Then this does what you want I think (there are more elegant ways, certainly): do.call(rbind, with(dat1,by(dat1, list(Species,Month), FUN =function(x)x[nrow(x),]))) (You can use the order() function to order the data frame by Species if you want to do this) Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Mar 20, 2016 at 7:15 AM, Kristi Glover <kristi.glover at hotmail.com> wrote:> Hi R Users, > Some individuals recorded multiple within a time period. But, I want to select the row of last site within each time period for each individual. I spent a substantial time, but no luck in selecting the rows. Would you give me a hint for this one? I have a very large data set, but this is just an example. > Thanks for your help. > > I want to get dat2 from dat1. > > dat1<-structure(list(sn = 1:16, Species = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L), .Label = c("Sp1", > "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 3L, 1L, 2L, 1L, 2L, 4L, 3L, 1L, 2L, 2L, 3L, 5L, 4L, 1L, 3L), .Label = c("SiteA", > "SiteB", "SiteC", "SiteD", "SIteD"), class = "factor"), Time = structure(c(1L, 5L, 6L, 6L, 6L, 2L, 3L, 11L, 8L, 10L, 1L, 1L, 4L, 4L, 7L, 9L), .Label = c("1/15/15", > "1/17/15", "1/29/15", "2/17/15", "2/25/15", "2/27/15", "2/28/15", "3/27/15", "3/5/15", "4/19/15", "7/3/15"), class = "factor"), > Month = structure(c(3L, 2L, 2L, 2L, 2L, 3L, 2L, 4L, 4L, 1L, 3L, 3L, 2L, 2L, 2L, 4L), .Label = c("April", "Feb", "Jan", > "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -16L)) > dat1 > #--- > dat2<-structure(list(sn = c(1L, 5L, 6L, 9L, 10L, 11L, 12L, 15L, 16L), Species = structure(c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 5L), .Label = c("Sp1", > "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L), .Label = c("SiteA", "SiteB", > "SiteC"), class = "factor"), Time = structure(c(1L, 3L, 2L, 5L, 7L, 1L, 1L, 4L, 6L), .Label = c("1/15/15", "1/17/15", "2/27/15", > "2/28/15", "3/27/15", "3/5/15", "4/19/15"), class = "factor"), Month = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 3L, 2L, 4L), .Label = c("April", > "Feb", "Jan", "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -9L)) > dat2 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.