Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a "ping" from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function that does this. However, it works with my test data but locks up my computer with my real data. I have several million vscan records and several thousand garmin records. Is there a better way to do this? My function and test data: myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00","12:20:00"))) names(myvscan)<-c("Latitude","DateTime") mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00","12:15:00"))) names(mygarmin)<-c("Latitude","DateTime") minute.diff<-1/24/12 #Time diff is in days, so this is 5 minutes for (k in 1:nrow(myvscan)) { if (is.na(myvscan$Latitude[k])) { if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < minute.diff ) { index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] }}} I appreciate your help and advice. Aloha, Tim Tim Clark Department of Zoology University of Hawaii
Tim Clark <mudiver1200 <at> yahoo.com> writes:> > I need some help in coming up with a function that will take two data sets,determine if a value is missing in> one, find a value in the second that was taken at about the same time, andsubstitute the second value in for> where the first should have been.This is the type of job I would do with a database, not R (alone). The main advantage is that you have to do the cleanup job only once and can retrieve the data in a rather well-documented way later (it's possible with R, I know).>> Put the 5 minutes data into one table. I would two new columns giving thedelta to the next value for easier linear interpolation, but that's secondary. Make sure to index the table.>> Put the 1 seconds data into another table, adding values rounded to5 seconds, and giving these an index.>>From R/ODBC or with RSQLite, make a Join of all values in Table 1that do have NULL values in the coordinates. If you do not want to do a linear interpolation, you could even do this within the database and SQL alone.>> Compute the linear interpolation, and write the data back intothe database. If you want to be careful, you might mark the interpolated values in a separate field as "computed" When at a later time new data come in, you can run the procedure again without penalty. Dieter
Here are 2 functions, which.just.above and which.just.below, which may help you. They will tell which element in a reference dataset is the first just above (or just below) each element in the main dataset (x). They return NA if there is no reference element above (or below) an element of x. The strict argument lets you say if the inequalities are strict or if equality is acceptable. They are vectorized so are pretty quick. E.g., > which.just.below(c(14,14.5), 11:15, strict=TRUE) [1] 3 4 > which.just.above(c(14,14.5), 11:15, strict=FALSE) [1] 4 5 They should work with any class of data that order() and sort() work on. In particular, POSIXct times work. The attached file has a demonstration function called 'test' with some examples. In your case the 'reference' data would be the times at which your backup measurements were taken and the 'x' data would be the times of the pings. You can look at the elements of 'reference' just before and just after each ping (or just the pings that are missing locations) and decide how to combine the data from the bracketing reference elements to inpute a location for the ping. Here are the functions, in case the attachment doesn't make it through. I'm sure some mailer will throw in some newlines so it will be corrupted. "which.just.above" <- function(x, reference, strict = T) { # output[k] will be index of smallest value in reference vector # larger than x[k]. If strict=F, replace 'larger than' by # 'larger than or equal to'. # We should allow NA's in x (but we don't). NA's in reference # should not be allowed. if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input") if(strict) i <- c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i <- cumsum(i)[!i] + 1. i[i > length(reference)] <- NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i <- order(reference)[i] # following needed if x is not sorted i[order(order(x))] } "which.just.below" <- function(x, reference, strict = T) { # output[k] will be index of largest value in reference vector # less than x[k]. If strict=F, replace 'less than' by # 'less than or equal to'. Neither x nor reference need be # sorted, although they should not have NA's (in theory, NA's # in x are ok, but not in reference). if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input") if(!strict) i <- c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i <- cumsum(i)[!i] i[i <= 0] <- NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i <- order(reference)[i] # following needed if x is not sorted i[order(order(x))] } Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Tim Clark > Sent: Thursday, May 21, 2009 9:45 PM > To: r-help at r-project.org > Subject: [R] Need a faster function to replace missing data > > > Dear List, > > I need some help in coming up with a function that will take > two data sets, determine if a value is missing in one, find a > value in the second that was taken at about the same time, > and substitute the second value in for where the first should > have been. My problem is from a fish tracking study. We put > acoustic tags in fish and track them for several days. > Location data is supposed to be automatically recorded every > time we detect a "ping" from the fish. Unfortunately the GPS > had some problems and sometimes the fishes depth was recorded > but not its location. I fortunately had a back-up GPS that > was taking location data every five minutes. I would like to > merge the two files, replacing the missing value in the vscan > (automatic) file with the location from the garmin file. > Since we were getting vscan records every 1-2 seconds and > garmin records every 5 minutes, I need to find the right > place in the vscan file to place the garmin record - i.e. the > closest in time, but not greater than 5 minutes. I have > written a function that does this. However, it works with my > test data but locks up my computer with my real data. I have > several million vscan records and several thousand garmin > records. Is there a better way to do this? > > > My function and test data: > > myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00", > "12:20:00"))) > names(myvscan)<-c("Latitude","DateTime") > mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00", > "12:15:00"))) > names(mygarmin)<-c("Latitude","DateTime") > > minute.diff<-1/24/12 #Time diff is in days, so this is 5 minutes > for (k in 1:nrow(myvscan)) > { > if (is.na(myvscan$Latitude[k])) > { > if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < minute.diff ) > { > index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) > myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] > }}} > > I appreciate your help and advice. > > Aloha, > > Tim > > > > > Tim Clark > Department of Zoology > University of Hawaii > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: bracket.R.txt URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090522/2339ce29/attachment-0002.txt>
I think this does what you want. It uses 'findInterval' to determine where a possible match is:>myvscan<-data.frame(c(1,NA,1.5),as.POSIXct(c("12:00:00","12:14:00","12:20:00"), format="%H:%M:%S"))> # convert to numeric > names(myvscan)<-c("Latitude","DateTime") > myvscan$tn <- as.numeric(myvscan$DateTime) # numeric for findInterval >mygarmin<-data.frame(c(20,30,40),as.POSIXct(c("12:00:00","12:10:00","12:15:00"), format="%H:%M:%S"))> names(mygarmin)<-c("Latitude","DateTime") > mygarmin$tn <- as.numeric(mygarmin$DateTime) > > # use 'findInterval' > na.indx <- which(is.na(myvscan$Latitude)) # find NAs > # replace with garmin latitude > myvscan$Latitude[na.indx] <-mygarmin$Latitude[findInterval(myvscan$tn[na.indx], mygarmin$tn)]> > > myvscanLatitude DateTime tn 1 1.0 2009-05-22 12:00:00 1243008000 2 30.0 2009-05-22 12:14:00 1243008840 3 1.5 2009-05-22 12:20:00 1243009200>On Fri, May 22, 2009 at 12:45 AM, Tim Clark <mudiver1200@yahoo.com> wrote:> > Dear List, > > I need some help in coming up with a function that will take two data sets, > determine if a value is missing in one, find a value in the second that was > taken at about the same time, and substitute the second value in for where > the first should have been. My problem is from a fish tracking study. We > put acoustic tags in fish and track them for several days. Location data is > supposed to be automatically recorded every time we detect a "ping" from the > fish. Unfortunately the GPS had some problems and sometimes the fishes > depth was recorded but not its location. I fortunately had a back-up GPS > that was taking location data every five minutes. I would like to merge the > two files, replacing the missing value in the vscan (automatic) file with > the location from the garmin file. Since we were getting vscan records > every 1-2 seconds and garmin records every 5 minutes, I need to find the > right place in the vscan file to place the garmin record - i.e. the > closest in time, but not greater than 5 minutes. I have written a > function that does this. However, it works with my test data but locks up my > computer with my real data. I have several million vscan records and > several thousand garmin records. Is there a better way to do this? > > > My function and test data: > > myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00","12:20:00"))) > names(myvscan)<-c("Latitude","DateTime") > mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00","12:15:00"))) > names(mygarmin)<-c("Latitude","DateTime") > > minute.diff<-1/24/12 #Time diff is in days, so this is 5 minutes > for (k in 1:nrow(myvscan)) > { > if (is.na(myvscan$Latitude[k])) > { > if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < minute.diff ) > { > index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) > myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] > }}} > > I appreciate your help and advice. > > Aloha, > > Tim > > > > > Tim Clark > Department of Zoology > University of Hawaii > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]
Jim, Thanks! I like the way you use indexing instead of the loops. However, the find.Interval function does not give the right result. I have been playing with it and it seems to give the closest number that is less than the one of interest. In this case, the correct replacement should have been 40, not 30, since 12:15 from mygarmin is closer to 12:14 in myvscan than 12:10. Is there a way to get the function to find the closest in value instead of the next smaller value? I was trying to use which.min to get the closet date but can't seem to get it to work right either. Aloha, Tim Tim Clark Department of Zoology University of Hawaii --- On Fri, 5/22/09, jim holtman <jholtman at gmail.com> wrote:> From: jim holtman <jholtman at gmail.com> > Subject: Re: [R] Need a faster function to replace missing data > To: "Tim Clark" <mudiver1200 at yahoo.com> > Cc: r-help at r-project.org > Date: Friday, May 22, 2009, 7:24 AM > I think this does what you > want.? It uses 'findInterval' to determine where a > possible match is: > ? > > > myvscan<-data.frame(c(1,NA,1.5),as.POSIXct(c("12:00:00","12:14:00","12:20:00"), > format="%H:%M:%S")) > > # convert to numeric > > > names(myvscan)<-c("Latitude","DateTime") > > > myvscan$tn <- as.numeric(myvscan$DateTime)? # > numeric for findInterval > > > mygarmin<-data.frame(c(20,30,40),as.POSIXct(c("12:00:00","12:10:00","12:15:00"), > format="%H:%M:%S")) > > > > names(mygarmin)<-c("Latitude","DateTime") > > mygarmin$tn <- as.numeric(mygarmin$DateTime) > > > > # use 'findInterval' > > na.indx <- which(is.na(myvscan$Latitude))? # find > NAs > > > # replace with garmin latitude > > myvscan$Latitude[na.indx] <- > mygarmin$Latitude[findInterval(myvscan$tn[na.indx], > mygarmin$tn)] > > > > > > myvscan > ? Latitude??????????? DateTime???????? > tn > > 1????? 1.0 2009-05-22 12:00:00 1243008000 > 2???? 30.0 2009-05-22 12:14:00 1243008840 > 3????? 1.5 2009-05-22 12:20:00 1243009200 > > > > > > On Fri, May 22, 2009 at 12:45 AM, > Tim Clark <mudiver1200 at yahoo.com> > wrote: > > > Dear List, > > I need some help in coming up with a function that will > take two data sets, determine if a value is missing in one, > find a value in the second that was taken at about the same > time, and substitute the second value in for where the first > should have been. ?My problem is from a fish tracking > study. ?We put acoustic tags in fish and track them for > several days. ?Location data is supposed to be > automatically recorded every time we detect a > "ping" from the fish. ?Unfortunately the GPS had > some problems and sometimes the fishes depth was recorded > but not its location. ?I fortunately had a back-up GPS that > was taking location data every five minutes. ?I would like > to merge the two files, replacing the missing value in the > vscan (automatic) file with the location from the garmin > file. ?Since we were getting vscan records every 1-2 > seconds and garmin records every 5 minutes, I need to find > the right place in the vscan file to place the garmin record > - i.e. the > > ?closest in time, but not greater than 5 minutes. ?I have > written a function that does this. However, it works with my > test data but locks up my computer with my real data. ?I > have several million vscan records and several thousand > garmin records. ?Is there a better way to do this? > > > > My function and test data: > > myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00","12:20:00"))) > names(myvscan)<-c("Latitude","DateTime") > > mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00","12:15:00"))) > names(mygarmin)<-c("Latitude","DateTime") > > minute.diff<-1/24/12 ? #Time diff is in days, so this > is 5 minutes > > for (k in 1:nrow(myvscan)) > { > if (is.na(myvscan$Latitude[k])) > { > if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < > minute.diff ) > { > index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) > > myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] > }}} > > I appreciate your help and advice. > > Aloha, > > Tim > > > > > Tim Clark > Department of Zoology > University of Hawaii > > ______________________________________________ > > R-help at r-project.org > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, > reproducible code. > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > >
Many thanks to Jim, Bill, and Carl. Using indexes instead of the for loop gave me my answer in minutes instead of hours! Thanks for all of your great suggestions! Aloha, Tim Tim Clark Department of Zoology University of Hawaii --- On Fri, 5/22/09, jim holtman <jholtman at gmail.com> wrote:> From: jim holtman <jholtman at gmail.com> > Subject: Re: [R] Need a faster function to replace missing data > To: "Tim Clark" <mudiver1200 at yahoo.com> > Cc: r-help at r-project.org > Date: Friday, May 22, 2009, 4:59 PM > Here is a modification that should > now find the closest: > ? > > > myvscan<-data.frame(c(1,NA,1.5),as.POSIXct(c("12:00:00","12:14:00","12:20:00"), > + format="%H:%M:%S")) > > # convert to numeric > > > > > names(myvscan)<-c("Latitude","DateTime") > > > > > myvscan$tn <- as.numeric(myvscan$DateTime)? # > numeric for findInterval > > > > > mygarmin<-data.frame(c(20,30,40),as.POSIXct(c("12:00:00","12:10:00","12:15:00"), > > + format="%H:%M:%S")) > > > > > > > names(mygarmin)<-c("Latitude","DateTime") > > mygarmin$tn <- as.numeric(mygarmin$DateTime) > > > > # use 'findInterval' > > > na.indx <- which(is.na(myvscan$Latitude))? # find > NAs > > > > # create matrix of values to test the range > > indices <- > findInterval(myvscan$tn[na.indx],mygarmin$tn) > > > x <- cbind(indices, > +??????????? abs(myvscan$tn[na.indx] - > mygarmin$tn[indices]), # lower > +??????????? abs(myvscan$tn[na.indx] - > mygarmin$tn[indices + 1]))? #higher > > # now determine which index is closer > > > closest <- x[,1] + (x[,2] > x[,3])? # determine > the proper index > > # replace with garmin latitude > > myvscan$Latitude[na.indx] <- > mygarmin$Latitude[closest] > > > > > > > > myvscan > > ? Latitude??????????? DateTime???????? > tn > 1????? 1.0 2009-05-23 12:00:00 1243080000 > 2???? 40.0 2009-05-23 12:14:00 1243080840 > 3????? 1.5 2009-05-23 12:20:00 1243081200 > > > > > > On Fri, May 22, 2009 at 7:39 PM, > Tim Clark <mudiver1200 at yahoo.com> > wrote: > > > Jim, > > Thanks! ?I like the way you use indexing instead of the > loops. ?However, the find.Interval function does not give > the right result. ?I have been playing with it and it seems > to give the closest number that is less than the one of > interest. ?In this case, the correct replacement should > have been 40, not 30, since 12:15 from mygarmin is closer to > 12:14 in myvscan than 12:10. ?Is there a way to get the > function to find the closest in value instead of the next > smaller value? ?I was trying to use which.min to get the > closet date but can't seem to get it to work right > either. > > > > Aloha, > > Tim > > > Tim Clark > Department of Zoology > University of Hawaii > > > --- On Fri, 5/22/09, jim holtman <jholtman at gmail.com> > wrote: > > > > From: jim holtman <jholtman at gmail.com> > > Subject: Re: [R] Need a faster function to replace > missing data > > To: "Tim Clark" <mudiver1200 at yahoo.com> > > > Cc: r-help at r-project.org > > Date: Friday, May 22, 2009, 7:24 AM > > > > > I think this does what you > > want.? It uses 'findInterval' to determine > where a > > possible match is: > > ? > > > > > > myvscan<-data.frame(c(1,NA,1.5),as.POSIXct(c("12:00:00","12:14:00","12:20:00"), > > > format="%H:%M:%S")) > > > # convert to numeric > > > > > > names(myvscan)<-c("Latitude","DateTime") > > > > > myvscan$tn <- as.numeric(myvscan$DateTime)? > # > > > numeric for findInterval > > > > > > mygarmin<-data.frame(c(20,30,40),as.POSIXct(c("12:00:00","12:10:00","12:15:00"), > > format="%H:%M:%S")) > > > > > > > > > names(mygarmin)<-c("Latitude","DateTime") > > > mygarmin$tn <- as.numeric(mygarmin$DateTime) > > > > > > # use 'findInterval' > > > na.indx <- which(is.na(myvscan$Latitude))? # find > > > NAs > > > > > # replace with garmin latitude > > > myvscan$Latitude[na.indx] <- > > mygarmin$Latitude[findInterval(myvscan$tn[na.indx], > > mygarmin$tn)] > > > > > > > > > myvscan > > > ? Latitude??????????? > DateTime???????? > > tn > > > > 1????? 1.0 2009-05-22 12:00:00 1243008000 > > 2???? 30.0 2009-05-22 12:14:00 1243008840 > > 3????? 1.5 2009-05-22 12:20:00 1243009200 > > > > > > > > > > > > On Fri, May 22, 2009 at 12:45 AM, > > Tim Clark <mudiver1200 at yahoo.com> > > wrote: > > > > > > Dear List, > > > > I need some help in coming up with a function that > will > > > take two data sets, determine if a value is missing in > one, > > find a value in the second that was taken at about the > same > > time, and substitute the second value in for where the > first > > should have been. ?My problem is from a fish > tracking > > > study. ?We put acoustic tags in fish and track them > for > > several days. ?Location data is supposed to be > > automatically recorded every time we detect a > > "ping" from the fish. ?Unfortunately the > GPS had > > > some problems and sometimes the fishes depth was > recorded > > but not its location. ?I fortunately had a back-up > GPS that > > was taking location data every five minutes. ?I would > like > > to merge the two files, replacing the missing value in > the > > > vscan (automatic) file with the location from the > garmin > > file. ?Since we were getting vscan records every 1-2 > > seconds and garmin records every 5 minutes, I need to > find > > the right place in the vscan file to place the garmin > record > > > - i.e. the > > > > ?closest in time, but not greater than 5 minutes. ?I > have > > written a function that does this. However, it works > with my > > test data but locks up my computer with my real data. > ?I > > > have several million vscan records and several > thousand > > garmin records. ?Is there a better way to do this? > > > > > > > > My function and test data: > > > > > myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00","12:20:00"))) > > > > names(myvscan)<-c("Latitude","DateTime") > > > > > mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00","12:15:00"))) > > > names(mygarmin)<-c("Latitude","DateTime") > > > > > minute.diff<-1/24/12 ? #Time diff is in days, so > this > > is 5 minutes > > > > for (k in 1:nrow(myvscan)) > > { > > if (is.na(myvscan$Latitude[k])) > > > { > > if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) > < > > minute.diff ) > > { > > > index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) > > > > > myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] > > > }}} > > > > I appreciate your help and advice. > > > > Aloha, > > > > Tim > > > > > > > > > > Tim Clark > > Department of Zoology > > University of Hawaii > > > > > ______________________________________________ > > > > R-help at r-project.org > > mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, > > > reproducible code. > > > > > > > > > > -- > > Jim Holtman > > Cincinnati, OH > > +1 513 646 9390 > > > > What is the problem that you are trying to solve? > > > > > > > > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > >