I appreciate the help I've been given so far. The issue I face is that the data I'm working with has 53000 rows, so in calculating distance, finding all recids that fall within 2km and summing the population, etc. - a) takes too long and b) have no sense of progress. Below is a loop that reads each recid one at a time, calculates the distance and identifies the recids that fall within 2 km. It iterates through all records successfully. Where I'm stuck is how to get the sum of population and dwellings and the mean age for the records that are selected. Also, the desired output should have the following fields: recid, sum(pop), sum(dwell), mean(age). I don't know how to write only those fields out to the file. Any suggestions? Thank you for your help, Danny ##### library(fields) d <- as.matrix( read.csv("filein.csv") ) for(i in 1:nrow(d)){ lonlat1 <- d[i,2:3] lonlat2 <- d[,2:3] distval <- d[,1] [which(rdist.earth( t( as.matrix(lonlat1) ), as.matrix(lonlat2), miles=F ) < 2)] write(distval,file="C:\\outfile.out",ncol=1, append=TRUE) } ##### -------------- Sample Input Data -------------- recid,lat,long,pop,dwell,age 10010265,47.5971174,-52.7039227,584,219,38 10010260,47.5971574,-52.7039147,488,188,34 10010263,47.5936538,-52.7037037,605,232,43 10010287,47.5739426,-52.7035365,548,256,29 10010290,47.5703333,-52.703182,559,336,36 10010284,47.5958782,-52.7013245,394,261,61 10010191,47.5322617,-52.7037037,892,323,23 10010291,47.5700412,-52.7009,0,0,0 10010289,47.5714152,-52.70023,0,0,0 10010285,47.5832183,-52.6995828,469,239,44 10010273,47.5800199,-52.6984875,855,283,28 10010190,47.472353,-52.697991,0,0,0 10010274,47.6018197,-52.6978362,344,117,51 10010288,47.5755249,-52.6978207,33,0,19 10010275,47.6005037,-52.697991,232,93,43 10010279,47.5915368,-52.6954916,983,437,33 10010276,47.5993086,-52.6954808,329,131,28 10010278,47.5958782,-52.6934253,251,107,27 10010354,47.5991086,-52.6934037,27,14,47 10010277,47.5968782,-52.6914148,515,194,37 10010293,47.5778754,-52.6954808,58,0,40 10010292,47.5722183,-52.6899332,1112,523,28 10010353,47.6356972,-52.6896838,1387,471,32 10010283,47.5958439,-52.6884621,531,296,41 10010281,47.5983891,-52.6880528,307,113,52 10010280,47.5958439,-52.6878177,374,129,18 10010282,47.5999645,-52.6880528,637,226,22 10010286,47.5797909,-52.6872042,446,280,32 10010355,47.5797609,-52.6872055,197,72,39
dsheuman at rogers.com wrote:> I appreciate the help I've been given so far. The issue I face is > that the data I'm working with has 53000 rows, so in calculating > distance, finding all recids that fall within 2km and summing the > population, etc. - a) takes too long and b) have no sense of progress. >Well, there are ways to speed this up.> Below is a loop that reads each recid one at a time, calculates the > distance and identifies the recids that fall within 2 km. It iterates > through all records successfully. >But you don't need to subset d[, 2:3] every time, e.g. Also, my experience is that writing a single line appended to a file every time around the loop is very inefficient.> Where I'm stuck is how to get the sum of population and dwellings and > the mean age for the records that are selected. Also, the desired > output should have the following fields: recid, sum(pop), sum(dwell), > mean(age). I don't know how to write only those fields out to the > file. >Well, that part is easy, essentially, you have to save the indices, then extract the relevant rows of pop, dwell and age. So I would modify your code as follows: library(fields) d <- as.matrix( read.csv("filein.csv") ) lonlat2 <- d[,2:3] results <- matrix(0, nrow=nrow(d), ncol=4) for(i in 1:nrow(d)){ lonlat1 <- d[i, 2:3] whichdist <- which(rdist.earth(t(as.matrix(lonlat1)), as.matrix(lonlat2), miles=F) < 2) distval <- d[, 1][whichdist] sumpop <- sum(data[, "pop"][whichdist]) sumdwell <- sum(data[, "dwell"][whichdist]) meanage <- mean(data[, "age"][whichdist]) results[i, ] <- c(d[i, "recid"], sumpop, sumdwell, meanage) } write(t(results), file="C:\\outfile.out", ncol=ncol(results)) Then there is a trick you can use to speed up the process. Essentially you reduce the 'inner loop' which is inside rdist.earth(). This is achieved by initially computing a single distance vector with reference to a fixed point [(0, 0) seems reasonable since it is in the Atlantic Ocean]. Then you select a subset of your points that are the same distance is your circle centre from the fixed point plus or minus the radius you want (and a bit of tolerance). This will generate an annulus of points rather than a circle, but in particular all the points within the circle of interest will also be within this annulus. Then you apply rdist.earth() to find the distance of each of these points from your circle centre. This is what I have used to process a 52000 length matrix in about 40 minutes (and ~50MB memory): DistAgg <- function(data, dist=2) { results <- matrix(0, nrow=nrow(data), ncol=4) colnames(results) <- c("recid", "sumpop", "sumdwell", "meanage") lonlat2 <- data[, 2:3] basedist <- rdist.earth(t(as.matrix(c(0, 0))), as.matrix(lonlat2)) for(i in 1:nrow(data)){ lonlat1 <- data[i, 2:3] approxval <- which(abs(basedist - basedist[i]) < dist*1.001) if (length(approxval) > 1) { whichval <- approxval[which(rdist.earth(t(as.matrix(lonlat1)), as.matrix(lonlat2[approxval, ]), miles=F) < dist)] } else { whichval <- approxval } sumpop <- sum(data[, "pop"][whichval]) sumdwell <- sum(data[, "dwell"][whichval]) meanage <- mean(data[, "age"][whichval]) results[i, ] <- c(data[i, "recid"], sumpop, sumdwell, meanage) } write(t(results), file="C:\\outfile.out", ncol=ncol(results)) } The data I used was based on yours, but randomised latitude and longitude (in restricted ranges). Hope this helps, Ray Brownrigg