thr3ads.net - R help - [R] Distance and Aggregate Data

If this information is useful, please help other people find it:
Share via:

Danny Heuman

2004-Feb-26 05:50 UTC

[R] Distance and Aggregate Data - Again...

I appreciate the help I've been given so far.  The issue I face is
that the data I'm working with has 53000 rows, so in calculating
distance, finding all recids that fall within 2km and summing the
population, etc. - a) takes too long and b) have no sense of progress.

Below is a loop that reads each recid one at a time, calculates the
distance and identifies the recids that fall within 2 km.  It iterates
through all records successfully.

Where I'm stuck is how to get the sum of population and dwellings and
the mean age for the records that are selected.  Also, the desired
output should have the following fields:  recid, sum(pop), sum(dwell),
mean(age).  I don't know how to write only those fields out to the
file.

Any suggestions?

Thank you for your help,

Danny


#####
library(fields)

d <- as.matrix( read.csv("filein.csv") )

for(i in 1:nrow(d)){
	lonlat1 <- d[i,2:3]
	lonlat2 <- d[,2:3]
	distval <- d[,1] [which(rdist.earth( t( as.matrix(lonlat1) ),
as.matrix(lonlat2), miles=F ) < 2)]
	write(distval,file="C:\\outfile.out",ncol=1, append=TRUE)
}
#####


--------------
Sample Input Data
--------------
recid,lat,long,pop,dwell,age
10010265,47.5971174,-52.7039227,584,219,38
10010260,47.5971574,-52.7039147,488,188,34
10010263,47.5936538,-52.7037037,605,232,43
10010287,47.5739426,-52.7035365,548,256,29
10010290,47.5703333,-52.703182,559,336,36
10010284,47.5958782,-52.7013245,394,261,61
10010191,47.5322617,-52.7037037,892,323,23
10010291,47.5700412,-52.7009,0,0,0
10010289,47.5714152,-52.70023,0,0,0
10010285,47.5832183,-52.6995828,469,239,44
10010273,47.5800199,-52.6984875,855,283,28
10010190,47.472353,-52.697991,0,0,0
10010274,47.6018197,-52.6978362,344,117,51
10010288,47.5755249,-52.6978207,33,0,19
10010275,47.6005037,-52.697991,232,93,43
10010279,47.5915368,-52.6954916,983,437,33
10010276,47.5993086,-52.6954808,329,131,28
10010278,47.5958782,-52.6934253,251,107,27
10010354,47.5991086,-52.6934037,27,14,47
10010277,47.5968782,-52.6914148,515,194,37
10010293,47.5778754,-52.6954808,58,0,40
10010292,47.5722183,-52.6899332,1112,523,28
10010353,47.6356972,-52.6896838,1387,471,32
10010283,47.5958439,-52.6884621,531,296,41
10010281,47.5983891,-52.6880528,307,113,52
10010280,47.5958439,-52.6878177,374,129,18
10010282,47.5999645,-52.6880528,637,226,22
10010286,47.5797909,-52.6872042,446,280,32
10010355,47.5797609,-52.6872055,197,72,39

Ray Brownrigg

2004-Feb-27 02:31 UTC

head link

[R] Distance and Aggregate Data - Again...

dsheuman at rogers.com wrote:
> I appreciate the help I've been given so far.  The issue I face is
> that the data I'm working with has 53000 rows, so in calculating
> distance, finding all recids that fall within 2km and summing the
> population, etc. - a) takes too long and b) have no sense of progress.
>Well, there are ways to speed this up.
> Below is a loop that reads each recid one at a time, calculates the
> distance and identifies the recids that fall within 2 km.  It iterates
> through all records successfully.
>But you don't need to subset d[, 2:3] every time, e.g.  Also, my
experience is that writing a single line appended to a file every time
around the loop is very inefficient.
> Where I'm stuck is how to get the sum of population and dwellings and
> the mean age for the records that are selected.  Also, the desired
> output should have the following fields:  recid, sum(pop), sum(dwell),
> mean(age).  I don't know how to write only those fields out to the
> file.
>Well, that part is easy, essentially, you have to save the indices, then
extract the relevant rows of pop, dwell and age. So I would modify your
code as follows:

library(fields)
d <- as.matrix( read.csv("filein.csv") )

lonlat2 <- d[,2:3]
results <- matrix(0, nrow=nrow(d), ncol=4)
for(i in 1:nrow(d)){
        lonlat1 <- d[i, 2:3]
        whichdist <- which(rdist.earth(t(as.matrix(lonlat1)),
                as.matrix(lonlat2), miles=F) < 2)
        distval <- d[, 1][whichdist]
        sumpop <- sum(data[, "pop"][whichdist])
        sumdwell <- sum(data[, "dwell"][whichdist])
        meanage <- mean(data[, "age"][whichdist])
        results[i, ] <- c(d[i, "recid"], sumpop, sumdwell, meanage)
}
write(t(results), file="C:\\outfile.out", ncol=ncol(results))

Then there is a trick you can use to speed up the process.  Essentially
you reduce the 'inner loop' which is inside rdist.earth().  This is
achieved by initially computing a single distance vector with reference
to a fixed point [(0, 0) seems reasonable since it is in the Atlantic
Ocean].  Then you select a subset of your points that are the same
distance is your circle centre from the fixed point plus or minus the
radius you want (and a bit of tolerance).  This will generate an
annulus of points rather than a circle, but in particular all the
points within the circle of interest will also be within this annulus.
Then you apply rdist.earth() to find the distance of each of these
points from your circle centre.

This is what I have used to process a 52000 length matrix in about 40
minutes (and ~50MB memory):

DistAgg <-
function(data, dist=2) {
  results <- matrix(0, nrow=nrow(data), ncol=4)
  colnames(results) <- c("recid", "sumpop",
"sumdwell", "meanage")
  lonlat2 <- data[, 2:3]
  basedist <- rdist.earth(t(as.matrix(c(0, 0))), as.matrix(lonlat2))
  for(i in 1:nrow(data)){
    lonlat1 <- data[i, 2:3]
    approxval <- which(abs(basedist - basedist[i]) < dist*1.001)
    if (length(approxval) > 1) {
      whichval <- approxval[which(rdist.earth(t(as.matrix(lonlat1)),
        as.matrix(lonlat2[approxval, ]), miles=F) < dist)]
    } else {
      whichval <- approxval
    }
    sumpop <- sum(data[, "pop"][whichval])
    sumdwell <- sum(data[, "dwell"][whichval])
    meanage <- mean(data[, "age"][whichval])
    results[i, ] <- c(data[i, "recid"], sumpop, sumdwell, meanage)
  }
  write(t(results), file="C:\\outfile.out", ncol=ncol(results))
}

The data I used was based on yours, but randomised latitude and
longitude (in restricted ranges).

Hope this helps,
Ray Brownrigg

Maybe Matching Threads

Search for more maybe matching threads

R help - Feb 2004 - Distance and Aggregate Data - Again...

[R] Distance and Aggregate Data - Again...

[R] Distance and Aggregate Data - Again...

Maybe Matching Threads