Morway, Eric
2015-Apr-08 18:47 UTC
[R] data-mining with multiple filters applied to multiple columns
Using this representative dataset of a much larger dataset: dat <- read.table(textConnection("ISEG IRCH div gw 1 1 265 229 1 2 260 298 1 3 234 196 54 1 432 485 54 39 467 485 54 40 468 468 54 41 460 381 54 42 489 502 1 1 265 317 1 2 276 225 1 3 217 164 54 1 430 489 54 39 456 495 54 40 507 607 54 41 483 424 54 42 457 404 1 1 265 278 1 2 287 370 1 3 224 274 54 1 412 585 54 39 473 532 54 40 502 595 54 41 497 441 54 42 447 467 1 1 230 258 1 2 251 152 1 3 199 179 54 1 412 415 54 39 439 538 54 40 474 486 54 41 477 484 54 42 413 346 1 1 230 171 1 2 262 171 1 3 217 263 54 1 432 485 54 39 455 482 54 40 493 419 54 41 489 536 54 42 431 504 1 1 1002 1090 1 2 1222 1178 1 3 1198 1177 54 1 1432 1485 54 39 1876 1975 54 40 1565 1646 54 41 1455 1451 54 42 1427 1524 1 1 1002 968 1 2 1246 1306 1 3 1153 1158 54 1 1532 1585 54 39 1790 1889 54 40 1490 1461 54 41 1518 1536 54 42 1486 1585 1 1 1002 1081 1 2 1229 1262 1 3 1142 1241 54 1 1632 1659 54 39 1797 1730 54 40 1517 1466 54 41 1527 1589 54 42 1514 1612"),header=TRUE) dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0) tmp <- diff(dat[dat$seq==1,]$div)!=0 dat$idx <- 0 dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1 dat$ts <- cumsum(dat$idx) dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum) dat$ct <- seq(1:length(dat[,1])) dat[1,] # ISEG IRCH div gw seq idx ts iter ct # 1 1 265 229 1 1 1 1 1 I'm attempting to carry out 2 data mining tasks: 1) for each time step ('ts') and each ISEG within those time steps, I want to difference the max and min 'div' values and store the result. I got close with this command, but the results were repeated: with(subset(dat,ts==1 & IRCH==1), ave(div,ISEG,FUN=function(x){max(x)-min(x)})) # 0 20 0 20 0 20 There are a few shortcomings with this line of script that I'm hoping could be improved upon: (1) I hard-coded ts==1, ideally, the R script would iterate over all 'ts', (2) I really only need the results printed once ("0 20"), and (3) it would be nice to store the results in something like this (the results shown next are from a brute-force hand-calculation on the dat data.frame): ISEG IRCH ts div_diff 1 1 1 0 54 1 1 20 1 1 2 0 54 1 2 20 1 1 3 0 54 1 3 200 2) The second data-mining attempt is a bit more convoluted and to demonstrate what I'd like to get, here it is in parts. First: aggregate(gw ~ ISEG + iter, subset(dat,ts==1), sum) ISEG iter gw 1 1 1 723 2 54 1 2321 3 1 2 706 4 54 2 2419 5 1 3 922 6 54 3 2620 Once again, ts==1 is hard-coded, but this would ideally loop through all unique 'ts' in the dataset. Next, with this result, I'd like to difference the maximum and minimum 'gw' values associated with each ISEG. I tried: with(aggregate(gw ~ ISEG + iter, subset(dat,ts==1), sum), ave(gw, ISEG, function(x){max(x)-min(x)})) #Error in unique.default(x, nmax = nmax) : # unique() applies only to vectors but didn't know what to do with the error. for the result above, the answer I'm seeking is: ISEG diff 1 216 54 289 For ts==1, the value of 216 results from 922 - 706 [max(gw) - min(gw) for ISEG==1] and the value of 289 results from 2620 - 2321 [max(gw) - min(gw) for ISEG==54]. So, the culmination of what I'm after would be the result from data-mining effort 1 + data-mining effort 2: ISEG IRCH ts div_diff gw_diff 1 1 1 0 216 54 1 1 20 289 1 1 2 0 16 54 1 2 20 157 1 1 3 0 152 54 1 3 200 25 [[alternative HTML version deleted]]