Morway, Eric
2016-Nov-28 17:38 UTC
[R] Manipulating groups of boolean data subject to group size and distance from other groups
The example below is a pared-down version of a much larger dataset. My goal is to use the binary data contained in DF$col2 to guide manipulation of the binary data itself, subject to the following: - Groups of '1' that are separated from other, larger groups of "1's" in 'col2' by 2 or more years should be converted to "0" - Groups of '1' need to be at least 2 consecutive years to be preserved So in the example provided below, DF$col2 would be manipulated such that its values are overrided to: c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1) That is, the first group of 1's in positions 2 through 6 are separated from other groups of 1's by 2 (or more) years, and the second group of 1's (positions 11 & 12) span only a single year and do not meet the criteria of being at least 2 years long. The example R script below shows a small example I'm working with, called "DF". The code that comes after the first line is my attempt to go through some R-gymnastics to append a column to DF called "isl2" that reflects the number of consecutive years in the 0/1 groups, where the +/- sign acts as (or denotes) the original binary condition: 0 = negative, 1 = positive. However, I'm stuck with how to proceed further. Could someone please help me come up with script that modifies DF$col2 shown below to be like that shown above? DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1)) DF$inc <- c(0, abs(diff(DF$col2))) DF$cum <- cumsum(DF$inc) ex1 <- aggregate(col1 ~ cum, data=DF, function(x) length(unique(x))) names(ex1) <- c('cum','isl') tmp1a <- merge(DF, ex1, by="cum", all.x=TRUE) tmp1a$isl2 <- (-1*tmp1a$col2) * tmp1a$isl tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0] DF$grpng <- tmp1a$isl2 At this point I was thinking I could use DF$grpng to sweep through col2 and make adjustments, but I didn't know how to proceed. For debugging purposes, a slightly different example would go from: DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(1,1,1,1, 1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)) to 'col2' looking like: c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1) That is, even though the first group of 1's is greater than two consecutive years, it is separated from a larger group of 1's by 2 (or more years). [[alternative HTML version deleted]]
David Winsemius
2016-Nov-28 20:25 UTC
[R] Manipulating groups of boolean data subject to group size and distance from other groups
> On Nov 28, 2016, at 9:38 AM, Morway, Eric <emorway at usgs.gov> wrote: > > The example below is a pared-down version of a much larger dataset. My > goal is to use the binary data contained in DF$col2 to guide manipulation > of the binary data itself, subject to the following: > > - Groups of '1' that are separated from other, larger groups of "1's" in > 'col2' by 2 or more years should be converted to "0" > - Groups of '1' need to be at least 2 consecutive years to be preserved > > So in the example provided below, DF$col2 would be manipulated such that > its values are overrided to: > > c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1) > > That is, the first group of 1's in positions 2 through 6 are separated from > other groups of 1's by 2 (or more) years, and the second group of 1's > (positions 11 & 12) span only a single year and do not meet the criteria of > being at least 2 years long. > > The example R script below shows a small example I'm working with, called > "DF". The code that comes after the first line is my attempt to go through > some R-gymnastics to append a column to DF called "isl2" that reflects the > number of consecutive years in the 0/1 groups, where the +/- sign acts as > (or denotes) the original binary condition: 0 = negative, 1 = positive. > However, I'm stuck with how to proceed further. Could someone please help > me come up with script that modifies DF$col2 shown below to be like that > shown above? > > DF <- data.frame(col1=rep(1991:2004, > each=2),col2=c(0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))It's not clear from you verbal description why the first group pf 1's with length 4 is discarded while the second group of ones also of length 4 is preserved. There's ambiguity in the rules about "how large" a run must be in order to be "safe" from removal. In any case the answer will almost surely involve the use of the rle function which if you have not encountered it should be your next visit to the help pages. -- David,> > DF$inc <- c(0, abs(diff(DF$col2))) > DF$cum <- cumsum(DF$inc) > > ex1 <- aggregate(col1 ~ cum, data=DF, function(x) length(unique(x))) > names(ex1) <- c('cum','isl') > > tmp1a <- merge(DF, ex1, by="cum", all.x=TRUE) > tmp1a$isl2 <- (-1*tmp1a$col2) * tmp1a$isl > tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0] > > DF$grpng <- tmp1a$isl2 > > At this point I was thinking I could use DF$grpng to sweep through col2 and > make adjustments, but I didn't know how to proceed. > > For debugging purposes, a slightly different example would go from: > > DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(1,1,1,1, > 1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)) > > to 'col2' looking like: > > c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1) > > That is, even though the first group of 1's is greater than two consecutive > years, it is separated from a larger group of 1's by 2 (or more years).> > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Morway, Eric
2016-Nov-28 22:20 UTC
[R] Manipulating groups of boolean data subject to group size and distance from other groups
To help with the clarification, I renamed 'col1' to 'year' and 'col2' to 'origDat'. With that said... The reason the second 'block' of 1's (four consecutive 1's appearing in DF$origDat[11:14]) is preserved is because they are only separated by a total of 1 year (1998 in DF$year) from a larger group of consecutive 1's (years 1999 through 2002). Because the first block of 1's are separated from from any other block of ones by at least 2 years, which I have deemed to be too large of a gap in data (0's are a surrogate for missing data), the 1's appearing in DF$year[3:6] should be reset to 0. I modified the script based on David's suggestion of rle (I was previously unaware of it) to that shown below, and it works for all three example DF's provided at the top of the script. That is, after running the script with any of the first 3 DF's provided, the data in DF$finalDat (as compared to DF$origDat) is reflective of what I'm after. HOWEVER, the use of nested while loops and if statements strikes me as antithetical to elegant R scripting. Second, my script, as currently constituted, has a significant bug in that the rules I've set forth are not completely satisfied. If DF4 is used (uncomment the line: "DF <- DF4") the blocks of 1's at the beginning and end of DF$origDat are preserved, whereas the middle (and largest continuous) block of 1's appearing in the middle of DF$origDat are reset to 0. Thus, I think I'm in need of a more elegant way of pursuing this problem...should anyone be so inclined to offer of additional thoughts. The (semi-) working script using rle is: DF <- data.frame(year=rep(1991:2004, each=2), origDat=c(0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1)) #DF <- data.frame(year=rep(1991:2004, each=2), # origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)) #DF <- data.frame(year=rep(1991:2004, each=2), # origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1)) # An example that doesn't work DF4 <- data.frame(year=rep(1991:2004, each=2), origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1)) #DF <- DF4 DF$inc <- c(1, abs(diff(DF$origDat))) DF$cumsum <- cumsum(DF$inc) ex1 <- aggregate(year ~ cumsum, data=DF, function(x) length(unique(x))) names(ex1) <- c('cumsum','isl') tmp1a <- merge(DF, ex1, by="cumsum", all.x=TRUE) tmp1a$isl2 <- (-1*tmp1a$origDat) * tmp1a$isl tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0] tmp1a$isl2 <- -1 * tmp1a$isl2 DF$grpng <- tmp1a$isl2 runlen <- data.frame(cumsum = seq(1:length(rle(DF$grpng)$lengths)), len = rle(DF$grpng)$lengths, val = rle(DF$grpng)$values) i <- 1 while(i <= nrow(runlen)){ if(runlen[i,'val'] >= 2){ # As long as a '-2' or smaller doesn't follow, # then the current group of data is NOT # too 'distant' from other data and should be # preserved. Otherwise, the current grp of # 1's should be reset to 0 j <- i + 1 while(j <= nrow(runlen)){ if(runlen[j,'val'] <= -2){ # If code enters here, then swich the sign of 'val' to # effectively inactivate this block of 1's runlen[i,'val'] <- -1 * runlen[i,'val'] } #print(paste0("j: ",as.character(j))) j <- j + 1 } } else if (runlen[i,'val'] > 0 & runlen[i,'val'] < 2){ # If the script enters here, then the current group of data # doesn't meet the minimum continuous length requirement of # 2 or more years (in this example a check of >0 & <2 seems # silly, but in the real-world dataset 2 will be replaced with # a much larger example. runlen[i,'val'] <- -1 * runlen[i,'val'] } #print(paste0("i: ",as.character(i))) i <- i + 1 } runlen$finalDat <- ifelse(runlen$val < 0, 0, 1) DF <- merge(DF, runlen, by = 'cumsum', all.x = TRUE) DF [[alternative HTML version deleted]]