thr3ads.net - R help - [R] Manipulating groups of boolean data subject to group size and distance from other groups [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Morway, Eric

2016-Nov-28 17:38 UTC

[R] Manipulating groups of boolean data subject to group size and distance from other groups

The example below is a pared-down version of a much larger dataset.  My
goal is to use the binary data contained in DF$col2 to guide manipulation
of the binary data itself, subject to the following:

   - Groups of '1' that are separated from other, larger groups of
"1's" in
   'col2' by 2 or more years should be converted to "0"
   - Groups of '1' need to be at least 2 consecutive years to be
preserved

So in the example provided below, DF$col2 would be manipulated such that
its values are overrided to:

c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1)

That is, the first group of 1's in positions 2 through 6 are separated from
other groups of 1's by 2 (or more) years, and the second group of 1's
(positions 11 & 12) span only a single year and do not meet the criteria of
being at least 2 years long.

The example R script below shows a small example I'm working with, called
"DF".  The code that comes after the first line is my attempt to go
through
some R-gymnastics to append a column to DF called "isl2" that reflects
the
number of consecutive years in the 0/1 groups, where the +/- sign acts as
(or denotes) the original binary condition: 0 = negative, 1 = positive.
However, I'm stuck with how to proceed further.  Could someone please help
me come up with script that modifies DF$col2 shown below to be like that
shown above?

DF <- data.frame(col1=rep(1991:2004,
each=2),col2=c(0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

DF$inc <- c(0, abs(diff(DF$col2)))
DF$cum <- cumsum(DF$inc)

ex1 <- aggregate(col1 ~ cum, data=DF, function(x) length(unique(x)))
names(ex1) <- c('cum','isl')

tmp1a <- merge(DF, ex1, by="cum", all.x=TRUE)
tmp1a$isl2 <- (-1*tmp1a$col2) * tmp1a$isl
tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0]

DF$grpng <- tmp1a$isl2

At this point I was thinking I could use DF$grpng to sweep through col2 and
make adjustments, but I didn't know how to proceed.

For debugging purposes, a slightly different example would go from:

DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(1,1,1,1,
1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

to 'col2' looking like:

c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)

That is, even though the first group of 1's is greater than two consecutive
years, it is separated from a larger group of 1's by 2 (or more years).

	[[alternative HTML version deleted]]

David Winsemius

2016-Nov-28 20:25 UTC

head link

[R] Manipulating groups of boolean data subject to group size and distance from other groups

> On Nov 28, 2016, at 9:38 AM, Morway, Eric <emorway at usgs.gov>
wrote:
> 
> The example below is a pared-down version of a much larger dataset.  My
> goal is to use the binary data contained in DF$col2 to guide manipulation
> of the binary data itself, subject to the following:
> 
>   - Groups of '1' that are separated from other, larger groups of
"1's" in
>   'col2' by 2 or more years should be converted to "0"
>   - Groups of '1' need to be at least 2 consecutive years to be
preserved
> 
> So in the example provided below, DF$col2 would be manipulated such that
> its values are overrided to:
> 
> c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1)
> 
> That is, the first group of 1's in positions 2 through 6 are separated
from
> other groups of 1's by 2 (or more) years, and the second group of
1's
> (positions 11 & 12) span only a single year and do not meet the
criteria of
> being at least 2 years long.
> 
> The example R script below shows a small example I'm working with,
called
> "DF".  The code that comes after the first line is my attempt to
go through
> some R-gymnastics to append a column to DF called "isl2" that
reflects the
> number of consecutive years in the 0/1 groups, where the +/- sign acts as
> (or denotes) the original binary condition: 0 = negative, 1 = positive.
> However, I'm stuck with how to proceed further.  Could someone please
help
> me come up with script that modifies DF$col2 shown below to be like that
> shown above?
> 
> DF <- data.frame(col1=rep(1991:2004,
> each=2),col2=c(0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))
It's not clear from you verbal description why the first group pf 1's
with length 4 is discarded while the second group of ones also of length 4 is
preserved. There's ambiguity in the rules about "how large" a run
must be in order to be "safe" from removal.

In any case the answer will almost surely involve the use of the rle function
which if you have not encountered it should be your next visit to the help
pages.

-- 
David,> 
> DF$inc <- c(0, abs(diff(DF$col2)))
> DF$cum <- cumsum(DF$inc)
> 
> ex1 <- aggregate(col1 ~ cum, data=DF, function(x) length(unique(x)))
> names(ex1) <- c('cum','isl')
> 
> tmp1a <- merge(DF, ex1, by="cum", all.x=TRUE)
> tmp1a$isl2 <- (-1*tmp1a$col2) * tmp1a$isl
> tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0]
> 
> DF$grpng <- tmp1a$isl2
> 
> At this point I was thinking I could use DF$grpng to sweep through col2 and
> make adjustments, but I didn't know how to proceed.
> 
> For debugging purposes, a slightly different example would go from:
> 
> DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(1,1,1,1,
> 1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1))
> 
> to 'col2' looking like:
> 
> c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)
> 
> That is, even though the first group of 1's is greater than two
consecutive
> years, it is separated from a larger group of 1's by 2 (or more years).
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Morway, Eric

2016-Nov-28 22:20 UTC

head link

[R] Manipulating groups of boolean data subject to group size and distance from other groups

To help with the clarification, I renamed 'col1' to 'year' and
'col2' to
'origDat'.  With that said...

The reason the second 'block' of 1's (four consecutive 1's
appearing in
DF$origDat[11:14]) is preserved is because they are only separated by a
total of 1 year (1998 in DF$year) from a larger group of consecutive 1's
(years 1999 through 2002).  Because the first block of 1's are separated
from from any other block of ones by at least 2 years, which I have deemed
to be too large of a gap in data (0's are a surrogate for missing data),
the 1's appearing in DF$year[3:6] should be reset to 0.

I modified the script based on David's suggestion of rle (I was previously
unaware of it) to that shown below, and it works for all three example DF's
provided at the top of the script. That is, after running the script with
any of the first 3 DF's provided, the data in DF$finalDat (as compared to
DF$origDat) is reflective of what I'm after.

HOWEVER, the use of nested while loops and if statements strikes me as
antithetical to elegant R scripting.  Second, my script, as currently
constituted, has a significant bug in that the rules I've set forth are not
completely satisfied.  If DF4 is used (uncomment the line: "DF <-
DF4") the
blocks of 1's at the beginning and end of DF$origDat are preserved, whereas
the middle (and largest continuous) block of 1's appearing in the middle of
DF$origDat are reset to 0.  Thus, I think I'm in need of a more elegant way
of pursuing this problem...should anyone be so inclined to offer of
additional thoughts.

The (semi-) working script using rle is:

DF <- data.frame(year=rep(1991:2004, each=2),

 origDat=c(0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1))

#DF <- data.frame(year=rep(1991:2004, each=2),
#
origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

#DF <- data.frame(year=rep(1991:2004, each=2),
#
origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

# An example that doesn't work
DF4 <- data.frame(year=rep(1991:2004, each=2),

 origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1))
#DF <- DF4

DF$inc <- c(1, abs(diff(DF$origDat)))
DF$cumsum <- cumsum(DF$inc)

ex1 <- aggregate(year ~ cumsum, data=DF, function(x) length(unique(x)))
names(ex1) <- c('cumsum','isl')

tmp1a <- merge(DF, ex1, by="cumsum", all.x=TRUE)
tmp1a$isl2 <- (-1*tmp1a$origDat) * tmp1a$isl
tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0]
tmp1a$isl2 <- -1 * tmp1a$isl2

DF$grpng <- tmp1a$isl2

runlen <- data.frame(cumsum = seq(1:length(rle(DF$grpng)$lengths)),
                     len = rle(DF$grpng)$lengths,
                     val = rle(DF$grpng)$values)

i <- 1
while(i <= nrow(runlen)){
  if(runlen[i,'val'] >= 2){  # As long as a '-2' or smaller
doesn't follow,
                            # then the current group of data is NOT
                            # too 'distant' from other data and should
be
                            # preserved.  Otherwise, the current grp of
                            # 1's should be reset to 0
    j <- i + 1
    while(j <= nrow(runlen)){
      if(runlen[j,'val'] <= -2){
        # If code enters here, then swich the sign of 'val' to
        # effectively inactivate this block of 1's
        runlen[i,'val'] <- -1 * runlen[i,'val']
      }
      #print(paste0("j: ",as.character(j)))
      j <- j + 1
    }
  } else if (runlen[i,'val'] > 0 & runlen[i,'val'] <
2){
    # If the script enters here, then the current group of data
    # doesn't meet the minimum continuous length requirement of
    # 2 or more years (in this example a check of >0 & <2 seems
    # silly, but in the real-world dataset 2 will be replaced with
    # a much larger example.
    runlen[i,'val'] <- -1 * runlen[i,'val']
  }
  #print(paste0("i: ",as.character(i)))
  i <- i + 1
}

runlen$finalDat <- ifelse(runlen$val < 0, 0, 1)
DF <- merge(DF, runlen, by = 'cumsum', all.x = TRUE)
DF

	[[alternative HTML version deleted]]

R help - Nov 2016 - Manipulating groups of boolean data subject to group size and distance from other groups

[R] Manipulating groups of boolean data subject to group size and distance from other groups

[R] Manipulating groups of boolean data subject to group size and distance from other groups

[R] Manipulating groups of boolean data subject to group size and distance from other groups