Max Brondfield
2012-May-23 20:42 UTC
[R] Using NA as a break point for indicator variable?
Hi all, I am working with a spatial data set for which I am only interested in high concentration values ("leaks"). The low values (< 90th percentile) have already been turned into NA's, leaving me with a matrix like this: < CH4_leak lon lat CH4 1 -71.11954 42.35068 2.595834 2 -71.11954 42.35068 2.595688 3 NA NA NA 4 NA NA NA 5 NA NA NA 6 -71.11948 42.35068 2.435762 7 -71.11948 42.35068 2.491003 8 NA NA NA 9 -71.11930 42.35068 2.464475 10 -71.11932 42.35068 2.470865 Every time an NA comes up, it means the "leak" is gone, and the next valid value would represent a different leak (at a different location). My goal is to tag all of the remaining values with an indicator variable to spatially distinguish the leaks. I am envisioning a simple numeric indicator such as: lon lat CH4 leak_num 1 -71.11954 42.35068 2.595834 1 2 -71.11954 42.35068 2.595688 1 3 NA NA NA NA 4 NA NA NA NA 5 NA NA NA NA 6 -71.11948 42.35068 2.435762 2 7 -71.11948 42.35068 2.491003 2 8 NA NA NA NA 9 -71.11930 42.35068 2.064475 3 10 -71.11932 42.35068 2.070865 3 Does anyone have any thoughts on how to code this, perhaps using the NA values as a "break point"? The data set is far too large to do this manually, and I must admit I'm completely at a loss. Any help would be much appreciated! Best, Max [[alternative HTML version deleted]]
Hello, Assuming that 'd' is your original data.frame and that you've set entire rows to NA, try this d$leak_num <- NA ix <- !is.na(d[, 1]) # any column will do, entire row is NA ## alternative, if other rows may have NAs, due to something else #ix <- apply(d, 1, function(x) all(!is.na(x))) r <- rle(ix) v <- cumsum(r$values) d$leak_num[ix] <- rep(v[r$values], r$lengths[r$values]) d Hope this helps, Rui Barradas Em 24-05-2012 11:00, Max Brondfield <mbrondf at post.harvard.edu> escreveu:> Date: Wed, 23 May 2012 16:42:02 -0400 > From: Max Brondfield<mbrondf at post.harvard.edu> > To:r-help at r-project.org > Subject: [R] Using NA as a break point for indicator variable? > Message-ID: > <CADu+jDpcJUHZTXxrsxyQvjaEmw_N0iLbL6ZJjHZC-rSBCMneiw at mail.gmail.com> > Content-Type: text/plain > > Hi all, > I am working with a spatial data set for which I am only interested in high > concentration values ("leaks"). The low values (< 90th percentile) have > already been turned into NA's, leaving me with a matrix like this: > > < CH4_leak > > lon lat CH4 > 1 -71.11954 42.35068 2.595834 > 2 -71.11954 42.35068 2.595688 > 3 NA NA NA > 4 NA NA NA > 5 NA NA NA > 6 -71.11948 42.35068 2.435762 > 7 -71.11948 42.35068 2.491003 > 8 NA NA NA > 9 -71.11930 42.35068 2.464475 > 10 -71.11932 42.35068 2.470865 > > Every time an NA comes up, it means the "leak" is gone, and the next valid > value would represent a different leak (at a different location). My goal > is to tag all of the remaining values with an indicator variable to > spatially distinguish the leaks. I am envisioning a simple numeric > indicator such as: > > lon lat CH4 leak_num > 1 -71.11954 42.35068 2.595834 1 > 2 -71.11954 42.35068 2.595688 1 > 3 NA NA NA NA > 4 NA NA NA NA > 5 NA NA NA NA > 6 -71.11948 42.35068 2.435762 2 > 7 -71.11948 42.35068 2.491003 2 > 8 NA NA NA NA > 9 -71.11930 42.35068 2.064475 3 > 10 -71.11932 42.35068 2.070865 3 > > Does anyone have any thoughts on how to code this, perhaps using the NA > values as a "break point"? The data set is far too large to do this > manually, and I must admit I'm completely at a loss. Any help would be much > appreciated! Best, > > Max > > [[alternative HTML version deleted]] > >
William Dunlap
2012-May-24 14:59 UTC
[R] Using NA as a break point for indicator variable?
> Does anyone have any thoughts on how to code this, perhaps using the NA > values as a "break point"?You can count the cumulative number of NA breakpoints in a vector with cumsum(is.na(vector)), as in > cbind(d, LeakNo=with(d, cumsum(is.na(lon)|is.na(lat)|is.na(CH4)))) lon lat CH4 LeakNo 1 -71.11954 42.35068 2.595834 0 2 -71.11954 42.35068 2.595688 0 3 NA NA NA 1 4 NA NA NA 2 5 NA NA NA 3 6 -71.11948 42.35068 2.435762 3 7 -71.11948 42.35068 2.491003 3 8 NA NA NA 4 9 -71.11930 42.35068 2.464475 4 10 -71.11932 42.35068 2.470865 4 Add 1 if you want to start with 1. If you only want to increase the count after each sequence of NA's then you could use rle() or > na <- with(d, is.na(lon)|is.na(lat)|is.na(CH4)) > cbind(d, LeakNo=cumsum(c(TRUE, na[-1] < na[-length(na)]))) lon lat CH4 LeakNo 1 -71.11954 42.35068 2.595834 1 2 -71.11954 42.35068 2.595688 1 3 NA NA NA 1 4 NA NA NA 1 5 NA NA NA 1 6 -71.11948 42.35068 2.435762 2 7 -71.11948 42.35068 2.491003 2 8 NA NA NA 2 9 -71.11930 42.35068 2.464475 3 10 -71.11932 42.35068 2.470865 3 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Max Brondfield > Sent: Wednesday, May 23, 2012 1:42 PM > To: r-help at r-project.org > Subject: [R] Using NA as a break point for indicator variable? > > Hi all, > I am working with a spatial data set for which I am only interested in high > concentration values ("leaks"). The low values (< 90th percentile) have > already been turned into NA's, leaving me with a matrix like this: > > < CH4_leak > > lon lat CH4 > 1 -71.11954 42.35068 2.595834 > 2 -71.11954 42.35068 2.595688 > 3 NA NA NA > 4 NA NA NA > 5 NA NA NA > 6 -71.11948 42.35068 2.435762 > 7 -71.11948 42.35068 2.491003 > 8 NA NA NA > 9 -71.11930 42.35068 2.464475 > 10 -71.11932 42.35068 2.470865 > > Every time an NA comes up, it means the "leak" is gone, and the next valid > value would represent a different leak (at a different location). My goal > is to tag all of the remaining values with an indicator variable to > spatially distinguish the leaks. I am envisioning a simple numeric > indicator such as: > > lon lat CH4 leak_num > 1 -71.11954 42.35068 2.595834 1 > 2 -71.11954 42.35068 2.595688 1 > 3 NA NA NA NA > 4 NA NA NA NA > 5 NA NA NA NA > 6 -71.11948 42.35068 2.435762 2 > 7 -71.11948 42.35068 2.491003 2 > 8 NA NA NA NA > 9 -71.11930 42.35068 2.064475 3 > 10 -71.11932 42.35068 2.070865 3 > > Does anyone have any thoughts on how to code this, perhaps using the NA > values as a "break point"? The data set is far too large to do this > manually, and I must admit I'm completely at a loss. Any help would be much > appreciated! Best, > > Max > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.