hind lazrak
2011-Feb-26 23:37 UTC
[R] how to remove rows in which 2 or more observations are smaller than a given threshold?
Hello The data set I am examining has 7425 observations (rows with unique identifiers) and 46 samples(columns). I have been trying to generate a dataset that filters out observations that are "negligible" The definition of "negligible" is absolute value less or equal to 1.58. The rule that I would like to adopt to create a new data is: drop rows in which 2 or more observations have absolute values <= 1.58. Since I have unique identifier per row, I have tried to reshape the data so I could create a new variable using an ifelse statement that would flag observations <=1.58 but I am not getting anywhere with this approach I could not come up with an apply function that counts the number of observations for which the absolute values are below the cutoff I've specified. All observations are numerical and I don't have missing values. Thank you in advance for the help, Hind
Phil Spector
2011-Feb-26 23:49 UTC
[R] how to remove rows in which 2 or more observations are smaller than a given threshold?
If the matrix in question is named "mymat", then mymat[apply(mymat,1,function(x)sum(abs(x) <= 1.58) < 2),] (untested due to a lack of a reproducible example) should give you a matrix without any rows containing two or more values with absolute value less than 1.58. I'm not sure ifelse would be of much use here. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Sat, 26 Feb 2011, hind lazrak wrote:> Hello > > The data set I am examining has 7425 observations (rows with unique > identifiers) and 46 samples(columns). > > I have been trying to generate a dataset that filters out observations > that are "negligible" > The definition of "negligible" is absolute value less or equal to 1.58. > > The rule that I would like to adopt to create a new data is: drop rows > in which 2 or more observations have absolute values <= 1.58. > > Since I have unique identifier per row, I have tried to reshape the > data so I could create a new variable using an ifelse statement that > would flag observations <=1.58 but I am not getting anywhere with this > approach > > I could not come up with an apply function that counts the number of > observations for which the absolute values are below the cutoff I've > specified. > > All observations are numerical and I don't have missing values. > > > Thank you in advance for the help, > > Hind > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
William Dunlap
2011-Feb-27 00:34 UTC
[R] how to remove rows in which 2 or more observations are smaller than a given threshold?
You didn't say if your data set was a matrix or data.frame. Here are 2 functions that do the job on either and one that only works with data.frames, but is faster (a similar speedup is available for matrices as well). They all compute the number of small values in each row, nSmall, and extract the rows for which nSmall is less than 2. f0 <- function (x) { nSmall <- apply(x, 1, function(row) sum(abs(row) <= 1.58) x[nSmall<2, , drop = FALSE] } f1 <- function (x) { nSmall<- rowSums(abs(x) < 1.58) x[nSmall<2, , drop = FALSE] } f2 <- function (x) { stopifnot(is.data.frame(x)) nSmall <- 0 for (column in x) { nSmall <- nSmall + (abs(column) < 1.58) } x[nSmall < 2, , drop = FALSE] } For a 10^5 row by 50 column data.frame I got the following times: > system.time(r0 <- f0(z)) user system elapsed 2.39 0.04 2.51 > system.time(r1 <- f1(z)) user system elapsed 0.42 0.08 0.51 > system.time(r2 <- f2(z)) user system elapsed 0.21 0.05 0.24 > identical(r0, r1) && identical(r0, r2) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of hind lazrak > Sent: Saturday, February 26, 2011 3:37 PM > To: r-help at r-project.org > Subject: [R] how to remove rows in which 2 or more > observations are smaller than a given threshold? > > Hello > > The data set I am examining has 7425 observations (rows with unique > identifiers) and 46 samples(columns). > > I have been trying to generate a dataset that filters out observations > that are "negligible" > The definition of "negligible" is absolute value less or > equal to 1.58. > > The rule that I would like to adopt to create a new data is: drop rows > in which 2 or more observations have absolute values <= 1.58. > > Since I have unique identifier per row, I have tried to reshape the > data so I could create a new variable using an ifelse statement that > would flag observations <=1.58 but I am not getting anywhere with this > approach > > I could not come up with an apply function that counts the number of > observations for which the absolute values are below the cutoff I've > specified. > > All observations are numerical and I don't have missing values. > > > Thank you in advance for the help, > > Hind > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Maybe Matching Threads
- creating a scale (factor) based on a continuous variable nested within levels of factor
- How to put given values in lower triangle of splom-plot?
- base::format adds extraneous whitespace for some inputs
- weird behavior of nsmall in format
- HTML nsmall vector format problem