Hi. I have a problem that I can't seem to find an optimal way of solving other than by doing things manually. I'm trying to subset a data frame by the number of observations that occurred at a given row but want to take into account the number of observations of preceding rows. Here's an example. I'm looking at intervals of data [10,20), [10, 30), ....., [10,120) which contain a certain number of observations for treatment A and treatment B. An example is given by the following code.>int <- as.factor(paste("[", rep(10, 11), ",", seq(20,120, by=10), ")")) >nsamA <- c(62, 83, 118, 151, 180, 201, 212, 215, 216, 217, 218) >nsamB <- c(65, 90, 128, 163, 190, 199, 209, 214, 215, 216, 218)>df0 <- data.frame(int, nsamA, nsamB) >df0Since the interval [10, s) with n_s samples is nested in [10, t)with n_t sample for s < t, we know n_s - n_t samples exist in the interval [s, t). If this sample size of the difference is small I want to exclude the interval [10,s). This can be done comparing adjacent preceding rows using the following.> df0$itagA <- ifelse(c(10, diff(nsamA)) <= 4, 1, 0) >df0$itagB <- ifelse(c(10, diff(nsamB)) <= 4, 1, 0) >df0 ># Subset df0 on the tag results > df1 <- df0[df0$itagA != 1 & df0$itagB != 1,] > df1This works fine, but here is my problem. This simply looks at only the immediate preceding row and not at rows further "down the line". What I would like to do is include the next interval that includes 5 or more samples from each group since earlier intervals are nested in the latter intervals. In the example given this would include the final interval [10, 120) as this contains more than 4 samples for each treatment. I can do this by hand using something like> df0[c(1:7,11),]But this is not an attractive solution as it requires me to actually look at the data set each time and determine the row numbers. This works for this case, but I have many intervals (rows of data) to look at and this would be cumbersome. I've considered using diff with different lag arguments, but this still doesn't seem to work. I also want to note that I need to keep the int factor (as used in the example above) as this is used throughout my analysis (i.e. this is a true factor variable and not simply denoting an interval). I'd be grateful for any possible suggestions as I'm stumped at this moment. Thanks, Mat R v. 2.0.1 on Windows XP Disclaimer: The views and opinions expressed in this email are of the author and not of the Food and Drug Administration. *********************************************************************** Mat Soukup, Ph.D. Mathematical Statistician, Biometrics III Center for Drug Evaluation and Research 9201 Corporate Blvd. Rm. N250 Phone: 301.827.2081 *********************************************************************** [[alternative HTML version deleted]]
Gabor Grothendieck
2005-Feb-05 04:36 UTC
[R] interval partition problem [was: (no subject)]
Soukup, Matt <SoukupM <at> cder.fda.gov> writes: : : Hi. : : I have a problem that I can't seem to find an optimal way of solving other : than by doing things manually. I'm trying to subset a data frame by the : number of observations that occurred at a given row but want to take into : account the number of observations of preceding rows. Here's an example. : : I'm looking at intervals of data [10,20), [10, 30), ....., [10,120) which : contain a certain number of observations for treatment A and treatment B. An : example is given by the following code. : : >int <- as.factor(paste("[", rep(10, 11), ",", seq(20,120, by=10), ")")) : >nsamA <- c(62, 83, 118, 151, 180, 201, 212, 215, 216, 217, 218) : >nsamB <- c(65, 90, 128, 163, 190, 199, 209, 214, 215, 216, 218) : : >df0 <- data.frame(int, nsamA, nsamB) : >df0 : : Since the interval [10, s) with n_s samples is nested in [10, t)with n_t : sample for s < t, we know n_s - n_t samples exist in the interval [s, t). If : this sample size of the difference is small I want to exclude the interval : [10,s). This can be done comparing adjacent preceding rows using the : following. : : > df0$itagA <- ifelse(c(10, diff(nsamA)) <= 4, 1, 0) : >df0$itagB <- ifelse(c(10, diff(nsamB)) <= 4, 1, 0) : >df0 : ># Subset df0 on the tag results : > df1 <- df0[df0$itagA != 1 & df0$itagB != 1,] : > df1 : : This works fine, but here is my problem. This simply looks at only the : immediate preceding row and not at rows further "down the line". What I : would like to do is include the next interval that includes 5 or more : samples from each group since earlier intervals are nested in the latter : intervals. In the example given this would include the final interval [10, : 120) as this contains more than 4 samples for each treatment. I can do this : by hand using something like : : > df0[c(1:7,11),] : : But this is not an attractive solution as it requires me to actually look at : the data set each time and determine the row numbers. This works for this : case, but I have many intervals (rows of data) to look at and this would be : cumbersome. I've considered using diff with different lag arguments, but : this still doesn't seem to work. I also want to note that I need to keep the : int factor (as used in the example above) as this is used throughout my : analysis (i.e. this is a true factor variable and not simply denoting an : interval). I'd be grateful for any possible suggestions as I'm stumped at : this moment. : Delete the rows one by one and then recalculate diff after each deletion (rather than diff'ing all at once and then deleting all at once). Also, assuming you want every interval to be covered, force the last interval to end at the last row. Assume too.few(df0, i) is a function, not shown here, which returns TRUE if there are too few As or Bs in row i minus row i-1 of df0 and otherwise FALSE. Then: last.row <- df0[nrow(df0),] i <- 1 while(i < nrow(df0)) if (too.few(df0, i)) df0 <- df0[-i,] else i <- i + 1 df0[nrow(df0),] <- last.row P.S. Please start a new thread rather than replying to an existing thread and please use a meaningful subject.