I need to run binomial tests (binom.test) on a large set of data, stored in a table - 600 tests in total. The values of x are stored in a column, as are the values of n. The data for each test are on a separate row. For example: X N 11 19 9 26 13 21 13 27 18 30 It is a two-tailed test, and P in all cases is 0.5. My question is: Is there a quicker way of running these tests without having to type an individual command for each test - and ideally also to store the resulting p-values in a single data vector? Many thanks for any pointers, Andrew Wilson
Hi: Here's one approach (not unique), and dragged out a bit to illustrate its different components. 1. Create a list object, something like l <- vector('list', 600) 2. Populate it. There are several ways to do this, but one is to initially create a vector of file names and then populate the list by looping over the file names. If your file names have a simple format (dat001 - dat600, say), then it's easy to create the file name vector with paste(); otherwise, you may need to do more work. Then run a loop that assigns to each list component the corresponding data frame, something like for(i in seq_along(filenames)) l[[i]] <- get(filenames[i]) 3. Create a function for one of the data sets, under the obvious proviso that you intend to process each data frame in the list the same way. To return only the p-values from a binomial test applied to each row of your input data frame, the following works for me (explanation below): f <- function(df) do.call(c, with(df, mapply(binom.test, x = X, n = N))[3, ]) 4. Use lapply() to map the function to each component data frame in your list; the result will also be a list. pvlist <- lapply(l, f) 5. *IF* each of your data frames has the same number of rows, you can use the following to slurp together all the p-values into a matrix: do.call(rbind, pvlist) OTOH, if the number of rows vary from one data frame to another, it may be best to keep the p-value results in list form or perhaps you could flatten them into a numeric vector, depending on your purposes. ---- The function f: mapply() allows you, in this case, to execute the non-vectorized function binom.test() to a pair of vector arguments supplied from the input data frame. The result is a 9 x n matrix where each column comprises a list of output for each of the n calls to binom.test() [where n = number of rows of the input data frame]. Since you wanted the p-values (component/row 3), we pull out the third row of the matrix. This will return a list, so using the concatenation function c() in do.call() coerces them into a numeric vector for output. The lapply() call maps the function f to each component of the list of data frames created in (2). ----- An alternative approach to this problem would be to use the plyr (and perhaps reshape, too) package, since it was designed to handle this 'split-apply-combine' strategy. HTH, Dennis On Thu, Jul 29, 2010 at 1:05 AM, Wilson, Andrew <a.wilson@lancaster.ac.uk>wrote:> I need to run binomial tests (binom.test) on a large set of data, stored > in a table - 600 tests in total. > > The values of x are stored in a column, as are the values of n. The > data for each test are on a separate row. > > For example: > > X N > 11 19 > 9 26 > 13 21 > 13 27 > 18 30 > > It is a two-tailed test, and P in all cases is 0.5. > > My question is: Is there a quicker way of running these tests without > having to type an individual command for each test - and ideally also to > store the resulting p-values in a single data vector? > > Many thanks for any pointers, > > Andrew Wilson > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi: As it turns out, this is pretty straightforward using plyr's ldply() function. Here's a toy example: d1 <- structure(list(X = c(11L, 9L, 13L, 13L, 18L), N = c(19L, 26L, 21L, 27L, 30L)), .Names = c("X", "N"), class = "data.frame", row.names c(NA, -5L)) w <- sample(1:50, 5) d2 <- data.frame(X = mapply(rbinom, 1, w, 0.5), N = w) w <- sample(1:50, 5) d3 <- data.frame(X = mapply(rbinom, 1, w, 0.5), N = w) # Combine data frames into a list - since these are already R objects, the call is easy: l <- list(d1, d2, d3) # the function: f <- function(df) do.call(c, with(df, mapply(binom.test, x = X, n = N))[3, ]) # do.call + lapply: do.call(rbind, lapply(l, f)) [,1] [,2] [,3] [,4] [,5] [1,] 0.6476059 0.1686375 0.38331032 1.0000000 0.3615946 [2,] 0.3019956 0.6515878 0.02944937 0.5600646 1.0000000 [3,] 1.0000000 1.0000000 0.81452942 0.0390625 0.4050322 # plyr approach: library(plyr) ldply(l, f) V1 V2 V3 V4 V5 1 0.6476059 0.1686375 0.38331032 1.0000000 0.3615946 2 0.3019956 0.6515878 0.02944937 0.5600646 1.0000000 3 1.0000000 1.0000000 0.81452942 0.0390625 0.4050322 ldply() takes a list as input along with a function to process in the lapply() step and returns a data frame of results. So the plyr approach can be summarized as: 1. Create a list of data frames. 2. Create a function to apply to each data frame. 3. Load the plyr package. 4. Run ldply(). Essentially, the plyr package provides a number of convenient 'wrapper' functions to simplify the 'split-apply-combine' strategy of data analysis for various combinations of input and output objects. HTH, Dennis On Thu, Jul 29, 2010 at 1:05 AM, Wilson, Andrew <a.wilson@lancaster.ac.uk>wrote:> I need to run binomial tests (binom.test) on a large set of data, stored > in a table - 600 tests in total. > > The values of x are stored in a column, as are the values of n. The > data for each test are on a separate row. > > For example: > > X N > 11 19 > 9 26 > 13 21 > 13 27 > 18 30 > > It is a two-tailed test, and P in all cases is 0.5. > > My question is: Is there a quicker way of running these tests without > having to type an individual command for each test - and ideally also to > store the resulting p-values in a single data vector? > > Many thanks for any pointers, > > Andrew Wilson > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]