I need to run binomial tests (binom.test) on a large set of data, stored in a table - 600 tests in total. The values of x are stored in a column, as are the values of n. The data for each test are on a separate row. For example: X N 11 19 9 26 13 21 13 27 18 30 It is a two-tailed test, and P in all cases is 0.5. My question is: Is there a quicker way of running these tests without having to type an individual command for each test - and ideally also to store the resulting p-values in a single data vector? Many thanks for any pointers, Andrew Wilson
Hi:
Here's one approach (not unique), and dragged out a bit to illustrate its
different components.
1. Create a list object, something like
l <- vector('list', 600)
2. Populate it. There are several ways to do this, but one is to initially
create a vector of file names and then populate the list by looping over the
file names. If your file names have a simple format (dat001 - dat600, say),
then it's easy to create the file name vector with paste(); otherwise, you
may need to do more work. Then run a loop that assigns to each list
component the corresponding data frame, something like
for(i in seq_along(filenames)) l[[i]] <- get(filenames[i])
3. Create a function for one of the data sets, under the obvious proviso
that you intend to process each data frame in the list the same way. To
return only the p-values from a binomial test applied to each row of your
input data frame, the following works for me (explanation below):
f <- function(df)
do.call(c, with(df, mapply(binom.test, x = X, n = N))[3, ])
4. Use lapply() to map the function to each component data frame in your
list; the result will also be a list.
pvlist <- lapply(l, f)
5. *IF* each of your data frames has the same number of rows, you can use
the following to slurp together all the p-values into a matrix:
do.call(rbind, pvlist)
OTOH, if the number of rows vary from one data frame to another, it may be
best to keep the p-value results in list form or perhaps you could flatten
them into a numeric vector, depending on your purposes.
----
The function f:
mapply() allows you, in this case, to execute the non-vectorized function
binom.test() to a pair of vector arguments supplied from the input data
frame. The result is a 9 x n matrix where each column comprises a list of
output for each of the n calls to binom.test() [where n = number of rows of
the input data frame]. Since you wanted the p-values (component/row 3), we
pull out the third row of the matrix. This will return a list, so using the
concatenation function c() in do.call() coerces them into a numeric vector
for output.
The lapply() call maps the function f to each component of the list of data
frames created in (2).
-----
An alternative approach to this problem would be to use the plyr (and
perhaps reshape, too) package, since it was designed to handle this
'split-apply-combine' strategy.
HTH,
Dennis
On Thu, Jul 29, 2010 at 1:05 AM, Wilson, Andrew
<a.wilson@lancaster.ac.uk>wrote:
> I need to run binomial tests (binom.test) on a large set of data, stored
> in a table - 600 tests in total.
>
> The values of x are stored in a column, as are the values of n. The
> data for each test are on a separate row.
>
> For example:
>
> X N
> 11 19
> 9 26
> 13 21
> 13 27
> 18 30
>
> It is a two-tailed test, and P in all cases is 0.5.
>
> My question is: Is there a quicker way of running these tests without
> having to type an individual command for each test - and ideally also to
> store the resulting p-values in a single data vector?
>
> Many thanks for any pointers,
>
> Andrew Wilson
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
Hi:
As it turns out, this is pretty straightforward using plyr's ldply()
function. Here's a toy example:
d1 <- structure(list(X = c(11L, 9L, 13L, 13L, 18L), N = c(19L, 26L,
21L, 27L, 30L)), .Names = c("X", "N"), class =
"data.frame", row.names c(NA,
-5L))
w <- sample(1:50, 5)
d2 <- data.frame(X = mapply(rbinom, 1, w, 0.5), N = w)
w <- sample(1:50, 5)
d3 <- data.frame(X = mapply(rbinom, 1, w, 0.5), N = w)
# Combine data frames into a list - since these are already R objects, the
call is easy:
l <- list(d1, d2, d3)
# the function:
f <- function(df)
do.call(c, with(df, mapply(binom.test, x = X, n = N))[3, ])
# do.call + lapply:
do.call(rbind, lapply(l, f))
[,1] [,2] [,3] [,4] [,5]
[1,] 0.6476059 0.1686375 0.38331032 1.0000000 0.3615946
[2,] 0.3019956 0.6515878 0.02944937 0.5600646 1.0000000
[3,] 1.0000000 1.0000000 0.81452942 0.0390625 0.4050322
# plyr approach:
library(plyr)
ldply(l, f)
V1 V2 V3 V4 V5
1 0.6476059 0.1686375 0.38331032 1.0000000 0.3615946
2 0.3019956 0.6515878 0.02944937 0.5600646 1.0000000
3 1.0000000 1.0000000 0.81452942 0.0390625 0.4050322
ldply() takes a list as input along with a function to process in the
lapply() step and returns a data frame of results. So the plyr approach can
be summarized as:
1. Create a list of data frames.
2. Create a function to apply to each data frame.
3. Load the plyr package.
4. Run ldply().
Essentially, the plyr package provides a number of convenient 'wrapper'
functions to simplify the 'split-apply-combine' strategy of data
analysis
for various combinations of input and output objects.
HTH,
Dennis
On Thu, Jul 29, 2010 at 1:05 AM, Wilson, Andrew
<a.wilson@lancaster.ac.uk>wrote:
> I need to run binomial tests (binom.test) on a large set of data, stored
> in a table - 600 tests in total.
>
> The values of x are stored in a column, as are the values of n. The
> data for each test are on a separate row.
>
> For example:
>
> X N
> 11 19
> 9 26
> 13 21
> 13 27
> 18 30
>
> It is a two-tailed test, and P in all cases is 0.5.
>
> My question is: Is there a quicker way of running these tests without
> having to type an individual command for each test - and ideally also to
> store the resulting p-values in a single data vector?
>
> Many thanks for any pointers,
>
> Andrew Wilson
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]