Gustavo Fernandez Bayon
2017-Aug-24 14:42 UTC
[Rd] Are r2dtable and C_r2dtable behaving correctly?
Hello,
While doing some enrichment tests using chisq.test() with simulated
p-values, I noticed some strange behaviour. The computed p-value was
extremely small, so I decided to dig a little deeper and debug
chisq.test(). I noticed then that the simulated statistics returned by the
following call
tmp <- .Call(C_chisq_sim, sr, sc, B, E)
were all the same, very small numbers. This, at first, seemed strange to
me. So I decided to do some simulations myself, and started playing around
with the r2dtable() function. Problem is, using my row and column
marginals, r2dtable() always returns the same matrix. Let's provide a
minimal example:
rr <- c(209410, 276167)
cc <- c(25000, 460577)
ms <- r2dtable(3, rr, cc)
I have tested this code in two machines and it always returned the same
list of length three containing the same matrix three times. The repeated
matrix is the following:
[[1]]
[,1] [,2]
[1,] 10782 198628
[2,] 14218 261949
[[2]]
[,1] [,2]
[1,] 10782 198628
[2,] 14218 261949
[[3]]
[,1] [,2]
[1,] 10782 198628
[2,] 14218 261949
I also coded a small function returning the value of the chi-squared
statistic using the previous fixed marginals and taking the value at [1, 1]
as input. This helped me to plot a curve and notice that the repeating
matrix was the one that yielded the minimum chi-squared statistic.
This behaviour persists if I use greater marginals (summing 100000 to every
element of the marginal for example),
> rr <- c(309410, 376167)
> cc <- c(125000, 560577)
> r2dtable(3, rr, cc)
[[1]]
[,1] [,2]
[1,] 56414 252996
[2,] 68586 307581
[[2]]
[,1] [,2]
[1,] 56414 252996
[2,] 68586 307581
[[3]]
[,1] [,2]
[1,] 56414 252996
[2,] 68586 307581
but not if we use smaller ones:
> rr <- c(9410, 76167)
> cc <- c(25000, 60577)
> r2dtable(3, rr, cc)
[[1]]
[,1] [,2]
[1,] 2721 6689
[2,] 22279 53888
[[2]]
[,1] [,2]
[1,] 2834 6576
[2,] 22166 54001
[[3]]
[,1] [,2]
[1,] 2778 6632
[2,] 22222 53945
I have looked inside the C code for the C_r2dtable() and rcont2()
functions, but I cannot do much more than guess where this behaviour could
originate, so I would like to ask for help from anybody more experienced in
the R implementation. Guess there is some kind of inflection point
depending on the total sample size of the table, or maybe the generation
algorithm tends to output matrices concentrated around the minimum.
This is the output from my sessionInfo()
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
LC_TIME=es_ES.UTF-8
[4] LC_COLLATE=es_ES.UTF-8 LC_MONETARY=es_ES.UTF-8
LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8
LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] profvis_0.3.3 bindrcpp_0.2
[3] FDb.InfiniumMethylation.hg19_2.2.0 org.Hs.eg.db_3.4.1
[5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.28.4
[7] AnnotationDbi_1.38.2 Biobase_2.36.2
[9] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2
[11] IRanges_2.10.2 S4Vectors_0.14.3
[13] BiocGenerics_0.22.0 epian_0.1.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.6.3 purrr_0.2.3 reshape2_1.4.2
[4] lattice_0.20-35 htmltools_0.3.6
rtracklayer_1.36.4
[7] blob_1.1.0 XML_3.98-1.9 rlang_0.1.2
[10] foreign_0.8-67 glue_1.1.1 DBI_0.7
[13] BiocParallel_1.10.1 bit64_0.9-7
matrixStats_0.52.2
[16] GenomeInfoDbData_0.99.0 bindr_0.1 plyr_1.8.4
[19] stringr_1.2.0 zlibbioc_1.22.0
Biostrings_2.44.2
[22] htmlwidgets_0.9 psych_1.7.5 memoise_1.1.0
[25] biomaRt_2.32.1 broom_0.4.2 Rcpp_0.12.12
[28] DelayedArray_0.2.7 XVector_0.16.0 bit_1.1-12
[31] Rsamtools_1.28.0 mnormt_1.5-5 digest_0.6.12
[34] stringi_1.1.5 dplyr_0.7.2 grid_3.4.0
[37] tools_3.4.0 bitops_1.0-6 magrittr_1.5
[40] RCurl_1.95-4.8 tibble_1.3.3 RSQLite_2.0
[43] tidyr_0.7.0 pkgconfig_2.0.1 Matrix_1.2-9
[46] assertthat_0.2.0 R6_2.2.2
GenomicAlignments_1.12.1
[49] nlme_3.1-131 compiler_3.4.0
Any hint or help would be much appreciated. We do not use a lot the
simulated version of the chisq.test at the lab, but I would like to
understand better what is happening.
Kind regards,
Gustavo
[[alternative HTML version deleted]]
Martin Maechler
2017-Aug-25 08:30 UTC
[Rd] Are r2dtable and C_r2dtable behaving correctly?
>>>>> Gustavo Fernandez Bayon <gbayon at gmail.com> >>>>> on Thu, 24 Aug 2017 16:42:36 +0200 writes:> Hello, > While doing some enrichment tests using chisq.test() with simulated > p-values, I noticed some strange behaviour. The computed p-value was > extremely small, so I decided to dig a little deeper and debug > chisq.test(). I noticed then that the simulated statistics returned by the > following call > tmp <- .Call(C_chisq_sim, sr, sc, B, E) > were all the same, very small numbers. This, at first, seemed strange to > me. So I decided to do some simulations myself, and started playing around > with the r2dtable() function. Problem is, using my row and column > marginals, r2dtable() always returns the same matrix. Let's provide a > minimal example: > rr <- c(209410, 276167) > cc <- c(25000, 460577) > ms <- r2dtable(3, rr, cc) > I have tested this code in two machines and it always returned the same > list of length three containing the same matrix three times. The repeated > matrix is the following: > [[1]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 > [[2]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 > [[3]] > [,1] [,2] > [1,] 10782 198628 > [2,] 14218 261949 Yes. You can also do unique(r2dtable(100, rr, cc)) and see that the result is constant. I'm pretty sure this is still due to some integer overflow, in spite of the fact that I had spent quite some time to fix such problem in Dec 2003, see the 14 years old bug PR#5701 https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2 It has to be said that this is based on an algorithm published in 1981, specifically - from help(r2dtable) - Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. _Applied Statistics_ *30*, 91-97. For those with JSTOR access (typically via your University), available at http://www.jstor.org/stable/2346669 When I start reading it, indeed the algorithm seems start from the expected value of a cell entry and then "explore from there"... and I do wonder if there is not a flaw somewhere in the algorithm: I've now found that a bit more than a year ago, 'paljenczy' found on SO https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated that indeed the generated tables seem to be too much around the mean. Basically his example: https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))user system elapsed 0.218 0.025 0.244> table(A11)34 35 36 37 38 39 40 41 42 43 2 17 40 129 334 883 2026 4522 8766 15786 44 45 46 47 48 49 50 51 52 53 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 54 55 56 57 58 59 60 61 62 63 59732 41474 26939 16006 8827 4633 2050 865 340 116 64 65 66 67 38 13 7 1>For a 2x2 table, there's really only one degree of freedom, hence the above characterizes the full distribution for that case. I would have expected to see all possible values in 0:100 instead of such a "normal like" distribution with carrier only in [34, 67]. There are newer publications and maybe algorithms. So maybe the algorithm is "flawed by design" for really large total number of observations, rather than wrong Seems interesting ... Martin Maechler
It is not about "really arge total number of observations", but:
set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <-
vapply(tabs, function(x) x[1, 1], numeric(1));table(A11)
A11
0 1 2
166483 666853 166664
There are three possible matrices, and these come out in proportions 1:4:1, the
one with all cells filled with ones being
most common.
Cheers, Jari O.
________________________________________
From: R-devel <r-devel-bounces at r-project.org> on behalf of Martin
Maechler <maechler at stat.math.ethz.ch>
Sent: 25 August 2017 11:30
To: Gustavo Fernandez Bayon
Cc: r-devel at r-project.org
Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
>>>>> Gustavo Fernandez Bayon <gbayon at gmail.com>
>>>>> on Thu, 24 Aug 2017 16:42:36 +0200 writes:
> Hello,
> While doing some enrichment tests using chisq.test() with simulated
> p-values, I noticed some strange behaviour. The computed p-value was
> extremely small, so I decided to dig a little deeper and debug
> chisq.test(). I noticed then that the simulated statistics returned by
the
> following call
> tmp <- .Call(C_chisq_sim, sr, sc, B, E)
> were all the same, very small numbers. This, at first, seemed strange
to
> me. So I decided to do some simulations myself, and started playing
around
> with the r2dtable() function. Problem is, using my row and column
> marginals, r2dtable() always returns the same matrix. Let's provide
a
> minimal example:
> rr <- c(209410, 276167)
> cc <- c(25000, 460577)
> ms <- r2dtable(3, rr, cc)
> I have tested this code in two machines and it always returned the same
> list of length three containing the same matrix three times. The
repeated
> matrix is the following:
> [[1]]
> [,1] [,2]
> [1,] 10782 198628
> [2,] 14218 261949
> [[2]]
> [,1] [,2]
> [1,] 10782 198628
> [2,] 14218 261949
> [[3]]
> [,1] [,2]
> [1,] 10782 198628
> [2,] 14218 261949
Yes. You can also do
unique(r2dtable(100, rr, cc))
and see that the result is constant.
I'm pretty sure this is still due to some integer overflow,
in spite of the fact that I had spent quite some time to fix
such problem in Dec 2003, see the 14 years old bug PR#5701
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2
It has to be said that this is based on an algorithm published
in 1981, specifically - from help(r2dtable) -
Patefield, W. M. (1981) Algorithm AS159. An efficient method of
generating r x c tables with given row and column totals.
_Applied Statistics_ *30*, 91-97.
For those with JSTOR access (typically via your University),
available at http://www.jstor.org/stable/2346669
When I start reading it, indeed the algorithm seems start from the
expected value of a cell entry and then "explore from there"...
and I do wonder if there is not a flaw somewhere in the
algorithm:
I've now found that a bit more than a year ago, 'paljenczy' found on
SO
https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
that indeed the generated tables seem to be too much around the mean.
Basically his example:
https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100,
100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
user system elapsed
0.218 0.025 0.244> table(A11)
34 35 36 37 38 39 40 41 42 43
2 17 40 129 334 883 2026 4522 8766 15786
44 45 46 47 48 49 50 51 52 53
26850 42142 59535 78851 96217 107686 112438 108237 95761 78737
54 55 56 57 58 59 60 61 62 63
59732 41474 26939 16006 8827 4633 2050 865 340 116
64 65 66 67
38 13 7 1>
For a 2x2 table, there's really only one degree of freedom,
hence the above characterizes the full distribution for that
case.
I would have expected to see all possible values in 0:100
instead of such a "normal like" distribution with carrier only
in [34, 67].
There are newer publications and maybe algorithms.
So maybe the algorithm is "flawed by design" for really large
total number of observations, rather than wrong
Seems interesting ...
Martin Maechler
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
> On 25 Aug 2017, at 10:30 , Martin Maechler <maechler at stat.math.ethz.ch> wrote: >[...]> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated > > >> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1)) > user system elapsed > 0.218 0.025 0.244 >> table(A11) > > 34 35 36 37 38 39 40 41 42 43 > 2 17 40 129 334 883 2026 4522 8766 15786 > 44 45 46 47 48 49 50 51 52 53 > 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737 > 54 55 56 57 58 59 60 61 62 63 > 59732 41474 26939 16006 8827 4633 2050 865 340 116 > 64 65 66 67 > 38 13 7 1 >> > > For a 2x2 table, there's really only one degree of freedom, > hence the above characterizes the full distribution for that > case. > > I would have expected to see all possible values in 0:100 > instead of such a "normal like" distribution with carrier only > in [34, 67].Hmm, am I missing a point here?> round(dhyper(0:100,100,100,100)*1e6)[1] 0 0 0 0 0 0 0 0 0 0 [11] 0 0 0 0 0 0 0 0 0 0 [21] 0 0 0 0 0 0 0 0 0 0 [31] 0 0 0 1 4 13 43 129 355 897 [41] 2087 4469 8819 16045 26927 41700 59614 78694 95943 108050 [51] 112416 108050 95943 78694 59614 41700 26927 16045 8819 4469 [61] 2087 897 355 129 43 13 4 1 0 0 [71] 0 0 0 0 0 0 0 0 0 0 [81] 0 0 0 0 0 0 0 0 0 0 [91] 0 0 0 0 0 0 0 0 0 0 [101] 0 -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com