Dear R-listers, I want to reperfrom a poisson distribution test that presented in a recent-published biological research paper (Plant Physiology 2008, vol 148, pp. 1189-1200). That test is about the occurrence number of a kind of gene in separate chromosomes. For instance: The observed gene number in chromosome A is 36. The expected gene number in chromosome A is 30. Then, the authors got a probability 0.137 by distribution test on this trial. In this test, a Poisson distribution was used to determine the significance of the gene distribution. Questions: How can I reperform this test in R? Thank you in advance. Mao Jian-Feng Institue of Botany, CAS, China [[alternative HTML version deleted]]
On 28-Jul-09 10:03:41, Mao Jianfeng wrote:> Dear R-listers, > I want to reperfrom a poisson distribution test that presented > in a recent-published biological research paper (Plant Physiology > 2008, vol 148, pp. 1189-1200). That test is about the occurrence > number of a kind of gene in separate chromosomes. > > For instance: > > The observed gene number in chromosome A is 36. > The expected gene number in chromosome A is 30. > > Then, the authors got a probability 0.137 by distribution test on this > trial. In this test, a Poisson distribution was used to determine the > significance of the gene distribution. > > Questions: > How can I reperform this test in R? > > Thank you in advance. > Mao Jian-Feng > Institue of Botany, > CAS, ChinaSince it is not clear what test procedure they used, I have done a couple of numerical experiments in R: 1. Compare the upper-tail probability of the POisson distribution with mean mu = 30 of the event that at least 36 are observed: 1-ppois(36,30) # [1] 0.1196266 Not quite the 0.137 that they got. 2. A similar comparison, using a Normal approximation to the Poisson (mean mu = 30, SD = sqrt(mu)): 1 - pnorm(6/sqrt(30)) # [1] 0.1366608 which, after rounding, is exactly the 0.137 that they got. So it seems they have used an upper-tail test based on the Normal approximation to the Poisson distribution. Method 1 (using the exact Poisson distribution) is preferable, since it is accurate (given the assumption of Poisson distribution). So that would, in principle, be the best way to do it in R (as illustrated). Possibly their adoption of Method 2 is based on a naive acceptance of the "rule-of-thumb" from some textbook; or maybe their available software does not offer ready access to the exact Poisson distribution (which wouldn't happen if they used R -- see Method 1). As stated, it is inaacurate compared with Method 1, so is not to be preferred. However, if you need to reproduce their method (regardless of merit), then use Method 2. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 28-Jul-09 Time: 11:42:08 ------------------------------ XFMail ------------------------------
I have gone to have a look at the paper. (http:// www.plantphysiol.org/cgi/content/full/148/3/1189 ) to try to work out what they're actually doing, in the hope that I might be able to figure out their procedure so we can give a more complete answer to the question. Unfortunately, I'm more confused now than before. The numbers the OP refers to are in table I: http:// www.plantphysiol.org/cgi/content/full/148/3/1189/TBL1 The first four rows are (abbreviating the column titles): Chromosome length(Mb) Obs_No._Genes Exp_No._Genes Distribution_Test LG I 32.16 36 30 0.137 LG II 23.44 25 22 0.235 LG III 17.45 13 17 0.235 LG IV 15.08 10 14 0.159 They are (somehow) working out the probability above or below the observed count, depending on wther or not the observed count happened to be above or below the expected! This means their p-values are on average about half what they should be. However, they don't seem to be using Poisson probabilities (as they claim) - I can't reproduce the results with a Poisson distribution. Nor can I reproduce them with a normal approximation, nor with a normal approximation with continuity correction. It's not at all clear how they've got their numbers, but they definitely don't seem to have done what they claim to have done (see below). The note under Table 1 says (using some sort of "pseudo" LaTeX to indicate the greek letters) "for distribution test $P(m(i,j) < \lambda(i,j)) <= \alpha$ or $P(m(i,j) > \lambda(i,j)) < \alpha$, a Poisson distribution was used to determine the significance of the F-box gene distribution in the Populus genome" -- that is, if they did what this says, they're working out probabilities of MORE extreme, not "at least as extreme", a second error. [Note that (it's stated later) the expected numbers are estimated. The don't seem to take this into account (though the effect may be small). They're also doing a whole bunch of tests in this paper - 19 "p-values" in table 1, another 273 in table 3, 18 more in table 4, 28 more in table 5, another 24 in table 7 ... and so on.] Under the "Materials and Methods" section, subsection "Localization of F-Box Genes in the Genome", they claim that the "probabilities $P(m(i) < \lambda(i))$ and $P(m(i) > \lambda(i))$ were evaluated under the cumulative Posson distribution at \alpha <= 0.05 and \alpha <= 0.01 significance levels." Which is weird, because the actual p-values they seem to regard as worth mentioning in the paper are the ones at or below 0.0001 [Is this sort of thing pretty typical for papers in biology? Don't referees even do a basic check of one or two numbers? Apparently I put too much effort into refereeing.] Can anyone see what they're actually doing? -- View this message in context: http://www.nabble.com/How-to-do-poisson-distribution-test-like-this--tp24696413p24711764.html Sent from the R help mailing list archive at Nabble.com.
Corrected links (the originals somehow aquired an extra space, sorry) Paper: http://www.plantphysiol.org/cgi/content/full/148/3/1189 Table I: http://www.plantphysiol.org/cgi/content/full/148/3/1189/TBL1 -- View this message in context: http://www.nabble.com/How-to-do-poisson-distribution-test-like-this--tp24696413p24711818.html Sent from the R help mailing list archive at Nabble.com.
Follow-up: On 29-Jul-09 03:28:24, glen_b wrote:> Corrected links (the originals somehow aquired an extra space, sorry) > > Paper: http://www.plantphysiol.org/cgi/content/full/148/3/1189 > Table I: http://www.plantphysiol.org/cgi/content/full/148/3/1189/TBL1 > --I have now scanned though the Paper, and find the explanation of their calculation of Expected Length (which agrees with the method I described in my previous email) in a somewhat garbled description on page 1198 (page 10/12 of the PDF file): "The expected gene number lambda_i on chromosome i would be a sample from a Poisson distribution lambda_i = m*L_i/Sum L_i where m is the total number of genes detected within the assembled sequences and L_i is the length of chromosome i." I agree with your earlier comment, Glen, that the paper would have benefited from more careful scrutiny by referees/editor! Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 29-Jul-09 Time: 12:43:26 ------------------------------ XFMail ------------------------------