Hello R-users, I am looking for an elegant way to calculate p-values for each row of a data frame. My situation is as follows: I have a gene expression results from a microarray with 64 samples looking at 25626 genes. The results are in a data frame with the dimensions 64 by 25626 I want to create a volcano plot of difference of means vs. ?log(10) of the p-values, comparing normal samples to abnormal samples. The results of both type of samples are all in my data frame. Now, I have found a way to calculate the p-value using a ?for (i in 1:25626)? loop (see below): df.normal #dataframe, which only contains the normal samples df.samples #dataframe, which only contains abnormal samples DM=rowMeans(df.normal)-rowMeans(df.samples) #gives me a dataframe with the difference of means PV=array(1,c(25626,1)) for (i in 1:25626){ VL=t.test(matrix.b[i,],matrix.a[i,]) V=as.numeric(VL[3]) V=-log10(V) PV[i,1]=V} plot(DM, PV, main=title,xlab=x.lab, ylab="-log(10) P-Values",pch=20)} It takes around 3-5 minutes to generate the volcano plot this way. I will be running arrays which will look at 2.2 million sites >> this approach will then take way too long. I was wondering if there is a more elegant way to calculate the p-values for an array/fataframe/matrix in a row-by row fashion, which is similar to ?rowMeans?. I thought writing a function to get the p-value and then using apply(x,1,function) would be the best. I have the function which will give me the p-value p.value = function (x,y){ PV=as.numeric(t.test(x,y)[3]) } and I can get a result if I test it only on one row (below is 6 by 10 data frame example of my original data) RRR X259863 X267862 X267906 X300875 X300877 X300878 MSPI0406S00000183 -3.2257205 -3.2248899 2.85590082 -2.6293602 -3.5054348 -2.62817269 MSPI0406S00000238 -2.6661903 -3.1135020 2.17073881 -3.2357307 -2.3309775 -1.76078452 MSPI0406S00000239 -1.7636439 -0.6702877 0.19471126 -0.7397132 -1.4332662 -0.24822470 MSPI0406S00000300 0.6471381 -0.2638928 -0.61876054 -0.9180127 0.2539848 -0.63122203 MSPI0406S00000301 0.9207208 0.2164267 -0.33238846 -1.1450717 -0.2935584 -1.01659802 MSPI0406S00000321 -0.4073272 -0.2852402 -0.08085746 -0.4109428 -0.2185432 -0.39736137 MSPI0406S00000352 -0.7074175 -0.6987548 -1.22004647 -0.8570551 -0.5083861 -0.09267928 MSPI0406S00000353 -0.2745682 0.3012990 -0.64787221 -0.5654195 0.4265007 -0.65963404 MSPI0406S00000354 -1.1858394 -1.4388609 -0.07329722 -2.0010785 -1.3245696 -1.43216984 MSPI0406S00000360 -1.4599809 -1.4929059 0.63453235 -1.1476760 -1.5849922 -1.03187399> zz=p.value(RRR[1,1:3],RRR[1,4:6]) > zz$p.value [1] 0.485727 but I cannot do this row by row using apply> xxx=apply(RRR,1,p.value(RRR[,1:3],RRR[,4:6]))Error in match.fun(FUN) : 'p.value(RRR[, 1:3], RRR[, 4:6])' is not a function, character or symbol Does anyone have any suggestions? Thanks in advance Christoph Heuck Albert Einstein College of Medicine
Meyners, Michael, LAUSANNE, AppliedMathematics
2009-Oct-16 06:04 UTC
[R] calculating p-values by row for data frames
> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Christoph Heuck > Sent: Donnerstag, 15. Oktober 2009 17:51 > To: r-help at r-project.org > Subject: [R] calculating p-values by row for data frames > > Hello R-users, > I am looking for an elegant way to calculate p-values for > each row of a data frame. > My situation is as follows: > I have a gene expression results from a microarray with 64 > samples looking at 25626 genes. The results are in a data > frame with the dimensions 64 by 25626 I want to create a > volcano plot of difference of means vs. -log(10) of the > p-values, comparing normal samples to abnormal samples. The > results of both type of samples are all in my data frame. > Now, I have found a way to calculate the p-value using a "for > (i in 1:25626)" loop (see below): > > df.normal #dataframe, which only contains the normal samples > df.samples #dataframe, which only contains abnormal samples > > DM=rowMeans(df.normal)-rowMeans(df.samples) #gives me a > dataframe with the difference of means > > PV=array(1,c(25626,1)) > for (i in 1:25626){ > VL=t.test(matrix.b[i,],matrix.a[i,]) > V=as.numeric(VL[3]) > V=-log10(V) > PV[i,1]=V} > > plot(DM, PV, main=title,xlab=x.lab, ylab="-log(10) P-Values",pch=20)} > > It takes around 3-5 minutes to generate the volcano plot this > way. I will be running arrays which will look at 2.2 million > sites >> this approach will then take way too long. > I was wondering if there is a more elegant way to calculate > the p-values for an array/fataframe/matrix in a row-by row > fashion, which is similar to "rowMeans". > > I thought writing a function to get the p-value and then using > apply(x,1,function) would be the best. > > I have the function which will give me the p-value > > p.value = function (x,y){ > PV=as.numeric(t.test(x,y)[3]) > } > > and I can get a result if I test it only on one row (below is > 6 by 10 data frame example of my original data) > > RRR > X259863 X267862 X267906 X300875 > X300877 X300878 > MSPI0406S00000183 -3.2257205 -3.2248899 2.85590082 -2.6293602 > -3.5054348 -2.62817269 > MSPI0406S00000238 -2.6661903 -3.1135020 2.17073881 -3.2357307 > -2.3309775 -1.76078452 > MSPI0406S00000239 -1.7636439 -0.6702877 0.19471126 -0.7397132 > -1.4332662 -0.24822470 > MSPI0406S00000300 0.6471381 -0.2638928 -0.61876054 -0.9180127 > 0.2539848 -0.63122203 > MSPI0406S00000301 0.9207208 0.2164267 -0.33238846 -1.1450717 > -0.2935584 -1.01659802 > MSPI0406S00000321 -0.4073272 -0.2852402 -0.08085746 -0.4109428 > -0.2185432 -0.39736137 > MSPI0406S00000352 -0.7074175 -0.6987548 -1.22004647 -0.8570551 > -0.5083861 -0.09267928 > MSPI0406S00000353 -0.2745682 0.3012990 -0.64787221 -0.5654195 > 0.4265007 -0.65963404 > MSPI0406S00000354 -1.1858394 -1.4388609 -0.07329722 -2.0010785 > -1.3245696 -1.43216984 > MSPI0406S00000360 -1.4599809 -1.4929059 0.63453235 -1.1476760 > -1.5849922 -1.03187399 > > > zz=p.value(RRR[1,1:3],RRR[1,4:6]) > > zz > $p.value > [1] 0.485727 > > but I cannot do this row by row using apply > > > xxx=apply(RRR,1,p.value(RRR[,1:3],RRR[,4:6]))xxx <- apply(RRR, 1, function(x) p.value(x[1:3],x[4:6])) works for me. Check the examples in ?apply. HTH, Michael> > Error in match.fun(FUN) : > 'p.value(RRR[, 1:3], RRR[, 4:6])' is not a function, > character or symbol > > Does anyone have any suggestions? > Thanks in advance > > Christoph Heuck > Albert Einstein College of Medicine > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >