thr3ads.net - R help - [R] Help on averaging sets of rows defined by row name [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Booman, M

2007-Apr-20 13:26 UTC

[R] Help on averaging sets of rows defined by row name

Dear all,

This is my problem: I have a table of gene expression data, where 1st column is
gene name, and 2nd -39th columns each are exression data for 38 samples. There
are multiple measurements per sample for each gene, so there are multiple rows
for each gene name. I want to average these measurements so i end up with one
value per sample for each gene name. The output data frame (table.averaged) is
further used in other R script. The code I use now (see below) takes 20 secs for
each loop, so it takes 45 minutes to average my files of 13500 unique genes. Can
anyone help me do this faster?

Cheers, marije

Code I use: 


table.imputed[,1] <- as.character(table.imputed[,1])    #table.imputed is
data.frame,1st column = gene name (class factor), rest of columns = expression
data (class numeric)

genesunique <- unique(table.imputed[,1])                   #To make list of
unique genes in the set

table.averaged <- NULL
  for (j in 1:length(genesunique)) {
     if (j%%100 == 0){                                                   #To
report progress
       cat(j, "genes finished", sep=" ", fill=TRUE)
       }
     table.averaged<-rbind(table.averaged,givemean(genesunique[j],
table.imputed))   #collects all rows of average values and binds them back into
one data frame
  }

givemean <- function (gene, table.imputed) {
   thisgene<-table.imputed[table.imputed[,1]==gene,]                         
#make a subset containing only the rows for one gene name
   data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, na.rm=TRUE)))    
#calculates average for each sample (column) and outputs one row of average
values and the gene name
}


De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de
geadresseerde(n). Anderen dan de geadresseerde mogen geen gebruik maken van dit
bericht, het openbaar maken of op enige wijze verspreiden of vermenigvuldigen.
Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of
vertraging van dit verzonden bericht.

The contents of this message are confidential and only intended for the eyes of
the addressee(s). Others than the addressee(s) are not allowed to use this
message, to make it public or to distribute or multiply this message in any way.
The UMCG cannot be held responsible for incomplete reception or delay of this
transferred message.

ONKELINX, Thierry

2007-Apr-20 13:53 UTC

head link

[R] Help on averaging sets of rows defined by row name

Dear Marije,

I think that aggregate() would make your life a lot easier.

aggregate(table.imputed, by = table.imputed[, 1], FUN = "mean")

Cheers,

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Reseach Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium
tel. + 32 54/436 185
Thierry.Onkelinx op inbo.be
www.inbo.be 

Do not put your faith in what statistics say until you have carefully
considered what they do not say.  ~William W. Watt
A statistical analysis, properly conducted, is a delicate dissection of
uncertainties, a surgery of suppositions. ~M.J.Moroney

 
> -----Oorspronkelijk bericht-----
> Van: r-help-bounces op stat.math.ethz.ch 
> [mailto:r-help-bounces op stat.math.ethz.ch] Namens Booman, M
> Verzonden: vrijdag 20 april 2007 15:27
> Aan: r-help op stat.math.ethz.ch
> Onderwerp: [R] Help on averaging sets of rows defined by row name
> 
> Dear all,
> 
> This is my problem: I have a table of gene expression data, 
> where 1st column is gene name, and 2nd -39th columns each are 
> exression data for 38 samples. There are multiple 
> measurements per sample for each gene, so there are multiple 
> rows for each gene name. I want to average these measurements 
> so i end up with one value per sample for each gene name. The 
> output data frame (table.averaged) is further used in other R 
> script. The code I use now (see below) takes 20 secs for each 
> loop, so it takes 45 minutes to average my files of 13500 
> unique genes. Can anyone help me do this faster?
> 
> Cheers, marije
> 
> Code I use: 
> 
> 
> table.imputed[,1] <- as.character(table.imputed[,1])    
> #table.imputed is data.frame,1st column = gene name (class 
> factor), rest of columns = expression data (class numeric)
> 
> genesunique <- unique(table.imputed[,1])                   
> #To make list of unique genes in the set
> 
> table.averaged <- NULL
>   for (j in 1:length(genesunique)) {
>      if (j%%100 == 0){                                        
>            #To report progress
>        cat(j, "genes finished", sep=" ", fill=TRUE)
>        }
>      
> table.averaged<-rbind(table.averaged,givemean(genesunique[j], 
> table.imputed))   #collects all rows of average values and 
> binds them back into one data frame
>   }
> 
> givemean <- function (gene, table.imputed) {
>    thisgene<-table.imputed[table.imputed[,1]==gene,]          
>                              #make a subset containing only 
> the rows for one gene name
>    data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, 
> na.rm=TRUE)))     #calculates average for each sample 
> (column) and outputs one row of average values and the gene name
> }
> 
> 
> De inhoud van dit bericht is vertrouwelijk en alleen bestemd 
> voor de geadresseerde(n). Anderen dan de geadresseerde mogen 
> geen gebruik maken van dit bericht, het openbaar maken of op 
> enige wijze verspreiden of vermenigvuldigen. Het UMCG kan 
> niet aansprakelijk gesteld worden voor een incomplete 
> aankomst of vertraging van dit verzonden bericht.
> 
> The contents of this message are confidential and only 
> intended for the eyes of the addressee(s). Others than the 
> addressee(s) are not allowed to use this message, to make it 
> public or to distribute or multiply this message in any way. 
> The UMCG cannot be held responsible for incomplete reception 
> or delay of this transferred message.
>

Liaw, Andy

2007-Apr-20 14:09 UTC

head link

[R] Help on averaging sets of rows defined by row name

You might want to check which of the following scales better for the
size of data you have.

## Make up some data to try.
R> dat <- data.frame(gene=rep(letters[1:3], each=3), s1=runif(9),
s2=runif(9))
R> dat
  gene        s1        s2
1    a 0.9959172 0.9531052
2    a 0.2064497 0.4257022
3    a 0.4791100 0.5977923
4    b 0.1307096 0.8256453
5    b 0.7887983 0.8904983
6    b 0.7841745 0.6901540
7    c 0.3356583 0.7125086
8    c 0.5859311 0.0509323
9    c 0.7681325 0.8677725

## Use aggregate():
R> aggregate(dat[-1], dat[1], mean)
  gene        s1        s2
1    a 0.5604923 0.6588666
2    b 0.5678941 0.8020992
3    c 0.5632407 0.5437378

## Do it "by hand": need a bit more work if there are Nas.
R> rowsum(dat[-1], dat[[1]]) / table(dat[[1]])
         s1        s2
a 0.5604923 0.6588666
b 0.5678941 0.8020992
c 0.5632407 0.5437378

Andy
 

From: Booman, M> 
> Dear all,
> 
> This is my problem: I have a table of gene expression data, 
> where 1st column is gene name, and 2nd -39th columns each are 
> exression data for 38 samples. There are multiple 
> measurements per sample for each gene, so there are multiple 
> rows for each gene name. I want to average these measurements 
> so i end up with one value per sample for each gene name. The 
> output data frame (table.averaged) is further used in other R 
> script. The code I use now (see below) takes 20 secs for each 
> loop, so it takes 45 minutes to average my files of 13500 
> unique genes. Can anyone help me do this faster?
> 
> Cheers, marije
> 
> Code I use: 
> 
> 
> table.imputed[,1] <- as.character(table.imputed[,1])    
> #table.imputed is data.frame,1st column = gene name (class 
> factor), rest of columns = expression data (class numeric)
> 
> genesunique <- unique(table.imputed[,1])                   
> #To make list of unique genes in the set
> 
> table.averaged <- NULL
>   for (j in 1:length(genesunique)) {
>      if (j%%100 == 0){                                        
>            #To report progress
>        cat(j, "genes finished", sep=" ", fill=TRUE)
>        }
>      
> table.averaged<-rbind(table.averaged,givemean(genesunique[j], 
> table.imputed))   #collects all rows of average values and 
> binds them back into one data frame
>   }
> 
> givemean <- function (gene, table.imputed) {
>    thisgene<-table.imputed[table.imputed[,1]==gene,]          
>                              #make a subset containing only 
> the rows for one gene name
>    data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, 
> na.rm=TRUE)))     #calculates average for each sample 
> (column) and outputs one row of average values and the gene name
> }
> 
> 
> De inhoud van dit bericht is vertrouwelijk en alleen bestemd 
> voor de geadresseerde(n). Anderen dan de geadresseerde mogen 
> geen gebruik maken van dit bericht, het openbaar maken of op 
> enige wijze verspreiden of vermenigvuldigen. Het UMCG kan 
> niet aansprakelijk gesteld worden voor een incomplete 
> aankomst of vertraging van dit verzonden bericht.
> 
> The contents of this message are confidential and only 
> intended for the eyes of the addressee(s). Others than the 
> addressee(s) are not allowed to use this message, to make it 
> public or to distribute or multiply this message in any way. 
> The UMCG cannot be held responsible for incomplete reception 
> or delay of this transferred message.
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

R help - Apr 2007 - Help on averaging sets of rows defined by row name

[R] Help on averaging sets of rows defined by row name

[R] Help on averaging sets of rows defined by row name

[R] Help on averaging sets of rows defined by row name