Dear all, This is my problem: I have a table of gene expression data, where 1st column is gene name, and 2nd -39th columns each are exression data for 38 samples. There are multiple measurements per sample for each gene, so there are multiple rows for each gene name. I want to average these measurements so i end up with one value per sample for each gene name. The output data frame (table.averaged) is further used in other R script. The code I use now (see below) takes 20 secs for each loop, so it takes 45 minutes to average my files of 13500 unique genes. Can anyone help me do this faster? Cheers, marije Code I use: table.imputed[,1] <- as.character(table.imputed[,1]) #table.imputed is data.frame,1st column = gene name (class factor), rest of columns = expression data (class numeric) genesunique <- unique(table.imputed[,1]) #To make list of unique genes in the set table.averaged <- NULL for (j in 1:length(genesunique)) { if (j%%100 == 0){ #To report progress cat(j, "genes finished", sep=" ", fill=TRUE) } table.averaged<-rbind(table.averaged,givemean(genesunique[j], table.imputed)) #collects all rows of average values and binds them back into one data frame } givemean <- function (gene, table.imputed) { thisgene<-table.imputed[table.imputed[,1]==gene,] #make a subset containing only the rows for one gene name data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, na.rm=TRUE))) #calculates average for each sample (column) and outputs one row of average values and the gene name } De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de geadresseerde(n). Anderen dan de geadresseerde mogen geen gebruik maken van dit bericht, het openbaar maken of op enige wijze verspreiden of vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een incomplete aankomst of vertraging van dit verzonden bericht. The contents of this message are confidential and only intended for the eyes of the addressee(s). Others than the addressee(s) are not allowed to use this message, to make it public or to distribute or multiply this message in any way. The UMCG cannot be held responsible for incomplete reception or delay of this transferred message.
ONKELINX, Thierry
2007-Apr-20 13:53 UTC
[R] Help on averaging sets of rows defined by row name
Dear Marije, I think that aggregate() would make your life a lot easier. aggregate(table.imputed, by = table.imputed[, 1], FUN = "mean") Cheers, Thierry ------------------------------------------------------------------------ ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Reseach Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 Thierry.Onkelinx op inbo.be www.inbo.be Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney> -----Oorspronkelijk bericht----- > Van: r-help-bounces op stat.math.ethz.ch > [mailto:r-help-bounces op stat.math.ethz.ch] Namens Booman, M > Verzonden: vrijdag 20 april 2007 15:27 > Aan: r-help op stat.math.ethz.ch > Onderwerp: [R] Help on averaging sets of rows defined by row name > > Dear all, > > This is my problem: I have a table of gene expression data, > where 1st column is gene name, and 2nd -39th columns each are > exression data for 38 samples. There are multiple > measurements per sample for each gene, so there are multiple > rows for each gene name. I want to average these measurements > so i end up with one value per sample for each gene name. The > output data frame (table.averaged) is further used in other R > script. The code I use now (see below) takes 20 secs for each > loop, so it takes 45 minutes to average my files of 13500 > unique genes. Can anyone help me do this faster? > > Cheers, marije > > Code I use: > > > table.imputed[,1] <- as.character(table.imputed[,1]) > #table.imputed is data.frame,1st column = gene name (class > factor), rest of columns = expression data (class numeric) > > genesunique <- unique(table.imputed[,1]) > #To make list of unique genes in the set > > table.averaged <- NULL > for (j in 1:length(genesunique)) { > if (j%%100 == 0){ > #To report progress > cat(j, "genes finished", sep=" ", fill=TRUE) > } > > table.averaged<-rbind(table.averaged,givemean(genesunique[j], > table.imputed)) #collects all rows of average values and > binds them back into one data frame > } > > givemean <- function (gene, table.imputed) { > thisgene<-table.imputed[table.imputed[,1]==gene,] > #make a subset containing only > the rows for one gene name > data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, > na.rm=TRUE))) #calculates average for each sample > (column) and outputs one row of average values and the gene name > } > > > De inhoud van dit bericht is vertrouwelijk en alleen bestemd > voor de geadresseerde(n). Anderen dan de geadresseerde mogen > geen gebruik maken van dit bericht, het openbaar maken of op > enige wijze verspreiden of vermenigvuldigen. Het UMCG kan > niet aansprakelijk gesteld worden voor een incomplete > aankomst of vertraging van dit verzonden bericht. > > The contents of this message are confidential and only > intended for the eyes of the addressee(s). Others than the > addressee(s) are not allowed to use this message, to make it > public or to distribute or multiply this message in any way. > The UMCG cannot be held responsible for incomplete reception > or delay of this transferred message. >
You might want to check which of the following scales better for the size of data you have. ## Make up some data to try. R> dat <- data.frame(gene=rep(letters[1:3], each=3), s1=runif(9), s2=runif(9)) R> dat gene s1 s2 1 a 0.9959172 0.9531052 2 a 0.2064497 0.4257022 3 a 0.4791100 0.5977923 4 b 0.1307096 0.8256453 5 b 0.7887983 0.8904983 6 b 0.7841745 0.6901540 7 c 0.3356583 0.7125086 8 c 0.5859311 0.0509323 9 c 0.7681325 0.8677725 ## Use aggregate(): R> aggregate(dat[-1], dat[1], mean) gene s1 s2 1 a 0.5604923 0.6588666 2 b 0.5678941 0.8020992 3 c 0.5632407 0.5437378 ## Do it "by hand": need a bit more work if there are Nas. R> rowsum(dat[-1], dat[[1]]) / table(dat[[1]]) s1 s2 a 0.5604923 0.6588666 b 0.5678941 0.8020992 c 0.5632407 0.5437378 Andy From: Booman, M> > Dear all, > > This is my problem: I have a table of gene expression data, > where 1st column is gene name, and 2nd -39th columns each are > exression data for 38 samples. There are multiple > measurements per sample for each gene, so there are multiple > rows for each gene name. I want to average these measurements > so i end up with one value per sample for each gene name. The > output data frame (table.averaged) is further used in other R > script. The code I use now (see below) takes 20 secs for each > loop, so it takes 45 minutes to average my files of 13500 > unique genes. Can anyone help me do this faster? > > Cheers, marije > > Code I use: > > > table.imputed[,1] <- as.character(table.imputed[,1]) > #table.imputed is data.frame,1st column = gene name (class > factor), rest of columns = expression data (class numeric) > > genesunique <- unique(table.imputed[,1]) > #To make list of unique genes in the set > > table.averaged <- NULL > for (j in 1:length(genesunique)) { > if (j%%100 == 0){ > #To report progress > cat(j, "genes finished", sep=" ", fill=TRUE) > } > > table.averaged<-rbind(table.averaged,givemean(genesunique[j], > table.imputed)) #collects all rows of average values and > binds them back into one data frame > } > > givemean <- function (gene, table.imputed) { > thisgene<-table.imputed[table.imputed[,1]==gene,] > #make a subset containing only > the rows for one gene name > data.frame(gene,t(sapply(thisgene[,2:ncol(thisgene)],mean, > na.rm=TRUE))) #calculates average for each sample > (column) and outputs one row of average values and the gene name > } > > > De inhoud van dit bericht is vertrouwelijk en alleen bestemd > voor de geadresseerde(n). Anderen dan de geadresseerde mogen > geen gebruik maken van dit bericht, het openbaar maken of op > enige wijze verspreiden of vermenigvuldigen. Het UMCG kan > niet aansprakelijk gesteld worden voor een incomplete > aankomst of vertraging van dit verzonden bericht. > > The contents of this message are confidential and only > intended for the eyes of the addressee(s). Others than the > addressee(s) are not allowed to use this message, to make it > public or to distribute or multiply this message in any way. > The UMCG cannot be held responsible for incomplete reception > or delay of this transferred message. >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}