Hello all, I'm new to R and trying to figure out how to perform calculations on a large dataset (300 000 datapoints). I have already made some code to do this but it is awfully slow. What I want to do is add a new column for each "rep_ " column where I have taken each value and divide it by the mean of all values where "PlateNo" is the same. My data is in the following format:> dataPlateNo Well rep_1 rep_2 rep_3 1 A01 1312 963 1172 1 A02 10464 6715 5628 1 A03 3301 3257 3281 1 A04 3895 3350 3496 1 A05 8731 7389 5701 2 A01 7893 6748 5920 2 A02 2912 2385 2586 2 A03 985 785 809 2 A04 1346 1018 1001 2 A05 794 314 486 To generate it copy: a <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) b <- c("A01", "A02", "A03", "A04", "A05", "A01", "A02", "A03", "A04", "A05") c <- c(1312, 10464, 3301, 3895, 8731, 7893, 2912, 985, 1346, 794) d <- c(963, 6715, 3257, 3350, 7389, 6748, 2385, 785, 1018, 314) e <- c(1172, 5628, 3281, 3496, 5701, 5920, 2586, 809, 1001, 486) data <- data.frame(plateNo = a, Well = b, rep_1 = c, rep_2 = d, rep_3 = e) Here is the code I have come up with: rows <- length(data$plateNo) reps <- 3 norm <- list() for (rep in 1:reps) { x <- paste("rep_",rep,sep="") normx <- paste("normalised_",rep,sep="") for (row in 1:rows) { plateMean <- mean(data[[x]][data$plateNo == data$plateNo[row]]) wellData <- data[[x]][row] norm[[normx]][row] <- wellData / plateMean } } Any help or tips would be greatly appreciated! Thanks, Haakon [[alternative HTML version deleted]]
To get the equivalent of what your loop does, you could use lapply(data[,3:5],function(x)x/ave(x,data$plateNo,FUN=mean)) but you might find the output of sapply(data[,3:5],function(x)x/ave(x,data$plateNo,FUN=mean)) to be more useful. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Fri, 11 Mar 2011, hi Berven wrote:> > Hello all, > > I'm new to R and trying to figure out how to perform calculations on a large dataset (300 000 datapoints). I have already made some code to do this but it is awfully slow. What I want to do is add a new column for each "rep_ " column where I have taken each value and divide it by the mean of all values where "PlateNo" is the same. My data is in the following format: > >> data > > > > > PlateNo > > Well > > rep_1 > > rep_2 > > rep_3 > > > 1 > > A01 > > 1312 > > 963 > > 1172 > > > 1 > > A02 > > 10464 > > 6715 > > 5628 > > > 1 > > A03 > > 3301 > > 3257 > > 3281 > > > 1 > > A04 > > 3895 > > 3350 > > 3496 > > > 1 > > A05 > > 8731 > > 7389 > > 5701 > > > 2 > > A01 > > 7893 > > 6748 > > 5920 > > > 2 > > A02 > > 2912 > > 2385 > > 2586 > > > 2 > > A03 > > 985 > > 785 > > 809 > > > 2 > > A04 > > 1346 > > 1018 > > 1001 > > > 2 > > A05 > > 794 > > 314 > > 486 > > To generate it copy: > a <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) > b <- c("A01", "A02", "A03", "A04", "A05", "A01", "A02", "A03", "A04", "A05") > c <- c(1312, 10464, 3301, 3895, 8731, 7893, 2912, 985, 1346, 794) > d <- c(963, 6715, 3257, 3350, 7389, 6748, 2385, 785, 1018, 314) > e <- c(1172, 5628, 3281, 3496, 5701, 5920, 2586, 809, 1001, 486) > data <- data.frame(plateNo = a, Well = b, rep_1 = c, rep_2 = d, rep_3 = e) > > Here is the code I have come up with: > > rows <- length(data$plateNo) > reps <- 3 > norm <- list() > for (rep in 1:reps) { > x <- paste("rep_",rep,sep="") > normx <- paste("normalised_",rep,sep="") > for (row in 1:rows) { > plateMean <- mean(data[[x]][data$plateNo == data$plateNo[row]]) > wellData <- data[[x]][row] > norm[[normx]][row] <- wellData / plateMean > } > } > > > Any help or tips would be greatly appreciated! > Thanks, > Haakon > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Haakon, as replicates imply that they all have the same data type, you can put them into a matrix which is often faster and needs less memory (though whether that can really matter depends of the number of replicates you have: for small no of replicates you won't have much effect anyways). But I find it handy to have the matrix of replicates with data$rep. data <- data.frame (plateNo = a, Well = b, rep = I (cbind (c, d, e))) > data plateNo Well rep.c rep.d rep.e 1 1 A01 1312 963 1172 2 1 A02 10464 6715 5628 3 1 A03 3301 3257 3281 4 1 A04 3895 3350 3496 5 1 A05 8731 7389 5701 6 2 A01 7893 6748 5920 7 2 A02 2912 2385 2586 8 2 A03 985 785 809 9 2 A04 1346 1018 1001 10 2 A05 794 314 486 > dim (data) [1] 10 3 Then: data$norm <- data$rep / apply (data$rep, 2, ave, plateNo = data$plateNo) you can also do the division into the apply: data$norm <- apply (data$rep, 2, function (x) x / ave(x, plateNo = data$plateNo)) If you always have the sampe number of wells per plate, you could also "fold" the data$rep matrix into an array: arep <- array (data$rep, dim = c (2, 5, 3)) anorm <- arep / rep (colMeans (arep), each = 2) dim (anorm) <- dim (data$rep) data$norm <- anorm Here are some microbenchmark results: Unit: nanoeconds min lq median uq max [1,] 1525160 1561280 1627620 1685020 3575719 [2,] 1505641 1539500 1560301 1649081 3538001 [3,] 113321 115041 115821 116881 155681 [4,] 2589800 2627280 2662540 2794920 4646399 1 and 2 are the two apply versions above. 3 is the array 4 are your loops HTH Claudia Am 11.03.2011 18:38, schrieb hi Berven:> > Hello all, > > I'm new to R and trying to figure out how to perform calculations on a large dataset (300 000 datapoints). I have already made some code to do this but it is awfully slow. What I want to do is add a new column for each "rep_ " column where I have taken each value and divide it by the mean of all values where "PlateNo" is the same. My data is in the following format: > >> data > > > > > PlateNo > > Well > > rep_1 > > rep_2 > > rep_3 > > > 1 > > A01 > > 1312 > > 963 > > 1172 > > > 1 > > A02 > > 10464 > > 6715 > > 5628 > > > 1 > > A03 > > 3301 > > 3257 > > 3281 > > > 1 > > A04 > > 3895 > > 3350 > > 3496 > > > 1 > > A05 > > 8731 > > 7389 > > 5701 > > > 2 > > A01 > > 7893 > > 6748 > > 5920 > > > 2 > > A02 > > 2912 > > 2385 > > 2586 > > > 2 > > A03 > > 985 > > 785 > > 809 > > > 2 > > A04 > > 13462 > > 1018 > > 1001 > > > 2 > > A05 > > 794 > > 314 > > 486 > > To generate it copy: > a<- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) > b<- c("A01", "A02", "A03", "A04", "A05", "A01", "A02", "A03", "A04", "A05") > c<- c(1312, 10464, 3301, 3895, 8731, 7893, 2912, 985, 1346, 794) > d<- c(963, 6715, 3257, 3350, 7389, 6748, 2385, 785, 1018, 314) > e<- c(1172, 5628, 3281, 3496, 5701, 5920, 2586, 809, 1001, 486) > data<- data.frame(plateNo = a, Well = b, rep_1 = c, rep_2 = d, rep_3 = e) > > Here is the code I have come up with: > > rows<- length(data$plateNo) > reps<- 3 > norm<- list() > for (rep in 1:reps) { > x<- paste("rep_",rep,sep="") > normx<- paste("normalised_",rep,sep="") > for (row in 1:rows) { > plateMean<- mean(data[[x]][data$plateNo == data$plateNo[row]]) > wellData<- data[[x]][row] > norm[[normx]][row]<- wellData / plateMean > } > } > > > Any help or tips would be greatly appreciated! > Thanks, > Haakon > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.