Ernesto Guerra
2015-Jun-18 12:16 UTC
[R] adjusted values for CI in a within-subjects design
Dear R-ers, I am trying to adjust the values of a within-items, within-subjects design (the experimental conditions are within subjects), to calculate between subjects confidence intervals (CI). I am following the recommendations from O'Brien & Cousineau (2014; see also Cousineau, 2005; Morey, 2008 for similar solutions). So, formula is the following. # formula for corrected CI: # Y = Xsj - Xj. + X.. # where... # Xsj = single value of a trial of a time window per participant # Xj. = participants mean on that conditions across trials # X.. = overall mean on that condition # W = sqrt(J/(J-1))*(Y - Y.j) + Y.j # where... # J = sqrt(f/(f-1)) # f = total number of measures per subject # Y.j = mean for a condition across participants # W = the corrected value from which we can calculate the CI between subjects. I've written a code that does that using a dataset of random values (0,1), but with the same structure that the actual dataset for which I hope to calculate corrected CI. fixprop subj trial time 1 1 1 1 0 1 1 2 1 1 1 3 0 1 1 4 0 1 1 5 0 1 1 6 The experiments deliver the time course of an effect (similar to longitudinal data), meaning, we have N time steps in which the effect is modulated. I've tested the script with this dummy dataset of 4 participants, 10 items, and 400 time steps, and it works nicely. The tricky part here is that in the real experiments, we have many more participants, items and time steps. Thus, the adjustment needs to be done many many times. With the dummy dataset the process takes about 6 seconds,> proc.time() - ptmuser system elapsed 4.53 0.06 6.03 but when I've added a bit more data (10 participants, 125 trials, 400 time steps), the scritp takes more than an hour,> proc.time() - ptmuser system elapsed 3483.64 879.31 4456.86 So, I don't even want to try doing this with real data, in which we have thousands of times steps, and generally over 50 participants (although less items in general, perhaps 40 or 50). QUESTION: does anyone know how could I optimize my script, such as it does not take forever? Here is the script. library(doBy) library(plotrix) library(matrixStats) library(doBy) library(bear) library(ggplot2) library(reshape) rm(list=ls()) # clear memory setwd (??) # set directory infile = "test.txt" # "test.txt" is the name of the fixation report data = read.delim(file=infile, header=T, sep="\t") # load the file data = data[with(data, order(subj,trial)), ] # data need to be organized by part, by trials head(data) subj = unique(data$subj) np=length(subj); np # how many participants trial = unique(data$trial) nt=length(trial); nt # how many items timewindows = unique(data$time) twsn=length(timewindows); twsn # how many time steps critcoln = 1 #column in which we find the dependent variable ncoln = 4 #total number of columns of your file f = 2 #total number of conditions per subject tm <- cbind(rep(c(critcoln:twsn), each=(nt*np))) newvar <- cbind(rep(c((critcoln+ncoln):(critcoln+ncoln)), each=(nt*np*twsn))) subj <- cbind(rep(1:np, each=nt, times=twsn)) count <-cbind(rep(c(1:1), each=(nt*np*twsn))) #################################### X..data = summaryBy(fixprop ~ time, FUN = mean, keep.names=T, data=data) Xj.data = summaryBy(fixprop ~ subj + time, FUN = mean, keep.names=T, data=data) ptm <- proc.time() prev_tw = 0 prev_subj = 0 j = 0 t = 0 for (i in 1:(nrow(data))) { curr_tw = tm[i] curr_subj = subj[i] if (prev_subj < curr_subj) {j = j + 1} Y. = data[i,critcoln] - Xj.data[j,3] if (prev_tw < curr_tw) {t = t + 1} Y = Y. + X..data[t,2] data[i,newvar[i]] <- Y prev_tw = curr_tw prev_subj = curr_subj } proc.time() - ptm colnames(data)[ncoln+1] <- 'fixprop_adj' Y.jdata = summaryBy(fixprop_adj ~ subj + time, FUN = mean, keep.names=T, data=data) J = sqrt(f/(f-1)) #correction factor newvar <- cbind(rep(c((critcoln+ncoln+1):(critcoln+ncoln+1)), each=(nt*np*twsn))) prev_tw = 0 t = 0 for (i in 1:(nrow(data))) { curr_tw = tm[i] if (prev_tw < curr_tw) {t = t + 1} W = J*((data[i,ncoln+1]) - Y.jdata[t,3]) + Y.jdata[t,3] data[i,newvar[i]] <- W prev_tw = curr_tw } proc.time() - ptm colnames(data)[ncoln+2] <- 'fixprop_final' That's all. The processes that really take long are the "for" loops, I know loops are not the best, but I couldn't think of a process that can do this better so far... Any comments, suggestions, criticisms and questions are welcome... Cheers, Ernesto. [[alternative HTML version deleted]]