##I have 2 columns of data. The first column is unique "event IDs" that represent a phone call made to a customer. ###So, if you see 3 entries together in the first column like follows: matrix(c("call1a","call1a","call1a") ) ##then this means that this particular phone call (the first call that's logged in the data set) was transferred ##between 3 different "modules" before the call was terminated. ##The second column is a numerical description of the module the call started with and then got transferred to prior to ##call termination. Now, I'll construct a ##representative array of the type of data I'm dealing with (the real data set goes ##on for X00,000s of rows): ##(Ignore how I construct the following array, it?s completely unrelated to how the actual data set was constructed). a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")}) development.a<-seq(1,40,3) development.a2<-seq(1,40,5) a[development.a]<-a[development.a+1] a[development.a2]<-a[development.a2+1] a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-"call9a" b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,930010,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,920010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,920009,960500,970050,930009,940010,960500,960500,960500) data<-as.data.frame(cbind(a,b)) colnames(data)<-c("phone calls","modules") dim(data) print(data[1:10,]) #sample of 10 rows # Note that in the real data set, data[,2] ranges from 810,000 to 999,999. I've been tasked with the following: # "For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls?" #Then it's the same question for each module beginning with 82, 83, 84..... all the way until 99. #I've created code that I think works for this, but I can't actually run it on the whole data set. I left it for 30 minutes and it only had about #5% of the task completed (I clicked "STOP" then checked my output to see if I did it properly, and it seems correct). #I know the apply() family specializes in vector operations, but I can't figure out how to complete the above question in any way other than #loops. L<-data A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1) A<-data.frame(A) for(i in 1:(nrow(L)-1)) { if(L[(i+1),1]!=L[i,1]) { A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1]<- { A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number of modules in the calls that begin with XX (not yet averaged). } A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2]<- { A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2]+1 } } } #If I can get this code to be more memory efficient such that I can do it on a 400,000 row data set, I can do, for example, A[17,1]/A[17,2] #and I'll arrive at the mean number of modules per call where the call starts with a module that starts with 97. A[17,1] #is 10, which means that, out of every single call that started with a module of 97X,XXX, #they went through 10 modules in total. A[17,2] #is 6, which means that there was 6 calls in total that began with a 97X,XXX module. #Hence, A[17,1]/A[17,2] #is the average number of modules that were executed in all the calls that began with a 97X,XXX module. ----- ---- Isaac Research Assistant Quantitative Finance Faculty, UTS -- View this message in context: http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4283594.html Sent from the R help mailing list archive at Nabble.com.
On Wed, 11 Jan 2012, iliketurtles wrote:> ##I have 2 columns of data. The first column is unique "event IDs" that > represent a phone call made to a customer. > ###So, if you see 3 entries together in the first column like follows: > > matrix(c("call1a","call1a","call1a") ) > > ##then this means that this particular phone call (the first call that's > logged in the data set) was transferred > ##between 3 different "modules" before the call was terminated. > > ##The second column is a numerical description of the module the call > started with and then got transferred to prior to ##call termination. Now, > I'll construct a ##representative array of the type of data I'm dealing > with (the real data set goes ##on for X00,000s of rows): > ##(Ignore how I construct the following array, it?s completely unrelated to > how the actual data set was constructed). > > > a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")}) > development.a<-seq(1,40,3) > development.a2<-seq(1,40,5) > a[development.a]<-a[development.a+1] > a[development.a2]<-a[development.a2+1] > a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-"ca > ll9a" > b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050 > ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9300 > 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,92 > 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010, > 920009,960500,970050,930009,940010,960500,960500,960500) > data<-as.data.frame(cbind(a,b)) > colnames(data)<-c("phone calls","modules") > dim(data) > print(data[1:10,]) #sample of 10 rows > > # Note that in the real data set, data[,2] ranges from 810,000 to 999,999. > I've been tasked with the following: > # "For each phone call that BEGINS with the module which is denoted by 81 > (i.e. of the form 81X,XXX), what is the expected number of modules in these > calls?" > #Then it's the same question for each module beginning with 82, 83, 84..... > all the way until 99. > #I've created code that I think works for this, but I can't actually run it > on the whole data set. I left it for 30 minutes and it only had about #5% > of the task completed (I clicked "STOP" then checked my output to see if I > did it properly, and it seems correct). > #I know the apply() family specializes in vector operations, but I can't > figure out how to complete the above question in any way other than #loops. > > L<-data > > A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1) > A<-data.frame(A) > > for(i in 1:(nrow(L)-1)) > { > if(L[(i+1),1]!=L[i,1]) > { > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1 > ]<- { > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1 > ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number > of modules in the calls that begin with XX (not yet averaged). > } > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2 > ]<- { > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2 > ]+1 } > } > > } > > #If I can get this code to be more memory efficient such that I can do it > on a 400,000 row data set, I can do, for example, > > A[17,1]/A[17,2] > > #and I'll arrive at the mean number of modules per call where the call > starts with a module that starts with 97. > > A[17,1] > #is 10, which means that, out of every single call that started with a > module of 97X,XXX, > #they went through 10 modules in total. > > A[17,2] > #is 6, which means that there was 6 calls in total that began with a > 97X,XXX module. > > #Hence, > > > A[17,1]/A[17,2] > > #is the average number of modules that were executed in all the calls that > began with a 97X,XXX module. > > > ----- > ---- > > Isaac > Research Assistant > Quantitative Finance Faculty, UTSI don't see any need for you to use data frames. If you make A and data (not a good use of a variable name) just matrices, you get the same answers at about 10 times the speed (using your example). Hope this helps, Ray Brownrigg
I'm having a really difficult time understanding what you're trying to get -- copy and pasting your code is failing to run, and your question isn't clear, ie: "For each phone call that BEGINS with the module which is denoted by 81 (i.e. of the form 81X,XXX), what is the expected number of modules in these calls?" How does one calculate the expected number of "modules" in this module? What does that even mean? Anyway, here's some using your `data` data.frame that calculates the number of unique calls and other statistics on the "call id" within each module prefix. I'm using both data.table and plyr ... there are no for loops. You will want to do `whatever it is you really want to do` inside the "blocks" below. ## R code data <- transform(data, module.prefix=substring(modules, 1, 2)) ## take a look at `data` now ## calulate "stuff" inside each module.prefix using data.table xx <- data.table(data, key="module.prefix") ans <- xx[, { ## the columns of the particular subset of your data.table ## are "injected" into the scope for this expression block ## which is where the `calls` variable below comes from tabled <- table(as.character(calls)) list(unique.calls=length(tabled), min=min(tabled), median=as.numeric(median(tabled)), max=max(tabled)) ## you will want to return your own list of "stuff" }, by='module.prefix'] ## with plyr library(plyr) ans <- ddply(data, "module.prefix", function(x) { ## `x` is a data.frame that all share the same module.prefix ## do whatever you want with it here tabled <- table(as.character(x$calls)) c(unique.calls=length(tabled), min=min(tabled), median=median(tabled), max=max(tabled)) }) You'll have to read up on the particulars of data.table and plyr. Both are really powerful packages ... you should get familiar with at least one. plyr is a bit more flexible in some ways. data.table is a bit more strict (cf. the need for `as.numeric(median(tabled))`), but also tends to be (much) faster when working over large datasets HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact