thr3ads.net - R help - [R] How to make this for() loop memory efficient? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

iliketurtles

2012-Jan-10 22:02 UTC

[R] How to make this for() loop memory efficient?

##I have 2 columns of data. The first column is unique "event IDs"
that
represent a phone call made to a customer.
###So, if you see 3 entries together in the first column like follows:

matrix(c("call1a","call1a","call1a") )

##then this means that this particular phone call  (the first call that's
logged in the data set) was transferred 
##between 3 different "modules" before the call was terminated.

##The second column is a numerical description of the module the call
started with and then got transferred to prior to ##call termination. Now,
I'll construct a ##representative array of the type of data I'm dealing
with
(the real data set goes ##on for X00,000s of rows):
##(Ignore how I construct the following array, it?s completely unrelated to
how the actual data set was constructed). 


a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")})
development.a<-seq(1,40,3)
development.a2<-seq(1,40,5)
a[development.a]<-a[development.a+1]
a[development.a2]<-a[development.a2+1]
a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-"call9a"
b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,930010,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,920010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,920009,960500,970050,930009,940010,960500,960500,960500)
data<-as.data.frame(cbind(a,b))
colnames(data)<-c("phone calls","modules")
dim(data)
print(data[1:10,]) #sample of 10 rows

# Note that in the real data set, data[,2] ranges from 810,000 to 999,999.
I've been tasked with the following:
# "For each phone call that BEGINS with the module which is denoted by 81
(i.e. of the form 81X,XXX), what is the expected number of modules in these
calls?"
#Then it's the same question for each module beginning with 82, 83, 84.....
all the way until 99. 
#I've created code that I think works for this, but I can't actually run
it
on the whole data set. I left it for 30 minutes and it only had about #5% of
the task completed (I clicked "STOP" then checked my output to see if
I did
it properly, and it seems correct).
#I know the apply() family specializes in vector operations, but I can't
figure out how to complete the above question in any way other than #loops.

L<-data

A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1)
A<-data.frame(A)

 for(i in 1:(nrow(L)-1))
 {
  if(L[(i+1),1]!=L[i,1])
  {
   
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1]<-
    { 
     
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE))
#aggregate number of modules in the calls that begin with XX (not yet
averaged). 
    }
   
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2]<-
    {
     
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2]+1
    }
  }
   
 }

#If I can get this code to be more memory efficient such that I can do it on
a 400,000 row data set, I can do, for example,

A[17,1]/A[17,2]

#and I'll arrive at the mean number of modules per call where the call
starts with a module that starts with 97.

A[17,1] 
#is 10, which means that, out of every single call that started with a
module of 97X,XXX,
#they went through 10 modules in total. 

A[17,2] 
#is 6, which means that there was 6 calls in total that began with a 97X,XXX
module.

#Hence,


A[17,1]/A[17,2]

#is the average number of modules that were executed in all the calls that
began with a 97X,XXX module.


-----
----

Isaac
Research Assistant
Quantitative Finance Faculty, UTS
--
View this message in context:
http://r.789695.n4.nabble.com/How-to-make-this-for-loop-memory-efficient-tp4283594p4283594.html
Sent from the R help mailing list archive at Nabble.com.

Ray Brownrigg

2012-Jan-10 23:31 UTC

head link

[R] How to make this for() loop memory efficient?

On Wed, 11 Jan 2012, iliketurtles wrote:> ##I have 2 columns of data. The first column is unique "event
IDs" that
> represent a phone call made to a customer.
> ###So, if you see 3 entries together in the first column like follows:
> 
> matrix(c("call1a","call1a","call1a") )
> 
> ##then this means that this particular phone call  (the first call
that's
> logged in the data set) was transferred
> ##between 3 different "modules" before the call was terminated.
> 
> ##The second column is a numerical description of the module the call
> started with and then got transferred to prior to ##call termination. Now,
> I'll construct a ##representative array of the type of data I'm
dealing
> with (the real data set goes ##on for X00,000s of rows):
> ##(Ignore how I construct the following array, it?s completely unrelated to
> how the actual data set was constructed).
> 
> 
>
a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")})
> development.a<-seq(1,40,3)
> development.a2<-seq(1,40,5)
> a[development.a]<-a[development.a+1]
> a[development.a2]<-a[development.a2+1]
>
a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-"ca
> ll9a"
>
b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,970050
> ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9300
> 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500,92
> 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,930010,
> 920009,960500,970050,930009,940010,960500,960500,960500)
> data<-as.data.frame(cbind(a,b))
> colnames(data)<-c("phone calls","modules")
> dim(data)
> print(data[1:10,]) #sample of 10 rows
> 
> # Note that in the real data set, data[,2] ranges from 810,000 to 999,999.
> I've been tasked with the following:
> # "For each phone call that BEGINS with the module which is denoted by
81
> (i.e. of the form 81X,XXX), what is the expected number of modules in these
> calls?"
> #Then it's the same question for each module beginning with 82, 83,
84.....
> all the way until 99.
> #I've created code that I think works for this, but I can't
actually run it
> on the whole data set. I left it for 30 minutes and it only had about #5%
> of the task completed (I clicked "STOP" then checked my output to
see if I
> did it properly, and it seems correct).
> #I know the apply() family specializes in vector operations, but I
can't
> figure out how to complete the above question in any way other than #loops.
> 
> L<-data
> 
> A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1)
> A<-data.frame(A)
> 
>  for(i in 1:(nrow(L)-1))
>  {
>   if(L[(i+1),1]!=L[i,1])
>   {
> 
>
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1
> ]<- {
> 
>
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),1
> ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate number
> of modules in the calls that begin with XX (not yet averaged).
>     }
> 
>
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2
> ]<- {
> 
>
A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse=""),2
> ]+1 }
>   }
> 
>  }
> 
> #If I can get this code to be more memory efficient such that I can do it
> on a 400,000 row data set, I can do, for example,
> 
> A[17,1]/A[17,2]
> 
> #and I'll arrive at the mean number of modules per call where the call
> starts with a module that starts with 97.
> 
> A[17,1]
> #is 10, which means that, out of every single call that started with a
> module of 97X,XXX,
> #they went through 10 modules in total.
> 
> A[17,2]
> #is 6, which means that there was 6 calls in total that began with a
> 97X,XXX module.
> 
> #Hence,
> 
> 
> A[17,1]/A[17,2]
> 
> #is the average number of modules that were executed in all the calls that
> began with a 97X,XXX module.
> 
> 
> -----
> ----
> 
> Isaac
> Research Assistant
> Quantitative Finance Faculty, UTS
I don't see any need for you to use data frames.

If you make A and data (not a good use of a variable name) just matrices, you
get the same
answers at about 10 times the speed (using your example).

Hope this helps,
Ray Brownrigg

Steve Lianoglou

2012-Jan-11 00:18 UTC

head link

[R] How to make this for() loop memory efficient?

I'm having a really difficult time understanding what you're trying to
get -- copy and pasting your code is failing to run, and your question
isn't clear, ie:

"For each phone call that BEGINS with the module which is denoted by 81
(i.e. of the form 81X,XXX), what is the expected number of modules in these
calls?"

How does one calculate the expected number of "modules" in this
module? What does that even mean?

Anyway, here's some using your `data` data.frame that calculates the
number of unique calls and other statistics on the "call id" within
each module prefix. I'm using both data.table and plyr ... there are
no for loops.

You will want to do `whatever it is you really want to do` inside the
"blocks" below.

## R code
data <- transform(data, module.prefix=substring(modules, 1, 2))

## take a look at `data` now

## calulate "stuff" inside each module.prefix using data.table
xx <- data.table(data, key="module.prefix")

ans <- xx[, {
  ## the columns of the particular subset of your data.table
  ## are "injected" into the scope for this expression block
  ## which is where the `calls` variable below comes from
  tabled <- table(as.character(calls))
  list(unique.calls=length(tabled), min=min(tabled),
median=as.numeric(median(tabled)), max=max(tabled))
  ## you will want to return your own list of "stuff"
}, by='module.prefix']


## with plyr
library(plyr)
ans <- ddply(data, "module.prefix", function(x) {
  ## `x` is a data.frame that all share the same module.prefix
  ## do whatever you want with it here
  tabled <- table(as.character(x$calls))
  c(unique.calls=length(tabled), min=min(tabled),
median=median(tabled), max=max(tabled))
})

You'll have to read up on the particulars of data.table and plyr. Both
are really powerful packages ... you should get familiar with at least
one.

plyr is a bit more flexible in some ways.

data.table is a bit more strict (cf. the need for
`as.numeric(median(tabled))`), but also tends to be (much) faster when
working over large datasets

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Reasonably Related Threads

Search for more apparently analagous threads

R help - Jan 2012 - How to make this for() loop memory efficient?

[R] How to make this for() loop memory efficient?

[R] How to make this for() loop memory efficient?

[R] How to make this for() loop memory efficient?

Reasonably Related Threads