Hi all,I know this topic has came up multiple times, but I've never fully
understand the apply() function.
Anyway, I'm here asking for your help again to convert this loop to apply().
I have 2 data frames with the following information: a1 is the fragment that
is need to be covered, a2 is the probes that cover the specific fragment.
I need to count the number of probes cover every given fragment (they need
to have the same cat ID to be on the same fragment)
a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat
2','cat 2','cat 2','cat
3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat
1','cat 1','cat 2','cat
2','cat 2','cat 3','cat 3','cat 3'),
st=c(1,9,20,38,53,70,80,95),
en=c(6,15,36,43,58,75,85,98));
a1$coverage<-NULL;
I came up with this for loop (basically, if a probe starts before the
fragment end, and end after a fragment start, it cover that fragment)
for (i in 1:length(a1$id))
{
a1$coverage[i]<-length(a2[a2$st<=a1$en[i]&a2$en>=a1$st[i]&a2$cat==a1$cat[i],]$id);
}
> a1$coverage
[1] 1 1 2 2 0 1
This loop runs awefully slow when I have 200,000 probes and 30,000
fragments. Is there anyway I can speed this up with apply()?
This is the time for my for loop to scan through the first 20 record of my
dataset:
user system elapsed
2.264 0.501 2.770
I think there is room for improvement here. Any idea?
Thanks
--
Regards,
Anh Tran
[[alternative HTML version deleted]]
One way to make it faster is to remove the accessing of the dataframe
in the loop. By converting the values you need to compare to
matrices, the compares will be faster. Try the following to see if
there is a speedup:
a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat
2','cat 2','cat
2','cat 3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat
1','cat 1','cat
2','cat 2','cat 2','cat 3','cat 3','cat
3'),
st=c(1,9,20,38,53,70,80,95), en=c(6,15,36,43,58,75,85,98));
# convert to matrices for faster access than data frames
# make sure that the 'cat' has the same numeric values in both matrices
uniqCAT <- unique(levels(a1$cat), levels(a2$cat))
a1.m <- cbind(a1$st, a1$en, as.numeric(factor(a1$cat, levels=uniqCAT)))
a2.m <- cbind(a2$st, a2$en, as.numeric(factor(a2$cat, levels=uniqCAT)))
# now do the comparison
a1$coverage <- apply(a1.m, 1, function(.row){
sum((.row[1] < a2.m[,2]) & (.row[2] > a2.m[,1]) & (.row[3] ==
a2.m[,3]))
})
On Sat, Aug 2, 2008 at 2:04 AM, Anh Tran <anhnttran at ucla.edu>
wrote:> Hi all,I know this topic has came up multiple times, but I've never
fully
> understand the apply() function.
>
> Anyway, I'm here asking for your help again to convert this loop to
apply().
>
> I have 2 data frames with the following information: a1 is the fragment
that
> is need to be covered, a2 is the probes that cover the specific fragment.
>
> I need to count the number of probes cover every given fragment (they need
> to have the same cat ID to be on the same fragment)
>
> a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat
2','cat 2','cat 2','cat
> 3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
> a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat
1','cat 1','cat 2','cat
> 2','cat 2','cat 3','cat 3','cat 3'),
st=c(1,9,20,38,53,70,80,95),
> en=c(6,15,36,43,58,75,85,98));
> a1$coverage<-NULL;
>
> I came up with this for loop (basically, if a probe starts before the
> fragment end, and end after a fragment start, it cover that fragment)
>
> for (i in 1:length(a1$id))
> {
>
a1$coverage[i]<-length(a2[a2$st<=a1$en[i]&a2$en>=a1$st[i]&a2$cat==a1$cat[i],]$id);
> }
>
>> a1$coverage
> [1] 1 1 2 2 0 1
>
>
> This loop runs awefully slow when I have 200,000 probes and 30,000
> fragments. Is there anyway I can speed this up with apply()?
>
> This is the time for my for loop to scan through the first 20 record of my
> dataset:
> user system elapsed
> 2.264 0.501 2.770
>
> I think there is room for improvement here. Any idea?
>
> Thanks
> --
> Regards,
> Anh Tran
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
Here is a way to speed up your toy example:
_________________
a1 <- data.frame(id = 1:6,
cat = paste('cat', rep(1:3, c(2,3,1))),
st = c(1, 7, 30, 40, 59, 91),
en = c(5, 25, 39, 55, 70, 120))
a2 <- data.frame(id = paste('probe', 1:8),
cat = paste('cat', rep(1:3, c(2,3,3))),
st = c(1, 9, 20, 38, 53, 70, 80, 95),
en = c(6, 15, 36, 43, 58, 75, 85, 98))
c1 <- outer(a2$st , a1$en , "<=")
c2 <- outer(a2$en , a1$st , ">=")
c3 <- outer(a2$cat, a1$cat, "==")
a1$coverage <- colSums(c1*c2*c3)
__________________
This won't work in one step if a1 has 30000 rows and a2 has 200000,
unless you memory size is approximately infinite, so you will need a
loop. Suppose you can handle 1000 probes at a time. You might be able
to get away with something like this:
__________________
chunk <- 1000 ### make as large as possible!
checkCoverage <- function(a1, a2)
colSums(outer(a2$st , a1$en , "<=") *
outer(a2$en , a1$st , ">=") *
outer(a2$cat, a1$cat, "=="))
coverage <- numeric(N <- nrow(a2))
m2 <- 0
while(m2 < N) {
m1 <- m2 + 1
m2 <- min(m2 + chunk, N)
coverage[m1:m2] <- checkCoverage(a1, a2[m1:m2, ])
}
a1$coverage <- coverage
__________________
(Warning: untested code.)
Failing that, go to C code and be done with it.
BTW, why did you think apply() was going to be useful here?
Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary): +61 7 3826 7304
Mobile: +61 4 8819 4402
Home Phone: +61 7 3286 7700
mailto:Bill.Venables at csiro.au
http://www.cmis.csiro.au/bill.venables/
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Anh Tran
Sent: Saturday, 2 August 2008 4:04 PM
To: rlist
Subject: [R] convert for loop into apply()
Hi all,I know this topic has came up multiple times, but I've never
fully
understand the apply() function.
Anyway, I'm here asking for your help again to convert this loop to
apply().
I have 2 data frames with the following information: a1 is the fragment
that
is need to be covered, a2 is the probes that cover the specific
fragment.
I need to count the number of probes cover every given fragment (they
need
to have the same cat ID to be on the same fragment)
a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat
2','cat 2','cat
2','cat
3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat
1','cat 1','cat
2','cat
2','cat 2','cat 3','cat 3','cat 3'),
st=c(1,9,20,38,53,70,80,95),
en=c(6,15,36,43,58,75,85,98));
a1$coverage<-NULL;
I came up with this for loop (basically, if a probe starts before the
fragment end, and end after a fragment start, it cover that fragment)
for (i in 1:length(a1$id))
{
a1$coverage[i]<-length(a2[a2$st<=a1$en[i]&a2$en>=a1$st[i]&a2$cat==a1$cat
[i],]$id);
}
> a1$coverage
[1] 1 1 2 2 0 1
This loop runs awefully slow when I have 200,000 probes and 30,000
fragments. Is there anyway I can speed this up with apply()?
This is the time for my for loop to scan through the first 20 record of
my
dataset:
user system elapsed
2.264 0.501 2.770
I think there is room for improvement here. Any idea?
Thanks
--
Regards,
Anh Tran
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Also, your request can easily be formulated as an SQL statement,
for example utilizing the 'sqldf' package:
----
library(sqldf)
a1 <- data.frame(id = 1:6,
cat = paste('cat', rep(1:3, c(2,3,1))),
st = c(1, 7, 30, 40, 59, 91),
en = c(5, 25, 39, 55, 70, 120))
a2 <- data.frame(id = paste('probe', 1:8),
cat = paste('cat', rep(1:3, c(2,3,3))),
st = c(1, 9, 20, 38, 53, 70, 80, 95),
en = c(6, 15, 36, 43, 58, 75, 85, 98))
sqldf("select a1.id as id, count(*) from a1, a2 where a1.cat = a2.cat
and a2.st <= a1.en
and a2.en >= a1.st
group by a1.id")
# id count(*)
# 1 1
# 2 1
# 3 2
# 4 2
# 6 1
----
Of course, it needs some overhead in generating the SQLite tables.
Therefore I would very much like to hear whether there is some
significant improvement -- or the contrary.
// Hans Werner Borchers
Anh Tran-2 wrote:>
> Hi all,I know this topic has came up multiple times, but I've never
fully
> understand the apply() function.
>
> Anyway, I'm here asking for your help again to convert this loop to
> apply().
>
> I have 2 data frames with the following information: a1 is the fragment
> that
> is need to be covered, a2 is the probes that cover the specific fragment.
>
> I need to count the number of probes cover every given fragment (they need
> to have the same cat ID to be on the same fragment)
>
> a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat
2','cat 2','cat
> 2','cat
> 3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
> a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat
1','cat 1','cat
> 2','cat
> 2','cat 2','cat 3','cat 3','cat 3'),
st=c(1,9,20,38,53,70,80,95),
> en=c(6,15,36,43,58,75,85,98));
> a1$coverage<-NULL;
>
> I came up with this for loop (basically, if a probe starts before the
> fragment end, and end after a fragment start, it cover that fragment)
>
> for (i in 1:length(a1$id))
> {
>
a1$coverage[i]<-length(a2[a2$st<=a1$en[i]&a2$en>=a1$st[i]&a2$cat==a1$cat[i],]$id);
> }
>
>> a1$coverage
> [1] 1 1 2 2 0 1
>
>
> This loop runs awefully slow when I have 200,000 probes and 30,000
> fragments. Is there anyway I can speed this up with apply()?
>
> This is the time for my for loop to scan through the first 20 record of my
> dataset:
> user system elapsed
> 2.264 0.501 2.770
>
> I think there is room for improvement here. Any idea?
>
> Thanks
> --
> Regards,
> Anh Tran
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context:
http://www.nabble.com/convert-for-loop-into-apply%28%29-tp18786483p18796799.html
Sent from the R help mailing list archive at Nabble.com.