HI, Dear R community,
I have one data set like this, What I want to do is to calculate the
cumulative coverage. The following codes works for small data set (#rows 100),
but when feed the whole data set, it still running after 24 hours.
Can someone give some suggestions for long vector?
id reads
Contig79:1 4
Contig79:2 8
Contig79:3 13
Contig79:4 14
Contig79:5 17
Contig79:6 20
Contig79:7 25
Contig79:8 27
Contig79:9 32
Contig79:10 33
Contig79:11 34
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
dim(matt)
[1] 3384766 2
matt_plot<-function(matt, outputfile) {
names(matt)<-c("id","reads")
cover<-matt$reads
#calculate the cumulative coverage.
+ cover_per<-function (data) {
+ output<-numeric(0)
+ for (i in data) {
+ x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+ output<-c(output, x)
+ }
+ return(output)
+ }
result<-cover_per(cover)
Thanks so much!
--
Sincerely,
Changbin
--
[[alternative HTML version deleted]]
Try this: rev(100 * cumsum(matt$reads > 1) / length(matt$reads) ) On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du <changbind@gmail.com> wrote:> HI, Dear R community, > > I have one data set like this, What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows > 100), but when feed the whole data set, it still running after 24 hours. > Can someone give some suggestions for long vector? > > id reads > Contig79:1 4 > Contig79:2 8 > Contig79:3 13 > Contig79:4 14 > Contig79:5 17 > Contig79:6 20 > Contig79:7 25 > Contig79:8 27 > Contig79:9 32 > Contig79:10 33 > Contig79:11 34 > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } > > > result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
Is this what you want:> xid reads 1 Contig79:1 4 2 Contig79:2 8 3 Contig79:3 13 4 Contig79:4 14 5 Contig79:5 17 6 Contig79:6 20 7 Contig79:7 25 8 Contig79:8 27 9 Contig79:9 32 10 Contig79:10 33 11 Contig79:11 34> x$percent <- x$reads / max(x$reads) * 100 > xid reads percent 1 Contig79:1 4 11.76471 2 Contig79:2 8 23.52941 3 Contig79:3 13 38.23529 4 Contig79:4 14 41.17647 5 Contig79:5 17 50.00000 6 Contig79:6 20 58.82353 7 Contig79:7 25 73.52941 8 Contig79:8 27 79.41176 9 Contig79:9 32 94.11765 10 Contig79:10 33 97.05882 11 Contig79:11 34 100.00000 On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com> wrote:> HI, Dear R community, > > I have one data set like this, ?What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows > 100), but when feed the whole data set, ?it still running after 24 hours. > Can someone give some suggestions for long vector? > > id ? ? ? ? ? ? ? ?reads > Contig79:1 ? ?4 > Contig79:2 ? ?8 > Contig79:3 ? ?13 > Contig79:4 ? ?14 > Contig79:5 ? ?17 > Contig79:6 ? ?20 > Contig79:7 ? ?25 > Contig79:8 ? ?27 > Contig79:9 ? ?32 > Contig79:10 ? ?33 > Contig79:11 ? ?34 > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 ? ? ? 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > ?cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + ? ? ? ? ? x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + ? ? ? ? ? output<-c(output, x) > + ? ? ? ? ? ? ? ? } > + return(output) > + } > > > ?result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Thanks, Jim! This is not what I want, What I want is calculate the percentage of reads bigger or equal to that reads in each position.MY output is like the following: for row 1, all the reads is >= 4, so the cover_per is 100, for row 2, 99 % reads >=4, so the cover_per is 99.> head(final)cover_per reads 1 100 4 2 99 8 3 98 13 4 97 14 5 96 17 6 95 20 I attached the input file with this email. This file is only 100 rows, very small. MY original data set is 3384766 rows. > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) #> dim(matt)[1] 3384766 2 Thanks so much for your time!> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t",skip=0, header=F,fill=T) #> names(matt)<-c("id","reads") > dim(matt)[1] 100 2> cover<-matt$reads > > > #calculate the cumulative coverage. > cover_per<-function (data) {+ output<-numeric(0) + for (i in data) { + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) + output<-c(output, x) + } + return(output) + }> > > result<-cover_per(cover) > head(result)[1] 100 99 98 97 96 95> > final<-data.frame(result, cover) > > names(final)<-c("cover_per", "reads") > head(final)cover_per reads 1 100 4 2 99 8 3 98 13 4 97 14 5 96 17 6 95 20 On Thu, Nov 4, 2010 at 9:18 AM, jim holtman <jholtman at gmail.com> wrote:> Is this what you want: > > > x > id reads > 1 Contig79:1 4 > 2 Contig79:2 8 > 3 Contig79:3 13 > 4 Contig79:4 14 > 5 Contig79:5 17 > 6 Contig79:6 20 > 7 Contig79:7 25 > 8 Contig79:8 27 > 9 Contig79:9 32 > 10 Contig79:10 33 > 11 Contig79:11 34 > > x$percent <- x$reads / max(x$reads) * 100 > > x > id reads percent > 1 Contig79:1 4 11.76471 > 2 Contig79:2 8 23.52941 > 3 Contig79:3 13 38.23529 > 4 Contig79:4 14 41.17647 > 5 Contig79:5 17 50.00000 > 6 Contig79:6 20 58.82353 > 7 Contig79:7 25 73.52941 > 8 Contig79:8 27 79.41176 > 9 Contig79:9 32 94.11765 > 10 Contig79:10 33 97.05882 > 11 Contig79:11 34 100.00000 > > > On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com> wrote: > > HI, Dear R community, > > > > I have one data set like this, What I want to do is to calculate the > > cumulative coverage. The following codes works for small data set (#rows > > > 100), but when feed the whole data set, it still running after 24 hours. > > Can someone give some suggestions for long vector? > > > > id reads > > Contig79:1 4 > > Contig79:2 8 > > Contig79:3 13 > > Contig79:4 14 > > Contig79:5 17 > > Contig79:6 20 > > Contig79:7 25 > > Contig79:8 27 > > Contig79:9 32 > > Contig79:10 33 > > Contig79:11 34 > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > > dim(matt) > > [1] 3384766 2 > > > > matt_plot<-function(matt, outputfile) { > > names(matt)<-c("id","reads") > > > > cover<-matt$reads > > > > > > #calculate the cumulative coverage. > > + cover_per<-function (data) { > > + output<-numeric(0) > > + for (i in data) { > > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > > + output<-c(output, x) > > + } > > + return(output) > > + } > > > > > > result<-cover_per(cover) > > > > > > Thanks so much! > > > > > > -- > > Sincerely, > > Changbin > > -- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? >-- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 -------------- next part -------------- Contig79:1 4 Contig79:2 8 Contig79:3 13 Contig79:4 14 Contig79:5 17 Contig79:6 20 Contig79:7 25 Contig79:8 27 Contig79:9 32 Contig79:10 33 Contig79:11 34 Contig79:12 36 Contig79:13 39 Contig79:14 40 Contig79:15 44 Contig79:16 49 Contig79:17 55 Contig79:18 56 Contig79:19 59 Contig79:20 60 Contig79:21 62 Contig79:22 64 Contig79:23 64 Contig79:24 68 Contig79:25 68 Contig79:26 68 Contig79:27 70 Contig79:28 73 Contig79:29 76 Contig79:30 77 Contig79:31 78 Contig79:32 78 Contig79:33 79 Contig79:34 80 Contig79:35 80 Contig79:36 84 Contig79:37 87 Contig79:38 87 Contig79:39 88 Contig79:40 88 Contig79:41 89 Contig79:42 93 Contig79:43 94 Contig79:44 98 Contig79:45 99 Contig79:46 99 Contig79:47 102 Contig79:48 103 Contig79:49 108 Contig79:50 112 Contig79:51 112 Contig79:52 113 Contig79:53 116 Contig79:54 118 Contig79:55 120 Contig79:56 124 Contig79:57 124 Contig79:58 126 Contig79:59 128 Contig79:60 130 Contig79:61 133 Contig79:62 134 Contig79:63 136 Contig79:64 139 Contig79:65 144 Contig79:66 145 Contig79:67 146 Contig79:68 148 Contig79:69 149 Contig79:70 151 Contig79:71 156 Contig79:72 157 Contig79:73 158 Contig79:74 159 Contig79:75 159 Contig79:76 159 Contig79:77 160 Contig79:78 160 Contig79:79 161 Contig79:80 163 Contig79:81 164 Contig79:82 165 Contig79:83 165 Contig79:84 165 Contig79:85 165 Contig79:86 166 Contig79:87 170 Contig79:88 170 Contig79:89 172 Contig79:90 174 Contig79:91 178 Contig79:92 180 Contig79:93 181 Contig79:94 184 Contig79:95 184 Contig79:96 187 Contig79:97 190 Contig79:98 192 Contig79:99 194 Contig79:100 199
Changbin -
Does
100 * sapply(matt$reads,function(x)sum(matt$reads >=
x))/length(matt$reads)
give what you want?
By the way, if you want to use a loop (there's nothing wrong with
that),
then try to avoid the most common mistake that people make with loops in R:
having your result grow inside the loop. Here's a better way to use a loop
to solve your problem:
cover_per_1 <- function(data){
l = length(data)
output = numeric(l)
for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
0))/length(data)
output
}
Using some random data, and comparing to your original cover_per function:
> dat = rnorm(1000)
> system.time(one <- cover_per(dat))
user system elapsed
0.816 0.000 0.824 > system.time(two <- cover_per_1(dat))
user system elapsed
0.792 0.000 0.805
Not that big a speedup, but it does increase quite a bit as the problem gets
larger.
There are two obvious ways to speed up your function:
1) Eliminate the ifelse function, since automatic coersion from
logical to numeric does the same thing.
2) Multiply by 100 and divide by the length outside the loop:
cover_per_2 <- function(data){
l = length(data)
output = numeric(l)
for(i in 1:l)output[i] = sum(data >= data[i])
100 * output / l
}
> system.time(three <- cover_per_2(dat))
user system elapsed
0.024 0.000 0.027
That makes the loop just about equivalent to the sapply solution:
> system.time(four <- 100*sapply(dat,function(x)sum(dat >=
x))/length(dat))
user system elapsed
0.024 0.000 0.026
- Phil Spector
Statistical Computing Facility
Department of Statistics
UC Berkeley
spector at stat.berkeley.edu
On Thu, 4 Nov 2010, Changbin Du wrote:
> HI, Dear R community,
>
> I have one data set like this, What I want to do is to calculate the
> cumulative coverage. The following codes works for small data set (#rows
> 100), but when feed the whole data set, it still running after 24 hours.
> Can someone give some suggestions for long vector?
>
> id reads
> Contig79:1 4
> Contig79:2 8
> Contig79:3 13
> Contig79:4 14
> Contig79:5 17
> Contig79:6 20
> Contig79:7 25
> Contig79:8 27
> Contig79:9 32
> Contig79:10 33
> Contig79:11 34
>
>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
> [1] 3384766 2
>
> matt_plot<-function(matt, outputfile) {
> names(matt)<-c("id","reads")
>
> cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> + cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> + output<-c(output, x)
> + }
> + return(output)
> + }
>
>
> result<-cover_per(cover)
>
>
> Thanks so much!
>
>
> --
> Sincerely,
> Changbin
> --
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>