HI, Dear R community, I have one data set like this, What I want to do is to calculate the cumulative coverage. The following codes works for small data set (#rows 100), but when feed the whole data set, it still running after 24 hours. Can someone give some suggestions for long vector? id reads Contig79:1 4 Contig79:2 8 Contig79:3 13 Contig79:4 14 Contig79:5 17 Contig79:6 20 Contig79:7 25 Contig79:8 27 Contig79:9 32 Contig79:10 33 Contig79:11 34 matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) # dim(matt) [1] 3384766 2 matt_plot<-function(matt, outputfile) { names(matt)<-c("id","reads") cover<-matt$reads #calculate the cumulative coverage. + cover_per<-function (data) { + output<-numeric(0) + for (i in data) { + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) + output<-c(output, x) + } + return(output) + } result<-cover_per(cover) Thanks so much! -- Sincerely, Changbin -- [[alternative HTML version deleted]]
Try this: rev(100 * cumsum(matt$reads > 1) / length(matt$reads) ) On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du <changbind@gmail.com> wrote:> HI, Dear R community, > > I have one data set like this, What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows > 100), but when feed the whole data set, it still running after 24 hours. > Can someone give some suggestions for long vector? > > id reads > Contig79:1 4 > Contig79:2 8 > Contig79:3 13 > Contig79:4 14 > Contig79:5 17 > Contig79:6 20 > Contig79:7 25 > Contig79:8 27 > Contig79:9 32 > Contig79:10 33 > Contig79:11 34 > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } > > > result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
Is this what you want:> xid reads 1 Contig79:1 4 2 Contig79:2 8 3 Contig79:3 13 4 Contig79:4 14 5 Contig79:5 17 6 Contig79:6 20 7 Contig79:7 25 8 Contig79:8 27 9 Contig79:9 32 10 Contig79:10 33 11 Contig79:11 34> x$percent <- x$reads / max(x$reads) * 100 > xid reads percent 1 Contig79:1 4 11.76471 2 Contig79:2 8 23.52941 3 Contig79:3 13 38.23529 4 Contig79:4 14 41.17647 5 Contig79:5 17 50.00000 6 Contig79:6 20 58.82353 7 Contig79:7 25 73.52941 8 Contig79:8 27 79.41176 9 Contig79:9 32 94.11765 10 Contig79:10 33 97.05882 11 Contig79:11 34 100.00000 On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com> wrote:> HI, Dear R community, > > I have one data set like this, ?What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows > 100), but when feed the whole data set, ?it still running after 24 hours. > Can someone give some suggestions for long vector? > > id ? ? ? ? ? ? ? ?reads > Contig79:1 ? ?4 > Contig79:2 ? ?8 > Contig79:3 ? ?13 > Contig79:4 ? ?14 > Contig79:5 ? ?17 > Contig79:6 ? ?20 > Contig79:7 ? ?25 > Contig79:8 ? ?27 > Contig79:9 ? ?32 > Contig79:10 ? ?33 > Contig79:11 ? ?34 > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 ? ? ? 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > ?cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + ? ? ? ? ? x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + ? ? ? ? ? output<-c(output, x) > + ? ? ? ? ? ? ? ? } > + return(output) > + } > > > ?result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Thanks, Jim! This is not what I want, What I want is calculate the percentage of reads bigger or equal to that reads in each position.MY output is like the following: for row 1, all the reads is >= 4, so the cover_per is 100, for row 2, 99 % reads >=4, so the cover_per is 99.> head(final)cover_per reads 1 100 4 2 99 8 3 98 13 4 97 14 5 96 17 6 95 20 I attached the input file with this email. This file is only 100 rows, very small. MY original data set is 3384766 rows. > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) #> dim(matt)[1] 3384766 2 Thanks so much for your time!> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t",skip=0, header=F,fill=T) #> names(matt)<-c("id","reads") > dim(matt)[1] 100 2> cover<-matt$reads > > > #calculate the cumulative coverage. > cover_per<-function (data) {+ output<-numeric(0) + for (i in data) { + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) + output<-c(output, x) + } + return(output) + }> > > result<-cover_per(cover) > head(result)[1] 100 99 98 97 96 95> > final<-data.frame(result, cover) > > names(final)<-c("cover_per", "reads") > head(final)cover_per reads 1 100 4 2 99 8 3 98 13 4 97 14 5 96 17 6 95 20 On Thu, Nov 4, 2010 at 9:18 AM, jim holtman <jholtman at gmail.com> wrote:> Is this what you want: > > > x > id reads > 1 Contig79:1 4 > 2 Contig79:2 8 > 3 Contig79:3 13 > 4 Contig79:4 14 > 5 Contig79:5 17 > 6 Contig79:6 20 > 7 Contig79:7 25 > 8 Contig79:8 27 > 9 Contig79:9 32 > 10 Contig79:10 33 > 11 Contig79:11 34 > > x$percent <- x$reads / max(x$reads) * 100 > > x > id reads percent > 1 Contig79:1 4 11.76471 > 2 Contig79:2 8 23.52941 > 3 Contig79:3 13 38.23529 > 4 Contig79:4 14 41.17647 > 5 Contig79:5 17 50.00000 > 6 Contig79:6 20 58.82353 > 7 Contig79:7 25 73.52941 > 8 Contig79:8 27 79.41176 > 9 Contig79:9 32 94.11765 > 10 Contig79:10 33 97.05882 > 11 Contig79:11 34 100.00000 > > > On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com> wrote: > > HI, Dear R community, > > > > I have one data set like this, What I want to do is to calculate the > > cumulative coverage. The following codes works for small data set (#rows > > > 100), but when feed the whole data set, it still running after 24 hours. > > Can someone give some suggestions for long vector? > > > > id reads > > Contig79:1 4 > > Contig79:2 8 > > Contig79:3 13 > > Contig79:4 14 > > Contig79:5 17 > > Contig79:6 20 > > Contig79:7 25 > > Contig79:8 27 > > Contig79:9 32 > > Contig79:10 33 > > Contig79:11 34 > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > > dim(matt) > > [1] 3384766 2 > > > > matt_plot<-function(matt, outputfile) { > > names(matt)<-c("id","reads") > > > > cover<-matt$reads > > > > > > #calculate the cumulative coverage. > > + cover_per<-function (data) { > > + output<-numeric(0) > > + for (i in data) { > > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > > + output<-c(output, x) > > + } > > + return(output) > > + } > > > > > > result<-cover_per(cover) > > > > > > Thanks so much! > > > > > > -- > > Sincerely, > > Changbin > > -- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? >-- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 -------------- next part -------------- Contig79:1 4 Contig79:2 8 Contig79:3 13 Contig79:4 14 Contig79:5 17 Contig79:6 20 Contig79:7 25 Contig79:8 27 Contig79:9 32 Contig79:10 33 Contig79:11 34 Contig79:12 36 Contig79:13 39 Contig79:14 40 Contig79:15 44 Contig79:16 49 Contig79:17 55 Contig79:18 56 Contig79:19 59 Contig79:20 60 Contig79:21 62 Contig79:22 64 Contig79:23 64 Contig79:24 68 Contig79:25 68 Contig79:26 68 Contig79:27 70 Contig79:28 73 Contig79:29 76 Contig79:30 77 Contig79:31 78 Contig79:32 78 Contig79:33 79 Contig79:34 80 Contig79:35 80 Contig79:36 84 Contig79:37 87 Contig79:38 87 Contig79:39 88 Contig79:40 88 Contig79:41 89 Contig79:42 93 Contig79:43 94 Contig79:44 98 Contig79:45 99 Contig79:46 99 Contig79:47 102 Contig79:48 103 Contig79:49 108 Contig79:50 112 Contig79:51 112 Contig79:52 113 Contig79:53 116 Contig79:54 118 Contig79:55 120 Contig79:56 124 Contig79:57 124 Contig79:58 126 Contig79:59 128 Contig79:60 130 Contig79:61 133 Contig79:62 134 Contig79:63 136 Contig79:64 139 Contig79:65 144 Contig79:66 145 Contig79:67 146 Contig79:68 148 Contig79:69 149 Contig79:70 151 Contig79:71 156 Contig79:72 157 Contig79:73 158 Contig79:74 159 Contig79:75 159 Contig79:76 159 Contig79:77 160 Contig79:78 160 Contig79:79 161 Contig79:80 163 Contig79:81 164 Contig79:82 165 Contig79:83 165 Contig79:84 165 Contig79:85 165 Contig79:86 166 Contig79:87 170 Contig79:88 170 Contig79:89 172 Contig79:90 174 Contig79:91 178 Contig79:92 180 Contig79:93 181 Contig79:94 184 Contig79:95 184 Contig79:96 187 Contig79:97 190 Contig79:98 192 Contig79:99 194 Contig79:100 199
Changbin - Does 100 * sapply(matt$reads,function(x)sum(matt$reads >= x))/length(matt$reads) give what you want? By the way, if you want to use a loop (there's nothing wrong with that), then try to avoid the most common mistake that people make with loops in R: having your result grow inside the loop. Here's a better way to use a loop to solve your problem: cover_per_1 <- function(data){ l = length(data) output = numeric(l) for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, 0))/length(data) output } Using some random data, and comparing to your original cover_per function:> dat = rnorm(1000) > system.time(one <- cover_per(dat))user system elapsed 0.816 0.000 0.824> system.time(two <- cover_per_1(dat))user system elapsed 0.792 0.000 0.805 Not that big a speedup, but it does increase quite a bit as the problem gets larger. There are two obvious ways to speed up your function: 1) Eliminate the ifelse function, since automatic coersion from logical to numeric does the same thing. 2) Multiply by 100 and divide by the length outside the loop: cover_per_2 <- function(data){ l = length(data) output = numeric(l) for(i in 1:l)output[i] = sum(data >= data[i]) 100 * output / l }> system.time(three <- cover_per_2(dat))user system elapsed 0.024 0.000 0.027 That makes the loop just about equivalent to the sapply solution:> system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))user system elapsed 0.024 0.000 0.026 - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Thu, 4 Nov 2010, Changbin Du wrote:> HI, Dear R community, > > I have one data set like this, What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows > 100), but when feed the whole data set, it still running after 24 hours. > Can someone give some suggestions for long vector? > > id reads > Contig79:1 4 > Contig79:2 8 > Contig79:3 13 > Contig79:4 14 > Contig79:5 17 > Contig79:6 20 > Contig79:7 25 > Contig79:8 27 > Contig79:9 32 > Contig79:10 33 > Contig79:11 34 > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } > > > result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >