thr3ads.net - R help - [R] how to work with long vectors [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Changbin Du

2010-Nov-04 15:46 UTC

[R] how to work with long vectors

HI, Dear R community,

I have one data set like this,  What I want to do is to calculate the
cumulative coverage. The following codes works for small data set (#rows 100),
but when feed the whole data set,  it still running after 24 hours.
Can someone give some suggestions for long vector?

id                reads
Contig79:1    4
Contig79:2    8
Contig79:3    13
Contig79:4    14
Contig79:5    17
Contig79:6    20
Contig79:7    25
Contig79:8    27
Contig79:9    32
Contig79:10    33
Contig79:11    34

matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
dim(matt)
[1] 3384766       2

matt_plot<-function(matt, outputfile) {
names(matt)<-c("id","reads")

 cover<-matt$reads


#calculate the cumulative coverage.
+ cover_per<-function (data) {
+ output<-numeric(0)
+ for (i in data) {
+           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+           output<-c(output, x)
+                 }
+ return(output)
+ }


 result<-cover_per(cover)


Thanks so much!


-- 
Sincerely,
Changbin
--

	[[alternative HTML version deleted]]

Henrique Dallazuanna

2010-Nov-04 16:12 UTC

head link

[R] how to work with long vectors

Try this:

rev(100 * cumsum(matt$reads > 1) / length(matt$reads) )

On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du <changbind@gmail.com> wrote:
> HI, Dear R community,
>
> I have one data set like this,  What I want to do is to calculate the
> cumulative coverage. The following codes works for small data set (#rows
> 100), but when feed the whole data set,  it still running after 24 hours.
> Can someone give some suggestions for long vector?
>
> id                reads
> Contig79:1    4
> Contig79:2    8
> Contig79:3    13
> Contig79:4    14
> Contig79:5    17
> Contig79:6    20
> Contig79:7    25
> Contig79:8    27
> Contig79:9    32
> Contig79:10    33
> Contig79:11    34
>
>
>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
> [1] 3384766       2
>
> matt_plot<-function(matt, outputfile) {
> names(matt)<-c("id","reads")
>
>  cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> + cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> +           output<-c(output, x)
> +                 }
> + return(output)
> + }
>
>
>  result<-cover_per(cover)
>
>
> Thanks so much!
>
>
> --
> Sincerely,
> Changbin
> --
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

	[[alternative HTML version deleted]]

jim holtman

2010-Nov-04 16:18 UTC

head link

[R] how to work with long vectors

Is this what you want:
> x            id reads
1   Contig79:1     4
2   Contig79:2     8
3   Contig79:3    13
4   Contig79:4    14
5   Contig79:5    17
6   Contig79:6    20
7   Contig79:7    25
8   Contig79:8    27
9   Contig79:9    32
10 Contig79:10    33
11 Contig79:11    34> x$percent <- x$reads / max(x$reads) * 100
> x            id reads   percent
1   Contig79:1     4  11.76471
2   Contig79:2     8  23.52941
3   Contig79:3    13  38.23529
4   Contig79:4    14  41.17647
5   Contig79:5    17  50.00000
6   Contig79:6    20  58.82353
7   Contig79:7    25  73.52941
8   Contig79:8    27  79.41176
9   Contig79:9    32  94.11765
10 Contig79:10    33  97.05882
11 Contig79:11    34 100.00000


On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com>
wrote:> HI, Dear R community,
>
> I have one data set like this, ?What I want to do is to calculate the
> cumulative coverage. The following codes works for small data set (#rows
> 100), but when feed the whole data set, ?it still running after 24 hours.
> Can someone give some suggestions for long vector?
>
> id ? ? ? ? ? ? ? ?reads
> Contig79:1 ? ?4
> Contig79:2 ? ?8
> Contig79:3 ? ?13
> Contig79:4 ? ?14
> Contig79:5 ? ?17
> Contig79:6 ? ?20
> Contig79:7 ? ?25
> Contig79:8 ? ?27
> Contig79:9 ? ?32
> Contig79:10 ? ?33
> Contig79:11 ? ?34
>
>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
> [1] 3384766 ? ? ? 2
>
> matt_plot<-function(matt, outputfile) {
> names(matt)<-c("id","reads")
>
> ?cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> + cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> + ? ? ? ? ? x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> + ? ? ? ? ? output<-c(output, x)
> + ? ? ? ? ? ? ? ? }
> + return(output)
> + }
>
>
> ?result<-cover_per(cover)
>
>
> Thanks so much!
>
>
> --
> Sincerely,
> Changbin
> --
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Changbin Du

2010-Nov-04 16:45 UTC

head link

[R] how to work with long vectors

Thanks, Jim!

This is not what I want,  What I want is calculate the percentage of reads
bigger or equal to that reads in each position.MY output is like the
following:
for row 1, all the reads is >= 4, so the cover_per is 100,
for row 2, 99 % reads >=4, so the cover_per is 99.> head(final)  cover_per reads
1       100     4
2        99     8
3        98    13
4        97    14
5        96    17
6        95    20

I attached the input file with this email. This file is only 100 rows, very
small. MY original data set is 3384766 rows.

 >
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #> dim(matt)[1] 3384766       2

Thanks so much for your time!
> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt",
sep="\t",
skip=0, header=F,fill=T) #> names(matt)<-c("id","reads")
> dim(matt)
[1] 100   2> cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> cover_per<-function (data) {+ output<-numeric(0)
+ for (i in data) {
+           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+           output<-c(output, x)
+                 }
+ return(output)
+ }>
>
> result<-cover_per(cover)
> head(result)
[1] 100  99  98  97  96  95>
> final<-data.frame(result, cover)
>
> names(final)<-c("cover_per", "reads")
> head(final)  cover_per reads
1       100     4
2        99     8
3        98    13
4        97    14
5        96    17
6        95    20





On Thu, Nov 4, 2010 at 9:18 AM, jim holtman <jholtman at gmail.com> wrote:
> Is this what you want:
>
> > x
>            id reads
> 1   Contig79:1     4
> 2   Contig79:2     8
> 3   Contig79:3    13
> 4   Contig79:4    14
> 5   Contig79:5    17
> 6   Contig79:6    20
> 7   Contig79:7    25
> 8   Contig79:8    27
> 9   Contig79:9    32
> 10 Contig79:10    33
> 11 Contig79:11    34
> > x$percent <- x$reads / max(x$reads) * 100
> > x
>            id reads   percent
> 1   Contig79:1     4  11.76471
> 2   Contig79:2     8  23.52941
> 3   Contig79:3    13  38.23529
> 4   Contig79:4    14  41.17647
> 5   Contig79:5    17  50.00000
> 6   Contig79:6    20  58.82353
> 7   Contig79:7    25  73.52941
> 8   Contig79:8    27  79.41176
> 9   Contig79:9    32  94.11765
> 10 Contig79:10    33  97.05882
> 11 Contig79:11    34 100.00000
>
>
> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com>
wrote:
> > HI, Dear R community,
> >
> > I have one data set like this,  What I want to do is to calculate the
> > cumulative coverage. The following codes works for small data set
(#rows
> > > 100), but when feed the whole data set,  it still running after
24 hours.
> > Can someone give some suggestions for long vector?
> >
> > id                reads
> > Contig79:1    4
> > Contig79:2    8
> > Contig79:3    13
> > Contig79:4    14
> > Contig79:5    17
> > Contig79:6    20
> > Contig79:7    25
> > Contig79:8    27
> > Contig79:9    32
> > Contig79:10    33
> > Contig79:11    34
> >
> >
>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > dim(matt)
> > [1] 3384766       2
> >
> > matt_plot<-function(matt, outputfile) {
> > names(matt)<-c("id","reads")
> >
> >  cover<-matt$reads
> >
> >
> > #calculate the cumulative coverage.
> > + cover_per<-function (data) {
> > + output<-numeric(0)
> > + for (i in data) {
> > +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> > +           output<-c(output, x)
> > +                 }
> > + return(output)
> > + }
> >
> >
> >  result<-cover_per(cover)
> >
> >
> > Thanks so much!
> >
> >
> > --
> > Sincerely,
> > Changbin
> > --
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>


-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856
-------------- next part --------------
Contig79:1	4
Contig79:2	8
Contig79:3	13
Contig79:4	14
Contig79:5	17
Contig79:6	20
Contig79:7	25
Contig79:8	27
Contig79:9	32
Contig79:10	33
Contig79:11	34
Contig79:12	36
Contig79:13	39
Contig79:14	40
Contig79:15	44
Contig79:16	49
Contig79:17	55
Contig79:18	56
Contig79:19	59
Contig79:20	60
Contig79:21	62
Contig79:22	64
Contig79:23	64
Contig79:24	68
Contig79:25	68
Contig79:26	68
Contig79:27	70
Contig79:28	73
Contig79:29	76
Contig79:30	77
Contig79:31	78
Contig79:32	78
Contig79:33	79
Contig79:34	80
Contig79:35	80
Contig79:36	84
Contig79:37	87
Contig79:38	87
Contig79:39	88
Contig79:40	88
Contig79:41	89
Contig79:42	93
Contig79:43	94
Contig79:44	98
Contig79:45	99
Contig79:46	99
Contig79:47	102
Contig79:48	103
Contig79:49	108
Contig79:50	112
Contig79:51	112
Contig79:52	113
Contig79:53	116
Contig79:54	118
Contig79:55	120
Contig79:56	124
Contig79:57	124
Contig79:58	126
Contig79:59	128
Contig79:60	130
Contig79:61	133
Contig79:62	134
Contig79:63	136
Contig79:64	139
Contig79:65	144
Contig79:66	145
Contig79:67	146
Contig79:68	148
Contig79:69	149
Contig79:70	151
Contig79:71	156
Contig79:72	157
Contig79:73	158
Contig79:74	159
Contig79:75	159
Contig79:76	159
Contig79:77	160
Contig79:78	160
Contig79:79	161
Contig79:80	163
Contig79:81	164
Contig79:82	165
Contig79:83	165
Contig79:84	165
Contig79:85	165
Contig79:86	166
Contig79:87	170
Contig79:88	170
Contig79:89	172
Contig79:90	174
Contig79:91	178
Contig79:92	180
Contig79:93	181
Contig79:94	184
Contig79:95	184
Contig79:96	187
Contig79:97	190
Contig79:98	192
Contig79:99	194
Contig79:100	199

Phil Spector

2010-Nov-04 17:16 UTC

head link

[R] how to work with long vectors

Changbin -
    Does

     100 * sapply(matt$reads,function(x)sum(matt$reads >=
x))/length(matt$reads)

give what you want?

     By the way, if you want to use a loop (there's nothing wrong with
that),
then try to avoid the most common mistake that people make with loops in R:
having your result grow inside the loop.  Here's a better way to use a loop
to solve your problem:

cover_per_1 <- function(data){
    l = length(data)
    output = numeric(l)
    for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
0))/length(data)
    output
}

Using some random data, and comparing to your original cover_per function:
> dat = rnorm(1000)
> system.time(one <- cover_per(dat))    user  system elapsed
   0.816   0.000   0.824 > system.time(two <- cover_per_1(dat))    user  system elapsed
   0.792   0.000   0.805

Not that big a speedup, but it does increase quite a bit as the problem gets
larger.

There are two obvious ways to speed up your function:
    1)  Eliminate the ifelse function, since automatic coersion from
        logical to numeric does the same thing.
    2)  Multiply by 100 and divide by the length outside the loop:

cover_per_2 <- function(data){
    l = length(data)
    output = numeric(l)
    for(i in 1:l)output[i] = sum(data >= data[i])
    100 * output / l
}
> system.time(three <- cover_per_2(dat))    user  system elapsed
   0.024   0.000   0.027

That makes the loop just about equivalent to the sapply solution:
> system.time(four <- 100*sapply(dat,function(x)sum(dat >=
x))/length(dat))    user  system elapsed
   0.024   0.000   0.026

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu








On Thu, 4 Nov 2010, Changbin Du wrote:
> HI, Dear R community,
>
> I have one data set like this,  What I want to do is to calculate the
> cumulative coverage. The following codes works for small data set (#rows
> 100), but when feed the whole data set,  it still running after 24 hours.
> Can someone give some suggestions for long vector?
>
> id                reads
> Contig79:1    4
> Contig79:2    8
> Contig79:3    13
> Contig79:4    14
> Contig79:5    17
> Contig79:6    20
> Contig79:7    25
> Contig79:8    27
> Contig79:9    32
> Contig79:10    33
> Contig79:11    34
>
>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
> [1] 3384766       2
>
> matt_plot<-function(matt, outputfile) {
> names(matt)<-c("id","reads")
>
> cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> + cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> +           output<-c(output, x)
> +                 }
> + return(output)
> + }
>
>
> result<-cover_per(cover)
>
>
> Thanks so much!
>
>
> -- 
> Sincerely,
> Changbin
> --
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Possibly Parallel Threads

how to save this result in a vector

R help - Nov 2010 - how to work with long vectors

[R] how to work with long vectors

[R] how to work with long vectors

[R] how to work with long vectors

[R] how to work with long vectors

[R] how to work with long vectors

Possibly Parallel Threads