Hi folks, I am working on a project that requires subsetting of a found file based on some known file. The known file contains several lines like below: chr1 3237546 3237547 rs52310428 0 + chr1 3237549 3237550 rs52097582 0 + chr2 4513326 4513327 rs29769280 0 + chr2 4513337 4513338 rs33286009 0 + where the first column can be chr2, chr1, chr12 etc... The second and third are numbers (cordinates). The found file contains lines like: chr1 3213435 G C chr1 3237547 T C chr1 3237549 G T chr2 4513326 A G chr2 4513337 C G where the first column, again, can be chr1, chr2, chr12 etc... and the second is a number. What I have to do is to separate the found file to two files: one (foundY) contains lines that have the same first column and the second column in range of the two columns 2 and 3 of any line of known file, and one (foundN) contains lines that do not meet the previous condition. For the two examples above, foundN will be the first line, and foundY will be the next 4 lines. What I came up with is this algorithm: * get the uniq item in the first column of found file (chr1, chr2, chr12, chr13 etc...) * for each of the uniq item, set subset of the known file and the found file that have same first column, then scanning each item in the known subset to see if any line meets any condition The code is like below: ########## CODE START########### # import known and found files to data frames known <- read.table( "known.txt", sep="\t", header=FALSE ) found <- read.table( "found.txt", sep="\t", header=FALSE, fill=TRUE ) # get the uniq item in first column of found file found.Chr <- as.character(found[!duplicated(found[[1]]),1]) # create two empty result data frames foundN <- found[0,] foundY <- found[0,] # scan for each of the uniq items for ( iChr in found.Chr ) { # subset of known and found with specific item found.iChr <- found[found[[1]]==iChr,] known.iChr <- known[known[[1]]==iChr,] # scan through all found subset items if ( nrow(known.iChr)>0 ) { for ( i in 1:nrow(found.iChr) ) { if ( nrow(known.iChr[known.iChr[[3]]>=found.iChr[i,2] & known.iChr[[2]]<=found.iChr[i,2],])==0 ) { foundN <- rbind( foundN, found.iChr[i,] ) } else { foundY <- rbind( foundN, found.iChr[i,] ) } } } } ########## CODE END########### The code works well, but I tested it for only small known and found files. When trying with larger files (the known file can contains ~ 15 million lines, the found ~ 15k lines), it takes like hrs to run. I want to speed up the process, and I believe there must be a better algorithm to do this with R. My questions are: * any body has a better algorithm or comments or suggestion? * I read (google) that matrices work faster than data frame. Can I use matrices for this case? (is matrices for numbers only?) * I read (google) that I should avoid rbind, and prelocate data frame for faster speed. How would I do that in this case? Thank you very much in advance, Bests, D.
On 1/12/2011 2:52 PM, Duke wrote:> Hi folks, > > I am working on a project that requires subsetting of a found file > based on some known file. The known file contains several lines like > below: > > chr1 3237546 3237547 rs52310428 0 + > chr1 3237549 3237550 rs52097582 0 + > chr2 4513326 4513327 rs29769280 0 + > chr2 4513337 4513338 rs33286009 0 + > > where the first column can be chr2, chr1, chr12 etc... The second and > third are numbers (cordinates). The found file contains lines like: > > chr1 3213435 G C > chr1 3237547 T C > chr1 3237549 G T > chr2 4513326 A G > chr2 4513337 C G > > where the first column, again, can be chr1, chr2, chr12 etc... and the > second is a number. What I have to do is to separate the found file to > two files: one (foundY) contains lines that have the same first column > and the second column in range of the two columns 2 and 3 of any line > of known file, and one (foundN) contains lines that do not meet the > previous condition. For the two examples above, foundN will be the > first line, and foundY will be the next 4 lines. > > What I came up with is this algorithm: > > * get the uniq item in the first column of found file (chr1, chr2, > chr12, chr13 etc...) > * for each of the uniq item, set subset of the known file and the > found file that have same first column, then scanning each item in the > known subset to see if any line meets any condition > > The code is like below: > > ########## CODE START########### > # import known and found files to data frames > known <- read.table( "known.txt", sep="\t", header=FALSE ) > found <- read.table( "found.txt", sep="\t", header=FALSE, fill=TRUE ) > > # get the uniq item in first column of found file > found.Chr <- as.character(found[!duplicated(found[[1]]),1]) > > # create two empty result data frames > foundN <- found[0,] > foundY <- found[0,] > > # scan for each of the uniq items > for ( iChr in found.Chr ) { > # subset of known and found with specific item > found.iChr <- found[found[[1]]==iChr,] > known.iChr <- known[known[[1]]==iChr,] > > # scan through all found subset items > if ( nrow(known.iChr)>0 ) { > for ( i in 1:nrow(found.iChr) ) { > if ( nrow(known.iChr[known.iChr[[3]]>=found.iChr[i,2] & > known.iChr[[2]]<=found.iChr[i,2],])==0 ) { > foundN <- rbind( foundN, found.iChr[i,] ) > } else { > foundY <- rbind( foundN, found.iChr[i,] ) > } > } > } > } > > ########## CODE END########### > > The code works well, but I tested it for only small known and found > files. When trying with larger files (the known file can contains ~ 15 > million lines, the found ~ 15k lines), it takes like hrs to run. > > I want to speed up the process, and I believe there must be a better > algorithm to do this with R. My questions are: > > * any body has a better algorithm or comments or suggestion?The Bioconductor project has many tools for dealing with sequence-related data. With the data k <- read.table(textConnection( "chr1 3237546 3237547 rs52310428 0 + chr1 3237549 3237550 rs52097582 0 + chr2 4513326 4513327 rs29769280 0 + chr2 4513337 4513338 rs33286009 0 +")) f <- read.table(textConnection( "chr1 3213435 G C chr1 3237547 T C chr1 3237549 G T chr2 4513326 A G chr2 4513337 C G")) One might use the GenomicRanges package as library(GenomicRanges) kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5)) fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4)) olaps <- findOverlaps(fgr, kgr) idx <- countOverlaps(fgr, kgr) != 0 resulting in > idx [1] FALSE TRUE TRUE TRUE TRUE This will be fast. One could write foundY with as.data.frame(fgr[idx]) (maybe a little editing) but likely one would want to stay in R / Bioc and do something more interesting... See http://bioconductor.org/install/index.html Martin> * I read (google) that matrices work faster than data frame. Can I use > matrices for this case? (is matrices for numbers only?) > * I read (google) that I should avoid rbind, and prelocate data frame > for faster speed. How would I do that in this case? > > Thank you very much in advance, > > Bests, > > D. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
On 1/12/11 6:12 PM, Martin Morgan wrote:> The Bioconductor project has many tools for dealing with > sequence-related data. With the data > > k <- read.table(textConnection( > "chr1 3237546 3237547 rs52310428 0 + > chr1 3237549 3237550 rs52097582 0 + > chr2 4513326 4513327 rs29769280 0 + > chr2 4513337 4513338 rs33286009 0 +")) > > f <- read.table(textConnection( > "chr1 3213435 G C > chr1 3237547 T C > chr1 3237549 G T > chr2 4513326 A G > chr2 4513337 C G")) > > One might use the GenomicRanges package as > > library(GenomicRanges) > kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5)) > fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4)) > olaps <- findOverlaps(fgr, kgr) > idx <- countOverlaps(fgr, kgr) != 0 > > resulting in > > > idx > [1] FALSE TRUE TRUE TRUE TRUE > > This will be fast.Thanks so much for your suggestion Martin. I had Bioconductor installed but I honestly do not know all its applications. Anyway, I am testing GenomicRanges with my data now. I will report back when I get the result.> > One could write foundY with as.data.frame(fgr[idx]) (maybe a little > editing) but likely one would want to stay in R / Bioc and do > something more interesting... >I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <- as.data.frame(fgr[idx]) as you suggested, but I dont really understand your last comment :). Thanks, D.
On 1/12/11 6:44 PM, Duke wrote:> > Thanks so much for your suggestion Martin. I had Bioconductor > installed but I honestly do not know all its applications. Anyway, I > am testing GenomicRanges with my data now. I will report back when I > get the result. >I got the results. My code took ~ 580 min ( ~ 10 hrs) to finish, where as using GenomicRanges per Martin suggested, it took only 22 min (about 30 times less!). Thanks so much for this improvement Martin. D.