何尧
2016-Apr-05 17:27 UTC
[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges
I do have a bunch of genes ( nearly ~50000) from the whole genome, which read in genomic ranges A range(gene) can be seem as an observation has three columns chromosome, start and end, like that seqnames start end width strand gene1 chr1 1 5 5 + gene2 chr1 10 15 6 + gene3 chr1 12 17 6 + gene4 chr1 20 25 6 + gene5 chr1 30 40 11 + I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows: gene_nameupstream_genedownstream_geneoverlapped_gene gene1NAgene2NA gene2gene1gene4gene3 gene3gene1gene4gene2 gene4gene3gene5NA Currently , the strategy I use is like that, library(GenomicRanges) find_overlapped_gene <- function(idx, all_genes_gr) { #cat(idx, "\n") curr_gene <- all_genes_gr[idx] other_genes <- all_genes_gr[-idx] n <- countOverlaps(curr_gene, other_genes) gene <- subsetByOverlaps(curr_gene, other_genes) return(list(n, gene)) }? system.time(lapply(1:100, function(idx) find_overlapped_gene(idx, all_genes_gr))) However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less [[alternative HTML version deleted]]
David Winsemius
2016-Apr-06 01:21 UTC
[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges
> On Apr 5, 2016, at 10:27 AM, ?? <heyao at pku.edu.cn> wrote: > > I do have a bunch of genes ( nearly ~50000) from the whole genome, which read in genomic ranges > > A range(gene) can be seem as an observation has three columns chromosome, start and end, like that > > seqnames start end width strand > > gene1 chr1 1 5 5 + > > gene2 chr1 10 15 6 + > > gene3 chr1 12 17 6 + > > gene4 chr1 20 25 6 + > > gene5 chr1 30 40 11 + > > I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the grangesThe data.table package (in CRAN) and the iRanges package (in bioC) have formalized efficient approaches to those problems.> > For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows: > > gene_nameupstream_genedownstream_geneoverlapped_gene > gene1NAgene2NA > gene2gene1gene4gene3 > gene3gene1gene4gene2 > gene4gene3gene5NA > > Currently , the strategy I use is like that, > library(GenomicRanges) > find_overlapped_gene <- function(idx, all_genes_gr) { > #cat(idx, "\n") > curr_gene <- all_genes_gr[idx] > other_genes <- all_genes_gr[-idx] > n <- countOverlaps(curr_gene, other_genes) > gene <- subsetByOverlaps(curr_gene, other_genes) > return(list(n, gene)) > }? > > system.time(lapply(1:100, function(idx) find_overlapped_gene(idx, all_genes_gr))) > However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. > > I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less >I suspect this would happen on a much faster basis for such a small dataset. -- David.> [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Michael Lawrence
2016-Apr-11 14:57 UTC
[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges
For the sake of prosterity, this question was asked and answered here: https://support.bioconductor.org/p/80448 On Tue, Apr 5, 2016 at 10:27 AM, ?? <heyao at pku.edu.cn> wrote:> I do have a bunch of genes ( nearly ~50000) from the whole genome, which read in genomic ranges > > A range(gene) can be seem as an observation has three columns chromosome, start and end, like that > > seqnames start end width strand > > gene1 chr1 1 5 5 + > > gene2 chr1 10 15 6 + > > gene3 chr1 12 17 6 + > > gene4 chr1 20 25 6 + > > gene5 chr1 30 40 11 + > > I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges > > For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows: > > gene_nameupstream_genedownstream_geneoverlapped_gene > gene1NAgene2NA > gene2gene1gene4gene3 > gene3gene1gene4gene2 > gene4gene3gene5NA > > Currently , the strategy I use is like that, > library(GenomicRanges) > find_overlapped_gene <- function(idx, all_genes_gr) { > #cat(idx, "\n") > curr_gene <- all_genes_gr[idx] > other_genes <- all_genes_gr[-idx] > n <- countOverlaps(curr_gene, other_genes) > gene <- subsetByOverlaps(curr_gene, other_genes) > return(list(n, gene)) > } > > system.time(lapply(1:100, function(idx) find_overlapped_gene(idx, all_genes_gr))) > However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. > > I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.