Yao He
2016-Apr-05 17:29 UTC
[R] Is that an efficient way to find the overlapped , upstream and downstream rangess for a bunch of rangess
I do have a bunch of genes ( nearly ~50000) from the whole genome, which
read in genomic ranges
A range(gene) can be seem as an observation has three columns chromosome,
start and end, like that
seqnames start end width strand
gene1 chr1 1 5 5 +
gene2 chr1 10 15 6 +
gene3 chr1 12 17 6 +
gene4 chr1 20 25 6 +
gene5 chr1 30 40 11 +
I just wondering is there an efficient way to find *overlapped, upstream
and downstream genes for each gene in the granges*
For example, assuming all_genes_gr is a ~50000 genes genomic range, the
result I want like belows:
gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA
Currently , the strategy I use is like that,
library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
#cat(idx, "\n")
curr_gene <- all_genes_gr[idx]
other_genes <- all_genes_gr[-idx]
n <- countOverlaps(curr_gene, other_genes)
gene <- subsetByOverlaps(curr_gene, other_genes)
return(list(n, gene))
}?
system.time(lapply(1:100, function(idx) find_overlapped_gene(idx,
all_genes_gr)))
However, for 100 genes, it use nearly ~8s by system.time().That means if I
had 50000 genes, nearly one hour for just find overlapped gene.
I am just wondering any algorithm or strategy to do that efficiently,
perhaps 50000 genes in ~10min or even less
Yao He
[[alternative HTML version deleted]]
Possibly Parallel Threads
- Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges
- help with reshape is needed again!
- Help needed in feature extraction from two input files
- plot columns
- Inversions in hierarchical clustering were they shouldn't be
