Hello, I have start and end coordinates from different experiments (DNase hypersensitivity data) and now I would like to combine overlapping intervals. For instance (see my test data below) (2) 30-52 and (3) 49-101 are combined to 30-101. But 49-101 and 70-103 would not be combined because they are on different chromosomes (chr a and chr b). Does anybody have an idea? Thanks Hermann> dfchr start end 1 a 5 10 2 a 30 52 3 a 49 101 4 b 70 103 5 b 100 130 6 b 129 140> dput (df)structure(list(chr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), start = c(5, 30, 49, 70, 100, 129), end = c(10, 52, 101, 103, 130, 140)), .Names = c("chr", "start", "end"), row.names = c(NA, -6L), class = "data.frame") [[alternative HTML version deleted]]
On 11/05/2012 09:14 AM, Hermann Norpois wrote:> Hello, > > I have start and end coordinates from different experiments (DNase > hypersensitivity data) and now I would like to combine overlapping > intervals. For instance (see my test data below) (2) 30-52 and (3) 49-101 > are combined to 30-101. But 49-101 and 70-103 would not be combined because > they are on different chromosomes (chr a and chr b). > Does anybody have an idea?This data is very naturally handled by the "GRange" class in Bioconductor's GenomicRanges package source("http://bioconductor.org/biocLite.R") biocLite("GenomicRanges') library(GenomicRanges) gr = GRanges(rep(c("a", "b"), each=3), IRanges(c(5, 30, 49, 70, 100, 129), c(10, 52, 101, 103, 130, 140)), strand="*") and then > reduce(gr) GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] a [ 5, 10] * [2] a [30, 101] * [3] b [70, 140] * --- seqlengths: a b NA NA There are vignettes vignette(package="GenomicRanges") and additional training material, e.g., http://bioconductor.org/help/course-materials/2012/CSC2012/ If you pursue this solution then please follow-up with questions on the Bioconductor mailing list http://bioconductor.org/help/mailing-list/ Martin> Thanks > Hermann > >> df > chr start end > 1 a 5 10 > 2 a 30 52 > 3 a 49 101 > 4 b 70 103 > 5 b 100 130 > 6 b 129 140 >> dput (df) > structure(list(chr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", > "b"), class = "factor"), start = c(5, 30, 49, 70, 100, 129), > end = c(10, 52, 101, 103, 130, 140)), .Names = c("chr", "start", > "end"), row.names = c(NA, -6L), class = "data.frame") > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
HI, May be you should check this link (http://r.789695.n4.nabble.com/R-overlapping-intervals-td810061.html). dat1<-structure(list(chr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), start = c(5, 30, 49, 70, 100, 129), ??? end = c(10, 52, 101, 103, 130, 140)), .Names = c("chr", "start", "end"), row.names = c(NA, -6L), class = "data.frame") Using Jim's code: fun1<-function(x){ x1<-x2<-logical(max(x[,2],x[,3])) x1[unlist(mapply(seq,x[,2],x[,3]))]<-TRUE ?x2[unlist(mapply(seq,x[,2],x[,3]))]<-TRUE r<-rle(x1 & x2) offset<-cumsum(r$lengths) cbind(offset[r$values]-r$lengths[r$values] +1,offset[r$values])} ?list1<-lapply(split(dat1,dat1$chr),function(x) x) ?res<-do.call(rbind,lapply(list1,function(x) data.frame(chr=names(list1)[match.call()[[2]][[3]]],fun1(x)))) rownames(res)<-1:nrow(res) ?colnames(res)<-colnames(dat1) ?res #? chr start end #1?? a???? 5? 10 #2?? a??? 30 101 #3?? b??? 70 140 A.K. ----- Original Message ----- From: Hermann Norpois <hnorpois at googlemail.com> To: r-help at r-project.org Cc: Sent: Monday, November 5, 2012 12:14 PM Subject: [R] fusion of overlapping intervals Hello, I have start and end coordinates from different experiments (DNase hypersensitivity data) and now I would like to combine overlapping intervals. For instance (see my test data below) (2) 30-52 and (3) 49-101 are combined to 30-101. But 49-101 and 70-103 would not be combined because they are on different chromosomes (chr a and chr b). Does anybody have an idea? Thanks Hermann> df? chr start end 1? a? ? 5? 10 2? a? ? 30? 52 3? a? ? 49 101 4? b? ? 70? 103 5? b? 100 130 6? b? 129 140> dput (df)structure(list(chr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), start = c(5, 30, 49, 70, 100, 129), ? ? end = c(10, 52, 101, 103, 130, 140)), .Names = c("chr", "start", "end"), row.names = c(NA, -6L), class = "data.frame") ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.